Jamie Matthews' Blog

Two Systems, One Truth: Why the Best Forecasters Will Use Both AI and Financial Markets

Jamie Matthews — Tue, 17 Mar 2026 13:19:45 GMT

Two days before the Academy Awards, I ran a prediction through Perspectives: "Will Timothée Chalamet win the Oscar for Best Actor at the 2026 Academy Awards?"

The system's consensus prediction across seven personas was 26%. On the same day, Polymarket had Chalamet at roughly 29.5%, backed by over $13 million in trading volume.

On March 15, Michael B. Jordan won the Oscar for Best Actor for Sinners. Both systems had it right.

Recent Changes

This is the first prediction I ran after two changes I made to the system based on patterns identified in the Iran and Noem evaluations.

The previous version of the interrogation protocol had no mechanism to force personas to consider outside intervention. Across the Iran debates, every persona analysed the situation through internal dynamics (succession planning, IRGC cohesion, protest momentum) and treated external military action as a fringe scenario. The Scenario Planner gave assassination or military strike only 5% in the Khamenei debate. I added an external intervention challenge dimension to the interrogation protocol. Each persona can now be challenged specifically on whether outside forces could override their internal analysis.

The previous report format presented a single aggregate probability. In the Iran debates, the aggregate compressed a range of 5% to 90% into a single figure of 30%. The Risk Analyst's 65-80% estimate and the Systems Thinker's 80-90% estimate were diluted into a number that obscured the disagreement. The report now splits its headline figure into a consensus view and a dissenting view when the distribution is sufficiently uneven. For this prediction, the report presented a consensus of ~26% (seven personas) and a dissenting view of ~80% (one persona).

A Note on the Base Rate Analyst

The Base Rate Analyst submitted a probability of 75-85% for Chalamet winning, placing it as the sole dissenting voice in the report's split view. However, reading its proposal reveals a problem: the entire argument is about why Michael B. Jordan will win. It cites the 72% SAG-Oscar correlation, argues that betting against the guild winner is "statistical malpractice," and concludes that Jordan's SAG victory is a decisive signal. The reasoning and the number point in opposite directions. The persona appears to have confused which candidate it was estimating a probability for.

This is a limitation of the current system that I'm working to correct. Structured probability outputs (where the persona submits a JSON object alongside its proposal) make the number extractable, but they don't prevent a persona from misaligning its reasoning with its estimate. The other seven personas are internally consistent and form the basis of the analysis below. The Base Rate Analyst's estimate is excluded from all figures in this post.

What the System Predicted

Seven personas produced estimates for Chalamet winning Best Actor. The majority clustered between 10% and 25%.

Persona	Probability Range	Midpoint
The Contrarian	52-58%	55%
The Sceptic	35-45%	40%
The Scenario Planner	20-30%	25%
The Risk Analyst	10-20%	15%
The Trend Analyst	10-20%	15%
The Systems Thinker	10-20%	15%
The Insider	10-20%	15%

Four personas converged at 10-20%. The Scenario Planner sat slightly above at 20-30%. The Sceptic and Contrarian formed the upper range, at 35-45% and 52-58% respectively.

The consensus average across these seven personas was approximately 26%.

The Arguments

The debate centred on a single question: how much weight to give the SAG Awards result.

Michael B. Jordan won the Screen Actors Guild Award for Best Actor on March 1, breaking Chalamet's run as frontrunner. Chalamet had won the Golden Globe in January and was sitting at roughly 79% on Polymarket before the SAG loss. By the time Oscar voting closed on March 5, his odds had dropped below 45%.

The majority camp treated the SAG result as a decisive signal. The Risk Analyst argued that the SAG-AFTRA membership represents the largest voting bloc within the Academy. When guild voters reject a performance, the Academy almost never overrides that. The Trend Analyst reinforced this with the BAFTA result, where Robert Aramayo won for I Swear, further isolating Chalamet from precursor momentum. The Insider pointed to reports of industry backlash against Chalamet's campaign, describing it as overwhelming.

The Systems Thinker framed the race structurally: with SAG, BAFTA, and narrative momentum all pointing away from Chalamet, the remaining paths to a win required multiple unlikely conditions (vote splitting, a sudden sentiment reversal, the Academy making a deliberate career anointment decision).

The Contrarian assigned the highest probability (52-58%), arguing that the market had overreacted to Jordan's SAG win. Their case rested on the Academy's tendency to crown "rising stars" and the idea that three nominations by age 30 created a career narrative with its own momentum. The Sceptic took a middle position (35-45%), respecting the SAG correlation while leaving room for uncertainty about how the preferential ballot might fracture.

The Interrogation

The debate generated 24 challenges: 3 concessions, 2 defences, and 19 disputes. The high dispute rate (79%) reflects how polarised the positions were.

The Contrarian conceded twice: once to the Insider on the weight of qualitative insider signals, and once to the Sceptic on a calibration challenge. These concessions weakened the strongest pro-Chalamet argument in the debate.

The Risk Analyst defended both of their challenges, contributing to their win in the ranked-choice vote. The Sceptic and Insider were challenged three times each and produced only disputes. Neither side was willing to yield on how to interpret the SAG signal.

The Polymarket Comparison

The Best Actor market on Polymarket processed over $13 million in trading volume. Traders put real money on their beliefs about the outcome, creating a strong incentive to be accurate.

On March 13, the day Perspectives ran this debate, Polymarket had Chalamet at 29.5% and Jordan at 54.5%. The Perspectives consensus came in at 26% for Chalamet.

Both systems identified Jordan as the clear favourite. Both assigned Chalamet a meaningful but minority probability.

The closeness in results is worth examining because the systems work in completely different ways. Polymarket aggregates the financial commitments of thousands of independent traders, each bringing their own information and analytical frameworks. Perspectives runs AI personas through a structured debate pipeline (blind proposals, interrogation, discussion, ranked-choice voting) and takes the arithmetic mean of their probability estimates.

The convergence suggests that for this category of question, the debate process produces probability estimates in the same range as a well-capitalised prediction market. Prediction markets provide a calibration benchmark: a well-tested number backed by financial incentives. The debate process provides the reasoning trail: why the personas predict what they predict, where they agree and disagree, and which arguments survive interrogation.

A Different Pattern

In the Iran and Noem evaluations, a consistent pattern appeared: cautious analyses won the debate vote while more aggressive estimates proved more accurate. The system underweighted outlier positions, and the aggregate landed too low.

This prediction breaks that pattern because the debate winner was correct. The Risk Analyst won the ranked-choice vote with a 10-20% probability for Chalamet, arguing that the SAG loss represented a systemic rejection. In the Iran and Noem predictions, the debate winner was consistently wrong.

The consensus was well-calibrated. A 26% probability for an event that didn't happen is reasonable. Proper calibration evaluation requires many predictions (a system that says 26% should see the event happen roughly 26% of the time), but the proximity to Polymarket's 29.5% suggests the number was in the right range.

The two system changes I made after the previous evaluations may have contributed. The external intervention challenge dimension, added to address the system's blind spot on outside forces, is less relevant for an Oscar prediction, but the split reporting addressed a problem visible here: the confused Base Rate Analyst submitted an 80% estimate that would have pulled the headline figure up to 32%. The previous report format would have presented that single number. The new format separated the consensus (26%) from the outlier (80%), making it clear that seven of eight personas clustered in a lower range. For a reader evaluating the prediction, the consensus figure is substantially more useful than the diluted average.

What separates this prediction from the geopolitical ones is the nature of the question. Oscar outcomes are determined by a known, finite voting body making a single decision. The information environment is dense (precursor awards, market odds, industry reporting) and the variables are well-understood. Geopolitical predictions involve open systems where external interventions, cascading failures, and unknown variables can override the base case.

The system appears to handle the structured, information-rich category more accurately. The conservative bias identified in the Iran and Noem evaluations may be specific to questions where tail risks and compounding factors are the dominant variables, rather than a universal problem with the aggregation method.

What This Means for Perspectives

Four predictions isn't enough to draw firm conclusions about calibration. The pattern across the evaluations published so far is becoming more defined.

On structured questions with rich information environments, the system produces probability estimates within a few percentage points of established prediction markets. On questions involving geopolitical disruption, external intervention, or compounding systemic risks, the system identifies the relevant variables but underweights their potential for interaction and cascade.

The plan is to integrate Polymarket tracking directly into the system, starting with manual comparison and progressing toward automated calibration. Each resolved prediction builds the dataset for per-persona accuracy tracking, which will eventually inform aggregation-layer adjustments based on demonstrated performance across different categories.

The next step is building a larger sample of resolved predictions, across both structured and open-ended categories, to confirm whether these patterns hold.

Resolution Summary

	Prediction	Reality
Consensus probability (7 personas)	~26%	Did not win
Polymarket probability (same day)	~29.5%	Did not win
Gap between systems	~3.5 points
Most confident (Chalamet wins)	The Contrarian: 52-58%	Wrong
Most confident (Chalamet loses)	The Risk Analyst: 10-20%	Correct
Debate winner	The Risk Analyst	Correct

You can read the full prediction report generated by Perspectives and the full debate.

Read my previous breakdowns: predictions about Iran and the Kristi Noem removal.

Escape the echo chamber.

Developing a Writing Style with Claude (Update)

Jamie Matthews — Tue, 17 Mar 2026 05:05:34 GMT

Update

As I've continued refining my approach to creating writing styles, I have created a skill that can be used within Claude. This means you don't have to do any prompting. Simply download the skill here, open claude.ai, click Customise -> Skills -> " +" (Create a new skill) -> Upload a skill -> Choose the .skill file you downloaded.

You can then invoke the skill in a new conversation and Claude will guide you through the entire process (I recommend you use Claude Opus 4.5 - as opposed to Opus 4.6 or a Sonnet model).

Getting consistent, tolerable writing from an AI requires more precision than most people realise. I developed a method that produces substantially better results than the usual approach of describing your preferences or uploading samples. The core idea is to have Claude interview you with structured multiple-choice questions, rather than trying to articulate what you want from scratch.

This blog post is specific to Claude. Other LLMs like ChatGPT or Gemini aren't as good at writing or following less specific but more refined guides. These are different capabilities because generating natural prose and interpreting nuanced stylistic rules require a sensitivity to language that (much like in people) varies significantly between models.

The Blank Page Problem

Writing preferences are high-dimensional. They span sentence length, transition patterns, word choice, how to introduce technical features, how to handle limitations, what phrases feel natural, and dozens of other micro-decisions. Describing all of this in an instruction like "write in a professional but conversational tone" covers maybe 5% of the decisions that actually determine how the output reads.

The other difficulty is that the strongest preferences tend to be aversions. I didn't know the phrase "game changer" bothered me until Claude used it. These negative constraints are often the most important rules in a style guide, and they only surface when you encounter violations.

The standard approach (paste examples, describe the tone, iterate) works poorly because it relies on the user to identify and communicate preferences they haven't fully formed yet. The structured questioning method addresses this by having Claude surface the decisions for you.

Structured Multi-Round Questioning

Rather than describing what I wanted, I asked Claude to generate detailed multiple-choice questions about my preferences. I answered each with a letter and a confidence level (for example, "C, 80%").

The process ran across four rounds, each progressively more specific.

Round 1 covered the basics: tone, voice, structure, technical depth. Should the writing be matter-of-fact, personal, or system-focused? How direct should criticism of previous approaches be? Confidence levels mattered because some answers were clear (British English spelling, 100%) while others were ambiguous (how much implementation detail to include, 55%).

Round 2 went into sentence structure, word choice, paragraph length, and tone calibration. Claude generated questions I wouldn't have thought to ask. One example: "When something is genuinely exciting or valuable, how do you express that?" with options ranging from "just state it clearly, enthusiasm comes from the value itself" to "avoid evaluative language entirely." My answer (measured positive language, 80% confidence) was a preference I'd never articulated but quickly recognised as correct.

Round 3 was specifically about hunting down irritating patterns. I asked Claude to add a large section on specific wording and phrasing traps. This round covered tolerance ratings for transitions like "What's interesting is...", "The key difference is...", and "Worth noting:..." It also asked me to flag phrases I absolutely hate from a provided list. From 13 candidates, only four turned out to be genuine no-nos: "game changer," "powerful" (as in "powerful new feature"), "deep dive," and "level up." The rest were dislikes of varying intensity.

Round 4 was the most granular. Over 60 questions about specific decisions: self-reference style ("I added" versus "I built"), how to handle parenthetical clarifications, how to close sections, whether "this" and "these" are acceptable sentence openers. Each answered with a letter and confidence percentage.

What Structured Questions Surface

The value of this method is coverage, because Claude generates questions spanning dimensions of preference that don't naturally come to mind. Some examples of preferences I discovered through the questioning:

I have a strong preference for problem-solution framing over capability language. "Previously X was limited. The new approach addresses this..." rather than "This enables X." When presented with the options, the preference was obvious to me. I would never have articulated this as a rule unprompted.

I dislike "I built..." (too casual) but "I added..." feels fine. Subtle distinctions in self-reference that are invisible until someone asks you to choose between concrete alternatives.

After four rounds of questions, Claude produced a draft style guide. I found problems with it (it had added rules I didn't mention and excluded rules I specifically required) - so you will likely have to manually edit and verify this style guide.

Beyond Writing

The same method works for any domain where preferences are high-dimensional and partially tacit.

I used it for product design decisions when developing a design language for a desktop application. Multiple rounds of questions covered visual preferences: warm versus cool off-whites, elevation and depth, icon style, motion philosophy. The process surfaced preferences like 150-200ms animations over 300-400ms, the kind of decision that's difficult to specify upfront but obvious when presented as a concrete choice.

I also used it for naming. When evaluating product names, Claude structured the evaluation criteria: negative connotations in specific contexts, conflicts with existing products, cultural associations. The structured approach caught concerns that a casual brainstorm would have missed.

The underlying principle is the same. Preferences exist before you can articulate them. Structured questions force you to confront specific decisions. Your answers (especially confidence levels) reveal the preference landscape. The AI synthesises this into something actionable, and you correct the synthesis where it goes wrong. To paraphrase Steve Jobs and Henry Ford: "People don't know what they want!".

Why Confidence Levels Matter

The confidence percentages carry valuable information. An answer at 55% confidence means the preference is weak. The guide should treat it as a soft default. An answer at 100% becomes non-negotiable.

Confidence levels also help when preferences are in tension. If problem-solution framing scores 75% confidence but varying structure naturally scores 80%, the framing preference is a tendency rather than a template. The higher-confidence rule takes priority in cases of conflict.

Low-confidence answers from early rounds became refinement targets in later rounds because Claude uses them to decide where to probe further.

Getting Started

A few things that made this method work well:

Ask Claude to generate many more questions than feels necessary - the precision came from answering over 60 specific ones across four rounds.
Answer with confidence levels. "B (75%)" carries significantly more information than just "B."
Commit to multiple rounds. The early rounds establish broad preferences while the later rounds target edge cases and irritants. You should plan for at least three rounds.
Use real output as a test. Apply the guide and look for violations. Each violation is a refinement opportunity.
Be precise about aversions. The style guide's most valuable section is the "phrases to avoid" list, because those rules prevent the most jarring output.
Correct specific misinterpretations.

What's Next

I'm developing additional style guides for different contexts (an entertaining or narrative-oriented guide would require the same process with completely different answers). I'm also exploring whether this structured elicitation approach could work as a product feature. AI tools let you upload samples and describe preferences, but systematically helping users discover what they want is an underserved problem.

The Kristi Noem Forecast: How AI Underpriced Transactional Loyalty

Jamie Matthews — Thu, 05 Mar 2026 22:10:56 GMT

On January 30, 2026, I ran a prediction through Perspectives: "Kristi Noem out by March 31?"

The system returned a 23% probability. Eight forecasting personas debated the question, interrogated each other's reasoning, voted, and produced a report concluding that Noem would almost certainly survive (at least until March 31st). The Insider won the debate with the lowest estimate of all eight personas (5-15%), arguing that Trump's public loyalty and Noem's policy alignment made removal implausible.

On March 5, 34 days later, Trump fired Noem as Secretary of Homeland Security.

This is the second prediction I've been able to evaluate against reality (the first covered three Iran-related forecasts published in a previous post). The pattern is becoming clearer, and it points to a specific structural weakness in how the system aggregates predictions.

What Actually Happened

The system ran its analysis on January 30. At that point, Noem was under pressure from the Minneapolis shooting controversy (where two U.S. citizens were murdered by federal agents during an immigration operation) and facing bipartisan Senate criticism. Trump had publicly defended her just three days earlier, on January 27, saying she was doing a "very good job."

Over the following five weeks, the situation escalated. Noem was called to testify before both the Senate and House Judiciary Committees in early March (just a few days ago at the time of writing this post). During those hearings, she faced hostile questioning from both parties over immigration enforcement tactics, a $220 million ad campaign featuring herself, and allegations that her department had obstructed the Inspector General's office. Trump was reportedly "incensed" by her performance during the hearings.

On March 5, Trump announced Noem's removal and named Senator Markwayne Mullin as her replacement, effective March 31. An administration official cited "a culmination of her many unfortunate leadership failures" including the Minneapolis fallout, the ad campaign, allegations of infidelity, staff mismanagement, and feuding with other agency heads. She was offered a consolation role as Special Envoy for a new Western Hemisphere security initiative.

What the System Predicted

The eight Forecaster personas produced a wide range of estimates. Seven of the eight clustered between 10% and 25%. The Risk Analyst was the clear outlier at 65-80%.

Persona	Probability Range	Midpoint
The Risk Analyst	65-80%	72%
The Trend Analyst	15-25%	20%
The Scenario Planner	15-25%	20%
The Systems Thinker	10-25%	18%
The Base Rate Analyst	10-20%	15%
The Contrarian	10-20%	15%
The Sceptic	10-20%	15%
The Insider	5-15%	10%

The aggregate of 23% reflected the strong consensus among the majority. The Risk Analyst's high estimate pulled the average up, but the weight of agreement pushed the final number firmly into "unlikely" territory.

Where the System Got It Right

The system identified nearly every factor that contributed to Noem's removal.

The Risk Analyst flagged the Minneapolis shooting as a "systemic liability" and argued that operational failures would make Noem "politically toxic." That assessment proved accurate. The Scenario Planner mapped three branching paths and acknowledged that a "political pivot" scenario (where Trump cuts losses to redirect the news cycle) was plausible. The Contrarian identified Trump's "transactional loyalty" as a key variable, noting he could turn on allies when they became liabilities, this is something we have seen many examples of from Trump. Several personas identified the potential for Senate criticism to work against Noem, which is exactly how the final weeks played out.

The system's "What Would Change These Predictions" section is almost a checklist for what happened: Trump publicly distancing himself, media narrative escalation, internal White House frustration with Noem, and policy clashes between Noem and the administration (specifically, Noem telling Congress that Trump had approved the ad campaign, which Trump then publicly denied).

Where the System Got It Wrong

The system correctly identified the variables. It failed to weight them properly.

Seven of eight personas treated Trump's January 27 public defence as a strong protective signal. "When Trump steps in front of a camera and explicitly says someone is doing a 'very good job' and 'won't step down' amid a PR crisis, he is telling us he is not looking for an exit," The Insider argued. This reasoning won the debate. It was also wrong.

That statement turned out to only be valid for around 5 weeks. The system treated it as a durable indicator of loyalty. In reality, it was a snapshot of a position that quickly shifted. The Scenario Planner actually conceded this vulnerability during the interrogation phase, admitting that anchoring to a 72-hour-old statement to predict a 60-day horizon was "false precision." But this concession didn't move the broader consensus.

The second failure is more systemic. The system correctly identified that Senate criticism could escalate into a genuine threat, but most personas dismissed it as "political theatre" or "noise." The Base Rate Analyst argued that Senate criticism alone had rarely forced a Cabinet departure. The Contrarian argued it would cause Trump to "dig in" on loyalty. Both assessments were wrong. The Senate hearings in March appear to have been the proximate trigger for Noem's removal.

The Outlier Was Right

The Risk Analyst predicted a 65-80% probability of Noem's departure. This was the only estimate in the correct range.

The Risk Analyst's core argument was that the "covariance of risks" (multiple overlapping problems arriving simultaneously) made Noem's position fragile, and that Trump's public defence was a "lagging indicator," a temporary holding pattern while political damage was assessed. This assessment was correct. The convergence of the Minneapolis fallout, the ad campaign controversy, the Inspector General obstruction allegations, and the disastrous congressional hearings created precisely the kind of multi-factor collapse the Risk Analyst described.

During the interrogation phase, three challengers tested the Risk Analyst's reasoning. Two challenges resulted in disputed verdicts, meaning the challengers and the Risk Analyst couldn't reach agreement. One challenge was defended outright. The Risk Analyst held their position under pressure. The rest of the system voted against them anyway.

Calibration Patterns

This is the second resolved prediction where the same pattern appears. In the Iran predictions, cautious analyses won votes while aggressive predictions proved more accurate. The system correctly identified key variables but underweighted their potential impact and failed to model how they could interact.

Specifically, two recurring issues:

The system underweights compounding risk. When multiple negative factors exist simultaneously, the probability of removal increases faster than a simple addition of individual risks would suggest. The Minneapolis shooting alone might not have been enough. The ad campaign alone might not have been enough. The congressional hearings alone might not have been enough. Together, they created a cascade. The Risk Analyst modelled this interaction. The other seven personas treated each factor more or less independently.

The voting mechanism amplifies consensus. When seven personas agree on a low probability and one persona disagrees, the STV voting system produces a result that reflects the majority view. The Insider won because most personas found the "Trump loyalty" argument convincing. The Risk Analyst's counterargument, that loyalty has an expiry date, was structurally disadvantaged in the voting. This is a known design tension: the voting system is meant to surface the most convincing argument, but "convincing" and "accurate" can diverge, particularly when the majority shares the same analytical blind spot.

What This Means for Perspectives

These results reinforce a finding from the Iran retrospective: prediction accuracy may benefit from mathematical corrections in the aggregation layer. The persona prompts should remain unchanged. The personas identified the right factors and the right risks. The aggregation underweighted the outlier.

The improvement path is likely to track persona-level accuracy over time and adjust weighting based on demonstrated performance in specific types of predictions. For example, if the Risk Analyst consistently outperforms on political removal predictions, the system should weight their estimates higher in that category.

The practical next step is building enough resolved predictions to calculate per-persona Brier scores (a standard accuracy metric for probabilistic forecasts) and establishing whether this pattern holds across a larger sample. Two predictions showing the same pattern is suggestive rather than definitative.

Resolution Summary

	Prediction	Reality
Aggregate probability of departure	23%	Departed (March 5)
Most confident persona (retention)	The Insider: 5-15%	Wrong
Most confident persona (departure)	The Risk Analyst: 65-80%	Closest to reality
Debate winner	The Insider	Wrong
Key factors identified	Yes (all major factors present)	Confirmed
Factor weighting	Underweighted	Confirmed pattern
Time to resolution	34 days from prediction	Within timeframe

You can read the full prediction report generated by Perspectives and the full debate.

Read my previous breakdown of the predictions generated by Perspectives about Iran.

Escape the echo chamber.

Three Predictions on Iran: What Perspectives Got Right, Got Wrong, and Couldn't See Coming

Jamie Matthews — Sun, 01 Mar 2026 23:51:01 GMT

On January 28, 2026, we ran three related predictions through Perspectives, a multi-agent forecasting system that uses AI personas to debate the likelihood of future events. The questions were all focused on Iran over a short time horizon:

Will Khamenei be out as Supreme Leader of Iran by March 31?
Will Israel strike Iran by March 31, 2026?
Will the Iranian regime fall by March 31?

Exactly one month later, on February 28, a joint US-Israeli military operation struck Iran. Supreme Leader Ayatollah Ali Khamenei was killed in an Israeli airstrike on his Tehran compound. Iranian state media confirmed his death the following day.

Two of the three questions have now resolved. This article examines how the system handled each prediction: what the personas argued, where they agreed and disagreed, what reasoning held up, and where it failed. The goal is transparency about the system's current capabilities and limitations, with a focus towards improving calibration.

How the System Works

Each prediction ran through Perspectives' forecasting pipeline using the Forecaster persona set: eight analytically distinct personas, each approaching the question from a different angle. The personas are: The Base Rate Analyst, The Contrarian, The Insider, The Risk Analyst, The Scenario Planner, The Sceptic, The Systems Thinker, and The Trend Analyst.

The workflow proceeds in phases. First, the system conducts background research, running web searches to establish shared factual context. Individual personas then run additional searches to gather evidence for their own arguments.

After research, each persona independently writes a blind proposal containing their analysis and a probability estimate. Blind proposals are critical to the system's design. No persona sees another's work before committing their own position. This prevents early arguments from pulling everyone else toward the same conclusion, and forces analytical diversity.

Once proposals are locked in, the system enters structured interrogation. Each persona has their analysis challenged three times by other personas, who probe for weaknesses: whether their confidence is well-calibrated, whether they've accounted for unlikely scenarios, whether historical patterns support their reasoning, and whether events might unfold differently than they expect. Each challenge receives a response from the original author, who can either concede the point or defend it. If they defend, the challenger decides whether the defence held up.

Finally, the personas vote using ranked-choice balloting to select the most convincing overall analysis. The system aggregates all probability estimates into a final prediction.

Debate 1: Khamenei Out as Supreme Leader by March 31?

Aggregate prediction: 30% likelihood of Khamenei leaving office by March 31, 2026.

Actual outcome: Khamenei was killed on February 28, 2026, in an Israeli airstrike on his Tehran compound. Confirmed by Iranian state media on March 1. This question resolves YES.

What the Personas Argued

The system ran twelve web searches for this debate, covering Khamenei's health reports, succession planning, and the constitutional provisions for removing a Supreme Leader.

Most personas clustered between 5% and 25%, predicting Khamenei would remain in office. Their reasoning converged on a few key points. The historical record showed only one leadership transition in 46 years. The succession deadlock around Khamenei's son Mojtaba (who lacks the required clerical rank for the role) meant the regime had every incentive to prop up the existing leader. And there was evidence of continued agency: CNN reported on January 17 that Khamenei had publicly addressed the nation. The Sceptic put it plainly: "You do not give speeches ordering security forces to crush dissent if you are on a ventilator."

The Trend Analyst (10-20%), who eventually won the ranked-choice vote, reinforced this reading. The scale of the crackdown (reports of up to 30,000 protesters killed) served as evidence of operational control: orchestrating violence at that scale requires active leadership.

The Insider (15-25%) reported that contacts in Tehran suggested a shift to "managed stasis," with Khamenei's schedule being planned months in advance. During interrogation, The Sceptic caught an inconsistency: the Insider's written argument implied near-certainty that Khamenei would stay, yet their probability range (15-25% chance of leaving) only represented 75-85% confidence. The Insider conceded the error. This is a case where the interrogation process caught and corrected a persona's own internal contradiction.

Two personas predicted much higher probabilities. The Risk Analyst (65-80%) argued that Khamenei's July 2025 public appearance was staged to prove he survived the Twelve-Day War, and that after it, he vanished again for seven months. They cited reports suggesting Khamenei was in a coma. The Systems Thinker (80-90%) went further, stating outright: "Biology is linear and unforgiving; at 86, he's likely dead or incapacitated." Yet their written argument also claimed the system would never declare him out because doing so would trigger the succession crisis. Their probability captured the biological reality, while their reasoning pointed in the opposite direction. This tension reveals a limitation in how the system reconciles numbers with reasoning.

The Interrogation

The debate generated 24 challenges: 4 concessions, 3 defences, and 17 disputes.

The most productive challenge came from The Insider targeting The Sceptic. The Insider argued that the January 17 televised address lacked the spontaneity of a live broadcast and that operational orders were bypassing Khamenei's office, flowing directly through his son and the IRGC (Iran's military-political force). The Sceptic conceded, acknowledging they had treated the televised address as solid evidence when, in a regime known for media manipulation, that was an assumption.

The Scenario Planner challenged The Base Rate Analyst on their treatment of mortality risk. Dividing an annual mortality rate by six assumes risk is spread evenly across the year, but medical emergencies are sudden. An overnight stroke doesn't care about averages. The Base Rate Analyst conceded the blind spot.

The Systems Thinker faced three challenges and failed to respond to any of them, triggering content filter errors on all three attempts. This meant one of the two highest-probability personas (80-90%) was never challenged. In a debate that ultimately resolved in the direction their probability indicated, the absence of stress-testing matters. If their reasoning had been interrogated and survived, it would have pulled the aggregate upward. If it had been interrogated and conceded, the aggregate would better reflect the consensus.

The Vote

The ranked-choice vote eliminated personas over eight rounds. The Trend Analyst won in the final round. Their argument, that the crackdown evidenced continued operational control and the succession deadlock preserved the status quo, was selected as the most convincing analysis.

What Actually Happened

The Trend Analyst's victory, and the aggregate prediction of 30%, reflected a reasonable assessment of the information available on January 28. Most personas correctly identified the succession deadlock, IRGC loyalty, and regime survival instinct as stabilising factors. These were sound analytical inputs for the scenarios they modelled.

What the system could not account for was outside intervention at the scale that occurred. The Scenario Planner explicitly modelled assassination or military strike and assigned it only 5%, noting that "direct strikes on the Supreme Leader are extreme escalation risks that even Israel has historically avoided." This reasoning was defensible on January 28. The joint US-Israeli operation on February 28 was, by historical standards, unprecedented.

The Risk Analyst and Systems Thinker assigned higher probabilities (65-80% and 80-90%), but both were modelling biological collapse, not military action. The Risk Analyst's reasoning about Khamenei being in a coma was wrong about the how, even if their probability was closer to the actual outcome.

Debate 2: Israel Strikes Iran by March 31, 2026?

Aggregate prediction: 44% likelihood of Israel striking Iran by March 31, 2026.

Actual outcome: On February 28, 2026, the United States and Israel launched a joint military operation against Iran, including Israeli missile strikes across Tehran and other locations. This question resolves YES.

This debate had the most formally specified resolution criteria of the three, defining a qualifying strike as "the use of aerial bombs, drones or missiles launched by Israeli military forces that impact Iranian ground territory or any official Iranian embassy or consulate." Intercepted missiles would not qualify.

What the Personas Argued

The system ran sixteen web searches for this debate, the highest of the three, covering Israeli emergency munitions procurement, Netanyahu's January 2026 statements, US carrier deployment, and Iran's nuclear breakout timeline.

The debate split into two clear camps: those who thought a strike was likely, and those who thought it was unlikely.

Personas predicting a strike was likely (above 50%):

The Insider (80-90%) pointed to a $183 million emergency procurement contract with Elbit Systems signed on January 27, Netanyahu's warning of force Iran "has never seen," and the arrival of the USS Abraham Lincoln carrier strike group. When challenged that procurement lead times made imminent use unlikely, The Insider countered: "You're reading the contract; I'm watching the loading docks. This is 'emergency procurement,' which draws down existing stockpiles immediately."

The Systems Thinker (70-85%) argued that the June 2025 Operation Rising Lion had proven the military option worked, creating a self-reinforcing pattern: successful action reduces the perceived cost of future action. The Contrarian (60-75%) agreed, arguing the June operation was a "proof of concept" and that with Iranian air defences degraded and enrichment continuing, the strategic logic had shifted from deterrence to active degradation.

Personas predicting a strike was unlikely (below 30%):

The Sceptic (15-25%) argued from post-conflict exhaustion: seven months is not enough to rebuild after a full campaign. They characterised Netanyahu's rhetoric as standard deterrence posturing. The Trend Analyst (15-25%) offered a different reading of the carrier deployment: rather than signalling an imminent attack, the carrier served as a security buffer that reduced the need for Israeli military action. The Base Rate Analyst (10-25%) pointed out that historically, countries very rarely launch a second major air campaign within months of the first.

The Risk Analyst (35-45%) fell between the two camps, acknowledging the conditions for a strike while noting that conditions and decisions are different things. They were the only persona to successfully defend all three of their interrogation challenges.

The Interrogation

This debate produced the highest tension of the three: 24 challenges with 0 concessions, 9 defences, and 15 disputes. Zero concessions is significant. No persona was willing to admit a fundamental flaw in their reasoning. The result was deeply entrenched disagreement, which the aggregate (44%) reflects.

The most revealing exchange came between The Sceptic and The Contrarian. The Contrarian challenged The Sceptic's assumption that stability would hold, arguing they were confusing practical constraints with political ones: "Netanyahu might strike because he has nothing left to lose." The Sceptic drew a distinction between political intent and military capability, arguing that even if Netanyahu wanted to strike, material constraints imposed a real floor on the timeline. The verdict was "disputed," which in hindsight captures the genuine uncertainty well: the political intent was there, and the capability turned out to be sufficient, though it required a joint operation with the United States, something the debate never modelled.

The Vote

The Sceptic won the ranked-choice vote, collecting support from the more cautious personas. This is notable: the system's voting process selected the analysis arguing against a strike, and the strike happened.

What Actually Happened

The aggregate of 44% was closer to reality than most individual estimates, but the mechanism was wrong. No persona modelled a joint US-Israeli operation. The debate framed the question as an Israeli decision constrained by logistics, US diplomatic pressure, and political capital. The actual event was a coordinated campaign with explicit US participation, which removed the capability constraints several personas relied on.

The Insider's emphasis on procurement signals and the carrier deployment proved directionally correct, though even they framed it as US support enabling an Israeli decision, not a jointly planned operation. The Trend Analyst's reading of the carrier as a dampener turned out to be wrong.

The Sceptic's argument about post-conflict exhaustion was reasonable on its own terms: Israel alone might struggle to mount another operation so quickly after June 2025. What they missed was that US participation removed that constraint entirely.

Debate 3: Will the Iranian Regime Fall by March 31?

Aggregate prediction: 16% likelihood of regime collapse by March 31, 2026.

Actual outcome: As of March 1, 2026, the regime has not fallen. A three-person interim leadership council (consisting of the president, the chief of the judiciary, and a jurist of the Guardian Council) has assumed temporary authority following Khamenei's death. The IRGC chain of command appears to have been severely damaged, with reports indicating the defence minister, the commander of the Revolutionary Guard Corps, and the secretary of the Iranian Security Council were killed in the strikes. The situation remains deeply unstable and this question is unresolved.

What the Personas Argued

This debate ran the most web searches of the three (twenty-one), reflecting the breadth of variables involved: economic data, protest dynamics, IRGC cohesion, succession planning, and historical patterns of regime collapse.

Every persona agreed on a single pivot point: the cohesion of the IRGC. Without a fracture in the security forces, the regime would survive regardless of other pressures. They diverged on how likely that fracture was.

Most personas clustered between 1% and 25%. The Base Rate Analyst (1-5%) presented the lowest estimate: once the regime demonstrated it could carry out mass killings without internal defections, the historical precedent for a 60-day collapse drops to near zero. The Contrarian (10-20%) argued that protesters lack weapons and two months is too short for uprisings to topple a military-backed authoritarian regime. The Insider (10-20%) provided the most detailed picture: capital flight patterns (private jet traffic at three-year highs from Tehran to Dubai and Istanbul), IRGC command centres relocated to Qom, and Basij paramilitary units experiencing delayed pay and food shortages. They argued the regime was in a "bloody stabilisation" phase that could sustain it for months or years.

The Risk Analyst (15-25%) focused on the Supreme Leader as a single point of failure: "In a system built around absolute authority, the death of the centre creates a vacuum that fills with chaos faster than the guards can contain it." They successfully defended this position against The Systems Thinker's challenge, arguing that the IRGC could only self-correct if there was a clear institution or successor to rally around, and there wasn't one.

The Trend Analyst (25-40%) assigned the highest probability, citing the accelerating pace of deterioration: currency collapse, escalating violence, and reports of IRGC war fatigue. They argued that historical comparisons fail when the rate of change is this extreme.

The Interrogation

Twenty-four challenges produced 3 concessions, 5 defences, and 16 disputes. The concessions show the system correcting its own errors in real time.

The Scenario Planner conceded to The Sceptic that a single public appearance by Khamenei was insufficient to justify near-certainty in regime survival. The Systems Thinker conceded to The Sceptic that their model was fragile: if the reported death toll represented decentralised panic rather than deliberate strategy, it would actually indicate a breakdown in command, not evidence of control. The Sceptic conceded to The Base Rate Analyst that they had failed to anchor against the broader historical record of authoritarian survival.

The Vote

The Sceptic won the ranked-choice vote for the second time across the three debates, collecting broad support for their balanced assessment of threat and resilience.

What Is Happening Now

The regime has not fallen as of this writing. However, the conditions have shifted dramatically. The joint US-Israeli strikes killed Khamenei, the defence minister, and the commander of the Revolutionary Guard Corps, the very institution every persona identified as the key variable. The operation explicitly aimed at regime change, with President Trump urging Iranians to "take over your government" and Netanyahu stating the goal was to "create the conditions for the brave Iranian people to take their destiny into their own hands."

The Risk Analyst's framing of the Supreme Leader as a "single point of failure" has been directly tested. The Insider's observations about capital flight and parallel command structures now read as prescient context for the current succession crisis. The Contrarian's argument that protesters "lack guns" remains structurally true, but external military intervention has introduced a variable none of the personas modelled: the systematic destruction of the regime's military command structure from outside.

Whether the regime survives in some form, transitions to military rule, or collapses entirely remains to be seen. The 16% aggregate may prove to have been too low, but as of March 1, the regime's formal institutions still exist, even if severely damaged.

Calibration Analysis: Patterns Across All Three Debates

A single prediction resolving one way or another does not tell you whether a system is well-calibrated. A system that predicts 30% should see the event happen roughly 30% of the time. You cannot evaluate that from one instance: a 30% event happening is not evidence of a mistake. The value of examining these predictions lies in the reasoning patterns, not in grading individual numbers against outcomes.

That said, three related predictions on the same geopolitical situation do reveal structural tendencies in how the system reasons.

Pattern 1: The System Systematically Underweighted Outside Intervention

Across all three debates, the dominant analytical framework was internal dynamics: IRGC cohesion, succession planning, economic pressure, protest momentum, biological health. The possibility of a large-scale external military operation was discussed but consistently pushed to the edges of the probability distribution. The Scenario Planner gave assassination or military strike only 5% in the Khamenei debate. No persona in the Israel strikes debate modelled joint US-Israeli operations. The regime fall debate never considered a scenario where external strikes would destroy the military command structure.

The system's personas analyse situations from within their respective frameworks, and those frameworks tend to assume that the actors under study are the primary drivers of outcomes. When the decisive actor is external and operating at a scale that redefines the situation entirely, the system underweights it.

A possible improvement: a dedicated challenge dimension in the interrogation protocol that forces each persona to stress-test their analysis against outside interventions, regardless of how unlikely they seem.

Pattern 2: The Base Rate Analyst Consistently Provided a Floor, Not a Forecast

Across all three debates, The Base Rate Analyst assigned the lowest probabilities: 5-12% for Khamenei's departure, 10-25% for an Israeli strike, and 1-5% for regime collapse. Their methodology (start with the historical frequency of similar events, then adjust for current specifics) produced conservative estimates that consistently undershot the actual outcomes.

The interrogation protocol exposed a core weakness twice. The Scenario Planner and The Risk Analyst both forced concessions by pointing out that statistical averages break in the presence of sudden events. Dividing an annual mortality rate by six assumes risk is spread evenly across the year. Medical emergencies and missile strikes don't work that way.

This suggests the system would benefit from tracking each persona's accuracy over time. If The Base Rate Analyst consistently anchors too low on questions involving potential sudden changes, the aggregation methodology could account for that.

Pattern 3: The Sceptic Won the Vote, But Higher Estimates Were Closer

In the Khamenei and Israel strikes debates, The Sceptic or Trend Analyst won the ranked-choice vote with moderate-to-low probability estimates. The personas with higher estimates (The Insider, The Risk Analyst, The Systems Thinker) were closer to the actual outcomes in probability terms, but their reasoning about mechanisms was often wrong (biological collapse rather than assassination, a unilateral Israeli strike rather than a joint operation).

This creates a puzzle. The voting mechanism rewards the most defensible analysis, not the most accurate prediction. The Sceptic's arguments were well-constructed, evidence-based, and logically coherent. The Insider's arguments relied on unverifiable intelligence claims and aggressive extrapolation. In any given instance, the Sceptic's methodology is more robust. Over time, though, if cautious methodologies consistently win votes while more aggressive ones consistently get closer to outcomes, the voting process may be introducing a conservative bias to the final predictions.

A possible improvement: decoupling the "most convincing analysis" vote from the probability aggregation. The vote could assess reasoning quality while probability estimates are weighted independently, perhaps factoring in each persona's historical accuracy.

Pattern 4: Content Filter Failures Created Analytical Gaps

In the Khamenei debate, The Systems Thinker hit content filter errors on all three interrogation responses. This meant one of the two highest-probability personas (80-90%) was never challenged. In a debate that ultimately resolved in the direction their probability indicated, the absence of stress-testing matters.

Content filtering is an inherent constraint when running sensitive geopolitical predictions through LLM providers. The system cannot currently route around these failures. This limitation disproportionately affects the kinds of predictions that are most useful to evaluate: those involving conflict, political violence, and regime stability.

Pattern 5: The System Correctly Identified the Key Variables, But Misjudged Their Interaction

Across the three debates, the system identified nearly every variable that mattered: Iran's nuclear enrichment trajectory, Israeli military preparedness, US carrier deployment, IRGC cohesion, Khamenei's health, the succession deadlock, economic collapse, and protest dynamics. These were not obscure factors. But the system analysed each in relative isolation, or within the scope of individual persona frameworks.

What no persona modelled was the way these factors could combine: that Israeli military preparation, US naval deployment, Iranian internal instability, and the succession crisis could come together in a single coordinated action designed to exploit all of them simultaneously. The actual event was a joint operation targeting nuclear facilities, military infrastructure, and political leadership in one campaign. The system treated each variable as an input to its own probability estimate. Reality treated them as components of a single integrated plan.

This points toward a deeper architectural question: whether multi-agent debate, in its current form, is capable of modelling connected risks across geopolitical domains, or whether the persona-by-persona structure inherently separates interconnected variables.

Summary of Aggregate Predictions vs. Outcomes

Debate	Aggregate Prediction	Outcome	Persona Range
Khamenei out by March 31	30%	Yes (killed Feb 28)	5% to 90%
Israel strikes Iran by March 31	44%	Yes (Feb 28)	10% to 90%
Iranian regime falls by March 31	16%	Unresolved (destabilised)	1% to 40%

The wide persona ranges (5-90%, 10-90%, 1-40%) reflect uncertainty that the aggregation process compressed. Whether that compression was appropriate depends on future calibration data. On these three questions, the aggregate trended toward the cautious end of the distribution, influenced by the voting process selecting careful, well-defended arguments over more aggressive probability estimates that ultimately landed closer to the actual outcomes.

A Note on the Human Cost

This article has discussed events in analytical terms - probabilities, persona frameworks, prediction accuracy. That framing should not obscure what actually happened on February 28 and the weeks preceding it.

The joint US-Israeli strikes killed at least 201 people across 24 Iranian provinces, according to the Iranian Red Crescent. Among the targets, Israeli strikes hit the Shajareh Tayyebeh girls' elementary school in Minab, killing at least 168 people, most of whom were children. Iran has retaliated with missile and drone strikes, including a ballistic missile that struck a residential area in Beit Shemesh, killing at least 9 people. The conflict continues to escalate.

This follows a period during which the Iranian regime itself carried out mass killings of its own citizens. Human rights organisations estimate that over 7,000 people were killed in the regime's crackdown on protests that began in late December 2025. Reports indicate these killings were carried out under direct orders from Khamenei, with reports of up to 36,000 protesters killed in a two-day span in early January. A nationwide internet blackout accompanied the violence.

Forecasting systems analyse the probabilities of events like these. They cannot capture what those events mean for the people living through them.

These debates were created using Perspectives. The system is under active development, and this analysis is part of the ongoing effort to improve calibration and identify systematic biases.

Perspectives Development Update: Persona Sets & Prediction Mode

Jamie Matthews — Mon, 12 Jan 2026 21:23:26 GMT

Perspectives previously used a fixed set of 8 philosophical personas for all debates. This created a mismatch between the analytical lens and the decision being made. Evaluating a feature prioritisation decision through Sociopath's strategic weakness lens produced interesting debate, but not necessarily useful data.

Two new features address this limitation. Persona Sets provide specialised analytical frameworks tailored to different decision types. Prediction Mode shifts focus from prescriptive recommendations to verifiable forecasts. Both features integrate fully with the Interrogation of Blind Proposals Protocol (IBPP), maintaining the structured challenge-response-verdict system that generates high-quality debate data.

Persona Sets

Decision-making requires cognitive diversity matched to the decision type. Strategic business decisions benefit from stakeholder analysis across competing interests. Product development decisions need perspectives from different roles in the development and usage cycle. Philosophical dilemmas require examination through distinct value systems.

The system now supports multiple persona sets, each optimised for a specific category of decision. Sets are selected at the start of each debate based on the question being asked.

Available Persona Sets

Philosophical (Default): Represents different cognitive frameworks and value systems.

This set works best for ethical dilemmas, personal decisions with moral dimensions, and questions exploring how different value systems interpret the same situation.

Product-Focused: Represents stakeholders in product development.

This set addresses feature prioritisation, UX decisions, technical trade-offs, and product roadmap planning. The perspectives map directly to conversations product teams already have, but structured through systematic interrogation rather than ad-hoc discussion.

Business-Focused: Represents competing organisational interests.

This set supports strategic decisions, resource allocation, stakeholder management, and organisational change. The analysis reveals which stakeholder groups benefit from a decision and which bear its costs.

Predictive (Specialised): Represents distinct forecasting methodologies. More on this below.

How Sets Change Analysis Quality

The philosophical set examines a feature request through abstract principles. The Empath asks how it affects users emotionally. The Idealist questions whether it aligns with product values. The Sociopath identifies ways users might exploit it.

The product-focused set examines the same request through practical concerns. User describes friction in specific workflows. Maintainer calculates operational burden. Architect evaluates system integrity impact. Market Sceptic demands adoption evidence.

The difference produces meaningfully different analysis outputs. Philosophical personas generate interesting ethical deliberation but limited implementation guidance. Product personas generate concrete technical and UX considerations directly relevant to shipping decisions.

Interestingly, when a query is framed a different way to the Philosophical personas (e.g., using the new Prediction Mode - see below), they each use their inherent "world views" to attempt to solve the queries posed in line with their characters. They essentially already have views on how things will work because they have theories on how the world actually works. I also noticed notably higher tension in identical debates conducted using the Philosophical personas than I did with the new persona sets.

Use Cases by Set Type

Philosophical Set:

Should we allow AI-generated content in our platform?
How do we balance privacy and convenience?
What ethical principles should guide content moderation?
Should we accept funding from controversial sources?

Product-Focused Set:

Should we add real-time collaboration to the editor?
How do we prioritise these five feature requests?
Do we rebuild the legacy architecture or patch it?
Should we ship this feature in beta or wait?

Business-Focused Set:

Should we expand into this market segment?
How do we allocate the marketing budget?
Do we pursue acquisition or organic growth?
Should we change our pricing model?

Prediction Mode

Analysis Mode generates recommendations about what should happen. The system evaluates options, debates their merits, and produces a recommended resolution. This works well for prescriptive questions where success depends on implementation quality rather than external factors.

Prediction Mode generates forecasts about what will actually happen. The system estimates probabilities, identifies key uncertainties, and produces confidence-weighted predictions. This works for questions with verifiable outcomes where accuracy can be measured against reality.

The distinction matters because verification mechanisms differ fundamentally. Analysis Mode recommendations can only be evaluated subjectively through user feedback. Prediction Mode forecasts can be evaluated objectively by checking what happened.

Benefits of Verifiable Outcomes

Objective measurement creates feedback loops. When predictions resolve, the system can calculate which personas forecasted accurately and which reasoning patterns led to correct forecasts. This data enables refinement that subjective evaluation cannot provide.

The feedback mechanism addresses a core limitation of analytical systems. Without external signal, there's no way to know if the debate process generates genuinely useful insights or plausible-sounding nonsense. Prediction accuracy provides that signal.

The Predictive Persona Set

Prediction Mode uses a specialised persona set optimised for forecasting rather than moral deliberation. These personas represent distinct epistemic approaches to understanding future events.

Base Rate Analyst examines historical patterns and reference class forecasting. What happened in similar situations and what does statistical evidence suggest?

Insider evaluates qualitative signals and expert judgment. What do people close to the situation observe and what information isn't captured in public data?

Contrarian identifies consensus errors and unpriced risks. What is the crowd missing?

Systems Thinker maps causal chains and feedback loops. What causes what and how do different forces interact?

Trend Analyst tracks momentum and directional movement. What's accelerating or decelerating?

Scenario Planner considers multiple possible futures. What are the discrete scenarios and what triggers each one?

Implementer assesses execution capacity and operational constraints. Can the actors involved actually deliver this? What practical obstacles exist?

Market Reader interprets signals from prediction markets and aggregated forecasts. What does money on the line suggest? How have similar predictions resolved?

{image: Predictive set interface showing 8 forecasting personas}

The personas approach predictions through methodologically distinct frameworks.

Use Cases for Prediction Mode

Political and Electoral Outcomes:

Will Candidate X win the election?
Will this legislation pass by the target date?
Will the approval rating exceed 50% by year end?

Market and Business Predictions:

Will Company X reach $1B valuation within 18 months?
Will this product category grow more than 20% this year?
Will the merger be approved by regulators?

Technology and Product Forecasts:

Will this feature achieve 40% adoption within 6 months?
Will competitors launch a similar product by Q3?
Will this technical approach become industry standard?

Social and Cultural Trends:

Will this regulation be implemented as currently drafted?
Will public opinion shift on this issue by next year?
Will this controversy still be discussed in 6 months?

Integration with Verification Systems

The system supports integration with external prediction platforms. Forecasts can be tracked against Polymarket resolutions, election results, product launches, and other verifiable events. This creates measurable accuracy data that feeds back into persona refinement.

IBPP Support for Both Features

Both Persona Sets and Prediction Mode operate through the Interrogation of Blind Proposals Protocol. Each persona generates an independent initial position without seeing other perspectives. Three challengers interrogate each position through structured challenge-response-verdict cycles. Tension metrics determine synthesis debate length. The full IBPP structure applies regardless of persona set or operational mode.

This means debates in Prediction Mode maintain the same analytical rigour as Analysis Mode debates. Personas defend their probability estimates through the same challenge system that tests ethical arguments. The only difference is the content being debated, not the debate structure.

What's Next

The core infrastructure now supports multiple analytical lenses and operational modes. The system adapts to the decision being made rather than forcing all questions through a single philosophical framework.

I'll be working on integrating Polymarket or similar data to begin testing the predictions to get feedback, enabling the improvement of Predictions and Personas.

I'll now be working on context management to allow the personas to analyse or predict more detailed scenarios with additional context. For example, it may already know all about your product or business, making it easier to start new debates for your specific circumstance.

Try Perspectives

Escape the echo chamber.

Perspectives Development Update: Interrogation of Blind Proposals

Jamie Matthews — Thu, 08 Jan 2026 13:00:29 GMT

I've rebuilt how debates work in Perspectives. The system now runs structured interrogations of blind proposals rather than open-ended discussion threads. The change addresses fundamental problems with how debates generate useful analysis.

The Problem with Threaded Debates

The previous system ran blind proposals followed by a threaded debate, where personas were encouraged to respond directly to the previous speakers. This created organic discussion but made analysis difficult. Personas would engage with topics unevenly. Some proposals would attract extensive criticism whilst others barely got challenged. The debate transcripts were interesting but the data was too messy to extract clear patterns.

More importantly, the system couldn't identify which arguments held up under scrutiny and which collapsed. When Pragmatist challenges Idealist's proposal, does the response actually address the concern or sidestep it? The old format gave no structured way to answer that question.

How Interrogation Works

The new system runs interrogations immediately after blind proposals. For each of the proposals, the system selects three challengers based on framework opposition. Each challenger receives a specific analytical dimension to probe.

The challenger submits targeted questions about that dimension. The proposal author responds to all three challenges, they can either concede or defend their proposal. If defended, the challenger then evaluates whether their challenge was successfully defended against, or remains fundamentally disputed. This creates structured data about which aspects of each proposal are robust and which are vulnerable.

The system tracks verdicts across all interrogations. A proposal that successfully defends against most challenges demonstrates resilience. A proposal that attracts many disputed verdicts indicates fundamental disagreement that discussions should address.

Why This Generates Better Analysis

The interrogation protocol creates quantifiable data about proposal strength. The analysis report can now identify which decision dimensions have the most unresolved conflict, which stakeholder perspectives align or diverge, and which information gaps prevent resolution.

The protocol also forces more honest assessment. In open debate, personas naturally defend their frameworks. When a challenger issues a verdict after seeing the response, they're evaluating whether the author actually addressed their concern rather than whether they agree with the conclusion. This produces more accurate signals about argument quality.

The Discussion Phase

After interrogations complete, the system calculates tension levels based on disputed verdict counts. High tension (many fundamental disagreements) triggers longer discussion phases where personas address the most contentious dimensions directly. Low tension (most challenges defended or conceded) runs shorter discussions.

This adaptive approach allocates discussion time where genuine uncertainty exists rather than forcing extended debate on decisions with clear patterns.

As this adds slightly less value than the interrogation system, it can be skipped by using the new “Fast Mode” (think of this as an inverse “Thinking Mode”, the system as a whole finishes much faster, but at the cost of a slightly less useful output).

Implementation Changes

The interrogation phase requires approximately 32 API calls per debate (8 proposals × 4 calls: challenger selection, challenge generation, author response, verdict evaluation). The calls run sequentially because verdicts depend on responses which depend on challenges. This adds processing time but creates substantially more structured data for analysis.

The frontend displays interrogations through expandable persona sections. Each section shows three challenger rows with verdict icons (defended, conceded, disputed). Clicking a row reveals the full challenge text, response, and verdict reasoning in the main panel.

What This Enables

This protocol creates analysable data about argument resilience, stakeholder alignment, decision dimensions, and information requirements. This feeds directly into analysis reports that map these patterns to actionable decision support.

The system now tracks which frameworks challenge which proposals most effectively, which dimensions attract the most dispute, and which information would resolve remaining conflicts. None of this was extractable from free-flowing debate transcripts.

The protocol also establishes foundation for future analysis improvements. When verdict patterns indicate specific information gaps, the system could potentially trigger targeted research or suggest follow-up interrogations on particular dimensions.

Additionally, the system can take advantage of higher concurrency models, meaning that if a provider allows the use of a model 5 times simultaneously, for example, the debate will complete much quicker, as all these slots are being used, and because the discussion phase is a lot shorter on average, far less time is spent waiting for this sequential operation to complete (only 1 of 5 slots can be used when the debate is in the discussion phase).

Trade-Offs

The structured approach sacrifices some organic discussion flow. Personas respond to direct challenges rather than building on each other's points naturally. The discussion debate attempts to recover this through discussion of disputed challenges, but it's less spontaneous than continuous threaded conversation.

Looking Forward

The interrogation protocol establishes groundwork for analysis improvements that depend on structured verdict data. I'm exploring how to better visualise the patterns that emerge from interrogations (which dimensions create most conflict, which proposals survive scrutiny, which frameworks align or oppose).

The system is live at getperspectives.app. Escape the echo chamber.

Perspectives Development Update: 2026 W1

Jamie Matthews — Sun, 04 Jan 2026 18:25:04 GMT

Perspectives has received three major updates: a new analysis report system, web search integration, and a redesigned interface. These changes address core limitations in how the system supports decision-making.

Analysis Reports Replace Summary Section

The previous summary section had limitations around decision support. It would describe what each persona said but didn't translate that into actionable guidance.

The new analysis system generates professional reports using 9 specialised subagents. Eight run in parallel to create different sections. A ninth reads the assembled report and writes a recommended resolution. Reports provide structured decision support rather than just debate summaries.

Read an example here.

The report includes nine sections:

Recommended Resolution sits at the top because executives need the bottom line first. The subagent that writes this section reads the entire assembled report before generating its recommendation. It explains the reasoning, acknowledges conditions where the recommendation might be wrong, and notes trade-offs. This makes it substantially more useful than a generic summary.

Executive Summary provides a brief overview of the core decision and main trade-offs.

Key Decision Points identifies the specific choices that need to be made, with clear stakes for each one.

Confidence Assessment evaluates how much confidence the analysis warrants. It identifies caveats, limitations, and assumptions.

Issue Breakdown structures the dimensions where personas disagreed into a table. It maps which personas took which positions and categorises disagreements as values-based or empirical.

Points of Agreement identifies where all personas aligned. Areas of consensus are often the most actionable.

Stakeholder Impact Analysis shows who wins and who loses under each major proposal in matrix form.

Decision Pathway maps the choice points as a text-based decision tree. This traces how different priorities lead to different outcomes.

Information Gaps categorises missing information into three types: empirical questions (testable), missing but obtainable information, and fundamental uncertainties.

Reports are downloadable as PDFs formatted for professional use. The styling matches executive-grade documentation rather than informal output.

Web Search Integration

The personas can now access real-time web search through a SearxNG instance. When starting a debate, a toggle on the welcome screen enables this feature.

When enabled, personas search during their analysis. The system requires at least one search during the blind proposal phase. Searches are optional during the main debate. This grounds discussions in current facts rather than purely theoretical reasoning.

The implementation uses SearxNG (a privacy-respecting metasearch engine). When a persona includes a search query in their response, the system executes it, injects the results, and the persona continues with that data available.

Search queries appear in the interface as indicators below each message. This makes it clear what information each persona looked up.

Previously, personas would make assumptions about how entities might behave. The search capability enables them to cite actual examples instead.

Interface Redesign

The web interface now matches the landing page aesthetic. The changes include a new colour scheme, updated typography, and a welcome screen.

Colour scheme: The interface uses the same accent colour (#ff3333) from the landing page, with the dark obsidian background and paper-coloured text creating consistent visual language across the site.

Typography: The system now uses Syncopate for headers and branding, IBM Plex Sans for body text, and JetBrains Mono for technical elements. This matches the landing page fonts.

Welcome screen: When no debate is selected, the interface shows a proper welcome screen instead of an empty state. It displays the debate input box and offers debate suggestions randomly selected from a curated list of topics.

The suggestions are scenarios with genuine uncertainty that force personas to engage with real trade-offs rather than obvious answers.

What's Next

I'm working on improvements to the underlying threaded debate system to make the output more useful for decision-making. The current implementation produces good blind proposals but the debate phase generates less structured data than the analysis reports need.

I'm also exploring better visualisations for the decision tree section. The text-based format works but could be more intuitive.

Escape the echo chamber: getperspectives.app

Introducing Perspectives

Jamie Matthews — Thu, 11 Dec 2025 19:11:37 GMT

Today’s AI assistants are marvels of engineering, but for high-stakes professional decisions, they share a critical limitation: they are designed to be agreeable. These chatbots excel at providing a single, coherent answer, often reinforcing the user's initial assumptions. While helpful for simple tasks, this creates an intellectual echo chamber precisely when it's most dangerous: tackling complex strategic, ethical, or technical problems. True progress on difficult questions requires intellectual diversity, rigorous debate, and the surfacing of productive conflict.

I believe it’s time for a paradigm shift. That's why I'm introducing Perspectives, a new framework for decision-making. Perspectives is not another chatbot. It is a "Synthetic Council", a multi-agent debate system designed to simulate a team of expert advisors whose primary goal is to challenge, debate, and pressure-test ideas from every conceivable angle. Its purpose is not to agree, but to create productive conflict and synthesise robust, resilient solutions. It is a tool built to help you escape the echo chamber.

Let's meet the members of the assembly

The analytical power of Perspectives is driven by its core component: The Assembly. This is a curated set of eight distinct AI personas, each engineered with unique priorities, analytical frameworks, reasoning styles, and strategic blind spots. This intentional diversity is the engine of the system. By forcing a structured debate among these conflicting viewpoints, Perspectives ensures that a problem is examined not just for its surface-level details, but for its hidden assumptions, second-order effects, and ethical implications.

The Sociopath: Zero-sum strategist, detached from social norms
The Manipulator: Game theory expert, exploits systems and psychology
The Opportunist: Ruthlessly pragmatic, seizes advantages
The Pragmatist: Outcome-focused, dismisses sentiment
The Diplomat: Seeks consensus and bridges perspectives
The Empath: Prioritizes emotional impact and human wellbeing
The Idealist: Champions principled, values-driven solutions
The Community Organizer: Focuses on collective benefit and inclusion

The value of Perspectives lies not only in its diverse Assembly but in its structured, four-phase process. This methodology is designed to prevent cognitive biases like groupthink and anchoring, which forces a rigorous and comprehensive analysis from start to finish.

Blind Proposals

Each of the eight personas receives the prompt and generates an independent, initial analysis without seeing the work of the others. This "blind" process is critical for ensuring true intellectual diversity. It prevents the first or loudest idea from anchoring the group and encourages genuinely divergent starting points for the debate.
Threaded Debate

Once all blind proposals are submitted, they are revealed to the entire Assembly. The system agent then orchestrates a structured, threaded debate. Personas critique each other's proposals, defend their own, and challenge underlying assumptions. This adversarial process refines strong ideas and exposes the weaknesses in fragile ones.
Ranked-Choice Voting (STV)

After the debate concludes, the Assembly votes. Instead of a simple majority vote, which can lead to polarising outcomes, Perspectives uses Single Transferable Vote (STV). Each persona ranks all proposals from their most to least preferred. This method is designed to find the most broadly acceptable and resilient solution—the one that survives the rigorous scrutiny of the entire council, not just the one favored by a narrow faction.
Summary Generation

The final output is a comprehensive debrief. It presents the winning proposal, a detailed analysis of why it won based on the voting dynamics, and a clear summary of the most salient dissenting opinions and unresolved tensions. This provides the user not just with an answer, but with a full map of the decision space, including the risks and trade-offs that a single-answer AI would have missed.

This structured process transforms a simple question into a rich, multi-faceted analysis. To see it in practice, let’s examine a complex, hypothetical problem.

In Action: A Debate on Building a Dyson Sphere

To showcase the system's ability to handle multifaceted problems, we tasked it with a classic strategic, ethical, and technical question: "Is it a good idea to build a Dyson Sphere?" This problem is ideal because a simple "yes" or "no" is insufficient; it demands a deep consideration of engineering, resource allocation, power dynamics, and moral philosophy.

The core conflict of the debate emerged immediately during the Blind Proposals phase, with three key viewpoints defining the battle lines:

The Pragmatist immediately grounded the debate in evidence, rejecting a complete sphere as "thermodynamically untenable, structurally impossible, and catastrophically inefficient." Its analysis focused on the quantified resource problem (requiring the mass of Jupiter) and the insurmountable engineering physics, dismissing the idea as fantasy.

The Sociopath, in stark contrast, ignored the engineering entirely and framed the Dyson Sphere as a "civilizational checkmate move." Its analysis focused solely on power dynamics, arguing that whoever builds the sphere gains absolute strategic dominance over all other actors. The question wasn't if it should be built, but who would build it first to consolidate permanent power.

The Community Organizer rejected both frames. Its critique reframed the entire prompt as a question of justice and sovereignty. It asked, "Who decides? Who benefits? Who bears the costs?" and argued that building such a structure would be an act of "colonialism in space," reinforcing global inequity and centralizing power without the consent of the communities it would affect.

The subsequent debate was fierce. The Sociopath, for instance, attacked The Pragmatist’s evidence-based dismissal, arguing that treating "2025 industrial capacity as a binding constraint" was a failure of strategic imagination, not a law of physics. This is the crucible where assumptions are tested and weak points exposed.

After multiple rounds of voting and vote transfers, The Pragmatist's proposal emerged as the winner. Its argument prevailed not because it was the most inspiring, but because it was the most robust. It successfully dismantled the premise of a complete, rigid sphere on engineering grounds but offered a viable alternative: a staged, evidence-driven development of Dyson swarms (distributed collectors) only after proven energy technologies like nuclear and orbital solar were fully deployed. This incremental, testable, and realistic pathway was broadly acceptable to a majority of the council, outlasting both the high-risk power play of The Sociopath and the justice-based veto of The Community Organizer.

While a Dyson Sphere may be a far-future concept, the system’s ability to navigate this complexity has direct applications for the high-stakes decisions professionals face every day.

A New Sparring Partner for Professional Decision-Making

In an age of information overload, the true competitive advantage is not access to more data, but the capacity for better judgment. Perspectives is a tool designed to sharpen that judgment by providing a structured, adversarial environment to pressure-test ideas before they are deployed. It serves as a dedicated sparring partner for any professional facing a complex choice.

In each scenario, Perspectives moves beyond a simplistic pro/con list to a dynamic analysis of competing values and hidden risks. It is more than a tool: it's a new methodology for thinking. One that embraces conflict as the pathway to clarity.

Your Invitation to Escape the Echo Chamber

Access to Perspectives is currently in early access as we continue to refine the system. To join the waitlist for access, to provide feedback, and to follow our progress, visit the site today: https://getperspectives.app

Building Realistic Simulations: A Research-Driven Approach

Jamie Matthews — Fri, 31 Oct 2025 20:56:59 GMT

Process simulations are only valuable if they're realistic. A simulation that looks impressive but doesn't capture how things actually work can lead to bad decisions and failed implementations. When we decided to create the first very realistic simulation for the Universal Automation Wiki, I knew we needed a proper methodology.

This post walks through the approach we developed for building the "Parkside Bakery" simulation, a completely fictional, small commercial bread bakery with realistic timing constraints, resource limitations, and operational complexities. This methodology can be applied to any domain where operational accuracy matters.

The first step I took was to use Claude Sonnet 4.5 to generate a plan. For LLMs, context is everything, to ensure that this instance of Claude knew exactly what its task was, it had access to our meeting notes and the code for the playground itself.

You can read the full implementation plan that guided this process.

Why Traditional Approaches Fall Short

Most simulation projects take one of two paths, both problematic.

Some start with theoretical models and ideal conditions, then layer on constraints. These simulations work perfectly on paper but break when confronted with reality. Other simulations are built based on what seems reasonable rather than what actually happens. Without grounding in real data, these reflect our assumptions and blind spots rather than reflecting what actually happens.

Our methodology addresses both issues by starting with deep research into actual operations and validating every aspect against real-world evidence.

The Four-Phase Approach

Phase 1: Deep Research Foundation

Understanding what actually happens in the real world is the foundation. For our bakery simulation, this meant answering questions like:

When do commercial bakeries actually start operations?
How long does sourdough really need to proof in a commercial setting?
What's the actual usable capacity of a deck oven, not just the marketing specs?

This initial research was followed by targeted deep dives into equipment specifications, recipe details, and financial realities. The result was a collection of research documents grounded in real bakeries, with citations for each piece of data.

Phase 2: Research Synthesis and Validation

Raw research needs to be transformed into a coherent operational model. I used Claude Pro to synthesize the research into a unified model for "Parkside Bakery", a fictional but realistic 4-person bakery.

The synthesis alone wasn't enough though. I ran the operational model through multiple validation passes checking for internal consistency, realism, completeness, and red flags. Does oven capacity actually support claimed daily production? Are profit margins realistic for this industry? Is equipment appropriate for the scale?

Download the full operational model here (or as a word document)

Phase 3: Implementation in UAW

With a validated operational model, the next step was translating it into the UAW simulation format. This is where the Universal Object Model and the multiple day types feature become important.

Real businesses don't operate the same way every day. A commercial bakery's Monday looks different from its Saturday. Before implementing the full simulation, we added support for multiple "day_types" to the Playground. This allows simulations to model weekday operations, weekend operations, holiday operations, and maintenance days.

The implementation follows the Universal Object Model with objects definition, day type configurations, tasks with real timing, and economic tracking.

Phase 4: Simulation Testing and Analysis

Once implemented, the simulation needs to be run and analysed to verify it behaves as expected and produces realistic results.

The UAW Playground provides real-time visualization as the simulation runs. The timeline view shows all tasks executing across the day with color-coded actors and equipment states. Object state panels track inventory levels and equipment states live. The financial dashboard calculates costs, revenues, and profit margins in real-time.

We validate across several dimensions. Does the timeline show realistic patterns like a busy early morning and steady midday production? Are actors working reasonable schedules or racing constantly? Does total daily production match operational model targets? Are financial metrics within industry ranges?

What Makes This Different

Every number, every timing constraint, every cost figure can be traced back to real-world research. When the simulation says task A takes x minutes, that's based on equipment specifications and operational reports, not a guess.

The simulation passes through multiple validation gates for internal consistency, realism checks, research traceability, and operational testing.

This level of realism enables use cases that superficial simulations cannot support. Entrepreneurs can test their bakery concept before investing. Existing bakeries can model changes before implementing them. Students get exposure to realistic operational constraints. Investors can stress-test financial projections against realistic operational models.

Adapting to Other Domains

While we used a commercial bakery as our example, this methodology applies to any domain where operational realism matters: manufacturing, healthcare, logistics, retail, software development.

The key steps remain the same: deep research, synthesis, validation, implementation, testing, and documentation.

Tools and Technologies

Our methodology uses three complementary AI systems.

Gemini Deep Research excels at comprehensive research across multiple sources, synthesizing information from diverse documents, and providing citations with confidence levels.

Claude are great at synthesis and coherent narrative creation, excellent for critical analysis and validation, and strong at identifying inconsistencies and gaps.

Claude Code and Gemini 2.5 Pro (via AI Studio) validates JSON syntax and structure, checks logical consistency in timing and dependencies, and automates testing and analysis scripts.

The Universal Automation Wiki Playground provides an interactive simulation editor with real-time validation, visual timeline for understanding task sequencing, economic analysis tools, and support for multiple day types.

Lessons Learned

Building this realistic simulation taught me a few valuable lessons:

Research quality varies dramatically

The most valuable research came from actual business case studies with numbers, equipment manufacturer specifications (real specs, not marketing), industry interviews and operational reports, and government labour and financial data.

Initial assumptions are often wrong

For example, I assumed bakeries might start around 5-6 AM, but many actually start at 2-4 AM, and I thought oven capacity equals manufacturer specs, but it's actually 10-20% less in practice.

Small details compound

The difference between a 3-hour versus 4-hour proofing time cascades through the entire schedule. Getting these details right is the difference between a useful simulation and an expensive mistake.

Additionally:

The ability to model different operational modes with multiple day types dramatically increases simulation utility. Real businesses adapt, and simulations must too.
Even after extensive validation, running the simulation reveals emergent behaviours and unexpected bottlenecks.

Getting Started

Want to build ultra-realistic simulations for your domain? Start with research before touching the simulation editor. Invest time understanding real operations by finding 3-5 real examples with detailed information.

Build a research base that synthesizes your findings into operational models, financial models, equipment specifications, and task dependencies.

Implement incrementally by starting with the core process flow and adding detail at each stage. Test carefully by running your simulation and analysing results against research-based targets. Document thoroughly by linking every element to supporting research and noting confidence levels.

Share and iterate with the UAW community to help improve your simulation.

Conclusion

Building realistic simulations is complex, but necessary. The difference between a simulation grounded in research and one built on assumptions is the difference between a tool that supports good decisions and one that supports confident mistakes.

Our research-driven methodology, supported by powerful AI tools and the flexible UAW platform, makes it possible to create simulations with genuine operational fidelity. The multiple day types feature enables modelling real operational variety, while comprehensive validation ensures every element can be justified.

The goal isn't perfect simulations. Perfect is almost impossible when modelling messy reality. The goal is honest simulations that accurately represent what we know, clearly document what we've assumed, and help users understand both the possibilities and the limitations.

Resources

How You Can Help

The Universal Automation Wiki is just starting, and we need your help, feedback and input. Here's how you can get involved:

Explore the project: https://universalautomation.wiki
Join the conversation: Suggest improvements and share your domain expertise
Contribute on GitHub: https://jamiem.me/uaw-github
Contact us: contact@universalautomation.wiki

Universal Automation Wiki is an independent open-core project being developed by Jamie Matthews and supervised by Dr. John Bustard of Queen's University Belfast.

Introducing the Simulation Playground

Jamie Matthews — Wed, 27 Aug 2025 17:57:37 GMT

The goal of the Universal Automation Wiki is to provide a clear, data-driven understanding of the world's processes. A text description, however, is only half the story. To really analyse, improve, and understand a complex workflow, you need to interact with it, test its limits, and see how its parts work together.

That’s where the Simulation Playground comes in: an interactive, web-based environment designed to build, test, and refine simulations. It serves as the bridge between the conceptual process maps on the wiki and the practical realities of dynamic, time-based models.

The Playground is a comprehensive toolset for anyone interested in process modelling, from students and researchers to engineers, compliance officers, HR professionals, and business analysts.

A Powerful Editor at its Core

At the heart of the Playground is a powerful JSON editor, built with the same editor found in modern web-based IDEs and code editors. It provides full syntax highlighting, error detection, and auto-formatting, making it straightforward to edit even the most complex simulation structures.

For those who prefer a guided approach, the interface includes a GUI for creating new objects via modals. You can add objects of certain types (actors, resources, equipment, or products) or define new tasks with specific timings, actor assignments, and resource dependencies. As you make changes to the JSON, the system provides immediate visual feedback, allowing for a rapid and iterative design workflow.

Visualise Your Process in Motion

A simulation comes to life when you can see it unfold. The Playground’s primary feature is its interactive timeline renderer. This component transforms the raw simulation data into a clear, time-based chart, similar to a Gantt chart. You can instantly see which actors are assigned to which tasks, how long each step takes, and where potential idle time or bottlenecks exist.

The timeline is not just a static image; users can drag and drop tasks to change their start times or reassign them to different actors. A built-in animation player allows you to watch the process execute from start to finish, with variable speed controls and the ability to scrub to any point in time. This is an invaluable tool for debugging timing issues and presenting complex workflows to others.

Add Space to your Process

To enhance the time-based simulation, the Playground includes a Space Editor for modelling the physical environment where a process occurs. This tool allows users to create a top-down layout of a workspace, such as a factory floor, office, or warehouse. Using simple drawing tools with grid snapping for precision, you can define specific zones, workstations, and equipment locations.

These defined areas can then be directly referenced by tasks in the simulation, adding a crucial spatial context to the workflow. This feature allows for a more holistic analysis, connecting the sequence of operations with the physical layout of a facility.

A System for Trustworthy Simulations

A model is only as useful as it is reliable. To that end, the Playground is equipped with a real-time validation engine that constantly checks your work against a central catalogue of metrics and constraints.

This system automatically detects a wide range of common issues, including:

Structural Errors: Ensuring the simulation file conforms to the Universal Object Model.
Scheduling Conflicts: Identifying when a single actor has been assigned to overlapping tasks.
Resource Shortages: Tracking stock levels on a minute-by-minute (simulation time) basis to ensure a resource is never consumed before it is available.
Logical Dependencies: Verifying that a task does not begin before its prerequisite tasks have been completed.

The results are displayed in a clear, categorised panel, allowing you to quickly identify and resolve errors or warnings.

Supporting Features for Comprehensive Modelling

Beyond these core functions, the Playground includes a suite of tools to support a variety of use cases:

Space Editor: A drag-and-drop tool for designing the physical layout of a workspace, from a factory floor to an office, which can be integrated into simulations.
Simulation Library: A collection of pre-built templates, such as the breadmaking process, providing a starting point for new projects or for learning modelling techniques.
Save & Load System: Simulations can be saved and shared via unique codes, allowing for collaboration and version control.
Tutorial System: A guided, step-by-step introduction to the platform’s features, designed to onboard new users quickly.

Potential Use Cases

The toolset is designed to be versatile. Process engineers can model complex manufacturing workflows to identify bottlenecks before they impact production. Educators can use the interactive tutorials and visualisations to teach simulation concepts. Business analysts can validate operational efficiency and resource utilisation, while facility planners can design and test workspace layouts in tandem with their process models.

The Next Step for the UAW

The Playground is a significant step towards our long-term vision. In the future, this tool will be directly integrated with the main wiki. Users will be able to open a simulation from a wiki article, refine it in the Playground, and contribute their improvements back to the community.

We invite you to explore the Simulation Playground today. For those interested in contributing to its development, our GitHub repository is the best place to start.

Introducing Metrics & Personas on the Universal Automation Wiki

Jamie Matthews — Fri, 23 May 2025 20:52:52 GMT

With the Universal Automation Wiki, we aim to map out automation possiblities throughout various domains, and make this information accessible to as many people as possible.

To help make the site more useful and reliable for our users, we’re excited to announce two new features: Personas and Metrics.

Personas

A persona is a type of person who we expect will find the Universal Automation Wiki useful, they reflect the diverse target audience of the site. As many different people will want to use our site for their own purposes, we think it’s important to offer them more relevant information.

Current Personas (may be subject to change):

Hobbyists: Enthusiasts exploring automation for personal projects.
Researchers: Academics and professionals conducting studies in automation.
Investors: Individuals or entities interested in the financial aspects of automation technologies.
Educators: Teachers and trainers seeking educational resources on automation.
Field Experts: Industry professionals with specialized knowledge in their fields.

These personas help in getting our users more relevant information, a user can simply choose which persona they best fit, and the articles will adapt to fit them.

In the future, it may be possible for users to effectively define their own personas by creating a list of the requirements that are useful for them.

Metrics

Metrics are boolean indicators which assess the presence of specific information in a content section, for example, if a given section contains links or references to primary research.

These metrics are evaluated using a local LLM, ensuring efficient and accurate assessments. The definitions for these are defined in a JSON file, and each persona is defined as having specific metrics. To use the earlier example, a researcher will care about links or references to primary research, whereas a hobbyist likely wouldn’t be interested.

User Interface Changes

There is now a dropdown for users to select their preferred persona, and the content on the articles automatically adapt and filter to show or hide relevant sections for this user. This replaces the previous system of having a few arbitrary tabs.

How You Can Help

The Universal Automation Wiki is just starting, and we need your help, feedback and input. Here’s how you can get involved:

Explore the project: https://universalautomation.wiki
Join the conversation: Suggest new steps, vote on alternatives, and refine task trees
Contribute on GitHub: https://jamiem.me/uaw-github
Contact us: contact@universalautomation.wiki

Introducing the Universal Automation Wiki

Jamie Matthews — Wed, 21 May 2025 01:45:28 GMT

Automation is revolutionising the world again, in everything from software engineering, to healthcare, to logistics. Yet, as innovation accelerates, one thing remains unclear: how close are we to full automation? And where are the potential gaps in the markets?

That’s why I’m creating the Universal Automation Wiki, a data-driven platform designed to track, structure and democratise the progress towards full autonomy in a wide variety of industries and domains. Our plan is backed by a novel approach we call Iterative AI.

The goal is to create a clear, grounded understanding of where automation stands right now, and where it’s going next.

A Global Map of Automation

The Universal Automation Wiki is a living, open-source platform that tracks the progress of automation across every field: software development, agriculture, education, logistics, you name it.

But it’s not just another collection of articles or a traditional wiki. It’s a new system: a set of structured processes that breaks tasks down into actionable steps, visualises them in interactive trees, and lets the community vote on the most effective approaches.

At the core is a technology we call Iterative AI. It builds knowledge not from speculation, but from what already works. Each task tree begins with existing tools and techniques and grows step-by-step, guided by real-world feasibility.

Why Now?

Simply put, the timing for a project like this is excellent, a combination of contributing factors in the AI space is enabling this project to even exist as a concept:

The explosion in the capabilities of large language models (LLMs), especially those capable of running on consumer-grade hardware and the open source ecosystem.
The rise of agent-based AI systems (just have a look at Microsoft Build 2025)
Fragmented and biased automation knowledge across domains
Growing demand for realistic and actionable automation roadmaps
Increasing tech readiness for collaborative knowledge platforms

Why the Universal Automation Wiki is useful

Almost no one agrees on how far we still have to go in terms of automation, and the information that does exist is often siloed, biased, or based on intuition rather than data. We wanted to do better. We want to create:

A space where anyone, not just credentialed experts, can contribute meaningful insight.
A system where progress is measurable, not just inspirational
A platform that evolves as the technology evolves

How It Works

Most systems start from the top; they imagine an end goal and work backwards. This sounds logical, but it often leads to overly ambitious roadmaps, missed deadlines, and unpredictable timelines.

We’ve flipped that model on its head.

Instead of starting with a goal like “Automation customer support”, we start with a grounded question: “What tools already exist to handle customer inquiries?”, and from there, we build upwards.

This method doesn’t just reflect the real world more accurately, it helps identify where innovation is truly needed. It shows which components already work, where integration is possible, and what still needs to be invented.

What Makes Us Different

Feature	Universal Automation Wiki	Traditional Wikis	Expert Systems
Design	Bottom-up, starts with existing	Flat structure with categories	Top-down, starts with goals/concepts
Quality	Transparent & objective metrics- based scoring	Subjective editor concensus & citations	Closed "trust us" authority with limited transparency
Structure	Dynamic trees showing multiple solutions	Static articles with fragmented information	Rigid frameworks resistant to innovation
Timelines	Data-backed forecasts with measurable accuracy	Absent or purely speculative	Absent or purely speculative
Bias Mitigation	Democratic voting system immune to individual bias	Dominated by the vocal few & edit wars	Echo chambers reinforcing established viewpoints
Contribution	Inclusive system where quality speaks for itself	Requires editor/moderator approval	Exclusive club limited to established credentials
Adaptability	Rapid evolution through continuous feedback	Slow updates depending on editor/moderator availability	Resistant to change outside scheduled revision cycles

Who Could Benefit

We can see a lot of different groups of people being able to benefit from the Universal Automation Wiki due to the coverage of as many industries and domains as we can. Here are a few examples:

Researchers & developers who want to benchmark progress or spot automation gaps
Policy thinkers & futurists curious about automation’s trajectory
Educators & students looking for structured, real-world examples
Tech enthusiasts who want to understand, not just speculate, as to where things are going

How You Can Help

The Universal Automation Wiki is just starting, and we need your help, feedback and input. Here’s how you can get involved:

Explore the project: https://universalautomation.wiki
Join the conversation: Suggest new steps, vote on alternatives, and refine task trees
Contribute on GitHub: https://jamiem.me/uaw-github
Contact us: contact@universalautomation.wiki

Building an Open Future

Automation progress shouldn’t be overseen by a handful of organisations behind closed doors. It should be visible, participatory, and driven by collective intelligence.

With the Universal Automation Wiki, we’re creating a platform that reflects that belief.

We hope you’ll join us.

This project is being developed by Jamie Matthews and supervised by Dr. John Bustard in QLab, at Queen’s University Belfast.

Structured Outputs

Jamie Matthews — Wed, 21 May 2025 00:26:16 GMT

A Structured Output is a way of interacting with a large language model while ensuring that its output conforms to a specific format that you can then use in your applications.

This is achieved by creating a JSON file which contains the information about the properties you require the LLM to output.

To make this process more flexible, I’ve created a Java implementation of the most useful commands, it’s available on Github, here.

To use it, simply create a Structure object and add in your required properties, you can then choose to restrict the response to conform exactly to your structure, or allow it to add its own properties.

For example, you can require that the LLM output a specific value, like the confidence level in its response. This can then be used to either disregard its response entirely, or use it for the originally intended purpose.

To learn more about the OpenAI implementation of Structured Outputs, click here.