01
The TL;DR
- Anthropic (Opus / Sonnet / Haiku) returned +4.45% ($104,451). OpenAI (GPT-5.4) returned +2.48% ($102,484).
- SPY returned +8.36% over the same period. Neither AI council beat the index.
- Anthropic generated +11.76% annualized alpha with 0.84 beta. The AI did add skill beyond market exposure.
- The system launched with 4 bugs that took 2 days to fix. The first 12 days ran on partially broken plumbing.
- Both councils independently converged to the same conviction allocation despite opposite trading strategies.
- A weekly self-review caused the two councils to calibrate in opposite directions — Anthropic loosened, OpenAI tightened. This was the turning point.
The AI generated positive alpha. It also cost 2.5x more and took 6x longer than the cheaper model that nearly matched its returns.
Anthropic
OpenAI
SPY
Return
+4.45%
+2.48%
+8.36%
Alpha
+11.76%
+1.56%
—
AI Cost
$58.17
$22.96
$0
Sessions
101
100
—
02
The Scoreboard
Anthropic
$104,451
+4.45%
1,842 trades · 65% win rate · $58 AI cost
VS
OpenAI
$102,484
+2.48%
2,091 trades · 70% win rate · $23 AI cost
How did the AI councils compare against simple benchmarks? The theoretical replay removes execution friction (wash trades, regime sizing, drift thresholds) to isolate pure conviction signal quality.
Strategy
Return
Max DD
Sharpe
Win Rate
Anthropic
+7.93%
-3.28%
6.35
66.7%
OpenAI
+8.18%
-4.05%
5.69
66.7%
SPY
+8.36%
-3.78%
5.93
72.2%
Equal-Wt
+10.03%
-4.98%
5.47
66.7%
Return: Total profit/loss over 30 days. Max Drawdown: The worst peak-to-trough decline. Closer to 0% is better. Sharpe Ratio: Return per unit of risk. Above 1.0 is good, above 2.0 is excellent. Win Rate: Percentage of trading days that ended green.
03
Did the AI Add Value?
Beta is the free ride. If you own stocks and stocks go up, you go up. Alpha is the part that is actually the AI's doing. If alpha is positive, the AI made smart bets that beat the market. If alpha is zero, you could have replaced the AI with a single ETF purchase.
Alpha (annualized)
Beta
R-squared
Anthropic
+11.76%
0.84
0.90
OpenAI
+1.56%
0.97
0.89
Anthropic's beta of 0.84 means it took less market risk than SPY — it wasn't just riding the market. Its alpha of +11.76% means the conviction-weighted allocation generated genuine excess returns. OpenAI's beta of 0.97 means it was mostly tracking the market, with only +1.56% alpha — minimal skill signal. R-squared of 0.90 means 90% of variance came from the market; the rest from AI decisions.
Alpha is skill. Beta is showing up. Anthropic showed skill. OpenAI showed up.
04
Which Themes Won?
AI & Compute
+$4,159
+$4,015
9 / 10
Quantum Computing
+$1,044
+$1,472
1 / 1
Crypto & Digital
+$852
+$927
2 / 1
Biotech & Health
+$785
+$546
6 / 1
Real Estate
+$543
+$416
3 / 1
Energy & Grid
+$382
+$638
2 / 1
Defensive / Hedge
+$278
+$321
1 / 1
Defense & Aerospace
-$242
-$280
1 / 2
Both councils correctly rotated into AI & Compute as the regime shifted from BEAR to BULL. The defensive and defense themes that dominated early allocations were shed as the market recovered. Only defense lost money — a casualty of the regime shift it was designed to protect against.
05
The Journey
Every intervention is documented. The council journals capture AI decisions. This captures builder decisions — what broke, what changed, and why.
March 23
Blind start. System launched with null momentum data, stale regime model, broken Haiku outputs, and wrong timezone scheduling. Both councils made Day 1 allocations on broken data.
March 24
The critical fix. Portfolio drawdown was being measured from 1-year stock peaks instead of experiment entry prices. SMR showed -79% drawdown when our actual loss was -2%. Fixing this unlocked Anthropic's first non-vetoed session.
March 26
Scheduler overhaul. Event-driven sessions burned all 4 daily slots by noon, missing the market close entirely. Switched to fixed 5-window schedule capturing open, mid-morning, afternoon, power hour, and pre-close.
End of Week 1
Conviction convergence. Despite opposite trading strategies (Anthropic aggressive, OpenAI frozen), both councils independently arrived at identical conviction allocations. $13 apart on $100k.
April 5
The recursive loop. Both flagship models reviewed their own 2-week performance and self-calibrated. They went opposite directions: Anthropic loosened its risk stance, OpenAI tightened. The divergence became structural.
April 5
Execution fix revealed a design flaw: the system was buying more shares of positions the Risk Sentinel flagged for exit. Fixing it triggered wash trade blocks on 5 tickers — the old bug's legacy.
April 7
Anthropic takes the lead for the first time. The self-calibration paid off.
April 10
Regime shift: BEAR to BULL. Both councils detected it and rotated from defense/defensive into AI & Compute.
April 17
Experiment ends. Anthropic +4.45%, OpenAI +2.48%. 200 sessions. 3,900 trades. $81 in AI cost.
06
The Veto Gamble
The Risk Sentinel can veto all conviction increases in a session. Decreases still go through. Anthropic was vetoed 64% of sessions. OpenAI was vetoed 74%. What if the veto hadn't existed?
Scenario
Anthropic
OpenAI
Actual
+7.93%
+8.18%
No Veto
+7.27%
+7.38%
Static
+8.23%
+8.23%
The vetoes created value — Actual beat No Veto by 0.66–0.80%. The Risk Sentinel was right to block the Thesis Analyst's more aggressive proposals. But the Static portfolio beat everything. The uncomfortable truth: doing nothing from Day 1 would have been the best strategy.
This doesn't mean the AI was useless. The initial allocation on Day 1 was itself an AI decision. And in a BEAR-to-BULL transition, the starting portfolio happened to be well-positioned. The question is whether active management added or subtracted from that starting point.
07
The Cost of Intelligence
Total AI Cost
$81.13
Less than a nice dinner for 30 days of autonomous trading.
Profit per $1 AI Spend
$77 / $108
Anthropic $77. OpenAI $108. Both massively positive ROI on AI spend.
Total Sessions
201
5 per day across 5 market windows. Weekdays only.
Fill Rate
99% / 98%
Wash trades and API outages were the main friction.
Agent
Anthropic
OpenAI
Role
Macro Oracle
$2.31
$0.22
Scout (cheapest)
Narrative Intel
$17.02
$9.17
Scout (mid-tier)
Thesis Analyst
$21.50
$6.59
Advocate (mid-tier)
Risk Sentinel
$16.14
$6.51
Adversary (flagship)
08
What Broke
Transparency is the point. Here is everything that went wrong.
Null momentum — Momentum data was null for the first 6 sessions. The pre-compute layer had a bug in skip-month calculation. Both councils made initial allocations without trend data.
Regime model divergence — The HMM model trained simultaneously on both services at restart, producing different models from the same data. Anthropic was stuck on a 216-day BEAR signal while OpenAI correctly showed 8 days.
Execution contradiction — The execution engine was buying more shares of positions the Risk Sentinel flagged for exit. This design flaw ran for 12 days before the weekly self-review caught it.
Wash trade trap — Fixing the execution bug triggered wash trade blocks on COIN, IONQ, MSTR, RGTI, and SMR. These positions were frozen for 30 days — the legacy of 12 days of compounding the wrong trades.
Opus API outages — Opus hit API capacity limits (529 errors) twice during peak market hours. The flagship model was the least reliable component in the stack.
09
The Verdict
The architecture works. The alpha is real. The cost premium is hard to justify.
What Worked
Adversarial council architecture produced genuine debate between bullish and bearish agents. The self-calibration loop caused meaningful behavioral divergence between identical systems. Regime detection correctly identified the BEAR-to-BULL transition. Deterministic guardrails prevented catastrophic decisions — conviction caps, turnover limits, cash reserves all held.
What Didn't
Neither council beat SPY on actual returns. Opus costs 2.5x more and takes 6x longer than GPT-5.4, with 2 outages vs 0. Active conviction management slightly underperformed a static portfolio. The execution layer had a fundamental design flaw that ran undetected for 12 days.
The Real Question
Is this worth advancing to real money? The alpha signal is positive but the execution gap is large. The architecture proved it can generate conviction, manage risk, and self-improve. But it also proved that simpler and cheaper can be just as effective in a trending market. The next phase needs backtesting against non-trending markets — sideways chop, flash crashes, regime uncertainty — where narrative-driven conviction should theoretically outperform passive strategies.
Built With
Co-authored with Claude Code
Every line of architecture, every agent prompt, every risk model, this analysis, and this page.
View the Battle →