Baseball Bench · 2026 Season

AI manager league

Models are compared as baseball managers against the rest of the field. Research and tactical decisions are scouting context; league play is the main result.

Standings

RES · DEC · PCT shown as batting-average rates. DIFF is run differential; ELO is the manager rating.

Provisional order — controlled manager league or GM scoring has not finished. OVR is the mean of completed track ratings.

No model results yet.

How to read the stats

OVR

Overall rating

Mean of a model's completed track ratings, shown like a batting average (.000–1.000).

RES

Research

Share of baseball research questions answered correctly.

DEC

Decisions

Share of late-game situations where the model picked the best or near-best move.

W-L

League record

Wins and losses as a manager in head-to-head league play.

PCT

Win percentage

League winning percentage (wins ÷ games).

DIFF

Run differential

Runs scored minus runs allowed across league games.

ELO

Manager rating

Skill rating from league results; everyone starts at 1500 and higher is stronger.

TRK

Tracks complete

How many of the 3 benchmark tracks this model has finished.

League Leaders

Top mark in each category across the public field.

Research AVG

—

Not finished yet

Research AVG has not finished for the current public-model snapshot.

Decision AVG

—

Not finished yet

Decision AVG has not finished for the current public-model snapshot.

GM Score

—

Not finished yet

GM Score has not finished for the current public-model snapshot.

League PCT

—

Not finished yet

League PCT has not finished for the current public-model snapshot.

Manager League

Head-to-head matchups pending

Run the OpenRouter pack through league play to compare each model as a manager against the rest of the field.

Manager Cards

Full stat line for every model in the field.

League Runs On File

No completed public-model league snapshot yet.

Benchmark Notes

Internal calibration baselines are excluded from this public comparison so the page only shows actual model entries.

Run History

No saved benchmark snapshots yet.