How I score a heads-up bot. And why I'll defend it.
My evaluation framework — and yeah I'll defend it — is 4 dimensions: timing consistency, action-distribution shape, willingness to deviate from GTO when exploited, and lobby-grinding behaviour. Most bots get 2 of 4. The ones that get 3 are A-tier. I haven't found one that gets all 4.
1. Timing consistency
The boring one, but it's first because it's the easiest to fail. A real opponent's response-time distribution has a long tail. They snap-fold their bottom range in 400ms, they snap-shove their top range in 700ms, and they tank on the medium decisions for 4–9 seconds. A naive bot bolts a uniform 1.5s ± random-jitter onto every action. That's a timing tell visible inside 80 hands.
What I want to see: a tri-modal distribution. Fast / medium / slow buckets, with conditional means that correlate with decision difficulty (not just random noise). Bot A passes this. Bot K does not — its timing histogram is a textbook gaussian. I'd score Bot K a 2/10 on this dimension alone and it drags the whole bot down.
2. Action-distribution shape
Take 600 hands, dump the bot's actions by street and position, and compare to a solver's range realization. I don't need an exact match — that's unreasonable and probably a sign of a bot ripping pre-cached solutions, which is its own problem. I want the shape to be right.
Specifically: does the river check-raise frequency exist at all? Most weak bots have a check-raise frequency of zero on rivers (they just don't have it coded). Does the IP turn over-bet appear when the board polarizes? Does the bot 4-bet bluff at >0% from any position? If any of these are flatlined, I drop the action-shape score by 2 points each.
The numbers I actually look at
- 3-bet frequency vs open-raise: solver wants roughly 24–30% HU at 100bb. Bots cluster around 18%. Suspect.
- River bluff frequency: solver wants ~32% of bets in polarized spots. Most bots ship 8–12%. They under-bluff hard.
- Check-call vs check-raise mix on the turn: should be roughly 70/30 on wet boards. Bots default to 95/5.
3. Willingness to deviate from GTO when exploited
This is the one that separates the A-tier from everything below. If I openly over-fold to my opponent's flop continuation bet — like, 75% folds, way above equilibrium — does the bot notice and start barreling lighter? An A-tier engine adapts inside ~150 hands. A B-tier engine adapts after about 500 hands. A C-tier engine never adapts, it just plays its prior strategy until heat death of the universe.
How I test it: I deliberately misplay one well-defined exploit for 200 hands and watch what the bot does in hand 201–400. If its action distribution shifts toward exploiting me, I credit it. If it doesn't, I don't.
4. Lobby-grinding behaviour
This one the bot devs forget. A real reg picks seats with intent, leaves when they're stuck-and-tired, doesn't sit at four tables of the same opponent in 10 minutes, and doesn't bot-hop tables with millisecond-accurate timing. The lobby layer is where the easiest detection signals live, and most bots are sloppy about it.
What scores well: variable session lengths (45m–3h), seat-selection logic that prefers tagged-as-fish opponents but not exclusively (which is a different kind of tell), some willingness to leave a profitable seat (humans get tired or distracted), and rate-limited table joins.
How the dimensions roll up
I score each dimension 1–10 and average them, equally weighted. I don't weight skill higher than behaviour, even though that's tempting. Reason: a 10/10 strategy bot with a 2/10 behaviour layer is going to get banned inside two months and its EV-per-hour is zero from a player perspective. I care about the whole package.
A bot that scores 8 / 7 / 6 / 8 averages to 7.25 — I'd round to 7. A bot that scores 9 / 9 / 8 / 5 averages to 7.75 — also a 7, but the breakdown matters. I publish both numbers on the format comparison page when sample sizes allow.
We run a small heads-up group. If you want a structured match-up rather than the public lobby grind, get in touch.
Match up with us