Show HN: Watch LLMs play 21,000 hands of Poker

PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, and Grok 4.1 Fast Reasoning have all been included.

All code -> https://github.com/JoeAzar/pokerbench

Summary

The article discusses the performance of large language models on the PokerBench, a benchmark for evaluating the abilities of AI systems in playing Texas Hold'em poker. It presents the results of running various large models, including GPT-3, on this benchmark and analyzes their performance across different metrics.

Story

Show HN: Watch LLMs play 21,000 hands of Poker