Ask HN: What benchmarks are you using to judge AI models?

There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?

I use:

* Aider's Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:

https://aider.chat/docs/leaderboards/

* I generally assume OpenRouter usage to be an indicator of a model's popularity, and by proxy, utility:

https://openrouter.ai/rankings

* LLM-Stats has a lot of charts of benchmarks that I look at:

https://llm-stats.com/

Story

Ask HN: What benchmarks are you using to judge AI models?