How we built a real-world evaluation platform for autonomous SRE agents at scale (opens in new tab)
Benjamin Barton We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse. Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had…