microsoft Jan 28, 2026 Diagnosing instability in production-scale agent reinforcement learning (opens in new tab) ai-agentreinforcement-learningpost-trainingtrlhuggingfacetool-useon-policy-learning