Unweight: how we compressed an LLM 22% without sacrificing quality (opens in new tab)
Unweight: how we compressed an LLM 22% without sacrificing quality 2026-04-17 Mari Galicer Ivan Nikulin Chris Branch Running inference within 50ms of 95% of the world's Internet-connected population means being ruthlessly efficient with GPU memory. Last year we improved memory u…