federated-analytics

1 posts

google

Toward provably private insights into AI use (opens in new tab)

Google Research has introduced Provably Private Insights (PPI), a framework designed to analyze generative AI usage patterns while providing mathematical guarantees of user privacy. By integrating Large Language Models (LLMs) with differential privacy and trusted execution environments (TEEs), the system enables developers to derive aggregate trends from unstructured data without exposing individual user content. This approach ensures that server-side processing remains limited to privacy-preserving computations that are fully auditable by external parties. ### The Role of LLMs in Structured Summarization The system employs "data expert" LLMs to transform unstructured generative AI data into actionable, structured insights. * The framework utilizes open-source Gemma 3 models to perform specific analysis tasks, such as classifying transcripts into topics or identifying user frustration levels. * This "structured summarization" occurs entirely within a TEE, ensuring that the model processes raw data in an environment inaccessible to human operators or external processes. * Developers can update LLM prompts frequently to answer new research questions without compromising the underlying privacy architecture. ### Confidential Federated Analytics (CFA) Infrastructure The PPI system is built upon Confidential Federated Analytics, a technique that isolates data through hardware-based security and cryptographic verification. * User devices encrypt data and define specific authorized processing steps before uploading it to the server. * A TEE-hosted key management service only releases decryption keys to processing steps that match public, open-source code signatures. * System integrity is verified using Rekor, a public, tamper-resistant transparency log that allows external parties to confirm that the code running in the TEE is exactly what was published. ### Anonymization via Differential Privacy Once the LLM extracts features from the data, the system applies differential privacy (DP) to ensure that the final output does not reveal information about any specific individual. * The extracted categories are aggregated into histograms, with DP noise added to the final counts to prevent the identification of single users. * Because the privacy guarantee is applied at the aggregation stage, the system remains secure even if a developer uses a prompt specifically designed to isolate a single user's data. * All aggregation algorithms are open-source and reproducibly buildable, allowing for end-to-end verifiability of the privacy claims. By open-sourcing the PPI stack through the Google Parfait project and deploying it in applications like Pixel Recorder, this framework establishes a new standard for transparent data analysis. Developers should look to integrate similar TEE-based federated analytics to balance the need for product insights with the necessity of provable, hardware-backed user privacy.