federated-learning

2 posts

toss

Toss's AI Technology Recognized (opens in new tab)

Toss ML Engineer Jin-woo Lee presents FedLPA, a novel Federated Learning algorithm accepted at NeurIPS 2025 that addresses the critical challenges of data sovereignty and non-uniform data distributions. By allowing AI models to learn from localized data without transferring sensitive information across borders, this research provides a technical foundation for expanding services like Toss Face Pay into international markets with strict privacy regulations. ### The Challenge of Data Sovereignty in Global AI * Traditional AI development requires centralizing data on a single server, which is often impossible due to international privacy laws and data sovereignty regulations. * Federated Learning offers a solution by sending the model to the user’s device (client) rather than moving the data, ensuring raw biometric information never leaves the local environment. * Standard Federated Learning fails in real-world scenarios where data is non-IID (Independent and Identically Distributed), meaning user patterns in different countries or regions vary significantly. ### Overcoming Limitations in Category Discovery * Existing models assume all users share similar data distributions and that all data classes are known beforehand, which leads to performance degradation when encountering new demographics. * FedLPA incorporates Generalized Category Discovery (GCD) to identify both known classes and entirely "novel classes" (e.g., new fraud patterns or ethnic features) that were not present in the initial training set. * This approach prevents the model from becoming obsolete as it encounters new environments, allowing it to adapt to local characteristics autonomously. ### The FedLPA Three-Step Learning Pipeline * **Confidence-guided Local Structure Discovery (CLSD):** The system builds a similarity graph by comparing feature vectors of local data. It refines these connections using "high-confidence" samples—data points the model is certain about—to strengthen the quality of the relational map. * **InfoMap Clustering:** Instead of requiring a human to pre-define the number of categories, the algorithm uses the InfoMap community detection method. This allows the client to automatically estimate the number of unique categories within its own local data through random walks on the similarity graph. * **Local Prior Alignment (LPA):** The model uses self-distillation to ensure consistent predictions across different views of the same data. Most importantly, an LPA regularizer forces the model’s prediction distribution to align with the "Empirical Prior" discovered in the clustering phase, preventing the model from becoming biased toward over-represented classes. ### Business Implications and Strategic Value * **Regulatory Compliance:** FedLPA removes technical barriers to entry for markets like the EU or Southeast Asia by maintaining high model performance while strictly adhering to local data residency requirements. * **Hyper-personalization:** Financial services such as Fraud Detection Systems (FDS) and Credit Scoring Systems (CSS) can be trained on local patterns, allowing for more accurate detection of region-specific scams or credit behaviors. * **Operational Efficiency:** By enabling models to self-detect and learn from new patterns without manual labeling or central intervention, the system significantly reduces the cost and time required for global maintenance. Implementing localized Federated Learning architectures like FedLPA is a recommended strategy for tech organizations seeking to scale AI services internationally while navigating the complex landscape of global privacy regulations and diverse data distributions.

google

Fine-tuning LLMs with user-level differential privacy (opens in new tab)

Researchers from Google investigated scaling user-level differential privacy (DP) to the fine-tuning of large language models in datacenter environments. While traditional example-level DP protects individual data points, user-level DP provides a stronger guarantee by masking the presence of an entire user's dataset, which is critical for privacy-sensitive, domain-specific tasks. The study explores how the flexibility of datacenter training can be used to optimize sampling strategies and contribution bounds to minimize the noise typically required for these stringent privacy guarantees. ## Limitations of Example-Level Privacy * Standard differential privacy focuses on "example-level" protection, which prevents attackers from learning about specific individual data points. * In many real-world scenarios, a single user contributes many examples to a dataset; if an attacker can analyze these multiple points together, they may still learn private information about the user even under example-level DP. * User-level DP addresses this by ensuring a model remains essentially the same whether or not a specific user’s entire data collection was used during training. * While more robust, user-level DP is "strictly harder" to implement because it requires injecting significantly more noise into the training process, a problem that scales with the size of the model. ## Methodologies for User-Level DP Fine-Tuning * Both primary algorithms require a "contribution bound" during pre-processing, which strictly limits the number of examples any single user can provide to the training set. * Example-Level Sampling (ELS) involves sampling random individual examples for a batch and then applying a modified version of DP-SGD with high noise to compensate for the potential presence of multiple examples from the same user. * User-Level Sampling (ULS) involves sampling random users and including all of their (bounded) examples in a batch, which more closely resembles the structure of federated learning. * The datacenter environment offers a unique advantage over federated learning because researchers can perform precise queries on both individual examples and whole users, allowing for better optimization of the noise-to-utility ratio. ## Optimization and Datacenter Flexibility * The researchers focused on fine-tuning rather than full training because DP requires additional computation that is often unaffordable for base model training. * A central challenge in this research is determining the optimal "contribution bound"—if the bound is too low, valuable data is discarded, but if it is too high, more noise must be added to maintain privacy. * Because the datacenter allows for random sampling of any user at any time (unlike federated learning where devices must be online), the ULS algorithm can be tuned more effectively to achieve quality gains in the final model. To maximize the utility of LLMs fine-tuned on private data, developers should prioritize User-Level Sampling (ULS) strategies and carefully calibrate the contribution bounds of their datasets. By leveraging the controlled environment of a datacenter to optimize these parameters, it is possible to achieve high-performance models that respect user privacy more effectively than traditional example-level methods.