nlp

15 posts

kakao

Building an Ultra-lightweight (opens in new tab)

Kakao developed a specialized, lightweight morphological analyzer to meet the strict resource constraints of mobile environments where modern deep-learning models are often too heavy. By opting for a classical Viterbi-based approach implemented in C++20, the team successfully reduced the library's binary size to approximately 200KB while ensuring high performance. This development highlights how traditional algorithmic optimization and careful language selection remain vital for mobile software efficiency. ## The Choice of C++ over Rust - While Rust was considered for its safety, it was ultimately rejected because its default binary size (even with optimization) reached several megabytes, which was too large for the specific project requirements. - C++ was chosen because mobile platforms like iOS and Android already include standard libraries (libc++ or libstdc++), allowing the final analyzer binary to be stripped down to core logic. - The project utilized C++20 features such as Concepts and `std::span` to replace older patterns like SFINAE and `gsl::span`, resulting in more readable and maintainable code without sacrificing performance. ## Trie Compression using LOUDS - To minimize the dictionary size, the team implemented a LOUDS (Level-Order Unary Degree Sequence) structure, which represents a Trie using a bit sequence instead of pointers. - This approach provides a compression rate near the information-theoretic lower bound, allowing approximately 760,000 nodes to be stored in just 9.4MB. - Further optimization was achieved through a custom encoding scheme that represents Hangul in 2 bytes and English in 1 byte, significantly reducing the dictionary's memory footprint compared to standard UTF-8. ## Optimizing the Select Bit Operation - Initial performance profiling showed that the `select0` operation (finding the N-th zero in a bit sequence) consumed 90% of the dictionary search time due to linear search overhead. - The solution involved dividing the bit sequence into 64-bit chunks and storing the cumulative count of zeros at each chunk boundary in a separate array. - By using binary search to find the correct chunk and applying parallel bit-counting techniques for intra-chunk searching, the dictionary search time was reduced from 165ms to 10ms. - These optimizations led to a total analysis time improvement from 182ms to 28ms, making the tool highly responsive for real-time mobile use. For mobile developers facing strict hardware limitations, this project proves that combining classical data structures like LOUDS with modern low-level language features can yield performance and size benefits that deep learning alternatives currently cannot match.

google

Gemini provides automated feedback for theoretical computer scientists at STOC 2026 (opens in new tab)

Google Research launched an experimental program for the STOC 2026 conference using a specialized Gemini model to provide automated, rigorous feedback on theoretical computer science submissions. By identifying critical logical errors and proof gaps within a 24-hour window, the tool demonstrated that advanced AI can serve as a powerful pre-vetting collaborator for high-level mathematical research. The overwhelmingly positive reception from authors indicates that AI can effectively augment the human peer-review process by improving paper quality before formal submission. ## Advanced Reasoning via Inference Scaling - The tool utilized an advanced version of Gemini 2.5 Deep Think specifically optimized for mathematical rigor. - It employed inference scaling methods, allowing the model to explore and combine multiple possible solutions and reasoning traces simultaneously. - This non-linear approach to problem-solving helps the model focus on the most salient technical issues while significantly reducing the likelihood of hallucinations. ## Structured Technical Feedback - Feedback was delivered in a structured format that included a high-level summary of the paper's core contributions. - The model provided a detailed analysis of potential mistakes, specifically targeting errors within lemmas, theorems, and logical proofs. - Authors also received a categorized list of minor corrections, such as inconsistent variable naming and typographical errors. ## Identified Technical Issues and Impact - The pilot saw high engagement, with over 80% of STOC 2026 submitters opting in for the AI-generated review. - The tool successfully identified "critical bugs" and calculation errors that had previously evaded human authors for months. - Survey results showed that 97% of participants found the feedback helpful, and 81% reported that the tool improved the overall clarity and readability of their work. ## Expert Verification and Hallucinations - Because the users were domain experts, they were able to act as a filter, distinguishing between deep technical insights and occasional model hallucinations. - While the model sometimes struggled to parse complex notation or interpret figures, authors valued the "neutral tone" and the speed of the two-day turnaround. - The feedback was used as a starting point for human verification, allowing researchers to refine their arguments rather than blindly following the model's output. ## Future Outlook and Educational Potential - Beyond professional research, 75% of surveyed authors see significant educational value in using the tool to train students in mathematical rigor. - The experiment's success has led to 88% of participants expressing interest in having continuous access to such a tool throughout their entire research and drafting process. The success of the STOC 2026 pilot suggests that researchers should consider integrating specialized LLMs early in the drafting phase to catch "embarrassing" or logic-breaking errors. While the human expert remains the final arbiter of truth, these tools provide a necessary layer of automated verification that can accelerate the pace of scientific discovery.

aws

Amazon Bedrock AgentCore adds quality evaluations and policy controls for deploying trusted AI agents | AWS News Blog (opens in new tab)

AWS has introduced several new capabilities to Amazon Bedrock AgentCore designed to remove the trust and quality barriers that often prevent AI agents from moving into production environments. These updates, which include granular policy controls and sophisticated evaluation tools, allow developers to implement strict operational boundaries and monitor real-world performance at scale. By balancing agent autonomy with centralized verification, AgentCore provides a secure framework for deploying highly capable agents across enterprise workflows. **Governance through Policy in AgentCore** * This feature establishes clear boundaries for agent actions by intercepting tool calls via the AgentCore Gateway before they are executed. * By operating outside of the agent’s internal reasoning loop, the policy layer acts as an independent verification system that treats the agent as an autonomous actor requiring permission. * Developers can define fine-grained permissions to ensure agents do not access sensitive data inappropriately or take unauthorized actions within external systems. **Quality Monitoring with AgentCore Evaluations** * The new evaluation framework allows teams to monitor the quality of AI agents based on actual behavior rather than theoretical simulations. * Built-in evaluators provide standardized metrics for critical dimensions such as helpfulness and correctness. * Organizations can also implement custom evaluators to ensure agents meet specific business-logic requirements and industry-specific compliance standards. **Enhanced Memory and Communication Features** * New episodic functionality in AgentCore Memory introduces a long-term strategy that allows agents to learn from past experiences and apply successful solutions to similar future tasks. * Bidirectional streaming in the AgentCore Runtime supports the deployment of advanced voice agents capable of handling natural, simultaneous conversation flows. * These enhancements focus on improving consistency and user experience, enabling agents to handle complex, multi-turn interactions with higher reliability. **Real-World Application and Performance** * The AgentCore SDK has seen rapid adoption with over 2 million downloads, supporting diverse use cases from content generation at the PGA TOUR to financial data analysis at Workday. * Case studies highlight significant operational gains, such as a 1,000 percent increase in content writing speed and a 50 percent reduction in problem resolution time through improved observability. * The platform emphasizes 100 percent traceability of agent decisions, which is critical for organizations transitioning from reactive to proactive AI-driven operations. To successfully scale AI agents, organizations should transition from simple prompt engineering to a robust agentic architecture. Leveraging these new policy and evaluation tools will allow development teams to maintain the necessary control and visibility required for customer-facing and mission-critical deployments.

google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.

line

Extracting Trending Keywords from Open Chat Messages (opens in new tab)

To enhance user engagement on the LINE OpenChat main screen, LY Corporation developed a system to extract and surface "trending keywords" from real-time message data. By shifting focus from chat room recommendations to content-driven keyword clusters, the team addresses the lack of context in individual messages while providing a more dynamic discovery experience. This approach utilizes a combination of statistical Z-tests to identify frequency spikes and MinHash clustering to eliminate near-duplicate content, ensuring that the trending topics are both relevant and diverse. **The Shift from Chat Rooms to Content-Driven Recommendations** * Traditional recommendations focus on entire chat rooms, which often require significant user effort to investigate and evaluate. * Inspired by micro-blogging services, the team aimed to surface messages as individual content pieces to increase the "main screen visit" KPI. * Because individual chat messages are often fragmented or full of typos, the system groups them by keywords to create meaningful thematic content. **Statistical Detection of Trending Keywords** * Simple frequency counts are ineffective because they capture common social fillers like greetings or expressions of gratitude rather than actual trends. * Trends are defined as keywords showing a sharp increase in frequency compared to a baseline from seven days prior. * The system uses a Z-test for two-sample proportions to assign a score to each word, filtering for terms with at least a 30% frequency growth. * A seven-day comparison window is specifically used to suppress weekly cyclical noise (e.g., mentions of "weekend") and to capture topics whose popularity peaks over several consecutive days. **MinHash-based Message Deduplication** * Redundant messages, such as copy-pasted text, are removed prior to frequency aggregation to prevent skewed results and repetitive user experiences. * The system employs MinHash, a dimensionality reduction technique, to identify near-duplicate messages based on Jaccard similarity. * The process involves "shingling" messages into sets of tokens (primarily nouns) and generating $k$-length signatures; messages with identical signatures are clustered together. * To evaluate the efficiency of these clusters without high computational costs, the team developed a "SetDiv" (Set Diversity) metric that operates in linear time complexity. By combining Z-test statistical modeling with MinHash deduplication, this methodology successfully transforms fragmented chat data into a structured discovery layer. For developers working with high-volume social data, using a rolling weekly baseline and signature-based clustering offers a scalable way to surface high-velocity trends while filtering out both routine social noise and repetitive content.

google

A scalable framework for evaluating health language models (opens in new tab)

Researchers at Google have developed a scalable framework for evaluating health-focused language models by replacing subjective, high-complexity rubrics with granular, binary criteria. This "Adaptive Precise Boolean" approach addresses the high costs and low inter-rater reliability typically associated with expert-led evaluation in specialized medical domains. By dynamically filtering rubric questions based on context, the framework significantly improves both the speed and precision of model assessments. ## Limitations of Traditional Evaluation * Current evaluation practices for health LLMs rely heavily on human experts, making them cost-prohibitive and difficult to scale. * Standard tools, such as Likert scales (e.g., 1-5 ratings) or open-ended text, often lead to subjective interpretations and low inter-rater consistency. * Evaluating complex, personalized health data requires a level of detail that traditional broad-scale rubrics fail to capture accurately. ## Precise Boolean Rubrics * The framework "granularizes" complex evaluation targets into a larger set of focused, binary (Yes/No) questions. * This format reduces ambiguity by forcing raters to make definitive judgments on specific aspects of a model's response. * By removing the middle ground found in multi-point scales, the framework produces a more robust and actionable signal for programmatic model refinement. ## The Adaptive Filtering Mechanism * To prevent the high volume of binary questions from overwhelming human raters, the researchers introduced an "Adaptive" layer. * The framework uses the Gemini model as a zero-shot classifier to analyze the user query and LLM response, identifying only the most relevant rubric questions. * This data-driven adaptation ensures that human experts only spend time on pertinent criteria, resulting in "Human-Adaptive Precise Boolean" rubrics. ## Performance and Reliability Gains * The methodology was validated in the domain of metabolic health, covering topics like diabetes, obesity, and cardiovascular disease. * The Adaptive Precise Boolean approach reduced human evaluation time by over 50% compared to traditional Likert-scale methods. * Inter-rater reliability, measured through intra-class correlation coefficients (ICC), was significantly higher than the baseline, proving that simpler scoring can provide a higher quality signal. This framework demonstrates that breaking down complex medical evaluations into simple, machine-filtered binary questions is a more efficient path toward safe and accurate health AI. Organizations developing domain-specific models should consider adopting adaptive binary rubrics to balance the need for expert oversight with the requirements of large-scale model iteration.

google

Enabling physician-centered oversight for AMIE (opens in new tab)

Guardrailed-AMIE (g-AMIE) is a diagnostic AI framework designed to perform patient history-taking while strictly adhering to safety guardrails that prevent it from providing direct medical advice. By decoupling data collection from clinical decision-making, the system enables an asynchronous oversight model where primary care physicians (PCPs) review and finalize AI-generated medical summaries. In virtual clinical trials, g-AMIE’s diagnostic outputs and patient communications were preferred by overseeing physicians and patient actors over human-led control groups. ## Multi-Agent Architecture and Guardrails * The system utilizes a multi-agent setup powered by Gemini 2.0 Flash, consisting of a dialogue agent, a guardrail agent, and a SOAP note agent. * The dialogue agent conducts history-taking in three distinct phases: general information gathering, targeted validation of a differential diagnosis, and a conclusion phase for patient questions. * A dedicated guardrail agent monitors and rephrases responses in real-time to ensure the AI abstains from sharing individualized diagnoses or treatment plans directly with the patient. * The SOAP note agent employs sequential multi-step generation to separate summarization tasks (Subjective and Objective) from more complex inferential tasks (Assessment and Plan). ## The Clinician Cockpit and Asynchronous Oversight * To facilitate human review, researchers developed the "clinician cockpit," a web interface co-designed with outpatient physicians through semi-structured interviews. * The interface is structured around the standard SOAP note format, presenting the patient’s perspective, measurable data, differential diagnosis, and proposed management strategy. * This framework allows overseeing PCPs to review cases asynchronously, editing the AI’s proposed differential diagnoses and management plans before sharing a final message with the patient. * The separation of history-taking from decision-making ensures that licensed medical professionals retain ultimate accountability for patient care. ## Performance Evaluation via Virtual OSCE * The system was evaluated in a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) involving 60 case scenarios. * g-AMIE’s performance was compared against primary care physicians, nurse practitioners, and physician assistants who were required to operate under the same restrictive guardrails. * Overseeing PCPs and independent physician raters preferred g-AMIE’s diagnostic accuracy and management plans over those of the human control groups. * Patient actors reported a preference for the messages generated by g-AMIE compared to those drafted by human clinicians in the study. While g-AMIE demonstrates high potential for human-AI collaboration in diagnostics, the researchers emphasize that results should be interpreted with caution. The workflow was specifically optimized for AI characteristics, and human clinicians may require specialized training to perform effectively within such highly regulated guardrail frameworks.

google

REGEN: Empowering personalized recommendations with natural language (opens in new tab)

Google Research has introduced REGEN, a benchmark dataset designed to evolve recommender systems from simple item predictors into conversational agents capable of natural language interaction. By augmenting the Amazon Product Reviews dataset with synthetic critiques and narratives using Gemini 1.5 Flash, the researchers provide a framework for training models to understand user feedback and explain their suggestions. The study demonstrates that integrating natural language critiques significantly improves recommendation accuracy while enabling models to generate personalized, context-aware content. ### Composition of the REGEN Dataset * The dataset enriches the existing Amazon Product Reviews archive by adding synthetic conversational elements, specifically targeting the gap in datasets that support natural language feedback. * **Critiques** are generated for similar item pairs within hierarchical categories, allowing users to guide the system by requesting specific changes, such as a different color or increased storage. * **Narratives** provide contextual depth through purchase reasons, product endorsements, and concise user summaries, helping the system justify its recommendations to the end-user. ### Unified Generative Modeling Approaches * The researchers framed a "jointly generative" task where models must process a purchase history and optional critique to output both a recommended item ID and a supporting narrative. * The **FLARE (Hybrid)** architecture uses a sequential recommender for item prediction based on collaborative filtering, which then feeds into a Gemma 2B LLM to generate the final text narrative. * The **LUMEN (Unified)** model functions as an end-to-end system where item IDs and text tokens are integrated into a single vocabulary, allowing one LLM to handle critiques, recommendations, and narratives simultaneously. ### Performance and Impact of User Feedback * Incorporating natural language critiques consistently improved recommendation metrics across different architectures, demonstrating that language-guided refinement is a powerful tool for accuracy. * In the Office domain, the FLARE hybrid model's Recall@10—a measure of how often the desired item appears in the top 10 results—increased from 0.124 to 0.1402 when critiques were included. * Results indicate that models trained on REGEN can achieve performance comparable to state-of-the-art specialized recommenders while maintaining high-quality natural language generation. The REGEN dataset and the accompanying LUMEN architecture provide a path forward for building more transparent and interactive AI assistants. For developers and researchers, utilizing these conversational benchmarks is essential for moving beyond "black box" recommendations toward systems that can explain their logic and adapt to specific user preferences in real time.

google

Making complex text understandable: Minimally-lossy text simplification with Gemini (opens in new tab)

Google Research has introduced a novel system using Gemini models to perform minimally-lossy text simplification, a process designed to enhance readability while meticulously preserving original meaning and nuance. By utilizing an automated, iterative prompt-refinement loop, the system optimizes LLM instructions to achieve high-fidelity paraphrasing that avoids the information loss typical of standard summarization. A large-scale randomized study confirms that this approach significantly improves user comprehension across complex domains like law and medicine while simultaneously reducing cognitive load for the reader. ## Automated Evaluation and Fidelity Assessment * The system moves beyond traditional metrics like Flesch-Kincaid by using a Gemini-powered 1-10 readability scale that aligns more closely with human judgment and comprehension ease. * Fidelity is maintained through a specialized process using Gemini 1.5 Pro that maps specific claims from the original source text directly to the simplified output. * This mapping method identifies and weights specific error types, such as information loss, unnecessary gains, or factual distortions, to ensure the output remains a faithful representation of the technical original. ## Iterative Prompt Optimization Loop * To overcome the limitations and speed of manual prompt engineering, the researchers implemented a feedback loop where Gemini models optimize their own instructions. * In this "LLMs optimizing LLMs" setup, Gemini 1.5 Pro analyzes the performance of simplification prompts and proposes refinements based on automated readability and fidelity scores. * The optimization process ran for 824 iterations before performance plateaued, allowing the system to autonomously discover highly effective strategies for simplifying text without sacrificing detail. ## Validating Impact through Randomized Studies * The effectiveness of the model was validated with 4,563 participants across 31 diverse text excerpts covering specialized fields like aerospace, philosophy, finance, and biology. * The study utilized a randomized complete block design to compare the original text against simplified versions, measuring outcomes through nearly 50,000 multiple-choice question responses. * Beyond accuracy, researchers measured cognitive effort using the NASA Task Load Index and tracked self-reported user confidence to ensure the simplification actually lowered the barrier to understanding. This technology provides a scalable method for democratizing access to specialist knowledge by making expert-level discourse understandable to a general audience. The system is currently available as the "Simplify" feature within the Google app for iOS, offering a practical tool for users navigating complex digital information.

google

Amplify Initiative: Localized data for globalized AI (opens in new tab)

The Amplify Initiative by Google Research addresses the critical lack of linguistic and cultural diversity in generative AI training data by establishing an open, community-based platform for localized data collection. By partnering with regional experts to co-create structured, high-quality datasets, the initiative aims to ensure AI models are both representative and effective in solving local challenges across health, finance, and education. This approach shifts data collection from a top-down model to a participatory framework that prioritizes responsible, locally respectful practices in the Global South. ## The Amplify Platform Framework The initiative is designed to bridge the gap between global AI capabilities and local needs through three core pillars: * **Participatory Co-creation:** Researchers and local communities collaborate to define specific data needs, ensuring the resulting datasets address region-specific problems like financial literacy or localized health misinformation. * **Open Access for Innovation:** The platform provides high-quality, multilingual datasets suitable for fine-tuning and evaluating models, specifically empowering developers in the Global South to build tools for their own communities. * **Author Recognition:** Contributors receive tangible rewards, including professional certificates, research acknowledgments, and data authorship attribution, creating a sustainable ecosystem for expert participation. ## Pilot Implementation in Sub-Saharan Africa To test the methodology, Google Research partnered with Makerere University’s AI Lab in Uganda to conduct an on-the-ground pilot program. * **Expert Onboarding:** The program trained 259 experts across Ghana, Kenya, Malawi, Nigeria, and Uganda through a combination of in-person workshops and app-based modules. * **Dataset Composition:** The pilot resulted in 8,091 annotated adversarial queries across seven languages, covering salient domains such as education and finance. * **Adversarial Focus:** By focusing on adversarial queries, the team captured localized nuances of potential AI harms, including regional stereotypes and specialized advice that generic models often miss. ## Technical Workflow and App-Based Methodology The initiative utilizes a structured technical pipeline to scale data collection while maintaining high quality and privacy. * **Privacy-Preserving Android App:** A dedicated app serves as the primary interface for training, data creation, and annotation, allowing experts to contribute from their own environments. * **Automated Validation:** The app includes built-in feedback loops that use automated checks to ensure queries are relevant and to prevent the submission of semantically similar or duplicate entries. * **Domain-Specific Annotation:** Experts are provided with specialized annotation topics tailored to their professional backgrounds, ensuring that the metadata for each query is technically accurate and contextually relevant. The Amplify Initiative provides a scalable blueprint for building inclusive AI by empowering experts in the Global South to define their own data needs. As the project expands to India and Brazil, it offers a vital resource for developers seeking to fine-tune models for local contexts and improve the safety and relevance of AI on a global scale.

google

Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis (opens in new tab)

Cell2Sentence-Scale (C2S-Scale) is a new family of open-source large language models designed to transform complex single-cell transcriptomic data into a text-based format accessible to natural language processing. By representing gene expression profiles as "cell sentences," the framework allows researchers to use general-purpose LLM architectures to "read" and "write" biological information. This approach simplifies single-cell analysis, enabling conversational queries and automated data interpretation that were previously limited to specialized tools and expert users. ### The Cell2Sentence Mapping Method * Translates single-cell RNA sequencing (scRNA-seq) measurements into sequences of text by ordering gene names according to their expression levels. * Enables the integration of cellular data with text-based biological context, such as cell types, experimental metadata, and scientific literature. * Leverages the existing vocabulary of biology—gene names and functions—to make high-dimensional data interpretable by standard language model tokenizers. ### C2S-Scale Model Architecture and Training * Built upon Google’s Gemma open model family, maintaining the original architecture to benefit from existing scalability and infrastructure. * Trained on a dataset exceeding 1 billion tokens derived from real-world transcriptomic data and biological metadata. * Features a range of model sizes from 410 million to 27 billion parameters, allowing researchers to choose between computational efficiency for exploratory work and high performance for complex tasks. ### Functional Applications in Biology * **Conversational Querying:** Researchers can interact with data through natural language to ask specific questions, such as predicting how a T cell might respond to a particular cancer therapy. * **Automated Interpretation:** The models can generate biological summaries of experiments, describing everything from individual cell types to the characteristics of entire tissues. * **Predictive Tasks:** The framework handles diverse tasks including cell type annotation and the generation of synthetic cells or tissues for research simulations. ### Performance and Biological Scaling Laws * Research demonstrates that biological language models follow predictable scaling laws, where performance in tasks like cell type annotation improves as model size increases. * Larger models show superior gene overlap and semantic similarity scores when interpreting datasets compared to smaller versions. * Smaller models remain highly effective for parameter-efficient fine-tuning in resource-constrained environments. C2S-Scale is available as an open-source resource on GitHub and HuggingFace, offering a flexible toolkit for the research community to apply large language models to next-generation genomic discovery.

google

ECLeKTic: A novel benchmark for evaluating cross-lingual knowledge transfer in LLMs (opens in new tab)

ECLeKTic is a novel benchmark designed to evaluate how effectively large language models (LLMs) transfer knowledge between languages, addressing a common limitation where models possess information in a source language but fail to access it in others. By utilizing a closed-book question-answering format based on language-specific Wikipedia entries, the benchmark quantifies the gap between human-like cross-lingual understanding and current machine performance. Initial testing reveals that even state-of-the-art models have significant room for improvement, with the highest-performing model, Gemini 2.5 Pro, achieving only a 52.6% success rate. ## Methodology and Dataset Construction The researchers built the ECLeKTic dataset by focusing on "information silos" within Wikipedia to ensure the models would need to perform internal transfer rather than simply recalling translated training data. * The dataset targets 12 languages: English, French, German, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Portuguese, and Spanish. * Researchers selected 100 articles per language from a July 2023 Wikipedia snapshot that existed exclusively in that specific language and had no equivalent articles in the other 11 targeted languages. * This approach uses Wikipedia presence as a proxy to identify facts likely encountered by the model in only one language during its training phase. ## Human Refinement and Decontextualization To ensure the quality and portability of the questions, the team employed native speakers to refine and verify the data generated by AI. * Human annotators filtered Gemini-generated question-and-answer pairs to ensure they were answerable in a closed-book setting without referring to external context. * Annotators performed "decontextualization" by adding specific details to ambiguous terms; for example, a reference to the "Supreme Court" was clarified as the "Israeli Supreme Court" to ensure the question remained accurate after translation. * Questions were curated to focus on cultural and local salience rather than general global knowledge like science or universal current events. * The final dataset consists of 384 unique questions, which were translated and verified across all 11 target languages, resulting in 4,224 total examples. ## Benchmarking Model Performance The benchmark evaluates models using a specific metric called "overall success," which measures a model's ability to answer a question correctly in both the original source language and the target language. * The benchmark was used to test eight leading open and proprietary LLMs. * Gemini 2.0 Pro initially set a high bar with 41.6% success, which was later surpassed by Gemini 2.5 Pro at 52.6%. * The results demonstrate that while models are improving, they still struggle to maintain consistent knowledge across different linguistic contexts, representing a major hurdle for equitable global information access. The release of ECLeKTic as an open-source benchmark on Kaggle provides a vital tool for the AI community to bridge the "knowledge gap" between high-resource and low-resource languages. Developers and researchers should use this data to refine training methodologies, aiming for models that can express their internal knowledge regardless of the language used in the prompt.

google

Deciphering language processing in the human brain through LLM representations (opens in new tab)

Recent research by Google Research and collaborating universities indicates that Large Language Models (LLMs) process natural language through internal representations that closely mirror neural activity in the human brain. By comparing intracranial recordings from spontaneous conversations with the internal embeddings of the Whisper speech-to-text model, the study found a high degree of linear alignment between artificial and biological language processing. These findings suggest that the statistical structures learned by LLMs via next-word prediction provide a viable computational framework for understanding how humans comprehend and produce speech. ## Mapping LLM Embeddings to Brain Activity * Researchers utilized intracranial electrodes to record neural signals during real-world, free-flowing conversations. * The study compared neural activity against two distinct types of embeddings from the Transformer-based Whisper model: "speech embeddings" from the model’s encoder and "language embeddings" from the decoder. * A linear transformation was used to predict brain signals based on these embeddings, revealing that LLMs and the human brain share similar multidimensional spaces for coding linguistic information. * The alignment suggests that human language processing may rely more on statistical structures and contextual embeddings rather than traditional symbolic rules or syntactic parts of speech. ## Neural Sequences in Speech Comprehension * When a subject listens to speech, the brain follows a specific chronological sequence that aligns with model representations. * Initially, speech embeddings predict cortical activity in the superior temporal gyrus (STG), which is responsible for processing auditory speech sounds. * A few hundred milliseconds later, language embeddings predict activity in Broca’s area (located in the inferior frontal gyrus), marking the transition from sound perception to decoding meaning. ## Reversed Dynamics in Speech Production * During speech production, the neural sequence is reversed, beginning approximately 500 milliseconds before a word is articulated. * Processing starts in Broca’s area, where language embeddings predict activity as the brain plans the semantic content of the utterance. * This is followed by activity in the motor cortex (MC), aligned with speech embeddings, as the brain prepares the physical articulatory movements. * Finally, after articulation, speech embeddings predict activity back in the STG, suggesting the brain is monitoring the sound of the speaker's own voice. This research validates the use of LLMs as powerful predictive tools for neuroscience, offering a new lens through which to study the temporal and spatial dynamics of human communication. By bridging the gap between artificial intelligence and cognitive biology, researchers can better model how the brain integrates sound and meaning in real-time.

coupang

Accelerating Coupang’s AI Journey with LLMs | by Coupang Engineering | Coupang Engineering Blog | Medium (opens in new tab)

Coupang is strategically evolving its machine learning infrastructure to integrate Large Language Models (LLMs) and foundation models across its e-commerce ecosystem. By transitioning from task-specific deep learning models to multi-modal transformers, the company aims to enhance customer experiences in search, recommendations, and logistics. This shift necessitates a robust ML platform capable of handling the massive compute, networking, and latency demands inherent in generative AI. ### Core Machine Learning Domains Coupang’s existing ML ecosystem is built upon three primary pillars that drive business logic: * **Recommendation Systems:** These models leverage vast datasets of user interactions—including clicks, purchases, and relevance judgments—to power home feeds, search results, and advertising. * **Content Understanding:** Utilizing deep learning to process product catalogs, user reviews, and merchant data to create unified representations of customers and products. * **Forecasting Models:** Predictive algorithms manage over 100 fulfillment centers, optimizing pricing and logistics for millions of products through a mix of statistical methods and deep learning. ### Enhancing Multimodal and Language Understanding The adoption of Foundation Models (FM) has unified previously fragmented ML tasks, particularly in multilingual environments: * **Joint Modeling:** Instead of separate embeddings, vision and language transformer models jointly model product images and metadata (titles/descriptions) to improve ad retrieval and similarity searches. * **Cross-Border Localization:** LLMs facilitate the translation of product titles from Korean to Mandarin and improve the quality of shopping feeds for global sellers. * **Weak Label Generation:** To overcome the high cost of human labeling in multiple languages, Coupang uses LLMs to generate high-quality "weak labels" for training downstream models, addressing label scarcity in under-resourced segments. ### Infrastructure for Large-Scale Training Scaling LLM training requires a shift in hardware architecture and distributed computing strategies: * **High-Performance Clusters:** The platform utilizes H100 and A100 GPU clusters interconnected with high-speed InfiniBand or RoCE (RDMA over Converged Ethernet) networking to minimize communication bottlenecks. * **Distributed Frameworks:** To fit massive models into GPU memory, Coupang employs various parallelism techniques, including Fully Sharded Data Parallelism (FSDP), Tensor Parallelism (TP), and Pipeline Parallelism (PP). * **Efficient Categorization:** Traditional architectures that required a separate model for every product category are being replaced by a single, massive multi-modal transformer capable of handling categorization and attribute extraction across the entire catalog. ### Optimizing LLM Serving and Inference The transition to real-time generative AI features requires significant optimizations to manage the high computational cost of inference: * **Quantization Strategies:** To reduce memory footprint and increase throughput, models are compressed using FP8, INT8, or INT4 precision without significant loss in accuracy. * **Advanced Serving Techniques:** The platform implements Key-Value (KV) caching to avoid redundant computations during text generation and utilizes continuous batching (via engines like vLLM or TGI) to maximize GPU utilization. * **Lifecycle Management:** A unified platform vision ensures that the entire end-to-end lifecycle—from data preparation and fine-tuning to deployment—is streamlined for ML engineers. To stay competitive, Coupang is moving toward an integrated AI lifecycle where foundation models serve as the backbone for both content generation and predictive analytics. This infrastructure-first approach allows for the rapid deployment of generative features while maintaining the resource efficiency required for massive e-commerce scales.

coupang

Meet Coupang’s Machine Learning Platform | by Coupang Engineering | Coupang Engineering Blog | Medium (opens in new tab)

Coupang’s internal Machine Learning Platform (MLP) is a comprehensive "batteries-included" ecosystem designed to streamline the end-to-end lifecycle of ML development across its diverse business units, including e-commerce, logistics, and streaming. By providing standardized tools for feature engineering, pipeline authoring, and model serving, the platform significantly reduces the time-to-production while enabling scalable, efficient compute management. Ultimately, this infrastructure allows Coupang to leverage advanced models like Ko-BERT for search and real-time forecasting to enhance the customer experience at scale. **Motivation for a Centralized Platform** * **Reduced Time to Production:** The platform aims to accelerate the transition from ad-hoc exploration to production-ready services by eliminating repetitive infrastructure setup. * **CI/CD Integration:** By incorporating continuous integration and delivery into ML workflows, the platform ensures that experiments are reproducible and deployments are reliable. * **Compute Efficiency:** Managed clusters allow for the optimization of expensive hardware resources, such as GPUs, across multiple teams and diverse workloads like NLP and Computer Vision. **Notebooks and Pipeline Authoring** * **Managed Jupyter Notebooks:** Provides data scientists with a standardized environment for initial data exploration and prototyping. * **Pipeline SDK:** Developers can use a dedicated SDK to define complex ML workflows as code, facilitating the transition from research to automated pipelines. * **Framework Agnostic:** The platform supports a wide range of ML frameworks and programming languages to accommodate different model architectures. **Feature Engineering and Data Management** * **Centralized Feature Store:** Enables teams to share and reuse features, reducing redundant data processing and ensuring consistency across the organization. * **Consistent Data Pipelines:** Bridges the gap between offline training and online real-time inference by providing a unified interface for data transformations. * **Large-scale Preparation:** Streamlines the creation of training datasets from Coupang’s massive logs, including product catalogs and user behavior data. **Training and Inference Services** * **Scalable Model Training:** Handles distributed training jobs and resource orchestration, allowing for the development of high-parameter models. * **Robust Model Inference:** Supports low-latency model serving for real-time applications such as ad ranking, video recommendations in Coupang Play, and pricing. * **Dedicated Infrastructure:** Training and inference clusters abstract the underlying hardware complexity, allowing engineers to focus on model logic rather than server maintenance. **Monitoring and Observability** * **Performance Tracking:** Integrated tools monitor model health and performance metrics in live production environments. * **Drift Detection:** Provides visibility into data and model drift, ensuring that models remain accurate as consumer behavior and market conditions change. For organizations looking to scale their AI capabilities, investing in an integrated platform that bridges the gap between experimentation and production is essential. By standardizing the "plumbing" of machine learning—such as feature stores and automated pipelines—companies can drastically increase the velocity of their data science teams and ensure the long-term reliability of their production models.