gemma-3

2 posts

naver

Naver TV (opens in new tab)

Processing complex PDF documents remains a significant bottleneck for Large Language Models (LLMs) due to the intricate layouts, nested tables, and visual charts that standard text extractors often fail to capture. To address this, NAVER developed PaLADIN, an LLM-friendly PDF parser designed to transform visual document elements into structured data that models can accurately interpret. By combining specialized vision models with advanced OCR, the system enables high-fidelity document understanding for demanding tasks like analyzing financial reports. ### Challenges in Document Intelligence * Standard PDF parsing often loses the semantic structure of the document, such as the relationship between headers and body text. * Tables and charts pose the greatest difficulty, as numerical values and trends must be extracted without losing the spatial context that defines their meaning. * A "one-size-fits-all" approach to text extraction results in "hallucinations" when LLMs attempt to reconstruct data from fragmented strings. ### The PaLADIN Architecture and Model Integration * **Element Detection:** The system utilizes `Doclayout-Yolo` to identify and categorize document components like text blocks, titles, tables, and figures. * **Table Extraction:** Visual table structures are processed through `nemoretriever-table-structure-v1`, ensuring that cell boundaries and headers are preserved. * **Chart Interpretation:** To convert visual charts into descriptive text or data, the parser employs `google/gemma3-27b-it`, allowing the LLM to "read" visual trends. * **Text Recognition:** For high-accuracy character recognition, particularly in multi-lingual contexts, the pipeline integrates NAVER’s `Papago OCR`. * **Infrastructure:** The architecture leverages `nv-ingest` for optimized throughput and speed, making it suitable for large-scale document processing. ### Evaluation and Real-world Application * **Performance Metrics:** NAVER established a dedicated parsing evaluation set to measure accuracy across diverse document types, focusing on speed and structural integrity. * **AIB Securities Reports:** The parser is currently applied to summarize complex stock market reports, where precision in numerical data is critical. * **LLM-as-a-Judge:** To ensure summary quality, the system uses an automated evaluation framework where a high-performing LLM judges the accuracy of the generated summaries against the parsed source data. For organizations building RAG (Retrieval-Augmented Generation) systems, the transition from basic text extraction to a layout-aware parsing pipeline like PaLADIN is crucial. Future improvements focusing on table cell coordinate precision and more granular chart analysis will further reduce the error rates in automated document processing.

google

MedGemma: Our most capable open models for health AI development (opens in new tab)

Google Research has expanded its Health AI Developer Foundations (HAI-DEF) collection with the release of MedGemma and MedSigLIP, a series of open, multimodal models designed specifically for medical research and application development. These models offer a high-performance, privacy-preserving alternative to closed systems, allowing developers to maintain full control over their infrastructure while leveraging state-of-the-art medical reasoning. By providing both 4B and 27B parameter versions, the collection balances computational efficiency with complex longitudinal data interpretation, even enabling deployment on single GPUs or mobile hardware. ## MedGemma Multimodal Variants The MedGemma collection utilizes the Gemma 3 architecture to process both image and text inputs, providing robust generative capabilities for healthcare tasks. * **MedGemma 27B Multimodal:** This model is designed for complex tasks such as interpreting longitudinal electronic health records (EHR) and achieves an 87.7% score on the MedQA benchmark, performing within 3 points of DeepSeek R1 at approximately one-tenth the inference cost. * **MedGemma 4B Multimodal:** A lightweight version that scores 64.4% on MedQA, outperforming most open models under 8B parameters; it is optimized for mobile hardware and specific tasks like chest X-ray report generation. * **Clinical Accuracy:** In unblinded studies, 81% of chest X-ray reports generated by the 4B model were judged by board-certified radiologists to be sufficient for patient management, achieving a RadGraph F1 score of 30.3. * **Versatility:** The models retain general-purpose capabilities from the original Gemma base, ensuring they remain effective at instruction-following and non-English language tasks while handling specialized medical data. ## MedSigLIP Specialized Image Encoding MedSigLIP serves as the underlying vision component for the MedGemma suite, but it is also available as a standalone 400M parameter encoder for structured data tasks. * **Architecture:** Based on the Sigmoid loss for Language Image Pre-training (SigLIP) framework, it bridges the gap between medical imagery and text through a shared embedding space. * **Diverse Modalities:** The encoder was fine-tuned on a wide variety of medical data, including fundus photography, dermatology images, histopathology patches, and chest X-rays. * **Functional Use Cases:** It is specifically recommended for tasks involving classification, retrieval, and search, where structured outputs are preferred over free-text generation. * **Data Retention:** Training protocols ensured the model retained its ability to process natural images, maintaining its utility for hybrid tasks that mix medical and non-medical visual information. ## Technical Implementation and Accessibility Google has prioritized accessibility for developers by ensuring these models can run on consumer-grade or limited hardware environments. * **Hardware Compatibility:** Both the 4B and 27B models are designed to run on a single GPU, while the 4B and MedSigLIP versions are adaptable for edge computing and mobile devices. * **Open Resources:** To support the community, Google has released the technical reports, model weights on Hugging Face, and implementation code on GitHub. * **Developer Flexibility:** Because these are open models, researchers can fine-tune them on proprietary datasets without compromising data privacy or being locked into specific cloud providers. For medical AI development, the choice of model should depend on the specific output requirement: MedGemma is the optimal starting point for generative tasks like visual question answering or report drafting, while MedSigLIP is the preferred tool for building high-speed classification and image retrieval systems.