llm-as-a-judge | Techlist.io

daangn Feb 27, 2026

Things I learned using 2 (opens in new tab)

카테고리 분류에 2조 토큰을 쓰면서 알게된 것들 -- Share 안녕하세요. 당근 Taxonomy 팀 윈터(winter.jung), 지원(jiwon)이에요. 저희 팀은 택소노미(Taxonomy)라고 부르는 카테고리 체계를 만들고, 그 체계를 기반으로 중고거래, 모임 게시글 등 당근에 올라오는 게시글을 자동으로 분류해 실제 서비스가 사용하도록 적재하는 파이프라인을 운영하고 있어요. 이번 글에서는 프로덕션 파이프라인에서 카테고리 분류를 위해 LLM을 어떻게 쓰고 있는지, 그리고 성능, 비용, 운영…

llm-as-a-judge llm prompt-engineering apache-kafka+4

spotify Feb 19, 2026

Background Coding Agents: Predictable Results Through Strong Feedback Loops (Honk, Part 3) | Spotify Engineering (opens in new tab)

Background Coding Agents: Predictable Results Through Strong Feedback Loops (Honk, Part 3) This is part 3 in our series about Spotify's journey with background coding agents (internal codename: “Honk”) and the future of large-scale software maintenance. See also part 1 and part…

llm-as-a-judge llm ai-agent model-context-protocol+4

naver Dec 4, 2025

Naver TV (opens in new tab)

Processing complex PDF documents remains a significant bottleneck for Large Language Models (LLMs) due to the intricate layouts, nested tables, and visual charts that standard text extractors often fail to capture. To address this, NAVER developed PaLADIN, an LLM-friendly PDF parser designed to transform visual document elements into structured data that models can accurately interpret. By combining specialized vision models with advanced OCR, the system enables high-fidelity document understanding for demanding tasks like analyzing financial reports. ### Challenges in Document Intelligence * Standard PDF parsing often loses the semantic structure of the document, such as the relationship between headers and body text. * Tables and charts pose the greatest difficulty, as numerical values and trends must be extracted without losing the spatial context that defines their meaning. * A "one-size-fits-all" approach to text extraction results in "hallucinations" when LLMs attempt to reconstruct data from fragmented strings. ### The PaLADIN Architecture and Model Integration * **Element Detection:** The system utilizes `Doclayout-Yolo` to identify and categorize document components like text blocks, titles, tables, and figures. * **Table Extraction:** Visual table structures are processed through `nemoretriever-table-structure-v1`, ensuring that cell boundaries and headers are preserved. * **Chart Interpretation:** To convert visual charts into descriptive text or data, the parser employs `google/gemma3-27b-it`, allowing the LLM to "read" visual trends. * **Text Recognition:** For high-accuracy character recognition, particularly in multi-lingual contexts, the pipeline integrates NAVER’s `Papago OCR`. * **Infrastructure:** The architecture leverages `nv-ingest` for optimized throughput and speed, making it suitable for large-scale document processing. ### Evaluation and Real-world Application * **Performance Metrics:** NAVER established a dedicated parsing evaluation set to measure accuracy across diverse document types, focusing on speed and structural integrity. * **AIB Securities Reports:** The parser is currently applied to summarize complex stock market reports, where precision in numerical data is critical. * **LLM-as-a-Judge:** To ensure summary quality, the system uses an automated evaluation framework where a high-performing LLM judges the accuracy of the generated summaries against the parsed source data. For organizations building RAG (Retrieval-Augmented Generation) systems, the transition from basic text extraction to a layout-aware parsing pipeline like PaLADIN is crucial. Future improvements focusing on table cell coordinate precision and more granular chart analysis will further reduce the error rates in automated document processing.

llm-as-a-judge ai llm ocr+4

dropbox Oct 2, 2025

A practical blueprint for evaluating conversational AI at scale (opens in new tab)

A practical blueprint for evaluating conversational AI at scale LLM applications present a deceptively simple interface: a single text box. But behind that minimalism runs a chain of probabilistic stages, including intent classification, document retrieval, ranking, prompt const…

llm-as-a-judge llm nlp rag+4