line

Building an Enterprise LLM Service 1 (opens in new tab)

LY Corporation’s engineering team developed an AI assistant for their private cloud platform, Flava, by prioritizing "context engineering" over traditional prompt engineering. To manage a complex environment of 260 APIs and hundreds of technical documents, they implemented a strategy of progressive disclosure to ensure the LLM receives only the most relevant information for any given query. This approach allows the assistant to move beyond simple RAG-based document summarization to perform active diagnostics and resource management based on real-time API data.

Performance Limitations of Long Contexts

  • Research indicates that LLM performance can drop by 13.9% to 85% as context length increases, even if the model technically supports a large token window.
  • The phenomenon of "context rot" occurs when low-quality or irrelevant information is mixed into the input, causing the model to generate confident but incorrect answers.
  • Because LLMs are stateless, maintaining conversation history and processing dense JSON responses from multiple APIs quickly exhausts context windows and degrades reasoning quality.

Progressive Disclosure and Tool Selection

  • The system avoids loading all 260+ API definitions at once; instead, it analyzes the user's intent to select only the necessary tools, such as loading only Redis-related APIs when a user asks about a cluster.
  • Specific product usage hints, such as the distinction between private and CDN settings for Object Storage, are injected only when those specific services are invoked.
  • This phased approach significantly reduces token consumption and prevents the model from being overwhelmed by irrelevant technical specifications.

Response Guidelines and the "Mock Tool Message" Strategy

  • The team distinguished between "System Prompts" (global rules) and "Response Guidelines" (situational instructions), such as directing users to a console UI before suggesting CLI commands.
  • Injecting specific guidelines into the system prompt often caused "instruction conflict," where the LLM might hallucinate information to satisfy a guideline while ignoring core requirements like using search tools.
  • To resolve these conflicts, the team utilized "ToolMessages" to inject guidelines; by formatting instructions as if they were results from a tool execution, the LLM treats the information as factual context rather than a command that might override the system prompt.

To build a robust enterprise LLM service, developers should focus on dynamic context management rather than static prompt optimization. Treating operational guidelines as external data via mock tool messages, rather than system instructions, provides a scalable way to reduce hallucinations and maintain high performance across hundreds of integrated services.