prompt-injection

2 posts

line

Safety is a Given, Cost Reduction (opens in new tab)

AI developers often rely on system prompts to enforce safety rules, but this integrated approach frequently leads to "over-refusal" and unpredictable shifts in model performance. To ensure both security and operational efficiency, it is increasingly necessary to decouple safety mechanisms into separate guardrail systems that operate independently of the primary model's logic. ## Negative Impact on Model Utility * Integrating safety instructions directly into system prompts often leads to a high False Positive Rate (FPR), where the model rejects harmless requests alongside harmful ones. * Technical analysis using Principal Component Analysis (PCA) reveals that guardrail prompts shift the model's embedding results in a consistent direction toward refusal, regardless of the input's actual intent. * Studies show that aggressive safety prompting can cause models to refuse benign technical queries—such as "how to kill a Python process"—because the model adopts an overly conservative decision boundary. ## Positional Bias and Context Neglect * Research on the "Lost in the Middle" phenomenon indicates that LLMs are most sensitive to information at the beginning and end of a prompt, while accuracy drops significantly for information placed in the center. * The "Constraint Difficulty Distribution Index" (CDDI) demonstrates that the order of instructions matters; models generally follow instructions better when difficult constraints are placed at the beginning of the prompt. * In complex system prompts where safety rules are buried in the middle, the model may fail to prioritize these guardrails, leading to inconsistent safety enforcement depending on the prompt's structure. ## The Butterfly Effect of Prompt Alterations * Small, seemingly insignificant changes to a system prompt—such as adding a single whitespace, a "Thank you" note, or changing the output format to JSON—can alter more than 10% of a model's predictions. * Modifying safety-related lines within a unified system prompt can cause "catastrophic performance collapse," where the model's internal reasoning path is diverted, affecting unrelated tasks. * Because LLMs treat every part of the prompt as a signal that moves their decision boundaries, managing safety and task logic in a single string makes the system brittle and difficult to iterate upon. To build robust and high-performing AI applications, developers should move away from bloated system prompts and instead implement external guardrails. This modular approach allows for precise security filtering without compromising the model's creative or logical capabilities.

line

Security threats and countermeasures in AI (opens in new tab)

Developing AI products introduces unique security vulnerabilities that extend beyond traditional software risks, ranging from package hallucinations to sophisticated indirect prompt injections. To mitigate these threats, organizations must move away from trusting LLM-generated content and instead implement rigorous validation, automated threat modeling, and input/output guardrails. The following summary details the specific risks and mitigation strategies identified by LY Corporation’s security engineering team. ## Slopsquatting and Package Hallucinations - AI models frequently hallucinate non-existent library or package names when providing coding instructions (e.g., suggesting `huggingface-cli` instead of the correct `huggingface_hub[cli]`). - Attackers exploit this by registering these hallucinated names on public registries to distribute malware to unsuspecting developers. - Mitigation requires developers to manually verify all AI-suggested commands and dependencies before execution in any environment. ## Prompt Injection and Arbitrary Code Execution - As seen in CVE-2024-5565 (Vanna AI), attackers can inject malicious instructions into prompts to force the application to execute arbitrary code. - This vulnerability arises when developers grant LLMs the autonomy to generate and run logic within the application context without sufficient isolation. - Mitigation involves treating LLM outputs as untrusted data, sanitizing user inputs, and strictly limiting the LLM's ability to execute system-level commands. ## Indirect Prompt Injection in Integrated AI - AI assistants integrated into office environments (like Gemini for Workspace) are susceptible to indirect prompt injections hidden within emails or documents. - A malicious email can contain "system-like" instructions that trick the AI into hiding content, redirecting users to phishing sites, or leaking data from other files. - Mitigation requires the implementation of robust guardrails that scan both the input data (the content being processed) and the generated output for instructional anomalies. ## Permission Risks in AI Agents and MCP - The use of Model Context Protocol (MCP) and coding agents creates risks where an agent might overstep its intended scope. - If an agent has broad access to a developer's environment, a malicious prompt in a public repository could trick the agent into accessing or leaking sensitive data (such as salary info or private keys) from a private repository. - Mitigation centers on the principle of least privilege, ensuring AI agents are restricted to specific, scoped directories and repositories. ## Embedding Inversion and Vector Store Vulnerabilities - Attacks targeting the retrieval phase of RAG (Retrieval-Augmented Generation) systems can lead to data leaks. - Embedding Inversion techniques may allow attackers to reconstruct original sensitive text from the vector embeddings stored in a database. - Securing AI products requires protecting the integrity of the vector store and ensuring that retrieved context does not bypass security filters. ## Automated Security Assessment Tools - To scale security, LY Corporation is developing internal tools like "ConA" for automated threat modeling and "LAVA" for automated vulnerability assessment. - These tools aim to identify AI-specific risks during the design and development phases rather than relying solely on manual reviews. Effective AI security requires a shift in mindset: treat every LLM response as a potential security risk. Developers should adopt automated threat modeling and implement strict input/output validation layers to protect both the application infrastructure and user data from evolving AI-based exploits.