python

11 posts

line

Advancing Guardrail Models through Automated Vulnerability Collection and Generation Using Coding Agents (opens in new tab)

LLM 시대의 보호 장치, 가드레일 LLM 기반 서비스가 빠르게 확산되면서 LLM 모델의 응답을 공격자가 의도대로 ‘조종’하려는 시도도 함께 늘고 있습니다. 특히 다음과 같은 공격 유형은 실제 서비스 환경에서 지속적으로 관찰되는 시도입니다. 프롬프트 인젝션(prompt injection): 사용자의 입력에 “이전 지시를 무시하라” 같은 문장을 섞어 시스템/개발자 지시보다 공격자의 지시를 우선하도록 유도하는 공격 방식입니다. 탈옥(jailbreaking): 모델이 따라야 할 안전 정책이나 제한을…

meta

Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption (opens in new tab)

The 2025 Typed Python Survey highlights that type hinting has transitioned from an optional feature to a core development standard, with 86% of respondents reporting frequent usage. While mid-career developers show the highest enthusiasm for typing, the ecosystem faces ongoing friction from tooling fragmentation and the complexity of advanced type logic. Overall, the community is pushing for a more robust system that mirrors the expressive power of TypeScript while maintaining Python’s hallmark flexibility. ## Respondent Demographics and Adoption Trends * The survey analyzed responses from 1,241 developers, the majority of whom are highly experienced, with nearly half reporting over a decade of Python expertise. * Adoption is highest among developers with 5–10 years of experience (93%), whereas junior developers (83%) and those with over 10 years of experience (80%) show slightly lower usage rates. * The lower adoption among seniors is attributed to the management of legacy codebases and long-standing habits formed before type hints were introduced to the language. ## Primary Drivers for Typing Adoption * **Incremental Integration:** Developers value the "gradual typing" approach, which allows them to add types to existing projects at their own pace without breaking the codebase. * **Improved Tooling and IDE Support:** Typing significantly enhances developer experience by enabling more accurate autocomplete, jump-to-definition, and inline documentation in IDEs. * **Bug Prevention and Readability:** Type hints act as living documentation that helps catch subtle bugs during refactoring and makes complex codebases easier for teams to reason about. * **Library Compatibility:** Features like Protocols and Generics are highly appreciated, particularly for their synergy with modern libraries like Pydantic and FastAPI that utilize type annotations at runtime. ## Technical Pain Points and Ecosystem Friction * **Third-Party Integration:** A major hurdle is the inconsistent quality or total absence of type stubs in massive libraries like NumPy, Pandas, and Django. * **Tooling Fragmentation:** Developers expressed frustration over inconsistencies between major type checkers like Mypy and Pyright, as well as the slow performance of Mypy in large projects. * **Conceptual Complexity:** Advanced features such as variance (co/contravariance), decorators, and complex nested Generics remain difficult for many developers to implement correctly. * **Runtime Limitations:** Because Python does not enforce types at the interpreter level, some developers find it difficult to justify the verbosity of typing when it offers no native runtime guarantees. ## Most Requested Type System Enhancements * **TypeScript Parity:** There is a strong demand for features found in TypeScript, specifically Intersection types (using the `&` operator), Mapped types, and Conditional types. * **Utility Types:** Developers are looking for built-in utilities like `Pick`, `Omit`, and `keyof` to handle dictionary shapes more effectively. * **Improved Structural Typing:** While `TypedDict` exists, respondents want more flexible, anonymous structural typing to handle complex data structures without excessive boilerplate. * **Performance and Enforcement:** There is a recurring request for an official, high-performance built-in type checker and optional runtime enforcement to bridge the gap between static analysis and execution. As the Python type system continues to mature, developers should prioritize incremental adoption in shared libraries and internal APIs to maximize the benefits of static analysis. While waiting for more advanced features like intersection types, focusing on tooling consistency—such as aligning team standards around a specific type checker—can mitigate much of the friction identified in the 2025 survey.

google

DS-STAR: A state-of-the-art versatile data science agent (opens in new tab)

DS-STAR is an advanced autonomous data science agent developed to handle the complexity and heterogeneity of real-world data tasks, ranging from statistical analysis to visualization. By integrating a specialized file analysis module with an iterative planning and verification loop, the system can interpret unstructured data and refine its reasoning steps dynamically based on execution feedback. This architecture allows DS-STAR to achieve state-of-the-art performance on major industry benchmarks, effectively bridging the gap between natural language queries and executable, verified code. ## Comprehensive Data File Analysis The framework addresses a major limitation of current agents—the over-reliance on structured CSV files—by implementing a dedicated analysis stage for diverse data formats. * The system automatically scans a directory to extract context from heterogeneous formats, including JSON, unstructured text, and markdown files. * A Python-based analysis script generates a textual summary of the data structure and content, which serves as the foundational context for the planning phase. * This module ensures the agent can navigate complex, multi-file environments where critical information is often spread across non-relational sources. ## Iterative Planning and Verification Architecture DS-STAR utilizes a sophisticated loop involving four specialized roles to mimic the workflow of a human expert conducting sequential analysis. * **Planner and Coder:** A Planner agent establishes high-level objectives, which a Coder agent سپس translates into executable Python scripts. * **LLM-based Verification:** A Verifier agent acts as a judge, assessing whether the generated code and its output are sufficient to solve the problem or if the reasoning is flawed. * **Dynamic Routing:** If the Verifier identifies gaps, a Router agent guides the refinement process by adding new steps or correcting errors, allowing the cycle to repeat for up to 10 rounds. * **Intermediate Review:** The agent reviews intermediate results before proceeding to the next step, similar to how data scientists use interactive environments like Google Colab. ## Benchmarking and State-of-the-Art Performance The effectiveness of the DS-STAR framework was validated through rigorous testing against existing agents like AutoGen and DA-Agent. * The agent secured the top rank on the public DABStep leaderboard, raising accuracy from 41.0% to 45.2% compared to previous best-performing models. * Performance gains were consistent across other benchmarks, including KramaBench (39.8% to 44.7%) and DA-Code (37.0% to 38.5%). * DS-STAR showed a significant advantage in "hard" tasks—those requiring the synthesis of information from multiple, varied data sources—demonstrating its superior versatility in complex environments. By automating the time-intensive tasks of data wrangling and verification, DS-STAR provides a robust template for the next generation of AI assistants. Organizations looking to scale their data science capabilities should consider adopting iterative agentic workflows that prioritize multi-format data understanding and self-correcting execution loops.

google

MLE-STAR: A state-of-the-art machine learning engineering agent (opens in new tab)

MLE-STAR is a state-of-the-art machine learning engineering agent designed to automate complex ML tasks by treating them as iterative code optimization challenges. Unlike previous agents that rely solely on an LLM’s internal knowledge, MLE-STAR integrates external web searches and targeted ablation studies to pinpoint and refine specific pipeline components. This approach allows the agent to achieve high-performance results, evidenced by its ability to win medals in 63% of Kaggle competitions within the MLE-Bench-Lite benchmark. ## External Knowledge and Targeted Ablation The core of MLE-STAR’s effectiveness lies in its ability to move beyond generic machine learning libraries by incorporating external research and specific performance testing. * The agent uses web search to retrieve task-specific, state-of-the-art models and approaches rather than defaulting to familiar libraries like scikit-learn. * Instead of modifying an entire script at once, the system conducts an ablation study to evaluate the impact of individual pipeline components, such as feature engineering or model selection. * By identifying which code blocks have the most significant impact on performance, the agent can focus its reasoning and optimization efforts where they are most needed. ## Iterative Refinement and Intelligent Ensembling Once the critical components are identified, MLE-STAR employs a specialized refinement process to maximize the effectiveness of the generated solution. * Targeted code blocks undergo iterative refinement based on LLM-suggested plans that incorporate feedback from prior experimental failures and successes. * The agent features a unique ensembling strategy where it proposes multiple candidate solutions and then designs its own method to merge them. * Rather than using simple validation-score voting, the agent iteratively improves the ensemble strategy itself, treating the combination of models as a distinct optimization task. ## Robustness and Safety Verification To ensure the generated code is both functional and reliable for real-world deployment, MLE-STAR incorporates three specialized diagnostic modules. * **Debugging Agent:** Automatically analyzes tracebacks and execution errors in Python scripts to provide iterative corrections. * **Data Leakage Checker:** Reviews the solution script prior to execution to ensure the model does not improperly access test dataset information during the training phase. * **Data Usage Checker:** Analyzes whether the script is utilizing all available data sources, preventing the agent from overlooking complex data formats in favor of simpler files like CSVs. By combining external grounding with a granular, component-based optimization strategy, MLE-STAR represents a significant shift in automated machine learning. For organizations looking to scale their ML workflows, such an agent suggests a future where the role of the engineer shifts from manual coding to high-level supervision of autonomous agents that can navigate the vast landscape of research and data engineering.

line

Implementing a RAG-based (opens in new tab)

To address the operational burden of handling repetitive user inquiries for the AWX automation platform, LY Corporation developed a support bot utilizing Retrieval-Augmented Generation (RAG). By combining internal documentation with historical Slack thread data, the system provides automated, context-aware answers that significantly reduce manual SRE intervention. This approach enhances service reliability by ensuring users receive immediate assistance while allowing engineers to focus on high-priority development tasks. ### Technical Infrastructure and Stack * **Slack Integration**: The bot is built using the **Bolt for Python** framework to handle real-time interactions within the company’s communication channels. * **LLM Orchestration**: **LangChain** is used to manage the RAG pipeline; the developers suggest transitioning to LangGraph for teams requiring more complex multi-agent workflows. * **Embedding Model**: The **paraphrase-multilingual-mpnet-base-v2** (SBERT) model was selected to support multi-language inquiries from LY Corporation’s global workforce. * **Vector Database**: **OpenSearch** serves as the vector store, chosen for its availability as an internal PaaS and its efficiency in handling high-dimensional data. * **Large Language Model**: The system utilizes **OpenAI (ChatGPT) Enterprise**, which ensures business data privacy by preventing the model from training on internal inputs. ### Enhancing LLM Accuracy through RAG and Vector Search * **Overcoming LLM Limits**: Traditional LLMs suffer from "hallucinations," lack of up-to-date info, and opaque sourcing; RAG fixes this by providing the model with specific, trusted context during the prompt phase. * **Embedding and Vectorization**: Textual data from wikis and chats are converted into high-dimensional vectors, where semantically similar phrases (e.g., "Buy" and "Purchase") are stored in close proximity. * **k-NN Retrieval**: When a user asks a question, the bot uses **k-Nearest Neighbors (k-NN)** algorithms to retrieve the top *k* most relevant snippets of information from the vector database. * **Contextual Generation**: Rather than relying on its internal training data, the LLM generates a response based specifically on the retrieved snippets, leading to higher accuracy and domain-specific relevance. ### AWX Support Bot Workflow and Data Sources * **Multi-Source Indexing**: The bot references two main data streams: the official internal AWX guide wiki and historical Slack inquiry threads where previous solutions were discussed. * **Automated First Response**: The workflow begins when a user submits a query via a Slack workflow; the bot immediately processes the request and provides an initial AI-generated answer. * **Human-in-the-Loop Validation**: After receiving an answer, users can click "Issue Resolved" to close the ticket or "Call AWX Admin" if the AI's response was insufficient. * **Efficiency Gains**: This tiered approach filters out "RTFM" (Read The F***ing Manual) style questions, ensuring that human administrators only spend time on unique or complex technical issues. Implementing a RAG-based support bot is a highly effective strategy for SRE teams looking to scale their internal support without increasing headcount. For the best results, organizations should focus on maintaining clean internal documentation and selecting embedding models that reflect the linguistic diversity of their specific workforce.

datadog

Our journey taking Kubernetes state metrics to the next level | Datadog (opens in new tab)

Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes. ### Challenges with KSM Scalability * KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size. * In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics. * In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes. * The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity. ### Limitations of Legacy Implementations * KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation. * The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time. * To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments). * This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization. ### Architectural Improvements in KSM v2.0 * Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design. * The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time. * By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins. * The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced. Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.

datadog

Cheering on coworkers: Building culture with Datadog dashboards | Datadog (opens in new tab)

Datadog engineers developed a real-time tracking dashboard to monitor a colleague’s progress during an 850km, six-day ultra-marathon challenge. By scraping public race statistics and piping the data into their monitoring platform, the team created a centralized visualization tool to provide remote support and office-wide engagement. ### Data Extraction and Parsing The team needed to harvest race data that was only available as plain HTML on the event’s official website. * A crawler was built using the Python `Requests` library to automate the retrieval of the webpage's source code. * The team utilized `BeautifulSoup` to parse the HTML and isolate specific data points, such as the runner's current ranking and total distance covered. ### Ingesting Metrics with StatsD Once the data was structured, it was converted into telemetry using the Datadog agent and the `statsd` Python library. * The script utilized `dog.gauge` to emit three primary metrics: `runner.distance`, `runner.ranking`, and `runner.elapsed_time`. * Each metric was assigned a "name" tag corresponding to the runner, allowing the team to filter data and compare participants within the Datadog interface. * The data was updated periodically to ensure the dashboard reflected the most current race standings. ### Dashboard Visualization and Results The final phase involved synthesizing the metrics into a high-visibility dashboard displayed in the company’s New York and Paris offices. * The dashboard combined technical performance graphs with multimedia elements, including live video feeds and GIFs, to create an interactive cheering station. * The system successfully tracked the athlete's 47km lead in real-time, providing the team with immediate updates on his physical progress and elapsed time over the 144-hour event. This project demonstrates how standard observability tools can be repurposed for creative "life-graphing" applications. By combining simple web scraping with metric ingestion, engineers can quickly build custom monitoring solutions for any public data source.