data-efficiency | Techlist.io

google Jun 2, 2025

Learning to clarify: Multi-turn conversations with Action-Based Contrastive Self-Training (opens in new tab)

Action-Based Contrastive Self-Training (ACT) is a novel approach designed to enhance the multi-turn conversational capabilities of large language models, specifically their ability to ask clarifying questions when faced with ambiguity. While standard models often default to guessing a user's intent or overhedging, ACT optimizes conversational action planning as an implicit subtask of response generation. This method demonstrates that data-efficient tuning can significantly improve dialogue policy learning and reasoning in complex, mixed-initiative interactive scenarios. ## Implicit Action Planning * Traditional conversational agents use separate modules for dialogue planning (deciding when to clarify) and response generation. * ACT introduces "implicit action planning," which integrates these steps by teaching the model to perform planning as an inherent part of the end-to-end generation process. * This approach addresses the limitations of standard Direct Preference Optimization (DPO), which often fails to account for the long-term, multi-turn consequences of specific dialogue actions. ## Action-Based Contrastive Data Generation * The first phase involves building a preference dataset by identifying "winning" and "losing" actions for specific conversation turns. * Using an existing dataset, the system identifies a successful turn (e.g., a clarifying question) as the winning response. * A synthetic "rejected" response is then generated to represent a converse, less-optimal action (e.g., attempting to answer despite ambiguity). * This creates a pairwise dataset that contrastively defines successful versus unsuccessful conversational strategies. ## Quasi-Online Contrastive Self-Training * Instead of relying solely on static, offline pairs, ACT employs on-policy sampling to simulate the multi-turn trajectory of a response. * The model evaluates whether a sampled response (such as a clarifying question) leads to a successful final outcome based on the user's original intent. * If the simulated trajectory is successful, it replaces the winning response in the DPO update; if it fails, it is used to refine the losing response. * This quasi-online feedback loop ensures the model is optimized based on the actual outcomes of its conversational decisions rather than just single-turn labels. ## Evaluation and the AmbigSQL Benchmark * The researchers introduced AmbigSQL, a new benchmark task focusing on disambiguating information-seeking requests for complex SQL code generation. * ACT was also tested on real-world tasks including tabular-grounded question-answering and machine reading comprehension. * Experimental results show that ACT substantially outperforms standard Supervised Fine-Tuning (SFT) and standard DPO in multi-turn conversation modeling. By focusing on the downstream consequences of dialogue actions, ACT provides a practical framework for developers to build more "mixed-initiative" agents that know when to stop and ask for clarification, ultimately leading to higher accuracy in complex data-seeking tasks.

data-efficiency llm reinforcement-learning supervised-fine-tuning+4