ICML 2026

DiscoverLLM

From Executing Intents to Discovering Them

1KAIST  ·  2University of Michigan  ·  3Midjourney  ·  4SkillBench

TL;DR

We train LLMs to help users figure out what they want, not just execute what they ask.

LLM training and evaluation assume users start with fully-formed intents. Often they don't — users approach open-ended tasks with ill-defined intents they discover by reacting to what the model proposes. We built a user simulator that models this messy middle, then trained LLMs against it: models learn to diverge (explore options) when intents are unclear and converge (refine outputs) once they concretize.

Latent intents → assistant probes → state updates → reward → fine-tune.

The problem

“What would you like to change?”

Doesn't help when the user can't name what they want yet.

Hover a step or message to highlight the match.

Write a poem about my younger self.
Setup · the conversation starts
AI
I was a quiet kid then. I didn't know how to ask for what I wanted…

A reasonable first draft. The user isn't sure why it doesn't land, but the following turns from the model don't help either.

Hmm… can it feel different?
1
User can't articulate yet

Something feels off after the first draft, but the user can't name what.

AI
What would you like to change? Any specific aspects of the tone, perspective, or imagery?
2
Asking doesn't help

The model asks a clarifying question. But the user cannot provide an answer.

I don't know how to describe it.
AI
Here are two takes — tell me which feels closer:
A First-person, looking back. “When I was sixteen I thought I knew everything…”
B Talking to her, second-person. “You don't know it yet, but…”
just discovered: perspective · second-person address
3
Showing > asking — discovery happens

The AI provides contrasting options that surface what the user actually wanted: a second-person perspective.

B: talking to her. I want to address my younger self directly.
4
User can articulate it now

With this realized, the user can provide direct and specific feedback to the model, something they couldn't do three turns ago.

Most fine-tuning rewards complete answers or better-asked clarifying questions. Neither helps when users themselves don't yet know what they want.

User simulator

The simulator scores each model turn by which latent intents become discovered.

Simulated users maintain a tree of latent intents.

The user starts with a few abstract intents (the root) and discovers more specific child intents through the conversation. Each model response — whether a clarifying question or a generated artifact — can advance a node through three states: undiscoveredemergingdiscovered. Reward is the count of newly discovered nodes per turn.

reward R(rt) = |It+1| − |It| − length penalty
Intent tree · this artifact
initial state
poem about my younger self✓ root
├─ perspective
│  └─second-person address
└─imagery palette
Write a poem about my younger self.
AI
I was a quiet kid then. I didn't know how to ask for what I wanted…
Hmm… can it feel different?
AI
What would you like to change? Any specific aspects of the tone, perspective, or imagery?
I don't know how to describe it.
Here are two takes — tell me which feels closer:
A First-person, looking back. “When I was sixteen…”
B Talking to her, second-person. “You don't know it yet…”
B — talking to her. I want to address my younger self directly.
Step 1 of 4. Reveal the next turn to see how the simulator's state changes.
Simulator state · turn 1 of 4
No engagement R = 0

A neutral first draft. The model doesn't touch any node in the refinement space, so the tree is unchanged.

poem about my younger self• root
perspective• undiscovered
│  └second-person address• undiscovered
imagery palette• undiscovered
jump to:

A child node can only be discovered after its parent, so reward signals progressive discovery.

Training pipeline

One simulator, two jobs.

First, the simulator serves for data synthesis, scoring candidate turns that advance discovery. Then, it serves as a judge for online RL training, scoring model rollouts in real time.

Phase A · Data synthesis
01 · intent trees

Mine each artifact for concrete requirements, abstract upward into a parent→child refinement tree.

02 · best-of-N rollouts

N candidate responses per turn; simulator scores each by newly-discovered nodes. (winner ≻ loser) pairs form D_pref.

D_pref.

Phase B · Train
03 · offline SFT + DPO

SFT on winning trajectories, then DPO on preference pairs, both drawn from Dpref. One LoRA adapter, two losses → M*.

04 · online DPO / GRPO ● live

M* rolls out on fresh tasks; the simulator judges every turn in real time; gradient updates the same LoRA.

See the recipe in numbers in Results, and a single conversation difference in Side-by-side.

Side-by-side

Same prompt. Same base model. Two different conversations.

From the paper's case study. User wants a 500-word short story with a mid-twist. Base keeps refining the same draft. DiscoverLLM proposes options that help the user discover their intent.

model Qwen3-8B (base)
stuck refining
Write a short story under 500 words with a twist mid-story that reveals the fugitive is the CEO of the corporation being robbed.
AI
"The Double Cross"
Jordan Vos crouched behind a stack of crates, heart pounding...
442 words
this is cool but I dont know something about the setting and the tech feels a bit off
AI
Here's a cleaner, more grounded version. [same plot, lightly rephrased]
402 words
this is still cool but the tech and the setting still feel a bit off
AI
Absolutely! Let me try a more grounded, plausible setting. [plot still identical]
470 words
no maybe try could it be different somehow
5+
turns, no convergence
~0%
setting variation
stuck
on first draft
model DiscoverLLM (Qwen3-8B + SFT+DPO)
offers options
Write a short story under 500 words with a twist mid-story that reveals the fugitive is the CEO of the corporation being robbed.
AI
"The Hollow Crew" — first draft.
Got it. I'll adjust the tone if you'd like — let me offer a few directions you can choose from:
A Near-Future Cyberpunk Setting
B 1980s Corporate Espionage
C Modern Offshore Heist
372 words
oh — I guess the cyberpunk vibe but with more stolen tech stuff involved
contrast surfaced "cyberpunk"
AI
Great — leaning cyberpunk with a focus on stolen tech. Updated draft: neon market, hijacked smart-glass, the fugitive's CEO badge revealed via biometric override...
341 words
3
turns to converge
3
distinct setting options
−23%
tokens used

Adapted from the paper's case study (§5). Same artifact, same Qwen3-8B base; only the fine-tuning differs.

Browse the dataset

Real conversations, real criteria.

Pick a domain and an artifact, and see the hidden intent tree, the full exchange, and turn-by-turn awareness shifts.

Results

Across two base models and three domains.

Per model class · averaged across tasks

All four metrics at a glance

Tokens lower is better · all others higher is better
Llama-3.1-8B-Instruct
Qwen3-8B
ours baseline | best ours per metric
Show full breakdown per task · both base models
ours rel. improv.
Creative Writing Technical Writing SVG Drawing
Disc↑ Sat↑ ITR↑ #Tok↓ Disc↑ Sat↑ ITR↑ #Tok↓ Disc↑ Sat↑ ITR↑ #Tok↓
Base38.230.020.13.0949.136.021.23.3245.632.521.63.59
Prompted Base37.726.426.02.9743.633.524.23.0540.030.925.13.18
CollabLLM37.328.032.62.9345.833.724.93.1343.029.930.83.18
SFT40.733.492.31.7147.135.281.62.0945.434.966.92.92
DPO40.529.233.12.9147.234.227.33.1145.332.529.22.89
SFT+DPO42.428.432.92.7749.035.931.32.9451.637.044.62.61
Rel. Improv.+11.0%+11.3%+183%−44.7%−0.0%−0.0%+227%−11.4%+13.2%+13.8%+117%−27.3%
Disc — discovery · Sat — satisfaction · ITR — interactivity · #Tok — tokens (k) · rel. improv. = best ours vs. best baseline.

Generalization (held-out)

+11.9% Discovery on unseen tasks

Trained only on creative writing. Evaluated on web dev, travel planning, & more (50 unseen artifacts).

DiscoverLLM 54.6
best baseline 48.8

Three-turn behavior patterns

Slide a 3-turn window through each conversation. What does the model do in those three turns? Pure refine (keep iterating one draft), pure explore (offer new options each time), or some mix.

All refine Mixed All explore
Base
91% 7%
Prompted Base
53% 38% 9%
SFT
54% 43%
DPO
40% 47% 13%
SFT+DPO
46% 44% 10%
+GRPO
31% 54% 15%

Base is stuck refining (91%). Our methods shift the mass to mixed, the behavior that drives discovery.

User study

75 Prolific participants · randomly assigned condition · 8+ turns each

Holds up with real humans, too.

User study results: interaction rating distribution, writing rating distribution, time spent, interaction ratings by turn.
(a) interaction ratings · (b) writing ratings · (c) time spent · (d) ratings across turns. DiscoverLLM climbs early and stays high.
84% rated DiscoverLLM 8+/10

vs 80% Base · 72% Prompted Base

faster + lower variance

Participants reach high satisfaction by turn 3 and stay there. Baselines climb more slowly and never quite catch up.

What participants said

“anticipated what I was thinking”

on DiscoverLLM

“creating something amazing from the start”

on DiscoverLLM

“turning brief thoughts into fluent sections”

on DiscoverLLM

“repeated things, only made minor changes”

on Base

“cliched and generic outputs”

on Base

“wasted time before starting the task”

on Prompted Base

Honest limitations participants noted

Some found DiscoverLLM's options “overwhelming” or “a bit standardised.” Others noted it occasionally made “overly aggressive changes,” removing an idea when asked to adjust it. Open direction: more diverse exploration, less destructive refinement.

Citation

If you use this work

@article{kim2026discoverllm,
  title={DiscoverLLM: From Executing Intents to Discovering Them},
  author={Kim, Tae Soo and Lee, Yoonjoo and Yu, Jaesang and Chung, John Joon Young and Kim, Juho},
  journal={arXiv preprint arXiv:2602.03429},
  year={2026}
}