DiscoverLLM: From Executing Intents to Discovering Them

TL;DR

We train LLMs to help users figure out what they want, not just execute what they ask.

LLM training and evaluation assume users start with fully-formed intents. Often they don't — users approach open-ended tasks with ill-defined intents they discover by reacting to what the model proposes. We built a user simulator that models this messy middle, then trained LLMs against it: models learn to diverge (explore options) when intents are unclear and converge (refine outputs) once they concretize.

Latent intents → assistant probes → state updates → reward → fine-tune.

The problem

“What would you like to change?”

Doesn't help when the user can't name what they want yet.

Hover a step or message to highlight the match.

Write a poem about my younger self.

Setup · the conversation starts

AI

I was a quiet kid then. I didn't know how to ask for what I wanted…

A reasonable first draft. The user isn't sure why it doesn't land, but the following turns from the model don't help either.

Hmm… can it feel different?

1

User can't articulate yet

Something feels off after the first draft, but the user can't name what.

AI

What would you like to change? Any specific aspects of the tone, perspective, or imagery?

2

Asking doesn't help

The model asks a clarifying question. But the user cannot provide an answer.

I don't know how to describe it.

AI

Here are two takes — tell me which feels closer:

A First-person, looking back. “When I was sixteen I thought I knew everything…”

B Talking to her, second-person. “You don't know it yet, but…”

✓ just discovered: perspective · second-person address

3

Showing > asking — discovery happens

The AI provides contrasting options that surface what the user actually wanted: a second-person perspective.

B: talking to her. I want to address my younger self directly.

4

User can articulate it now

With this realized, the user can provide direct and specific feedback to the model, something they couldn't do three turns ago.

Most fine-tuning rewards complete answers or better-asked clarifying questions. Neither helps when users themselves don't yet know what they want.

User simulator

The simulator scores each model turn by which latent intents become discovered.

Simulated users maintain a tree of latent intents.

The user starts with a few abstract intents (the root) and discovers more specific child intents through the conversation. Each model response — whether a clarifying question or a generated artifact — can advance a node through three states: undiscovered → emerging → discovered. Reward is the count of newly discovered nodes per turn.

reward R(r_t) = |I_t+1| − |I_t| − length penalty

Intent tree · this artifact

initial state

poem about my younger self✓ root

├─ perspective

│ └─second-person address

└─imagery palette

Write a poem about my younger self.

AI

I was a quiet kid then. I didn't know how to ask for what I wanted…

Hmm… can it feel different?

AI

What would you like to change? Any specific aspects of the tone, perspective, or imagery?

I don't know how to describe it.

★

Here are two takes — tell me which feels closer:

A First-person, looking back. “When I was sixteen…”

B Talking to her, second-person. “You don't know it yet…”

B — talking to her. I want to address my younger self directly.

Step 1 of 4. Reveal the next turn to see how the simulator's state changes.

Simulator state · turn 1 of 4 hover a turn ↔

No engagement R = 0

A neutral first draft. The model doesn't touch any node in the refinement space, so the tree is unchanged.

poem about my younger self• root

├perspective• undiscovered

│ └second-person address• undiscovered

└imagery palette• undiscovered

jump to:

A child node can only be discovered after its parent, so reward signals progressive discovery.

Training pipeline

One simulator, two jobs.

First, the simulator serves for data synthesis, scoring candidate turns that advance discovery. Then, it serves as a judge for online RL training, scoring model rollouts in real time.

01

Mine the intent tree

Given an artifact (story, code), an LLM pulls out its concrete properties (tones, structures, formats), then gradually abstracts them to create a parent-child refinement tree.

output T_task · one tree per task

02

Best-of-N rollouts

At each turn the model samples N candidate replies. The simulator scores each by newly-discovered nodes. The winner is committed and used to continue the conversation.

output D_traj D_pref

03

Offline — SFT then DPO

SFT on D_traj teaches the base model what good turns look like, producing M*. Offline DPO on D_pref then widens the gap between winners and losers, producing M**.

out M** · checkpoint for online phase

04

Online RL with sim-in-the-loop

M** generates rollouts on fresh tasks. The user simulator scores each turn live to update the gradient. Output: M**_online.

output M**_online · final shipped model

step 1

Artifacts → Intent trees

artifact

intent tree

abstract specific

Mine each artifact for concrete requirements, abstract them upward, organize into a parent-before-child refinement tree.

step 2

Synthesize (best-of-N)

trajectory

W ≻ L

pairs

N candidates per turn; the best is committed. (winner ≻ loser) per turn forms D_pref.

step 3

offline

SFT + Offline DPO

base 8B

LoRA

1 SFT · trajectories

2 DPO · pairs

→M*

Two losses, one adapter: SFT teaches winners first, DPO then sharpens the chosen-over-rejected gap.

step 4

online

Online DPO / GRPO

M*+LoRA

sim

Model rolls out, simulator scores live, gradient updates the same LoRA, and the loop continues.

Phase A · Data synthesis

01 · intent trees

Mine each artifact for concrete requirements, abstract upward into a parent→child refinement tree.

02 · best-of-N rollouts

N candidate responses per turn; simulator scores each by newly-discovered nodes. (winner ≻ loser) pairs form D_pref.

D_pref.

Phase B · Train

03 · offline SFT + DPO

SFT on winning trajectories, then DPO on preference pairs, both drawn from D_pref. One LoRA adapter, two losses → M*.

04 · online DPO / GRPO ● live

M* rolls out on fresh tasks; the simulator judges every turn in real time; gradient updates the same LoRA.

See the recipe in numbers in Results, and a single conversation difference in Side-by-side.

Side-by-side

Same prompt. Same base model. Two different conversations.

From the paper's case study. User wants a 500-word short story with a mid-twist. Base keeps refining the same draft. DiscoverLLM proposes options that help the user discover their intent.

model Qwen3-8B (base)

stuck refining

Write a short story under 500 words with a twist mid-story that reveals the fugitive is the CEO of the corporation being robbed.

AI

"The Double Cross"

Jordan Vos crouched behind a stack of crates, heart pounding...

442 words

this is cool but I dont know something about the setting and the tech feels a bit off

AI

Here's a cleaner, more grounded version. [same plot, lightly rephrased]

402 words

this is still cool but the tech and the setting still feel a bit off

AI

Absolutely! Let me try a more grounded, plausible setting. [plot still identical]

470 words

no maybe try could it be different somehow

5+

turns, no convergence

~0%

setting variation

stuck

on first draft

model DiscoverLLM (Qwen3-8B + SFT+DPO)

offers options

Write a short story under 500 words with a twist mid-story that reveals the fugitive is the CEO of the corporation being robbed.

AI

"The Hollow Crew" — first draft.

Got it. I'll adjust the tone if you'd like — let me offer a few directions you can choose from:

A Near-Future Cyberpunk Setting

B 1980s Corporate Espionage

C Modern Offshore Heist

372 words

oh — I guess the cyberpunk vibe but with more stolen tech stuff involved

● contrast surfaced "cyberpunk"

AI

Great — leaning cyberpunk with a focus on stolen tech. Updated draft: neon market, hijacked smart-glass, the fugitive's CEO badge revealed via biometric override...

341 words

3

turns to converge

3

distinct setting options

−23%

tokens used

Adapted from the paper's case study (§5). Same artifact, same Qwen3-8B base; only the fine-tuning differs.

Browse the dataset

Real conversations, real criteria.

Pick a domain and an artifact, and see the hidden intent tree, the full exchange, and turn-by-turn awareness shifts.

Results

Across two base models and three domains.

Per model class · averaged across tasks

All four metrics at a glance

Tokens lower is better · all others higher is better

Llama-3.1-8B-Instruct

Qwen3-8B

ours baseline | ★ best ours per metric

Show full breakdown per task · both base models

ours rel. improv.

	Creative Writing				Technical Writing				SVG Drawing
	Disc↑	Sat↑	ITR↑	#Tok↓	Disc↑	Sat↑	ITR↑	#Tok↓	Disc↑	Sat↑	ITR↑	#Tok↓
Base	38.2	30.0	20.1	3.09	49.1	36.0	21.2	3.32	45.6	32.5	21.6	3.59
Prompted Base	37.7	26.4	26.0	2.97	43.6	33.5	24.2	3.05	40.0	30.9	25.1	3.18
CollabLLM	37.3	28.0	32.6	2.93	45.8	33.7	24.9	3.13	43.0	29.9	30.8	3.18
SFT	40.7	33.4	92.3	1.71	47.1	35.2	81.6	2.09	45.4	34.9	66.9	2.92
DPO	40.5	29.2	33.1	2.91	47.2	34.2	27.3	3.11	45.3	32.5	29.2	2.89
SFT+DPO	42.4	28.4	32.9	2.77	49.0	35.9	31.3	2.94	51.6	37.0	44.6	2.61
Rel. Improv.	+11.0%	+11.3%	+183%	−44.7%	−0.0%	−0.0%	+227%	−11.4%	+13.2%	+13.8%	+117%	−27.3%

	Creative Writing				Technical Writing				SVG Drawing
	Disc↑	Sat↑	ITR↑	#Tok↓	Disc↑	Sat↑	ITR↑	#Tok↓	Disc↑	Sat↑	ITR↑	#Tok↓
Base	35.2	30.4	36.2	3.41	40.7	33.7	35.3	3.39	47.0	32.0	54.4	3.96
Prompted Base	39.0	30.8	62.9	3.01	41.3	33.8	64.0	2.79	47.0	35.6	75.1	2.83
SFT	34.9	31.0	90.4	1.59	41.6	33.7	81.0	1.90	48.9	38.8	70.2	2.36
DPO	42.6	33.5	70.5	3.10	42.2	33.3	67.2	2.76	46.9	35.1	75.9	2.70
SFT+DPO	44.0	33.4	72.1	2.87	47.5	36.4	69.1	2.78	42.6	38.7	32.7	1.81
SFT+DPO+GRPO	45.2	33.7	83.1	2.05	48.2	35.5	55.0	2.63	48.7	38.6	46.3	2.46
Rel. Improv.	+13.7%	+9.4%	+43.7%	−31.9%	+14.3%	+7.7%	+26.6%	−31.9%	+4.0%	+9.0%	+1.1%	−40.4%

Disc — discovery · Sat — satisfaction · ITR — interactivity · #Tok — tokens (k) · rel. improv. = best ours vs. best baseline.

Generalization (held-out)

+11.9% Discovery on unseen tasks

Trained only on creative writing. Evaluated on web dev, travel planning, & more (50 unseen artifacts).

DiscoverLLM 54.6

best baseline 48.8

Three-turn behavior patterns

Slide a 3-turn window through each conversation. What does the model do in those three turns? Pure refine (keep iterating one draft), pure explore (offer new options each time), or some mix.

All refine Mixed All explore

Base

91% 7%

Prompted Base

53% 38% 9%

SFT

54% 43%

DPO

40% 47% 13%

SFT+DPO

46% 44% 10%

+GRPO

31% 54% 15%

Base is stuck refining (91%). Our methods shift the mass to mixed, the behavior that drives discovery.

User study

75 Prolific participants · randomly assigned condition · 8+ turns each

Holds up with real humans, too.

User study results: interaction rating distribution, writing rating distribution, time spent, interaction ratings by turn. — (a) interaction ratings · (b) writing ratings · (c) time spent · (d) ratings across turns. DiscoverLLM climbs early and stays high.

84% rated DiscoverLLM 8+/10

vs 80% Base · 72% Prompted Base

faster + lower variance

Participants reach high satisfaction by turn 3 and stay there. Baselines climb more slowly and never quite catch up.

What participants said

“anticipated what I was thinking”

on DiscoverLLM

“creating something amazing from the start”

on DiscoverLLM

“turning brief thoughts into fluent sections”

on DiscoverLLM

“repeated things, only made minor changes”

on Base

“cliched and generic outputs”

on Base

“wasted time before starting the task”

on Prompted Base

Honest limitations participants noted

Some found DiscoverLLM's options “overwhelming” or “a bit standardised.” Others noted it occasionally made “overly aggressive changes,” removing an idea when asked to adjust it. Open direction: more diverse exploration, less destructive refinement.

Citation

If you use this work

@article{kim2026discoverllm,
  title={DiscoverLLM: From Executing Intents to Discovering Them},
  author={Kim, Tae Soo and Lee, Yoonjoo and Yu, Jaesang and Chung, John Joon Young and Kim, Juho},
  journal={arXiv preprint arXiv:2602.03429},
  year={2026}
}