IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari

IntelliAsk Architecture

IntelliAsk learns to generate critical, evidence-based peer review questions through reinforcement learning with human preferences.

Abstract

Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries. We find that LLM-generated questions draw over 50% of their tokens from a paper's first page, while human reviewers engage with the full text. Human questions also demonstrate greater effort and grounding, whereas LLM questions primarily mimic stylistic patterns.

To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines (Gemini 2.5 Flash, GPT-4.1) in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding.

We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.

Human Preference Annotation Study

We collected 15.5k high-quality questions from ICLR 2024 reviews through a multi-stage filtering process (length filtering, semantic deduplication, removing non-technical content and vague questions). To benchmark the gap between human and LLM-generated questions, we conducted a human annotation study with 572 annotated question-paper pairs sampled from 300 randomly selected ICLR 2025 submissions on OpenReview.

Four expert annotators read each paper in full, including text, figures, and equations, to ensure proper context. All questions were anonymized to eliminate source bias. Annotators scored each question on three binary dimensions:

  • Effort: Does the question demand real thought to answer?
  • Evidence: Is the question backed by specific content from the paper?
  • Grounding: Is the question anchored in the actual content of the paper?

Results show that human-written questions scored 0.78 points higher on average than the strongest model and 1.53 points higher than the lowest scoring model.

Score Distribution

IntelliReward: Reward Model Architecture

Evaluating all 15,500 questions with human annotators across three rubrics is costly and risks bias from fatigue. Leading closed-source LLMs tested on reward prediction showed weak accuracy (Gemini 2.5 Flash: 37%, GPT-4.1: 32%), making them unsuitable for large-scale benchmarking. We trained IntelliReward on our human preference annotations to serve as an efficient and scalable substitute for human judgment.

Our reward model pairs a frozen causal LLM with per-objective Transformer heads. We extract the pooled hidden states of the last 50 output tokens and pass them to our per-objective Transformer head. Each evaluation objective (Effort, Evidence, Grounding) has an independent head producing logits. Only the per-objective heads are trained while the LLM backbone remains frozen (30 minutes on a single NVIDIA L40S GPU).

IntelliReward achieves 72% mean accuracy, substantially outperforming API-based baselines: Gemini 2.5 Flash (37%), GPT-4.1 (32%), GPT-5 (53%).

IntelliReward Architecture

IntelliAsk: Training with Human-Aligned Rewards

Supervised fine-tuning (SFT) performs poorly for review question generation: the model copies surface style but does not produce questions with real effort, evidence, or grounding. SFT-trained models achieve scores of only 0.03–0.10/3.0.

We use reinforcement learning with IntelliReward to align generation with human preferences. We train IntelliAsk-7B using DAPO and IntelliAsk-32B using GRPO. For each paper, the model generates candidate questions scored by IntelliReward to guide optimization.

IntelliAsk-32B achieves 0.55/3.0 on automatic evaluation, outperforming SFT baselines. In human evaluation, IntelliAsk-32B achieves 0.66/3.0, outperforming Gemini 2.5 Pro (0.60/3.0). It achieves the lowest first-page bias (21.37%) among all models.

Training Curves: SFT vs RL

Evaluation Results

Question Generation Performance

IntelliAsk-32B substantially outperforms all small models (≤32B) and is competitive with frontier models.

Model Reasoning Effort Evidence Grounding Total (0-3) 1st Page Bias ↓
Human questions 0.54 0.46 0.57 1.57 28.21%
Large Models
o3 Medium 0.28 0.14 0.30 0.72 16.81%
Gemini 2.5 Pro Default 0.22 0.11 0.18 0.51 25.75%
GPT-5 Default 0.09 0.20 0.16 0.45 18.63%
Claude 3.7 Sonnet Default 0.08 0.16 0.13 0.37 47.13%
GPT-4.1 No 0.07 0.12 0.12 0.31 31.73%
Small Models (≤32B)
IntelliAsk-32B (Ours) Default 0.23 0.12 0.20 0.55 21.37%
Qwen3-32B (base) Default 0.05 0.13 0.09 0.28 26.73%
IntelliAsk-7B (Ours) No 0.03 0.07 0.07 0.17 27.44%
OpenReviewer-8B No 0.00 0.00 0.10 0.10 51.14%
DeepReviewer-7B No 0.00 0.00 0.10 0.10 48.14%
Qwen2.5-7B SFT (Ours) No 0.00 0.01 0.02 0.03 42.11%

Automatic evaluation using IntelliReward. IntelliAsk-32B achieves the highest score among small models (0.55/3.0).

Generalization to Writing and Reasoning

IntelliAsk shows consistent improvements on external reasoning and writing benchmarks.

Benchmark IntelliAsk-32B Qwen3-32B Metric
Reasoning & Comprehension
DROP 95.1 93.3 F1 / Acc
MuSR 68.3 64.7 Accuracy
BoolQ 90.0 90.0 Accuracy
GPQA-Diamond 69.1 68.4 Accuracy
Writing & Generation
WritingBench 8.31 8.07 Score (0-10)
Arena Hard 94.1 93.8 Score (0-100)

External benchmarks. Learning to ask better questions improves general writing ability.

Detailed WritingBench by Category

Category IntelliAsk Qwen3
Academic & Engineering8.338.09
Finance & Business8.228.04
Politics & Law8.298.02
Medical & Health8.348.09
Technology8.227.96
Arts & Culture8.318.10
Education8.418.22
Marketing & Sales8.288.02
Science & Nature8.358.14
Social Sciences8.328.09
Category IntelliAsk Qwen3
Contract8.167.94
Test Report8.358.01
User Research7.937.72
Review8.248.00
Report8.368.13
Blog Post8.398.16
Creative Writing8.318.09
Email8.248.03
Narrative8.288.05
Technical Writing8.308.07

WritingBench scores (out of 10). IntelliAsk-32B consistently outperforms Qwen3-32B across all categories.

BibTeX

@misc{sharma2026intelliasklearningaskhighquality,
  title={IntelliAsk: Learning to Ask High-Quality Research Questions via RLVR},
  author={Karun Sharma and Vidushee Vats and Shengzhi Li and Yuxiang Wang and Zhongtian Sun and Prayag Tiwari},
  year={2026},
  eprint={2602.15849},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.15849},
}