Version: 2.0 Date: January 2026 Status: Honest Evaluation Author: OpenAdapt Research Team
This document is written from the perspective of a skeptical reviewer at a top venue. The goal is not to inflate claims but to identify what is genuinely publishable, what experiments are actually needed, and what timeline is realistic given current resources.
Guiding principle: Better to publish a solid workshop paper than to submit an overreaching main track paper that gets rejected.
| Experiment | n | Result | Statistical Validity | Benchmark |
|---|---|---|---|---|
| macOS demo-conditioning (first-action) | 45 | 46.7% -> 100% | Moderate (single model, single platform) | Non-standard |
| WAA baseline (interrupted) | 8 | 12.5% success | Weak (incomplete, agent bugs) | Standard |
| Length-matched control | 45 | 57.8% | Useful (rules out token length) | Non-standard |
The 100% first-action accuracy claim:
The WAA baseline:
| Claimed Contribution | Novelty Assessment | Prior Work |
|---|---|---|
| Demo-conditioned GUI agents | Moderate - PbD is old; VLM+demo is emerging | UINav (2023), SUGILITE (2017) |
| “Show don’t tell” paradigm | Low - Standard few-shot prompting | GPT-3 (2020), chain-of-thought |
| Multimodal demo retrieval | Moderate - Novel application to GUI domain | RAG literature extensive |
| Modular architecture | Low - Engineering contribution | Many open-source frameworks |
| Cross-platform support | Low - Engineering contribution | SeeAct, UFO also support multiple platforms |
After honest assessment, the defensible novel contribution is:
Demonstration-conditioned prompting for VLM-based GUI agents: We show that providing a human demonstration in the VLM prompt substantially improves action selection accuracy compared to instruction-only prompting. This is a prompting strategy, not a new model architecture or training method.
This is NOT:
Honest positioning: This is an empirical study showing that a simple prompting intervention (including demonstrations) improves GUI agent performance. The contribution is:
What reviewers will say: “This is straightforward few-shot prompting applied to GUI agents. What is technically novel?”
Our response must be: “The contribution is empirical, not algorithmic. We systematically evaluate demo-conditioning across N tasks and M models, providing the first rigorous study of this prompting strategy for GUI automation.”
| Criticism | Severity | Our Current Status | Mitigation |
|---|---|---|---|
| “All tasks share the same first action” | Critical | True - intentional design | Expand to diverse first actions |
| “Only one model tested” | High | True | Add GPT-4V, Gemini |
| “Non-standard benchmark” | High | True | Complete WAA evaluation |
| “No episode success rate” | High | True | Run multi-step evaluation |
| “Small sample size” | Medium | n=45 is reasonable | Add more tasks |
| “No statistical tests” | Medium | True | Add McNemar’s test, bootstrap CI |
| “Limited to English/macOS” | Medium | True | Acknowledge as limitation |
| “Retrieval system not evaluated” | Medium | True | Either evaluate or remove claims |
| “No comparison to fine-tuning” | Medium | True | Acknowledge; position as prompt-only |
| “Engineering contribution, not research” | Low | Partially true | Emphasize empirical findings |
| Experiment | Tasks | Models | Trials/Task | Total Runs | Effort |
|---|---|---|---|---|---|
| WAA zero-shot baseline | 20 | 2 | 3 | 120 | 1 week |
| WAA demo-conditioned | 20 | 2 | 3 | 120 | 1 week |
| Total | 20 | 2 | 6 | 240 | 2 weeks |
Why 3 trials per task?
| Experiment | Tasks | Models | Trials | Total Runs | Effort |
|---|---|---|---|---|---|
| WAA evaluation | 50+ | 3 | 3 | 450+ | 3 weeks |
| WebArena evaluation | 100+ | 2 | 3 | 600+ | 4 weeks |
| Ablation: demo format | 20 | 1 | 3 | 60 | 1 week |
| Ablation: demo length | 20 | 1 | 3 | 60 | 1 week |
| Ablation: # demos (k=1,3,5) | 20 | 1 | 3 | 180 | 2 weeks |
| Cross-task transfer | 20 | 1 | 3 | 60 | 1 week |
| Total | ~230 | 3-5 | 3+ | ~1500 | 10-12 weeks |
| Baseline | Description | Why Essential |
|---|---|---|
| Zero-shot instruction only | No demo, just task description | Primary comparison |
| Zero-shot + CoT | “Think step by step” | Fair comparison to prompting methods |
| Few-shot examples (text) | Text-only examples, no screenshots | Isolate visual contribution |
| SOTA on WAA | GPT-5.1 + OmniParser (~19.5%) | Establish relative performance |
| Random policy | Random clicks | Sanity check |
| Test | Purpose | When to Use |
|---|---|---|
| McNemar’s test | Paired comparison of binary outcomes | Zero-shot vs demo on same tasks |
| Bootstrap confidence intervals | Uncertainty estimation | All accuracy metrics |
| Effect size (Cohen’s h) | Practical significance | Accompany p-values |
| Bonferroni correction | Multiple comparisons | When testing multiple models/conditions |
For detecting a 20 percentage point improvement with 80% power (alpha=0.05):
For detecting a 10 percentage point improvement:
Every result table must include:
Example:
| Condition | Accuracy | 95% CI | p-value (vs zero-shot) |
|-----------|----------|--------|------------------------|
| Zero-shot | 33.3% | [22.1, 46.0] | - |
| Demo-conditioned | 68.9% | [55.7, 80.1] | p<0.001 (McNemar) |
GUI Agents & Benchmarks:
VLM-based Agents:
Programming by Demonstration:
Visual Grounding:
Few-shot Prompting & RAG:
Based on related work, likely reviewers include researchers from:
Implication: Paper must respectfully position against UFO, SeeClick, and other Microsoft/Google work.
| Prior Work | Their Approach | Our Difference |
|---|---|---|
| UINav | Referee model for demo quality | We don’t evaluate demo quality |
| SUGILITE | NL + GUI disambiguation | We use full VLM reasoning |
| UFO | Dual-agent architecture | We use single VLM with demo context |
| WebVoyager | Web-specific agent | We target desktop applications |
| Claude Computer Use | Production agent, no demos | We add demo conditioning |
Honest assessment: The difference from Claude Computer Use is simply “add a demo to the prompt.” This is the core contribution, and we must own it.
| Venue | Fit | Honest Chance | Rationale |
|---|---|---|---|
| NeurIPS main track | Poor | <20% | Contribution too incremental for main track |
| NeurIPS Datasets & Benchmarks | Poor | N/A | We don’t propose a new benchmark |
| ICML main track | Poor | <20% | Same as NeurIPS |
| ICLR main track | Poor | <20% | Needs stronger learning contribution |
| CHI main track | Moderate | 30-40% | Good fit IF we add user study |
| UIST main track | Good | 40-50% | Systems + empirical evaluation |
| ACL/EMNLP | Poor | <20% | Not sufficiently NLP-focused |
| AAAI | Moderate | 30-40% | More accepting of applied work |
| LLM Agents Workshop (NeurIPS) | Excellent | 60-70% | Perfect scope and contribution level |
| CHI Late-Breaking Work | Excellent | 70%+ | Low barrier, good fit |
| UIST Demo Track | Excellent | 60-70% | Live demo is compelling |
Phase 1 (Immediate): Target LLM Agents Workshop @ NeurIPS 2026 or ICML 2026
Phase 2 (If workshop goes well): Expand to CHI 2027 or UIST 2026
Phase 3 (Long shot): Only pursue NeurIPS/ICML main track IF:
For CHI acceptance:
For Workshop acceptance:
| Week | Tasks | Dependencies |
|---|---|---|
| 1-2 | Fix WAA environment, run clean baseline | VM stable |
| 3-4 | Run demo-conditioned WAA experiments | Baseline done |
| 5 | Statistical analysis, write results | Experiments done |
| 6 | Write introduction, related work | - |
| 7 | Internal review, revisions | Draft done |
| 8 | Submit to workshop | - |
Total: 8 weeks from today to submission-ready
| Month | Tasks |
|---|---|
| 1-2 | Complete WAA + WebArena experiments |
| 3 | Design and run user study |
| 4 | Analyze user study, write draft |
| 5 | Internal review, revisions |
| 6 | Submit to CHI |
Total: 6 months (CHI 2027 deadline: ~September 2026)
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| WAA environment issues | High | 2-3 week delay | Have backup mock evaluation |
| Results don’t match expectations | Medium | May kill paper | Pivot to analysis/negative results |
| API rate limits/costs | Medium | 1-2 week delay | Budget API costs upfront |
| Co-author availability | Medium | Variable | Start writing in parallel |
Scenario: Demo-conditioning shows <10pp improvement on WAA
Options:
Scenario: Cannot complete experiments before deadline
Options:
Mitigation in paper:
This section provides a rigorous assessment of what would be required to publish in a main track venue (NeurIPS, ICML, ICLR) rather than a workshop. This is a parallel track that requires substantially more investment.
Our current contribution is fundamentally prompt engineering, not machine learning research. While valuable for practitioners, this positions us poorly for ML venues that expect learned components, theoretical insights, or architectural innovations.
Table: Anticipated Reviewer Concerns for Main Track Submission
| Concern | Severity | Our Current Status | What Main Track Requires |
|---|---|---|---|
| No learned component | Critical | True - retrieval uses heuristic similarity | Train retrieval end-to-end for downstream task |
| Single demo format | High | True - behavior-only format hardcoded | Learn optimal format/compression |
| Heuristic retrieval (BM25/embedding) | High | True - not optimized for action accuracy | Retrieval that optimizes task success, not similarity |
| Limited evaluation | High | 45 tasks, 1 model, 1 platform | 200+ tasks, 3+ models, 2+ benchmarks |
| No comparison to fine-tuning | High | True | Show when prompting beats/complements fine-tuning |
| No theoretical analysis | Medium | True - purely empirical | Information-theoretic or PAC-learning analysis |
| Engineering focus | Medium | True - system building, not research | Clear algorithmic or theoretical contribution |
| No ablation of demo components | Medium | Partial | Systematic ablation with significance tests |
Bottom line: A main track reviewer at NeurIPS/ICML will likely say: “This is a well-executed engineering project with an empirical evaluation, but where is the research contribution? Adding demos to prompts is not novel.”
To elevate from workshop to main track, we need at least ONE of the following technical contributions:
| Effort: 2-3 months | Risk: Medium | Novelty: High |
Core idea: Train the retrieval system to optimize action accuracy, not semantic similarity.
Why this works: Current retrieval uses off-the-shelf embeddings (CLIP, text similarity) that optimize for semantic match. But the best demo for a task may not be the most semantically similar - it may be one that provides the right procedural template or spatial priors.
Technical approach:
Key experiments:
Related work to cite:
Why reviewers would accept: “First demonstration that learned retrieval improves demo-conditioned GUI agents, with analysis of what retrieval features matter.”
| Effort: 3-4 months | Risk: Medium-High | Novelty: High |
Core idea: Learn to synthesize optimal demo prompts rather than using fixed templates.
Technical approach:
Key experiments:
Related work to cite:
Why reviewers would accept: “Novel prompt synthesis method that learns to format demonstrations for maximal downstream utility.”
| Effort: 4-6 months | Risk: High | Novelty: Very High |
Core idea: Fine-tune a VLM using demonstration-augmented behavioral cloning.
Technical approach:
Key experiments:
Related work to cite:
Why reviewers would accept: “First demonstration that demo-augmentation improves fine-tuned GUI agents, with analysis of when prompting vs fine-tuning is preferred.”
Caveat: This requires significant compute ($2-5k GPU, 4-6 weeks training) and expertise in VLM fine-tuning.
| Effort: 2-3 months | Risk: High | Novelty: Medium |
Core idea: Provide theoretical analysis of why demonstrations help GUI agents.
Technical approach:
Key contributions:
Related work to cite:
Why reviewers would accept: “Theoretical understanding of demonstration utility for GUI agents, with empirical validation.”
Caveat: Requires theoretical ML expertise; risk of disconnect between theory and practice.
Beyond the technical contribution, main track requires substantially more empirical evidence:
Benchmark Coverage: | Benchmark | Tasks Required | Current Status | Effort | |———–|—————|—————-|——–| | Windows Agent Arena (WAA) | 50+ tasks | 8 tasks (incomplete) | 3-4 weeks | | WebArena | 100+ tasks | 0 tasks | 4-6 weeks | | OSWorld (optional) | 50+ tasks | 0 tasks | 4-6 weeks |
Evaluation Metrics:
Multi-Model Comparison: | Model | Priority | Status | |——-|———-|——–| | Claude Sonnet 4.5 | Required | Tested | | GPT-4V | Required | Not tested | | Gemini 1.5 Pro | Required | Not tested | | Qwen-VL | Nice to have | Not tested | | Open-source (LLaVA) | Nice to have | Not tested |
Ablation Studies:
Statistical Requirements:
Minimum timeline for main track submission:
| Phase | Duration | Activities |
|---|---|---|
| Phase 1: Technical contribution | 2-4 months | Implement learned retrieval or prompt synthesis |
| Phase 2: Large-scale evaluation | 2-3 months | WAA (50+), WebArena (100+), multi-model |
| Phase 3: Analysis & writing | 1-2 months | Ablations, significance tests, paper writing |
| Total | 6-9 months | From start to submission-ready |
Resource requirements:
| Resource | Estimate | Notes |
|---|---|---|
| Dedicated researchers | 1-2 FTE | Cannot be done part-time |
| GPU compute | $2-5k | For fine-tuning experiments (Option C) |
| API credits | $1-3k | Multi-model evaluation at scale |
| Azure VM (WAA) | $200-500 | Extended evaluation runs |
| Human annotation | $500-1k | Demo quality labels, retrieval training data |
Total estimated cost: $5-10k (excluding researcher time)
For a small team with limited resources:
For a team with dedicated resources:
Do NOT attempt main track if:
The workshop path is not a consolation prize. Top workshops at NeurIPS/ICML have excellent visibility, lead to valuable feedback, and establish priority for your ideas. Many impactful papers started as workshop papers.
Retrieval-Augmented Learning:
Automatic Prompt Engineering:
GUI Agent Fine-Tuning:
We present an empirical study of demonstration-conditioned prompting for vision-language model (VLM) GUI agents. While prior work has explored VLMs for GUI automation, we systematically evaluate the effect of including human demonstrations in the prompt. Across N tasks on the Windows Agent Arena benchmark, we find that demo-conditioning improves task success rate from X% to Y% (p < 0.01), representing a Z percentage point improvement. We analyze which task categories benefit most and identify limitations where demonstrations do not help. Our findings suggest that simple prompting interventions can substantially improve GUI agent performance without fine-tuning, and we release our code and demo library to facilitate future research.
Our contributions are:
- Empirical study: We conduct the first systematic evaluation of demo-conditioning for VLM GUI agents across N tasks and M models
- Analysis: We identify which task categories and UI patterns benefit most from demonstrations
- Practical method: We provide an open-source implementation with demo retrieval capabilities
- Dataset: We release a library of K human demonstrations for GUI tasks
| Model | Input ($/1M) | Output ($/1M) | Est. calls | Est. cost |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $3 | $15 | 1000 | ~$50-100 |
| GPT-4V | $10 | $30 | 1000 | ~$100-200 |
| Gemini Pro Vision | $0.25 | $0.50 | 1000 | ~$10-20 |
| Total | - | - | 3000 | ~$200-400 |
| Resource | Rate | Hours | Cost |
|---|---|---|---|
| D4ds_v5 (WAA VM) | $0.19/hr | 100 | ~$20 |
| Storage | $0.02/GB | 100GB | ~$2 |
| Total | - | - | ~$25 |
We agree that demo-conditioning can be viewed as a form of few-shot prompting. However, GUI automation presents unique challenges compared to standard NLP tasks: (1) visual grounding requires understanding spatial relationships in screenshots, (2) multi-step tasks require maintaining procedural context, and (3) UI variations across platforms and applications create distribution shift. Our contribution is demonstrating that demonstrations substantially help in this domain (X% -> Y%), characterizing when they help (task category analysis), and providing practical infrastructure (demo retrieval, open-source code) for practitioners.
We acknowledge this limitation. With n=N tasks and 3 trials each, we are powered to detect a 20pp effect at 80% power. Our observed effect of Zpp is well above this threshold, and our statistical tests (McNemar’s, bootstrap CI) confirm significance. We have expanded our task set to N tasks for the camera-ready version.
This is a valid concern. We have focused on WAA as it represents realistic enterprise desktop tasks. In future work, we plan to evaluate on WebArena and OSWorld to assess cross-benchmark generalization. However, we note that the WAA benchmark itself covers diverse applications (browser, office, file management, settings) and our positive results across these categories suggest some generalizability within desktop environments.
Last updated: January 2026 This is a living document. Update as experiments complete and understanding deepens.