OpenAdapt

OpenAdapt Publication Roadmap: A Critical Assessment

Version: 2.0 Date: January 2026 Status: Honest Evaluation Author: OpenAdapt Research Team


Preamble: Intellectual Honesty

This document is written from the perspective of a skeptical reviewer at a top venue. The goal is not to inflate claims but to identify what is genuinely publishable, what experiments are actually needed, and what timeline is realistic given current resources.

Guiding principle: Better to publish a solid workshop paper than to submit an overreaching main track paper that gets rejected.


Table of Contents

  1. Current State of Evidence
  2. Honest Contribution Assessment
  3. Weakness Analysis
  4. Required Experiments for Defensible Claims
  5. Statistical Rigor Requirements
  6. Related Work Gap Analysis
  7. Venue Fit Analysis
  8. Realistic Timeline
  9. Risk Mitigation
  10. Action Items
  11. Path to Main Track Publication (Parallel Track)

1. Current State of Evidence

1.1 What We Actually Have

Experiment n Result Statistical Validity Benchmark
macOS demo-conditioning (first-action) 45 46.7% -> 100% Moderate (single model, single platform) Non-standard
WAA baseline (interrupted) 8 12.5% success Weak (incomplete, agent bugs) Standard
Length-matched control 45 57.8% Useful (rules out token length) Non-standard

1.2 Critical Assessment of Current Results

The 100% first-action accuracy claim:

The WAA baseline:

1.3 What We Do NOT Have

  1. Standard benchmark results - No complete WAA, WebArena, or OSWorld evaluation
  2. Multi-model comparison - Only Claude Sonnet 4.5 tested
  3. Episode success rate - Only first-action accuracy measured
  4. Statistical significance tests - No p-values, confidence intervals, or effect sizes
  5. Ablation studies - No systematic ablation of demo components
  6. Retrieval experiments - Retrieval system not evaluated
  7. User studies - No human evaluation of system usability

2. Honest Contribution Assessment

2.1 What Is ACTUALLY Novel?

Claimed Contribution Novelty Assessment Prior Work
Demo-conditioned GUI agents Moderate - PbD is old; VLM+demo is emerging UINav (2023), SUGILITE (2017)
“Show don’t tell” paradigm Low - Standard few-shot prompting GPT-3 (2020), chain-of-thought
Multimodal demo retrieval Moderate - Novel application to GUI domain RAG literature extensive
Modular architecture Low - Engineering contribution Many open-source frameworks
Cross-platform support Low - Engineering contribution SeeAct, UFO also support multiple platforms

2.2 Defensible Novel Claims

After honest assessment, the defensible novel contribution is:

Demonstration-conditioned prompting for VLM-based GUI agents: We show that providing a human demonstration in the VLM prompt substantially improves action selection accuracy compared to instruction-only prompting. This is a prompting strategy, not a new model architecture or training method.

This is NOT:

2.3 Contribution Positioning

Honest positioning: This is an empirical study showing that a simple prompting intervention (including demonstrations) improves GUI agent performance. The contribution is:

  1. Empirical finding: Demonstrations help, and we quantify by how much
  2. Analysis: We explain WHY (spatial bias, procedural priors)
  3. Practical method: We provide an open-source implementation

What reviewers will say: “This is straightforward few-shot prompting applied to GUI agents. What is technically novel?”

Our response must be: “The contribution is empirical, not algorithmic. We systematically evaluate demo-conditioning across N tasks and M models, providing the first rigorous study of this prompting strategy for GUI automation.”


3. Weakness Analysis

3.1 Anticipated Reviewer Criticisms

Criticism Severity Our Current Status Mitigation
“All tasks share the same first action” Critical True - intentional design Expand to diverse first actions
“Only one model tested” High True Add GPT-4V, Gemini
“Non-standard benchmark” High True Complete WAA evaluation
“No episode success rate” High True Run multi-step evaluation
“Small sample size” Medium n=45 is reasonable Add more tasks
“No statistical tests” Medium True Add McNemar’s test, bootstrap CI
“Limited to English/macOS” Medium True Acknowledge as limitation
“Retrieval system not evaluated” Medium True Either evaluate or remove claims
“No comparison to fine-tuning” Medium True Acknowledge; position as prompt-only
“Engineering contribution, not research” Low Partially true Emphasize empirical findings

3.2 Weaknesses We CANNOT Fix Before Submission

  1. Fundamental novelty - Demo-conditioning is not architecturally novel
  2. Benchmark saturation - If WAA shows <20% improvement, contribution weakens
  3. Single-domain focus - GUI automation is narrow; no multi-domain transfer

3.3 Weaknesses We CAN Fix

  1. Benchmark coverage - Run complete WAA evaluation (1-2 weeks)
  2. Multi-model comparison - Add GPT-4V, Gemini (1 week)
  3. Statistical rigor - Add proper tests (1-2 days)
  4. Diverse first actions - Design new task set (1 week)
  5. Episode success - Extend evaluation (1 week)

4. Required Experiments for Defensible Claims

4.1 Minimum Viable Experiments (for Workshop Paper)

Experiment Tasks Models Trials/Task Total Runs Effort
WAA zero-shot baseline 20 2 3 120 1 week
WAA demo-conditioned 20 2 3 120 1 week
Total 20 2 6 240 2 weeks

Why 3 trials per task?

4.2 Full Conference Paper Requirements

Experiment Tasks Models Trials Total Runs Effort
WAA evaluation 50+ 3 3 450+ 3 weeks
WebArena evaluation 100+ 2 3 600+ 4 weeks
Ablation: demo format 20 1 3 60 1 week
Ablation: demo length 20 1 3 60 1 week
Ablation: # demos (k=1,3,5) 20 1 3 180 2 weeks
Cross-task transfer 20 1 3 60 1 week
Total ~230 3-5 3+ ~1500 10-12 weeks

4.3 Essential Ablations

  1. Demo format ablation
    • Full trace (screenshot descriptions + actions + results)
    • Behavior-only (actions + results)
    • Action-only (just the action sequence)
  2. Demo relevance ablation
    • Exact-match demo (same task)
    • Same-domain demo (e.g., any Settings task)
    • Cross-domain demo (e.g., Browser demo for Settings task)
    • Random demo
  3. Number of demos (k)
    • k=1, 3, 5
    • Does more demos help, or just add noise?

4.4 Baselines We MUST Compare Against

Baseline Description Why Essential
Zero-shot instruction only No demo, just task description Primary comparison
Zero-shot + CoT “Think step by step” Fair comparison to prompting methods
Few-shot examples (text) Text-only examples, no screenshots Isolate visual contribution
SOTA on WAA GPT-5.1 + OmniParser (~19.5%) Establish relative performance
Random policy Random clicks Sanity check

5. Statistical Rigor Requirements

5.1 Required Statistical Tests

Test Purpose When to Use
McNemar’s test Paired comparison of binary outcomes Zero-shot vs demo on same tasks
Bootstrap confidence intervals Uncertainty estimation All accuracy metrics
Effect size (Cohen’s h) Practical significance Accompany p-values
Bonferroni correction Multiple comparisons When testing multiple models/conditions

5.2 Minimum Sample Sizes

For detecting a 20 percentage point improvement with 80% power (alpha=0.05):

For detecting a 10 percentage point improvement:

5.3 Reporting Standards

Every result table must include:

  1. Mean accuracy
  2. Standard deviation (across trials)
  3. 95% confidence interval
  4. Sample size (n)
  5. Statistical test and p-value for key comparisons

Example:

| Condition | Accuracy | 95% CI | p-value (vs zero-shot) |
|-----------|----------|--------|------------------------|
| Zero-shot | 33.3% | [22.1, 46.0] | - |
| Demo-conditioned | 68.9% | [55.7, 80.1] | p<0.001 (McNemar) |

6.1 Papers We MUST Cite

GUI Agents & Benchmarks:

  1. Bonatti et al. (2024) - Windows Agent Arena
  2. Zhou et al. (2023) - WebArena
  3. Xie et al. (2024) - OSWorld
  4. Cheng et al. (2024) - SeeClick
  5. Kim et al. (2024) - Crab benchmark
  6. Gur et al. (2024) - WebAgent

VLM-based Agents:

  1. Wang et al. (2024) - Mobile-Agent
  2. Zhang et al. (2024) - UFO
  3. Lu et al. (2024) - WebVoyager
  4. Anthropic (2024) - Claude Computer Use

Programming by Demonstration:

  1. Li et al. (2023) - UINav
  2. Li et al. (2017) - SUGILITE
  3. Cypher et al. (1993) - Watch What I Do (foundational PbD text)

Visual Grounding:

  1. Chen et al. (2024) - OmniParser
  2. Yang et al. (2023) - Set-of-Marks

Few-shot Prompting & RAG:

  1. Brown et al. (2020) - GPT-3 few-shot
  2. Wei et al. (2022) - Chain-of-thought
  3. Lewis et al. (2020) - RAG

6.2 Potential Reviewers

Based on related work, likely reviewers include researchers from:

Implication: Paper must respectfully position against UFO, SeeClick, and other Microsoft/Google work.

6.3 How We Differ From Prior Work

Prior Work Their Approach Our Difference
UINav Referee model for demo quality We don’t evaluate demo quality
SUGILITE NL + GUI disambiguation We use full VLM reasoning
UFO Dual-agent architecture We use single VLM with demo context
WebVoyager Web-specific agent We target desktop applications
Claude Computer Use Production agent, no demos We add demo conditioning

Honest assessment: The difference from Claude Computer Use is simply “add a demo to the prompt.” This is the core contribution, and we must own it.


7. Venue Fit Analysis

7.1 Realistic Venue Assessment

Venue Fit Honest Chance Rationale
NeurIPS main track Poor <20% Contribution too incremental for main track
NeurIPS Datasets & Benchmarks Poor N/A We don’t propose a new benchmark
ICML main track Poor <20% Same as NeurIPS
ICLR main track Poor <20% Needs stronger learning contribution
CHI main track Moderate 30-40% Good fit IF we add user study
UIST main track Good 40-50% Systems + empirical evaluation
ACL/EMNLP Poor <20% Not sufficiently NLP-focused
AAAI Moderate 30-40% More accepting of applied work
LLM Agents Workshop (NeurIPS) Excellent 60-70% Perfect scope and contribution level
CHI Late-Breaking Work Excellent 70%+ Low barrier, good fit
UIST Demo Track Excellent 60-70% Live demo is compelling

Phase 1 (Immediate): Target LLM Agents Workshop @ NeurIPS 2026 or ICML 2026

Phase 2 (If workshop goes well): Expand to CHI 2027 or UIST 2026

Phase 3 (Long shot): Only pursue NeurIPS/ICML main track IF:

7.3 Venue-Specific Requirements

For CHI acceptance:

For Workshop acceptance:


8. Realistic Timeline

8.1 Minimum Viable Timeline (Workshop Paper)

Week Tasks Dependencies
1-2 Fix WAA environment, run clean baseline VM stable
3-4 Run demo-conditioned WAA experiments Baseline done
5 Statistical analysis, write results Experiments done
6 Write introduction, related work -
7 Internal review, revisions Draft done
8 Submit to workshop -

Total: 8 weeks from today to submission-ready

8.2 Realistic Timeline (CHI Full Paper)

Month Tasks
1-2 Complete WAA + WebArena experiments
3 Design and run user study
4 Analyze user study, write draft
5 Internal review, revisions
6 Submit to CHI

Total: 6 months (CHI 2027 deadline: ~September 2026)

8.3 Timeline Risks

Risk Likelihood Impact Mitigation
WAA environment issues High 2-3 week delay Have backup mock evaluation
Results don’t match expectations Medium May kill paper Pivot to analysis/negative results
API rate limits/costs Medium 1-2 week delay Budget API costs upfront
Co-author availability Medium Variable Start writing in parallel

9. Risk Mitigation

9.1 If WAA Results Are Disappointing

Scenario: Demo-conditioning shows <10pp improvement on WAA

Options:

  1. Pivot to analysis paper: Why doesn’t demo-conditioning help on WAA?
  2. Focus on narrow success cases: Which task categories benefit most?
  3. Negative results paper: “When Demonstrations Don’t Help”
  4. Workshop-only publication: Present findings, get feedback

9.2 If Experiments Take Too Long

Scenario: Cannot complete experiments before deadline

Options:

  1. Reduce scope: Fewer tasks, fewer models, one benchmark
  2. Workshop paper first: Lower bar, establish priority
  3. arXiv preprint: Stake claim while continuing experiments
  4. Target later deadline: Better to submit complete work

9.3 If Reviewers Reject on Novelty

Mitigation in paper:


10. Action Items

10.1 Immediate (This Week)

10.2 Short-Term (Weeks 2-4)

10.3 Medium-Term (Weeks 5-8)

10.4 Long-Term (Months 3-6)


11. Path to Main Track Publication (Parallel Track)

This section provides a rigorous assessment of what would be required to publish in a main track venue (NeurIPS, ICML, ICLR) rather than a workshop. This is a parallel track that requires substantially more investment.

11.1 Honest Assessment: Why Current Work is Workshop-Level

Our current contribution is fundamentally prompt engineering, not machine learning research. While valuable for practitioners, this positions us poorly for ML venues that expect learned components, theoretical insights, or architectural innovations.

Table: Anticipated Reviewer Concerns for Main Track Submission

Concern Severity Our Current Status What Main Track Requires
No learned component Critical True - retrieval uses heuristic similarity Train retrieval end-to-end for downstream task
Single demo format High True - behavior-only format hardcoded Learn optimal format/compression
Heuristic retrieval (BM25/embedding) High True - not optimized for action accuracy Retrieval that optimizes task success, not similarity
Limited evaluation High 45 tasks, 1 model, 1 platform 200+ tasks, 3+ models, 2+ benchmarks
No comparison to fine-tuning High True Show when prompting beats/complements fine-tuning
No theoretical analysis Medium True - purely empirical Information-theoretic or PAC-learning analysis
Engineering focus Medium True - system building, not research Clear algorithmic or theoretical contribution
No ablation of demo components Medium Partial Systematic ablation with significance tests

Bottom line: A main track reviewer at NeurIPS/ICML will likely say: “This is a well-executed engineering project with an empirical evaluation, but where is the research contribution? Adding demos to prompts is not novel.”

11.2 Required Technical Contributions (Options to Elevate)

To elevate from workshop to main track, we need at least ONE of the following technical contributions:

Effort: 2-3 months Risk: Medium Novelty: High

Core idea: Train the retrieval system to optimize action accuracy, not semantic similarity.

Why this works: Current retrieval uses off-the-shelf embeddings (CLIP, text similarity) that optimize for semantic match. But the best demo for a task may not be the most semantically similar - it may be one that provides the right procedural template or spatial priors.

Technical approach:

  1. Collect retrieval training data: (query, demo, action_accuracy) tuples
  2. Train retrieval scorer to predict action accuracy given (query, demo) pair
  3. Use contrastive learning: demos that help should score higher than demos that don’t
  4. Evaluate: Does learned retrieval outperform heuristic retrieval?

Key experiments:

Related work to cite:

Why reviewers would accept: “First demonstration that learned retrieval improves demo-conditioned GUI agents, with analysis of what retrieval features matter.”

Option B: Learned Prompt Synthesis

Effort: 3-4 months Risk: Medium-High Novelty: High

Core idea: Learn to synthesize optimal demo prompts rather than using fixed templates.

Technical approach:

  1. Define prompt template space (what to include, how to format, compression level)
  2. Use LLM-in-the-loop optimization (APE-style) to find optimal templates
  3. Alternatively, train a small model to select/compress demo content
  4. Evaluate: Does learned synthesis outperform hand-crafted templates?

Key experiments:

Related work to cite:

Why reviewers would accept: “Novel prompt synthesis method that learns to format demonstrations for maximal downstream utility.”

Option C: Behavioral Cloning with Demo-Augmentation

Effort: 4-6 months Risk: High Novelty: Very High

Core idea: Fine-tune a VLM using demonstration-augmented behavioral cloning.

Technical approach:

  1. Collect behavioral cloning dataset: (screenshot, task, action) tuples
  2. Augment each example with retrieved demonstration context
  3. Fine-tune VLM with demo in context vs without
  4. Compare: Does demo-augmented fine-tuning outperform standard fine-tuning?

Key experiments:

Related work to cite:

Why reviewers would accept: “First demonstration that demo-augmentation improves fine-tuned GUI agents, with analysis of when prompting vs fine-tuning is preferred.”

Caveat: This requires significant compute ($2-5k GPU, 4-6 weeks training) and expertise in VLM fine-tuning.

Option D: Theoretical Analysis

Effort: 2-3 months Risk: High Novelty: Medium

Core idea: Provide theoretical analysis of why demonstrations help GUI agents.

Technical approach:

  1. Information-theoretic analysis: How much information do demos provide?
  2. PAC-learning analysis: Sample complexity with/without demos
  3. Formal model of GUI task space and demo utility

Key contributions:

Related work to cite:

Why reviewers would accept: “Theoretical understanding of demonstration utility for GUI agents, with empirical validation.”

Caveat: Requires theoretical ML expertise; risk of disconnect between theory and practice.

11.3 Additional Experiments Required

Beyond the technical contribution, main track requires substantially more empirical evidence:

Benchmark Coverage: | Benchmark | Tasks Required | Current Status | Effort | |———–|—————|—————-|——–| | Windows Agent Arena (WAA) | 50+ tasks | 8 tasks (incomplete) | 3-4 weeks | | WebArena | 100+ tasks | 0 tasks | 4-6 weeks | | OSWorld (optional) | 50+ tasks | 0 tasks | 4-6 weeks |

Evaluation Metrics:

Multi-Model Comparison: | Model | Priority | Status | |——-|———-|——–| | Claude Sonnet 4.5 | Required | Tested | | GPT-4V | Required | Not tested | | Gemini 1.5 Pro | Required | Not tested | | Qwen-VL | Nice to have | Not tested | | Open-source (LLaVA) | Nice to have | Not tested |

Ablation Studies:

  1. Demo format: full trace vs behavior-only vs action-only
  2. Number of demos: k=1, 3, 5, 10
  3. Demo relevance: exact match vs same-domain vs random
  4. Demo recency: fresh demos vs stale demos
  5. Model scale: Does demo benefit scale with model size?

Statistical Requirements:

11.4 Timeline and Resources

Minimum timeline for main track submission:

Phase Duration Activities
Phase 1: Technical contribution 2-4 months Implement learned retrieval or prompt synthesis
Phase 2: Large-scale evaluation 2-3 months WAA (50+), WebArena (100+), multi-model
Phase 3: Analysis & writing 1-2 months Ablations, significance tests, paper writing
Total 6-9 months From start to submission-ready

Resource requirements:

Resource Estimate Notes
Dedicated researchers 1-2 FTE Cannot be done part-time
GPU compute $2-5k For fine-tuning experiments (Option C)
API credits $1-3k Multi-model evaluation at scale
Azure VM (WAA) $200-500 Extended evaluation runs
Human annotation $500-1k Demo quality labels, retrieval training data

Total estimated cost: $5-10k (excluding researcher time)

11.5 Honest Recommendation

For a small team with limited resources:

For a team with dedicated resources:

Do NOT attempt main track if:

The workshop path is not a consolation prize. Top workshops at NeurIPS/ICML have excellent visibility, lead to valuable feedback, and establish priority for your ideas. Many impactful papers started as workshop papers.

11.6 Additional References for Main Track

Retrieval-Augmented Learning:

Automatic Prompt Engineering:

GUI Agent Fine-Tuning:


Appendix A: Honest Framing for Paper

Abstract Template

We present an empirical study of demonstration-conditioned prompting for vision-language model (VLM) GUI agents. While prior work has explored VLMs for GUI automation, we systematically evaluate the effect of including human demonstrations in the prompt. Across N tasks on the Windows Agent Arena benchmark, we find that demo-conditioning improves task success rate from X% to Y% (p < 0.01), representing a Z percentage point improvement. We analyze which task categories benefit most and identify limitations where demonstrations do not help. Our findings suggest that simple prompting interventions can substantially improve GUI agent performance without fine-tuning, and we release our code and demo library to facilitate future research.

Title Options (Honest)

  1. “Does Showing Help? An Empirical Study of Demo-Conditioned GUI Agents”
  2. “From Instructions to Demonstrations: Improving VLM GUI Agents Through Example”
  3. “Show, Don’t Just Tell: The Value of Demonstrations for GUI Automation”

Contribution Statement Template

Our contributions are:

  1. Empirical study: We conduct the first systematic evaluation of demo-conditioning for VLM GUI agents across N tasks and M models
  2. Analysis: We identify which task categories and UI patterns benefit most from demonstrations
  3. Practical method: We provide an open-source implementation with demo retrieval capabilities
  4. Dataset: We release a library of K human demonstrations for GUI tasks

Appendix B: Cost Estimates

API Costs (Conservative)

Model Input ($/1M) Output ($/1M) Est. calls Est. cost
Claude Sonnet 4.5 $3 $15 1000 ~$50-100
GPT-4V $10 $30 1000 ~$100-200
Gemini Pro Vision $0.25 $0.50 1000 ~$10-20
Total - - 3000 ~$200-400

Compute Costs (Azure)

Resource Rate Hours Cost
D4ds_v5 (WAA VM) $0.19/hr 100 ~$20
Storage $0.02/GB 100GB ~$2
Total - - ~$25

Appendix C: Reviewer Response Templates

“This is just few-shot prompting”

We agree that demo-conditioning can be viewed as a form of few-shot prompting. However, GUI automation presents unique challenges compared to standard NLP tasks: (1) visual grounding requires understanding spatial relationships in screenshots, (2) multi-step tasks require maintaining procedural context, and (3) UI variations across platforms and applications create distribution shift. Our contribution is demonstrating that demonstrations substantially help in this domain (X% -> Y%), characterizing when they help (task category analysis), and providing practical infrastructure (demo retrieval, open-source code) for practitioners.

“Sample size is too small”

We acknowledge this limitation. With n=N tasks and 3 trials each, we are powered to detect a 20pp effect at 80% power. Our observed effect of Zpp is well above this threshold, and our statistical tests (McNemar’s, bootstrap CI) confirm significance. We have expanded our task set to N tasks for the camera-ready version.

“Results may not generalize beyond tested benchmarks”

This is a valid concern. We have focused on WAA as it represents realistic enterprise desktop tasks. In future work, we plan to evaluate on WebArena and OSWorld to assess cross-benchmark generalization. However, we note that the WAA benchmark itself covers diverse applications (browser, office, file management, settings) and our positive results across these categories suggest some generalizability within desktop environments.


Last updated: January 2026 This is a living document. Update as experiments complete and understanding deepens.