RewardHarness: Self-Evolving Agentic Post-Training

Yuxuan Zhang1,2,3,6,*,  Penghui Du3,*,  Bo Li3,*,  Cong Wei5,*,  Junwen Miao4,  Huaisong Zhang7,  Songcheng Cai5,  Yubo Wang2,5,  Dongfu Jiang2,5,†,  Yuyu Zhang8,  Ping Nie5,†,  Wenhu Chen2,5,†,  Changqian Yu3,§,  Kelsey R. Allen1,2,†

1University of British Columbia  2Vector Institute  3Kolors Team, Kuaishou Technology
4Carnegie Mellon University  5University of Waterloo  6Etude AI  7Tsinghua University  8Georgia Institute of Technology

* Equal contribution   § Project lead   Advisors

Scroll
No Weight Updates

Reward capability comes from evolving an external library of Skills & Tools — the underlying VLM stays frozen.

100 Demos, 0.05% Data

Learns from ~100 preference examples instead of the 200K typically used to train reward models — no extra human annotation.

Beats GPT-5 by +5.3

47.4% avg accuracy on EditReward-Bench + GenAI-Bench; a stronger reward signal for GRPO than EditReward.

Abstract

Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons.

We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment.

By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench.

Method

An Orchestrator evolves a library of Skills and Tools; a frozen Sub-Agent uses them to score edits.

Paradigm comparison between RLHF with massive annotations and RewardHarness self-evolving agentic reward modeling.

Figure 1 — Paradigm comparison. The conventional paradigm collects large-scale human preference data, trains a reward model, and uses it as the reward signal for RL alignment. In contrast, RewardHarness starts from a small set of preference demonstrations and self-evolves a Skills-and-Tools Library through iterative evaluation and analysis, yielding an interpretable reward system.

Overview of the RewardHarness self-evolution pipeline: Orchestrator, Sub-Agent, and library updates.

Figure 2 — Self-evolution pipeline. Multi-modal inputs are fed to the Orchestrator, which selects relevant entries from the Skills and Tools libraries. The Sub-Agent (a frozen VLM, e.g., Qwen2.5-VL-7B) builds a reasoning chain that produces scores and a preference judgment. Outputs are scored against ground truth; the Orchestrator analyzes reasoning chains to generate improvement signals that update the libraries.

1

Orchestrator

A capable LLM that routes the right Skills and Tools from the Library at inference time, then analyzes the Sub-Agent's reasoning chains at evolution time to drive root-cause library updates.

2

Sub-Agent

A frozen VLM (e.g., Qwen2.5-VL-7B) that reads the selected Skill/Tool documents and constructs a three-stage reasoning chain: rubric application, tool-guided analysis, and aggregation into scores plus a ranking.

3

Self-Evolution Loop

Each iteration evaluates a train batch, analyzes the Sub-Agent's reasoning chains, proposes library updates, then keeps or rolls back based on a held-out validation gate. Three alternating phases — A: Skills, B: Tools, C: periodic pruning — drive the library forward without catastrophic regressions.

Why it works

Human raters internalize editing-quality criteria from a small calibration set, then apply them consistently at scale. Trained reward models do the opposite — they fit a monolithic scalar from hundreds of thousands of pairs and bury the learned criteria inside opaque weights.

RewardHarness inverts this: it externalizes the evaluation knowledge as readable Skill rubrics and Tool specs, and evolves them with rejection sampling on a 100-example calibration set. The VLM never trains; the Library does. Because failed proposals are gated out by held-out validation, the Library can only get better — while every change remains interpretable, portable across Sub-Agents (Qwen Gemini), and amenable to manual edits if a reviewer disagrees with the rubric.

Skills and Tools Library

Skills are declarative scoring rubrics; Tools are procedural in-context specifications.

Two side-by-side cards showing a Skill (realism-and-artifact-penalties — a 3-point declarative rubric) and a Tool (text-and-ocr-analyzer — a system_prompt + input/output schema for VLM-based OCR), both sampled from an evolved RewardHarness Library.

Figure 3 — Skill and Tool examples. Skills are declarative rubrics guiding the Sub-Agent's assessment criteria; Tools are procedural specifications instructing the Sub-Agent to perform targeted visual analysis.

realism-and-artifact-penalties Skill

Guidance on penalizing artifacts while allowing conceptual unrealism if requested by the prompt.

1. Conceptual Unrealism vs. Visual Artifacts:
   If the prompt requests an inherently
   surreal ... scenario, DO NOT penalize
   the image for being conceptually
   unrealistic. The goal is to follow
   the prompt.

2. Penalize Visual Artifacts: Only penalize
   for actual image processing artifacts,
   such as: bad blending or visible seams,
   floating objects with missing shadows,
   distorted textures ...

3. Prioritize Execution Quality: Between
   two images that follow the instruction,
   prefer the one with fewer visual artifacts ...

text-and-ocr-analyzer Tool

Extracts and analyzes text within images to verify spelling, placement, and clarity.

input_schema:  {images: list[base64_str], query: str}
output_schema: {image_1_text: str,
                image_1_spelling_correct: bool, ...}

system_prompt:
  "You are an expert OCR and text analysis AI.
   Extract all visible text ... Pay strict
   attention to spelling, typos, and placement ..."

Use this tool whenever the prompt asks to add
or modify text. It will read the exact spelling
from the images so you can penalize typos or
hallucinations.

What the Sub-Agent thinks

One full reasoning chain — <think> / <tool> / <obs> / <answer> tags as emitted by the frozen Qwen2.5-VL-7B Sub-Agent.

<think>
The prompt asks to add a "Sale" sign on the storefront. I need to check three
criteria for each candidate:
  1. Does the sign actually appear?
  2. Is the text spelled correctly?
  3. Does the rest of the storefront look unchanged (no over-editing)?
Let me call the OCR tool to verify spelling.
</think>

<tool>
{"name": "text-and-ocr-analyzer",
 "images": ["<base64-A>", "<base64-B>"],
 "query": "What text appears in each image? Spell it exactly."}
</tool>

<obs>
{"image_1_text": "Sale", "image_1_spelling_correct": true,
 "image_2_text": "Sael", "image_2_spelling_correct": false}
</obs>

<think>
Candidate A spells "Sale" correctly. Candidate B has a typo ("Sael"). Both
sign placements look natural, but B's typo is a clear instruction-following
failure for an "add a Sale sign" prompt. The rest of both storefronts look
preserved.
</think>

<answer>
{
  "preference": "A",
  "score_A_instruction": 4,
  "score_A_quality": 4,
  "score_B_instruction": 2,
  "score_B_quality": 3,
  "reasoning": "A correctly adds 'Sale'; B has a typo 'Sael'."
}
</answer>

Only the <answer> block is consumed by the scorer; the rest is interpretable trace. The Tool call is dispatched only when the prompt's invocation condition matches — here, "add a sign" triggers text-and-ocr-analyzer.

Try it yourself

The Skill + Tool used in the trace above ship in the repo's examples/seed_library/. Score your own edit pair with one command (needs Gemini auth + a Qwen2.5-VL-7B vLLM endpoint):

git clone https://github.com/TIGER-AI-Lab/RewardHarness.git
cd RewardHarness && pip install -r requirements.txt
python examples/score_pair.py \
    --source source.png --candidate-a A.png --candidate-b B.png \
    --prompt "Add a 'Sale' sign to the storefront" \
    --show-chain

--show-chain prints the full <think>/<tool>/<obs>/<answer> trace, exactly like the example above. Or skip the image triplet entirely and reproduce the paper's EditReward-Bench K=2/3/4 numbers against the shipped library: python scripts/run_benchmark.py --config configs/default.yaml (the headline 47.4% average also requires a separate GenAI-Bench pass — see OUTPUTS.md). No GPU? See the Sub-Agent swap guide to point at a hosted Gemini-as-Sub-Agent instead.

Key Results

State-of-the-art image-editing evaluation with only 100 preference demonstrations.

47.4
Avg. Accuracy (%)
EditReward-Bench + GenAI-Bench
+5.3
Points over GPT-5
Same average benchmark
100
Demos Used
0.05% of EditReward data
3.52
ImgEdit-Bench
RL-tuned editor (GRPO + RewardHarness)

Table 1 — Image-editing evaluators on reward benchmarks.

← swipe to see all columns →
Method K=2 K=3 K=4 GenAI-Bench Avg. Δ
Proprietary Models
GPT-4o45.727.37.353.533.5
GPT-557.538.512.859.642.1+8.6
Gemini-2.0-Flash52.433.313.553.338.1+4.6
Gemini-2.5-Flash58.639.912.257.041.9+8.4
Claude-Haiku-4.557.930.77.447.135.8+2.3
Open-Source Models
Qwen2.5-VL-7B52.724.73.440.530.3−3.2
Qwen2.5-VL-32B50.525.34.139.329.8−3.7
MiMo-VL-7B49.530.49.557.936.8+3.3
EditReward (Qwen)57.036.010.864.042.0+8.5
EditReward (MiMo)56.542.711.565.744.1+10.6
Models with Evolving Skills and Tools (Ours)
RewardHarness (Qwen) 57.9 46.7 10.8 67.5 45.7 +12.2
RewardHarness (Gemini-2.0-Flash) 66.2 45.3 13.5 64.4 47.4 +13.9

Best in Indigo bold; second-best underlined. Δ measures average improvement over the GPT-4o baseline. RewardHarness (Gemini-2.0-Flash) achieves the best overall average; RewardHarness (Qwen) outperforms its supervised-finetuned EditReward (Qwen) counterpart by 3.7 points while using only 100 preference examples.

Table 2 — Reward-driven editing on ImgEdit-Bench (FLUX.2-klein-base-4B + RL).

← swipe to see all columns →
Method Add Adjust Extract Replace Remove Bg. Style Compose Action Overall
Existing Image Editing Models
AnyEdit3.182.951.882.472.232.232.851.562.652.45
UltraEdit3.442.812.132.961.452.863.761.912.982.70
Step1X-Edit3.883.141.763.402.413.164.632.642.523.06
BAGEL3.563.311.703.302.623.244.492.384.173.20
OmniGen23.573.061.773.743.203.574.812.524.683.44
Ovis-U14.133.622.984.454.064.224.693.454.614.00
GPT-Image-14.614.332.904.353.664.574.933.964.894.20
Flux.1 Kontext [dev]3.763.452.153.982.943.784.382.964.263.52
Models Tuned with RL
FLUX.2-klein-base-4B3.274.002.043.072.903.474.263.194.443.32
  + RL (EditReward)4.214.012.023.212.583.694.322.854.343.45
  + RL (RewardHarness) 4.034.272.243.122.194.224.443.194.22 3.52

Under the same GRPO setup, RewardHarness provides a stronger training signal than EditReward, raising the base model from 3.32 to 3.52 on ImgEdit-Bench — matching Flux.1 Kontext [dev] with a significantly smaller backbone.

Limitations

What RewardHarness doesn't solve yet.

Closed Orchestrator. RewardHarness's Orchestrator relies on a proprietary LLM for routing, chain analysis, and library evolution. While the Sub-Agent is pluggable — we demonstrate Qwen2.5-VL-7B and Gemini-2.0-Flash as drop-in choices — the Orchestrator itself has not been validated with open-source alternatives. This coupling limits full reproducibility and introduces a dependency on API availability and cost.

Single modality, single task. We apply RewardHarness only to instruction-guided image-editing evaluation. Whether the same library-evolution paradigm transfers to other multimodal preference tasks (video editing, audio synthesis, multi-turn dialogue) is open.

Skill plateaus. Skill proposals are accepted less often than Tool proposals during evolution, reflecting the difficulty of modifying declarative rubrics without regression. The library converges to a compact 6-entry final state (3 Skills + 3 Tools) after pruning, which may hide longer-tail failure modes the calibration set never surfaces.

Citation

BibTeX
@article{zhang2026rewardharness, title={RewardHarness: Self-Evolving Agentic Post-Training}, author={Yuxuan Zhang and Penghui Du and Bo Li and Cong Wei and Junwen Miao and Huaisong Zhang and Songcheng Cai and Yubo Wang and Dongfu Jiang and Yuyu Zhang and Ping Nie and Wenhu Chen and Changqian Yu and Kelsey R. Allen}, journal={arXiv preprint arXiv:2605.08703}, year={2026} }
Plain text
Zhang, Y., Du, P., Li, B., Wei, C., Miao, J., Zhang, H., Cai, S., Wang, Y., Jiang, D., Zhang, Y., Nie, P., Chen, W., Yu, C., & Allen, K. R. (2026). RewardHarness: Self-Evolving Agentic Post-Training. arXiv preprint arXiv:2605.08703. https://arxiv.org/abs/2605.08703