A newly published paper on arXiv (2605.26731) challenges a foundational assumption in LLM agent deployment: that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance. The research, conducted by Yong Eun Cho and team, used a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions—light, balanced, and strict—on HEAT-24, a synthetic benchmark featuring git-based workspace verification.

The Experiment Design

HEAT-24 consists of 24 tasks designed to stress-test LLM agents in realistic development scenarios. The researchers evaluated models spanning frontier chat models (Gemini 2.5 Flash), frontier reasoning models with extended thinking enabled (Qwen3.5-122B), and smaller constrained-tier models like Gemma4:e2B. Each harness condition varied the verbosity and strictness of task instructions, expected output formats, and error handling protocols.

The Harness Complexity Paradox

The results flip conventional wisdom on its head. For Gemini 2.5 Flash—the frontier chat model evaluated—increased harness verbosity actually lowered VTSR (task success rate) by a staggering 29-38 percentage points. This "harness-complexity paradox" suggests that capable chat models become brittle when subjected to overly rigid scaffolding, as if the extra structure interferes with their natural instruction-following strengths.

Reasoning Models Tell a Different Story

Frontier reasoning models like Qwen3.5-122B showed the opposite trend: strict harness achieved both the highest VTSR at 91.7% and the lowest latency—contrary to what the monotone inverse relationship hypothesis would predict. The tight constraints appear to channel the model's extended thinking capabilities toward task completion rather than exploratory meandering.

Small Models Punching Above Their Weight

Perhaps most surprising: a constrained-tier 2B model (Gemma4:e2B) matched frontier-level stability at 91.7% VTSR across all three harness conditions. This suggests that for certain deployment contexts, smaller models with appropriate harness design can achieve parity with much larger counterparts—a finding with significant practical implications for resource-constrained environments.

Failure Taxonomy Reveals Tier-Specific Weaknesses

The researchers introduced a six-label failure taxonomy revealing distinct patterns: format_violation dominates failures in capable models (they deviate from expected outputs), while wrong_file errors dominate low-capability model failures. This diagnostic framework enables tier-aware harness selection—developers can tune complexity based on where they expect their target deployment to fail.

Key Takeaways

  • The assumed monotone inverse relationship between model capability and optimal harness complexity is empirically refuted
  • Frontier chat models suffer under verbose harnesses; reasoning models thrive with strict constraints
  • Small models (2B class) can achieve frontier-tier stability when properly harnessed
  • Failure patterns differ systematically by tier, enabling targeted optimization

The Bottom Line

This research should be a wake-up call for teams treating harness design as a one-size-fits-all problem. If you're deploying frontier chat agents, strip back the scaffolding and let them breathe. If you're working with reasoning models, tighten the constraints—you'll get faster, more reliable results. The days of assuming "more structure = better" are officially over.