RTL Historical Fix Replay Executive Summary v0.1
RTL Historical Fix Replay Executive Summary v0.1
Buyer-Safe Claim
Ark historical-fix replay evaluates whether ranked RTL review targets overlap
with regions that were later modified in public RTL repair commits. This is
repair-region overlap evidence, not bug detection, not formal signoff, and not
a claim that Ark would have found the original issue independently.
Result Snapshot
Dataset: HWE-bench public RTL repair smoke set, HDL-FixBench-shaped.
| Metric | Result |
|---|---|
| Public repair cases analyzed | 15 |
| Projects covered | Ibex, CVA6, OpenTitan |
| Cases with ranked Ark targets | 13 |
| Top-3 exact-signal overlap | 12 / 13 |
| Top-1 blind-structural overlap | 12 / 13 |
| Random baseline mean top-3 rate | 0.3219 |
| Median review-compression ratio | 349.3333x |
| Median case runtime | 0.2013s |
| Max case runtime | 3.7657s |
| Misses / no-target cases | 2 |
Interpretation
In this initial public historical repair smoke set, Ark usually produced ranked
review targets that overlapped identifiers in the eventual repair region. The
exact-signal top-3 result is compared against a deterministic random-signal
baseline over the same candidate designs.
The blind-structural check removes repaired identifiers from target tokens and
asks whether the remaining ranked-target neighborhood still carries structural
context. This helps guard against the objection that the result is only name
echo from the diff.
Misses
Two cases produced no ranked targets. These are treated as useful diligence
signals, not hidden failures:
- clean-equivalent or low-signal RTL deltas can erase ranked target surfaces;
- local cone extraction can miss repair regions that require wider file/module
context.
Call-Safe Language
We have started retrospective validation on public RTL repair history. In a
15-case HWE-bench-shaped smoke set across Ibex, CVA6, and OpenTitan, Ark
produced ranked targets in 13 cases. Top-3 exact-signal overlap with repaired
identifiers was 12/13, compared with a random baseline mean top-3 rate of
0.322. Median review compression was about 349x. We treat this as repair-region
overlap evidence, not bug detection or signoff. In this local smoke run, median
case runtime was about 0.2 seconds.
Boundaries
- Not proof of bug detection.
- Not final verification closure.
- Not security signoff.
- Not a vulnerability claim about upstream projects.
- Not a replacement for simulation, LEC, SAT, BMC, formal tools, or human
review.
- Current result is a smoke set, not a full benchmark.