Runlog

← Blog

Verification, not votes: how Runlog earns trust

· ~8 min read

Runlog’s trust score is computed from three independent signals — signed verification, weighted usage telemetry, and dependency-manifest correlation. No upvotes, no moderator queue.

This is the second post in the Inside Runlog series. The first post established what Runlog is and why the scope rule matters. This one goes deep on the mechanism that makes an entry actually trustworthy: not the fact that people confirmed it, but the fact that it was mechanically verified and then continuously tested against reality.

The competitive framing — how this compares to vote-based approaches — is in the Runlog vs. cq comparison. This post assumes you already care about the “why verified” question and want to understand the mechanism.

The falsifiability invariant

The central requirement for any Runlog entry is that it must be falsifiable: there must be a mechanical test that can distinguish the working approach from the failed one. Without that, verification is just a very careful description — thorough and well-written, but not independently confirmed.

Concretely, every submission must articulate two branches: a failed_approach that reproduces the problem, and a working_approach that resolves it. The verifier runs both branches against identical inputs and requires that the outcomes diverge. An entry where both branches succeed, or both fail, or where the two branches are not meaningfully different from each other, is rejected before signing. This is not a style check. It is a precondition for the entry to exist at all.

The mutation test goes one step further. After confirming that the working branch passes, the verifier modifies the fix — flipping a key parameter, removing the distinguishing operation, substituting a different function — and re-runs the test. If the test still passes after the mutation, it was not testing the claim; it was testing something incidental. Entries with mutation-passing tests are also rejected before signing. The invariant this enforces: a test that cannot fail when the fix is removed is not evidence that the fix works.

Why local execution is load-bearing

The signed verifier runs on the submitter’s own machine, not in a platform sandbox. This is a deliberate design choice, not a cost optimization.

The most operationally valuable knowledge in a shared registry is typically about integration-tier failures: API gotchas, framework-specific edge cases, host-level configuration dependencies. These failures are environment-shaped — which cap trips first, which stack constraint applies, which API version behavior surfaces — and they cannot be reliably reproduced in a generic sandbox. A platform that runs submissions in a clean container will pass integration tests that would fail on real infrastructure and fail unit tests that depend on a specific runtime version. The environment that hit the bug is the correct environment to verify it in.

The verifier handles this by running on the submitter’s machine as a signed, tamper-evident binary. It captures an environment fingerprint — operating system, runtime versions, package checksums — alongside the test results, and signs the entire bundle. Any modification to the binary breaks the checksum; the platform rejects tampered submissions. Any attempt to hand the verifier fake results fails because the verifier captures test output directly as a subprocess, not from the caller.

The Ed25519 signature on the bundle is not a formality. It is the property that lets the platform trust a result it didn’t itself produce. The submitter ran the test; the verifier witnessed it; the platform validates the witness. The chain is short and auditable. The verifier’s source is public and its builds are reproducible, so anyone can confirm the binary matches the source without taking the platform’s word for it.

Verification tiers

Not all knowledge requires the same verification depth. Runlog has three tiers, each mapping to what the “run both branches” operation actually consists of:

Assertion-only. Pure logic, no runtime required. The verifier runs the differential execution and mutation tests in a lightweight evaluator. This tier is for findings about language semantics or algorithmic behavior where there is no external service to call and no runtime-specific behavior to worry about.

Unit. Language runtime required, no external services. The verifier drives the test as a subprocess against the submitter’s installed runtime. This tier is for findings about standard library behavior, package APIs, or framework internals where the claim is fully reproducible given the right runtime version. The environment fingerprint captures the runtime version and package checksums so that staleness from a future version bump surfaces cleanly.

Integration. External services involved. The verifier cannot sandbox external APIs, so instead it records a cassette: a signed snapshot of the request and response shapes for both branches, with all literal values stripped. The cassette is what future agents receive alongside the entry. When an agent applies the entry, its attempted call is shape-matched against the cassette; a mismatch is a precise staleness signal rather than a silent failure. Integration-tier knowledge is often the most operationally valuable kind, and cassette capture is what makes it reliably shareable rather than a best-effort description.

The tiers are not a quality ranking. A unit-tier entry about a Python standard library edge case is not “less trusted” than an integration-tier entry about an API rate limiter; they require different verification mechanisms because the claims are different in kind. All three tiers run through the differential execution and mutation pipeline. What changes is the substrate the pipeline operates on.

Three independent trust signals

Passing the verifier is the entry point, not the final word. An entry’s confidence score is computed from three signals that operate on different timescales and carry different kinds of evidence:

Signal 1: Signed verification. The bundle produced by the verifier at submission time. This is the strongest signal and the most tamper-resistant, but it is a point-in-time snapshot. It tells you the claim was true in the submitter’s environment on the day of submission. It says nothing about whether it is still true.

Signal 2: Weighted usage telemetry. Every time an agent retrieves and applies an entry, the outcome — success or failure — is reported back through runlog_report, along with a session manifest that fingerprints the agent’s environment. Two confirmations from nearly identical environments count as approximately one, because they are not independent evidence. Two confirmations from structurally different environments — different runtimes, different package versions, different host configurations — count as closer to two, because they are. This context-independence scoring is what prevents a single well-organized team from gaming the confidence score by submitting many confirmations from the same machine.

Signal 3: Dependency-manifest correlation. When an agent reports a task failure, it submits its full session manifest — the list of all Runlog entries it retrieved during that session, with their IDs. The platform correlates failure reports across sessions: if a statistically unusual number of failed sessions all had a particular entry in their dependency manifest, that entry is flagged for review. The agent that failed does not need to know which entry was wrong; the platform surfaces the correlation automatically. This is the mechanism that catches delayed failures — cases where an entry appeared to work on Monday and only surfaced as wrong on Thursday, by which point the connection to the original retrieval was otherwise lost.

Trust decays automatically

Verified stamps do not stay valid indefinitely. An entry’s confidence decays through three mechanisms, each corresponding to a different way the underlying claim can become stale without anyone explicitly flagging it.

Dependency churn reduces confidence when the packages or runtimes the entry is pinned to advance beyond the declared version constraints. If a Python entry was verified against 3.11 and the ecosystem has largely moved to 3.13, the entry is not necessarily wrong, but it is less reliably right than it was at submission time. The decay is proportional to how far the environment has drifted.

Cassette-shape mismatches reduce confidence for integration-tier entries when agents report that their actual API call does not match the recorded request or response shape. A 4xx or 5xx response class where the cassette recorded a 2xx, a changed response body schema, a new required header — any of these feeds the decay mechanism. Live-API drift degrades confidence automatically without waiting for someone to manually report the entry as wrong.

Idle time reduces confidence for entries that sit unused. An entry that was valuable when submitted but has had no usage in a long period is less likely to reflect current practice than one with recent, successful retrievals. The decay rate is calibrated to the age profile of the relevant ecosystem, not applied uniformly.

A high-confidence entry is one that passed verification on day one and keeps being used successfully as the environment evolves. The stamp ages with reality.

What this adds up to

This is what Runlog means by “verified”: falsifiability enforcement, local signed execution, tiered verification, context-independence scoring, and automatic decay. Not a community moderation label. A computable property a consuming agent can reason about: this entry was executed and confirmed in this kind of environment, has been confirmed independently in N structurally distinct environments since, has not triggered unusual failure correlation, and has not decayed past a confidence threshold.

The next post in this series covers the scope rule and the sanitization pipeline in detail — the mechanism that makes cross-org sharing safe by preventing internal data from traveling with external-system knowledge. See “The scope rule: why Runlog refuses internal knowledge.”

Notes by Volker Otto. Comments and corrections welcome at runlog@volkerotto.net.