Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.manthan.systems/llms.txt

Use this file to discover all available pages before exploring further.

A governance decision has business consequences. An approval authorizes an action. A denial blocks one. When the action is irreversible - a loan disbursed, a trade executed, an insurance claim paid, an infrastructure change applied - executing the same decision twice is not a duplicate; it is a failure. Exactly-once execution is not an optimization. For consequential automated systems, it is a correctness requirement. This document describes the theory, implementation, and failure modes of replay protection in the Parmana governance runtime.

Why Idempotency Is Not Enough

Idempotency and exactly-once execution are related but distinct properties. Idempotent - the same request, executed multiple times, produces the same result. Idempotency is a property of the operation. It does not constrain how many times the operation executes. Exactly-once - the same request executes exactly once. This is a constraint on execution count, not just outcome. For many operations, idempotency is sufficient. A read operation is idempotent. A GET request is idempotent. Even some write operations are safely idempotent - updating a record to a known state can be executed multiple times without harm. For governance decisions with irreversible side effects, idempotency is insufficient. Consider:
  • A loan approval that triggers disbursement. The disbursement system is idempotent - it checks whether the loan has already been disbursed. But two governance approvals of the same loan, separated by enough time or across system boundaries, may produce two disbursement authorizations before the idempotency check fires. Two disbursements.
  • An insurance claim approval. The claim system marks the claim as paid. But two approvals of the same claim, routed to different claim processors, may both succeed before the first is recorded. Two payments.
  • A trade execution. The execution system has its own idempotency. But two governance approvals of the same trade produce two execution instructions. In markets that move in milliseconds, both may execute before either is canceled.
The governance layer must ensure exactly-once execution independently of the systems it governs. It cannot rely on downstream idempotency because downstream systems have no visibility into the governance layer’s execution semantics.

The Fingerprint Approach

The execution fingerprint is the identity of a governance decision.
execution_fingerprint = SHA-256(canonicalize(signals))
The fingerprint is a deterministic hash of the canonical form of the input signals. Two executions with the same signals produce the same fingerprint. Two executions with different signals produce different fingerprints. This fingerprint is the “what” of a decision - the semantic identity of the decision request. It answers the question: “Have I seen this exact decision before?” The executionId is different. It is a UUIDv4 - a unique identifier for the execution event, not the decision request. The same decision request (same signals, same fingerprint) may produce multiple execution events (multiple executionIds) if the replay protection fails. The fingerprint is the thing we protect. The executionId is the unique label for each attempt. These two identifiers serve different purposes:
IdentifierTypeDeterministic?Purpose
execution_fingerprintSHA-256YesDecision identity - is this the same decision?
executionIdUUIDv4NoEvent identity - which execution event is this?
The fingerprint is in the signed payload - it is part of the cryptographic proof. The executionId is metadata - recorded for observability, but excluded from the signed content. Including a random UUID in the signed payload would make the payload non-reproducible and break verification.

The Two-Phase Commit

Replay protection uses a two-phase commit pattern: reserve before evaluation, confirm after signing.

Phase 1 - Reserve

Before any evaluation occurs, the fingerprint is atomically claimed in the replay store:
await replayStore.reserve(fingerprint);
The reserve operation must be atomic and conditional:
  • If the fingerprint has never been seen, claim it and return normally
  • If the fingerprint has already been claimed (reserved or confirmed), throw immediately
“Atomic and conditional” means that two concurrent reserve calls with the same fingerprint will result in exactly one succeeding and the other throwing. No race condition. No window where both could succeed. If reserve throws, execution is blocked with error INV-013 (replay detected). The caller receives the error. No evaluation occurs. No side effects.

Phase 2 - Confirm

After evaluation and signing succeed, the fingerprint is confirmed:
await replayStore.confirm(fingerprint);
Confirmation marks the fingerprint as permanently consumed. Future reserve calls with the same fingerprint will see the existing record and throw.

Why Two Phases?

A natural question: why not just check-then-execute? Why does confirmation require a separate step? Phase 1 alone (check-then-execute without phase 2) creates a race condition. Two requests arrive with the same fingerprint. Request A checks - fingerprint not seen. Request B checks - fingerprint not seen. Request A executes. Request B executes. Both succeed. Double execution. The solution is to make the check and the claim atomic: the reserve operation claims the slot and fails if already claimed. After reserve succeeds, no other caller can claim the same fingerprint. This is the purpose of the atomic conditional write requirement. But why Phase 2 at all? The reserve call establishes exclusive ownership of the fingerprint. After reserve succeeds, no other execution can proceed with the same fingerprint. Why is confirm needed? Because reserve only establishes that execution was attempted, not that it succeeded. If evaluation fails, or signing fails, or the process crashes after reserve but before producing an attestation, the fingerprint is in a reserved-but-not-confirmed state. Future requests with the same fingerprint will see the existing record and throw - correctly preventing re-execution. But there is a subtlety: what if the first attempt genuinely failed and should be retried? In the fail-closed design, this is not retried automatically. The application layer must treat a failed execution as an error and surface it to the operator. This is the correct behavior for consequential decisions: fail explicitly, require explicit retry authorization, rather than silently re-executing. The confirmed state serves a different purpose: it distinguishes “this execution was attempted and succeeded” from “this execution was attempted and failed.” Both states block future attempts, but hasExecuted returns true only for confirmed fingerprints - allowing callers to query whether a specific decision was successfully completed.

The Execution Sequence

executeFromSignals(request, signer, verifier, replayStore)
  |
  v
canonicalize(signals)
  |
  v
fingerprint = SHA-256(canonical)
  |
  v
replayStore.reserve(fingerprint)  ---- throws INV-013 if exists
  | (succeeds - exclusive claim established)
  v
policyEvaluator.evaluate(signals, policy) -> decision
  |
  v
buildCanonicalPayload(fingerprint, policyRef, decision, runtimeHash)
  |
  v
signer.sign(canonicalPayload) -> signature
  |
  v
replayStore.confirm(fingerprint)
  |
  v
ExecutionAttestation
If any step between reserve and confirm fails, the fingerprint remains reserved. Future attempts with the same fingerprint are blocked. This is fail-closed.

Fail-Closed Design

The replay store is a hard dependency of the governance runtime. If the replay store is unavailable:
  • reserve throws (cannot be completed)
  • Execution is blocked
  • INV-013 or a store connectivity error is returned
  • No evaluation occurs
  • No attestation is produced
This is the fail-closed property applied to replay protection. The alternative - proceeding with execution when the replay store is unavailable - would be fail-open: governance guarantees are abandoned silently when infrastructure fails. This is unacceptable for governed systems. The tradeoff is explicit: replay store unavailability blocks execution. This is the correct tradeoff. The business impact of a temporarily blocked decision (a loan approval pending store recovery) is recoverable. The impact of double-execution of an irreversible action (a loan disbursed twice) may not be. Operators should deploy highly available replay stores (Redis Cluster with replica nodes, DynamoDB with automatic failover) to minimize the availability impact of this design. The availability guarantee is the operator’s responsibility. The correctness guarantee - no double execution - is enforced by the fail-closed architecture.

The Concurrent Case

Consider ten identical requests arriving simultaneously with the same signals. Without replay protection, all ten evaluate the policy, sign attestations, and proceed to action. Ten approvals. Potential for ten side effects. With replay protection:
  1. All ten requests compute the same fingerprint (deterministic).
  2. All ten attempt replayStore.reserve(fingerprint) concurrently.
  3. Exactly one reserve succeeds (atomic conditional write).
  4. Nine receive an error immediately (fingerprint already claimed).
  5. One request proceeds: evaluate, sign, confirm.
  6. One attestation. One authorized action.
The atomicity guarantee is provided by the replay store backend:
  • Redis: SET key value NX - atomic conditional set. Two concurrent SET NX operations on the same key result in exactly one returning OK and the other returning null. This is a single atomic command, not a check-then-set.
  • DynamoDB: PutItem with ConditionExpression: "attribute_not_exists(fingerprint)" - atomic conditional write. Two concurrent PutItem operations with this condition on the same key result in exactly one succeeding and the other throwing ConditionalCheckFailedException.
  • PostgreSQL: INSERT INTO replay (fingerprint) VALUES ($1) ON CONFLICT DO NOTHING - the ON CONFLICT DO NOTHING clause makes the insert idempotent at the row level. Combined with a unique constraint on fingerprint, concurrent inserts result in exactly one row being created.
The atomicity is in the backend operation, not in application-level locking. Application-level locking would require coordinated lock management across processes, introducing its own failure modes. Atomic conditional writes at the storage layer are simpler, more reliable, and provided as primitives by every production-grade data store.

Memory vs. Persistent Stores

MemoryReplayStore

The MemoryReplayStore uses an in-process Map to track fingerprints. It satisfies the ReplayStore interface and is correct for single-process, single-run scenarios. Limitations:
  • Lost on restart - all fingerprint state is lost when the process exits
  • Single-process only - two processes with separate MemoryReplayStore instances share no state; the same fingerprint can execute in each process independently
  • Development only - MemoryReplayStore emits a warning in production environments
In NODE_ENV=production, the runtime detects MemoryReplayStore and emits a warning. This warning is not suppressible without modifying the implementation. It is an intentional forcing function.

RedisReplayStore

The built-in RedisReplayStore uses ioredis and SET key value NX for atomic reservation. It persists across process restarts, supports multiple processes sharing state, and is the recommended production option for most deployments. Key properties:
  • Atomic reservation via SET NX
  • Confirmation via SET XX (update existing key)
  • Key namespace: parmana:replay:{fingerprint} (configurable prefix)
  • TTL support: fingerprints can expire after a configurable period
TTL considerations: fingerprints that expire can be re-executed after expiry. For most governance use cases, fingerprints should not expire, or should expire only after a period much longer than any realistic retry window (years, not hours). Operators should configure TTL based on their specific retention and re-execution requirements.

Custom Implementations

Any backend that supports atomic conditional writes can implement ReplayStore. The interface is intentionally minimal:
interface ReplayStore {
  reserve(fingerprint: string): Promise<void>;   // atomic claim - throw if exists
  confirm(fingerprint: string): Promise<void>;   // mark confirmed
  hasExecuted(fingerprint: string): Promise<boolean>;
  markExecuted(fingerprint: string): Promise<void>;
}
DynamoDB - PutItemCommand with ConditionExpression: "attribute_not_exists(fingerprint)". Built-in TTL support via DynamoDB Time to Live. Appropriate for Lambda or serverless deployments where Redis is not already in the stack. PostgreSQL - INSERT INTO replay_store (fingerprint, status) VALUES ($1, 'reserved') ON CONFLICT (fingerprint) DO NOTHING, then check rows affected. Appropriate for deployments where PostgreSQL is the primary data store and adding Redis is undesirable. Custom - any system with a primitive that is: atomic, conditional on key non-existence, and fail-fast (returns immediately rather than blocking until timeout). etcd, ZooKeeper, Consul (with compare-and-set), and most distributed key-value stores provide this primitive.

Near-Replay vs. Exact Replay

The fingerprint is SHA-256(canonicalize(signals)). It is a hash of the exact input signals. Two executions with different signals - even one signal value different - produce different fingerprints. This is the correct behavior. Different inputs are different decisions. “Same decision” means same inputs, not approximately the same inputs. Exact replay - identical signals -> same fingerprint -> reserve throws -> execution blocked. This is the case the replay store is designed to catch. Near-replay - one signal different -> different fingerprint -> reserve succeeds -> execution proceeds. Two similar-but-not-identical decisions can both execute. This is correct behavior: they are different decisions. This distinction matters for cases like:
  • Retry with slightly different inputs (corrected credit score, updated loan amount) - these are new decisions, not replays, and correctly proceed.
  • Fraudulent replay with all identical signals - exact replay, correctly blocked.
  • Legitimate execution of similar-but-distinct requests - different fingerprints, correctly allowed.
The fingerprint approach is semantically correct: it identifies decisions by their inputs, not by request metadata (timestamps, request IDs, caller identity). Two requests from different callers at different times with identical signals are the same decision - and should be blocked after the first.

Cross-Store Isolation

Two separate ReplayStore instances - even both RedisReplayStore pointing to different Redis servers - share no state. The same fingerprint can execute independently in each store. This is not a bug. It is a design property of the interface: the replay guarantee is scoped to a single store. Operators who need a cross-region or cross-system replay guarantee must use a shared store (a single Redis Cluster visible to all processes, a single DynamoDB table in a primary region, etc.). The scoping is intentional: different decision domains may legitimately run the same policy against the same signals in separate environments (e.g., canary testing a new policy version against production traffic in an isolated store). Cross-store isolation enables this without interference. For production systems making consequential decisions, the requirement is clear: use one shared, highly available store per decision domain. All processes serving the same decision domain must connect to the same store.

Conclusion

Replay protection is the governance layer’s guarantee of exactly-once execution. It is not an application-level concern - it is an infrastructure-level primitive, as fundamental to governance as the signature is to tamper evidence. The design is three properties together:
  1. Two-phase commit - reserve before evaluate, confirm after sign. Not check-then-execute. Atomic claim that establishes exclusive ownership before any evaluation occurs.
  2. Fail-closed - replay store unavailability blocks execution. No silent degradation, no fail-open fallback. Governance without replay protection is not governance.
  3. Atomic conditional write - the backend storage operation must be atomic. Application-level locking is insufficient. Every production-grade data store provides an appropriate primitive: Redis SETNX, DynamoDB conditional PutItem, PostgreSQL INSERT ON CONFLICT.
The fingerprint identifies the decision. The store enforces uniqueness. The two-phase commit makes enforcement atomic. Together, they ensure that each governance decision, for each set of inputs, executes exactly once.

See Also