A governance decision has business consequences. An approval authorizes an action. A denial blocks one. When the action is irreversible - a loan disbursed, a trade executed, an insurance claim paid, an infrastructure change applied - executing the same decision twice is not a duplicate; it is a failure. Exactly-once execution is not an optimization. For consequential automated systems, it is a correctness requirement. This document describes the theory, implementation, and failure modes of replay protection in the Parmana governance runtime.Documentation Index
Fetch the complete documentation index at: https://docs.manthan.systems/llms.txt
Use this file to discover all available pages before exploring further.
Why Idempotency Is Not Enough
Idempotency and exactly-once execution are related but distinct properties. Idempotent - the same request, executed multiple times, produces the same result. Idempotency is a property of the operation. It does not constrain how many times the operation executes. Exactly-once - the same request executes exactly once. This is a constraint on execution count, not just outcome. For many operations, idempotency is sufficient. A read operation is idempotent. A GET request is idempotent. Even some write operations are safely idempotent - updating a record to a known state can be executed multiple times without harm. For governance decisions with irreversible side effects, idempotency is insufficient. Consider:- A loan approval that triggers disbursement. The disbursement system is idempotent - it checks whether the loan has already been disbursed. But two governance approvals of the same loan, separated by enough time or across system boundaries, may produce two disbursement authorizations before the idempotency check fires. Two disbursements.
- An insurance claim approval. The claim system marks the claim as paid. But two approvals of the same claim, routed to different claim processors, may both succeed before the first is recorded. Two payments.
- A trade execution. The execution system has its own idempotency. But two governance approvals of the same trade produce two execution instructions. In markets that move in milliseconds, both may execute before either is canceled.
The Fingerprint Approach
The execution fingerprint is the identity of a governance decision.executionId is different. It is a UUIDv4 - a unique identifier for the execution event, not the decision request. The same decision request (same signals, same fingerprint) may produce multiple execution events (multiple executionIds) if the replay protection fails. The fingerprint is the thing we protect. The executionId is the unique label for each attempt.
These two identifiers serve different purposes:
| Identifier | Type | Deterministic? | Purpose |
|---|---|---|---|
execution_fingerprint | SHA-256 | Yes | Decision identity - is this the same decision? |
executionId | UUIDv4 | No | Event identity - which execution event is this? |
The Two-Phase Commit
Replay protection uses a two-phase commit pattern: reserve before evaluation, confirm after signing.Phase 1 - Reserve
Before any evaluation occurs, the fingerprint is atomically claimed in the replay store:reserve operation must be atomic and conditional:
- If the fingerprint has never been seen, claim it and return normally
- If the fingerprint has already been claimed (reserved or confirmed), throw immediately
reserve calls with the same fingerprint will result in exactly one succeeding and the other throwing. No race condition. No window where both could succeed.
If reserve throws, execution is blocked with error INV-013 (replay detected). The caller receives the error. No evaluation occurs. No side effects.
Phase 2 - Confirm
After evaluation and signing succeed, the fingerprint is confirmed:reserve calls with the same fingerprint will see the existing record and throw.
Why Two Phases?
A natural question: why not just check-then-execute? Why does confirmation require a separate step? Phase 1 alone (check-then-execute without phase 2) creates a race condition. Two requests arrive with the same fingerprint. Request A checks - fingerprint not seen. Request B checks - fingerprint not seen. Request A executes. Request B executes. Both succeed. Double execution. The solution is to make the check and the claim atomic: thereserve operation claims the slot and fails if already claimed. After reserve succeeds, no other caller can claim the same fingerprint. This is the purpose of the atomic conditional write requirement.
But why Phase 2 at all? The reserve call establishes exclusive ownership of the fingerprint. After reserve succeeds, no other execution can proceed with the same fingerprint. Why is confirm needed?
Because reserve only establishes that execution was attempted, not that it succeeded. If evaluation fails, or signing fails, or the process crashes after reserve but before producing an attestation, the fingerprint is in a reserved-but-not-confirmed state. Future requests with the same fingerprint will see the existing record and throw - correctly preventing re-execution.
But there is a subtlety: what if the first attempt genuinely failed and should be retried? In the fail-closed design, this is not retried automatically. The application layer must treat a failed execution as an error and surface it to the operator. This is the correct behavior for consequential decisions: fail explicitly, require explicit retry authorization, rather than silently re-executing.
The confirmed state serves a different purpose: it distinguishes “this execution was attempted and succeeded” from “this execution was attempted and failed.” Both states block future attempts, but hasExecuted returns true only for confirmed fingerprints - allowing callers to query whether a specific decision was successfully completed.
The Execution Sequence
reserve and confirm fails, the fingerprint remains reserved. Future attempts with the same fingerprint are blocked. This is fail-closed.
Fail-Closed Design
The replay store is a hard dependency of the governance runtime. If the replay store is unavailable:reservethrows (cannot be completed)- Execution is blocked
INV-013or a store connectivity error is returned- No evaluation occurs
- No attestation is produced
The Concurrent Case
Consider ten identical requests arriving simultaneously with the same signals. Without replay protection, all ten evaluate the policy, sign attestations, and proceed to action. Ten approvals. Potential for ten side effects. With replay protection:- All ten requests compute the same fingerprint (deterministic).
- All ten attempt
replayStore.reserve(fingerprint)concurrently. - Exactly one
reservesucceeds (atomic conditional write). - Nine receive an error immediately (fingerprint already claimed).
- One request proceeds: evaluate, sign, confirm.
- One attestation. One authorized action.
-
Redis:
SET key value NX- atomic conditional set. Two concurrent SET NX operations on the same key result in exactly one returningOKand the other returningnull. This is a single atomic command, not a check-then-set. -
DynamoDB:
PutItemwithConditionExpression: "attribute_not_exists(fingerprint)"- atomic conditional write. Two concurrent PutItem operations with this condition on the same key result in exactly one succeeding and the other throwingConditionalCheckFailedException. -
PostgreSQL:
INSERT INTO replay (fingerprint) VALUES ($1) ON CONFLICT DO NOTHING- theON CONFLICT DO NOTHINGclause makes the insert idempotent at the row level. Combined with a unique constraint onfingerprint, concurrent inserts result in exactly one row being created.
Memory vs. Persistent Stores
MemoryReplayStore
TheMemoryReplayStore uses an in-process Map to track fingerprints. It satisfies the ReplayStore interface and is correct for single-process, single-run scenarios.
Limitations:
- Lost on restart - all fingerprint state is lost when the process exits
- Single-process only - two processes with separate
MemoryReplayStoreinstances share no state; the same fingerprint can execute in each process independently - Development only -
MemoryReplayStoreemits a warning in production environments
NODE_ENV=production, the runtime detects MemoryReplayStore and emits a warning. This warning is not suppressible without modifying the implementation. It is an intentional forcing function.
RedisReplayStore
The built-inRedisReplayStore uses ioredis and SET key value NX for atomic reservation. It persists across process restarts, supports multiple processes sharing state, and is the recommended production option for most deployments.
Key properties:
- Atomic reservation via
SET NX - Confirmation via
SET XX(update existing key) - Key namespace:
parmana:replay:{fingerprint}(configurable prefix) - TTL support: fingerprints can expire after a configurable period
Custom Implementations
Any backend that supports atomic conditional writes can implementReplayStore. The interface is intentionally minimal:
PutItemCommand with ConditionExpression: "attribute_not_exists(fingerprint)". Built-in TTL support via DynamoDB Time to Live. Appropriate for Lambda or serverless deployments where Redis is not already in the stack.
PostgreSQL - INSERT INTO replay_store (fingerprint, status) VALUES ($1, 'reserved') ON CONFLICT (fingerprint) DO NOTHING, then check rows affected. Appropriate for deployments where PostgreSQL is the primary data store and adding Redis is undesirable.
Custom - any system with a primitive that is: atomic, conditional on key non-existence, and fail-fast (returns immediately rather than blocking until timeout). etcd, ZooKeeper, Consul (with compare-and-set), and most distributed key-value stores provide this primitive.
Near-Replay vs. Exact Replay
The fingerprint isSHA-256(canonicalize(signals)). It is a hash of the exact input signals. Two executions with different signals - even one signal value different - produce different fingerprints.
This is the correct behavior. Different inputs are different decisions. “Same decision” means same inputs, not approximately the same inputs.
Exact replay - identical signals -> same fingerprint -> reserve throws -> execution blocked. This is the case the replay store is designed to catch.
Near-replay - one signal different -> different fingerprint -> reserve succeeds -> execution proceeds. Two similar-but-not-identical decisions can both execute. This is correct behavior: they are different decisions.
This distinction matters for cases like:
- Retry with slightly different inputs (corrected credit score, updated loan amount) - these are new decisions, not replays, and correctly proceed.
- Fraudulent replay with all identical signals - exact replay, correctly blocked.
- Legitimate execution of similar-but-distinct requests - different fingerprints, correctly allowed.
Cross-Store Isolation
Two separateReplayStore instances - even both RedisReplayStore pointing to different Redis servers - share no state. The same fingerprint can execute independently in each store.
This is not a bug. It is a design property of the interface: the replay guarantee is scoped to a single store. Operators who need a cross-region or cross-system replay guarantee must use a shared store (a single Redis Cluster visible to all processes, a single DynamoDB table in a primary region, etc.).
The scoping is intentional: different decision domains may legitimately run the same policy against the same signals in separate environments (e.g., canary testing a new policy version against production traffic in an isolated store). Cross-store isolation enables this without interference.
For production systems making consequential decisions, the requirement is clear: use one shared, highly available store per decision domain. All processes serving the same decision domain must connect to the same store.
Conclusion
Replay protection is the governance layer’s guarantee of exactly-once execution. It is not an application-level concern - it is an infrastructure-level primitive, as fundamental to governance as the signature is to tamper evidence. The design is three properties together:- Two-phase commit - reserve before evaluate, confirm after sign. Not check-then-execute. Atomic claim that establishes exclusive ownership before any evaluation occurs.
- Fail-closed - replay store unavailability blocks execution. No silent degradation, no fail-open fallback. Governance without replay protection is not governance.
- Atomic conditional write - the backend storage operation must be atomic. Application-level locking is insufficient. Every production-grade data store provides an appropriate primitive: Redis SETNX, DynamoDB conditional PutItem, PostgreSQL INSERT ON CONFLICT.
See Also
- Replay Protection - conceptual overview
- Fail-Closed Governance - the fail-closed property in full
- Custom Integrations - DynamoDB and PostgreSQL ReplayStore implementations
- Bring Your Own Infrastructure - ReplayStore interface and backend options
- Why Governance Must Be Deterministic - the fingerprint and its determinism requirements