The exam letter arrives, and somewhere in your stomach you already know which question is coming. Not "do you use AI" — they assume that now. The question is sharper: show me how this specific model reached this specific decision, prove the control that should have caught a problem actually fired, and demonstrate you knew when the model changed. If your honest answer is "the data science team can probably reconstruct that," you have already lost the room.
Examiners and internal auditors are not impressed by model accuracy. They are paid to be skeptical of it. When AI touches a regulated decision — a loan, a claim, a trade, an adverse action — the examiner's job is to assume the decision was wrong until your evidence says otherwise. Their questions are remarkably consistent across lending, insurance, and markets. Here is what they actually ask, and what a good answer looks like.
Ask 1: "Explain this decision to the person it affected."
This is the explainability ask, and it is more demanding than a SHAP chart. Under the Equal Credit Opportunity Act and Regulation B, an adverse action on a credit application requires specific, accurate reasons — not "your application did not meet our model's criteria." The Fair Housing Act adds that those reasons cannot be a polite cover for a prohibited basis. An examiner will pick one declined applicant and ask you to walk the decision from inputs to outcome in plain language.
The trap here is that many teams can explain the model in general but not this decision in particular. A global feature-importance plot describes the model's tendencies. It does not tell the examiner why Maria's application was denied on a Tuesday. What they want is decision-level reasoning: the inputs that drove this output, the policy that was applied, and a reason a human can read and a regulator can defend.
Deterministic governance changes the shape of this answer. When a policy pack — encoding ECOA, Reg B, and FHA constraints — evaluates the proposed decision before it dispatches, the reasons are not reverse-engineered after a complaint. They are recorded at the moment of decision, tied to the rule that produced them.
Ask 2: "Reproduce it. Exactly."
Reproducibility is where confident programs quietly fall apart. The examiner asks you to re-run the decision from six months ago and get the same result. Then they watch you discover that the model has been retrained twice, a feature pipeline changed, and the reference data the model saw that day no longer exists in the form it existed in.
An examiner does not want to know what your model usually does. They want to know what it did, on this record, on this date — and they want to verify it without taking your word for it.
Reproducibility is not "we can probably get close." It is the ability to replay the exact decision — same inputs, same policy version, same logic — and produce a byte-identical record. This is the difference between a model-risk program that survives an exam and one that generates findings. The answer is a replayable decision trail: every regulated decision captured as an immutable record of what was evaluated, which policy version governed it, and what came out. When that record is cryptographically signed, the examiner can confirm it has not been altered since the day it was created.
Ask 3: "Prove the control worked — not that it exists."
Every program has a control inventory. Examiners stopped being impressed by control inventories a long time ago. The Federal Reserve and OCC's guidance on model risk management (commonly referenced as SR 11-7) is built around a simple, brutal premise: a model is a source of risk, and the controls around it must be effective, independently validated, and evidenced. "We have a fair-lending review step" is a claim. "Here is the log showing the fair-lending control evaluated 1.2 million decisions and blocked the 340 that violated policy" is evidence.
The gap between those two sentences is where findings live. Auditors increasingly distinguish between design effectiveness (the control is well-built) and operating effectiveness (the control actually fires, every time, on every decision). You can pass the first and fail the second spectacularly — a beautifully designed control that was silently bypassed in production for a quarter.
This is the strongest case for enforcing governance before an action dispatches rather than filtering outputs after the fact. A pre-execution gate that returns ALLOWED, BLOCKED, or MODIFIED on every decision — and logs that verdict with the risk it computed — turns "we have a control" into "here is the control's complete operating history." A model update firewall that deterministically enforces ECOA and Reg B constraints in real time is the difference between catching a violation in an exam and stopping it before it reaches a customer.
Ask 4: "Show me your change management."
Models drift. Teams retrain. A vendor pushes an update. Somebody adjusts a threshold to hit a quarterly target. Every one of these is a change to a regulated decision system, and the examiner wants to see that each was governed: who approved it, what was tested, when it went live, and how you would roll it back.
The most dangerous version of this question is the silent model update — a third-party model that improves itself behind an API, changing your decision behavior without a single internal approval. From the examiner's chair, that is an unmanaged change to a regulated control. The expectation under model-risk guidance is that you treat model changes with the same rigor as code changes: versioned, tested against policy, approved, and reversible. If you cannot name the policy version that governed a decision, you cannot prove the decision was made under the controls you had approved at the time.
Decision evidence solves this by binding every decision to a specific, versioned policy. When the policy changes, the record reflects it. When you replay a decision, you replay it under the policy that was actually in force — not the one you have today.
Ask 5: "Why should I trust your evidence?"
Here is the question underneath all the others, and it is the one most programs never prepare for. Even when you produce the logs, the explanations, and the control history, the examiner is entitled to ask why they should trust a record your own team generated and could, in principle, have edited. Self-attested evidence carries the weight of self-attested anything.
This is the case for cryptographic attestation. When each decision record is signed with HMAC-SHA256 and that signature can be verified independently — offline, without trusting the vendor or the team that produced it — the trust question changes character. The examiner is no longer asking you to vouch for the record. They are verifying the math. This is exactly what EVE Proof is built to do: turn every AI decision into a signed, independently verifiable certificate that a controls team or external auditor can validate without taking anyone's word for it.
The takeaway
The five asks — explain it, reproduce it, prove the control fired, govern the change, and verify the evidence — are not separate problems. They are one problem viewed from five angles: can you produce trustworthy proof of what your AI decided and why? Programs that treat governance as documentation answer with effort and apology. Programs that enforce decisions deterministically and capture them as signed, replayable evidence answer with a file.
Build the evidence before the letter arrives, not after. If your AI touches lending, claims, or trading decisions and you want every one of them to come with an audit-ready, independently verifiable certificate, see how EVE Proof turns decision evidence into something an examiner can verify without trusting you at all — which is exactly the point.