Receipts or It Didn’t Happen: Evidence-Grade Logging for Agents

A support workflow does the thing everyone wanted.

A customer escalation comes in. An agent summarizes the issue, checks prior tickets, retrieves product documentation, looks at account context, updates a CRM field, opens a follow-up ticket, and posts a status note in Slack.

Everything appears to work.

The customer team has the context. The ticket is moving. The account record is cleaner than it was an hour ago. Nobody is screaming. No dashboard is blinking red.

Then, two weeks later, someone asks why the CRM field changed.

That is where the story gets less clean.

The CRM audit log shows a field changed. The agent runtime shows a tool call. The ticketing system shows follow-up work. Slack shows the summary. The identity provider shows access. Somewhere there may be a prompt, a retrieved document, an approval, or a workflow run.

But the question was not, “Did something happen?”

The question was, “Can we prove what happened, under whose authority, with what data, through which tool, and why?”

That is the difference between logging and receipts.

The workflow worked. The receipt did not.

The action is not the whole story

Earlier AI usage was easier to reason about because most of it stayed inside the answer.

A user asked a question. The model generated text. Maybe the answer was good. Maybe it was wrong. Maybe it needed review. But the system mostly stayed in the lane of producing content.

That is not where agentic workflows are going.

Once an agent can read systems, call tools, update records, open tickets, send messages, retrieve documents, or trigger downstream work, the final answer is not the whole event. It might not even be the most important part of the event.

A prompt is not a tool call.

A tool call is not a state change.

A state change is not evidence by itself.

That distinction matters because the risk is not “AI” in the abstract. The risk is that a system acted across boundaries and left behind a record that makes the work look simpler than it really was.

I wrote about that broader problem in The Workflow Got Faster. The Record Got Fuzzier. The productivity gain is real. The attribution problem is also real. Those two things can be true at the same time. That is usually where the work gets interesting.

The normal failure mode is boring

The easiest version to dismiss is the sci-fi version.

The rogue agent. The cinematic mistake. The demo-day disaster with the red siren spinning in the background.

That is not the version I would design against first.

I would start with the boring version.

An approved workflow gets a little more capable. A connector gets added. Retrieval gets wired in. A tool call moves from draft-only to update-capable. A Slack notification becomes a workflow trigger. A CRM field update starts feeding a downstream report.

Nobody thinks of this as a new system.

It feels like the old workflow, but faster.

The problem shows up later, when someone needs to reconstruct the path.

The CRM log says the field changed, but not why.

The agent trace says a tool was called, but not whether it was acting as itself, on behalf of the user, or under some delegated workflow.

The retrieval layer shows documents that were accessed, but not whether that access matched the requester’s policy boundary.

Slack shows the summary, but not the source data or approval trail.

The ticketing system shows follow-up work, but not what actually triggered it.

The identity provider shows access, but not intent.

More logs did not create more evidence.

They created more places to look.

That is not an anti-agent argument. It is an anti-fuzziness argument.

Receipts are connected evidence

A receipt is not just a log line.

It is the connected record that can answer basic questions after the workflow has moved on.

Who or what acted?
Was it acting as itself, as a service identity, on behalf of a user, or through delegated authority?
What task caused the action?
What data was retrieved?
What tool was called?
What scope was available at the time?
What policy, reasoning step, or approval allowed the workflow to continue?
What changed?
Did anything downstream get triggered?
Can someone reconstruct the path later without tribal knowledge, screenshots, Slack archaeology, or asking the one engineer who happened to be online?

That last part matters. A record that only works while the person who built the automation is still in the room is not evidence. It is a memory with a badge.

This is why I like the term receipts. It forces the conversation away from generic “AI governance” language and back toward reviewable proof.

If the workflow can act, the record has to survive review.

Evidence-grade logging is about joins, not volume

The lazy answer is “log everything.”

That is not enough.

Logging more data can help, but only if the pieces connect. A pile of unjoined logs is just a junk drawer with timestamps.

The useful question is not whether each system recorded something. The useful question is whether the records agree with each other.

For an agent workflow, the receipt may need to connect records across:

Agent runtime logs
Identity provider logs
Tool or API logs
Retrieval traces
CRM audit logs
Ticketing system records
Slack or collaboration audit logs
Approval workflow records
Security information and event management, or SIEM, events
Access governance records
Human review notes

That is why LLM observability work matters here. Langfuse is a useful example because its observability and application tracing documentation treat traces as structured records across prompts, model responses, token usage, latency, tools, and retrieval steps.

That is the right direction.

But for enterprise review, the trace is only one piece of the receipt. The trace still has to join back to identity, authority, scope, approval, business state, and the system of record.

The receipt is the chain of records that survives the question, “Why did this happen?”

Traces help, but raw traces are not the whole answer

Agent traces are valuable because they show the step-by-step path.

They can help answer what the agent tried, what tool it used, where it branched, and where something started to go sideways. That is useful for debugging. It is useful for testing. It is useful for review.

But raw traces can become their own problem.

A thousand lines of JavaScript Object Notation, or JSON, may technically contain the truth and still be useless to the person trying to understand what happened under pressure.

That is why I think more people should pay attention to Invariant Labs. Their Explorer work is aimed at visualizing and understanding agent traces, including key steps, decision points, anomalies, and failure points inside an agent run.

That is a practical problem.

If an agent fails, you do not just need a log archive. You need to understand the path.

Where did the agent read?

Where did it reason?
Where did it call a tool?
Where did it cross from suggestion into action?
Where did the state change happen?

That is the part that traditional logs often flatten.

The receipt has to include data access

In the customer escalation workflow, the agent probably did not start by updating the CRM.

It probably read first.

Prior tickets. Customer notes. Product documentation. Internal knowledge-base entries. Maybe account plans. Maybe support history. Maybe renewal risk notes. Maybe more than it should have seen.

That is why retrieval belongs in the receipt.

In RAG Is Data Access: Retrieval Authorization Is the Control, I made the point that retrieval-augmented generation, or RAG, is not just a way to improve answers. In enterprise systems, it creates a new read path into internal data.

That same idea shows up here.

If the agent used retrieved context to make a decision, summarize an issue, recommend a next step, or update a record, the receipt needs to show what it retrieved and whether that retrieval was allowed.

The final Slack message is not enough.

The CRM field change is not enough.

The ticket update is not enough.

The data path matters because the agent’s action may only make sense if you can prove what it read before it acted.

For the implementation side of that problem, Practical Retrieval Authorization Patterns for RAG Systems gets into the more concrete design work: preserving permissions, enforcing access at retrieval time, separating trust zones, and logging what actually got retrieved.

The short version of this post is simpler.

If retrieval shaped the action, retrieval belongs in the receipt.

Tool calls are where intent becomes action

Tool calls are not magic. They are execution paths.

That is good. That is why agents are useful. A system that can only generate a suggestion is helpful. A system that can open the ticket, update the record, notify the owner, or trigger the workflow is more useful.

It is also harder to govern.

HumanLayer has a useful engineering lens here through its 12-Factor Agents work. The pieces that matter for this conversation are the ones that treat natural language, tool calls, execution state, business state, and human contact paths as distinct things.

That is exactly the distinction the receipt needs to preserve.

The user’s prompt is one record.

The agent’s plan or reasoning context may be another.

The tool call is another.

The business state change is another.

The human approval, if there was one, is another.

Flatten those into one generic event, and the story gets too clean.

That is also why Tool Calling Is Privileged Execution belongs close to this conversation. The moment a tool call can change something real, the logging model has to prove more than “the assistant ran.”

It has to prove what was allowed, what happened, and what survived review.

Ownership is part of the receipt

There is another quiet failure mode in all of this.

Nobody owns the receipt.

The app owner owns the workflow. The platform team owns the integration. Security owns the policy. IT owns the identity provider. The business team owns the CRM field. The data team owns part of the retrieval layer. The SIEM has some of the telemetry. Slack has the message. The ticketing system has the follow-up.

Everyone owns a piece.

Nobody owns the story.

That is how agent workflows become identity debt.

In Agent Inventory and the Agent Register, I argued that agent ownership, scope, lifecycle, and evidence need a place to live before agent sprawl turns into a cleanup project. Receipts are part of that same control surface.

An agent receipt without an owner is just another artifact waiting for someone else to explain.

That is not governance. That is a future meeting with worse lighting.

What to do this week

Start small and more honestly.

Pick one agent workflow that can change state.

Not a demo. Not a toy. Pick something that updates a ticket, writes to CRM, posts into a shared channel, changes a field, triggers an approval, or opens work for someone else.

Then map the path:

What kicked off the workflow?
What identity initiated it?
What identity executed it?
What did the agent read?
What did it retrieve?
What tool did it call?
What scope was available at the time?
What approval, policy, or guardrail allowed the action?
What changed?
What downstream work started?
Where is each part recorded?
Can those records be joined later?

Then look for the gaps.

Where does the record flatten the actor chain?

Where does it show a tool call but not the authority behind it?

Where does it show access but not intent?

Where does it show a state change but not the prompt, retrieval path, approval, or policy decision that led there?

Where does the evidence disappear after 30, 60, or 90 days?

And the big one:

Who owns the receipt?

Do this before the workflow becomes business-critical. After that, every logging gap becomes archaeology.

Attribution hygiene is part of the same mindset

This may sound like a content marketer detour, but it is not.

Attribution hygiene is the same operating instinct in a different system.

Clear data is usable. Unclear data is murky for everyone.

That is why every link in this article carries campaign context, internal and external. If I link to a company homepage, that team should be able to see the click came from TechThatMattRs. If I link to one of my own related posts, I should be able to reconstruct the content path later. If a reader follows the thread from agent logging to retrieval authorization to tool calling, the data should tell that story.

Same principle.

Receipts matter because they let people understand what happened after the fact.

The point

Agents are going to become part of normal business workflows.

That is not the issue.

The issue is whether the organization can explain what happened when one of those workflows matters.

If the customer record changed, the organization should not need a Slack archaeology dig to explain why.

If an agent can read, decide, call tools, update records, or trigger downstream work, then the receipt has to survive more than the demo.

It has to survive the incident review.

It has to survive the audit.

It has to survive the person asking, three weeks later, “Why did this happen?”

And if the answer is scattered across logs, prompts, traces, Slack threads, and hope, the workflow was never really governed.

It was just moving fast.

The speed is useful.

The receipt is what keeps the story honest.