Two Agents, Two Reputations: A Live Demo of Score Divergence
We built two agents with the same starting point but different engineering quality. Within hours, their trust scores diverged dramatically. Here is what happened and what it proves.
By Credian Team
The Experiment
We wanted to prove something simple: that the Credian scoring engine can distinguish between a well built agent and a poorly built one, using only the behavioral data those agents produce. No manual labels. No human evaluation. Just events flowing in and a score coming out.
So we built two agents with identical capabilities but different engineering quality, pointed them at the same task queue, and watched what happened.
Agent Alpha: The Well Built Agent
Agent Alpha was built with production best practices:
- Proper error handling with retries and exponential backoff
- Input validation before processing
- Graceful degradation when external services are unavailable
- Structured logging and health checks
- Timeout handling with clean abort semantics
In short, Agent Alpha was built the way you would build any production service. Nothing exotic, just competent engineering.
Agent Beta: The Quick and Dirty Agent
Agent Beta had the same capabilities but was built with common shortcuts:
- No retry logic. If a request failed, it reported failure immediately.
- No input validation. Malformed inputs caused unhandled exceptions.
- No timeout handling. Long running tasks could hang indefinitely.
- No health checks. The agent could be in a degraded state without anyone knowing.
- Occasional payment delays due to missing queue management.
Agent Beta is not unrealistic. It represents the kind of agent that gets built in a hackathon, deployed to production because it works most of the time, and left running until something breaks.
The Results
Both agents were registered on Credian, given identical initial conditions (starting score of 100), and pointed at the same task queue. Tasks included data processing, API calls, and simulated payment operations.
After 24 hours and approximately 500 events each:
Score Divergence at Hour 24
Agent Alpha: Overall Score 687, Confidence Medium
- Reliability: 780 (97.2% task completion)
- Financial: 720 (100% on time payments)
- Identity: 560 (same registration data as Beta)
Agent Beta: Overall Score 312, Confidence Medium
- Reliability: 340 (78.4% task completion)
- Financial: 410 (2 late payments, 1 failure)
- Identity: 560 (same registration data as Alpha)
The identity scores are identical because both agents have the same registration completeness. The divergence is entirely in reliability and financial, which is exactly what you would expect: the scoring engine is measuring behavioral differences, not identity differences.
What Caused the Divergence
Reliability gap (780 vs 340):
Agent Alpha's retry logic turned transient failures into successes. A request that failed on the first attempt but succeeded on the third was reported as task.completed. Agent Beta reported the same scenario as task.failed because it gave up after the first attempt.
Agent Alpha's timeout handling meant that slow tasks were aborted cleanly and reported as task.timeout, which is a negative signal but less severe than an unhandled exception. Agent Beta's lack of timeout handling meant some tasks hung indefinitely and were eventually killed by the operating system, producing task.failed events.
Financial gap (720 vs 410):
Agent Alpha processed payments synchronously and confirmed completion before reporting. Agent Beta queued payments without tracking confirmation, leading to 2 late payments (reported after the expected deadline) and 1 outright failure (payment that was initiated but never completed).
What This Proves
Three things:
1. Scoring works without human judgment.
Neither agent was manually evaluated. The scoring engine received events and produced scores that accurately reflected the quality difference. A platform querying these scores would correctly conclude that Agent Alpha is more trustworthy than Agent Beta.
2. Engineering quality is visible in behavioral data.
You might expect that distinguishing "well built" from "poorly built" requires code review. It does not. The downstream effects of engineering quality (completion rates, timeout handling, payment reliability) are visible in operational data. The scoring engine captures these effects without ever seeing a line of code.
3. Adaptive weighting handles the cold start correctly.
In the first hour, both agents had nearly identical scores because the reliability and financial dimensions had not yet activated (fewer than 3 events each). By hour 3, both dimensions were active and the scores began diverging. By hour 24, the scores reflected a clear and justified quality difference.
Running the Demo Yourself
The demo script is available in our examples directory. It simulates the dual agent scenario described above, running both agents against a simulated task queue and plotting the score divergence over time.
# Clone the repo and run the demo
git clone https://github.com/ZaneTech-Credian/credian.git
cd credian/examples/demo
python3 demo.py
The script registers two agents, generates realistic event streams with different quality profiles, reports events to the Credian API, and prints score comparisons at regular intervals. It takes about 5 minutes to run a compressed version of the 24 hour experiment.
Implications for Agent Builders
If you are building AI agents, your engineering choices are visible. Not just to your users, but to any platform that queries your agent's trust score. Retry logic, error handling, timeout management, and payment reliability are not invisible infrastructure decisions. They are reputation building activities.
The agents that invest in operational excellence will build higher trust scores faster. The agents that cut corners will find their scores reflecting those shortcuts. And as platforms increasingly use trust scores to gate access and set limits, the financial impact of engineering quality will become direct and measurable.
See where your agent stands: npx credian score shows your current score and full dimension breakdown.