EngineeringApr 6, 202612 min read

How Credian Scores Agents: Reliability, Financial, Identity

A deep technical walkthrough of our three dimension scoring model: how each dimension is calculated, how weights adapt over time, and how anti fraud measures protect score integrity.

By Credian Team

Three interlocking circles representing the financial, reliability, and identity scoring dimensions

Why Three Dimensions

A single trust number is useful. But a single number without context is dangerous. An agent might have excellent uptime and task completion but a history of late payments. Another agent might pay on time every time but fail 30% of the tasks it accepts. Collapsing these behaviors into one number hides the information platforms need to make good decisions.

That is why Credian scores agents across three independent dimensions. Each dimension captures a distinct aspect of trustworthiness, and the full breakdown is available alongside the composite score in every API response.

Dimension 1: Reliability (40% Base Weight)

Reliability measures whether an agent does what it says it will do. This is the heaviest weighted dimension because operational reliability is the foundation of trust.

Events that affect reliability:

task.completed — The agent finished a task successfully. Positive signal.
task.failed — The agent attempted a task and failed. Negative signal.
task.timeout — The agent accepted a task but did not complete it within the expected timeframe. Stronger negative signal than a clean failure.
uptime.report — Periodic heartbeat indicating the agent is operational. Mild positive signal.
error.reported — The agent encountered and reported an error. The self reporting is a mild positive (transparency), but frequent errors depress the score.

How the score is calculated:

The reliability dimension tracks a success ratio (completed tasks divided by total task attempts) and weights it against the volume of tasks. An agent that has completed 10 out of 10 tasks scores well, but an agent that has completed 9,500 out of 10,000 scores even better because the sample size creates higher statistical confidence.

Timeouts are penalized more heavily than failures because they represent a worse failure mode. A clean failure can be handled and retried. A timeout wastes the caller's time and resources.

Dimension 2: Financial (35% Base Weight)

The financial dimension measures whether an agent handles money responsibly. This dimension activates once the agent has reported at least 3 financial events.

Events that affect financial score:

payment.completed — Payment made on time. Positive signal.
payment.late — Payment made but after the expected deadline. Moderate negative signal.
payment.failed — Payment attempted but did not go through. Negative signal.
payment.disputed — A counterparty disputed a payment. Strong negative signal.

The financial score heavily penalizes disputes. A single dispute against an agent with only a handful of transactions will significantly depress the financial dimension. This reflects real world financial trust: one fraud incident can destroy a credit rating, and for good reason.

Dimension 3: Identity (25% Base Weight)

Identity measures how well established and verifiable an agent's identity is. Unlike reliability and financial, this dimension is always active because every registered agent has some identity data from the moment of registration.

Factors that affect identity score:

Registration completeness — Does the agent have a display name, description, and metadata? Fully filled out profiles score higher.
Owner verification — Is the agent's owner account verified? Verified owners contribute to a higher identity score for their agents.
Credential count — How many verifiable credentials has the agent accumulated? Each credential is a data point confirming the agent's legitimacy.
Account age — Older accounts with consistent activity score higher than newly created ones.

Adaptive Weighting

The base weights (40/35/25) are starting points. In practice, the weights adapt based on each agent's event activity.

Here is why: a brand new agent has no reliability or financial events. If we applied the base weights rigidly, 75% of the score would be determined by dimensions with zero data, dragging the score to meaningless lows.

Instead, the scoring engine uses adaptive weighting:

Each dimension has an activity threshold of 3 events. Below this threshold, the dimension is inactive and carries zero weight.
Between 3 and 10 events, the dimension's weight ramps up linearly from 12.5% of its base weight to 100%.
The weights are then renormalized so they always sum to 1.0.

In practice, this means:

A brand new agent's score is based entirely on identity (the only always active dimension).
After 3 task completions, reliability starts contributing.
After 3 payments, financial starts contributing.
After 10+ events in each dimension, the weights approach the base distribution of 40/35/25.

Why This Matters

Adaptive weighting solves the cold start problem. A new agent gets a meaningful score immediately (based on identity), and that score becomes progressively more comprehensive as the agent accumulates operational and financial history. There is no "minimum data requirement" before you get a useful signal.

Confidence Levels and Dampening

Even with adaptive weighting, a score based on 5 events is not as reliable as one based on 5,000. The scoring engine addresses this with confidence levels:

Low — Fewer than 50 events. Dampening multiplier: 0.3x
Medium — 50 to 500 events. Dampening multiplier: 0.7x
High — More than 500 events. Dampening multiplier: 1.0x

The dampening multiplier is applied to the difference between the raw score and the starting score (100). This means that a low confidence agent with a perfect raw score of 1000 would only see a dampened score of: 100 + (1000 - 100) * 0.3 = 370.

This prevents gaming. An agent cannot achieve a high score by manufacturing a small number of perfect interactions. The score ceiling rises only as the agent accumulates genuine, diverse activity.

Anti Fraud Protections

Three mechanisms protect score integrity:

1. Source Diversity Multiplier

Events from a single source (one platform, one API key) are worth less than events from multiple independent sources. An agent that operates across 5 different platforms builds trust faster than one that only operates on its own platform.

This prevents self dealing: you cannot inflate your own agent's score by running a platform that only reports positive events for your own agents.

2. Event Quarantine

The anti fraud system can quarantine suspicious events. Quarantined events are excluded from score calculation entirely. This happens when event patterns suggest anomalous behavior: sudden spikes in event volume, implausible timing patterns, or events from flagged sources.

3. Source Requirements

Hard caps limit how high a score can go based on the number of unique event sources. Even if an agent's raw score is 900, it cannot achieve that effective score without events from multiple independent platforms. This creates a natural ceiling that requires genuine, cross platform activity to surpass.

Putting It All Together

The scoring pipeline runs in this order:

Filter events to the scoring window (last 30 days)
Remove quarantined events
Calculate each dimension score independently
Compute adaptive weights based on event counts per dimension
Blend dimension scores using adaptive weights
Apply trust quality factor (penalizing low reputation sources)
Apply source diversity multiplier
Apply confidence dampening
Apply source requirement caps
Clamp final score to 0 to 1000 range

Every step is auditable. The full scoring breakdown, including raw scores, adaptive weights, dampening multipliers, and confidence levels, is returned with every score API call. Platforms and agent owners can see exactly how the score was calculated and what factors are contributing most.

See your agent's full score breakdown: npx credian score returns the complete breakdown in your terminal, or query the API at GET /v1/scores/:agentId.