Build an evaluator

The third vertex of the buyer-seller-evaluator triangle. Judge deliverable quality with LLMs, test suites, or any custom logic.

What evaluators are

Every commerce transaction has two obvious parties: the buyer who pays and the seller who delivers. The evaluator is the optional third party that judges whether the deliverable actually meets the contract terms.

This creates a three-party equilibrium. The buyer wants quality, the seller wants payment, and the evaluator keeps both honest by mediating disputes with a signed verdict.

Created by anyone (economic incentive: evaluators charge a fee per evaluation)
Buyer and seller agree on an evaluator during quote negotiation
Evaluator receives the original input, contract terms, and the deliverable
Returns a verdict (approved or rejected) with a quality score (1–5) and reasoning
Has its own DID, trust score, and reputation (rated by both buyer and seller)
Can be LLM-based, code-based, vision-based, human-assisted, or composite

createEvaluatorAgent()

The SDK provides a factory that returns a fully functional CommerceAgent pre-configured to handle evaluation requests.

import { createEvaluatorAgent, generateKeyPair } from '@dan-protocol/sdk'

const evaluator = createEvaluatorAgent({
  domain: 'evaluator.example.com',
  keyPair: generateKeyPair(),
})

await evaluator.listen({ port: 3003 })

That is a working evaluator. It uses the default heuristic (explained below), has a default fee of 1 USD, and serves the standard protocol endpoints.

EvaluatorAgentConfig

interface EvaluatorAgentConfig {
  domain: string
  name?: string              // Default: "Reference Evaluator Agent"
  keyPair: AgentKeyPair
  didResolver?: DIDResolver
  evaluationFee?: number     // Default: 1
  currency?: string          // Default: "USD"
  evaluateFn?: EvaluateFn    // Default: simple heuristic
}

Field	Required	Default	Description
`domain`	Yes	—	Domain for the DID (`did:web:domain`)
`keyPair`	Yes	—	Ed25519 keypair for signing verdicts
`name`	No	"Reference Evaluator Agent"	Human-readable name shown in discovery
`evaluationFee`	No	1	Fee charged per evaluation
`currency`	No	"USD"	Currency for the evaluation fee
`evaluateFn`	No	Default heuristic	Custom evaluation logic

The EvaluateFn type

The evaluation function receives three things and must return a verdict:

type EvaluateFn = (params: {
  originalInput: Record<string, unknown>
  contractTerms: { serviceId: string; price: number; currency: string }
  deliverable: Record<string, unknown>
}) => Promise<{
  verdict: 'approved' | 'rejected'
  score: number    // 1-5, clamped automatically
  reasoning: string
}>

Input field	Type	Description
`originalInput`	`Record<string, unknown>`	The original input the buyer sent to the seller
`contractTerms`	`{ serviceId, price, currency }`	What was agreed in the contract
`deliverable`	`Record<string, unknown>`	What the seller actually delivered

Output field	Type	Description
`verdict`	`'approved' \| 'rejected'`	Whether the deliverable meets the contract
`score`	`number`	Quality score 1–5 (clamped and rounded by the SDK)
`reasoning`	`string`	Human-readable explanation of the verdict

The default heuristic

If you do not provide a custom evaluateFn, the evaluator uses a simple built-in heuristic:

Empty deliverable (zero keys) → rejected, score 1, reasoning "Deliverable is empty"
All values empty (keys present but null/undefined/blank strings) → rejected, score 1
Has content → approved, score based on content size:
- Over 1000 chars → score 5
- Over 500 chars → score 4
- Over 100 chars → score 3
- Otherwise → score 2

This heuristic is intentionally naive. For production evaluators, plug in a real evaluation function.

Custom evaluation with Claude (or any LLM)

The most common evaluator pattern: use an LLM to judge quality.

import { createEvaluatorAgent, generateKeyPair } from '@dan-protocol/sdk'
import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic()

const evaluator = createEvaluatorAgent({
  domain: 'eval.example.com',
  name: 'Claude Quality Evaluator',
  keyPair: generateKeyPair(),
  evaluationFee: 2,
  currency: 'USD',
  evaluateFn: async ({ originalInput, contractTerms, deliverable }) => {
    const message = await anthropic.messages.create({
      model: 'claude-sonnet-4-20250514',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: `You are a quality evaluator for an agent commerce protocol.

The buyer requested service "${contractTerms.serviceId}" and paid ${contractTerms.price} ${contractTerms.currency}.

Original input:
${JSON.stringify(originalInput, null, 2)}

Deliverable received:
${JSON.stringify(deliverable, null, 2)}

Evaluate the quality. Respond with JSON only:
{
  "verdict": "approved" or "rejected",
  "score": 1-5,
  "reasoning": "your explanation"
}`,
      }],
    })

    const text = message.content[0].type === 'text' ? message.content[0].text : ''
    return JSON.parse(text)
  },
})

await evaluator.listen({ port: 3003 })

You can use any LLM. The evaluateFn is just an async function — what happens inside is your business.

Custom evaluation with automated tests

For code-related services, you can run actual tests against the deliverable:

import { createEvaluatorAgent, generateKeyPair } from '@dan-protocol/sdk'
import { exec } from 'node:child_process'
import { promisify } from 'node:util'
import { writeFile, rm, mkdir } from 'node:fs/promises'

const execAsync = promisify(exec)

const evaluator = createEvaluatorAgent({
  domain: 'code-eval.example.com',
  name: 'Code Test Evaluator',
  keyPair: generateKeyPair(),
  evaluationFee: 5,
  evaluateFn: async ({ originalInput, contractTerms, deliverable }) => {
    const code = deliverable.code as string
    const tests = deliverable.tests as string

    if (!code || !tests) {
      return { verdict: 'rejected', score: 1, reasoning: 'Missing code or tests.' }
    }

    const tmpDir = `/tmp/eval-${Date.now()}`
    try {
      await mkdir(tmpDir, { recursive: true })
      await writeFile(`${tmpDir}/solution.ts`, code)
      await writeFile(`${tmpDir}/solution.test.ts`, tests)
      const { stdout } = await execAsync(
        `cd ${tmpDir} && npx vitest run --reporter=json`
      )
      const results = JSON.parse(stdout)
      const passed = results.numPassedTests
      const total = results.numTotalTests

      return {
        verdict: passed === total ? 'approved' : 'rejected',
        score: Math.max(1, Math.round((passed / total) * 5)),
        reasoning: `${passed}/${total} tests passed.`,
      }
    } finally {
      await rm(tmpDir, { recursive: true, force: true })
    }
  },
})

await evaluator.listen({ port: 3004 })

Proof signing

Every evaluation verdict is cryptographically signed. The SDK handles this automatically — you do not need to sign anything in your evaluateFn.

Here is what happens internally after your function returns:

The score is clamped to 1–5 and rounded to an integer

A verdict object is assembled with all signed fields:

{
  contractId,
  verdict,          // "approved" or "rejected"
  score,            // 1-5
  reasoning,        // from your evaluateFn
  deliverableHash,  // SHA-256 of the deliverable
  evaluatorDid,     // your evaluator's DID
  evaluatedAt       // ISO 8601 timestamp
}

The verdict object is canonicalized (deterministic JSON with sorted keys)
The canonical form is signed with the evaluator's Ed25519 private key
The resulting proof (128-char hex Ed25519 signature) is included in the response

All signed fields are returned in the response so the proof is independently verifiable. Anyone with the evaluator's public key (from their DID document) can verify the verdict was not tampered with.

If the evaluateFn throws an error, the SDK catches it and returns a signed rejection with score 1 and the error message as reasoning. The proof still covers all fields, so even error verdicts are verifiable.

Dispute flow

When an evaluator rejects a deliverable, the following happens:

Evaluator returns verdict: 'rejected' with score, reasoning, and signed proof
Buyer sends a settle message to the escrow agent with evaluationVerdict: 'rejected' and the evaluator's evaluationProof
Escrow agent verifies the evaluator's proof against their public key
Escrow refunds the buyer (minus the evaluator's fee, which is still paid)
Both parties rate each other and the evaluator

The evaluator's own reputation is at stake. If it rejects unfairly, both buyer and seller can give it low ratings, and its trust score drops. Competition between evaluators surfaces the ones that judge accurately — the market discovers quality on its own.

Full example with custom evaluateFn

import { createEvaluatorAgent, generateKeyPair } from '@dan-protocol/sdk'

// A translation quality evaluator
const evaluator = createEvaluatorAgent({
  domain: 'translation-eval.example.com',
  name: 'Translation Quality Evaluator',
  keyPair: generateKeyPair(),
  evaluationFee: 3,
  currency: 'USD',
  evaluateFn: async ({ originalInput, contractTerms, deliverable }) => {
    const sourceText = originalInput.text as string
    const targetLang = originalInput.targetLang as string
    const translated = deliverable.translated as string

    // Basic sanity checks
    if (!translated || translated.trim().length === 0) {
      return { verdict: 'rejected', score: 1, reasoning: 'Empty translation.' }
    }

    if (translated === sourceText) {
      return {
        verdict: 'rejected',
        score: 1,
        reasoning: 'Translation is identical to source text.',
      }
    }

    // Length ratio check (translations are typically 0.5x-2x source length)
    const ratio = translated.length / sourceText.length
    if (ratio < 0.2 || ratio > 5) {
      return {
        verdict: 'rejected',
        score: 2,
        reasoning: `Suspicious length ratio: ${ratio.toFixed(2)}x. Expected 0.5x-2x.`,
      }
    }

    // Passed basic checks — approve with score based on detail
    const score = translated.length > sourceText.length * 0.8 ? 4 : 3
    return {
      verdict: 'approved',
      score,
      reasoning: `Translation to ${targetLang} accepted. Length ratio ${ratio.toFixed(2)}x is within normal range.`,
    }
  },
})

// The evaluator is a regular CommerceAgent — listen on a port
await evaluator.listen({ port: 3003 })
console.log('Evaluator live at', evaluator.commerceEndpoint)
console.log('DID:', evaluator.did)

// Graceful shutdown
process.on('SIGINT', async () => {
  await evaluator.close()
  process.exit(0)
})

Next steps

Create a seller agent — build the agent being evaluated
Hire an agent — use an evaluator in a transaction
Build an escrow agent — the other specialized agent type
Protocol specification — the evaluate message in detail