Docs/Observability/Error Correlation

Error Correlation

Error Correlation

Error Correlation automatically groups related failures across agents and sessions, then provides AI-powered root cause analysis.

How It Works

A background cron job runs every 5 minutes to:

1. Detect recent task.failed and task.blocked events 2. Normalize error patterns (strip numbers, hex values, file paths) 3. Fingerprint each normalized pattern with a hash 4. Group events with the same fingerprint into error groups 5. Track affected agents and sessions

Error Groups

Each error group shows:

  • Severity: low, medium, high, or critical (auto-classified)
  • Occurrence count: How many times this error has occurred
  • Affected agents: Which agents experienced the error
  • First/last seen: Time range of occurrences
  • Status: new, investigating, or resolved

AI Root Cause Analysis

Click Analyze Root Cause on any error group. Hivemind sends the error pattern, affected agents, and recent related events to GPT-4o-mini for analysis. The AI returns:

  • A concise root cause hypothesis
  • A suggested fix

Managing Errors

  • Investigate: Mark an error as "investigating" to signal the team
  • Resolve: Mark as resolved when fixed — if the error recurs, it re-opens automatically
  • Filter: Use the tab bar to filter by status (All / New / Investigating / Resolved)

REST API

GET /v1/errors?status=new&limit=20
GET /v1/errors/:groupId
POST /v1/errors/:groupId/resolve