LLM Scoring Service

Overview

llm-scoring-service is a self-hosted observability and scoring platform for LLM-backed applications. It captures requests/responses from LLM pipelines, scores them on configurable dimensions, and surfaces metrics through a React dashboard.

Why I Built This

Production LLM integrations have a measurement problem - you can't tell if quality is degrading without structured evaluation. This service addresses that by treating LLM scoring as a first-class infrastructure concern.

Architecture

Ingest layer: REST API receives LLM request/response pairs from client applications
Scoring engine: Evaluates responses using Groq's API against configurable rubrics (relevance, faithfulness, toxicity, custom criteria)
Event pipeline: Kafka decouples ingestion from scoring to handle burst traffic
Storage: PostgreSQL stores evaluations, scores, and aggregated metrics
SDKs: Java and Python client libraries for zero-friction integration
Dashboard: React UI for querying evaluations, viewing score distributions, and setting alert thresholds

Key Design Decisions

Kafka buffer prevents scoring latency from blocking the caller's request path
Scoring rubrics are runtime-configurable — no redeploy needed to add a new dimension
SDK design mirrors OpenTelemetry's instrumentation pattern for familiar DX

Status

Active development. Core scoring pipeline and Java SDK complete. Python SDK and dashboard in progress. 70+ test cases covering service and integration layers.