Building an Async Scoring Pipeline with Kafka and Spring Boot

The core promise of the scoring pipeline: a client submits an LLM evaluation, gets back a 202 immediately, and scoring happens in the background without affecting the caller's latency. Kafka is what makes that guarantee hold under pressure.

This post is about the actual implementation — producer setup, consumer setup, the error handling decisions, and the failure modes I had to reason through.

The Event Contract

Before any code, the event schema. This is the message published to Kafka after the ingest API receives a request:

public record EvaluationEvent(
    String evaluationId,
    String applicationId,
    String prompt,
    String response,
    String context,        // nullable
    String modelId,
    Instant createdAt
) {}

I use a Java record intentionally — immutable, no boilerplate. The event is serialized to JSON. I considered Avro for schema evolution guarantees, but for a v1 the operational overhead wasn't worth it. The tradeoff: adding a required field to the event in the future requires coordinating producer and consumer deploys carefully.

Producer: Ingest API → Kafka

The ingest service saves to PostgreSQL first, then publishes the event. Order matters here.

@Service
@RequiredArgsConstructor
public class EvaluationIngestService {
 
    private final EvaluationRepository evaluationRepository;
    private final KafkaTemplate<String, EvaluationEvent> kafkaTemplate;
 
    @Transactional
    public EvaluationResponse ingest(EvaluationRequest request) {
        // 1. Persist first — evaluation exists regardless of Kafka state
        Evaluation saved = evaluationRepository.save(
            Evaluation.from(request)
        );
 
        // 2. Publish event — keyed by applicationId for partition ordering
        EvaluationEvent event = EvaluationEvent.from(saved);
        kafkaTemplate.send("llm-evaluation-events", saved.getApplicationId(), event)
            .whenComplete((result, ex) -> {
                if (ex != null) {
                    log.error("Failed to publish evaluation event [id={}]: {}",
                        saved.getId(), ex.getMessage());
                    // Not rethrowing — DB record exists, scoring will be handled
                    // by the missed-events reconciliation job (see below)
                }
            });
 
        return new EvaluationResponse(saved.getId(), "PENDING");
    }
}

Why DB first, then Kafka — not the reverse?

If Kafka publish fails after a successful DB write, the evaluation exists but is unscored. That's recoverable — a reconciliation job can find unscored evaluations and republish. If Kafka publish succeeds but the DB write fails (say you do it the other way), you have an event in Kafka referencing an evaluation that doesn't exist. The consumer will fail and you have no recovery path without event replay logic.

Partition key = applicationId

All events for the same application go to the same partition. This guarantees ordering within an application's evaluations, which matters for the dashboard's time-series views. The downside: a single application with very high volume becomes a hot partition. For v1 this is acceptable.

Producer Config

@Configuration
public class KafkaProducerConfig {
 
    @Bean
    public ProducerFactory<String, EvaluationEvent> producerFactory() {
        Map<String, Object> config = new HashMap<>();
        config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
        config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class);
 
        // Critical: wait for all in-sync replicas to acknowledge
        config.put(ProducerConfig.ACKS_CONFIG, "all");
 
        // Retry on transient failures (network blip, leader election)
        config.put(ProducerConfig.RETRIES_CONFIG, 3);
        config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 1000);
 
        // Idempotent producer — prevents duplicate messages on retry
        config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
 
        return new DefaultKafkaProducerFactory<>(config);
    }
}

ACKS=all + ENABLE_IDEMPOTENCE=true together ensure: if the producer retries because it didn't receive an ack, Kafka deduplicates the message server-side. Without idempotence, a retry after a network timeout where the message actually did land would produce a duplicate event — and a duplicate scoring run.

Consumer: Kafka → Scoring Engine

@Component
@RequiredArgsConstructor
@Slf4j
public class EvaluationScoringConsumer {
 
    private final ScoringEngine scoringEngine;
    private final EvaluationRepository evaluationRepository;
 
    @KafkaListener(
        topics = "llm-evaluation-events",
        groupId = "scoring-consumer-group",
        concurrency = "3"   // 3 consumer threads per instance
    )
    public void consume(
        @Payload EvaluationEvent event,
        Acknowledgment acknowledgment
    ) {
        log.info("Scoring evaluation [id={}] for app [{}]",
            event.evaluationId(), event.applicationId());
 
        try {
            List<Score> scores = scoringEngine.score(event);
            evaluationRepository.saveScores(event.evaluationId(), scores);
            acknowledgment.acknowledge();  // commit offset only on success
 
        } catch (ScoringException ex) {
            log.error("Scoring failed for evaluation [id={}]: {}",
                event.evaluationId(), ex.getMessage());
            // Do NOT acknowledge — message goes to retry topic via error handler
            throw ex;
        }
    }
}

Manual acknowledgment is non-negotiable here.

The default Spring Kafka behavior commits the offset as soon as the message is received, before your processing logic runs. If scoring fails mid-flight, the offset is already committed and the event is lost. With AckMode.MANUAL, the offset only advances after acknowledgment.acknowledge() is explicitly called — meaning a failed scoring run keeps the message available for retry.

Error Handling: Dead Letter Topic

Not every scoring failure is transient. If the Groq API returns an unrecoverable error or the event is malformed, you don't want infinite retries blocking the partition.

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> template) {
    DeadLetterPublishingRecoverer recoverer =
        new DeadLetterPublishingRecoverer(template,
            (record, ex) -> new TopicPartition(
                "llm-evaluation-events.DLT",  // dead letter topic
                record.partition()
            )
        );
 
    // Retry up to 3 times with exponential backoff before sending to DLT
    ExponentialBackOffWithMaxRetries backOff =
        new ExponentialBackOffWithMaxRetries(3);
    backOff.setInitialInterval(1000L);
    backOff.setMultiplier(2.0);
    backOff.setMaxInterval(10000L);
 
    return new DefaultErrorHandler(recoverer, backOff);
}

The DLT (llm-evaluation-events.DLT) is monitored separately. Failed events there represent evaluations that exist in PostgreSQL but have no scores — they show up in the dashboard with a SCORING_FAILED status. This is intentional: the data is preserved, the failure is visible, and you can replay the DLT once you fix the underlying issue.

The Failure Mode Inventory

Before shipping, I mapped out every failure scenario:

Failure	What happens	Recovery
Kafka broker down at ingest time	Publish fails, evaluation saved to DB with `PENDING` status	Reconciliation job republishes unscored evaluations older than N minutes
Groq API rate limit	`ScoringException` thrown, message retried 3× then DLT	Manual replay after rate limit window passes
Consumer crash mid-scoring	Offset not committed, message redelivered to next consumer	Idempotent score save (upsert by `evaluation_id + dimension`)
Malformed event	Deserialization error, straight to DLT (no retries)	Fix producer, replay DLT
DB down during score save	Exception thrown, offset not committed, message retried	Message held in Kafka until DB recovers

The reconciliation job (row 1) is the safety net for Kafka-level publish failures. It runs every 5 minutes and queries:

SELECT id FROM evaluations
WHERE created_at < now() - interval '10 minutes'
  AND id NOT IN (SELECT evaluation_id FROM scores)

Any evaluation that's been pending for more than 10 minutes without scores gets republished. It's a simple but complete guarantee: no evaluation silently stays unscored forever.

What I'd Do Differently

Outbox pattern instead of DB-first + reconciliation. The cleaner solution is to write the evaluation and the pending Kafka message to the DB in the same transaction (the outbox table), then have a relay process publish from the outbox. This eliminates the reconciliation job entirely. I skipped it because it adds schema + relay complexity, and for this project the reconciliation job is good enough. But in a high-throughput production system, the outbox pattern is the right call.

Schema registry. JSON serialization works but schema drift is a real risk. As the EvaluationEvent evolves, there's no enforcement that producers and consumers agree on the shape. Confluent Schema Registry with Avro would solve this at the cost of another dependency.