Why Preprocessing Matters More Than the Model

January 2026 · Technical Note · 5 min read

There is a persistent assumption in governance analytics that the hard problem is the model. Build a better classifier, train a smarter embedding, and insight will follow. This assumption is wrong in a specific and consequential way.

The hard problem is not what happens after the data is structured. It is what happens before.

The Invisible Layer

Consider a mid-size organization with 50,000 emails per month flowing through its communication systems. Some of those emails are about the same issue. Some involve the same participants discussing different issues. Some are forwarded through chains that introduce new participants, change the subject line, and cross organizational boundaries.

Before any analysis can happen, someone must answer a deceptively simple question: which messages belong together?

This is the preprocessing layer. It determines what is visible to every downstream step. If two messages about the same regulatory inquiry are not linked, the resulting issue path is truncated. The escalation appears to stop at depth 2 when it actually reached depth 5. The variance calculation is wrong. The shape classification is wrong. Every subsequent metric inherits that error silently.

Why Models Cannot Fix Bad Linkage

A common instinct is to throw an LLM at the linkage problem. Read the content of every message, infer semantic relationships, and let the model figure it out.

This fails for three reasons:

Reproducibility. Language models produce probabilistic outputs. The same pair of messages may be linked or not linked depending on inference conditions. This makes longitudinal comparison meaningless, you cannot track variance over time if the underlying graph changes with every run.
Auditability. When a regulator or actuary asks “why were these messages grouped together?”, the answer cannot be “because the model thought so.” Linkage must be explainable in terms of concrete structural properties.
Cost and latency. Processing 50,000 messages per month through an LLM for linkage is expensive and slow. Structural linkage based on metadata is fast, deterministic, and free.

What Structural Linkage Looks Like

The BBCO approach to cross-message linkage uses three structural signals:

Participant overlap. Messages sharing participants within a configurable time window are candidate links.
Temporal proximity. Messages close in time are more likely to concern the same issue.
Subject-line matching. Normalized subject lines compared after stripping reply and forward prefixes, corroborated by participant overlap. This is a header operation, not a content analysis operation.

None of these signals require reading the full content of any message. They operate on the structural envelope, who, when, and what header, rather than the semantic payload.

The Consequence

When preprocessing is done well, the issue graph is an honest representation of what actually happened. Shape types are classified correctly. Variance measurements reflect real behavioral patterns rather than linkage artifacts.

When preprocessing is done poorly, no downstream model can recover. The information was lost before analysis began.

This is why the BBCO community concentrates its shared effort here. The preprocessing layer is where fidelity is won or lost, and it is where open, inspectable, and deterministic methods matter most. For captive insurance programs, where every downstream metric feeds into capital and retention decisions reviewed by actuaries and boards, preprocessing quality is not an engineering detail — it is a governance credibility requirement.

The most important decision in any analytical pipeline is not which model to use. It is which observations to create.

← Back to Blog