There is a persistent assumption in governance analytics that the hard problem is the model. Build a better classifier, train a smarter embedding, and insight will follow. This assumption is wrong in a specific and consequential way.
The hard problem is not what happens after the data is structured. It is what happens before.
Consider a mid-size organization with 50,000 emails per month flowing through its communication systems. Some of those emails are about the same issue. Some involve the same participants discussing different issues. Some are forwarded through chains that introduce new participants, change the subject line, and cross organizational boundaries.
Before any analysis can happen, someone must answer a deceptively simple question: which messages belong together?
This is the preprocessing layer. It determines what is visible to every downstream step. If two messages about the same regulatory inquiry are not linked, the resulting issue path is truncated. The escalation appears to stop at depth 2 when it actually reached depth 5. The variance calculation is wrong. The shape classification is wrong. Every subsequent metric inherits that error silently.
A common instinct is to throw an LLM at the linkage problem. Read the content of every message, infer semantic relationships, and let the model figure it out.
This fails for three reasons:
The BBCO approach to cross-message linkage uses three structural signals:
None of these signals require reading the full content of any message. They operate on the structural envelope, who, when, and what header, rather than the semantic payload.
When preprocessing is done well, the issue graph is an honest representation of what actually happened. Shape types are classified correctly. Variance measurements reflect real behavioral patterns rather than linkage artifacts.
When preprocessing is done poorly, no downstream model can recover. The information was lost before analysis began.
This is why the BBCO community concentrates its shared effort here. The preprocessing layer is where fidelity is won or lost, and it is where open, inspectable, and deterministic methods matter most. For captive insurance programs, where every downstream metric feeds into capital and retention decisions reviewed by actuaries and boards, preprocessing quality is not an engineering detail — it is a governance credibility requirement.
The most important decision in any analytical pipeline is not which model to use. It is which observations to create.
Read more from the BBCO community.