Cleaning data is assumed to improve it — but often removes the system itself.
We assume that cleaning data improves it.
In most systems, it does.
Noise is real. Instrument drift, electrical interference, sampling artifacts — these need to be removed. Without filtering, data is unusable. Without smoothing, patterns are invisible. Preprocessing exists because raw data, in its native form, is not interpretable.
The problem is not that we filter. The problem is what we define as noise.
Standard preprocessing assumes that meaningful signal is stable, repeatable, and strong. Deviations from that pattern — fluctuations, inconsistencies, transient spikes — are classified as noise and removed. In most systems, this works. In chemistry, physics, industrial process control — where behavior is governed by stable dynamics and deviations are genuinely random — filtering improves the data because it removes what does not carry information.
Biological systems do not follow this pattern.
In living systems, the earliest indicators of change are weak, irregular, and transient. A shift in metabolic state does not announce itself as a clean trend. Emerging competition between populations does not produce a stable, repeatable signal. The system’s most consequential dynamics operate at a scale and pattern that standard filters are designed to remove.
This creates a specific failure mode: preprocessing that is correct by its own criteria systematically eliminates biological signal it was never built to recognize.
In fermentation, a batch rarely fails without early signs. Before yield drops or growth slows, the system shifts — subtly. Activity becomes inconsistent and responses diverge. Small fluctuations appear that do not match expected behavior.
These signals are typically filtered out or averaged away. The process data remains stable. The batch proceeds.
Only later does the failure become visible — lower yield, unexpected byproducts, unexplained deviation. At that point, the data is clean. And the cause is gone.
The data was there. It entered the dataset. It was present in the raw measurement. But it was removed — not because it was irrelevant, but because it matched the pattern of what we define as noise.
Standard preprocessing works. The problem is that it works on the wrong definition of noise.
In systems where behavior is stable, our definitions hold. In systems where behavior is dynamic, adaptive, and context-dependent, those same definitions become a filter against the most informative signal the system produces.
We do not need to stop filtering. We need to recognize that the boundary between signal and noise is not fixed — it depends on the system. And in biological systems, that boundary has been drawn in the wrong place.
— Pegah Farr