How Industrial Systems Get Set Up to Fail (And What to Look for Before the Emergency)
Your System Isn't Fragile Because of Bad Luck
2026-Insider-Volume7
This newsletter is for educational purposes only. The frameworks and thinking approaches described here are starting points for developing professional judgment, not prescriptive procedures for any specific facility or application. Adapt these thinking approaches with the appropriate engineering and operational expertise for your context.
This month’s Snapshot introduced the mental chain experienced engineers use when something goes wrong.
This piece is about upstream issues: how they arise and how to spot warning signs before troubleshooting.
In this issue:
Why most plant failures are predictable, and what the early warning signs actually look like
The three ways integration failures get baked into a system during design, commissioning, and everyday operations, and what to look for in each one
The three “it seemed fine” assumptions that show up in almost every post-incident review
Two downloadable tools: a system fragility audit guide and a received-system checklist for engineers inheriting a system they didn’t build
The Problem
Most integration failures aren’t random. They’re predictable.
Industrial systems are designed in
one context,
commissioned in another, and
then operated in a third,
This often occurs among different groups of people who never share the same understanding of how the entire system works. By the time something fails visibly, the conditions for that failure are usually in place for months or years.
The failure event is just when the accumulated fragility finally showed up.
That’s a challenging thing to sit with, because it means that a lot of what looks like bad luck or sudden equipment failure is actually a slow-building situation that had early warning signs nobody knew to look for.
Understanding how systems accumulate fragility and what those early warning signs look like is a different kind of knowledge than troubleshooting skills.
Troubleshooting helps you find the problem after it surfaces. This process is about developing the pattern recognition to see a fragile system before it breaks.
In my experience, integration failures trace back to one of three sources:
design-phase assumptions that were never validated,
commissioning shortcuts that left the system undercharacterized, or
accumulated change that nobody connected to the whole.
Usually it’s more than one.
The Framework
Source 1: Design-Phase Assumptions That Don’t Get Tested
Every industrial system is designed based on assumptions about
how the process will behave,
how different pieces of equipment will interact,
how operators will respond under normal and abnormal conditions.
Those assumptions are encoded into
PID tuning parameters,
interlock logic,
alarm setpoints,
network timing configurations,
and dozens of other settings that nobody looks at twice once the system is running.
The problem is that some of those assumptions are never validated against the real system.
When the actual process behaves differently from the design model, the response is usually a local adjustment:
retune the controller,
raise the setpoint,
add a workaround in the logic.
The underlying assumption isn’t revisited. It gets papered over.
What this approach creates is a system that works under the conditions it was tuned for but becomes unpredictable when those conditions shift.
The original design intent is no longer visible in the running configuration. What’s visible instead is a layer of accumulated adjustments that each made sense at the time, but nobody has a map of.
When I walk into a system and I see
PID controllers running at parameters far outside their original design values,
interlocks that have been bypassed for reasons nobody can clearly explain, or
alarm setpoints that drift slowly upward every year because the process is “always in alarm,”
those are signals that the system is carrying unvalidated assumptions. They’re worth asking about.
The question that opens this up is a simple one: does anyone know what conditions the original design was tested against, and are those conditions still true?
If you want to work through this systematically, Section 1 of the ICS System Fragility Self-Assessment Guide (linked at the end of this issue) covers exactly this, with diagnostic questions for whether design assumptions are documented, whether current parameters and setpoints still reflect original intent, and whether the institutional knowledge behind the original design is still in the building.


