Systems-Theoretic Accident Model and Processes
Engineering a Safer World - Prof. Nancy Leveson - Massachusetts Institute of Technology
In today's fast-paced and complex world, it's crucial to approach safety from a different perspective. Our traditional methods of ensuring safety have limitations that hinder their effectiveness and cost-efficiency.
1: Why Our Efforts are Often Not Cost-Effective:
Firstly, current efforts towards safety are often superficial, isolated, or misdirected, focusing too much on assuring system safety rather than designing systems to be inherently safe. Added to this, safety measures are often implemented too late in the system development process, limiting their effectiveness. We also tend to use inappropriate techniques that are not suitable for the complex systems being built today. Our efforts primarily concentrate on the technical components of systems, overlooking important factors like human error, new technologies (especially software), conflicting expectations, management, and system evolution.
2: The Limitations of the Traditional Approach:
The traditional approach to safety views it as a failure problem, aiming to establish barriers between events or prevent individual component failures. As systems become more complex, accidents often arise from the interactions among components rather than individual failures. It's impossible to anticipate and account for all potential interactions, both by designers and operators. If we confuse safety with reliability, we end up neglecting the dynamic and non-linear nature of accidents.
Non-serious events and incidents are often overlooked as learning opportunities, but they hold tremendous value in enhancing safety. Blaming individuals with the label of operator error is an unproductive finding that fails to address the underlying causes. In order to truly improve safety, we must shift our focus from "who" or "what" to "why." Blame serves as the enemy of safety. The seduction of finding a single root cause can lead us astray, creating an illusion of control. This approach typically centers on operator error or technical failures, disregarding systemic and management factors. Consequently, we find ourselves engaged in a sophisticated game of "whack-a-mole," fixing surface-level symptoms while failing to address the flawed processes that gave rise to those symptoms. This perpetual fire-fighting mode perpetuates the cycle of accidents repeating themselves.
To truly grasp the complexities of safety, we must engage in three levels of analysis. First, we must examine the events themselves - the "what" - such as an explosion. Then, we look at the conditions surrounding the incident, considering the "who" and "how." This includes factors like flawed valve design or an operator failing to notice something. Lastly, and most importantly, we must explore the underlying systemic factors - the "why." This entails evaluating production pressures, cost concerns, flaws in design and reporting processes, and more. By understanding why the safety control structure failed, we can prevent future losses effectively.
Hindsight bias often clouds our judgment after an incident. It becomes easy to pinpoint where individuals went wrong, what they should have done differently, or what crucial information they missed. We cannot fully grasp the perspective of someone without the knowledge of the outcome. Overcoming hindsight bias requires us to assume that nobody comes to work with the intention of performing poorly and that they were acting reasonably given the complexities, dilemmas, trade-offs, and uncertainties they faced. Simply highlighting mistakes or stating what should have been done does not provide the necessary insights into why people acted the way they did.
In the face of incidents, organizations often seek simple, clean answers and desire a single root cause. But incidents typically result from a gradual increase in risk over time rather than isolated, chance occurrences. As others like Deming have shown us, instead of fixating on specific events or individuals, the focus should be on understanding the broader systemic factors contributing to the incident. That means prioritizing fixing the part of the system that changes the slowest rather than solely addressing immediate events.
Leveson challenges the assumption that solely improving the performance of individual system components will automatically increase safety. She emphasizes the criticality of managing both reliability and safety as essential constraints in system design. While reliability ensures e.g. timely delivery, safety encompasses e.g. preventing harm to workers and stakeholders. In complex systems, local autonomy and expertise alone are insufficient to ensure safety. While workers may possess expertise in their specific tasks, they may lack awareness of the larger system and the potential hazards associated with it. So, we must understand the role of various processes, laws, and cultural influences in maintaining control and feedback within the system. By comprehending the feedback loops at different levels, we gain a comprehensive understanding of the entire system and acknowledge that incidents cannot be attributed solely to a single component failure. Instead, they often arise from complex interactions within the system.
3: Software-Related Accidents and Operator Error:
Software-related accidents are frequently caused by flawed requirements, incorrect assumptions, and unhandled controlled system states. Merely focusing on making software reliable will not guarantee its safety under these conditions. Similarly, blaming operator error for incidents and accidents is a limited perspective. Human error is often a symptom, not the cause, of accidents. Understanding the role of operators in complex systems, the changing nature of their work, and the system design in which they operate is crucial. Human error can be mitigated by designing systems that minimize the likelihood of errors and provide the necessary support and tools for operators.
4: Systems Thinking - STAMP Approach:
To address these limitations and create a safer world, we need to adopt a systems thinking approach. One such approach is the System-Theoretic Accident Model and Processes (STAMP). STAMP treats safety as a dynamic control problem rather than just a reliability problem. It emphasizes enforcing a set of constraints on system behavior to prevent accidents. Accidents are viewed as a result of interactions among system components that violate these constraints. By shifting our focus from preventing failures to enforcing safety constraints, we can make significant progress in enhancing system safety.
5: Applying STAMP to Safety Engineering:
STAMP provides a comprehensive framework for safety engineering. It considers accidents as complex, dynamic processes arising from interactions among humans, machines, and the environment. By identifying safety constraints and designing an effective control structure, we can eliminate or reduce adverse events. The control structure encompasses physical design, operations, management, social interactions, and culture. Clear expectations, responsibilities, authority, and accountability must be defined at all levels of the safety control structure.
6: STPA - A New Hazard Analysis Technique:
System Theoretic Process Analysis (STPA) is a powerful hazard analysis technique that complements STAMP. STPA starts from hazards and identifies safety constraints and scenarios leading to their violation. It influences early design.
Implementation of STPA requires a learning by doing approach. Immersing people in small interdisciplinary project teams with e.g. engineers, designers, subject matter experts and testers is most effective. Facilitators provide expertise, guidance, and coordination throughout the analysis. They assist with project selection, scoping, team formation, and control structure development. They also manage the meetings and interactions among team members and they ensure that the STPA process is followed correctly and compile the results into a final report. The initial use of STPA may be more costly due to learning, but the cost decreases with subsequent projects.