Pragmatic 11: Cause And Effect

3 February, 2014


John and Ben discuss the process of analyzing failure, from broken alternator belts to plane crashes. Starting with Toyota’s five (sometimes six!) why’s, John explains some different approaches including fault tree and root cause analysis.

Transcript available
[MUSIC] >> This is Pragmatic, a weekly discussion show contemplating the practical application of technology. Exploring the real-world trade-offs, we look at how great ideas are transformed into products and services that can change our lives. Nothing is as simple as it seems. I'm Ben Alexander and my co-host is John Chidjie. How are you doing, John? I'm doing very well. How you doing Ben? - Doing well. Scratchy throat. - Awesome. Yeah, I'm sorry to hear you still haven't quite kicked that bug you've got. Unfortunately, the time of year I suppose over there and our turn is coming. But anyway, so hello to everyone in the chat room. Thanks for stopping by to listen live. So starting off again with a general thank you to everyone from Twitter and who's been saying and continuing to get great feedback about the show. It is all greatly appreciated. So, thank you to all those people. Also, a quick apology to Michael Solis. I mispronounced his name in episode four. He actually kind of begged to some extent on Twitter not to correct his name, but it's one of those sort of bugbears of mine. So, I kind of had to get it out of my system and yeah, just going to have to deal with it. So, thank you. Special thanks also to the listeners that have emailed me directly. I have read every email that's come in. I have responded to some, but there is a bit of a backlog. Ben's working on a more efficient way to distribute them for hosting everything. He's building up everything at Fiat Lux. So, I will get back to you guys that have emailed. I haven't got back to you. I will get back to you shortly. So, thanks again for that. Three more iTunes reviews. We're still getting them, which is fantastic. a different mix of countries, one from my own country, and then one from my home away from home, Canada, and also one from Kazakhstan. Now, a big thanks to Zri, Jason Clark, and Nikita Kushkin, respectively, which I'm really hoping is their real name, because that's a cool name. Just a side note, I actually am thrilled to have a listener from Kazakhstan. And the reason is that When I was living in Calgary as an intern, a student, but I was interning at Nortel at the time back in '97, I actually was sharing a room. It was one of those four rooms in the campus at the University of Calgary. And one of my roommates, Ed, was a Kazakhstanian. And he used to get really annoyed when I beat him at chess, which did happen from time to time. Apparently, that's not allowed. His neck of the world, they're supposed to be like chess geniuses and stuff. and Australian's not supposed to be any good, but anyway, so there you go. Go Kazakhstan. Also, thank you to Tim Radke, who recently changed his website from Grim Trigger to a site using his own name and wrote some lovely things about the show. I actually missed it originally because it was in German, but I eventually found it. So, thank you very much for that. Also much appreciated. Quickly, an apology for feedback for episode two. We actually have more. That's just the episode that keeps on giving. And unfortunately, Ben's been waylaid this week with other things and was unable to get that feed excerpt out. You'll be seeing it shortly. I know I've told a couple of people about it. Don't worry, it's coming if you're hanging on for that. So I think it's a good one too. So episode two, we've got four bits of follow-up on that. And some, like I said, the episode that keeps on giving. Also for those reading Tech Distortion, my blog, then please note that I'm sort of in the process of migrating from WordPress to Statomic for a whole bunch of reasons. Those following my Twitter feed will see all of those. But in any case, following the footsteps of Sid O'Neill and Harry Marks. And in the next week or two, you should see that coming up. Or when that does happen, please let me know what you think. I'm always interested in any and all feedback about that. Sid's article though, that was pretty good. Yeah, and yeah, it was. And so was Harry's actually. It was sometime before that in early 2013. Yeah. So, yeah, I've actually, I've read them both. and it's great. - I remember he talked about that on Storming Mortal too. - Yes, he did. - Yeah. - Yeah, he did. So, yeah, I just sort of reached my limit with WordPress at this point and I know I've been an advocate for it for quite some time. I don't hate it, I just want to try something different and I'm a geek so I'm allowed to do crazy stuff like that, like just wiping the slate clean and saying, "Oh, let's try this new shiny toy," which is kind of what Stanimic is right now. So, I've got that sort of giddy excitement while I'm playing with a new toy. So, anyhow. Okay. Okay. The topic today may sound a bit odd at first, but I actually want to talk about cause and effect. And I realise on the surface, it sounds a little bit weird, maybe. I'm not sure. Why I want to talk about it is it's something that's fascinated me. So, the idea of domino effect, the idea of a chain of events and the chain of events lead to an outcome. And when you start at the beginning of the chain of events, you can't see the outcome. at the end of the chain of events, you look back and you're amazed, or I'm amazed at the different connections between the different separate events that lead to that one last event and everyone sort of has their own way of thinking about this, but that it happens every day in all sorts of things that happen. And there's all sorts of scenarios you can play out in your head, "Oh, well, if this didn't happen, then that wouldn't have happened," and that sort of thing. I find that fascinating. And I also find it fascinating from an engineering point of view, because a very big part of of what I was doing in the first few years of my career was reliability engineering, where we did a lot of these things I'm going to talk about today. So I want to start talking about the analytical component of cause and effect and how people can use that if they want to in their daily lives, if they want to. So should be an interesting one. So I'd like to start with Toyota. And why I want to start with Toyota is they came up with this concept of the five whys. I don't know if you've heard of the five whys or not, Ben. - Yeah, it's a little familiar, but it's been a while since I read the Toyota way. - Okay. In Toyota, they developed a methodology many years ago called the five whys. And the concept is you ask why five times. And it was a methodology to help improve their manufacturing processes and the quality of their vehicles. And the idea is you simply ask, Well, just like the title says, why five times in a row? So, the example from the Wikipedia article is very simplistic, but it gets the idea across. So, I'll just read it for you. Why the battery is dead? Why is it dead? The alternator is not functioning. OK, why is that? The alternator belt has broken. And why is that? The alternator belt was well beyond its useful service life and wasn't replaced. And why was that? Well, the vehicle was not maintained. according to the recommended service schedule. Now, at that point, you've reached five whys. At that point, the idea is that you've reached your conclusion. So the final recommendation would be follow a proper maintenance schedule for your vehicle. The problem is that you can actually keep asking why and get further than that, potentially. So let's say that you did, you ask why again. Why was the vehicle not maintained according to the service schedule? Well, it turns out that the alternated belt in question hasn't been made for years. It's not available anymore. So it wasn't that they intentionally didn't maintain it. It was because they were unable to maintain it, which completely changes the complexion of your conclusion. So the problem is, well, you know, Toyota should have been continuing to manufacture this stupid belt, right? This petrol engine. Anyway, so the problem with 5 Whys is that it's a great place to start. You can start with the idea of systematically going through a query process to reach a point where you are satisfied that you've reached the true cause of the problem. But why stop at five? What if it's what if it takes eight questions or 10 questions to get there? So it's not a serious, shall I say it's not serious. Well, I mean, in many respects, most people don't take it seriously in reliability because it's simply it's kind of a thought experiment. It's like to get you thinking about how you should be doing analysis of a failure, how to get you started and sort of this is one approach. But it's simplistic because the depth is arbitrarily set to five. Why five? Why not four? Why not two thousand? I mean, and at what point do you stop? You know, are we going to get down to the molecular level? You know, the belt isn't made anymore because it's made of steel and the steel is made of carbon and the carbon's got this many atoms. And I mean, what point do you stop? The point of ridiculousness, it's also difficult to define. The other issue is that each path is basically, it's a singular path. So, you go from one question to the previous, to the previous, to the previous, and essentially it assumes that there is only one way to get there. And that's not really always the case because, in fact, it usually isn't the case. Normally, there is a predominant reason. But what about the less dominant reasons that were contributing factors? And the Five Wise methodology doesn't address any of that. So, it's generally not used anymore. It's considered to be somewhat of a thought experiment more than anything. So, that leads to the next evolution of that kind of thinking, something called root cause analysis. So, root cause analysis is essentially a method of looking at a failure event and breaking it down into sub-events or a series of sub-events that led up to the failure. And you simply repeat that for each level down until you reach a point where you cannot break it down any further. It's the sort of exercise that's done where you have a failure event that has occurred and you're trying to analyse why. It's not so much a design tool, but it's the sort of process that is very similar to the next one that I'm going to talk about, except it's not, how should I say, statistically analysable. In other words, there's no potential, there's no probabilities used, there's no calculation of what's the most likely path. It's simply, we had a failure, here are the possible causes. And we're simply going to investigate each of them until we figure out what caused it. So, it's an investigative tool more than anything else. Sort of thing that is used in accident investigations, and we'll get to that. So, the most calculated, rigorous method of fault analysis I'd like to suggest from a top down approach is something called fault tree analysis, or FTA for short. And Foultry Analysis was developed by a bloke called H. Watson. Don't know what his first name was. Sorry, Mr. Watson. And it was developed when he was working for Bell Labs in 1962. And it was an... They were a group of engineers and they were trying to understand the best way, the best areas of a design to focus on to ensure that they had the best possible outcome for their design. So, it was designed to be- it's meant to be a design tool. It's top down again. So, you start by defining an undesirable state at the top of the tree. And they call it a fault tree. Yeah, it ends up looking like a tree, but the tree is upside down, unless you want to think that the branches of the tree are the roots of the tree, I guess. But, you know, or it's a boa tree or something like that. The idea is that each of the individual reasons, it's kind of It's very similar to root cause analysis, but what you do is you say, well, if my undesirable state is a failure of a computer, let's say, well, what could cause that? Well, it could be the hard drive or it could be a software problem or it could be the CPU or it could be memory. There's a whole bunch of different possible reasons. Each of those will have a failure rate associated with it. So what's the probability that it's going to fail in the next thousand hours, ten thousand hours, hundred thousand hours, whatever it is? And by doing that, you can then figure out which of these branches of the fault analysis tree is the most critical. And that critical path, that critical series of events is the one that you should spend the time working on the hardest. So in essence, it guides the way you design for improved reliability. So improving, like adding redundancy is the simplest one. So, you know, the most common thing in a server computer that fails is the power supply, statistically. therefore you have redundant power supplies. It's that simple. And that originally, that analysis would have been done decades ago as a fault tree, looking at all the reliability of all the components, and they would have figured that out and said, "Okay, well, we should add redundancy in our power supplies." That's a whole other discussion about Markov models and reliability prediction. I didn't want to get too much in depth into that and don't get me started on that. So, I'll go on forever. So each of these events, they've got to figure out their deterministic failure rate. And that failure rate sometimes referred to as the mean time between failures or MTBF. A lot of products will have that. So you'll buy a hard drive and it'll be, have an MTBF of 200,000 hours or whatever the heck it is. Pulling a number out of my butt there, I'm afraid. But in any case, the number's not the point. One of the interesting things is that you can reach the failure rate theoretically by using a different kind of a Fulton failure analysis. So you could look at an individual circuit board and say, I'm going to do something referred to as a FAMIA or FAMICA, which is, FAMIA is failure modes effects analysis. And the other one is a FAMICA, which as a C and it stands for criticality. So it's failure modes effects, criticality analysis. And if you know the fit, the failure rate of each of the components on the board, and you'd look at the different results different pins are stuck at a high or a low or a transient state, and you look at the end result of what's corrupted and what isn't and so on, all of that stuff can be rolled up into a failure rate for a product that doesn't exist, a theoretically predicted failure rate. And that's what they would use to feed into the fault tree analysis. And I actually did a lot of FAMICAs when I worked at Nortel in '97 and again in '99 and 2000, sort of the bread and butter and R&D, because you don't have, I mean, we did have real products that were coming back from the field, of course, with failure rates, and we tried to map our predictions that had been made previously to the field returns in the real world. And we had some reasonably good alignment, but in any case, I'm getting off topic. So these fault tree analyses are quite detailed, and you've got a bunch of basic function blocks. You've got a basic event, undeveloped event, external event, undeveloped, I've gone developed intermediate events. The Wikipedia article goes through all of them, and I don't really want to go into too many of the specifics, but the interconnections between them are simple logic gates. So you've got, you know, ands, ors, xors, all that stuff. So if you know your basic, you know, Boolean logic gates, then it'll look very familiar to you. So, it accounts for things that you need to have multiple failures of certain things in order to actually have a failure or an undesired event. Sorry, they call it an undesired event. So anyway, feel free to read up on that. The Wikipedia article is not too bad. So, the question, however, is that you've got all these methods developed for failure analysis now. So, we've got fault tree root cause analysis, and it all starts with 5-Y sort of as a thought experiment. But how is it actually usable, useful? Maybe it's obvious, maybe it isn't. But I find this sort of thing fascinating and also scary at the same time, because there are actually a couple of documentaries on TV, like documentary TV series that explore this stuff. So, two of them in particular. So, you've got one of them called Seconds from Disaster. And I don't know if you've seen that one, Ben. Yeah. Or heard of it? Yeah. Yeah. National Geographic Channel. Mm-hmm. It's very hyped, you know, it's very... Oh, God, how do I express it? It's very sensationalised. Yeah, the big dramatic voice, you know, "Oh, and they're seconds from disaster." And it's like, "Oh, okay, yeah, just get to the point." Well, I guess, but I found that there's another one called Mayday, but it's got five different names. I didn't realise because in Australia, it's called Air Crash Investigation. And apparently it's also called that in the UK and Ireland and in some parts of the world, it's called air emergency or air disasters. So, I don't know why on earth they do that. But anyway, that's a Canadian documentary. It's produced by Cineflix. And that particular one is far more, how should I say, straightforward, far more clinical in its delivery of the news. It's the, yeah. So, it's less height. But both shows are good. It's just that Seconds From Disaster is a broad spectrum of topics. I kind of like this show. It's not focused on one thing. It's focused on, okay, so they covered the Piper Alpha oil platform explosion. They covered a tunnel fire under the Alps. They cover airplane crashes. But if you look at Mayday Air Crash Investigation or Air Disasters, Air Emergency, which whatever name you want to call it, that one focuses purely on air crashes, plane crashes. And planes, of course, are an area where all this fault analysis and root cause analysis is critical because, you know, if you have a problem up there in the sky, you're on your own and if something goes wrong, you are going to fall out of the sky and that's generally a bad thing. - It's the same Mayday on Discovery here. I think I found it. - Okay. - Is it acted? It's like, it's dramatic. - Yeah, what they do is they've got periods where they walk you through the events that lead up to the accident and then they walk you through the investigation afterwards. And they often have like 20, 30 second segments of dramatisations, usually in the cockpit or wherever the... Well, if it's an aircraft one usually. So, and they sort of dramatise that bit a little bit, but they often will have the actual audio. Sometimes they can't legally release the audio for all sorts of different reasons, like they're still investigating it or there was request that they were sealed for some reason. So, you can't always hear that, but that's okay. It is... Yeah, so there are a lot on there and I've watched that on and off over the years and I found it to be fascinating and disturbing at the same time because, you know, you realise that a lot of people died in some of these things. And when I look at what I do from day to day, I don't work on aircraft. Even when I worked at Boeing, I didn't actually work on aircraft. I work on other military stuff, but not commercial aircraft. And, you know, most of the stuff that I deal with, it is technically possible for me to kill somebody, but, yeah, they have to be trying very hard. Even so, you still have to be very careful with what you do and what you design. It's not a joke and it's not a game. It's just that it doesn't have an immediate, obvious effect like a plane falling out of the sky does. So, there's one in particular that stuck in my mind that I wanted to talk about. And the reason I want to talk about it, I guess I can't really explain it. It's just that it's lodged itself in my mind. I've watched lots of episodes of this and this particular one has stayed with me for years. And I don't really- I think the reason is because I see parallels with what I've seen in my own experience. So, forgive me this little tangent, but it illustrates a point. So, this particular one is Aero Peru Flight 603. So, on the 2nd of October in 1996, just after midnight, 70 people on a flight from Miami to Lima in Peru. They were on a Boeing 757. And it's important to set a bit more of the scene. Moonrise was at about 10.47pm on that day. There was about a 79% elimination. So, it wasn't pitch black by any means. But because the flight started out predominantly leaving Miami, predominantly over water. And when you're flying at night over water and there's no mountain ranges or anything that you might see in the moonlight, perhaps, essentially you are relying on your instrumentation. So, without instruments, you've got a problem. So, on this particular flight, sorry, a little bit more, a little bit more background first. I'm getting ahead of myself. Most of the instruments on an aircraft require sampling of the outside atmosphere. So, the entire plane can't be sealed. You need to have airspeed indication. You need air pressure. There's a bunch of, they call them static ports. And these static ports are usually at the front of the plane, at the nose of the plane, and they're sort of like small tubes. They've usually got a 90 degree bend in them. Varies exactly the configuration from aircraft to aircraft. The idea is you get them up the front, away from any turbulence, air turbulence and so on. Nice clean air to work with, that sort of thing, to get the most accurate reading. Now, these small ports are critical for the instrumentation to work. So we'll keep that in mind for a second. So after they've been in the air for a few minutes, they started to get a bunch of alarms from their instruments. They were all very conflicting. So you had overspeed alarm, under speed alarm, too low to the ground terrain alarm. Their primary altimeter and the airspeed indicators were not indicating correctly at all. And they were essentially flying without their primary instruments. Now, I am not, I just want to say I am not a pilot and I suck at Microsoft Flight Simulator. However, based on my understanding of the aircraft, the bottom line is that their primary instruments were not functional and they were getting a flood of alarms. And these are audible alarms as well as visual flashing alarms. So this is a 757. It's not, at the time, it was an older plane, even at the time. So, this is not like a modern Airbus or a Dreamliner or anything. It wasn't digital. It wasn't, you know, it was old fashioned, old school. Okay. They did the right thing, which was they don't know how high they are. They don't know how fast they're going. Something is clearly not right. They turn around and head back to the airport, or at least they try to. But they've gone too far out to get any guidance. They couldn't see the land from where they were. They had no idea how fast they were going, no idea how high they were. And as a result, they tried to descend and slow down to try and land. And essentially, the plane starts to stall. They have several stalls and dropped a lot of altitude very quickly. And eventually, they crashed into the water and everybody died. The National Transport Safety Board, NTSB, and working with another investigator as well, They look at these incidents and try and figure out what went wrong. And they do root cause analysis and they try to understand the sequence of events that led to the accident so that people can learn from that and not make these mistakes again. So, they found several causes. The first one, funnily enough, I say funnily enough, it's not funny at all, oddly enough, we talked about this on Pulling the String, actually, which was information overload or abnormal situation management, where the user interface and obviously, as I said before, this is old school. This is not a digital touchscreen, iPad or anything like that. This is old school, you know, audible alarms, flashing enunciator panels and flashing indicator lights. So all of this information doesn't matter how it's displayed. It's still a user interface. So all these alarms going off, all of them at the same time, and there was no grouping of these alarms to say there's no logical rationale to it. So in a modern system, the system would try to give you a summary alarm and say, "You know what? I've got this, I've got this, I've got this, I've got this. Most likely cause is blah." And it should give you a list of the most likely cause on top and each subsequent cause that could have been beneath that. But it didn't do that. These pilots had logs, tens of thousands of hours. hours. These were not amateurs, they were not new, they weren't wet behind the ears. They were experienced pilots, very experienced, and they got information overload. And to the point where there's actually a radar altimeter. The radar altimeter doesn't rely on the ports, the static ports, and as a result, it was giving them a reading. My understanding is that it's not as accurate a reading, or maybe it wasn't in the 757, but it was still a reading of altitude. So, had they checked, they would have known at that time that that backup instrument, that second instrument, the redundant instrument was actually working to a point where they would know roughly how high they were. At least they could have prevented them from crashing into the ground and given them time to get over land and where they may have been able to navigate to the airport visually based on city lights and so on. Unfortunately, because of the information overload, they did not see that. Sounds crazy afterwards, looking at it afterwards, but it's true. They did not see it. So, that was the first cause. It was operator confusion, pilot confusion. However, the most direct cause was that the ports had adhesive tape placed over the front of them. And you might say, why the hell would they have tape over the front of these ports? They need the ports for the instrumentation on the aircraft. So, what's happened is, in this particular model of plane, it was their standard method of cleaning. So, when they clean the plane down, they don't want water, high-pressure water or any other grit getting into these static ports, so they plug them up with some tape. They wash the plane down, clean it all up. They're supposed to take the tape off, and then the plane goes on its merry way. Someone forgot to take the tape off. And because of that, and in conjunction with the fact that there was multiple conflicting alarms, essentially an alarm storm happening in the cockpit, 70 people died because of a piece of tape. Now, you think, okay, well, I'm going to keep asking why, right? So, why, how is this possible? So, the guy who, I've written his name down, but I'm not going to try and pronounce it. It's the maintenance technician who had cleaned the plane and who was responsible for putting the tape on there in the first place and not removing it, went to jail for two years for negligent homicide. But the question that I didn't get an answer to, and I don't have the full report, obviously, I only have the Wikipedia article and the episode itself, they don't release the full details of the report, or if they do, I don't know where to get it. So, I don't have that report in front of me. I don't know if it answers these questions, but the questions that occur to me are, how is it that he just forgot? If there's a procedure that says, you know, tape them up, surely there's a step in the procedure that says remove them, you know, if it's there in a manual, there's a checklist says, you know, remove tape, surely that's what it said. If that's the case, did he skip that step for some reason? Was he distracted? Did someone distract him at the critical moment and he forgot to go back to it because of distraction? Was he overtired? Had he been pulling double shifts or whatever because he needed the money? You know, was it because he was working two jobs because he needed the money? I mean, I don't know. I'm not sure what it was like. And was it symptomatic of his work ethic? That's the other thing. I mean, what if he had a long history of forgetting to put a nut and a bolt back on or something? Some people just lack that attention to detail and probably shouldn't be doing that job. So then what about the employee screening? What about supervision? What about someone to double check their work? And these are all procedural things that could have been put into place. They could have, but I don't have answers to any of those questions. But those are the questions that I would go on to find out. I would like to find out. But of course, we'll never know the final details, even if they're in the report. I guess we'd know if I had the report. But because Aeroprew went out of business, they weren't in a good way before the accident. After the accident, it was kind of like the last nail in the coffin. So they went out of business within a few years of the accident and that was it. However, the other interesting part of this is that Boeing actually got sued and Boeing was sued because they, apart from being the last person standing that had any money to sue, they were shown to have been neglectful in their design procedure, or their training procedure, sorry, for cleaning the aeroplane. So, they put it down to a training issue and you know, sued the hell out of Boeing. Boeing coughed up massive amounts of money. People wonder why planes cost so much. It's not how much it costs to build them. It's for the damned insurance, right? So, although I'm sure a lot of it's how much it costs to build them, but still, insurance isn't cheap. So, in any case, Boeing took the majority of the blame for that. But the question I've got is, why... The 757 was an older plane at that point. So, with all of the procedures and everything in place, it's never happened before. Was it something that they should have been able to predict? What would mean, did they do their fault tree analysis correctly when they went through this and they're, you know, when they're doing the design for these things, you know, surely the thought occurred to them that this was kind of important. I mean, there's probably a thousand things, probably 2,000, 5,000, 10,000 things that are important with an aeroplane that if they fail, you're in deep trouble. Maybe they figured the backup altimeter was enough. Maybe they figured that having two people doing the checks after the flight was being, the plane was being cleaned, maybe that was enough. They figured that was enough statistically. I mean, I don't know. I mean, how many other planes of that era had a proper system? Apparently, either as a result of this or around the same time as this happened, a lot of other planes have physical plugs that fit in those ports and they have a dangling bright orange coloured flag, so it's visually very obvious. So that if you see a plane rocking up, about to leave, and it's got a bunch of dangling red flags under it, you might suspect that you should look at it. So, in any case, the question is, was that standard procedure? Was it required by law? What happened? I don't know. So, that sounds dramatic, maybe. I mean, I guess the point is that it's a terrible thing. But from my own experience, I can't claim that I've been involved in any fault analysis where someone died. I've been involved with ones where they were injured and I've been involved at a place where there was a massive explosion. So I'll talk about that one next. So I did some work at the Stanmore Power Station as an intern back in 1996. It was actually really, really great. I really enjoyed it. When I was there, or rather I wasn't there at the time and most people weren't, the incident happened about two in the morning. There's a 6.6 kV switch room full of massive switchboards and circuit breakers. And one of the circuit breakers had a face-to-face fault and exploded. The explosion was so big it actually blew a hole in the wall. The wall was, oh geez, half a foot thick rock block. So yeah, you know, significant blast energy from this. So what happened? Just out of the blue, randomly bang. And this power station was new. And when I say new, I mean, it had been running for maybe that particular unit and switchboard and circuit breaker premium there for about a year tops, but operational for maybe nine months, 10 months. They'd followed all of their maintenance requirements. So circuit breakers after a certain period of time, even if they don't operate, are service checked. So they go through and they check everything's all hunky-dory inside. But they also have a number of operations. Every time you operate the breaker after a certain number of breaker operations, what do you do? You go and check it out, make sure it's still good. So, well, they'd done all that correctly. So what happened? Fortunately, there was no one in the room at the time of the explosion. Two in the morning, why would you be hanging out in the HV switch room? Well, normally, you don't even walk through the switch rooms. They advise you not to. Explosion is a good reason why. It's very rare, very uncommon. Okay, so what happened? What happened was, okay, the Standall Power Station was built, just for a bit of background, it's a 1.44 megawatt coal-fired power station. There's four turbines and essentially the buildings, they tried to cut costs. Why does it always start with that? They tried to cut costs on ancillary buildings. And what they did is they built the ancillary buildings in the adjoining spaces between the different turbine generator sets. I'm not sure how many people here know the basic architecture of a power plant, but you tend to break them up into generator sizes because the bigger the generators get, the more unwieldy they get, law of diminishing returns, that kind of thing. So, they keep them in a manageable size, a maintainable size, you tend to limit their size and have multiples of them. So, rather than having one massive one gigawatt generator, you'll have four smaller 350 megawatt generators. And that just makes everything easier in a lot of ways. So, and all of these are spaced apart because you don't want an incident at one to cause an incident in one of the others. So, a bit of independence there. So, what do they do? They want to, they figure, well, we'll build this ancillary buildings and the connections between the turbine hall in these buildings and save the cost of X number of buildings. That required a subtle change to the way the concrete foundations had been poured, such that the pulverizers, which were directly beneath the offices, were on the same slab as the actual circuit breakers in the HV switch room. So, they're on a common slab. Common slab means not isolated, which means vibration will be carried from one to the other and back again. one big happy vibrating family. It was a common issue actually in the buildings, in the build between, because you'd walk between them and there were hot spots. So, you'd stand in one spot on the floor and you'd feel the vibration and you'd see a desk nearby and there'd be like a pencil doing a little dance on the table. And you're like, "Is that normal or is that possessed or something?" First time I saw it, I thought poltergeist or something. That's crazy. Then when they explained all of this, I'm like, "Oh, that makes sense." Okay. Don't worry, it has a happy ending. No one gets hurt. Okay. But still. So, what happened? Pulverises. Of course, people may not understand the way the coal works. So, dig the coal up out of the ground. It's a bunch of little rocks, right? Little black rocks, if it's black coal, which is what they use at Stanwell because, you know, black coal is plentiful in that part of Queensland. So, the coal comes up from the, on the conveyors and splits across all the different boiler unit boiler houses and goes each boiler house has its own... actually that's not entirely true. I think there's... is it one pulverizer for each or is it... I can't remember actually. Interesting, my memory's rusty, sorry. Anyway, let's say there's a pulverizer for each turbine generator set. So the coal falls into the pulverizer and all the pulverizer is, is a bunch of enormous ball bearings sitting in a rut of sorts. You drop the coal in, spin the ball bearings around and crush, crush, crush, crush, crush. What do you get? You get fine coal dust. And then that is extracted. I think it's vacuum extracted or positive pressure. I can't figure out which one, but anyway, it's blown or sucked one or the other into the boiler where it's torched. And of course, that generates heat that heats up the steam. Steam makes the generator spin and hallelujah, you got electricity. So the pulverizers, though, as you can imagine, massive ball bearings going round and round and round and round. What's that going to do? It's going to create a lot of vibration. So, the problem was that the circuit breakers were being shaken to pieces. All the previous plants had been built on exactly the same idea, but they just changed this one thing. They put on a common slab and suddenly it caused a massive problem. So, it's the sort of thing that someone somewhere in a meeting room, I can just picture it in my mind. They're like, how can we save 250 grand on a bunch of, you know, and maybe that wasn't it. Maybe it was the decision that had been made and maybe there was a better way of doing it. They said, oh, we'll just pour it on a common slab. And someone just thought, yeah, that's fine. That'll work. Anyway. So what was the solution to all this? Okay, the happy ending. Happy ending was they simply separated the slabs, which you can do. A lot of cutting, a bit of digging. You know, they separate the slabs and cut back the vibration immensely. I mean, you'll still get vibration conducted through the soil, obviously, but nowhere near to that level. And I believe they also stepped up the frequency and the thoroughness of the inspections of the circuit breakers. And to the best of my knowledge, there hasn't been a problem since. But I find that whole thing fascinating because it's fascinating how you had a functional design, one tweak to that design. And I say a functional design. Stanwell was actually, there were three plants built at the same time. You had Tarong, followed by Callied Bee, then followed by Stanwell. They were all built to the same blueprint, same control system, which is a Siemens, whatever the hell it was, Telepem, some ancient now, goodness. Anyway, and yeah, anyway, and there was Hitachi generators. And anyway, the details don't matter. But the point is, they were essentially carbon copies, except for that one detail. And that was the only plant that had a problem with that. So, OK, when you do this for a while, ultimately, you reach the conclusion that all problems are caused by people. And I realise it's a bit like saying the school would run perfectly without any students in it to get in the way, which I don't know how many times I've heard that. But all the trains would run on time if people would just not use them. But, you know, obviously, maybe, obviously, maybe not. But all problems come from people in one form or another because we make mistakes. And like I've said before, people have bad days. They get up on the wrong side of the bed, things go wrong. So what we try to do to control that is we create procedures. we create safety measures and some kind of structure and rigidity around what we do to ensure in higher risk activities that we reduce the risk of there being a problem. You can never eliminate it, though. That's the problem. That's the ultimate problem. You just cannot eliminate it. So anyway, one of the more interesting modern developments is the concept of fatigue management. For the longest time, if you wanted to work extra hours, it was a sign of, gee, you're pulling for the company, you're doing a great job. Yay. Way to go. Go you. You're bucking for a raise or whatever the hell. Right. Here's the thing. That leads to fatigue. Fatigue leads to mistakes and mistakes in certain parts and jobs can cost people's lives. One of the big pushes in the construction industry that I've witnessed in the last 10 years in Australia in particular, and my understanding, it's a global thing. It's not just Australia, is the focus on fatigue management and safety. Now, safety is a topic for another show. That's another very long topic, which maybe we'll get to one day. But honestly, people are now having enforced maximum work hours. So you cannot, if you want to earn overtime, that's nice, but you'll restrict your cap to like 12 hours a day. And that has to include travel to and from your place of primary residence to the place that you are working. So, if you've got a two-hour commute each way, then you're stuck working an eight-hour day. That's it. That includes your lunch break and all that time. It's the time you are consciously engaged on site. Because too many mistakes were happening because people were fatigued. They'd worked really long hours to meet deadlines. And now it was causing people to get hurt or die. So fatigue is something also on a non-industrial front, on non-engineering front, modern vehicles, I'm led to believe that they're developing systems for like Mercedes and BMWs, where they actually track how alert the driver is. Now that sort of stuff is amazing and fantastic and every car should have it. Driving fatigued is like driving when you're drunk. It's terrible. And yet people do it. Why? Because there's no definitive test for it. Well, not really. I suppose if you were to apply a DUI test, you know, the roadside test where they try and make you walk in a straight line, you put your finger on your nose. I don't even know if they still do that. But that's a meme in the movies, right? I mean, all I know is that nowadays I get pulled over and they get me to blow into a white hollow tube. Say the alphabet backwards, John. No, I can't. Let me write it down first and then I can do it. I was always scared that that would happen. So, yeah, but you know what I mean, right? Where's the test for fatigue? And there really isn't a good one. So, but the problem is that fatigue causes so many of these issues. I mean, the thing is that Aero Peru disaster, who is to say that that wasn't the reason? I mean, what if the guy had been up all night with a baby or something, which is something you could probably personally attest to right now? You know what I mean? You just don't know. And the rules when you're dealing with something that's so important like aircraft maintenance, I mean, surely they would have to have fatigue clauses. Now, back then, they probably wouldn't have. These days, it's becoming more and more common because people are realizing how dangerous it is. So, the problem with that is, there's always a problem. You can't force someone to switch off. You can't force someone to have a good night's sleep. I'll restrict you to 12 hours door to door. What you do in the other 12 hours is your business. I can't stop you. You want to spend the 12 hours partying hard at the local pub or nightclub or whatever. Can't stop you, you know? And therein lies the problem. So people still have to be responsible up to a certain point, but they're tightening up on fatigue. And I'm actually debating whether I would actually admit this one. I may as well, since I'm on the subject. But when I was, when my third, so I've got four kids, when my third child was still a baby, he was keeping me awake a lot. And I came into work, and there's a very big safety culture at the place that I was working at the time. And it became obvious that I was not getting enough sleep. And I was pulled into my manager's office and told quite succinctly, that they noticed that the quality of some of my work in the last two or three weeks had been not as good because generally, I have, well, I like to think I have a high standard. Some people say I do, but depends on what they're talking about, I suppose. In the end, they noticed a slide on the quality of my work and they knew that I was yawning all the time and looking very tired, bags under my eyes, all the signs, all the warning signs. And I was told in uncertain terms, I said, John, get your fatigue under control or we will get it under control for you. Which is, of course, a nice way of saying, because it's in the employment contract, that if it remains unresolved, that they can actually fire you. You're expected to show up to work well rested and ready for work, which is, you know what? Fair enough. They're paying me to show up. So I had no choice but to seek out alternative arrangements. I won't go into the details. I was still living in the same house, though. But the point was that fortunately for me in the weeks following that, just as luck would have it, my son started sleeping through and problem solved through indirect means, but temporary measures were taken. And that's going to become more and more serious and as it becomes more, it comes more to the fore. Anyway. All right. So this is all well and sanitary and lovely and, you know, domino effect, cause and effect, you know, chain of events, all that sort of stuff. I find it fascinating, but it's the thought process that's really useful. And a lot of people do this without thinking about it. A lot of people don't do it as deep enough as they should. So, they'll look at something that goes wrong in their life and they'll go down a step or two and say, oh, yeah, it was just because of blah. When the truth is, if they had kept digging, they would have found that there are actually other reasons that they were either too afraid to admit or were not able to see because they were happy to accept some higher level reason for what went wrong. So, I'll do a little bit of a retelling, I suppose, of what happened to me when I was in Calgary. So, as you know, I worked in Calgary for two and a half years all up, but it was in two segments. First time as an intern, second time as a full-time employee at Nortel Networks. And it was wonderful. I loved it. I loved every minute of it, except the minute where I was laid off. I didn't like that minute. And what happened is Nortel stock had gone on the tank and essentially we had, they were cutting the budgets and I think I've talked about this before and the project I was working on the appliance BTS got canned along with me and worldwide 60,000 other people. And when something like that happens to you, you really need to ask yourself the hard questions as to well, why? So to walk you through my thought process on this, to illustrate how I think it can be a useful mechanism for people to keep digging until they get down to what they want. So if you've ever failed anything, I recommend do a root cause analysis on why. But you've got to be honest with yourself and it's going to be painful, some of it. The answers you're going to get might not be what you hope to find, but if you're honest about it, that will help you a lot. So for me, okay, I'm an engineer at heart. I always have been, even though I didn't realize it early on. I wanted to be a theoretical physicist, or an inventor, whatever the hell that meant. But I didn't realize it till later on. I guess people call this process I'm about to walk through soul searching. But as an engineer, I can't think of it as soul searching. I think of it as a root cause analysis. What does that tell you? But anyway, so why was I laid off? I was laid off because the stock went out in the tank, they ran out of money, they canceled a bunch of R&D projects that weren't key and I was on a project that got canned. Okay, well, why? I moved to that department only four months previously. I used to be working in the reliability department. You know, why? Why did I leave? I was chasing a dream. I wanted to be at that point in my life, an RF hardware design engineer. That's what I wanted to be. All my background, amateur radio, I loved it. I want to do electronics, I wanted to do more of that stuff. So I thought, you know, great, I'll be an RF hardware designer, fantastic. I can't wait, this is brilliant. So why did I leave? I was chasing my dream. But at this point, you can't just keep asking why because it splits into two pieces. So the first piece to go to split off is, would I have been laid off if I'd stayed in the reliability team? The answer is yes, because they were cut back as well. Less projects to support, less requirement for reliability, last in, first out. Statistically, I would have been laid off there. So had I stayed in reliability, it wouldn't have mattered. End of path, switch to the other path. So could I still chase my dream job If I could lose it at any time through no fault of my own, in other words, have the rug pulled out from under me, was there still a point to chasing a dream job if I'm sacrificing everything else for it? That's where things get interesting. Truth was, I love Canada. It's a beautiful country. I miss it. Love North America. I mean, time I spent in the US was great too, but I missed home. I was homesick, that's true. I miss the beaches. I mean, Calgary was what? 1200 kilometers inland from the nearest ocean. So, it's little ways inland. You've got the lakes nearby. I guess there are a handful of lakes and there's the Bow River. Yeah. But it's not the same as the ocean. Well, not to me anyway, I guess. I don't know. Didn't see it that way. Not that I'm a surfer or anything. And not that I'm a great swimmer either. But I just love the sound of the ocean. Anyway, so I miss the ocean. I miss my family. I mean, I was over there by myself. The only family I ever had was still in Australia. I was over there by myself. So, I miss them. So, I reached that point after two and a half years I'd been laid off. Was I actually happy chasing this dream of mine? If my end goal was to be happy, was that going to actually make me happy? What I came to understand as I keep digging down deeper and deeper and deeper is I actually just wanted to be happy. And that's the simplest thing that everyone wants. But what did it mean for me? I realized that I wanted to be living where I wanted to live and my job and career was simply going to have to be whatever it turned out to be. that chasing that dream, going around the world to try and find it, ultimately would not make me happy. And that's where my root cause analysis end. It ended there. I packed up, sold my stuff and came back to Australia. I didn't go back to where I grew up because there was no jobs there. I went to Brisbane instead, and that's where I still am. So, that's how I see it's a useful exercise. And I know it's maybe, I don't know how that sounds. Does that sound strange to you? Is it is I mean, I know it's soul searching, but at the same time, there's a process to it. No, I went through the same thing when the when the startup imploded. And can you tell me a bit about that? Yeah, well, it actually relates back in a way to the discussion of fatigue because it was, you know, it was one of those stereotypical death marches. Right. You know, kind of one mistake after another. None of us really knew what we were doing, what we got ourselves into. I've gone through a number of failure analysis on it. I could go through and point out all the moments when we made serious mistakes and the moments when we made little ones. ones, but I'd say for me, the big one was the same thing, which was never really asking myself if this is what I wanted to do, or I guess how it fit into my goals and my plans for what was actually important for me. The same kind of thing. when you're working 12, 16 hours a day, every day of the week, you get that deep fatigue, right? That's beyond just being tired, it's that some parts of your brain are just kind of off, stay off, and you end up kind of doing crappy work and you end up kind of being a crappy person. At least that's how it worked out for me. And yeah, I think, sometimes I wonder if it's the kind of thing that you really can, can figure out ahead of time, right? I think you'd have to be pretty self-aware and really have it together to not ever make a mistake like that. Absolutely. I mean, it burned a couple of years of my life. Set it on. I kind of became unhealthy physically, mentally, emotionally, all these sort of things. On the other hand, there were good things about the experience, though. So it's not like it was a complete loss. I think going through a process, you know, not unlike what you're talking about, to kind of unpack it all, even though it took months, I was able to, you know, if I had to today, I could map it all out and point to the good things and point to the bad things. But on the whole, I would not recommend doing it that way. Yeah. Well, the thing that I... Oh, go ahead. Sorry. I was just going to say, one of the things that I found fascinating about my experience that I went over there with my eyes closed, essentially. And like you said, you don't know until you try it, I suppose, is the way to think about it. But I went over there chasing a dream. And the truth was that I learnt root cause analysis when I was working in the reliability team over there. I didn't want to do reliability out of the box. I found it interesting up to a point, despite the fact I nearly failed statistics at uni. Here I was doing statistics. Oh, my goodness. Terrible. It's very nice. Yeah, cheers, statistics. G+ GME. Anyway, I'm just like, go down that one. But goodness me. What happens is you learn about this technique of root cause analysis, and then you use it to basically realise that you shouldn't be there. The irony of that is not lost on me. It's like I went somewhere to learn about something that has eventually taught me that I didn't need to be there. So, anyway, I just I find that funny, amusing. This is years later looking back, I can look back and sort of laugh. But you know what, the funny thing is that like you said, there were good things too. And I mean, I got to live in another country for years. It's changed my view on the world and on life and people and it was wonderful. And if you haven't done it, I highly recommend it if you're in a position to do it. Because staying in one country and seeing things through one set of, well, one country's perspective is always going to be limiting. And there's so much more out there. And I consider Canada my home away from home. And I miss the Rocky Mountains and I miss snow, crazy as that sounds. I don't miss de-icing my car in the morning, I will admit. But, you know, and I know you guys are just going through that, you know, polar vortex or whatever the heck they called it. And it's still cold over there, I hear. But even so, I still miss it. And there were good bits. There really were. So, and I miss Tim Hortons, which is not good. But I mean, there you go. Anyway, I don't have too much more to add to that. But... The one thing I just noted down, we were talking about, you know, the people, you know, people, I forget where that was, but that in terms of being the source of failure. I don't know if you're familiar with the original, the source of Murphy's Law and Edward A. Murphy, who is the guy it's named after. But he was working, it was an Air Force project, I forget what it was, it was some sort of crash test set up, one of those things where they shoot you in a rocket and see how much damage you can dummy. But the actual quote I've found it here is, "One day after finding that a transducer was wired wrong, he cursed the technician responsible and said, 'If there's any way to do it wrong, he'll find it. And I remember reading that a few years ago, and it just stuck with me that it actually, it's been misquoted and it's, keep in mind. Yeah, well, that's, that's what, yeah, that's true. And that's what I think about myself too. If there's any way to do it wrong, I'll find it. So. Oh, that sounds a bit depressing now when you look at it that way. Well, it's not, it's, but it's the, the point is if I'm making a system to not leave, you know, have the smart, the smart Ben be kind to the dumb Ben. Cool. All righty, mate. Well, I think we're done. Good. You want to wrap it up? All right. If you want to talk more about this, you can find John on Twitter at JohnChidji. It's the same on You should check out John's site, If you'd like to send an email, you can send it to [email protected]. I'm Ben Alexander, and you can reach me on Twitter @fiatluxfm. You can follow @PragmaticShow on Twitter to see show announcements and other related materials. else. Thanks, Dom. Thank you, Ben. And thanks to everyone who listened live. Take care, everyone. Bye. [Music] [Music] (upbeat music) [MUSIC PLAYING] [BLANK_AUDIO]
Duration 58 minutes and 8 seconds Direct Download

Show Notes

Related Links:

Premium supporters have access to high-quality, early released episodes with a full back-catalogues of previous episodes


Ben Alexander

Ben Alexander

Ben created and runs and Fiat Lux

John Chidgey

John Chidgey

John is an Electrical, Instrumentation and Control Systems Engineer, software developer, podcaster, vocal actor and runs TechDistortion and the Engineered Network. John is a Chartered Professional Engineer in both Electrical Engineering and Information, Telecommunications and Electronics Engineering (ITEE) and a semi-regular conference speaker.

John has produced and appeared on many podcasts including Pragmatic and Causality and is available for hire for Vocal Acting or advertising. He has experience and interest in HMI Design, Alarm Management, Cyber-security and Root Cause Analysis.

Described as the David Attenborough of disasters, and a Dreamy Narrator with Great Pipes by the Podfather Adam Curry.

You can find him on the Fediverse and on Twitter.