Causality 50: 737 MAX Ethiopian Air

23 April, 2023

CURRENT

Five months after Lion Air 610 crashed, another 737-MAX went down with a similar cause. However the official report was at odds with two other internationally respected investigative organisations. We dig into the detail of how the AOA Sensor was claimed to have failed, and review checklist discrepancies to extract fact from opinion as to what most likely triggered this horrible chain of events.

Transcript available
[Music] Chain of events, cause and effect. We analyse what went right, what went wrong, as we discover that many outcomes can be predicted, planned for, and even prevented. I'm John Chidgey and this is Causality. To celebrate the 50th episode of Causality, I'll be hosting three live Q&A sessions for current patrons in May 2023 to accommodate listeners time zones all around the world. Details will be published on Patreon in coming weeks. A competition is now open where you could win your own Causality T-shirt. To enter, all you need to do is write a short or long post either on your own blog, the Fediverse, Twitter or Facebook, linking to and celebrating your favorite episode of Causality or just the show in general. Then, submit a link to your post via email to [email protected] to enter. The competition closes on the 31st of May, 2023 and you can enter as many times as you like. The best post will be chosen and the winner published on the network blog the following week. If you don't want to wait, you can just buy your own from the TEN store with T-shirts for this and other TEN shows, smartphone cases and more are all available now but for a limited time. The TEN store will be closing on the 14th of June, 2023 so get in while you can. Visit Visit https://engineered.network/celebrate for details, and keep an eye on Patreon posts for all the details. Causality is entirely supported by you, our listeners. If you'd like to support us and keep the show ad-free, you can by becoming a Premium Supporter. Premium Supporters have access to high-quality versions of episodes, as well as bonus material from all of our shows not available anywhere else. Just visit https://engineered.network/causality to learn how you can help this show to continue to be made. Thank you. 737 Max: Ethiopian Air. In Episode 33 of this show, we delved deeply into the history of the Boeing 737 MAX and specifically the incident relating to Lion Air Flight 610. Mention was made during that episode of a second incident relating to the 737 MAX regarding Ethiopian Air Flight 302 that occurred on 10 March 2019, less than 5 months following Lion Air 610. The draft investigation report was rather thin on detail at the time, shall we say, and when Episode 33 aired on the 31st of January 2020, insufficient detail was available to compare and contrast the two incidents. With the final report on Flight 302 eventually released on the 23rd of December 2022, we can now, finally, conclude regarding the Boeing 737 MAX. Only technical details that weren't already covered in Episode 33 will be covered here in this episode. For the full context of technical points relating to MCAS operation, please listen to episode 33 first. Even if you've listened to it before, it's useful to re-listen to it as the reason MCAS exists and the other findings for the most part will apply here, plus a few others. With that all said, now let's talk about the incident. At 8.36am local time on Sunday 10 March 2019, Ethiopian Air Flight 302 lined up for takeoff on runway 07R at Addis Ababa Bowl International Airport in Ethiopia. Aboard were 149 passengers, 5 cabin crew, 2 flight deck crew and an in-flight security officer, otherwise known as an Air Marshal, and they were heading to Nairobi, specifically to the Kenya Jomo Kenyatta International Airport, a trip that takes approximately 2 hours. was Yared Getachew. He was 29 years old and had been flying for nine years. Yared had 4,017 hours experience on the Boeing 737 Next Generation model with 103 hours on the Boeing 737 Max. In Ethiopia people are addressed by their given names only and moving forward therefore he'll be referred to only as Yared. The co-pilot was Ahmed Nur Mohammod. He was 25 years old and was a relatively recent graduate. He had 151 hours on the Boeing 737 Next Generation model with only 56 hours on the Boeing 737 Max. The aircraft itself was relatively new and had only 1330 hours of flight time for a total of 382 cycles at that time. At 8.37am and 36 seconds, air traffic control cleared the flight for takeoff and 15 seconds later the aircraft began its takeoff roll. At 8.38 and 43 seconds, VR was reached and the aircraft rotated or lifted its nose and started departing ground level. Within a second of the nose rising, the AOA sensors, that's angle of attack sensors, began reporting disagree. Activating the left-hand stick shaker and calculated airspeed on the left-hand side began showing erroneous values relative to the right-hand side. Five seconds after takeoff, both the master caution and anti-ice lamps turned on. The captain and first officer attempted to engage autopilot from 400 feet of altitude. However, it kept disengaging. A total of four times, with the longest autopilot active duration lasting only 32 seconds. At 8.39 and 59 seconds, the captain radioed the tower, indicating they were having flight control problems and requested to maintain their initial heading. During this radio call, the autopilot once again disengaged. The MCAS system activated its first nose down trim command lasting 9 seconds, leaving a 2.1 unit stabilizer position requiring 90lbs from the pilot to pitch up the nose of the plane at that point. Red-black stripes then presented across the speed tape on the left hand controls followed by a GPWS don't sink warning for 3 seconds and a pull up warning message on both flight displays for 14 seconds. At 8.40 and 14 seconds, the captain further trimmed nose up for 2 seconds using the electric trim switches on the control wheel, reaching 2.3 stabilizer units. At 4.50 and 22 seconds, MCAS applied its second nose down trim command for 7 seconds, though it would have run for 9 seconds except the captain again applied a manual electric trim during its activation, which cancelled out the MCAS impact, returning to 2.3 stabilizer units. During this engagement, the GPWS don't sync once again sounded and the pull-up was once again displayed. At approximately 8.40 and 38 seconds, the first officer and captain agreed to apply the stability trim cut-out switches. These are guarded switches and disable the automatic electric trim which is found in all 737s. It was in fact the only way to stop MCAS from functioning and doing so meant they now had to trim the aircraft manually using the manual trim wheels. At this point in time the stabilizer was still at 2.3 stabilizer units. The aircraft was at 1500 feet above ground level although the left hand reported 500 feet lower than this. Travelling at 332 knots that's 615 kilometers an hour, pitch was 2.5 degrees climbing at 350 feet per minute. At 840 and 43 seconds the MCAS attempted to operate for a third time however the cutout switches inhibited its operation. The plane's pitch was now effectively in the hands of the pilots manually, with one or both pulling up periodically leading to varying pitch values between +7° to -2°. At 8.40 and 50 seconds the aircraft had reached 9,500 feet (2900m) and air traffic control were advised by the first officer that the captain wanted to maintain 14,000 feet (4,300m) noting they still had a flight control problem. Over the next few minutes the captain and first officer applied considerable physical force to pitch up the aircraft in an attempt to reach 14 000 feet, with the overspeed alert now sounding. In the midst of the confusion, the captain and first officer had both failed to realize they were still at 94% N1 reference, with autothrottle still set to ARM mode. As an aside, N1 relates to the low pressure spool in a jet engine where N2 relates to the high pressure spool. It's referenced against maximum in percent and is roughly analogous to revs per minute in a reciprocating internal combustion engine. N2 is more important during initial engine startup however once operating N1 is generally referenced. Hence a 94% N1 reference is close to maximum thrust which is required for takeoff. Autothrottle maintains either a speed or thrust setting and was set to N1 which is a thrust mode. ARM mode on a Boeing 737 allows the pilot to set autothrottle without a set speed, which sounds odd but its intention is to allow for speed control but still have minimum speed protection. The use of ARM in the 737 flight crew training manual is described as follows. "The A/T ARM mode is not normally recommended because its function can be confusing. The primary feature the A/T ARM mode provides is minimum speed protection in the event the airplane slows to minimum maneuvering speed. Other features normally associated with the A/T are not provided." Now back to the incident. The captain asked the first officer to confirm that the trim was functioning properly in manual. However, the first officer concluded stating and I quote "it is not working." At 8:42 and 15 seconds the first officer requested a vector to return to the airport which was granted. They began a banking turn to return to the airport changing their heading from 102 degrees to a new heading of 262 degrees. At this point in time, the stabilizer was still at 2.3 units. The aircraft was at 6,200 feet or 1,900 meters above ground level, although the left hand reported 1,250 feet lower than this. Travelling at 367 knots, that's 680 kilometers an hour, pitch was plus 1 degree, descending at 125 feet per minute, banking 21 degrees right. In order to re-engage autopilot, the captain and and 1st officer turned off the stability trim cutout switches and began trim control once again using the electric trim switches. However, this action also re-engaged MCAS. At 8:43 and 21 seconds, the MCAS operated for the fourth time, pitching down for 5 seconds, moving the stabilizer to 1 unit. During this activation, the captain and 1st officer decreased their average force pulling up from 100 pounds to 78 over three and a half seconds, during which time the pitch went from plus 0.5 degrees, nose up, to minus 7.8 degrees, now nose down. This increased the descent rate from minus 100 feet per minute to minus 5000 feet per minute. The combined force applied by the pilot and first officer registered at 180 pounds and despite this they could not pull out. At 8:43 and 36 seconds the Enhanced Ground Proximity Warning system (EPWS) alerted "Terrain, Terrain, Pull up, Pull up." Impact occurred at approximately 8:43 and 44 seconds. According to the onboard sensors the aircraft was traveling at approximately 500 knots, that's 926 kilometers per hour, at the time of impact. The plane had crashed in farmer's fields near the town of Bishoftu, 62km or 39mi southeast of the airport that it took off from, leaving an impact crater of 27 meters, that's 90 feet wide, and 37 meters, that's 120 feet long, with wreckage found as much as 9 meters or 30 feet deep in the ground. There were no survivors. Let's talk about the investigation and reports. The Federal Democratic Republic of Ethiopia, Ministry of Transport and Logistics, Aircraft Accident Investigation Bureau or Ethiopian Accident Investigation Bureau or EAIB for short, conducted the investigation on behalf of the Ethiopian Civil Aviation Authority (ECAA). They issued multiple reports. The first was a preliminary report on the 19th of April 2019. The next was an interim report on the 9th of March 2020. The first final draft for internal review was released on the 12th of January 2021. The second final draft for internal review was released on the 26th of May 2021. The third final draft for internal review was released on the 30th of March 2022 and as we said previously the final was released on the 23rd of December 2022. Now this may seem odd but there is a good reason for this. The International Civil Aviation Organization ICAO, Annex 13 aircraft accident and incident investigation requires a preliminary report is released within 30 days of the incident. Specifically it requests and I quote "The state conducting the investigation should release the final report in the shortest possible time and, if possible, within 12 months of the date of the occurrence. If the report cannot be released within 12 months, the State conducting the investigation should release an interim report on each anniversary of the occurrence detailing the progress of the investigation and any safety issues raised." Noting the timing and dates of the report released, it's clear that these were intended to meet this expectation. What's interesting is whether they were intending these drafts to be an opportunity for, or whether they were actually interested in taking on feedback. There are two external entities that were requested to assist in the investigation that are worthy of call out. The first was the Bureau d’Enquêtes et d’Analyses pour la Sécurité de l’Aviation Civile (BEA) from France. The second was a United States team comprising representatives from the National Transportation Safety Board (NTSB), Federal Aviation Administration (FAA), the aircraft manufacturer Boeing, and the engine manufacturer General Electric. In addition to this, the US team called Collins Aerospace in as a technical advisor to the US team in April 2019 after the EAIB requested assistance into the most likely failure modes of an AOA sensor. More about this in a minute. To quickly revisit the AOA sensors previously discussed in Episode 33 and how they relate to MCAS, that's the Maneuver Characteristic Augmentation System. There are two AOA sensors, also sometimes called alpha vanes, fitted on every 737 on either side of the nose directly beneath a respective pair of pitot tubes. The AOA sensors pivot around a central axis with the small reverse swept blade or fin, often referred to as the vein. Unlike an aerofoil, the fin operates just like a wind vane. It is blown backwards to a position where it has the least cross-sectional wind resistance, which is directly in the downstream direction of the airflow. The flight control computers receive their process values from sensors via multiple systems, including the Air Data Inertial Reference System, or ADIRS. The left ADIRU from the left AOA sensors and the right ADIRU from the right AOA sensor. MCAS is a flight control law, executed in a single Flight Control Computer only, based on the angle of attack value from a single sensor. MCAS is only present on the 737 MAX range of aircraft and it becomes active during manual (meaning autopilot is not engaged) flaps fully up in position 0 when the AOA value received by the Master Flight Control Computer exceeds a determined setpoint value. The intention of MCAS was to avoid the need for retraining of pilots that were used to the different reaction of the 737 MAX compared to the previous generation, the 737NG, for which the MAX presented a different pitch response due to design changes to the MAX. Upon customer release there was no mention of MCAS in any training materials and Boeing had advertised the newer plane as not requiring any additional retraining over the 737NG. So what went wrong? The investigators implicated MCAS as the primary cause of the incident, for much the same reasons as Lion Air 610. The incorrect AOA sensor reading into the flight master control computer under manual electric trim control had operated incorrectly. The pilot and first officer then began a series of trim corrections, cutting MCAS out and back in again until MCAS commanded a nose down into the ground. In essence, the pilots were fighting the control system due to an erroneous input and lost. Admittedly, the Lion Air 610 pilots managed to fight MCAS for longer, about twice as long for 12 minutes, but the issue here should be, why are we fighting MCAS at all? A more detailed deep dive into MCAS and its issues is in Episode 33 if you're interested. Given that the erroneous AOA sensor values triggered the MCAS behavior, the investigators found that, and I quote, "an AOA sensor malfunction most likely occurred as the result of a power quality problem that resulted in the loss of power to the left AOA sensor heater." So let's look at the AOA sensor heater just for a moment. The EAIB report states that the AOA sensors have an "embedded heater in a vane that thermally compensates to increase the vane surface temperature in high flow and for de-icing." It's potentially a bad translation, I tweaked it slightly, but in essence, the AOA vane, shaft and coupling must be kept free-moving at all times, otherwise it won't settle into the correct position when airflow travels over it. Seems simple enough. So if the heater had failed and ice had accumulated, then you might see readings like those encountered leading up to the incident. However, heaters are subject to the laws of thermodynamics and basic physics, of which we learn that thermal coefficients and hence thermal lag is a problem, both good and bad. Assuming a sensor is already iced up, when we turn on the power systems and activate the heater, it could take several minutes for the heating coils to transfer enough heat through the shaft and vane to start melting any ice that might be present. Conversely, if the sensor is not yet iced up and the heater turns off, there will be enough residual heat that it won't ice up for at least several minutes, and only then if external atmospheric conditions are right. So what was the weather like that morning? I'm glad you asked. The report indicated conditions at the time of the incident were approximately +13°C, that's 55°F, with a dew point of +11°C or 52°F, which aligns with the historical data from the local weather bureau. The aircraft had been on the ground for nearly three hours between flights, noting that the overnight low was +11°C and was far above freezing temperature. Its previous flight was an overnight from Johannesburg, a nearly six-hour flight that landed at 5:52am local time. The maintenance log showing that flight's arrival recorded no write-ups or rectification actions and there were no notes from the flight crew either. If there had been icing during the prior flight, an AOA disagree error should have indicated, but even if somehow it didn't, the heater did not show any errors until 6 seconds after the moment of rotation, which suggests the heater was still working fine up until that moment. I find it extremely implausible that the heater failure was the primary cause of the AOA sensor's incorrect readings. I have other reasons for thinking that too. A little more from the EAIB about the sensor though. The EAIB explored the maintenance log for the aircraft and noted that it had, and I quote, "suffered intermittent electrical electronic anomalies in addition to the flight control system malfunctions" and "three days before the crash the auxiliary power unit APU fault light illuminated and the APU had a protective shutdown. Where the onboard maintenance function computer message also indicated the start converter unit SCU showed the APU start system was inoperative." There was one other interesting note that they added and I quote "the captain's personal computer power outlet also had no power. They concluded the possibility of intermittent electrical-electronic system defects were an underlying issue." Okay then. This is supposed to be an engineering-based, fact-driven analysis that follows specific evidence that directly leads us to a specific root cause or causes. The above conclusion, to me at least, reads very much like it was "...an electrical glitch of some kind, probably...?" Somewhat flimsy? If this statement sounds harsh, maybe I should say it feels a little presumptive and inconclusive. I have no doubt that there were electrical glitches with this 737 MAX. The only important question is, did a specific electrical glitch specifically cause the MCAS to malfunction or not? Let's talk about the Collins Aerospace involvement. Collins Aerospace provided a report to the EAIB with their findings based on the flight data recorder information provided to them by the EAIB. One of the findings of the Collins report was the most likely cause for erroneous readings from the sensor was a bird strike. To search for evidence of a potential bird strike during or shortly following take-off, the investigators inspected the immediate take-off runway area for signs of debris to explain the damage to the AOA sensor or potential damage to the tail. They found no evidence that this was the case, responding in their final official report that, and I quote, "the investigation team cannot comment and verify on the conclusions noted in the Collins report." Now this is where things start to get a bit more interesting. The NTSB had provided comments and feedback to each of the internal drafts, the first within six weeks, the second within two weeks, and the third they took longer to submit but requested it be incorporated into the final report. It wasn't. Regarding the AOA sensor conclusion, the NTSB agreed with the Collins Aerospace Report which correlated sensor data with known failure modes of an AOA sensor. They had developed a detailed fault tree analysis that considered the following things. 1) Manufacturing defects; 2) Internal component failures; 3) Heater failures; 4) Non-impact structural failures of the AOA vane attachment hardware; and finally 5) AOA vane impact failures. The NTSB called out the following findings. The AOA reading began deviating on the left-hand sensor at 44 seconds after the beginning of take-off roll. The left alpha vane fail annunciation on the probe heat panel, indicating vane heater current below the monitor threshold, 6 seconds after the AOA deviations began, is consistent with a vane breaking at the hub and separating from the AOA sensor. These timings align with moments following rotation or nose lift-off and noting that a small to moderate bird weighing approximately 230 grams or 1/2lb impacting at 170 knots would be sufficient to cause damage of this suspected kind. The NTSB were also critical of the delay the EAIB team took in searching the runway for debris and bird activity which was eight days after the incident and subsequently a lack of search of the taxiway EA302 would have been directly above at the most likely time of bird impact, Taxiway D. Additionally, the EAIB had reported officially on an engine failure event that had occurred months before this incident due to a bird strike, making a recommendation that the Ethiopian Airlines Group Airport Authority (EAGAA) to "take practical measures to minimize/eliminate bird hazards around the airport so that arriving and departing flights are conducted safely without any human and material loss." Given this recommendation occurred 8 months after the EAE302 incident, it's clear no additional bird control measures had been put in place. However, with some investigator personnel overlap between investigations, it's unclear why this was dismissed as a potential root cause in the EA302's final report. Let's talk a bit about crew training. In the EAIB report, Finding 83 states the following and I quote, "The Emergency Airworthiness Directive (AD) pilot procedures were inadequate and unverified. AD 2018-23-51 does not mention the possibility of an auto throttle malfunction due to an erroneous AOA input." About that Airworthiness Directive, that's 2018-23-51, that was issued by the Federal Aviation Authority on the 12th of June 2018, long before the incident. The NTSB responded to this specifically stating, and I quote, "Even if such a reference document did *not* exist, the flight crew should have been trained on 737-8 MAX non-normal procedures. Non-normal procedures related to erroneous AOA inputs instruct the crew to disengage both the autopilot and autothrottle, thereby preventing the erroneous AOA inputs from affecting flight control and throttle movements." Interestingly, if you read the detail of AD 2018-23-51, it states the following for a runaway stabilizer condition. It says the pilot should "disengage autopilot and control airplane pitch attitude with control column and main electric trim as required. If relaxing the column causes the trim to move, set stabilizer trim switches to cut out. If runaway continues, hold the stabilizer trim wheel against rotation and trim the airplane manually." It goes on to say, "Initially, higher control forces may be needed to overcome any stabilizer nose-down trim already applied. Electric stabilizer trim can be used to neutralize control column pitch forces before moving the stab trim cutout switches to cut out. Manual stabilizer trim can be used before and after the stab trim cutout switches are moved." This was issued by Boeing to Ethiopian Air on the 6th of November 2018 following the initial findings from the Lion Air 610 incident, issued as a Flight Crew Operations Manual Bulletin, reference number ETH-12. So the pilots actually did attempt some of those things, however they applied the trim cutout switches perhaps too early and then against the directive un-cut-out the stabiliser trim when they probably shouldn't have. The extreme forces they had to apply were in part due to the near full thrust condition of the engines at the time. Technically though, that specific directive doesn't mention autothrottle however, that's true. But the issue of speed and force applied to the trim stabilisers is one of basic aerodynamics. The NTSB's comments and their feedback specifically about autothrottle are interesting. The following direct quote from their feedback is read verbatim: "Because the autothrottle remained engaged and responsive to the erroneous AOA inputs, the autothrottle did not transition to N1 mode and remained in ARM mode with take-off thrust. The expected crew response is to manually control thrust in this situation. However, the lack of manual control and the absence of flight crew conversation regarding the thrust settings indicate that the crew did not notice the autothrottle's failure to transition to N1, even when the aural overspeed warning triggered as the airplane accelerated beyond about 340 knots. As airspeed increased, the required control forces increased on both the control column and the manual trim wheel." But it wasn't just the NTSB that was saying this. The BEA also deconstructed the crew's actions in their feedback, stating the following, "In the case of the IAS disagree, the flight crew has to apply the airspeed unreliable non-normal checklist. This checklist states to first disengage the autopilot, then the autothrottle." So after the updated directive was given to Ethiopian Air four months before this incident, do we know that the pilots were made aware of it? Ethiopian Air uses a system called Logipad, which pilots are required to upload as standard procedure before going on a flight to grab the latest directives and bulletins. The company confirmed that at least every seven days this was done by the pilots involved. There is however no test for comprehension, no review or check that the uploaded documents have been read. The system can only confirm that they were uploaded to the pilot's device. The BEA stated the following regarding this system. A contributing factor in this incident was, and I quote, "The use of the LogiPad system by the airline as a sole means to disseminate information on new systems and/or procedures which doesn't allow the evaluation of the crew's understanding and knowledge acquisition on new systems and procedures. The system was used to disseminate information related to the MCAS system issued following the previous 737 MAX accident and did not allow the airline to ensure that the crews had read and correctly understood this information." So this all feels a bit odd. Something that's bugged me for a while with these incidents is the relative experience of the pilots changed the duration of the lack of control to the ultimate crash. There were a lot of 737 MAX's flying out there so it's a numbers game with such an important instrument the AOA sensor now playing such a vital role in that mode of operation that surely we'd had a near miss before either of these incidents? It turns out there was, and whilst MCAS wasn't involved the investigators did find that the pilots were not following the Boeing training manual and this could or should have been a warning of how other pilots might react under the same alerts as observed on both Lion Air 610 and Ethiopian Air 302. The report I'm referring to is BEA 2018-0071 released on the 16th of November 2020 regarding the same aircraft with the same faulty AOA sensor over two flights on subsequent days, the 7th and 8th of February 2018. Whilst the incident did not involve a crash and no loss or damage occurred, the crew performance may have provided some key insights into how to address the issue with other 737 MAX flight crews. For both flights, an incorrect reading AOA sensor triggered AOA Disagree and Alt Disagree alerts, with one flight pressing onto its destination and the other radioing "PAN-PAN" and returning to their originating airport. Both flight crews chose to follow the AOA Disagree and Alt Disagree checklists followed by the IAS Disagree checklist, and whilst they noted a brief reference to the airspeed unreliable checklist, the pilots did not follow it. The Boeing 737-800 flight manual airspeed unreliable procedure has the following key memory items. 1) Autopilot, if engaged, disengage; 2) Autothrottle, if engaged, disengage; 3) Flight director switches, both set to off. The BEA in this report states and I quote "In these two incidents, the pilots did not immediately carry out the memory items. In both cases, they first tried to identify the side which was supplying the erroneous information and initially used this assessment to continue the flight with the automatic systems engaged." Had Boeing dug deeper into the order in which these checklists were executed, ensuring flight crews followed them with revised training, it's possible that neither EA302 or LA610 would have happened. Of course, that's leaving a lot in the hands of procedures and training for a system, MCAS, that wasn't even named in any training manuals or checklists. Also, of course, that would have required the BEA report to have been completed, reviewed and published before LA610, not years after it had actually happened. So it's been a while since Episode 33. And Episode 33's discussion regarding the fallout from Lion Air 610 was somewhat limited at that time since that was released on the 31st of January 2020. A few things have happened in the world since then, not just in the case of the 737 MAX. The SARS-CoV-2, also known as COVID-19 pandemic, had spread globally by March 2020, with many countries locking down all but essential air travel in and out. At its peak, the FAA reported in late April 2020 that air traffic in the United States had dropped by 96%. The 737 MAX fleet of aircraft were already grounded whilst investigations and rectifications continued from March 2019 following the EA302 incident. Whilst the FAA issued a Continued Airworthiness Notification to the International Community (CANIC) for the 737 MAX on the 18th of November 2020, many 737 MAXs took significant time to return to service due to poor COVID-19 demand. On the 7th of January 2021 Boeing were charged with 737 Fraud Conspiracy and agreed to settle with the US Department of Justice for a total criminal amount of just over $2.5B USD. $0.5B dollars of that figure is to a crash victim beneficiaries fund for both the Lion Air and Ethiopian Air incidents. The statement from the acting assistant Attorney General David P. Burns of the Justice Department's criminal division is worth reading and I quote "the tragic crashes of Lion Air flight 610 and Ethiopian Airlines Flight 302 exposed fraudulent and deceptive conduct by employees of one of the world's leading commercial airplane manufacturers. Boeing's employees chose the path of profit over candor by concealing material information from the FAA concerning the operation of its 737 MAX airplane and engaging in an effort to cover up their deception. This resolution holds Boeing accountable for its employees' criminal misconduct, addresses the financial impact to Boeing's airline customers and hopefully provide some measure of compensation to the crash victims' families and beneficiaries." Now for an incident of this sort of scale, as one might expect, it didn't end there for Boeing. But to briefly quote myself from Episode 33, "Certainly, Boeing, and to a lesser extent the FAA, for less than ideal oversight of Boeing's qualification of the the 737 MAX have to shoulder most of the responsibility for these events." Well then. On the 16th of September 2020, a 238-page congressional report by the House Committee on Transportation and Infrastructure was released, taking 18 months of investigation to produce that placed fault primarily with Boeing, with some resting also with the FAA. Nice to be validated there. The report described "disturbing cultural issues relating to employee surveys showing some employees had experienced undue pressure as Boeing pressed to complete the 737 MAX ahead of other offerings released at the time by Airbus." Regarding the two 737 MAX crashes, the report stated and I quote, "They were the horrific culmination of a series of faulty technical assumptions by Boeing's engineers, a lack of transparency on the part of Boeing's management and grossly insufficient oversight by the FAA. The pernicious results of regulatory capture on the part of the FAA with respect to its responsibilities to perform robust oversight of Boeing and to ensure the safety of the flying public." On 2 September 2021, Boeing and Ethiopian Airlines reached an out-of-court settlement for an undisclosed amount. On 5 November 2021, Boeing's directors settled the shareholder lawsuit for $237.5M US dollars. The shareholder lawsuit claimed, and I quote, "Boeing's directors and officers failed them in overseeing mission-critical airplane safety to protect enterprise and stockholder value." On 11 November 2021, Boeing accepted liability for overseas family compensation claims relating to Ethiopian Air 302 to be submitted by the US court system. The final amount of compensation via this pathway remains unclear. And again, if I may quote myself, again from Episode 33, "I think that aviation authorities around the world need to reconsider what constitutes a genuine derivative design and when grandfathering provisions should and should not apply, as it encourages aircraft manufacturers to make incremental changes to an aircraft's design and avoid a full regression test of all of the impacted aspects of those changes." So about those regulations... The Aircraft Certification, Safety and Accountability Act was passed on 17 November 2020, which requires the FAA to do many things, but key items of interest to me are, require manufacturers to disclose the FAA certain safety critical information related to an aircraft, and revise and improve its process of issuing amended type certificates for modifying an aircraft. I should damn well hope so. On the 6th of March 2023, the FAA released another policy proposal that would require applicants who want to modify the original transport category aircraft designs to disclose all proposed changes in a single document at the beginning of the certification process. Again, yes...indeed, great idea! Finally!! In terms of the final costs to Boeing, the costs of the 737 MAX incidents to Boeing as a company are continuing even today. At the end of 2020, over 800 737 MAX orders had been cancelled, and with the production shut down between the COVID-19 pandemic and the order reduction, it's difficult to separate the two. Best estimates are that by mid-2022 Boeing had lost approximately $20 billion between lost orders, re-compliance and compensation claims. So what did Boeing do to get the 737 MAX re-certified? On the 20th of November 2020, the FAA issued AD 2020-24-02 that superseded their previous airworthiness directive that was 2018-23-15 we mentioned previously regarding the 737 MAX aircraft. Rather than dig into every detail, the summary of key points from the directive are as follows. Boeing to install new flight control computer software. This change is intended to prevent erroneous MCAS activation, among other safeguards. A direct extract states, "The new flight control laws now require inputs from both AOA sensors in order to activate MCAS." Boeing to install updated cockpit display system software to generate an AOA disagree alert. This will alert pilots that the airplane's two AOA sensors are disagreeing by a certain amount indicating a potential AOA sensor failure. Boeing to incorporate new and revised operating procedures to the airplane flight manual. This change is intended to ensure the flight crew has the means to recognize and respond to erroneous stabilizer movement and the effects of a potential AOA sensor failure. In addition to these design changes, FAA also will require operators to conduct an AOA sensor system test and perform an operational readiness flight prior to returning each airplane to service. Now Transport Canada and the European Union Aviation Safety Agency (EASA) they didn't fully accept this directive, suggesting that it didn't go quite far enough and thus they didn't adopt it at the time. With Transport Canada instead issuing its own directive on the 18th of January 2021. Some, but not all, of the additional requirements included the addition of coloured caps on the circuit breakers for the stick shaker to allow for easier identification, an enhanced flight deck procedure that provides the option for a pilot in command to disable a loud and intrusive warning system, in other words the stick shaker, when the system has been erroneously activated by a failure of an AOA. The EASA requested similar additions to Transport Canada before finally accepting the 737 MAX was fit to return to service once those requirements were met and the EASA followed on the 27th of January 2021. So what do we learn from all this? There are a few things to take away from this incident beyond those that were already discussed in Lion Air 610 but perhaps not what you might think. Firstly let's talk about Root Cause analysis and opinions. Clearly the BEA, NTSB and Collins had a different opinion from the EAIB, but this shouldn't be about opinions, it should be about facts. Yes, the official report told its story, but the facts by equivalence disagree with the formal report. It's important to remember when you're reading these formal reports to keep some level of skepticism. Reports are written by people and people have opinions, but facts don't have opinions and that's what makes them facts. If you're ever investigating anything and doing a root cause analysis and you or your team or organization might be implicated by a finding, you have to be prepared to admit fault if that's the real root cause because otherwise you and others won't really learn anything. Ultimately though, we each have to sift through the opinions and facts that are jumbled together and take our own learnings as best we can. The other learning that's perhaps less clear is the adherence to the intent of the ICAO requirement to release a report with findings as soon as possible following an incident. For comparison, the time lag between the incidents and the final report for Lion Air 610 was 49 weeks, not even one year. The time lag for Ethiopian Air 302 was 197 weeks, that's just under 4 years or 4 times longer. During that time, multiple other organisations had completed their reviews of the aircraft. Boeing as an organisation, the FAA had passed new regulations, Boeing had updated the flight control software and a pandemic came and went. Well...mostly went...and all the while, we were still waiting for a report that incorporated very little feedback and whose contents did not improve proportionally with the amount of additional time that had passed. The BA report on the near miss could have been quicker as well, coming in at 144 weeks or just under three years. And that may have potentially prevented these incidents if Boeing and the FAA had it early enough but would have required a very short turnaround. To be clear, do your investigations with some hustle, listen to the experts, incorporate their findings into your own and be prepared to throw yourself under the bus if that's a true root cause. Ultimately, investigating is hard. For the record, I have no staff on this show. I never have. It's just me. Only me. I'm not a formal investigator outside of the company I work for, and even within the company, it's more of a role that Technical Authorities like myself are asked to undertake from time to time. I've done a lot of Root Cause Analyses, Fault Trees, HAZOPs, CHAZOPs, and 5-Whys in the nearly three decades of my professional engineering career around the world. I have never investigated an incident where one or more people lost their lives. I have never investigated an incident where the cost outcomes broke the $1M mark. I've never been in charge of a team of investigators because I didn't need a team to undertake the investigations at the scale that I was asked to. I have, however, organized and performed post-incident reviews, traced alarm logs, maintenance histories, operating procedures, organizational, structural changes, local weather conditions, and challenged biases, had my own biases challenged, and written ridiculously detailed reports with many, many findings. And my conclusion from all of that is simple. Investigating is hard. Or rather, it's very hard to do it well. Sometimes you just can't find the facts, and in my experience, it's not the facts that present the problem, it's the people. And people aren't evil. They're not good either. They're just people. Our recollections vary from day to day with the passage of time, and we're subject to influence, both real and perceived. And sometimes those subtleties can dramatically change the findings, even though they shouldn't. I have worked at companies with a no-blame mantra, where knowledge sharing and admitted fault is celebrated as an opportunity for mutual learning, and I think that's noble and I'm glad that environments like that exist. But don't kid yourself. If the stakes are real, the consequences need to be real too. I like to say to others, "It's not about blame, but it actually is..." as a caution to other people but also a reminder to myself that I'm not above judgment. When you're investigating an incident, no matter how many times people press for the true root cause for mutual learnings, the human element will always be there pushing back if it feels the facts portray them in a bad light, and they may be blamed, either in whole or in part. Beyond egotistical reasons, there's legal consequences, contractual consequences, and litigative consequences to admission of guilt...in certain circumstances. You have to be prepared to throw yourself under the bus you're driving, even if that's physically difficult (you know what I mean.) I rely on official reports produced by people whose job it was to overturn every rock, pull apart every thread, and dig into every minute detail. Sometimes, reading other investigation reports, you reach a point where you know there were issues with the report. And this is an example of that. It's clear to me that there's a fundamental disagreement with what set these events into motion with the failed AOA sensor. The EAIB blame Boeing and their bad wiring, and Collins and their AOA sensor. The NTSB and Collins blame an uninvited external third party: a bird. If we briefly suspend evidence for a moment, why would each party be debating this at all? MCAS was certainly the root cause, and no one's arguing about that. But what was the initiating event? Who really is to blame, beyond just the Boeing 737 MAX MCAS design at that time? If an Ethiopian authority blames poor bird controls at an airport in their country, there may be liability pressures from companies and families for compensation, bad publicity and reputational damage. If the NTSB and Collins demonstrate it was a third party event, then Collins protect their reputation and may avoid some liability concerns as well. But this show isn't about that kind of speculation. So when we're investigating something, what do we do? We follow the evidence as best we can. There were insufficient clues in the debris to conclude the initiating event, or even if the vane was connected at the moment of impact, based on debris at the crash site. In cases like this, investigators have to do the next best thing, replicate the circumstances with equivalent equipment and attempt to inject the same suspected failure modes to mimic and get the same results. This is exactly what Collins did and their conclusion was a bird strike fit the failure they observed in the lead up to the incident. If you have to weigh on the one hand, some wiring problem with the heater, which should not have an immediate impact to the sensor, with failure testing under equivalent circumstances led to the same behaviour, then the choice should be obvious. So far as crew training goes, that's yet another issue. But then, it's not like the MCAS system was obvious, nor was it mentioned, nor was there specific training mentioning it as part of the 737-NG to 737-MAX migration training, as previously discussed in Episode 33. I could go on about it, but I'm not going to again. Between the two incidents, 346 people died because of the MCAS system. Both reports into each incident agree on that point. The measures taken by the United States government regulatory improvements, if applied correctly by the FAA and Boeing specifically, should prevent incidents like this from occurring in the future. The updates Boeing made to the 737 MAX address the deficiencies that MCAS had and it's now a safer plane as most out there. I'd happily fly on a retrofitted one at this point. So perhaps in the 4-1/2yrs since Lion Air 610, the right outcomes have eventuated. The correct course corrections have been made. And that's a good thing. But I can't shake my concerns about Boeing as a company and how they performed engineering in their designs in recent times. Boeing was a well-established business with a solid engineering reputation. But as we discussed in Episode 33, the push internally was to get the design out the door and minimise cost for the end customer, and risk probabilities were downrated when they should not have been. Younger engineers look at the history of the solid, reliable designs that went through detailed rigour and risk assessments, but because they haven't seen the failures, they haven't lived the consequences of cost-centric decision making, they make the mistake of putting cost ahead of risk. This isn't a problem that's unique to Boeing either. As engineers, we need to be constantly vigilant to ensure that those that make decisions understand those risks before the decisions are made. We need to ensure that the engineering processes are followed. We need to ensure that risks are fairly assessed and to stop the job if we need to, because if we don't, someday our inaction will lead to a failure and someone may be injured or may die. Finally though, on a self-reflecting note, and perhaps fittingly, after 50 episodes of this show, that's a show that places formal investigative reports in very high regard as the source of facts. Not all reports are created equally. We should read each of them carefully. Be mindful of the biases in play, including your own, and be balanced in the learnings that we take from every incident. Be sure within yourself that you've taken away the best things to avoid future incidents from occurring, and not just blindly trusting a one-page summary. I think it's good to be a little skeptical. I think it's good to keep an open mind. Forever a skeptic? I guess I am. I think that's okay. To celebrate the 50th episode of Causality, I'll be hosting 3 live Q&A sessions for current patrons in May 2023 to accommodate listeners' time zones all around the world. Details will be published on Patreon in coming weeks. A competition is now open where you could win your own Causality T-shirt. To enter, all you need to do is write a short or long post either on your own blog, the Fetiverse, Twitter or Facebook, linking to and celebrating your favorite episode of Causality, or just the show in general. Then submit a link to your post via email to [email protected] to enter. The competition closes on the 31st of May 2023, and you can enter as many times as you like. The best post will be chosen and the winner published on the network blog the following week. If you don't want to wait, you can just buy your own from the TEN store, with T-shirts for this and other TEN shows, smartphone cases and more, all available now, but for a limited The 10 store will be closing on the 14th of June 2023, so get in while you can. Visit https://engineered.network/celebrate for details and keep an eye on Patreon posts for all the details. If you're enjoying Causality and you'd like to support us and keep the show ad free, you can by becoming a premium supporter. Just visit https://engineered.network/causality to learn how you can help this show to continue to be made. Thank you. A big thank you to all of our supporters. A special thank you to our Silver Producers Mitch Bilger, Lesley, Shane O'Neill, Jared Roman, Joel Maher, Katerina Will, Chad Juehring, Dave Jones and Kellen Frodelius-Fujimoto. And an extra special thank you to both of our Gold Producers, Stephen Bridle and our Gold Producer known only as "R". Causality is heavily researched and links to all materials used for the creation of this episode are contained in the show notes. You can find them in the text of the episode description of your podcast player or on our website. Causality is a Podcasting 2.0 enhanced show and with the right podcast player you'll have episode locations, enhanced chapters, and real-time subtitles on selected episodes. And you can also stream Satoshi's and Boost with a message if you like. There's details on how, along with a Boostagram leaderboard on our website. You can follow me on the Fediverse @[email protected] or the network @[email protected]. This was Causality. I'm John Chidgey. Thanks so much for listening.
Duration 52 minutes and 7 seconds Direct Download

Show Notes

This show is Podcasting 2.0 Enhanced

Celebrating Causality’s 50th Episode!!

Previous Episode:

Reports:

General Information:

Articles:

Regulations:

Litigation:


Episode Gold Producers: 'r' and Steven Bridle.
Episode Silver Producers: Mitch Biegler, Shane O'Neill, Lesley, Jared Roman, Joel Maher, Katharina Will, Chad Juehring, Dave Jones and Kellen Frodelius-Fujimoto.
Premium supporters have access to high-quality, early released episodes with a full back-catalogues of previous episodes
SUPPORT CAUSALITY PATREON APPLE PODCASTS SPOTIFY PAYPAL ME
STREAMING VALUE SUPPORT FOUNTAIN PODVERSE BREEZ PODFRIEND
CONTACT FEEDBACK REDDIT FEDIVERSE TWITTER FACEBOOK
LISTEN RSS PODFRIEND APPLE PODCASTS PANDORA GOOGLE PODCASTS INSTAGRAM STITCHER IHEART RADIO TUNEIN RADIO CASTBOX FM OVERCAST POCKETCASTS CASTRO GAANA JIOSAAVN AMAZON YOUTUBE

People


John Chidgey

John Chidgey

John is an Electrical, Instrumentation and Control Systems Engineer, software developer, podcaster, vocal actor and runs TechDistortion and the Engineered Network. John is a Chartered Professional Engineer in both Electrical Engineering and Information, Telecommunications and Electronics Engineering (ITEE) and a semi-regular conference speaker.

John has produced and appeared on many podcasts including Pragmatic and Causality and is available for hire for Vocal Acting or advertising. He has experience and interest in HMI Design, Alarm Management, Cyber-security and Root Cause Analysis.

Described as the David Attenborough of disasters, and a Dreamy Narrator with Great Pipes by the Podfather Adam Curry.

You can find him on the Fediverse and on Twitter.