Analytical 33: Sinking

21 December, 2018


Looking at how maintenance teams function can reveal a lot about the system they’re maintaining. We dive in the consequences of poor design and long term operational costs.

Transcript available
[Music] Everything can be improved, iterated and refined. And if you don't think that's true, maybe you haven't analysed it enough. Calculated choices, carefully considered, absolutely analytical. This episode is brought to you by ManyTricks. Makers of helpful apps for the Mac, visit for more information about their amazingly useful apps. talk more about them during the show. Analytical is part of the Engineer Network to support our shows including this one. Head over to our Patreon page or for other great shows at today. Sinking. That's a bit of an odd title for an episode I guess, but drawing from an analogy from episode 22, Teamwork, I want to talk a bit about boats, floating and sinking once. Now stick with me. That's not really what I want to talk about. Anyway, in the past two decades, I've been really fortunate in engineering to work with and observe of many different teams and groups within different size organizations around the world. The thing that strikes me though, when I think about the architecture of organizations and groups, I think about the base design that they're supporting. In organizations where you have a system that you need to build, extend and maintain, you need one or more people supporting and maintaining it or your business is ultimately gonna fail. So specifically, I'd like to focus on the maintenance aspect just for this episode. The reason being that when you're building something from nothing, it's a very different mindset and maintaining some system is the core business of all businesses once they've been built anyway, and they're all up and running. The trick though, seems to be ensuring that when you're maintaining, that you're keeping on top of issues as they arise because nothing lasts forever. Nothing is absolutely reliable. Something will always fail, so you have to keep on top of issues when they come up. The kinds of issues I'm talking about could be broken down into two types. Service impacting, and hence presumably they're very urgent and you should probably resolve them as quickly as possible, and redundancy impacting. That is to say they're less urgent, you haven't lost service, but if you don't resolve them then you will have another failure that will lead to a loss of service. Now for the purposes of this exercise will ignore relative scales. That is to say, a service impact to one person or one customer versus a service impact to thousands of customers. So we'll just take that off the table for now. Of course, ideally, you would design a system to be fully redundant in every single aspect, in every respect. But in reliability engineering, we balance upfront and ongoing costs with system availability. And you quickly learn that pushing redundancy in all aspects isn't viable. you will always have single points of failure and the trick seems to be to keep those single points of failures as simple and as low failure rate as you possibly can. Now irrespective of Markov modeling and system design and reliability engineering and all that other really good stuff which is probably worthy of a different podcast to be honest, I want to focus on two scenarios instead. Firstly, a system that has poor reliability driven by a poor design and the second, a system with excellent reliability and a solid design. So what do the ultimate outcomes for an organization look like when they are driven to support option A or option B? So let's look at our first option, the poorly designed system. So first of all, in a poorly designed system, what can you expect if you're maintaining it? First of all, you could expect a large number of service impacting faults directly related to the lack of redundancy. Now that large number of faults is gonna drive a requirement for a ticketing system that can capture and track dozens or hundreds of breakdown requests, maybe even every single day. And to support that, you're probably gonna need multiple levels of triage. The trivial level first, maybe we could call that first level support. More complex would be second, and then a third for the most complex of fixes and investigations. Staffing for that first level, that bumper bar level, if you prefer to call it, think of it like that, has to be quite high. And as we go up the layers of support, then the number of people and staff at those levels will be decreasing. And for each of those layers, there'll either be a manager for each layer or perhaps a manager for all the layers potentially to ensure that those tickets are being closed in a timely fashion satisfactorily. KPI driven behaviors will lead to fast closeouts, refer to episode 23 about metrics. Oh, either that or it'll be palming issues off, pushing back to people, raising tickets with false closeouts, leading to low customer satisfaction, both inside and outside the organization. Tends to be what can happen. Internally, however, the management of this support team would tell anyone that would listen, "Hey, we've got the best ticketing system and all our staff are super busy. They're working hard, constantly under the pump to deliver outcomes for this business." But they have only got that by necessity and they've become exceptionally good at patching a leaking boat. And there's tying it back to that analogy again about syncing. Now, before we go on any further, I wanna talk about a sponsor for this episode, and that's ManyTricks, makers of helpful apps for the Mac, whose apps do, you guessed it, ManyTricks. Their apps include Butler, Keymail, Leech, Desktop Curtain, TimeSync, Moom, NameAngler, Resolutionator, and Witch. And there's so much to talk about for each app that they make, we're just gonna touch on highlights for five of them. Witch, you should think about Witch as a supercharger Command Tab App Switcher. If you've got 3 or 4 documents open at once in any one app, then Wichis' beautiful simple pop-up will let you pick exactly the one you're looking for. Recently updated, you can now also switch between tabs as well as apps and app windows with horizontal, vertical or menu bar switching panels. With Text Search for Switching, you can show the front-most app in the menu bar icon. It also now has touch bar support and much much more. Time Sync. Track the time you spend in apps or activities on your Mac. It's a simple an easy way to do it. You can pool your apps by common activities, create custom trackers for non-Mac activities, and its simple but powerful reporting feature shows you exactly where your time went so you can start to plan and stay focused. NameMangler You've got a whole bunch of files to rename quickly, efficiently and in large numbers. NameMangler is great for creating staged renaming sequences with powerful Regex pattern matching, recently enhanced, showing you the results as you go and if you mess it up just revert back to where you started and try again. Moom - it makes it easy to move any of your windows to whichever screen positions you want - halves, corners, edges, fractions of the screen and then you can even save and recall your favourite window arrangements with a special auto-arrange feature when you connect or disconnect your external display. It was recently updated to be even faster. It now has touch bar support and keyboard integration with Adobe's apps. It's the first app I load on a new Mac because it's just awesome. Resolutionator - is so simple. A drop down menu from the menu bar and you can change the resolution of whatever display you like that's currently connected to your Mac. The best part though, you can even set your resolution to fit more pixels than are actually there. It's very handy if you're stuck on your laptop but you need more screen real estate. Now that's just 5 of their great apps, and that's only half of them. All these apps have free trials and you can download them from ManyTricks or and you can easily try them out before you buy them. They're all available from their website or through the Mac App Store. However, if you visit that URL, you can take advantage of a special discount off their very helpful apps, exclusively for Engineered Network listeners. Simply use Pragmatic18 (that's Pragmatic the word and 1-8 the numbers) in the discount code box in the shopping cart to receive 25% off. This offer is only available to Engineered Network listeners for a limited time so take advantage of it while you can. Thank you to ManyTricks for sponsoring the Engineered Network. Now for our well designed system. A large number of redundancy impacting faults, but none of them are service impacting. Support teams monitor the system for redundancy and they raise their own support tickets against themselves to track their own workload. Rectification of redundancy faults happens in the background at a far more regulation pace, if you rather, and is generally unnoticed by the greater organization and its customers. And if they're transitioning from a less reliable system, the first sign of improvement is many months later when the business organization realizes that they've needed to raise less external tickets and there have been a lot less outages. KPI driven behaviors aren't as prevalent and there's less pressure to return to service because service was never lost. It becomes the business of maintaining redundancy and maintaining integrity. There's a significantly less noise allowing a flattening of the tiered support structure, often to a single level leading to a different model where you have generalists and subject matter experts. And this subgrouping internally helps to balance the load. Less support staff are needed since the less service outages and there are less tickets, hence you don't need to have as much administration to respond to all of them. So they have, essentially by good design, not needed to be exceptionally good at patching the leaking boat, since their boat was designed not to spring leaks constantly from the beginning. What I've observed is that some people believe that cutting costs, flattening infrastructure to save monthly hardware or virtualization costs, hence making the system less reliable, is how one adds value to a business. Look, I've cut hosting costs, that's great. But the truth is that costs need to be considered in a truly end-to-end fashion and the cost of labor is worth a lot more than the cost of hardware. Paying more up front for that hardware or even monthly virtualized hosting costs can still save huge amounts of labor in the longer term if you consider the indirect support costs over that period. The cheaper we push, it inevitably leads to a larger number of incidents. This then requires a more polished, detailed incident management system, which is then thought by some to be a positive outcome. Wouldn't it make more sense to reflect on what about the design has led to the requirement for this outcome? Had the design been more fault tolerant, then there wouldn't have been so many service impacting incidents in the first place. Why do I have such a large support team, and what does that truly tell us? Ultimately, these are two very extreme ends of the spectrum, but the reality is most organizations will sit somewhere in the middle. So if you're part of such an organization, I would encourage you to think about the sort of organization that you're in and where it sits on that spectrum. Are you exceptionally good at patching a leaking boat or are you good at building a boat that doesn't need constant patching in the first place? Now, once you've figured that out, where would you rather work? If you're enjoying Analytical and want to support the show, you can. Like some of our backers, Carsten Hansen and John Widlow. They and many others are patrons of the show via Patreon, and you can find it at or one word. Patron rewards include a named thank you on the website, a named thank you at the end of episodes, access to raw detailed show notes, as well as ad-free high-quality releases of every episode. So if you'd like to contribute something, anything at all, there's lots of great rewards, and beyond that, it's all really, really appreciated. Beyond Patreon, there's also a PayPal for one-off contributions at or one word, but if you're not in a position to support the show financially, that's totally fine. There are other ways you can still help. Leave a rating or a review in iTunes, favorite this episode in your podcast player app of choice, or share the episode or the show with your friends or via social. All of these things help others to discover the show and can make a huge difference too. I'd personally like to thank ManyTricks for sponsoring the engineered network. If you're looking for some Mac software that can do many tricks, remember specifically visit this this URL, for more information about their amazingly useful apps. Check them out. Analytical is part of the Engineered Network, and you can find it at, and you can follow me on Mastodon at [email protected] or the network on Twitter at engineered_net. Accept nothing. Question everything. It's always a good time to analyze something. I'm John Chiji. Thanks for listening. (music) [Music]
Duration 12 minutes and 41 seconds Direct Download
Episode Sponsor:

Show Notes

Links of potential interest:

Premium supporters have access to high-quality, early released episodes with a full back-catalogues of previous episodes


John Chidgey

John Chidgey

John is an Electrical, Instrumentation and Control Systems Engineer, software developer, podcaster, vocal actor and runs TechDistortion and the Engineered Network. John is a Chartered Professional Engineer in both Electrical Engineering and Information, Telecommunications and Electronics Engineering (ITEE) and a semi-regular conference speaker.

John has produced and appeared on many podcasts including Pragmatic and Causality and is available for hire for Vocal Acting or advertising. He has experience and interest in HMI Design, Alarm Management, Cyber-security and Root Cause Analysis.

You can find him on the Fediverse and on Twitter.