Analytical 23: Metrics

2 March, 2018


Metrics can be very useful if they are carefully chosen. Badly chosen metrics can do more harm than good, so how can you tell the difference?

Transcript available
[Music] Everything can be improved, iterated and refined and if you don't think that's true, maybe you haven't analysed it enough. Calculated choices, carefully considered, absolutely analytical. Analytical is part of the Engineer Network. To support our shows, including this one, head over to our Patreon page and for other great shows, other great shows, visit today. Metrics. People that know me know how much I'm a little bit weary of metrics. Some people refer to them as key performance indicators or KPIs, mentioned them on the last episode. Well, there's actually a subtle difference between metrics, which can be a measure of absolutely anything, and KPIs, which are a subset of metrics that are specifically performance focused and are used as an overall litmus test to determine the overall health of something complicated. But I digress. I've been chewing over what I don't like about KPIs and metrics and it's not actually that I don't like them, it's actually more the fact that I don't like how some people focus on the wrong ones. Comprehension of a metric determines its overall usefulness but at the same time it limits its communicability. So managers are incentivized to choose rolled up metrics that are easy to communicate but are otherwise essentially useless. That's just one problem though. So let's go through a specific example of how it can go right or wrong. Imagine a facility, factory if you want to call it that, that transports boxes from one end of the factory to the other. There are 10 conveyors all travelling at the same design speed and they're all designed to equally transfer the same amount of mass from one end to the other in exactly the same amount of time. The problem is that the conveyors have moving parts and therefore they sometimes break. Everything that moves eventually breaks. The designers designed those conveyors to travel at an optimal speed. That was about a decade ago but unfortunately the knowledge of what they were optimizing for was lost through the years. And now our virtual stage is set. So let's see what happens when the metrics take over. Today, a new engineer has started at the factory and their goal is to improve production rate. But no one measures any metrics. People say, "Oh, it's been a good week for transfers this week, and weeks ago was bad, mumble mumble, something about local power blackout for a while, they don't remember how long it went for, but it was just a bad week, it was rough." Well, at this point, the machines had been run based on the manufacturer's recommendations. They were never challenged, they were never questioned, no one seeked to refine or improve them, they just did as everyone else had always done. So, shaking their head, the engineer starts by thinking about the first metric. Let's look at boxes transferred per hour. We'll call it boxes per hour. Turns out that each box that week had exactly the same items and quantities in it, meaning every box weighed exactly the same. So in week one, there were 100 boxes per hour per line average. Week two, there was a surge of products to be transferred and the engineer approved at half the lines to have their boxes filled twice as full. This meant they weighed twice as much and that would meet the demand. Previously this simply would have been kept to the side as a backlog and the machines would have stayed within manufacturers recommendations and just transferred what they ordinarily would. The additional weight however had an effect on halving their throughput to only 50 boxes per hour for those lines. The other 5 lines stayed at at 100 because I hadn't changed. So on week two it was an average of 75 boxes per hour per line. At this point the engineer scratched their head out puzzled and said well you know maybe the best thing to do was to measure each of the lines individually and then after the third week they figured that something was wrong because with half of those lines because they were only traveling at 50 boxes per hour for those five that were carrying twice the mass. Anyhow the engineer then ordered that the speed of the slower lines was doubled and that would improve their overall throughput. So during week 4, on the first day, several boxes were damaged during transportation through the factory. The overall production rate returned to 100 boxes per hour per line average. Yay! However, they had to create a new metric. There'd never been damaged boxes in the factory before, so it was time to start measuring the total percentage of damaged boxes and for week 4 it was 5% damage rate overall. For week 5, one of the accelerated conveyors broke. That had never happened before. It took the rest of the week to fix it. That led to the need for a new metric, conveyor downtime. So for week 5, 90 boxes per hour, 4% damage rate, 168 hours of downtime. The The engineer had spent some time investigating the failed line and discovered that it was due to overheating of the drive motor, most likely because of excessive load and lack of maintenance. But trying to understand how this had happened, it led the engineer back to the speed increase that they had ordered several weeks before and determined that not only that, the damaged boxes had come from those same accelerated conveyors as the momentum was causing the the boxes to not stop in time at the end of the conveyor belt. So now we're going to play this in an alternate reality, where the metrics chosen initially were subtly different. And see what happens. So the engineer instead in this timeline chose to measure throughput in terms of production weight transferred per hour, not boxes. Under this regime the weights of the boxes changed in the same pattern as they did last time but no attempt was made to increase the conveyor speeds since they were transferring the same amount of overall weight. Just more slowly overall but it was the same final amount. Now I'm not suggesting you shouldn't track downtime specifically but the point is you should look at the right metrics and you'll get a better outcome. If you choose the wrong metrics you'll end up optimising the wrong variables and you'll just end up making it harder for yourself or worse in the long term. It's all well and sanitary I hear you say, dear listener, but how does that help me? Well, there's a few attributes of metrics I think that can help guide whether they're good or bad. It's not an exhaustive list and it's not universally applicable, but it might help save you from picking a bad one someday. Who can say? Let's have a crack at this. A good metric must be accurately and directly measurable. If you can't measure it directly, it may be influenced by other factors. So it has to be directly measurable accurately. And a good metric should measure a value assignable attribute. So in the above example, the mistake was measuring the container and not its contents. The weight of the contents had the value, the box, it had no direct value, it was just a container. So it has to be able to, the metric you measure has to have some value assignable to it. So a good metric when evaluated with associated metrics must show balance overall. Engineering is about balance and optimization. If a metric is rolled up to a higher level or shown alongside other metrics, it must be explainable how it balances overall, how it interacts. Meaning if you take one metric, you can often give to another. So in the above example, you can increase the total transferred weight by increasing the speed at the cost of increasing damage and convey a downtime. So you have to consider your metrics as a whole. And a good metric also reflects the origin of its intent. So the original conveyor specification was lost, but it was actually designed for 100% uptime and longevity at a consistent throughput. But the engineer that came in didn't appreciate this 'cause they didn't know the knowledge was lost and their metrics didn't align with the original design intent and that is what led to the problems. Of course, you could argue that there are optimization optimizations to be made in this example. And that's fine, that's true. Certainly, if you wanna increase throughput, you could just add more conveyors. But we didn't talk about the other constraints, like there was not enough physical space in the factory, there was enough land to expand the factory, there was enough power in the power feeder into the factory, and that's just a few that come to mind randomly. I mean, alternatively, you could upsize the motors, improve the cushioning systems at the point of receivable, and so on and so on, but that's really not the point. Metrics can be useful, But I've just seen so many badly chosen metrics in my time, and even well-chosen metrics can be misinterpreted by those that don't understand their meaning. And as a result of this, I've become skeptical of metrics on the whole. Maybe then, the best advice I could provide is, for metrics, choose them carefully. Think about the bad behaviors that they could drive just as much as the good behaviors they could drive. and take time to understand those that others choose and assess their usefulness and never take them on face value. I've heard people say, if you aren't measuring it, you can't improve it. I would rather say, if you're measuring the wrong thing in time, you'll just make it worse. So choose your metrics wisely. If you're enjoying Analytical and wanna support the show, You can, like some of our backers, Chris Stone, Ivan, and Karsten Hansen. They, and many others, are patrons of the show via Patreon, and you can find it at or one word. Patron rewards include a named thank you on the website, a named thank you at the end of episodes, access to pages of raw show notes, as well as an ad-free, higher quality release of every episode. There's now a back catalogue of ad-free episodes available, and a new making an episode tier as well. So if you'd like to contribute something, anything at all, there's lots of great rewards and beyond that it's all very much appreciated. Analytical is part of the Engineered Network and you can find it at and you can follow me on Mastodon at [email protected] or for our shows on Twitter at engineered_net. Accept nothing, question everything. It's always a good time to analyze something. I'm John Chidgy, thanks so much for listening. [Music] you
Duration 10 minutes and 20 seconds Direct Download

Show Notes

Links of potential interest:

Premium supporters have access to high-quality, early released episodes with a full back-catalogues of previous episodes


John Chidgey

John Chidgey

John is an Electrical, Instrumentation and Control Systems Engineer, software developer, podcaster, vocal actor and runs TechDistortion and the Engineered Network. John is a Chartered Professional Engineer in both Electrical Engineering and Information, Telecommunications and Electronics Engineering (ITEE) and a semi-regular conference speaker.

John has produced and appeared on many podcasts including Pragmatic and Causality and is available for hire for Vocal Acting or advertising. He has experience and interest in HMI Design, Alarm Management, Cyber-security and Root Cause Analysis.

Described as the David Attenborough of disasters, and a Dreamy Narrator with Great Pipes by the Podfather Adam Curry.

You can find him on the Fediverse and on Twitter.