On metrics

There's still an ongoing debate whether development in particular and IT, in general, are engineering practices. In all cases, there's no denying that our industry is based on scientific foundations.

Most of the organizations I've worked for implement the Deming wheel in one form or another:

Plan-Do-Check-Act

From a bird's eye view, it makes a lot of sense. And given the success of the method in the Japanese car industry, there's no denying that it's effective. However, in this post, I'd like to focus on the Check (or Study) phase: how do you implement it?

Most developers/engineers/developers will have no problem coming up with the solution - metrics. Aye, there's the rub. Because most of the time, we choose the wrong metrics, and the later phases of the above cycle stray away from our target instead of converging toward it.

Some examples of wrong metrics

Let me detail three examples of those metrics that developers are pretty familiar with.

Code coverage

Code coverage is the first metric that made me realize how wrong a metric can be - apart from management-oriented metrics. The idea behind code coverage is to check how well tested your software is. You run your test suites on your code and mark every line (or branch) executed. Then, you divide this number of executed lines by the total number of lines, and presto, you've got your code coverage.

It is all good and well, but it's very straightforward to gain 100% code coverage on any codebase. The reason is that code coverage checks that a line is executed during a test, but not that the test does anything meaningful. Here's a test that asserts nothing but increases the test coverage:

@Test
public void dummy_test_to_improve_coverage() {
    Math.add(1, 1);                // 1
    // Oops, no assertion...       // 2
}

Math.add() will be executed
But there's no assertion on the result

At this point, you'd be wrong to confidently refactor your project because of a 100% code coverage.

If you're interested in an alternative - and relevant - way to assess confidence in refactoring, please check this talk of mine.

Docker image size

With Docker, I regularly stumble another wrong metric: the smallest possible image size.

On the surface, it makes sense. Fewer bytes in the image mean fewer bytes to transfer through the network, thus increasing download speed. Yet, as in code coverage, it ignores how things work under the hood.

Docker images are not monolithic but designed around layers. Each layer but the first one references a parent. It's similar to how Git works; a commit references a parent. The similarity doesn't stop there. In Git, when you pull commits from a repository, you don't pull them all but only those added after your latest pull.

Likewise, in Docker, locally cached layers play an important role. If images A and B use the parent image P, you only need to download the gap between B and P if you had A locally.

While layers do add to the overall size of the image, the absolute size of an image is much less important than how it's layered.

Application startup time

Right now, lowering application startup time is all the rage in the Java world. It's part of the advertised features of frameworks such as Quarkus and Micronaut, with the help of GraalVM native image. While it's a lofty goal, there's no such thing as a free lunch. What you gain in startup time, you lose in raw performance.

The JVM has been able to compete on an equal footing with native binaries because of how it works. Java compilation produces bytecode. At runtime, the JVM first interprets this bytecode ; the performance is not outstanding. But the JVM can analyze the "hot" execution paths. After some time, it compiles the bytecode to native code.

There's a massive difference with a standard code-to-native-code compilation process. In this case, you need to assume the workload at compile-time and configure the compilation accordingly. On the other side, the JVM adapts the configuration based on the actual workload. Even better, because the JVM continues to analyze the execution paths, it can recompile to different native code if the workload changes.

However, this optimization process has a cost, and the price is warm-up time. The time it takes for the JVM to optimize code is non-trivial. All industry-grade performance tests that involve the JVM take this time into account.

Native compilation is a trade-off: faster start-up time vs. slower performances over time. For short-running processes, it's a good one; for long-running ones, not so much.

Hence, fast startup time plays a prominent role when you need to scale up fast - you spin-off multiple short-running processes. It happens mostly in serverless contexts. On Kubernetes, it may also be the case when resources are limited and pods are killed and relaunched regularly, but then you have another problem.

For that reason, startup time is meaningless in most contexts.

Measuring influences the measured metric

The above examples are just illustrations, but everybody is aware of such wrong metrics in their specific context.

Readers with a background in physics know that measuring influences the measured metric. It's the infamous Schrödinger's cat: it's both alive and dead, but when you open the box, it becomes either alive or dead.

But even in fields unrelated to physics, it stops being relevant when you start to measure a metric. It's known as Goodhart's law:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

The reason is that once people understand how the system works, they will start to game it to meet the goal. As a colleague put it:

Once you set a metric to achieve a goal, the metric becomes the goal.

Measuring is challenging

Even without any influence from the observer, the act of measuring itself is challenging by itself. Let's think about car designers. It stands to reason to assume they have a significant impact on car sales. But how would you measure their impact?

The first issue is the position of the actor on the value production chain. The further you are from the end of the chain, the harder it is to measure. That's why salespeople are easy to measure; they stand at the very end.

The second issue is that's it's easy to measure one's effort but harder to measure one's impact. Logically, the latter, known as outcome, depends on the former, known as output.

It means that for developers, it's easy to measure the number of lines of code they produce, but not their impact on the software. By widening our perspective, it's easier to measure the number of user stories delivered by the IT department than their added value to the business. Etc, ad nauseam.

Measuring narrows down your perspective

Reality is impossible to encompass in its infinity. Some of its aspects are not relevant to the problem we are to solve. That's the reason behind metrics: to simplify things enough so we can understand them.

But even then, we have to simplify further. Thousands of indicators are too much; we need to reduce them to tens. By doing that, we also remove context and nuance. When you aggregate team indicators into service indicators into department indicators so that they're digestible by the board, the most valuable insight are already lost.

There's a wide range of options between completeness and digestibility. Most of the time, we tend to prefer the latter at the cost of the former.

All models are wrong, but some are useful.

-- en.wikipedia.org/wiki/All_models_are_wrong

Conclusion

Let's recap my arguments:

Good metrics require an understanding of the context
The act of measuring itself will change the context, whether you like or not
The more relevant the metric, the harder it will be to measure correctly
Whatever the metric, it doesn't reflect reality

This being said, I'm not advocating against metrics, just wrong ones. In general, defining valuable metrics will require considerable effort. No pain, no gain. Good luck!

I, unfortunately, cannot offer any more specific advice at this point. It depends on the specific goal you're trying to achieve and your context.

Originally published at A Java Geek on September 5^th, 2021