100% is the only acceptable test coverage

…kinda.

I think that the only acceptable test coverage percentage is about 100%. And in this post, I’ll try to explain why I choose to believe it.

Why test coverage matters

I have a lot of projects of various sizes. In most cases, these projects are libraries, that I host publicly, and some tools I’ve written for myself only. What’s common about these projects is that no matter if it’s a public project or my personal one, I’m trying to test those as well as I can.

But this raises some important questions, one of which is how do we measure the quality of our tests? A rather obvious answer to that question is test coverage. And I can only say that this is the stupidest metric to measure test quality because coverage doesn’t actually guarantee that tests check all various corner cases data-wise.

Test quality is a bit different topic, and while code coverage is not what measures it, it is still important, and in my opinion, the reasonable percentage is 100% or something really close to it. I’ll explain why in a moment, but before that, I’ll address a question that probably has appeared in your head: “why go as high as 100%?”. And that’s a valid question.

Recently, I’ve talked with some developers who use different languages and was kinda surprised to hear that on average, the acceptable test coverage is ~60%. Some say that writing unit tests for functions that are inherently simple is meaningless, as those can be verified fully at the stage of writing the function itself. Others say that 60% is just enough to cover the public API and catch most of the bugs that the end-user may experience. Some even say that test coverage doesn’t matter at all, and only integration coverage actually does, but it’s hard to count, so nobody really cares. But even they agreed when I asked them if 60% would be enough for them.

I disagree strongly.

You may have heard of the 80/20 rule, also known as Pareto principle. The Wikipedia article I’ve linked has a section related to computer science, that states that:

In computer science, the Pareto principle can be applied to optimization efforts. For example, Microsoft noted that by fixing the top 20% of the most-reported bugs, 80% of the related errors and crashes in a given system would be eliminated. Lowell Arthur expressed that “20% of the code has 80% of the errors. Find them, fix them!”

And after a while of writing tests and measuring code coverage for my projects, I noticed something similar to that. The last 20% of code coverage held 80% of bugs of the code. And the interesting part was, that no matter how good my tests were, it didn’t matter, as they weren’t touching the parts of the code that actually had unnoticed bugs. So whenever I’ve stopped at code coverage of 80% saying to myself that this is good enough, I was lying to myself, as it clearly wasn’t.

What code coverage is about

So, now let me address some misconceptions about code coverage.

Foremost, test coverage is not about covering every function in your project with a unit test. This is pointless extra work that will be negated with the first tiny code refactoring. Because of that, all these tests will become obsolete, and new tests are likely won’t be ever written, because why bother, if they’ll become irrelevant with the next refactoring? And functions are really simple, why even test those?

Second, for those who love numbers, 60% coverage in a project with just 2k lines of code (LOC) means, that 800 lines were never even touched by the existing tests. In a project with 10k LOC, it’s already 4k lines. Not only you can’t reason about correctness, but you also can’t tell what’s actually used in your project and what’s not!

And that’s the point of code coverage - telling you what parts of your application/library are not reachable from the public API that you always must test. You don’t need to write unit tests for anything but the public API in almost all cases! Yes, it’s good to write tests that check complex functions, which do some data transformation for example, with a lot of different test inputs - I’m not against that at all. However, if you write a unit test for a private function, think of it for a moment, maybe you’re doing something wrong.

As of this moment, I’m mainly using two languages - Fennel and Clojure, and the part about testing private functions is what makes these languages a bit different for me when I’m reasoning about code and its public API. The difference is that in Fennel there’s no way of accessing a private function of a module, and hence there’s no way to test it directly. In Clojure, there is, and when I see, or even have to use (#'some-ns/some-private-fn)¹ in a test, it’s a huge red flag for me, that something is not right in the application code itself. Fennel (actually Lua) doesn’t have this problem, because if something is declared as local in a script file and not exported from it in a table, that’s that - you can’t have it.

So when writing tests in Fennel or Lua, your only option is to test exported functions only, which is a kind of your public API. And if you wrote tests, and see that all lines are covered, all expressions are visited, and you still can’t reach some private functions - well, that’s a clear indication of a dead code. Dead code is bad because it rots even if you don’t touch it. Even more - if you actually go and write unit tests for such dead code, you’ll make the situation even worse for yourself in the long run. After a certain period, these tests may be the reason you can’t move forward, because you’ll be afraid to change or even remove a code that appears to be unused - but you can’t be sure, because there are tests for this code.

Now, I know what you can say to this - modern IDEs can detect unused code by using static analysis. This is a great feature that helps to deal with this problem to some degree. First, some IDEs detect that code is used in tests, and that’s enough for them to consider code not to be dead, which isn’t actually true. It is an important feature, but not a silver bullet, and not all languages have it. Yes, in most cases it helps you see that some code is indeed unused, but that’s the only thing it does. Test coverage also shows that, plus provides additional correctness guarantees of your code because of tests that you’ve written.

Why 100% is not always possible

I hope you now see why I think why code coverage is very important. But percentage doesn’t exactly say to you what is left uncovered, so all code coverage tools provide some kind of report, usually in a color-coded form. Some coverage tools build interactive web pages that allow easy project navigation, and some just print results to stdout using ANSI color codes. Both are equally helpful because you see the results of your work.

However, unfortunately, it’s not as simple as that. If it had been, I guess the average coverage percentage expectage would be higher than 60%. So what can cause problems?

If you test an application, then in many cases you probably can’t test the main function, especially when it starts a server for example. This means, that the entry point to your application is untested, and this might cause problems in the future. As a workaround, it is possible to mock the server starting functions so main would return immediately. This, probably, won’t return any meaningful values for the test itself, but you’ll be able to see what parts of code are triggered by the initialization process and see if there are some dead parts. Alternatively, if all the main function does is initialization and service startup, these things can be tested separately, and main can be excluded from testing, but this will lower the coverage percentage, as expected.

Another problem is again solvable with mocks but seems to defeat the purpose of testing. It does not. I’m talking about interactions with external services. A common misconception is that by mocking the server, we create an ideal model that doesn’t help at unit testing. While this is somewhat true, it also depends on how you mock it. You can write such a mock that causes a timeout, or a mock that sends invalid or corrupted data to test error recovery. You can even write a small service inside your codebase that mimics the public API of a real one, which can help with testing. And with mock-based testing, your application code can mostly consist of private definitions, as you don’t need to export those for testing.

This is something in between integration and unit testing, and while it’s surely not as good as real integration testing, it is still something worth doing. Your public API probably won’t change as much as your code, and you’ll be more certain that nothing broke if you have such tests. This is also a very good way of seeing what parts of your code in the coverage report are touched sorely by interacting with external services, without manually patching in the data for transformation. Though I must say that writing mocks for every network problem imaginable is not an easy or even impossible task, it’s not always a solution.

Expression and line coverage

The last main problem is that coverage tools are not always accurate. This includes some quirks in the language runtime or tooling, that prevent it from tracking all lines or expressions. I often see this in Lua, when I run some tests, collect coverage reports and see some branches as not being hit. I then add a print call to these branches and see prints in the output, meaning that the code is indeed touched by the test. But the coverage tool doesn’t register it for some reason, and that particular call to print is shown as uncovered.

And the line-or-expression distinction in the coverage report is another problem with the opposite outcome. If the previous problem was about the impossibility of achieving 100% because some code just won’t register, this is about achieving 100% without hitting all code. The reason for that is that if you put multiple expressions on the same line, a coverage tool that doesn’t know about expressions may show a completely wrong coverage. Here’s an oversimplified example:

-- lib.lua
return {
  f = function (a, b)
    local res
    if a == 10 then res = a * b else res = a / b end
    return res
  end
}
-- lib_test.lua
local lib = require "lib"
local res = lib.f(10, 20)
assert(res == 200, "expected 200, got " .. res)

The if statement in the function f is written as a one-liner (for this particular example reasons). The test only checks one specific case, which can only trigger the true branch of the if. There are no tests for false branch, which may easily result in a division by zero if a is not 10, and b is 0. While this is not a problem in Lua, as it will just return inf, this may be a problem for your application, which will not know what to do with infinity.

This is an exaggerated example, only meant as an illustration of the problem. Running the test module with Luacov shows a 100% coverage:

$ lua -lluacov lib_test.lua && luacov-console && luacov-console -s
==============================================================================
Summary
==============================================================================

File         Hits Missed Coverage
---------------------------------
lib.lua      4    0      100.00%
lib_test.lua 4    0      100.00%
---------------------------------
Total        8    0      100.00%

Thankfully, I write my code in Fennel, and it usually generates a somewhat well-formatted Lua code, so coverage is more or less accurate. Doing the same with Clojure won’t have this problem at all, because Cloverage provides two separate statistics - line coverage and form coverage:

;; src/lib/core.clj
(ns lib.core)

(defn f [a b]
  (if (= a 10) (* a b) (/ a b)))

;; test/lib/core.clj
(ns lib.core-test
  (:require [clojure.test :refer [deftest is]]
            [lib.core :refer [f]]))

(deftest f-test
  (is (= 200 (f 10 20))))

As you can see, I’ve done the same thing as in Lua, but the coverage report suggests that something is not right:

$ lein cloverage
|-----------+---------+---------|
| Namespace | % Forms | % Lines |
|-----------+---------+---------|
|  lib.core |   75.00 |  100.00 |
|-----------+---------+---------|
| ALL FILES |   75.00 |  100.00 |
|-----------+---------+---------|

Again, the main point I wanted to show here is that not all coverage percentage reports are fully trustable, that’s why the visual report is important. Seeing a pretty long line with an if statement in Lua should tell you to be more careful with this code, as some parts of the code could have not been touched. Cloverage colors such line in orange, meaning that not all forms were covered, but doesn’t tell you which forms weren’t touched. Both cases can be fixed with the proper formatting.

Closing thoughts

Overall, I hope I was able to convince you that 60% code coverage is far from ideal and that 80% is still not enough. In my personal opinion, something starting from 97% is a way to go. Of course, it’s not always possible, but when it is, there’s no reason not to go for it. And I’ll repeat myself, having higher coverage percentage doesn’t mean your code is well tested, it’s merely well covered.

To summarize - test your public API and collect coverage of it specifically. Having 100% code coverage achieved through such testing means that:

There’s no dead code.
You have tests for everything in your application/library.
You have a very reasonable starting point for improving all tests with more cases.
- And because of that, there are more possibilities to find bugs in some far-to-reach places (80/20 rule).

I’ve already said that I’m against writing unit tests for private functions, but, well, see for yourself. I, personally, think that this means that the code was meant to be public, and making it private was a design mistake. But your view may vary.

While 100% coverage may not be always possible, carefully observe what kind of code was not covered. If that’s something because the coverage calculation tool just misbehaves, well, there’s little you can do with it. If that’s due to the inability to test something like a main function, or a very tricky part that communicates with another service - make sure it’s well tested through integration testing. Everything else, probably means you have some potentially incorrect design decisions that should be corrected.

Hope this was an interesting read, feel free to share your thoughts on the topic and tag me in any of the social networks if you disagree or have something to discuss on the topic!

This is how you access a private var in Clojure. ↩︎

Comment via email