Copilot is a worrisome proposal, but not for the reasons you may think of.

Copilot is a worrisome proposal, but not for the reasons you may think of.

Thoughts on Github Copilot

TLDR;

Github Copilot is a disruptive proposition that could change forever how developers work. In this post, I will give you example of successes and failures of Copilot; I will also elaborate on its positive and negative impacts, and will risk a prognosis.

While it is difficult to ascertain any position, I think it will bring significant, yet incremental progress. But it raises many questions around collaboration; and I think we should be concerned for the future of OSS (and I don’t think about licensing issues).
I propose to address the following questions:

  1. What is Copilot?
  2. Is it useful?
  3. What are the impacts of such a tool?
  4. What does it tell us about our trade?
  5. How actually useful is it?

Disclaimer

I have yet to experience Copilot first hand, but I have seen enough videos and read enough feedbacks to get the gist of it. In any case, I will mostly talk about the concept, not the product. It is not a product review!

What is Copilot?

GitHub Copilot is touted as ‘Your AI pair programmer’.
From a user experience point of view, it works kind of like an auto-completion engine, except that it does not simply suggest the end of the word you are typing (such as ToS ==> ToString()), but full functions/methods or chunk of code.

copilot generates a sendtweet function in python
SendTweet sample

Most impressive results are achieved when CoPilot is able to suggest a full function simply based on the code comments you typed.

Comments to shell command

Note that Copilot is often able to offer several alternante suggestions you can navigate through with the keyboard before choosing one with Enter; then you can change, complete and alter the code as usual.

From an implementation perspective, Copilot is a service built on top of Codex, which is an Open AI’s GPT-3 implementation dedicated to code generation (see this article for more). In short, GPT-3 couples NLP with a (huge) neural net to produce very convincing documents based on correlation with provided keywords.

Here, Codex has been trained with every public GitHub repo, disregarding their respective licences: if it’s public, it is fair game!

Github support answer to the licence question.

Side note, I wonder how GPT-3 deals with various languages grammar. I have only seen text examples in english, I wonder how good it is with languages with more strict and complex grammar, such as german or french. This is a relevant question for Codex since correct grammar is an important topic for computer language.

Wait! What?! Disregard for the licences ?!

Yes, this issue has already been heavily discussed elsewhere. In short, there is no actual issue. A more detailed answer is:

  1. No, GPL does not result in Copilot generated code being GPL as well (same for any viral license)
  2. This is akin to reading other people’s code to learn from them, definitely fair use
  3. Trying to fight this means applying copyright laws (and principles) to something that exists BECAUSE copyright laws were seen as hampering creation

Nope, not a good fight, sorry.

It is a benefit, right?

Not sure… Let’s see.

Simple case

Let’s assume that Copilot works flawlessly for simple requirements and works partially for more complex ones.
The following logical demonstration is based on a simplistic view of the development effort, but I assume everyone understands it.

So it will help developers to code simple requirements properly with little effort, with a significant production increase for those.
Production here being expressed both as the number of requirements covered (KPI #1) and as the number of lines of code (KPI #2). Neat, isn’t it?

You see the problem? As a trade, we know that in general we want to maximize the number of implemented requirements while minimizing the amount of written code; that is, keep KPI #1/KPI #2
as low as possible.

Why? because we know there is a maintenance cost associated with a line of code. Even if this is a simple function, that rarely needs change, what if the code needs to be migrated, or another team using different coding patterns takes the code over? A line of code is both an asset and a liability!

Today, almost no one boasts about how large its code base is!

One may retort that it is not because ‘producing’ code gets simpler that it will result in more code; I simply suggest to look into [Jevon’s paradox], and IT history, which is a constant demonstration that whenever code gets cheaper to create, we end up with more and more of it.

So a system that ends up favoring the amount of written code does not seem so smart. So, in this simple terms, I don’t think it brings value if it is only able to support simple requirements.

What about more complex requirements ?

Here be dragons

Everybody with some professional code experience knows how hard it is to extract and capture real world requirements in a written, structured form (spoiler alert, Copilot will not help you there).

For the sake of the argument, let’s say that Copilot can process simple business requirements (process, not understand, it does not understand anything). All examples I have seen so far imply there is still significant work to be done for the human developer once she/he has chosen the best copilot proposal. So we end up with some hybrid AI/human code with no marking to tell them apart . Code generation history has told us repeatedly this is not a good idea :
Those requirements are likely to change over time. Sadly, Copilot knows how to generate code, not change it in the face of a shift in requirements. In all likelihood, it means regenerating code as a whole, not altering it.
And God forbid if this imply some signature change: Copilot does not rely on an understanding of the language syntax is not able to perform any refactoring, such as dealing with the impacts of a signature change.

So Copilot may help in the short term here, but this contribution may as well be a blessing or be a curse.

So what about productivity then?

I am now pretty convinced Copilot brings little benefits in term of raw productivity, and I think MS thinks the same:

If anyone knows how to sell software, it is Microsoft (remember, Bill Gates kinda invented the concept of paid software). Hence I am pretty sure MS guys themselves know this pretty well, otherwise they would already have a commercial/paid tier offer to sell.
As of now, we have a MVP released in the wild to see where it gets traction and how to extract value from this.
It may very well end up as a failed experiment (remember Tay?) or it may find its market. My best guess is that it will remain a niche market, like being used by some coding sweatshops producing low quality website/app for SOHOs.

So why worry?

A bit of context first

First, let me tell you a bit about my personal experience with coding, so that you understand where I come from and guess my biases:
I started to code in the mid eighties; everybody was short-staffed on professional developers, and as working code was really expensive to produce (as compared with today), there was a strong focus on DRY and code reuse. Libraries were seen as the THE solution; alas, libraries were scarce. The languages provided some (standard libraries), there were a few specialized editors that provided commercial products but most of the existing libraries were internal/private. Fast forward a couple of decades; early 21st century, Internet and OSS movement proved to be the enablers for a thriving library ecosystem, that ended up fully reinventing our technical stacks (from vendors to open source).

An ode to OSS libraries

Sorry, I had to do this. 😀

Libraries are great. They provide us with ready made solutions for some of our requirements, but most of all, they allow for a separation of concerns!
The library’s team is in charge of identifying the correct abstractions and build an efficient implementation. As such, using a library provides you help right now, when implementing as well as in the future, when issues are found or changes are required.
If you copy paste the library code, instead of depending on its distribution package, you will have to deal with any needed changes in the future. But the worst part is that you will have to understand its design and internal abstractions first if you want to maintain and fix it, and you need deep understanding if you want to extend it.

Using an external dependencies beats the s**t out of copy/pasting part of it, hands down, every day; that is assuming a decent dependency ecosystem (see the LeftPad Debacle five years ago).

The problem for OSS

Let’s have a systematic view at this:

If Copilot is useful for the short term (and this is a big ‘if’, as we will discuss later on):

  1. Copilot will provide bits of code to cover part of the requirements.
  2. So coders are less likely to look after OSS libraries to help them.
  3. Less users means less feedback (issues and feature requests), less contributors and less motivations for OSS authors.
  4. Less energy in OSS results in a slightly less dynamic OSS ecosystem
  5. A less dynamic ecosystem increases the relative value of Copilot
  6. Back to (1)

Here we have a (slow) Copilot usage reinforcement loop that could theoretically lead to a complete drying out of OSS ecosystems.
Which would be a bummer, since the OSS ecosystems is the source material for Copilot.

I am not saying this will, or even could, happen. But I see no interesting equilibrium point beyond a marginal use of Copilot.

Not that there is a parallel to be drawn between Copilot and (arguably) the most famous
coding website: StackOverflow.

The parodic idiotic coder that copy paste StackOverflow-found-code without adjusting it
to his/her own project would be replaced by the idiotic Copilot user that fails to correct the
generated code.

Except that fixing Copilot will likely require more work and better skills.

Also, the value of StackOverflow does not reside in the posted code extracts, but in the embedded social network that increase its value 100 fold by providing context and assistance to people looking for help.
Features that are sorely lacking for Copilot.

But Copilot is still useful, right?

Watch out for the bugs

Well, it is still early to get a definitive answer, but I am getting more skeptical by the day.
I think we can make a parallel with self driving car: we are, slowly, getting Level 4 assistance (see here for level definitions) but level 5 seems further away every time we look at it.

The main problem with Level 4 is making sure the driver takes over when necessary. For a car, the problem is that the driver’s focus will not be on the road when the problem arises, leading to dangerous delay. For Copilot, the problem is that the issue will be hidden in the complexity of the code.

Let me illustrate this with a Copilot example (see original tweet). Bug

You see the problem? You probably won’t at first sight. You may very well never see it if you are not familiar enough with color representation.

The color can never be 16777215, ie 0xFFFFFF, aka pure white!

  1. The fix is simple: you need to use 16777216 instead.
  2. How do you fix every copy of this code ? You don’t as you can’t identify them, since it is likely that Copilot will have ‘generated’ sligthly different versions: varying variables or function names for example.
  3. How do you make sure future version of this algorithm are correct? You can’t as you cannot identify the source of this!

Furthermore, this example also illustrates that Copilot has absolutely no semantic understanding of what it does: if it relied of some internal understanding of what a color is (from an IT perspective), the code would have been correct.

It is likely a source of subtle bugs…

But Copilot will resolve the problem of boilerplate code

This one is very likely. Boilerplate code, the lines that must be written due to some technical requirements (generally due to some library) and bring little value to the general requirements can be masterfully managed by Copilot.

From my point of the view, boilerplate code is the sure sign of a design in need of improvements. If Copilot removes this pain, the design will never be improved and we will rely on Copilot as a crutch instead.

The best way to deal with boilerplate code is to review the design that led to it in the first place.

It will help people write tests

I have seen several examples of using Copilot to generate unit tests out of comments. That could be an interesting approach, but I am not sure how this could prove to be better than using BDD tooling (Cucumber/Specflow….).

Being a TDD practitioner, I see writing a test as a design exercise, as such, I think of this step as the one that requires the most skills. Hence, it does not appear natural to me to delegate this to an AI.

What could be interesting, is if Copilot was able to generate code that will make tests pass. Codex has been tested against some coding challenges (see white paper here), and the paper shows that using sample results to select the most efficient Copilot suggestion can achieve 78% success.

I see reason to worry again here, as it is a tool that will divert coders from TDD, or TDD like practices. Indeed, why bother with tests if Copilot generates good code for you. To that, I will retort that:

  1. Tests are useful when writing but also when maintaining code (and code needs to be maintained, until Copilot does it for you)
  2. Copilot may not generate good code at first attempt.

First users are delighted

First time users are definitely amazed by Copilot, often talking about ‘magic like’ results (a nice remembrance of Clarke third law), but I am waiting for longer term evaluation. Not holding my breath for those, as I expect them to reveal several limitations that reduce the interest for the tool.

There is one important thing to bear in mind: one of the problem with neural nets, especially deep ones, is that nobody really understands how they work. Don’t get me wrong: their architecture, principles and general design is well documented, but since this is a self calibrating engine, it is hard to explain how decisions are made afterward.
Let me illustrate that with a clear contradictions. See what Github says about the risks of Copilot duplicating something from the training set:

We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.

Not so fast!!
As you can see from the following video, you can twist Copilot’s arm until it reveals its source of inspiration.
The conclusion is not that the authors lied about how Copilot works, but they have only an approximate grasp on how Codex behaves in the real world.
This is sorcerer’s apprentice stuff here.

Why would anyone think this is a good idea in the first place?

I must say I was really mad at having someone push for yet another tool that results in producing even more code lines than necessary. Then I realized that Copilot is a solution in search of its problem, or a technology looking for its market.

It does not pretend to be a solution to anything, just an incremental improvement. Still, this is disappointing news: we need new tools that help us increase the quality of the code we produce, not the amount of it. Code is pervasive in our lives and the impacts of imperfect code are getting higher by the day: there is open source software on Mars. Nowadays no developer can affirm that none of this code have life or death stakes, we simply do not know. But that warrants a specific post. In the meanwhile, we are still waiting for new tools and practices to be invented and emerge.

Is there any chance Github Copilot could be useful?

Not in its current form, as I said in the previous paragraphs. But I think different approaches could be useful (assuming those are possible, I am no GPT3 expert):

  1. Use it as library/framework search tool. Picking a library is a really difficult exercise, and finding one is an even harder challenge. A tool for that could be a game changer; it would also be a power play quagmire, but it’s another discussion
  2. Generate code out of unit tests. This would be a boon for TDD like practices.
  3. Use it to suggest code improvements/simplifications. Crafting good code is difficult, we need help for this.
  4. Make it contextual: Copilot should learn from your project context and adjust its suggestions accordingly. So that it can based its proposals on your dependencies for example. Probably hard to do.
  5. Create a feedback loop, à-la-StackOverflow. There should be a way for Copilot to learn directly from its users. Maybe it is the case as of now, but there is no evidence to corroborate this.

Conclusion

In short, GitHub Copilot

  1. Is a significant technical achievement
  2. May be a danger to OSS in general
  3. Should not be a success in its current form
  4. May not end up in a commercial available form
  5. Is the sure sign of similar AI powered tools to come.

References

Here a few further references

. Is GitHub Copilot a blessing, or a curse?

. GITHUB COPILOT AND THE UNFULFILLED PROMISES OF AN ARTIFICIAL INTELLIGENCE FUTURE

.Captain Stack

Applications Video pour iOS

Applications Video pour iOS

De quoi ça parle?

Disclaimer: for once this article is in French because it discusses France related apps.

Cela faisait déjà quelques temps que je voulais faire un article pour parler de l’état des lieux des applications dédiées au streaming/TV, car mon constat est assez affligeant vis-à-vis des applications françaises, au moins en ce qui concerne iOS. D’ailleurs, même si je ne vois pas pourquoi la situation serait plus reluisante sur Android, n’hésitait pas à me faire part de vos remarques si vos constats sont différents.

Attention, cet article ne se veut pas être comparatif sérieux, juste un panorama de la situation actuelle (au 8/6/2021).
Evidemment, il n’y a absolument pas de discussions sur le contenu proposé par les applications, juste la partie navigation/player.

Le setup

Pour des raisons pratiques, une bonne partie de mon utilisation se fait sur un iPad branché sur un bon moniteur via HDMI, afin de faire TV d’appoint dans la cuisine.

Constat général

Le niveau de qualité global est très variable: les interfaces utilisateurs sont très diverses, la stabilité est en général bonne mais le comportement une fois branché sur un écran est en général affligeant.

La liste.

J’ai classé les 8 applications que j’utilise régulièrement ou que j’ai utilisées, classées en ordre de qualité décroissante.

Netflix ★★★★★

On ne présente pas le leader historique, a défini pas mal des patterns d’usage et d’interface.

  1. Stabilité ★★★★★: très stable
  2. Navigation ★★★★✩: efficace malgré un contenu riche, parfois difficile de retrouver les videos en cours.
  3. Playback video ★★★★★
  4. Support matériel ★★★★✩: capricieux sur écran externe, apparition parfois de glitchs durables

Prime Video ★★★★★

Arrivée plus tardivement, l’application d’Amazon est cependant rapidement devenu assez excellente

  1. Stabilité ★★★★★: très stable
  2. Navigation ★★★✩✩: vidéos en cours difficile à retrouver, catégories améliorables
  3. Playback video ★★★★✩: fonctionne très bien. J’enlève une étoile pour la fâcheuse manie d’injecter des trailers en début de diffusion. J’ai horreur de la pub.
  4. Support matériel ★★★★★: impeccable, très bon support des écrans externes.

Arte ★★★★★

L’application ARTE est tout simplement impressionnante, c’est la meilleure application de TV française, et de loin. Efficace, stable, agréable. Une référence. Sa qualité fait que je regarde beaucoup plus Arte qu’avant.

  1. Stabilité ★★★★★: très stable
  2. Navigation ★★★★★: excellente. Cela est certes plus simples du fait d’un contenu (relativement) réduit; cependant, parcourir le catalogue est très agréable. Mention spécial pour le fait de naviguer simplement sur les horaires pour retrouver toutes les émissions d’une journée
  3. Playback video ★★★★✩: pas de souci. J’enlève une étoile car certaines émissions sont indisponibles en streaming, y compris en direct.
  4. Support matériel ★★★★★: impeccable, très bon support des écrans externes

Apple TV ★★★★✩

L’ergonomie de l’application est perfectible, et le fonctionnement sur écran externe est capricieux. Globalement correcte, mais un cran en deça des meilleurs applications Apple.

  1. Stabilité ★★★★★: stable
  2. Navigation ★★★★✩: assez confuse; cela vient du côté portail de l’application, qui intègre les catalogues d’autres applications (Prime et Arte dans mon cas). Elle propose les videos en cours de toutes les applications, mais cela résulte plutôt à de la confusion. Les fonctions de contrôle sont petites et relégués sur les bords. L’exploration des catalogues est assez luxieuse et met en valeur les cotenus.
  3. Playback video ★★★★★: impeccable
  4. Support matériel ★★★✩✩: se plaint souvent de ne pas être connecté à un écran HDCP. Je ne sais pas si c’est lié à un bug ou à un faux contact, mais le problème se produit avec 2 adapteurs différents.

TF1 ★★★✩✩

Niveau de qualité insuffisant pour un géant de l’audiovisuel. L’application est minimale en terme de fonctionnalités, et le non support d’écran externe est agaçant.

  1. Stabilité ★★★★★: stable
  2. Navigation ★★★★✩: simple mais efficace
  3. Playback video ★★★✩✩: correcte, mais l’injection de publicité ciblée est assez hasardeux, il arrive régulièrement de voir plusieurs fois de suite la même publicité.
  4. Support matériel ★★★✩✩: ne supporte pas d’écran externe, s’appuie uniquement sur la fonction mirroring. Du coup, perte de surface d’écran du fait que l’iPad est en 4×3 vs 16×9 pour un écran

France.TV ★★★✩✩

Dans cette liste, c’est l’application qui a le plus de problème de qualité: il y a souvent des régressions entre les versions, le direct est capricieux et donnant souvent lieu à des messages d’erreurs étranges (le direct semble passer par le replay, mais à condition d’être dans la bonne tranche horaire, ce qui conduit à des messages d’erreur à l’heure de départ de l’émission si elle est en retard et à l’heure de fin si la navigation n’a pas mis à jour l’émission en cours. Certaines versions de l’application plantaient régulièrement entre les émissions à cause de cela. L’utilisation sur écran externe est très agaçante: l’écran est supporté en mirroring pour les émissions (donc présence de bandes noires), mais les publicités ciblées fonctionnent en plein écran; c’est la preuve que France TV a accès à la technologie pour supporter le plein écran, mais qu’elle est apportée via un composant tiers dédié à ma pub
; petit défaut agaçant, un icône trop similaire à celui d’Apple TV

  1. Stabilité ★★★★✩: plantage rare
  2. Navigation ★✩✩✩✩: complexe et assez lente, plusieurs points d’entrées: directs, chaînes et categories, mais pas possible de retrouver le direct depuis la chaîne.
  3. Playback video ★★★✩✩: correcte, mais l’injection de publicité ciblée est assez hasardeux, il arrive régulièrement de voir plusieurs fois de suite la même publicité.
  4. Support matériel ★★★✩✩: ne supporte pas d’écran externe, s’appuie uniquement sur la fonction mirroring. Du coup, perte de surface d’écran du fait que l’iPad est en 4×3 vs 16×9 pour un écran

Molotov ★★★✩✩

Niveau de qualité très décevante pour une application qui veut vous faire payer pour accéder à du contenu gratuit. J’ai souvent utilisé cette application tant que je n’ai pas eu d’écran externe, comme elle est inutilisable une fois branchée sur un écran, je l’ai abandonnée.

  1. Stabilité ★★★★★: stable, mais je ne l’ai pas utilisée récemment
  2. Navigation ★★★★✩: plutôt agréable, mais consacre beaucoup d’écrans aux offres payantes
  3. Playback video ★★★★✩: démarrage très rapide des flux videos, bonne interface
  4. Support matériel ★✩✩✩✩: aucune video sur écran externe une fois branché, la vidéo n’est visible que sur l’iPad, l’écran affichant un carré noir là où se situe le flux. La première version de mon post disait à tord qu’il n’y avait aucune vidéo. J’ai vérifié et soit ma mémoire m’a joué des tours, soit le bug a été corrigé.

M6 Replay ★✩✩✩✩

La plus mauvaise appli des groupes télévisuels français. Pour un parler, je préfère le mot subir au mot utiliser; sa finalité est uniquement de vous faire voir un maximum de publicité. La navigation en jette plein les yeux mais est totalement inefficace, le streaming ne marche pas…. bref, je ne la supporte pas.

  1. Stabilité ★★★★✩: plantage rare, mais attention, je l’utilise rarement du fait de ses autres défauts bloquants
  2. Navigation ★★✩✩✩: images du catalogue des émissions beaucoup trop grandes, pas d’utilisation de l’historique pour mettre en avant les émissions regardées souvent, fonction direct planquée
  3. Playback video ✩✩✩✩✩: abominable. Il y a injection de 30-60 secondes de publicités dès que vous tentez de regarder quoi que ce soit, y compris le direct (et donc vous perdez le même temps d’émission). La diffusion de la pub est fiable, mais le replay d’émission plante régulièrement. Il est courant de subir 60 secondes de pub pour tomber sur un écran noir et bien sûr, tout nouvel essai se traduit à nouveau par 60 secondes de pub. La diffusion de pub interrompt arbitrairement le flux à n’importe quel moment (y compris au milieu d’un mot) et il n’est pas garanti que l’émission reprenne correctement ensuite.
  4. Support matériel ✩✩✩✩✩: abominable aussi. Vous allez lire ce qui suit 2 fois car vous allez penser que vous avez mal compris: si on branche un écran externe, les publicités (très présentes rapelez vous) sont diffusées en plein écran, par contre, les émissions s’appuient sur le mode mirroring (idem France.TV) mais avec le bonus que la video n’est pas diffusée: le flux vidéo reste en noir sur l’écran externe

Mutation testing

Mutation testing

first steps with Stryker-Mutator .Net

TLDR;

I will explain why Mutation testing is an extraordinary tools that pushes to superior code quality.
I will also draft how Stryker-Mutator is implemented.

Mutation testing: WTF ?!

Mutation testing is second order testing, i.e. it tests your tests, not your code. You therefore use it on top of your favourite automated testing practices: TDD, BDD, ATDD…
Having a good mutation score means your tests are good, and you can trust them to catch errors. On the other hands, a low score means your test base won’t catch errors! Something that should alarm you.

Underlying Principle

In order to assess if your tests can spot bugs, mutation testing tools will inject bugs in your code base, then run the tests. At least one test should fail, confirming the bug has been found. Otherwise, the bug was undetected, which is obviously not cool!

Workflow

The tool will generate Mutants, i.e. new versions of your code in which the tool has injected a single bug in each. Then the Mutants are tested using your test base. If the tests succeed, the mutant is a survivor, and this means your test base is imperfect. Conversely, if at least one test fails, the mutant has been killed, and everything is fine. Actually, there is a third option: the mutant can screw the logic of the code and create some infinite loop. To handle this situation, mutation testing tools have a timeout features that kills long running tests. Timeouts are considered as test failures, but reported specifically, as it is impossible to distinguish an endless loop and a bit of code that takes a long time to run (see halting problem).

The tool will generate several Mutants, tests them all and then report the survival rate. That is the percentage of survivors. It will also provide details on each generated mutant and its final status.

You want your survival percentage to be as low as possible. 0 would be great.

Limitations

You have to bear in mind that those tools are no small undertakings and
come with limitations.

1. Speed

Mutation testing is SLOW. Remember that the tool has to:

  1. Analyze your project
  2. Generate Mutants
  3. For each Mutant:
  4. Compile it
  5. Test it
  6. Generate a resulting report

The costly part is the inner loop of course, where the tool needs to build and test each mutant.
For example, for NFluent Stryker-net generates around 1600 mutants, and a test run takes around 10 seconds. This give a grand total of roughly 4 (four) hours for complete testing. Run time can be significantly improve by using test coverage detail so the engine only run tests that may be impacted by the mutation. But it implies a tight collaboration between the test runner, the coverage tool and the mutation testing tool.

2. Mutations

The tool has to generate mutants, but this raises two conflicting goals:

  • On one hand, you want to inject a lot of mutants to really exert your tests.
  • But on the other hand, you need to have an acceptable running time, hence a reasonable number of test runs (= mutants).

The tool must also try and generate interesting mutants: the objective is to inject non trivial bugs. The usual approach is to rely on simple patterns, that can be mutated quite simply. Such as replacing comparisons (less by greater, of less by less or equal, and vice-versa), increment by decrement, inverting ifs condition…

3. Implementation

Creating a mutation tool is a daunting task: in order to generate realistic and interesting mutants, the tool must perform syntactic analysis of the source code to identify patterns it can modify.

For language having an intermediate form, such a Java or C#/F#, the tool can mutate the bytecode/IL directly (I think PITest does this). It has the great advantage of being simpler to implement (IL/Bytecode is a pretty limited language, close to assembler). But with a significant drawback as it may be difficult or even impossible to show the mutation at the source code level.

As a user, being able to read the mutated code is important, as it helps you to reproduce the mutants if need arises.

On the .Net front, the implementation complexity has long been a major obstacle; the most advanced project, Ninja-Turtle, uses IL modification.

Prerequisites

There is an important prerequisite: having a decent test coverage. Keep in mind that any uncovered significant block of code/method will be a mutants’ nest and will drag your score down.

Discovering Stryker-Mutator.Net

We have a clear and simple test coverage strategy for NFluent: 100% of line and branch coverage. Quality is paramount for an assertion library, as the quality of other projects depends on it, and I made a personal commitment to keep the bar at 100%. It sometimes accidentally drops a bit, but top priority is to restore it when the slip is discovered. You can check it by yourself on codecov.io.

For the past 3 years, some people (well @alexandre_victoor, mostly) said we need to look into mutation testing to assess the quality of our tests. But, when I tried to a couple of years ago, I discovered a bleak reality: there was no actively supported mutation testing tool for .Net.

That is, until September 2018, where the first alpha versions of Stryker Mutator were released for Net Core.

First steps

I immediately decided to try it on NFluent; so on mid October 2018, I installed Stryker-Mutator (V0.3) and made my first runs. Which meant: adding a dependency to the NFluent test project for Net.Core 2.1 and using dotnet stryker command to initiate the test.
Sadly, I kept having a bleak result: no mutants were generated, so no score
was computed.

I suspected it was related to NFluent heavy reliance on Visual Studio shared projects. Having a glance at the Stryker’s source code seemed to confirm my intuition as I discovered that Stryker reads the csp file in order to discover the source files of the project. Next step was to fork the project on Github and debug it to confirm my suspicion. I saw this issue as a perfect opportunity for me. Indeed, it allowed me to fulfil several important ambitions I had in my backlogs for a couple of years:

  1. Contribute to an OSS (besides NFluent)
  2. Secure the reliability of NFluent
  3. Increase attractiveness of Net Core platform
  4. Satisfy my curiosity and understand the underlying design of a mutation testing tool.

In the following weeks I opened 5 pull requests to Stryker for features I though were important for the project success:

  1. Support for shared projects (a feature often used by OSS projects for supporting multiple Net framework versions)
  2. Improve performance (of the running tests)
  3. Improve experience when used on a laptop (see footnote 1 for details)
  4. Improve experience when Stryker failed to generate proper mutants
  5. Improve experience by providing estimated remaining running time

I must say that Stryker’s project team was helpful and receptive to my suggestions, which is great, considering they are still in the early stage of the project and very busy adding features.

Getting the first results

I got the first successful run mid November, and the results did surprise me:
224 mutants survived out of 1557 generated, roughly 15% of survival rate. Definitely more than I anticipated, having in mind that the project as a 100% test coverage rate.

I assumed I had a lot of false positive, i.e. mutations that were supposed to survive.

I was wrong!

Once I started reviewing those survivors, I quickly realised that almost all survivors were relevant, but also that they were strong indications of actual code weaknesses.

I have been improving the code and test since, and on my latest run, the survival rate is down to 10.5% (174 out of 1644).

Post mortem of failed kills

The surviving mutants can be classified in categories. Please note that I have not established any objective statistics regarding those category, I only share my impression regarding the size of those various groups.

1. No test coverage

That is, mutants that survived simply because the code was not part of any test whatsoever. It should not have happened, since I have 100% test coverage. Yes but NFluent relies on several test assemblies to reach 100% coverage, and current Stryker versions can only be applied on a single testing assembly.
We use several assemblies for good reasons, as we have one per supported Net framework version (2.0, 3.0, 3.5, 4.0, 4.5, 4.7, net standard 1.3 and 2.0) as well as one per testing framework we explicitly support (NUnit, xUnit, MSTest).
But also for less valid reasons, such as testing edge cases for low level utility code.

For me, those survivors are signs of a slight ugliness that should be fixed but may not be, due to external constraints, in the context of NFluent. As I said earlier, I suspect this is the largest group, 25-30% of the overall population (in NFluent case).

2. Insufficient assertions

That is, mutants that survived due to some lacking assertions. That was the category I was predicting I will have a lot of. NFluent puts a strong emphasis on error messages and as such, tests much checks the generated error messages. It turns out that we did not test some error messages, so any mutation of the associated text strings or code may survive. Sometimes, it was simple oversight. So fixing this meant simply adding the appropriate assertion.

Sometimes it was a bit trickier; for example, NFluent has an assertion for the execution time of a lambda. Here is (part of) the failing check that is part of the test code base.

// this always fails as the execution time is never 0
Check.ThatCode(() => Thread.Sleep(0)).LastsLessThan(0, TimeUnit.Milliseconds);

The problem is that since the actual execution time will vary, the error message contains a variable part (the actual execution time).

Here is the original test in full

[Test]
public void FailDurationTest()
{
    Check.ThatCode(() =>
        {
            Check.ThatCode(() => Thread.Sleep(0)).LastsLessThan(0, TimeUnit.Milliseconds);
        })
        .ThrowsAny()
        .AndWhichMessage().StartsWith(Environment.NewLine +
            "The checked code took too much time to execute." + Environment.NewLine +
            "The checked execution time:");
}

As you can see, the assertion only checks for the beginning of the message. But the actual message looks like this

The checked code's execution time was too high.
The checked code's execution time:
    [0.7692 Milliseconds]
The expected code's execution time: less than
    [0 Milliseconds]

So any mutant that would corrupt the second part of the message would not be caught by the test. So to improve the efficiency of the test, I added support for regular expression.

[Test]
public void FailDurationTest()
{
    Check.ThatCode(() =>
        {
            Check.ThatCode(() => Thread.Sleep(0)).LastsLessThan(0, TimeUnit.Milliseconds);
        })
        .IsAFaillingCheckWithMessage("",
            "The checked code's execution time was too high.",
            "The checked code's execution time:",
            "#\\[.+ Milliseconds\\]",
            "The expected code's execution time: less than",
            "\t[0 Milliseconds]");
}

Yes, the regular expression is still a bit permissive. But all related mutants are killed.

And you know the best part of this: in the actual NFluent’s code there was a regression that garbled the error message. It turned out it was introduced a year before after a refactoring. And the insufficient assertions let it pass undetected.
So I was able to fix an issue thanks to Stryker-Mutator!

3. Limit cases

That is mutants that survived because they relate to how limits are handled in the code and the tests. Mutation of limits handling strategy may survive if you do not explicitly have tests for them.
The typical case is this one:

public static ICheckLink<ICheck<TimeSpan>> IsGreaterThan(this ICheck<TimeSpan> check, Duration providedDuration)
{
    ExtensibilityHelper.BeginCheck(check)
        .CheckSutAttributes( sut => new Duration(sut, providedDuration.Unit), "")
        // important line is here
        .FailWhen(sut => sut <= providedDuration, "The {0} is not more than the limit.")
        .OnNegate("The {0} is more than the limit.")
        .ComparingTo(providedDuration, "more than", "less than or equal to")
        .EndCheck();
    return ExtensibilityHelper.BuildCheckLink(check);
}

As you can see, IsGreaterThan implements a strict comparison, hence if the duration is equal to the provided limit, the check will fail.
Here are the tests for this check:

[Test]
public void IsGreaterThanWorks()
{
    var testValue = TimeSpan.FromMilliseconds(500);
    Check.That(testValue).IsGreaterThan(TimeSpan.FromMilliseconds(100));

    Check.ThatCode(() =>
        {
            Check.That(TimeSpan.FromMilliseconds(50)).IsGreaterThan(100, TimeUnit.Milliseconds);
        })
        .IsAFailingCheckWithMessage("",
            "The checked duration is not more than the limit.",
            "The checked duration:",
            "\t[50 Milliseconds]",
            "The expected duration: more than",
            "\t[100 Milliseconds]");
}

Stryker-Mutator.Net will mutate the comparison replacing <= by <

,FailWhen(sut => sut < providedDuration, "The {0} is not more than the limit.")

And the tests will keep on working. My initial reaction was to regard those as false positive. On second thought, I realised that not having a test to deal with the limit case, was equivalent to consider the limit case as undefined behaviour. Indeed, any change of behaviour would introduce a silent breaking change. Definitely not what I am ok with….

Of course, the required change is trivial, adding the following test:

[Test]
public void IsGreaterThanFailsOnLimitValue()
{
    Check.ThatCode(() =>
        {
            Check.That(TimeSpan.FromMilliseconds(50)).IsGreaterThan(50, TimeUnit.Milliseconds);
        })
        .IsAFailingCheckWithMessage("",
            "The checked duration is not more than the limit.",
            "The checked duration:",
            "\t[50 Milliseconds]",
            "The expected duration: more than",
            "\t[50 Milliseconds]");
}

4. Refactoring needed

This is the category that hurts me the most, but I deserve it so much I can’t
complain. Wherever I have complex code, ridden with multi criteria conditions and multi-lines expressions, I get a high survival rate (high as in 30-40%). This method is a good example of such a code.
This method has such cyclomatic complexity as well as overlapping conditions that many mutants are able to survive. Each of them is a false positive, in essence, but the sheer numbers of those is a clear smell.
Here is an example of surviving mutants:

// original line
var boolSingleLine = actualLines.Length == 1 && expectedLines.Length == 1;
...
// mutant
var boolSingleLine = actualLines.Length == 1 || expectedLines.Length == 1;

This flag (boolSingleLine) is used in string comparison to optimize error messages.

It turns out that you cannot devise a test that would kill this mutant: due to the logic in previous lines (not displayed here) either actualLines and expectedLines have both one line, or they both have more than one.

I was tempted to just mark it as a false positive and do nothing about it; but then I realised that it was a smell, the smell of bugs to come: the flow was objectively so complex that I could no longer understand, lest anticipate the impact of any change (link to original code).

So I refactored it toward a simpler and cleaner design (new version here).

5.Algorithm simplification

It relates to the need for refactoring, but with deeper benefits: sometimes you end with a needlessly complex algorithm, but you do not know how to simplify it. If you already have full test coverage (line and branch), having a survivor may be the sign that you have useless conditions, or unreachable program state.
Here is an example: the InThatOrder implementation method. Its purpose is to verify that a set of values appears in the proposed order within an other enumeration (the sut) ignoring missing values. My original implementation‘s algorithm was:

  1. select the first value V from the list of expected values expected
  2. for each entry T in sut
  3. if it is different from the expected one:
  4. check if T is present in the rest of expected
    1. if yes, select its position in expectedLines, and skip duplicates
    2. if no and T is present before the current position, return an error
  5. if T is not present in the rest but is present before the
    current position returns an error.
  6. when every sut entry has been checked, return that everything is fine

But Stryker generated a mutant with inverted condition for line 3 (if is the same as the expected one)!

I peered at the code, tried to add some test to kill the mutant, to no avail. In fact, this condition was useless, or to be more specific, it was redundant with the overall logic. So I removed it, achieving cleaner code.

6. Conclusion

A few years ago, Alexandre Victoor (@Alex_victoor) kept telling me about the virtues of mutant testing. I was somewhat convinced, but I saw it a bit overkill and somewhat impractical, but still eager to test, nonetheless. Alas nothing was available for .Net. The good news is that this is no longer true. And you should try it to:

  1. At the very least it will show you how much risk remains in your code and help you identify where your should add some tests.
  2. If you have decent coverage, it will help you improve your code and your design.

You should try it now. At this writing, Stryker has reached version 0.11 and is fully usable. Try it, discover what it entails and what it provides.

It will help you improve your skills and move forward.

Notes:

  1. I had a lot of timeout results on my laptop. I realized it was related to
    my MBA going into sleep. I revised Striker’s timeout strategy to handle this situation
    and voilà, no more random timeouts.

Unit testing, you’re doing it wrong

(this is a simple reposting of the Medium version)

TLDR; Existence of untested code in the wild should worry you: most of our
lives is now somehow software controlled. Good news is
that you can do something about it. Also, there is confusion about what
unit testing means.

Disclaimer

I understand that I am addressing a very sensitive topic; I will probably
offend many readers that will say that I am an insane troll and that my views are bullshit. Offending is not my objective, but I stand by my opinions. Of
course comments are here to help you voice your opinion. And yes this piece is
biased by my past experiences, but that’s the point of it, sharing my
experiences.

‘How legitimate are you?”

Fair question. I have a 35 years career in IT; I have worked at companies of
various sizes and culture. I have often been in some transversal position and
had the opportunity to meet and work with a lot of developers (think x000s)
While most of my positions involves code, I also touched on QA and BA
activities. I am now in CTO-like positions for 2500 ITs and had the great
privilege to work with well-known french experts, as well as lesser-known ones.

So my opinion is based on things and events I have experienced first-hand as a developer, things I have seen others struggle or succeed with, problems
encountered by teams I have helped as well and views and issues that other
experts taught me about.
Basically, I have been through all this sh*t and made most of the mistakes
listed here.
Of course, this dos not imply that I am
right, but at least, grant me that I have a comprehensive view of what I am
talking about.

Fallacies about unit testing

1. TDD is all about unit tests

You keep using that word

Big NO, TDD, a.k.a ‘Test First Development’ is about defining what the code is
expected to produce, capturing this __as some __test(s) and
then implementing just
enough code to make it pass
. Unit testing is about testing small parts of
the code in isolation
, e.g. testing some class’s methods, maybe using some
stubs/mocks to strip dependencies.

unit tests

Unit tests are promoted for their speed and
focus
: they are
small, with limited dependencies, hence run (usually fast). When a unit test
fails, it is easy to identify which part of the code is responsible.

Actually, TDD is about every form of
tests
.
For example, I often write performance
tests as part of my TDD routine; end-to-end tests as well.
Furthermore, this is about requirements, not implementation: you write a
new test **when you need
to fulfill a requirement. You do not write a test
when you need to code a new class or a new method**. Subtle, but important
nuance.

And when Kent Beck wrote about tests being isolated, he meant between one and
another. For example, having one test inserting record in a table while
another reads that same table is probably a bad idea, as the result of the
tests may vary depending in the order of which the tests are run.


2. Automated testing is all about unit tests

No, automated testing describes a process: having tests automatically run as
part of your build/delivery chain. It covers every kind of tests you can
perform automatically
: behavior tests, stress tests, performance tests,
integration tests, system tests, UI tests….

There is an emphasis on unit tests because they are fast, localized and you
can execute them en masse. But feature tests, use case tests, system
tests, performance tests
, you name it, must be part of your building
chain
.
You must reduce as much manual tests as you can. Manual tests are expensive
and give slow feedback.

Sickness


3. 100% code coverage requires extensive unit testing

NO, NO, NO and f…g no. In a perfect TDD world, untested code does not
exist in the first place
.

Writing a test is akin to writing down a contract or a specification, it fixes
and enforces many decisions.
Your tests must focus on behavior; behavior driven and use cases tests are the most important ones. Code coverage must include every tests,
disregarding its type.
tests tests and tests
Improving coverage by simply adding specific tests for untested methods and
classes is wrong
. Added tests must be justified by some
requirements (new or existing); otherwise, it means the code has no actual
use
. It will lead your codebase to excessive coupling between tests and
implementation details and your tests will break whenever refactoring occurs.

For example, if you implemented a calendar module that support Gregorian to
Julian
conversion
,
either you have a pertinent test for this feature, or you just remove it.


4. You have to make private methods public to reach 100%

Exposing

Again, no: private methods will be tested through public entry points.
Once again, unit testing is not about testing methods one by one.

Wondering about how to test private methods is a clear sign you’ve got TDD
wrong. If this is not clear to you, I suggest you stop ** UNIT TESTING**
altogether and contemplate BDD. When you get the grasp on BDD, you will be able to embrace TDD.
If they cannot be tested in full, you need to challenge the relevance
of the non covered part: it is __probably __useless code.


5. Some code do not need be tested

The design of the Death Star is Rebel proof, right ?!

This will never happen, right?

This one is somewhat true, but probably not to the extent you think it is:
code that works by construction does not require testing if it never changes.
That being said, please show me some code that will never change.

Plus, I am an average developer, and my long
experience have taught me that my code working on the first attempt is an
happy accident
.
Even if you are the god of code, chances are somebody else will break
your code in a couple of months, weeks or even hours.
And yes, that somebody else is probably the future you. Remember as I said
earlier, a test is a contract. And contracts exist because people
change, context changes, etc….

I often get this remark: “Testing getters or setters is simply a waste of
time.”
. Seems pretty obvious, isn’t it?
What is wrong with this remark is the implicit notion of testing (trivial)
getters or setters in isolation
. Which would probably be not only useless
but likely harmful.
Unit testing is not about testing method in isolation. Your getters and
setters should be tested as part of a larger, behavior related, test.


6. You need to use a mocking framework

Isn’t it cute and mesmerizing ?

Nifty, isn' it

Nope, chances are you don’t. Mocking frameworks are great pieces of
engineering, but almost every time I have seen a team using it, mocks were
pervasive within the test base with little to no added value. I have seen
tests that ultimately
test no production code whatsoever, but it took me hours peering at the code to
come to that conclusion.

Often teams are using mocks to test class in isolation, mocking every
dependencies. Remember, ‘unit’ in unit testing is to be understood as a
module or a component, not a class.

Whenever you decide to introduce a mock, you enforce a contract that makes
refactoring more difficult.

Mocks are here to help you get rid of slow or unstable
dependencies
, such as a remote services, or some persistent storage.

You should not test for collaboration/dependencies between classes. Those tests
are useful if you do bottom-up/inside-out TDD, but you must get
rid of them once the feature is complete
.
Philippe Bourgau has a
great set of posts on this topic
if you are wanting to dig further.

7. Tests are expensive to write

Yes, testing is expensive in most of industries: think about testing
a home appliance, a drug or a new car…

Expensive test run in real life

Actual crash test

But code is incredibly cheap, giving
the impression that tests are needlessly costly, in a relative way.

They do require extra effort, but they are efficient compliment or even
replacement for specifications, they improve quality, bring fast feedback,
secure knowledge for newcomers.

But green tests look useless both to the team and to management.


8. The ‘testing pyramid’ is the ultimate testing strategy

You have probably heard of the testing pyramid. It basically states that
you should have a lot of unit tests, less component tests, then less
integration tests, and so one, up to the top of the pyramid where you have a
few use case based/acceptance tests
. It is used as the default testing
strategy for most projects.

Pyramids can be dangerous!

Testing Pyramid

Truth to be told, the ** testing pyramid outlived its usefulness**.
Its original purpose was to address the fact that high level tests can have a
long execution time and that cause for causes of failure may be hard to
identify. It therefore pushes to invest more in unit tests, which are both fast and local, by definition.

This is also a dangerous analogy, giving the impression that a ratio of 1000 to 1 between unit and use case based tests is a desirable thing.

You should focus on the top of the pyramid, not the bottom !

I often see teams that have only a couple of high level tests, that
covers some of the core use cases, of crude, nothing
more than glorified smoke tests. And then thousands of method tests to
ensure a high coverage. This is not good.

You need to have a decent set of use case based tests for your system, ideally
covering all use cases, but major ones is a good start.
This tests must be rely on your high level public APIs, just ‘below’ the
user interface.
Then have some performance tests for the performance sensitive parts of
the application, integrates also failures reproducing tests, such as
external dependencies that are down (thanks to mock), to make sure your system
handles those properly.
And then, unit (as in module) tests for the dynamic part of your code base.
Then understand the trade off:
* Having a few unit tests means your design can
easily be changed, but it means that finding the root cause of a failing high
level tests will take time (and probably debugging).
* Having a lot of those means you find issues as soon as they are introduced
in the code base, but significant re design of your solution will be ridden
with failing tests.

if at any point in time you need to have finer tests, such as class or
method tests, throw them away as soon as you no longer need them
, such as
when the initial design and implementation phase is over. Otherwise they will
drag your product down slowly.


What about some truths ?

1. Unit tests are not about testing a method in isolation

Here is what Wikipedia proposes:

In computer programming, unit testing is a software testing method by which
individual units of source code, sets of one or more computer program
modules together with associated control data, usage procedures, and
operating procedures, are tested to determine whether they are fit for use.[1]

isolation
Good tests must test a behavior in isolation to other tests. Calling
them unit, system or integration has no relevance to this.

Kent Beck says it so much better than I could ever do.

From this perspective, the integration/unit test frontier is a frontier of
design
, not of tools or frameworks or how long tests run or how many lines
of code we wrote get executed while running the test.

Kent Beck


2. 100% coverage does not mean your code is bug free

This the first rebuttal I get whenever I talk about 100% coverage.
Of course, it does not. Coverage only shows which part of the code have
been executed. It does not guarantee that it will work in all
circumstances
, and it may still fail for specific parameters’ values, some
application state or due to concurrency issue. Also, it does not prove the
code produce the required output in itself; you need to have adequate
assertions
to that effect.
Unit tests vs integration tests.

This is especially true if you only perform unit testing!

Coverage metrics are not about what is covered, but about what is not
covered.

Non covered means not tested. So at least make sure that non tested parts
are non critical and that important part of your code must be properly
tested
!


3. There is a tooling problem

The truth is unit tests are in the spotlight mostly thanks to tooling!
We should be all eternally grateful to Kent Beck for creating sUnit, the
library which triggered a testing revolution, but we must not stop there.

Are you using test coverage tools (JCov, Clover, NCover, Jasmine…)?
Do you look at their report?

Have you tried continuous testing tools (InfinyTest, NCrunch, Wallaby…)?
I have a bias: I am addicted to NCrunch.

Having your tests running continuously is a game changer for TDD!

Me

No seriously, do it, now! It will change your perceived value for tests.

Have you tried Cucumber to have a more use case driven approach? You may
also consider using
Mutation Testing
, to assess the quality of your tests.
Property Based Testing
is useful to check for invariants and higher level abstractions.

![Testing](./Engine tuning.jpg)


4. It is difficult

Yes, but this is no more difficult than designing the software up front.
You face complexity, but what is interesting in test first approaches,
is that you have an opportunity to focus on
essential complexity
as test code ought to be simpler than actual implementation.
Difficult

I have animated many craftsmanship discovering sessions based on Lego
exercises (French
deck)

. After the TDD exercise, attendants often express that the difficult part
was choosing the right test
, and building the solution was straightforward.
Interestingly, even non coder profiles (BA, managers, CxO, …) share this
feeling, sometime event saying how comfortable it was just to follow
requirements, versus the hardship of identifying a test (in TDD mode).

Choosing the next test is an act of design.

(attributed to) Kent Beck

I attribute this difficulty to a set of factors:
1. it forces you to think problem first, while solution first is
everyone comfort zone
2. it constrains your design, and nobody likes extra constraints
3. it gives you the impression of being unproductive

But all those factors turn into benefits:
1. Problem first is the right focus!
2. Constraints help you drive the design. And as you are problem first, this
is bound to be a good design.
3. Worst case, tests will be thrown away. But they helped you build a solution
and a deep understanding of the problem. At best, they prevent future
regression, and provide help and documentation for future developers.

Writing tests is never unproductive.


5. Tests require maintenance

Maintenance

Tests require maintenance effort as any other piece of code. It needs
refactoring along the source code, but it may also requires
refactoring on its own.
They will have to be updated if new use cases are identified, or if existing
ones must be altered.

To sum it up: tests are part of your codebase and must be treated as such.
Which leads to the next truth:


6. Having too many tests is a problem

Since tests need to evolve with the production code, too much tests will
hamper your productivity: if changing some lines of code break hundred
tests or more, the cost (of change) becomes an issue.
This is a sure sign of failing to tender for your tests appropriately:
tests may be replicated with only minor variations, each one adding little
value.

I have seen projects and teams that were grounded to a halt due to having a
far too large test base. Then there is a strong likelihood that the test base
may be simply thrown away, or cut through savagely.

Automated tests

Ultimately, tests also increase build time, and as you are doing continuous
build/delivery (you are, aren’t you?), you need to keep build time as low as
possible.

This has a clear consequence:


7. Throwing away tests is a hygienic move

It should be obvious by now that you need to maintain a manageable
number of tests.

Therefore you must have some form of optimization strategy for you test base.
Articles are pretty much non existent for this kind of activity, so let me make
a proposal:
– getting rid of scaffolding tests should be part of your TDD/BDD coding
cycle.
By scaffolding tests, I mean tests that you used to write the code in the
first place, identify algorithm(s) and explore the problem space. Only keep
use case based tests.
– make regular code coverage review, identify highly tested lines and remove
tests you find redundant.

You can see this thread for an extensive
discussion on having too many tests.
Recycling


8. Automated tests are useful

Last but not least. Automated tests have a lot of value.
Yes, a green test looks useless, like any security device: safety belt,
life vest, emergency brakes…

If you practice
TDD, tests have value right now. But even if you don’t, tests have value in
the long run.

An interesting and important 2014 study done analyzed 198 user reported
issues
on distributed systems. Among several important findings, it
concluded that 77% of the analyzed production issues can be reproduced by a
unit test.

Another key finding was that almost all catastrophic failures were the
result of incorrect error handling
.

Catastrophe
Simple testing can prevent most critical failures

Source study


Conclusion

First of all, thanks for having the patience of reading this so far. If you
are dubious about unit tests, I hope this article cleared some of your
concerns and gave you some reason to try it.
If you are already doing unit testing, I hope I offered you some guidance to help you avoid the dangerous mines that lie ahead.
And if you think you’re a master at unit testing, I hope you share my point of views and that I gave you strong arguments to convince other.

Each of the facts I listed previously is worthy of a dedicated talk or article.
Digging further is left as an exercise for the so minded reader.

Remember:
1. Tests are useful, they can prevent catastrophic failures.
2. Test behaviors, not implementation. A.k.a. understand what unit
stands for in unit tests.
3. Maintain your test base with the delicate but strong hand of the
gardener
: gently refactoring when necessary and pruning out when no longer
useful.

100% code coverage is good

100% code coverage is good

Quick Rex about 100% code coverage

read comments here

TLDR; maintaining 100% coverage brings many benefits, you need to try it.

A few years ago I blogged about aiming for 100% code coverage for your tests. This post made some noise and the feedback was essentially negative. I was even called out as a troll a few times…

Being stubborn and dedicated, I understood I needed to put my money where my mouth was, and start to practice what I preached. I did and this post is about what I learned by reaching 100% code coverage for my tests.

Continue reading “100% code coverage is good”