Copilot is a worrisome proposal, but not for the reasons you may think of.

Copilot is a worrisome proposal, but not for the reasons you may think of.

Thoughts on Github Copilot

TLDR;

Github Copilot is a disruptive proposition that could change forever how developers work. In this post, I will give you example of successes and failures of Copilot; I will also elaborate on its positive and negative impacts, and will risk a prognosis.

While it is difficult to ascertain any position, I think it will bring significant, yet incremental progress. But it raises many questions around collaboration; and I think we should be concerned for the future of OSS (and I don’t think about licensing issues).
I propose to address the following questions:

  1. What is Copilot?
  2. Is it useful?
  3. What are the impacts of such a tool?
  4. What does it tell us about our trade?
  5. How actually useful is it?

Disclaimer

I have yet to experience Copilot first hand, but I have seen enough videos and read enough feedbacks to get the gist of it. In any case, I will mostly talk about the concept, not the product. It is not a product review!

What is Copilot?

GitHub Copilot is touted as ‘Your AI pair programmer’.
From a user experience point of view, it works kind of like an auto-completion engine, except that it does not simply suggest the end of the word you are typing (such as ToS ==> ToString()), but full functions/methods or chunk of code.

copilot generates a sendtweet function in python
SendTweet sample

Most impressive results are achieved when CoPilot is able to suggest a full function simply based on the code comments you typed.

Comments to shell command

Note that Copilot is often able to offer several alternante suggestions you can navigate through with the keyboard before choosing one with Enter; then you can change, complete and alter the code as usual.

From an implementation perspective, Copilot is a service built on top of Codex, which is an Open AI’s GPT-3 implementation dedicated to code generation (see this article for more). In short, GPT-3 couples NLP with a (huge) neural net to produce very convincing documents based on correlation with provided keywords.

Here, Codex has been trained with every public GitHub repo, disregarding their respective licences: if it’s public, it is fair game!

Github support answer to the licence question.

Side note, I wonder how GPT-3 deals with various languages grammar. I have only seen text examples in english, I wonder how good it is with languages with more strict and complex grammar, such as german or french. This is a relevant question for Codex since correct grammar is an important topic for computer language.

Wait! What?! Disregard for the licences ?!

Yes, this issue has already been heavily discussed elsewhere. In short, there is no actual issue. A more detailed answer is:

  1. No, GPL does not result in Copilot generated code being GPL as well (same for any viral license)
  2. This is akin to reading other people’s code to learn from them, definitely fair use
  3. Trying to fight this means applying copyright laws (and principles) to something that exists BECAUSE copyright laws were seen as hampering creation

Nope, not a good fight, sorry.

It is a benefit, right?

Not sure… Let’s see.

Simple case

Let’s assume that Copilot works flawlessly for simple requirements and works partially for more complex ones.
The following logical demonstration is based on a simplistic view of the development effort, but I assume everyone understands it.

So it will help developers to code simple requirements properly with little effort, with a significant production increase for those.
Production here being expressed both as the number of requirements covered (KPI #1) and as the number of lines of code (KPI #2). Neat, isn’t it?

You see the problem? As a trade, we know that in general we want to maximize the number of implemented requirements while minimizing the amount of written code; that is, keep KPI #1/KPI #2
as low as possible.

Why? because we know there is a maintenance cost associated with a line of code. Even if this is a simple function, that rarely needs change, what if the code needs to be migrated, or another team using different coding patterns takes the code over? A line of code is both an asset and a liability!

Today, almost no one boasts about how large its code base is!

One may retort that it is not because ‘producing’ code gets simpler that it will result in more code; I simply suggest to look into [Jevon’s paradox], and IT history, which is a constant demonstration that whenever code gets cheaper to create, we end up with more and more of it.

So a system that ends up favoring the amount of written code does not seem so smart. So, in this simple terms, I don’t think it brings value if it is only able to support simple requirements.

What about more complex requirements ?

Here be dragons

Everybody with some professional code experience knows how hard it is to extract and capture real world requirements in a written, structured form (spoiler alert, Copilot will not help you there).

For the sake of the argument, let’s say that Copilot can process simple business requirements (process, not understand, it does not understand anything). All examples I have seen so far imply there is still significant work to be done for the human developer once she/he has chosen the best copilot proposal. So we end up with some hybrid AI/human code with no marking to tell them apart . Code generation history has told us repeatedly this is not a good idea :
Those requirements are likely to change over time. Sadly, Copilot knows how to generate code, not change it in the face of a shift in requirements. In all likelihood, it means regenerating code as a whole, not altering it.
And God forbid if this imply some signature change: Copilot does not rely on an understanding of the language syntax is not able to perform any refactoring, such as dealing with the impacts of a signature change.

So Copilot may help in the short term here, but this contribution may as well be a blessing or be a curse.

So what about productivity then?

I am now pretty convinced Copilot brings little benefits in term of raw productivity, and I think MS thinks the same:

If anyone knows how to sell software, it is Microsoft (remember, Bill Gates kinda invented the concept of paid software). Hence I am pretty sure MS guys themselves know this pretty well, otherwise they would already have a commercial/paid tier offer to sell.
As of now, we have a MVP released in the wild to see where it gets traction and how to extract value from this.
It may very well end up as a failed experiment (remember Tay?) or it may find its market. My best guess is that it will remain a niche market, like being used by some coding sweatshops producing low quality website/app for SOHOs.

So why worry?

A bit of context first

First, let me tell you a bit about my personal experience with coding, so that you understand where I come from and guess my biases:
I started to code in the mid eighties; everybody was short-staffed on professional developers, and as working code was really expensive to produce (as compared with today), there was a strong focus on DRY and code reuse. Libraries were seen as the THE solution; alas, libraries were scarce. The languages provided some (standard libraries), there were a few specialized editors that provided commercial products but most of the existing libraries were internal/private. Fast forward a couple of decades; early 21st century, Internet and OSS movement proved to be the enablers for a thriving library ecosystem, that ended up fully reinventing our technical stacks (from vendors to open source).

An ode to OSS libraries

Sorry, I had to do this. 😀

Libraries are great. They provide us with ready made solutions for some of our requirements, but most of all, they allow for a separation of concerns!
The library’s team is in charge of identifying the correct abstractions and build an efficient implementation. As such, using a library provides you help right now, when implementing as well as in the future, when issues are found or changes are required.
If you copy paste the library code, instead of depending on its distribution package, you will have to deal with any needed changes in the future. But the worst part is that you will have to understand its design and internal abstractions first if you want to maintain and fix it, and you need deep understanding if you want to extend it.

Using an external dependencies beats the s**t out of copy/pasting part of it, hands down, every day; that is assuming a decent dependency ecosystem (see the LeftPad Debacle five years ago).

The problem for OSS

Let’s have a systematic view at this:

If Copilot is useful for the short term (and this is a big ‘if’, as we will discuss later on):

  1. Copilot will provide bits of code to cover part of the requirements.
  2. So coders are less likely to look after OSS libraries to help them.
  3. Less users means less feedback (issues and feature requests), less contributors and less motivations for OSS authors.
  4. Less energy in OSS results in a slightly less dynamic OSS ecosystem
  5. A less dynamic ecosystem increases the relative value of Copilot
  6. Back to (1)

Here we have a (slow) Copilot usage reinforcement loop that could theoretically lead to a complete drying out of OSS ecosystems.
Which would be a bummer, since the OSS ecosystems is the source material for Copilot.

I am not saying this will, or even could, happen. But I see no interesting equilibrium point beyond a marginal use of Copilot.

Not that there is a parallel to be drawn between Copilot and (arguably) the most famous
coding website: StackOverflow.

The parodic idiotic coder that copy paste StackOverflow-found-code without adjusting it
to his/her own project would be replaced by the idiotic Copilot user that fails to correct the
generated code.

Except that fixing Copilot will likely require more work and better skills.

Also, the value of StackOverflow does not reside in the posted code extracts, but in the embedded social network that increase its value 100 fold by providing context and assistance to people looking for help.
Features that are sorely lacking for Copilot.

But Copilot is still useful, right?

Watch out for the bugs

Well, it is still early to get a definitive answer, but I am getting more skeptical by the day.
I think we can make a parallel with self driving car: we are, slowly, getting Level 4 assistance (see here for level definitions) but level 5 seems further away every time we look at it.

The main problem with Level 4 is making sure the driver takes over when necessary. For a car, the problem is that the driver’s focus will not be on the road when the problem arises, leading to dangerous delay. For Copilot, the problem is that the issue will be hidden in the complexity of the code.

Let me illustrate this with a Copilot example (see original tweet). Bug

You see the problem? You probably won’t at first sight. You may very well never see it if you are not familiar enough with color representation.

The color can never be 16777215, ie 0xFFFFFF, aka pure white!

  1. The fix is simple: you need to use 16777216 instead.
  2. How do you fix every copy of this code ? You don’t as you can’t identify them, since it is likely that Copilot will have ‘generated’ sligthly different versions: varying variables or function names for example.
  3. How do you make sure future version of this algorithm are correct? You can’t as you cannot identify the source of this!

Furthermore, this example also illustrates that Copilot has absolutely no semantic understanding of what it does: if it relied of some internal understanding of what a color is (from an IT perspective), the code would have been correct.

It is likely a source of subtle bugs…

But Copilot will resolve the problem of boilerplate code

This one is very likely. Boilerplate code, the lines that must be written due to some technical requirements (generally due to some library) and bring little value to the general requirements can be masterfully managed by Copilot.

From my point of the view, boilerplate code is the sure sign of a design in need of improvements. If Copilot removes this pain, the design will never be improved and we will rely on Copilot as a crutch instead.

The best way to deal with boilerplate code is to review the design that led to it in the first place.

It will help people write tests

I have seen several examples of using Copilot to generate unit tests out of comments. That could be an interesting approach, but I am not sure how this could prove to be better than using BDD tooling (Cucumber/Specflow….).

Being a TDD practitioner, I see writing a test as a design exercise, as such, I think of this step as the one that requires the most skills. Hence, it does not appear natural to me to delegate this to an AI.

What could be interesting, is if Copilot was able to generate code that will make tests pass. Codex has been tested against some coding challenges (see white paper here), and the paper shows that using sample results to select the most efficient Copilot suggestion can achieve 78% success.

I see reason to worry again here, as it is a tool that will divert coders from TDD, or TDD like practices. Indeed, why bother with tests if Copilot generates good code for you. To that, I will retort that:

  1. Tests are useful when writing but also when maintaining code (and code needs to be maintained, until Copilot does it for you)
  2. Copilot may not generate good code at first attempt.

First users are delighted

First time users are definitely amazed by Copilot, often talking about ‘magic like’ results (a nice remembrance of Clarke third law), but I am waiting for longer term evaluation. Not holding my breath for those, as I expect them to reveal several limitations that reduce the interest for the tool.

There is one important thing to bear in mind: one of the problem with neural nets, especially deep ones, is that nobody really understands how they work. Don’t get me wrong: their architecture, principles and general design is well documented, but since this is a self calibrating engine, it is hard to explain how decisions are made afterward.
Let me illustrate that with a clear contradictions. See what Github says about the risks of Copilot duplicating something from the training set:

We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.

Not so fast!!
As you can see from the following video, you can twist Copilot’s arm until it reveals its source of inspiration.
The conclusion is not that the authors lied about how Copilot works, but they have only an approximate grasp on how Codex behaves in the real world.
This is sorcerer’s apprentice stuff here.

Why would anyone think this is a good idea in the first place?

I must say I was really mad at having someone push for yet another tool that results in producing even more code lines than necessary. Then I realized that Copilot is a solution in search of its problem, or a technology looking for its market.

It does not pretend to be a solution to anything, just an incremental improvement. Still, this is disappointing news: we need new tools that help us increase the quality of the code we produce, not the amount of it. Code is pervasive in our lives and the impacts of imperfect code are getting higher by the day: there is open source software on Mars. Nowadays no developer can affirm that none of this code have life or death stakes, we simply do not know. But that warrants a specific post. In the meanwhile, we are still waiting for new tools and practices to be invented and emerge.

Is there any chance Github Copilot could be useful?

Not in its current form, as I said in the previous paragraphs. But I think different approaches could be useful (assuming those are possible, I am no GPT3 expert):

  1. Use it as library/framework search tool. Picking a library is a really difficult exercise, and finding one is an even harder challenge. A tool for that could be a game changer; it would also be a power play quagmire, but it’s another discussion
  2. Generate code out of unit tests. This would be a boon for TDD like practices.
  3. Use it to suggest code improvements/simplifications. Crafting good code is difficult, we need help for this.
  4. Make it contextual: Copilot should learn from your project context and adjust its suggestions accordingly. So that it can based its proposals on your dependencies for example. Probably hard to do.
  5. Create a feedback loop, à-la-StackOverflow. There should be a way for Copilot to learn directly from its users. Maybe it is the case as of now, but there is no evidence to corroborate this.

Conclusion

In short, GitHub Copilot

  1. Is a significant technical achievement
  2. May be a danger to OSS in general
  3. Should not be a success in its current form
  4. May not end up in a commercial available form
  5. Is the sure sign of similar AI powered tools to come.

References

Here a few further references

. Is GitHub Copilot a blessing, or a curse?

. GITHUB COPILOT AND THE UNFULFILLED PROMISES OF AN ARTIFICIAL INTELLIGENCE FUTURE

.Captain Stack

Workload management strategies

In a React world, everything is an event that anybody can grab and process according to whatever its responsibility is. The notion of event allows for very low contract coupling; it acts as a medium between classes. A message is similar, but with more coupling as it aggregates endpoints or endpoints addresses.
This model offers flexible design where events flow and are processed through the systems. But what happens when you have too many events to process?

Continue reading “Workload management strategies”

The secret for 100% test coverage: remove code

Update note

Based on the interesting feedback I got (can be seen on Tom’s ramblings), I realized this post probably needed some tweaking and scope precision. I did put them at the end.

What is the adequate objective for test coverage?

60%?

80%?

99%?

100%?

Like many others, I have often pondered this question, like many before me, and many after I suppose. Why aiming for 100%? The 80/20 law clearly applies to test coverage: to try to cover every corner cases that lie in the code is going to require a significant investment in time and brain cells. Plus integration point can never really be properly covered.

On the other hand, having 100% coverage provides huge benefits:

  1. Every single line of code is constantly tested
  2. Trust in code is high
  3. Any uncovered line is a regression

What happens if the target is, 80%:

  1. Significant part of the code is never tested
  2. Trust in code is moderate and can degrade
  3. 20% uncovered line codes is significant, 2000 lines for a 10K code base. That means full namespaces can hide in there.

For me, there is no question 100% coverage is the only worthy objective, do not settle for less.

Yes there are exceptions, usually at integration points. Mocks are not a real solution either, they can help you increase your coverage but not by that much. The pragmatic solution is to wrap them into isolated modules (jars/assemblies). Think hexagonal architecture there. You will have specific coverage target for those, you also need to make sure that no other code creeps in and finally, understands those are weak points in your design.

While I am working on nFluent, I constantly make sure unit tests exert every single line of code of the library. It also means that I help contributors reach the target. It is not that difficult, especially if you do TDD!

There is one simple golden rule: to reach and maintain 100% coverage, you do not need to add tests, you have to remove not covered lines!

This is important, so let me restate that: a line of code that is not covered is not maintainable, must be seen as not working and must be removed!

Think about it:

  1. The fact that no automated test exists means that the behavior can be silently changed, not just the implementation!
  2. Any newcomer, including your proverbial future self, will have to guess the expected behavior !
  3. What will happen if the code get executed some day in production?
  4. If you are doing TDD you can safely assume the code is useless!

So, when you discover not covered lines, refactor to remove them or to make sure they are exerted. But do not add tests for the sake of coverage.

Having 100% coverage does not mean the code is bug free

Tom’s comments implied that I was somehow trying to promote this idea. 100% coverage is no bug free proof at all, and do not imply this at all. Quality and relevance of your tests are essential attributes; that is exactly why I promote removing non tested lines. Any specially crafted test will not be driven by an actual need and would be artificial. The net result would be a less agile code base.

On the other hand, if you have 100% coverage and you discover some reproducible bug, either by manual testing or in production, you should be able to add an automated test to prevent any re occurrence.

When coverage is insufficient, there is a high probability that you will not be able to add this test, keeping the door open for future regression!

If you want to build trust based on coverage metrics, you need to look into branch coverage and property based testing at the very least. But I do not think this is a smart objective.

Note

  • This post focuses  on new code! For legacy code, the approach should be to add tests before anything else, and never remove working code 🙂

Enlarge your TDD with NFluent

Full disclosure I am an active contributor to NFluent.

Imagine your first days on a new project that has a decent unit tests base (if not, change this or walk away).

You start coding away a fix or new feature, you extend the test base, you happily commit! Then the factory signals a failed build due to failing tests.

Well, that’s expected actually, unit tests are here as a safety net.

Bu then you realize that the test error message does not help. Neither taking a look at the test code, as the assertion’s syntax does not properly reflect the test intents.
Or maybe the previous developer failed to realize that the expected value comes first in Assert.AreEqual, or the comment has not been updated.

In the end, you have to debug the test to understand what is going wrong.

Image

This hassle fuels the naysayers that claim that unit tests are a cute idea but:

  • do not provided added value to the product
  • are expensive to write
  • are a burden in maintenance

They are basically right. Of course this is a short-sighted vision, but those are actual issues with many code base as of now, with the notable exception of OSS.

This is a serious issue that needs to be addressed.

Part of the problem comes from the fact that unit test tools have not significantly changed in the past decade, a time when the main challenges were being able to implement test runners and build testing infrastructures. Then there was interest in building more efficient test runners with the integration of multiple scenario for a single test, or the generation of variants. But no significant effort regarding the API.

Coming back to my earlier example, we, TDD craftsmen, need an API that allows us to be expressive on the assertion/checks we make. We want the IDE to help us, typically through Intellisense, we want to be able to add our own tests, we want to express conjoined test criteria as a single check and we want to ensure that newcomers (including our proverbial ‘future self’) to understand clearly what the test is about.

nFluent has been designed to answer those requirements as well as some others. It is on OSS assertion library that works on top of all unit test frameworks (nUnit, mbUnit, xUnit, even MSTest). You can start using now on any existing code base or new project. It is available through nFluent and it is guided by the brilliant T.Pierrain (@Tpierrain).

Check it: http://n-fluent.net/

Date and time pitfalls

I just finished adding a feature to a small tool that permits to define trigger time in any time zone the user wishes. It gave me the opportunity to assess the situation regarding date and time classes. The status is appalling : both C# and Java standard classes simply sucks at the exercise, for different reasons, but they lead to the ‘flatland of desolation’ instead the ‘pit of success’. And they will leave you stranded if you need more than ‘Local to UTC’ and ‘UTC to local’ conversions. This gives us an opportunity to make a tour d’horizon on the topics of Date, Time, DST and time zones. Lets start by some useful definitions…

Quick definitions

  • Calendar: allows to give name to periods of time. The smallest period represented in a calendar is a day. Multiple calendars exist. Most of you are probably aware of the existence of one of those: jewish , muslim , Chinese… Keep in mind: just a representation, many calendars exist.
  • GMT: Greenwich Mean Time. Current time of day as viewed from the Greenwich’s meridian. Is used as the standard global time.
  • UTC: Universal Time Coordinated, has replaced TAI. The difference with TAI is that UTC accounts explicitly for leap seconds, whereas GMT does not. As of today, I am not aware of any library that does manage leap seconds (see below).
  • TAI: International Atomic Time, reference time established by averaging atomic clock outputs (>200) based on the current second definition. The acronym comes from the french: Temps Atomique International. This standard is an absolute that disregards the Earth rotation.  It is roughly synchronized with GMT; there is currently a 35 seconds gap.
  • Leap Seconds: due to the slowdown of the earth rotation, extra seconds have to be inserted so that the actual time of day still matches its definition. There are added at the end of the day (GMT Time) as an extra second: after 23:59:59, the clock marks 23:59:60 and then 00:00:00. They are added either on June 30th or December 31st; as of today, 25 leap seconds have been added since 1972, roughly one per year up to 1998 and around 1 every 4 years since.
  • Time zone: geographical zone where the current time is the same. Originally, they were defined by an offset against the GMT, but since the introduction of Daylight saving time, they have got more complex.
  • Local time: time as seen within a given time zone.

Timezones in the world since Septembe...
Timezones in the world since September 20, 2011  (Photo credit: Wikipedia)
Now it is time to discuss the pitfalls…

#1 Time vs time : same difference?

This is a big one: when the user/requirement speaks about time, what does he/it actually mean? local time or universal time? When dealing with humans, you can assume local time, but turn the implicit into explicit and raise the question. When dealing with MtoM, universal time is your best bet, unless humans are somehow involved. Next questions is: are multiple time zones involved? Probably, most systems are global in some way. The use case probably needs refining, such as: As a user I want this process to happen at 10 AM Paris Time and 4 PM Hong Kong. Pretty straightforward, don’t you think? Nope! Make the implicit explicit: Does it mean simultaneously? Yes, of course, as it happens actually once at 10 AM Paris, so 4PM Hong Kong. Are we there yet? Nope!!! It turns out that 10 AM Paris may not be equivalent to 4PM Hong Kong, because Paris has Daylight Saving Time and Hong Kong does not: half of the year the gap is 6 hours, and it is 7 hours for the other half. So, now the question is: is the reference Paris time (10 AM) or Hong Kong time (4 PM)? Or it may also be 8 AM UTC? Conclusion: When dealing with time, make sure to properly identify the time zone.

#2 A date is not defined at midnight

In C# there is no Date only type, therefore the (enforced, see DateTime.Date) usage is to set the time to 00:00, as in March 23rd is 2015/03/23 00:00:00 when you have no concern for time.

Bad! As soon as you will do any kind of transformation: arithmetic, TimeZone conversions etc.. you have a significant risk to loose a day in the process . For example, if the initial date is the end of DST, moving forward by 1 day (i.e. 24 hours) will stay on the same date, at 11 PM. Search for a Date only class.

#3 Full UTC may not save you

Once you have been bitten by some TimeZone issues, you may be tempted to go full UTC, i.e. having all internal dates expressed as UTC datetimes. Alas! As soon as you need to relate those to the user you can either show those as UTC or his/her current local time. But this may lead to some unexpected results if the user has changed timezone for some reason, or entered/exited DST. Conclusion: store the reference timezone whenever you need to store a DateTime or a time (without the date).

#4 Scheduling is a nightmare

Well, any type of scheduling is more complex than one may think: 99% of the time, this will be implemented by computing the delay before the event occur and then awaiting that delay to elapse. Maybe someday OSes will natively provide date&time based scheduling, but for now, you need to deal with self computed wait time and delays. To compute a delay, you need to use the same base, i.e. same timezone, for both the start time and the scheduled event, then the delay can be used as is for any timer primitive that suits your need.

Remember that a user’s timezone is not a constant attribute: when I travel, I no longer expect my phone to wake me up at 6AM Paris time, it would make me very angry when I am in New York!

#5 There is no shared standard for timezone identification

Microsoft OSes were among the first to fully support timezones, i.e. not only supporting the standard time offset but also daylight saving time period. And they have established their own timezone referential. But due to its proprietary dimension, it has never got any traction outside Windows. To be fair, it exhibits shortcomings, the biggest one being it they has poor ids for the timezones.

But there is a good alternative: the IANA time zone database. It does offer nice ids, based on regions and cities and it is used on Java, iOS, Linux… basically everywhere but Windows. But the naming convention is not practical for storage communication: it is Area/Location, such as Europe/Amsterdam which is a bit long to be included with each exchanged date and impact storage/bandwidth requirements.

#6 There is no adequate framework for unit testing

Not sure what the situation is for Java, but trying to simulate/set a timezone in unit testing is hard, to say the least. And it gets even harder if you need two: one for the server and one for the (mock) user!

Rules of thumb

  1. Store application events in UTC
  2. Properly identify users timezone
  3. Use a date only class if you only need dates
  4. Schedule event in the user timezone
  5. Use J/NodaTime!

Resources

Note:

Edited on October 8th, 2014 to add TAI definition.

Edited on November 1st, 2015 to fix misspellings and improve some wording.