“80% of the Code Your Team is Writing is Duplicate Code”1
Amount of Duplication
“Eighty percent of the code your team is writing is duplicate code.”1 I first ran across this statement in a book called Software Factories1. When I first read that line many years ago, as a developer, I thought to myself, “me? Eighty percent duplicate code? No way! I write deep and profound code all day”, I told myself :). Then the authors of the book go on to further describe that the duplication may not be that obvious, in that the duplicate code and the original new code are interleaved with each other. Good point! And one could go even further and say that the two can be so finely interleaved that one can believe they are creating a whole and entire piece of original code.
Now, given this ratio of 80/20 percent duplication, one might think there is a copy/paste party going on throughout the development team. Perhaps their keyboards have a button like this:
Or perhaps their mouse looks like this:
Now, this can certainly be the case. We have all done it at some point in time. Even if only copying and pasting a working part of our own code and changing the bits necessary to make it work in the new context. But in most cases the duplication is much more subtle, both in terms of how it occurs and how it can be detected.
In many cases, the developers honestly think they are creating original code but really are not.
The Duplication Situation in Pictures
Let’s revisit the concept of the interleaving of duplicate code with original code as mentioned above. Graphically, the 80/20 split could look like this:
In a) above, the block shows the non-interleaved case where the 80% duplicate code (blue) is obvious and well separated from the new creative and original code (yellow).
In the interleaving case, the situation looks more like b). And in fact, taking it further, the duplicate code and the interleaved code can be so finely interleaved or otherwise mixed up with the new creative code that is looks like c).
The result of this situation is that someone thinks they are creating one entirely new piece of original code when in fact they are not.
So, somehow we need to be able to put on our duplicate code detection glasses:
Once this has been established we can do something effective about the large blue chunk of 80% duplicated code.
Taking it a bit further, one could also make a similar observation with regard to domain-specific code and domain-independent code. That is, code that is directly related to your domain versus code that is there to glue things together or otherwise make the technologies work.
As with duplicate code, in many instances the amount of domain-independent code can vastly outnumber the code that actually has to do with your business. Similar to the duplicate code case, if we can properly isolate the domain-independent code from the domain-specific code, we can perhaps do something effective and efficient about that domain-independent code.
If you really inspect the state of your code from the perspective of duplication you will be surprised, if not shocked, at the state of affairs.
What Does Duplication Mean?
Robert Martin asks an interesting question and provides a well-stated answer in his book “Agile Software Development”2. “What does duplicate code mean?” It means you are “missing an abstraction”2.
Abstractions can come is many forms, e.g. refactorings, generators, frameworks, tools, templates, base classes etc.
Ideally, I suppose, there would be no duplication. In the real world, sometimes there is not much you can do to remove the duplication. But there are things you can do about it. Code generation is one technique that deals with duplication without eliminating it. It is a technique we, as software developers, use every day without realizing it per se, when we use a compiler. Without such an approach, we just would not be able to scale our work up to the tackling of the complex problems that we have in front of us.
Again, the point is that we can (and should) do something effective about the duplication either in the process of developing new products or in the maintaining of existing systems.
Code Generation and Brick Roads in The Netherlands
It is interesting sometimes to see non-software examples of how people solve duplication.
Our company has offices in the United States and in Europe. Our European headquarters are located in Holland. There they have some beautiful brick streets. I came across one other day and was amazed at how good it looked and how smooth it was and had been that way for years despite the weather. I thought about the meticulous and careful work it must have to lay down each brick by hand. How long that must have taken? Here are photos of a such a street:
I was talking about it with a Dutch friend and showed him the above photos. He just laughed and sent me the following photos in response:
I got a good laugh as well. This is a great non-software example of handling duplication without removing it. In this case, someone designed a domain-specific4 tool that works within the constraints of the brick street building domain. These are literally called “street printers“. From a practical standpoint, it automates much of the work but not all of the work. The top of the machine contains a place for man to lay the bricks in place according to some rules as to color, orientation and quality of bricks. He also throws away defective bricks. Could this be automated further? I suppose, but ultimately there are tradeoffs. In this case, a large part of the duplicate work has been handled or automated.
Back to Software
Now, in this example, the workers are laying down a street where there was not one before. In the software world, this approach certainly works for new systems one is starting to develop. But what if one has rampant duplication throughout the codebase much of which is very hard to detect? In this case, some amount of work may and most probably will have to be done to isolate where and what the duplication is. Once that is done, it can be targeted for generation. This approach is one of the key aspects of Model Driven Engineering (MDE) and Domain-Specific Languages (DSLs).
The main point here is that duplication means a missing abstraction. One has to establish what the duplication is and then reapproach the current software solution with this abstraction in place. Many times this is the starting point for a domain-specific language3 which in most cases involves code generation as a means to handle duplication. Taking such an approach speeds up development, increases productivity and quality. This leads to faster time-to-market and reduced costs.
MDE Systems Inc. specializes in Model Driven Engineering (MDE) and Domain Specific Language (DSL) software technologies. Their staff is expert in detecting, isolating and handling duplication in existing software code bases or preventing it from happening in the first place in new systems. Their staff has successfully applied these approaches to many complex and varying domains. MDE Systems has offices in the United States and Europe.
1 Software Factories, Greenfield, Short, Cook, Kent, ISBN 0471202843, John Wiley & Sons, September 2004
2 Agile Software Development, Martin, ISBN 0135974445, Pearson; 1st edition, October 25, 2002
Share this article: