At the Philly ETE conference Michael Tiemann presented some interesting facts about open source quality, and in particular mentioned that open source software has an average defect density that is 50-150 times lower than proprietary software. As it stands, this statement is somewhat incorrect, and I would like to provide a small clarification of the context and the real values:
- First of all, the average that is mentioned by Michael is related to a small number of projects, in particular the Linux kernel, the Apache web server (and later the entire LAMP stack), and a small number of additional, “famous” projects. For all of these projects, the reality is that the defect density is substantially lower than that of comparable proprietary products. A very good article on this is Succi, Paulson, Eberlein. An Empirical Study of Open-Source and Closed-Source Software Products, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, V.30/4, april 2004, where the study was performed. It was not the only study on the subject, but all pointed at more or less the same results.
- Other than the software engineering community, some results from companies working in the code defect identification industry also published some results, like Reasoning Inc. A Quantitative Analysis of TCP/IP Implementations in Commercial Software and in the Linux Kernel, and How Open Source and Commercial Software Compare: Database Implementations in Commercial Software and in MySQL. All results confirm the much higher quality (in terms of defect per line of code) of the academic research.
- Additional research identified a common pattern: the initial quality of the source code is roughly the same for proprietary and open source, but the defect density decreases in a much faster way with open source. So, it’s not the fact that OSS coders are on average code wonders, but that the process itself creates more opportunity for defect resolution on average. As Succi et al. pointed out: “In terms of defects, our analysis finds that the changing rate or the functions modified as a percentage of the total functions is higher in open-source projects than in closed- source projects. This supports the hypothesis that defects may be found and fixed more quickly in open-source projects than in closed-source projects and may be an added benefit for using the open-source development model.” (emphasis mine).
I have a personal opinion on why this happens, and is really related to two different phenomenons:the first aspect is related to code reuse: the general modularity and great reuse of components is in fact helping developers, because instead of recoding something (introducing new bugs) the reuse of an already debugged component reduces the overall defect density. This aspect was found in other research groups focusing on reuse; for example in a work by Mohagheghi, Conradi, Killi and Schwarz called “An Empirical Study of Software Reuse vs. Defect-Density and Stability” (available here) we can find that reuse introduces a similar degree of improvement in the bug density and the trouble report numbers of code:
As it can be observed from the graph, code originated from reuse has a significant higher quality compared to traditional code, and the gap between the two grows with the size (as expected from basic probabilistic models of defect generation and discovery).
The second aspect is that the fact that bug data is public allows a “prioritization” and a better coordination of developers on triaging and in general fixing things. This explains why this faster improvement appears not only in code that is reused, but in newly generated code as well; the sum of the two effects explains the incredible difference in quality (50-150 times), higher than any previous effort like formal methods, automated code generation and so on. And this quality differential can only grow with time, leading to a long-term push for proprietary vendor to include more and more open source code inside of their own products to reduce the growing effort of bug isolation and fixing.