Are you stuck in a rabbit hole where too many bugs are found in production? – Blog

This seems to be a more and more common problem in agile at scale, and apart from that the bugs are numerous, we also need to ask ourselves why the bugs are found so late. We will also go through the reasons why this is happening, as well as the cure.

In order to understand why we have this case, it is very important that we deeply understand the difference between verification, validation and inspection. We will later also show that it is very easy to get into deep trouble if this difference is not understood, but let us start examine a quote from one of our most famous quality gurus.

Inspection does not improve the quality, nor guarantee quality. Inspection is too late. The quality, good or bad, is already in the product. Quality cannot be inspected into a product or service; it must be built into it.

—W. Edwards Deming

It is interesting that the quote is from manufacturing, but that agile scaling frameworks often refer to it, even though software product development never touches manufacturing at all. But, to help us to be able to fully distinguish between verification, validation and inspection, it is very good to refer to. We must of course also understand in which phases terms like inspection (which is the same as production test, of parts or the whole product) and quality actually are made, so let us start our analysis.

In manufacturing, where Deming was a quality guru, the most important is of course that the components that are used, meet the drawings, tolerances, etc. made by the product development in the earlier phase. But of course, these drawings, tolerances, etc. originating from a proper and complete systems design, which gives the sentence in his quote “Quality cannot be inspected into a product or service; it must be built into it.”, since this actually means that we then can assure that the quality level can be fulfilled at manufacturing, giving also the phrase “The quality, good or bad, is already in the product”. This means that if manufacturing does a production test and other manual inspections of the product after manufacturing of parts has been done, we will at least be sure that we do not send a non-working product to the customer, i.e., this “copy” of the specifications and drawings from product development have passed the inspection. This means that manufacturing can never in any way affect the product’s quality if the parts are following the specifications and drawings, the quality is already built-in by product development. The product development has already taken care of both the validation (we have developed the right product), and the verification (we have developed the product right). But of course, if there is a component that for example has unexpected ageing problems or components not following the specifications, the production test (inspection) at manufacturing will not be able to find that fault, which sometimes will happen. Of course, the production itself can be malfunctioning, which also would lead to bad quality, even if the specifications and the drawings of the product is top notch. This means that there is always a risk that a hardware product still does not meet the customer expectations in the long-run, which is another type of quality issue, even though it is most probable will happen for a few of the products, which gives that it should be expected to happen.

Let us look at an example from our daily life. An example which is common when we buy any product, since validation, verification and inspection really are context dependent activities, and therefore must be considered depending on context, when we develop and manufacture our products. Let us consider buying a car.

When we buy a car, we can only look and make some small tests to see that it fulfils our needs like; colour, size, number of passengers, comfort when test driving, functionality within the driver’s compartment, etc. As a customer, this is our validation of the car, the whole system (product), if it fulfils our needs or not. Of course, the car manufacturer’s validation of the car has already been done, since they must have a very good idea that enough buyers will accept the car as it is, or the possible variants that can be manufactured. As mentioned above the car manufacturer has also done a production test, an inspection of the car to be sure that all the components that the car is built of, as well as the aggregated whole car, are working as intended and is fulfilling the specifications and drawings. The instruction book of the car (if it is correct), is also a kind of indirect validation of the car that a customer can do, and later can refer to when something is wrong, since we never (normally) will validate all the details of the instruction book.

But, since most of us do not have deep car expertise, we have no idea if the car of our choice also has good quality, and even experts can neither make that evaluation, if they do not demount the car into pieces. This is what Deming starts his quote with, “Inspection does not improve the quality, nor guarantee quality.”, which means that it does not matter how much we inspect the car, we cannot even see the quality inside the parts, or even the outside of the whole, like the laquear quality. The reason is that product development already has set the quality level of the car during the development, where the system verification (part of the system testing) of the product to release, finally also has shown that the parts are working together as intended, to an integrated and unified whole, with good quality. And as long as the manufacturing of the car also works as intended and the components and their integrations up the whole car meets all the requirements and drawings set by the product development, the quality level of the car is already in the car. So, it is not coincidence that we have guarantees; 3 years guarantee on the motor, 7 years guarantee on the car body, etc, since an ordinary customer inspection never can judge these kind of quality levels.

This means that quality is built into the car already, and this quality has been secured by the product development when developing the car. Most probably the product development uses prototypes, where the verification and validation of each prototype makes us gain knowledge in order to make another prototype. This means iterative product development as well, until we have got our intended product with the right quality level, since we are sure we will not get the specifications right the first time, it is simply too complex. We actually talk about the need of reducing transdisciplinary complexity, as well as doing it iteratively, see details in this article about why we need to do it, and this article about how we do it.

So, apart from that we can have a component out of specification in hardware at the manufacturing, it is the hardware product development that sets the quality level and that we have developed the right product right. The production test and other inspection tests performed manually at manufacturing, only secure that a faulty product is not sold to the customer, since all component or integrated components never can be fully within the tolerances. This means that the quality level is already confirmed for the product by a passed (and full) system test during the product development phase. The production test and manual inspections, only regards hardware products, since a new physical product from hardware manufacturing means making another copy, another copy of the drawings of the products.

Let us now go back to software product development. The software code that is set to operate in the production environment is not a copy, it is the original code that the customer uses live, distributed locally to the customer, or as a distributed service. Since built-in quality apparently is not achieved in the manufacturing phase for hardware products, as well as manufacturing is not even a phase in software development, inspection can of course not be performed in software development. This in turn means that Deming’s quote above about built-in quality, really makes no sense at all in software development, since for all product development, hardware or software, we only have verification and validation phases, never an inspection phase. Instead, when the quote is referenced from software development, especially from agile using scaling frameworks, it feels awkward, since product development has always been, and will always be, about integrating parts. Product development has never been about aggregating parts (assembly), which is completely reserved for the manufacturing phase. Integration on the other hand, really means synthesis, where the antipode, the analysis, has been done first, which has led to a well-thought-out systemic hypothesis that the parts will integrate to a united and unified whole. This also means that should be very suspicious to Continuous Integration without any systems design first, since that instead implies Continuous Aggregation, which only is valid for manufacturing. Please, see this article to get more information about aggregation vs integration, and this article for a deep-dive into analysis and synthesis. So, agile at scale’s interest in Deming’s quote, we really need to dig deeper into.

Let us leave inspection now, and dig into verification and validation. How about verification and validation in the nowadays so common agile software development, and especially at scale, since the built-in quality mantra seems to have such a dominant role?

In agile software development, when having one agile team developing one product, it is rather easy to divide verification from validation, the former is for the agile team to secure and the latter for the Product Owner. The Product Owner leans on the agile team that the team develops, integrates and verifies the product with good quality. Since there is only software developed, with no manufacturing, there will be no inspection as stated above.

But at scale, when many teams developing bigger products and especially when using agile scaling frameworks and their incremental bottom-up transformation strategy, ending up in value streams, not the full system, we need to be vigilant. The reason is that integration and verification on the whole has vanished, and turned into only verification of the parts, as well as an impossibility, validation of parts. But this is wrong per se, since validation can only be done on whole systems. The reason for this is that the systems design of the whole always is a hypothesis about how the interdependent parts will integrate to a united, unified and well-functioning whole. This in turn means that we need to verify this hypothesis first, to understand that we have a correct solution, before we can talk about validation of our product. This means that our parts can never be validated until we understand that we have developed the product right, i.e., after system verification. And now we have finally come down to the meat, since this is the difference between manufacturing where we can aggregate the parts to a whole, and product development where we are integrating the parts to a whole. In product development, we can therefore never talk about built-in quality, if the parts do not integrate to a united, unified and well-functioning whole system. This is also the reason why we need to be very careful with the mantra continuous improvements, since that thinking also come from manufacturing, which always in one way or another, regards already correctly validated parts (or the process making the part). Of course, on system level, it is also impossible to replace verification with only validation, no matter if the validation is done by the Product Management (at scale) or when putting the software in production, where the latter is a very bad idea.

How come that the increasing belief is that we can validate parts?

The reason for this is that agile software development tries to avoid developing too big parts, since iterative prototypes which normally is omitted in software development (but, mandatory in hardware development). This leads to that we are putting the delivery at high risk, as well as that the customers do not get any incremental deliveries (if possible). When having a few agile teams, they are together to develop a system, it means that validation (and normally also verification) of their total work package is done. But, when the agile way of working is scaled, this thinking that a few agile teams developing a work package together, is scaled as it is. This is apparently done without consideration, since the work package now, is only a part of the whole system, no matter if we have feature teams or component teams, or any other split for the parts of the solution of the whole. This means that our agile teams developing the work package, no longer can validate their work, they can only verify, no matter if there is a preceding systems design or not. In the end it is about that the complexity that need to be reduced when making big systems is so much higher, compared to small systems. In big systems (especially novel ones), the complexity level is extremely high with tonnes of interdependencies between the parts of the system that need to be understood and mastered, with early iterations on the whole. And not forgetting to mention all the non-functional requirements affecting each part of the whole system in a different way, which means that systems design shall (and can) never be omitted for a system, no matter if the system is a product or service.

But, the reason above is not enough. What follows when the belief is that we can divide the whole into parts, without any well-thought-out hypothesis of the synthesis to the whole, and instead just deliver them “validated” part by part, is that the way of working as well is divided into parts. This means that the belief is extended to that we also can have an iterative way of working, with short loops, delivering increments on the whole, no matter of the complexity of the whole. But, since the whole for big systems can take years to deliver in whole, this means that it will take long time to confirm that this extended belief that the way of working can be divided into parts, is erroneous. This is also the reason why the pilots when transforming to agile at scale, are made with only simple and short-looped tasks, like maintenance or non-complicated new functionality in an existing systems-designed architecture. This gives a false impression that verification of the way of working has been made, which gives a fulfilled organizational purpose, and that it therefore also can work for also completely new initiatives and of any size for software development.

But the fact is that, any way of working cannot be truly verified until the product has been successfully released, i.e., we have the delivered the right product right, from idea to full release. This means that during our use of our way of working in the full life-cycle of the product, we are actually verifying it; we are verifying the solution of our way of working. We can also reflect over that an overviewing purpose for all organizations in fact are the same, no matter context, and therefore no matter domain, not only for product development. This leads to that we in any organization, really do not have any uncertainty in what the way of working shall achieve (the specification of requirements for the overviewing purpose is always correct and the same) and therefore we frankly never can consider any validation of our way of working. This means that when we are achieving a new way of working the (only) hard part is to reduce the complexity (get the way of working right), not the uncertainty (to get the right way of working), since this specification about what the way of working shall achieve, is firm. Because, if we did not deliver the right product right, our solution of the way of working was not correct, not the overviewing purpose of the organization. As long as we do not want to change this (firm) specification, we can in the cases when our way of working is mal-functioning, only talk about that the implementation of our way of working failed the verification, never the validation. It is really no difference regarding verification, if we compare a product with a way of working, since if the product is not fulfilling a correct specification of requirements on the whole, we need to fix the solution, not the specification regarding the requirements for the whole. But it is easy to make the mistake and think that the way of working is not fulfilling the requirements, if it is not fulfilling our needs, when it is actually not fulfilling the organizational principles. This instead means that our way of working failed the verification, so we need to (dis)solve the root causes to be able fulfil the specification.

The fact is that the way of working is continuously verified during the work on our product, and as soon as we get problems, we will get symptoms showing us that something is wrong, i.e., we are not fulfilling all the organizational principles for our context. This will happen in any mal-functioning way of working in any domain, meaning that it is out-of-context, i.e., the way of working is operating in a context that it is not operable in, since it is not fulfilling the organizational principles. The root causes that the symptoms are originating from, are always non-fulfilled organizational principles, meaning that we need another solution of our way of working, that fulfils all organizational principles, and by that dissolves the root causes. Another way to view it is that the symptoms tell us, that our hypothesis of our way of working is wrong; our thinking, or specification for the way of working is not correctly fulfilled. But, without any continuously and properly made test cases on the bigger sub-systems or the whole (meaning only continuous integrations of tested parts), with no long-term planning, no planning towards a release date, while at the same time not being observant to the up-coming symptoms, it will of course be very hard to understand that the way of working is inferior. When having this blindfolded behaviour in agile at scale, this will lead to Gig-bang integration of small independently solved parts (gigs), compared to waterfall with a systems-designed architecture, but omitting the prototypes, will lead to Big-bang integration, see this blog post for more information about these different Xig-bang integrations.

But, the reasons above are still not enough to match the still increasing belief that we can validate the parts, since there must always be convincing material, as always, that the new is better than the old, preferably by backbiting the old way of working, i.e., backbiting common sense, as well as already since long established and in many cases, well-known (but unfortunately old?) science.

Another way to put, and to start even earlier in this chain of dividing everything in small pieces, gives this summary; the belief is that the requirements can be divided into parts, which means that the implementation and testing can be done on parts, which leads to the belief that we can validate parts, which leads to the belief that we can make built-in quality of the parts, which leads to the belief that we can optimize the parts, which leads to that the way of working also can be divided into similar parts, as well as that by only making trivial functionality, the belief is that then the way of working divided into similar parts, can also be verified. The bad news is as we have shown, that this thinking instead only leads to bad quality, that was tried to be avoided in the first place, but which is the normal thing to happen, i.e., sub-optimization makes the symptoms to grow in numbers, as well as get worse. These quality issues will also be found very late, when the code is in production, since neither “validation” nor “built-in quality” in a part, can be aggregated to a whole. This is also the normal case during sub-optimization, the new and much worse symptoms will be found much later, since when we are blindfolded, we are omitting important things, we must take action on. In this case the sub-optimization is clear, it is due to that new parts that we try to integrate, always are interdependent, and therefore are part of a hypothesis, that we only can verify on the whole to understand its correctness.

As we now can see, when we put everything together, is that the start of all this, really is the belief in scaling frameworks, that dividing the functionality into small parts is always the main key to success. And the reason for this is the scaling of agile thinking, without consideration. This means that the key is not, if we have done the systems design properly at the start, instead the key is that aggregating “validated” parts with “built-in quality” is impossible, and only leads to that an enormous number of severe bugs, found very late, as late as in the production at the time of the release. So, by hammering the mantra “built-in quality” into our minds, our common sense is weakened, to instead believe that “validation” of parts is possible, which instead is the main root to all the problems.

The belief that small is beautiful propagates level for level down in the system, and often leads to that the “validation” of new incremental functionality that should have been shown by all teams together during the “system” demos, instead are shown by the teams one by one. This instead means “sub-validation”, i.e., each team shows their own unit tested features/user stories (functionality) without connection to the work of the other teams. And too many times, there are many sub-subsystems, that in the end will integrate to a real whole system, leading really to that another level of verification is missing. This mainly means that these deductions can be drawn:

The new functionality per team will be put in production with a toggled flag, since the functionality is not usable yet, depending on that all functionality for the total system, or at least not yet enough functionality, have been implemented.
Even if every team has integrated their solution to the Main branch adding their tests belonging to the feature, the requirements of the whole are not verified, i.e., no system verification of the functional and non-functional requirements on the whole system has been done. This means that Continuous Integration, which is a good approach since it avoids adding too big code chunks, when misused, easily can lead to Continuous Aggregation, if the wholeness is not verified when the part is added.
The system test environment including all the system test cases, is inferior, or non-existing (which depends on the wrong thinking about built-in quality mentioned above, i.e., which in turn leads to the belief that it is enough to do tests within the Continuous Delivery pipeline)
Adding of systems test will finally happen, but reactively, due to all the bugs are found as late as in the production phase.
Small is beautiful leads to the belief that only agile teams are needed, meaning that the former knowledge of the production staff is no longer needed, they are frankly excluded. That means their expert knowledge about production too, like fire walls, configuration files, IT security, etc., leading to many unnecessary bugs just because of this exclusion, bugs that are not actually connected to the code.

This means that we will not only find severe and mainly unrepairable quality issues in production, which is bad, but we will also find the quality issues extremely late. And the later we find problems in our product, the higher risk is that the bugs are severe and that the systems design of our product is inferior. This will in turn early lead to a premature dead product, due to the needed spaghetti programming when fixing these late bugs, where adding extra cables in hardware to fix problems at the customer, is an apt comparison.

This also means that agile as it is, simply never can be scaled, since the difference between small scale agile and agile at scale, simply is huge. When we are developing big systems, we instead need to consider the wholeness also, which requires add-ons like systems design, systems verification and systems validation when scaling agile. By doing this, we will give the right prerequisites for our many teams of teams, to be able achieve built-in quality, both of the whole and its parts.

Another bad news is that the dividing of the functionality into small parts, most probably also means that a proper systems design has never been done, which then will be the number one root cause to the problems we later (maybe years) will find, in this case, severe bugs in production. And another bad news is that symptoms, as you already know, are impossible to solve, which means that no Continuous Improvement (in production or earlier) at all, can help us fix an inferior systems design, since any try to solve symptoms will lead to sub-optimisation on the whole, which in turn will only lead to spaghetti programming.

As you have seen in this blog post about verification, validation and inspection, and the blog post about aggregation and (false) integration, it is now common to take a bad product development path. Leaving the bugs to be found as late as in production, is a clear sign that our way of working is substandard, and probably with high risk of giving non-working products. This is also the reason for the development of the TDSD method, solving all these problems in nowadays agile scaling frameworks, see this article for further information about TDSD.

A remaining question to bring up in a later blog post is if there is a common denominator between the blog post about aggregation and this blog post, since it really seems like that, doesn’t it?

Leave a Reply Cancel reply