Take the Red Pill and Push Errors to the Left
#devops #agile #productivity
DevOps Concepts
How many times has the software development team said an enhancement was done but there were still many errors found in the Quality Assurance (QA) testing? How many times did the new code break existing functionality? This can be a frustrating cycle that will burnout engineers and cause them to leave. There is a solution that will create a better working environment to preserve your best engineers and also enhance performance in the long run, but it takes leadership commitment to temporarily slow down software delivery (or hire additional help) to build it. If your deadlines are so urgent that you feel that you need to keep throwing bodies at the problem, then take the blue pill and stop reading. If you want to open your eyes to a whole new world of possibility, take the red pill and see where it takes you.
“This is your last chance. After this, there is no turning back. You take the blue pill — the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill — you stay in Wonderland, and I show you how deep the rabbit hole goes. Remember: all I’m offering is the truth. Nothing more.” Morpheus
Much of the following is adapted from DevOps Foundations: Core Concepts and Fundamentals on Pluralsight.
Push Errors to the Left with a Poka-Yoke
Poka what? What kind of rabbit hole is this? Do not fear, my friend … remember you took the red pill. A Poka-Yoke is a Japanese term used in Lean Management that refers to anything that helps with Error Avoidance. Have you ever had a problem with a tool, and someone told you, “Ya, everyone makes that mistake. You just need to do it this way.” All of those people took the blue pill, a world where you need to ask about “the right way” to use a tool, and that is just the way it is. We call it “user error”. Now you may think, “What’s the big deal? I learned how to do it, and now it works every time.” Ah, but what if you and the many other users never had to waste time learning the right way? Some examples of tools that prevent errors are a multiple choice drop down menu or a GFCI electrical outlet breaks the circuit if there is an overload.
The Design of Everyday Things, by Donald A. Norman, has many examples of tool designs that intuitively lead you to operate them correctly, and tool designs that lead you to make the wrong choice. The book cover has a fun bad design example.
For example, the design of a door should indicate how it works without any need for a sign that says “Push” or “Pull”. You may laugh at the following Far Side cartoon, but how many times have you pulled on a door handle only to realize you were supposed to push it? The architect should have put a door plate on the “push” side, and then you would have made the correct decision the first time. It many such cases the design is at fault, not the users.
The Far Side, by Gary Larson
When we stay in the Matrix we have someone submit a form and then tell them what errors they made on the form. When we live in the Real World we have an error check during data input. Some examples of error avoidance in software development:
- Least Privilege: User should have exactly the permissions needed. Prevent malicious behavior or error. What you cannot do, you can’t do it mistakenly or maliciously.
- Version Control: Prevent committing code without an approved pull request.
- Quality and Testing: Write tests that must pass before new code can be committed. This prevents defects now and forever.
Avoid Errors by Eliminating Waste
Remove non-value added steps to avoid errors and perform better. Attack waste in the software development life cycle (SDLC):
- Partially Done Work. “It’s 90% done” actually means half of the work remains. Engineers are only accounting for the logic to create the code, but it is only complete after testing. This is caused by siloed goals: Every department has a unique measure of success. The software engineering (SWE) team goal is to “finish” code by the deadline, and the validation team goal is to quickly implement tests and push issues back to the SWE team. Instead, there should be a team goal to deploy code more quickly with good performance and less change failure rate.
“If you aim at speed, you may get speed, but you’ll get waste. If you aim at the elimination of waste, you’ll eliminate waste and get speed.” Chris Behrens
- Extra Features. A feature produced at the wrong time. A feature just in case, but may never be used. A part on a shelf taking time and money. The SWE created it because it would take more effort later. Instead make it possible later: make it a deferred commitment, make the product more agile.
The number one mistake of a star engineer is optimizing a thing that shouldn’t exist.” Elon Musk
- Relearning. Acquisition of previous knowledge. You learned it before but need a refresher. You created it before, but you need to remember how the code works. Create and store knowledge that is easy to understand and access. Make the code easier to comprehend even without comments, but still add comments as appropriate.
- Handoffs. Knowledge transfer to a new engineer. Some engineer turnover is inevitable, so cutting corners on documentation will create waste later. Be deliberate about cross-training, rotate responsibilities among the team.
- Task Switching. People are not good multitaskers. If two tasks take one hour, and only switch once in the middle of the first task, then the first task will take 40% longer. That is 40% waste. Here are some great books on Work Productivity. Look for future articles from me on this topic.
Deep Work: Rules for Focused Success in a Distracted World by Cal Newport
Great at Work: The Hidden Habits of Top Performers by Morten T. Hansen
- Delays. This is a waste that cascades from the wastes above. The project is delayed by authority concentrated on the wrong level, siloed communication, and commitment that is not deferred (i.e “Extra Features”).
- Defects. This is another waste that cascades from other wastes above. Task switch from new work, re-learn old work to fix defects, and then switch back to new task. You are assuming the developer is available anytime to fix defects. Defects are a sign you are not managing the other wastes.
Reduce Time and Steps with Automated Testing
Sometimes we think it will waste time to thoroughly test our code.
“Perfect is the enemy of good enough.” Engineering Adage
And yet we know there is a risk that insufficient software testing may result in a bug that will cause rework. But we got away with it before, and we hope we can get away with it again. Then we can deliver more instead of confirm what we already “know”: the code works properly. But here is where we begin to digest the red pill: we can have sufficient testing without any additional work. How is this possible? Follow me further down the rabbit hole.
Build a system where the code review catches the bugs. Write the test first and then write the code: it will inform how to write the code. No code will be error free. There is a debate about writing code first or tests first. It could waste time to write tests: during coding you may change your approach and the tests are no longer applicable so you need to re-write them. But you could also waste time coding: if you write the tests later, you may realize there are corner cases that you did not consider that require significant changes to the code. Your choice depends on the situation. You may have a failure of imagination: you may not think of a scenario so you cannot build the associated test case. You cannot anticipate all bugs, but you can learn from the past.
You may eliminate steps in the process because of automation. For example, you may remove some approval stages if there is sufficient confidence in the automated testing or if it is always approved anyway.
“The best part is no part. The best process is no process. It weighs nothing. Costs nothing. Can’t go wrong.” Elon Musk
Some leaders may complain that the software release cycles already take too long and adding automatic regression testing would add development time. Yes, it would take more time at first, but automating regression testing would speed up testing later. The SWE manager may be focussed on optimizing software development time. But the mistake is that optimizing the parts will optimize the whole. Google’s best practice is for Site Reliability Engineers (SREs) to spend 50% of the time automating the development.
Take the Red Pill and Live in the Real World
Sometimes we fail to weigh the impact on the customer or tester when they catch a bug. It damages the relationship, reduces trust, shifts the burden of work to others, passes the buck. Nobody likes to be dumped on. Quality issues are deprioritized to ship on time. We incorrectly think the best approach is deliver quickly and “hope for the best”. But when we push bugs to the left the overall cost of the bugs are less, the performance is better, and your best engineers will stay. Take the red pill.
What are examples where you found your group making good steps toward pushing errors to the left? What waste do you see in your SDLC? What resistance did you have to automate?