These are my brushed up notes for a presentation I gave during one of First Opinion's all hands meetings in October 2015.
Types of Mistakes
I want to start with a discussion of the three types of mistakes, as relayed to me by a good friend of mine:
-
Honest mistakes - Everyone falls victim to an honest mistake sometime in their life (I fall victim to them more than most). You accidentally push incomplete code to production or you leave an ad campaign running just a little too long. We're human, these mistakes happen, so we fix them and move on.
-
Incompetent mistakes - The people that are committing these mistakes are either out of their league, or incompetent. Either way, it's time to cut them loose or give them a different job.
-
Process mistakes - Most mistakes are because of a lack of process, so I want to spend the majority of my time talking about this type of mistake and how we seek to minimize them (nobody's perfect) at First Opinion.
Minimizing Process Mistakes
In Engineering
Let's start with Engineering, since I happen to know a thing or two about how the engineering team works.
Testing
The first line of defense for engineering is our automated tests. Each major piece of our codebase has a decent amount of tests backing it up, as a completely unscientific informal look at the amount of code we have written just to test the codebase, here are some conservative stats on our three biggest software areas1:
- Server - 500+ tests comprising around 15,000 lines of code.
- iOS - 60+ tests comprising over 10,000 lines of code.
- Web-client - Many tests comprising around 5,000 lines of code.
- Other - We have multiple open source projects that also have decent testing suites.
We are constantly adding new tests and making sure the existing tests are still relevant so our Engineers can have that sweet sweet piece of mind that comes from knowing any changes they make to the codebase don't ripple outward like silent chaos ninjas to cause unforeseen bugs elsewhere in our system because without automated tests...
Making even small changes will become increasingly difficult. Eric Evans in Domain-Driven Design: "When complexity gets out of hand, developers can no longer understand the software well enough to change or extend it easily and safely." Facebook [needs] a huge staff to keep up their momentum maintaining a big ball of mud. ... Releases will break things, because you don't understand the relationships well enough to pretend the impact of your changes. ... Next time management or clients try to convince you to move faster and throw quality under the bus, you can say sure, that will work, as long as you can hire 429 engineers to work on our iOS app. via
In fact, our iOS engineers just spent the better part of the last two weeks knocking down the last untested part of the iOS application in an effort to release the most stable app we've ever released2.
Jarid also recently spent a solid week figuring out how to automate the testing of a new feature we're working on3. I guarantee our competitors didn't devote that much time to ensuring they could automatically test that feature, that's what makes us different.
Code Reviews
But wait, there's more. The Engineers also do code reviews, these are when one engineer has another engineer look over their code and make suggestions on how it can be improved. These are incredibly valuable in helping all the engineers understand the codebases and helps the codebase stay high quality and well documented since your best critic isn't yourself, it's your colleague who has to maintain your bug infested code.
Automation
We try and automate all the things. When we deploy, we deploy with one simple command. When we add servers to our system, that's also one command. Automation means we do things the same way each and every time, thus minimizing mistakes.
When we do find mistakes, we fix the automation scripts, for a change once, fix everywhere workflow.
In Product
We have a full time user researcher that is constantly going out and talking to people about our product and planned upcoming features. Think of that for a minute, we have someone whose whole job is to go out and speak to real people about our product, do you know any other company that does that?
Before any major feature gets anywhere near engineering, it's gone through multiple rounds of user testing using paper wireframes, Then onto small little interactive prototypes on the phone, and only after all that is the feature ready to be passed to engineering to be built.
On the doctor side
Each of our main doctors has a staff of support doctors that help them out, our matched doctors review the conversations and monitor the quality of the interaction each one of our users has on our service.
We also have a full time QA staff that works continuously to make sure each and every interaction with a Doctor on our app is a high quality one.
Besides that, we ask the user's themselves to rate their interaction with our doctors and take that feedback incredibly seriously.
But things always go wrong
It's true, they do, and we're no exception. So when a user does have a bad interaction, we kill them with kindness, and we work with them personally to make sure their issue is resolved to their satisfaction.
Post-Mortems
"Reverse engineer your successes and turn best practices into best processes"
-Howard Lindzon
Whenever we have problems, we go through what caused the problem and talk about how we fixed it and what we are going to do different to make sure we never see the problem again, this is the best way to make sure we are always solving new problems and moving forward, instead of going insane.
In Engineering, we also change the code, add tests, or automate the problem away, the goal of these post-mortems and the implementation of the solutions is to minimize the same mistake happening a second time.
We also do one more thing...
We pass it on down
Anyone familiar with the Banana story?
Start with a cage containing five monkeys. Inside the cage, hang a banana on a string and place a set of stairs under it. Before long, a monkey will go to the stairs and start to climb towards the banana. As soon as he touches the stairs, spray all of the other monkeys with cold water. After a while, another monkey makes an attempt with the same result - all the other monkeys are sprayed with cold water. Pretty soon, when another monkey tries to climb the stairs, the other monkeys will try to prevent it.
Now, put away the cold water. Remove one monkey from the cage and replace it with a new one. The new monkey sees the banana and wants to climb the stairs. To his surprise and horror, all of the other monkeys attack him. After another attempt and attack, he knows that if he tries to climb the stairs, he will be assaulted.
Next, remove another of the original five monkeys and replace it with a new one. The newcomer goes to the stairs and is attacked. The previous newcomer takes part in the punishment with enthusiasm! Likewise, replace a third original monkey with a new one, then a fourth, then the fifth.
Every time the newest monkey takes to the stairs, he is attacked. Most of the monkeys that are beating him have no idea why they were not permitted to climb the stairs or why they are participating in the beating of the newest monkey.
After replacing all the original monkeys, none of the remaining monkeys have ever been sprayed with cold water. Nevertheless, no monkey ever again approaches the stairs to try for the banana.
Why not?
Because as far as they know that's the way it's always been done around here.
While this story usually has a negative connotation, it doesn't have to be that way, passing down good company culture is a great thing and helps makes us stronger.
We learn from each other, hopefully we retain mostly the good stuff and fix the bad stuff.
In Conclusion
This fanatical obsession with scalable quality assurance has been going on from the very first line of code and the very first user/doctor interaction, to where we sit now at hundreds of interactions a day, to the future at thousands of interactions, and tens of thousands of interactions, and millions of interactions4. We focus so much on our internal quality because we know our internal quality eventually becomes our external quality.
This is what makes us different, this is what makes us special, and this is what will help us change healthcare.
-
This was incredibly unscientific, for example, in the iOS repo, I went into the testing directory and ran:
find . -name "*.m" | xargs wc -l
to get the line count. ↩ -
Which we did release...in 2015, a release about 100x more stable is still available in the app store. ↩
-
Sadly, I removed the specifics of the feature and now I'm not 100% sure what feature it was. ↩
-
Fun fact, we've hit most of these interaction goals over the last few years. ↩