Scaling Continuous Delivery: Happiness as a Metric

A few days ago Jeff Atwood (Coding Horror) suggested a good measure of a tech company’s health is the time it takes to have a simple change become available to customers:

And while there are numerous metrics that determine the health of a tech company (see Jez Humble’s book, Accelerate, for an amazingly comprehensive overview), Continuous Delivery strongly correlates to successful outcomes.

I support Jeff’s assertion – I witnessed the value created by Continuous Delivery at IMVU, where we pioneered some of the crazy processes that would be followed by more sane practitioners. From day one, IMVU placed value on the speed of product iteration and “designed” build systems accordingly. In 2006, development was done in Windows using Reactor Server to provide the LAMP-ish stack, and the deploy process looked something like this:

 svn-server$ rcp website/* production:/var/www/

If you’re wondering why I omitted the test framework, I didn’t. Code went from a local Windows sandbox environment, to source control, to live on a Linux environment running a version of PHP different than the local sandbox. Fun! While there were numerous problems with this system, iteration velocity was amazing (those 1 word copy changes could ship in less than 5 minutes, and so could full features). This development velocity was a key component enabling IMVU to build a large, successful business in a space where the failed companies outnumber survivors 50 to 1.

And to be clear, time to get a change live to customers doesn’t in itself indicate healthy tech, but a lot of tech health comes from the corresponding systems necessary to make rapid deployment work.

Fast forward a few years to 2008 and I transition from leading the operations team to leading the engineering organization, where the build and deploy systems had matured, with reasonable test coverage, and automated deployment, with automated rollbacks when something unfortunate made it into production. It was pretty cool, even though publicly the process was mostly received with the sentiment, “that will never work , and certainly won’t scale”.

Scaling Problems

One of my first challenges as the new engineering leader was a team unhappy about their ability to get work done because it was taking hours for a commit to become live to customers. It’s astonishing when you think about it – every engineer in the company had come from companies where the commit to live process was typically measured in months, but once they experienced the value of Continuous Delivery, anything more than minutes seemed unbearable.

Digging into the problem, I came to understand that the problem was not slow builds (although that was part of it), the most significant issues were caused from the shared responsibility for build systems, combined with the desire to deliver features to customers, created a tragedy of the commons. When an engineer had a failure in the build system, the optimal solution for that engineer was to fix the problem in place, blocking the build system for anybody else in the queue, which meant the number of commits in the next build increased, which meant the chance of a failure in the that build increased, ad infinitum. The result was pushing to production could be blocked for hours, sometimes most of the work day.

Solving for Happiness

I thought the best solution was to formalize a project, have a clear success outcome, and have a single person with the responsibility for (and therefore authority over) the build / deploy systems. The first problem was determining a clear success criteria… anything time metrics I chose would be somewhat arbitrary, so instead I chose engineering happiness as the success criteria, or more specifically, pushing to production was no longer causing unhappiness. While I generally hate subjective success criteria, there were ways to assess progress through 1:1 conversations and Likert scale surveys. We also had great (highly objective) data around commit to deploy times, so we could see the correlation to the more subjective happiness index.

There was some pretty straightforward work to improve the actual test and deploy speeds, including simple things like adding more hardware and the slightly less simple sorting tests to run by speed (a surprisingly large performance gain), and fixing the slowest of the tests. But some of the most important gains came from the human parts of the deployment system… engineers were required to immediately revert code and fix the issue in their sandbox rather than blocking the build system. This was not a popular policy change as immediately engineers experienced the direct impact from a failed commit, but didn’t immediately see any gains to the overall system.  But after a few weeks the improvements were clear in the average commit to deploy time. And giving credit where it is due, Eric Prestemon was the “Buildbot Sheriff” that identified so many of the opportunities for improvement and delivered the results… many people helped, but Eric had the burden of hearing a lot of critical feedback about unpopular policy changes (eventually outweighed by the praise for the results he produced).

Eventually the build system frustration ceased being a common topic in 1:1 meetings, and it faded away as a meaningful problem in engineering surveys. 12 minutes. When the commit to live time is 12 minutes, this system is operating well. That became the new value for alerting – under 12 minutes, all is good, after that we need to actively drive improvements. In practice, deploy time was usually around 11 minutes, 8 for parallel test builds/runs and 3 minutes for rollout checks (thanks for the reminder, @jwatte).

Diminishing Returns

I have been asked why we didn’t try to make the build and deploy systems as fast as possible… why not 2 minutes? We constantly worked on optimizing these systems, adding separate hypothesis builds, automatically isolating build servers to allow diagnosing and fixing without blocking, etc. And sometimes deployment would take less than 9 minutes.

However, much like the difference between 99.99% and 99.999% uptime for a service, the difference to the customer can be negligible while the resources necessary to deliver that improvement can be extraordinary. When business requirements are being met and engineering is happy with deploy times, the resources necessary to dramatically improve were better spent delivering value to customers.

Key Takeaways

  1. Working in a (well functioning) Continuous Delivery environment is empowering, naturally encourages other strong technical practices, and is hard to retreat from once experienced.
  2. Certain problems fall into what I call the “roommates and dishes” category, where “it’s everybody’s responsibility” sounds good, but in practice actually means “it’s nobody’s responsibility”. In these cases it is better to find a results-driven person and ensure they have responsibility and corresponding authority.
  3. Hire Eric Prestemon or somebody like him.

 

Have you worked in a Continuous Delivery environment and experienced non-obvious scaling challenges? I’d like to hear about your experience – please leave a comment!

Avoiding the Perils of A/B Split Testing

A/B testing is widely used in product development, popularized as a fundamental component of the Lean Startup  framework, and providing a scientific way of validating product and business improvements. The concept is simple… put some customers in the new experience, compare the results against customers that didn’t get the new experience, and better metrics validates the improvement. In reality, this process of validation is very complicated and there is no shortage of hazards leading you to poor outcomes.

Creating Information out of Data is Hard

IMVU had a culture of data-validated decisions from almost day one, and as a result we made it easy for anybody to create their own split test and validate the business results of their efforts. It took minutes to implement the split test and compare oh so many metrics between the cohorts. All employees had access to this system and we tested everything, all the time. A paper released in 2009,  Controlled experiments on the web: survey and practical guide, reinforced that split testing was the undisputed arbiter or truth. We were clearly on the right path. 

While the ability to self-assess progress created a very empowering culture, we were largely ill-equipped to understand the nuances of what the data actually meant. Years later we would start to better understand, we don’t know how much we don’t know.

First Know Why

The first opportunity to make a mistake with split testing is deciding to test in the first place. When creating a split test has a very low barrier, it is easy to err on the side of just testing everything so that you can have the data if you need it. But every test has a lot of hidden costs than come from false-positives, clarification of data, shiny-object distractions, inconsistent customer experiences, and additional opportunities for introducing bugs.

Recognizing that being a split test packrat has a real cost, there should be some requirement for incurring this cost. Are very least, answering the question, “What are the significant changes that will be made as a result of this test?” Additional pre-test work to specify what will be measured, and what results will determine success or failure can also go a long way towards ensuring time spent testing is valuable.

Test Implementation is a Project

IMVU had a great framework to make test implementation a seemingly simple task, with a few lines of code of creating a branch for the test experience, and leaving the current experience as the control. Again, this made creating tests seem deceptively easy, and left openings for measuring the wrong thing.

Often a split test is a cross-functional effort, with an engineer handling the implementation and the customer being any combination of a product manager, acquisition team, marketing representative, revenue officer, or generally interested party. In some cases, the interpretation of test data is done by another person altogether. Correctly understanding what the internal customer wants to know, capturing the right data, and converting that data into information ends up with many points of communication that must be accurate to deliver a valid test.

For example, the acquisition team wants to test a new landing page, simply reordering the registration fields because they think it will improve the registration completion rate. The engineer realizing this is a no-brainer takes the 15 minutes before lunch to create the quick test, two paths and the test is running. However, the registration page has both manual registration and sign in with a social network account, so the test is including a lot of users that are social logins, irrelevant to the registration fields. This subtle nuance means that the impact of the registration field changes will likely be lost as the irrelevant data acts as a damper. What the customer wanted to know isn’t what the test is answering, and it’s likely that nobody on the project knows there is an error.

The ease of creating a split test should not be conflated with delivering quality results from a test. Doing it right is a project and requires investment of resources consistent with any other project.

WTF Do These Results Actually Mean?

Assuming you were diligent in your experiment design, you captured all of the relevant data, and you avoided some of the common errors of A/B testing, you now need to make sense of the data. In the best cases, you’re looking at something like “the registration landing page increased conversions from 1.83% to 2.01%”, in the worst cases you find something like “customers are engaging with messaging feature 17% longer… but their lifetime value has dropped by 4%”, and now there is work to put together a narrative that explains the perplexing results.

In 2012 I read a paper, Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained, and I had what I like to call an, “oh shit” moment. Highly controlled experiments, run by companies with world-class, dedicated analytics teams were getting perplexing results that required substantial research to understand what was actually happening. What chance did we have of getting this right when we are running 15+ experiments a week with training consisting of a one page internal wiki version of, “A/B Testing for Dummies”?

The tl;dr summary of the paper, without deep consideration for the “why” behind the change in metrics, positive results may be antithetical to what you are actually trying to achieve.

The up-front work to limit the scope of the experiment and how it will be measured / interpreted can help, assuming you have the self control to ignore the data outside of scope. Often these perplexing results require follow-up experiments to better isolate cause and effect. I also highly recommend talking to customers – often qualitative insights from hearing their experiences can often help make sense of what the quantitative results were hiding.

You’re Biased. No, Really, You Are

I’m sure there are a lot of great reasons we humans are wired to think the way we do, and this wiring probably served us very well in many situations. However, humans also come standard with cognitive biases, built-in tendencies to make irrational decisions. Unfortunately, putting a bunch of effort into building something and then getting a giant pile of metrics is a perfect enabler for a cognitive biases and craptastic decisions.

While numerous biases are working against you, with a buffet of metrics one of the most common is the Texas sharpshooter fallacy, in which the all of the test metrics that are improvements over the control metrics are used to demonstrate the success of the test. With a 95% confidence rate, 1 out of 20 metrics tracked are expected to show a false positive improvement, so even an A/A test (two separate cohorts with identical experiences) would likely show “improvements”. Before we eliminated the practice of metric-sniping at IMVU, it wasn’t uncommon to hear somebody say something like, “my pet project to streamline registration didn’t change registration, but it does deliver a 5% improvement in [the completely unrelated] customer lifetime value, so we should keep it.”

There are process controls that can help reduce the potential impact of various biases, in particular around defining and constraining each test. However, being aware of these biases and encouraging a culture consistent with the dialectical method can help make better product decisions, even beyond interpreting test results.

Talk to Your Customers!

One of the biggest risks that come from over-reliance on split testing is seeing it as a more convenient method of getting customer feedback. Why spend 30 minutes on the phone with one customer when you can simply measure the actual actions of thousands of customers?

Looking at data and sending surveys may seem like an efficient use of time, but that highly structured approach is unlikely to surface critical customer insights. Metrics and surveys will often answer the “what”, but almost always miss the “why”, the most critical driver of valuable insights. There is no substitute for talking to your customers.

In the words of Steve Blank, “Get Out of the Building.”

 

I’m interested in hearing other stories where split testing has made an impact, either positive or negative. Please share a comment if you have one!

Interviewed on #ModernAgileShow

I recently had the pleasure of being interviewed by Joshua Kerievsky on the #ModernAgileShow, where we talked about a lot of my experience working at IMVU, ranging from the early days of Continuous Deployment (without all of those fancy automated tests or cluster immune systems) to changes in experiment systems and challenges of building a culture where people feel safe.  I also provide some insights into the sausage making of The Lean Startup.

In the interest of accuracy, my title in the video should be “former CEO of IMVU“.

For more information about Josh’s work to setup agile processes and cultures independent of a specific framework, check out the Modern Agile website.

On a semi-related note, Josh mentioned that the original video of Timothy Fitz presenting on Continuous Deployment at IMVU: Doing the impossible fifty times a day was lost as the result of server corruption…. if anybody happens to have a local copy please let me know – it would be great to restore this historic presentation for the Interwebs!

 

Being a Great Engineer != Being a Great Engineering Manager

I just read Google’s Quest to Build a Better Boss, describing “Project Oxygen”, which analyzed Google’s performance and review data to determine which characteristics are most important to being a successful manager at Google.  This was summarized into eight key success behaviors and three common pitfalls.  The big surprise?  Google “…found that technical expertise — the ability, say, to write computer code in your sleep — ranked dead last among Google’s big eight.

This is not a surprise to me and supports what I have come to believe after years of engineering management – being a great engineer does not necessarily prepare you for being a good manager.  This is not to say that great engineers can’t also be great managers, but the process many companies use of taking their best engineers and “promoting” them to management is flawed.  In many cases, it leads to a company losing a great engineer and gaining an ineffective (or worse, harmful) manager.  Many companies compound this problem by creating career ladders that effectively force engineers to choose between a career ceiling and a management path.

There are many characteristics that I see in successful managers.  First and foremost, good managers have to always be working to ensure the success of the team and their individual reports.  Success goes beyond just getting projects and tasks done – it also means helping their individual reports understand their strengths and opportunities for growth.  It requires taking a real interest in where each person wants to go in their career and creating opportunities for them to reach their goals.  Good managers need a lot of block and tackle type skills to unblock people and ensure they have an environment that helps them remain productive.  Good managers encourage growth for their employees by giving direction when needed but empowering them to try (and sometimes fail) in the interest of helping them learn and improve.  Of course, good managers must also be proactive about confronting tough issues and addressing performance problems to maintain a high-quality team.

Those characteristics are not necessarily the same characteristics necessary to be a great engineer.   It is not uncommon to see great engineers also be really great mentors and solve problems (beyond just engineering) in creative ways, but it is not typically their focus.  Also, the way they work is typically different.  Most managers have a tremendous amount of context switching during their day and need to make themselves available and interruptible to unblock others – this can be highly detrimental to an engineer that typically pays a high cost for context switching and getting back into the flow.

Another critical characteristic of good managers is knowing how to get problems solved.  This is very different than knowing the solution to a problem. The manager adds value by unblocking their report, not by being smarter than their report.  Many times I see very technical employees go to a much less technical manager with a technical problem.  While the manager may not be able to solve the problem directly, they can usually identify the steps (and people) required to get a solution.   This is where I see many organizations make mistakes when looking for managers – they assume that a manager can’t manage engineers if she is less technical that the engineers in the organization.   As an example of how this can manifest itself, at my company we were looking for an additional engineering manager and the bar was set pretty high based on the performance and 360 feedback of our existing manager – engineers thought he was great.  The engineers interviewing the candidate used the exact same very technical questions we use to identify great engineers.  The candidate did not do well.  In the wrap-up meeting I asked if they had ever needed their great manager to to answer these types of technical problems and the response was, “no – we have really solid tech leads for that”.  We quickly adjusted the engineering manager candidate questions to stop looking for successful engineer skills and instead identify manager skills that make other engineers successful.

For most of my life I have had the privilege of working with some truly exceptional programmers (far better than myself).  It did not take long for me to realize that the value I could create for each company as an engineer was much less significant than the value I could create by ensuring that other (better) engineers were effective and successful.  However, some companies make management the only option for career progression, which encourages great engineers that are passionate about coding to switch to a role for which they are less passionate and probably less capable (yes this is a generalization and I apologize to the truly amazing individuals that are both deeply technical and exceptional managers).  More companies should have parallel career ladders that allow engineers to remain with their hands on the keyboard and heads in the code while obtaining a career level as high (or higher) than management positions.

On a side note, one of the things I really liked about Project Oxygen is the approach of using data to analyze business processes.  I find that many companies that are data driven and have a deep understanding of their customer metrics many times don’t have the same understanding of how they work and what make them (in)effective.  We regularly collect data at my company and use it as an input to redefine how we work and constantly benefit from that evaluation.

Here is a summary of Google’s findings from Project Oxygen:

Here are the 8 top behaviors of managers in order of importance:

  1. Be a good coach
  2. Empower your team and don’t micromanage
  3. Express interest in team members’ success and personal well-being
  4. Don’t be a sissy: Be productive and results-oriented
  5. Be a good communicator and listen to your team
  6. Help your employees with career development
  7. Have a clear vision and strategy for the team
  8. Have key technical skills so you can help advise the team

Here are an additional 3 manager pitfalls:

  1. Have trouble making a transition to the team
  2. Lack a consistent approach to performance management and career development
  3. Spend too little time managing and communicating