Scaling Continuous Delivery: Happiness as a Metric

A few days ago Jeff Atwood (Coding Horror) suggested a good measure of a tech company’s health is the time it takes to have a simple change become available to customers:

And while there are numerous metrics that determine the health of a tech company (see Jez Humble’s book, Accelerate, for an amazingly comprehensive overview), Continuous Delivery strongly correlates to successful outcomes.

I support Jeff’s assertion – I witnessed the value created by Continuous Delivery at IMVU, where we pioneered some of the crazy processes that would be followed by more sane practitioners. From day one, IMVU placed value on the speed of product iteration and “designed” build systems accordingly. In 2006, development was done in Windows using Reactor Server to provide the LAMP-ish stack, and the deploy process looked something like this:

 svn-server$ rcp website/* production:/var/www/

If you’re wondering why I omitted the test framework, I didn’t. Code went from a local Windows sandbox environment, to source control, to live on a Linux environment running a version of PHP different than the local sandbox. Fun! While there were numerous problems with this system, iteration velocity was amazing (those 1 word copy changes could ship in less than 5 minutes, and so could full features). This development velocity was a key component enabling IMVU to build a large, successful business in a space where the failed companies outnumber survivors 50 to 1.

And to be clear, time to get a change live to customers doesn’t in itself indicate healthy tech, but a lot of tech health comes from the corresponding systems necessary to make rapid deployment work.

Fast forward a few years to 2008 and I transition from leading the operations team to leading the engineering organization, where the build and deploy systems had matured, with reasonable test coverage, and automated deployment, with automated rollbacks when something unfortunate made it into production. It was pretty cool, even though publicly the process was mostly received with the sentiment, “that will never work , and certainly won’t scale”.

Scaling Problems

One of my first challenges as the new engineering leader was a team unhappy about their ability to get work done because it was taking hours for a commit to become live to customers. It’s astonishing when you think about it – every engineer in the company had come from companies where the commit to live process was typically measured in months, but once they experienced the value of Continuous Delivery, anything more than minutes seemed unbearable.

Digging into the problem, I came to understand that the problem was not slow builds (although that was part of it), the most significant issues were caused from the shared responsibility for build systems, combined with the desire to deliver features to customers, created a tragedy of the commons. When an engineer had a failure in the build system, the optimal solution for that engineer was to fix the problem in place, blocking the build system for anybody else in the queue, which meant the number of commits in the next build increased, which meant the chance of a failure in the that build increased, ad infinitum. The result was pushing to production could be blocked for hours, sometimes most of the work day.

Solving for Happiness

I thought the best solution was to formalize a project, have a clear success outcome, and have a single person with the responsibility for (and therefore authority over) the build / deploy systems. The first problem was determining a clear success criteria… anything time metrics I chose would be somewhat arbitrary, so instead I chose engineering happiness as the success criteria, or more specifically, pushing to production was no longer causing unhappiness. While I generally hate subjective success criteria, there were ways to assess progress through 1:1 conversations and Likert scale surveys. We also had great (highly objective) data around commit to deploy times, so we could see the correlation to the more subjective happiness index.

There was some pretty straightforward work to improve the actual test and deploy speeds, including simple things like adding more hardware and the slightly less simple sorting tests to run by speed (a surprisingly large performance gain), and fixing the slowest of the tests. But some of the most important gains came from the human parts of the deployment system… engineers were required to immediately revert code and fix the issue in their sandbox rather than blocking the build system. This was not a popular policy change as immediately engineers experienced the direct impact from a failed commit, but didn’t immediately see any gains to the overall system.  But after a few weeks the improvements were clear in the average commit to deploy time. And giving credit where it is due, Eric Prestemon was the “Buildbot Sheriff” that identified so many of the opportunities for improvement and delivered the results… many people helped, but Eric had the burden of hearing a lot of critical feedback about unpopular policy changes (eventually outweighed by the praise for the results he produced).

Eventually the build system frustration ceased being a common topic in 1:1 meetings, and it faded away as a meaningful problem in engineering surveys. 12 minutes. When the commit to live time is 12 minutes, this system is operating well. That became the new value for alerting – under 12 minutes, all is good, after that we need to actively drive improvements. In practice, deploy time was usually around 11 minutes, 8 for parallel test builds/runs and 3 minutes for rollout checks (thanks for the reminder, @jwatte).

Diminishing Returns

I have been asked why we didn’t try to make the build and deploy systems as fast as possible… why not 2 minutes? We constantly worked on optimizing these systems, adding separate hypothesis builds, automatically isolating build servers to allow diagnosing and fixing without blocking, etc. And sometimes deployment would take less than 9 minutes.

However, much like the difference between 99.99% and 99.999% uptime for a service, the difference to the customer can be negligible while the resources necessary to deliver that improvement can be extraordinary. When business requirements are being met and engineering is happy with deploy times, the resources necessary to dramatically improve were better spent delivering value to customers.

Key Takeaways

  1. Working in a (well functioning) Continuous Delivery environment is empowering, naturally encourages other strong technical practices, and is hard to retreat from once experienced.
  2. Certain problems fall into what I call the “roommates and dishes” category, where “it’s everybody’s responsibility” sounds good, but in practice actually means “it’s nobody’s responsibility”. In these cases it is better to find a results-driven person and ensure they have responsibility and corresponding authority.
  3. Hire Eric Prestemon or somebody like him.

 

Have you worked in a Continuous Delivery environment and experienced non-obvious scaling challenges? I’d like to hear about your experience – please leave a comment!

You Are Wrong About Your Stupid Account

You’re wrong – hackers are interested in your boring personal account, you are making it easy for them to get access, and it will likely end up being a bigger problem than you imagine.

Those are the stern words I want to use whenever I witness a friend doing the online equivalent of parking and leaving a stack of $100 bills on their car dashboard in a crime-ridden neighborhood. Instead I tend to suggest some easy steps to take to be more secure, which are almost invariably met with “it’s not a big deal”. I decided to write up my thoughts, so I can just point friends to this article and hopefully help others. This is absolutely not for altruistic reasons… I’ve had multiple experiences where somebody else’s bad online security habits resulted in nights and weekends of work for me and entire teams of people. I just want to sleep.

Hackers Want Your Stupid [insert lame service] Account

It seems absurd that your Lint Sculptures Discussion Forums password is of value to anybody… it’s just you and people you’ve met over the last 15 years that love to talk about dryer lint sculpting… security doesn’t matter. However, it was 15 years ago, so you chose a really lame password at the time (like “123456”), and now that an elite hacker has broken that code, they see your basic account details (your email, IP address, real name and city you live in). Again, who cares… that’s useless. Well, except you used the same password for everything back then, so with your email and password they can run a script to check 100,000 other sites and hey… looks like your genealogy, old photo sharing, and that antique Hotmail account you abandoned had the same password. Unfortunately, that banking thing you signed up for 12 years ago used that Hotmail address, and you forgot to unlink the Hotmail address from a few other accounts, including Paypal and LinkedIn. Now the hacker has the ability to access your LinkedIn account, change account credentials on your banking and possibly access accounts you don’t even remember you had. You can imagine how this gets problematic… the ability to send and receive from your email address typically provides the ability to get access to all other accounts, if by no other means than requesting a password reset. And this is just the annoying scenario where you have to deal with correcting identify theft on your own… at least you didn’t drag your friends down.

Instead the Hacker could exploit your Lint Sculptures Discussion Forums friends of 15 years. Does everybody need a direct message and 10,000 forum posts offering black market Viagra? No problem. Or how about a few messages to trusted friends to install this Lint Sculpting Simulation program… you know it doesn’t have a virus because your trusted friend of 15 years swears it’s great. Everybody wants to be part of a botnet, right? All of these acts may seem pointless to you, but hackers have a way of generating value (and money) from these pointless acts, and it isn’t much effort (a lot of it is automated), so it happens.

These scenarios may sound ridiculous, but two years ago I was contacted by a long-time friend that was traveling abroad and all of his possessions has been stolen, his family was stranded and he needed me to send money. What was true is he was traveling with family, the rest was made up by a hacker that got enough information to know I was a friend that would help, knew when the family was traveling, and when the story might make sense. Everything hackers needed to make this happen came from accessing worthless accounts.

Steps to Making Yourself More Secure

Security must be balanced with convenience. When being secure is a hassle, people naturally find (unfortunate) workarounds that make things less secure. If you require a password that is 20 characters long and random, look around the person’s desk for the PostIt (or possibly worse, in their “passwords.txt” file on their desktop). The sweet spot is a mild inconvenience that dramatically improves security. I find there’s a few easy practices that fit into this sweet spot…

Two-factor Authentication

Systems that require two components to authenticate are substantially more secure than password-only systems. To access an account, it requires something you know (e.g. the password), and something you have, like a key. The “key” today is typically an application like Google Authenticator, or an SMS message with a code sent to your phone, both of which provide a unique code that is only valid for 1-5 minutes. Many services offer this, including Gmail, Facebook, Twitter, Dropbox, and a few banks (seriously banks, WTF?)

The beauty of Two-factor Authentication is, even if your password is breached, it doesn’t allow the hacker to access your account. So when you are are that hotel and using the guest computer with a key-logger to print your flight itinerary from your Gmail account, it doesn’t matter… the hacker only has 50% of what they need.

The inconvenience of adding Two-factor Authentication is typically an additional 20 seconds and, since many services allow you to say “remember me for 30 days”, it’s less than a minute a month (and… don’t use “remember me” on any shared machine).

Unique Passwords

If I told you I had every lock I use in my life (home, office, safety deposit box, cars, bike lock, vacation house) re-keyed to use the exact same key, you’d probably agree that it would be disproportionately bad if somebody found my bike key. When you apply this to online habits, people seem oddly comfortable with one key for almost everything, and a special key for their bank account (but online, weak keys often provide access to special keys).

Use a different (and strong) password for everything. This, of course, is a hassle… nobody can remember 150 different strong passwords, especially when you have to change them all every 3 weeks when you get the latest exploit notice from Yahoo!

One solution is to have a hard password that is modified in a way that you know for each service. As an example, my password is “nS72!la^mq” and I add the first four letters of the website it uses, in reverse… so for Yahoo! it becomes “nS72!la^mqohaY” and for Google it is “nS72!la^mqgooG”. This has a few flaws, including making it hard to change passwords, but it’s a substantial improvement over “swordfish” for everything.

A better solution is a password manager. Services like LastPass and Passpack provide a secure way for you to store and retrieve complicated passwords. Legitimate services encrypt your data in a way where they don’t actually know or even have access to your password, so a hacker that steals their database ends-up with a ton of encrypted files and no keys. While there are ways that could be exploited, these services are certainly better than any other options available at a consumer-level (and if you’re really paranoid, some make the source code available for you to keep the encrypted data only on your computer).

Whatever you do, never, ever, ever keep a password file on you computer, even if you think you’re clever by naming it “groceries.doc”.

Don’t Share Accounts

Sharing accounts invariably leads to other poor security practices, like the need to email everybody when a password changes or having a shared password file somewhere. And, when one of the people sharing your account gets hacked, this means the shared account gets hacked (and probably every other account in that shared password file so cleverly named “groceries.doc”)

This isn’t 1997 -these days there are very few reasons why each person can’t have their own credentials, especially for email. Only share accounts when separate accounts are not possible (I’m looking at you, Netflix). If you do need to share accounts, use a password manager that offers sharing of specific entries, which means that only the minimum exposure is shared and it is simple to update credentials (Passpack does this nicely).

Don’t Click Links

Okay, so the Interwebs sort of suck if you follow this rule exactly and dead-end on a website. However, for any site you are going to access and provide your credentials, enter the URL directly.

Did you just receive a weird email from PayPal telling you that Ned just paid you $42 for a lint sculpture you don’t remember selling? Instead of clicking on the “collect your money” link in the email, type “paypal.com” in your browser bar directly and see if the transaction is in your account history. Many phishing emails look and smell like the real thing because it is pretty simple to copy the real thing and send you to “paypaI.com” (see what I did there? that was a capital “i”, not an “l” in that URL) to steal your password. Of course, if you’re using Two-factor Authentication, a stolen password is less of a problem.

Secure Your Family

I used to get sick a couple of times a year… no big deal, just a sniffle every now and then. When I had kids, my health status flipped and it seemed like a couple of times a year I wasn’t infected with whatever was festering in the cesspool of Cheerios, finger paint, juice boxes and runny noses known as preschool.

My point is, there is almost certainly going to be an overlap of your family’s online account footprint, and when one person gets hacked it will likely be a vector for the rest of your family. Sharing documents in Dropbox, G Suite (Google Docs), or Amazon family all provide opportunities for a hack to spread. Protect your accounts by having those close to you keep their accounts secure (and… that is the real reason I wrote this post – pure selfishness as I protect my own accounts).

Do you have other tips or suggestions to help make the average person more secure? Share them in the comments section!

Interviewed on #ModernAgileShow

I recently had the pleasure of being interviewed by Joshua Kerievsky on the #ModernAgileShow, where we talked about a lot of my experience working at IMVU, ranging from the early days of Continuous Deployment (without all of those fancy automated tests or cluster immune systems) to changes in experiment systems and challenges of building a culture where people feel safe.  I also provide some insights into the sausage making of The Lean Startup.

In the interest of accuracy, my title in the video should be “former CEO of IMVU“.

For more information about Josh’s work to setup agile processes and cultures independent of a specific framework, check out the Modern Agile website.

On a semi-related note, Josh mentioned that the original video of Timothy Fitz presenting on Continuous Deployment at IMVU: Doing the impossible fifty times a day was lost as the result of server corruption…. if anybody happens to have a local copy please let me know – it would be great to restore this historic presentation for the Interwebs!

 

3 Things You Can (and Should) Change In Vendor Agreements

Over the last 10+ years I reviewed and negotiated all sorts of vendor agreements for technical operations.  Companies that are starting to build out their production environments occasionally contact me looking for advice.  Being on vacation (and having time to write), I decided to share some of the more common problems I see in vendor agreements.

In almost all cases you can (and should) get better terms on what are presented as these “standard” clauses.

SLA

The Service Level Agreement (SLA) is probably the most critical to the availability of your business.  For vendors providing services like DNS or bandwidth, any vendor failure can result in failure of your business.  In other words, your uptime is no better than their uptime. The SLA is typically expressed and a percentage of availability.  If the SLA is 99.9% uptime, you are accepting 45 minutes of downtime per month.  Failure to meet the SLA usually means reimbursement for the cost of the service, not for the cost of your lost revenue resulting from the failure.   For example, if you pay a DNS service $31 per month and they are down for a full day, your reimbursement would be $1, not the revenue you lost during that full day.  Also, when the failure begins is usually defined as your notification to the vendor, not by the actual beginning of the failure.  In other words, if you didn’t report it, the problem never happened.

The availability percentages for an SLA are usually difficult to alter but there are a few things in that you can change to limit your liability.  Most (all) services have occasional failures, but it’s how they fail that become problematic for your business.  An occasional failure might be okay but if this is a pattern you want the option of moving to a new vendor.  You can usually add a clause that allow a termination of the agreement if the vendor fails to provide service more than N times in a 1-month period.  Also, you can usually require that SLA failure begin at the time of the actual failure (when it is detected by either party) rather than your notification to the vendor.

Term and Renewal

Automatic renewals are also common in agreements, in which the duration of the agreement is automatically extended by the length of the initial term.  Typically these require you to opt-out of the renewal by providing written notice within a narrow window of time.    For example, your initial duration is a 1-year after which the contract will automatically renews under the same terms for an additional year unless notice is provided in writing 30 – 45 days prior to the automatic renewal.  Vendors generally don’t contact you to remind you that your opt-out window is approaching and that you might want to negotiate a better deal while you can.

In most cases you want to avoid this simply because the prices for the service are almost always cheaper at the end of the initial term.  This is especially true for things like CDN and bandwidth.   If you’re not good at remembering to do things 11 months in the future, you may find yourself stuck in an agreement with the least favorable pricing.

The initial duration of the agreement is usually a requirement, or at least a requirement for favorable pricing.  However, you should be able to change the automatic renewal to transition into a month-to-month agreement instead of the initial term.  This will provide a better negotiation position when the agreement is up for renewal and will allow you flexibility in the timing.

Change of Terms

It is not uncommon to have a clause in the agreement that sates something to the effect of, “these terms are subject to change” with a link to the vendor website with the current terms.  In effect, this says “you agree to whatever we decide to publish on our website”.  I find these clauses ridiculous… I would love to respond with a clause stating, “our payment terms are subject to change subject to the amount I decide to write on the check”.

In these cases I find it useful to add a clause that requires notification (in writing) of any changes with a short period allowing an opt-out if the changes are seen as a material change.  If you are unable to get a clause to allow termination of the agreement you should be able to get the option to stick with the original terms.

It’s worth noting that with any change to an agreement, a vendor may not have systems helping them enforce or react to the change.  For example, if you are the only customer requiring written notice of changes, this may require manual work that they forgot shortly after signing the contract.  You should consider this and word your changes in a way where a failure on the part of the vendor does not put you at a disadvantage.