Google I/O 2019, Some Exciting Bits that Were not Obviously Exciting

Over the last couple of days I’ve been looking at the various product announcements that came out of Google I/O 2019 and there were a couple of themes that got me pretty excited about where Google can go and how that can make pretty a positive impact on millions of people.

Creating Opportunities for People… All People

I loved the Google Lens announcements from Aparna Chennapragada because the application of the technology can make such a huge difference in people’s lives, and not just the people I typically see in wearing fleece vests and sipping cold brew coffee Silicon Valley. What was most compelling to me was the transcribing / Google Translate integration that was demonstrated, especially when combined with the processing being done on device (not cloud), and being accessible to extremely low-end ($35) devices. Visual translation was always a very cool feature and, when I was trying to figure out menus in Paris, I was happy to have the privilege of a high-end phone and data plan. Making this technology widely accessible enables breaking down barriers created by illiteracy, assisting the visually impaired, and helping human interactions in regions with language borders.

Google also announced Live Caption, where pretty much every form of video (including third party apps and live chat) can have real-time subtitles. This is also done on-device, and works offline, so it can be applied to live events, like watching a speaker at a conference. A shoutout to my friend and former colleague KR Liu for her work with Google on this project, that makes the world far more accessible to people with hearing challenges.

Also notable, Google’s Project Euphonia is making speech recognition more accessible to people with impaired speech.

Movement Towards Device vs. Cloud

The “on device” and “offline” features I mentioned (and were part of other announcements like Google Assistant improvements) are important because of the implications they have in making the technology available to everyone, and also because of the personal privacy that capability will enable.

Of course, my data, Google’s access to it, and personal privacy is a much larger, complicated conversation… for now I am going to focus on possibilities, not challenges.

For years there has been a move for all aspects of people’s lives to be captured and collected in the cloud. There are many reasons this may have been necessary, from correlating data to make it useful, raw computer processing power requirements, over-reaching policies, and business models requiring all the things to win. Once in the cloud, personal information can be used for purposes never imagined by the consumer, including detailed profiling, sharing with third parties, accidentally leaking to malicious parties, revealing personal content, and various other exploitations that can negatively impact the consumer.

As the processing stays on your device and does not require transferring data off of your device, it enables products that can still provide incredible benefits while also being respectful of customer privacy. This is exciting as there are product opportunities in areas like personal health (physical and mental) that will likely require deep trust and protection of consumer information to gain wide acceptance and benefit the most people.

Personal Assistant of My Dreams

And something I am more selfishly excited about…

For several years I wished that all of the products in Google would integrate with each other and eliminate almost every manual step I have to organizing my day. I am going to side-step the discussion about how much data a company has about an individual and say that I intentionally choose to trust my information with two companies (Google being one), because of the value I get from them. I use Google to organize most aspects of my life, from email communication to coordinating my kid’s schedules, video conferencing, travel planning, finding my way around anywhere, and almost every form of document. As a result, all the parts of Google know a lot about me. But still, when I send an email to setup a meeting, I usually need to manually add that to my calendar and then I also need to add in the travel details (I frequently take trains instead of driving)… it’s a couple of extra minutes that I could be spending on better things, or just looking at pictures of cats on the Internet.

With the progress of Google Assistant and Google Duplex, I am seeing a path where administrivia is eliminated, where email, text messages, phone calls and video conferencing can also provide inputs that guide this assistant into organizing my life behind the scenes… Action items discussed in a Hangout can automatically result in a summary document, a coordinated follow-up lunch, optimal travel details, and a task list.

There is an obvious contradiction between my excitement for the announcements that emphasize better human outcomes and my “let Google know all the things” excitement over a personal assistant, but again, this is about my personal, intentional choice to share data vs. products that mandate supplying personal data, often far in excess of what is necessary to deliver the product or service.

There were some other “that’s cool” announcements, and I’ll probably be buying a Pixel 3a, which seems like a great deal for the feature set, but overall I’m more excited about the direction than the specific products showcased.

Hinder, Don’t Halt: Griefing Content Thieves for Fun and Profit

The art of deterring content theft is an ongoing game of cat and mouse – generally any barrier you create to prevent theft is temporary, as thieves continue to find new ways to steal the content, so long as the value of the content exceeds the effort necessary to steal it. For this reason, it can often be more effective to hinder thieves instead of trying to stop them.

I encounter this “hinder don’t halt” pattern with others that run large services, and you can see this reflected in solutions like shadow banning. One of the most common themes I hear is the satisfaction that comes from solutions that cause frustration for bad actors, so I’m sharing one from my personal experiences…

At IMVU, customers called Creators make content that they sell to other IMVU customers. The content they create is 3D items like avatar clothing, items to decorate an environment, and ways to customize an avatar. This content creates real value for other IMVU customers, who spend real money to purchase it from the catalog of over 10 million items. While many Creators create content just for the enjoyment of creating, some do it as a business, with a few making over $100K US annually. Whether creating for pleasure or business, all Creators hated having their work stolen. And, since there is real money from the sales of content, there is real incentive for thieves to try to steal it.

At one point we discovered a site that was selling a service that would allow people to steal Creator content without paying for it. It was pretty easy to detect the service and the initial response was blocking them, which immediately broke their service completely and, not surprisingly, made the thieves quickly respond by finding a new way around the block. The block lasted less than a day and the thieves were back in business.

The next response was more fun… rather than blocking the thieves, we made their service not work… sometimes… and inconsistently. Code was added to detect thieves accessing content and randomly some content being accessed would be mildly corrupted. The corruption could be configured to occur at certain rates, on certain items, at certain times of day, and be disabled based on what appeared to be testing for the corruption. As a result, customers of the thieves started getting inconsistent results, that would sometimes lead to content failing to load and even crashes. If you are an engineer reading this, you understand why this is a nightmare scenario to debug and fix… customers are reporting different failure cases with no consistent way of reproducing the problem to understand the cause. And, since your code is working fine, the bug isn’t going to be found… you eventually have to discover that you are being served different content than is being served to legitimate customers.

The result of hindering was much more effective than blocking… it took many weeks for the thieves to understand what was happening and, during this time, we could see them getting bashed by the people that paid them because the stolen content was ruining their experience. By the time the thieves had found another solution, they had such a bad reputation that people were less willing to give them money.

If you have dealt with content thieves I would be interested in hearing your stories, successful or not. Please leave a reply, below!

Credits
Cat and mouse chase image by Jeroen Moes
Dungeons & Dragons dice by Lydia

Scaling Continuous Delivery: Happiness as a Metric

A few days ago Jeff Atwood (Coding Horror) suggested a good measure of a tech company’s health is the time it takes to have a simple change become available to customers:

And while there are numerous metrics that determine the health of a tech company (see Jez Humble’s book, Accelerate, for an amazingly comprehensive overview), Continuous Delivery strongly correlates to successful outcomes.

I support Jeff’s assertion – I witnessed the value created by Continuous Delivery at IMVU, where we pioneered some of the crazy processes that would be followed by more sane practitioners. From day one, IMVU placed value on the speed of product iteration and “designed” build systems accordingly. In 2006, development was done in Windows using Reactor Server to provide the LAMP-ish stack, and the deploy process looked something like this:

 svn-server$ rcp website/* production:/var/www/

If you’re wondering why I omitted the test framework, I didn’t. Code went from a local Windows sandbox environment, to source control, to live on a Linux environment running a version of PHP different than the local sandbox. Fun! While there were numerous problems with this system, iteration velocity was amazing (those 1 word copy changes could ship in less than 5 minutes, and so could full features). This development velocity was a key component enabling IMVU to build a large, successful business in a space where the failed companies outnumber survivors 50 to 1.

And to be clear, time to get a change live to customers doesn’t in itself indicate healthy tech, but a lot of tech health comes from the corresponding systems necessary to make rapid deployment work.

Fast forward a few years to 2008 and I transition from leading the operations team to leading the engineering organization, where the build and deploy systems had matured, with reasonable test coverage, and automated deployment, with automated rollbacks when something unfortunate made it into production. It was pretty cool, even though publicly the process was mostly received with the sentiment, “that will never work , and certainly won’t scale”.

Scaling Problems

One of my first challenges as the new engineering leader was a team unhappy about their ability to get work done because it was taking hours for a commit to become live to customers. It’s astonishing when you think about it – every engineer in the company had come from companies where the commit to live process was typically measured in months, but once they experienced the value of Continuous Delivery, anything more than minutes seemed unbearable.

Digging into the problem, I came to understand that the problem was not slow builds (although that was part of it), the most significant issues were caused from the shared responsibility for build systems, combined with the desire to deliver features to customers, created a tragedy of the commons. When an engineer had a failure in the build system, the optimal solution for that engineer was to fix the problem in place, blocking the build system for anybody else in the queue, which meant the number of commits in the next build increased, which meant the chance of a failure in the that build increased, ad infinitum. The result was pushing to production could be blocked for hours, sometimes most of the work day.

Solving for Happiness

I thought the best solution was to formalize a project, have a clear success outcome, and have a single person with the responsibility for (and therefore authority over) the build / deploy systems. The first problem was determining a clear success criteria… anything time metrics I chose would be somewhat arbitrary, so instead I chose engineering happiness as the success criteria, or more specifically, pushing to production was no longer causing unhappiness. While I generally hate subjective success criteria, there were ways to assess progress through 1:1 conversations and Likert scale surveys. We also had great (highly objective) data around commit to deploy times, so we could see the correlation to the more subjective happiness index.

There was some pretty straightforward work to improve the actual test and deploy speeds, including simple things like adding more hardware and the slightly less simple sorting tests to run by speed (a surprisingly large performance gain), and fixing the slowest of the tests. But some of the most important gains came from the human parts of the deployment system… engineers were required to immediately revert code and fix the issue in their sandbox rather than blocking the build system. This was not a popular policy change as immediately engineers experienced the direct impact from a failed commit, but didn’t immediately see any gains to the overall system.  But after a few weeks the improvements were clear in the average commit to deploy time. And giving credit where it is due, Eric Prestemon was the “Buildbot Sheriff” that identified so many of the opportunities for improvement and delivered the results… many people helped, but Eric had the burden of hearing a lot of critical feedback about unpopular policy changes (eventually outweighed by the praise for the results he produced).

Eventually the build system frustration ceased being a common topic in 1:1 meetings, and it faded away as a meaningful problem in engineering surveys. 12 minutes. When the commit to live time is 12 minutes, this system is operating well. That became the new value for alerting – under 12 minutes, all is good, after that we need to actively drive improvements. In practice, deploy time was usually around 11 minutes, 8 for parallel test builds/runs and 3 minutes for rollout checks (thanks for the reminder, @jwatte).

Diminishing Returns

I have been asked why we didn’t try to make the build and deploy systems as fast as possible… why not 2 minutes? We constantly worked on optimizing these systems, adding separate hypothesis builds, automatically isolating build servers to allow diagnosing and fixing without blocking, etc. And sometimes deployment would take less than 9 minutes.

However, much like the difference between 99.99% and 99.999% uptime for a service, the difference to the customer can be negligible while the resources necessary to deliver that improvement can be extraordinary. When business requirements are being met and engineering is happy with deploy times, the resources necessary to dramatically improve were better spent delivering value to customers.

Key Takeaways

  1. Working in a (well functioning) Continuous Delivery environment is empowering, naturally encourages other strong technical practices, and is hard to retreat from once experienced.
  2. Certain problems fall into what I call the “roommates and dishes” category, where “it’s everybody’s responsibility” sounds good, but in practice actually means “it’s nobody’s responsibility”. In these cases it is better to find a results-driven person and ensure they have responsibility and corresponding authority.
  3. Hire Eric Prestemon or somebody like him.

 

Have you worked in a Continuous Delivery environment and experienced non-obvious scaling challenges? I’d like to hear about your experience – please leave a comment!