technical Archives - Page 2 of 2 - Brett's Random Musings

The art of deterring content theft is an ongoing game of cat and mouse – generally any barrier you create to prevent theft is temporary, as thieves continue to find new ways to steal the content, so long as the value of the content exceeds the effort necessary to steal it. For this reason, it can often be more effective to hinder thieves instead of trying to stop them.

I encounter this “hinder don’t halt” pattern with others that run large services, and you can see this reflected in solutions like shadow banning. One of the most common themes I hear is the satisfaction that comes from solutions that cause frustration for bad actors, so I’m sharing one from my personal experiences…

At IMVU, customers called Creators make content that they sell to other IMVU customers. The content they create is 3D items like avatar clothing, items to decorate an environment, and ways to customize an avatar. This content creates real value for other IMVU customers, who spend real money to purchase it from the catalog of over 10 million items. While many Creators create content just for the enjoyment of creating, some do it as a business, with a few making over $100K US annually. Whether creating for pleasure or business, all Creators hated having their work stolen. And, since there is real money from the sales of content, there is real incentive for thieves to try to steal it.

At one point we discovered a site that was selling a service that would allow people to steal Creator content without paying for it. It was pretty easy to detect the service and the initial response was blocking them, which immediately broke their service completely and, not surprisingly, made the thieves quickly respond by finding a new way around the block. The block lasted less than a day and the thieves were back in business.

The next response was more fun… rather than blocking the thieves, we made their service not work… sometimes… and inconsistently. Code was added to detect thieves accessing content and randomly some content being accessed would be mildly corrupted. The corruption could be configured to occur at certain rates, on certain items, at certain times of day, and be disabled based on what appeared to be testing for the corruption. As a result, customers of the thieves started getting inconsistent results, that would sometimes lead to content failing to load and even crashes. If you are an engineer reading this, you understand why this is a nightmare scenario to debug and fix… customers are reporting different failure cases with no consistent way of reproducing the problem to understand the cause. And, since your code is working fine, the bug isn’t going to be found… you eventually have to discover that you are being served different content than is being served to legitimate customers.

The result of hindering was much more effective than blocking… it took many weeks for the thieves to understand what was happening and, during this time, we could see them getting bashed by the people that paid them because the stolen content was ruining their experience. By the time the thieves had found another solution, they had such a bad reputation that people were less willing to give them money.

If you have dealt with content thieves I would be interested in hearing your stories, successful or not. Please leave a reply, below!

Credits
Cat and mouse chase image by Jeroen Moes
Dungeons & Dragons dice by Lydia

A few days ago Jeff Atwood (Coding Horror) suggested a good measure of a tech company’s health is the time it takes to have a simple change become available to customers:

In 2011, it was 11 minutes at IMVU (the delay largely as a result of a gradual rollout to ensure it didn’t impact key metrics). https://t.co/RGMtR9Lo3T

— Brett G. Durrett (@bdurrett) October 26, 2018

And while there are numerous metrics that determine the health of a tech company (see Jez Humble’s book, Accelerate, for an amazingly comprehensive overview), Continuous Delivery strongly correlates to successful outcomes.

I support Jeff’s assertion – I witnessed the value created by Continuous Delivery at IMVU, where we pioneered some of the crazy processes that would be followed by more sane practitioners. From day one, IMVU placed value on the speed of product iteration and “designed” build systems accordingly. In 2006, development was done in Windows using Reactor Server to provide the LAMP-ish stack, and the deploy process looked something like this:

 svn-server$ rcp website/* production:/var/www/

If you’re wondering why I omitted the test framework, I didn’t. Code went from a local Windows sandbox environment, to source control, to live on a Linux environment running a version of PHP different than the local sandbox. Fun! While there were numerous problems with this system, iteration velocity was amazing (those 1 word copy changes could ship in less than 5 minutes, and so could full features). This development velocity was a key component enabling IMVU to build a large, successful business in a space where the failed companies outnumber survivors 50 to 1.

And to be clear, time to get a change live to customers doesn’t in itself indicate healthy tech, but a lot of tech health comes from the corresponding systems necessary to make rapid deployment work.

Fast forward a few years to 2008 and I transition from leading the operations team to leading the engineering organization, where the build and deploy systems had matured, with reasonable test coverage, and automated deployment, with automated rollbacks when something unfortunate made it into production. It was pretty cool, even though publicly the process was mostly received with the sentiment, “that will never work , and certainly won’t scale”.

Scaling Problems

One of my first challenges as the new engineering leader was a team unhappy about their ability to get work done because it was taking hours for a commit to become live to customers. It’s astonishing when you think about it – every engineer in the company had come from companies where the commit to live process was typically measured in months, but once they experienced the value of Continuous Delivery, anything more than minutes seemed unbearable.

Digging into the problem, I came to understand that the problem was not slow builds (although that was part of it), the most significant issues were caused from the shared responsibility for build systems, combined with the desire to deliver features to customers, created a tragedy of the commons. When an engineer had a failure in the build system, the optimal solution for that engineer was to fix the problem in place, blocking the build system for anybody else in the queue, which meant the number of commits in the next build increased, which meant the chance of a failure in the that build increased, ad infinitum. The result was pushing to production could be blocked for hours, sometimes most of the work day.

Solving for Happiness

I thought the best solution was to formalize a project, have a clear success outcome, and have a single person with the responsibility for (and therefore authority over) the build / deploy systems. The first problem was determining a clear success criteria… anything time metrics I chose would be somewhat arbitrary, so instead I chose engineering happiness as the success criteria, or more specifically, pushing to production was no longer causing unhappiness. While I generally hate subjective success criteria, there were ways to assess progress through 1:1 conversations and Likert scale surveys. We also had great (highly objective) data around commit to deploy times, so we could see the correlation to the more subjective happiness index.

There was some pretty straightforward work to improve the actual test and deploy speeds, including simple things like adding more hardware and the slightly less simple sorting tests to run by speed (a surprisingly large performance gain), and fixing the slowest of the tests. But some of the most important gains came from the human parts of the deployment system… engineers were required to immediately revert code and fix the issue in their sandbox rather than blocking the build system. This was not a popular policy change as immediately engineers experienced the direct impact from a failed commit, but didn’t immediately see any gains to the overall system. But after a few weeks the improvements were clear in the average commit to deploy time. And giving credit where it is due, Eric Prestemon was the “Buildbot Sheriff” that identified so many of the opportunities for improvement and delivered the results… many people helped, but Eric had the burden of hearing a lot of critical feedback about unpopular policy changes (eventually outweighed by the praise for the results he produced).

Eventually the build system frustration ceased being a common topic in 1:1 meetings, and it faded away as a meaningful problem in engineering surveys. 12 minutes. When the commit to live time is 12 minutes, this system is operating well. That became the new value for alerting – under 12 minutes, all is good, after that we need to actively drive improvements. In practice, deploy time was usually around 11 minutes, 8 for parallel test builds/runs and 3 minutes for rollout checks (thanks for the reminder, @jwatte).

Diminishing Returns

I have been asked why we didn’t try to make the build and deploy systems as fast as possible… why not 2 minutes? We constantly worked on optimizing these systems, adding separate hypothesis builds, automatically isolating build servers to allow diagnosing and fixing without blocking, etc. And sometimes deployment would take less than 9 minutes.

However, much like the difference between 99.99% and 99.999% uptime for a service, the difference to the customer can be negligible while the resources necessary to deliver that improvement can be extraordinary. When business requirements are being met and engineering is happy with deploy times, the resources necessary to dramatically improve were better spent delivering value to customers.

Key Takeaways

Working in a (well functioning) Continuous Delivery environment is empowering, naturally encourages other strong technical practices, and is hard to retreat from once experienced.
Certain problems fall into what I call the “roommates and dishes” category, where “it’s everybody’s responsibility” sounds good, but in practice actually means “it’s nobody’s responsibility”. In these cases it is better to find a results-driven person and ensure they have responsibility and corresponding authority.
Hire Eric Prestemon or somebody like him.

Have you worked in a Continuous Delivery environment and experienced non-obvious scaling challenges? I’d like to hear about your experience – please leave a comment!

Category: technical

Hinder, Don’t Halt: Griefing Content Thieves for Fun and Profit

Scaling Continuous Delivery: Happiness as a Metric

Scaling Problems

Solving for Happiness

Diminishing Returns

Key Takeaways