Aaron Kalair

@AaronKalair

Scale Summit 2018 Notes

This year I went to my 3rd Scale Summit event, an unconference about building scalable, high performance systems.

I’ve summarised notes from the various sessions I attended…

Scaling Teams & Culture Keynote

The day started with a Keynote from @Geek_Manager titled 5 Things I Wish I Knew Sooner About Scaling Teams & Culture…

  1. Dry doesn’t work for human communication
  • People need to hear things more than once for it to register
  • Architectural decision records show context for why things were done in the past

2. Scaling teams is about creating conditions for success

3. Inflection points

  • Different things come for free at different inflection points
  • In smaller teams it’s easier for everyone to know what’s going on
  • Focus on the right problems at the right time
  • You’re not Netflix and probably don’t have the same problems
  • Stumbling on happiness
  • Humans edit the past and are bad at predicting the future
  • The most useful person to learn from is just ahead of you

4. With people observability > testing

  • Your impact may not match your intent, no matter how hard you try
  • Need to check if your intentions match reality

5. Culture add matters a lot more than culture fit

  • Focus on getting the most out of difference
  • We’re not interchangeable resource units
  • Think of people and roles as a matter of casting
  • Assemble teams with complementary abilities

Pragmatic Approaches to Real Problems

  • 1 massive server is easier to deal with than a distributed system and costs about as much as a few smaller ones
  • Risk of failed automation can be greater than the cost of downtime by having to do things like manual failovers
  • Simpler, smaller, systems let you get to market faster
  • What's the cost of being wrong Vs the cost of making sure you’re right?
  • What’s the business context
  • GDPR may tip the cost of taking the risk in a different direction
  • Tradeoffs against the cost of it becoming a legacy system you have to maintain
  • Legacy systems can just become anything you don’t understand
  • Make it clear what the priorities are and why you are making the tradeoffs
  • You can build new replacement systems for legacy things and never fully migrate to them
  • Make old things better before building replacements. Ringfence the old systems and put APIs infront of them
  • Phrase Jira tickets in terms of problems so you can discuss lots of solutions
  • How will you know its working when you ship it?
  • Explain “Why can we just?”

Tin for Cloud Kids

If you’ve only ever used the cloud, how do deal with a project that requires you to run on ‘real’ servers

  • Run your own hypervisor? Many devices like NICs support passthrough to the VM

Killing and replacing machines

  • Can netboot and install machines
  • Speed of doing so can be limited by your BIOS
  • Can achieve 10 -15 minute cycles for a full install

Create ground rules that no server is sacred

  • Cattle not pets
  • Think about availability when building everything
  • Have hot spares
  • Raw hardware is cheap compared to AWS
  • Have a lower threshold for capacity when you buy new hardware
  • Consider support packages and hardware life cycles
  • Classify expense as apex rather than capex
  • You can get a new server within 6 hours
  • Capacity planning needs to be someones job
  • Can use out of warranty hardware for things like Jenkins
  • You have a pool of compute rather than servers with dedicated jobs
  • You will hit physical limits of hardware at somepoint if you continue to run more VMs on the hardware. E.g. switching on NICs is done in hardware up until a point where its emulated in software much more slowly
  • Have to deal with disk failures
  • Metal as a Service from Ubuntu
  • Packet.net for buying bare metal compute

Production Performance Monitoring

New Relic

  • Easy to get started
  • Very expensive
  • Push data to them, if your server falls over it may not be able to push the crucial data you needed

Tracing

  • Zipkin
  • Jaeger
  • XRay, cheap, UX is poor
  • Sampling can loose the traces you really need
  • Can’t choose to trace retroactively after you’ve hit an error condition, need to choose to trace at the ingress point
  • XRay < OpenTrace because you can’t switch it out as easily
  • Hard to get started as you need people to modify there code
  • Get buy in, by showing it off on hack days
  • Span tracing can be tricky if context passes between threads
  • Even with 1 thread it needs to work with all the libraries you use
  • Easy for new code, but no ones really going to go back and instrument all the old code

Graphite vs Prometheus

  • Pull model can be hard to deploy for existing projects, if you are in regulated environments or have security teams that make it hard to get the access to the scraping endpoint
  • Push gateway can be a way arond this
  • Need code changes to expose metrics endpoint for Prometheus
  • Managing TSDBs is hard
  • Prometheus struggles with a 300–400 node Kubernetes cluster, need to add more instances, federation is hard
  • Hosted Graphite is nice, adds alarms on top

Why use Prometheus?

  • Kubernetes adds endponts for it
  • Standardised a metrics format
  • Can add labels to metrics
  • Query language and data storage is nice

HoneyComb?

  • Nice for seeing the bigger picture and then drilling down into common factors
  • Not a logging replacement
  • Cheap way to store lots of datapoints, costs less than building and maintaining and similar solution yourself
  • Can sample events based on if they are a success or error

Logging

  • Buy an ELK stack, everyone has them and maintaining your own is hard
  • Splunk is amazing if you have the money
  • Attach trace ids to logs so you can link them back
  • Turn async writes on for Elasticsearch

Strategies for Testing in Production

Monitoring systems that log into prod and perform a bunch of actions work well for a few people in the room

  • Have found breakages and issues belonging to other teams
  • Add headers to request to identify them as test runs so that systems can decide how to respond to it. Useful for things like payment systems that can drop the request and return dummy data
  • Need to be sure that you update the mock data when the systems change
  • Becareful when interacting with 3rd party APIs easy to get banned for posting data that looks like spam or hit rate limits and break your prod app
  • Running these tests against prod systems means they also emit metrics / logs to the standard systems that you can monitor

Canarying

  • CNAMEs that let you switch between environments
  • Rollout your code to a subset of users, controlled by things like feature flags

Shadow Traffic deploys

  • Becareful about the extra load your placing on downstream systems

Ways to duplicate traffic

  • Have code in the client, controlled by a feature flag that makes requests to new and old systems
  • Use something like Kafka streams to replay prod traffic against new systems, or pipe it into development environments

Envoy Proxy

  • Rate limiting requests to 3rd party APIs
  • Test credential swapping

Feature Flags

  • Put all new features behind feature flags so you can deploy it in small pieces early on
  • Be sure to remove the flag at the end
  • Product owners can control who / when the flag is turned on

Observability or has anyone tried HoneyComb?

  • If you have events you can generate metrics
  • If you just store metrics you lose the events
  • Can correlate things like CPU spikes with other events occuring in the system
  • Don’t know what data you’ll need until the event has passed
  • Metrics show you, you have a problem. observability show you common traits of anomolous metrics
  • Etsy Skyline

eBPF

  • Awesome but new, not in alot of kernels
  • Can hook into events happening in the kernel, without overhead
  • Hard to use currently

What’s Changed Since Scale Summit 2017 / Predictions for Next Year

Predictions from last year

Alexa / Voice interfaces won't take off

  • Seem to be bigger in the USA than in Europe
  • Homepod doesn’t seem to have taken off
  • Alexa laughs at you
  • If you live in a small house and have an Alexa that controls your lock, you can shout “Alexa open the door” from outside the house

Rust

  • Go seems to be big in the operations field but not as popular elsewhere
  • Firefox Servo rewrite big success for Rust
  • WASM looks interesting

Brexit

  • Still a mess and unclear

IR35 / Gov Tech

  • People have been leaving GDS and no one is really taking over the community leader roles

Yarn

  • Package lock files have become more popular
  • People are moving from npm -> yarn
  • Hard to keep up with the rate packages are updated
  • Dependabot, alerts you about updates and looks at the test run results for the new version across the internet to workout how safe it is

Kubernetes on AWS

  • Happened
  • All the major clouds now have it
  • Not ready for prime time yet, released at reInvent as a marketing thing
  • Kubernetes is complex, Nomad is easier to run if you’re going to do it yourself
  • Lots of excitement about managing stateful services on Kubernetes now

ELK CVE

  • Didn’t happen
  • Ransomware for Elasticsearch clusters accidentally exposed to the internet

Predictions for next year

Smarter viruses that dont kill the host

  • More valuable to stay hidden and mine cryptocurrencies
  • Until that market collapses

SWE Ethics

  • Will become more of a hot topic, is growing after the Volvo incident
  • Machine Learning ethics will become a bigger topic
  • More attacks against machine learning

Crowd Sourcing Behind the Scenes

  • Expensify using Mechanical Turk
  • Duolingo swapping translations with users learning the opposite languages
  • Will continue to grow

More attacks against hardware

A country will have its CA chain revoked

Social Media Regulation

  • More transparency around who paid for ads you see
  • More spam messages that are really close to looking like a human wrote them

Private companies will start competing with branches of Government

  • City Mapper busses.

You can view tweets from the event on the #ScaleSummit18 hashtag, I’m even in one of them

More by Aaron Kalair

Topics of interest

More Related Stories