This year I went to my 3rd <a href="http://www.scalesummit.org/" target="_blank">Scale Summit</a> event, an unconference about building scalable, high performance systems.
People Mentioned
Companies Mentioned
Coin Mentioned
This year I went to my 3rd Scale Summit event, an unconference about building scalable, high performance systems.
I’ve summarised notes from the various sessions I attended…
Scaling Teams & Culture Keynote
The day started with a Keynote from @Geek_Manager titled 5 Things I Wish I Knew Sooner About Scaling Teams & Culture…
Humans edit the past and are bad at predicting the future
The most useful person to learn from is just ahead of you
4. With people observability > testing
Your impact may not match your intent, no matter how hard you try
Need to check if your intentions match reality
5. Culture add matters a lot more than culture fit
Focus on getting the most out of difference
We’re not interchangeable resource units
Think of people and roles as a matter of casting
Assemble teams with complementary abilities
Pragmatic Approaches to Real Problems
1 massive server is easier to deal with than a distributed system and costs about as much as a few smaller ones
Risk of failed automation can be greater than the cost of downtime by having to do things like manual failovers
Simpler, smaller, systems let you get to market faster
What's the cost of being wrong Vs the cost of making sure you’re right?
What’s the business context
GDPR may tip the cost of taking the risk in a different direction
Tradeoffs against the cost of it becoming a legacy system you have to maintain
Legacy systems can just become anything you don’t understand
Make it clear what the priorities are and why you are making the tradeoffs
You can build new replacement systems for legacy things and never fully migrate to them
Make old things better before building replacements. Ringfence the old systems and put APIs infront of them
Phrase Jira tickets in terms of problems so you can discuss lots of solutions
How will you know its working when you ship it?
Explain “Why can we just?”
Tin for Cloud Kids
If you’ve only ever used the cloud, how do deal with a project that requires you to run on ‘real’ servers
Run your own hypervisor? Many devices like NICs support passthrough to the VM
Killing and replacing machines
Can netboot and install machines
Speed of doing so can be limited by your BIOS
Can achieve 10 -15 minute cycles for a full install
Create ground rules that no server is sacred
Cattle not pets
Think about availability when building everything
Have hot spares
Raw hardware is cheap compared to AWS
Have a lower threshold for capacity when you buy new hardware
Consider support packages and hardware life cycles
Classify expense as apex rather than capex
You can get a new server within 6 hours
Capacity planning needs to be someones job
Can use out of warranty hardware for things like Jenkins
You have a pool of compute rather than servers with dedicated jobs
You will hit physical limits of hardware at somepoint if you continue to run more VMs on the hardware. E.g. switching on NICs is done in hardware up until a point where its emulated in software much more slowly
Can’t choose to trace retroactively after you’ve hit an error condition, need to choose to trace at the ingress point
XRay < OpenTrace because you can’t switch it out as easily
Hard to get started as you need people to modify there code
Get buy in, by showing it off on hack days
Span tracing can be tricky if context passes between threads
Even with 1 thread it needs to work with all the libraries you use
Easy for new code, but no ones really going to go back and instrument all the old code
Graphite vs Prometheus
Pull model can be hard to deploy for existing projects, if you are in regulated environments or have security teams that make it hard to get the access to the scraping endpoint
Push gateway can be a way arond this
Need code changes to expose metrics endpoint for Prometheus
Managing TSDBs is hard
Prometheus struggles with a 300–400 node Kubernetes cluster, need to add more instances, federation is hard
Nice for seeing the bigger picture and then drilling down into common factors
Not a logging replacement
Cheap way to store lots of datapoints, costs less than building and maintaining and similar solution yourself
Can sample events based on if they are a success or error
Logging
Buy an ELK stack, everyone has them and maintaining your own is hard
Splunk is amazing if you have the money
Attach trace ids to logs so you can link them back
Turn async writes on for Elasticsearch
Strategies for Testing in Production
Monitoring systems that log into prod and perform a bunch of actions work well for a few people in the room
Have found breakages and issues belonging to other teams
Add headers to request to identify them as test runs so that systems can decide how to respond to it. Useful for things like payment systems that can drop the request and return dummy data
Need to be sure that you update the mock data when the systems change
Becareful when interacting with 3rd party APIs easy to get banned for posting data that looks like spam or hit rate limits and break your prod app
Running these tests against prod systems means they also emit metrics / logs to the standard systems that you can monitor
Canarying
CNAMEs that let you switch between environments
Rollout your code to a subset of users, controlled by things like feature flags
If you can rollback fast, less need to things like blue / green deploys