This year I went to my 3rd event, an unconference about building scalable, high performance systems. Scale Summit I’ve summarised from the various sessions I attended… notes Scaling Teams & Culture Keynote The day started with a Keynote from titled 5 Things I Wish I Knew Sooner About Teams & Culture… @Geek_Manager Scaling _I was delighted to do the opening keynote at Scale Summit 2018 this morning - thanks for the warm welcome, folks. Below…_blog.geekmanager.co.uk Scaling Teams & Culture Keynote at Scale Summit Dry doesn’t work for human communication People need to hear things more than once for it to register Architectural decision records show context for why things were done in the past 2. Scaling teams is about creating conditions for success First, Break All The Rules Understand Drive Purpose, Autonomy, Mastery and Inclusion 3. Inflection points Different things come for free at different inflection points In smaller teams it’s easier for everyone to know what’s going on Focus on the right problems at the right time You’re not Netflix and probably don’t have the same problems Stumbling on happiness Humans edit the past and are bad at predicting the future The most useful person to learn from is just ahead of you 4. With people observability > testing Your impact may not match your intent, no matter how hard you try Need to check if your intentions match reality 5. Culture add matters a lot more than culture fit Focus on getting the most out of difference We’re not interchangeable resource units Think of people and roles as a matter of casting Assemble teams with complementary abilities Pragmatic Approaches to Real Problems 1 massive server is easier to deal with than a distributed system and costs about as much as a few smaller ones Risk of failed automation can be greater than the cost of downtime by having to do things like manual failovers Simpler, smaller, systems let you get to market faster What's the cost of being wrong Vs the cost of making sure you’re right? What’s the business context GDPR may tip the cost of taking the risk in a different direction Tradeoffs against the cost of it becoming a legacy system you have to maintain Legacy systems can just become anything you don’t understand Make it clear what the priorities are and why you are making the tradeoffs You can build new replacement systems for legacy things and never fully migrate to them Make old things better before building replacements. Ringfence the old systems and put APIs infront of them Phrase Jira tickets in terms of problems so you can discuss lots of solutions How will you know its working when you ship it? Explain “Why can we just?” Tin for Cloud Kids If you’ve only ever used the cloud, how do deal with a project that requires you to run on ‘real’ servers Run your own hypervisor? Many devices like NICs support passthrough to the VM Killing and replacing machines Can netboot and install machines Speed of doing so can be limited by your BIOS Can achieve 10 -15 minute cycles for a full install Create ground rules that no server is sacred Cattle not pets Think about availability when building everything Have hot spares Raw hardware is cheap compared to AWS Have a lower threshold for capacity when you buy new hardware Consider support packages and hardware life cycles Classify expense as apex rather than capex You can get a new server within 6 hours Capacity planning needs to be someones job Can use out of warranty hardware for things like Jenkins You have a pool of compute rather than servers with dedicated jobs You will hit physical limits of hardware at somepoint if you continue to run more VMs on the hardware. E.g. switching on NICs is done in hardware up until a point where its emulated in software much more slowly Have to deal with disk failures Metal as a Service from Ubuntu for buying bare metal compute Packet.net Production Performance Monitoring New Relic Easy to get started Very expensive Push data to them, if your server falls over it may not be able to push the crucial data you needed Tracing Zipkin Jaeger , cheap, UX is poor XRay Sampling can loose the traces you really need Can’t choose to trace retroactively after you’ve hit an error condition, need to choose to trace at the ingress point XRay < OpenTrace because you can’t switch it out as easily Hard to get started as you need people to modify there code Get buy in, by showing it off on hack days Span tracing can be tricky if context passes between threads Even with 1 thread it needs to work with all the libraries you use Easy for new code, but no ones really going to go back and instrument all the old code Graphite vs Prometheus Pull model can be hard to deploy for existing projects, if you are in regulated environments or have security teams that make it hard to get the access to the scraping endpoint Push gateway can be a way arond this Need code changes to expose metrics endpoint for Prometheus Managing TSDBs is hard Prometheus struggles with a 300–400 node Kubernetes cluster, need to add more instances, federation is hard Hosted Graphite is nice, adds alarms on top Why use Prometheus? Kubernetes adds endponts for it Standardised a metrics format Can add labels to metrics Query language and data storage is nice HoneyComb? Nice for seeing the bigger picture and then drilling down into common factors Not a logging replacement Cheap way to store lots of datapoints, costs less than building and maintaining and similar solution yourself Can sample events based on if they are a success or error Logging Buy an ELK stack, everyone has them and maintaining your own is hard Splunk is amazing if you have the money Attach trace ids to logs so you can link them back Turn async writes on for Elasticsearch Strategies for Testing in Production Monitoring systems that log into prod and perform a bunch of actions work well for a few people in the room Have found breakages and issues belonging to other teams Add headers to request to identify them as test runs so that systems can decide how to respond to it. Useful for things like payment systems that can drop the request and return dummy data Need to be sure that you update the mock data when the systems change Becareful when interacting with 3rd party APIs easy to get banned for posting data that looks like spam or hit rate limits and break your prod app Running these tests against prod systems means they also emit metrics / logs to the standard systems that you can monitor Canarying CNAMEs that let you switch between environments Rollout your code to a subset of users, controlled by things like feature flags If you can rollback fast, less need to things like blue / green deploys How to Deploy Software — For testing new codepaths safely (available for a bunch of languages) Github Scientist Shadow Traffic deploys Becareful about the extra load your placing on downstream systems Ways to duplicate traffic Have code in the client, controlled by a feature flag that makes requests to new and old systems Use something like Kafka streams to replay prod traffic against new systems, or pipe it into development environments Envoy Proxy Rate limiting requests to 3rd party APIs Test credential swapping Feature Flags Put all new features behind feature flags so you can deploy it in small pieces early on Be sure to remove the flag at the end Product owners can control who / when the flag is turned on Observability or has anyone tried ? HoneyComb If you have events you can generate metrics If you just store metrics you lose the events Can correlate things like CPU spikes with other events occuring in the system Don’t know what data you’ll need until the event has passed Metrics show you, you have a problem. observability show you common traits of anomolous metrics Etsy Skyline eBPF Awesome but new, not in alot of kernels Can hook into events happening in the kernel, without overhead Hard to use currently What’s Changed Since Scale Summit 2017 / Predictions for Next Year Predictions from last year Alexa / Voice interfaces won't take off Seem to be bigger in the USA than in Europe Homepod doesn’t seem to have taken off Alexa laughs at you If you live in a small house and have an Alexa that controls your lock, you can shout “Alexa open the door” from outside the house Rust Go seems to be big in the operations field but not as popular elsewhere Firefox Servo rewrite big success for Rust WASM looks interesting Brexit Still a mess and unclear IR35 / Gov Tech People have been leaving GDS and no one is really taking over the community leader roles Yarn Package lock files have become more popular People are moving from npm -> yarn Hard to keep up with the rate packages are updated , alerts you about updates and looks at the test run results for the new version across the internet to workout how safe it is Dependabot Kubernetes on AWS Happened All the major clouds now have it Not ready for prime time yet, released at reInvent as a marketing thing Kubernetes is complex, is easier to run if you’re going to do it yourself Nomad Lots of excitement about managing stateful services on Kubernetes now ELK CVE Didn’t happen Ransomware for Elasticsearch clusters accidentally exposed to the internet Predictions for next year Smarter viruses that dont kill the host More valuable to stay hidden and mine cryptocurrencies Until that market collapses SWE Ethics Will become more of a hot topic, is growing after the Volvo incident Machine Learning ethics will become a bigger topic More attacks against machine learning Crowd Sourcing Behind the Scenes Expensify using Mechanical Turk Duolingo swapping translations with users learning the opposite languages Will continue to grow More attacks against hardware Maybe new attacks against AMD chips have ? already happened A country will have its CA chain revoked Social Media Regulation More transparency around who paid for ads you see More spam messages that are really close to looking like a human wrote them Private companies will start competing with branches of Government City Mapper busses. You can view tweets from the event on the , I’m even in one of them #ScaleSummit18 hashtag