Technologies for GasBuddy’s Future

This is a version of an email I sent around to our team this week with minor edits to protect the innocent. We have set a course for an evolved platform, it’s clearly going to take a while to get there but we are building and deploying services on this platform right now.

Hello all, I wanted to write one of my long winded emails about the decisions we’ve made so far about “the new stack” — the set of technologies that will power new services and slowly take load from the existing stack. Many of you are already working with elements of it. The primary goal of the new stack is efficiency and horizontal scale. For us, this definition is most appropriate:

Horizontal scaling means that you scale by adding more machines into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine.

Our technologies must have the capability to scale by adding more (virtual) machines at a cost that the business can sustain. That is the money line in this email, literally and figuratively. By a somewhat recent estimate, if we were to pick up our infrastructure and stick it in AWS, it would cost something like $80,000 per month. This just isn’t a good value for money given our existing workload and our scaling plans. This business can’t sustain a $250,000 a month bill for triple the monthly unique visitors (a lot of assumptions in that projection, but it’s a big number). The bigger problem of course is that much of our functionality is dependent on a very non-horizontally scalable Microsoft SQL Server database, so without major application changes we couldn’t actually GET to that scale even with big money. Obviously there are a number of efforts underway to address this on the existing infrastructure and they are incredibly important for the long term health of our platform.

Our cost analysis has to include workload efficiency, software licensing and developer productivity. The technologies, services and partners that make up our infrastructure will change over time, and we have to architect systems to anticipate that change. So we’ve settled on a set of core components and patterns that we hope will set us up for long term success while providing a path for short term improvement. It is not a goal or a reasonable expectation to just “change everything.” We will continue to improve our existing services but look for opportunities to gain meaningful production experience with what we believe to be the right long term platform choices.

Microservices

In general, one of the bigger things we’re doing is moving away from monolithic services and binary inclusion (i.e. referenced assemblies) to small, composable services. We’re using Swagger as an API specification format which comes with auto-server and client generation in both C# and Javascript. A service should be responsible for a very narrow band of functionality — as an example geo-serv is solely responsible for non-user-specific geographic data — shape analysis, distance computations, etc. But NOT for searching stations based on those things (for example). In addition, we are moving to a more strict three tier architecture — front end APIs and pages, middle tier services / background tasks, and databases. Authorization happens at the front tier (in fact integrated into the swagger specs for most situations). The middle tier just does what it’s told — it should not be rechecking the user-related rights of operations. This makes orchestration significantly simpler and more powerful. The database should basically enforce service layer access rights.

Over time — and with the right amount of engineering discipline — the microservices approach yields significantly more flexible components with frequent releases and better code quality maintained by smaller teams. It encodes a lot of functionality in the relationships and call models between services, which is more mentally taxing but less manually taxing. That means if something is confusing or not working as expected, communicate/ask quickly rather than assuming the code is a “golden city on a hill.” Smaller services also mean we can revisit, rewrite and gradually replace services without inordinate fear that it will have broader impact than the service itself. Another minor point with major value — all requests that come into this infrastructure are assigned a unique identifier and that identifier is passed along with all inter-service requests. That “correlation id” allows us to view cross-service logs from an end-user request perspective, which was an immensely valuable tool at PayPal.

Like most ideas in software development, the concept of building microservices have both passionate proponents and opponents who offer criticism that runs the gamut in terms of validity and usefulness. It has the added “benefit” of having a muddled definition since becoming one of the buzziest buzzwords in the industry in recent times. Despite an astronomical uptick in interest in “microservices” over the last two years, this concept is nothing new. The idea of service oriented architecture has been around for ~40 years and really boils down to building subsystems with well defined interfaces, expecting dependent-services to simply be available. In the past, this often meant little more than a shift in complexity from app-level to infrastructure-level — deconstructing one big problem into 50 little ones. However leveraging modern, battle-tested tooling (as described below) helps us reduce that new complexity, hopefully giving us the best of both worlds.

Containers

Containers allow us to orchestrate the microservices as very small and manageable deployment units. We are using Docker as a container platform because it has the broadest industry buy in — AWS, Joyent, Google, pretty much everybody supports it. Docker (the company) has lots of add on products to Docker with varying levels of success, but the core container format is very stable and does what we need. We do need to orchestrate the creation of several non-containery things (like the VMs that host the containers or other non-Docker components), and for that we intend to use Terraform and Ansible. From those tools we expect to be able to version all our infrastructure which provides both change management and the ability to spin up production-like environments (staging, QA, etc) more simply and accurately.

AWS

They used to say “nobody gets fired for choosing IBM.” I think that’s Amazon’s position in cloud infrastructure at the moment. It’s not the cheapest hosting option by any stretch, and it probably doesn’t even perform the best (Joyent and GCE claim to be a lot better); but pretty much anything you want to do with AWS has been done by someone else already. The number of value added services they offer is staggering — things like Lambda that allow you to run code on external event triggers and just pay for that runtime, Glacier that let you store near-line data for super cheap, RDS that lets you provision databases as services… Most of the other providers just don’t do that — they give you a VM infrastructure and the rest is up to you. Of course this is why the others are cheaper, and as we scale our applications our goal is to be multi-provider and be able to shift workloads across providers as necessary. This is hard with something like a database, but with stateless services like geo data lookups would be pretty easy™. The nearest competitor looks to be Google, and we have our eyes on them.

Kubernetes (aka k8s)

For some services (like databases), AWS provides external management of the underlying hardware so we don’t really have to “worry about it.” But for the rest of our services, some that we write and some that we use (like RabbitMQ), we have to manage where they run, how many there are, etc. This sort of “container orchestration” has become a market of its own. We have chosen Kubernetes (http://kubernetes.io/). Jean-Charles will send a longer mail at some point explaining it in more detail, but essentially k8s manages a set of virtual machines and a set of services that run on those machines. It can automatically scale them up and down, move them around, and manage their health. It provides service discovery and connectivity between those services as well as to external services like RDS (a database-as-a-service product). It manages secrets and other security primitives and takes us a long way towards versioned infrastructure — meaning our production environment and its changes over time are versioned just like all our other software.

Postgres

I think I’ve sent mail on this before, but to summarize — SQL Server is awesome; SQL Server is too expensive; therefore SQL Server is not awesome. Over time, we will be moving to PostgreSQL. I’m somewhat concerned this might be unfortunate timing as MSFT has announced SQL Server on Linux, which in theory could reduce the total cost. But Postgres is “free as in beer.” It is extremely reliable, with support for replication/clustering and read-only replicas for performance. It has robust JSON support, which allows us to combine the needs of transactional systems with the flexibility of NoSQL like systems. I just finished some work on our identity services that allows “patch based” profile changes in a JSON column with transactional consistency. I’m not even sure SQL Server could pull that off. Postgres is also offered by AWS as a service, where we basically don’t have to configure high availability manually or manage it too much.

One other change is that we’re moving towards a higher number of independent logical databases. This is a double edged sword. We will lose foreign keys in many cases and have our data more scattered. I’m hoping this reduces the exposure to single db hacks and helps to enforce proper service boundaries. While there will be some performance loss on application-layer joins and such, we feel the benefits in scale efficiency and separation of concern plus healthy caching (both Redis for shared cache and in-memory for very slow changing data like geo data) are worth it.

Kong

Lastly, we are using a single “gatekeeper” for all public APIs built on the new stack. That product is called Kong and handles various aspects of authentication (not authorization — just providing tokens and turning those tokens back into user contexts), rate limiting, logging and more. It supports first-party access (I am the user acting on behalf of myself) and third-party access (I am a developer acting on behalf of a user that’s given me permission to do so). Our experience so far hasn’t been perfect — we’ve had to fork and make some changes, but the fact that we can is another reason we like it (not that I enjoyed having to write Lua). It will allow us to host all our APIs behind api.gasbuddy.com (or maybe api.gasbuddy.io, not sure yet) and have it farm out to the appropriate services. It can handle transformations of input and output such that we could handle version migrations with configuration rather than service-layer code. We can also write plugins to do things like favor a particular data center or retry certain types of failed requests in other data centers before failing back to the caller. It also allows us to test more easily because our services can now just get some http headers with user context and the tests can just make those up rather than going through an authentication process.

Hooking things up to Kong is currently done in our bootstrap.sh script in the tooling repo. We’ll come up with some better way of tracking precisely what APIs exist and how they’re named and arranged over time.

The tooling repo should be enough to get any developer up and running with a personal “GasBuddy” new stack environment in about 10 minutes. If it fails, speak up and we’ll figure out why. Over the next couple weeks I’d like us to work on our “Developer Manifesto” together — what are the expectations of our infrastructure in making our lives easier and what are our responsibilities to each other to make that a reality. The one I just mentioned has always been a big one for me — if I know what an “IDE” is and can use a command line, I should be able to get up and running with just the contents of some README.md and less than a half hour (internet weather dependent).

Thanks for sticking with me this long, and hopefully these new services and patterns help us reach our goals more quickly and with less pain than typical. Specifically, we want to get a meaningful percentage of our overall traffic onto these new services by the end of the year. If we can do that, and make a corresponding reduction of the load on our most expensive resource (SQL Server), we may be able to get out of our self-hosted environment around the same time, which gives us a great deal more control over our destiny.

One of my favorite sayings and words I try to live by are “have passionate beliefs, loosely held.” Many of you have been involved in vetting the components described in this mail and I think we’re in a good place in terms of collective buy in and engineering headroom. But things change, and that’s ok. Maybe Kong decides Lua is a bad idea and rewrites the whole thing in Java (ewww). We’ll roll with those punches. But our overall architectural philosophy and development practices should endure.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!