Microservices for Startups: An Interview with Steven Czerwinski of Scalyr

This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices. Thanks to Steven for his time and input! Steven Czerwinski is the co-founder and head of engineering at Scalyr . Scalyr is a service that provides a scalable log monitoring and analysis SaaS product. Prior to co-founding Sclayr, Steven worked at Google for eight years specializing in building distributed database systems for consumer apps. For context, how big is your engineering team? We’re still a fairly small startup. We are on the order of about 23 full time employees; 11 of them are engineers. I’m head of engineering so I kind of watch out for the different teams we have on the engineering side. We’ve been around since 2011, but for the first three years of the company’s life, it was just me and the founder of the company working at the company and building it, kind of moving the product along to MVP, getting some initial experience with users. We actually had already launched it at that point to paying customers and so on. But we didn’t actually start trying to grow the company aggressively until 2015, and that’s when we took our first round of seed funding. Since then we’ve been hiring people and growing the engineering team and all that good stuff. We both actually came from Google. I was at Google for about eight years or so. Steve himself was there for four. He was the founder of a company called Brightly that basically Google acquired to make Google docs. So we both were responsible for running a lot of the storage layer services in what was then called the Apps Division at Google. Are you using microservices? If so, how are you using them? Some of my experiences with microservices are colored by my experience at Google. They definitely were big believers in microservices and it was probably the first place where I saw them used to the extent that they were used. That experience shaped how we’re using and choosing to use microservices at Scalyr. In terms of how we’re using microservices now, let me give you a little bit of background on our our setup, our infrastructure. We’re a log management product. The core parts of that service comes to ingestion of logs. We’re SaaS. So all of our customers are uploading their logs to us in real time and we have an ingestion pipeline that takes that in. And then of course then we provide a search service over that as well as some analytical services to be able to pull out certain metrics from log files that can do graphs. For instance, show me average response time about H3 requests have status two hundred and things like that. Similarly you can also use that to build dashboards and alerting and all that good stuff. In fact our key differentiator between us and the other competitors in this market are that we’ve actually built our own custom NoSQL database essentially from scratch. Did you start with a monolith and later adopt microservices? When it was just Steve and I the first versions of the service were just a monolithic server. We had one Java-based service that did everything: that implemented the NoSQL database, implemented our front end user interface and served up the Javascript, all that good stuff. And even though we had had these positive experiences of using microservices at Google, we went this route — just having one monolithic server — because it was less work for us as two engineers to maintain and deploy. So we started off with the monolithic approach, and then I think about maybe two years ago we started pulling off some of those pieces from the mount of the Linux server for particular reasons. If so, what was the motivation to adopt microservices? How did you evaluate the tradeoffs? I’ll talk about the three things that we’re kind of moving to microservices and the different ways that we’re doing it. The first one is our front end proxy layer. This is the layer that an incoming HP requests gets mapped to a particular shard of our database server in order to be handled. I mean it’s not surprising our NoSQL database, of course it’s sharded. Every slice of the row space is owned by a particular master node and then behind that we have two replicas. And those those two replicas in the master constitute that shard for that row space. So when we have an incoming request we have to find out what account that request is originating from and then map it to the shard, where they are in the row space. And before that was done just by the monolithic server. We’re on AWS. So all the AWS ELBs would just send one request to a random monolithic server and the monolithic server would turn around and then reroute that to the right place. Obviously that’s not that efficient. And it’s a justification for microservices. So we pulled off that part of the proxy service. The second microservice that we created was what we call the queue servers. And what the queue servers are is they really provide reliability and availability on log ingestion. It’s very important for us that every second of the day that we’re accepting these log files, [they] are being uploaded by our customers with no blips. And I just kind of mentioned to you that our backend servers — the monolithic servers — are stateful. When they have to restart and come back up there is some disruption in service as they read in their state from disk. The queue servers are meant to be a kind of buffer between our client customers and the database servers. So again this is a choice in microservice based on the statefulness of the data. So we can essentially much more easily drain the individual queue servers in order to provide for software updates without disrupting the service. And we can also size them differently. I mean again it’s one of these things where the monolithic server has such requirements for disk and CPU, but the queue servers, all they need is enough disk just to act as a buffer for normal expected hiccups in the service. We know what matters in terms of computing is resources. We know this is our average log ingestion volume. We know we want to plan for 45 minutes and boom, that determines the amount of disk that we’re going to leave to it. And so you can make the argument that that breaking these things up into microservices allows for estimating resource usage much easier. Whereas with a monolithic server, there’s so many different things that it does, it’s really more of a black box in terms of the amount of resources you have to devote to it. How did you approach the topic of microservices as a team/engineering organization? Was there discussion on aligning around what a microservice is? When it was just Steve and I, we both had these experiences at Google so to a certain degree there was not much of a decision process with us. We both agreed microservices were good and ideally the right thing to do if we had enough engineering resources. When we started up this new microservice we had more junior team members working on that project and there was a bit of an education. Like why is this good? Why do we need to move this out of the monolithic server? And for this one — this is a worker microservice. I mean of course you try to be very data-driven. And very fact-driven. So for us we had clear monitoring data that showed the impact that these S3 archive jobs was having on the on the database servers. So you can point at a dashboard, [and say] “look at these increases in lag. This is not good. We need to solve this somehow.” And we actually did this with the team where we went through the justification, like okay well we could maybe solve this in the monolithic server by trying to cap the CPU more. But that’s going to be such a difficult game to play. You’re not going to catch all the cases and so you’re still going to have some impact. And so really it was an opportunity where we can move it out to a completely separate process, a completely separate node, and we’ll get isolation like we want. The other things that we kind of went through were the resort, resizing arguments in the meeting. If we move this off to something else then we might be able to decrease the JVM size on the database servers because they don’t make this extra RAM around to read terabytes of data in order to archive them to S3. And so it’s not just the live data we’re showing, but also going through kind of the metrics. One of pushbacks from one of the team members was why didn’t we just use an off-the-shelf queue server, worker queue model, because there are ones out there. And it’s an interesting point about microservices. Once you have things factored out in kind of meaningful API boundaries, you might find open source alternatives that meet those boundaries and you can swap them in for that maybe save yourself time. And we talked briefly about that. For something simple like this, it was decided that we have very particular ways that we want the service to behave. We want to monitor it. We want to be able to control it. So we decided the engineering effort it was going to take to build it was worth it rather than using something off the shelf. But it was definitely a discussion point. Did you change the way your team(s) were organized or operated in response to adopting microservices? We were already on the path of breaking up the teams. Maybe two years ago everyone was a full stack engineer. Everyone was touching every part of the stack. We’re now at the point where you kind of have to have specialized skills, especially for us where we have this NoSQL database that we implement. We’ve already been on the path where, for the last year, we’ve been separating up the engineering disciplines. We have a front end team, [and] a backend team. We have a dev ops team and then we have an integrations team. And so in some ways if we hadn’t done that already I think developing a microservice would have pushed us in that boundary. I definitely saw that at Google. Microservices had clear team owners. I remember at Google where we’d actually constitute new teams in order to own particular microservices: this is your baby now. You get to take care of it. For us, our backend team is about three or four engineers depending on how you count half time commitments because people wear different hats and all that. And for us the backend team was clearly going to own this. How much freedom is there on technology choices? Did you all agree on sticking with one stack or is there flexibility to try new? How did you arrive at that decision? To some degree our stack is fixed. We’re were a Java-heavy shop, so we’re going to use Java. But in our monolithic server, we’re actually using Tomcat as a Java container. We actually don’t like Tomcat. We would love to move away from it. We actually did give ourselves the freedom with this new worker microservice to kind of do a reset. I mentioned to you that we’re doing GRPC for the RPC framework. So this is our first example of building a service purely from GRPC. So we allowed ourselves a little bit of freedom, exploration with that. And that’s because we strongly feel that that’s what we’re going to adopt throughout the entire ecosystem. Have you broken a monolithic application into smaller microservices? If so, can you take us through that process? How did you approach the task? What were some unforeseen issues and lessons learned? We did monolithic because it was just faster for us at first. We didn’t have to think about generalizing things, we didn’t have to write libraries to be generic and all that good stuff. But we knew at some point we were going to be peeling stuff off. And so you had to know when was the right time to peel off the layers. That means you have to, as you’re building your monolithic server or building your service, you have to have the right monitoring or the right information in place to help you warn yourself when you’re getting close to that cliff. This approach where we took the existing code base and then just turned off various parts that we didn’t need — it did reveal bugs where we didn’t realize that particular parts of the code depend on particular pieces of global state. Because when you start peeling off microservices, of course you’re going to look at your global state you’re going to say, OK well this service doesn’t need this state, it doesn’t need this state, it doesn’t need that state. But you’re using the same libraries so down under the covers sometimes we had surprises like, oh well it turns out that for some reason this library did expect to be passed a database handle even though because we’re in the proxy layer we don’t have a database for something like that. So we had to find those places, fix them. Ideally we would have had this all factored appropriately from the get go. We would have had the right boundaries between the different layers and there would have been no crosstalk. But in practice it just doesn’t work. And this might be obvious but whenever you deploy a new microservice, you need to make sure that you actually have the microservice faithfully represented in both places. We can’t cheat in the staging instance and use the production proxy layer or something like that. If I was back in Google, I wouldn’t use the same image parsers between the two instances. Because you really need the ability to canary and test new versions of each of the services. So we do make an effort of anytime we’re making a change to the front end proxy servers, that gets pushed first to staging and bakes there for a period of time before we then push those servers on to production. How do you determine service boundaries? [We consider whether the service is stateless or stateful], but it’s also the type of data. What type of impact it’s going to have on the customer. Some data you have that’s global. For example, we have to have a global mapping from customer to what part of the row space that they’re in. So for that we need to have high availability of that particular data, because if it’s gone we can’t service any request. So we have that stored in a separate place and replicate it in different ways because it’s so important. Whereas the per shard information, that’s in its own little partition. It sucks if it goes down because that portion of the customer population is not going to have their logs available, but it’s only impacting 5 percent of the customers rather than 100 percent of the customers. So yeah, stateless versus stateful. I think also resource sizing. That’s another important consideration. What was that discussion like within your team? Can you give some examples? As we saw the number of incoming requests increase where we knew we were starting to exhaust the thread pools we had to do the proxying, we knew that it doesn’t make sense to spin up more monolithic servers in order to handle those additional proxying. So there we had a little bit of runway. We’re like, okay we could increase our costs by spinning up more database servers just handle this need. But we knew that was just not a good idea. So instead what we did was we gave ourselves enough time to start spinning up the proxy layer so that we could then start sites in that different way. So we’re pulling out the motivations why to do microservice. For us the two additional ones from that were isolation of operations. Sometimes you want isolation in order to make sure the resource utilization or resource usage of one operation doesn’t impact the other more critical operations. And also then just the isolation of diceyness. Part of the decision to use microservices is this security implication and the diceyness of integration with whatever you’re using. What lessons have you learned around sizing services? At the start certainly we did play around with the knobs and it was partly dependent on the EC2 instance type that we’re going to use for each of these type of the guys. So for our proxy servers we use of course a different EC2 instance type than what we use for our database search. So there was some playing around like, okay let’s buy a reserved instance of this C2 heavy, network heavy instance type and then benchmark and see how many requests per second the proxy server can serve on that. There was a learning curve. We had to play with that. But since then we don’t make changes to individual resources set on a per instance basis or per server basis. It’s more that we just spin up more shards. When we see that the number of requests coming in per second from our customers is exceeding some maximum cap based on the number of proxies we have then we spin up more proxies. We just add more EC2 instances. You could make an argument that we could do AWSS autoscaling going to do that but we’re we’re kind of a conservative bunch and we don’t trust that as much. We’d rather have a human in the loop for it. And also a lot of times we see this stuff coming so we can react. How have microservices impacted your development process? It’ll be interesting actually when we add in the queueing servers because with the approach that we took with the queueing server and the front end server, that was part of the monolithic server and those code bases are still shared. So when we develop locally, we still just bring up the monolithic server without actually turning on a separate front end or a separate queuing service because that code is already in there. We just configured to use that code rather than rely on the external microservices. You are right though that that is going to have to change a bit when we add in the new worker service. It’s one of these things where that functionality, the S3 archiving and the other things that it’s going to do — it’s not critical to our operation. Like for a front end developer bringing up a local build in order to test some new UI they’ve built, it doesn’t matter to them that the S3 archivers isn’t running on their local service. So we will probably have it set that we can run it without the microservice, without the worker service. So we will only need to bring that up when we when there are features or issues that we’re testing specifically. It will mean though that right now we use Jenkins for our continuous build system and it will mean that we will have to modify that. We already use Docker in order to spin up the microservice. Your ops and deployment processes? What were some challenges that came up and how did you solve them? Can you give some examples? When we started having microservices is we had to start of course generalizing our dev ops scripts. We want to have the same core dev ops libraries do these pushes. But now we’re pushing different types of things. The queue server, the front end proxy, the monolithic server. So we had to refactor that code. So we went through a process of kind of pulling out the common stuff, adding ability to tailor for the specific server type and that sort of thing. For us it also impacted how we do alerting as well. Like I said, we use our own service to monitor and alert of performance problems or availability problems on our servers. And so with the monolithic server we just had one big set of alerts that applied to every server that was in the NoSQL database. But now that we had more server types we had to factor the alerts. There was a base set of alerts that should apply to anything: is a CPU pegged, is a disk full, that type of thing. But then there’s server type specific alerts for the databases, the database not able to be opened as it’s seeing errors for the proxies. Is there unavailability on the backend that it’s trying to send requests to you and all that. So definitely there was a big step of refactoring and making that more general. We couldn’t have done that without our growing dev ops team. We have one fully dedicated person on dev ops, but then we have a lot of people that wear that hat half the time or something like that. We need manpower to do that. How have microservices impacted the way you approach testing? What are lessons learned or advice around this? Can you give some examples? In our kind of environment we have a couple different realms that we use and we refer to them as instances. We have the production instance. That is, of course, all the services that are being used in order to service our customers’ data. And then we have a staging instance where that is the servers that we use to test things and also we eat our own dog food so it’s one of these things where the staging servers monitor the production servers using the Scalyr system and then the production servers monitor the staging servers so they’re kind of pointing at each other. And this might be obvious but whenever you deploy a new microservice, you need to make sure that you actually have the Microservice faithfully represented in both places. We can’t cheat in the staging instance and use the production proxy layer or something like that. If I was back in Google, I wouldn’t use the same image parsers between the two instances. Because you really need the ability to canary and test new versions of each of the services. So we do make an effort of anytime we’re making a change to the front end proxy servers, that gets pushed first to staging and bakes there for a period of time before we then push those servers on to production. The other thing about the testing, is up until now all the code has been in the monolithic server. So for integration testing, automated testing, we just went in through the monolithic front end to test those different pieces. We didn’t have to spin up a separate front end or separate queue server just to test those code packs, because they are represented in the monolithic server. How have Microservices impacted security and controlling access to data? What are lessons learned or advice around this? Can you give some examples? In terms of the security, I mean we talked about the image processing or, Steve also brought up the fact that some of these third party libraries that we integrate, we don’t necessarily trust them that much. So it’s better to run them in a microservice where we don’t have critical information where they can’t necessarily impact or somehow…there’s no back doors that would allow other people to get access to our data or something like that. But beyond that the microservices don’t really impact us that much. And that’s partly because of our size. We’re already at the point where the dev ops team has the keys to the kingdom anyway. And they’re going to need the keys to the kingdom to the database servers. All these microservices, the queueing server, the front end proxy layer, the workers, they’re lower security concerns for us. So we don’t feel the need to restrict access to those guys beyond just the people that already have access to essentially the database servers. Not that much need yet. [At Google], certainly there they use microservices as service boundaries. In around 2006, 2007 for the RPC mechanism that they had they actually added in a very strong authentication and encryption layer to make sure that requests were coming from authenticated services. And then they used that as boundaries between teams and so on in terms of protecting information and all that, giving responsibilities to just core people that then protected that very critical microservice and the data it represented. Thanks again to Steven for his time and input! This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices. . This article was originally published at buttercms.com . It’s part of an interview series for the book Microservices for Startups Sign up to get new chapters emailed to you as soon as they’re released. If you’ve enjoyed this article, please help it spread by clapping below! For more content like this, follow us on Twitter and subscribe to our blog. And if you want to add a blog or CMS to your website without messing around with Wordpress, you should try Butter CMS .