FTL usually refers to "faster than light". A theoretical particle known as a tachyon that powers certain spaceships in the Star Trek universe keeps the plot going for decades through multiple series and and movie franchises.
Today though, we are going to be talking about running linux applications faster than linux. Our company, NanoVMs has been working with unikernels for a while now but we've mostly been focused on their security properties and had not put much thought into their performance considerations.
Some have even correctly pointed out that nanos is not a traditional "true blue" unikernel as it retains different privilege levels for kernel vs user code. This is because there are certain privileged instructions that allow you to change page mappings and if you have that capability then all the ASLR and page protections in the world don't matter. You're going to get hacked.
Having said that, there's nothing preventing us from running software faster than most linux distributions or linux in general. In fact it is almost a guarantee because linux is a general purpose operating system and is simply built to run everything from the latest FPS game to a load bearing database to doing inference at the edge.
It was also purpose built to run on bare metal and so it has facilities that a unikernel would never incorporate. Over half of linux is device drivers and it's not a case of you deciding what to pick/choose for your hand-rolled kernel. Keep in mind these same design concepts (multiple process, multiple users, interactivity) are what powered the PDP-7.
As much trash as we talk on the multiple process model you kind of need it for bare metal installations. You want that capability for the laptop or phone you are reading this article on. However, for production server-side applications that always run inside of a virtual machine, which is mostly 99% of everything server-side nowadays, that condition goes away.
In fact we don't want users, nor do we want remote interactive access, nor do we want a bunch of random crap running that is not our software. There's already one layer of linux running as the hypervisor - does the guest need to be a heavy-weight GPOS as well? It is these environmental characteristics that allows us to do what we do.
Before we go further down the road it should be clear linux is a general purpose operating system and as such has many use-cases - for example, linux is and will forever be our favorite development environments.
Not only do we get security benefits by only allowing one application to run it tends to run much, much faster. There are a lot of unikernel projects out there, but at the end of the day cloning decades of kernel work just takes a ton of time and effort and it can be years before you see performance gains for commonly used software.
Since OPS is written in Go and we have numerous other software projects at our company written in Go we've used Go quite a lot for testing and thus that's where we've started seeing some results. You should be able to replicate this with most Go releases just by using the latest Nanos release but if you are using Go 1.14 you'll need to build nanos from source (or use the nightly release) as we just threw in a SA_ONSTACK fix. A scheduling change was made between Go 1.13 and 1.14. These types of changes ordinary application developers don't get much exposure too but when they interface with the operating system we have to deal with them. There's a reason why they call it bleeding edge. :)
To build from source you'll need to clone Nanos and a simple make is necessary. If you haven't installed OPS yet you should do so. From there you can copy your local build to the latest ops release like so:
eyberg@box:~/go/src/github.com/nanovms/nanos$ cp output/boot/boot.img ~/.ops/0.1.25/.
eyberg@box:~/go/src/github.com/nanovms/nanos$ cp output/mkfs/bin/mkfs ~/.ops/0.1.25/.
eyberg@box:~/go/src/github.com/nanovms/nanos$ cp output/stage3/bin/stage3.img ~/.ops/0.1.25/
It is highly recommended to just use the pre-built releases from OPS if you can.
Now let's use this simple little Go hello world:
package main
import (
"fmt"
"net/http"
)
func main() {
fmt.Println("hello world!")
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Welcome to my website!")
})
fs := http.FileServer(http.Dir("static/"))
http.Handle("/static/", http.StripPrefix("/static/", fs))
http.ListenAndServe("0.0.0.0:8080", nil)
}
I'm using this ops config:
{
"CloudConfig" :{
"ProjectID" :"prod-1033",
"Zone": "us-west2-a",
"BucketName":"my-bucket"
}
}
Then we build the GCE image:
eyberg@box:~/y$ cat build.sh
#!/bin/sh
GOOGLE_APPLICATION_CREDENTIALS=~/gcloud.json ops image create \
-c config.json -t gcp -a hackernoon
Let's spin it up:
eyberg@box:~/y$ cat create-instance.sh
#!/bin/sh
GOOGLE_APPLICATION_CREDENTIALS=~/gcloud.json ops instance create \
-t gcp -i hackernoon-image -z us-west2-a
Now let's spin up 2 more instances on GCE directly. One for the benchmarking using ab and one for the go webserver to just sit on a debian instance.
Don't be dumb like the author and wonder why the latency is two orders of magnitude difference from a different region - use the same region and zone you are using with OPS:
Then let's transfer our little go app:
➜ ~ scp -i ~/.ssh/nope [email protected]:~/y/hackernoon .
hackernoon 100% 7321KB 9.0MB/s 00:00
➜ ~ gcloud beta compute scp --zone "us-west2-a" --project "project-something" hackernoon "gtest":~/.
Then install ab:
eyberg@bench:~$ sudo apt-get install apache2-utils
Let's hit both of the instances up:
eyberg@bench:~$ curl -XGET http://10.240.0.41:8080/
Welcome to my website!eyberg@bench:~$ ^C
eyberg@bench:~$ curl -XGET http://10.240.0.38:8080/
Welcome to my website!eyberg@bench:~$
Seems legit. Also - just so we know whose who:
Now let's run with concurrency of 1:
eyberg@bench:~$ ab -c 1 -n 100 http://10.240.0.38:8080/
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.38 (be patient).....done
Server Software:
Server Hostname: 10.240.0.38
Server Port: 8080
Document Path: /
Document Length: 22 bytes
Concurrency Level: 1
Time taken for tests: 0.028 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 13900 bytes
HTML transferred: 2200 bytes
Requests per second: 3634.65 [#/sec] (mean)
Time per request: 0.275 [ms] (mean)
Time per request: 0.275 [ms] (mean, across all concurrent requests)
Transfer rate: 493.37 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 0 0 0.0 0 0
Waiting: 0 0 0.0 0 0
Total: 0 0 0.2 0 2
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 0
95% 0
98% 0
99% 2
100% 2 (longest request)
eyberg@bench:~$ ab -c 1 -n 100 http://10.240.0.41:8080/
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.41 (be patient).....done
Server Software:
Server Hostname: 10.240.0.41
Server Port: 8080
Document Path: /
Document Length: 22 bytes
Concurrency Level: 1
Time taken for tests: 0.037 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 13900 bytes
HTML transferred: 2200 bytes
Requests per second: 2684.06 [#/sec] (mean)
Time per request: 0.373 [ms] (mean)
Time per request: 0.373 [ms] (mean, across all concurrent requests)
Transfer rate: 364.34 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 0 0 0.1 0 1
Waiting: 0 0 0.1 0 1
Total: 0 0 0.2 0 2
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 0
95% 0
98% 1
99% 2
100% 2 (longest request)
Ok - got them warmed up. We see the unikernel outpacing the linux instance just by a bit.
Now let's hit it more:
eyberg@bench:~$ ab -c 10 -n 1000 http://10.240.0.41:8080/
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.41 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 10.240.0.41
Server Port: 8080
Document Path: /
Document Length: 22 bytes
Concurrency Level: 10
Time taken for tests: 0.087 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 139000 bytes
HTML transferred: 22000 bytes
Requests per second: 11444.27 [#/sec] (mean)
Time per request: 0.874 [ms] (mean)
Time per request: 0.087 [ms] (mean, across all concurrent requests)
Transfer rate: 1553.47 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 2
Processing: 0 1 0.2 1 1
Waiting: 0 1 0.2 1 1
Total: 0 1 0.3 1 3
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 2
100% 3 (longest request)
eyberg@bench:~$ ab -c 10 -n 1000 http://10.240.0.38:8080/
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.38 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 10.240.0.38
Server Port: 8080
Document Path: /
Document Length: 22 bytes
Concurrency Level: 10
Time taken for tests: 0.055 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 139000 bytes
HTML transferred: 22000 bytes
Requests per second: 18313.68 [#/sec] (mean)
Time per request: 0.546 [ms] (mean)
Time per request: 0.055 [ms] (mean, across all concurrent requests)
Transfer rate: 2485.94 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 2
Processing: 0 0 0.1 0 1
Waiting: 0 0 0.1 0 1
Total: 0 1 0.2 1 2
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 2
100% 2 (longest request)
As you can see this is a fairly decent percentage difference. Now before you get all crazy on the twitters keep in mind that benchmarking can measure lots of things. To be utterly painfully clear - these benchmarks are very crude and naive. They are only meant to get you interested - not necessarily prove anything. This was merely looking at a go webserver requests/second. Measuring a different language like Rust or Node will produce very different results.
In fact - let's go ahead and do just that. Let's look at a simple Rust webserver real quick:
use std::io::{Read, Write};
use std::net::{TcpListener, TcpStream};
use std::thread;
fn handle_read(mut stream: &TcpStream) {
let mut buf = [0u8; 4096];
match stream.read(&mut buf) {
Ok(_) => {
let req_str = String::from_utf8_lossy(&buf);
// println!("{}", req_str);
}
Err(e) => println!("Unable to read stream: {}", e),
}
}
fn handle_write(mut stream: TcpStream) {
let response = b"HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=UTF-8\r\n\r\n<html><body>Hello world</body></html>\r\n";
match stream.write(response) {
Ok(_) => {} //println!("Response sent"),
Err(e) => println!("Failed sending response: {}", e),
}
}
fn handle_client(stream: TcpStream) {
handle_read(&stream);
handle_write(stream);
}
fn main() {
let listener = TcpListener::bind("0.0.0.0:8080").unwrap();
println!("Listening for connections on port {}", 8080);
for stream in listener.incoming() {
match stream {
Ok(stream) => {
thread::spawn(|| handle_client(stream));
}
Err(e) => {
println!("Unable to connect: {}", e);
}
}
}
}
export GOOGLE_APPLICATION_CREDENTIALS=~/gcloud.json
ops image create -c config.json -a main -i rustz1
ops instance create -z us-west2-a -i rustz1
Using the same config as before we'll upload it to Google and spin up three new instances. One for the rust unikernel, one for a debian instance on the same subnet and one running debian w/the rust webserver on it. Note: I'm choosing debian for no other choice than it's the default choice and so would be used quite a lot.
We do a quick live-check:
eyberg@dtest:~$ curl -XGET http://10.240.0.94:8080/
<html><body>Hello world</body></html>
eyberg@dtest:~$ curl -XGET http://10.240.0.8:8080/
<html><body>Hello world</body></html>
For the one running on debian:
eyberg@dtest:~$ ab -c 1 -n 100 http://10.240.0.94:8080/
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.94 (be patient).....done
Server Software:
Server Hostname: 10.240.0.94
Server Port: 8080
Document Path: /
Document Length: 39 bytes
Concurrency Level: 1
Time taken for tests: 0.025 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 9800 bytes
HTML transferred: 3900 bytes
Requests per second: 3948.98 [#/sec] (mean)
Time per request: 0.253 [ms] (mean)
Time per request: 0.253 [ms] (mean, across all concurrent requests)
Transfer rate: 377.93 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 2
Processing: 0 0 0.0 0 0
Waiting: 0 0 0.0 0 0
Total: 0 0 0.2 0 2
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 0
95% 0
98% 0
99% 2
100% 2 (longest request)
We can see the rust unikernel outperforming just slightly:
eyberg@dtest:~$ ab -c 1 -n 100 http://10.240.0.8:8080/
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.8 (be patient).....done
Server Software:
Server Hostname: 10.240.0.8
Server Port: 8080
Document Path: /
Document Length: 39 bytes
Concurrency Level: 1
Time taken for tests: 0.021 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 9800 bytes
HTML transferred: 3900 bytes
Requests per second: 4778.97 [#/sec] (mean)
Time per request: 0.209 [ms] (mean)
Time per request: 0.209 [ms] (mean, across all concurrent requests)
Transfer rate: 457.36 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 2
Processing: 0 0 0.0 0 0
Waiting: 0 0 0.0 0 0
Total: 0 0 0.2 0 2
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 0
95% 0
98% 0
99% 2
100% 2 (longest request)
Keep in mind these are on 1vCPU instances, but let's go ahead and up the concurrency:
eyberg@dtest:~$ ab -c 10 -n 1000 http://10.240.0.94:8080/
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.94 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 10.240.0.94
Server Port: 8080
Document Path: /
Document Length: 39 bytes
Concurrency Level: 10
Time taken for tests: 0.072 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 98000 bytes
HTML transferred: 39000 bytes
Requests per second: 13919.63 [#/sec] (mean)
Time per request: 0.718 [ms] (mean)
Time per request: 0.072 [ms] (mean, across all concurrent requests)
Transfer rate: 1332.15 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 2
Processing: 0 1 0.2 1 2
Waiting: 0 1 0.2 1 2
Total: 0 1 0.2 1 3
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 2
100% 3 (longest request)
Not bad. Now let's check out the rust webserver running under Nanos:
eyberg@dtest:~$ ab -c 10 -n 1000 http://10.240.0.8:8080/ [53/188]
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.240.0.8 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: 10.240.0.8
Server Port: 8080
Document Path: /
Document Length: 39 bytes
Concurrency Level: 10
Time taken for tests: 0.046 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 98000 bytes
HTML transferred: 39000 bytes
Requests per second: 21736.30 [#/sec] (mean)
Time per request: 0.460 [ms] (mean)
Time per request: 0.046 [ms] (mean, across all concurrent requests)
Transfer rate: 2080.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 0 0 0.1 0 1
Waiting: 0 0 0.1 0 1
Total: 0 0 0.2 0 2
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 1
95% 1
98% 1
99% 1
100% 2 (longest request)
Well - that's a pretty large difference! Keep in mind that these are on 1vCPU - Nanos has SMP support because even though there are a ton of languages/applications that are inherently single process/single-threaded that's not what our future holds. Google actually has support for instances with 416 threads! Holy multi-threading batman.
If you are testing and you just run OPS locally without any taps you'll be using user-mode networking with no hardware acceleration. Both of these will produce far less results than what you see here. That's why we use gcloud as a neutral testing ground here and since you can upload the image and start the instance in about 2 minutes it's not really that big of a deal.
Likewise, measuring something like filesystem writes is not something we're looking at here. That you'll need to wait for the next blogpost for!
Also keep in mind - Nanos is not linux. A lot of people seem to think we've trimmed down the linux kernel and created something like Alpine. That is most definitely not what we have done - go look at the source.
We are also testing on Google Cloud here. Your results will most assuredly be different on AWS. Again, most of our testing has been done on Google and even though we can deploy to AWS today I am aware of at least one feature that needs to be implemented to make it go much faster than it is doing today. Why the difference? The instances we run on AWS use Xen and the ones on Google are KVM based. Everything from clocks to network drivers to storage is different.
This is just the start of the performance stuff however it is promising. Some people think that down the road as the codebase grows we will see significant slowdowns. I don't think this is realistic though as at the end of the day we are comparing a multi-threaded single process system to a system that is not just multiple process but massively multiple processed. They are just two different beasts.
Plus, there are a ton of optimizations that we have roadmapped that already exist in other general purpose systems such as fbsd and linux that we haven't even started on yet. A simple scroll through the issue tracker enumerates many of those. So if anything I expect these numbers to improve by a lot. Plus, one of the cooler, imo, things we can do since we know they are unikernels is that we can make app specific optimizations such as subbing out various schedulers easily.
The Reality is that we Have a Lot to Look Forward To
Most of the heavy context switching that you read about comes from large many multiple process systems. There are quite a lot of ramifications to software that is written in this style that is sadly just not being taught anymore even though the "hella-core"™ future is extremely important to be aware of.
For instance remember that a process will share a heap between multiple threads but two processes have to have their own. This is why we see databases and the JVM utilizing things like huge_pages. This is also why forking as a concurrency primitive can be so horribly slow.
Also, there are a lot of various interfaces in the kernel that are present to facilitate its environment of being a multiple process system and just as many facilities to ensure the concept of multiple users not stomping on each other's memory. If you have a kernel that is focused on running one and only one application it is no coincidence that it is going to run a lot faster even without tuning. It's honestly not even fair to compare the two types.
It's not fair to compare.
Linux has expectations that it could be deployed onto real hardware. We know we will only ever live as a vm and that allows us to take advantage of that fact. Today it is quite possible to run a vm faster than native linux because of how good hypervisors and virtualization have gotten. The hardware being produced today is actually optimized for hyperscaler deployments (eg: running in a virtualized environment).
Tack on the fact that there are quite a few syscalls we don't support nor care too and we automatically get performance boosts.
Then couple with the fact that a brand new instance of debian or ubuntu or whatever comes with all of this:
root@bench:/home/eyberg# ps aux | wc -l
72
That was debian. Let's look at ubuntu:
root@instance-1:/home/eyberg# ps aux | wc -l
93
Keep in mind these are processes that are already running. We haven't installed anything yet. This is a fresh boot! This is on a single thread instance!! 93 processes are all fighting each other for that one thread.
I've always loved this slide graphically showing the true cost.
However, other things that stand out immediately on this host:
root@instance-1:/home/eyberg# ps aux | grep python
root 1332 0.1 0.5 171708 19432 ? Ssl 00:14 0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root 1497 0.1 0.5 65148 21352 ? Ss 00:14 0:00 /usr/bin/python3 /usr/bin/google_network_daemon
root 1499 0.1 0.5 65112 21228 ? Ss 00:14 0:00 /usr/bin/python3 /usr/bin/google_clock_skew_daemon
root 1501 0.1 0.5 65448 21476 ? Ss 00:14 0:00 /usr/bin/python3 /usr/bin/google_accounts_daemon
If you think any of these random python programs are going to be fast you might be mistaken. Let's not forget that I'm an active user on this system screwing with the performance for every single command I type. We call them "commands" but what are they really? That's right - yet another program.
On some of the smaller instances you will be throttled too. The f-1 and g-1 instances for instance are labeled wrong in the gcloud dashboard as "1 vcpu" - they really are 0.2 and 0.5 respectively and after playing with them enough I feel that the 20% and 50% are bursty all the time. You can really tell the difference between a shared thread and one that is all yours.
All this is to say that running a general purpose operating system in the cloud works and everyone does it cause it's the only tools we had.
However, it's not the only tool in your toolbelt anymore.
The server side operating system revolution is long overdue.