What if we could verify npm packages?

Reproducible steps for identifying unwanted and malicious code

The state of NPM security

2018 brought us some fairly high-profile talking points about the state of NPM security.

In January, David Gilbertson broke the internet (pun intended) with a plausible attack on everyone’s PII (I’m harvesting credit card numbers and passwords from your site. Here’s how.) The hypothetical attack was centered around adding malicious code to a package that was not in source control.

In July, attackers added malicious code to eslint-scope that allowed them to steal npm tokens from other packages. Our friends over at NPM nuked all tokens before the attack could spread further.

In September, malicious code was added to event-stream via a dependency and remained undetected for months.

Photo by João Silas on Unsplash

NPM has a few attack vectors

I hesitate to call the following vectors “security flaws” because some of them are just the way package managers work. However, most of the high-profile stories from last year targeted the same things. Let’s take a look at what they are and then see if anything can be done to mitigate their risk.

Legitimate packages have contents that aren’t in source control

An npm package is just tarball of files. The fact is that all package managers (Npm, Nuget, Maven, etc) just distribute tarballs or zip files or some other bundle of content. Any responsible developer is going to keep their code in source control; however, this code may or may not be the only thing in the package.

For compiled languages like Java or .Net, packages contain build artifacts, not source code. Especially if the build output is obfuscated, it is difficult if not impossible to casually discover security flaws in the package contents. Javascript is a bit different in that many simple packages are simply tarballs of unmodified source code. However, Typescript requires a transpilation step any many other non-Typescrpt codebases include some form of bundling or minification process before an npm package is created.

Security concerns are exacerbated but the fact that minified code is hard to read. The chances of someone finding a flaw without sifting through the code character by character is very low.

I cannot overemphasize how normal all of this is; however, the disparity between what’s in Github and what’s in the package is a great place to hide malicious code.

This is what happened in the attack on eslint-scope. The attackers gained access to the token required to publish the package, but not the project’s Github account. In an oversimplified explanation, they added malicious code on their local machine and republished the package. When the dust settled, NPM unpublished the offending version, but it is important to note that the latest version of eslint-scope still does not match the repository per the git HEAD published via the registry. (I’ll explain this in detail later.)

Legitimate packages can be updated with bad dependencies

This is the basis of David Gilbertson’s hypothetical attack and the very real attack on event-stream. By targeting dependencies, attackers never had to gain access to anything but the attention of maintainers. Gilbertson (fictitiously) created a few quasi-useful packages and then sent out a blast of PRs to get them into trusted packages. Flatmap-stream was actually created and added to event-stream.

The only guard against such an attack is to meticulously review all commits that affect the entire dependency tree. Additionally, using a package-lock.json file helps to ensure that predictable dependencies are used which mitigates the risk of an attack on a sub-dependency. But security by code review has to be on-point all the time whereas attackers only have to get lucky once.

There is no canonical way to prove that a package is legitimate

This is somewhat of a combination of the previous two vectors. If NPM knew that eslint-scope contained malicious code, they could have kept it from ever being published.

Let’s say that the code in a package must always match what is in source control. Because the attackers never gained access to the Github account, they couldn’t change the repo. This would have prevented a new package version from being published. But it would also prevent Typescript-based projects such as RxJS from existing as well. So obviously, that isn’t a good heuristic.

It would also be nice to verify that dependencies were valid before installing them or if problematic dependencies could be flagged with something like npm audit.

Is there a viable solution?

Ideally publishing an npm package would look something like this:

Edit code
Commit and push to Github/Gitlab/Bitbucket/etc
???
Publish

There is a missing automated process that verifies that package about to be published matches what is in source control or at least is a deterministic result of what is in source control.

Let’s take a look at what NPM offers for building such a process.

NPM pre/post scripts

I’m going to give a quick refresher here about npm scripts so that we are all on the same page.

You can create a “pre” or “post” version of any script that is run before or after any other script. For example, if you have a “build” script that runs the Typescript compiler, you could also create a “prebuild” script to clear the build output folder of any previous files. Breaking apart complex scripts this way helps to product small, easy-to-read scripts.

But you can also write “pre” and “post” scripts for the “built-in” npm scripts as well. For example, “prepack,” “prepublish,” and the weirdly named “prepare” scripts let you run builds or tests before creating or publishing your package.

Here is the actual “prepare” script from the Redux package.json:

"prepare": "npm run clean && npm run format:check && npm run lint && npm test"

As you can see, packaging Redux will first clean, check, lint, and test all the things. If a developer somehow snuck bugs or even bad formatting past code review, this script would prevent that change from being packaged and subsequently published. This is pretty neat!

NPM registry and source control

While npm is a command line application, all of its data comes from registry.npmjs.com. Any publicly available data about a package is provided by this API. If you want to see all of the current and historical data about Redux, just GET https://registry.npmjs.com/redux.

As you can see the “versions” object contains data about each version indexed by version ID. At the time of writing, [email protected] is the latest stable version. There are a few interesting things to note about the version info that comes from the registry. Here is the abbreviated object so that we can focus on the relevant bits:

"4.0.1": {..."repository": {"type": "git","url": "git+https://github.com/reduxjs/redux.git"},..."gitHead": "c5d87d95f3b9b0ebdb57791f69b53d8507cebbed",..."dist": {..."shasum": "436cae6cc40fbe4727689d7c8fae44808f1bfef5",...}}

The “gitHead” corresponds to the current commit on the machine from which the package was published. For a bit more context, you can access that commit on Github to see more info about it:

https://github.com/reduxjs/redux/commit/c5d87d95f3b9b0ebdb57791f69b53d8507cebbed

NPM pack dry run

The “shasum” from the registry output is the checksum of the package that is generated. Assuming you have the Redux repo cloned, you can check out the “githead” of the package with git checkout c5d87d95(the first 8 of the sha) and install dependencies from the lock file: npm ci.

Now we can do a dry run of the pack command which performs the packaging process without actually generating the file: npm pack --dry-run. Here is the output when run at this specific commit:

Result of packing Redux at c5d87d95

This is what would go into the package’s tarball if we didn’t run it as a “dry run.” There are two really cool things going on here. First, anyone who checks out this commit on a clean repo and runs the “pack” script will get the EXACT SAME OUTPUT. Secondly, note that the shasum in the output is the EXACT SAME as the one from the registry. That proves that when Tim Dorr packed and published Redux, he performed the EXACT SAME steps we did. No undocumented manual steps, no hidden malicious code, no sneaky business.

We know this because we started with the same git HEAD and ended with the same shasum. We can look at everything in the repo including both the code and the “prepare” script and see that everything is above board. Armed with this knowledge, I think that we can definitely trust this version of Redux!

👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍

RealScience™ is repeatable

Here are the steps we just took to verify Redux:

Get package data from the NPM registry
Find the repository URL
Find the gitHead/commit
Check out the commit (into a temp folder)
Install dependencies
Pack dry run
Compare published and re-created shasums

I created a proof-of-concept package validator called TBV (trust but verify) that automates these steps for verifying packages. The package’s repo is “shallow cloned” into a temporary folder where all of the npm operations are performed. Here is an example output when we run TBV on Redux:

Redux passes. So does Express:

Note that Express doesn’t have a “prepare” or “prepack” step, so we didn’t need to install dependencies.

But not all popular packages verify. At the time of writing, lodash is the most depended upon package on NPM; however, it doesn’t have a “prepack” step and the published shasum doesn’t match the one generated from the corresponding version tag in Github:

This isn’t to say that Lodash is dangerous to use, it just says that we can’t prove it isn’t with the same steps we performed on other packages. In the future, I intend on doing further research on the most popular packages to see if there are ways to reduce false negatives.

Pre-publish verification

There are a similar set of steps for self-verifying prior to publishing. The intent here is to introduce packages into the wild that are easy to verify:

Ensure that your package.json specifies a repository
Pack dry run (local files)
Checkout the latest local commit (into a temp folder)
Install dependencies
Pack dry run (temp folder)
Compare local and temp folder shasums

This process is similar to the previous verification process; however, instead of the shasum coming from the registry (we haven’t published yet), we get it from a dry run of the local code. This is then compared against what is in source control. The result is that if there are any uncommitted or unpushed changes on your local, the process will fail.

The “test” process indicates whether or not other developers would be able to verify your package if it were to be published. Because it doesn’t pull from the NPM registry at all, you can make and push as many changes as necessary to get the package to validate before publishing.

Here is an example output from testing the TBV project itself:

Proof-of-Concept to Production

The TBV tool is nothing more than a quick and dirty experiment to test the hypothesis that automated package verification is possible. And it looks like it is! But in its current form, it is nothing more than a proof-of-concept. Where do we go from here?

Official Verified Packages

I think that NPM (and probably all package managers) ought to have this sort of package verification baked it. Npmjs.com could display the verification status of packages to indicate which ones are much less likely to contain unpredictable or non-reproducible contents:

Twitter, please don’t sue me. You’ll benefit from this, too. I promise! :D

Verified-only accounts/packages

NPM could allow accounts to be configured such that publication of unverified packages is prohibited. This would prevent attacks like the one on eslint-scope. Hackers wouldn’t be able to update the code in Github which means it would be impossible to publish a package that would validate, even if they had access to the publish token.

To flip it around, package.json could have a “verifiedOnly” flag to prevent installation of unverified dependencies or sub-dependencies.

Improve NPM audit

Currently, the npm audit command checks for known security vulnerabilities in the projects full dependency tree. A useful addition to the current audit would be reporting how many “unverified” packages exist in the dependency tree.

Many legitimate packages exist at the current moment to do not validate per the simplistic rules I define in this post. If such a feature makes it into npm (or yarn) there would be a need for whitelisting certain known good versions of unverified packages.

Actually catch a bad guy

The best thing that could happen is for a news story to break that someone DIDN’T introduce malicious code into a popular package. No, you’re right, such a non-story would never even hit the blogosphere in the first place. 👍

I actually learned a lot about the NPM ecosystem while researching this project! Watching high-profile security flaws hit the news machine is painful because it always brings out the NPM and NodeJS haters. I am optimistic that solutions are within reach! Let’s make 2018 the last time that something like [email protected] ever happens. Let’s also make sure that David Gilbertson keeps his grimy mitts off our credit card numbers and passwords 😃.

I’m genuinely interested in your feedback on this sort of approach. Drop a comment or leave some 👏 to let me know what you think. Also, check out TBV (yes, it validates 😃) and give it some 🌟🌟🌟 if you think it earned them.

Happy coding!

EDIT: I have a put bit of work into building an experimental version of “Package Verification as a Service.” You can read about it here: https://hackernoon.com/npm-package-verification-ep-2-2b2ec66eb610