Taming node_modules at Facebook
About a year ago I joined Facebook and the React Native team. I checked recently - I’m among the top 10 contributors in the GitHub. If you look at my commits, especially recently, they are mostly fixes and tests.
If something got broken, I jumped into it and fixed the feature. Or I ping a guy or a girl, who committed something that broke other people’s work. That is what drives me - unblocking other people, and allowing them to contribute to the project, not only inside Facebook, but from the outside.
Node.js and NPM
Quite recently, NPM rose so high and so fast, that it now has more packages than all the other package registrars combined.
Why is NPM Successful?
I think the first reason is how easy it is to publish to NPM. Some people think it’s too easy to publish. And by easy, I mean you type something on the keyboard, press enter, and in just a few seconds the package is up there.
Another reason is how NPM resolves dependencies. It’s unique to many client-side package managers and many server-side. So let’s look at an example.
How Dependencies are Resolved in NPM
I have application A that depends on dependency B, and it depends on dependency C. In Java, for example, you just download them all, combine them, link them together, and you have an application.
Everything is fine, except when you start relying on a dependency C of a different version in your application. You basically get a conflict between C and C2.
In Gradle, you have to choose one of them. It forces you, and doesn’t allow you to proceed without making the choice. Sometimes they may be incompatible. And what is worse, you cannot just choose any one. The application will not compile.
What Node.js did was make it so that it doesn’t really matter. In this example, the packages in Node.js and NPM were installed for application A, and they were extracted in node_module’s folder.
The dependencies of A are directly under node_modules B and C2 here, but B has its own version of dependency C. This way, when B works with its code, and requires something from library C, it doesn’t interfere. It just goes inside - and that’s quite nice.
I mean, you don’t need to care about dependency resolution ever. Isn’t it amazing? The problem is that library C can exist four times in your application, sometimes 20, sometimes 200.
Get more development news like this
When you have so many files, you have to manage them, which is not that easy. You can make your application less stable if you have too many files in some areas of it.
Let’s return to our application graph. In Node.js and NPM, it is usually described with a package JSON file: you have the name, and you have the dependencies, B and C. Quite often, people set a loose dependency in the identifier to the versions, so that if the community pushes a patch, it will be automatically sucked into your application.
Here, we only list the direct dependencies, not the sub-dependencies. The problem arises when you install something on your development machine, or on your test server, and test your application all through. Here in our graph, it’s done with the B3 dependency.
Imagine everything goes fine with the test. However, let’s say you deploy to production, or you complete a production build. At that moment, the new dependency - Dependency B - gets resolved, and is patched by the community. Because of the loose versioning, your application catches the one that is the latest and the best thing.
But this is something you may not expect. Things may break - things break quite often. And even worse, sometimes a dependency or a package may be removed from the registry. Perhaps it could not be accessed because there was a network partition while you were installing.
This is not something that you want while you are developing business critical applications, but it was a problem that we faced at Facebook.
Problems Installing NPM Packages
There are two problems that I just highlighted.
Firstly, node_modules get bigger and bigger. Secondly, the same package or json installation could result in different applications. Different because the dependencies either were not available, or the dependencies had changed over time.
Ways to Stabilize NPM Installs
Let me go through some of the things that companies and people in the Node.js community do to make their applications more stable, and to make their application development cycle more predictable.
First of all, a method called shrink wrapping.
Shrink Wrap Dependencies
As I showed before, imagine you have your packages on the same dependency graph in our application A. However, instead of package json for your installation, you generate a shrink wrap.
Shrink wrap is the same as package json, but the difference is that you list all the dependencies deep in the tree, and you freeze them. You don’t allow them to be updated. The dependencies are resolved to exact URLs of the tar.gz zip files that get downloaded remotely, unzipped - and then it works.
The only problem is, with shrink wrapping, you are still not protected from a dependency being removed from the registry. Or again, the network could be unavailable. We don’t want our build server to depend on the network connection, and neither do we want a DDoS attack on the dependency resolution on the DNS.
Shrink wrapping is actually very good. I would recommend that everyone should use it in their applications. But it does not guarantee that we will have a nice stable build system.
Check in Node_Modules
The second method, one that has been recommended for quite long, is checking in node_modules.
Once the Node community and NPM community started to get bigger, people started noticing problems. For example, we have an application in Node.js which uses a dependency they call shell.js. It’s a small library. I decided to update it from version 0.6 to 0.7. Should be a minor patch, right?
It resulted in a pull request with 228 files changed. If we check in those node_modules and check in every dependency, reviewing this pair becomes quite hard. There are too many files.
It’s not just that having too many files makes it hard. At Facebook we have a large repository. It’s very convenient for us to have all the applications in the same repositories so that they share the libraries. This is how we prevent a dependency hell inside Facebook.
What it means is that we have a special dedicated team which reviews and works with the version control systems and makes sure they’re as fast as possible.
When we were committing node_modules for React Native into the repository, that being the easiest and the most reliable way to freeze your dependencies and not depend on the network, we were committing 100,000 files, which is comparable to the whole size of the Facebook repository.
Every week we update one or two versions - imagine there were 200,000 file changes! The version control team started knocking on our door and saying, “Hey guys, you’re slowing down everyone in the company because when they re-base, they’re updating thousands of files at a time, and you need to do something about it.”
Remember the shrink wrap, and that it depends on the tar.gz zip files? What if we don’t check in the node_modules with the 100,000 files? The node_modules are now a bit smaller because we managed to shrink the size a little bit.
But let’s return our attention to the tar.gz zip files. They are quite small - most of them are less than 30 kilobytes and people are quite comfortable to having them checked in. There are just under 1,000 of them, compared to 50 to one hundred thousand of them exploded.
Addy Osmani from Google posted a nice blog covering this problem. The question is, “can we install node_modules from files located somewhere on the disc locally?” The blog post goes through some explanations why it doesn’t work, and ideas as to what could be done instead.
The bottom line is that an NPM client needs a server in order to respond to you, because NPM goes to the server and checks if a dependency was updated. If no, it takes something from cache. Nevertheless, it needs to go to the server.
When we were faced with the rage of the version control team, we were forced or asked politely to find another solution. The most logical method was, rather than store node_modules in the version control, we should store it somewhere else.
Third Party Storage
Let’s look at an example. We have a basic project that goes to GitHub. But node_modules doesn’t go there - it goes to a network storage or a Dropbox folder, or somewhere else. Anytime you change the package json that represents your modules, you zip your node_modules, send it to the network, and associate it with your package json, probably by the hash of the whole file.
Whenever your teammates are re-basing on the master branch and see the new package json, our tools go to the network and get the same zip file that you have uploaded before. You would think this should work quite nicely. Not exactly. The problem is that version control systems are for team collaboration, but the network storage is not for collaboration.
For example, I can change a file, and my teammate can do the same. GitHub will automatically merge the changes, if there are no conflicts on the same lines.
Imagine I change one dependency on a package json, while my teammate changes the package json for another dependency at roughly the same time. We both upload the zip files to the network storage.
When it gets resolved into the master branch after both of us merge it, we have a nice resolved package json file in GitHub. However, we cannot three-way merge two zip files which are binary. It doesn’t happen automatically.
There we have two problems. First, we cannot do this three-way merge for something that is a binary. Second, it doesn’t work offline. At least how it worked at Facebook.
At Facebook, we have offices in London, in other parts of America, and in Asia. People travel quite a lot. At least twice a year we travel between offices, and spend a considerable amount of time on planes where we don’t always have internet access.
When you checkout a branch in Git or Mercurial, you have it cached, and you can switch between features quite easily. But if you don’t have access to the network, where your zip files are connected, you cannot run the application, you cannot compile it.
We were thinking it could work better and considered a few options: should we invest a little bit more time in making the network cacheable? Or resolve the three-way merge? Maybe we should automate it somehow?
Can We Do Better?
Let’s take one step back and look at how npm works, to know how much effort it will take to improve on the third option. While it worked, it was not completely suitable for us, and is a considerable investment.
NPM CLI in a Nutshell
Let’s return to our application A with the dependency graph. We have its package json. Now, we resolve all the dependencies and download them from the registry. We unzip them and place them in the folders.
Is it really harder than the solution that I explained, that of loading something in the remote folder?
Something else that motivated us while we were investigating the problem was that there was a big change, as NPM version 2 was migrated to NPM 3. A lot of engineers at Facebook didn’t want to update to the latest node version because NPM 3 became much slower. The new version had optimized how files are extracted in the folder.
For example, you have the same version of dependency C installed many times, deep in the trees. It could be optimized to have C installed only once, with a means to link to the one-time installed thing. With the optimization, they sacrificed performance of installation quite significantly.
NPM 2 to NPM 3 Slowdown
Here we have a network waterfall of installing the package with NPM 2. At the top is NPM 2 and its many packages stacked quite nicely together: they just go and download. There is not much space in between them.
What happened after NPM 3 got updated is that we saw a lot of white space between them. NPM 2 installed this package in 12 seconds, while NPM 3 installed in 50 seconds. It was quite noticeable.
While we were thinking about this, we started hacking. That’s what people do at Facebook - try to find a solution that works. And that was how Yarn was born.
Yarn is basically a reverse engineered client to NPM.
The work started at Facebook London office. We saw some potential, and people from other companies joined us: from Tilde, from Google, from Exponent. They offered help and offered to create a community project rather than belonging to some one company.
Right now it is managed by the open source community, although big companies are involved, helping and providing the engineering strength. Just two weeks after Yarn was released, we had 87 people who had contributed code to the project. We have plenty of issues, but we’re on the track of resolving them and finding out what can be done better.
When Yarn was designed, we put constraints on ourselves: we wanted it to be faster than NPM, not to mention the pain of node_modules management inside Facebook. The solution had to be reliable, it could not be dependent on the network, and we had to protect ourselves from malicious packages.
Here is how Yarn works versus NPM 3. This is the timing of installing React Native, which is very topical to me. Red is NPM times, while blue is Yarn. It varies - depending on whether you have cache locally, or whether you have a log file already generated.
Most of the time Yarn works just faster.
Secondly, you don’t need to generate shrink wrap. Yarn generates it for you. And Yarn manages it better. Here’s an example, with the same shrink wrap that I showed you before, this time in Yarn.
The difference is that we moved away from json format, because we noticed that people review those files. We needed them to be editable and understandable by people. Another detail is that we added a checksum to all the packages after they’re installed.
If you installed something on your development machine for the first time, and you test something in production, but you want to install it again, you can be quite sure that no one mangled with the file that was downloaded from the internet.
Now, the question that I asked before: can NPM work with files local, if they’re checked in? The answer is, of course. That’s the idea of Yarn. That’s why we wanted it to be hackable.
Remember that pull request that I showed you where we had 228 files? This is the same change, version 0.6 to 0.7.
What changed is three tar.gz zip files that got downloaded. I checked them into the repository. And package json changed, and then I review the Yarn log file because I care what changed and I want to know what happened.
Here is a new dependency that got edited with the shell.js one in it, and the location local to the folder. Here is another edit, and here is another thing replaced. Yarn makes it is easy to review and easy to three-way merge.
To summarize: NPM is quite amazing - it works quite well for the community. However, it is very easy to mismanage node_modules with the current tools we have. That is why Yarn was created, and that’s why we think it’s useful for the community.
Questions and Answers
“Q: Isn’t it better to restrict the number of people who can modify node_modules and their dependency?
To have a Node package manager who will be deciding when to watch the request by other developers and tend to have one or two people, or like five bucks in, not like every developer, to be able to change the packages, but to have - you know what I mean?”
“A: Not exactly, no.”
“Q: Probably, in Facebook, every developer can change the packages and then the dependencies.
It is better to do it on request by some people who are on top and then to have an - or stable node_modules than to change it to 100 times a day, right?”
“A: Oh, that works, but doesn’t work at Facebook’s scale. At Facebook, we have 1,000 commits happening a day. It’s just impossible to have one person owning a piece, because people need sub-dependencies - people need to be unblocked.
And they work across the globe, so the person responsible for generating node_modules could be fast asleep while the other person needs to fix a bug.
That’s one concern. As for the other one, I guess we could dedicate one or two or three people but we would rather fix the tool rather than rely on people not making mistakes.
For smaller teams, yeah it can work. Whatever works for you. It’s a volunteer thing, and it’s something that everyone should check for themselves - something that they should evaluate their risks, and the gains and the losses.
One person is quite fine, as long as, here is the situation that we had. We had a CI server running Linux and most of the developers running Mac. If that person generates node_modules, and if here she checks in those node_modules, or sends them to a third party storage, they are basically Mac generated C dependencies.
If a Linux server tries to run it, it would fail. So there are extra tools needed to provide for the cross-platform thing.”
“Q: I heard that on Facebook there’s a single repository, right, for everything?”
“Q: Now, when you have a single repository and commit lots of binary parsing to it, how do you handle that? At some point my local clone would be just big. What are the strategies for that?”
“A: The question is, what happens if the files that we commit, for example for React Native, the tar.gz zip sources of packages in the source control, what happens when they become too big?
First of all, we need to understand the scale of the problem. When we were checking in node_modules as they are installed into the repository, we were checking in 150, 200 megabytes and 100,000 files.
Right now we are checking in 800 files, 100 times less and 20 megabytes total, so those 20 megabytes gets unzipped into 200 megabytes. We have quite a little bit of breathing space here.
Another trick we have is a common storage for all the offline things. For example, React Native and NOOK Light Project, they share roughly 50% of dependencies.
So these common storage that we have in one large repo is reused quite a lot. Adding a new project in the same system is just adding a few tar.gz files that are not yet mentioned. But again, these will come. At some point they will be a problem.
“Q: So what should we do? Should we use Yarn now, or should we commit our node_modules to the repository?”
“A: I personally, outside of Facebook infrastructure, would use Yarn and check in the tar.gz zip files, the sources of the packages. And that would use Yarn to rebuild the packages when I need a fresh node_modules. This is quite fast and doesn’t take too much space.
But it all depends. People use node_modules and npm right now for other languages like OCaml, Reason, C++, whatever goes into the registry. So some of the dependencies could be large. Sometimes it could 50 megabytes per file. So you have to use you judgment.”
“Q: To get back to the scale of your project: are you able to review your code if you happen to team like 100 commits per day? Are you doing code reviews?”
“A: Yeah, of course. No one actually should be lending changes without someone else approving it. In very rare situations, when something is completely broken, you could go through these. But we trust the engineers to use their best judgment not to do that.
People at Facebook, even though we make 1,000 commits per day, we don’t change the package json 100 times a day. People are quite familiar with which package json does what, so we don’t need one person to review that.
Teammates can see and open it quite well, as long as they are all working on the project and as long as they can explain why they are making a change. With continuous integration, and with builds running all the time for every commit we make, we can ensure that people don’t break each other’s code.
So the answer to scale is continuous integration. Tests, tests, tests.”
About the content
This talk was delivered live in October 2016 at Mobilization. The video was transcribed by Realm and is published here with the permission of the conference organizers.