Boosting Continuous Integration Pipelines

September 21, 2024

I’ve recently been working with teammates to improve the performance of our CI pipelines and I wanted to share some of what I’ve learned. We’ve has some pretty slow pipelines. If you want to deploy an approved PR, it’s not unheard of to wait 30 minutes or more for all the pipelines to pass before you can merge.

Slow pipelines hurt productivity because engineers end up waiting and watching pipelines rather than doing work. As an engineer faced with this dilemma, you could choose to let the pipeline do its thing while you start something else, but you risk having another engineer merge their PR while you’re focused elsewhere. If that happens, you have run the pipeline once more, against the new version of main (merge queues would help with this; we’re working on implementing them now).

The first step to improving our pipelines was to make sense of the whole process: what does this step do? Why does step A depend on step B? Is a particular step even necessary? It turns out that lots of what we were doing wasn’t necessary. For example, we downloaded binaries for cypress and playwright tests by default in the step for checking TypeScript. This was likely unintentional, but it added unnecessary time to the process and was easily fixed by adding a couple of environment variables, CYPRESS_INSTALL_BINARY=0 and PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1.

We also installed or downloaded node_modules for many steps where they weren’t actually needed — for example, all of our static analysis steps were running a script to check for dependencies. If the dependencies had already been installed in a prior step, it downloaded the node_modules artifact and unarchived it. While this is reasonably quick, it isn’t instantaneous. In any case, linting, type-checking, and formatting don’t need to run yarn install. We removed that logic and gained a couple of minutes.

Of course, some steps did need to have dependencies installed, but given that we use a yarn cache and we use an optimized, internal package registry (Artifactory), I wondered if we’d see faster times simply by running yarn install instead of downloading and unarchiving node_modules as an artifact. This turned out to be the case, and we saved more time.

Some bigger gains came with adding parallelization wherever possible. One step handled both Cypress component tests and end-to-end tests, meaning that component tests wouldn’t start running until the e2e tests were finished. By splitting this into two steps, we could run then in parallel, which had the effect of essentially shaving off the time for component tests, since e2e tests were slower.

Another engineer and I brainstormed about adding concurrency to the end-to-end tests. Could we not split them up into several chunks and run them at the same time on different CPU cores? It turns out that this was possible. We did it by identifying all the specs and then dividing them into chunks based on the number of CPU cores.

Most recently, my team deployed a change to clone the repo more quickly by shallow cloning. Previously, we were cloning the entire repo as part of the build process, but this isn’t necessary. You want to know how your PR is different from previous PRs, but you don’t need to know the entire repo’s history. The full repo is about 5GB. By shallow cloning to a depth of 1, we could make this much smaller. We decided to start less aggressively and use a depth of 50, which means we’d have more than enough information about history without adding too much excess weight.

All this work has boosted our CI pipelines by around 30%, and we’re still working on making improvements. The next area where I anticipate seeing big gains is with our Docker build step. I’ll write about it once we start.