One job server to rule them all – Michał Górny

[keyword]


A common problem with managing Gentoo builds is concurrency. Many packages contain extensive build steps that are either fully serial or cannot utilize the available CPU threads throughout. This problem is less pronounced when you build multiple packages in parallel, but then we run the risk of overscheduling for packages that do benefit from parallel builds.

Fortunately, there are some tools at our disposal that can improve the situation. Most recently, they were joined by two experimental system-wide job servers: guild master and steve. In this post I would like to provide the background on them, and discuss the problems they face.

The job multiplication problem

You can the MAKE OPTIONS variable to specify a number of parallel tasks to be executed:

MAKEOPTS="-j12"

It’s not only used by GNU-make, but it’s also recognized by a plethora of e-classes and ebuilds, and turned into appropriate options for various builders, test runners, and other tools that can benefit from concurrency. So far this is good news; whenever we can, we’re going to run 12 tasks and use all the CPU threads.

The problems start when we run multiple builds in parallel. It can either be due to running emerge --workor simply have to start another emerge process. The latter happens to me quite often, as I test several packages at once.

For example, if we end up building four packages at once, and supporting them all -jwe can eventually create 48 jobs. The problem is not just saturating the CPU; Imagine running 48 memory hungry C++ compilers at once!

Load-Average Scheduling to the rescue

One possible solution is to --load-avg option, eg:

MAKEOPTS="-j12 -l13"

This causes tools to support the option to not start new tasks if the current load exceeds 13, which is about 13 processes running at once. However, the option is not universally supported, and the exact behavior varies from tool to tool. For example, CTest does not start any tasks when the load is exceeded, effectively stopping test execution, while GNU make and Ninja throttle themselves to one task.

Of course, this is a rough approximation. While GNU makes efforts to determine the current load from /proc/loadavgmost tools only use the one-minute average of getloadavg()which suffers from some lag. It is entirely possible to end up with alternating periods of overscheduling while the load is still building, followed by periods of underscheduling before it subsides again. Still, it’s better than nothing, and it can become especially useful for providing background load for other tasks: a build process that can use the idle CPU threads, and can spin back when other builds need it.

The nested Makefile problem and GNU Make jobserver

Nested Make files are processed by calling make recursively, and thus face a similar problem: if you run multiple manufacturing processes in parallel, and they perform multiple tasks simultaneously, you end up overscheduling. To avoid this, GNU introduces make jobserver. This ensures that the specified position number is respected across multiple manufacturing calls.

At the time of writing, GNU make supports three types of job server protocol:

  1. The legacy Unix pipe-based protocol that relied on passing file descriptors to child processes.
  2. The modern Unix protocol that uses a named pipe.
  3. The Windows protocol using a shared semaphore.

All of these variants follow roughly the same design principles, and are peer-to-peer protocols for using shared state rather than true servers in the network sense. The job server’s role is mostly limited to initializing the state and seeding it with an appropriate number of job tokens. After that, customers are responsible for obtaining a token when they are about to start a job, and returning it once the job is done. The availability of worker tokens therefore limits the total number of processes started.

The flexibility of modern protocols has allowed more tools to support them. The Ninja build system has recently started to support the protocol, thus enabling proper parallelism in complex build systems combining Makefiles and Ninja. The job server protocol is also supported by Cargo and various Rust tools, GCC and LLVM, where it can be used to limit the number of parallel LTO jobs.

A system-wide job server

With a growing number of tools capable of parallel processing, and at the same time gaining support for the GNU make jobserver protocol, this is starting to be an interesting solution to the overscheduling problem. If we could run one job server that is shared across all build processes, we could control the total number of jobs running concurrently, and thus let all the concurrent builds dynamically adjust to each other!

In fact, this is not a new idea. ON error requesting job server integration was submitted for Portage in 2019. NixOS job server effort dates back to at least 2021, although it has not yet been merged. Guild Master and steve joined the effort very recently.

There are two primary problems with using a system-wide job server: token release reliability and the “implicit slot” problem.

The token release problem

The first problem is more important. As previously mentioned, the job server protocol relies entirely on clients releasing the job tokens they have obtained, and the documentation explicitly emphasizes that these should be returned even in error conditions. Unfortunately, this is not always possible: if the client is killed, it cannot execute any cleanup code and therefore returns the tokens! For workservers with scopes like GNU makes, this is usually not such a big problem, since make normally ends when a child is killed. However, a system job server can easily be left queued without job tokens this way!

This problem can’t really be solved within the strict confines of the job server protocol. After all, it’s just a named pipe, and there are limits to how much you can monitor what happens to the pipe buffer. Fortunately, there is a way around this: you can implement a proper server for the jobserver protocol using FUSE, and provide it in place of the named pipe. The good news is that most of the tools don’t check the file type, and those that do can be easily fixed.

The current concept of NixOS jobserver provides a regular file with special behavior via FUSE, while guildmaster and Steve both provide a character device via its CUSE API. NixOS jobserver and guildmaster both return unreleased tokens as soon as the process closes the jobserver file, while Steve returns them as soon as the process acquiring them exits. This way they can guarantee that a process that either cannot release its tokens (e.g. because it was killed), or one that does not due to implementation issue (e.g. Cargo) does not end up effectively locking other builds. This also means that we can provide live information about which processes hold the tokens, or even implement additional features such as limiting token supply based on the system load, or setting per-process limits.

The implicit lock problem

The second problem is related to the implicit assumption that a job server is inherited from an older GNU make process that has already acquired a token to spawn the subprocess. Since the make subprocess doesn’t really do work itself, it can “use” the token to create another job instead. Therefore, each GNU make process running under a job server has one implicit slot that runs tasks without consuming any tokens. If the job server is running externally and no job tokens were obtained while the topmaking process is running, it ends up running an extra process without a job token: steve -j12 allows 12 jobs, plus one extra job for each package built.

Fortunately, the solution is quite simple: one needs to implement token acquisition at the Portage level. Portage obtains a new token before a build begins, and releases it once the work is complete. In fact, it solves two problems: it accounts for the implicit slotting in builders that implement the jobserver protocol, and it limits the total number of jobs executed for parallel builds.

However, this is a double-edged sword. On the one hand, it limits the risk of overscheduling when executing parallel build tasks. On the other hand, it means that a new emerge job may not start immediately, but instead wait for other jobs to release job tokens first, which negatively affects interactivity.

A semi-related issue is that obtaining a single token does not properly account for processes that are themselves parallel but do not implement the job server protocol, such as pytest-xdist runs. It might be possible to handle this better by acquiring multiple tokens before running it (or possibly while running it), but in the former case one must be careful to acquire it atomically, and not end up with the equivalent of lock contention: two processes acquiring part of the tokens they need, and waiting forever for more.

The implicit lock problem also causes problems for other clients. For example, nasm-rs writes an extra token to the job server pipe to avoid special wrappers the implicit slot. However, this violates the protocol and breaks clients with per-process tokens. Steve carries a special solution for that package.

Summary

A growing number of tools are capable of some degree of concurrency: from builders that can traditionally launch multiple parallel tasks, to multithreaded compilers. While this provides some control over how many jobs to start, it is non-trivial to avoid overscheduling while running multiple builds in parallel. Some builders can use load averaging to partially mitigate the problem, but it is far from a perfect solution.

Job servers are currently our best bet. Originally designed to handle job scheduling for recursive GNU make calls, they are extended to control other parallel processes throughout the build, and can be further extended to control the job numbers across different builds, and even across different build containers.

While NixOS seems to have dropped the ball, Gentoo is now finally actively pursuing global job server support. Guildmaster and Steve both prove that the server-side implementation is possible, and integration is just around the corner. At this point, it’s not clear whether a job server-enabled systems will become the default in the future, but it’s certainly an interesting experiment to run.



Eva Grace

Eva Grace

Leave a Reply

Your email address will not be published. Required fields are marked *