NixOS Planet

January 13, 2021

nixbuild.net

Finding Non-determinism with nixbuild.net

During the last decade, many initiatives focussing on making builds reproducible have gained momentum. reproducible-builds.org is a great resource for anyone interested in how the work progresses in multiple software communities. r13y.com tracks the current reproducibility metrics in NixOS.

Nix is particularly suited for working on reproducibility, since it by design isolates builds and comes with tools for finding non-determinism. The Nix community also works on related projects, like Trustix and the content-addressed store.

This blog post summarises how nixbuild.net can be useful for finding non-deterministic builds, and announces a new feature related to reproducibility!

Repeated Builds

The way to find non-reproducible builds is to run the same build multiple times and check for any difference in results, when compared bit-for-bit. Since Nix guarantees that all inputs will be identical between the runs, just finding differing output results is enough to conclude that a build is non-deterministic. Of course, we can never prove that a build is deterministic this way, but if we run the build many times, we gain a certain confidence in it.

To run a Nix build multiple times, simply add the –repeat option to your build command. It will run your build the number of extra times you specify.

Suppose we have the following Nix expression in deterministic.nix:

let
  inherit (import <nixpkgs> {}) runCommand;
in {
  stable = runCommand "stable" {} ''
    touch $out
  '';

  unstable = runCommand "unstable" {} ''
    echo $RANDOM > $out
  '';
}

We can run repeated builds like this (note that the --builders "" option is there to force a local build, to not use nixbuild.net):

$ nix-build deterministic.nix --builders "" -A stable --repeat 1
these derivations will be built:
  /nix/store/0fj164aqyhsciy7x97s1baswygxn8lzf-stable.drv
building '/nix/store/0fj164aqyhsciy7x97s1baswygxn8lzf-stable.drv' (round 1/2)...
building '/nix/store/0fj164aqyhsciy7x97s1baswygxn8lzf-stable.drv' (round 2/2)...
/nix/store/6502c5490rap0c8dhvfwm5rhi22i9clz-stable

$ nix-build deterministic.nix --builders "" -A unstable --repeat 1
these derivations will be built:
  /nix/store/psmn1s3bb97989w5a5b1gmjmprqcmf0k-unstable.drv
building '/nix/store/psmn1s3bb97989w5a5b1gmjmprqcmf0k-unstable.drv' (round 1/2)...
building '/nix/store/psmn1s3bb97989w5a5b1gmjmprqcmf0k-unstable.drv' (round 2/2)...
output '/nix/store/g7a5sf7iwdxs7q12ksrzlvjvz69yfq3l-unstable' of '/nix/store/psmn1s3bb97989w5a5b1gmjmprqcmf0k-unstable.drv' differs from previous round
error: build of '/nix/store/psmn1s3bb97989w5a5b1gmjmprqcmf0k-unstable.drv' failed

Running repeated builds on nixbuild.net works exactly the same way:

$ nix-build deterministic.nix -A stable --repeat 1
these derivations will be built:
  /nix/store/wnd5y30jp3xwpw1bhs4bmqsg5q60vc8i-stable.drv
building '/nix/store/wnd5y30jp3xwpw1bhs4bmqsg5q60vc8i-stable.drv' (round 1/2) on 'ssh://eu.nixbuild.net'...
copying 1 paths...
copying path '/nix/store/z3wlpwgz66ningdbggakqpvl0jp8bp36-stable' from 'ssh://eu.nixbuild.net'...
/nix/store/z3wlpwgz66ningdbggakqpvl0jp8bp36-stable

$ nix-build deterministic.nix -A unstable --repeat 1
these derivations will be built:
  /nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv
building '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' (round 1/2) on 'ssh://eu.nixbuild.net'...
[nixbuild.net] output '/nix/store/srch6l8pyl7z93c7gp1xzf6mq6rwqbaq-unstable' of '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' differs from previous round
error: build of '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' on 'ssh://eu.nixbuild.net' failed: build was non-deterministic
builder for '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' failed with exit code 1
error: build of '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' failed

As you can see, the log output differs slightly between the local and the remote builds. This is because when Nix submits a remote build, it will not do the determinism check itself, instead it will leave it up to the builder (nixbuild.net in our case). This is actually a good thing, because it allows nixbuild.net to perform some optimizations for repeated builds. The following sections will enumerate those optimizations.

Finding Non-determinism in Past Builds

If you locally try to rebuild a something that has failed due to non-determinism, Nix will build it again at least two times (due to --repeat) and fail it due to non-determinism again, since it keeps no record of the previous build failure (other than the build log).

However, nixbuild.net keeps a record of every build performed, also for repeated builds. So when you try to build the same derivation again, nixbuild.net is smart enough to look at its past build and figure out that the derivation is non-deterministic without having to rebuild it. We can demonstrate this by re-running the last build from the example above:

$ nix-build deterministic.nix -A unstable --repeat 1
these derivations will be built:
  /nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv
building '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' (round 1/2) on 'ssh://eu.nixbuild.net'...
[nixbuild.net] output '/nix/store/srch6l8pyl7z93c7gp1xzf6mq6rwqbaq-unstable' of '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' differs from previous round
error: build of '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' on 'ssh://eu.nixbuild.net' failed: a previous build of the derivation was non-deterministic
builder for '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' failed with exit code 1
error: build of '/nix/store/6im1drv4pklqn8ziywbn44vq8am977vm-unstable.drv' failed

As you can see, the exact same derivation fails again, but now the build status message says: a previous build of the derivation was non-deterministic. This means nixbuild.net didn’t have to run the build, it just checked its past outputs for the derivation and noticed they differed.

When nixbuild.net looks at past builds it considers all outputs that have been signed by a key that the account trusts. That means that it can even compare outputs that have been fetched by substitution.

Scaling Out Repeated Builds

When you use --repeat, nixbuild.net will create multiple copies of the build and schedule all of them like any other build would have been scheduled. This means that every repeated build will run in parallel, saving time for the user. As soon as nixbuild.net has found proof of non-determinism, any repeated build still running will be cancelled.

Provoking Non-determinism through Filesystem Randomness

As promised in the beginning of this blog post, we have new a feature to announce! nixbuild.net is now able to inject randomness into the filesystem that the builds see when they run. This can be used to provoke builds to uncover non-deterministic behavior.

The idea is not new, it is in fact the exact same concept as have been implemented in the disorderfs project by reproducible-builds.org. However, we’re happy to make it easily accessible to nixbuild.net users. The feature is disabled by default, but can be enabled through a new user setting.

For the moment, the implementation will return directory entries in a random order when enabled. In the future we might inject more metadata randomness.

To demonstrate this feature, let’s use this build:

let
  inherit (import <nixpkgs> {}) runCommand;
in rec {
  files = runCommand "files" {} ''
    mkdir $out
    touch $out/{1..10}
  '';

  list = runCommand "list" {} ''
    ls -f ${files} > $out
  '';
}

The files build just creates ten empty files as its output, and the list build lists those file with ls. The -f option of ls disables sorting entirely, so the file names will be printed in the order the filesystem returns them. This means that the build output will depend on how the underlying filesystem is implemented, which could be considered a non-deterministic behavior.

First, we build it locally with --repeat:

$ nix-build non-deterministic-fs.nix --builders "" -A list --repeat 1
these derivations will be built:
  /nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv
building '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' (round 1/2)...
building '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' (round 2/2)...
/nix/store/h1591y02qff8vls5v41khgjz2zpdr2mg-list

As you can see, the build succeeded. Then we delete the result from our Nix store so we can run the build again:

rm result
nix-store --delete /nix/store/h1591y02qff8vls5v41khgjz2zpdr2mg-list

We enable the inject-fs-randomness feature through the nixbuild.net shell:

nixbuild.net> set inject-fs-randomness true

Then we run the build (with --repeat) on nixbuild.net:

$ nix-build non-deterministic-fs.nix -A list --repeat 1
these derivations will be built:
  /nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv
building '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' (round 1/2) on 'ssh://eu.nixbuild.net'...
copying 1 paths...
copying path '/nix/store/vl13q40hqp4q8x6xjvx0by06s1v9g3jy-files' to 'ssh://eu.nixbuild.net'...
[nixbuild.net] output '/nix/store/h1591y02qff8vls5v41khgjz2zpdr2mg-list' of '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' differs from previous round
error: build of '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' on 'ssh://eu.nixbuild.net' failed: build was non-deterministic
builder for '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' failed with exit code 1
error: build of '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' failed

Now, nixbuild.net found the non-determinism! We can double check that the directory entries are in a random order by running without --repeat:

$ nix-build non-deterministic-fs.nix -A list
these derivations will be built:
  /nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv
building '/nix/store/153s3ir379cy27wpndd94qlfhz0wj71v-list.drv' on 'ssh://eu.nixbuild.net'...
copying 1 paths...
copying path '/nix/store/h1591y02qff8vls5v41khgjz2zpdr2mg-list' from 'ssh://eu.nixbuild.net'...
/nix/store/h1591y02qff8vls5v41khgjz2zpdr2mg-list

$ cat /nix/store/h1591y02qff8vls5v41khgjz2zpdr2mg-list
6
1
2
5
10
7
8
..
9
4
3
.

Future Work

There are lots of possibilities to improve the utility of nixbuild.net when it comes to reproducible builds. Your feedback and ideas are very welcome to support@nixbuild.net.

Here are some of the things that could be done:

  • Make it possible to trigger repeated builds for any previous build, without submitting a new build with Nix. For example, there could be a command in the nixbuild.net shell allowing a user to trigger a repeated build and report back any non-determinism issues.

  • Implement functionality similar to diffoscope to be able to find out exactly what differs between builds. This could be available as a shell command or through an API.

  • Make it possible to download specific build outputs. The way Nix downloads outputs (and stores them locally) doesn’t allow for having multiple variants of the same output, but nixbuild.net could provide this functionality through the shell or an API.

  • Inject more randomness inside the sandbox. Since we have complete control over the sandbox environment we can introduce more differences between repeated builds to provoke non-determinism. For example, we can schedule builds on different hardware or use different kernels between repeated builds.

  • Add support for listing known non-deterministic derivations.

by nixbuild.net (support@nixbuild.net) at January 13, 2021 12:00 AM

December 29, 2020

nixbuild.net

The First Year

One year ago nixbuild.net was announced to the Nix community for the very first time. The service then ran as a closed beta for 7 months until it was made generally available on the 28th of August 2020.

This blog post will try to summarize how nixbuild.net has evolved since GA four months ago, and give a glimpse of the future for the service.

Stability and Performance

Thousands of Nix builds have been built by nixbuild.net so far, and every build helps in making the service more reliable by uncovering possible edge cases in the build environment.

These are some of the stability-related improvements and fixes that have been deployed since GA:

  • Better detection and handling of builds that time out or hang.

  • Improved retry logic should our backend storage not deliver Nix closures as expected.

  • Fixes to the virtual file system inside the KVM sandbox.

  • Better handling of builds that have binary data in their log output.

  • Changes to the virtual sandbox environment so it looks even more like a “standard” Linux environment.

  • Application of the Nix sandbox inside our KVM sandbox. This basically guarantees that the Nix environment provided through nixbuild.net is identical to the Nix environment for local builds.

  • Support for following HTTP redirects from binary caches.

Even Better Build Reuse

One of the fundamental ideas in nixbuild.net is to try as hard as possible to not build your builds, if an existing build result can be reused instead. We can trivially reuse an account’s own builds since they are implicitly trusted by the user, but also untrusted builds can be reused under certain circumstances. This has been described in detail in an earlier blog post

Since GA we’ve introduced a number of new ways build results can be reused.

Reuse of Build Failures

Build failures are now also reused. This means that if someone tries to build a build that is identical (in the sense that the derivation and its transitive input closure is bit-by-bit identical) to a previously failed build, nixbuild.net will immediately serve back the failed result instead of re-running the build. You will even get the build log replayed.

Build failures can be reused since we are confident that our sandbox is pure, meaning that it will behave exactly the same as long as the build is exactly the same. Only non-transient failures will be reused. So if the builder misbehaves in some way that is out of control for Nix, that failure will not be reused. This can happen if the builder machine breaks down or something similar. In such cases we will automatically re-run the build anyway.

When we fix bugs or make major changes in our sandbox it can happen that we alter the behavior in terms of which builds succeed or fail. For example, we could find a build that fail just because we have missed implementing some specific detail in the sandbox. Once that is fixed, we don’t want to reuse such failures. To avoid that, all existing build failures will be “invalidated” on each major update of the sandbox.

If a user really wants to re-run a failed build on nixbuild.net, failure reuse can be turned off using the new user settings (see below).

Reuse of Build Timeouts

In a similar vein to reused build failures, we can also reuse build timeouts. This is not enabled by default, since users can select different timeout limits. A user can activate reuse of build timeouts through the user settings.

The reuse of timed out builds works like this: Each time a new build is submitted, we check if we have any previous build results of the exact same build. If no successful results or plain failures are found, we look for builds that have timed out. We then check if any of the existing timed out builds ran for longer than the user-specified timeout for the new build. If we can find such a result, it will be served back to the user instead of re-running the build.

This feature can be very useful if you want to avoid re-running builds that timeout over and over again (which can be a very time-consuming excercise). For example, say that you have your build timeout set to two hours, and some input needed for a build takes longer than that to build. The first time that input is needed you have to wait two hours to detect that the build will fail. If you then try building something else that happens to depend on the very same input you will save two hours by directly being served the build failure from nixbuild.net!

Wait for Running Builds

When a new build is submitted, nixbuild.net will now check if there is any identical build currently running (after checking for previous build results or failures). If there is, the new build will simply hold until the running build has finished. After that, the result of the running build will likely be served back as the result of the new build (as long as the running build wasn’t terminated in a transient way, in which case the new build will have to run from scratch). The identical running builds are checked and reused across accounts.

Before this change, nixbuild.net would simply start another build in parallel even if the builds were identical.

New Features

User Settings

A completely new feature has been launched since GA: User Settings. This allows end users to tweak the behavior of nixbuild.net. For example, the build reuse described above can be controlled by user settings. Other settings includes controlling the maximum used build time per month, and the possibility to lock down specific SSH keys which is useful in CI setups.

The user settings can be set in various way; through the nixbuild.net shell, the SSH client environment and even through the Nix derivations themselves.

Even if many users probably never need to change any settings, it can be helpful to read through the documentation to get a feeling for what is possible. If you need to differentiate permissions in any way (different settings for account administrators, developers, CI etc) you should definitely look into the various user settings.

GitHub CI Action

A GitHub Action has been published. This action makes it very easy to use nixbuild.net as a remote Nix builder in your GitHub Actions workflows. Instead of running you Nix builds on the two vCPUs provided by GitHub you can now enjoy scale-out Nix builds on nixbuild.net with minimal setup required.

The nixbuild.net GitHub Action is developed by the nixbuild.net team and there are plans on adding more functionality that nixbuild.net can offer users, like automatically generated cost and performance reports for your Nix builds.

Shell Improvements

Various minor improvements have been made to the nixbuild.net shell. It is for example now much easier to get an overview on how large your next invoice will be, through the usage command.

The Future

After one year of real world usage, we are very happy with the progress of nixbuild.net. It has been well received in the Nix community, proved both reliable and scalable, and it has delivered on our initial vision of a simple service that can integrate into any setup using Nix.

We feel that we can go anywhere from here, but we also realize that we must be guided by our users’ needs. We have compiled a small and informal roadmap below. The items on this list are things that we, based on the feedback we’ve received throughout the year, think are natural next steps for nixbuild.net.

The roadmap has no dates and no prioritization, and should be seen as merely a hint about which direction the development is heading. Any question or comment concerning this list (or what’s missing from the list) is very welcome to support@nixbuild.net.

Support aarch64-linux Builds

Work is already underway to add support for aarch64-linux builds to nixbuild.net, and so far it is looking good. With the current surge in performant ARM hardware (Apple M1, Ampere Altra etc), we think having aarch64 support in nixbuild.net is an obvious feature. It is also something that has been requested by our users.

We don’t know yet how the pricing of aarch64 builds will look, or what scalability promises we can make. If you are interested in evaluating aarch64 builds on nixbuild.net in an early access setting, just send us an email to support@nixbuild.net.

Provide an API over SSH and HTTP

Currently the nixbuild.net shell is the administrative tool we offer end users. We will keep developing the shell and make it more intuitive for interactive use. But will also add an alternative, more scriptable variant of the shell.

This alternative version will provide roughly the same functionality as the original shell, only more adapted to scripting instead of interactive use. The reason for providing such an SSH-based API is to make it easy to integrate nixbuild.net more tightly into CI and similar scenarios.

There is in fact already a tiny version of this API deployed. You can run the following command to try it out:

$ ssh eu.nixbuild.net api show public-signing-key
{"keyName":"nixbuild.net/bob-1","publicKey":"PmUhzAc4Ug6sf1uG8aobbqMdalxW41SHWH7FE0ie1BY="}

The above API command is in use by the nixbuild-action for GitHub. So far, this is the only API command implemented, and it should be seen as a very first proof of concept. Nothing has been decided on how the API should look and work in the future.

The API will also be offered over HTTP in addition to SSH.

Upload builds to binary caches

Adding custom binary caches that nixbuild.net can fetch dependencies from is supported today, although such requests are still handled manually through support.

We also want to support uploading to custom binary caches. That way users could gain performance by not having to first download build results from nixbuild.net and then upload them somewhere else. This could be very useful for CI setups that can spend a considerable amount of their time just uploading closures.

Provide an HTTP-based binary cache

Using nixbuild.net as a binary cache is handy since you don’t have to wait for any uploads after a build has finished. Instead, the closures will be immediately available in the binary cache, backed by nixbuild.net.

It is actually possible to use nixbuild.net as a binary cache today, by configuring an SSH-based cache (ssh://eu.nixbuild.net). This works out of the box right now. You can even use nix-copy-closure to upload paths to nixbuild.net. We just don’t yet give any guarantees on how long store paths are kept.

However, there are benfits to providing an HTTP-based cache. It would most probably have better performance (serving nar files over HTTP instead of using the nix-store protocol over SSH), but more importantly it would let us use a CDN for serving cache contents. This could help mitigate the fact that nixbuild.net is only deployed in Europe so far.

Support builds that use KVM

The primary motivation for this is to be able to run NixOS tests (with good performance) on nixbuild.net.

Thank You!

Finally we’d like to thank all our users. We look forward to an exciting new year with lots of Nix builds!

by nixbuild.net (support@nixbuild.net) at December 29, 2020 12:00 AM

December 24, 2020

Cachix

Postmortem of outage on 20th December

On 20 December, Cachix experienced a six-hour downtime, the second significant outage since the service started operating on 1 June 2018. Here are the details of what exactly happened and what has been done to prevent similar events from happening. Timeline (UTC) 02:55:07 - Backend starts to emit errors for all HTTP requests 02:56:00 - Pagerduty tries to notify me of outage via email, phone and mobile app 09:01:00 - I wake up and see the notifications 09:02:02 - Backend is restarted and recovers Root cause analysis All ~24k HTTP requests that reached the backend during the outage failed with the following exception:

by Domen Kožar (support@cachix.org) at December 24, 2020 11:30 AM

December 23, 2020

Ollie Charles

Monad Transformers and Effects with Backpack

A good few years ago Edward Yang gifted us an implementation of Backpack - a way for us to essentially abstract modules over other modules, allowing us to write code independently of implementation. A big benefit of doing this is that it opens up new avenues for program optimization. When we provide concrete instantiations of signatures, GHC compiles it as if that were the original code we wrote, and we can benefit from a lot of specialization. So aside from organizational concerns, Backpack gives us the ability to write some really fast code. This benefit isn’t just theoretical - Edward Kmett gave us unpacked-containers, removing a level of indirection from all keys, and Oleg Grenrus showed as how we can use Backpack to “unroll” fixed sized vectors. In this post, I want to show how we can use Backpack to give us the performance benefits of explicit transformers, but without having library code commit to any specific stack. In short, we get the ability to have multiple interpretations of our program, but without paying the performance cost of abstraction.

The Problem

Before we start looking at any code, let’s look at some requirements, and understand the problems that come with some potential solutions. The main requirement is that we are able to write code that requires some effects (in essence, writing our code to an effect interface), and then run this code with different interpretations. For example, in production I might want to run as fast as possible, in local development I might want further diagnostics, and in testing I might want a pure or in memory solution. This change in representation shouldn’t require me to change the underlying library code.

Seasoned Haskellers might be familiar with the use of effect systems to solve these kinds of problems. Perhaps the most familiar is the mtl approach - perhaps unfortunately named as the technique itself doesn’t have much to do with the library. In the mtl approach, we write our interfaces as type classes abstracting over some Monad m, and then provide instances of these type classes - either by stacking transformers (“plucking constraints”, in the words of Matt Parson), or by a “mega monad” that implements many of these instances at once (e.g., like Tweag’s capability) approach.

Despite a few annoyances (e.g., the “n+k” problem, the lack of implementations being first-class, and a few other things), this approach can work well. It also has the potential to generate a great code, but in practice it’s rarely possible to achieve maximal performance. In her excellent talk “Effects for Less”, Alexis King hits the nail on the head - despite being able to provide good code for the implementations of particular parts of an effect, the majority of effectful code is really just threading around inside the Monad constraint. When we’re being polymorphic over any Monad m, GHC is at a loss to do any further optimization - and how could it? We know nothing more than “there will be some >>= function when you get here, promise!” Let’s look at this in a bit more detail.

Say we have the following:

foo :: Monad m => m Int
foo = go 0 1_000_000_000
  where
    go acc 0 = return acc
    go acc i = return acc >> go (acc + 1) (i - 1)

This is obviously “I needed an example for my blog” levels of contrived, but at least small. How does it execute? What are the runtime consequences of this code? To answer, we’ll go all the way down to the STG level with -ddump-stg:

$wfoo =
    \r [ww_s2FA ww1_s2FB]
        let {
          Rec {
          $sgo_s2FC =
              \r [sc_s2FD sc1_s2FE]
                  case eqInteger# sc_s2FD lvl1_r2Fp of {
                    __DEFAULT ->
                        let {
                          sat_s2FK =
                              \u []
                                  case +# [sc1_s2FE 1#] of sat_s2FJ {
                                    __DEFAULT ->
                                        case minusInteger sc_s2FD lvl_r2Fo of sat_s2FI {
                                          __DEFAULT -> $sgo_s2FC sat_s2FI sat_s2FJ;
                                        };
                                  }; } in
                        let {
                          sat_s2FH =
                              \u []
                                  let { sat_s2FG = CCCS I#! [sc1_s2FE]; } in  ww1_s2FB sat_s2FG;
                        } in  ww_s2FA sat_s2FH sat_s2FK;
                    1# ->
                        let { sat_s2FL = CCCS I#! [sc1_s2FE]; } in  ww1_s2FB sat_s2FL;
                  };
          end Rec }
        } in  $sgo_s2FC lvl2_r2Fq 0#;

foo =
    \r [w_s2FM]
        case w_s2FM of {
          C:Monad _ _ ww3_s2FQ ww4_s2FR -> $wfoo ww3_s2FQ ww4_s2FR;
        };

In STG, whenever we have a let we have to do a heap allocation - and this code has quite a few! Of particular interest is the what’s going on inside the actual loop $sgo_s2FC. This loop first compares i to see if it’s 0. In the case that’s it’s not, we allocate two objects and call ww_s2Fa. If you squint, you’ll notice that ww_s2FA is the first argument to $wfoo, and it ultimately comes from unpacking a C:Monad dictionary. I’ll save you the labor of working out what this is - ww_s2Fa is the >>. We can see that every iteration of our loop incurs two allocations for each argument to >>. A heap allocation doesn’t come for free - not only do we have to do the allocation, the entry into the heap incurs a pointer indirection (as heap objects have an info table that points to their entry), and also by merely being on the heap we increase our GC time as we have a bigger heap to traverse. While my STG knowledge isn’t great, my understanding of this code is that every time we want to call >>, we need to supply it with its arguments. This means we have to allocate two closures for this function call - which is basically whenever we pressed “return” on our keyboard when we wrote the code. This seems crazy - can you imagine if you were told in C that merely using ; would cost time and memory?

If we compile this code in a separate module, mark it as {-# NOINLINE #-}, and then call it from main - how’s the performance? Let’s check!

module Main (main) where

import Foo

main :: IO ()
main = print =<< foo
$ ./Main +RTS -s
1000000000
 176,000,051,368 bytes allocated in the heap
       8,159,080 bytes copied during GC
          44,408 bytes maximum residency (1 sample(s))
          33,416 bytes maximum slop
               0 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     169836 colls,     0 par    0.358s   0.338s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   54.589s  ( 54.627s elapsed)
  GC      time    0.358s  (  0.338s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   54.947s  ( 54.965s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,224,078,302 bytes per MUT second

  Productivity  99.3% of total user, 99.4% of total elapsed

OUCH. My i7 laptop took almost a minute to iterate a loop 1 billion times.

A little disclaimer: I’m intentionally painting a severe picture here - in practice this cost is irrelevant to all but the most performance sensitive programs. Also, notice where the let bindings are in the STG above - they are nested within the loop. This means that we’re essentially allocating “as we go” - these allocations are incredibly cheap, and the growth to GC is equal trivial, resulting in more like constant GC pressure, rather than impending doom. For code that is likely to do any IO, this cost is likely negligible compared to the rest of the work. Nonetheless, it is there, and when it’s there, it’s nice to know if there are alternatives.

So, is the TL;DR that Haskell is completely incapable of writing effectful code? No, of course not. There is another way to compile this program, but we need a bit more information. If we happen to know what m is and we have access to the Monad dictionary for m, then we might be able to inline >>=. When we do this, GHC can be a lot smarter. The end result is code that now doesn’t allocate for every single >>=, and instead just gets on with doing work. One trivial way to witness this is to define everything in a single module (Alexis rightly points out this is a trap for benchmarking that many fall into, but for our uses it’s the behavior we actually want).

This time, let’s write everything in one module:

module Main ( main ) where

And the STG:

lvl_r4AM = CCS_DONT_CARE S#! [0#];

lvl1_r4AN = CCS_DONT_CARE S#! [1#];

Rec {
main_$sgo =
    \r [void_0E sc1_s4AY sc2_s4AZ]
        case eqInteger# sc1_s4AY lvl_r4AM of {
          __DEFAULT ->
              case +# [sc2_s4AZ 1#] of sat_s4B2 {
                __DEFAULT ->
                    case minusInteger sc1_s4AY lvl1_r4AN of sat_s4B1 {
                      __DEFAULT -> main_$sgo void# sat_s4B1 sat_s4B2;
                    };
              };
          1# -> let { sat_s4B3 = CCCS I#! [sc2_s4AZ]; } in  Unit# [sat_s4B3];
        };
end Rec }

main2 = CCS_DONT_CARE S#! [1000000000#];

main1 =
    \r [void_0E]
        case main_$sgo void# main2 0# of {
          Unit# ipv1_s4B7 ->
              let { sat_s4B8 = \s [] $fShowInt_$cshow ipv1_s4B7;
              } in  hPutStr' stdout sat_s4B8 True void#;
        };

main = \r [void_0E] main1 void#;

main3 = \r [void_0E] runMainIO1 main1 void#;

main = \r [void_0E] main3 void#;

The same program compiled down to much tighter loop that is almost entirely free of allocations. In fact, the only allocation that happens is when the loop terminates, and it’s just boxing the unboxed integer that’s been accumulating in the loop.

As we might hope, the performance of this is much better:

$ ./Main +RTS -s
1000000000
  16,000,051,312 bytes allocated in the heap
         128,976 bytes copied during GC
          44,408 bytes maximum residency (1 sample(s))
          33,416 bytes maximum slop
               0 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15258 colls,     0 par    0.031s   0.029s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    9.402s  (  9.405s elapsed)
  GC      time    0.031s  (  0.029s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    9.434s  (  9.434s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,701,712,595 bytes per MUT second

  Productivity  99.7% of total user, 99.7% of total elapsed

Our time in the garbage collector dropped by a factor of 10, from 0.3s to 0.03. Our total allocation dropped from 176GB (yes, you read that right) to 16GB (I’m still not entirely sure what this means, maybe someone can enlighten me). Most importantly our total runtime dropped from 54s to just under 10s. All this from just knowing what m is at compile time.

So GHC is capable of producing excellent code for monads - what are the circumstances under which this happens? We need, at least:

  1. The source code of the thing we’re compiling must be available. This means it’s either defined in the same module, or is available with an INLINABLE pragma (or GHC has chosen to add this itself).

  2. The definitions of >>= and friends must also be available in the same way.

These constraints start to feel a lot like needing whole program compilation, and in practice are unreasonable constraints to reach. To understand why, consider that most real world programs have a small Main module that opens some connections or opens some file handles, and then calls some library code defined in another module. If this code in the other module was already compiled, it will (probably) have been compiled as a function that takes a Monad dictionary, and just calls the >>= function repeatedly in the same manner as our original STG code. To get the allocation-free version, this library code needs to be available to the Main module itself - as that’s the module that choosing what type to instantiate ‘m’ with - which means the library code has to have marked that code as being inlinable. While we could add INLINE everywhere, this leads to an explosion in the amount of code produced, and can sky rocket compilation times.

Alexis’ eff library works around this by not being polymorphic in m. Instead, it chooses a concrete monad with all sorts of fancy continuation features. Likewise, if we commit to a particular monad (a transformer stack, or maybe using RIO), we again avoid this cost. Essentially, if the monad is known a priori at time of module compilation, GHC can go to town. However, the latter also commits to semantics - by choosing a transformer stack, we’re choosing a semantics for our monadic effects.

With the scene set, I now want to present you with another approach to solving this problem using Backpack.

A Backpack Primer

Vanilla GHC has a very simple module system - modules are essentially a method for name-spacing and separate compilation, they don’t do much more. The Backpack project extends this module system with a new concept - signatures. A signature is like the “type” of a module - a signature might mention the presence of some types, functions and type class instances, but it says nothing about what the definitions of these entities are. We’re going to (ab)use this system to build up transformer stacks at configuration time, and allow our library to be abstracted over different monads. By instantiating our library code with different monads, we get different interpretations of the same program.

I won’t sugar coat - what follows is going to pretty miserable. Extremely fun, but miserable to write in practice. I’ll let you decide if you want to inflict this misery on your coworkers in practice - I’m just here to show you it can be done!

A Signature for Monads

The first thing we’ll need is a signature for data types that are monads. This is essentially the “hole” we’ll rely on with our library code - it will give us the ability to say “there exists a monad”, without committing to any particular choice.

In our Cabal file, we have:

library monad-sig
  hs-source-dirs:   src-monad-sig
  signatures:       Control.Monad.Signature
  default-language: Haskell2010
  build-depends:    base

The important line here is signatures: Control.Monad.Signature which shows that this library is incomplete and exports a signature. The definition of Control/Monad/Signature.hsig is:

signature Control.Monad.Signature where

data M a
instance Functor M
instance Applicative M
instance Monad M

This simply states that any module with this signature has some type M with instances of Functor, Applicative and Monad.

Next, we’ll put that signature to use in our library code.

Libary Code

For our library code, we’ll start with a new library in our Cabal file:

library business-logic
  hs-source-dirs:   lib
  signatures:       BusinessLogic.Monad
  exposed-modules:  BusinessLogic
  build-depends:
    , base
    , fused-effects
    , monad-sig

  default-language: Haskell2010
  mixins:
    monad-sig requires (Control.Monad.Signature as BusinessLogic.Monad)

Our business-logic library itself exports a signature, which is really just a re-export of the Control.Monad.Signature, but we rename it something more meaningful. It’s this module that will provide the monad that has all of the effects we need. Along with this signature, we also export the BusinessLogic module:

{-# language FlexibleContexts #-}
module BusinessLogic where

import BusinessLogic.Monad ( M )
import Control.Algebra ( Has )
import Control.Effect.Empty ( Empty, guard )

businessCode :: Has Empty sig M => Bool -> M Int
businessCode b = do
  guard b
  return 42

In this module I’m using fused-effects as a framework to say which effects my monad should have (though this is not particularly important, I just like it!). Usually Has would be applied to a type variable m, but here we’re applying it to the type M. This type comes from BusinessLogic.Monad, which is a signature (you can confirm this by checking against the Cabal file). Other than that, this is all pretty standard!

Backpack-ing Monad Transformers

Now we get into the really fun stuff - providing implementations of effects. I mentioned earlier that one possible way to do this is with a stack of monad transformers. Generally speaking, one would write a single newtype T m a for each effect type class, and have that transformer dispatch any effects in that class, and to lift any effects from other classes - deferring their implementation to m.

We’re going to take the same approach here, but we’ll absorb the idea of a transformer directly into the module itself. Let’s look at an implementation of the Empty effect. The Empty effect gives us a special empty :: m a function, which serves the purpose of stopping execution immediately. As a monad transformer, one implementation is MaybeT:

newtype MaybeT m a = MaybeT { runMaybeT :: m (Maybe a) }

But we can also write this using Backpack. First, our Cabal library:

library fused-effects-empty-maybe
  hs-source-dirs:   src-fused-effects-backpack
  default-language: Haskell2010
  build-depends:
    , base
    , fused-effects
    , monad-sig

  exposed-modules: Control.Carrier.Backpack.Empty.Maybe
  mixins:
    monad-sig requires (Control.Monad.Signature as Control.Carrier.Backpack.Empty.Maybe.Base)

Our library exports the module Control.Carrier.Backpack.Empty.Maybe, but also has a hole - the type of base monad this transformer stacks on top of. As a monad transformer, this would be the m parameter, but when we use Backpack, we move that out into a separate module.

The implementation of Control.Carrier.Backpack.Empty.Maybe is short, and almost identical to the body of Control.Monad.Trans.Maybe - we just change any occurrences of m to instead refer to M from our .Base module:

{-# language BlockArguments, FlexibleContexts, FlexibleInstances, LambdaCase,
      MultiParamTypeClasses, TypeOperators, UndecidableInstances #-}

module Control.Carrier.Backpack.Empty.Maybe where

import Control.Algebra
import Control.Effect.Empty
import qualified Control.Carrier.Backpack.Empty.Maybe.Base as Base

type M = EmptyT

-- We could also write: newtype EmptyT a = EmptyT { runEmpty :: MaybeT Base.M a }
newtype EmptyT a = EmptyT { runEmpty :: Base.M (Maybe a) }

instance Functor EmptyT where
  fmap f (EmptyT m) = EmptyT $ fmap (fmap f) m

instance Applicative EmptyT where
  pure = EmptyT . pure . Just
  EmptyT f <*> EmptyT x = EmptyT do
    f >>= \case
      Nothing -> return Nothing
      Just f' -> x >>= \case
        Nothing -> return Nothing
        Just x' -> return (Just (f' x'))

instance Monad EmptyT where
  return = pure
  EmptyT x >>= f = EmptyT do
    x >>= \case
      Just x' -> runEmpty (f x')
      Nothing -> return Nothing

Finally, we make sure that Empty can handle the Empty effect:

instance Algebra sig Base.M => Algebra (Empty :+: sig) EmptyT where
  alg handle sig context = case sig of
    L Empty -> EmptyT $ return Nothing
    R other -> EmptyT $ thread (maybe (pure Nothing) runEmpty ~<~ handle) other (Just context)

Base Monads

Now that we have a way to run the Empty effect, we need a base case to our transformer stack. As our transformer is now built out of modules that conform to the Control.Monad.Signature signature, we need some modules for each monad that we could use as a base. For this POC, I’ve just added the IO monad:

library fused-effects-lift-io
  hs-source-dirs:   src-fused-effects-backpack
  default-language: Haskell2010
  build-depends:    base
  exposed-modules:  Control.Carrier.Backpack.Lift.IO
module Control.Carrier.Backpack.Lift.IO where
type M = IO

That’s it!

Putting It All Together

Finally we can put all of this together into an actual executable. We’ll take our library code, instantiate the monad to be a combination of EmptyT and IO, and write a little main function that unwraps this all into an IO type. First, here’s the Main module:

module Main where

import BusinessLogic
import qualified BusinessLogic.Monad

main :: IO ()
main = print =<< BusinessLogic.Monad.runEmptyT (businessCode True)

The BusinessLogic module we’ve seen before, but previously BusinessLogic.Monad was a signature (remember, we renamed Control.Monad.Signature to BusinessLogic.Monad). In executables, you can’t have signatures - executables can’t be depended on, so it doesn’t make sense for them to have holes, they must be complete. The magic happens in our Cabal file:

executable test
  main-is:          Main.hs
  hs-source-dirs:   exe
  build-depends:
    , base
    , business-logic
    , fused-effects-empty-maybe
    , fused-effects-lift-io
    , transformers

  default-language: Haskell2010
  mixins:
    fused-effects-empty-maybe (Control.Carrier.Backpack.Empty.Maybe as BusinessLogic.Monad) requires (Control.Carrier.Backpack.Empty.Maybe.Base as BusinessLogic.Monad.Base),
    fused-effects-lift-io (Control.Carrier.Backpack.Lift.IO as BusinessLogic.Monad.Base)

Wow, that’s a mouthful! The work is really happening in mixins. Let’s take this step by step:

  1. First, we can see that we need to mixin the fused-effects-empty-maybe library. The first (X as Y) section specifies a list of modules from fused-effects-empty-maybe and renames them for the test executable that’s currently being compiled. Here, we’re renaming Control.Carrier.Backpack.Empty.Maybe as BusinessLogic.Monad. By doing this, we satisfy the hole in the business-logic library, which was otherwise incomplete.

  2. But fused-effects-empty-maybe itself has a hole - the base monad for the transformer. The requires part lets us rename this hole, but we’ll still need to plug it. For now, we rename Control.Carrier.Backpack.Empty.Maybe.Base).

  3. Next, we mixin the fused-effects-lift-io library, and rename Control.Carrier.Backpack.Lift.IO to be BusinessLogic.Monad.Base. We’ve now satisfied the hole for fused-effects-empty-maybe, and our executable has no more holes and can be compiled.

We’re Done!

That’s “all” there is to it. We can finally run our program:

$ cabal run
Just 42

If you compare against businessCode you’ll see that we got passed the guard and returned 42. Because we instantiated BusinessLogic.Monad with a MaybeT-like transformer, this 42 got wrapped up in Just.

Is This Fast?

The best check here is to just look at the underlying code itself. If we add

{-# options -ddump-simpl -ddump-stg -dsuppress-all #-}

to BusinessLogic and recompile, we’ll see the final code output to STDERR. The core is:

businessCode1
  = \ @ sig_a2cM _ b_a13P eta_B1 ->
      case b_a13P of {
        False -> (# eta_B1, Nothing #);
        True -> (# eta_B1, lvl1_r2NP #)
      }

and the STG:

businessCode1 =
    \r [$d(%,%)_s2PE b_s2PF eta_s2PG]
        case b_s2PF of {
          False -> (#,#) [eta_s2PG Nothing];
          True -> (#,#) [eta_s2PG lvl1_r2NP];
        };

Voila!

Conclusion

In this post, I’ve hopefully shown how we can use Backpack to write effectful code without paying the cost of abstraction. What I didn’t answer is the question of whether or you not you should. There’s a lot more to effectful code than I’ve presented, and it’s unclear to me whether this approach can scale to the needs. For example, if we needed something like mmorph’s MFunctor, what do we do? Are we stuck? I don’t know! Beyond these technical challenges, it’s clear that Backpack here is also not remotely ergonomic, as is. We’ve had to write five components just to get this done, and I pray for any one who comes to read this code and has to orientate themselves.

Nonetheless, I think this an interesting point of the effect design space that hasn’t been explored, and maybe I’ve motivated some people to do some further exploration.

The code for this blog post can be found at https://github.com/ocharles/fused-effects-backpack.

Happy holidays, all!

by Oliver Charles at December 23, 2020 12:00 AM

December 16, 2020

Tweag I/O

Trustix: Distributed trust and reproducibility tracking for binary caches

Downloading binaries from well-known providers is the easiest way to install new software. After all, building software from source is a chore — it requires both time and technical expertise. But how do we know that we aren’t installing something malicious from these providers?

Typically, we trust these binaries because we trust the provider. We believe that they were built from trusted sources, in a trusted computational environment, and with trusted build instructions. But even if the provider does everything transparently and in good faith, the binaries could still be anything if the provider’s system is compromised. In other words, the build process requires trust even if all build inputs (sources, dependencies, build scripts, etc…) are known.

Overcoming this problem is hard — after all, how can we verify the output of arbitrary build inputs? Excitingly, the last years have brought about ecosystems such as Nix, where all build inputs are known and where significant amounts of builds are reproducible. This means that the correspondence between inputs and outputs can be verified by building the same binary multiple times! The r13y project, for example, tracks non-reproducible builds by building them twice on the same machine, showing that this is indeed practical.

But we can go further, and that’s the subject of this blog post, which introduces Trustix, a new tool we are working on. Trustix compares build outputs for given build inputs across independent providers and machines, effectively decentralizing trust. This establishes what I like to call build transparency because it verifies what black box build machines are doing. Behind the scenes Trustix builds a Merkle tree-based append-only log that maps build inputs to build outputs, which I’ll come back to in a later post. This log can be used to establish consensus whether certain build inputs always produce the same output — and can therefore be trusted. Conversely, it can also be used to uncover non-reproducible builds, corrupted or not, on a large scale.

The initial implementation of Trustix, and its description in this post are based on the Nix package manager. Nix focuses on isolated builds, provides access to the hashes of all build inputs as well as a high quantity of bit-reproducible packages, making it the ideal initial testing ecosystem. However, Trustix was designed to be system-independent, and is not strongly tied to Nix.

The developmentent of Trustix is funded by NLnet foundation and the European Commission’s Next Generation Internet programme through the NGI Zero PET (privacy and trust enhancing technologies) fund. The tool is still in development, but I’m very excited to announce it already!

How Nix verifies binary cache results

Most Linux package managers use a very simple signature scheme to secure binary distribution to users. Some use GPG keys, some use OpenSSL certificates, and others use some other kind of key, but the idea is essentially the same for all of them. The general approach is that binaries are signed with a private key, and clients can use an associated public key to check that a binary was really signed by the trusted entity.

Nix for example uses an ed25519-based key signature scheme and comes with a default hard-coded public key that corresponds to the default cache. This key can be overridden or complemented by others, allowing the use of additional caches. The list of signing keys can be found in /etc/nix/nix.conf. The default base64-encoded ed25519 public key with a name as additional metadata looks like this:

trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY=

Now, in Nix, software is addressed by the hash of all of its build inputs (sources, dependencies and build instructions). This hash, or the output path is used to query a cache (like https://cache.nixos.org) for a binary.

Here is an example: The hash of the hello derivation can be obtained from a shell with nix-instantiate:

$ nix-instantiate '<nixpkgs>' --eval -A hello.outPath
"/nix/store/w9yy7v61ipb5rx6i35zq1mvc2iqfmps1-hello-2.10"

Here, behind the scenes, we have evaluated and hashed all build inputs that the hello derivation needs (.outPath is just a helper). This hash can then be used to query the default Nix binary cache:

$ curl https://cache.nixos.org/w9yy7v61ipb5rx6i35zq1mvc2iqfmps1.narinfo
StorePath: /nix/store/w9yy7v61ipb5rx6i35zq1mvc2iqfmps1-hello-2.10
URL: nar/15zk4zszw9lgkdkkwy7w11m5vag11n5dhv2i6hj308qpxczvdddx.nar.xz
Compression: xz
FileHash: sha256:15zk4zszw9lgkdkkwy7w11m5vag11n5dhv2i6hj308qpxczvdddx
FileSize: 41232
NarHash: sha256:1mi14cqk363wv368ffiiy01knardmnlyphi6h9xv6dkjz44hk30i
NarSize: 205968
References: 9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31 w9yy7v61ipb5rx6i35zq1mvc2iqfmps1-hello-2.10
Sig: cache.nixos.org-1:uP5KU8MCmyRnKGlN5oEv6xWJBI5EO/Pf5aFztZuLSz8BpCcZ1fdBnJkVXhBAlxkdm/CemsgQskhwvyd2yghTAg==

Besides links to the archive that contains the compressed binaries, this response includes two relevant pieces of information which are used to verify binaries from the binary cache(s):

  • The NarHash is a hash over all Nix store directory contents
  • The Sig is a cryptographic signature over the NarHash

With this information, the client can check that this binary really comes from the provider’s Nix store.

What are the limitations of this model?

While this model has served Nix and others well for many years it suffers from a few problems. All of these problems can be traced back to a single point of failure in the chain of trust:

  • First, if the key used by cache.nixos.org is ever compromised, all builds that were ever added to the cache can be considered tainted.
  • Second, one needs to put either full trust or no trust at all in the build machines of a binary cache — there is no middle ground.
  • Finally, there is no inherent guarantee that the build inputs described in the Nix expressions were actually used to build what’s in the cache.

Trustix

Trustix aims to solve these problems by assembling a mapping from build inputs to (hashes of) build outputs provided by many build machines.

Instead of relying on verifying packages signatures, like the traditional Nix model does, Trustix only exposes packages that it considers trustworthy. Concretely, Trustix is configured as a proxy for a binary cache, and hides the packages which are not trustworthy. As far as Nix is concerned, the package not being trustworthy is exactly as if the package wasn’t stored in the binary cache to begin with. If such a package is required, Nix will therefore build it from source.

Trustix doesn’t define what a trustworthy package is. What your Trustix considers trustworthy is up to you. The rules for accepting packages are entirely configurable. In fact, in the current prototype, there isn’t a default rule for packages to count as trustworthy: you need to configure trustworthiness yourself.

With this in mind, let’s revisit the above issues

  • In Trustix, if an entity is compromised, you can rely on all other entities in the network to establish that a binary artefact is trustworthy. Maybe a few hashes are wrong in the Trustix mapping, but if an overwhelming majority of the outputs are the same, you can trust that the corresponding artefact is indeed what you would have built yourself.

    Therefore you never need to invalidate an entire binary cache: you can still verify the trustworthiness of old packages, even if newer packages are built by a malicious actor.

  • In Trustix, you never typically consider any build machine to be fully trusted. You always check their results against the other build machines. You can further configure this by considering some machines as more trusted (maybe because it is a community-operated machine, and you trust said community) or less trusted (for instance, because it has been compromised in the past, and you fear it may be compromised again).

    Moreover, in the spirit of having no single point of failure, Trustix’s mapping is not kept in a central database. Instead every builder keeps a log of its builds; these logs are aggregated on your machine by your instance of the Trustix daemon. Therefore even the mapping itself doesn’t have to be fully trusted.

  • In Trustix, package validity is not ensured by a signature scheme. Instead Trustix relies on the consistency of the input to output mapping. As a consequence, the validity criterion, contrary to a signature scheme, links the output to the input. It makes it infeasible to pass the build result of input I as a build result for input J: it would require corrupting the entire network.

Limitations: reproducibility tracking and non-reproducible builds

A system like Trustix will not work well with builds that are non-reproducible, which is a limitation of this model. After all, you cannot reach consensus if everyone’s opinions differ.

However, Trustix can still be useful, even for non-reproducible builds! By accumulating all the data in the various logs and aggregating them, we can track which derivations are non-reproducible over all of Nixpkgs, in a way that is easier than previously possible. Whereas the r13y project builds a single closure on a single machine, Trustix will index everything ever built on every architecture.

Conclusion

I am very excited to be working on the next generation of tooling for trust and reproducibility, and for the purely functional software packaging model pioneered by Nix to keep enabling new use cases. I hope that this work can be a foundation for many other applications other than improving trust — for example, by enabling the Nix community to support new CPU architectures with community binary caches.

Please check out the code at the repo or join us for a chat over in #untrustix on Freenode. And stay tuned — in the next blog post, we will talk more about Merkle trees and how they are used in Trustix.

NLNet

NGI0

December 16, 2020 12:00 AM

November 29, 2020

Sander van der Burg

Constructing a simple alerting system with well-known open source projects


Some time ago, I have been experimenting with all kinds of monitoring and alerting technologies. For example, with the following technologies, I can develop a simple alerting system with relative ease:

  • Telegraf is an agent that can be used to gather measurements and transfer the corresponding data to all kinds of storage solutions.
  • InfluxDB is a time series database platform that can store, manage and analyze timestamped data.
  • Kapacitor is a real-time streaming data process engine, that can be used for a variety of purposes. I can use Kapacitor to analyze measurements and see if a threshold has been exceeded so that an alert can be triggered.
  • Alerta is a monitoring system that can store, de-duplicate alerts, and arrange black outs.
  • Grafana is a multi-platform open source analytics and interactive visualization web application.

These technologies appear to be quite straight forward to use. However, as I was learning more about them, I discovered a number of oddities, that may have big implications.

Furthermore, testing and making incremental changes also turns out to be much more challenging than expected, making it very hard to diagnose and fix problems.

In this blog post, I will describe how I built a simple monitoring and alerting system, and elaborate about my learning experiences.

Building the alerting system


As described in the introduction, I can combine several technologies to create an alerting system. I will explain them more in detail in the upcoming sections.

Telegraf


Telegraf is a pluggable agent that gathers measurements from a variety of inputs (such as system metrics, platform metrics, database metrics etc.) and sends them to a variety of outputs, typically storage solutions (database management systems such as InfluxDB, PostgreSQL or MongoDB). Telegraf has a large plugin eco-system that provides all kinds integrations.

In this blog post, I will use InfluxDB as an output storage backend. For the inputs, I will restrict myself to capturing a sub set of system metrics only.

With the following telegraf.conf configuration file, I can capture a variety of system metrics every 10 seconds:


[agent]
interval = "10s"

[[outputs.influxdb]]
urls = [ "https://test1:8086" ]
database = "sysmetricsdb"
username = "sysmetricsdb"
password = "sysmetricsdb"

[[inputs.system]]
# no configuration

[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## If true, collect raw CPU time metrics.
collect_cpu_time = false
## If true, compute and report the sum of all non-idle CPU states.
report_active = true

[[inputs.mem]]
# no configuration

With the above configuration file, I can collect the following metrics:
  • System metrics, such as the hostname and system load.
  • CPU metrics, such as how much the CPU cores on a machine are utilized, including the total CPU activity.
  • Memory (RAM) metrics.

The data will be stored in an InfluxDB database name: sysmetricsdb hosted on a remote machine with host name: test1.

InfluxDB


As explained earlier, InfluxDB is a timeseries platform that can store, manage and analyze timestamped data. In many ways, InfluxDB resembles relational databases, but there are also some notable differences.

The query language that InfluxDB uses is called InfluxQL (that shares many similarities with SQL).

For example, with the following query I can retrieve the first three data points from the cpu measurement, that contains the CPU-related measurements collected by Telegraf:


> precision rfc3339
> select * from "cpu" limit 3

providing me the following result set:


name: cpu
time cpu host usage_active usage_guest usage_guest_nice usage_idle usage_iowait usage_irq usage_nice usage_softirq usage_steal usage_system usage_user
---- --- ---- ------------ ----------- ---------------- ---------- ------------ --------- ---------- ------------- ----------- ------------ ----------
2020-11-16T15:36:00Z cpu-total test2 10.665258711721098 0 0 89.3347412882789 0.10559662090813073 0 0 0.10559662090813073 0 8.658922914466714 1.79514255543822
2020-11-16T15:36:00Z cpu0 test2 10.665258711721098 0 0 89.3347412882789 0.10559662090813073 0 0 0.10559662090813073 0 8.658922914466714 1.79514255543822
2020-11-16T15:36:10Z cpu-total test2 0.1055966209080346 0 0 99.89440337909197 0 0 0 0.10559662090813073 0 0 0

As you may probably notice by looking at the output above, every data point has a timestamp and a number of fields capturing CPU metrics:

  • cpu identifies the CPU core.
  • host contains the host name of the machine.
  • The remainder of the fields contain all kinds of CPU metrics, e.g. how much CPU time is consumed by the system (usage_system), the user (usage_user), by waiting for IO (usage_iowait) etc.
  • The usage_active field contains the total CPU activity percentage, which is going to be useful to develop an alert that will warn us if there is too much CPU activity for a long period of time.

Aside from the fact that all data is timestamp based, data in InfluxDB has another notable difference compared to relational databases: an InfluxDB database is schemaless. You can add an arbitrary number of fields and tags to a data point without having to adjust the database structure (and migrating existing data to the new database structure).

Fields and tags can contain arbitrary data, such as numeric values or strings. Tags are also indexed so that you can search for these values more efficiently. Furthermore, tags can be used to group data.

For example, the cpu measurement collection has the following tags:


> SHOW TAG KEYS ON "sysmetricsdb" FROM "cpu";
name: cpu
tagKey
------
cpu
host

As shown in the above output, the cpu and host fields are tags in the cpu measurement.

We can use these tags to search for all data points related to a CPU core and/or host machine. Moreover, we can use these tags for grouping allowing us to compute aggregate values, sch as the mean value per CPU core and host.

Beyond storing and retrieving data, InfluxDB has many useful additional features:

  • You can also automatically sample data and run continuous queries that generate and store sampled data in the background.
  • Configure retention policies so that data is no longer stored for an indefinite amount of time. For example, you can configure a retention policy to drop raw data after a certain amount of time, but retain the corresponding sampled data.

InfluxDB has a "open core" development model. The free and open source edition (FOSS) of InfluxDB server (that is MIT licensed) allows you to host multiple databases on a multiple servers.

However, if you also want horizontal scalability and/or high assurance, then you need to switch to the hosted InfluxDB versions -- data in InfluxDB is partitioned into so-called shards of a fixed size (the default shard size is 168 hours).

These shards can be distributed over multiple InfluxDB servers. It is also possible to deploy multiple read replicas of the same shard to multiple InfluxDB servers improving read speed.

Kapacitor


Kapacitor is a real-time streaming data process engine developed by InfluxData -- the same company that also develops InfluxDB and Telegraf.

It can be used for all kinds of purposes. In my example cases, I will only use it to determine whether some threshold has been exceeded and an alert needs to be triggered.

Kapacitor works with customly implemented tasks that are written in a domain-specific language called the TICK script language. There are two kinds of tasks: stream and batch tasks. Both task types have advantages and disadvantages.

We can easily develop an alert that gets triggered if the CPU activity level is high for a relatively long period of time (more than 75% on average over 1 minute).

To implement this alert as a stream job, we can write the following TICK script:


dbrp "sysmetricsdb"."autogen"

stream
|from()
.measurement('cpu')
.groupBy('host', 'cpu')
.where(lambda: "cpu" != 'cpu-total')
|window()
.period(1m)
.every(1m)
|mean('usage_active')
|alert()
.message('Host: {{ index .Tags "host" }} has high cpu usage: {{ index .Fields "mean" }}')
.warn(lambda: "mean" > 75.0)
.crit(lambda: "mean" > 85.0)
.alerta()
.resource('{{ index .Tags "host" }}/{{ index .Tags "cpu" }}')
.event('cpu overload')
.value('{{ index .Fields "mean" }}')

A stream job is built around the following principles:

  • A stream task does not execute queries on an InfluxDB server. Instead, it creates a subscription to InfluxDB -- whenever a data point gets inserted into InfluxDB, the data points gets forwarded to Kapacitor as well.

    To make subscriptions work, both InfluxDB and Kapacitor need to be able to connect to each other with a public IP address.
  • A stream task defines a pipeline consisting of a number of nodes (connected with the | operator). Each node can consume data points, filter, transform, aggregate, or execute arbitrary operations (such as calling an external service), and produce new data points that can be propagated to the next node in the pipeline.
  • Every node also has property methods (such as .measurement('cpu')) making it possible to configure parameters.

The TICK script example shown above does the following:

  • The from node consumes cpu data points from the InfluxDB subscription, groups them by host and cpu and filters out data points with the the cpu-total label, because we are only interested in the CPU consumption per core, not the total amount.
  • The window node states that we should aggregate data points over the last 1 minute and pass the resulting (aggregated) data points to the next node after one minute in time has elapsed. To aggregate data, Kapacitor will buffer data points in memory.
  • The mean node computes the mean value for usage_active for the aggregated data points.
  • The alert node is used to trigger an alert of a specific severity level (WARNING if the mean activity percentage is bigger than 75%) and (CRITICAL if the mean activity percentage is bigger than 85%). In the remainder of the case, the status is considered OK. The alert is sent to Alerta.

It is also possible to write a similar kind of alerting script as a batch task:


dbrp "sysmetricsdb"."autogen"

batch
|query('''
SELECT mean("usage_active")
FROM "sysmetricsdb"."autogen"."cpu"
WHERE "cpu" != 'cpu-total'
''')
.period(1m)
.every(1m)
.groupBy('host', 'cpu')
|alert()
.message('Host: {{ index .Tags "host" }} has high cpu usage: {{ index .Fields "mean" }}')
.warn(lambda: "mean" > 75.0)
.crit(lambda: "mean" > 85.0)
.alerta()
.resource('{{ index .Tags "host" }}/{{ index .Tags "cpu" }}')
.event('cpu overload')
.value('{{ index .Fields "mean" }}')

The above TICK script looks similar to the stream task shown earlier, but instead of using a subscription, the script queries the InfluxDB database (with an InfluxQL query) for data points over the last minute with a query node.

Which approach for writing a CPU alert is best, you may wonder? Each of these two approaches have their pros and cons:

  • Stream tasks offer low latency responses -- when a data point appears, a stream task can immediately respond, whereas a batch task needs to query every minute all the data points to compute the mean percentage over the last minute.
  • Stream tasks maintain a buffer for aggregating the data points making it possible to only send incremental updates to Alerta. Batch tasks are stateless. As a result, they need to update the status of all hosts and CPUs every minute.
  • Processing data points is done synchronously and in sequential order -- if an update round to Alerta takes too long (which is more likely to happen with a batch task), then the next processing run may overlap with the previous, causing all kinds of unpredictable results.

    It may also cause Kapacitor to eventually crash due to growing resource consumption.
  • Batch tasks may also miss data points -- while querying data over a certain time window, it may happen that a new data point gets inserted in that time window (that is being queried). This new data point will not be picked up by Kapacitor.

    A subscription made by a stream task, however, will never miss any data points.
  • Stream tasks can only work with data points that appear from the moment Kapacitor is started -- it cannot work with data points in the past.

    For example, if Kapacitor is restarted and some important event is triggered in the restart time window, Kapacitor will not notice that event, causing the alert to remain in its previous state.

    To work effectively with stream tasks, a continuous data stream is required that frequently reports on the status of a resource. Batch tasks, on the other hand, can work with historical data.
  • The fact that nodes maintain a buffer may also cause the RAM consumption of Kapacitor to grow considerably, if the data volumes are big.

    A batch task on the other hand, does not buffer any data and is more memory efficient.

    Another compelling advantage of batch tasks over stream tasks is that InfluxDB does all the work. The hosted version of InfluxDB can also horizontally scale.
  • Batch tasks can also aggregate data more efficiently (e.g. computing the mean value or sum of values over a certain time period).

I consider neither of these script types the optimal solution. However, for implementing the alerts I tend to have a slight preference for stream jobs, because of its low latency, and incremental update properties.

Alerta


As explained in the introduction, Alerta is a monitoring system that can store and de-duplicate alerts, and arrange black outs.

The Alerta server provides a REST API that can be used to query and modify alerting data and uses MongoDB or PostgreSQL as a storage database.

There are also a variety of Alerta clients: there is the alerta-cli allows you to control the service from the command-line. There is also a web user interface that I will show later in this blog post.

Running experiments


With all the components described above in place, we can start running experiments to see if the CPU alert will work as expected. To gain better insights in the process, I can install Grafana that allows me to visualize the measurements that are stored in InfluxDB.

Configuring a dashboard and panel for visualizing the CPU activity rate was straight forward. I configured a new dashboard, with the following variables:


The above variables allow me to select for each machine in the network, which CPU core's activity percentage I want to visualize.

I have configured the CPU panel as follows:


In the above configuration, I query the usage_activity from the cpu measurement collection, using the dashboard variables: cpu and host to filter for the right target machine and CPU core.

I have also configured the field unit to be a percentage value (between 0 and 100).

When running the following command-line instruction on a test machine that runs Telegraf (test2), I can deliberately hog the CPU:


$ dd if=/dev/zero of=/dev/null

The above command reads zero bytes (one-by-one) and discards them by sending them to /dev/null, causing the CPU to remain utilized at a high level:


In the graph shown above, it is clearly visible that CPU core 0 on the test2 machine remains utilized at 100% for several minutes.

(As a sidenote, we can also hog both the CPU and consume RAM at the same time with a simple command line instruction).

If we keep hogging the CPU and wait for at least a minute, the Alerta web interface dashboard will show a CRITICAL alert:


If we stop the dd command, then the TICK script should eventually notice that the mean percentage drops below the WARNING threshold causing the alert to go back into the OK state and disappearing from the Alerta dashboard.

Developing test cases


Being able to trigger an alert with a simple command-line instruction is useful, but not always convenient or effective -- one of the inconveniences is that we always have to wait at least one minute to get feedback.

Moreover, when an alert does not work, it is not always easy to find the root cause. I have encountered the following problems that contribute to a failing alert:

  • Telegraf may not be running and, as a result, not capturing the data points that need to be analyzed by the TICK script.
  • A subscription cannot be established between InfluxDB and Kapacitor. This may happen when Kapacitor cannot be reached through a public IP address.
  • There are data points collected, but only the wrong kinds of measurements.
  • The TICK script is functionally incorrect.

Fortunately, for stream tasks it is relatively easy to quickly find out whether an alert is functionally correct or not -- we can generate test cases that almost instantly trigger each possible outcome with a minimal amount of data points.

An interesting property of stream tasks is that they have no notion of time -- the .window(1m) property may suggest that Kapacitor computes the mean value of the data points every minute, but that is not what it actually does. Instead, Kapacitor only looks at the timestamps of the data points that it receives.

When Kapacitor sees that the timestamps of the data points fit in the 1 minute time window, then it keeps buffering. As soon as a data point appears that is outside this time window, the window node relays an aggregated data point to the next node (that computes the mean value, than in turn is consumed by the alert node deciding whether an alert needs to be raised or not).

We can exploit that knowledge, to create a very minimal bash test script that triggers every possible outcome: OK, WARNING and CRITICAL:


influxCmd="influx -database sysmetricsdb -host test1"

export ALERTA_ENDPOINT="https://test1"

### Trigger CRITICAL alert

# Force the average CPU consumption to be 100%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100 0000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100 60000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=100 120000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "critical" ]
then
echo "Expected severity: critical, but we got: $actualSeverity" >&2
false
fi

### Trigger WARNING alert

# Force the average CPU consumption to be 80%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 180000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 240000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=80 300000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "warning" ]
then
echo "Expected severity: warning, but we got: $actualSeverity" >&2
false
fi

### Trigger OK alert

# Force the average CPU consumption to be 0%
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 300000000000"
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 360000000000"
# This data point triggers the alert
$influxCmd -execute "INSERT cpu,cpu=cpu0,host=test2 usage_active=0 420000000000"

sleep 1
actualSeverity=$(alerta --output json query | jq '.[0].severity')

if [ "$actualSeverity" != "ok" ]
then
echo "Expected severity: ok, but we got: $actualSeverity" >&2
false
fi

The shell script shown above automatically triggers all three possible outcomes of the CPU alert:

  • CRITICAL is triggered by generating data points that force a mean activity percentage of 100%.
  • WARNING is triggered by a mean activity percentage of 80%.
  • OK is triggered by a mean activity percentage of 0%.

It uses the Alerta CLI to connect to the Alerta server to check whether the alert's severity level has the expected value.

We need three data points to trigger each alert type -- the first two data points are on the boundaries of the 1 minute window (0 seconds and 60 seconds), forcing the mean value to become the specified CPU activity percentage.

The third data point is deliberately outside the time window (of 1 minute), forcing the alert node to be triggered with a mean value over the previous two data points.

Although the above test strategy works to quickly validate all possible outcomes, one impractical aspect is that the timestamps in the above example start with 0 (meaning 0 seconds after the epoch: January 1st 1970 00:00 UTC).

If we also want to observe the data points generated by the above script in Grafana, we need to configure the panel to go back in time 50 years.

Fortunately, I can also easily adjust the script to start with a base timestamp, that is 1 hour in the past:


offset="$(($(date +%s) - 3600))"

With this tiny adjustment, we should see the following CPU graph (displaying data points from the last hour) after running the test script:


As you may notice, we can see that the CPU activity level quickly goes from 100%, to 80%, to 0%, using only 9 data points.

Although testing stream tasks (from a functional perspective) is quick and convenient, testing batch tasks in a similar way is difficult. Contrary to the stream task implementation, the query node in the batch task does have a notion of time (because of the WHERE clause that includes the now() expression).

Moreover, the embedded InfluxQL query evaluates the mean values every minute, but the test script does not exactly know when this event triggers.

The only way I could think of to (somewhat reliably) validate the outcomes is by creating a test script that continuously inserts data points for at least double the time window size (2 minutes) until Alerta reports the right alert status (if it does not after a while, I can conclude that the alert is incorrectly implemented).

Automating the deployment


As you may probably have already guessed, to be able to conveniently experiment with all these services, and to reliably run tests in isolation, some form of deployment automation is an absolute must-have.

Most people who do not know anything about my deployment technology preferences, will probably go for Docker or docker-compose, but I have decided to use a variety of solutions from the Nix project.

NixOps is used to automatically deploy a network of NixOS machines -- I have created a logical and physical NixOps configuration that deploys two VirtualBox virtual machines.

With the following command I can create and deploy the virtual machines:


$ nixops create network.nix network-virtualbox.nix -d test
$ nixops deploy -d test

The first machine: test1 is responsible for hosting the entire monitoring infrastructure (InfluxDB, Kapacitor, Alerta, Grafana), and the second machine (test2) runs Telegraf and the load tests.

Disnix (my own deployment tool) is responsible for deploying all services, such as InfluxDB, Kapacitor, Alarta, and the database storage backends. Contrary to docker-compose, Disnix does not work with containers (or other Docker objects, such as networks or volumes), but with arbitrary deployment units that are managed with a plugin system called Dysnomia.

Moreover, Disnix can also be used for distributed deployment in a network of machines.

I have packaged all the services and captured them in a Disnix services model that specifies all deployable services, their types, and their inter-dependencies.

If I combine the services model with the NixOps network models, and a distribution model (that maps Telegraf and the test scripts to the test2 machine and the remainder of the services to the first: test1), I can deploy the entire system:


$ export NIXOPS_DEPLOYMENT=test
$ export NIXOPS_USE_NIXOPS=1

$ disnixos-env -s services.nix \
-n network.nix \
-n network-virtualbox.nix \
-d distribution.nix

The following diagram shows a possible deployment scenario of the system:


The above diagram describes the following properties:

  • The light-grey colored boxes denote machines. In the above diagram, we have two of them: test1 and test2 that correspond to the VirtualBox machines deployed by NixOps.
  • The dark-grey colored boxes denote containers in a Disnix-context (not to be confused with Linux or Docker containers). These are environments that manage other services.

    For example, a container service could be the PostgreSQL DBMS managing a number of PostgreSQL databases or the Apache HTTP server managing web applications.
  • The ovals denote services that could be any kind of deployment unit. In the above example, we have services that are running processes (managed by systemd), databases and web applications.
  • The arrows denote inter-dependencies between services. When a service has an inter-dependency on another service (i.e. the arrow points from the former to the latter), then the latter service needs to be activated first. Moreover, the former service also needs to know how the latter can be reached.
  • Services can also be container providers (as denoted by the arrows in the labels), stating that other services can be embedded inside this service.

    As already explained, the PostgreSQL DBMS is an example of such a service, because it can host multiple PostgreSQL databases.

Although the process components in the diagram above can also be conveniently deployed with Docker-based solutions (i.e. as I have explained in an earlier blog post, containers are somewhat confined and restricted processes), the non-process integrations need to be managed by other means, such as writing extra shell instructions in Dockerfiles.

In addition to deploying the system to machines managed by NixOps, it is also possible to use the NixOS test driver -- the NixOS test driver automatically generates QEMU virtual machines with a shared Nix store, so that no disk images need to be created, making it possible to quickly spawn networks of virtual machines, with very small storage footprints.

I can also create a minimal distribution model that only deploys the services required to run the test scripts -- Telegraf, Grafana and the front-end applications are not required, resulting in a much smaller deployment:


As can be seen in the above diagram, there are far fewer components required.

In this virtual network that runs a minimal system, we can run automated tests for rapid feedback. For example, the following test driver script (implemented in Python) will run my test shell script shown earlier:


test2.succeed("test-cpu-alerts")

With the following command I can automatically run the tests on the terminal:


$ nix-build release.nix -A tests

Availability


The deployment recipes, test scripts and documentation describing the configuration steps are stored in the monitoring playground repository that can be obtained from my GitHub page.

Besides the CPU activity alert described in this blog post, I have also developed a memory alert that triggers if too much RAM is consumed for a longer period of time.

In addition to virtual machines and services, there is also deployment automation in place allowing you also easily deploy Kapacitor TICK scripts and Grafana dashboards.

To deploy the system, you need to use the very latest version of Disnix (version 0.10) that was released very recently.

Acknowledgements


I would like to thank my employer: Mendix for writing this blog post. Mendix allows developers to work two days per month on research projects, making projects like these possible.

Presentation


I have given a presentation about this subject at Mendix. For convienence, I have embedded the slides:

by Sander van der Burg (noreply@blogger.com) at November 29, 2020 08:57 PM

November 18, 2020

Tweag I/O

Self-references in a content-addressed Nix

In a previous post I explained why we were eagerly trying to change the Nix store model to allow for content-addressed derivations. I also handwaved that this was a real challenge, but without giving any hint at why this could be tricky. So let’s dive a bit into the gory details and understand some of the conceptual pain points with content-addressability in Nix, which forced us to some trade-offs in how we handle content-addressed paths.

What are self-references?

This is a self-reference

Théophane Hufschmitt, This very article

A very trivial Nix derivation might look like this:

with import <nixpkgs> {};
writeScript "hello" ''
#!${bash}/bin/bash

${hello}/bin/hello
''

The result of this derivation will be an executable file containing a script that will run the hello program. It will depend on the bash and hello derivations as we refer to them in the file.

We can build this derivation and execute it:

$ nix-build hello.nix
$ ./result
Hello, world!

So far, so good. Let’s now change our derivation to change the prompt of hello to something more personalized:

with import <nixpkgs> {};
writeScript "hello-its-me" ''
#!${bash}/bin/bash

echo "Hello, world! This is ${placeholder "out"}"
''

where ${placeholder "out"} is a magic value that will be replaced by the output path of the derivation during the build.

We can build this and run the result just fine

$ nix-build hello-its-me.nix
$ ./result
Hello, world! This is /nix/store/c0qw0gbp7rfyzm7x7ih279pmnzazg86p-hello-its-me

And we can check that the file is indeed who it claims to be:

$ /nix/store/c0qw0gbp7rfyzm7x7ih279pmnzazg86p-hello-its-me
Hello, world! This is /nix/store/c0qw0gbp7rfyzm7x7ih279pmnzazg86p-hello-its-me

While the hello derivation depends on bash and hello, hello-its-me depends on bash and… itself. This is something rather common in Nix. For example, it’s rather natural for a C program to have /nix/store/xxx-foo/bin/foo depend of /nix/store/xxx-foo/lib/libfoo.so.

Self references and content-addressed paths

How do we build a content-addressed derivation foo in Nix? The recipe is rather simple:

  1. Build the derivation in a temporary directory /some/where/
  2. Compute the hash xxx of that /some/where/ directory
  3. Move the directory under /nix/store/xxx-foo/

You might see where things will go wrong with self-references: the reference will point to /some/where rather than /nix/store/xxx-foo, and so will be wrong (in addition to leak a path to what should just be a temporary directory).

To work around that, we would need to compute this xxx hash before the build, but that’s quite impossible as the hash depends on the content of the directory, including the value of the self-references.

However, we can hack our way around it in most cases by allowing ourselves a bit of heuristic. The only assumption that we need to make is that all the self-references will appear textually (i.e. running strings on a file that contains self-references will print all the self-references out).

Under that assumption, we can:

  1. Build the derivation in our /some/where directory
  2. Replace all the occurrences of a self-reference by a magic value
  3. Compute the hash of the resulting path to determine the final path
  4. Replace all the occurrences of the magic value by the final path
  5. Move the resulting store path to its final path

Now you might think that this is a crazy hack − there’s so many ways it could break. And in theory you’ll be right. But, surprisingly, this works remarkably well in practice. You might also notice that pedantically speaking this scheme isn’t exactly content-addressing because of the “modulo the final hash” part. But this is close-enough to keep all the desirable properties of proper content addressing, while also enabling self-references, which wouldn’t be possible otherwise. For example, the Fugue cloud deployment system used a generalisation of this technique which not only deals with self-references, but with reference cycles of arbitrary length.

However, there’s a key thing that’s required for this to work: patching strings in binaries is generally benign, but the final string must have the same length as the original one. But we can do that: we don’t know what the final xxx hash will be, but we know its length (because it’s a fixed-length hash), so we can just choose a temporary directory that has the right length (like a temporary store path with the same name), and we’re all set!

The annoying thing is that there’s no guarantee that there are no self-references hidden in such a way that a textual replacement won’t catch it (for example inside a compressed zip file). This is the main reason why content-addressability will not be the default in Nix, at first at least.

Non-deterministic builds − the diamond problem strikes back

No matter how hard Nix tries to isolate the build environment, some actions will remain inherently non-deterministic − anything that can yield a different output depending on the order in which concurrent tasks will be executed for example. This is an annoyance as it might prevent early cutoff (see our previous article on the subject in case you missed it).

But more than limiting the efficiency of the cache, this could also hurt the correctness of Nix if we’re not careful enough.

For example, consider the following dependency graph:

Dependency graph for foo

Alice wants to get foo installed. She already built lib0 and lib1 locally. Let’s call them lib0_a and lib1_a. The binary cache contains builds of lib0 and lib2. Let’s call them lib0_b and lib2_b. Because the build of lib0 is not deterministic, lib0_a and lib0_b are different — and so have a different hash. In a content-addressed word, that means they will be stored in different paths.

A simple cache implementation would want to fetch lib2_b from the cache and use it to build foo. This would also pull lib0_b, because it’s a dependency of lib2_b. But that would mean that foo would depend on both lib0_a and lib0_b.

Buggy runtime dependency graph for foo

In the happy case this would just be a waste of space − the dependency is duplicated, so we use twice as much memory to store it. But in many cases this would simply blow-up at some point — for example if lib0 is a shared library, the C linker will fail because of the duplicated symbols. Besides that, this breaks down the purity of the build as we get a different behavior depending on what’s already in the store at the start of the build.

Getting out of this

Nix’s foundational paper shows a way out of this by rewriting hashes in substituted paths. This is however quite complex to implement for a first version, so the current implementation settles down on a simpler (though not optimal) behavior where we only allow one build for each derivation. In the example above, lib0 has already been instantiated (as lib0_a), so we don’t allow pulling in lib0_b (nor lib1_b) and we rebuild both lib1 and foo.

While not optimal − we’ll end-up rebuilding foo even if it’s already in the binary cache − this solution has the advantage of preserving correctness while staying conceptually and technically simple.

What now?

Part of this has already been implemented but there’s still quite a long way forward.

I hope for it to be usable (though maybe still experimental) for Nix 3.0.

And in the meantime stay tuned with our regular updates on discourse. Or wait for the next blog post that will explain another change that will be necessary — one that is less fundamental, but more user-facing.

November 18, 2020 12:00 AM

November 10, 2020

Cachix

Write access control for binary caches

As Cachix is growing, I have noticed a few issues along the way: Signing keys are still the best way to upload content and not delegate trust to Cachix, but users have also found that they can be difficult to manage, particularly if the secret key needs to be rotated. At this point, the best option is to clear out the cache completely, and re-sign everything with a newly generated key.

by Domen Kožar (support@cachix.org) at November 10, 2020 11:00 AM

October 31, 2020

Sander van der Burg

Building multi-process Docker images with the Nix process management framework

Some time ago, I have described my experimental Nix-based process management framework that makes it possible to automatically deploy running processes (sometimes also ambiguously called services) from declarative specifications written in the Nix expression language.

The framework is built around two concepts. As its name implies, the Nix package manager is used to deploy all required packages and static artifacts, and a process manager of choice (e.g. sysvinit, systemd, supervisord and others) is used to manage the life-cycles of the processes.

Moreover, it is built around flexible concepts allowing integration with solutions that are not qualified as process managers (but can still be used as such), such as Docker -- each process instance can be deployed as a Docker container with a shared Nix store using the host system's network.

As explained in an earlier blog post, Docker has become such a popular solution that it has become a standard for deploying (micro)services (often as a utility in the Kubernetes solution stack).

When deploying a system that consists of multiple services with Docker, a typical strategy (and recommended practice) is to use multiple containers that have only one root application process. Advantages of this approach is that Docker can control the life-cycles of the applications, and that each process is (somewhat) isolated/protected from other processes and the host system.

By default, containers are isolated, but if they need to interact with other processes, then they can use all kinds of integration facilities -- for example, they can share namespaces, or use shared volumes.

In some situations, it may also be desirable to deviate from the one root process per container practice -- for some systems, processes may need to interact quite intensively (e.g. with IPC mechanisms, shared files or shared memory, or a combination these) in which the container boundaries introduce more inconveniences than benefits.

Moreover, when running multiple processes in a single container, common dependencies can also typically be more efficiently shared leading to lower disk and RAM consumption.

As explained in my previous blog post (that explores various Docker concepts), sharing dependencies between containers only works if containers are constructed from images that share the same layers with the same shared libraries. In practice, this form of sharing is not always as efficient as we want it to be.

Configuring a Docker image to run multiple application processes is somewhat cumbersome -- the official Docker documentation describes two solutions: one that relies on a wrapper script that starts multiple processes in the background and a loop that waits for the "main process" to terminate, and the other is to use a process manager, such as supervisord.

I realised that I could solve this problem much more conveniently by combining the dockerTools.buildImage {} function in Nixpkgs (that builds Docker images with the Nix package manager) with the Nix process management abstractions.

I have created my own abstraction function: createMultiProcessImage that builds multi-process Docker images, managed by any supported process manager that works in a Docker container.

In this blog post, I will describe how this function is implemented and how it can be used.

Creating images for single root process containers


As shown in earlier blog posts, creating a Docker image with Nix for a single root application process is very straight forward.

For example, we can build an image that launches a trivial web application service with an embedded HTTP server (as shown in many of my previous blog posts), as follows:


{dockerTools, webapp}:

dockerTools.buildImage {
name = "webapp";
tag = "test";

runAsRoot = ''
${dockerTools.shadowSetup}
groupadd webapp
useradd webapp -g webapp -d /dev/null
'';

config = {
Env = [ "PORT=5000" ];
Cmd = [ "${webapp}/bin/webapp" ];
Expose = {
"5000/tcp" = {};
};
};
}

The above Nix expression (default.nix) invokes the dockerTools.buildImage function to automatically construct an image with the following properties:

  • The image has the following name: webapp and the following version tag: test.
  • The web application service requires some state to be initialized before it can be used. To configure state, we can run instructions in a QEMU virual machine with root privileges (runAsRoot).

    In the above deployment Nix expression, we create an unprivileged user and group named: webapp. For production deployments, it is typically recommended to drop root privileges, for security reasons.
  • The Env directive is used to configure environment variables. The PORT environment variable is used to configure the TCP port where the service should bind to.
  • The Cmd directive starts the webapp process in foreground mode. The life-cycle of the container is bound to this application process.
  • Expose exposes TCP port 5000 to the public so that the service can respond to requests made by clients.

We can build the Docker image as follows:


$ nix-build

load it into Docker with the following command:


$ docker load -i result

and launch a container instance using the image as a template:


$ docker run -it -p 5000:5000 webapp:test

If the deployment of the container succeeded, we should get a response from the webapp process, by running:


$ curl http://localhost:5000
<!DOCTYPE html>
<html>
<head>
<title>Simple test webapp</title>
</head>
<body>
Simple test webapp listening on port: 5000
</body>
</html>

Creating multi-process images


As shown in previous blog posts, the webapp process is part of a bigger system, namely: a web application system with an Nginx reverse proxy forwarding requests to multiple webapp instances:


{ pkgs ? import <nixpkgs> { inherit system; }
, system ? builtins.currentSystem
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager
}:

let
sharedConstructors = import ../services-agnostic/constructors.nix {
inherit pkgs stateDir runtimeDir logDir cacheDir tmpDir forceDisableUserChange processManager;
};

constructors = import ./constructors.nix {
inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager;
};
in
rec {
webapp = rec {
port = 5000;
dnsName = "webapp.local";

pkg = constructors.webapp {
inherit port;
};
};

nginx = rec {
port = 8080;

pkg = sharedConstructors.nginxReverseProxyHostBased {
webapps = [ webapp ];
inherit port;
} {};
};
}

The Nix expression above shows a simple processes model variant of that system, that consists of only two process instances:

  • The webapp process is (as shown earlier) an application that returns a static HTML page.
  • nginx is configured as a reverse proxy to forward incoming connections to multiple webapp instances using the virtual host header property (dnsName).

    If somebody connects to the nginx server with the following host name: webapp.local then the request is forwarded to the webapp service.

Configuration steps


To allow all processes in the process model shown to be deployed to a single container, we need to execute the following steps in the construction of an image:

  • Instead of deploying a single package, such as webapp, we need to refer to a collection of packages and/or configuration files that can be managed with a process manager, such as sysvinit, systemd or supervisord.

    The Nix process management framework provides all kinds of Nix function abstractions to accomplish this.

    For example, the following function invocation builds a configuration profile for the sysvinit process manager, containing a collection of sysvinit scripts (also known as LSB Init compliant scripts):


    profile = import ../create-managed-process/sysvinit/build-sysvinit-env.nix {
    exprFile = ./processes.nix;
    stateDir = "/var";
    };

  • Similar to single root process containers, we may also need to initialize state. For example, we need to create common FHS state directories (e.g. /tmp, /var etc.) in which services can store their relevant state files (e.g. log files, temp files).

    This can be done by running the following command:


    nixproc-init-state --state-dir /var
  • Another property that multiple process containers have in common is that they may also require the presence of unprivileged users and groups, for security reasons.

    With the following commands, we can automatically generate all required users and groups specified in a deployment profile:


    ${dysnomia}/bin/dysnomia-addgroups ${profile}
    ${dysnomia}/bin/dysnomia-addusers ${profile}
  • Instead of starting a (single root) application process, we need to start a process manager that manages the processes that we want to deploy. As already explained, the framework allows you to pick multiple options.

Starting a process manager as a root process


From all process managers that the framework currently supports, the most straight forward option to use in a Docker container is: supervisord.

To use it, we can create a symlink to the supervisord configuration in the deployment profile:


ln -s ${profile} /etc/supervisor

and then start supervisord as a root process with the following command directive:


Cmd = [
"${pkgs.pythonPackages.supervisor}/bin/supervisord"
"--nodaemon"
"--configuration" "/etc/supervisor/supervisord.conf"
"--logfile" "/var/log/supervisord.log"
"--pidfile" "/var/run/supervisord.pid"
];

(As a sidenote: creating a symlink is not strictly required, but makes it possible to control running services with the supervisorctl command-line tool).

Supervisord is not the only option. We can also use sysvinit scripts, but doing so is a bit tricky. As explained earlier, the life-cycle of container is bound to a running root process (in foreground mode).

sysvinit scripts do not run in the foreground, but start processes that daemonize and terminate immediately, leaving daemon processes behind that remain running in the background.

As described in an earlier blog post about translating high-level process management concepts, it is also possible to run "daemons in the foreground" by creating a proxy script. We can also make a similar foreground proxy for a collection of daemons:


#!/bin/bash -e

_term()
{
nixproc-sysvinit-runactivity -r stop ${profile}
kill "$pid"
exit 0
}

nixproc-sysvinit-runactivity start ${profile}

# Keep process running, but allow it to respond to the TERM and INT
# signals so that all scripts are stopped properly

trap _term TERM
trap _term INT

tail -f /dev/null & pid=$!
wait "$pid"

The above proxy script does the following:

  • It first starts all sysvinit scripts by invoking the nixproc-sysvinit-runactivity start command.
  • Then it registers a signal handler for the TERM and INT signals. The corresponding callback triggers a shutdown procedure.
  • We invoke a dummy command that keeps running in the foreground without consuming too many system resources (tail -f /dev/null) and we wait for it to terminate.
  • The signal handler properly deactivates all processes in reverse order (with the nixproc-sysvinit-runactivity -r stop command), and finally terminates the dummy command causing the script (and the container) to stop.

In addition supervisord and sysvinit, we can also use Disnix as a process manager by using a similar strategy with a foreground proxy.

Other configuration properties


The above configuration properties suffice to get a multi-process container running. However, to make working with such containers more practical from a user perspective, we may also want to:

  • Add basic shell utilities to the image, so that you can control the processes, investigate log files (in case of errors), and do other maintenance tasks.
  • Add a .bashrc configuration file to make file coloring working for the ls command, and to provide a decent prompt in a shell session.

Usage


The configuration steps described in the previous section are wrapped into a function named: createMultiProcessImage, which itself is a thin wrapper around the dockerTools.buildImage function in Nixpkgs -- it accepts the same parameters with a number of additional parameters that are specific to multi-process configurations.

The following function invocation builds a multi-process container deploying our example system, using supervisord as a process manager:


let
pkgs = import <nixpkgs> {};

createMultiProcessImage = import ../../nixproc/create-multi-process-image/create-multi-process-image.nix {
inherit pkgs system;
inherit (pkgs) dockerTools stdenv;
};
in
createMultiProcessImage {
name = "multiprocess";
tag = "test";
exprFile = ./processes.nix;
stateDir = "/var";
processManager = "supervisord";
}

After building the image, and deploying a container, with the following commands:


$ nix-build
$ docker load -i result
$ docker run -it --network host multiprocessimage:test

we should be able to connect to the webapp instance via the nginx reverse proxy:


$ curl -H 'Host: webapp.local' http://localhost:8080
<!DOCTYPE html>
<html>
<head>
<title>Simple test webapp</title>
</head>
<body>
Simple test webapp listening on port: 5000
</body>
</html>

As explained earlier, the constructed image also provides extra command-line utilities to do maintenance tasks, and control the life-cycle of the individual processes.

For example, we can "connect" to the running container, and check which processes are running:


$ docker exec -it mycontainer /bin/bash
# supervisorctl
nginx RUNNING pid 11, uptime 0:00:38
webapp RUNNING pid 10, uptime 0:00:38
supervisor>

If we change the processManager parameter to sysvinit, we can deploy a multi-process image in which the foreground proxy script is used as a root process (that starts and stops sysvinit scripts).

We can control the life-cycle of each individual process by directly invoking the sysvinit scripts in the container:


$ docker exec -it mycontainer /bin/bash
$ /etc/rc.d/init.d/webapp status
webapp is running with Process ID(s) 33.

$ /etc/rc.d/init.d/nginx status
nginx is running with Process ID(s) 51.

Although having extra command-line utilities to do administration tasks is useful, a disadvantage is that they considerably increase the size of the image.

To save storage costs, it is also possible to disable interactive mode to exclude these packages:


let
pkgs = import <nixpkgs> {};

createMultiProcessImage = import ../../nixproc/create-multi-process-image/create-multi-process-image.nix {
inherit pkgs system;
inherit (pkgs) dockerTools stdenv;
};
in
createMultiProcessImage {
name = "multiprocess";
tag = "test";
exprFile = ./processes.nix;
stateDir = "/var";
processManager = "supervisord";
interactive = false; # Do not install any additional shell utilities
}

Discussion


In this blog post, I have described a new utility function in the Nix process management framework: createMultiProcessImage -- a thin wrapper around the dockerTools.buildImage function that can be used to convienently build multi-process Docker images, using any Docker-capable process manager that the Nix process management framework supports.

Besides the fact that we can convienently construct multi-process images, this function also has the advantage (similar to the dockerTools.buildImage function) that Nix is only required for the construction of the image. To deploy containers from a multi-process image, Nix is not a requirement.

There is also a drawback: similar to "ordinary" multi-process container deployments, when it is desired to upgrade a process, the entire container needs to be redeployed, also requiring a user to terminate all other running processes.

Availability


The createMultiProcessImage function is part of the current development version of the Nix process management framework that can be obtained from my GitHub page.

by Sander van der Burg (noreply@blogger.com) at October 31, 2020 03:05 PM

October 22, 2020

Tweag I/O

Nickel: better configuration for less

We are making the Nickel repository public. Nickel is an experimental configuration language developed at Tweag. While this is not the time for the first release yet, it is an occasion to talk about this project. The goal of this post is to give a high-level overview of the project. If your curiosity is tickled but you are left wanting to learn more, fear not, as we will publish more blog posts on specific aspects of the language in the future. But for now, let’s have a tour!

[Disclaimer: the actual syntax of Nickel being still worked on, I’m freely using as-of-yet non-existing syntax for illustrative purposes. The underlying features are however already supported.]

The inception

We, at Tweag, are avid users of the Nix package manager. As it happens, the configuration language for Nix (also called Nix) is a pretty good configuration language, and would be applicable to many more things than just package management.

All in all, the Nix language is a lazy JSON with functions. It is simple yet powerful. It is used to generate Nix’s package descriptions but would be well suited to write any kind of configuration (Terraform, Kubernetes, etc…).

The rub is that the interpreter for Nix-the-language is tightly coupled with Nix-the-package manager. So, as it stands, using the Nix language for anything else than package management is a rather painful exercise.

Nickel is our attempt at answering the question: what would Nix-the-language look like if it was split from the package manager? While taking the opportunity to improve the language a little, building on the experience of the Nix community over the years.

What’s Nickel, exactly ?

Nickel is a lightweight generic configuration language. In that it can replace YAML as your application’s configuration language. Unlike YAML, though, it anticipates large configurations by being programmable. Another way to use Nickel is to generate static configuration files — e.g. in JSON, YAML — that are then fed to another system. Like Nix, it is designed to have a simple, well-understood core: at its heart, it is JSON with functions.

But past experience with Nix also brings some insights on which aspects of the language could be improved. Whatever the initial scope of a language is, it will almost surely be used in a way that deviates from the original plan: you create a configuration language to describe software packages, and next thing you know, somebody needs to implement a topological sort.

Nickel strives to retain the simplicity of Nix, while extending it according to this feedback. Though, you can do perfectly fine without the new features and just write Nix-like code.

Yet another configuration language

At this point you’re probably wondering if this hasn’t already been done elsewhere. It seems that more and more languages are born every day, and surely there already exist configuration languages with a similar purpose to Nickel: Starlark, Jsonnet, Dhall or CUE, to name a few. So why Nickel?

Typing

Perhaps the most important difference with other configuration languages is Nickel’s approach to typing.

Some languages, such as Jsonnet or Starlark, are not statically typed. Indeed, static types can be seen as superflous in a configuration language: if your program is only run once on fixed inputs, any type error will be reported at run-time anyway. Why bother with a static type system?

On the other hand, more and more systems rely on complex configurations, such as cloud infrastructure (Terraform, Kubernetes or NixOps), leading the corresponding programs to become increasingly complex, to the point where static types are beneficial. For reusable code — that is, library functions — static types add structure, serve as documentation, and eliminate bugs early.

Although less common, some configuration languages are statically typed, including Dhall and CUE.

Dhall features a powerful type system that is able to type a wide range of idioms. But it is complex, requiring some experience to become fluent in.

CUE is closer to what we are striving for. It has an optional and well-behaved type system with strong guarantees. In exchange for which, one can’t write nor type higher-order functions in general, even if some simple functions are possible to encode.

Gradual typing

Nickel, features a gradual type system. Gradual types are unobtrusive: they make it possible to statically type reusable parts of your programs, but you are still free to write configurations without any types. The interpreter safely handles the interaction between the typed and untyped worlds.

Concretely, typed library code like this:

// file: mylib.ncl
{
  numToStr : Num -> Str = fun n => ...;
  makeURL : Str -> Str -> Num -> Str = fun proto host port =>
    "${proto}://${host}:${numToStr port}/";
}

can coexist with untyped configuration code like this:

// file: server.ncl
let mylib = import "mylib.ncl" in
let host = "myproject.com" in
{
  host = host;
  port = 1;
  urls = [
    mylib.makeURL "myproto" host port,
    {protocol = "proto2"; server = "sndserver.net"; port = 4242}
  ];
}

In the first snippet, the body of numToStr and makeURL are statically checked: wrongfully calling numToStr proto inside makeURL would raise an error even if makeURL is never used. On the other hand, the second snippet is not annotated, and thus not statically checked. In particular, we mix an URL represented as a string together with one represented as a record in the same list. The interpreter rather inserts run-time checks, or contracts, such that if makeURL is misused then the program fails with an appropriate error.

Gradual types also lets us keep the type system simple: even in statically typed code if you want to write a component that the type checker doesn’t know how to verify, you don’t have to type-check that part.

Contracts

Complementary to the static type system, Nickel offers contracts. Contracts offer precise and accurate dynamic type error reporting, even in the presence of function types. Contracts are used internally by Nickel’s interpreter to insert guards at the boundary between typed and untyped chunks. Contracts are available to the programmer as well, to give them the ability to enforce type assertions at run-time in a simple way.

One pleasant consequence of this design is that the exposure of the user to the type system can be progressive:

  • Users writing configurations can just write Nix-like code while ignoring (almost) everything about typing, since you can seamlessly call a typed function from untyped code.
  • Users writing consumers or verifiers of these configurations would use contracts to model data schemas.
  • Users writing libraries would instead use the static type system.

An example of contract is given in the next section.

Schemas

While the basic computational blocks are functions, the basic data blocks in Nickel are records (or objects in JSON). Nickel supports writing self-documenting record schemas, such as:

{
  host | type: Str
       | description: "The host name of the server."
       | default: "fallback.myserver.net"
  ;

  port | type: Num
       | description: "The port of the connection."
       | default: 4242
  ;

  url | type: Url
      | description: "The host name of the server."
  ;
}

Each field can contain metadata, such as a description or default value. These aim at being displayed in documentation, or queried by tools.

The schema can then be used as a contract. Imagine that a function has swapped two values in its output and returns:

{
  host = "myproject.com",
  port = "myproto://myproject.com:1/",
  url = 1
}

Without types, this is hard to catch. Surely, an error will eventually pop up downstream in the pipeline, but how and when? Using the schema above will make sure that, whenever the fields are actually evaluated, the function will be blamed in the type error.

Schemas are actually part of a bigger story involving merging records together, which, in particular, lets the schema instantiate missing fields with their default values. It is very much inspired by the NixOs module system and the CUE language, but it is a story for another time.

Conclusion

I hope that I gave you a sense of what Nickel is trying to achieve. I only presented its most salient aspects: its gradual type system with contracts, and built-in record schemas. But there is more to explore! The language is not ready to be used in real world applications yet, but a good share of the design presented here is implemented. If you are curious about it, check it out!

October 22, 2020 12:00 AM

October 08, 2020

Sander van der Burg

Transforming Disnix models to graphs and visualizing them

In my previous blog post, I have described a new tool in the Dynamic Disnix toolset that can be used to automatically assign unique numeric IDs to services in a Disnix service model. Unique numeric IDs can represent all kinds of useful resources, such as TCP/UDP port numbers, user IDs (UIDs), and group IDs (GIDs).

Although I am quite happy to have this tool at my disposal, implementing it was much more difficult and time consuming than I expected. Aside from the fact that the problem is not as obvious as it may sound, the main reason is that the Dynamic Disnix toolset was originally developed as a proof of concept implementation for a research paper under very high time pressure. As a result, it has accumulated quite a bit of technical debt, that as of today, is still at a fairly high level (but much better than it was when I completed the PoC).

For the ID assigner tool, I needed to make changes to the foundations of the tools, such as the model parsing libraries. As a consequence, all kinds of related aspects in the toolset started to break, such as the deployment planning algorithm implementations.

Fixing some of these algorithm implementations was much more difficult than I expected -- they were not properly documented, not decomposed into functions, had little to no reuse of common concepts and as a result, were difficult to understand and change. I was forced to re-read the papers that I used as a basis for these algorithms.

To prevent myself from having to go through such a painful process again, I have decided to revise them in such a way that they are better understandable and maintainable.

Dynamically distributing services


The deployment models in the core Disnix toolset are static. For example, the distribution of services to machines in the network is done in a distribution model in which the user has to manually map services in the services model to target machines in the infrastructure model (and optionally to container services hosted on the target machines).

Each time a condition changes, e.g. the system needs to scale up or a machine crashes and the system needs to recover, a new distribution model must be configured and the system must be redeployed. For big complex systems that need to be reconfigured frequently, manually specifying new distribution models becomes very impractical.

As I have already explained in older blog posts, to cope with the limitations of static deployment models (and other static configuration aspects), I have developed Dynamic Disnix, in which various configuration aspects can be automated, including the distribution of services to machines.

A strategy for dynamically distributing services to machines can be specified in a QoS model, that typically consists of two phases:

  • First, a candidate target selection must be made, in which for each service the appropriate candidate target machines are selected.

    Not all machines are capable of hosting a certain service for functional and non-functional reasons -- for example, a i686-linux machine is not capable of running a binary compiled for a x86_64-linux machine.

    A machine can also be exposed to the public internet, and as a result, may not be suitable to host a service that exposes privacy-sensitive information.
  • After the suitable candidate target machines are known for each service, we must decide to which candidate machine each service gets distributed.

    This can be done in many ways. The strategy that we want to use is typically based on all kinds of non-functional requirements.

    For example, we can optimize a system's reliability by minimizing the amount of network links between services, requiring a strategy in which services that depend on each other are mapped to the same machine, as much as possible.

Graph-based optimization problems


In the Dynamic Disnix toolset, I have implemented various kinds of distribution algorithms/strategies for all kinds of purposes.

I did not "invent" most of them -- for some, I got inspiration from papers in the academic literature.

Two of the more advanced deployment planning algorithms are graph-based, to accomplish the following goals:

  • Reliable deployment. Network links are a potential source making a distributed system unreliable -- connections may fail, become slow, or could be interrupted frequently. By minimizing the amount of network links between services (by co-locating them on the same machine), their impact can be reduced. To not make deployments not too expensive, it should be done with a minimal amount of machines.

    As described in the paper: "Reliable Deployment of Component-based Applications into Distributed Environments" by A. Heydarnoori and F. Mavaddat, this problem can be transformed into a graph problem: the multiway cut problem (which is NP-hard).

    It can only be solved in polynomial time with an approximation algorithm that comes close to the optimal solution, unless a proof that P = NP exists.
  • Fragile deployment. Inspired by the above deployment problem, I also came up with the opposite problem (as my own "invention") -- how can we make any connection between a service a true network link (not local), so that we can test a system for robustness, using a minimal amount of machines?

    This problem can be modeled as a graph coloring problem (that is a NP-hard problem as well). I used one of the approximation algorithms described in the paper: "New Methods to Color the Vertices of a Graph" by D. Brélaz to implement a solution.

To work with these graph-based algorithms, I originally did not apply any transformations -- because of time pressure, I directly worked with objects from the Disnix models (e.g. services, target machines) and somewhat "glued" these together with generic data structures, such as lists and hash tables.

As a result, when looking at the implementation, it is very hard to get an understanding of the process and how an implementation aspect relates to a concept described in the papers shown above.

In my revised version, I have implemented a general purpose graph library that can be used to solve all kinds of general graph related problems.

Aside from using a general graph library, I have also separated the graph-based generation processes into the following steps:

  • After opening the Disnix input models (such as the services, infrastructure, and distribution models) I transform the models to a graph representing an instance of the problem domain.
  • After the graph has been generated, I apply the approximation algorithm to the graph data structure.
  • Finally, I transform the resolved graph back to a distribution model that should provide our desired distribution outcome.

This new organization provides better separation of concerns, common concepts can be reused (such as graph operations), and as a result, the implementations are much closer to the approximation algorithms described in the papers.

Visualizing the generation process


Another advantage of having a reusable graph implementation is that we can easily extend it to visualize the problem graphs.

When I combine these features together with my earlier work that visualizes services models, and a new tool that visualizes infrastructure models, I can make the entire generation process transparent.

For example, the following services model:


{system, pkgs, distribution, invDistribution}:

let
customPkgs = import ./pkgs { inherit pkgs system; };
in
rec {
testService1 = {
name = "testService1";
pkg = customPkgs.testService1;
type = "echo";
};

testService2 = {
name = "testService2";
pkg = customPkgs.testService2;
dependsOn = {
inherit testService1;
};
type = "echo";
};

testService3 = {
name = "testService3";
pkg = customPkgs.testService3;
dependsOn = {
inherit testService1 testService2;
};
type = "echo";
};
}

can be visualized as follows:


$ dydisnix-visualize-services -s services.nix


The above services model and corresponding visualization capture the following properties:

  • They describe three services (as denoted by ovals).
  • The arrows denote inter-dependency relationships (the dependsOn attribute in the services model).

    When a service has an inter-dependency on another service means that the latter service has to be activated first, and that the dependent service needs to know how to reach the former.

    testService2 depends on testService1 and testService3 depends on both the other two services.

We can also visualize the following infrastructure model:


{
testtarget1 = {
properties = {
hostname = "testtarget1";
};
containers = {
mysql-database = {
mysqlPort = 3306;
};
echo = {};
};
};

testtarget2 = {
properties = {
hostname = "testtarget2";
};
containers = {
mysql-database = {
mysqlPort = 3306;
};
};
};

testtarget3 = {
properties = {
hostname = "testtarget3";
};
};
}

with the following command:


$ dydisnix-visualize-infra -i infrastructure.nix

resulting in the following visualization:


The above infrastructure model declares three machines. Each target machine provides a number of container services (such as a MySQL database server, and echo that acts as a testing container).

With the following command, we can generate a problem instance for the graph coloring problem using the above services and infrastructure models as inputs:


$ dydisnix-graphcol -s services.nix -i infrastructure.nix \
--output-graph

resulting in the following graph:


The graph shown above captures the following properties:

  • Each service translates to a node
  • When an inter-dependency relationship exists between services, it gets translated to a (bi-directional) link representing a network connection (the rationale is that a service that has an inter-dependency on another service, interact with each other by using a network connection).

Each target machine translates to a color, that we can represent with a numeric index -- 0 is testtarget1, 1 is testtarget2 and so on.

The following command generates the resolved problem instance graph in which each vertex has a color assigned:


$ dydisnix-graphcol -s services.nix -i infrastructure.nix \
--output-resolved-graph

resulting in the following visualization:


(As a sidenote: in the above graph, colors are represented by numbers. In theory, I could also use real colors, but if I also want that the graph to remain visually appealing, I need to solve a color picking problem, which is beyond the scope of my refactoring objective).

The resolved graph can be translated back into the following distribution model:


$ dydisnix-graphcol -s services.nix -i infrastructure.nix
{
"testService2" = [
"testtarget2"
];
"testService1" = [
"testtarget1"
];
"testService3" = [
"testtarget3"
];
}

As you may notice, every service is distributed to a separate machine, so that every network link between a service is a real network connection between machines.

We can also visualize the problem instance of the multiway cut problem. For this, we also need a distribution model that, declares for each service, which target machine is a candidate.

The following distribution model makes all three target machines in the infrastructure model a candidate for every service:


{infrastructure}:

{
testService1 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
testService2 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
testService3 = [ infrastructure.testtarget1 infrastructure.testtarget2 infrastructure.testtarget3 ];
}

With the following command we can generate a problem instance representing a host-application graph:


$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
-d distribution.nix --output-graph

providing me the following output:


The above problem graph has the following properties:

  • Each service translates to an app node (prefixed with app:) and each candidate target machine to a host node (prefixed with host:).
  • When a network connection between two services exists (implicitly derived from having an inter-dependency relationship), an edge is generated with a weight of 1.
  • When a target machine is a candidate target for a service, then an edge is generated with a weight of n2 representing a very large number.

The objective of solving the multiway cut problem is to cut edges in the graph in such a way that each terminal (host node) is disconnected from the other terminals (host nodes), in which the total weight of the cuts is minimized.

When applying the approximation algorithm in the paper to the above graph:


$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
-d distribution.nix --output-resolved-graph

we get the following resolved graph:


that can be transformed back into the following distribution model:


$ dydisnix-multiwaycut -s services.nix -i infrastructure.nix \
-d distribution.nix
{
"testService2" = [
"testtarget1"
];
"testService1" = [
"testtarget1"
];
"testService3" = [
"testtarget1"
];
}

As you may notice by looking at the resolved graph (in which the terminals: testtarget2 and testtarget3 are disconnected) and the distribution model output, all services are distributed to the same machine: testtarget1 making all connections between the services local connections.

In this particular case, the solution is not only close to the optimal solution, but it is the optimal solution.

Conclusion


In this blog post, I have described how I have revised the deployment planning algorithm implementations in the Dynamic Disnix toolset. Their concerns are now much better separated, and the graph-based algorithms now use a general purpose graph library, that can also be used for generating visualizations of the intermediate steps in the generation process.

This revision was not on my short-term planned features list, but I am happy that I did the work. Retrospectively, I regret that I never took the time to finish things up properly after the submission of the paper. Although Dynamic Disnix's quality is still not where I want it to be, it is quite a step forward in making the toolset more usable.

Sadly, it is almost 10 years ago that I started Dynamic Disnix and still there is no offical release yet and the technical debt in Dynamic Disnix is one of the important reasons that I never did an official release. Hopefully, with this step I can do it some day. :-)

The good news is that I made the paper submission deadline and that the paper got accepted for presentation. It brought me to the SEAMS 2011 conference (co-located with ICSE 2011) in Honolulu, Hawaii, allowing me to take pictures such as this one:


Availability


The graph library and new implementations of the deployment planning algorithms described in this blog post are part of the current development version of Dynamic Disnix.

The paper: "A Self-Adaptive Deployment Framework for Service-Oriented Systems" describes the Dynamic Disnix framework (developed 9 years ago) and can be obtained from my publications page.

Acknowledgements


To generate the visualizations I used the Graphviz toolset.

by Sander van der Burg (noreply@blogger.com) at October 08, 2020 09:29 PM

October 01, 2020

Cachix

Changes to Garbage Collection

Based on your feedback, I have made the following two changes: When downloading <store-hash>.narinfo the timestamp of last access is updated, previously this would happen only with nar archives. This change allows tools like nix-build-uncached to prevent unneeded downloads and playing nicely with Cachix garbage collection algorithm! Previously, the algorithm ordered paths first by last accessed timestamp and then by creation timestamp. That worked well until you had all entries with last accessed and all newly created store paths will get deleted first.

by Domen Kožar (support@cachix.org) at October 01, 2020 09:00 AM

September 30, 2020

Tweag I/O

Fully statically linked Haskell binaries with Bazel

Deploying and packaging Haskell applications can be challenging at times, and runtime library dependencies are one reason for this. Statically linked binaries have no such dependencies and are therefore easier to deploy. They can also be quicker to start, since no dynamic loading is needed. In exchange, all used symbols must be bundled into the application, which may lead to larger artifacts.

Thanks to the contribution of Will Jones of Habito1, rules_haskell, the Haskell Bazel extension, has gained support for fully static linking of Haskell binaries.

Habito uses Bazel to develop, build, test and deploy Haskell code in a minimal Docker container. By building fully-statically-linked binaries, Docker packaging (using rules_docker) becomes straightforward and easy to integrate into existing build workflows. A static binary can also be stripped once it is built to reduce the size of production artifacts. With static binaries, what you see (just the binary) is what you get, and this is powerful.

In the following, we will discuss the technical challenges of statically linking Haskell binaries and how these challenges are addressed in rules_haskell. Spoiler alert: Nix is an important part of the solution. Finally, we will show you how you can create your own fully statically linked Haskell binaries with Bazel and Nix.

Technical challenges

Creating fully statically linked Haskell binaries is not without challenges. The main difficulties for doing so are:

  • Not all library dependencies are suited for statically linked binaries.
  • Compiling template Haskell requires dynamic libraries on Linux by default.

Library dependencies

Like most binaries on Linux, the Haskell compiler GHC is typically configured to link against the GNU C library glibc. However, glibc is not designed to support fully static linking and explicitly depends on dynamic linking in some use cases. The alternative C library musl is designed to support fully static linking.

Relatedly, there may be licensing reasons to not link some libraries statically. Common instances in the Haskell ecosystem are again glibc which is licensed under GPL, and the core Haskell dependency libgmp which is licensed under LGPL. For the latter GHC can be configured to use the core package integer-simple instead of integer-gmp.

Fortunately, the Nix community has made great progress towards fully statically linked Haskell binaries and we can build on much of this work in rules_haskell. The rules_nixpkgs extension makes it possible to import Nix derivations into a Bazel project, and rules_haskell has first class support for Nix-provided GHC toolchains using rules_nixpkgs under the hood. In particular, it can import a GHC toolchain based on musl from static-haskell-nix.

Template Haskell

By default GHC is configured to require dynamic libraries when compiling template Haskell. GHC’s runtime system (RTS) can be built in various combinations of so called ways. The relevant way in this context is called dynamic. On Linux, GHC itself is built with a dynamic RTS. However, statically linked code is targeting a non-dynamic RTS. This may sound familiar if you ever tried to compile code using template Haskell in profiling mode. As the GHC user guide points out, when evaluating template Haskell splices, GHC will execute compiled expressions in its built-in bytecode interpreter and this code has to be compatible with the RTS of GHC itself. In short, a GHC configured with a dynamic RTS will not be able to load static Haskell libraries to evaluate template Haskell splices.

One way to solve this issue is to compile all Haskell libraries twice, once with dynamic linking and once with static linking. C library dependencies will similarly need to be available in both static and dynamic forms. This is the approach taken by static-haskell-nix. However, in the context of Bazel we found it preferable to only compile Haskell libraries once in static form and also only have to provide C libraries in static form. To achieve this we need to build GHC with a static RTS and to make sure that Haskell code is compiled as position independent code so that it can be loaded into a running GHC for template Haskell splices. Thanks to Nix, it is easy to override the GHC derivation to include the necessary configuration.

Make your project fully statically linked

How can you benefit from this? In this section we will show how you can setup a Bazel Haskell project for fully static linking with Nix. For further details please refer to the corresponding documentation on haskell.build. A fully working example repository is available here. For a primer on setting up a Bazel Haskell project take a look at this tutorial.

First, you need to configure a Nixpkgs repository that defines a GHC toolchain for fully static linking based on musl. We start by pulling in a base Nixpkgs revision and the static-haskell-nix project. Create a default.nix, with the following.

let
  baseNixpkgs = builtins.fetchTarball {
    name = "nixos-nixpkgs";
    url = "https://github.com/NixOS/nixpkgs/archive/dca182df882db483cea5bb0115fea82304157ba1.tar.gz";
    sha256 = "0193bpsg1ssr93ihndyv7shz6ivsm8cvaxxl72mc7vfb8d1bwx55";
  };

  staticHaskellNixpkgs = builtins.fetchTarball
    "https://github.com/nh2/static-haskell-nix/archive/dbce18f4808d27f6a51ce31585078b49c86bd2b5.tar.gz";
in

Then we import a Haskell package set based on musl from static-haskell-nix. The package set provides GHC and various Haskell packages. However, we will only use the GHC compiler and use Bazel to build other Haskell packages.

let
  staticHaskellPkgs = (
    import (staticHaskellNixpkgs + "/survey/default.nix") {}
  ).approachPkgs;
in

Next we define a Nixpkgs overlay that introduces a GHC based on musl that is configured to use a static runtime system and core packages built with position independent code so that they can be loaded for template Haskell.

let
  overlay = self: super: {
    staticHaskell = staticHaskellPkgs.extend (selfSH: superSH: {
      ghc = (superSH.ghc.override {
        enableRelocatedStaticLibs = true;
        enableShared = false;
      }).overrideAttrs (oldAttrs: {
        preConfigure = ''
          ${oldAttrs.preConfigure or ""}
          echo "GhcLibHcOpts += -fPIC -fexternal-dynamic-refs" >> mk/build.mk
          echo "GhcRtsHcOpts += -fPIC -fexternal-dynamic-refs" >> mk/build.mk
        '';
      });
    });
  };
in

Finally, we extend the base Nixpkgs revision with the overlay. This makes the newly configured GHC available under the Nix attribute path staticHaskell.ghc.

  args@{ overlays ? [], ... }:
    import baseNixpkgs (args // {
      overlays = [overlay] ++ overlays;
    })

This concludes the Nix part of the setup and we can move on to the Bazel part.

You can import this Nixpkgs repository into Bazel by adding the following lines to your WORKSPACE file.

load(
    "@io_tweag_rules_nixpkgs//nixpkgs:nixpkgs.bzl",
    "nixpkgs_local_repository",
)
nixpkgs_local_repository(
    name = "nixpkgs",
    nix_file = "default.nix",
)

Now you can define a GHC toolchain for rules_haskell that uses the Nix built GHC defined above. Note how we declare that this toolchain has a static RTS and is configured for fully static linking. Add the following lines to your WORKSPACE file.

load(
    "@rules_haskell//haskell:nixpkgs.bzl",
    "haskell_register_ghc_nixpkgs",
)
haskell_register_ghc_nixpkgs(
    version = "X.Y.Z",  # Make sure this matches the GHC version.
    attribute_path = "staticHaskell.ghc",
    repositories = {"nixpkgs": "@nixpkgs"},
    static_runtime = True,
    fully_static_link = True,
)

GHC relies on the C compiler and linker during compilation. rules_haskell will always use the C compiler and linker provided by the active Bazel C toolchain. We need to make sure that we use a musl-based C toolchain as well. Here we will use the same Nix-provided C toolchain that is used by static-haskell-nix to build GHC.

load(
    "@io_tweag_rules_nixpkgs//nixpkgs:nixpkgs.bzl",
    "nixpkgs_cc_configure",
)
nixpkgs_cc_configure(
    repository = "@nixpkgs",
    nix_file_content = """
      with import <nixpkgs> { config = {}; overlays = []; }; buildEnv {
        name = "bazel-cc-toolchain";
        paths = [ staticHaskell.stdenv.cc staticHaskell.binutils ];
      }
    """,
)

Finally, everything is configured for fully static linking. You can define a Bazel target for a fully statically linked Haskell binary as follows.

haskell_binary(
    name = "example",
    srcs = ["Main.hs"],
    features = ["fully_static_link"],
)

You can build your binary and confirm that it is fully statically linked as follows.

$ bazel build //:example
$ ldd bazel-bin/example
      not a dynamic executable

Conclusion

If you’re interested in further exploring the benefits of fully statically linked binaries, you might combine them with rules_docker (e.g. through its container_image rule) to build Docker images as Habito have done. With a rich enough set of Bazel rules and dependency specifications, it’s possible to reduce your build and deployment workflow to a bazel test and bazel run!

The current implementation depends on a Nix-provided GHC toolchain capable of fully static linking that is imported into Bazel using rules_nixpkgs. However, there is no reason why it shouldn’t be possible to use a GHC distribution capable of fully static linking that was provided by other means, for example a Docker image such as ghc-musl. Get in touch if you would like to create fully statically linked Haskell binaries with Bazel but can’t or don’t want to integrate Nix into your build. Contributions are welcome!

We thank Habito for their contributions to rules_haskell.


  1. Habito is fixing mortgages and making homebuying fit for the future. Habito gives people tools, jargon-free knowledge and expert support to help them buy and finance their homes. Built on a rich foundation of functional programming and other cutting-edge technology, Habito is a long time user of and contributor to rules_haskell.

September 30, 2020 12:00 AM

September 24, 2020

Sander van der Burg

Assigning unique IDs to services in Disnix deployment models

As described in some of my recent blog posts, one of the more advanced features of Disnix as well as the experimental Nix process management framework is to deploy multiple instances of the same service to the same machine.

To make running multiple service instances on the same machine possible, these tools rely on conflict avoidance rather than isolation (typically used for containers). To allow multiple services instances to co-exist on the same machine, they need to be configured in such a way that they do not allocate any conflicting resources.

Although for small systems it is doable to configure multiple instances by hand, this process gets tedious and time consuming for larger and more technologically diverse systems.

One particular kind of conflicting resource that could be configured automatically are numeric IDs, such as TCP/UDP port numbers, user IDs (UIDs), and group IDs (GIDs).

In this blog post, I will describe how multiple service instances are configured (in Disnix and the process management framework) and how we can automatically assign unique numeric IDs to them.

Configuring multiple service instances


To facilitate conflict avoidance in Disnix and the Nix process management framework, services are configured as follows:


{createManagedProcess, tmpDir}:
{port, instanceSuffix ? "", instanceName ? "webapp${instanceSuffix}"}:

let
webapp = import ../../webapp;
in
createManagedProcess {
name = instanceName;
description = "Simple web application";
inherit instanceName;

# This expression can both run in foreground or daemon mode.
# The process manager can pick which mode it prefers.
process = "${webapp}/bin/webapp";
daemonArgs = [ "-D" ];

environment = {
PORT = port;
PID_FILE = "${tmpDir}/${instanceName}.pid";
};
user = instanceName;
credentials = {
groups = {
"${instanceName}" = {};
};
users = {
"${instanceName}" = {
group = instanceName;
description = "Webapp";
};
};
};

overrides = {
sysvinit = {
runlevels = [ 3 4 5 ];
};
};
}

The Nix expression shown above is a nested function that describes how to deploy a simple self-contained REST web application with an embedded HTTP server:

  • The outer function header (first line) specifies all common build-time dependencies and configuration properties that the service needs:

    • createManagedProcess is a function that can be used to define process manager agnostic configurations that can be translated to configuration files for a variety of process managers (e.g. systemd, launchd, supervisord etc.).
    • tmpDir refers to the temp directory in which temp files are stored.
  • The inner function header (second line) specifies all instance parameters -- these are the parameters that must be configured in such a way that conflicts with other process instances are avoided:

    • The instanceName parameter (that can be derived from the instanceSuffix) is a value used by some of the process management backends (e.g. the ones that invoke the daemon command) to derive a unique PID file for the process. When running multiple instances of the same process, each of them requires a unique PID file name.
    • The port parameter specifies to which TCP port the service binds to. Binding the service to a port that is already taken by another service, causes the deployment of this service to fail.
  • In the function body, we invoke the createManagedProcess function to construct configuration files for all supported process manager backends to run the webapp process:

    • As explained earlier, the instanceName is used to configure the daemon executable in such a way that it allocates a unique PID file.
    • The process parameter specifies which executable we need to run, both as a foreground process or daemon.
    • The daemonArgs parameter specifies which command-line parameters need to be propagated to the executable when the process should daemonize on its own.
    • The environment parameter specifies all environment variables. The webapp service uses these variables for runtime property configuration.
    • The user parameter is used to specify that the process should run as an unprivileged user. The credentials parameter is used to configure the creation of the user account and corresponding user group.
    • The overrides parameter is used to override the process manager-agnostic parameters with process manager-specific parameters. For the sysvinit backend, we configure the runlevels in which the service should run.

Although the convention shown above makes it possible to avoid conflicts (assuming that all potential conflicts have been identified and exposed as function parameters), these parameters are typically configured manually:


{ pkgs, system
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager ? "sysvinit"
, ...
}:

let
constructors = import ./constructors.nix {
inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager;
};

processType = import ../../nixproc/derive-dysnomia-process-type.nix {
inherit processManager;
};
in
rec {
webapp1 = rec {
name = "webapp1";
port = 5000;
dnsName = "webapp.local";
pkg = constructors.webapp {
inherit port;
instanceSuffix = "1";
};
type = processType;
};

webapp2 = rec {
name = "webapp2";
port = 5001;
dnsName = "webapp.local";
pkg = constructors.webapp {
inherit port;
instanceSuffix = "2";
};
type = processType;
};
}

The above Nix expression shows both a valid Disnix services as well as a valid processes model that composes two web application process instances that can run concurrently on the same machine by invoking the nested constructor function shown in the previous example:

  • Each webapp instance has its own unique instance name, by specifying a unique numeric instanceSuffix that gets appended to the service name.
  • Every webapp instance binds to a unique TCP port (5000 and 5001) that should not conflict with system services or other process instances.

Previous work: assigning port numbers


Although configuring two process instances is still manageable, the configuration process becomes more tedious and time consuming when the amount and the kind of processes (each having their own potential conflicts) grow.

Five years ago, I already identified a resource that could be automatically assigned to services: port numbers.

I have created a very simple port assigner tool that allows you to specify a global ports pool and a target-specific pool pool. The former is used to assign globally unique port numbers to all services in the network, whereas the latter assigns port numbers that are unique to the target machine where the service is deployed to (this is to cope with the scarcity of port numbers).

Although the tool is quite useful for systems that do not consist of too many different kinds of components, I ran into a number limitations when I want to manage a more diverse set of services:

  • Port numbers are not the only numeric IDs that services may require. When deploying systems that consist of self-contained executables, you typically want to run them as unprivileged users for security reasons. User accounts on most UNIX-like systems require unique user IDs, and the corresponding users' groups require unique group IDs.
  • We typically want to manage multiple resource pools, for a variety of reasons. For example, when we have a number of HTTP server instances and a number of database instances, then we may want to pick port numbers in the 8000-9000 range for the HTTP servers, whereas for the database servers we want to use a different pool, such as 5000-6000.

Assigning unique numeric IDs


To address these shortcomings, I have developed a replacement tool that acts as a generic numeric ID assigner.

This new ID assigner tool works with ID resource configuration files, such as:


rec {
ports = {
min = 5000;
max = 6000;
scope = "global";
};

uids = {
min = 2000;
max = 3000;
scope = "global";
};

gids = uids;
}

The above ID resource configuration file (idresources.nix) defines three resource pools: ports is a resource that represents port numbers to be assigned to the webapp processes, uids refers to user IDs and gids to group IDs. The group IDs' resource configuration is identical to the users' IDs configuration.

Each resource attribute refers the following configuration properties:

  • The min value specifies the minimum ID to hand out, max the maximum ID.
  • The scope value specifies the scope of the resource pool. global (which is the default option) means that the IDs assigned from this resource pool to services are globally unique for the entire system.

    The machine scope can be used to assign IDs that are unique for the machine where a service is distributed to. When the latter option is used, services that are distributed two separate machines may have the same ID.

We can adjust the services/processes model in such a way that every service will use dynamically assigned IDs and that each service specifies for which resources it requires a unique ID:


{ pkgs, system
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager ? "sysvinit"
, ...
}:

let
ids = if builtins.pathExists ./ids.nix then (import ./ids.nix).ids else {};

constructors = import ./constructors.nix {
inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager ids;
};

processType = import ../../nixproc/derive-dysnomia-process-type.nix {
inherit processManager;
};
in
rec {
webapp1 = rec {
name = "webapp1";
port = ids.ports.webapp1 or 0;
dnsName = "webapp.local";
pkg = constructors.webapp {
inherit port;
instanceSuffix = "1";
};
type = processType;
requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
};

webapp2 = rec {
name = "webapp2";
port = ids.ports.webapp2 or 0;
dnsName = "webapp.local";
pkg = constructors.webapp {
inherit port;
instanceSuffix = "2";
};
type = processType;
requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
};
}

In the above services/processes model, we have made the following changes:

  • In the beginning of the expression, we import the dynamically generated ids.nix expression that provides ID assignments for each resource. If the ids.nix file does not exists, we generate an empty attribute set. We implement this construction (in which the absence of ids.nix can be tolerated) to allow the ID assigner to bootstrap the ID assignment process.
  • Every hardcoded port attribute of every service is replaced by a reference to the ids attribute set that is dynamically generated by the ID assigner tool. To allow the ID assigner to open the services model in the first run, we provide a fallback port value of 0.
  • Every service specifies for which resources it requires a unique ID through the requiresUniqueIdsFor attribute. In the above example, both service instances require unique IDs to assign a port number, user ID to the user and group ID to the group.

The port assignments are propagated as function parameters to the constructor functions that configure the services (as shown earlier in this blog post).

We could also implement a similar strategy with the UIDs and GIDs, but a more convenient mechanism is to compose the function that creates the credentials, so that it transparently uses our uids and gids assignments.

As shown in the expression above, the ids attribute set is also propagated to the constructors expression. The constructors expression indirectly composes the createCredentials function as follows:


{pkgs, ids ? {}, ...}:

{
createCredentials = import ../../create-credentials {
inherit (pkgs) stdenv;
inherit ids;
};

...
}

The ids attribute set is propagated to the function that composes the createCredentials function. As a result, it will automatically assign the UIDs and GIDs in the ids.nix expression when the user configures a user or group with a name that exists in the uids and gids resource pools.

To make these UIDs and GIDs assignments go smoothly, it is recommended to give a process instance the same process name, instance name, user and group names.

Using the ID assigner tool


By combining the ID resources specification with the three Disnix models: a services model (that defines all distributable services, shown above), an infrastructure model (that captures all available target machines) and their properties and a distribution model (that maps services to target machines in the network), we can automatically generate an ids configuration that contains all ID assignments:


$ dydisnix-id-assign -s services.nix -i infrastructure.nix \
-d distribution.nix \
--id-resources idresources.nix --output-file ids.nix

The above command will generate an ids configuration file (ids.nix) that provides, for each resource in the ID resources model, a unique assignment to services that are distributed to a target machine in the network. (Services that are not distributed to any machine in the distribution model will be skipped, to not waste too many resources).

The output file (ids.nix) has the following structure:


{
"ids" = {
"gids" = {
"webapp1" = 2000;
"webapp2" = 2001;
};
"uids" = {
"webapp1" = 2000;
"webapp2" = 2001;
};
"ports" = {
"webapp1" = 5000;
"webapp2" = 5001;
};
};
"lastAssignments" = {
"gids" = 2001;
"uids" = 2001;
"ports" = 5001;
};
}

  • The ids attribute contains for each resource (defined in the ID resources model) the unique ID assignments per service. As shown earlier, both service instances require unique IDs for ports, uids and gids. The above attribute set stores the corresponding ID assignments.
  • The lastAssignments attribute memorizes the last ID assignment per resource. Once an ID is assigned, it will not be immediately reused. This is to allow roll backs and to prevent data to incorrectly get owned by the wrong user accounts. Once the maximum ID limit is reached, the ID assigner will start searching for a free assignment from the beginning of the resource pool.

In addition to assigning IDs to services that are distributed to machines in the network, it is also possible to assign IDs to all services (regardless whether they have been deployed or not):


$ dydisnix-id-assign -s services.nix \
--id-resources idresources.nix --output-file ids.nix

Since the above command does not know anything about the target machines, it only works with an ID resources configuration that defines global scope resources.

When you intend to upgrade an existing deployment, you typically want to retain already assigned IDs, while obsolete ID assignment should be removed, and new IDs should be assigned to services that have none yet. This is to prevent unnecessary redeployments.

When removing the first webapp service and adding a third instance:


{ pkgs, system
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager ? "sysvinit"
, ...
}:

let
ids = if builtins.pathExists ./ids.nix then (import ./ids.nix).ids else {};

constructors = import ./constructors.nix {
inherit pkgs stateDir runtimeDir logDir tmpDir forceDisableUserChange processManager ids;
};

processType = import ../../nixproc/derive-dysnomia-process-type.nix {
inherit processManager;
};
in
rec {
webapp2 = rec {
name = "webapp2";
port = ids.ports.webapp2 or 0;
dnsName = "webapp.local";
pkg = constructors.webapp {
inherit port;
instanceSuffix = "2";
};
type = processType;
requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
};

webapp3 = rec {
name = "webapp3";
port = ids.ports.webapp3 or 0;
dnsName = "webapp.local";
pkg = constructors.webapp {
inherit port;
instanceSuffix = "3";
};
type = processType;
requiresUniqueIdsFor = [ "ports" "uids" "gids" ];
};
}

And running the following command (that provides the current ids.nix as a parameter):


$ dydisnix -s services.nix -i infrastructure.nix -d distribution.nix \
--id-resources idresources.nix --ids ids.nix --output-file ids.nix

we will get the following ID assignment configuration:


{
"ids" = {
"gids" = {
"webapp2" = 2001;
"webapp3" = 2002;
};
"uids" = {
"webapp2" = 2001;
"webapp3" = 2002;
};
"ports" = {
"webapp2" = 5001;
"webapp3" = 5002;
};
};
"lastAssignments" = {
"gids" = 2002;
"uids" = 2002;
"ports" = 5002;
};
}

As may be observed, since the webapp2 process is in both the current and the previous configuration, its ID assignments will be retained. webapp1 gets removed because it is no longer in the services model. webapp3 gets the next numeric IDs from the resources pools.

Because the configuration of webapp2 stays the same, it does not need to be redeployed.

The models shown earlier are valid Disnix services models. As a consequence, they can be used with Dynamic Disnix's ID assigner tool: dydisnix-id-assign.

Although these Disnix services models are also valid processes models (used by the Nix process management framework) not every processes model is guaranteed to be compatible with a Disnix service model.

For process models that are not compatible, it is possible to use the nixproc-id-assign tool that acts as a wrapper around dydisnix-id-assign tool:


$ nixproc-id-assign --id-resources idresources.nix processes.nix

Internally, the nixproc-id-assign tool converts a processes model to a Disnix service model (augmenting the process instance objects with missing properties) and propagates it to the dydisnix-id-assign tool.

A more advanced example


The webapp processes example is fairly trivial and only needs unique IDs for three kinds of resources: port numbers, UIDs, and GIDs.

I have also developed a more complex example for the Nix process management framework that exposes several commonly used system services on Linux systems, such as:


{ pkgs ? import <nixpkgs> { inherit system; }
, system ? builtins.currentSystem
, stateDir ? "/var"
, runtimeDir ? "${stateDir}/run"
, logDir ? "${stateDir}/log"
, cacheDir ? "${stateDir}/cache"
, tmpDir ? (if stateDir == "/var" then "/tmp" else "${stateDir}/tmp")
, forceDisableUserChange ? false
, processManager
}:

let
ids = if builtins.pathExists ./ids.nix then (import ./ids.nix).ids else {};

constructors = import ./constructors.nix {
inherit pkgs stateDir runtimeDir logDir tmpDir cacheDir forceDisableUserChange processManager ids;
};
in
rec {
apache = rec {
port = ids.httpPorts.apache or 0;

pkg = constructors.simpleWebappApache {
inherit port;
serverAdmin = "root@localhost";
};

requiresUniqueIdsFor = [ "httpPorts" "uids" "gids" ];
};

postgresql = rec {
port = ids.postgresqlPorts.postgresql or 0;

pkg = constructors.postgresql {
inherit port;
};

requiresUniqueIdsFor = [ "postgresqlPorts" "uids" "gids" ];
};

influxdb = rec {
httpPort = ids.influxdbPorts.influxdb or 0;
rpcPort = httpPort + 2;

pkg = constructors.simpleInfluxdb {
inherit httpPort rpcPort;
};

requiresUniqueIdsFor = [ "influxdbPorts" "uids" "gids" ];
};
}

The above processes model exposes three service instances: an Apache HTTP server (that works with a simple configuration that serves web applications from a single virtual host), PostgreSQL and InfluxDB. Each service requires a unique user ID and group ID so that their privileges are separated.

To make these services more accessible/usable, we do not use a shared ports resource pool. Instead, each service type consumes port numbers from their own resource pools.

The following ID resources configuration can be used to provision the unique IDs to the services above:


rec {
uids = {
min = 2000;
max = 3000;
};

gids = uids;

httpPorts = {
min = 8080;
max = 8085;
};

postgresqlPorts = {
min = 5432;
max = 5532;
};

influxdbPorts = {
min = 8086;
max = 8096;
step = 3;
};
}


The above ID resources configuration defines a shared UIDs and GIDs resource pool, but separate ports resource pools for each service type. This has the following implications if we deploy multiple instances of each service type:

  • All Apache HTTP server instances get a TCP port assignment between 8080-8085.
  • All PostgreSQL server instances get a TCP port assignment between 5432-5532.
  • All InfluxDB server instances get a TCP port assignment between 8086-8096. Since an InfluxDB allocates two port numbers: one for the HTTP server and one for the RPC service (the latter's port number is the base port number + 2). We use a step count of 3 so that we can retain this convention for each InfluxDB instance.

Conclusion


In this blog post, I have described a new tool: dydisnix-id-assign that can be used to automatically assign unique numeric IDs to services in Disnix service models.

Moreover, I have described: nixproc-id-assign that acts a thin wrapper around this tool to automatically assign numeric IDs to services in the Nix process management framework's processes model.

This tool replaces the old dydisnix-port-assign tool in the Dynamic Disnix toolset (described in the blog post written five years ago) that is much more limited in its capabilities.

Availability


The dydisnix-id-assign tool is available in the current development version of Dynamic Disnix. The nixproc-id-assign is part of the current implementation of the Nix process management framework prototype.

by Sander van der Burg (noreply@blogger.com) at September 24, 2020 06:24 PM

September 16, 2020

Tweag I/O

Implicit Dependencies in Build Systems

In making a build system for your software, you codified the dependencies between its parts. But, did you account for implicit software dependencies, like system libraries and compiler toolchains?

Implicit dependencies give rise to the biggest and most common problem with software builds - the lack of hermiticity. Without hermetic builds, reproducibility and cacheability are lost.

This post motivates the desire for reproducibility and cacheability, and explains how we achieve hermetic, reproducible, highly cacheable builds by taking control of implicit dependencies.

Reproducibility

Consider a developer newly approaching a code repository. After cloning the repo, the developer must install a long list of “build requirements” and plod through multiple steps of “setup”, only to find that, yes indeed, the build fails. Yet, it worked just fine for their colleague! The developer, typically not expert in build tooling, must debug the mysterious failure not of their making. This is bad for morale and for productivity.

This happens because the build is not reproducible.

One very common reason for the failure is that the compiler toolchain on the developer’s system is different from that of the colleague. This happens even with build systems that use sophisticated build software, like Bazel. Bazel implicitly uses whatever system libraries and compilers are currently installed in the developer’s environment.

A common workaround is to provide developers with a Docker image equipped with a certain compiler toolchain and system libraries, and then to mandate that the Bazel build occurs in that context.

That solution has a number of drawbacks. First, if the developer is using macOS, the virtualized build context runs substantially slower. Second, the Bazel build cache, developer secrets, and the source code remain outside of the image and this adds complexity to the Docker invocation. Third, the Docker image must be rebuilt and redistributed as dependencies change and that’s extra maintenance. Fourth, and this is the biggest issue, Docker image builds are themselves not reproducible - they nearly always rely on some external state that does not remain constant across build invocations, and that means the build can fail for reasons unrelated to the developer’s code.

A better solution is to use Nix to supply the compiler toolchain and system library dependencies. Nix is a software package management system somewhat like Debian’s APT or macOS’s Homebrew. Nix goes much farther to help developers control their environments. It is unsurpassed when it comes to reproducible builds of software packages.

Nix facilitates use of the Nixpkgs package set. That set is the largest single set of software packages. It is also the freshest package set. It provides build instructions that work both on Linux and macOS. Developers can easily pin any software package at an exact version.

Learn more about using Nix with Bazel, here.

Cacheability

Not only should builds be reproducible, but they should also be fast. Fast builds are achieved by caching intermediate build results. Cache entries are keyed based on the precise dependencies as well as the build instructions that produce the entries. Builds will only benefit from a (shared, distributed) cache when they have matching dependencies. Otherwise, cache keys (which depend on the precise dependencies) will be different, and there will be cache misses. This means that the developer will have to rebuild targets locally. These unnecessary local rebuilds slow development.

The solution is to make the implicit dependencies into explicit ones, again using Nix, making sure to configure and use a shared Nix cache.

Learn more about configuring a shared Bazel cache, here.

Conclusion

It is important to eliminate implicit dependencies in your build system in order to retain build reproducibility and cacheability. Identify Nix packages that can replace the implicit dependencies of your Bazel build and use rules_nixpkgs to declare them as explicit dependencies. That will yield a fast, correct, hermetic build.

September 16, 2020 12:00 AM