rfcs/rfcs/0108-nixos-containers.md
2025-05-11 20:32:37 +02:00

20 KiB

feature start-date author co-authors shepherd-team shepherd-leader related-issues
NixOS Container rewrite 2021-02-14 Maximilian Bosch <maximilian@mbosch.me> n/a @Mic92, @lheckemann, @SuperSandro2000, @arianvp, @Lassulus @Lassulus
https://github.com/NixOS/nixpkgs/issues/69414
https://github.com/NixOS/nixpkgs/issues/67265
https://github.com/NixOS/nixpkgs/pull/67232
https://github.com/NixOS/nixpkgs/pull/67336
POC
https://github.com/NixOS/nixpkgs/pull/140669

Summary

This document suggests a full replacement of the nixos-container subsystem of NixOS with a new implementation based on systemd-nspawn(5) and incorporates systemd-networkd(8) for the networking stack rather than imperative networking while providing a reasonable upgrade path for existing installations.

Motivation

The nixos-container feature originally appeared in nixpkgs in 2013, at a time where systemd support was relatively new to NixOS.

Back then, systemd-nspawn was only designed as a development tool for systemd developers and NixOS didn't support networkd. Due to those circumstances the entire feature was implemented in a fairly ad-hoc way. One of the most notable issues is the broken uplink during boot of a container:

  • Containers will be started in a template unit named container@.service. This service configures the network interfaces after the container has started.

  • This means that even though the network-online.target is reached, no uplink is available until the container is fully booted.

    The implication is that a lot of services won't work as-is when installed into a container. For instance, oneshot services such as nextcloud-setup.service will hang if a database in e.g. a local network is used. Other examples are rspamd or clamav.

Additionally, we currently maintain a Perl script called nixos-container.pl which serves as the CLI frontend for the feature. This is not only an additional maintenance burden for us, but largely duplicates functionality already provided by machinectl(1).

The main reason why machinectl cannot be used as a complete replacement are imperative containers and state getting lost after the container@<container-name>.service unit has stopped since .nspawn units aren't used.

In the following section the design of a replacement is proposed with these goals:

  • Use networkd as the networking stack since systemd-nspawn is part of the same project and thus both components are designed to work together and resolve issues like no uplink until the container is fully booted.

  • Provide a useful base to easily use systemd-nspawn features:

    • When using actual .nspawn units defined with Nix expressions, it will be trivial to define and override configuration per-container (in contrast to listing flags passed to the CLI interface as it's the case in the old module).
    • With this design, it won't be necessary to implement adjustments for advanced features such as MACVLAN interfaces since administrators can directly use the upstream configuration format. The current module supports MACVLAN interfaces for instance, but not IPVLAN.
    • Another side effect is that existing knowledge about this configuration can be re-used.
  • Provide a reasonable upgrade path for existing installations. Even though this RFC suggests deprecating the existing nixos-container subsystem, this measure is purely optional. However, for this to happen, a smooth migration path must be provided.

Detailed design

Bootstrapping

To be fully consistent with upstream systemd, the template unit systemd-nspawn@.service will be used.

The approach how a container is bootstrapped won't change and will thus consist of the following steps (executed via a custom ExecStartPre=-script):

  • Create an empty directory in /var/lib/machines named like the container-name.
  • systemd-nspawn only expects /etc/os-release, /etc/machine-id and /var to exist inside, however with no content.
  • To get a running NixOS inside, /nix/store is bind-mounted into it. As an init process, the stage-2 script is started which eventually exec(2)s into systemd and ensures that everything is correctly set up.
  • The option boot.isContainer = true; will be automatically set for new containers as well. This is necessary to
    • avoid bogus modprobe calls since nspawn doesn't have its own kernel.
    • avoid building a stage-1 boot script and initramfs as part of the container's NixOS system

This init-script can be built by evaluating a NixOS config against <nixpkgs/nixos/lib/eval-config.nix>.

Support for existing tarballs to be imported with machinectl pull-tar is explicitly out of scope in this RFC.

Network

The following section provides an overview of how to configure networking for containers and how this will be implemented. A proposal how the API of the NixOS module could look like will be demonstrated in the next chapter.

"public" networking

This is the most trivial networking mode. It is taken if the PrivateNetwork-option of the .nspawn-unit is set to no. In this case, the container has full access to the host's network, otherwise the container will run in its own namespace.

Default Mode

If nothing else is specified, the default settings of systemd-nspawn will be used for networking. To briefly summarize, this means:

  • A veth interface-pair will be created, one "host-side" interface and a container interface inside its own namespace.
  • A subnet from a RFC1918 private IP range is assigned to the host-side interface. IPv4 addresses will be distributed via DHCP to containers.
  • Analogous to IPv4, a RFC4193 IPv6 ULA prefix will be assigned to the host-side interface. Containers can assign themselves addresses from this prefix by utilizing RFC4862 SLAAC.

Hosts will be available on the current system via the mymachines nss module. This means that container names can be resolved to addresses like DNS names, i.e. ping containername works.

Static networking

It's also possible to assign an arbitrary number of IPv4 and IPv6 addresses statically. This is internally implemented by using the Address= setting of systemd.network(5).

An example of how this can be done is shown in the next chapter.

DNS

The current implementation uses networking.useHostResolvConf to configure DNS via /etc/resolv.conf in the container. This option will be deprecated as systemd can take care of it:

  • If networkd is enabled via NixOS, systemd-resolved is enabled as well.
    • By default, resolved will be configured via DHCP which is enabled in Default Mode.
    • With only Static networking enabled, it is necessary to configure DNS servers for resolved statically which can be done by setting DNS servers via a .network unit for the host0 interface.
  • The behavior of networking.useHostResolvConf can be implemented with pure systemd by setting the ResolvConf-setting for the container's .nspawn-unit.

Migration plan

All features from the old implementation are still supported, however several abstractions (such as networking.useHostResolvConf or containers.<name>.macvlans) are dropped and have to be implemented by specifying unit options for systemd in the NixOS module system.

The state directory in /var/lib/containers/<name> is also usable by systemd-nspawn directly. Thus, the following steps are necessary:

  • Port existing container options to the new module (documentation describing how this can be done for each feature has to be written before this is considered ready).
  • Most of the NixOS configuration can be easily reused except for the following differences:
    • networkd is used inside the container rather than scripted networking. This means that NixOS's networking configuration may require adjustment. However the basic networking.interfaces interface is also supported by the networkd stack. More notable is that eth0 inside the container is named host0 by default.
    • As soon as the config is ready to deploy, the state directory in /var/lib/containers has to be copied to /var/lib/machines.
    • Deploy & reboot.
    • See also https://github.com/Ma27/nixpkgs/blob/networkd-containers/nixos/tests/container-migration.nix as POC.

Imperative management

systemd differentiates between "privileged" & "unprivileged" settings. Each privileged (also called "trusted") nspawn unit lives in /etc/systemd/nspawn. Since unprivileged containers don't allow bind mounts, these will be out of scope. Additionally, this means that /etc/systemd/nspawn has to be writable for administrative users and can't be a symlink to a store path anymore.

The new implementation is written in Python since it's expected to be more accessible than Perl and thus more folks are willing to maintain this code (just as it was the case after porting the VM test driver from Perl to Python).

The following features won't be available anymore in the new script:

  • Start/Stop operations, logging into containers: this can be entirely done via machinectl(1).
  • No configuration will be made via CLI flags. Instead, the option set from the NixOS module will be used to declare not only the container's configuration, but also networking. This approach is inspired by erikarvstedt/extra-container.

But still, not all features from declarative containers are implemented here, for instance:

  • One has to explicitly specify whether to restart/reload a container when updating the config. This is done on purpose to avoid duplicating the logic from switch-to-configuration.pl here.
  • IPv6 prefix delegation is turned off because radvd's configuration is declaratively specified when building the host's NixOS.

Examples are in the next chapter.

Config activation

By default, NixOS has to decide how to activate configuration changes for a container to avoid unnecessary reboots, but reloads aren't necessarily sufficient either because changes such as new bind mounts require a reboot. The host's switch-to-configuration.pl implements it like this:

  • systemctl reload systemd-nspawn@container-name.service runs switch-to-configuration test inside the container container-name.
  • When activating a new config on the host, the following things happen:
    • If the setting Parameter= in the container's .nspawn-unit is the only thing that has changed, a reload will be done. This parameter contains the init-script for the container's NixOS and changes every time the container's NixOS config changes.
    • If anything else changes, systemd-nspawn@container-name.service will be scheduled for a restart which effectively reboots the container.
  • This behavior can be turned off completely which means that the container where this is turned off won't be touched at all on switch-to-configuration. Additionally, it's possible to always force a reload or restart. See Examples & Interactions for details.

Deprecation

The current nixos-container-implementation should be considered deprecated as soon as the new implementation is part of a stable NixOS release. To give end-users a reasonable time to migrate, it should be kept and maintained for at least two release cycles, if necessary even longer.

Examples and Interactions

Basics

A container with a private IPv4 & IPv6 address can be configured like this:

{
  nixos.containers.instances.demo = {
    network = {};
    system-config = { pkgs, ... }: {
      environment.systemPackages = [ pkgs.hello ];
    };
  };
}

It's reachable locally like this thanks to systemd's mymachines NSS module:

[root@server:~]# ping demo -c1
PING demo(fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6 (fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6)) 56 data bytes
64 bytes from fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6 (fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6): icmp_seq=1 ttl=64 time=0.292 ms

--- demo ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.214/0.214/0.214/0.000 ms

The container can be entirely controlled via machinectl:

$ machinectl reboot demo
$ machinectl shell demo
demo$ ...

Optionally, containers can be grouped into a networking zone. Instead of a veth pair for each container, all containers will live in an interface named vz-<zone>:

{
  nixos.containers.zones.demo = {};
  nixos.containers.instances = {
    test1.network.zone = "demo";
    test2.network.zone = "demo";
  };
}

IP addresses can be statically assigned to a container as well:

{
  nixos.containers.instances.static = {
    network = {
      v4.static.containerPool = [ "10.237.1.3/16" ];
      v6.static.containerPool = [ "2a01:4f9:4b:1659:3aa3:cafe::3/96" ];
    };
    system-config = {};
  };
}

With this change, the containers live in the given subnets and both on the host- and container-side the network will be properly configured accordingly.

Advanced Features

MACVLANs are an example for how every unit setting from networkd and nspawn can be used. These are helpful to assign multiple virtual interfaces with distinct MAC addresses to a single physical NIC.

A sub-interface which is actually part of the physical one can be moved into the container's namespace then:

{
  # Config for the physical interface itself with DHCP enabled and associated to a MACVLAN.
  systemd.network.networks."40-eth1" = {
    matchConfig.Name = "eth1";
    networkConfig.DHCP = "yes";
    dhcpConfig.UseDNS = "no";
    networkConfig.MACVLAN = "mv-eth1-host";
    linkConfig.RequiredForOnline = "no";
    address = lib.mkForce [];
    addresses = lib.mkForce [];
  };

  # The host-side sub-interface of the MACVLAN. This means that the host is reachable
  # at `192.168.2.2`, both on the physical interface and from the container.
  systemd.network.networks."20-mv-eth1-host" = {
    matchConfig.Name = "mv-eth1-host";
    networkConfig.IPForward = "yes";
    dhcpV4Config.ClientIdentifier = "mac";
    address = lib.mkForce [
      "192.168.2.2/24"
    ];
  };
  systemd.network.netdevs."20-mv-eth1-host" = {
    netdevConfig = {
      Name = "mv-eth1-host";
      Kind = "macvlan";
    };
    extraConfig = ''
      [MACVLAN]
      Mode=bridge
    '';
  };

  # Assign a MACVLAN to a container. This is done by pure nspawn.
  systemd.nspawn.vlandemo.networkConfig.MACVLAN = "eth1";
  nixos.containers = {
    instances.vlandemo.system-config = {
      systemd.network = {
        networks."10-mv-eth1" = {
          matchConfig.Name = "mv-eth1";
          address = [ "192.168.2.5/24" ];
        };
        netdevs."10-mv-eth1" = {
          netdevConfig.Name = "mv-eth1";
          netdevConfig.Kind = "veth";
        };
      };
    };
  };
}

Imperative containers

Create a container with a pinned nixpkgs

Let the following expression be called imperative-container.nix:

{
  nixpkgs = <nixpkgs>;
  system-config = { pkgs, ... }: {
    services.nginx.enable = true;
    networking.firewall.allowedTCPPorts = [ 80 ];
  };

  # This implies that the "default" networking mode (i.e. DHCPv4) is used
  # and not the host's network (which is the default for imperative containers).
  network = {};
  forwardPorts = [ { hostPort = 8080; containerPort = 80; } ];
}

The container can be built like this now:

$ nixos-nspawn create imperative ./imperative-container.nix

The default page of nginx is now reachable like this:

$ curl imperative:80 -i
$ curl <IPv4 of the host-side veth interface>:8080 -i

Modify a container's config imperatively

When imperative-container.nix is updated, it can be rebuilt like this:

$ nixos-nspawn update imperative --config ./imperative-container.nix

By default, it will be restarted. This can be overridden via activation.strategy, however only reload, restart and none are supported.

Additionally, the way how the container's new config will be activated can be specified via --reload or --restart passed to nixos-nspawn update.

If declarative containers are attempted to be modified, the script will terminate early with an error.

Manage an imperative container's lifecycle

Reboot/Login/etc can be managed via machinectl(1):

$ machinectl reboot imperative
$ machinectl shell imperative
[root@imperative:~]$ …

Drawbacks

  • Explicit dependency on networkd (networking.useNetworkd = true;) on both the host-side and container-side.

    • Since there's a movement to make systemd-networkd the default on NixOS, this is from the author's PoV not a big problem.
  • Need to migrate from existing containers.

    • As demonstrated in Migration plan, a sane path exists.
    • With a long deprecation time, a rush to migrate can be avoided.
    • This also means that the container backend for nixops needs to be deprecated.

Alternatives

  • Implement this feature in e.g. its own (optionally community-maintained) repository:
    • This is problematic due to the changes for Config activation that required changes in switch-to-configuration.pl.
  • Keep both the proposed feature and the existing nixos-container subsystem in NixOS. In contrast to systemd-nspawn@, the current container subsystem uses /var/lib/containers as state-directory, so clashes shouldn't happen:
    • The main concern is increased maintenance workload. Also, with the rather prominent name nixos-container we shouldn't advertise the old, problematic implementation.
  • Do nothing.
    • As shown above, this change leverages the full feature set of systemd-nspawn and also solves a few existing problems, that are non-trivial to solve when keeping the old implementation.
    • Since it's planned to move to networkd in the longterm anyways, fundamental changes in the container subsystem will be mandatory anyways.

Unresolved questions

  • None that I'm aware of.

Future work

  • Write documentation for the new module.
  • Get the PR into a mergeable state.