diff --git a/rfcs/0108-nixos-containers.md b/rfcs/0108-nixos-containers.md new file mode 100644 index 0000000..8d626d5 --- /dev/null +++ b/rfcs/0108-nixos-containers.md @@ -0,0 +1,465 @@ +--- +feature: NixOS Container rewrite +start-date: 2021-02-14 +author: Maximilian Bosch +co-authors: n/a +shepherd-team: @Mic92, @lheckemann, @SuperSandro2000, @arianvp, @Lassulus +shepherd-leader: @Lassulus +related-issues: + - https://github.com/NixOS/nixpkgs/issues/69414 + - https://github.com/NixOS/nixpkgs/issues/67265 + - https://github.com/NixOS/nixpkgs/pull/67232 + - https://github.com/NixOS/nixpkgs/pull/67336 + - POC: https://github.com/NixOS/nixpkgs/pull/140669 +--- + +# Summary +[summary]: #summary + +This document suggests a full replacement of the +[`nixos-container`](https://nixos.org/manual/nixos/stable/#ch-containers) subsystem of NixOS with +a new implementation based on +[`systemd-nspawn(5)`](https://man7.org/linux/man-pages/man5/systemd.nspawn.5.html) and incorporates +[`systemd-networkd(8)`](https://man7.org/linux/man-pages/man8/systemd-networkd.service.8.html) for +the networking stack rather than imperative networking while providing a reasonable upgrade path +for existing installations. + +# Motivation +[motivation]: #motivation + +The `nixos-container` feature originally appeared in `nixpkgs` in +[2013](https://github.com/nixos/nixpkgs/commit/9ee30cd9b51c46cea7193993d006bb4301588001), at a time where `systemd` support was relatively new to NixOS. + +Back then, `systemd-nspawn` was +[only designed as a development tool for systemd developers](https://lwn.net/Articles/572957/) and NixOS +didn't [support networkd](https://github.com/NixOS/nixpkgs/commit/59f512ef7d2137586330f2cabffc41a70f4f0346). +Due to those circumstances the entire feature was implemented +in a fairly ad-hoc way. One of the most notable issues is the broken uplink during boot +of a container: + +* Containers will be started in a template unit named [`container@.service`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Description). This + service [configures the network interfaces after the container has started](https://github.com/NixOS/nixpkgs/blob/2f96b9a7b4c083edf79374ceb9d61b5816648276/nixos/modules/virtualisation/nixos-containers.nix#L178-L229). + +* This means that even though the `network-online.target` is reached, no uplink is available + until the container is fully booted. + + The implication is that a lot of services won't work as-is when installed into a container. + For instance, [oneshot](https://www.freedesktop.org/software/systemd/man/systemd.service.html#Type=) services + such as `nextcloud-setup.service` will hang if a database in e.g. a local network is used. Other + examples are `rspamd` or `clamav`. + +Additionally, we currently maintain a Perl script called `nixos-container.pl` which serves +as the CLI frontend for the feature. This is not only an additional maintenance burden for us, but largely duplicates functionality already provided by [`machinectl(1)`](https://www.freedesktop.org/software/systemd/man/machinectl.html). + +The main reason why `machinectl` cannot be used as a complete replacement are +[imperative containers](https://nixos.org/manual/nixos/stable/index.html#sec-imperative-containers) +and state getting lost after the `container@.service` unit +has stopped since `.nspawn` units aren't used. + +In the following section the design of a replacement is proposed with these goals: + +* Use [`networkd`](https://www.freedesktop.org/software/systemd/man/systemd.network.html) as the networking stack since `systemd-nspawn` is part of the same project and + thus both components are designed to work together and resolve issues like no uplink until the container is fully booted. + +* Provide a useful base to easily use `systemd-nspawn` features: + * When using actual `.nspawn` units defined with Nix expressions, it will be trivial + to define and override configuration per-container (in contrast to listing flags + passed to the CLI interface as it's the case in the old module). + * With this design, it won't be necessary to implement adjustments for advanced features + such as [MACVLAN interfaces](https://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/) + since administrators can directly use the upstream configuration format. The current module + supports `MACVLAN` interfaces for instance, but not `IPVLAN`. + * Another side effect is that existing knowledge about this configuration can be re-used. + +* Provide a reasonable upgrade path for existing installations. Even though this RFC suggests + deprecating the existing `nixos-container` subsystem, this measure is purely optional. However, + for this to happen, a smooth migration path must be provided. + +# Detailed design +[design]: #detailed-design + +## Bootstrapping + +To be fully consistent with upstream `systemd`, the template unit +[`systemd-nspawn@.service`](https://github.com/systemd/systemd/blob/v247/units/systemd-nspawn@.service.in) will be used. + +The approach how a container is bootstrapped won't change and will thus consist of the +following steps (executed via a custom `ExecStartPre=`-script): + +* Create an empty directory in `/var/lib/machines` named like the container-name. +* `systemd-nspawn` only expects `/etc/os-release`, `/etc/machine-id` and `/var` to exist + inside, however with no content. +* To get a running NixOS inside, `/nix/store` is bind-mounted into it. As an `init` process, + the [stage-2 script](https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/system/boot/stage-2-init.sh) + is started which eventually `exec(2)`s into `systemd` and ensures that everything + is correctly set up. +* The option `boot.isContainer = true;` will be automatically set for new containers as well. + This is necessary to + * avoid bogus `modprobe` calls since `nspawn` doesn't have its own kernel. + * avoid building a `stage-1` boot script and initramfs as part of the container's NixOS system + +This init-script can be built by evaluating a NixOS config against ``. + +Support for existing tarballs to be imported with `machinectl pull-tar` is explicitly out of +scope in this RFC. + +## Network + +The following section provides an overview of how to configure networking for containers and +how this will be implemented. A proposal how the API of the NixOS module could look like will +be demonstrated in the [next chapter](#examples-and-interactions). + +### "public" networking + +This is the most trivial networking mode. It is taken if the `PrivateNetwork`-option of the +`.nspawn`-unit is set to `no`. In this case, the container has full access to the host's network, +otherwise the container will run in its own namespace. + +### Default Mode + +If nothing else is specified, the [default settings of `systemd-nspawn`](https://github.com/systemd/systemd/blob/v247/network/80-container-ve.network) will +be used for networking. To briefly summarize, this means: + +* A [`veth`](https://man7.org/linux/man-pages/man4/veth.4.html) interface-pair will be created, + one "host-side" interface and a container interface inside its own namespace. +* A subnet from a [RFC1918](https://datatracker.ietf.org/doc/html/rfc1918) private IP range + is assigned to the host-side interface. IPv4 addresses will be distributed via DHCP to containers. +* Analogous to IPv4, a [RFC4193 IPv6 ULA prefix](https://tools.ietf.org/html/rfc4193) will be + assigned to the host-side interface. Containers can assign themselves addresses from this + prefix by utilizing [RFC4862 SLAAC](https://tools.ietf.org/html/rfc4862). + +Hosts will be available on the current system via the +[`mymachines` `nss` module](https://www.freedesktop.org/software/systemd/man/nss-mymachines.html). +This means that container names can be resolved to addresses like DNS names, i.e. `ping containername` works. + +### Static networking + +It's also possible to assign an arbitrary number of IPv4 and IPv6 addresses statically. This +is internally implemented by using the `Address=` setting of [`systemd.network(5)`](https://www.freedesktop.org/software/systemd/man/systemd.network.html). + +An example of how this can be done is shown in the [next chapter](#examples-and-interactions). + +### DNS + +The current implementation uses [`networking.useHostResolvConf`](https://search.nixos.org/options?channel=20.09&show=networking.useHostResolvConf&from=0&size=50&sort=relevance&query=networking.useHostResolvConf) +to configure DNS via `/etc/resolv.conf` in the container. This option will be **deprecated** as +`systemd` can take care of it: + +* If `networkd` is enabled via NixOS, [`systemd-resolved`](https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html) is enabled as well. + * By default, `resolved` will be configured via DHCP which is enabled in [Default Mode](#default-mode). + * With only [Static networking](#static-networking) enabled, it is necessary to configure + DNS servers for resolved statically which can be done by setting DNS + servers via a `.network` unit for the `host0` interface. +* The behavior of `networking.useHostResolvConf` can be implemented with pure `systemd` + by setting the `ResolvConf`-setting for the container's `.nspawn`-unit. + +## Migration plan + +All features from the old implementation are still supported, however several abstractions +(such as `networking.useHostResolvConf` or `containers..macvlans`) are dropped and have +to be implemented by specifying unit options for `systemd` in the NixOS module system. + +The state directory in `/var/lib/containers/` is also usable by `systemd-nspawn` directly. +Thus, the following steps are necessary: + +* Port existing container options to the new module (documentation describing how this can be + done for each feature **has** to be written before this is considered ready). +* Most of the NixOS configuration can be easily reused except for the following differences: + * `networkd` is used inside the container rather than scripted networking. This means that + NixOS's networking configuration may require adjustment. However the basic `networking.interfaces` interface + is also supported by the `networkd` stack. More notable is that `eth0` inside the container is + named `host0` by default. + * As soon as the config is ready to deploy, the state directory in `/var/lib/containers` has to + be copied to `/var/lib/machines`. + * Deploy & reboot. + * See also https://github.com/Ma27/nixpkgs/blob/networkd-containers/nixos/tests/container-migration.nix as POC. + +## Imperative management + +`systemd` differentiates between "privileged" & "unprivileged" settings. Each privileged (also +called "trusted") `nspawn` unit lives in `/etc/systemd/nspawn`. Since unprivileged containers +don't allow bind mounts, these will be out of scope. Additionally, this means that +`/etc/systemd/nspawn` has to be writable for administrative users and can't be a symlink to +a store path anymore. + +The new implementation is written in Python since it's expected to be more accessible than Perl +and thus more folks are willing to maintain this code (just as it was the case after porting +the VM test driver from Perl to Python). + +The following features won't be available anymore in the new script: + +* Start/Stop operations, logging into containers: this can be entirely done via [`machinectl(1)`](https://www.freedesktop.org/software/systemd/man/machinectl.html). +* No configuration will be made via CLI flags. Instead, the option set from the + NixOS module will be used to declare not only the container's configuration, but also + networking. This approach is inspired by [erikarvstedt/extra-container](https://github.com/erikarvstedt/extra-container). + +But still, not all features from declarative containers are implemented here, for instance: + +* One has to explicitly specify whether to restart/reload a container when updating the config. + This is done on purpose to avoid duplicating the logic from `switch-to-configuration.pl` here. +* IPv6 prefix delegation is turned off because `radvd`'s configuration is declaratively specified + when building the host's NixOS. + +Examples are in the next chapter. + +## Config activation + +By default, NixOS has to decide how to activate configuration changes for a container to avoid +unnecessary reboots, but `reload`s aren't necessarily sufficient either because changes such as +new bind mounts require a reboot. The host's `switch-to-configuration.pl` implements it like +this: + +* `systemctl reload systemd-nspawn@container-name.service` runs `switch-to-configuration test` + inside the container `container-name`. +* When activating a new config on the host, the following things happen: + * If the setting `Parameter=` in the container's `.nspawn`-unit is the only thing that has changed, + a `reload` will be done. This parameter contains the `init`-script for the container's NixOS + and changes every time the container's NixOS config changes. + * If anything else changes, `systemd-nspawn@container-name.service` will be scheduled for a restart + which effectively reboots the container. +* This behavior can be turned off completely which means that the container where this is turned + off won't be touched at all on `switch-to-configuration`. Additionally, it's possible to always + force a `reload` or `restart`. See [Examples & Interactions](#examples-and-interactions) for + details. + +## Deprecation + +The current `nixos-container`-implementation should be considered deprecated as soon as the new +implementation is part of a stable NixOS release. To give end-users a reasonable time to migrate, +it should be kept and maintained for **at least two release cycles**, if necessary even longer. + +# Examples and Interactions +[examples-and-interactions]: #examples-and-interactions + +### Basics + +A container with a private IPv4 & IPv6 address can be configured like this: + +``` nix +{ + nixos.containers.instances.demo = { + network = {}; + system-config = { pkgs, ... }: { + environment.systemPackages = [ pkgs.hello ]; + }; + }; +} +``` + +It's reachable locally like this thanks to systemd's `mymachines` NSS module: + +```shell +[root@server:~]# ping demo -c1 +PING demo(fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6 (fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6)) 56 data bytes +64 bytes from fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6 (fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6): icmp_seq=1 ttl=64 time=0.292 ms + +--- demo ping statistics --- +1 packets transmitted, 1 received, 0% packet loss, time 0ms +rtt min/avg/max/mdev = 0.214/0.214/0.214/0.000 ms +``` + +The container can be entirely controlled via `machinectl`: + +```shell +$ machinectl reboot demo +$ machinectl shell demo +demo$ ... +``` + +Optionally, containers can be grouped into a networking zone. Instead of a `veth` pair for each +container, all containers will live in an interface named `vz-`: + +```nix +{ + nixos.containers.zones.demo = {}; + nixos.containers.instances = { + test1.network.zone = "demo"; + test2.network.zone = "demo"; + }; +} +``` + +IP addresses can be statically assigned to a container as well: + +``` nix +{ + nixos.containers.instances.static = { + network = { + v4.static.containerPool = [ "10.237.1.3/16" ]; + v6.static.containerPool = [ "2a01:4f9:4b:1659:3aa3:cafe::3/96" ]; + }; + system-config = {}; + }; +} +``` + +With this change, the containers live in the given subnets and both on the host- and container-side +the network will be properly configured accordingly. + +### Advanced Features + +MACVLANs are an example for how every unit setting from `networkd` and `nspawn` can be used. +These are helpful to assign multiple virtual interfaces with distinct MAC addresses to a single +physical NIC. + +A sub-interface which is actually part of the physical one can be moved into the container's +namespace then: + +``` nix +{ + # Config for the physical interface itself with DHCP enabled and associated to a MACVLAN. + systemd.network.networks."40-eth1" = { + matchConfig.Name = "eth1"; + networkConfig.DHCP = "yes"; + dhcpConfig.UseDNS = "no"; + networkConfig.MACVLAN = "mv-eth1-host"; + linkConfig.RequiredForOnline = "no"; + address = lib.mkForce []; + addresses = lib.mkForce []; + }; + + # The host-side sub-interface of the MACVLAN. This means that the host is reachable + # at `192.168.2.2`, both on the physical interface and from the container. + systemd.network.networks."20-mv-eth1-host" = { + matchConfig.Name = "mv-eth1-host"; + networkConfig.IPForward = "yes"; + dhcpV4Config.ClientIdentifier = "mac"; + address = lib.mkForce [ + "192.168.2.2/24" + ]; + }; + systemd.network.netdevs."20-mv-eth1-host" = { + netdevConfig = { + Name = "mv-eth1-host"; + Kind = "macvlan"; + }; + extraConfig = '' + [MACVLAN] + Mode=bridge + ''; + }; + + # Assign a MACVLAN to a container. This is done by pure nspawn. + systemd.nspawn.vlandemo.networkConfig.MACVLAN = "eth1"; + nixos.containers = { + instances.vlandemo.system-config = { + systemd.network = { + networks."10-mv-eth1" = { + matchConfig.Name = "mv-eth1"; + address = [ "192.168.2.5/24" ]; + }; + netdevs."10-mv-eth1" = { + netdevConfig.Name = "mv-eth1"; + netdevConfig.Kind = "veth"; + }; + }; + }; + }; +} +``` + +### Imperative containers + +#### Create a container with a pinned `nixpkgs` + +Let the following expression be called `imperative-container.nix`: + +```nix +{ + nixpkgs = ; + system-config = { pkgs, ... }: { + services.nginx.enable = true; + networking.firewall.allowedTCPPorts = [ 80 ]; + }; + + # This implies that the "default" networking mode (i.e. DHCPv4) is used + # and not the host's network (which is the default for imperative containers). + network = {}; + forwardPorts = [ { hostPort = 8080; containerPort = 80; } ]; +} +``` + +The container can be built like this now: + +``` +$ nixos-nspawn create imperative ./imperative-container.nix +``` + +The default page of `nginx` is now reachable like this: + +``` +$ curl imperative:80 -i +$ curl :8080 -i +``` + +#### Modify a container's config imperatively + +When `imperative-container.nix` is updated, it can be rebuilt like this: + +``` +$ nixos-nspawn update imperative --config ./imperative-container.nix +``` + +By default, it will be **restarted**. This can be overridden via `activation.strategy`, +however only `reload`, `restart` and `none` are supported. + +Additionally, the way how the container's new config will be activated can be specified +via `--reload` or `--restart` passed to `nixos-nspawn update`. + +If declarative containers are attempted to be modified, the script will terminate early with an +error. + +#### Manage an imperative container's lifecycle + +Reboot/Login/etc can be managed via [`machinectl(1)`](https://www.freedesktop.org/software/systemd/man/machinectl.html): + +``` +$ machinectl reboot imperative +$ machinectl shell imperative +[root@imperative:~]$ … +``` + +# Drawbacks +[drawbacks]: #drawbacks + +* Explicit dependency on `networkd` (`networking.useNetworkd = true;`) on both the + host-side and container-side. + * Since there's a movement to make `systemd-networkd` the default on NixOS, this + is from the author's PoV not a big problem. + +* Need to migrate from existing containers. + * As demonstrated in [*Migration plan*](#migration-plan), a sane path exists. + * With a long deprecation time, a rush to migrate can be avoided. + * This also means that [the container backend](https://github.com/PsyanticY/nixops-container) + for `nixops` needs to be deprecated. + +# Alternatives +[alternatives]: #alternatives + + +* Implement this feature in e.g. its own (optionally community-maintained) repository: + * This is problematic due to the changes for [Config activation](#config-activation) that + required changes in `switch-to-configuration.pl`. +* Keep both the proposed feature and the existing `nixos-container` subsystem in NixOS. In contrast + to `systemd-nspawn@`, the current container subsystem uses `/var/lib/containers` as state-directory, + so clashes shouldn't happen: + * The main concern is increased maintenance workload. Also, with the rather prominent + name `nixos-container` we shouldn't advertise the old, problematic implementation. +* Do nothing. + * As shown above, this change leverages the full feature set of `systemd-nspawn` and also + solves a few existing problems, that are non-trivial to solve when keeping the old + implementation. + * Since it's planned to move to `networkd` in the longterm anyways, fundamental changes + in the container subsystem will be mandatory anyways. + +# Unresolved questions +[unresolved]: #unresolved-questions + +* None that I'm aware of. + +# Future work +[future]: #future-work + +* Write documentation for the new module. +* Get the PR into a mergeable state. +