[RFC 0108] NixOS Container rewrite (#108)

* Initial nixos container commit because I want to view diffs between changes

* Updates / Improvements

* Note regarding static networking

* Updates

* update

* Update.

* Update.

* Update

Thanks Linus for the proof-read :)

* Update.

Mention that nixops container backend is also affected as pointed out by
fpletz.

* fixup! wording, spelling

* Move document according to assigned RFC number

* Linkify related issues

As it's done in https://github.com/NixOS/rfcs/blob/master/rfcs/0004-replace-unicode-quotes.md

* Bind-mounting Nix state isn't strictly needed

It's sufficient to have the store-paths available.

* Wording

Co-authored-by: Kevin Cox <kevincox@kevincox.ca>

* Clarify "default mode", remove a few unnecessary implementation details

* rfc108: add shepherds

Co-authored-by: Jörg Thalheim <Mic92@users.noreply.github.com>

* Apply wording suggestions from @lheckemann

Co-authored-by: Linus Heckemann <git@sphalerite.org>

* Usage of `nsenter` to update a container's configuration is not relevant for the RFC

* Rename `config` to `system-config` to avoid ambiguities in the module system

* Explicitly talk about deprecation of current implementation

Co-authored-by: Erik Arvstedt <erik.arvstedt@gmail.com>
Co-authored-by: Kevin Cox <kevincox@kevincox.ca>
Co-authored-by: Jörg Thalheim <Mic92@users.noreply.github.com>
Co-authored-by: Linus Heckemann <git@sphalerite.org>
This commit is contained in:
Maximilian Bosch 2022-01-12 15:29:54 +01:00 committed by GitHub
parent 9a2f5accf0
commit 5fa45c912a
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -0,0 +1,465 @@
---
feature: NixOS Container rewrite
start-date: 2021-02-14
author: Maximilian Bosch <maximilian@mbosch.me>
co-authors: n/a
shepherd-team: @Mic92, @lheckemann, @SuperSandro2000, @arianvp, @Lassulus
shepherd-leader: @Lassulus
related-issues:
- https://github.com/NixOS/nixpkgs/issues/69414
- https://github.com/NixOS/nixpkgs/issues/67265
- https://github.com/NixOS/nixpkgs/pull/67232
- https://github.com/NixOS/nixpkgs/pull/67336
- POC: https://github.com/NixOS/nixpkgs/pull/140669
---
# Summary
[summary]: #summary
This document suggests a full replacement of the
[`nixos-container`](https://nixos.org/manual/nixos/stable/#ch-containers) subsystem of NixOS with
a new implementation based on
[`systemd-nspawn(5)`](https://man7.org/linux/man-pages/man5/systemd.nspawn.5.html) and incorporates
[`systemd-networkd(8)`](https://man7.org/linux/man-pages/man8/systemd-networkd.service.8.html) for
the networking stack rather than imperative networking while providing a reasonable upgrade path
for existing installations.
# Motivation
[motivation]: #motivation
The `nixos-container` feature originally appeared in `nixpkgs` in
[2013](https://github.com/nixos/nixpkgs/commit/9ee30cd9b51c46cea7193993d006bb4301588001), at a time where `systemd` support was relatively new to NixOS.
Back then, `systemd-nspawn` was
[only designed as a development tool for systemd developers](https://lwn.net/Articles/572957/) and NixOS
didn't [support networkd](https://github.com/NixOS/nixpkgs/commit/59f512ef7d2137586330f2cabffc41a70f4f0346).
Due to those circumstances the entire feature was implemented
in a fairly ad-hoc way. One of the most notable issues is the broken uplink during boot
of a container:
* Containers will be started in a template unit named [`container@.service`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Description). This
service [configures the network interfaces after the container has started](https://github.com/NixOS/nixpkgs/blob/2f96b9a7b4c083edf79374ceb9d61b5816648276/nixos/modules/virtualisation/nixos-containers.nix#L178-L229).
* This means that even though the `network-online.target` is reached, no uplink is available
until the container is fully booted.
The implication is that a lot of services won't work as-is when installed into a container.
For instance, [oneshot](https://www.freedesktop.org/software/systemd/man/systemd.service.html#Type=) services
such as `nextcloud-setup.service` will hang if a database in e.g. a local network is used. Other
examples are `rspamd` or `clamav`.
Additionally, we currently maintain a Perl script called `nixos-container.pl` which serves
as the CLI frontend for the feature. This is not only an additional maintenance burden for us, but largely duplicates functionality already provided by [`machinectl(1)`](https://www.freedesktop.org/software/systemd/man/machinectl.html).
The main reason why `machinectl` cannot be used as a complete replacement are
[imperative containers](https://nixos.org/manual/nixos/stable/index.html#sec-imperative-containers)
and state getting lost after the `container@<container-name>.service` unit
has stopped since `.nspawn` units aren't used.
In the following section the design of a replacement is proposed with these goals:
* Use [`networkd`](https://www.freedesktop.org/software/systemd/man/systemd.network.html) as the networking stack since `systemd-nspawn` is part of the same project and
thus both components are designed to work together and resolve issues like no uplink until the container is fully booted.
* Provide a useful base to easily use `systemd-nspawn` features:
* When using actual `.nspawn` units defined with Nix expressions, it will be trivial
to define and override configuration per-container (in contrast to listing flags
passed to the CLI interface as it's the case in the old module).
* With this design, it won't be necessary to implement adjustments for advanced features
such as [MACVLAN interfaces](https://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/)
since administrators can directly use the upstream configuration format. The current module
supports `MACVLAN` interfaces for instance, but not `IPVLAN`.
* Another side effect is that existing knowledge about this configuration can be re-used.
* Provide a reasonable upgrade path for existing installations. Even though this RFC suggests
deprecating the existing `nixos-container` subsystem, this measure is purely optional. However,
for this to happen, a smooth migration path must be provided.
# Detailed design
[design]: #detailed-design
## Bootstrapping
To be fully consistent with upstream `systemd`, the template unit
[`systemd-nspawn@.service`](https://github.com/systemd/systemd/blob/v247/units/systemd-nspawn@.service.in) will be used.
The approach how a container is bootstrapped won't change and will thus consist of the
following steps (executed via a custom `ExecStartPre=`-script):
* Create an empty directory in `/var/lib/machines` named like the container-name.
* `systemd-nspawn` only expects `/etc/os-release`, `/etc/machine-id` and `/var` to exist
inside, however with no content.
* To get a running NixOS inside, `/nix/store` is bind-mounted into it. As an `init` process,
the [stage-2 script](https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/system/boot/stage-2-init.sh)
is started which eventually `exec(2)`s into `systemd` and ensures that everything
is correctly set up.
* The option `boot.isContainer = true;` will be automatically set for new containers as well.
This is necessary to
* avoid bogus `modprobe` calls since `nspawn` doesn't have its own kernel.
* avoid building a `stage-1` boot script and initramfs as part of the container's NixOS system
This init-script can be built by evaluating a NixOS config against `<nixpkgs/nixos/lib/eval-config.nix>`.
Support for existing tarballs to be imported with `machinectl pull-tar` is explicitly out of
scope in this RFC.
## Network
The following section provides an overview of how to configure networking for containers and
how this will be implemented. A proposal how the API of the NixOS module could look like will
be demonstrated in the [next chapter](#examples-and-interactions).
### "public" networking
This is the most trivial networking mode. It is taken if the `PrivateNetwork`-option of the
`.nspawn`-unit is set to `no`. In this case, the container has full access to the host's network,
otherwise the container will run in its own namespace.
### Default Mode
If nothing else is specified, the [default settings of `systemd-nspawn`](https://github.com/systemd/systemd/blob/v247/network/80-container-ve.network) will
be used for networking. To briefly summarize, this means:
* A [`veth`](https://man7.org/linux/man-pages/man4/veth.4.html) interface-pair will be created,
one "host-side" interface and a container interface inside its own namespace.
* A subnet from a [RFC1918](https://datatracker.ietf.org/doc/html/rfc1918) private IP range
is assigned to the host-side interface. IPv4 addresses will be distributed via DHCP to containers.
* Analogous to IPv4, a [RFC4193 IPv6 ULA prefix](https://tools.ietf.org/html/rfc4193) will be
assigned to the host-side interface. Containers can assign themselves addresses from this
prefix by utilizing [RFC4862 SLAAC](https://tools.ietf.org/html/rfc4862).
Hosts will be available on the current system via the
[`mymachines` `nss` module](https://www.freedesktop.org/software/systemd/man/nss-mymachines.html).
This means that container names can be resolved to addresses like DNS names, i.e. `ping containername` works.
### Static networking
It's also possible to assign an arbitrary number of IPv4 and IPv6 addresses statically. This
is internally implemented by using the `Address=` setting of [`systemd.network(5)`](https://www.freedesktop.org/software/systemd/man/systemd.network.html).
An example of how this can be done is shown in the [next chapter](#examples-and-interactions).
### DNS
The current implementation uses [`networking.useHostResolvConf`](https://search.nixos.org/options?channel=20.09&show=networking.useHostResolvConf&from=0&size=50&sort=relevance&query=networking.useHostResolvConf)
to configure DNS via `/etc/resolv.conf` in the container. This option will be **deprecated** as
`systemd` can take care of it:
* If `networkd` is enabled via NixOS, [`systemd-resolved`](https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html) is enabled as well.
* By default, `resolved` will be configured via DHCP which is enabled in [Default Mode](#default-mode).
* With only [Static networking](#static-networking) enabled, it is necessary to configure
DNS servers for resolved statically which can be done by setting DNS
servers via a `.network` unit for the `host0` interface.
* The behavior of `networking.useHostResolvConf` can be implemented with pure `systemd`
by setting the `ResolvConf`-setting for the container's `.nspawn`-unit.
## Migration plan
All features from the old implementation are still supported, however several abstractions
(such as `networking.useHostResolvConf` or `containers.<name>.macvlans`) are dropped and have
to be implemented by specifying unit options for `systemd` in the NixOS module system.
The state directory in `/var/lib/containers/<name>` is also usable by `systemd-nspawn` directly.
Thus, the following steps are necessary:
* Port existing container options to the new module (documentation describing how this can be
done for each feature **has** to be written before this is considered ready).
* Most of the NixOS configuration can be easily reused except for the following differences:
* `networkd` is used inside the container rather than scripted networking. This means that
NixOS's networking configuration may require adjustment. However the basic `networking.interfaces` interface
is also supported by the `networkd` stack. More notable is that `eth0` inside the container is
named `host0` by default.
* As soon as the config is ready to deploy, the state directory in `/var/lib/containers` has to
be copied to `/var/lib/machines`.
* Deploy & reboot.
* See also https://github.com/Ma27/nixpkgs/blob/networkd-containers/nixos/tests/container-migration.nix as POC.
## Imperative management
`systemd` differentiates between "privileged" & "unprivileged" settings. Each privileged (also
called "trusted") `nspawn` unit lives in `/etc/systemd/nspawn`. Since unprivileged containers
don't allow bind mounts, these will be out of scope. Additionally, this means that
`/etc/systemd/nspawn` has to be writable for administrative users and can't be a symlink to
a store path anymore.
The new implementation is written in Python since it's expected to be more accessible than Perl
and thus more folks are willing to maintain this code (just as it was the case after porting
the VM test driver from Perl to Python).
The following features won't be available anymore in the new script:
* Start/Stop operations, logging into containers: this can be entirely done via [`machinectl(1)`](https://www.freedesktop.org/software/systemd/man/machinectl.html).
* No configuration will be made via CLI flags. Instead, the option set from the
NixOS module will be used to declare not only the container's configuration, but also
networking. This approach is inspired by [erikarvstedt/extra-container](https://github.com/erikarvstedt/extra-container).
But still, not all features from declarative containers are implemented here, for instance:
* One has to explicitly specify whether to restart/reload a container when updating the config.
This is done on purpose to avoid duplicating the logic from `switch-to-configuration.pl` here.
* IPv6 prefix delegation is turned off because `radvd`'s configuration is declaratively specified
when building the host's NixOS.
Examples are in the next chapter.
## Config activation
By default, NixOS has to decide how to activate configuration changes for a container to avoid
unnecessary reboots, but `reload`s aren't necessarily sufficient either because changes such as
new bind mounts require a reboot. The host's `switch-to-configuration.pl` implements it like
this:
* `systemctl reload systemd-nspawn@container-name.service` runs `switch-to-configuration test`
inside the container `container-name`.
* When activating a new config on the host, the following things happen:
* If the setting `Parameter=` in the container's `.nspawn`-unit is the only thing that has changed,
a `reload` will be done. This parameter contains the `init`-script for the container's NixOS
and changes every time the container's NixOS config changes.
* If anything else changes, `systemd-nspawn@container-name.service` will be scheduled for a restart
which effectively reboots the container.
* This behavior can be turned off completely which means that the container where this is turned
off won't be touched at all on `switch-to-configuration`. Additionally, it's possible to always
force a `reload` or `restart`. See [Examples & Interactions](#examples-and-interactions) for
details.
## Deprecation
The current `nixos-container`-implementation should be considered deprecated as soon as the new
implementation is part of a stable NixOS release. To give end-users a reasonable time to migrate,
it should be kept and maintained for **at least two release cycles**, if necessary even longer.
# Examples and Interactions
[examples-and-interactions]: #examples-and-interactions
### Basics
A container with a private IPv4 & IPv6 address can be configured like this:
``` nix
{
nixos.containers.instances.demo = {
network = {};
system-config = { pkgs, ... }: {
environment.systemPackages = [ pkgs.hello ];
};
};
}
```
It's reachable locally like this thanks to systemd's `mymachines` NSS module:
```shell
[root@server:~]# ping demo -c1
PING demo(fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6 (fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6)) 56 data bytes
64 bytes from fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6 (fdd1:98a7:f71:61f0:900e:81ff:fe78:e9d6): icmp_seq=1 ttl=64 time=0.292 ms
--- demo ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.214/0.214/0.214/0.000 ms
```
The container can be entirely controlled via `machinectl`:
```shell
$ machinectl reboot demo
$ machinectl shell demo
demo$ ...
```
Optionally, containers can be grouped into a networking zone. Instead of a `veth` pair for each
container, all containers will live in an interface named `vz-<zone>`:
```nix
{
nixos.containers.zones.demo = {};
nixos.containers.instances = {
test1.network.zone = "demo";
test2.network.zone = "demo";
};
}
```
IP addresses can be statically assigned to a container as well:
``` nix
{
nixos.containers.instances.static = {
network = {
v4.static.containerPool = [ "10.237.1.3/16" ];
v6.static.containerPool = [ "2a01:4f9:4b:1659:3aa3:cafe::3/96" ];
};
system-config = {};
};
}
```
With this change, the containers live in the given subnets and both on the host- and container-side
the network will be properly configured accordingly.
### Advanced Features
MACVLANs are an example for how every unit setting from `networkd` and `nspawn` can be used.
These are helpful to assign multiple virtual interfaces with distinct MAC addresses to a single
physical NIC.
A sub-interface which is actually part of the physical one can be moved into the container's
namespace then:
``` nix
{
# Config for the physical interface itself with DHCP enabled and associated to a MACVLAN.
systemd.network.networks."40-eth1" = {
matchConfig.Name = "eth1";
networkConfig.DHCP = "yes";
dhcpConfig.UseDNS = "no";
networkConfig.MACVLAN = "mv-eth1-host";
linkConfig.RequiredForOnline = "no";
address = lib.mkForce [];
addresses = lib.mkForce [];
};
# The host-side sub-interface of the MACVLAN. This means that the host is reachable
# at `192.168.2.2`, both on the physical interface and from the container.
systemd.network.networks."20-mv-eth1-host" = {
matchConfig.Name = "mv-eth1-host";
networkConfig.IPForward = "yes";
dhcpV4Config.ClientIdentifier = "mac";
address = lib.mkForce [
"192.168.2.2/24"
];
};
systemd.network.netdevs."20-mv-eth1-host" = {
netdevConfig = {
Name = "mv-eth1-host";
Kind = "macvlan";
};
extraConfig = ''
[MACVLAN]
Mode=bridge
'';
};
# Assign a MACVLAN to a container. This is done by pure nspawn.
systemd.nspawn.vlandemo.networkConfig.MACVLAN = "eth1";
nixos.containers = {
instances.vlandemo.system-config = {
systemd.network = {
networks."10-mv-eth1" = {
matchConfig.Name = "mv-eth1";
address = [ "192.168.2.5/24" ];
};
netdevs."10-mv-eth1" = {
netdevConfig.Name = "mv-eth1";
netdevConfig.Kind = "veth";
};
};
};
};
}
```
### Imperative containers
#### Create a container with a pinned `nixpkgs`
Let the following expression be called `imperative-container.nix`:
```nix
{
nixpkgs = <nixpkgs>;
system-config = { pkgs, ... }: {
services.nginx.enable = true;
networking.firewall.allowedTCPPorts = [ 80 ];
};
# This implies that the "default" networking mode (i.e. DHCPv4) is used
# and not the host's network (which is the default for imperative containers).
network = {};
forwardPorts = [ { hostPort = 8080; containerPort = 80; } ];
}
```
The container can be built like this now:
```
$ nixos-nspawn create imperative ./imperative-container.nix
```
The default page of `nginx` is now reachable like this:
```
$ curl imperative:80 -i
$ curl <IPv4 of the host-side veth interface>:8080 -i
```
#### Modify a container's config imperatively
When `imperative-container.nix` is updated, it can be rebuilt like this:
```
$ nixos-nspawn update imperative --config ./imperative-container.nix
```
By default, it will be **restarted**. This can be overridden via `activation.strategy`,
however only `reload`, `restart` and `none` are supported.
Additionally, the way how the container's new config will be activated can be specified
via `--reload` or `--restart` passed to `nixos-nspawn update`.
If declarative containers are attempted to be modified, the script will terminate early with an
error.
#### Manage an imperative container's lifecycle
Reboot/Login/etc can be managed via [`machinectl(1)`](https://www.freedesktop.org/software/systemd/man/machinectl.html):
```
$ machinectl reboot imperative
$ machinectl shell imperative
[root@imperative:~]$ …
```
# Drawbacks
[drawbacks]: #drawbacks
* Explicit dependency on `networkd` (`networking.useNetworkd = true;`) on both the
host-side and container-side.
* Since there's a movement to make `systemd-networkd` the default on NixOS, this
is from the author's PoV not a big problem.
* Need to migrate from existing containers.
* As demonstrated in [*Migration plan*](#migration-plan), a sane path exists.
* With a long deprecation time, a rush to migrate can be avoided.
* This also means that [the container backend](https://github.com/PsyanticY/nixops-container)
for `nixops` needs to be deprecated.
# Alternatives
[alternatives]: #alternatives
* Implement this feature in e.g. its own (optionally community-maintained) repository:
* This is problematic due to the changes for [Config activation](#config-activation) that
required changes in `switch-to-configuration.pl`.
* Keep both the proposed feature and the existing `nixos-container` subsystem in NixOS. In contrast
to `systemd-nspawn@`, the current container subsystem uses `/var/lib/containers` as state-directory,
so clashes shouldn't happen:
* The main concern is increased maintenance workload. Also, with the rather prominent
name `nixos-container` we shouldn't advertise the old, problematic implementation.
* Do nothing.
* As shown above, this change leverages the full feature set of `systemd-nspawn` and also
solves a few existing problems, that are non-trivial to solve when keeping the old
implementation.
* Since it's planned to move to `networkd` in the longterm anyways, fundamental changes
in the container subsystem will be mandatory anyways.
# Unresolved questions
[unresolved]: #unresolved-questions
* None that I'm aware of.
# Future work
[future]: #future-work
* Write documentation for the new module.
* Get the PR into a mergeable state.