nixos/systemd-boot: init boot counting

This commit is contained in:
Julien Malka 2024-01-06 00:30:23 +00:00
parent 67ebfe5a80
commit eb435897a6
6 changed files with 366 additions and 65 deletions

View File

@ -14,6 +14,8 @@ In addition to numerous new and upgraded packages, this release has the followin
- This can be disabled through the `environment.stub-ld.enable` option.
- If you use `programs.nix-ld.enable`, no changes are needed. The stub will be disabled automatically.
- NixOS now has support for *automatic boot assessment* (see [here](https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/)) for detailed description of the feature) for `systemd-boot` users. Available as [boot.loader.systemd-boot.bootCounting](#opt-boot.loader.systemd-boot.bootCounting.enable).
- Julia environments can now be built with arbitrary packages from the ecosystem using the `.withPackages` function. For example: `julia.withPackages ["Plots"]`.
## New Services {#sec-release-24.05-new-services}

View File

@ -0,0 +1,38 @@
# Automatic boot assessment with systemd-boot {#sec-automatic-boot-assessment}
## Overview {#sec-automatic-boot-assessment-overview}
Automatic boot assessment (or boot-counting) is a feature of `systemd-boot` that allows for automatically detecting invalid boot entries.
When the feature is active, each boot entry has an associated counter with a user defined number of trials. Whenever `system-boot` boots an entry, its counter is decreased by one, ultimately being marked as *bad* if the counter ever reaches zero. However, if an entry is successfully booted, systemd will permanently mark it as *good* and remove the counter altogether. Whenever an entry is marked as *bad*, it is sorted last in the systemd-boot menu.
A complete explanation of how that feature works can be found [here](https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/).
## Enabling the feature {#sec-automatic-boot-assessment-enable}
The feature can be enabled by toogling the [boot.loader.systemd-boot.bootCounting](#opt-boot.loader.systemd-boot.bootCounting.enable) option.
## The boot-complete.target unit {#sec-automatic-boot-assessment-boot-complete-target}
A *successful boot* for an entry is defined in terms of the `boot-complete.target` synchronisation point. It is up to the user to schedule all necessary units for the machine to be considered successfully booted before that synchronisation point.
For example, if you are running `nsd`, an authoritative DNS server on a machine and you want to be sure that a *good* entry is an entry where that DNS server is started successfully. A configuration for that NixOS machine could look like that:
```
boot.loader.systemd-boot.bootCounting.enable = true;
services.nsd.enable = true;
/* rest of nsd configuration omitted */
systemd.services.nsd = {
before = [ "boot-complete.target" ];
wantedBy = [ "boot-complete.target" ];
unitConfig.FailureAction = "reboot";
};
```
## Interaction with specialisations {#sec-automatic-boot-assessment-specialisations}
When the boot-counting feature is enabled, `systemd-boot` will still try the boot entries in the same order as they are displayed in the boot menu. This means that the specialisations of a given generation will be tried directly after that generation. A generation being marked as *bad* do not mean that its specialisations will also be marked as *bad* (as its specialisations could very well be booting successfully).
## Limitations {#sec-automatic-boot-assessment-limitations}
This feature has to be used wisely to not risk any data integrity issues. Rollbacking into past generations can sometimes be dangerous, for example if some of the services may have undefined behaviors in the presence of unrecognized data migrations from future versions of themselves.

View File

@ -12,8 +12,9 @@ import subprocess
import sys
import warnings
import json
from typing import NamedTuple, Dict, List
from typing import NamedTuple, Dict, List, Type, Generator, Iterable
from dataclasses import dataclass
from pathlib import Path
@dataclass
@ -28,7 +29,114 @@ class BootSpec:
specialisations: Dict[str, "BootSpec"]
initrdSecrets: str | None = None
@dataclass
class Entry:
profile: str | None
generation_number: int
specialisation: str | None
@classmethod
def from_path(cls: Type["Entry"], path: Path) -> "Entry":
filename = path.name
# Matching nixos-$profile-generation-*.conf
rex_profile = re.compile(r"^nixos-(.*)-generation-.*\.conf$")
# Matching nixos*-generation-$number*.conf
rex_generation = re.compile(r"^nixos.*-generation-([0-9]+).*\.conf$")
# Matching nixos*-generation-$number-specialisation-$specialisation_name*.conf
rex_specialisation = re.compile(r"^nixos.*-generation-([0-9]+)-specialisation-([a-zA-Z0-9]+).*\.conf$")
profile = rex_profile.sub(r"\1", filename) if rex_profile.match(filename) else None
specialisation = rex_specialisation.sub(r"\2", filename) if rex_specialisation.match(filename) else None
try:
generation_number = int(rex_generation.sub(r"\1", filename))
except ValueError:
raise
return cls(profile, generation_number, specialisation)
BOOT_ENTRY = """title {title}
version Generation {generation} {description}
linux {kernel}
initrd {initrd}
options {kernel_params}
machine-id {machine_id}
sort-key {sort_key}
"""
@dataclass
class DiskEntry():
entry: Entry
default: bool
counters: str | None
title: str
description: str
kernel: str
initrd: str
kernel_params: str
machine_id: str
@classmethod
def from_path(cls: Type["DiskEntry"], path: Path) -> "DiskEntry":
entry = Entry.from_path(path)
with open(path, 'r') as f:
data = f.read().splitlines()
if '' in data:
data.remove('')
entry_map = dict(l.split(' ', 1) for l in data)
assert "title" in entry_map
assert "version" in entry_map
version_splitted = entry_map["version"].split(" ", 2)
assert version_splitted[0] == "Generation"
assert version_splitted[1].isdigit()
assert "linux" in entry_map
assert "initrd" in entry_map
assert "options" in entry_map
assert "machine-id" in entry_map
assert "sort-key" in entry_map
filename = path.name
# Matching nixos*-generation-*$counters.conf
rex_counters = re.compile(r"^nixos.*-generation-.*(\+\d(-\d)?)\.conf$")
counters = rex_counters.sub(r"\1", filename) if rex_counters.match(filename) else None
disk_entry = cls(
entry=entry,
default=(entry_map["sort-key"] == "default"),
counters=counters,
title=entry_map["title"],
description=entry_map["version"],
kernel=entry_map["linux"],
initrd=entry_map["initrd"],
kernel_params=entry_map["options"],
machine_id=entry_map["machine-id"])
return disk_entry
def write(self) -> None:
tmp_path = self.path.with_suffix(".tmp")
with tmp_path.open('w') as f:
# We use "sort-key" to sort the default generation first.
# The "default" string is sorted before "non-default" (alphabetically)
f.write(BOOT_ENTRY.format(title=self.title,
generation=self.entry.generation_number,
kernel=self.kernel,
initrd=self.initrd,
kernel_params=self.kernel_params,
machine_id=self.machine_id,
description=self.description,
sort_key="default" if self.default else "non-default"))
f.flush()
os.fsync(f.fileno())
tmp_path.rename(self.path)
@property
def path(self) -> Path:
pieces = [
"nixos",
self.entry.profile or None,
"generation",
str(self.entry.generation_number),
f"specialisation-{self.entry.specialisation}" if self.entry.specialisation else None,
]
prefix = "-".join(p for p in pieces if p)
return Path(f"@efiSysMountPoint@/loader/entries/{prefix}{self.counters if self.counters else ''}.conf")
libc = ctypes.CDLL("libc.so.6")
@ -56,29 +164,14 @@ def system_dir(profile: str | None, generation: int, specialisation: str | None)
else:
return d
BOOT_ENTRY = """title {title}
version Generation {generation} {description}
linux {kernel}
initrd {initrd}
options {kernel_params}
"""
def generation_conf_filename(profile: str | None, generation: int, specialisation: str | None) -> str:
pieces = [
"nixos",
profile or None,
"generation",
str(generation),
f"specialisation-{specialisation}" if specialisation else None,
]
return "-".join(p for p in pieces if p) + ".conf"
def write_loader_conf(profile: str | None, generation: int, specialisation: str | None) -> None:
def write_loader_conf(profile: str | None) -> None:
with open("@efiSysMountPoint@/loader/loader.conf.tmp", 'w') as f:
if "@timeout@" != "":
f.write("timeout @timeout@\n")
f.write("default %s\n" % generation_conf_filename(profile, generation, specialisation))
if profile:
f.write("default nixos-%s-generation-*\n" % profile)
else:
f.write("default nixos-generation-*\n")
if not @editor@:
f.write("editor 0\n")
f.write("console-mode @consoleMode@\n")
@ -86,6 +179,17 @@ def write_loader_conf(profile: str | None, generation: int, specialisation: str
os.fsync(f.fileno())
os.rename("@efiSysMountPoint@/loader/loader.conf.tmp", "@efiSysMountPoint@/loader/loader.conf")
def scan_entries() -> Generator[DiskEntry, None, None]:
"""
Scan all entries in $ESP/loader/entries/*
Does not support Type 2 entries as we do not support them for now.
Returns a generator of Entry.
"""
for path in Path("@efiSysMountPoint@/loader/entries/").glob("nixos*-generation-[1-9]*.conf"):
try:
yield DiskEntry.from_path(path)
except ValueError:
continue
def get_bootspec(profile: str | None, generation: int) -> BootSpec:
system_directory = system_dir(profile, generation, None)
@ -120,7 +224,7 @@ def copy_from_file(file: str, dry_run: bool = False) -> str:
return efi_file_path
def write_entry(profile: str | None, generation: int, specialisation: str | None,
machine_id: str, bootspec: BootSpec, current: bool) -> None:
machine_id: str, bootspec: BootSpec, entries: Iterable[DiskEntry], current: bool) -> None:
if specialisation:
bootspec = bootspec.specialisations[specialisation]
kernel = copy_from_file(bootspec.kernel)
@ -142,28 +246,30 @@ def write_entry(profile: str | None, generation: int, specialisation: str | None
f'for "{title} - Configuration {generation}", an older generation', file=sys.stderr)
print("note: this is normal after having removed "
"or renamed a file in `boot.initrd.secrets`", file=sys.stderr)
entry_file = "@efiSysMountPoint@/loader/entries/%s" % (
generation_conf_filename(profile, generation, specialisation))
tmp_path = "%s.tmp" % (entry_file)
kernel_params = "init=%s " % bootspec.init
kernel_params = kernel_params + " ".join(bootspec.kernelParams)
build_time = int(os.path.getctime(system_dir(profile, generation, specialisation)))
build_date = datetime.datetime.fromtimestamp(build_time).strftime('%F')
counters = "+@bootCountingTrials@" if @bootCounting@ else ""
entry = Entry(profile, generation, specialisation)
# We check if the entry we are writing is already on disk
# and we update its "default entry" status
for entry_on_disk in entries:
if entry == entry_on_disk.entry:
entry_on_disk.default = current
entry_on_disk.write()
return
with open(tmp_path, 'w') as f:
f.write(BOOT_ENTRY.format(title=title,
generation=generation,
kernel=kernel,
initrd=initrd,
kernel_params=kernel_params,
description=f"{bootspec.label}, built on {build_date}"))
if machine_id is not None:
f.write("machine-id %s\n" % machine_id)
f.flush()
os.fsync(f.fileno())
os.rename(tmp_path, entry_file)
DiskEntry(
entry=entry,
title=title,
kernel=kernel,
initrd=initrd,
counters=counters,
kernel_params=kernel_params,
machine_id=machine_id,
description=f"{bootspec.label}, built on {build_date}",
default=current).write()
def get_generations(profile: str | None = None) -> list[SystemIdentifier]:
gen_list = subprocess.check_output([
@ -188,30 +294,19 @@ def get_generations(profile: str | None = None) -> list[SystemIdentifier]:
return configurations[-configurationLimit:]
def remove_old_entries(gens: list[SystemIdentifier]) -> None:
rex_profile = re.compile(r"^@efiSysMountPoint@/loader/entries/nixos-(.*)-generation-.*\.conf$")
rex_generation = re.compile(r"^@efiSysMountPoint@/loader/entries/nixos.*-generation-([0-9]+)(-specialisation-.*)?\.conf$")
def remove_old_entries(gens: list[SystemIdentifier], disk_entries: Iterable[DiskEntry]) -> None:
known_paths = []
for gen in gens:
bootspec = get_bootspec(gen.profile, gen.generation)
known_paths.append(copy_from_file(bootspec.kernel, True))
known_paths.append(copy_from_file(bootspec.initrd, True))
for path in glob.iglob("@efiSysMountPoint@/loader/entries/nixos*-generation-[1-9]*.conf"):
if rex_profile.match(path):
prof = rex_profile.sub(r"\1", path)
else:
prof = None
try:
gen_number = int(rex_generation.sub(r"\1", path))
except ValueError:
continue
if not (prof, gen_number, None) in gens:
os.unlink(path)
for disk_entry in disk_entries:
if (disk_entry.entry.profile, disk_entry.entry.generation_number, None) not in gens:
os.unlink(disk_entry.path)
for path in glob.iglob("@efiSysMountPoint@/efi/nixos/*"):
if not path in known_paths and not os.path.isdir(path):
if path not in known_paths and not os.path.isdir(path):
os.unlink(path)
def get_profiles() -> list[str]:
if os.path.isdir("/nix/var/nix/profiles/system-profiles/"):
return [x
@ -284,16 +379,17 @@ def install_bootloader(args: argparse.Namespace) -> None:
gens = get_generations()
for profile in get_profiles():
gens += get_generations(profile)
remove_old_entries(gens)
entries = scan_entries()
remove_old_entries(gens, entries)
for gen in gens:
try:
bootspec = get_bootspec(gen.profile, gen.generation)
is_default = os.path.dirname(bootspec.init) == args.default_config
write_entry(*gen, machine_id, bootspec, current=is_default)
write_entry(*gen, machine_id, bootspec, entries, current=is_default)
for specialisation in bootspec.specialisations.keys():
write_entry(gen.profile, gen.generation, specialisation, machine_id, bootspec, current=is_default)
write_entry(gen.profile, gen.generation, specialisation, machine_id, bootspec, entries, current=is_default)
if is_default:
write_loader_conf(*gen)
write_loader_conf(gen.profile)
except OSError as e:
# See https://github.com/NixOS/nixpkgs/issues/114552
if e.errno == errno.EINVAL:

View File

@ -49,6 +49,8 @@ let
${pkgs.coreutils}/bin/install -D $empty_file "${efi.efiSysMountPoint}/efi/nixos/.extra-files/loader/entries/"${escapeShellArg n}
'') cfg.extraEntries)}
'';
bootCountingTrials = cfg.bootCounting.trials;
bootCounting = if cfg.bootCounting.enable then "True" else "False";
};
checkedSystemdBootBuilder = pkgs.runCommand "systemd-boot" {
@ -69,7 +71,10 @@ let
'';
in {
meta.maintainers = with lib.maintainers; [ julienmalka ];
meta = {
maintainers = with lib.maintainers; [ julienmalka ];
doc = ./boot-counting.md;
};
imports =
[ (mkRenamedOptionModule [ "boot" "loader" "gummiboot" "enable" ] [ "boot" "loader" "systemd-boot" "enable" ])
@ -238,6 +243,15 @@ in {
'';
};
bootCounting = {
enable = mkEnableOption (lib.mdDoc "automatic boot assessment");
trials = mkOption {
default = 3;
type = types.int;
description = lib.mdDoc "number of trials each entry should start with";
};
};
};
config = mkIf cfg.enable {

View File

@ -101,6 +101,10 @@ let
"systemd-rfkill.service"
"systemd-rfkill.socket"
# Boot counting
"boot-complete.target"
] ++ lib.optional config.boot.loader.systemd-boot.bootCounting.enable "systemd-bless-boot.service" ++ [
# Hibernate / suspend.
"hibernate.target"
"suspend.target"

View File

@ -13,9 +13,11 @@ let
boot.loader.systemd-boot.enable = true;
boot.loader.efi.canTouchEfiVariables = true;
environment.systemPackages = [ pkgs.efibootmgr ];
# Needed for machine-id to be persisted between reboots
environment.etc."machine-id".text = "00000000000000000000000000000000";
};
in
{
rec {
basic = makeTest {
name = "systemd-boot";
meta.maintainers = with pkgs.lib.maintainers; [ danielfullmer julienmalka ];
@ -252,15 +254,15 @@ in
'';
};
garbage-collect-entry = makeTest {
name = "systemd-boot-garbage-collect-entry";
garbage-collect-entry = { withBootCounting ? false, ... }: makeTest {
name = "systemd-boot-garbage-collect-entry" + optionalString withBootCounting "-with-boot-counting";
meta.maintainers = with pkgs.lib.maintainers; [ julienmalka ];
nodes = {
inherit common;
machine = { pkgs, nodes, ... }: {
imports = [ common ];
boot.loader.systemd-boot.bootCounting.enable = withBootCounting;
# These are configs for different nodes, but we'll use them here in `machine`
system.extraDependencies = [
nodes.common.system.build.toplevel
@ -275,8 +277,12 @@ in
''
machine.succeed("nix-env -p /nix/var/nix/profiles/system --set ${baseSystem}")
machine.succeed("nix-env -p /nix/var/nix/profiles/system --delete-generations 1")
# At this point generation 1 has already been marked as good so we reintroduce counters artificially
${optionalString withBootCounting ''
machine.succeed("mv /boot/loader/entries/nixos-generation-1.conf /boot/loader/entries/nixos-generation-1+3.conf")
''}
machine.succeed("${baseSystem}/bin/switch-to-configuration boot")
machine.fail("test -e /boot/loader/entries/nixos-generation-1.conf")
machine.fail("test -e /boot/loader/entries/nixos-generation-1*")
machine.succeed("test -e /boot/loader/entries/nixos-generation-2.conf")
'';
};
@ -322,4 +328,145 @@ in
machine.wait_for_unit("multi-user.target")
'';
};
# Check that we are booting the default entry and not the generation with largest version number
defaultEntry = { withBootCounting ? false, ... }: makeTest {
name = "systemd-boot-default-entry" + optionalString withBootCounting "-with-boot-counting";
meta.maintainers = with pkgs.lib.maintainers; [ julienmalka ];
nodes = {
machine = { pkgs, lib, nodes, ... }: {
imports = [ common ];
system.extraDependencies = [ nodes.other_machine.system.build.toplevel ];
boot.loader.systemd-boot.bootCounting.enable = withBootCounting;
};
other_machine = { pkgs, lib, ... }: {
imports = [ common ];
boot.loader.systemd-boot.bootCounting.enable = withBootCounting;
environment.systemPackages = [ pkgs.hello ];
};
};
testScript = { nodes, ... }:
let
orig = nodes.machine.system.build.toplevel;
other = nodes.other_machine.system.build.toplevel;
in
''
orig = "${orig}"
other = "${other}"
def check_current_system(system_path):
machine.succeed(f'test $(readlink -f /run/current-system) = "{system_path}"')
machine.succeed("test -e /boot/loader/entries/nixos-generation-1.conf")
check_current_system(orig)
# Switch to other configuration
machine.succeed("nix-env -p /nix/var/nix/profiles/system --set ${other}")
machine.succeed(f"{other}/bin/switch-to-configuration boot")
# Rollback, default entry is now generation 1
machine.succeed("nix-env -p /nix/var/nix/profiles/system --rollback")
machine.succeed(f"{orig}/bin/switch-to-configuration boot")
machine.succeed("test -e /boot/loader/entries/nixos-generation-1.conf")
${if withBootCounting
then ''machine.succeed("test -e /boot/loader/entries/nixos-generation-2+3.conf")''
else ''machine.succeed("test -e /boot/loader/entries/nixos-generation-2.conf")''}
machine.shutdown()
machine.start()
machine.wait_for_unit("multi-user.target")
# Check that we booted generation 1 (default)
# even though generation 2 comes first in alphabetical order
check_current_system(orig)
'';
};
bootCounting =
let
baseConfig = { pkgs, lib, ... }: {
imports = [ common ];
boot.loader.systemd-boot.bootCounting.enable = true;
boot.loader.systemd-boot.bootCounting.trials = 2;
};
in
makeTest {
name = "systemd-boot-counting";
meta.maintainers = with pkgs.lib.maintainers; [ julienmalka ];
nodes = {
machine = { pkgs, lib, nodes, ... }: {
imports = [ baseConfig ];
system.extraDependencies = [ nodes.bad_machine.system.build.toplevel ];
};
bad_machine = { pkgs, lib, ... }: {
imports = [ baseConfig ];
systemd.services."failing" = {
script = "exit 1";
requiredBy = [ "boot-complete.target" ];
before = [ "boot-complete.target" ];
serviceConfig.Type = "oneshot";
};
};
};
testScript = { nodes, ... }:
let
orig = nodes.machine.system.build.toplevel;
bad = nodes.bad_machine.system.build.toplevel;
in
''
orig = "${orig}"
bad = "${bad}"
def check_current_system(system_path):
machine.succeed(f'test $(readlink -f /run/current-system) = "{system_path}"')
# Ensure we booted using an entry with counters enabled
machine.succeed(
"test -e /sys/firmware/efi/efivars/LoaderBootCountPath-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f"
)
# systemd-bless-boot should have already removed the "+2" suffix from the boot entry
machine.wait_for_unit("systemd-bless-boot.service")
machine.succeed("test -e /boot/loader/entries/nixos-generation-1.conf")
check_current_system(orig)
# Switch to bad configuration
machine.succeed("nix-env -p /nix/var/nix/profiles/system --set ${bad}")
machine.succeed(f"{bad}/bin/switch-to-configuration boot")
# Ensure new bootloader entry has initialized counter
machine.succeed("test -e /boot/loader/entries/nixos-generation-1.conf")
machine.succeed("test -e /boot/loader/entries/nixos-generation-2+2.conf")
machine.shutdown()
machine.start()
machine.wait_for_unit("multi-user.target")
check_current_system(bad)
machine.succeed("test -e /boot/loader/entries/nixos-generation-1.conf")
machine.succeed("test -e /boot/loader/entries/nixos-generation-2+1-1.conf")
machine.shutdown()
machine.start()
machine.wait_for_unit("multi-user.target")
check_current_system(bad)
machine.succeed("test -e /boot/loader/entries/nixos-generation-1.conf")
machine.succeed("test -e /boot/loader/entries/nixos-generation-2+0-2.conf")
machine.shutdown()
# Should boot back into original configuration
machine.start()
check_current_system(orig)
machine.wait_for_unit("multi-user.target")
machine.succeed("test -e /boot/loader/entries/nixos-generation-1.conf")
machine.succeed("test -e /boot/loader/entries/nixos-generation-2+0-2.conf")
machine.shutdown()
'';
};
defaultEntryWithBootCounting = defaultEntry { withBootCounting = true; };
garbageCollectEntryWithBootCounting = garbage-collect-entry { withBootCounting = true; };
}