All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/2] System Generation ID driver and VMGENID backend
@ 2021-02-24  8:47 ` Adrian Catangiu
  0 siblings, 0 replies; 23+ messages in thread
From: Adrian Catangiu @ 2021-02-24  8:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390
  Cc: gregkh, graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mst, mhocko, rafael,
	pavel, mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer, Adrian Catangiu

This feature is aimed at virtualized or containerized environments
where VM or container snapshotting duplicates memory state, which is a
challenge for applications that want to generate unique data such as
request IDs, UUIDs, and cryptographic nonces.

The patch set introduces a mechanism that provides a userspace
interface for applications and libraries to be made aware of uniqueness
breaking events such as VM or container snapshotting, and allow them to
react and adapt to such events.

Solving the uniqueness problem strongly enough for cryptographic
purposes requires a mechanism which can deterministically reseed
userspace PRNGs with new entropy at restore time. This mechanism must
also support the high-throughput and low-latency use-cases that led
programmers to pick a userspace PRNG in the first place; be usable by
both application code and libraries; allow transparent retrofitting
behind existing popular PRNG interfaces without changing application
code; it must be efficient, especially on snapshot restore; and be
simple enough for wide adoption.

The first patch in the set implements a device driver which exposes a
the /dev/sysgenid char device to userspace. Its associated filesystem
operations operations can be used to build a system level safe workflow
that guest software can follow to protect itself from negative system
snapshot effects.

The second patch in the set adds a VmGenId driver which makes use of
the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
following VM snapshots.

**Please note**, SysGenID alone does not guarantee complete snapshot
safety to applications using it. A certain workflow needs to be
followed at the system level, in order to make the system
snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
section in the included SysGenID documentation.

---

v6 -> v7:
  - remove sysgenid uevent

v5 -> v6:

  - sysgenid: watcher tracking disabled by default
  - sysgenid: add SYSGENID_SET_WATCHER_TRACKING ioctl to allow each
    file descriptor to set whether they should be tracked as watchers
  - rename SYSGENID_FORCE_GEN_UPDATE -> SYSGENID_TRIGGER_GEN_UPDATE
  - rework all documentation to clearly capture all prerequisites for
    achieving snapshot safety when using the provided mechanism
  - sysgenid documentation: replace individual filesystem operations
    examples with a higher level example showcasing system-level
    snapshot-safe workflow

v4 -> v5:

  - sysgenid: generation changes are also exported through uevents
  - remove SYSGENID_GET_OUTDATED_WATCHERS ioctl
  - document sysgenid ioctl major/minor numbers

v3 -> v4:

  - split functionality in two separate kernel modules: 
    1. drivers/misc/sysgenid.c which provides the generic userspace
       interface and mechanisms
    2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
       kernel entropy and acts as a driving backend for the generic
       sysgenid
  - rename /dev/vmgenid -> /dev/sysgenid
  - rename uapi header file vmgenid.h -> sysgenid.h
  - rename ioctls VMGENID_* -> SYSGENID_*
  - add ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
  - fix races in documentation examples

v2 -> v3:

  - separate the core driver logic and interface, from the ACPI device.
    The ACPI vmgenid device is now one possible backend
  - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
  - add locking to avoid races between fs ops handlers and hw irq
    driven generation updates
  - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
    outdated or a generation change happens while waiting (thus making
    current caller outdated), the ioctl returns -EINTR to signal the
    user to handle event and retry. Fixes blocking on oneself
  - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
    CAP_CHECKPOINT_RESTORE capability, through which software can force
    generation bump

v1 -> v2:

  - expose to userspace a monotonically increasing u32 Vm Gen Counter
    instead of the hw VmGen UUID
  - since the hw/hypervisor-provided 128-bit UUID is not public
    anymore, add it to the kernel RNG as device randomness
  - insert driver page containing Vm Gen Counter in the user vma in
    the driver's mmap handler instead of using a fault handler
  - turn driver into a misc device driver to auto-create /dev/vmgenid
  - change ioctl arg to avoid leaking kernel structs to userspace
  - update documentation

Adrian Catangiu (2):
  drivers/misc: sysgenid: add system generation id driver
  drivers/virt: vmgenid: add vm generation id driver

 Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 Documentation/virt/vmgenid.rst                     |  36 +++
 MAINTAINERS                                        |  15 +
 drivers/misc/Kconfig                               |  15 +
 drivers/misc/Makefile                              |   1 +
 drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
 drivers/virt/Kconfig                               |  13 +
 drivers/virt/Makefile                              |   1 +
 drivers/virt/vmgenid.c                             | 153 ++++++++++
 include/uapi/linux/sysgenid.h                      |  18 ++
 11 files changed, 804 insertions(+)
 create mode 100644 Documentation/misc-devices/sysgenid.rst
 create mode 100644 Documentation/virt/vmgenid.rst
 create mode 100644 drivers/misc/sysgenid.c
 create mode 100644 drivers/virt/vmgenid.c
 create mode 100644 include/uapi/linux/sysgenid.h

-- 
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v7 0/2] System Generation ID driver and VMGENID backend
@ 2021-02-24  8:47 ` Adrian Catangiu
  0 siblings, 0 replies; 23+ messages in thread
From: Adrian Catangiu @ 2021-02-24  8:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390
  Cc: gregkh, graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mst, mhocko, rafael,
	pavel, mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer, Adrian Catangiu

This feature is aimed at virtualized or containerized environments
where VM or container snapshotting duplicates memory state, which is a
challenge for applications that want to generate unique data such as
request IDs, UUIDs, and cryptographic nonces.

The patch set introduces a mechanism that provides a userspace
interface for applications and libraries to be made aware of uniqueness
breaking events such as VM or container snapshotting, and allow them to
react and adapt to such events.

Solving the uniqueness problem strongly enough for cryptographic
purposes requires a mechanism which can deterministically reseed
userspace PRNGs with new entropy at restore time. This mechanism must
also support the high-throughput and low-latency use-cases that led
programmers to pick a userspace PRNG in the first place; be usable by
both application code and libraries; allow transparent retrofitting
behind existing popular PRNG interfaces without changing application
code; it must be efficient, especially on snapshot restore; and be
simple enough for wide adoption.

The first patch in the set implements a device driver which exposes a
the /dev/sysgenid char device to userspace. Its associated filesystem
operations operations can be used to build a system level safe workflow
that guest software can follow to protect itself from negative system
snapshot effects.

The second patch in the set adds a VmGenId driver which makes use of
the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
following VM snapshots.

**Please note**, SysGenID alone does not guarantee complete snapshot
safety to applications using it. A certain workflow needs to be
followed at the system level, in order to make the system
snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
section in the included SysGenID documentation.

---

v6 -> v7:
  - remove sysgenid uevent

v5 -> v6:

  - sysgenid: watcher tracking disabled by default
  - sysgenid: add SYSGENID_SET_WATCHER_TRACKING ioctl to allow each
    file descriptor to set whether they should be tracked as watchers
  - rename SYSGENID_FORCE_GEN_UPDATE -> SYSGENID_TRIGGER_GEN_UPDATE
  - rework all documentation to clearly capture all prerequisites for
    achieving snapshot safety when using the provided mechanism
  - sysgenid documentation: replace individual filesystem operations
    examples with a higher level example showcasing system-level
    snapshot-safe workflow

v4 -> v5:

  - sysgenid: generation changes are also exported through uevents
  - remove SYSGENID_GET_OUTDATED_WATCHERS ioctl
  - document sysgenid ioctl major/minor numbers

v3 -> v4:

  - split functionality in two separate kernel modules: 
    1. drivers/misc/sysgenid.c which provides the generic userspace
       interface and mechanisms
    2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
       kernel entropy and acts as a driving backend for the generic
       sysgenid
  - rename /dev/vmgenid -> /dev/sysgenid
  - rename uapi header file vmgenid.h -> sysgenid.h
  - rename ioctls VMGENID_* -> SYSGENID_*
  - add ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
  - fix races in documentation examples

v2 -> v3:

  - separate the core driver logic and interface, from the ACPI device.
    The ACPI vmgenid device is now one possible backend
  - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
  - add locking to avoid races between fs ops handlers and hw irq
    driven generation updates
  - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
    outdated or a generation change happens while waiting (thus making
    current caller outdated), the ioctl returns -EINTR to signal the
    user to handle event and retry. Fixes blocking on oneself
  - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
    CAP_CHECKPOINT_RESTORE capability, through which software can force
    generation bump

v1 -> v2:

  - expose to userspace a monotonically increasing u32 Vm Gen Counter
    instead of the hw VmGen UUID
  - since the hw/hypervisor-provided 128-bit UUID is not public
    anymore, add it to the kernel RNG as device randomness
  - insert driver page containing Vm Gen Counter in the user vma in
    the driver's mmap handler instead of using a fault handler
  - turn driver into a misc device driver to auto-create /dev/vmgenid
  - change ioctl arg to avoid leaking kernel structs to userspace
  - update documentation

Adrian Catangiu (2):
  drivers/misc: sysgenid: add system generation id driver
  drivers/virt: vmgenid: add vm generation id driver

 Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 Documentation/virt/vmgenid.rst                     |  36 +++
 MAINTAINERS                                        |  15 +
 drivers/misc/Kconfig                               |  15 +
 drivers/misc/Makefile                              |   1 +
 drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
 drivers/virt/Kconfig                               |  13 +
 drivers/virt/Makefile                              |   1 +
 drivers/virt/vmgenid.c                             | 153 ++++++++++
 include/uapi/linux/sysgenid.h                      |  18 ++
 11 files changed, 804 insertions(+)
 create mode 100644 Documentation/misc-devices/sysgenid.rst
 create mode 100644 Documentation/virt/vmgenid.rst
 create mode 100644 drivers/misc/sysgenid.c
 create mode 100644 drivers/virt/vmgenid.c
 create mode 100644 include/uapi/linux/sysgenid.h

-- 
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
  2021-02-24  8:47 ` Adrian Catangiu
@ 2021-02-24  8:47   ` Adrian Catangiu
  -1 siblings, 0 replies; 23+ messages in thread
From: Adrian Catangiu @ 2021-02-24  8:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390
  Cc: gregkh, graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mst, mhocko, rafael,
	pavel, mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer, Adrian Catangiu

- Background and problem

The System Generation ID feature is required in virtualized or
containerized environments by applications that work with local copies
or caches of world-unique data such as random values, uuids,
monotonically increasing counters, etc.
Such applications can be negatively affected by VM or container
snapshotting when the VM or container is either cloned or returned to
an earlier point in time.

Furthermore, simply finding out about a system generation change is
only the starting point of a process to renew internal states of
possibly multiple applications across the system. This process requires
a standard interface that applications can rely on and through which
orchestration can be easily done.

- Solution

The System Generation ID is meant to help in these scenarios by
providing a monotonically increasing u32 counter that changes each time
the VM or container is restored from a snapshot.

The `sysgenid` driver exposes a monotonic incremental System Generation
u32 counter via a char-dev filesystem interface accessible
through `/dev/sysgenid`. It provides synchronous and asynchronous SysGen
counter update notifications, as well as counter retrieval and
confirmation mechanisms.
The counter starts from zero when the driver is initialized and
monotonically increments every time the system generation changes.

Userspace applications or libraries can (a)synchronously consume the
system generation counter through the provided filesystem interface, to
make any necessary internal adjustments following a system generation
update.

The provided filesystem interface operations can be used to build a
system level safe workflow that guest software can follow to protect
itself from negative system snapshot effects.

The `sysgenid` driver exports the `void sysgenid_bump_generation()`
symbol which can be used by backend drivers to drive system generation
changes based on hardware events.
System generation changes can also be driven by userspace software
through a dedicated driver ioctl.

**Please note**, SysGenID alone does not guarantee complete snapshot
safety to applications using it. A certain workflow needs to be
followed at the system level, in order to make the system
snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
section in the included documentation.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
 Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 MAINTAINERS                                        |   8 +
 drivers/misc/Kconfig                               |  15 +
 drivers/misc/Makefile                              |   1 +
 drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
 include/uapi/linux/sysgenid.h                      |  18 ++
 7 files changed, 594 insertions(+)
 create mode 100644 Documentation/misc-devices/sysgenid.rst
 create mode 100644 drivers/misc/sysgenid.c
 create mode 100644 include/uapi/linux/sysgenid.h

diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
new file mode 100644
index 0000000..0b8199b
--- /dev/null
+++ b/Documentation/misc-devices/sysgenid.rst
@@ -0,0 +1,229 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+SYSGENID
+========
+
+The System Generation ID feature is required in virtualized or
+containerized environments by applications that work with local copies
+or caches of world-unique data such as random values, UUIDs,
+monotonically increasing counters, etc.
+Such applications can be negatively affected by VM or container
+snapshotting when the VM or container is either cloned or returned to
+an earlier point in time.
+
+The System Generation ID is meant to help in these scenarios by
+providing a monotonically increasing counter that changes each time the
+VM or container is restored from a snapshot. The driver for it lives at
+``drivers/misc/sysgenid.c``.
+
+The ``sysgenid`` driver exposes a monotonic incremental System
+Generation u32 counter via a char-dev filesystem interface accessible
+through ``/dev/sysgenid`` that provides sync and async SysGen counter
+update notifications. It also provides SysGen counter retrieval and
+confirmation mechanisms.
+
+The counter starts from zero when the driver is initialized and
+monotonically increments every time the system generation changes.
+
+The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
+symbol which can be used by backend drivers to drive system generation
+changes based on hardware events.
+System generation changes can also be driven by userspace software
+through a dedicated driver ioctl.
+
+Userspace applications or libraries can (a)synchronously consume the
+system generation counter through the provided filesystem interface, to
+make any necessary internal adjustments following a system generation
+update.
+
+**Please note**, SysGenID alone does not guarantee complete snapshot
+safety to applications using it. A certain workflow needs to be
+followed at the system level, in order to make the system
+snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
+section below.
+
+Driver filesystem interface
+===========================
+
+``open()``:
+  When the device is opened, a copy of the current SysGenID (counter)
+  is associated with the open file descriptor. Every open file
+  descriptor will have readable data available (EPOLLIN) while its
+  current copy of the SysGenID is outdated. Reading from the fd will
+  provide the latest SysGenID, while writing to the fd will update the
+  fd-local copy of the SysGenID and is used as a confirmation
+  mechanism.
+
+``read()``:
+  Read is meant to provide the *new* system generation counter when a
+  generation change takes place. The read operation blocks until the
+  associated counter is no longer up to date, at which point the new
+  counter is provided/returned.  Nonblocking ``read()`` returns
+  ``EAGAIN`` to signal that there is no *new* counter value available.
+  The generation counter is considered *new* for each open file
+  descriptor that hasn't confirmed the new value following a generation
+  change. Therefore, once a generation change takes place, all
+  ``read()`` calls will immediately return the new generation counter
+  and will continue to do so until the new value is confirmed back to
+  the driver through ``write()``.
+  Partial reads are not allowed - read buffer needs to be at least
+  32 bits in size.
+
+``write()``:
+  Write is used to confirm the up-to-date SysGenID counter back to the
+  driver.
+  Following a VM generation change, all existing watchers are marked
+  as *outdated*. Each file descriptor will maintain the *outdated*
+  status until a ``write()`` containing the new up-to-date generation
+  counter is used as an update confirmation mechanism.
+  Partial writes are not allowed - write buffer should be exactly
+  32 bits in size.
+
+``poll()``:
+  Poll is implemented to allow polling for generation counter updates.
+  Such updates result in ``EPOLLIN`` polling status until the new
+  up-to-date counter is confirmed back to the driver through a
+  ``write()``.
+
+``ioctl()``:
+  The driver also adds support for waiting on open file descriptors
+  that haven't acknowledged a generation counter update, as well as a
+  mechanism for userspace to *trigger* a generation update:
+
+  - SYSGENID_SET_WATCHER_TRACKING: takes a bool argument to set tracking
+    status for current file descriptor. When watcher tracking is
+    enabled, the driver tracks this file descriptor as an independent
+    *watcher*. The driver keeps accounting of how many watchers have
+    confirmed the latest Sys-Gen-Id counter and how many of them are
+    *outdated*; an outdated watcher is a *tracked* open file descriptor
+    that has lived through a Sys-Gen-Id change but has not yet confirmed
+    the new generation counter.
+    Software that wants to be waited on by the system while it adjusts
+    to generation changes, should turn tracking on. The sysgenid driver
+    then keeps track of it and can block system-level adjustment process
+    until the software has finished adjusting and confirmed it through a
+    ``write()``.
+    Tracking is disabled by default and file descriptors need to
+    explicitly opt-in using this IOCTL.
+  - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+    tracked watchers or, if a ``timeout`` argument is provided, until
+    the timeout expires.
+    If the current caller is *outdated* or a generation change happens
+    while waiting (thus making current caller *outdated*), the ioctl
+    returns ``-EINTR`` to signal the user to handle event and retry.
+  - SYSGENID_TRIGGER_GEN_UPDATE: triggers a generation counter increment.
+    It takes a ``minimum-generation`` argument which represents the
+    minimum value the generation counter will be set to. For example if
+    current generation is ``5`` and ``SYSGENID_TRIGGER_GEN_UPDATE(8)``
+    is called, the generation counter will increment to ``8``.
+    This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
+    or CAP_SYS_ADMIN capabilities.
+
+``mmap()``:
+  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+  in size. The first 4 bytes of the mapped page will contain an
+  up-to-date u32 copy of the system generation counter.
+  The mapped memory can be used as a low-latency generation counter
+  probe mechanism in critical sections.
+  The mmap() interface is targeted at libraries or code that needs to
+  check for generation changes in-line, where an event loop is not
+  available or read()/write() syscalls are too expensive.
+  In such cases, logic can be added in-line with the sensitive code to
+  check and trigger on-demand/just-in-time readjustments when changes
+  are detected on the memory mapped generation counter.
+  Users of this interface that plan to lazily adjust should not enable
+  watcher tracking, since waiting on them doesn't make sense.
+
+``close()``:
+  Removes the file descriptor as a system generation counter *watcher*.
+
+Snapshot Safety Prerequisites
+=============================
+
+If VM, container or other system-level snapshots happen asynchronously,
+at arbitrary times during an active workload there is no practical way
+to ensure that in-flight local copies or caches of world-unique data
+such as random values, secrets, UUIDs, etc are properly scrubbed and
+regenerated.
+The challenge stems from the fact that the categorization of data as
+snapshot-sensitive is only known to the software working with it, and
+this software has no logical control over the moment in time when an
+external system snapshot occurs.
+
+Let's take an OpenSSL session token for example. Even if the library
+code is made 100% snapshot-safe, meaning the library guarantees that
+the session token is unique (any snapshot that happened during the
+library call did not duplicate or leak the token), the token is still
+vulnerable to snapshot events while it transits the various layers of
+the library caller, then the various layers of the OS before leaving
+the system.
+
+To catch a secret while it's in-flight, we'd have to validate system
+generation at every layer, every step of the way. Even if that would
+be deemed the right solution, it would be a long road and a whole
+universe to patch before we get there.
+
+Bottom line is we don't have a way to track all of these in-flight
+secrets and dynamically scrub them from existence with snapshot
+events happening arbitrarily.
+
+Simplifyng assumption - safety prerequisite
+-------------------------------------------
+
+**Control the snapshot flow**, disallow snapshots coming at arbitrary
+moments in the workload lifetime.
+
+Use a system-level overseer entity that quiesces the system before
+snapshot, and post-snapshot-resume oversees that software components
+have readjusted to new environment, to the new generation. Only after,
+will the overseer un-quiesce the system and allow active workloads.
+
+Software components can choose whether they want to be tracked and
+waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
+IOCTL.
+
+The sysgenid framework standardizes the API for system software to
+find out about needing to readjust and at the same time provides a
+mechanism for the overseer entity to wait for everyone to be done, the
+system to have readjusted, so it can un-quiesce.
+
+Example snapshot-safe workflow
+------------------------------
+
+1) Before taking a snapshot, quiesce the VM/container/system. Exactly
+   how this is achieved is very workload-specific, but the general
+   description is to get all software to an expected state where their
+   event loops dry up and they are effectively quiesced.
+2) Take snapshot.
+3) Resume the VM/container/system from said snapshot.
+4) SysGenID counter will either automatically increment if there is
+   a vmgenid backend (hw-driven), or overseer will trigger generation
+   bump using ``SYSGENID_TRIGGER_GEN_UPDATE`` IOCLT (sw-driven).
+5) Software components which have ``/dev/sysgenid`` in their event
+   loops (either using ``poll()`` or ``read()``) are notified of the
+   generation change.
+   They do their specific internal adjustments. Some may have requested
+   to be tracked and waited on by the overseer, others might choose to
+   do their adjustments out of band and not block the overseer.
+   Tracked ones *must* signal when they are done/ready with a ``write()``
+   while the rest *should* also do so for cleanliness, but it's not
+   mandatory.
+6) Overseer will block and wait for all tracked watchers by using the
+   ``SYSGENID_WAIT_WATCHERS`` IOCTL. Once all tracked watchers are done
+   in step 5, this overseer will return from this blocking ioctl knowing
+   that the system has readjusted and is ready for active workload.
+7) Overseer un-quiesces system.
+8) There is a class of software, usually libraries, most notably PRNGs
+   or SSLs, that don't fit the event-loop model and also have strict
+   latency requirements. These can take advantage of the ``mmap()``
+   interface and lazily adjust on-demand whenever they are called after
+   un-quiesce.
+   For a well-designed service stack, these libraries should not be
+   called while system is quiesced. When workload is resumed by the
+   overseer, on the first call into these libs, they will safely JIT
+   readjust.
+   Users of this lazy on-demand readjustment model should not enable
+   watcher tracking since doing so would introduce a logical deadlock:
+   lazy adjustments happen only after un-quiesce, but un-quiesce is
+   blocked until all tracked watchers are up-to-date.
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index d02ba2f..39f9482 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -357,6 +357,7 @@ Code  Seq#    Include File                                           Comments
 0xDB  00-0F  drivers/char/mwave/mwavepub.h
 0xDD  00-3F                                                          ZFCP device driver see drivers/s390/scsi/
                                                                      <mailto:aherrman@de.ibm.com>
+0xE4  01-03  uapi/linux/sysgenid.h                                   SysGenID misc driver
 0xE5  00-3F  linux/fuse.h
 0xEC  00-01  drivers/platform/chrome/cros_ec_dev.h                   ChromeOS EC driver
 0xF3  00-3F  drivers/usb/misc/sisusbvga/sisusb.h                     sisfb (in development)
diff --git a/MAINTAINERS b/MAINTAINERS
index 1d75afa..b812dad8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17261,6 +17261,14 @@ L:	linux-mmc@vger.kernel.org
 S:	Maintained
 F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
 
+SYSGENID
+M:	Adrian Catangiu <acatan@amazon.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	Documentation/misc-devices/sysgenid.rst
+F:	drivers/misc/sysgenid.c
+F:	include/uapi/linux/sysgenid.h
+
 SYSTEM CONFIGURATION (SYSCON)
 M:	Lee Jones <lee.jones@linaro.org>
 M:	Arnd Bergmann <arnd@arndb.de>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index fafa8b0..a2b7cae 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -456,6 +456,21 @@ config PVPANIC
 	  a paravirtualized device provided by QEMU; it lets a virtual machine
 	  (guest) communicate panic events to the host.
 
+config SYSGENID
+	tristate "System Generation ID driver"
+	help
+	  This is a System Generation ID driver which provides a system
+	  generation counter. The driver exposes FS ops on /dev/sysgenid
+	  through which it can provide information and notifications on system
+	  generation changes that happen because of VM or container snapshots
+	  or cloning.
+	  This enables applications and libraries that store or cache
+	  sensitive information, to know that they need to regenerate it
+	  after process memory has been exposed to potential copying.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called sysgenid.
+
 config HISI_HIKEY_USB
 	tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
 	depends on (OF && GPIOLIB) || COMPILE_TEST
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index d23231e..4b4933d 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI)		+= habanalabs/
 obj-$(CONFIG_UACCE)		+= uacce/
 obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
 obj-$(CONFIG_HISI_HIKEY_USB)	+= hisi_hikey_usb.o
+obj-$(CONFIG_SYSGENID)		+= sysgenid.o
diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
new file mode 100644
index 0000000..ace292b
--- /dev/null
+++ b/drivers/misc/sysgenid.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * System Generation ID driver
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ *	Authors:
+ *	  Adrian Catangiu <acatan@amazon.com>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/minmax.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+struct sysgenid_data {
+	unsigned long		map_buf;
+	wait_queue_head_t	read_waitq;
+	atomic_t		generation_counter;
+
+	unsigned int		watchers;
+	atomic_t		outdated_watchers;
+	wait_queue_head_t	outdated_waitq;
+	spinlock_t		lock;
+};
+static struct sysgenid_data sysgenid_data;
+
+struct file_data {
+	bool tracked_watcher;
+	int acked_gen_counter;
+};
+
+static int equals_gen_counter(unsigned int counter)
+{
+	return counter == atomic_read(&sysgenid_data.generation_counter);
+}
+
+static void _bump_generation(int min_gen)
+{
+	unsigned long flags;
+	int counter;
+
+	spin_lock_irqsave(&sysgenid_data.lock, flags);
+	counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
+	atomic_set(&sysgenid_data.generation_counter, counter);
+	*((int *) sysgenid_data.map_buf) = counter;
+	atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
+
+	wake_up_interruptible(&sysgenid_data.read_waitq);
+	wake_up_interruptible(&sysgenid_data.outdated_waitq);
+	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+}
+
+void sysgenid_bump_generation(void)
+{
+	_bump_generation(0);
+}
+EXPORT_SYMBOL_GPL(sysgenid_bump_generation);
+
+static void put_outdated_watchers(void)
+{
+	if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
+		wake_up_interruptible(&sysgenid_data.outdated_waitq);
+}
+
+static void start_fd_tracking(struct file_data *fdata)
+{
+	unsigned long flags;
+
+	if (!fdata->tracked_watcher) {
+		/* enable tracking this fd as a watcher */
+		spin_lock_irqsave(&sysgenid_data.lock, flags);
+			fdata->tracked_watcher = 1;
+			++sysgenid_data.watchers;
+			if (!equals_gen_counter(fdata->acked_gen_counter))
+				atomic_inc(&sysgenid_data.outdated_watchers);
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+	}
+}
+
+static void stop_fd_tracking(struct file_data *fdata)
+{
+	unsigned long flags;
+
+	if (fdata->tracked_watcher) {
+		/* stop tracking this fd as a watcher */
+		spin_lock_irqsave(&sysgenid_data.lock, flags);
+		if (!equals_gen_counter(fdata->acked_gen_counter))
+			put_outdated_watchers();
+		--sysgenid_data.watchers;
+		fdata->tracked_watcher = 0;
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+	}
+}
+
+static int sysgenid_open(struct inode *inode, struct file *file)
+{
+	struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
+
+	if (!fdata)
+		return -ENOMEM;
+	fdata->tracked_watcher = 0;
+	fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
+	file->private_data = fdata;
+
+	return 0;
+}
+
+static int sysgenid_close(struct inode *inode, struct file *file)
+{
+	struct file_data *fdata = file->private_data;
+
+	stop_fd_tracking(fdata);
+	kfree(fdata);
+
+	return 0;
+}
+
+static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
+		size_t nbytes, loff_t *ppos)
+{
+	struct file_data *fdata = file->private_data;
+	ssize_t ret;
+	int gen_counter;
+
+	if (nbytes == 0)
+		return 0;
+	/* disallow partial reads */
+	if (nbytes < sizeof(gen_counter))
+		return -EINVAL;
+
+	if (equals_gen_counter(fdata->acked_gen_counter)) {
+		if (file->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+		ret = wait_event_interruptible(
+			sysgenid_data.read_waitq,
+			!equals_gen_counter(fdata->acked_gen_counter)
+		);
+		if (ret)
+			return ret;
+	}
+
+	gen_counter = atomic_read(&sysgenid_data.generation_counter);
+	ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
+	if (ret)
+		return -EFAULT;
+
+	return sizeof(gen_counter);
+}
+
+static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
+		size_t count, loff_t *ppos)
+{
+	struct file_data *fdata = file->private_data;
+	unsigned int new_acked_gen;
+	unsigned long flags;
+
+	/* disallow partial writes */
+	if (count != sizeof(new_acked_gen))
+		return -ENOBUFS;
+	if (copy_from_user(&new_acked_gen, ubuf, count))
+		return -EFAULT;
+
+	spin_lock_irqsave(&sysgenid_data.lock, flags);
+	/* wrong gen-counter acknowledged */
+	if (!equals_gen_counter(new_acked_gen)) {
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+		return -EINVAL;
+	}
+	/* update acked gen-counter if necessary */
+	if (!equals_gen_counter(fdata->acked_gen_counter)) {
+		fdata->acked_gen_counter = new_acked_gen;
+		if (fdata->tracked_watcher)
+			put_outdated_watchers();
+	}
+	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+	return (ssize_t)count;
+}
+
+static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
+{
+	__poll_t mask = 0;
+	struct file_data *fdata = file->private_data;
+
+	if (!equals_gen_counter(fdata->acked_gen_counter))
+		return EPOLLIN | EPOLLRDNORM;
+
+	poll_wait(file, &sysgenid_data.read_waitq, wait);
+
+	if (!equals_gen_counter(fdata->acked_gen_counter))
+		mask = EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static long sysgenid_ioctl(struct file *file,
+		unsigned int cmd, unsigned long arg)
+{
+	struct file_data *fdata = file->private_data;
+	bool tracking = !!arg;
+	unsigned long timeout_ns, min_gen;
+	ktime_t until;
+	int ret = 0;
+
+	switch (cmd) {
+	case SYSGENID_SET_WATCHER_TRACKING:
+		if (tracking)
+			start_fd_tracking(fdata);
+		else
+			stop_fd_tracking(fdata);
+		break;
+	case SYSGENID_WAIT_WATCHERS:
+		timeout_ns = arg * NSEC_PER_MSEC;
+		until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
+
+		ret = wait_event_interruptible_hrtimeout(
+			sysgenid_data.outdated_waitq,
+			(!atomic_read(&sysgenid_data.outdated_watchers) ||
+					!equals_gen_counter(fdata->acked_gen_counter)),
+			until
+		);
+		if (!equals_gen_counter(fdata->acked_gen_counter))
+			ret = -EINTR;
+		break;
+	case SYSGENID_TRIGGER_GEN_UPDATE:
+		if (!checkpoint_restore_ns_capable(current_user_ns()))
+			return -EACCES;
+		min_gen = arg;
+		_bump_generation(min_gen);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct file_data *fdata = file->private_data;
+
+	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+		return -EINVAL;
+
+	if ((vma->vm_flags & VM_WRITE) != 0)
+		return -EPERM;
+
+	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_flags &= ~VM_MAYWRITE;
+	vma->vm_private_data = fdata;
+
+	return vm_insert_page(vma, vma->vm_start,
+			virt_to_page(sysgenid_data.map_buf));
+}
+
+static const struct file_operations fops = {
+	.owner		= THIS_MODULE,
+	.mmap		= sysgenid_mmap,
+	.open		= sysgenid_open,
+	.release	= sysgenid_close,
+	.read		= sysgenid_read,
+	.write		= sysgenid_write,
+	.poll		= sysgenid_poll,
+	.unlocked_ioctl	= sysgenid_ioctl,
+};
+
+static struct miscdevice sysgenid_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "sysgenid",
+	.fops = &fops,
+};
+
+static int __init sysgenid_init(void)
+{
+	int ret;
+
+	sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
+	if (!sysgenid_data.map_buf)
+		return -ENOMEM;
+
+	atomic_set(&sysgenid_data.generation_counter, 0);
+	atomic_set(&sysgenid_data.outdated_watchers, 0);
+	init_waitqueue_head(&sysgenid_data.read_waitq);
+	init_waitqueue_head(&sysgenid_data.outdated_waitq);
+	spin_lock_init(&sysgenid_data.lock);
+
+	ret = misc_register(&sysgenid_misc);
+	if (ret < 0) {
+		pr_err("misc_register() failed for sysgenid\n");
+		goto err;
+	}
+
+	return 0;
+
+err:
+	free_pages(sysgenid_data.map_buf, 0);
+	sysgenid_data.map_buf = 0;
+
+	return ret;
+}
+
+static void __exit sysgenid_exit(void)
+{
+	misc_deregister(&sysgenid_misc);
+	free_pages(sysgenid_data.map_buf, 0);
+	sysgenid_data.map_buf = 0;
+}
+
+module_init(sysgenid_init);
+module_exit(sysgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("System Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
new file mode 100644
index 0000000..7279df6
--- /dev/null
+++ b/include/uapi/linux/sysgenid.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_SYSGENID_H
+#define _UAPI_LINUX_SYSGENID_H
+
+#include <linux/ioctl.h>
+
+#define SYSGENID_IOCTL			0xE4
+#define SYSGENID_SET_WATCHER_TRACKING	_IO(SYSGENID_IOCTL, 1)
+#define SYSGENID_WAIT_WATCHERS		_IO(SYSGENID_IOCTL, 2)
+#define SYSGENID_TRIGGER_GEN_UPDATE	_IO(SYSGENID_IOCTL, 3)
+
+#ifdef __KERNEL__
+void sysgenid_bump_generation(void);
+#endif /* __KERNEL__ */
+
+#endif /* _UAPI_LINUX_SYSGENID_H */
+
-- 
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
@ 2021-02-24  8:47   ` Adrian Catangiu
  0 siblings, 0 replies; 23+ messages in thread
From: Adrian Catangiu @ 2021-02-24  8:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390
  Cc: gregkh, graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mst, mhocko, rafael,
	pavel, mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer, Adrian Catangiu

- Background and problem

The System Generation ID feature is required in virtualized or
containerized environments by applications that work with local copies
or caches of world-unique data such as random values, uuids,
monotonically increasing counters, etc.
Such applications can be negatively affected by VM or container
snapshotting when the VM or container is either cloned or returned to
an earlier point in time.

Furthermore, simply finding out about a system generation change is
only the starting point of a process to renew internal states of
possibly multiple applications across the system. This process requires
a standard interface that applications can rely on and through which
orchestration can be easily done.

- Solution

The System Generation ID is meant to help in these scenarios by
providing a monotonically increasing u32 counter that changes each time
the VM or container is restored from a snapshot.

The `sysgenid` driver exposes a monotonic incremental System Generation
u32 counter via a char-dev filesystem interface accessible
through `/dev/sysgenid`. It provides synchronous and asynchronous SysGen
counter update notifications, as well as counter retrieval and
confirmation mechanisms.
The counter starts from zero when the driver is initialized and
monotonically increments every time the system generation changes.

Userspace applications or libraries can (a)synchronously consume the
system generation counter through the provided filesystem interface, to
make any necessary internal adjustments following a system generation
update.

The provided filesystem interface operations can be used to build a
system level safe workflow that guest software can follow to protect
itself from negative system snapshot effects.

The `sysgenid` driver exports the `void sysgenid_bump_generation()`
symbol which can be used by backend drivers to drive system generation
changes based on hardware events.
System generation changes can also be driven by userspace software
through a dedicated driver ioctl.

**Please note**, SysGenID alone does not guarantee complete snapshot
safety to applications using it. A certain workflow needs to be
followed at the system level, in order to make the system
snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
section in the included documentation.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
 Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 MAINTAINERS                                        |   8 +
 drivers/misc/Kconfig                               |  15 +
 drivers/misc/Makefile                              |   1 +
 drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
 include/uapi/linux/sysgenid.h                      |  18 ++
 7 files changed, 594 insertions(+)
 create mode 100644 Documentation/misc-devices/sysgenid.rst
 create mode 100644 drivers/misc/sysgenid.c
 create mode 100644 include/uapi/linux/sysgenid.h

diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
new file mode 100644
index 0000000..0b8199b
--- /dev/null
+++ b/Documentation/misc-devices/sysgenid.rst
@@ -0,0 +1,229 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+SYSGENID
+========
+
+The System Generation ID feature is required in virtualized or
+containerized environments by applications that work with local copies
+or caches of world-unique data such as random values, UUIDs,
+monotonically increasing counters, etc.
+Such applications can be negatively affected by VM or container
+snapshotting when the VM or container is either cloned or returned to
+an earlier point in time.
+
+The System Generation ID is meant to help in these scenarios by
+providing a monotonically increasing counter that changes each time the
+VM or container is restored from a snapshot. The driver for it lives at
+``drivers/misc/sysgenid.c``.
+
+The ``sysgenid`` driver exposes a monotonic incremental System
+Generation u32 counter via a char-dev filesystem interface accessible
+through ``/dev/sysgenid`` that provides sync and async SysGen counter
+update notifications. It also provides SysGen counter retrieval and
+confirmation mechanisms.
+
+The counter starts from zero when the driver is initialized and
+monotonically increments every time the system generation changes.
+
+The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
+symbol which can be used by backend drivers to drive system generation
+changes based on hardware events.
+System generation changes can also be driven by userspace software
+through a dedicated driver ioctl.
+
+Userspace applications or libraries can (a)synchronously consume the
+system generation counter through the provided filesystem interface, to
+make any necessary internal adjustments following a system generation
+update.
+
+**Please note**, SysGenID alone does not guarantee complete snapshot
+safety to applications using it. A certain workflow needs to be
+followed at the system level, in order to make the system
+snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
+section below.
+
+Driver filesystem interface
+===========================
+
+``open()``:
+  When the device is opened, a copy of the current SysGenID (counter)
+  is associated with the open file descriptor. Every open file
+  descriptor will have readable data available (EPOLLIN) while its
+  current copy of the SysGenID is outdated. Reading from the fd will
+  provide the latest SysGenID, while writing to the fd will update the
+  fd-local copy of the SysGenID and is used as a confirmation
+  mechanism.
+
+``read()``:
+  Read is meant to provide the *new* system generation counter when a
+  generation change takes place. The read operation blocks until the
+  associated counter is no longer up to date, at which point the new
+  counter is provided/returned.  Nonblocking ``read()`` returns
+  ``EAGAIN`` to signal that there is no *new* counter value available.
+  The generation counter is considered *new* for each open file
+  descriptor that hasn't confirmed the new value following a generation
+  change. Therefore, once a generation change takes place, all
+  ``read()`` calls will immediately return the new generation counter
+  and will continue to do so until the new value is confirmed back to
+  the driver through ``write()``.
+  Partial reads are not allowed - read buffer needs to be at least
+  32 bits in size.
+
+``write()``:
+  Write is used to confirm the up-to-date SysGenID counter back to the
+  driver.
+  Following a VM generation change, all existing watchers are marked
+  as *outdated*. Each file descriptor will maintain the *outdated*
+  status until a ``write()`` containing the new up-to-date generation
+  counter is used as an update confirmation mechanism.
+  Partial writes are not allowed - write buffer should be exactly
+  32 bits in size.
+
+``poll()``:
+  Poll is implemented to allow polling for generation counter updates.
+  Such updates result in ``EPOLLIN`` polling status until the new
+  up-to-date counter is confirmed back to the driver through a
+  ``write()``.
+
+``ioctl()``:
+  The driver also adds support for waiting on open file descriptors
+  that haven't acknowledged a generation counter update, as well as a
+  mechanism for userspace to *trigger* a generation update:
+
+  - SYSGENID_SET_WATCHER_TRACKING: takes a bool argument to set tracking
+    status for current file descriptor. When watcher tracking is
+    enabled, the driver tracks this file descriptor as an independent
+    *watcher*. The driver keeps accounting of how many watchers have
+    confirmed the latest Sys-Gen-Id counter and how many of them are
+    *outdated*; an outdated watcher is a *tracked* open file descriptor
+    that has lived through a Sys-Gen-Id change but has not yet confirmed
+    the new generation counter.
+    Software that wants to be waited on by the system while it adjusts
+    to generation changes, should turn tracking on. The sysgenid driver
+    then keeps track of it and can block system-level adjustment process
+    until the software has finished adjusting and confirmed it through a
+    ``write()``.
+    Tracking is disabled by default and file descriptors need to
+    explicitly opt-in using this IOCTL.
+  - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+    tracked watchers or, if a ``timeout`` argument is provided, until
+    the timeout expires.
+    If the current caller is *outdated* or a generation change happens
+    while waiting (thus making current caller *outdated*), the ioctl
+    returns ``-EINTR`` to signal the user to handle event and retry.
+  - SYSGENID_TRIGGER_GEN_UPDATE: triggers a generation counter increment.
+    It takes a ``minimum-generation`` argument which represents the
+    minimum value the generation counter will be set to. For example if
+    current generation is ``5`` and ``SYSGENID_TRIGGER_GEN_UPDATE(8)``
+    is called, the generation counter will increment to ``8``.
+    This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
+    or CAP_SYS_ADMIN capabilities.
+
+``mmap()``:
+  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+  in size. The first 4 bytes of the mapped page will contain an
+  up-to-date u32 copy of the system generation counter.
+  The mapped memory can be used as a low-latency generation counter
+  probe mechanism in critical sections.
+  The mmap() interface is targeted at libraries or code that needs to
+  check for generation changes in-line, where an event loop is not
+  available or read()/write() syscalls are too expensive.
+  In such cases, logic can be added in-line with the sensitive code to
+  check and trigger on-demand/just-in-time readjustments when changes
+  are detected on the memory mapped generation counter.
+  Users of this interface that plan to lazily adjust should not enable
+  watcher tracking, since waiting on them doesn't make sense.
+
+``close()``:
+  Removes the file descriptor as a system generation counter *watcher*.
+
+Snapshot Safety Prerequisites
+=============================
+
+If VM, container or other system-level snapshots happen asynchronously,
+at arbitrary times during an active workload there is no practical way
+to ensure that in-flight local copies or caches of world-unique data
+such as random values, secrets, UUIDs, etc are properly scrubbed and
+regenerated.
+The challenge stems from the fact that the categorization of data as
+snapshot-sensitive is only known to the software working with it, and
+this software has no logical control over the moment in time when an
+external system snapshot occurs.
+
+Let's take an OpenSSL session token for example. Even if the library
+code is made 100% snapshot-safe, meaning the library guarantees that
+the session token is unique (any snapshot that happened during the
+library call did not duplicate or leak the token), the token is still
+vulnerable to snapshot events while it transits the various layers of
+the library caller, then the various layers of the OS before leaving
+the system.
+
+To catch a secret while it's in-flight, we'd have to validate system
+generation at every layer, every step of the way. Even if that would
+be deemed the right solution, it would be a long road and a whole
+universe to patch before we get there.
+
+Bottom line is we don't have a way to track all of these in-flight
+secrets and dynamically scrub them from existence with snapshot
+events happening arbitrarily.
+
+Simplifyng assumption - safety prerequisite
+-------------------------------------------
+
+**Control the snapshot flow**, disallow snapshots coming at arbitrary
+moments in the workload lifetime.
+
+Use a system-level overseer entity that quiesces the system before
+snapshot, and post-snapshot-resume oversees that software components
+have readjusted to new environment, to the new generation. Only after,
+will the overseer un-quiesce the system and allow active workloads.
+
+Software components can choose whether they want to be tracked and
+waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
+IOCTL.
+
+The sysgenid framework standardizes the API for system software to
+find out about needing to readjust and at the same time provides a
+mechanism for the overseer entity to wait for everyone to be done, the
+system to have readjusted, so it can un-quiesce.
+
+Example snapshot-safe workflow
+------------------------------
+
+1) Before taking a snapshot, quiesce the VM/container/system. Exactly
+   how this is achieved is very workload-specific, but the general
+   description is to get all software to an expected state where their
+   event loops dry up and they are effectively quiesced.
+2) Take snapshot.
+3) Resume the VM/container/system from said snapshot.
+4) SysGenID counter will either automatically increment if there is
+   a vmgenid backend (hw-driven), or overseer will trigger generation
+   bump using ``SYSGENID_TRIGGER_GEN_UPDATE`` IOCLT (sw-driven).
+5) Software components which have ``/dev/sysgenid`` in their event
+   loops (either using ``poll()`` or ``read()``) are notified of the
+   generation change.
+   They do their specific internal adjustments. Some may have requested
+   to be tracked and waited on by the overseer, others might choose to
+   do their adjustments out of band and not block the overseer.
+   Tracked ones *must* signal when they are done/ready with a ``write()``
+   while the rest *should* also do so for cleanliness, but it's not
+   mandatory.
+6) Overseer will block and wait for all tracked watchers by using the
+   ``SYSGENID_WAIT_WATCHERS`` IOCTL. Once all tracked watchers are done
+   in step 5, this overseer will return from this blocking ioctl knowing
+   that the system has readjusted and is ready for active workload.
+7) Overseer un-quiesces system.
+8) There is a class of software, usually libraries, most notably PRNGs
+   or SSLs, that don't fit the event-loop model and also have strict
+   latency requirements. These can take advantage of the ``mmap()``
+   interface and lazily adjust on-demand whenever they are called after
+   un-quiesce.
+   For a well-designed service stack, these libraries should not be
+   called while system is quiesced. When workload is resumed by the
+   overseer, on the first call into these libs, they will safely JIT
+   readjust.
+   Users of this lazy on-demand readjustment model should not enable
+   watcher tracking since doing so would introduce a logical deadlock:
+   lazy adjustments happen only after un-quiesce, but un-quiesce is
+   blocked until all tracked watchers are up-to-date.
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index d02ba2f..39f9482 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -357,6 +357,7 @@ Code  Seq#    Include File                                           Comments
 0xDB  00-0F  drivers/char/mwave/mwavepub.h
 0xDD  00-3F                                                          ZFCP device driver see drivers/s390/scsi/
                                                                      <mailto:aherrman@de.ibm.com>
+0xE4  01-03  uapi/linux/sysgenid.h                                   SysGenID misc driver
 0xE5  00-3F  linux/fuse.h
 0xEC  00-01  drivers/platform/chrome/cros_ec_dev.h                   ChromeOS EC driver
 0xF3  00-3F  drivers/usb/misc/sisusbvga/sisusb.h                     sisfb (in development)
diff --git a/MAINTAINERS b/MAINTAINERS
index 1d75afa..b812dad8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17261,6 +17261,14 @@ L:	linux-mmc@vger.kernel.org
 S:	Maintained
 F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
 
+SYSGENID
+M:	Adrian Catangiu <acatan@amazon.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	Documentation/misc-devices/sysgenid.rst
+F:	drivers/misc/sysgenid.c
+F:	include/uapi/linux/sysgenid.h
+
 SYSTEM CONFIGURATION (SYSCON)
 M:	Lee Jones <lee.jones@linaro.org>
 M:	Arnd Bergmann <arnd@arndb.de>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index fafa8b0..a2b7cae 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -456,6 +456,21 @@ config PVPANIC
 	  a paravirtualized device provided by QEMU; it lets a virtual machine
 	  (guest) communicate panic events to the host.
 
+config SYSGENID
+	tristate "System Generation ID driver"
+	help
+	  This is a System Generation ID driver which provides a system
+	  generation counter. The driver exposes FS ops on /dev/sysgenid
+	  through which it can provide information and notifications on system
+	  generation changes that happen because of VM or container snapshots
+	  or cloning.
+	  This enables applications and libraries that store or cache
+	  sensitive information, to know that they need to regenerate it
+	  after process memory has been exposed to potential copying.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called sysgenid.
+
 config HISI_HIKEY_USB
 	tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
 	depends on (OF && GPIOLIB) || COMPILE_TEST
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index d23231e..4b4933d 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI)		+= habanalabs/
 obj-$(CONFIG_UACCE)		+= uacce/
 obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
 obj-$(CONFIG_HISI_HIKEY_USB)	+= hisi_hikey_usb.o
+obj-$(CONFIG_SYSGENID)		+= sysgenid.o
diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
new file mode 100644
index 0000000..ace292b
--- /dev/null
+++ b/drivers/misc/sysgenid.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * System Generation ID driver
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ *	Authors:
+ *	  Adrian Catangiu <acatan@amazon.com>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/minmax.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+struct sysgenid_data {
+	unsigned long		map_buf;
+	wait_queue_head_t	read_waitq;
+	atomic_t		generation_counter;
+
+	unsigned int		watchers;
+	atomic_t		outdated_watchers;
+	wait_queue_head_t	outdated_waitq;
+	spinlock_t		lock;
+};
+static struct sysgenid_data sysgenid_data;
+
+struct file_data {
+	bool tracked_watcher;
+	int acked_gen_counter;
+};
+
+static int equals_gen_counter(unsigned int counter)
+{
+	return counter == atomic_read(&sysgenid_data.generation_counter);
+}
+
+static void _bump_generation(int min_gen)
+{
+	unsigned long flags;
+	int counter;
+
+	spin_lock_irqsave(&sysgenid_data.lock, flags);
+	counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
+	atomic_set(&sysgenid_data.generation_counter, counter);
+	*((int *) sysgenid_data.map_buf) = counter;
+	atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
+
+	wake_up_interruptible(&sysgenid_data.read_waitq);
+	wake_up_interruptible(&sysgenid_data.outdated_waitq);
+	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+}
+
+void sysgenid_bump_generation(void)
+{
+	_bump_generation(0);
+}
+EXPORT_SYMBOL_GPL(sysgenid_bump_generation);
+
+static void put_outdated_watchers(void)
+{
+	if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
+		wake_up_interruptible(&sysgenid_data.outdated_waitq);
+}
+
+static void start_fd_tracking(struct file_data *fdata)
+{
+	unsigned long flags;
+
+	if (!fdata->tracked_watcher) {
+		/* enable tracking this fd as a watcher */
+		spin_lock_irqsave(&sysgenid_data.lock, flags);
+			fdata->tracked_watcher = 1;
+			++sysgenid_data.watchers;
+			if (!equals_gen_counter(fdata->acked_gen_counter))
+				atomic_inc(&sysgenid_data.outdated_watchers);
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+	}
+}
+
+static void stop_fd_tracking(struct file_data *fdata)
+{
+	unsigned long flags;
+
+	if (fdata->tracked_watcher) {
+		/* stop tracking this fd as a watcher */
+		spin_lock_irqsave(&sysgenid_data.lock, flags);
+		if (!equals_gen_counter(fdata->acked_gen_counter))
+			put_outdated_watchers();
+		--sysgenid_data.watchers;
+		fdata->tracked_watcher = 0;
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+	}
+}
+
+static int sysgenid_open(struct inode *inode, struct file *file)
+{
+	struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
+
+	if (!fdata)
+		return -ENOMEM;
+	fdata->tracked_watcher = 0;
+	fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
+	file->private_data = fdata;
+
+	return 0;
+}
+
+static int sysgenid_close(struct inode *inode, struct file *file)
+{
+	struct file_data *fdata = file->private_data;
+
+	stop_fd_tracking(fdata);
+	kfree(fdata);
+
+	return 0;
+}
+
+static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
+		size_t nbytes, loff_t *ppos)
+{
+	struct file_data *fdata = file->private_data;
+	ssize_t ret;
+	int gen_counter;
+
+	if (nbytes == 0)
+		return 0;
+	/* disallow partial reads */
+	if (nbytes < sizeof(gen_counter))
+		return -EINVAL;
+
+	if (equals_gen_counter(fdata->acked_gen_counter)) {
+		if (file->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+		ret = wait_event_interruptible(
+			sysgenid_data.read_waitq,
+			!equals_gen_counter(fdata->acked_gen_counter)
+		);
+		if (ret)
+			return ret;
+	}
+
+	gen_counter = atomic_read(&sysgenid_data.generation_counter);
+	ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
+	if (ret)
+		return -EFAULT;
+
+	return sizeof(gen_counter);
+}
+
+static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
+		size_t count, loff_t *ppos)
+{
+	struct file_data *fdata = file->private_data;
+	unsigned int new_acked_gen;
+	unsigned long flags;
+
+	/* disallow partial writes */
+	if (count != sizeof(new_acked_gen))
+		return -ENOBUFS;
+	if (copy_from_user(&new_acked_gen, ubuf, count))
+		return -EFAULT;
+
+	spin_lock_irqsave(&sysgenid_data.lock, flags);
+	/* wrong gen-counter acknowledged */
+	if (!equals_gen_counter(new_acked_gen)) {
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+		return -EINVAL;
+	}
+	/* update acked gen-counter if necessary */
+	if (!equals_gen_counter(fdata->acked_gen_counter)) {
+		fdata->acked_gen_counter = new_acked_gen;
+		if (fdata->tracked_watcher)
+			put_outdated_watchers();
+	}
+	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+	return (ssize_t)count;
+}
+
+static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
+{
+	__poll_t mask = 0;
+	struct file_data *fdata = file->private_data;
+
+	if (!equals_gen_counter(fdata->acked_gen_counter))
+		return EPOLLIN | EPOLLRDNORM;
+
+	poll_wait(file, &sysgenid_data.read_waitq, wait);
+
+	if (!equals_gen_counter(fdata->acked_gen_counter))
+		mask = EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static long sysgenid_ioctl(struct file *file,
+		unsigned int cmd, unsigned long arg)
+{
+	struct file_data *fdata = file->private_data;
+	bool tracking = !!arg;
+	unsigned long timeout_ns, min_gen;
+	ktime_t until;
+	int ret = 0;
+
+	switch (cmd) {
+	case SYSGENID_SET_WATCHER_TRACKING:
+		if (tracking)
+			start_fd_tracking(fdata);
+		else
+			stop_fd_tracking(fdata);
+		break;
+	case SYSGENID_WAIT_WATCHERS:
+		timeout_ns = arg * NSEC_PER_MSEC;
+		until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
+
+		ret = wait_event_interruptible_hrtimeout(
+			sysgenid_data.outdated_waitq,
+			(!atomic_read(&sysgenid_data.outdated_watchers) ||
+					!equals_gen_counter(fdata->acked_gen_counter)),
+			until
+		);
+		if (!equals_gen_counter(fdata->acked_gen_counter))
+			ret = -EINTR;
+		break;
+	case SYSGENID_TRIGGER_GEN_UPDATE:
+		if (!checkpoint_restore_ns_capable(current_user_ns()))
+			return -EACCES;
+		min_gen = arg;
+		_bump_generation(min_gen);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct file_data *fdata = file->private_data;
+
+	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+		return -EINVAL;
+
+	if ((vma->vm_flags & VM_WRITE) != 0)
+		return -EPERM;
+
+	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_flags &= ~VM_MAYWRITE;
+	vma->vm_private_data = fdata;
+
+	return vm_insert_page(vma, vma->vm_start,
+			virt_to_page(sysgenid_data.map_buf));
+}
+
+static const struct file_operations fops = {
+	.owner		= THIS_MODULE,
+	.mmap		= sysgenid_mmap,
+	.open		= sysgenid_open,
+	.release	= sysgenid_close,
+	.read		= sysgenid_read,
+	.write		= sysgenid_write,
+	.poll		= sysgenid_poll,
+	.unlocked_ioctl	= sysgenid_ioctl,
+};
+
+static struct miscdevice sysgenid_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "sysgenid",
+	.fops = &fops,
+};
+
+static int __init sysgenid_init(void)
+{
+	int ret;
+
+	sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
+	if (!sysgenid_data.map_buf)
+		return -ENOMEM;
+
+	atomic_set(&sysgenid_data.generation_counter, 0);
+	atomic_set(&sysgenid_data.outdated_watchers, 0);
+	init_waitqueue_head(&sysgenid_data.read_waitq);
+	init_waitqueue_head(&sysgenid_data.outdated_waitq);
+	spin_lock_init(&sysgenid_data.lock);
+
+	ret = misc_register(&sysgenid_misc);
+	if (ret < 0) {
+		pr_err("misc_register() failed for sysgenid\n");
+		goto err;
+	}
+
+	return 0;
+
+err:
+	free_pages(sysgenid_data.map_buf, 0);
+	sysgenid_data.map_buf = 0;
+
+	return ret;
+}
+
+static void __exit sysgenid_exit(void)
+{
+	misc_deregister(&sysgenid_misc);
+	free_pages(sysgenid_data.map_buf, 0);
+	sysgenid_data.map_buf = 0;
+}
+
+module_init(sysgenid_init);
+module_exit(sysgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("System Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
new file mode 100644
index 0000000..7279df6
--- /dev/null
+++ b/include/uapi/linux/sysgenid.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_SYSGENID_H
+#define _UAPI_LINUX_SYSGENID_H
+
+#include <linux/ioctl.h>
+
+#define SYSGENID_IOCTL			0xE4
+#define SYSGENID_SET_WATCHER_TRACKING	_IO(SYSGENID_IOCTL, 1)
+#define SYSGENID_WAIT_WATCHERS		_IO(SYSGENID_IOCTL, 2)
+#define SYSGENID_TRIGGER_GEN_UPDATE	_IO(SYSGENID_IOCTL, 3)
+
+#ifdef __KERNEL__
+void sysgenid_bump_generation(void);
+#endif /* __KERNEL__ */
+
+#endif /* _UAPI_LINUX_SYSGENID_H */
+
-- 
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
  2021-02-24  8:47 ` Adrian Catangiu
@ 2021-02-24  8:47   ` Adrian Catangiu
  -1 siblings, 0 replies; 23+ messages in thread
From: Adrian Catangiu @ 2021-02-24  8:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390
  Cc: gregkh, graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mst, mhocko, rafael,
	pavel, mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer, Adrian Catangiu

The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.

The feature can be used to drive the `sysgenid` mechanism required in
virtualized environments by software that works with local copies and
caches of world-unique data such as random values, uuids, monotonically
increasing counters, etc.

The VM Generation ID is a hypervisor/hardware provided 128-bit unique
ID that changes each time the VM is restored from a snapshot. It can be
used to differentiate between VMs or different generations of the same
VM.
This VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors.

The `vmgenid` driver acts as a backend for the `sysgenid` kernel module
(`drivers/misc/sysgenid.c`, `Documentation/misc-devices/sysgenid.rst`)
to drive changes to the "System Generation Id" which is further exposed
to userspace as a monotonically increasing counter.

The driver uses ACPI events to be notified by hardware of changes to the
128-bit Vm Gen Id UUID. Since the actual UUID value is not directly exposed
to userspace, but only used to drive the System Generation Counter, the
driver also adds it as device randomness to improve kernel entropy
following VM snapshot events.

This patch builds on top of Or Idgar <oridgar@gmail.com>'s proposal
https://lkml.org/lkml/2018/3/1/498

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
 Documentation/virt/vmgenid.rst |  36 ++++++++++
 MAINTAINERS                    |   7 ++
 drivers/virt/Kconfig           |  13 ++++
 drivers/virt/Makefile          |   1 +
 drivers/virt/vmgenid.c         | 153 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 210 insertions(+)
 create mode 100644 Documentation/virt/vmgenid.rst
 create mode 100644 drivers/virt/vmgenid.c

diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..a429c2a3
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+VMGENID
+=======
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by applications
+that work with local copies/caches of world-unique data such as random
+values, UUIDs, monotonically increasing counters, etc.
+Such applications can be negatively affected by VM snapshotting when
+the VM is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept through which a hypevisor
+notifies its guest that a snapshot has taken place. The vmgenid device
+provides a unique ID that changes each time the VM is restored from a
+snapshot. The hardware provided UUID value can be used to differentiate
+between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The ``vmgenid`` driver acts as a backend for the ``sysgenid`` kernel module
+(``drivers/misc/sysgenid.c``, ``Documentation/misc-devices/sysgenid.rst``)
+to drive changes to the "System Generation Id" which is further exposed
+to userspace as a monotonically increasing counter.
+
+The driver uses ACPI events to be notified by hardware of changes to the
+128-bit Vm Gen Id UUID. Since the actual UUID value is not directly exposed
+to userspace, but only used to drive the System Generation Counter, the
+driver also adds it as device randomness to improve kernel entropy
+following VM snapshot events.
diff --git a/MAINTAINERS b/MAINTAINERS
index b812dad8..f21451e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19086,6 +19086,13 @@ F:	drivers/staging/vme/
 F:	drivers/vme/
 F:	include/linux/vme*
 
+VMGENID
+M:	Adrian Catangiu <acatan@amazon.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	Documentation/virt/vmgenid.rst
+F:	drivers/virt/vmgenid.c
+
 VMWARE BALLOON DRIVER
 M:	Nadav Amit <namit@vmware.com>
 M:	"VMware, Inc." <pv-drivers@vmware.com>
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 80c5f9c1..95d82c9 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,19 @@ menuconfig VIRT_DRIVERS
 
 if VIRT_DRIVERS
 
+config VMGENID
+	tristate "Virtual Machine Generation ID driver"
+	depends on ACPI && SYSGENID
+	help
+	  The driver uses the hypervisor provided Virtual Machine Generation ID
+	  to drive the system generation counter mechanism exposed by sysgenid.
+	  The vmgenid changes on VM snapshots or VM cloning. The hypervisor
+	  provided 128-bit vmgenid is also used as device randomness to improve
+	  kernel entropy following VM snapshot events.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called vmgenid.
+
 config FSL_HV_MANAGER
 	tristate "Freescale hypervisor management driver"
 	depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f28425c..889be01 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,6 +4,7 @@
 #
 
 obj-$(CONFIG_FSL_HV_MANAGER)	+= fsl_hypervisor.o
+obj-$(CONFIG_VMGENID)		+= vmgenid.o
 obj-y				+= vboxguest/
 
 obj-$(CONFIG_NITRO_ENCLAVES)	+= nitro_enclaves/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..d9d089a
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc. All rights reserved.
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ *	Authors:
+ *	  Adrian Catangiu <acatan@amazon.com>
+ *	  Or Idgar <oridgar@gmail.com>
+ *	  Gal Hammer <ghammer@redhat.com>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct vmgenid_data {
+	uuid_t uuid;
+	void *uuid_iomap;
+};
+static struct vmgenid_data vmgenid_data;
+
+static int vmgenid_acpi_map(struct vmgenid_data *priv, acpi_handle handle)
+{
+	int i;
+	phys_addr_t phys_addr;
+	struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+	acpi_status status;
+	union acpi_object *pss;
+	union acpi_object *element;
+
+	status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+	if (ACPI_FAILURE(status)) {
+		ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+		return -ENODEV;
+	}
+	pss = buffer.pointer;
+	if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+		return -EINVAL;
+
+	phys_addr = 0;
+	for (i = 0; i < pss->package.count; i++) {
+		element = &(pss->package.elements[i]);
+		if (element->type != ACPI_TYPE_INTEGER)
+			return -EINVAL;
+		phys_addr |= element->integer.value << i * 32;
+	}
+
+	priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+	if (!priv->uuid_iomap) {
+		pr_err("Could not map memory at 0x%llx, size %u\n",
+			   phys_addr,
+			   (u32) sizeof(uuid_t));
+		return -ENOMEM;
+	}
+
+	memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+	return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+	int ret;
+
+	if (!device)
+		return -EINVAL;
+	device->driver_data = &vmgenid_data;
+
+	ret = vmgenid_acpi_map(device->driver_data, device->handle);
+	if (ret < 0) {
+		pr_err("vmgenid: failed to map acpi device\n");
+		device->driver_data = NULL;
+	}
+
+	return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+	if (!device || acpi_driver_data(device) != &vmgenid_data)
+		return -EINVAL;
+	device->driver_data = NULL;
+
+	if (vmgenid_data.uuid_iomap)
+		acpi_os_unmap_memory(vmgenid_data.uuid_iomap, sizeof(uuid_t));
+	vmgenid_data.uuid_iomap = NULL;
+
+	return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+	uuid_t old_uuid;
+
+	if (!device || acpi_driver_data(device) != &vmgenid_data) {
+		pr_err("VMGENID notify with unexpected driver private data\n");
+		return;
+	}
+
+	/* update VM Generation UUID */
+	old_uuid = vmgenid_data.uuid;
+	memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
+
+	if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
+		/* HW uuid updated */
+		sysgenid_bump_generation();
+		add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
+	}
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+	{"VMGENID", 0},
+	{"QEMUVGID", 0},
+	{"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+	.name = "vm_generation_id",
+	.ids = vmgenid_ids,
+	.owner = THIS_MODULE,
+	.ops = {
+		.add = vmgenid_acpi_add,
+		.remove = vmgenid_acpi_remove,
+		.notify = vmgenid_acpi_notify,
+	}
+};
+
+static int __init vmgenid_init(void)
+{
+	return acpi_bus_register_driver(&acpi_vmgenid_driver);
+}
+
+static void __exit vmgenid_exit(void)
+{
+	acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
-- 
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
@ 2021-02-24  8:47   ` Adrian Catangiu
  0 siblings, 0 replies; 23+ messages in thread
From: Adrian Catangiu @ 2021-02-24  8:47 UTC (permalink / raw)
  To: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390
  Cc: gregkh, graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mst, mhocko, rafael,
	pavel, mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer, Adrian Catangiu

The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.

The feature can be used to drive the `sysgenid` mechanism required in
virtualized environments by software that works with local copies and
caches of world-unique data such as random values, uuids, monotonically
increasing counters, etc.

The VM Generation ID is a hypervisor/hardware provided 128-bit unique
ID that changes each time the VM is restored from a snapshot. It can be
used to differentiate between VMs or different generations of the same
VM.
This VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors.

The `vmgenid` driver acts as a backend for the `sysgenid` kernel module
(`drivers/misc/sysgenid.c`, `Documentation/misc-devices/sysgenid.rst`)
to drive changes to the "System Generation Id" which is further exposed
to userspace as a monotonically increasing counter.

The driver uses ACPI events to be notified by hardware of changes to the
128-bit Vm Gen Id UUID. Since the actual UUID value is not directly exposed
to userspace, but only used to drive the System Generation Counter, the
driver also adds it as device randomness to improve kernel entropy
following VM snapshot events.

This patch builds on top of Or Idgar <oridgar@gmail.com>'s proposal
https://lkml.org/lkml/2018/3/1/498

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
 Documentation/virt/vmgenid.rst |  36 ++++++++++
 MAINTAINERS                    |   7 ++
 drivers/virt/Kconfig           |  13 ++++
 drivers/virt/Makefile          |   1 +
 drivers/virt/vmgenid.c         | 153 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 210 insertions(+)
 create mode 100644 Documentation/virt/vmgenid.rst
 create mode 100644 drivers/virt/vmgenid.c

diff --git a/Documentation/virt/vmgenid.rst b/Documentation/virt/vmgenid.rst
new file mode 100644
index 0000000..a429c2a3
--- /dev/null
+++ b/Documentation/virt/vmgenid.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+VMGENID
+=======
+
+The VM Generation ID is a feature defined by Microsoft (paper:
+http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
+multiple hypervisor vendors.
+
+The feature is required in virtualized environments by applications
+that work with local copies/caches of world-unique data such as random
+values, UUIDs, monotonically increasing counters, etc.
+Such applications can be negatively affected by VM snapshotting when
+the VM is either cloned or returned to an earlier point in time.
+
+The VM Generation ID is a simple concept through which a hypevisor
+notifies its guest that a snapshot has taken place. The vmgenid device
+provides a unique ID that changes each time the VM is restored from a
+snapshot. The hardware provided UUID value can be used to differentiate
+between VMs or different generations of the same VM.
+
+The VM Generation ID is exposed through an ACPI device by multiple
+hypervisor vendors. The driver for it lives at
+``drivers/virt/vmgenid.c``
+
+The ``vmgenid`` driver acts as a backend for the ``sysgenid`` kernel module
+(``drivers/misc/sysgenid.c``, ``Documentation/misc-devices/sysgenid.rst``)
+to drive changes to the "System Generation Id" which is further exposed
+to userspace as a monotonically increasing counter.
+
+The driver uses ACPI events to be notified by hardware of changes to the
+128-bit Vm Gen Id UUID. Since the actual UUID value is not directly exposed
+to userspace, but only used to drive the System Generation Counter, the
+driver also adds it as device randomness to improve kernel entropy
+following VM snapshot events.
diff --git a/MAINTAINERS b/MAINTAINERS
index b812dad8..f21451e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19086,6 +19086,13 @@ F:	drivers/staging/vme/
 F:	drivers/vme/
 F:	include/linux/vme*
 
+VMGENID
+M:	Adrian Catangiu <acatan@amazon.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	Documentation/virt/vmgenid.rst
+F:	drivers/virt/vmgenid.c
+
 VMWARE BALLOON DRIVER
 M:	Nadav Amit <namit@vmware.com>
 M:	"VMware, Inc." <pv-drivers@vmware.com>
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 80c5f9c1..95d82c9 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,19 @@ menuconfig VIRT_DRIVERS
 
 if VIRT_DRIVERS
 
+config VMGENID
+	tristate "Virtual Machine Generation ID driver"
+	depends on ACPI && SYSGENID
+	help
+	  The driver uses the hypervisor provided Virtual Machine Generation ID
+	  to drive the system generation counter mechanism exposed by sysgenid.
+	  The vmgenid changes on VM snapshots or VM cloning. The hypervisor
+	  provided 128-bit vmgenid is also used as device randomness to improve
+	  kernel entropy following VM snapshot events.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called vmgenid.
+
 config FSL_HV_MANAGER
 	tristate "Freescale hypervisor management driver"
 	depends on FSL_SOC
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index f28425c..889be01 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -4,6 +4,7 @@
 #
 
 obj-$(CONFIG_FSL_HV_MANAGER)	+= fsl_hypervisor.o
+obj-$(CONFIG_VMGENID)		+= vmgenid.o
 obj-y				+= vboxguest/
 
 obj-$(CONFIG_NITRO_ENCLAVES)	+= nitro_enclaves/
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
new file mode 100644
index 0000000..d9d089a
--- /dev/null
+++ b/drivers/virt/vmgenid.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual Machine Generation ID driver
+ *
+ * Copyright (C) 2018 Red Hat Inc. All rights reserved.
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ *	Authors:
+ *	  Adrian Catangiu <acatan@amazon.com>
+ *	  Or Idgar <oridgar@gmail.com>
+ *	  Gal Hammer <ghammer@redhat.com>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+#define DEV_NAME "vmgenid"
+ACPI_MODULE_NAME(DEV_NAME);
+
+struct vmgenid_data {
+	uuid_t uuid;
+	void *uuid_iomap;
+};
+static struct vmgenid_data vmgenid_data;
+
+static int vmgenid_acpi_map(struct vmgenid_data *priv, acpi_handle handle)
+{
+	int i;
+	phys_addr_t phys_addr;
+	struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+	acpi_status status;
+	union acpi_object *pss;
+	union acpi_object *element;
+
+	status = acpi_evaluate_object(handle, "ADDR", NULL, &buffer);
+	if (ACPI_FAILURE(status)) {
+		ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+		return -ENODEV;
+	}
+	pss = buffer.pointer;
+	if (!pss || pss->type != ACPI_TYPE_PACKAGE || pss->package.count != 2)
+		return -EINVAL;
+
+	phys_addr = 0;
+	for (i = 0; i < pss->package.count; i++) {
+		element = &(pss->package.elements[i]);
+		if (element->type != ACPI_TYPE_INTEGER)
+			return -EINVAL;
+		phys_addr |= element->integer.value << i * 32;
+	}
+
+	priv->uuid_iomap = acpi_os_map_memory(phys_addr, sizeof(uuid_t));
+	if (!priv->uuid_iomap) {
+		pr_err("Could not map memory at 0x%llx, size %u\n",
+			   phys_addr,
+			   (u32) sizeof(uuid_t));
+		return -ENOMEM;
+	}
+
+	memcpy_fromio(&priv->uuid, priv->uuid_iomap, sizeof(uuid_t));
+
+	return 0;
+}
+
+static int vmgenid_acpi_add(struct acpi_device *device)
+{
+	int ret;
+
+	if (!device)
+		return -EINVAL;
+	device->driver_data = &vmgenid_data;
+
+	ret = vmgenid_acpi_map(device->driver_data, device->handle);
+	if (ret < 0) {
+		pr_err("vmgenid: failed to map acpi device\n");
+		device->driver_data = NULL;
+	}
+
+	return ret;
+}
+
+static int vmgenid_acpi_remove(struct acpi_device *device)
+{
+	if (!device || acpi_driver_data(device) != &vmgenid_data)
+		return -EINVAL;
+	device->driver_data = NULL;
+
+	if (vmgenid_data.uuid_iomap)
+		acpi_os_unmap_memory(vmgenid_data.uuid_iomap, sizeof(uuid_t));
+	vmgenid_data.uuid_iomap = NULL;
+
+	return 0;
+}
+
+static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
+{
+	uuid_t old_uuid;
+
+	if (!device || acpi_driver_data(device) != &vmgenid_data) {
+		pr_err("VMGENID notify with unexpected driver private data\n");
+		return;
+	}
+
+	/* update VM Generation UUID */
+	old_uuid = vmgenid_data.uuid;
+	memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
+
+	if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
+		/* HW uuid updated */
+		sysgenid_bump_generation();
+		add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
+	}
+}
+
+static const struct acpi_device_id vmgenid_ids[] = {
+	{"VMGENID", 0},
+	{"QEMUVGID", 0},
+	{"", 0},
+};
+
+static struct acpi_driver acpi_vmgenid_driver = {
+	.name = "vm_generation_id",
+	.ids = vmgenid_ids,
+	.owner = THIS_MODULE,
+	.ops = {
+		.add = vmgenid_acpi_add,
+		.remove = vmgenid_acpi_remove,
+		.notify = vmgenid_acpi_notify,
+	}
+};
+
+static int __init vmgenid_init(void)
+{
+	return acpi_bus_register_driver(&acpi_vmgenid_driver);
+}
+
+static void __exit vmgenid_exit(void)
+{
+	acpi_bus_unregister_driver(&acpi_vmgenid_driver);
+}
+
+module_init(vmgenid_init);
+module_exit(vmgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("Virtual Machine Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
-- 
2.7.4




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 0/2] System Generation ID driver and VMGENID backend
  2021-02-24  8:47 ` Adrian Catangiu
@ 2021-02-24  9:05   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2021-02-24  9:05 UTC (permalink / raw)
  To: Adrian Catangiu
  Cc: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390, gregkh,
	graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46, borntraeger,
	Jason, jannh, w, colmmacc, luto, tytso, ebiggers, dwmw, bonzini,
	sblbir, raduweis, corbet, mhocko, rafael, pavel, mpe, areber,
	ovzxemul, avagin, ptikhomirov, gil, asmehra, dgunigun, vijaysun,
	oridgar, ghammer

On Wed, Feb 24, 2021 at 10:47:30AM +0200, Adrian Catangiu wrote:
> This feature is aimed at virtualized or containerized environments
> where VM or container snapshotting duplicates memory state, which is a
> challenge for applications that want to generate unique data such as
> request IDs, UUIDs, and cryptographic nonces.
> 
> The patch set introduces a mechanism that provides a userspace
> interface for applications and libraries to be made aware of uniqueness
> breaking events such as VM or container snapshotting, and allow them to
> react and adapt to such events.
> 
> Solving the uniqueness problem strongly enough for cryptographic
> purposes requires a mechanism which can deterministically reseed
> userspace PRNGs with new entropy at restore time. This mechanism must
> also support the high-throughput and low-latency use-cases that led
> programmers to pick a userspace PRNG in the first place; be usable by
> both application code and libraries; allow transparent retrofitting
> behind existing popular PRNG interfaces without changing application
> code; it must be efficient, especially on snapshot restore; and be
> simple enough for wide adoption.
> 
> The first patch in the set implements a device driver which exposes a
> the /dev/sysgenid char device to userspace. Its associated filesystem
> operations operations can be used to build a system level safe workflow
> that guest software can follow to protect itself from negative system
> snapshot effects.
> 
> The second patch in the set adds a VmGenId driver which makes use of
> the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
> following VM snapshots.
> 
> **Please note**, SysGenID alone does not guarantee complete snapshot
> safety to applications using it. A certain workflow needs to be
> followed at the system level, in order to make the system
> snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
> section in the included SysGenID documentation.
> 
> ---
> 
> v6 -> v7:
>   - remove sysgenid uevent

How about we drop mmap too?

There's simply no way I can see to make it safe, and
no implementation is worse than a racy one imho.

Yea there's some decumentation explaining how it is not
supposed to be used but it will *seem* to work for people
and we will be stuck trying to maintain it.

Let's see if userspace using this often enough to make the
system call 



> v5 -> v6:
> 
>   - sysgenid: watcher tracking disabled by default
>   - sysgenid: add SYSGENID_SET_WATCHER_TRACKING ioctl to allow each
>     file descriptor to set whether they should be tracked as watchers
>   - rename SYSGENID_FORCE_GEN_UPDATE -> SYSGENID_TRIGGER_GEN_UPDATE
>   - rework all documentation to clearly capture all prerequisites for
>     achieving snapshot safety when using the provided mechanism
>   - sysgenid documentation: replace individual filesystem operations
>     examples with a higher level example showcasing system-level
>     snapshot-safe workflow
> 
> v4 -> v5:
> 
>   - sysgenid: generation changes are also exported through uevents
>   - remove SYSGENID_GET_OUTDATED_WATCHERS ioctl
>   - document sysgenid ioctl major/minor numbers
> 
> v3 -> v4:
> 
>   - split functionality in two separate kernel modules: 
>     1. drivers/misc/sysgenid.c which provides the generic userspace
>        interface and mechanisms
>     2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
>        kernel entropy and acts as a driving backend for the generic
>        sysgenid
>   - rename /dev/vmgenid -> /dev/sysgenid
>   - rename uapi header file vmgenid.h -> sysgenid.h
>   - rename ioctls VMGENID_* -> SYSGENID_*
>   - add ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
>   - fix races in documentation examples
> 
> v2 -> v3:
> 
>   - separate the core driver logic and interface, from the ACPI device.
>     The ACPI vmgenid device is now one possible backend
>   - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
>   - add locking to avoid races between fs ops handlers and hw irq
>     driven generation updates
>   - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
>     outdated or a generation change happens while waiting (thus making
>     current caller outdated), the ioctl returns -EINTR to signal the
>     user to handle event and retry. Fixes blocking on oneself
>   - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
>     CAP_CHECKPOINT_RESTORE capability, through which software can force
>     generation bump
> 
> v1 -> v2:
> 
>   - expose to userspace a monotonically increasing u32 Vm Gen Counter
>     instead of the hw VmGen UUID
>   - since the hw/hypervisor-provided 128-bit UUID is not public
>     anymore, add it to the kernel RNG as device randomness
>   - insert driver page containing Vm Gen Counter in the user vma in
>     the driver's mmap handler instead of using a fault handler
>   - turn driver into a misc device driver to auto-create /dev/vmgenid
>   - change ioctl arg to avoid leaking kernel structs to userspace
>   - update documentation
> 
> Adrian Catangiu (2):
>   drivers/misc: sysgenid: add system generation id driver
>   drivers/virt: vmgenid: add vm generation id driver
> 
>  Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>  Documentation/virt/vmgenid.rst                     |  36 +++
>  MAINTAINERS                                        |  15 +
>  drivers/misc/Kconfig                               |  15 +
>  drivers/misc/Makefile                              |   1 +
>  drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
>  drivers/virt/Kconfig                               |  13 +
>  drivers/virt/Makefile                              |   1 +
>  drivers/virt/vmgenid.c                             | 153 ++++++++++
>  include/uapi/linux/sysgenid.h                      |  18 ++
>  11 files changed, 804 insertions(+)
>  create mode 100644 Documentation/misc-devices/sysgenid.rst
>  create mode 100644 Documentation/virt/vmgenid.rst
>  create mode 100644 drivers/misc/sysgenid.c
>  create mode 100644 drivers/virt/vmgenid.c
>  create mode 100644 include/uapi/linux/sysgenid.h
> 
> -- 
> 2.7.4
> 
> 
> 
> 
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 0/2] System Generation ID driver and VMGENID backend
@ 2021-02-24  9:05   ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2021-02-24  9:05 UTC (permalink / raw)
  To: Adrian Catangiu
  Cc: Jason, areber, kvm, linux-doc, ghammer, vijaysun, 0x7f454c46,
	qemu-devel, mhocko, dgunigun, avagin, pavel, ptikhomirov,
	linux-s390, corbet, mpe, rafael, ebiggers, borntraeger, sblbir,
	bonzini, arnd, jannh, raduweis, asmehra, graf, rppt, luto, gil,
	oridgar, colmmacc, tytso, gregkh, rdunlap, linux-kernel,
	ebiederm, ovzxemul, w, dwmw

On Wed, Feb 24, 2021 at 10:47:30AM +0200, Adrian Catangiu wrote:
> This feature is aimed at virtualized or containerized environments
> where VM or container snapshotting duplicates memory state, which is a
> challenge for applications that want to generate unique data such as
> request IDs, UUIDs, and cryptographic nonces.
> 
> The patch set introduces a mechanism that provides a userspace
> interface for applications and libraries to be made aware of uniqueness
> breaking events such as VM or container snapshotting, and allow them to
> react and adapt to such events.
> 
> Solving the uniqueness problem strongly enough for cryptographic
> purposes requires a mechanism which can deterministically reseed
> userspace PRNGs with new entropy at restore time. This mechanism must
> also support the high-throughput and low-latency use-cases that led
> programmers to pick a userspace PRNG in the first place; be usable by
> both application code and libraries; allow transparent retrofitting
> behind existing popular PRNG interfaces without changing application
> code; it must be efficient, especially on snapshot restore; and be
> simple enough for wide adoption.
> 
> The first patch in the set implements a device driver which exposes a
> the /dev/sysgenid char device to userspace. Its associated filesystem
> operations operations can be used to build a system level safe workflow
> that guest software can follow to protect itself from negative system
> snapshot effects.
> 
> The second patch in the set adds a VmGenId driver which makes use of
> the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
> following VM snapshots.
> 
> **Please note**, SysGenID alone does not guarantee complete snapshot
> safety to applications using it. A certain workflow needs to be
> followed at the system level, in order to make the system
> snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
> section in the included SysGenID documentation.
> 
> ---
> 
> v6 -> v7:
>   - remove sysgenid uevent

How about we drop mmap too?

There's simply no way I can see to make it safe, and
no implementation is worse than a racy one imho.

Yea there's some decumentation explaining how it is not
supposed to be used but it will *seem* to work for people
and we will be stuck trying to maintain it.

Let's see if userspace using this often enough to make the
system call 



> v5 -> v6:
> 
>   - sysgenid: watcher tracking disabled by default
>   - sysgenid: add SYSGENID_SET_WATCHER_TRACKING ioctl to allow each
>     file descriptor to set whether they should be tracked as watchers
>   - rename SYSGENID_FORCE_GEN_UPDATE -> SYSGENID_TRIGGER_GEN_UPDATE
>   - rework all documentation to clearly capture all prerequisites for
>     achieving snapshot safety when using the provided mechanism
>   - sysgenid documentation: replace individual filesystem operations
>     examples with a higher level example showcasing system-level
>     snapshot-safe workflow
> 
> v4 -> v5:
> 
>   - sysgenid: generation changes are also exported through uevents
>   - remove SYSGENID_GET_OUTDATED_WATCHERS ioctl
>   - document sysgenid ioctl major/minor numbers
> 
> v3 -> v4:
> 
>   - split functionality in two separate kernel modules: 
>     1. drivers/misc/sysgenid.c which provides the generic userspace
>        interface and mechanisms
>     2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
>        kernel entropy and acts as a driving backend for the generic
>        sysgenid
>   - rename /dev/vmgenid -> /dev/sysgenid
>   - rename uapi header file vmgenid.h -> sysgenid.h
>   - rename ioctls VMGENID_* -> SYSGENID_*
>   - add ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
>   - fix races in documentation examples
> 
> v2 -> v3:
> 
>   - separate the core driver logic and interface, from the ACPI device.
>     The ACPI vmgenid device is now one possible backend
>   - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
>   - add locking to avoid races between fs ops handlers and hw irq
>     driven generation updates
>   - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
>     outdated or a generation change happens while waiting (thus making
>     current caller outdated), the ioctl returns -EINTR to signal the
>     user to handle event and retry. Fixes blocking on oneself
>   - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
>     CAP_CHECKPOINT_RESTORE capability, through which software can force
>     generation bump
> 
> v1 -> v2:
> 
>   - expose to userspace a monotonically increasing u32 Vm Gen Counter
>     instead of the hw VmGen UUID
>   - since the hw/hypervisor-provided 128-bit UUID is not public
>     anymore, add it to the kernel RNG as device randomness
>   - insert driver page containing Vm Gen Counter in the user vma in
>     the driver's mmap handler instead of using a fault handler
>   - turn driver into a misc device driver to auto-create /dev/vmgenid
>   - change ioctl arg to avoid leaking kernel structs to userspace
>   - update documentation
> 
> Adrian Catangiu (2):
>   drivers/misc: sysgenid: add system generation id driver
>   drivers/virt: vmgenid: add vm generation id driver
> 
>  Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>  Documentation/virt/vmgenid.rst                     |  36 +++
>  MAINTAINERS                                        |  15 +
>  drivers/misc/Kconfig                               |  15 +
>  drivers/misc/Makefile                              |   1 +
>  drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
>  drivers/virt/Kconfig                               |  13 +
>  drivers/virt/Makefile                              |   1 +
>  drivers/virt/vmgenid.c                             | 153 ++++++++++
>  include/uapi/linux/sysgenid.h                      |  18 ++
>  11 files changed, 804 insertions(+)
>  create mode 100644 Documentation/misc-devices/sysgenid.rst
>  create mode 100644 Documentation/virt/vmgenid.rst
>  create mode 100644 drivers/misc/sysgenid.c
>  create mode 100644 drivers/virt/vmgenid.c
>  create mode 100644 include/uapi/linux/sysgenid.h
> 
> -- 
> 2.7.4
> 
> 
> 
> 
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
  2021-02-24  8:47   ` Adrian Catangiu
@ 2021-02-24  9:19     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2021-02-24  9:19 UTC (permalink / raw)
  To: Adrian Catangiu
  Cc: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390, gregkh,
	graf, rdunlap, arnd, ebiederm, rppt, 0x7f454c46, borntraeger,
	Jason, jannh, w, colmmacc, luto, tytso, ebiggers, dwmw, bonzini,
	sblbir, raduweis, corbet, mhocko, rafael, pavel, mpe, areber,
	ovzxemul, avagin, ptikhomirov, gil, asmehra, dgunigun, vijaysun,
	oridgar, ghammer

On Wed, Feb 24, 2021 at 10:47:31AM +0200, Adrian Catangiu wrote:
> - Background and problem
> 
> The System Generation ID feature is required in virtualized or
> containerized environments by applications that work with local copies
> or caches of world-unique data such as random values, uuids,
> monotonically increasing counters, etc.
> Such applications can be negatively affected by VM or container
> snapshotting when the VM or container is either cloned or returned to
> an earlier point in time.
> 
> Furthermore, simply finding out about a system generation change is
> only the starting point of a process to renew internal states of
> possibly multiple applications across the system. This process requires
> a standard interface that applications can rely on and through which
> orchestration can be easily done.
> 
> - Solution
> 
> The System Generation ID is meant to help in these scenarios by
> providing a monotonically increasing u32 counter that changes each time
> the VM or container is restored from a snapshot.
> 
> The `sysgenid` driver exposes a monotonic incremental System Generation
> u32 counter via a char-dev filesystem interface accessible
> through `/dev/sysgenid`. It provides synchronous and asynchronous SysGen
> counter update notifications, as well as counter retrieval and
> confirmation mechanisms.
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the system generation changes.
> 
> Userspace applications or libraries can (a)synchronously consume the
> system generation counter through the provided filesystem interface, to
> make any necessary internal adjustments following a system generation
> update.
> 
> The provided filesystem interface operations can be used to build a
> system level safe workflow that guest software can follow to protect
> itself from negative system snapshot effects.
> 
> The `sysgenid` driver exports the `void sysgenid_bump_generation()`
> symbol which can be used by backend drivers to drive system generation
> changes based on hardware events.
> System generation changes can also be driven by userspace software
> through a dedicated driver ioctl.
> 
> **Please note**, SysGenID alone does not guarantee complete snapshot
> safety to applications using it. A certain workflow needs to be
> followed at the system level, in order to make the system
> snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
> section in the included documentation.
> 
> Signed-off-by: Adrian Catangiu <acatan@amazon.com>
> ---
>  Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>  MAINTAINERS                                        |   8 +
>  drivers/misc/Kconfig                               |  15 +
>  drivers/misc/Makefile                              |   1 +
>  drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
>  include/uapi/linux/sysgenid.h                      |  18 ++
>  7 files changed, 594 insertions(+)
>  create mode 100644 Documentation/misc-devices/sysgenid.rst
>  create mode 100644 drivers/misc/sysgenid.c
>  create mode 100644 include/uapi/linux/sysgenid.h
> 
> diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
> new file mode 100644
> index 0000000..0b8199b
> --- /dev/null
> +++ b/Documentation/misc-devices/sysgenid.rst
> @@ -0,0 +1,229 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +SYSGENID
> +========
> +
> +The System Generation ID feature is required in virtualized or
> +containerized environments by applications that work with local copies
> +or caches of world-unique data such as random values, UUIDs,
> +monotonically increasing counters, etc.
> +Such applications can be negatively affected by VM or container
> +snapshotting when the VM or container is either cloned or returned to
> +an earlier point in time.
> +
> +The System Generation ID is meant to help in these scenarios by
> +providing a monotonically increasing counter that changes each time the
> +VM or container is restored from a snapshot. The driver for it lives at
> +``drivers/misc/sysgenid.c``.
> +
> +The ``sysgenid`` driver exposes a monotonic incremental System
> +Generation u32 counter via a char-dev filesystem interface accessible
> +through ``/dev/sysgenid`` that provides sync and async SysGen counter
> +update notifications. It also provides SysGen counter retrieval and
> +confirmation mechanisms.
> +
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the system generation changes.
> +
> +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
> +symbol which can be used by backend drivers to drive system generation
> +changes based on hardware events.
> +System generation changes can also be driven by userspace software
> +through a dedicated driver ioctl.
> +
> +Userspace applications or libraries can (a)synchronously consume the
> +system generation counter through the provided filesystem interface, to
> +make any necessary internal adjustments following a system generation
> +update.
> +
> +**Please note**, SysGenID alone does not guarantee complete snapshot
> +safety to applications using it. A certain workflow needs to be
> +followed at the system level, in order to make the system
> +snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
> +section below.
> +
> +Driver filesystem interface
> +===========================
> +
> +``open()``:
> +  When the device is opened, a copy of the current SysGenID (counter)
> +  is associated with the open file descriptor. Every open file
> +  descriptor will have readable data available (EPOLLIN) while its
> +  current copy of the SysGenID is outdated. Reading from the fd will
> +  provide the latest SysGenID, while writing to the fd will update the
> +  fd-local copy of the SysGenID and is used as a confirmation
> +  mechanism.
> +
> +``read()``:
> +  Read is meant to provide the *new* system generation counter when a
> +  generation change takes place. The read operation blocks until the
> +  associated counter is no longer up to date, at which point the new
> +  counter is provided/returned.  Nonblocking ``read()`` returns
> +  ``EAGAIN`` to signal that there is no *new* counter value available.
> +  The generation counter is considered *new* for each open file
> +  descriptor that hasn't confirmed the new value following a generation
> +  change. Therefore, once a generation change takes place, all
> +  ``read()`` calls will immediately return the new generation counter
> +  and will continue to do so until the new value is confirmed back to
> +  the driver through ``write()``.
> +  Partial reads are not allowed - read buffer needs to be at least
> +  32 bits in size.
> +
> +``write()``:
> +  Write is used to confirm the up-to-date SysGenID counter back to the
> +  driver.
> +  Following a VM generation change, all existing watchers are marked
> +  as *outdated*. Each file descriptor will maintain the *outdated*
> +  status until a ``write()`` containing the new up-to-date generation
> +  counter is used as an update confirmation mechanism.
> +  Partial writes are not allowed - write buffer should be exactly
> +  32 bits in size.
> +
> +``poll()``:
> +  Poll is implemented to allow polling for generation counter updates.
> +  Such updates result in ``EPOLLIN`` polling status until the new
> +  up-to-date counter is confirmed back to the driver through a
> +  ``write()``.
> +
> +``ioctl()``:
> +  The driver also adds support for waiting on open file descriptors
> +  that haven't acknowledged a generation counter update, as well as a
> +  mechanism for userspace to *trigger* a generation update:
> +
> +  - SYSGENID_SET_WATCHER_TRACKING: takes a bool argument to set tracking
> +    status for current file descriptor. When watcher tracking is
> +    enabled, the driver tracks this file descriptor as an independent
> +    *watcher*. The driver keeps accounting of how many watchers have
> +    confirmed the latest Sys-Gen-Id counter and how many of them are
> +    *outdated*; an outdated watcher is a *tracked* open file descriptor
> +    that has lived through a Sys-Gen-Id change but has not yet confirmed
> +    the new generation counter.
> +    Software that wants to be waited on by the system while it adjusts
> +    to generation changes, should turn tracking on. The sysgenid driver
> +    then keeps track of it and can block system-level adjustment process
> +    until the software has finished adjusting and confirmed it through a
> +    ``write()``.
> +    Tracking is disabled by default and file descriptors need to
> +    explicitly opt-in using this IOCTL.
> +  - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> +    tracked watchers or, if a ``timeout`` argument is provided, until
> +    the timeout expires.
> +    If the current caller is *outdated* or a generation change happens
> +    while waiting (thus making current caller *outdated*), the ioctl
> +    returns ``-EINTR`` to signal the user to handle event and retry.
> +  - SYSGENID_TRIGGER_GEN_UPDATE: triggers a generation counter increment.
> +    It takes a ``minimum-generation`` argument which represents the
> +    minimum value the generation counter will be set to. For example if
> +    current generation is ``5`` and ``SYSGENID_TRIGGER_GEN_UPDATE(8)``
> +    is called, the generation counter will increment to ``8``.

And what if it's 9?

> +    This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
> +    or CAP_SYS_ADMIN capabilities.
> +
> +``mmap()``:
> +  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> +  in size. The first 4 bytes of the mapped page will contain an
> +  up-to-date u32 copy of the system generation counter.
> +  The mapped memory can be used as a low-latency generation counter
> +  probe mechanism in critical sections.
> +  The mmap() interface is targeted at libraries or code that needs to
> +  check for generation changes in-line, where an event loop is not
> +  available or read()/write() syscalls are too expensive.
> +  In such cases, logic can be added in-line with the sensitive code to
> +  check and trigger on-demand/just-in-time readjustments when changes
> +  are detected on the memory mapped generation counter.
> +  Users of this interface that plan to lazily adjust should not enable
> +  watcher tracking, since waiting on them doesn't make sense.
> +
> +``close()``:
> +  Removes the file descriptor as a system generation counter *watcher*.
> +
> +Snapshot Safety Prerequisites
> +=============================
> +
> +If VM, container or other system-level snapshots happen asynchronously,
> +at arbitrary times during an active workload there is no practical way
> +to ensure that in-flight local copies or caches of world-unique data
> +such as random values, secrets, UUIDs, etc are properly scrubbed and
> +regenerated.
> +The challenge stems from the fact that the categorization of data as
> +snapshot-sensitive is only known to the software working with it, and
> +this software has no logical control over the moment in time when an
> +external system snapshot occurs.
> +
> +Let's take an OpenSSL session token for example. Even if the library
> +code is made 100% snapshot-safe, meaning the library guarantees that
> +the session token is unique (any snapshot that happened during the
> +library call did not duplicate or leak the token), the token is still
> +vulnerable to snapshot events while it transits the various layers of
> +the library caller, then the various layers of the OS before leaving
> +the system.
> +
> +To catch a secret while it's in-flight, we'd have to validate system
> +generation at every layer, every step of the way. Even if that would
> +be deemed the right solution, it would be a long road and a whole
> +universe to patch before we get there.
> +
> +Bottom line is we don't have a way to track all of these in-flight
> +secrets and dynamically scrub them from existence with snapshot
> +events happening arbitrarily.

Above should try harder to explan what are the things that need to be
scrubbed and why. For example, I personally don't really know what is
the OpenSSL session token example and what makes it vulnerable. I guess
snapshots can attack each other?




Here's a simple example of a workflow that submits transactions
to a database and wants to avoid duplicate transactions.
This does not require overseer magic. It does however require
a correct genid from hypervisor, so no mmap tricks work.



	int genid, oldgenid;
	read(&genid);
start:
	oldgenid = genid;
	transid = submit transaction
	read(&genid);
	if (genid != oldgenid) {
			revert transaction (transid);
			goto start:
	}






> +Simplifyng assumption - safety prerequisite
> +-------------------------------------------
> +
> +**Control the snapshot flow**, disallow snapshots coming at arbitrary
> +moments in the workload lifetime.
> +
> +Use a system-level overseer entity that quiesces the system before
> +snapshot, and post-snapshot-resume oversees that software components
> +have readjusted to new environment, to the new generation. Only after,
> +will the overseer un-quiesce the system and allow active workloads.
> +
> +Software components can choose whether they want to be tracked and
> +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
> +IOCTL.
> +
> +The sysgenid framework standardizes the API for system software to
> +find out about needing to readjust and at the same time provides a
> +mechanism for the overseer entity to wait for everyone to be done, the
> +system to have readjusted, so it can un-quiesce.
> +
> +Example snapshot-safe workflow
> +------------------------------
> +
> +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
> +   how this is achieved is very workload-specific, but the general
> +   description is to get all software to an expected state where their
> +   event loops dry up and they are effectively quiesced.

If you have ability to do this by communicating with
all processes e.g. through a unix domain socket,
why do you need the rest of the stuff in the kernel?
Quescing is a harder problem than waking up.

> +2) Take snapshot.
> +3) Resume the VM/container/system from said snapshot.
> +4) SysGenID counter will either automatically increment if there is
> +   a vmgenid backend (hw-driven), or overseer will trigger generation
> +   bump using ``SYSGENID_TRIGGER_GEN_UPDATE`` IOCLT (sw-driven).
> +5) Software components which have ``/dev/sysgenid`` in their event
> +   loops (either using ``poll()`` or ``read()``) are notified of the
> +   generation change.
> +   They do their specific internal adjustments. Some may have requested
> +   to be tracked and waited on by the overseer, others might choose to
> +   do their adjustments out of band and not block the overseer.
> +   Tracked ones *must* signal when they are done/ready with a ``write()``
> +   while the rest *should* also do so for cleanliness, but it's not
> +   mandatory.
> +6) Overseer will block and wait for all tracked watchers by using the
> +   ``SYSGENID_WAIT_WATCHERS`` IOCTL. Once all tracked watchers are done
> +   in step 5, this overseer will return from this blocking ioctl knowing
> +   that the system has readjusted and is ready for active workload.
> +7) Overseer un-quiesces system.
> +8) There is a class of software, usually libraries, most notably PRNGs
> +   or SSLs, that don't fit the event-loop model and also have strict
> +   latency requirements. These can take advantage of the ``mmap()``
> +   interface and lazily adjust on-demand whenever they are called after
> +   un-quiesce.
> +   For a well-designed service stack, these libraries should not be
> +   called while system is quiesced. When workload is resumed by the
> +   overseer, on the first call into these libs, they will safely JIT
> +   readjust.
> +   Users of this lazy on-demand readjustment model should not enable
> +   watcher tracking since doing so would introduce a logical deadlock:
> +   lazy adjustments happen only after un-quiesce, but un-quiesce is
> +   blocked until all tracked watchers are up-to-date.
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index d02ba2f..39f9482 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -357,6 +357,7 @@ Code  Seq#    Include File                                           Comments
>  0xDB  00-0F  drivers/char/mwave/mwavepub.h
>  0xDD  00-3F                                                          ZFCP device driver see drivers/s390/scsi/
>                                                                       <mailto:aherrman@de.ibm.com>
> +0xE4  01-03  uapi/linux/sysgenid.h                                   SysGenID misc driver
>  0xE5  00-3F  linux/fuse.h
>  0xEC  00-01  drivers/platform/chrome/cros_ec_dev.h                   ChromeOS EC driver
>  0xF3  00-3F  drivers/usb/misc/sisusbvga/sisusb.h                     sisfb (in development)
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1d75afa..b812dad8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17261,6 +17261,14 @@ L:	linux-mmc@vger.kernel.org
>  S:	Maintained
>  F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
>  
> +SYSGENID
> +M:	Adrian Catangiu <acatan@amazon.com>
> +L:	linux-kernel@vger.kernel.org
> +S:	Supported
> +F:	Documentation/misc-devices/sysgenid.rst
> +F:	drivers/misc/sysgenid.c
> +F:	include/uapi/linux/sysgenid.h
> +
>  SYSTEM CONFIGURATION (SYSCON)
>  M:	Lee Jones <lee.jones@linaro.org>
>  M:	Arnd Bergmann <arnd@arndb.de>
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index fafa8b0..a2b7cae 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -456,6 +456,21 @@ config PVPANIC
>  	  a paravirtualized device provided by QEMU; it lets a virtual machine
>  	  (guest) communicate panic events to the host.
>  
> +config SYSGENID
> +	tristate "System Generation ID driver"
> +	help
> +	  This is a System Generation ID driver which provides a system
> +	  generation counter. The driver exposes FS ops on /dev/sysgenid
> +	  through which it can provide information and notifications on system
> +	  generation changes that happen because of VM or container snapshots
> +	  or cloning.
> +	  This enables applications and libraries that store or cache
> +	  sensitive information, to know that they need to regenerate it
> +	  after process memory has been exposed to potential copying.
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called sysgenid.
> +
>  config HISI_HIKEY_USB
>  	tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
>  	depends on (OF && GPIOLIB) || COMPILE_TEST
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index d23231e..4b4933d 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI)		+= habanalabs/
>  obj-$(CONFIG_UACCE)		+= uacce/
>  obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
>  obj-$(CONFIG_HISI_HIKEY_USB)	+= hisi_hikey_usb.o
> +obj-$(CONFIG_SYSGENID)		+= sysgenid.o
> diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
> new file mode 100644
> index 0000000..ace292b
> --- /dev/null
> +++ b/drivers/misc/sysgenid.c
> @@ -0,0 +1,322 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * System Generation ID driver
> + *
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + *
> + *	Authors:
> + *	  Adrian Catangiu <acatan@amazon.com>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/kernel.h>
> +#include <linux/minmax.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/random.h>
> +#include <linux/uuid.h>
> +#include <linux/sysgenid.h>
> +
> +struct sysgenid_data {
> +	unsigned long		map_buf;
> +	wait_queue_head_t	read_waitq;
> +	atomic_t		generation_counter;
> +
> +	unsigned int		watchers;
> +	atomic_t		outdated_watchers;
> +	wait_queue_head_t	outdated_waitq;
> +	spinlock_t		lock;
> +};
> +static struct sysgenid_data sysgenid_data;
> +
> +struct file_data {
> +	bool tracked_watcher;
> +	int acked_gen_counter;
> +};
> +
> +static int equals_gen_counter(unsigned int counter)
> +{
> +	return counter == atomic_read(&sysgenid_data.generation_counter);
> +}
> +
> +static void _bump_generation(int min_gen)
> +{
> +	unsigned long flags;
> +	int counter;
> +
> +	spin_lock_irqsave(&sysgenid_data.lock, flags);
> +	counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
> +	atomic_set(&sysgenid_data.generation_counter, counter);
> +	*((int *) sysgenid_data.map_buf) = counter;
> +	atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
> +
> +	wake_up_interruptible(&sysgenid_data.read_waitq);
> +	wake_up_interruptible(&sysgenid_data.outdated_waitq);
> +	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +}
> +
> +void sysgenid_bump_generation(void)
> +{
> +	_bump_generation(0);
> +}
> +EXPORT_SYMBOL_GPL(sysgenid_bump_generation);
> +
> +static void put_outdated_watchers(void)
> +{
> +	if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
> +		wake_up_interruptible(&sysgenid_data.outdated_waitq);
> +}
> +
> +static void start_fd_tracking(struct file_data *fdata)
> +{
> +	unsigned long flags;
> +
> +	if (!fdata->tracked_watcher) {
> +		/* enable tracking this fd as a watcher */
> +		spin_lock_irqsave(&sysgenid_data.lock, flags);
> +			fdata->tracked_watcher = 1;
> +			++sysgenid_data.watchers;
> +			if (!equals_gen_counter(fdata->acked_gen_counter))
> +				atomic_inc(&sysgenid_data.outdated_watchers);
> +		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +	}
> +}
> +
> +static void stop_fd_tracking(struct file_data *fdata)
> +{
> +	unsigned long flags;
> +
> +	if (fdata->tracked_watcher) {
> +		/* stop tracking this fd as a watcher */
> +		spin_lock_irqsave(&sysgenid_data.lock, flags);
> +		if (!equals_gen_counter(fdata->acked_gen_counter))
> +			put_outdated_watchers();
> +		--sysgenid_data.watchers;
> +		fdata->tracked_watcher = 0;
> +		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +	}
> +}
> +
> +static int sysgenid_open(struct inode *inode, struct file *file)
> +{
> +	struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
> +
> +	if (!fdata)
> +		return -ENOMEM;
> +	fdata->tracked_watcher = 0;
> +	fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
> +	file->private_data = fdata;
> +
> +	return 0;
> +}
> +
> +static int sysgenid_close(struct inode *inode, struct file *file)
> +{
> +	struct file_data *fdata = file->private_data;
> +
> +	stop_fd_tracking(fdata);
> +	kfree(fdata);
> +
> +	return 0;
> +}
> +
> +static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
> +		size_t nbytes, loff_t *ppos)
> +{
> +	struct file_data *fdata = file->private_data;
> +	ssize_t ret;
> +	int gen_counter;
> +
> +	if (nbytes == 0)
> +		return 0;
> +	/* disallow partial reads */
> +	if (nbytes < sizeof(gen_counter))
> +		return -EINVAL;
> +
> +	if (equals_gen_counter(fdata->acked_gen_counter)) {
> +		if (file->f_flags & O_NONBLOCK)
> +			return -EAGAIN;
> +		ret = wait_event_interruptible(
> +			sysgenid_data.read_waitq,
> +			!equals_gen_counter(fdata->acked_gen_counter)
> +		);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	gen_counter = atomic_read(&sysgenid_data.generation_counter);
> +	ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
> +	if (ret)
> +		return -EFAULT;
> +
> +	return sizeof(gen_counter);
> +}
> +
> +static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
> +		size_t count, loff_t *ppos)
> +{
> +	struct file_data *fdata = file->private_data;
> +	unsigned int new_acked_gen;
> +	unsigned long flags;
> +
> +	/* disallow partial writes */
> +	if (count != sizeof(new_acked_gen))
> +		return -ENOBUFS;
> +	if (copy_from_user(&new_acked_gen, ubuf, count))
> +		return -EFAULT;
> +
> +	spin_lock_irqsave(&sysgenid_data.lock, flags);
> +	/* wrong gen-counter acknowledged */
> +	if (!equals_gen_counter(new_acked_gen)) {
> +		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +		return -EINVAL;
> +	}
> +	/* update acked gen-counter if necessary */
> +	if (!equals_gen_counter(fdata->acked_gen_counter)) {
> +		fdata->acked_gen_counter = new_acked_gen;
> +		if (fdata->tracked_watcher)
> +			put_outdated_watchers();
> +	}
> +	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> +	return (ssize_t)count;
> +}
> +
> +static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
> +{
> +	__poll_t mask = 0;
> +	struct file_data *fdata = file->private_data;
> +
> +	if (!equals_gen_counter(fdata->acked_gen_counter))
> +		return EPOLLIN | EPOLLRDNORM;
> +
> +	poll_wait(file, &sysgenid_data.read_waitq, wait);
> +
> +	if (!equals_gen_counter(fdata->acked_gen_counter))
> +		mask = EPOLLIN | EPOLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static long sysgenid_ioctl(struct file *file,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	struct file_data *fdata = file->private_data;
> +	bool tracking = !!arg;
> +	unsigned long timeout_ns, min_gen;
> +	ktime_t until;
> +	int ret = 0;
> +
> +	switch (cmd) {
> +	case SYSGENID_SET_WATCHER_TRACKING:
> +		if (tracking)
> +			start_fd_tracking(fdata);
> +		else
> +			stop_fd_tracking(fdata);
> +		break;
> +	case SYSGENID_WAIT_WATCHERS:
> +		timeout_ns = arg * NSEC_PER_MSEC;
> +		until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
> +
> +		ret = wait_event_interruptible_hrtimeout(
> +			sysgenid_data.outdated_waitq,
> +			(!atomic_read(&sysgenid_data.outdated_watchers) ||
> +					!equals_gen_counter(fdata->acked_gen_counter)),
> +			until
> +		);
> +		if (!equals_gen_counter(fdata->acked_gen_counter))
> +			ret = -EINTR;
> +		break;
> +	case SYSGENID_TRIGGER_GEN_UPDATE:
> +		if (!checkpoint_restore_ns_capable(current_user_ns()))
> +			return -EACCES;
> +		min_gen = arg;
> +		_bump_generation(min_gen);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct file_data *fdata = file->private_data;
> +
> +	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> +		return -EINVAL;
> +
> +	if ((vma->vm_flags & VM_WRITE) != 0)
> +		return -EPERM;
> +
> +	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> +	vma->vm_flags &= ~VM_MAYWRITE;
> +	vma->vm_private_data = fdata;
> +
> +	return vm_insert_page(vma, vma->vm_start,
> +			virt_to_page(sysgenid_data.map_buf));
> +}
> +
> +static const struct file_operations fops = {
> +	.owner		= THIS_MODULE,
> +	.mmap		= sysgenid_mmap,
> +	.open		= sysgenid_open,
> +	.release	= sysgenid_close,
> +	.read		= sysgenid_read,
> +	.write		= sysgenid_write,
> +	.poll		= sysgenid_poll,
> +	.unlocked_ioctl	= sysgenid_ioctl,
> +};
> +
> +static struct miscdevice sysgenid_misc = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "sysgenid",
> +	.fops = &fops,
> +};
> +
> +static int __init sysgenid_init(void)
> +{
> +	int ret;
> +
> +	sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> +	if (!sysgenid_data.map_buf)
> +		return -ENOMEM;
> +
> +	atomic_set(&sysgenid_data.generation_counter, 0);
> +	atomic_set(&sysgenid_data.outdated_watchers, 0);
> +	init_waitqueue_head(&sysgenid_data.read_waitq);
> +	init_waitqueue_head(&sysgenid_data.outdated_waitq);
> +	spin_lock_init(&sysgenid_data.lock);
> +
> +	ret = misc_register(&sysgenid_misc);
> +	if (ret < 0) {
> +		pr_err("misc_register() failed for sysgenid\n");
> +		goto err;
> +	}
> +
> +	return 0;
> +
> +err:
> +	free_pages(sysgenid_data.map_buf, 0);
> +	sysgenid_data.map_buf = 0;
> +
> +	return ret;
> +}
> +
> +static void __exit sysgenid_exit(void)
> +{
> +	misc_deregister(&sysgenid_misc);
> +	free_pages(sysgenid_data.map_buf, 0);
> +	sysgenid_data.map_buf = 0;
> +}
> +
> +module_init(sysgenid_init);
> +module_exit(sysgenid_exit);
> +
> +MODULE_AUTHOR("Adrian Catangiu");
> +MODULE_DESCRIPTION("System Generation ID");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("0.1");
> diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
> new file mode 100644
> index 0000000..7279df6
> --- /dev/null
> +++ b/include/uapi/linux/sysgenid.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_LINUX_SYSGENID_H
> +#define _UAPI_LINUX_SYSGENID_H
> +
> +#include <linux/ioctl.h>
> +
> +#define SYSGENID_IOCTL			0xE4
> +#define SYSGENID_SET_WATCHER_TRACKING	_IO(SYSGENID_IOCTL, 1)
> +#define SYSGENID_WAIT_WATCHERS		_IO(SYSGENID_IOCTL, 2)
> +#define SYSGENID_TRIGGER_GEN_UPDATE	_IO(SYSGENID_IOCTL, 3)
> +
> +#ifdef __KERNEL__
> +void sysgenid_bump_generation(void);
> +#endif /* __KERNEL__ */
> +
> +#endif /* _UAPI_LINUX_SYSGENID_H */
> +
> -- 
> 2.7.4
> 
> 
> 
> 
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
@ 2021-02-24  9:19     ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2021-02-24  9:19 UTC (permalink / raw)
  To: Adrian Catangiu
  Cc: Jason, areber, kvm, linux-doc, ghammer, vijaysun, 0x7f454c46,
	qemu-devel, mhocko, dgunigun, avagin, pavel, ptikhomirov,
	linux-s390, corbet, mpe, rafael, ebiggers, borntraeger, sblbir,
	bonzini, arnd, jannh, raduweis, asmehra, graf, rppt, luto, gil,
	oridgar, colmmacc, tytso, gregkh, rdunlap, linux-kernel,
	ebiederm, ovzxemul, w, dwmw

On Wed, Feb 24, 2021 at 10:47:31AM +0200, Adrian Catangiu wrote:
> - Background and problem
> 
> The System Generation ID feature is required in virtualized or
> containerized environments by applications that work with local copies
> or caches of world-unique data such as random values, uuids,
> monotonically increasing counters, etc.
> Such applications can be negatively affected by VM or container
> snapshotting when the VM or container is either cloned or returned to
> an earlier point in time.
> 
> Furthermore, simply finding out about a system generation change is
> only the starting point of a process to renew internal states of
> possibly multiple applications across the system. This process requires
> a standard interface that applications can rely on and through which
> orchestration can be easily done.
> 
> - Solution
> 
> The System Generation ID is meant to help in these scenarios by
> providing a monotonically increasing u32 counter that changes each time
> the VM or container is restored from a snapshot.
> 
> The `sysgenid` driver exposes a monotonic incremental System Generation
> u32 counter via a char-dev filesystem interface accessible
> through `/dev/sysgenid`. It provides synchronous and asynchronous SysGen
> counter update notifications, as well as counter retrieval and
> confirmation mechanisms.
> The counter starts from zero when the driver is initialized and
> monotonically increments every time the system generation changes.
> 
> Userspace applications or libraries can (a)synchronously consume the
> system generation counter through the provided filesystem interface, to
> make any necessary internal adjustments following a system generation
> update.
> 
> The provided filesystem interface operations can be used to build a
> system level safe workflow that guest software can follow to protect
> itself from negative system snapshot effects.
> 
> The `sysgenid` driver exports the `void sysgenid_bump_generation()`
> symbol which can be used by backend drivers to drive system generation
> changes based on hardware events.
> System generation changes can also be driven by userspace software
> through a dedicated driver ioctl.
> 
> **Please note**, SysGenID alone does not guarantee complete snapshot
> safety to applications using it. A certain workflow needs to be
> followed at the system level, in order to make the system
> snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
> section in the included documentation.
> 
> Signed-off-by: Adrian Catangiu <acatan@amazon.com>
> ---
>  Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>  MAINTAINERS                                        |   8 +
>  drivers/misc/Kconfig                               |  15 +
>  drivers/misc/Makefile                              |   1 +
>  drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
>  include/uapi/linux/sysgenid.h                      |  18 ++
>  7 files changed, 594 insertions(+)
>  create mode 100644 Documentation/misc-devices/sysgenid.rst
>  create mode 100644 drivers/misc/sysgenid.c
>  create mode 100644 include/uapi/linux/sysgenid.h
> 
> diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
> new file mode 100644
> index 0000000..0b8199b
> --- /dev/null
> +++ b/Documentation/misc-devices/sysgenid.rst
> @@ -0,0 +1,229 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +SYSGENID
> +========
> +
> +The System Generation ID feature is required in virtualized or
> +containerized environments by applications that work with local copies
> +or caches of world-unique data such as random values, UUIDs,
> +monotonically increasing counters, etc.
> +Such applications can be negatively affected by VM or container
> +snapshotting when the VM or container is either cloned or returned to
> +an earlier point in time.
> +
> +The System Generation ID is meant to help in these scenarios by
> +providing a monotonically increasing counter that changes each time the
> +VM or container is restored from a snapshot. The driver for it lives at
> +``drivers/misc/sysgenid.c``.
> +
> +The ``sysgenid`` driver exposes a monotonic incremental System
> +Generation u32 counter via a char-dev filesystem interface accessible
> +through ``/dev/sysgenid`` that provides sync and async SysGen counter
> +update notifications. It also provides SysGen counter retrieval and
> +confirmation mechanisms.
> +
> +The counter starts from zero when the driver is initialized and
> +monotonically increments every time the system generation changes.
> +
> +The ``sysgenid`` driver exports the ``void sysgenid_bump_generation()``
> +symbol which can be used by backend drivers to drive system generation
> +changes based on hardware events.
> +System generation changes can also be driven by userspace software
> +through a dedicated driver ioctl.
> +
> +Userspace applications or libraries can (a)synchronously consume the
> +system generation counter through the provided filesystem interface, to
> +make any necessary internal adjustments following a system generation
> +update.
> +
> +**Please note**, SysGenID alone does not guarantee complete snapshot
> +safety to applications using it. A certain workflow needs to be
> +followed at the system level, in order to make the system
> +snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
> +section below.
> +
> +Driver filesystem interface
> +===========================
> +
> +``open()``:
> +  When the device is opened, a copy of the current SysGenID (counter)
> +  is associated with the open file descriptor. Every open file
> +  descriptor will have readable data available (EPOLLIN) while its
> +  current copy of the SysGenID is outdated. Reading from the fd will
> +  provide the latest SysGenID, while writing to the fd will update the
> +  fd-local copy of the SysGenID and is used as a confirmation
> +  mechanism.
> +
> +``read()``:
> +  Read is meant to provide the *new* system generation counter when a
> +  generation change takes place. The read operation blocks until the
> +  associated counter is no longer up to date, at which point the new
> +  counter is provided/returned.  Nonblocking ``read()`` returns
> +  ``EAGAIN`` to signal that there is no *new* counter value available.
> +  The generation counter is considered *new* for each open file
> +  descriptor that hasn't confirmed the new value following a generation
> +  change. Therefore, once a generation change takes place, all
> +  ``read()`` calls will immediately return the new generation counter
> +  and will continue to do so until the new value is confirmed back to
> +  the driver through ``write()``.
> +  Partial reads are not allowed - read buffer needs to be at least
> +  32 bits in size.
> +
> +``write()``:
> +  Write is used to confirm the up-to-date SysGenID counter back to the
> +  driver.
> +  Following a VM generation change, all existing watchers are marked
> +  as *outdated*. Each file descriptor will maintain the *outdated*
> +  status until a ``write()`` containing the new up-to-date generation
> +  counter is used as an update confirmation mechanism.
> +  Partial writes are not allowed - write buffer should be exactly
> +  32 bits in size.
> +
> +``poll()``:
> +  Poll is implemented to allow polling for generation counter updates.
> +  Such updates result in ``EPOLLIN`` polling status until the new
> +  up-to-date counter is confirmed back to the driver through a
> +  ``write()``.
> +
> +``ioctl()``:
> +  The driver also adds support for waiting on open file descriptors
> +  that haven't acknowledged a generation counter update, as well as a
> +  mechanism for userspace to *trigger* a generation update:
> +
> +  - SYSGENID_SET_WATCHER_TRACKING: takes a bool argument to set tracking
> +    status for current file descriptor. When watcher tracking is
> +    enabled, the driver tracks this file descriptor as an independent
> +    *watcher*. The driver keeps accounting of how many watchers have
> +    confirmed the latest Sys-Gen-Id counter and how many of them are
> +    *outdated*; an outdated watcher is a *tracked* open file descriptor
> +    that has lived through a Sys-Gen-Id change but has not yet confirmed
> +    the new generation counter.
> +    Software that wants to be waited on by the system while it adjusts
> +    to generation changes, should turn tracking on. The sysgenid driver
> +    then keeps track of it and can block system-level adjustment process
> +    until the software has finished adjusting and confirmed it through a
> +    ``write()``.
> +    Tracking is disabled by default and file descriptors need to
> +    explicitly opt-in using this IOCTL.
> +  - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
> +    tracked watchers or, if a ``timeout`` argument is provided, until
> +    the timeout expires.
> +    If the current caller is *outdated* or a generation change happens
> +    while waiting (thus making current caller *outdated*), the ioctl
> +    returns ``-EINTR`` to signal the user to handle event and retry.
> +  - SYSGENID_TRIGGER_GEN_UPDATE: triggers a generation counter increment.
> +    It takes a ``minimum-generation`` argument which represents the
> +    minimum value the generation counter will be set to. For example if
> +    current generation is ``5`` and ``SYSGENID_TRIGGER_GEN_UPDATE(8)``
> +    is called, the generation counter will increment to ``8``.

And what if it's 9?

> +    This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
> +    or CAP_SYS_ADMIN capabilities.
> +
> +``mmap()``:
> +  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
> +  in size. The first 4 bytes of the mapped page will contain an
> +  up-to-date u32 copy of the system generation counter.
> +  The mapped memory can be used as a low-latency generation counter
> +  probe mechanism in critical sections.
> +  The mmap() interface is targeted at libraries or code that needs to
> +  check for generation changes in-line, where an event loop is not
> +  available or read()/write() syscalls are too expensive.
> +  In such cases, logic can be added in-line with the sensitive code to
> +  check and trigger on-demand/just-in-time readjustments when changes
> +  are detected on the memory mapped generation counter.
> +  Users of this interface that plan to lazily adjust should not enable
> +  watcher tracking, since waiting on them doesn't make sense.
> +
> +``close()``:
> +  Removes the file descriptor as a system generation counter *watcher*.
> +
> +Snapshot Safety Prerequisites
> +=============================
> +
> +If VM, container or other system-level snapshots happen asynchronously,
> +at arbitrary times during an active workload there is no practical way
> +to ensure that in-flight local copies or caches of world-unique data
> +such as random values, secrets, UUIDs, etc are properly scrubbed and
> +regenerated.
> +The challenge stems from the fact that the categorization of data as
> +snapshot-sensitive is only known to the software working with it, and
> +this software has no logical control over the moment in time when an
> +external system snapshot occurs.
> +
> +Let's take an OpenSSL session token for example. Even if the library
> +code is made 100% snapshot-safe, meaning the library guarantees that
> +the session token is unique (any snapshot that happened during the
> +library call did not duplicate or leak the token), the token is still
> +vulnerable to snapshot events while it transits the various layers of
> +the library caller, then the various layers of the OS before leaving
> +the system.
> +
> +To catch a secret while it's in-flight, we'd have to validate system
> +generation at every layer, every step of the way. Even if that would
> +be deemed the right solution, it would be a long road and a whole
> +universe to patch before we get there.
> +
> +Bottom line is we don't have a way to track all of these in-flight
> +secrets and dynamically scrub them from existence with snapshot
> +events happening arbitrarily.

Above should try harder to explan what are the things that need to be
scrubbed and why. For example, I personally don't really know what is
the OpenSSL session token example and what makes it vulnerable. I guess
snapshots can attack each other?




Here's a simple example of a workflow that submits transactions
to a database and wants to avoid duplicate transactions.
This does not require overseer magic. It does however require
a correct genid from hypervisor, so no mmap tricks work.



	int genid, oldgenid;
	read(&genid);
start:
	oldgenid = genid;
	transid = submit transaction
	read(&genid);
	if (genid != oldgenid) {
			revert transaction (transid);
			goto start:
	}






> +Simplifyng assumption - safety prerequisite
> +-------------------------------------------
> +
> +**Control the snapshot flow**, disallow snapshots coming at arbitrary
> +moments in the workload lifetime.
> +
> +Use a system-level overseer entity that quiesces the system before
> +snapshot, and post-snapshot-resume oversees that software components
> +have readjusted to new environment, to the new generation. Only after,
> +will the overseer un-quiesce the system and allow active workloads.
> +
> +Software components can choose whether they want to be tracked and
> +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
> +IOCTL.
> +
> +The sysgenid framework standardizes the API for system software to
> +find out about needing to readjust and at the same time provides a
> +mechanism for the overseer entity to wait for everyone to be done, the
> +system to have readjusted, so it can un-quiesce.
> +
> +Example snapshot-safe workflow
> +------------------------------
> +
> +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
> +   how this is achieved is very workload-specific, but the general
> +   description is to get all software to an expected state where their
> +   event loops dry up and they are effectively quiesced.

If you have ability to do this by communicating with
all processes e.g. through a unix domain socket,
why do you need the rest of the stuff in the kernel?
Quescing is a harder problem than waking up.

> +2) Take snapshot.
> +3) Resume the VM/container/system from said snapshot.
> +4) SysGenID counter will either automatically increment if there is
> +   a vmgenid backend (hw-driven), or overseer will trigger generation
> +   bump using ``SYSGENID_TRIGGER_GEN_UPDATE`` IOCLT (sw-driven).
> +5) Software components which have ``/dev/sysgenid`` in their event
> +   loops (either using ``poll()`` or ``read()``) are notified of the
> +   generation change.
> +   They do their specific internal adjustments. Some may have requested
> +   to be tracked and waited on by the overseer, others might choose to
> +   do their adjustments out of band and not block the overseer.
> +   Tracked ones *must* signal when they are done/ready with a ``write()``
> +   while the rest *should* also do so for cleanliness, but it's not
> +   mandatory.
> +6) Overseer will block and wait for all tracked watchers by using the
> +   ``SYSGENID_WAIT_WATCHERS`` IOCTL. Once all tracked watchers are done
> +   in step 5, this overseer will return from this blocking ioctl knowing
> +   that the system has readjusted and is ready for active workload.
> +7) Overseer un-quiesces system.
> +8) There is a class of software, usually libraries, most notably PRNGs
> +   or SSLs, that don't fit the event-loop model and also have strict
> +   latency requirements. These can take advantage of the ``mmap()``
> +   interface and lazily adjust on-demand whenever they are called after
> +   un-quiesce.
> +   For a well-designed service stack, these libraries should not be
> +   called while system is quiesced. When workload is resumed by the
> +   overseer, on the first call into these libs, they will safely JIT
> +   readjust.
> +   Users of this lazy on-demand readjustment model should not enable
> +   watcher tracking since doing so would introduce a logical deadlock:
> +   lazy adjustments happen only after un-quiesce, but un-quiesce is
> +   blocked until all tracked watchers are up-to-date.
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index d02ba2f..39f9482 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -357,6 +357,7 @@ Code  Seq#    Include File                                           Comments
>  0xDB  00-0F  drivers/char/mwave/mwavepub.h
>  0xDD  00-3F                                                          ZFCP device driver see drivers/s390/scsi/
>                                                                       <mailto:aherrman@de.ibm.com>
> +0xE4  01-03  uapi/linux/sysgenid.h                                   SysGenID misc driver
>  0xE5  00-3F  linux/fuse.h
>  0xEC  00-01  drivers/platform/chrome/cros_ec_dev.h                   ChromeOS EC driver
>  0xF3  00-3F  drivers/usb/misc/sisusbvga/sisusb.h                     sisfb (in development)
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1d75afa..b812dad8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -17261,6 +17261,14 @@ L:	linux-mmc@vger.kernel.org
>  S:	Maintained
>  F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
>  
> +SYSGENID
> +M:	Adrian Catangiu <acatan@amazon.com>
> +L:	linux-kernel@vger.kernel.org
> +S:	Supported
> +F:	Documentation/misc-devices/sysgenid.rst
> +F:	drivers/misc/sysgenid.c
> +F:	include/uapi/linux/sysgenid.h
> +
>  SYSTEM CONFIGURATION (SYSCON)
>  M:	Lee Jones <lee.jones@linaro.org>
>  M:	Arnd Bergmann <arnd@arndb.de>
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index fafa8b0..a2b7cae 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -456,6 +456,21 @@ config PVPANIC
>  	  a paravirtualized device provided by QEMU; it lets a virtual machine
>  	  (guest) communicate panic events to the host.
>  
> +config SYSGENID
> +	tristate "System Generation ID driver"
> +	help
> +	  This is a System Generation ID driver which provides a system
> +	  generation counter. The driver exposes FS ops on /dev/sysgenid
> +	  through which it can provide information and notifications on system
> +	  generation changes that happen because of VM or container snapshots
> +	  or cloning.
> +	  This enables applications and libraries that store or cache
> +	  sensitive information, to know that they need to regenerate it
> +	  after process memory has been exposed to potential copying.
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called sysgenid.
> +
>  config HISI_HIKEY_USB
>  	tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
>  	depends on (OF && GPIOLIB) || COMPILE_TEST
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index d23231e..4b4933d 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -57,3 +57,4 @@ obj-$(CONFIG_HABANA_AI)		+= habanalabs/
>  obj-$(CONFIG_UACCE)		+= uacce/
>  obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
>  obj-$(CONFIG_HISI_HIKEY_USB)	+= hisi_hikey_usb.o
> +obj-$(CONFIG_SYSGENID)		+= sysgenid.o
> diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
> new file mode 100644
> index 0000000..ace292b
> --- /dev/null
> +++ b/drivers/misc/sysgenid.c
> @@ -0,0 +1,322 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * System Generation ID driver
> + *
> + * Copyright (C) 2020 Amazon. All rights reserved.
> + *
> + *	Authors:
> + *	  Adrian Catangiu <acatan@amazon.com>
> + *
> + */
> +#include <linux/acpi.h>
> +#include <linux/kernel.h>
> +#include <linux/minmax.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/random.h>
> +#include <linux/uuid.h>
> +#include <linux/sysgenid.h>
> +
> +struct sysgenid_data {
> +	unsigned long		map_buf;
> +	wait_queue_head_t	read_waitq;
> +	atomic_t		generation_counter;
> +
> +	unsigned int		watchers;
> +	atomic_t		outdated_watchers;
> +	wait_queue_head_t	outdated_waitq;
> +	spinlock_t		lock;
> +};
> +static struct sysgenid_data sysgenid_data;
> +
> +struct file_data {
> +	bool tracked_watcher;
> +	int acked_gen_counter;
> +};
> +
> +static int equals_gen_counter(unsigned int counter)
> +{
> +	return counter == atomic_read(&sysgenid_data.generation_counter);
> +}
> +
> +static void _bump_generation(int min_gen)
> +{
> +	unsigned long flags;
> +	int counter;
> +
> +	spin_lock_irqsave(&sysgenid_data.lock, flags);
> +	counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
> +	atomic_set(&sysgenid_data.generation_counter, counter);
> +	*((int *) sysgenid_data.map_buf) = counter;
> +	atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
> +
> +	wake_up_interruptible(&sysgenid_data.read_waitq);
> +	wake_up_interruptible(&sysgenid_data.outdated_waitq);
> +	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +}
> +
> +void sysgenid_bump_generation(void)
> +{
> +	_bump_generation(0);
> +}
> +EXPORT_SYMBOL_GPL(sysgenid_bump_generation);
> +
> +static void put_outdated_watchers(void)
> +{
> +	if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
> +		wake_up_interruptible(&sysgenid_data.outdated_waitq);
> +}
> +
> +static void start_fd_tracking(struct file_data *fdata)
> +{
> +	unsigned long flags;
> +
> +	if (!fdata->tracked_watcher) {
> +		/* enable tracking this fd as a watcher */
> +		spin_lock_irqsave(&sysgenid_data.lock, flags);
> +			fdata->tracked_watcher = 1;
> +			++sysgenid_data.watchers;
> +			if (!equals_gen_counter(fdata->acked_gen_counter))
> +				atomic_inc(&sysgenid_data.outdated_watchers);
> +		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +	}
> +}
> +
> +static void stop_fd_tracking(struct file_data *fdata)
> +{
> +	unsigned long flags;
> +
> +	if (fdata->tracked_watcher) {
> +		/* stop tracking this fd as a watcher */
> +		spin_lock_irqsave(&sysgenid_data.lock, flags);
> +		if (!equals_gen_counter(fdata->acked_gen_counter))
> +			put_outdated_watchers();
> +		--sysgenid_data.watchers;
> +		fdata->tracked_watcher = 0;
> +		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +	}
> +}
> +
> +static int sysgenid_open(struct inode *inode, struct file *file)
> +{
> +	struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
> +
> +	if (!fdata)
> +		return -ENOMEM;
> +	fdata->tracked_watcher = 0;
> +	fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
> +	file->private_data = fdata;
> +
> +	return 0;
> +}
> +
> +static int sysgenid_close(struct inode *inode, struct file *file)
> +{
> +	struct file_data *fdata = file->private_data;
> +
> +	stop_fd_tracking(fdata);
> +	kfree(fdata);
> +
> +	return 0;
> +}
> +
> +static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
> +		size_t nbytes, loff_t *ppos)
> +{
> +	struct file_data *fdata = file->private_data;
> +	ssize_t ret;
> +	int gen_counter;
> +
> +	if (nbytes == 0)
> +		return 0;
> +	/* disallow partial reads */
> +	if (nbytes < sizeof(gen_counter))
> +		return -EINVAL;
> +
> +	if (equals_gen_counter(fdata->acked_gen_counter)) {
> +		if (file->f_flags & O_NONBLOCK)
> +			return -EAGAIN;
> +		ret = wait_event_interruptible(
> +			sysgenid_data.read_waitq,
> +			!equals_gen_counter(fdata->acked_gen_counter)
> +		);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	gen_counter = atomic_read(&sysgenid_data.generation_counter);
> +	ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
> +	if (ret)
> +		return -EFAULT;
> +
> +	return sizeof(gen_counter);
> +}
> +
> +static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
> +		size_t count, loff_t *ppos)
> +{
> +	struct file_data *fdata = file->private_data;
> +	unsigned int new_acked_gen;
> +	unsigned long flags;
> +
> +	/* disallow partial writes */
> +	if (count != sizeof(new_acked_gen))
> +		return -ENOBUFS;
> +	if (copy_from_user(&new_acked_gen, ubuf, count))
> +		return -EFAULT;
> +
> +	spin_lock_irqsave(&sysgenid_data.lock, flags);
> +	/* wrong gen-counter acknowledged */
> +	if (!equals_gen_counter(new_acked_gen)) {
> +		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +		return -EINVAL;
> +	}
> +	/* update acked gen-counter if necessary */
> +	if (!equals_gen_counter(fdata->acked_gen_counter)) {
> +		fdata->acked_gen_counter = new_acked_gen;
> +		if (fdata->tracked_watcher)
> +			put_outdated_watchers();
> +	}
> +	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
> +
> +	return (ssize_t)count;
> +}
> +
> +static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
> +{
> +	__poll_t mask = 0;
> +	struct file_data *fdata = file->private_data;
> +
> +	if (!equals_gen_counter(fdata->acked_gen_counter))
> +		return EPOLLIN | EPOLLRDNORM;
> +
> +	poll_wait(file, &sysgenid_data.read_waitq, wait);
> +
> +	if (!equals_gen_counter(fdata->acked_gen_counter))
> +		mask = EPOLLIN | EPOLLRDNORM;
> +
> +	return mask;
> +}
> +
> +static long sysgenid_ioctl(struct file *file,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	struct file_data *fdata = file->private_data;
> +	bool tracking = !!arg;
> +	unsigned long timeout_ns, min_gen;
> +	ktime_t until;
> +	int ret = 0;
> +
> +	switch (cmd) {
> +	case SYSGENID_SET_WATCHER_TRACKING:
> +		if (tracking)
> +			start_fd_tracking(fdata);
> +		else
> +			stop_fd_tracking(fdata);
> +		break;
> +	case SYSGENID_WAIT_WATCHERS:
> +		timeout_ns = arg * NSEC_PER_MSEC;
> +		until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
> +
> +		ret = wait_event_interruptible_hrtimeout(
> +			sysgenid_data.outdated_waitq,
> +			(!atomic_read(&sysgenid_data.outdated_watchers) ||
> +					!equals_gen_counter(fdata->acked_gen_counter)),
> +			until
> +		);
> +		if (!equals_gen_counter(fdata->acked_gen_counter))
> +			ret = -EINTR;
> +		break;
> +	case SYSGENID_TRIGGER_GEN_UPDATE:
> +		if (!checkpoint_restore_ns_capable(current_user_ns()))
> +			return -EACCES;
> +		min_gen = arg;
> +		_bump_generation(min_gen);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct file_data *fdata = file->private_data;
> +
> +	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
> +		return -EINVAL;
> +
> +	if ((vma->vm_flags & VM_WRITE) != 0)
> +		return -EPERM;
> +
> +	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> +	vma->vm_flags &= ~VM_MAYWRITE;
> +	vma->vm_private_data = fdata;
> +
> +	return vm_insert_page(vma, vma->vm_start,
> +			virt_to_page(sysgenid_data.map_buf));
> +}
> +
> +static const struct file_operations fops = {
> +	.owner		= THIS_MODULE,
> +	.mmap		= sysgenid_mmap,
> +	.open		= sysgenid_open,
> +	.release	= sysgenid_close,
> +	.read		= sysgenid_read,
> +	.write		= sysgenid_write,
> +	.poll		= sysgenid_poll,
> +	.unlocked_ioctl	= sysgenid_ioctl,
> +};
> +
> +static struct miscdevice sysgenid_misc = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "sysgenid",
> +	.fops = &fops,
> +};
> +
> +static int __init sysgenid_init(void)
> +{
> +	int ret;
> +
> +	sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> +	if (!sysgenid_data.map_buf)
> +		return -ENOMEM;
> +
> +	atomic_set(&sysgenid_data.generation_counter, 0);
> +	atomic_set(&sysgenid_data.outdated_watchers, 0);
> +	init_waitqueue_head(&sysgenid_data.read_waitq);
> +	init_waitqueue_head(&sysgenid_data.outdated_waitq);
> +	spin_lock_init(&sysgenid_data.lock);
> +
> +	ret = misc_register(&sysgenid_misc);
> +	if (ret < 0) {
> +		pr_err("misc_register() failed for sysgenid\n");
> +		goto err;
> +	}
> +
> +	return 0;
> +
> +err:
> +	free_pages(sysgenid_data.map_buf, 0);
> +	sysgenid_data.map_buf = 0;
> +
> +	return ret;
> +}
> +
> +static void __exit sysgenid_exit(void)
> +{
> +	misc_deregister(&sysgenid_misc);
> +	free_pages(sysgenid_data.map_buf, 0);
> +	sysgenid_data.map_buf = 0;
> +}
> +
> +module_init(sysgenid_init);
> +module_exit(sysgenid_exit);
> +
> +MODULE_AUTHOR("Adrian Catangiu");
> +MODULE_DESCRIPTION("System Generation ID");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("0.1");
> diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
> new file mode 100644
> index 0000000..7279df6
> --- /dev/null
> +++ b/include/uapi/linux/sysgenid.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_LINUX_SYSGENID_H
> +#define _UAPI_LINUX_SYSGENID_H
> +
> +#include <linux/ioctl.h>
> +
> +#define SYSGENID_IOCTL			0xE4
> +#define SYSGENID_SET_WATCHER_TRACKING	_IO(SYSGENID_IOCTL, 1)
> +#define SYSGENID_WAIT_WATCHERS		_IO(SYSGENID_IOCTL, 2)
> +#define SYSGENID_TRIGGER_GEN_UPDATE	_IO(SYSGENID_IOCTL, 3)
> +
> +#ifdef __KERNEL__
> +void sysgenid_bump_generation(void);
> +#endif /* __KERNEL__ */
> +
> +#endif /* _UAPI_LINUX_SYSGENID_H */
> +
> -- 
> 2.7.4
> 
> 
> 
> 
> Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
  2021-02-24  9:19     ` Michael S. Tsirkin
@ 2021-02-24 13:45       ` Alexander Graf
  -1 siblings, 0 replies; 23+ messages in thread
From: Alexander Graf @ 2021-02-24 13:45 UTC (permalink / raw)
  To: Michael S. Tsirkin, Adrian Catangiu
  Cc: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390, gregkh,
	rdunlap, arnd, ebiederm, rppt, 0x7f454c46, borntraeger, Jason,
	jannh, w, colmmacc, luto, tytso, ebiggers, dwmw, bonzini, sblbir,
	raduweis, corbet, mhocko, rafael, pavel, mpe, areber, ovzxemul,
	avagin, ptikhomirov, gil, asmehra, dgunigun, vijaysun, oridgar,
	ghammer


On 24.02.21 10:19, Michael S. Tsirkin wrote:
> 
> On Wed, Feb 24, 2021 at 10:47:31AM +0200, Adrian Catangiu wrote:
>> - Background and problem
>>
>> The System Generation ID feature is required in virtualized or
>> containerized environments by applications that work with local copies
>> or caches of world-unique data such as random values, uuids,
>> monotonically increasing counters, etc.
>> Such applications can be negatively affected by VM or container
>> snapshotting when the VM or container is either cloned or returned to
>> an earlier point in time.
>>
>> Furthermore, simply finding out about a system generation change is
>> only the starting point of a process to renew internal states of
>> possibly multiple applications across the system. This process requires
>> a standard interface that applications can rely on and through which
>> orchestration can be easily done.
>>
>> - Solution
>>
>> The System Generation ID is meant to help in these scenarios by
>> providing a monotonically increasing u32 counter that changes each time
>> the VM or container is restored from a snapshot.
>>
>> The `sysgenid` driver exposes a monotonic incremental System Generation
>> u32 counter via a char-dev filesystem interface accessible
>> through `/dev/sysgenid`. It provides synchronous and asynchronous SysGen
>> counter update notifications, as well as counter retrieval and
>> confirmation mechanisms.
>> The counter starts from zero when the driver is initialized and
>> monotonically increments every time the system generation changes.
>>
>> Userspace applications or libraries can (a)synchronously consume the
>> system generation counter through the provided filesystem interface, to
>> make any necessary internal adjustments following a system generation
>> update.
>>
>> The provided filesystem interface operations can be used to build a
>> system level safe workflow that guest software can follow to protect
>> itself from negative system snapshot effects.
>>
>> The `sysgenid` driver exports the `void sysgenid_bump_generation()`
>> symbol which can be used by backend drivers to drive system generation
>> changes based on hardware events.
>> System generation changes can also be driven by userspace software
>> through a dedicated driver ioctl.
>>
>> **Please note**, SysGenID alone does not guarantee complete snapshot
>> safety to applications using it. A certain workflow needs to be
>> followed at the system level, in order to make the system
>> snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
>> section in the included documentation.
>>
>> Signed-off-by: Adrian Catangiu <acatan@amazon.com>
>> ---
>>   Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
>>   Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>>   MAINTAINERS                                        |   8 +
>>   drivers/misc/Kconfig                               |  15 +
>>   drivers/misc/Makefile                              |   1 +
>>   drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
>>   include/uapi/linux/sysgenid.h                      |  18 ++
>>   7 files changed, 594 insertions(+)
>>   create mode 100644 Documentation/misc-devices/sysgenid.rst
>>   create mode 100644 drivers/misc/sysgenid.c
>>   create mode 100644 include/uapi/linux/sysgenid.h
>>

[...]

>> +``ioctl()``:
>> +  The driver also adds support for waiting on open file descriptors
>> +  that haven't acknowledged a generation counter update, as well as a
>> +  mechanism for userspace to *trigger* a generation update:
>> +
>> +  - SYSGENID_SET_WATCHER_TRACKING: takes a bool argument to set tracking
>> +    status for current file descriptor. When watcher tracking is
>> +    enabled, the driver tracks this file descriptor as an independent
>> +    *watcher*. The driver keeps accounting of how many watchers have
>> +    confirmed the latest Sys-Gen-Id counter and how many of them are
>> +    *outdated*; an outdated watcher is a *tracked* open file descriptor
>> +    that has lived through a Sys-Gen-Id change but has not yet confirmed
>> +    the new generation counter.
>> +    Software that wants to be waited on by the system while it adjusts
>> +    to generation changes, should turn tracking on. The sysgenid driver
>> +    then keeps track of it and can block system-level adjustment process
>> +    until the software has finished adjusting and confirmed it through a
>> +    ``write()``.
>> +    Tracking is disabled by default and file descriptors need to
>> +    explicitly opt-in using this IOCTL.
>> +  - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
>> +    tracked watchers or, if a ``timeout`` argument is provided, until
>> +    the timeout expires.
>> +    If the current caller is *outdated* or a generation change happens
>> +    while waiting (thus making current caller *outdated*), the ioctl
>> +    returns ``-EINTR`` to signal the user to handle event and retry.
>> +  - SYSGENID_TRIGGER_GEN_UPDATE: triggers a generation counter increment.
>> +    It takes a ``minimum-generation`` argument which represents the
>> +    minimum value the generation counter will be set to. For example if
>> +    current generation is ``5`` and ``SYSGENID_TRIGGER_GEN_UPDATE(8)``
>> +    is called, the generation counter will increment to ``8``.
> 
> And what if it's 9?

Then it becomes 10. The hint only tells you what the smallest version 
the system is matching against is.

The only thing I have a slight concern over here is an overflow. What if 
my generation id is 0x7fffffff? For starters, it'd probably be better to 
treat the counter as ulong so it matches the atomic_t, no?

But then you would still have the same situation, just with a wrap to 0 
instead of a wrap to negative. I guess the answer is "users of this API 
will not get a guarantee that the counters are monotonically increasing. 
They have to check for != instead of < or >".

> 
>> +    This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
>> +    or CAP_SYS_ADMIN capabilities.
>> +
>> +``mmap()``:
>> +  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
>> +  in size. The first 4 bytes of the mapped page will contain an
>> +  up-to-date u32 copy of the system generation counter.
>> +  The mapped memory can be used as a low-latency generation counter
>> +  probe mechanism in critical sections.
>> +  The mmap() interface is targeted at libraries or code that needs to
>> +  check for generation changes in-line, where an event loop is not
>> +  available or read()/write() syscalls are too expensive.
>> +  In such cases, logic can be added in-line with the sensitive code to
>> +  check and trigger on-demand/just-in-time readjustments when changes
>> +  are detected on the memory mapped generation counter.
>> +  Users of this interface that plan to lazily adjust should not enable
>> +  watcher tracking, since waiting on them doesn't make sense.
>> +
>> +``close()``:
>> +  Removes the file descriptor as a system generation counter *watcher*.
>> +
>> +Snapshot Safety Prerequisites
>> +=============================
>> +
>> +If VM, container or other system-level snapshots happen asynchronously,
>> +at arbitrary times during an active workload there is no practical way
>> +to ensure that in-flight local copies or caches of world-unique data
>> +such as random values, secrets, UUIDs, etc are properly scrubbed and
>> +regenerated.
>> +The challenge stems from the fact that the categorization of data as
>> +snapshot-sensitive is only known to the software working with it, and
>> +this software has no logical control over the moment in time when an
>> +external system snapshot occurs.
>> +
>> +Let's take an OpenSSL session token for example. Even if the library
>> +code is made 100% snapshot-safe, meaning the library guarantees that
>> +the session token is unique (any snapshot that happened during the
>> +library call did not duplicate or leak the token), the token is still
>> +vulnerable to snapshot events while it transits the various layers of
>> +the library caller, then the various layers of the OS before leaving
>> +the system.
>> +
>> +To catch a secret while it's in-flight, we'd have to validate system
>> +generation at every layer, every step of the way. Even if that would
>> +be deemed the right solution, it would be a long road and a whole
>> +universe to patch before we get there.
>> +
>> +Bottom line is we don't have a way to track all of these in-flight
>> +secrets and dynamically scrub them from existence with snapshot
>> +events happening arbitrarily.
> 
> Above should try harder to explan what are the things that need to be
> scrubbed and why. For example, I personally don't really know what is
> the OpenSSL session token example and what makes it vulnerable. I guess
> snapshots can attack each other?
> 
> 
> 
> 
> Here's a simple example of a workflow that submits transactions
> to a database and wants to avoid duplicate transactions.
> This does not require overseer magic. It does however require
> a correct genid from hypervisor, so no mmap tricks work.
> 
> 
> 
>          int genid, oldgenid;
>          read(&genid);
> start:
>          oldgenid = genid;
>          transid = submit transaction
>          read(&genid);
>          if (genid != oldgenid) {
>                          revert transaction (transid);
>                          goto start:
>          }

I'm not sure I fully follow. For starters, if this is a VM local 
database, I don't think you'd care about the genid. If it's a remote 
database, your connection would get dropped already at the point when 
you clone/resume, because TCP and your connection state machine will get 
really confused when you suddenly have a different IP address or two 
consumers of the same stream :).

But for the sake of the argument, let's assume you can have a 
connectionless database connection that maintains its own connection 
uniqueness logic. That database connector would need to understand how 
to abort the connection (and thus the transaction!) when the generation 
changes. And that's logic you would do with the read/write/notify 
mechanism. So your main loop would check for reads on the genid fd and 
after sending a connection termination, notify the overlord that it's 
safe to use the VM now.

The OpenSSL case (with mmap) is for libraries that are stateless and can 
not guarantee that they receive a genid notification event timely.

Since you asked, this is mainly important for the PRNG. Imagine an https 
server. You create a snapshot. You resume from that snapshot. OpenSSL is 
fully initialized with a user space PRNG randomness pool that it 
considers safe to consume. However, that means your first connection 
after resume will be 100% predictable randomness wise.

The mmap mechanism allows the PRNG to reseed after a genid change. 
Because we don't have an event mechanism for this code path, that can 
happen minutes after the resume. But that's ok, we "just" have to ensure 
that nobody is consuming secret data at the point of the snapshot.

> 
> 
> 
> 
> 
> 
>> +Simplifyng assumption - safety prerequisite
>> +-------------------------------------------
>> +
>> +**Control the snapshot flow**, disallow snapshots coming at arbitrary
>> +moments in the workload lifetime.
>> +
>> +Use a system-level overseer entity that quiesces the system before
>> +snapshot, and post-snapshot-resume oversees that software components
>> +have readjusted to new environment, to the new generation. Only after,
>> +will the overseer un-quiesce the system and allow active workloads.
>> +
>> +Software components can choose whether they want to be tracked and
>> +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
>> +IOCTL.
>> +
>> +The sysgenid framework standardizes the API for system software to
>> +find out about needing to readjust and at the same time provides a
>> +mechanism for the overseer entity to wait for everyone to be done, the
>> +system to have readjusted, so it can un-quiesce.
>> +
>> +Example snapshot-safe workflow
>> +------------------------------
>> +
>> +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
>> +   how this is achieved is very workload-specific, but the general
>> +   description is to get all software to an expected state where their
>> +   event loops dry up and they are effectively quiesced.
> 
> If you have ability to do this by communicating with
> all processes e.g. through a unix domain socket,
> why do you need the rest of the stuff in the kernel?
> Quescing is a harder problem than waking up.

That depends. Think of a typical VM workload. Let's take the web server 
example again. You can preboot the full VM and snapshot it as is. As 
long as you don't allow any incoming connections, you can guarantee that 
the system is "quiesced" well enough for the snapshot.

This is really what this bullet point is about. The point is that you're 
not consuming randomness you can't reseed asynchronously (see the above 
OpenSSL PRNG example).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
@ 2021-02-24 13:45       ` Alexander Graf
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Graf @ 2021-02-24 13:45 UTC (permalink / raw)
  To: Michael S. Tsirkin, Adrian Catangiu
  Cc: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390, gregkh,
	rdunlap, arnd, ebiederm, rppt, 0x7f454c46, borntraeger, Jason,
	jannh, w, colmmacc, luto, tytso, ebiggers, dwmw, bonzini, sblbir,
	raduweis, corbet, mhocko, rafael, pavel, mpe, areber, ovzxemul,
	avagin, ptikhomirov, gil, asmehra, dgunigun, vijaysun, oridgar,
	ghammer


On 24.02.21 10:19, Michael S. Tsirkin wrote:
> 
> On Wed, Feb 24, 2021 at 10:47:31AM +0200, Adrian Catangiu wrote:
>> - Background and problem
>>
>> The System Generation ID feature is required in virtualized or
>> containerized environments by applications that work with local copies
>> or caches of world-unique data such as random values, uuids,
>> monotonically increasing counters, etc.
>> Such applications can be negatively affected by VM or container
>> snapshotting when the VM or container is either cloned or returned to
>> an earlier point in time.
>>
>> Furthermore, simply finding out about a system generation change is
>> only the starting point of a process to renew internal states of
>> possibly multiple applications across the system. This process requires
>> a standard interface that applications can rely on and through which
>> orchestration can be easily done.
>>
>> - Solution
>>
>> The System Generation ID is meant to help in these scenarios by
>> providing a monotonically increasing u32 counter that changes each time
>> the VM or container is restored from a snapshot.
>>
>> The `sysgenid` driver exposes a monotonic incremental System Generation
>> u32 counter via a char-dev filesystem interface accessible
>> through `/dev/sysgenid`. It provides synchronous and asynchronous SysGen
>> counter update notifications, as well as counter retrieval and
>> confirmation mechanisms.
>> The counter starts from zero when the driver is initialized and
>> monotonically increments every time the system generation changes.
>>
>> Userspace applications or libraries can (a)synchronously consume the
>> system generation counter through the provided filesystem interface, to
>> make any necessary internal adjustments following a system generation
>> update.
>>
>> The provided filesystem interface operations can be used to build a
>> system level safe workflow that guest software can follow to protect
>> itself from negative system snapshot effects.
>>
>> The `sysgenid` driver exports the `void sysgenid_bump_generation()`
>> symbol which can be used by backend drivers to drive system generation
>> changes based on hardware events.
>> System generation changes can also be driven by userspace software
>> through a dedicated driver ioctl.
>>
>> **Please note**, SysGenID alone does not guarantee complete snapshot
>> safety to applications using it. A certain workflow needs to be
>> followed at the system level, in order to make the system
>> snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
>> section in the included documentation.
>>
>> Signed-off-by: Adrian Catangiu <acatan@amazon.com>
>> ---
>>   Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
>>   Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>>   MAINTAINERS                                        |   8 +
>>   drivers/misc/Kconfig                               |  15 +
>>   drivers/misc/Makefile                              |   1 +
>>   drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
>>   include/uapi/linux/sysgenid.h                      |  18 ++
>>   7 files changed, 594 insertions(+)
>>   create mode 100644 Documentation/misc-devices/sysgenid.rst
>>   create mode 100644 drivers/misc/sysgenid.c
>>   create mode 100644 include/uapi/linux/sysgenid.h
>>

[...]

>> +``ioctl()``:
>> +  The driver also adds support for waiting on open file descriptors
>> +  that haven't acknowledged a generation counter update, as well as a
>> +  mechanism for userspace to *trigger* a generation update:
>> +
>> +  - SYSGENID_SET_WATCHER_TRACKING: takes a bool argument to set tracking
>> +    status for current file descriptor. When watcher tracking is
>> +    enabled, the driver tracks this file descriptor as an independent
>> +    *watcher*. The driver keeps accounting of how many watchers have
>> +    confirmed the latest Sys-Gen-Id counter and how many of them are
>> +    *outdated*; an outdated watcher is a *tracked* open file descriptor
>> +    that has lived through a Sys-Gen-Id change but has not yet confirmed
>> +    the new generation counter.
>> +    Software that wants to be waited on by the system while it adjusts
>> +    to generation changes, should turn tracking on. The sysgenid driver
>> +    then keeps track of it and can block system-level adjustment process
>> +    until the software has finished adjusting and confirmed it through a
>> +    ``write()``.
>> +    Tracking is disabled by default and file descriptors need to
>> +    explicitly opt-in using this IOCTL.
>> +  - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
>> +    tracked watchers or, if a ``timeout`` argument is provided, until
>> +    the timeout expires.
>> +    If the current caller is *outdated* or a generation change happens
>> +    while waiting (thus making current caller *outdated*), the ioctl
>> +    returns ``-EINTR`` to signal the user to handle event and retry.
>> +  - SYSGENID_TRIGGER_GEN_UPDATE: triggers a generation counter increment.
>> +    It takes a ``minimum-generation`` argument which represents the
>> +    minimum value the generation counter will be set to. For example if
>> +    current generation is ``5`` and ``SYSGENID_TRIGGER_GEN_UPDATE(8)``
>> +    is called, the generation counter will increment to ``8``.
> 
> And what if it's 9?

Then it becomes 10. The hint only tells you what the smallest version 
the system is matching against is.

The only thing I have a slight concern over here is an overflow. What if 
my generation id is 0x7fffffff? For starters, it'd probably be better to 
treat the counter as ulong so it matches the atomic_t, no?

But then you would still have the same situation, just with a wrap to 0 
instead of a wrap to negative. I guess the answer is "users of this API 
will not get a guarantee that the counters are monotonically increasing. 
They have to check for != instead of < or >".

> 
>> +    This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
>> +    or CAP_SYS_ADMIN capabilities.
>> +
>> +``mmap()``:
>> +  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
>> +  in size. The first 4 bytes of the mapped page will contain an
>> +  up-to-date u32 copy of the system generation counter.
>> +  The mapped memory can be used as a low-latency generation counter
>> +  probe mechanism in critical sections.
>> +  The mmap() interface is targeted at libraries or code that needs to
>> +  check for generation changes in-line, where an event loop is not
>> +  available or read()/write() syscalls are too expensive.
>> +  In such cases, logic can be added in-line with the sensitive code to
>> +  check and trigger on-demand/just-in-time readjustments when changes
>> +  are detected on the memory mapped generation counter.
>> +  Users of this interface that plan to lazily adjust should not enable
>> +  watcher tracking, since waiting on them doesn't make sense.
>> +
>> +``close()``:
>> +  Removes the file descriptor as a system generation counter *watcher*.
>> +
>> +Snapshot Safety Prerequisites
>> +=============================
>> +
>> +If VM, container or other system-level snapshots happen asynchronously,
>> +at arbitrary times during an active workload there is no practical way
>> +to ensure that in-flight local copies or caches of world-unique data
>> +such as random values, secrets, UUIDs, etc are properly scrubbed and
>> +regenerated.
>> +The challenge stems from the fact that the categorization of data as
>> +snapshot-sensitive is only known to the software working with it, and
>> +this software has no logical control over the moment in time when an
>> +external system snapshot occurs.
>> +
>> +Let's take an OpenSSL session token for example. Even if the library
>> +code is made 100% snapshot-safe, meaning the library guarantees that
>> +the session token is unique (any snapshot that happened during the
>> +library call did not duplicate or leak the token), the token is still
>> +vulnerable to snapshot events while it transits the various layers of
>> +the library caller, then the various layers of the OS before leaving
>> +the system.
>> +
>> +To catch a secret while it's in-flight, we'd have to validate system
>> +generation at every layer, every step of the way. Even if that would
>> +be deemed the right solution, it would be a long road and a whole
>> +universe to patch before we get there.
>> +
>> +Bottom line is we don't have a way to track all of these in-flight
>> +secrets and dynamically scrub them from existence with snapshot
>> +events happening arbitrarily.
> 
> Above should try harder to explan what are the things that need to be
> scrubbed and why. For example, I personally don't really know what is
> the OpenSSL session token example and what makes it vulnerable. I guess
> snapshots can attack each other?
> 
> 
> 
> 
> Here's a simple example of a workflow that submits transactions
> to a database and wants to avoid duplicate transactions.
> This does not require overseer magic. It does however require
> a correct genid from hypervisor, so no mmap tricks work.
> 
> 
> 
>          int genid, oldgenid;
>          read(&genid);
> start:
>          oldgenid = genid;
>          transid = submit transaction
>          read(&genid);
>          if (genid != oldgenid) {
>                          revert transaction (transid);
>                          goto start:
>          }

I'm not sure I fully follow. For starters, if this is a VM local 
database, I don't think you'd care about the genid. If it's a remote 
database, your connection would get dropped already at the point when 
you clone/resume, because TCP and your connection state machine will get 
really confused when you suddenly have a different IP address or two 
consumers of the same stream :).

But for the sake of the argument, let's assume you can have a 
connectionless database connection that maintains its own connection 
uniqueness logic. That database connector would need to understand how 
to abort the connection (and thus the transaction!) when the generation 
changes. And that's logic you would do with the read/write/notify 
mechanism. So your main loop would check for reads on the genid fd and 
after sending a connection termination, notify the overlord that it's 
safe to use the VM now.

The OpenSSL case (with mmap) is for libraries that are stateless and can 
not guarantee that they receive a genid notification event timely.

Since you asked, this is mainly important for the PRNG. Imagine an https 
server. You create a snapshot. You resume from that snapshot. OpenSSL is 
fully initialized with a user space PRNG randomness pool that it 
considers safe to consume. However, that means your first connection 
after resume will be 100% predictable randomness wise.

The mmap mechanism allows the PRNG to reseed after a genid change. 
Because we don't have an event mechanism for this code path, that can 
happen minutes after the resume. But that's ok, we "just" have to ensure 
that nobody is consuming secret data at the point of the snapshot.

> 
> 
> 
> 
> 
> 
>> +Simplifyng assumption - safety prerequisite
>> +-------------------------------------------
>> +
>> +**Control the snapshot flow**, disallow snapshots coming at arbitrary
>> +moments in the workload lifetime.
>> +
>> +Use a system-level overseer entity that quiesces the system before
>> +snapshot, and post-snapshot-resume oversees that software components
>> +have readjusted to new environment, to the new generation. Only after,
>> +will the overseer un-quiesce the system and allow active workloads.
>> +
>> +Software components can choose whether they want to be tracked and
>> +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
>> +IOCTL.
>> +
>> +The sysgenid framework standardizes the API for system software to
>> +find out about needing to readjust and at the same time provides a
>> +mechanism for the overseer entity to wait for everyone to be done, the
>> +system to have readjusted, so it can un-quiesce.
>> +
>> +Example snapshot-safe workflow
>> +------------------------------
>> +
>> +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
>> +   how this is achieved is very workload-specific, but the general
>> +   description is to get all software to an expected state where their
>> +   event loops dry up and they are effectively quiesced.
> 
> If you have ability to do this by communicating with
> all processes e.g. through a unix domain socket,
> why do you need the rest of the stuff in the kernel?
> Quescing is a harder problem than waking up.

That depends. Think of a typical VM workload. Let's take the web server 
example again. You can preboot the full VM and snapshot it as is. As 
long as you don't allow any incoming connections, you can guarantee that 
the system is "quiesced" well enough for the snapshot.

This is really what this bullet point is about. The point is that you're 
not consuming randomness you can't reseed asynchronously (see the above 
OpenSSL PRNG example).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
  2021-02-24 13:45       ` Alexander Graf
@ 2021-02-24 22:41         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2021-02-24 22:41 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Adrian Catangiu, linux-doc, linux-kernel, qemu-devel, kvm,
	linux-s390, gregkh, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mhocko, rafael, pavel,
	mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer

On Wed, Feb 24, 2021 at 02:45:03PM +0100, Alexander Graf wrote:
> > Above should try harder to explan what are the things that need to be
> > scrubbed and why. For example, I personally don't really know what is
> > the OpenSSL session token example and what makes it vulnerable. I guess
> > snapshots can attack each other?
> > 
> > 
> > 
> > 
> > Here's a simple example of a workflow that submits transactions
> > to a database and wants to avoid duplicate transactions.
> > This does not require overseer magic. It does however require
> > a correct genid from hypervisor, so no mmap tricks work.
> > 
> > 
> > 
> >          int genid, oldgenid;
> >          read(&genid);
> > start:
> >          oldgenid = genid;
> >          transid = submit transaction
> >          read(&genid);
> >          if (genid != oldgenid) {
> >                          revert transaction (transid);
> >                          goto start:
> >          }
> 
> I'm not sure I fully follow. For starters, if this is a VM local database, I
> don't think you'd care about the genid. If it's a remote database, your
> connection would get dropped already at the point when you clone/resume,
> because TCP and your connection state machine will get really confused when
> you suddenly have a different IP address or two consumers of the same stream
> :).
>
> But for the sake of the argument, let's assume you can have a connectionless
> database connection that maintains its own connection uniqueness logic.

Right. E.g. not uncommon with REST APIs. They survive disconnect easily
and use cookies or such.

> That
> database connector would need to understand how to abort the connection (and
> thus the transaction!) when the generation changes.

the point is that instead of all that you discover transaction as
a duplicate and revert it.


> And that's logic you
> would do with the read/write/notify mechanism. So your main loop would check
> for reads on the genid fd and after sending a connection termination, notify
> the overlord that it's safe to use the VM now.
> 
> The OpenSSL case (with mmap) is for libraries that are stateless and can not
> guarantee that they receive a genid notification event timely.
> 
> Since you asked, this is mainly important for the PRNG. Imagine an https
> server. You create a snapshot. You resume from that snapshot. OpenSSL is
> fully initialized with a user space PRNG randomness pool that it considers
> safe to consume. However, that means your first connection after resume will
> be 100% predictable randomness wise.

I wonder whether something similar is possible here. I.e. use the secret
to encrypt stuff but check the gen ID before actually sending data.
If it changed re-encrypt. Hmm?

> 
> The mmap mechanism allows the PRNG to reseed after a genid change. Because
> we don't have an event mechanism for this code path, that can happen minutes
> after the resume. But that's ok, we "just" have to ensure that nobody is
> consuming secret data at the point of the snapshot.


Something I am still not clear on is whether it's really important to
skip the system call here. If not I think it's prudent to just stick
to read for now, I think there's a slightly lower chance that
it will get misused. mmap which gives you a laggy gen id value
really seems like it would be hard to use correctly.


> > 
> > 
> > 
> > 
> > 
> > 
> > > +Simplifyng assumption - safety prerequisite
> > > +-------------------------------------------
> > > +
> > > +**Control the snapshot flow**, disallow snapshots coming at arbitrary
> > > +moments in the workload lifetime.
> > > +
> > > +Use a system-level overseer entity that quiesces the system before
> > > +snapshot, and post-snapshot-resume oversees that software components
> > > +have readjusted to new environment, to the new generation. Only after,
> > > +will the overseer un-quiesce the system and allow active workloads.
> > > +
> > > +Software components can choose whether they want to be tracked and
> > > +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
> > > +IOCTL.
> > > +
> > > +The sysgenid framework standardizes the API for system software to
> > > +find out about needing to readjust and at the same time provides a
> > > +mechanism for the overseer entity to wait for everyone to be done, the
> > > +system to have readjusted, so it can un-quiesce.
> > > +
> > > +Example snapshot-safe workflow
> > > +------------------------------
> > > +
> > > +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
> > > +   how this is achieved is very workload-specific, but the general
> > > +   description is to get all software to an expected state where their
> > > +   event loops dry up and they are effectively quiesced.
> > 
> > If you have ability to do this by communicating with
> > all processes e.g. through a unix domain socket,
> > why do you need the rest of the stuff in the kernel?
> > Quescing is a harder problem than waking up.
> 
> That depends. Think of a typical VM workload. Let's take the web server
> example again. You can preboot the full VM and snapshot it as is. As long as
> you don't allow any incoming connections, you can guarantee that the system
> is "quiesced" well enough for the snapshot.

Well you can use a firewall or such to block incoming packets,
but I am not at all sure that means e.g. all socket buffers
are empty.


> This is really what this bullet point is about. The point is that you're not
> consuming randomness you can't reseed asynchronously (see the above OpenSSL
> PRNG example).
> 
> 
> Alex
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
@ 2021-02-24 22:41         ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2021-02-24 22:41 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Jason, areber, kvm, linux-doc, ghammer, vijaysun, 0x7f454c46,
	qemu-devel, mhocko, dgunigun, avagin, pavel, ptikhomirov,
	linux-s390, corbet, mpe, rafael, ebiggers, borntraeger, sblbir,
	bonzini, arnd, jannh, raduweis, asmehra, Adrian Catangiu, rppt,
	luto, gil, oridgar, colmmacc, tytso, gregkh, rdunlap,
	linux-kernel, ebiederm, ovzxemul, w, dwmw

On Wed, Feb 24, 2021 at 02:45:03PM +0100, Alexander Graf wrote:
> > Above should try harder to explan what are the things that need to be
> > scrubbed and why. For example, I personally don't really know what is
> > the OpenSSL session token example and what makes it vulnerable. I guess
> > snapshots can attack each other?
> > 
> > 
> > 
> > 
> > Here's a simple example of a workflow that submits transactions
> > to a database and wants to avoid duplicate transactions.
> > This does not require overseer magic. It does however require
> > a correct genid from hypervisor, so no mmap tricks work.
> > 
> > 
> > 
> >          int genid, oldgenid;
> >          read(&genid);
> > start:
> >          oldgenid = genid;
> >          transid = submit transaction
> >          read(&genid);
> >          if (genid != oldgenid) {
> >                          revert transaction (transid);
> >                          goto start:
> >          }
> 
> I'm not sure I fully follow. For starters, if this is a VM local database, I
> don't think you'd care about the genid. If it's a remote database, your
> connection would get dropped already at the point when you clone/resume,
> because TCP and your connection state machine will get really confused when
> you suddenly have a different IP address or two consumers of the same stream
> :).
>
> But for the sake of the argument, let's assume you can have a connectionless
> database connection that maintains its own connection uniqueness logic.

Right. E.g. not uncommon with REST APIs. They survive disconnect easily
and use cookies or such.

> That
> database connector would need to understand how to abort the connection (and
> thus the transaction!) when the generation changes.

the point is that instead of all that you discover transaction as
a duplicate and revert it.


> And that's logic you
> would do with the read/write/notify mechanism. So your main loop would check
> for reads on the genid fd and after sending a connection termination, notify
> the overlord that it's safe to use the VM now.
> 
> The OpenSSL case (with mmap) is for libraries that are stateless and can not
> guarantee that they receive a genid notification event timely.
> 
> Since you asked, this is mainly important for the PRNG. Imagine an https
> server. You create a snapshot. You resume from that snapshot. OpenSSL is
> fully initialized with a user space PRNG randomness pool that it considers
> safe to consume. However, that means your first connection after resume will
> be 100% predictable randomness wise.

I wonder whether something similar is possible here. I.e. use the secret
to encrypt stuff but check the gen ID before actually sending data.
If it changed re-encrypt. Hmm?

> 
> The mmap mechanism allows the PRNG to reseed after a genid change. Because
> we don't have an event mechanism for this code path, that can happen minutes
> after the resume. But that's ok, we "just" have to ensure that nobody is
> consuming secret data at the point of the snapshot.


Something I am still not clear on is whether it's really important to
skip the system call here. If not I think it's prudent to just stick
to read for now, I think there's a slightly lower chance that
it will get misused. mmap which gives you a laggy gen id value
really seems like it would be hard to use correctly.


> > 
> > 
> > 
> > 
> > 
> > 
> > > +Simplifyng assumption - safety prerequisite
> > > +-------------------------------------------
> > > +
> > > +**Control the snapshot flow**, disallow snapshots coming at arbitrary
> > > +moments in the workload lifetime.
> > > +
> > > +Use a system-level overseer entity that quiesces the system before
> > > +snapshot, and post-snapshot-resume oversees that software components
> > > +have readjusted to new environment, to the new generation. Only after,
> > > +will the overseer un-quiesce the system and allow active workloads.
> > > +
> > > +Software components can choose whether they want to be tracked and
> > > +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
> > > +IOCTL.
> > > +
> > > +The sysgenid framework standardizes the API for system software to
> > > +find out about needing to readjust and at the same time provides a
> > > +mechanism for the overseer entity to wait for everyone to be done, the
> > > +system to have readjusted, so it can un-quiesce.
> > > +
> > > +Example snapshot-safe workflow
> > > +------------------------------
> > > +
> > > +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
> > > +   how this is achieved is very workload-specific, but the general
> > > +   description is to get all software to an expected state where their
> > > +   event loops dry up and they are effectively quiesced.
> > 
> > If you have ability to do this by communicating with
> > all processes e.g. through a unix domain socket,
> > why do you need the rest of the stuff in the kernel?
> > Quescing is a harder problem than waking up.
> 
> That depends. Think of a typical VM workload. Let's take the web server
> example again. You can preboot the full VM and snapshot it as is. As long as
> you don't allow any incoming connections, you can guarantee that the system
> is "quiesced" well enough for the snapshot.

Well you can use a firewall or such to block incoming packets,
but I am not at all sure that means e.g. all socket buffers
are empty.


> This is really what this bullet point is about. The point is that you're not
> consuming randomness you can't reseed asynchronously (see the above OpenSSL
> PRNG example).
> 
> 
> Alex
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
  2021-02-24 22:41         ` Michael S. Tsirkin
  (?)
@ 2021-02-24 23:22         ` Alexander Graf
  -1 siblings, 0 replies; 23+ messages in thread
From: Alexander Graf @ 2021-02-24 23:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Adrian Catangiu, linux-doc, linux-kernel, qemu-devel, kvm,
	linux-s390, gregkh, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, colmmacc, luto, tytso, ebiggers,
	dwmw, bonzini, sblbir, raduweis, corbet, mhocko, rafael, pavel,
	mpe, areber, ovzxemul, avagin, ptikhomirov, gil, asmehra,
	dgunigun, vijaysun, oridgar, ghammer



On 24.02.21 23:41, Michael S. Tsirkin wrote:
> 
> On Wed, Feb 24, 2021 at 02:45:03PM +0100, Alexander Graf wrote:
>>> Above should try harder to explan what are the things that need to be
>>> scrubbed and why. For example, I personally don't really know what is
>>> the OpenSSL session token example and what makes it vulnerable. I guess
>>> snapshots can attack each other?
>>>
>>>
>>>
>>>
>>> Here's a simple example of a workflow that submits transactions
>>> to a database and wants to avoid duplicate transactions.
>>> This does not require overseer magic. It does however require
>>> a correct genid from hypervisor, so no mmap tricks work.
>>>
>>>
>>>
>>>           int genid, oldgenid;
>>>           read(&genid);
>>> start:
>>>           oldgenid = genid;
>>>           transid = submit transaction
>>>           read(&genid);
>>>           if (genid != oldgenid) {
>>>                           revert transaction (transid);
>>>                           goto start:
>>>           }
>>
>> I'm not sure I fully follow. For starters, if this is a VM local database, I
>> don't think you'd care about the genid. If it's a remote database, your
>> connection would get dropped already at the point when you clone/resume,
>> because TCP and your connection state machine will get really confused when
>> you suddenly have a different IP address or two consumers of the same stream
>> :).
>>
>> But for the sake of the argument, let's assume you can have a connectionless
>> database connection that maintains its own connection uniqueness logic.
> 
> Right. E.g. not uncommon with REST APIs. They survive disconnect easily
> and use cookies or such.
> 
>> That
>> database connector would need to understand how to abort the connection (and
>> thus the transaction!) when the generation changes.
> 
> the point is that instead of all that you discover transaction as
> a duplicate and revert it.
> 
> 
>> And that's logic you
>> would do with the read/write/notify mechanism. So your main loop would check
>> for reads on the genid fd and after sending a connection termination, notify
>> the overlord that it's safe to use the VM now.
>>
>> The OpenSSL case (with mmap) is for libraries that are stateless and can not
>> guarantee that they receive a genid notification event timely.
>>
>> Since you asked, this is mainly important for the PRNG. Imagine an https
>> server. You create a snapshot. You resume from that snapshot. OpenSSL is
>> fully initialized with a user space PRNG randomness pool that it considers
>> safe to consume. However, that means your first connection after resume will
>> be 100% predictable randomness wise.
> 
> I wonder whether something similar is possible here. I.e. use the secret
> to encrypt stuff but check the gen ID before actually sending data.
> If it changed re-encrypt. Hmm?

I don't see why you would though. Once you control the application 
level, just use the event based API. That's the much easier to use one. 
The mmap one is really just there to cover cases where you don't own the 
main event loop, but can't spend the syscall overhead on every 
invocation to check if the genid changed.

> 
>>
>> The mmap mechanism allows the PRNG to reseed after a genid change. Because
>> we don't have an event mechanism for this code path, that can happen minutes
>> after the resume. But that's ok, we "just" have to ensure that nobody is
>> consuming secret data at the point of the snapshot.
> 
> 
> Something I am still not clear on is whether it's really important to
> skip the system call here. If not I think it's prudent to just stick
> to read for now, I think there's a slightly lower chance that
> it will get misused. mmap which gives you a laggy gen id value
> really seems like it would be hard to use correctly.

The read is not any less racy than the mmap. The real "safety" of the 
read interface comes from the acknowledge path. And that path requires 
you to be part of the event loop.

> 
> 
>>>
>>>
>>>
>>>
>>>
>>>
>>>> +Simplifyng assumption - safety prerequisite
>>>> +-------------------------------------------
>>>> +
>>>> +**Control the snapshot flow**, disallow snapshots coming at arbitrary
>>>> +moments in the workload lifetime.
>>>> +
>>>> +Use a system-level overseer entity that quiesces the system before
>>>> +snapshot, and post-snapshot-resume oversees that software components
>>>> +have readjusted to new environment, to the new generation. Only after,
>>>> +will the overseer un-quiesce the system and allow active workloads.
>>>> +
>>>> +Software components can choose whether they want to be tracked and
>>>> +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
>>>> +IOCTL.
>>>> +
>>>> +The sysgenid framework standardizes the API for system software to
>>>> +find out about needing to readjust and at the same time provides a
>>>> +mechanism for the overseer entity to wait for everyone to be done, the
>>>> +system to have readjusted, so it can un-quiesce.
>>>> +
>>>> +Example snapshot-safe workflow
>>>> +------------------------------
>>>> +
>>>> +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
>>>> +   how this is achieved is very workload-specific, but the general
>>>> +   description is to get all software to an expected state where their
>>>> +   event loops dry up and they are effectively quiesced.
>>>
>>> If you have ability to do this by communicating with
>>> all processes e.g. through a unix domain socket,
>>> why do you need the rest of the stuff in the kernel?
>>> Quescing is a harder problem than waking up.
>>
>> That depends. Think of a typical VM workload. Let's take the web server
>> example again. You can preboot the full VM and snapshot it as is. As long as
>> you don't allow any incoming connections, you can guarantee that the system
>> is "quiesced" well enough for the snapshot.
> 
> Well you can use a firewall or such to block incoming packets,
> but I am not at all sure that means e.g. all socket buffers
> are empty.

If it's a fresh VM that only started the web server and did nothing 
else, there shouldn't be anything in its socket buffers :).

I agree that it won't allow us to cover 100% of all cases automatically 
and seamlessly. I can't think of any solution that does - if you can 
think of something I'm all ears. But this API at least gives us a path 
to slowly move the ecosystem to a point where applications and libraries 
can enable themselves to become vm/container clone aware. Today we don't 
even give them the opportunity to self adjust.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 0/2] System Generation ID driver and VMGENID backend
  2021-02-24  9:05   ` Michael S. Tsirkin
  (?)
@ 2021-03-04 20:08   ` Catangiu, Adrian Costin
  -1 siblings, 0 replies; 23+ messages in thread
From: Catangiu, Adrian Costin @ 2021-03-04 20:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-doc, linux-kernel, qemu-devel, kvm, linux-s390, gregkh,
	Graf (AWS),
	Alexander, rdunlap, arnd, ebiederm, rppt, 0x7f454c46,
	borntraeger, Jason, jannh, w, MacCarthaigh, Colm, luto, tytso,
	ebiggers, Woodhouse, David, bonzini, Singh, Balbir, Weiss, Radu,
	corbet, mhocko, rafael, pavel, mpe, areber, ovzxemul, avagin,
	ptikhomirov, gil, asmehra, dgunigun, vijaysun, oridgar, ghammer

Hi Michael,

On 24/02/2021, 11:06, "Michael S. Tsirkin" <mst@redhat.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Wed, Feb 24, 2021 at 10:47:30AM +0200, Adrian Catangiu wrote:
    > This feature is aimed at virtualized or containerized environments
    > where VM or container snapshotting duplicates memory state, which is a
    > challenge for applications that want to generate unique data such as
    > request IDs, UUIDs, and cryptographic nonces.
    >
    > The patch set introduces a mechanism that provides a userspace
    > interface for applications and libraries to be made aware of uniqueness
    > breaking events such as VM or container snapshotting, and allow them to
    > react and adapt to such events.
    >
    > Solving the uniqueness problem strongly enough for cryptographic
    > purposes requires a mechanism which can deterministically reseed
    > userspace PRNGs with new entropy at restore time. This mechanism must
    > also support the high-throughput and low-latency use-cases that led
    > programmers to pick a userspace PRNG in the first place; be usable by
    > both application code and libraries; allow transparent retrofitting
    > behind existing popular PRNG interfaces without changing application
    > code; it must be efficient, especially on snapshot restore; and be
    > simple enough for wide adoption.
    >
    > The first patch in the set implements a device driver which exposes a
    > the /dev/sysgenid char device to userspace. Its associated filesystem
    > operations operations can be used to build a system level safe workflow
    > that guest software can follow to protect itself from negative system
    > snapshot effects.
    >
    > The second patch in the set adds a VmGenId driver which makes use of
    > the ACPI vmgenid device to drive SysGenId and to reseed kernel entropy
    > following VM snapshots.
    >
    > **Please note**, SysGenID alone does not guarantee complete snapshot
    > safety to applications using it. A certain workflow needs to be
    > followed at the system level, in order to make the system
    > snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
    > section in the included SysGenID documentation.
    >
    > ---
    >
    > v6 -> v7:
    >   - remove sysgenid uevent

    How about we drop mmap too?

    There's simply no way I can see to make it safe, and
    no implementation is worse than a racy one imho.

    Yea there's some decumentation explaining how it is not
    supposed to be used but it will *seem* to work for people
    and we will be stuck trying to maintain it.

    Let's see if userspace using this often enough to make the
    system call

As Colm explained in his reply, the mmap is the only option to consume
this within the strict latency constraints of PRNGs and SSL libs, so what if
instead, we remove the IRQ race by removing vmgenid as an in-kernel
sysgenid backend/driver?

We could just drop the vmgenid driver for now and only drive sysgenid
from userspace using the fs interface. Doing so will remove the IRQ race
which comes from vmgenid backend, and will keep the SysGenID kernel
interface safe and consistent, with a race-free mmap().

What do you think?

    > v5 -> v6:
    >
    >   - sysgenid: watcher tracking disabled by default
    >   - sysgenid: add SYSGENID_SET_WATCHER_TRACKING ioctl to allow each
    >     file descriptor to set whether they should be tracked as watchers
    >   - rename SYSGENID_FORCE_GEN_UPDATE -> SYSGENID_TRIGGER_GEN_UPDATE
    >   - rework all documentation to clearly capture all prerequisites for
    >     achieving snapshot safety when using the provided mechanism
    >   - sysgenid documentation: replace individual filesystem operations
    >     examples with a higher level example showcasing system-level
    >     snapshot-safe workflow
    >
    > v4 -> v5:
    >
    >   - sysgenid: generation changes are also exported through uevents
    >   - remove SYSGENID_GET_OUTDATED_WATCHERS ioctl
    >   - document sysgenid ioctl major/minor numbers
    >
    > v3 -> v4:
    >
    >   - split functionality in two separate kernel modules:
    >     1. drivers/misc/sysgenid.c which provides the generic userspace
    >        interface and mechanisms
    >     2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
    >        kernel entropy and acts as a driving backend for the generic
    >        sysgenid
    >   - rename /dev/vmgenid -> /dev/sysgenid
    >   - rename uapi header file vmgenid.h -> sysgenid.h
    >   - rename ioctls VMGENID_* -> SYSGENID_*
    >   - add ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
    >   - fix races in documentation examples
    >
    > v2 -> v3:
    >
    >   - separate the core driver logic and interface, from the ACPI device.
    >     The ACPI vmgenid device is now one possible backend
    >   - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
    >   - add locking to avoid races between fs ops handlers and hw irq
    >     driven generation updates
    >   - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
    >     outdated or a generation change happens while waiting (thus making
    >     current caller outdated), the ioctl returns -EINTR to signal the
    >     user to handle event and retry. Fixes blocking on oneself
    >   - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
    >     CAP_CHECKPOINT_RESTORE capability, through which software can force
    >     generation bump
    >
    > v1 -> v2:
    >
    >   - expose to userspace a monotonically increasing u32 Vm Gen Counter
    >     instead of the hw VmGen UUID
    >   - since the hw/hypervisor-provided 128-bit UUID is not public
    >     anymore, add it to the kernel RNG as device randomness
    >   - insert driver page containing Vm Gen Counter in the user vma in
    >     the driver's mmap handler instead of using a fault handler
    >   - turn driver into a misc device driver to auto-create /dev/vmgenid
    >   - change ioctl arg to avoid leaking kernel structs to userspace
    >   - update documentation
    >
    > Adrian Catangiu (2):
    >   drivers/misc: sysgenid: add system generation id driver
    >   drivers/virt: vmgenid: add vm generation id driver
    >
    >  Documentation/misc-devices/sysgenid.rst            | 229 +++++++++++++++
    >  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
    >  Documentation/virt/vmgenid.rst                     |  36 +++
    >  MAINTAINERS                                        |  15 +
    >  drivers/misc/Kconfig                               |  15 +
    >  drivers/misc/Makefile                              |   1 +
    >  drivers/misc/sysgenid.c                            | 322 +++++++++++++++++++++
    >  drivers/virt/Kconfig                               |  13 +
    >  drivers/virt/Makefile                              |   1 +
    >  drivers/virt/vmgenid.c                             | 153 ++++++++++
    >  include/uapi/linux/sysgenid.h                      |  18 ++
    >  11 files changed, 804 insertions(+)
    >  create mode 100644 Documentation/misc-devices/sysgenid.rst
    >  create mode 100644 Documentation/virt/vmgenid.rst
    >  create mode 100644 drivers/misc/sysgenid.c
    >  create mode 100644 drivers/virt/vmgenid.c
    >  create mode 100644 include/uapi/linux/sysgenid.h
    >
    > --
    > 2.7.4
    >
    >
    >
    >
    > Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.





Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
  2021-02-24  8:47   ` Adrian Catangiu
@ 2022-02-22 21:24     ` Jason A. Donenfeld
  -1 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-02-22 21:24 UTC (permalink / raw)
  To: Adrian Catangiu
  Cc: open list:DOCUMENTATION, LKML, QEMU Developers, KVM list,
	linux-s390, Greg Kroah-Hartman, graf, Randy Dunlap,
	Arnd Bergmann, Eric W. Biederman, Mike Rapoport, 0x7f454c46,
	borntraeger, Jann Horn, Willy Tarreau, Colm MacCarthaigh,
	Andrew Lutomirski, Theodore Ts'o, Eric Biggers, Woodhouse,
	David, bonzini, Singh, Balbir, Weiss, Radu, Jonathan Corbet,
	Michael S. Tsirkin, Michal Hocko, Rafael J. Wysocki,
	Pavel Machek, Michael Ellerman, areber, ovzxemul, avagin,
	ptikhomirov, gil, asmehra, dgunigun, vijaysun, oridgar, ghammer

Hi Adrian,

This thread seems to be long dead, but I couldn't figure out what
happened to the ideas in it. I'm specifically interested in this part:

On Wed, Feb 24, 2021 at 9:48 AM Adrian Catangiu <acatan@amazon.com> wrote:
> +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
> +{
> +       uuid_t old_uuid;
> +
> +       if (!device || acpi_driver_data(device) != &vmgenid_data) {
> +               pr_err("VMGENID notify with unexpected driver private data\n");
> +               return;
> +       }
> +
> +       /* update VM Generation UUID */
> +       old_uuid = vmgenid_data.uuid;
> +       memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
> +
> +       if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
> +               /* HW uuid updated */
> +               sysgenid_bump_generation();
> +               add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
> +       }
> +}

As Jann mentioned in an earlier email, we probably want this to
immediately reseed the crng, not just dump it into
add_device_randomness alone. But either way, the general idea seems
interesting to me. As far as I can tell, QEMU still supports this. Was
it not deemed to be sufficiently interesting?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
@ 2022-02-22 21:24     ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-02-22 21:24 UTC (permalink / raw)
  To: Adrian Catangiu
  Cc: areber, KVM list, open list:DOCUMENTATION, ghammer, vijaysun,
	0x7f454c46, QEMU Developers, Michal Hocko, dgunigun, avagin,
	Pavel Machek, ptikhomirov, linux-s390, Jonathan Corbet,
	Michael Ellerman, Michael S. Tsirkin, Eric Biggers, borntraeger,
	Singh, Balbir, bonzini, Arnd Bergmann, Jann Horn, Weiss, Radu,
	asmehra, graf, Mike Rapoport, Andrew Lutomirski, gil, oridgar,
	Colm MacCarthaigh, Theodore Ts'o, Greg Kroah-Hartman,
	Randy Dunlap, LKML, Eric W. Biederman, ovzxemul,
	Rafael J. Wysocki, Willy Tarreau, Woodhouse, David

Hi Adrian,

This thread seems to be long dead, but I couldn't figure out what
happened to the ideas in it. I'm specifically interested in this part:

On Wed, Feb 24, 2021 at 9:48 AM Adrian Catangiu <acatan@amazon.com> wrote:
> +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
> +{
> +       uuid_t old_uuid;
> +
> +       if (!device || acpi_driver_data(device) != &vmgenid_data) {
> +               pr_err("VMGENID notify with unexpected driver private data\n");
> +               return;
> +       }
> +
> +       /* update VM Generation UUID */
> +       old_uuid = vmgenid_data.uuid;
> +       memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
> +
> +       if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
> +               /* HW uuid updated */
> +               sysgenid_bump_generation();
> +               add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
> +       }
> +}

As Jann mentioned in an earlier email, we probably want this to
immediately reseed the crng, not just dump it into
add_device_randomness alone. But either way, the general idea seems
interesting to me. As far as I can tell, QEMU still supports this. Was
it not deemed to be sufficiently interesting?

Thanks,
Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
  2022-02-22 21:24     ` Jason A. Donenfeld
@ 2022-02-22 22:17       ` Jason A. Donenfeld
  -1 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-02-22 22:17 UTC (permalink / raw)
  To: adrian
  Cc: open list:DOCUMENTATION, LKML, QEMU Developers, KVM list,
	linux-s390, Greg Kroah-Hartman, graf, Randy Dunlap,
	Arnd Bergmann, Eric W. Biederman, Mike Rapoport, 0x7f454c46,
	borntraeger, Jann Horn, Willy Tarreau, Colm MacCarthaigh,
	Andrew Lutomirski, Theodore Ts'o, Eric Biggers, Woodhouse,
	David, bonzini, Singh, Balbir, Weiss, Radu, Jonathan Corbet,
	Michael S. Tsirkin, Michal Hocko, Rafael J. Wysocki,
	Pavel Machek, Michael Ellerman, areber, ovzxemul, avagin,
	ptikhomirov, gil, asmehra, dgunigun, vijaysun, oridgar, ghammer,
	Adrian Catangiu

Hey again,

On Tue, Feb 22, 2022 at 10:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> This thread seems to be long dead, but I couldn't figure out what
> happened to the ideas in it. I'm specifically interested in this part:
>
> On Wed, Feb 24, 2021 at 9:48 AM Adrian Catangiu <acatan@amazon.com> wrote:
> > +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
> > +{
> > +       uuid_t old_uuid;
> > +
> > +       if (!device || acpi_driver_data(device) != &vmgenid_data) {
> > +               pr_err("VMGENID notify with unexpected driver private data\n");
> > +               return;
> > +       }
> > +
> > +       /* update VM Generation UUID */
> > +       old_uuid = vmgenid_data.uuid;
> > +       memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
> > +
> > +       if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
> > +               /* HW uuid updated */
> > +               sysgenid_bump_generation();
> > +               add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
> > +       }
> > +}
>
> As Jann mentioned in an earlier email, we probably want this to
> immediately reseed the crng, not just dump it into
> add_device_randomness alone. But either way, the general idea seems
> interesting to me. As far as I can tell, QEMU still supports this. Was
> it not deemed to be sufficiently interesting?
>
> Thanks,
> Jason

Well I cleaned up this v7 and refactored it into something along the
lines of what I'm thinking. I don't yet know enough about this general
problem space to propose the patch and I haven't tested it either, but
in case you're curious, something along the lines of what I'm thinking
about lives at https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/commit/?h=jd/vmgenid
if you (or somebody else) feels inclined to pick this up.

Looking forward to learning more from you in general, though, about
what the deal is with the VM gen ID, and if this is a real thing or
not.

Regards,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
@ 2022-02-22 22:17       ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-02-22 22:17 UTC (permalink / raw)
  To: adrian
  Cc: areber, KVM list, open list:DOCUMENTATION, ghammer, vijaysun,
	0x7f454c46, QEMU Developers, Michal Hocko, dgunigun, avagin,
	Pavel Machek, ptikhomirov, linux-s390, Jonathan Corbet,
	Michael Ellerman, Michael S. Tsirkin, Eric Biggers, borntraeger,
	Singh, Balbir, bonzini, Arnd Bergmann, Jann Horn, Weiss, Radu,
	asmehra, Adrian Catangiu, graf, Mike Rapoport, Andrew Lutomirski,
	gil, oridgar, Colm MacCarthaigh, Theodore Ts'o,
	Greg Kroah-Hartman, Randy Dunlap, LKML, Eric W. Biederman,
	ovzxemul, Rafael J. Wysocki, Willy Tarreau, Woodhouse, David

Hey again,

On Tue, Feb 22, 2022 at 10:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> This thread seems to be long dead, but I couldn't figure out what
> happened to the ideas in it. I'm specifically interested in this part:
>
> On Wed, Feb 24, 2021 at 9:48 AM Adrian Catangiu <acatan@amazon.com> wrote:
> > +static void vmgenid_acpi_notify(struct acpi_device *device, u32 event)
> > +{
> > +       uuid_t old_uuid;
> > +
> > +       if (!device || acpi_driver_data(device) != &vmgenid_data) {
> > +               pr_err("VMGENID notify with unexpected driver private data\n");
> > +               return;
> > +       }
> > +
> > +       /* update VM Generation UUID */
> > +       old_uuid = vmgenid_data.uuid;
> > +       memcpy_fromio(&vmgenid_data.uuid, vmgenid_data.uuid_iomap, sizeof(uuid_t));
> > +
> > +       if (memcmp(&old_uuid, &vmgenid_data.uuid, sizeof(uuid_t))) {
> > +               /* HW uuid updated */
> > +               sysgenid_bump_generation();
> > +               add_device_randomness(&vmgenid_data.uuid, sizeof(uuid_t));
> > +       }
> > +}
>
> As Jann mentioned in an earlier email, we probably want this to
> immediately reseed the crng, not just dump it into
> add_device_randomness alone. But either way, the general idea seems
> interesting to me. As far as I can tell, QEMU still supports this. Was
> it not deemed to be sufficiently interesting?
>
> Thanks,
> Jason

Well I cleaned up this v7 and refactored it into something along the
lines of what I'm thinking. I don't yet know enough about this general
problem space to propose the patch and I haven't tested it either, but
in case you're curious, something along the lines of what I'm thinking
about lives at https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/commit/?h=jd/vmgenid
if you (or somebody else) feels inclined to pick this up.

Looking forward to learning more from you in general, though, about
what the deal is with the VM gen ID, and if this is a real thing or
not.

Regards,
Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
  2022-02-22 22:17       ` Jason A. Donenfeld
@ 2022-02-23 13:21         ` Jason A. Donenfeld
  -1 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-02-23 13:21 UTC (permalink / raw)
  To: adrian
  Cc: open list:DOCUMENTATION, LKML, QEMU Developers, KVM list,
	linux-s390, Greg Kroah-Hartman, graf, Randy Dunlap,
	Arnd Bergmann, Eric W. Biederman, Mike Rapoport, 0x7f454c46,
	borntraeger, Jann Horn, Willy Tarreau, Colm MacCarthaigh,
	Andrew Lutomirski, Theodore Ts'o, Eric Biggers, Woodhouse,
	David, bonzini, Singh, Balbir, Weiss, Radu, Jonathan Corbet,
	Michael S. Tsirkin, Michal Hocko, Rafael J. Wysocki,
	Pavel Machek, Michael Ellerman, areber, ovzxemul, avagin,
	ptikhomirov, gil, asmehra, dgunigun, vijaysun, oridgar, ghammer,
	Adrian Catangiu

On Tue, Feb 22, 2022 at 11:17 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Well I cleaned up this v7 and refactored it into something along the
> lines of what I'm thinking. I don't yet know enough about this general
> problem space to propose the patch and I haven't tested it either

A little further along, there's now this series:
https://lore.kernel.org/lkml/20220223131231.403386-1-Jason@zx2c4.com/T/
We can resume discussion there.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 2/2] drivers/virt: vmgenid: add vm generation id driver
@ 2022-02-23 13:21         ` Jason A. Donenfeld
  0 siblings, 0 replies; 23+ messages in thread
From: Jason A. Donenfeld @ 2022-02-23 13:21 UTC (permalink / raw)
  To: adrian
  Cc: areber, KVM list, open list:DOCUMENTATION, ghammer, vijaysun,
	0x7f454c46, QEMU Developers, Michal Hocko, dgunigun, avagin,
	Pavel Machek, ptikhomirov, linux-s390, Jonathan Corbet,
	Michael Ellerman, Michael S. Tsirkin, Eric Biggers, borntraeger,
	Singh, Balbir, bonzini, Arnd Bergmann, Jann Horn, Weiss, Radu,
	asmehra, Adrian Catangiu, graf, Mike Rapoport, Andrew Lutomirski,
	gil, oridgar, Colm MacCarthaigh, Theodore Ts'o,
	Greg Kroah-Hartman, Randy Dunlap, LKML, Eric W. Biederman,
	ovzxemul, Rafael J. Wysocki, Willy Tarreau, Woodhouse, David

On Tue, Feb 22, 2022 at 11:17 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Well I cleaned up this v7 and refactored it into something along the
> lines of what I'm thinking. I don't yet know enough about this general
> problem space to propose the patch and I haven't tested it either

A little further along, there's now this series:
https://lore.kernel.org/lkml/20220223131231.403386-1-Jason@zx2c4.com/T/
We can resume discussion there.

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
@ 2021-02-24 23:00 MacCarthaigh, Colm
  0 siblings, 0 replies; 23+ messages in thread
From: MacCarthaigh, Colm @ 2021-02-24 23:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, Graf (AWS), Alexander
  Cc: Catangiu, Adrian Costin, linux-doc, linux-kernel, qemu-devel,
	kvm, linux-s390, gregkh, rdunlap, arnd, ebiederm, rppt,
	0x7f454c46, borntraeger, Jason, jannh, w, luto, tytso, ebiggers,
	Woodhouse, David, bonzini, Singh, Balbir, Weiss, Radu, corbet,
	mhocko, rafael, pavel, mpe, areber, ovzxemul, avagin,
	ptikhomirov, gil, asmehra, dgunigun, vijaysun, oridgar, ghammer



On 2/24/21, 2:44 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
    > The mmap mechanism allows the PRNG to reseed after a genid change. Because
    > we don't have an event mechanism for this code path, that can happen minutes
    > after the resume. But that's ok, we "just" have to ensure that nobody is
    > consuming secret data at the point of the snapshot.


    Something I am still not clear on is whether it's really important to
    skip the system call here. If not I think it's prudent to just stick
    to read for now, I think there's a slightly lower chance that
    it will get misused. mmap which gives you a laggy gen id value
    really seems like it would be hard to use correctly.

It's not uncommon for these user-space PRNGs to used quite a lot in very performance critical paths. If you negotiate a TLS session that uses an explicit IV, the RNG is being called for every TLS record sent. Same for IPSec depending on the cipher-suite. Every TLS hello message has 28-32 bytes of data from the RNG, or if you've got ECDSA as your signature algorithm, it's inline again. Using RSA_PSS? Same again. Many Post-Quantum algorithms are even more veraciously entropy hungry.  We examine the compiled instructions for ours by hand to check it's all as tight as it can be. 

To give more of an idea, several crypto libraries took out the getpid() guards they had for fork detection in the RNGs, though VDSO could have helped there and I'm not sure they would have needed to if VDSO were more widely used at the time.  I don't think we'd get a patch into OpenSSL/libcrypto that involves a full syscall. VDSO might be ok, but even that's not going to have the speed that a single memory lookup can do with the mmap/madvise approach ... since we already have to use WIPEONFORK.

In practice I don't think it will be that hard to use correctly; snapshots and restores of this nature really have to happen only when the activity is quiescent. If operations are in-flight, it's not easy to reason about the potential multi-restore problems at all and it only makes sense to think about transactional correctness at the level of all transactions that may have been in-flight. The mmap solution is more about integrating with existing library APIs and semantics than it is about somehow solving that at the kernel level. That part has to be solved at the system level.

- 
Colm


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-02-23 13:38 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-24  8:47 [PATCH v7 0/2] System Generation ID driver and VMGENID backend Adrian Catangiu
2021-02-24  8:47 ` Adrian Catangiu
2021-02-24  8:47 ` [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver Adrian Catangiu
2021-02-24  8:47   ` Adrian Catangiu
2021-02-24  9:19   ` Michael S. Tsirkin
2021-02-24  9:19     ` Michael S. Tsirkin
2021-02-24 13:45     ` Alexander Graf
2021-02-24 13:45       ` Alexander Graf
2021-02-24 22:41       ` Michael S. Tsirkin
2021-02-24 22:41         ` Michael S. Tsirkin
2021-02-24 23:22         ` Alexander Graf
2021-02-24  8:47 ` [PATCH v7 2/2] drivers/virt: vmgenid: add vm " Adrian Catangiu
2021-02-24  8:47   ` Adrian Catangiu
2022-02-22 21:24   ` Jason A. Donenfeld
2022-02-22 21:24     ` Jason A. Donenfeld
2022-02-22 22:17     ` Jason A. Donenfeld
2022-02-22 22:17       ` Jason A. Donenfeld
2022-02-23 13:21       ` Jason A. Donenfeld
2022-02-23 13:21         ` Jason A. Donenfeld
2021-02-24  9:05 ` [PATCH v7 0/2] System Generation ID driver and VMGENID backend Michael S. Tsirkin
2021-02-24  9:05   ` Michael S. Tsirkin
2021-03-04 20:08   ` Catangiu, Adrian Costin
2021-02-24 23:00 [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver MacCarthaigh, Colm

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.