Draft NVDIMM proposal

* Draft NVDIMM proposal
@ 2018-05-09 17:29 George Dunlap
  2018-05-09 17:35 ` George Dunlap
  0 siblings, 1 reply; 24+ messages in thread
From: George Dunlap @ 2018-05-09 17:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Yi Zhang, Jan Beulich, Roger Pau Monne

Below is an initial draft of an NVDIMM proposal.  I'll submit a patch to
include it in the tree at some point, but I thought for initial
discussion it would be easier if it were copied in-line.

I've done a fair amount of investigation, but it's quite likely I've
made mistakes.  Please send me corrections where necessary.

-George

---
% NVDIMMs and Xen
% George Dunlap
% Revision 0.1

# NVDIMM overview

It's very difficult, from the various specs, to actually get a
complete enough picture if what's going on to make a good design.
This section is meant as an overview of the current hardware,
firmware, and Linux interfaces sufficient to inform a discussion of
the issues in designing a Xen interface for NVDIMMs.

## DIMMs, Namespaces, and access methods

An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form
factor) that contains _non-volatile RAM_ (NVRAM).  Individual bytes of
memory on a DIMM are specified by a _DIMM physical address_ or DPA.
Each DIMM is attached to an NVDIMM controller.

Memory on the DIMMs is divided up into _namespaces_.  The word
"namespace" is rather misleading though; a namespace in this context
is not actually a space of names (contrast, for example "C++
namespaces"); rather, it's more like a SCSI LUN, or a volume, or a
partition on a drive: a set of data which is meant to be viewed and
accessed as a unit.  (The name was apparently carried over from NVMe
devices, which were precursors of the NVDIMM spec.)

The NVDIMM controller allows two ways to access the DIMM.  One is
mapped 1-1 in _system physical address space_ (SPA), much like normal
RAM.  This method of access is called _PMEM_.  The other method is
similar to that of a PCI device: you have a control and status
register which control an 8k aperture window into the DIMM.  This
method access is called _PBLK_.

In the case of PMEM, as in the case of DRAM, addresses from the SPA
are interleaved across a set of DIMMs (an _interleave set_) for
performance reasons.  A specific PMEM namespace will be a single
contiguous DPA range across all DIMMs in its interleave set.  For
example, you might have a namespace for DPAs `0-0x50000000` on DIMMs 0
and 1; and another namespace for DPAs `0x80000000-0xa0000000` on DIMMs
0, 1, 2, and 3.

In the case of PBLK, a namespace always resides on a single DIMM.
However, that namespace can be made up of multiple discontiguous
chunks of space on that DIMM.  For instance, in our example above, we
might have a namespace om DIMM 0 consisting of DPAs
`0x50000000-0x60000000`, `0x80000000-0x90000000`, and
`0xa0000000-0xf0000000`.

The interleaving of PMEM has implications for the speed and
reliability of the namespace: Much like RAID 0, it maximizes speed,
but it means that if any one DIMM fails, the data from the entire
namespace is corrupted.  PBLK makes it slightly less straightforward
to access, but it allows OS software to apply RAID-like logic to
balance redundancy and speed.

Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM;
for large systems without 5-level paging, this is actually becoming a
limitation.  Using PBLK allows existing 4-level paged systems to
access an arbitrary amount of NVDIMM.

## Namespaces, labels, and the label area

A namespace is a mapping from the SPA and MMIO space into the DIMM.

The firmware and/or operating system can talk to the NVDIMM controller
to set up mappings from SPA and MMIO space into the DIMM.  Because the
memory and PCI devices are separate, it would be possible for buggy
firmware or NVDIMM controller drivers to misconfigure things such that
the same DPA is exposed in multiple places; if so, the results are
undefined.

Namespaces are constructed out of "labels".  Each DIMM has a Label
Storage Area, which is persistent but logically separate from the
device-addressable areas on the DIMM.  A label on a DIMM describes a
single contiguous region of DPA on that DIMM.  A PMEM namespace is
made up of one label from each of the DIMMs which make its interleave
set; a PBLK namespace is made up of one label for each chunk of range.

In our examples above, the first PMEM namespace would be made of two
labels (one on DIMM 0 and one on DIMM 1, each describind DPA
`0-0x50000000`), and the second namespace would be made of four labels
(one on DIMM 0, one on DIMM 1, and so on).  Similarly, in the PBLK
example, the namespace would consist of three labels; one describing
`0x50000000-0x60000000`, one describing `0x80000000-0x90000000`, and
so on.

The namespace definition includes not only information about the DPAs
which make up the namespace and how they fit together; it also
includes a UUID for the namespace (to allow it to be identified
uniquely), a 64-character "name" field for a human-friendly
description, and Type and Address Abstraction GUIDs to inform the
operating system how the data inside the namespace should be
interpreted.  Additionally, it can have an `ROLABEL` flag, which
indicates to the OS that "device drivers and manageability software
should refuse to make changes to the namespace labels", because
"attempting to make configuration changes that affect the namespace
labels will fail (i.e. because the VM guest is not in a position to
make the change correctly)".

See the [UEFI Specification][uefi-spec], section 13.19, "NVDIMM Label
Protocol", for more information.

[uefi-spec]:
http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf

## NVDIMMs and ACPI

The [ACPI Specification][acpi-spec] breaks down information in two ways.

The first is about physical devices (see section 9.20, "NVDIMM
Devices").  The NVDIMM controller is called the _NVDIMM Root Device_.
There will generally be only a single NVDIMM root device on a system.

Individual NVDIMMs are referred to by the spec as _NVDIMM Devices_.
Each separate DIMM will have its own device listed as being under the
Root Device.  Each DIMM will have an _NVDIMM Device Handle_ which
describes the physical DIMM (its location within the memory channel,
the channel number within the memory controller, the memory controller
ID within the socket, and so on).

The second is about the data on those devices, and how the operating
system can access it.  This information is exposed in the NFIT table
(see section 5.2.25).

Because namespace labels allow NVDIMMs to be partitioned in fairly
arbitrary ways, exposing information about how the operating system
can access it is a bit complicated.  It consists of several tables,
whose information must be correlated to make sense out of it.

These tables include:
 1. A table of DPA ranges on individual NVDIMM devices
 2. A table of SPA ranges where PMEM regions are mapped, along with
    interleave sets
 3. Tables for control and data addresses for PBLK regions

NVRAM on a given NVDIMM device will be broken down into one or more
_regions_.  These regions are enumerated in the NVDIMM Region Mapping
Structure.  Each entry in this table contains the NVDIMM Device Handle
for the device the region is in, as well as the DPA range for the
region (called "NVDIMM Physical Address Base" and "NVDIMM Region Size"
in the spec).  Regions which are part of a PMEM namespace will have
references into SPA tables and interleave set tables; regions which
are part of PBLK namespaces will have references into control region
and block data window region structures.

[acpi-spec]:
http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf

## Namespaces and the OS

At boot time, the firmware will read the label regions from the NVDIMM
device and set up the memory controllers appropriately.  It will then
construct a table describing the resulting regions in a table called
an NFIT table, and expose that table via ACPI.

To use a namespace, an operating system needs at a minimum two pieces
of information: The UUID and/or Name of the namespace, and the SPA
range where that namespace is mapped; and ideally also the Type and
Abstraction Type to know how to interpret the data inside.

Unfortunately, the information needed to understand namespaces is
somewhat disjoint.  The namespace labels themselves contain the UUID,
Name, Type, and Abstraction Type, but don't contain any information
about SPA or block control / status registers and windows.  The NFIT
table contains a list of SPA Range Structures, which list the
NVDIMM-related SPA ranges and their Type GUID; as well as a table
containing individual DPA ranges, which specifies which SPAs they
correspond to.  But the NFIT does not contain the UUID or other
identifying information from the Namespace labels.  In order to
actually discover that namespace with UUID _X_ is mapped at SPA _Y-Z_,
an operating system must:

1. Read the label areas of all NVDIMMs and discover the DPA range and
   Interleave Set for namespace _X_
2. Read the Region Mapping Structures from the NFIT table, and find
   out which structures match the DPA ranges for namespace _X_
3. Find the System Physical Address Range Structure Index associated
   with the Region Mapping
4. Look up the SPA Range Structure in the NFIT table using the SPA
   Range Structure Index
5. Read the SPA range _Y-Z_

An OS driver can modify the namespaces by modifying the Label Storage
Areas of the corresponding DIMMs.  The NFIT table describes how the OS
can access the Label Storage Areas.  Label Storage Areas may be
"isolated", in which case the area would be accessed via
device-specific AML methods (DSM), or they may be exposed directly
using a well-known location.  AML methods to access the label areas
are "dumb": they are essentially a memcpy() which copies into or out
of a given {DIMM, Label Area Offest} address.  No checking for
validity of reads and writes is done, and simply modifying the labels
does not change the mapping immediately -- this must be done either by
the OS driver reprogramming the NVDIMM memory controller, or by
rebooting and allowing the firmware to it.

Modifying labels is tricky, due to an issue that will be somewhat of a
recurring theme when discussing NVDIMMs: The necessity of assuming
that, at any given point in time, power may be suddenly cut, and the
system needing to be able to recover sensible data in such a
circumstance.  The [UEFI Specification][uefi-spec] chapter on the
NVDIMM label protocol specifies how the label area is to be modified
such that a consistent "view" is always available; and how firmware
and the operating system should respond consistently to labels which
appear corrupt.

## NVDIMMs and filesystems

Along the same line, most filesystems are written with the assumption
that a given write to a block device will either finish completely, or
be entirely reverted.  Since access to NVDIMMs (even in PBLK mode) are
essentially `memcpy`s, writes may well be interrupted halfway through,
resulting in _sector tearing_.  In order to help with this, the UEFI
spec defines method of reading and writing NVRAM which is capable of
emulating sector-atomic write semantics via a _block translation
layer_ (BTT) ([UEFI spec][uefi-spec], chapter 6, "Block Translation
Table (BTT) Layout").  Namespaces accessed via this discipline will
have a _BTT info block_ at the beginning of the namespace (similar to
a superblock on a traditional hard disk).  Additionally, the
AddressAbstraction GUID in the namespace label(s) should be set to
`EFI_BTT_ABSTRACTION_GUID`.

## Linux

Linux has a _direct access_ (DAX) filesystem mount mode for block
devices which are "memory-like" ^[kernel-dax].  If both the filesystem
and the underlying device support DAX, and the `dax` mount option is
enabled, then when a file on that filesystem is `mmap`ed, the page
cache is bypassed and the underlying storage is mapped directly into
the user process. (?)

[kernel-dax]: https://www.kernel.org/doc/Documentation/filesystems/dax.txt

Linux has a tool called `ndctl` to manage NVDIMM namespaces.  From the
documentation it looks fairly well abstracted: you don't typically
specify individual DPAs when creating PBLK or PMEM regions: you
specify the type you want and the size and it works out the layout
details (?).

The `ndctl` tool allows you to make PMEM namespaces in one of four
modes: `raw`, `sector`, `fsdax` (or `memory`), and `devdax` (or,
confusingly, `dax`).

The `raw`, `sector`, and `fsdax` modes all result in a block device in
the pattern of `/dev/pmemN[.M]`, in which a filesystem can be stored.
`devdax` results in a character device in the pattern of
`/dev/daxN[.M]`.

It's not clear from the documentation exactly what `raw` mode is or
when it would be safe to use it.

`sector` mode implements `BTT`; it is thus safe against sector
tearing, but does not support mapping files in DAX mode.  The
namespace can be either PMEM or PBLK (?).  As described above, the
first block of the namespace will be a BTT info block.

`fsdax` and `devdax` mode are both designed to make it possible for
user processes to have direct mapping of NVRAM.  As such, both are
only suitable for PMEM namespaces (?).  Both also need to have kernel
page structures allocated for each page of NVRAM; this amounts to 64
bytes for every 4k of NVRAM.  Memory for these page structures can
either be allocated out of normal "system" memory, or inside the PMEM
namespace itself.

In both cases, an "info block", very similar to the BTT info block, is
written to the beginning of the namespace when created.  This info
block specifies whether the page structures come from system memory or
from the namespace itself.  If from the namespace itself, it contains
information about what parts of the namespace have been set aside for
Linux to use for this purpose.

Linux has also defined "Type GUIDs" for these two types of namespace
to be stored in the namespace label, although these are not yet in the
ACPI spec.

Documentation seems to indicate that both `pmem` and `dax` devices can
be further subdivided (by mentioning `/dev/pmemN.M` and
`/dev/daxN.M`), but don't mention specifically how.  `pmem` devices,
being block devices, can presumuably be partitioned like a block
device can. `dax` devices may have something similar, or may have
their own subdivision mechanism.  The rest of this document will
assume that this is the case.

# Xen considerations

## RAM and MMIO in Xen

Xen generally has two types of things that can go into a pagetable or
p2m.  The first is RAM or "system memory".  RAM has a page struct,
which allows it to be accounted for on a page-by-page basis: Assigned
to a specific domain, reference counted, and so on.

The second is MMIO.  MMIO areas do not have page structures, and thus
cannot be accounted on a page-by-page basis.  Xen knows about PCI
devices and the associated MMIO ranges, and makes sure that PV
pagetables or HVM p2m tables only contain MMIO mappings for devices
which have been assigned to a guest.

## Page structures

To begin with, Xen, like Linux, needs page structs for NVDIMM
memory.  Without page structs, we don't have reference counts; which
means there's no safe way, for instance, for a guest to ask a PV
device to write into NVRAM owned by a guest; and no real way to be
confident that the same memory hadn't been mapped multiple times.

Page structures in Xen are 32 bytes for non-BIGMEM systems (<5 TiB),
and 40 bytes for BIGMEM systems.

### Page structure allocation

There are three potential places we could store page structs:

 1. **System memory** Allocated from the host RAM

 2. **Inside the namespace** Like Linux, there could be memory set
   aside inside the namespace set aside specifically for mapping that
   namespace.  This could be 2a) As a user-visible separate partition,
   or 2b) allocated by `ndctl` from the namespace "superblock".  As
   the page frame areas of the namespace can be discontiguous (?), it
   would be possible to enable or disable this extra space on an
   existing namespace, to allow users with existing vNVDIMM images to
   switch to or from Xen.

 3. **A different namespace** NVRAM could be set aside for use by
   arbitrary namespaces.  This could be a 3a) specially-selected
   partition from a normal namespace, or it could be 3b) a namespace
   specifically designed to be used for Xen (perhaps with its own Type
   GUID).

2b has the advantage that we should be able to unilaterally allocate a
Type GUID and start using it for that purpose.  It also has the
advantage that it should be somewhat easier for someone with existing
vNVDIMM images to switch into (or away from) using Xen.  It has the
disadvantage of being less transparent to the user.

3b has the advantage of being invisible to the user once being set up.
It has the slight disadvantage of having more gatekeepers to get
through; and if those gatekeepers aren't happy with enabling or
disabling extra frametable space for Xen after creation (or if I've
misunderstood and such functionality isn't straightforward to
implement) then it will be difficult for people with existing images
to switch to Xen.

### Dealing with changing frame tables

Another potential issue to consider is the monolithic nature of the
current frame table.  At the moment, to find a page struct given an
mfn, you use the mfn as an index into a single large array.

I think we can assume that NVDIMM SPA ranges will be separate from
normal system RAM.  There's no reason the frame table couldn't be
"sparse": i.e., only the sections of it that actually contain valid
pages need to have ram backing them.

However, if we pursue a solution like Linux, where each namespace
contains memory set aside to use for its own pagetables, we may have a
situation where boundary between two namespaces falls in the middle of
a frame table page; in that case, from where should such a frame table
page be allocated?

A simple answer would be to use system RAM to "cover the gap": There
would only ever need to be a single page per boundary.

## Page tracking for domain 0

When domain 0 adds or removes entries from its pagetables, it does not
explicitly store the memory type (i.e., whether RAM or MMIO); Xen
infers this from its knowledge of where RAM is and is not.  Below we
will explore design choices that involve domain 0 telling Xen about
NVDIMM namespaces, SPAs, and what it can use for page structures.  In
such a scenario, NVRAM pages essentially transition from being MMIO
(before Xen knows about them) to being RAM (after Xen knows about
them), which in turn has implications for any mappings which domain 0
has in its pagetables.

## PVH and QEMU

A number of solutions have suggested using QEMU to provide emulated
NVDIMM support to guests.  This is a workable solution for HVM guests,
but for PVH guests we would like to avoid introducing a device model
if at all possible.

## FS DAX and DMA in Linux

There is [an issue][linux-fs-dax-dma-issue] with DAX and filesystems,
in that filesystems (even those claiming to support DAX) may want to
rearrange the block<->file mapping "under the feet" of running
processes with mapped files.  Unfortunately, this is more tricky with
DAX than with a page cache, and as of [early 2018][linux-fs-dax-dma-2]
was essentially incompatible with virtualization. ("I think we need to
enforce this in the host kernel. I.e. do not allow file backed DAX
pages to be mapped in EPT entries unless / until we have a solution to
the DMA synchronization problem.")

More needs to be discussed and investigated here; but for the time
being, mapping a file in a DAX filesystem into a guest's p2m is
probably not going to be possible.

[linux-fs-dax-dma-issue]:
https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html
[linux-fs-dax-dma-2]:
https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg07347.html

# Target functionality

The above sets the stage, but to actually determine on an architecture
we have to decide what kind of final functionality we're looking for.
The functionality falls into two broad areas: Functionality from the
host administrator's point of view (accessed from domain 0), and
functionality from the guest administrator's point of view.

## Domain 0 functionality

For the purposes of this section, I shall be distinguishing between
"native Linux" functionality and "domain 0" functionality.  By "native
Linux" functionality I mean functionality which is available when
Linux is running on bare metal -- `ndctl`, `/dev/pmem`, `/dev/dax`,
and so on.  By "dom0 functionality" I mean functionality which is
available in domain 0 when Linux is running under Xen.

 1. **Disjoint functionality** Have dom0 and native Linux
   functionality completely separate: namespaces created when booted
   on native Linux would not be accessible when booted under domain 0,
   and vice versa.  Some Xen-specific tool similar to `ndctl` would
   need to be developed for accessing functionality.

 2. **Shared data but no dom0 functionality** Another option would be
   to have Xen and Linux have shared access to the same namespaces,
   but dom0 essentially have no direct access to the NVDIMM.  Xen
   would read the NFIT, parse namespaces, and expose those namespaces
   to dom0 like any other guest; but dom0 would not be able to create
   or modify namespaces.  To manage namespaces, an administrator would
   need to boot into native Linux, modify the namespaces, and then
   reboot into Xen again.

 3. **Dom0 fully functional, Manual Xen frame table** Another level of
   functionality would be to make it possible for dom0 to have full
   parity with native Linux in terms of using `ndctl` to manage
   namespaces, but to require the host administrator to manually set
   aside NVRAM for Xen to use for frame tables.

 4. **Dom0 fully functional, automatic Xen frame table** This is like
   the above, but with the Xen frame table space automatically
   managed, similar to Linux's: You'd simply specify that you wanted
   the Xen frametable somehow when you create the namespace, and from
   then on forget about it.

Number 1 should be avoided if at all possible, in my opinion.

Given that the NFIT table doesn't currently have namespace UUIDs or
other key pieces of information to fully understand the namespaces, it
seems like #2 would likely not be able to be made functional enough.

Number 3 should be achievable under our control.  Obviously #4 would
be ideal, but might depend on getting cooperation from the Linux
NVDIMM maintainers to be able to set aside Xen frame table memory in
addition to Linux frame table memory.

## Guest functionality

  1. **No remapping** The guest can take the PMEM device as-is.  It's
    mapped by the toolstack at a specific place in _guest physical
    address_ (GPA) space and cannot be moved.  There is no controller
    emulation (which would allow remapping) and minimal label area
    functionality.

  2. **Full controller access for PMEM**.  The guest has full
    controller access for PMEM: it can carve up namespaces, change
    mappings in GPA space, and so on.

  3. **Full controller access for both PMEM and PBLK**.  A guest has
    full controller access, and can carve up its NVRAM into arbitrary
    PMEM or PBLK regions, as it wants.

Numbers 2 and 3 would of course be nice-to-have, but would almost
certainly involve having a QEMU qprocess to emulate them.  Since we'd
like to have PVH use NVDIMMs, we should at least make #1 an option.

# Proposed design / roadmap

Initially, dom0 accesses the NVRAM as normal, using static ACPI tables
and the DSM methods; mappings are treated by Xen during this phase as
MMIO.

Once dom0 is ready to pass parts of a namespace through to a guest, it
makes a hypercall to tell Xen about the namespace.  It includes any
regions of the namespace which Xen may use for 'scratch'; it also
includes a flag to indicate whether this 'scratch' space may be used
for frame tables from other namespaces.

Frame tables are then created for this SPA range.  They will be
allocated from, in this order: 1) designated 'scratch' range from
within this namespace 2) designated 'scratch' range from other
namespaces which has been marked as sharable 3) system RAM.

Xen will either verify that dom0 has no existing mappings, or promote
the mappings to full pages (taking appropriate reference counts for
mappings).  Dom0 must ensure that this namespace is not unmapped,
modified, or relocated until it asks Xen to unmap it.

For Xen frame tables, to begin with, set aside a partition inside a
namespace to be used by Xen.  Pass this in to Xen when activating the
namespace; this could be either 2a or 3a from "Page structure
allocation".  After that, we could decide which of the two more
streamlined approaches (2b or 3b) to pursue.

At this point, dom0 can pass parts of the mapped namespace into
guests.  Unfortunately, passing files on a fsdax filesystem is
probably not safe; but we can pass in full dev-dax or fsdax
partitions.

From a guest perspective, I propose we provide static NFIT only, no
access to labels to begin with.  This can be generated in hvmloader
and/or the toolstack acpi code.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread