* Draft NVDIMM proposal @ 2018-05-09 17:29 George Dunlap 2018-05-09 17:35 ` George Dunlap 0 siblings, 1 reply; 24+ messages in thread From: George Dunlap @ 2018-05-09 17:29 UTC (permalink / raw) To: xen-devel; +Cc: Andrew Cooper, Yi Zhang, Jan Beulich, Roger Pau Monne Below is an initial draft of an NVDIMM proposal. I'll submit a patch to include it in the tree at some point, but I thought for initial discussion it would be easier if it were copied in-line. I've done a fair amount of investigation, but it's quite likely I've made mistakes. Please send me corrections where necessary. -George --- % NVDIMMs and Xen % George Dunlap % Revision 0.1 # NVDIMM overview It's very difficult, from the various specs, to actually get a complete enough picture if what's going on to make a good design. This section is meant as an overview of the current hardware, firmware, and Linux interfaces sufficient to inform a discussion of the issues in designing a Xen interface for NVDIMMs. ## DIMMs, Namespaces, and access methods An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form factor) that contains _non-volatile RAM_ (NVRAM). Individual bytes of memory on a DIMM are specified by a _DIMM physical address_ or DPA. Each DIMM is attached to an NVDIMM controller. Memory on the DIMMs is divided up into _namespaces_. The word "namespace" is rather misleading though; a namespace in this context is not actually a space of names (contrast, for example "C++ namespaces"); rather, it's more like a SCSI LUN, or a volume, or a partition on a drive: a set of data which is meant to be viewed and accessed as a unit. (The name was apparently carried over from NVMe devices, which were precursors of the NVDIMM spec.) The NVDIMM controller allows two ways to access the DIMM. One is mapped 1-1 in _system physical address space_ (SPA), much like normal RAM. This method of access is called _PMEM_. The other method is similar to that of a PCI device: you have a control and status register which control an 8k aperture window into the DIMM. This method access is called _PBLK_. In the case of PMEM, as in the case of DRAM, addresses from the SPA are interleaved across a set of DIMMs (an _interleave set_) for performance reasons. A specific PMEM namespace will be a single contiguous DPA range across all DIMMs in its interleave set. For example, you might have a namespace for DPAs `0-0x50000000` on DIMMs 0 and 1; and another namespace for DPAs `0x80000000-0xa0000000` on DIMMs 0, 1, 2, and 3. In the case of PBLK, a namespace always resides on a single DIMM. However, that namespace can be made up of multiple discontiguous chunks of space on that DIMM. For instance, in our example above, we might have a namespace om DIMM 0 consisting of DPAs `0x50000000-0x60000000`, `0x80000000-0x90000000`, and `0xa0000000-0xf0000000`. The interleaving of PMEM has implications for the speed and reliability of the namespace: Much like RAID 0, it maximizes speed, but it means that if any one DIMM fails, the data from the entire namespace is corrupted. PBLK makes it slightly less straightforward to access, but it allows OS software to apply RAID-like logic to balance redundancy and speed. Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM; for large systems without 5-level paging, this is actually becoming a limitation. Using PBLK allows existing 4-level paged systems to access an arbitrary amount of NVDIMM. ## Namespaces, labels, and the label area A namespace is a mapping from the SPA and MMIO space into the DIMM. The firmware and/or operating system can talk to the NVDIMM controller to set up mappings from SPA and MMIO space into the DIMM. Because the memory and PCI devices are separate, it would be possible for buggy firmware or NVDIMM controller drivers to misconfigure things such that the same DPA is exposed in multiple places; if so, the results are undefined. Namespaces are constructed out of "labels". Each DIMM has a Label Storage Area, which is persistent but logically separate from the device-addressable areas on the DIMM. A label on a DIMM describes a single contiguous region of DPA on that DIMM. A PMEM namespace is made up of one label from each of the DIMMs which make its interleave set; a PBLK namespace is made up of one label for each chunk of range. In our examples above, the first PMEM namespace would be made of two labels (one on DIMM 0 and one on DIMM 1, each describind DPA `0-0x50000000`), and the second namespace would be made of four labels (one on DIMM 0, one on DIMM 1, and so on). Similarly, in the PBLK example, the namespace would consist of three labels; one describing `0x50000000-0x60000000`, one describing `0x80000000-0x90000000`, and so on. The namespace definition includes not only information about the DPAs which make up the namespace and how they fit together; it also includes a UUID for the namespace (to allow it to be identified uniquely), a 64-character "name" field for a human-friendly description, and Type and Address Abstraction GUIDs to inform the operating system how the data inside the namespace should be interpreted. Additionally, it can have an `ROLABEL` flag, which indicates to the OS that "device drivers and manageability software should refuse to make changes to the namespace labels", because "attempting to make configuration changes that affect the namespace labels will fail (i.e. because the VM guest is not in a position to make the change correctly)". See the [UEFI Specification][uefi-spec], section 13.19, "NVDIMM Label Protocol", for more information. [uefi-spec]: http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf ## NVDIMMs and ACPI The [ACPI Specification][acpi-spec] breaks down information in two ways. The first is about physical devices (see section 9.20, "NVDIMM Devices"). The NVDIMM controller is called the _NVDIMM Root Device_. There will generally be only a single NVDIMM root device on a system. Individual NVDIMMs are referred to by the spec as _NVDIMM Devices_. Each separate DIMM will have its own device listed as being under the Root Device. Each DIMM will have an _NVDIMM Device Handle_ which describes the physical DIMM (its location within the memory channel, the channel number within the memory controller, the memory controller ID within the socket, and so on). The second is about the data on those devices, and how the operating system can access it. This information is exposed in the NFIT table (see section 5.2.25). Because namespace labels allow NVDIMMs to be partitioned in fairly arbitrary ways, exposing information about how the operating system can access it is a bit complicated. It consists of several tables, whose information must be correlated to make sense out of it. These tables include: 1. A table of DPA ranges on individual NVDIMM devices 2. A table of SPA ranges where PMEM regions are mapped, along with interleave sets 3. Tables for control and data addresses for PBLK regions NVRAM on a given NVDIMM device will be broken down into one or more _regions_. These regions are enumerated in the NVDIMM Region Mapping Structure. Each entry in this table contains the NVDIMM Device Handle for the device the region is in, as well as the DPA range for the region (called "NVDIMM Physical Address Base" and "NVDIMM Region Size" in the spec). Regions which are part of a PMEM namespace will have references into SPA tables and interleave set tables; regions which are part of PBLK namespaces will have references into control region and block data window region structures. [acpi-spec]: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf ## Namespaces and the OS At boot time, the firmware will read the label regions from the NVDIMM device and set up the memory controllers appropriately. It will then construct a table describing the resulting regions in a table called an NFIT table, and expose that table via ACPI. To use a namespace, an operating system needs at a minimum two pieces of information: The UUID and/or Name of the namespace, and the SPA range where that namespace is mapped; and ideally also the Type and Abstraction Type to know how to interpret the data inside. Unfortunately, the information needed to understand namespaces is somewhat disjoint. The namespace labels themselves contain the UUID, Name, Type, and Abstraction Type, but don't contain any information about SPA or block control / status registers and windows. The NFIT table contains a list of SPA Range Structures, which list the NVDIMM-related SPA ranges and their Type GUID; as well as a table containing individual DPA ranges, which specifies which SPAs they correspond to. But the NFIT does not contain the UUID or other identifying information from the Namespace labels. In order to actually discover that namespace with UUID _X_ is mapped at SPA _Y-Z_, an operating system must: 1. Read the label areas of all NVDIMMs and discover the DPA range and Interleave Set for namespace _X_ 2. Read the Region Mapping Structures from the NFIT table, and find out which structures match the DPA ranges for namespace _X_ 3. Find the System Physical Address Range Structure Index associated with the Region Mapping 4. Look up the SPA Range Structure in the NFIT table using the SPA Range Structure Index 5. Read the SPA range _Y-Z_ An OS driver can modify the namespaces by modifying the Label Storage Areas of the corresponding DIMMs. The NFIT table describes how the OS can access the Label Storage Areas. Label Storage Areas may be "isolated", in which case the area would be accessed via device-specific AML methods (DSM), or they may be exposed directly using a well-known location. AML methods to access the label areas are "dumb": they are essentially a memcpy() which copies into or out of a given {DIMM, Label Area Offest} address. No checking for validity of reads and writes is done, and simply modifying the labels does not change the mapping immediately -- this must be done either by the OS driver reprogramming the NVDIMM memory controller, or by rebooting and allowing the firmware to it. Modifying labels is tricky, due to an issue that will be somewhat of a recurring theme when discussing NVDIMMs: The necessity of assuming that, at any given point in time, power may be suddenly cut, and the system needing to be able to recover sensible data in such a circumstance. The [UEFI Specification][uefi-spec] chapter on the NVDIMM label protocol specifies how the label area is to be modified such that a consistent "view" is always available; and how firmware and the operating system should respond consistently to labels which appear corrupt. ## NVDIMMs and filesystems Along the same line, most filesystems are written with the assumption that a given write to a block device will either finish completely, or be entirely reverted. Since access to NVDIMMs (even in PBLK mode) are essentially `memcpy`s, writes may well be interrupted halfway through, resulting in _sector tearing_. In order to help with this, the UEFI spec defines method of reading and writing NVRAM which is capable of emulating sector-atomic write semantics via a _block translation layer_ (BTT) ([UEFI spec][uefi-spec], chapter 6, "Block Translation Table (BTT) Layout"). Namespaces accessed via this discipline will have a _BTT info block_ at the beginning of the namespace (similar to a superblock on a traditional hard disk). Additionally, the AddressAbstraction GUID in the namespace label(s) should be set to `EFI_BTT_ABSTRACTION_GUID`. ## Linux Linux has a _direct access_ (DAX) filesystem mount mode for block devices which are "memory-like" ^[kernel-dax]. If both the filesystem and the underlying device support DAX, and the `dax` mount option is enabled, then when a file on that filesystem is `mmap`ed, the page cache is bypassed and the underlying storage is mapped directly into the user process. (?) [kernel-dax]: https://www.kernel.org/doc/Documentation/filesystems/dax.txt Linux has a tool called `ndctl` to manage NVDIMM namespaces. From the documentation it looks fairly well abstracted: you don't typically specify individual DPAs when creating PBLK or PMEM regions: you specify the type you want and the size and it works out the layout details (?). The `ndctl` tool allows you to make PMEM namespaces in one of four modes: `raw`, `sector`, `fsdax` (or `memory`), and `devdax` (or, confusingly, `dax`). The `raw`, `sector`, and `fsdax` modes all result in a block device in the pattern of `/dev/pmemN[.M]`, in which a filesystem can be stored. `devdax` results in a character device in the pattern of `/dev/daxN[.M]`. It's not clear from the documentation exactly what `raw` mode is or when it would be safe to use it. `sector` mode implements `BTT`; it is thus safe against sector tearing, but does not support mapping files in DAX mode. The namespace can be either PMEM or PBLK (?). As described above, the first block of the namespace will be a BTT info block. `fsdax` and `devdax` mode are both designed to make it possible for user processes to have direct mapping of NVRAM. As such, both are only suitable for PMEM namespaces (?). Both also need to have kernel page structures allocated for each page of NVRAM; this amounts to 64 bytes for every 4k of NVRAM. Memory for these page structures can either be allocated out of normal "system" memory, or inside the PMEM namespace itself. In both cases, an "info block", very similar to the BTT info block, is written to the beginning of the namespace when created. This info block specifies whether the page structures come from system memory or from the namespace itself. If from the namespace itself, it contains information about what parts of the namespace have been set aside for Linux to use for this purpose. Linux has also defined "Type GUIDs" for these two types of namespace to be stored in the namespace label, although these are not yet in the ACPI spec. Documentation seems to indicate that both `pmem` and `dax` devices can be further subdivided (by mentioning `/dev/pmemN.M` and `/dev/daxN.M`), but don't mention specifically how. `pmem` devices, being block devices, can presumuably be partitioned like a block device can. `dax` devices may have something similar, or may have their own subdivision mechanism. The rest of this document will assume that this is the case. # Xen considerations ## RAM and MMIO in Xen Xen generally has two types of things that can go into a pagetable or p2m. The first is RAM or "system memory". RAM has a page struct, which allows it to be accounted for on a page-by-page basis: Assigned to a specific domain, reference counted, and so on. The second is MMIO. MMIO areas do not have page structures, and thus cannot be accounted on a page-by-page basis. Xen knows about PCI devices and the associated MMIO ranges, and makes sure that PV pagetables or HVM p2m tables only contain MMIO mappings for devices which have been assigned to a guest. ## Page structures To begin with, Xen, like Linux, needs page structs for NVDIMM memory. Without page structs, we don't have reference counts; which means there's no safe way, for instance, for a guest to ask a PV device to write into NVRAM owned by a guest; and no real way to be confident that the same memory hadn't been mapped multiple times. Page structures in Xen are 32 bytes for non-BIGMEM systems (<5 TiB), and 40 bytes for BIGMEM systems. ### Page structure allocation There are three potential places we could store page structs: 1. **System memory** Allocated from the host RAM 2. **Inside the namespace** Like Linux, there could be memory set aside inside the namespace set aside specifically for mapping that namespace. This could be 2a) As a user-visible separate partition, or 2b) allocated by `ndctl` from the namespace "superblock". As the page frame areas of the namespace can be discontiguous (?), it would be possible to enable or disable this extra space on an existing namespace, to allow users with existing vNVDIMM images to switch to or from Xen. 3. **A different namespace** NVRAM could be set aside for use by arbitrary namespaces. This could be a 3a) specially-selected partition from a normal namespace, or it could be 3b) a namespace specifically designed to be used for Xen (perhaps with its own Type GUID). 2b has the advantage that we should be able to unilaterally allocate a Type GUID and start using it for that purpose. It also has the advantage that it should be somewhat easier for someone with existing vNVDIMM images to switch into (or away from) using Xen. It has the disadvantage of being less transparent to the user. 3b has the advantage of being invisible to the user once being set up. It has the slight disadvantage of having more gatekeepers to get through; and if those gatekeepers aren't happy with enabling or disabling extra frametable space for Xen after creation (or if I've misunderstood and such functionality isn't straightforward to implement) then it will be difficult for people with existing images to switch to Xen. ### Dealing with changing frame tables Another potential issue to consider is the monolithic nature of the current frame table. At the moment, to find a page struct given an mfn, you use the mfn as an index into a single large array. I think we can assume that NVDIMM SPA ranges will be separate from normal system RAM. There's no reason the frame table couldn't be "sparse": i.e., only the sections of it that actually contain valid pages need to have ram backing them. However, if we pursue a solution like Linux, where each namespace contains memory set aside to use for its own pagetables, we may have a situation where boundary between two namespaces falls in the middle of a frame table page; in that case, from where should such a frame table page be allocated? A simple answer would be to use system RAM to "cover the gap": There would only ever need to be a single page per boundary. ## Page tracking for domain 0 When domain 0 adds or removes entries from its pagetables, it does not explicitly store the memory type (i.e., whether RAM or MMIO); Xen infers this from its knowledge of where RAM is and is not. Below we will explore design choices that involve domain 0 telling Xen about NVDIMM namespaces, SPAs, and what it can use for page structures. In such a scenario, NVRAM pages essentially transition from being MMIO (before Xen knows about them) to being RAM (after Xen knows about them), which in turn has implications for any mappings which domain 0 has in its pagetables. ## PVH and QEMU A number of solutions have suggested using QEMU to provide emulated NVDIMM support to guests. This is a workable solution for HVM guests, but for PVH guests we would like to avoid introducing a device model if at all possible. ## FS DAX and DMA in Linux There is [an issue][linux-fs-dax-dma-issue] with DAX and filesystems, in that filesystems (even those claiming to support DAX) may want to rearrange the block<->file mapping "under the feet" of running processes with mapped files. Unfortunately, this is more tricky with DAX than with a page cache, and as of [early 2018][linux-fs-dax-dma-2] was essentially incompatible with virtualization. ("I think we need to enforce this in the host kernel. I.e. do not allow file backed DAX pages to be mapped in EPT entries unless / until we have a solution to the DMA synchronization problem.") More needs to be discussed and investigated here; but for the time being, mapping a file in a DAX filesystem into a guest's p2m is probably not going to be possible. [linux-fs-dax-dma-issue]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html [linux-fs-dax-dma-2]: https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg07347.html # Target functionality The above sets the stage, but to actually determine on an architecture we have to decide what kind of final functionality we're looking for. The functionality falls into two broad areas: Functionality from the host administrator's point of view (accessed from domain 0), and functionality from the guest administrator's point of view. ## Domain 0 functionality For the purposes of this section, I shall be distinguishing between "native Linux" functionality and "domain 0" functionality. By "native Linux" functionality I mean functionality which is available when Linux is running on bare metal -- `ndctl`, `/dev/pmem`, `/dev/dax`, and so on. By "dom0 functionality" I mean functionality which is available in domain 0 when Linux is running under Xen. 1. **Disjoint functionality** Have dom0 and native Linux functionality completely separate: namespaces created when booted on native Linux would not be accessible when booted under domain 0, and vice versa. Some Xen-specific tool similar to `ndctl` would need to be developed for accessing functionality. 2. **Shared data but no dom0 functionality** Another option would be to have Xen and Linux have shared access to the same namespaces, but dom0 essentially have no direct access to the NVDIMM. Xen would read the NFIT, parse namespaces, and expose those namespaces to dom0 like any other guest; but dom0 would not be able to create or modify namespaces. To manage namespaces, an administrator would need to boot into native Linux, modify the namespaces, and then reboot into Xen again. 3. **Dom0 fully functional, Manual Xen frame table** Another level of functionality would be to make it possible for dom0 to have full parity with native Linux in terms of using `ndctl` to manage namespaces, but to require the host administrator to manually set aside NVRAM for Xen to use for frame tables. 4. **Dom0 fully functional, automatic Xen frame table** This is like the above, but with the Xen frame table space automatically managed, similar to Linux's: You'd simply specify that you wanted the Xen frametable somehow when you create the namespace, and from then on forget about it. Number 1 should be avoided if at all possible, in my opinion. Given that the NFIT table doesn't currently have namespace UUIDs or other key pieces of information to fully understand the namespaces, it seems like #2 would likely not be able to be made functional enough. Number 3 should be achievable under our control. Obviously #4 would be ideal, but might depend on getting cooperation from the Linux NVDIMM maintainers to be able to set aside Xen frame table memory in addition to Linux frame table memory. ## Guest functionality 1. **No remapping** The guest can take the PMEM device as-is. It's mapped by the toolstack at a specific place in _guest physical address_ (GPA) space and cannot be moved. There is no controller emulation (which would allow remapping) and minimal label area functionality. 2. **Full controller access for PMEM**. The guest has full controller access for PMEM: it can carve up namespaces, change mappings in GPA space, and so on. 3. **Full controller access for both PMEM and PBLK**. A guest has full controller access, and can carve up its NVRAM into arbitrary PMEM or PBLK regions, as it wants. Numbers 2 and 3 would of course be nice-to-have, but would almost certainly involve having a QEMU qprocess to emulate them. Since we'd like to have PVH use NVDIMMs, we should at least make #1 an option. # Proposed design / roadmap Initially, dom0 accesses the NVRAM as normal, using static ACPI tables and the DSM methods; mappings are treated by Xen during this phase as MMIO. Once dom0 is ready to pass parts of a namespace through to a guest, it makes a hypercall to tell Xen about the namespace. It includes any regions of the namespace which Xen may use for 'scratch'; it also includes a flag to indicate whether this 'scratch' space may be used for frame tables from other namespaces. Frame tables are then created for this SPA range. They will be allocated from, in this order: 1) designated 'scratch' range from within this namespace 2) designated 'scratch' range from other namespaces which has been marked as sharable 3) system RAM. Xen will either verify that dom0 has no existing mappings, or promote the mappings to full pages (taking appropriate reference counts for mappings). Dom0 must ensure that this namespace is not unmapped, modified, or relocated until it asks Xen to unmap it. For Xen frame tables, to begin with, set aside a partition inside a namespace to be used by Xen. Pass this in to Xen when activating the namespace; this could be either 2a or 3a from "Page structure allocation". After that, we could decide which of the two more streamlined approaches (2b or 3b) to pursue. At this point, dom0 can pass parts of the mapped namespace into guests. Unfortunately, passing files on a fsdax filesystem is probably not safe; but we can pass in full dev-dax or fsdax partitions. From a guest perspective, I propose we provide static NFIT only, no access to labels to begin with. This can be generated in hvmloader and/or the toolstack acpi code. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-09 17:29 Draft NVDIMM proposal George Dunlap @ 2018-05-09 17:35 ` George Dunlap 2018-05-11 16:33 ` Dan Williams 2018-05-11 16:33 ` Dan Williams 0 siblings, 2 replies; 24+ messages in thread From: George Dunlap @ 2018-05-09 17:35 UTC (permalink / raw) To: xen-devel Cc: Andrew Cooper, Dan Williams, Yi Zhang, Jan Beulich, Roger Pau Monne Dan, I understand that you're the NVDIMM maintainer for Linux. I've been working with your colleagues to try to sort out an architecture to allow NVRAM to be passed to guests under the Xen hypervisor. If you have time, I'd appreciate it if you could skim through at least the first section of the document below ("NVIDMM Overview"), concerning NVDIMM devices and Linux, to see if I've made any mistakes. If you're up for it, additional early feedback on the proposed Xen architecture, from a Linux perspective, would be awesome as well. Thanks, -George On 05/09/2018 06:29 PM, George Dunlap wrote: > Below is an initial draft of an NVDIMM proposal. I'll submit a patch to > include it in the tree at some point, but I thought for initial > discussion it would be easier if it were copied in-line. > > I've done a fair amount of investigation, but it's quite likely I've > made mistakes. Please send me corrections where necessary. > > -George > > --- > % NVDIMMs and Xen > % George Dunlap > % Revision 0.1 > > # NVDIMM overview > > It's very difficult, from the various specs, to actually get a > complete enough picture if what's going on to make a good design. > This section is meant as an overview of the current hardware, > firmware, and Linux interfaces sufficient to inform a discussion of > the issues in designing a Xen interface for NVDIMMs. > > ## DIMMs, Namespaces, and access methods > > An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form > factor) that contains _non-volatile RAM_ (NVRAM). Individual bytes of > memory on a DIMM are specified by a _DIMM physical address_ or DPA. > Each DIMM is attached to an NVDIMM controller. > > Memory on the DIMMs is divided up into _namespaces_. The word > "namespace" is rather misleading though; a namespace in this context > is not actually a space of names (contrast, for example "C++ > namespaces"); rather, it's more like a SCSI LUN, or a volume, or a > partition on a drive: a set of data which is meant to be viewed and > accessed as a unit. (The name was apparently carried over from NVMe > devices, which were precursors of the NVDIMM spec.) > > The NVDIMM controller allows two ways to access the DIMM. One is > mapped 1-1 in _system physical address space_ (SPA), much like normal > RAM. This method of access is called _PMEM_. The other method is > similar to that of a PCI device: you have a control and status > register which control an 8k aperture window into the DIMM. This > method access is called _PBLK_. > > In the case of PMEM, as in the case of DRAM, addresses from the SPA > are interleaved across a set of DIMMs (an _interleave set_) for > performance reasons. A specific PMEM namespace will be a single > contiguous DPA range across all DIMMs in its interleave set. For > example, you might have a namespace for DPAs `0-0x50000000` on DIMMs 0 > and 1; and another namespace for DPAs `0x80000000-0xa0000000` on DIMMs > 0, 1, 2, and 3. > > In the case of PBLK, a namespace always resides on a single DIMM. > However, that namespace can be made up of multiple discontiguous > chunks of space on that DIMM. For instance, in our example above, we > might have a namespace om DIMM 0 consisting of DPAs > `0x50000000-0x60000000`, `0x80000000-0x90000000`, and > `0xa0000000-0xf0000000`. > > The interleaving of PMEM has implications for the speed and > reliability of the namespace: Much like RAID 0, it maximizes speed, > but it means that if any one DIMM fails, the data from the entire > namespace is corrupted. PBLK makes it slightly less straightforward > to access, but it allows OS software to apply RAID-like logic to > balance redundancy and speed. > > Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM; > for large systems without 5-level paging, this is actually becoming a > limitation. Using PBLK allows existing 4-level paged systems to > access an arbitrary amount of NVDIMM. > > ## Namespaces, labels, and the label area > > A namespace is a mapping from the SPA and MMIO space into the DIMM. > > The firmware and/or operating system can talk to the NVDIMM controller > to set up mappings from SPA and MMIO space into the DIMM. Because the > memory and PCI devices are separate, it would be possible for buggy > firmware or NVDIMM controller drivers to misconfigure things such that > the same DPA is exposed in multiple places; if so, the results are > undefined. > > Namespaces are constructed out of "labels". Each DIMM has a Label > Storage Area, which is persistent but logically separate from the > device-addressable areas on the DIMM. A label on a DIMM describes a > single contiguous region of DPA on that DIMM. A PMEM namespace is > made up of one label from each of the DIMMs which make its interleave > set; a PBLK namespace is made up of one label for each chunk of range. > > In our examples above, the first PMEM namespace would be made of two > labels (one on DIMM 0 and one on DIMM 1, each describind DPA > `0-0x50000000`), and the second namespace would be made of four labels > (one on DIMM 0, one on DIMM 1, and so on). Similarly, in the PBLK > example, the namespace would consist of three labels; one describing > `0x50000000-0x60000000`, one describing `0x80000000-0x90000000`, and > so on. > > The namespace definition includes not only information about the DPAs > which make up the namespace and how they fit together; it also > includes a UUID for the namespace (to allow it to be identified > uniquely), a 64-character "name" field for a human-friendly > description, and Type and Address Abstraction GUIDs to inform the > operating system how the data inside the namespace should be > interpreted. Additionally, it can have an `ROLABEL` flag, which > indicates to the OS that "device drivers and manageability software > should refuse to make changes to the namespace labels", because > "attempting to make configuration changes that affect the namespace > labels will fail (i.e. because the VM guest is not in a position to > make the change correctly)". > > See the [UEFI Specification][uefi-spec], section 13.19, "NVDIMM Label > Protocol", for more information. > > [uefi-spec]: > http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf > > ## NVDIMMs and ACPI > > The [ACPI Specification][acpi-spec] breaks down information in two ways. > > The first is about physical devices (see section 9.20, "NVDIMM > Devices"). The NVDIMM controller is called the _NVDIMM Root Device_. > There will generally be only a single NVDIMM root device on a system. > > Individual NVDIMMs are referred to by the spec as _NVDIMM Devices_. > Each separate DIMM will have its own device listed as being under the > Root Device. Each DIMM will have an _NVDIMM Device Handle_ which > describes the physical DIMM (its location within the memory channel, > the channel number within the memory controller, the memory controller > ID within the socket, and so on). > > The second is about the data on those devices, and how the operating > system can access it. This information is exposed in the NFIT table > (see section 5.2.25). > > Because namespace labels allow NVDIMMs to be partitioned in fairly > arbitrary ways, exposing information about how the operating system > can access it is a bit complicated. It consists of several tables, > whose information must be correlated to make sense out of it. > > These tables include: > 1. A table of DPA ranges on individual NVDIMM devices > 2. A table of SPA ranges where PMEM regions are mapped, along with > interleave sets > 3. Tables for control and data addresses for PBLK regions > > NVRAM on a given NVDIMM device will be broken down into one or more > _regions_. These regions are enumerated in the NVDIMM Region Mapping > Structure. Each entry in this table contains the NVDIMM Device Handle > for the device the region is in, as well as the DPA range for the > region (called "NVDIMM Physical Address Base" and "NVDIMM Region Size" > in the spec). Regions which are part of a PMEM namespace will have > references into SPA tables and interleave set tables; regions which > are part of PBLK namespaces will have references into control region > and block data window region structures. > > [acpi-spec]: > http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf > > ## Namespaces and the OS > > At boot time, the firmware will read the label regions from the NVDIMM > device and set up the memory controllers appropriately. It will then > construct a table describing the resulting regions in a table called > an NFIT table, and expose that table via ACPI. > > To use a namespace, an operating system needs at a minimum two pieces > of information: The UUID and/or Name of the namespace, and the SPA > range where that namespace is mapped; and ideally also the Type and > Abstraction Type to know how to interpret the data inside. > > Unfortunately, the information needed to understand namespaces is > somewhat disjoint. The namespace labels themselves contain the UUID, > Name, Type, and Abstraction Type, but don't contain any information > about SPA or block control / status registers and windows. The NFIT > table contains a list of SPA Range Structures, which list the > NVDIMM-related SPA ranges and their Type GUID; as well as a table > containing individual DPA ranges, which specifies which SPAs they > correspond to. But the NFIT does not contain the UUID or other > identifying information from the Namespace labels. In order to > actually discover that namespace with UUID _X_ is mapped at SPA _Y-Z_, > an operating system must: > > 1. Read the label areas of all NVDIMMs and discover the DPA range and > Interleave Set for namespace _X_ > 2. Read the Region Mapping Structures from the NFIT table, and find > out which structures match the DPA ranges for namespace _X_ > 3. Find the System Physical Address Range Structure Index associated > with the Region Mapping > 4. Look up the SPA Range Structure in the NFIT table using the SPA > Range Structure Index > 5. Read the SPA range _Y-Z_ > > An OS driver can modify the namespaces by modifying the Label Storage > Areas of the corresponding DIMMs. The NFIT table describes how the OS > can access the Label Storage Areas. Label Storage Areas may be > "isolated", in which case the area would be accessed via > device-specific AML methods (DSM), or they may be exposed directly > using a well-known location. AML methods to access the label areas > are "dumb": they are essentially a memcpy() which copies into or out > of a given {DIMM, Label Area Offest} address. No checking for > validity of reads and writes is done, and simply modifying the labels > does not change the mapping immediately -- this must be done either by > the OS driver reprogramming the NVDIMM memory controller, or by > rebooting and allowing the firmware to it. > > Modifying labels is tricky, due to an issue that will be somewhat of a > recurring theme when discussing NVDIMMs: The necessity of assuming > that, at any given point in time, power may be suddenly cut, and the > system needing to be able to recover sensible data in such a > circumstance. The [UEFI Specification][uefi-spec] chapter on the > NVDIMM label protocol specifies how the label area is to be modified > such that a consistent "view" is always available; and how firmware > and the operating system should respond consistently to labels which > appear corrupt. > > ## NVDIMMs and filesystems > > Along the same line, most filesystems are written with the assumption > that a given write to a block device will either finish completely, or > be entirely reverted. Since access to NVDIMMs (even in PBLK mode) are > essentially `memcpy`s, writes may well be interrupted halfway through, > resulting in _sector tearing_. In order to help with this, the UEFI > spec defines method of reading and writing NVRAM which is capable of > emulating sector-atomic write semantics via a _block translation > layer_ (BTT) ([UEFI spec][uefi-spec], chapter 6, "Block Translation > Table (BTT) Layout"). Namespaces accessed via this discipline will > have a _BTT info block_ at the beginning of the namespace (similar to > a superblock on a traditional hard disk). Additionally, the > AddressAbstraction GUID in the namespace label(s) should be set to > `EFI_BTT_ABSTRACTION_GUID`. > > ## Linux > > Linux has a _direct access_ (DAX) filesystem mount mode for block > devices which are "memory-like" ^[kernel-dax]. If both the filesystem > and the underlying device support DAX, and the `dax` mount option is > enabled, then when a file on that filesystem is `mmap`ed, the page > cache is bypassed and the underlying storage is mapped directly into > the user process. (?) > > [kernel-dax]: https://www.kernel.org/doc/Documentation/filesystems/dax.txt > > Linux has a tool called `ndctl` to manage NVDIMM namespaces. From the > documentation it looks fairly well abstracted: you don't typically > specify individual DPAs when creating PBLK or PMEM regions: you > specify the type you want and the size and it works out the layout > details (?). > > The `ndctl` tool allows you to make PMEM namespaces in one of four > modes: `raw`, `sector`, `fsdax` (or `memory`), and `devdax` (or, > confusingly, `dax`). > > The `raw`, `sector`, and `fsdax` modes all result in a block device in > the pattern of `/dev/pmemN[.M]`, in which a filesystem can be stored. > `devdax` results in a character device in the pattern of > `/dev/daxN[.M]`. > > It's not clear from the documentation exactly what `raw` mode is or > when it would be safe to use it. > > `sector` mode implements `BTT`; it is thus safe against sector > tearing, but does not support mapping files in DAX mode. The > namespace can be either PMEM or PBLK (?). As described above, the > first block of the namespace will be a BTT info block. > > `fsdax` and `devdax` mode are both designed to make it possible for > user processes to have direct mapping of NVRAM. As such, both are > only suitable for PMEM namespaces (?). Both also need to have kernel > page structures allocated for each page of NVRAM; this amounts to 64 > bytes for every 4k of NVRAM. Memory for these page structures can > either be allocated out of normal "system" memory, or inside the PMEM > namespace itself. > > In both cases, an "info block", very similar to the BTT info block, is > written to the beginning of the namespace when created. This info > block specifies whether the page structures come from system memory or > from the namespace itself. If from the namespace itself, it contains > information about what parts of the namespace have been set aside for > Linux to use for this purpose. > > Linux has also defined "Type GUIDs" for these two types of namespace > to be stored in the namespace label, although these are not yet in the > ACPI spec. > > Documentation seems to indicate that both `pmem` and `dax` devices can > be further subdivided (by mentioning `/dev/pmemN.M` and > `/dev/daxN.M`), but don't mention specifically how. `pmem` devices, > being block devices, can presumuably be partitioned like a block > device can. `dax` devices may have something similar, or may have > their own subdivision mechanism. The rest of this document will > assume that this is the case. > > # Xen considerations > > ## RAM and MMIO in Xen > > Xen generally has two types of things that can go into a pagetable or > p2m. The first is RAM or "system memory". RAM has a page struct, > which allows it to be accounted for on a page-by-page basis: Assigned > to a specific domain, reference counted, and so on. > > The second is MMIO. MMIO areas do not have page structures, and thus > cannot be accounted on a page-by-page basis. Xen knows about PCI > devices and the associated MMIO ranges, and makes sure that PV > pagetables or HVM p2m tables only contain MMIO mappings for devices > which have been assigned to a guest. > > ## Page structures > > To begin with, Xen, like Linux, needs page structs for NVDIMM > memory. Without page structs, we don't have reference counts; which > means there's no safe way, for instance, for a guest to ask a PV > device to write into NVRAM owned by a guest; and no real way to be > confident that the same memory hadn't been mapped multiple times. > > Page structures in Xen are 32 bytes for non-BIGMEM systems (<5 TiB), > and 40 bytes for BIGMEM systems. > > ### Page structure allocation > > There are three potential places we could store page structs: > > 1. **System memory** Allocated from the host RAM > > 2. **Inside the namespace** Like Linux, there could be memory set > aside inside the namespace set aside specifically for mapping that > namespace. This could be 2a) As a user-visible separate partition, > or 2b) allocated by `ndctl` from the namespace "superblock". As > the page frame areas of the namespace can be discontiguous (?), it > would be possible to enable or disable this extra space on an > existing namespace, to allow users with existing vNVDIMM images to > switch to or from Xen. > > 3. **A different namespace** NVRAM could be set aside for use by > arbitrary namespaces. This could be a 3a) specially-selected > partition from a normal namespace, or it could be 3b) a namespace > specifically designed to be used for Xen (perhaps with its own Type > GUID). > > 2b has the advantage that we should be able to unilaterally allocate a > Type GUID and start using it for that purpose. It also has the > advantage that it should be somewhat easier for someone with existing > vNVDIMM images to switch into (or away from) using Xen. It has the > disadvantage of being less transparent to the user. > > 3b has the advantage of being invisible to the user once being set up. > It has the slight disadvantage of having more gatekeepers to get > through; and if those gatekeepers aren't happy with enabling or > disabling extra frametable space for Xen after creation (or if I've > misunderstood and such functionality isn't straightforward to > implement) then it will be difficult for people with existing images > to switch to Xen. > > ### Dealing with changing frame tables > > Another potential issue to consider is the monolithic nature of the > current frame table. At the moment, to find a page struct given an > mfn, you use the mfn as an index into a single large array. > > I think we can assume that NVDIMM SPA ranges will be separate from > normal system RAM. There's no reason the frame table couldn't be > "sparse": i.e., only the sections of it that actually contain valid > pages need to have ram backing them. > > However, if we pursue a solution like Linux, where each namespace > contains memory set aside to use for its own pagetables, we may have a > situation where boundary between two namespaces falls in the middle of > a frame table page; in that case, from where should such a frame table > page be allocated? > > A simple answer would be to use system RAM to "cover the gap": There > would only ever need to be a single page per boundary. > > ## Page tracking for domain 0 > > When domain 0 adds or removes entries from its pagetables, it does not > explicitly store the memory type (i.e., whether RAM or MMIO); Xen > infers this from its knowledge of where RAM is and is not. Below we > will explore design choices that involve domain 0 telling Xen about > NVDIMM namespaces, SPAs, and what it can use for page structures. In > such a scenario, NVRAM pages essentially transition from being MMIO > (before Xen knows about them) to being RAM (after Xen knows about > them), which in turn has implications for any mappings which domain 0 > has in its pagetables. > > ## PVH and QEMU > > A number of solutions have suggested using QEMU to provide emulated > NVDIMM support to guests. This is a workable solution for HVM guests, > but for PVH guests we would like to avoid introducing a device model > if at all possible. > > ## FS DAX and DMA in Linux > > There is [an issue][linux-fs-dax-dma-issue] with DAX and filesystems, > in that filesystems (even those claiming to support DAX) may want to > rearrange the block<->file mapping "under the feet" of running > processes with mapped files. Unfortunately, this is more tricky with > DAX than with a page cache, and as of [early 2018][linux-fs-dax-dma-2] > was essentially incompatible with virtualization. ("I think we need to > enforce this in the host kernel. I.e. do not allow file backed DAX > pages to be mapped in EPT entries unless / until we have a solution to > the DMA synchronization problem.") > > More needs to be discussed and investigated here; but for the time > being, mapping a file in a DAX filesystem into a guest's p2m is > probably not going to be possible. > > [linux-fs-dax-dma-issue]: > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html > [linux-fs-dax-dma-2]: > https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg07347.html > > # Target functionality > > The above sets the stage, but to actually determine on an architecture > we have to decide what kind of final functionality we're looking for. > The functionality falls into two broad areas: Functionality from the > host administrator's point of view (accessed from domain 0), and > functionality from the guest administrator's point of view. > > ## Domain 0 functionality > > For the purposes of this section, I shall be distinguishing between > "native Linux" functionality and "domain 0" functionality. By "native > Linux" functionality I mean functionality which is available when > Linux is running on bare metal -- `ndctl`, `/dev/pmem`, `/dev/dax`, > and so on. By "dom0 functionality" I mean functionality which is > available in domain 0 when Linux is running under Xen. > > 1. **Disjoint functionality** Have dom0 and native Linux > functionality completely separate: namespaces created when booted > on native Linux would not be accessible when booted under domain 0, > and vice versa. Some Xen-specific tool similar to `ndctl` would > need to be developed for accessing functionality. > > 2. **Shared data but no dom0 functionality** Another option would be > to have Xen and Linux have shared access to the same namespaces, > but dom0 essentially have no direct access to the NVDIMM. Xen > would read the NFIT, parse namespaces, and expose those namespaces > to dom0 like any other guest; but dom0 would not be able to create > or modify namespaces. To manage namespaces, an administrator would > need to boot into native Linux, modify the namespaces, and then > reboot into Xen again. > > 3. **Dom0 fully functional, Manual Xen frame table** Another level of > functionality would be to make it possible for dom0 to have full > parity with native Linux in terms of using `ndctl` to manage > namespaces, but to require the host administrator to manually set > aside NVRAM for Xen to use for frame tables. > > 4. **Dom0 fully functional, automatic Xen frame table** This is like > the above, but with the Xen frame table space automatically > managed, similar to Linux's: You'd simply specify that you wanted > the Xen frametable somehow when you create the namespace, and from > then on forget about it. > > Number 1 should be avoided if at all possible, in my opinion. > > Given that the NFIT table doesn't currently have namespace UUIDs or > other key pieces of information to fully understand the namespaces, it > seems like #2 would likely not be able to be made functional enough. > > Number 3 should be achievable under our control. Obviously #4 would > be ideal, but might depend on getting cooperation from the Linux > NVDIMM maintainers to be able to set aside Xen frame table memory in > addition to Linux frame table memory. > > ## Guest functionality > > 1. **No remapping** The guest can take the PMEM device as-is. It's > mapped by the toolstack at a specific place in _guest physical > address_ (GPA) space and cannot be moved. There is no controller > emulation (which would allow remapping) and minimal label area > functionality. > > 2. **Full controller access for PMEM**. The guest has full > controller access for PMEM: it can carve up namespaces, change > mappings in GPA space, and so on. > > 3. **Full controller access for both PMEM and PBLK**. A guest has > full controller access, and can carve up its NVRAM into arbitrary > PMEM or PBLK regions, as it wants. > > Numbers 2 and 3 would of course be nice-to-have, but would almost > certainly involve having a QEMU qprocess to emulate them. Since we'd > like to have PVH use NVDIMMs, we should at least make #1 an option. > > # Proposed design / roadmap > > Initially, dom0 accesses the NVRAM as normal, using static ACPI tables > and the DSM methods; mappings are treated by Xen during this phase as > MMIO. > > Once dom0 is ready to pass parts of a namespace through to a guest, it > makes a hypercall to tell Xen about the namespace. It includes any > regions of the namespace which Xen may use for 'scratch'; it also > includes a flag to indicate whether this 'scratch' space may be used > for frame tables from other namespaces. > > Frame tables are then created for this SPA range. They will be > allocated from, in this order: 1) designated 'scratch' range from > within this namespace 2) designated 'scratch' range from other > namespaces which has been marked as sharable 3) system RAM. > > Xen will either verify that dom0 has no existing mappings, or promote > the mappings to full pages (taking appropriate reference counts for > mappings). Dom0 must ensure that this namespace is not unmapped, > modified, or relocated until it asks Xen to unmap it. > > For Xen frame tables, to begin with, set aside a partition inside a > namespace to be used by Xen. Pass this in to Xen when activating the > namespace; this could be either 2a or 3a from "Page structure > allocation". After that, we could decide which of the two more > streamlined approaches (2b or 3b) to pursue. > > At this point, dom0 can pass parts of the mapped namespace into > guests. Unfortunately, passing files on a fsdax filesystem is > probably not safe; but we can pass in full dev-dax or fsdax > partitions. > > From a guest perspective, I propose we provide static NFIT only, no > access to labels to begin with. This can be generated in hvmloader > and/or the toolstack acpi code. > _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-09 17:35 ` George Dunlap @ 2018-05-11 16:33 ` Dan Williams 2018-05-11 16:33 ` Dan Williams 1 sibling, 0 replies; 24+ messages in thread From: Dan Williams @ 2018-05-11 16:33 UTC (permalink / raw) To: George Dunlap Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne [ adding linux-nvdimm ] Great write up! Some comments below... On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: > Dan, > > I understand that you're the NVDIMM maintainer for Linux. I've been > working with your colleagues to try to sort out an architecture to allow > NVRAM to be passed to guests under the Xen hypervisor. > > If you have time, I'd appreciate it if you could skim through at least > the first section of the document below ("NVIDMM Overview"), concerning > NVDIMM devices and Linux, to see if I've made any mistakes. > > If you're up for it, additional early feedback on the proposed Xen > architecture, from a Linux perspective, would be awesome as well. > > Thanks, > -George > > On 05/09/2018 06:29 PM, George Dunlap wrote: >> Below is an initial draft of an NVDIMM proposal. I'll submit a patch to >> include it in the tree at some point, but I thought for initial >> discussion it would be easier if it were copied in-line. >> >> I've done a fair amount of investigation, but it's quite likely I've >> made mistakes. Please send me corrections where necessary. >> >> -George >> >> --- >> % NVDIMMs and Xen >> % George Dunlap >> % Revision 0.1 >> >> # NVDIMM overview >> >> It's very difficult, from the various specs, to actually get a >> complete enough picture if what's going on to make a good design. >> This section is meant as an overview of the current hardware, >> firmware, and Linux interfaces sufficient to inform a discussion of >> the issues in designing a Xen interface for NVDIMMs. >> >> ## DIMMs, Namespaces, and access methods >> >> An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form >> factor) that contains _non-volatile RAM_ (NVRAM). Individual bytes of >> memory on a DIMM are specified by a _DIMM physical address_ or DPA. >> Each DIMM is attached to an NVDIMM controller. >> >> Memory on the DIMMs is divided up into _namespaces_. The word >> "namespace" is rather misleading though; a namespace in this context >> is not actually a space of names (contrast, for example "C++ >> namespaces"); rather, it's more like a SCSI LUN, or a volume, or a >> partition on a drive: a set of data which is meant to be viewed and >> accessed as a unit. (The name was apparently carried over from NVMe >> devices, which were precursors of the NVDIMM spec.) Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs provide a "label area" which is an out-of-band non-volatile memory area where the OS can store whatever it likes. The UEFI 2.7 specification defines a data format for the definition of namespaces on top of persistent memory ranges advertised to the OS via the ACPI NFIT structure. There is no obligation for an NVDIMM to provide a label area, and as far as I know all NVDIMMs on the market today do not provide a label area. That said, QEMU has the ability to associate a virtual label area with for its virtual NVDIMM representation. >> The NVDIMM controller allows two ways to access the DIMM. One is >> mapped 1-1 in _system physical address space_ (SPA), much like normal >> RAM. This method of access is called _PMEM_. The other method is >> similar to that of a PCI device: you have a control and status >> register which control an 8k aperture window into the DIMM. This >> method access is called _PBLK_. >> >> In the case of PMEM, as in the case of DRAM, addresses from the SPA >> are interleaved across a set of DIMMs (an _interleave set_) for >> performance reasons. A specific PMEM namespace will be a single >> contiguous DPA range across all DIMMs in its interleave set. For >> example, you might have a namespace for DPAs `0-0x50000000` on DIMMs 0 >> and 1; and another namespace for DPAs `0x80000000-0xa0000000` on DIMMs >> 0, 1, 2, and 3. >> >> In the case of PBLK, a namespace always resides on a single DIMM. >> However, that namespace can be made up of multiple discontiguous >> chunks of space on that DIMM. For instance, in our example above, we >> might have a namespace om DIMM 0 consisting of DPAs >> `0x50000000-0x60000000`, `0x80000000-0x90000000`, and >> `0xa0000000-0xf0000000`. >> >> The interleaving of PMEM has implications for the speed and >> reliability of the namespace: Much like RAID 0, it maximizes speed, >> but it means that if any one DIMM fails, the data from the entire >> namespace is corrupted. PBLK makes it slightly less straightforward >> to access, but it allows OS software to apply RAID-like logic to >> balance redundancy and speed. >> >> Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM; >> for large systems without 5-level paging, this is actually becoming a >> limitation. Using PBLK allows existing 4-level paged systems to >> access an arbitrary amount of NVDIMM. >> >> ## Namespaces, labels, and the label area >> >> A namespace is a mapping from the SPA and MMIO space into the DIMM. >> >> The firmware and/or operating system can talk to the NVDIMM controller >> to set up mappings from SPA and MMIO space into the DIMM. Because the >> memory and PCI devices are separate, it would be possible for buggy >> firmware or NVDIMM controller drivers to misconfigure things such that >> the same DPA is exposed in multiple places; if so, the results are >> undefined. >> >> Namespaces are constructed out of "labels". Each DIMM has a Label >> Storage Area, which is persistent but logically separate from the >> device-addressable areas on the DIMM. A label on a DIMM describes a >> single contiguous region of DPA on that DIMM. A PMEM namespace is >> made up of one label from each of the DIMMs which make its interleave >> set; a PBLK namespace is made up of one label for each chunk of range. >> >> In our examples above, the first PMEM namespace would be made of two >> labels (one on DIMM 0 and one on DIMM 1, each describind DPA >> `0-0x50000000`), and the second namespace would be made of four labels >> (one on DIMM 0, one on DIMM 1, and so on). Similarly, in the PBLK >> example, the namespace would consist of three labels; one describing >> `0x50000000-0x60000000`, one describing `0x80000000-0x90000000`, and >> so on. >> >> The namespace definition includes not only information about the DPAs >> which make up the namespace and how they fit together; it also >> includes a UUID for the namespace (to allow it to be identified >> uniquely), a 64-character "name" field for a human-friendly >> description, and Type and Address Abstraction GUIDs to inform the >> operating system how the data inside the namespace should be >> interpreted. Additionally, it can have an `ROLABEL` flag, which >> indicates to the OS that "device drivers and manageability software >> should refuse to make changes to the namespace labels", because >> "attempting to make configuration changes that affect the namespace >> labels will fail (i.e. because the VM guest is not in a position to >> make the change correctly)". >> >> See the [UEFI Specification][uefi-spec], section 13.19, "NVDIMM Label >> Protocol", for more information. >> >> [uefi-spec]: >> http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf >> >> ## NVDIMMs and ACPI >> >> The [ACPI Specification][acpi-spec] breaks down information in two ways. >> >> The first is about physical devices (see section 9.20, "NVDIMM >> Devices"). The NVDIMM controller is called the _NVDIMM Root Device_. >> There will generally be only a single NVDIMM root device on a system. >> >> Individual NVDIMMs are referred to by the spec as _NVDIMM Devices_. >> Each separate DIMM will have its own device listed as being under the >> Root Device. Each DIMM will have an _NVDIMM Device Handle_ which >> describes the physical DIMM (its location within the memory channel, >> the channel number within the memory controller, the memory controller >> ID within the socket, and so on). >> >> The second is about the data on those devices, and how the operating >> system can access it. This information is exposed in the NFIT table >> (see section 5.2.25). >> >> Because namespace labels allow NVDIMMs to be partitioned in fairly >> arbitrary ways, exposing information about how the operating system >> can access it is a bit complicated. It consists of several tables, >> whose information must be correlated to make sense out of it. >> >> These tables include: >> 1. A table of DPA ranges on individual NVDIMM devices >> 2. A table of SPA ranges where PMEM regions are mapped, along with >> interleave sets >> 3. Tables for control and data addresses for PBLK regions >> >> NVRAM on a given NVDIMM device will be broken down into one or more >> _regions_. These regions are enumerated in the NVDIMM Region Mapping >> Structure. Each entry in this table contains the NVDIMM Device Handle >> for the device the region is in, as well as the DPA range for the >> region (called "NVDIMM Physical Address Base" and "NVDIMM Region Size" >> in the spec). Regions which are part of a PMEM namespace will have >> references into SPA tables and interleave set tables; regions which >> are part of PBLK namespaces will have references into control region >> and block data window region structures. >> >> [acpi-spec]: >> http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf >> >> ## Namespaces and the OS >> >> At boot time, the firmware will read the label regions from the NVDIMM >> device and set up the memory controllers appropriately. It will then >> construct a table describing the resulting regions in a table called >> an NFIT table, and expose that table via ACPI. Labels are not involved in the creation of the NFIT. The NFIT only defines PMEM ranges and interleave sets, the rest is left to the OS. >> To use a namespace, an operating system needs at a minimum two pieces >> of information: The UUID and/or Name of the namespace, and the SPA >> range where that namespace is mapped; and ideally also the Type and >> Abstraction Type to know how to interpret the data inside. Not necessarily, no. Linux supports "label-less" mode where it exposes the raw capacity of a region in 1:1 mapped namespace without a label. This is how Linux supports "legacy" NVDIMMs that do not support labels. >> Unfortunately, the information needed to understand namespaces is >> somewhat disjoint. The namespace labels themselves contain the UUID, >> Name, Type, and Abstraction Type, but don't contain any information >> about SPA or block control / status registers and windows. The NFIT >> table contains a list of SPA Range Structures, which list the >> NVDIMM-related SPA ranges and their Type GUID; as well as a table >> containing individual DPA ranges, which specifies which SPAs they >> correspond to. But the NFIT does not contain the UUID or other >> identifying information from the Namespace labels. In order to >> actually discover that namespace with UUID _X_ is mapped at SPA _Y-Z_, >> an operating system must: >> >> 1. Read the label areas of all NVDIMMs and discover the DPA range and >> Interleave Set for namespace _X_ >> 2. Read the Region Mapping Structures from the NFIT table, and find >> out which structures match the DPA ranges for namespace _X_ >> 3. Find the System Physical Address Range Structure Index associated >> with the Region Mapping >> 4. Look up the SPA Range Structure in the NFIT table using the SPA >> Range Structure Index >> 5. Read the SPA range _Y-Z_ I'm not sure I'm grokking your distinction between 2, 3, 4? In any event we do the DIMM to SPA association first before reading labels. The OS calculates a so called "Interleave Set Cookie" from the NFIT information to compare against a similar value stored in the labels. This lets the OS determine that the Interleave Set composition has not changed from when the labels were initially written. An Interleave Set Cookie mismatch indicates the labels are stale, corrupted, or that the physical composition of the Interleave Set has changed. >> An OS driver can modify the namespaces by modifying the Label Storage >> Areas of the corresponding DIMMs. The NFIT table describes how the OS >> can access the Label Storage Areas. Label Storage Areas may be >> "isolated", in which case the area would be accessed via >> device-specific AML methods (DSM), or they may be exposed directly >> using a well-known location. AML methods to access the label areas >> are "dumb": they are essentially a memcpy() which copies into or out >> of a given {DIMM, Label Area Offest} address. No checking for >> validity of reads and writes is done, and simply modifying the labels >> does not change the mapping immediately -- this must be done either by >> the OS driver reprogramming the NVDIMM memory controller, or by >> rebooting and allowing the firmware to it. There are checksums in the Namespace definition to account label validity. Starting with ACPI 6.2 DSMs for labels are deprecated in favor of the new / named methods for label access _LSI, _LSR, and _LSW. >> Modifying labels is tricky, due to an issue that will be somewhat of a >> recurring theme when discussing NVDIMMs: The necessity of assuming >> that, at any given point in time, power may be suddenly cut, and the >> system needing to be able to recover sensible data in such a >> circumstance. The [UEFI Specification][uefi-spec] chapter on the >> NVDIMM label protocol specifies how the label area is to be modified >> such that a consistent "view" is always available; and how firmware >> and the operating system should respond consistently to labels which >> appear corrupt. Not that tricky :-). >> ## NVDIMMs and filesystems >> >> Along the same line, most filesystems are written with the assumption >> that a given write to a block device will either finish completely, or >> be entirely reverted. Since access to NVDIMMs (even in PBLK mode) are >> essentially `memcpy`s, writes may well be interrupted halfway through, >> resulting in _sector tearing_. In order to help with this, the UEFI >> spec defines method of reading and writing NVRAM which is capable of >> emulating sector-atomic write semantics via a _block translation >> layer_ (BTT) ([UEFI spec][uefi-spec], chapter 6, "Block Translation >> Table (BTT) Layout"). Namespaces accessed via this discipline will >> have a _BTT info block_ at the beginning of the namespace (similar to >> a superblock on a traditional hard disk). Additionally, the >> AddressAbstraction GUID in the namespace label(s) should be set to >> `EFI_BTT_ABSTRACTION_GUID`. >> >> ## Linux >> >> Linux has a _direct access_ (DAX) filesystem mount mode for block >> devices which are "memory-like" ^[kernel-dax]. If both the filesystem >> and the underlying device support DAX, and the `dax` mount option is >> enabled, then when a file on that filesystem is `mmap`ed, the page >> cache is bypassed and the underlying storage is mapped directly into >> the user process. (?) >> >> [kernel-dax]: https://www.kernel.org/doc/Documentation/filesystems/dax.txt >> >> Linux has a tool called `ndctl` to manage NVDIMM namespaces. From the >> documentation it looks fairly well abstracted: you don't typically >> specify individual DPAs when creating PBLK or PMEM regions: you >> specify the type you want and the size and it works out the layout >> details (?). >> >> The `ndctl` tool allows you to make PMEM namespaces in one of four >> modes: `raw`, `sector`, `fsdax` (or `memory`), and `devdax` (or, >> confusingly, `dax`). Yes, apologies on the confusion. Going forward from ndctl-v60 we have deprecated 'memory' and 'dax'. >> The `raw`, `sector`, and `fsdax` modes all result in a block device in >> the pattern of `/dev/pmemN[.M]`, in which a filesystem can be stored. >> `devdax` results in a character device in the pattern of >> `/dev/daxN[.M]`. >> >> It's not clear from the documentation exactly what `raw` mode is or >> when it would be safe to use it. We'll add some documentation to the man page, but 'raw' mode is effectively just a ramdisk. No, dax support. >> >> `sector` mode implements `BTT`; it is thus safe against sector >> tearing, but does not support mapping files in DAX mode. The >> namespace can be either PMEM or PBLK (?). As described above, the >> first block of the namespace will be a BTT info block. The info block is not exposed in the user accessible data space as this comment seems to imply. It's similar to a partition table it's on media metadata that specifies an encapsulation. >> `fsdax` and `devdax` mode are both designed to make it possible for >> user processes to have direct mapping of NVRAM. As such, both are >> only suitable for PMEM namespaces (?). Both also need to have kernel >> page structures allocated for each page of NVRAM; this amounts to 64 >> bytes for every 4k of NVRAM. Memory for these page structures can >> either be allocated out of normal "system" memory, or inside the PMEM >> namespace itself. >> >> In both cases, an "info block", very similar to the BTT info block, is >> written to the beginning of the namespace when created. This info >> block specifies whether the page structures come from system memory or >> from the namespace itself. If from the namespace itself, it contains >> information about what parts of the namespace have been set aside for >> Linux to use for this purpose. >> >> Linux has also defined "Type GUIDs" for these two types of namespace >> to be stored in the namespace label, although these are not yet in the >> ACPI spec. They never will be. One of the motivations for GUIDs is that an OS can define private ones without needing to go back and standardize them. Only GUIDs that are needed to inter-OS / pre-OS compatibility would need to be defined in ACPI, and there is no expectation that other OSes understand Linux's format for reserving page structure space. >> Documentation seems to indicate that both `pmem` and `dax` devices can >> be further subdivided (by mentioning `/dev/pmemN.M` and >> `/dev/daxN.M`), but don't mention specifically how. `pmem` devices, >> being block devices, can presumuably be partitioned like a block >> device can. `dax` devices may have something similar, or may have >> their own subdivision mechanism. The rest of this document will >> assume that this is the case. You can create multiple namespaces in a given region. Sub-sequent namespaces after the first get the .1, .2, .3 etc suffix. >> >> # Xen considerations >> >> ## RAM and MMIO in Xen >> >> Xen generally has two types of things that can go into a pagetable or >> p2m. The first is RAM or "system memory". RAM has a page struct, >> which allows it to be accounted for on a page-by-page basis: Assigned >> to a specific domain, reference counted, and so on. >> >> The second is MMIO. MMIO areas do not have page structures, and thus >> cannot be accounted on a page-by-page basis. Xen knows about PCI >> devices and the associated MMIO ranges, and makes sure that PV >> pagetables or HVM p2m tables only contain MMIO mappings for devices >> which have been assigned to a guest. >> >> ## Page structures >> >> To begin with, Xen, like Linux, needs page structs for NVDIMM >> memory. Without page structs, we don't have reference counts; which >> means there's no safe way, for instance, for a guest to ask a PV >> device to write into NVRAM owned by a guest; and no real way to be >> confident that the same memory hadn't been mapped multiple times. >> >> Page structures in Xen are 32 bytes for non-BIGMEM systems (<5 TiB), >> and 40 bytes for BIGMEM systems. >> >> ### Page structure allocation >> >> There are three potential places we could store page structs: >> >> 1. **System memory** Allocated from the host RAM >> >> 2. **Inside the namespace** Like Linux, there could be memory set >> aside inside the namespace set aside specifically for mapping that >> namespace. This could be 2a) As a user-visible separate partition, >> or 2b) allocated by `ndctl` from the namespace "superblock". As >> the page frame areas of the namespace can be discontiguous (?), it >> would be possible to enable or disable this extra space on an >> existing namespace, to allow users with existing vNVDIMM images to >> switch to or from Xen. I think a Xen mode namespace makes sense. If I understand correctly it would also need to house the struct page array at the same time in case dom0 needs to do a get_user_pages() operation when assigning pmem to a guest? The other consideration is how to sub-divide that namespace for handing it out to guests. We are currently working through problems with virtualization and device-assignment when the guest is given memory for a dax mapped file on a filesystem in dax mode. Given that the filesytem can do physical layout rearrangement at will it means that it is not suitable to give to guest. For now we require a devdax mode namespace for mapping pmem to a guest so that we do not collide with filesystem block map mutations. I assume this "xen-mode" namespace would be something like devdax + mfn array. >> >> 3. **A different namespace** NVRAM could be set aside for use by >> arbitrary namespaces. This could be a 3a) specially-selected >> partition from a normal namespace, or it could be 3b) a namespace >> specifically designed to be used for Xen (perhaps with its own Type >> GUID). >> >> 2b has the advantage that we should be able to unilaterally allocate a >> Type GUID and start using it for that purpose. It also has the >> advantage that it should be somewhat easier for someone with existing >> vNVDIMM images to switch into (or away from) using Xen. It has the >> disadvantage of being less transparent to the user. >> >> 3b has the advantage of being invisible to the user once being set up. >> It has the slight disadvantage of having more gatekeepers to get >> through; and if those gatekeepers aren't happy with enabling or >> disabling extra frametable space for Xen after creation (or if I've >> misunderstood and such functionality isn't straightforward to >> implement) then it will be difficult for people with existing images >> to switch to Xen. >> >> ### Dealing with changing frame tables >> >> Another potential issue to consider is the monolithic nature of the >> current frame table. At the moment, to find a page struct given an >> mfn, you use the mfn as an index into a single large array. >> >> I think we can assume that NVDIMM SPA ranges will be separate from >> normal system RAM. There's no reason the frame table couldn't be >> "sparse": i.e., only the sections of it that actually contain valid >> pages need to have ram backing them. >> >> However, if we pursue a solution like Linux, where each namespace >> contains memory set aside to use for its own pagetables, we may have a >> situation where boundary between two namespaces falls in the middle of >> a frame table page; in that case, from where should such a frame table >> page be allocated? It's already the case that the minimum alignment for multiple namespaces is 128MB given the current "section size" assumptions of the core mm. Can we make a similar alignment restriction for Xen to eliminate this problem.? >> A simple answer would be to use system RAM to "cover the gap": There >> would only ever need to be a single page per boundary. >> >> ## Page tracking for domain 0 >> >> When domain 0 adds or removes entries from its pagetables, it does not >> explicitly store the memory type (i.e., whether RAM or MMIO); Xen >> infers this from its knowledge of where RAM is and is not. Below we >> will explore design choices that involve domain 0 telling Xen about >> NVDIMM namespaces, SPAs, and what it can use for page structures. In >> such a scenario, NVRAM pages essentially transition from being MMIO >> (before Xen knows about them) to being RAM (after Xen knows about >> them), which in turn has implications for any mappings which domain 0 >> has in its pagetables. >> >> ## PVH and QEMU >> >> A number of solutions have suggested using QEMU to provide emulated >> NVDIMM support to guests. This is a workable solution for HVM guests, >> but for PVH guests we would like to avoid introducing a device model >> if at all possible. >> >> ## FS DAX and DMA in Linux >> >> There is [an issue][linux-fs-dax-dma-issue] with DAX and filesystems, >> in that filesystems (even those claiming to support DAX) may want to >> rearrange the block<->file mapping "under the feet" of running >> processes with mapped files. Unfortunately, this is more tricky with >> DAX than with a page cache, and as of [early 2018][linux-fs-dax-dma-2] >> was essentially incompatible with virtualization. ("I think we need to >> enforce this in the host kernel. I.e. do not allow file backed DAX >> pages to be mapped in EPT entries unless / until we have a solution to >> the DMA synchronization problem.") >> >> More needs to be discussed and investigated here; but for the time >> being, mapping a file in a DAX filesystem into a guest's p2m is >> probably not going to be possible. Ah, you have the fsdax issue captured here, great. >> >> [linux-fs-dax-dma-issue]: >> https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html >> [linux-fs-dax-dma-2]: >> https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg07347.html >> >> # Target functionality >> >> The above sets the stage, but to actually determine on an architecture >> we have to decide what kind of final functionality we're looking for. >> The functionality falls into two broad areas: Functionality from the >> host administrator's point of view (accessed from domain 0), and >> functionality from the guest administrator's point of view. >> >> ## Domain 0 functionality >> >> For the purposes of this section, I shall be distinguishing between >> "native Linux" functionality and "domain 0" functionality. By "native >> Linux" functionality I mean functionality which is available when >> Linux is running on bare metal -- `ndctl`, `/dev/pmem`, `/dev/dax`, >> and so on. By "dom0 functionality" I mean functionality which is >> available in domain 0 when Linux is running under Xen. >> >> 1. **Disjoint functionality** Have dom0 and native Linux >> functionality completely separate: namespaces created when booted >> on native Linux would not be accessible when booted under domain 0, >> and vice versa. Some Xen-specific tool similar to `ndctl` would >> need to be developed for accessing functionality. I'm open to teaching ndctl about Xen needs if that helps. >> >> 2. **Shared data but no dom0 functionality** Another option would be >> to have Xen and Linux have shared access to the same namespaces, >> but dom0 essentially have no direct access to the NVDIMM. Xen >> would read the NFIT, parse namespaces, and expose those namespaces >> to dom0 like any other guest; but dom0 would not be able to create >> or modify namespaces. To manage namespaces, an administrator would >> need to boot into native Linux, modify the namespaces, and then >> reboot into Xen again. Ugh. >> >> 3. **Dom0 fully functional, Manual Xen frame table** Another level of >> functionality would be to make it possible for dom0 to have full >> parity with native Linux in terms of using `ndctl` to manage >> namespaces, but to require the host administrator to manually set >> aside NVRAM for Xen to use for frame tables. >> >> 4. **Dom0 fully functional, automatic Xen frame table** This is like >> the above, but with the Xen frame table space automatically >> managed, similar to Linux's: You'd simply specify that you wanted >> the Xen frametable somehow when you create the namespace, and from >> then on forget about it. >> >> Number 1 should be avoided if at all possible, in my opinion. >> >> Given that the NFIT table doesn't currently have namespace UUIDs or >> other key pieces of information to fully understand the namespaces, it >> seems like #2 would likely not be able to be made functional enough. >> >> Number 3 should be achievable under our control. Obviously #4 would >> be ideal, but might depend on getting cooperation from the Linux >> NVDIMM maintainers to be able to set aside Xen frame table memory in >> addition to Linux frame table memory. "xen-mode" namespace? >> >> ## Guest functionality >> >> 1. **No remapping** The guest can take the PMEM device as-is. It's >> mapped by the toolstack at a specific place in _guest physical >> address_ (GPA) space and cannot be moved. There is no controller >> emulation (which would allow remapping) and minimal label area >> functionality. >> >> 2. **Full controller access for PMEM**. The guest has full >> controller access for PMEM: it can carve up namespaces, change >> mappings in GPA space, and so on. In it's own virtual label area, right? >> 3. **Full controller access for both PMEM and PBLK**. A guest has >> full controller access, and can carve up its NVRAM into arbitrary >> PMEM or PBLK regions, as it wants. I'd forget about giving PBLK to guests, just use standard virtio or equivalent to route requests to the dom0 driver. Unless the PBLK control registers are mapped on 4K boundaries there's no safe way to give individual guests their own direct PBLK access. >> >> Numbers 2 and 3 would of course be nice-to-have, but would almost >> certainly involve having a QEMU qprocess to emulate them. Since we'd >> like to have PVH use NVDIMMs, we should at least make #1 an option. >> >> # Proposed design / roadmap >> >> Initially, dom0 accesses the NVRAM as normal, using static ACPI tables >> and the DSM methods; mappings are treated by Xen during this phase as >> MMIO. >> >> Once dom0 is ready to pass parts of a namespace through to a guest, it >> makes a hypercall to tell Xen about the namespace. It includes any >> regions of the namespace which Xen may use for 'scratch'; it also >> includes a flag to indicate whether this 'scratch' space may be used >> for frame tables from other namespaces. >> >> Frame tables are then created for this SPA range. They will be >> allocated from, in this order: 1) designated 'scratch' range from >> within this namespace 2) designated 'scratch' range from other >> namespaces which has been marked as sharable 3) system RAM. >> >> Xen will either verify that dom0 has no existing mappings, or promote >> the mappings to full pages (taking appropriate reference counts for >> mappings). Dom0 must ensure that this namespace is not unmapped, >> modified, or relocated until it asks Xen to unmap it. >> >> For Xen frame tables, to begin with, set aside a partition inside a >> namespace to be used by Xen. Pass this in to Xen when activating the >> namespace; this could be either 2a or 3a from "Page structure >> allocation". After that, we could decide which of the two more >> streamlined approaches (2b or 3b) to pursue. >> >> At this point, dom0 can pass parts of the mapped namespace into >> guests. Unfortunately, passing files on a fsdax filesystem is >> probably not safe; but we can pass in full dev-dax or fsdax >> partitions. >> >> From a guest perspective, I propose we provide static NFIT only, no >> access to labels to begin with. This can be generated in hvmloader >> and/or the toolstack acpi code. I'm ignorant of Xen internals, but can you not reuse the existing QEMU emulation for labels and NFIT? Thanks for this thorough write up, it's always nice to see the tradeoffs. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-09 17:35 ` George Dunlap 2018-05-11 16:33 ` Dan Williams @ 2018-05-11 16:33 ` Dan Williams 2018-05-15 10:05 ` Roger Pau Monné ` (3 more replies) 1 sibling, 4 replies; 24+ messages in thread From: Dan Williams @ 2018-05-11 16:33 UTC (permalink / raw) To: George Dunlap Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne [ adding linux-nvdimm ] Great write up! Some comments below... On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: > Dan, > > I understand that you're the NVDIMM maintainer for Linux. I've been > working with your colleagues to try to sort out an architecture to allow > NVRAM to be passed to guests under the Xen hypervisor. > > If you have time, I'd appreciate it if you could skim through at least > the first section of the document below ("NVIDMM Overview"), concerning > NVDIMM devices and Linux, to see if I've made any mistakes. > > If you're up for it, additional early feedback on the proposed Xen > architecture, from a Linux perspective, would be awesome as well. > > Thanks, > -George > > On 05/09/2018 06:29 PM, George Dunlap wrote: >> Below is an initial draft of an NVDIMM proposal. I'll submit a patch to >> include it in the tree at some point, but I thought for initial >> discussion it would be easier if it were copied in-line. >> >> I've done a fair amount of investigation, but it's quite likely I've >> made mistakes. Please send me corrections where necessary. >> >> -George >> >> --- >> % NVDIMMs and Xen >> % George Dunlap >> % Revision 0.1 >> >> # NVDIMM overview >> >> It's very difficult, from the various specs, to actually get a >> complete enough picture if what's going on to make a good design. >> This section is meant as an overview of the current hardware, >> firmware, and Linux interfaces sufficient to inform a discussion of >> the issues in designing a Xen interface for NVDIMMs. >> >> ## DIMMs, Namespaces, and access methods >> >> An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form >> factor) that contains _non-volatile RAM_ (NVRAM). Individual bytes of >> memory on a DIMM are specified by a _DIMM physical address_ or DPA. >> Each DIMM is attached to an NVDIMM controller. >> >> Memory on the DIMMs is divided up into _namespaces_. The word >> "namespace" is rather misleading though; a namespace in this context >> is not actually a space of names (contrast, for example "C++ >> namespaces"); rather, it's more like a SCSI LUN, or a volume, or a >> partition on a drive: a set of data which is meant to be viewed and >> accessed as a unit. (The name was apparently carried over from NVMe >> devices, which were precursors of the NVDIMM spec.) Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs provide a "label area" which is an out-of-band non-volatile memory area where the OS can store whatever it likes. The UEFI 2.7 specification defines a data format for the definition of namespaces on top of persistent memory ranges advertised to the OS via the ACPI NFIT structure. There is no obligation for an NVDIMM to provide a label area, and as far as I know all NVDIMMs on the market today do not provide a label area. That said, QEMU has the ability to associate a virtual label area with for its virtual NVDIMM representation. >> The NVDIMM controller allows two ways to access the DIMM. One is >> mapped 1-1 in _system physical address space_ (SPA), much like normal >> RAM. This method of access is called _PMEM_. The other method is >> similar to that of a PCI device: you have a control and status >> register which control an 8k aperture window into the DIMM. This >> method access is called _PBLK_. >> >> In the case of PMEM, as in the case of DRAM, addresses from the SPA >> are interleaved across a set of DIMMs (an _interleave set_) for >> performance reasons. A specific PMEM namespace will be a single >> contiguous DPA range across all DIMMs in its interleave set. For >> example, you might have a namespace for DPAs `0-0x50000000` on DIMMs 0 >> and 1; and another namespace for DPAs `0x80000000-0xa0000000` on DIMMs >> 0, 1, 2, and 3. >> >> In the case of PBLK, a namespace always resides on a single DIMM. >> However, that namespace can be made up of multiple discontiguous >> chunks of space on that DIMM. For instance, in our example above, we >> might have a namespace om DIMM 0 consisting of DPAs >> `0x50000000-0x60000000`, `0x80000000-0x90000000`, and >> `0xa0000000-0xf0000000`. >> >> The interleaving of PMEM has implications for the speed and >> reliability of the namespace: Much like RAID 0, it maximizes speed, >> but it means that if any one DIMM fails, the data from the entire >> namespace is corrupted. PBLK makes it slightly less straightforward >> to access, but it allows OS software to apply RAID-like logic to >> balance redundancy and speed. >> >> Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM; >> for large systems without 5-level paging, this is actually becoming a >> limitation. Using PBLK allows existing 4-level paged systems to >> access an arbitrary amount of NVDIMM. >> >> ## Namespaces, labels, and the label area >> >> A namespace is a mapping from the SPA and MMIO space into the DIMM. >> >> The firmware and/or operating system can talk to the NVDIMM controller >> to set up mappings from SPA and MMIO space into the DIMM. Because the >> memory and PCI devices are separate, it would be possible for buggy >> firmware or NVDIMM controller drivers to misconfigure things such that >> the same DPA is exposed in multiple places; if so, the results are >> undefined. >> >> Namespaces are constructed out of "labels". Each DIMM has a Label >> Storage Area, which is persistent but logically separate from the >> device-addressable areas on the DIMM. A label on a DIMM describes a >> single contiguous region of DPA on that DIMM. A PMEM namespace is >> made up of one label from each of the DIMMs which make its interleave >> set; a PBLK namespace is made up of one label for each chunk of range. >> >> In our examples above, the first PMEM namespace would be made of two >> labels (one on DIMM 0 and one on DIMM 1, each describind DPA >> `0-0x50000000`), and the second namespace would be made of four labels >> (one on DIMM 0, one on DIMM 1, and so on). Similarly, in the PBLK >> example, the namespace would consist of three labels; one describing >> `0x50000000-0x60000000`, one describing `0x80000000-0x90000000`, and >> so on. >> >> The namespace definition includes not only information about the DPAs >> which make up the namespace and how they fit together; it also >> includes a UUID for the namespace (to allow it to be identified >> uniquely), a 64-character "name" field for a human-friendly >> description, and Type and Address Abstraction GUIDs to inform the >> operating system how the data inside the namespace should be >> interpreted. Additionally, it can have an `ROLABEL` flag, which >> indicates to the OS that "device drivers and manageability software >> should refuse to make changes to the namespace labels", because >> "attempting to make configuration changes that affect the namespace >> labels will fail (i.e. because the VM guest is not in a position to >> make the change correctly)". >> >> See the [UEFI Specification][uefi-spec], section 13.19, "NVDIMM Label >> Protocol", for more information. >> >> [uefi-spec]: >> http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf >> >> ## NVDIMMs and ACPI >> >> The [ACPI Specification][acpi-spec] breaks down information in two ways. >> >> The first is about physical devices (see section 9.20, "NVDIMM >> Devices"). The NVDIMM controller is called the _NVDIMM Root Device_. >> There will generally be only a single NVDIMM root device on a system. >> >> Individual NVDIMMs are referred to by the spec as _NVDIMM Devices_. >> Each separate DIMM will have its own device listed as being under the >> Root Device. Each DIMM will have an _NVDIMM Device Handle_ which >> describes the physical DIMM (its location within the memory channel, >> the channel number within the memory controller, the memory controller >> ID within the socket, and so on). >> >> The second is about the data on those devices, and how the operating >> system can access it. This information is exposed in the NFIT table >> (see section 5.2.25). >> >> Because namespace labels allow NVDIMMs to be partitioned in fairly >> arbitrary ways, exposing information about how the operating system >> can access it is a bit complicated. It consists of several tables, >> whose information must be correlated to make sense out of it. >> >> These tables include: >> 1. A table of DPA ranges on individual NVDIMM devices >> 2. A table of SPA ranges where PMEM regions are mapped, along with >> interleave sets >> 3. Tables for control and data addresses for PBLK regions >> >> NVRAM on a given NVDIMM device will be broken down into one or more >> _regions_. These regions are enumerated in the NVDIMM Region Mapping >> Structure. Each entry in this table contains the NVDIMM Device Handle >> for the device the region is in, as well as the DPA range for the >> region (called "NVDIMM Physical Address Base" and "NVDIMM Region Size" >> in the spec). Regions which are part of a PMEM namespace will have >> references into SPA tables and interleave set tables; regions which >> are part of PBLK namespaces will have references into control region >> and block data window region structures. >> >> [acpi-spec]: >> http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf >> >> ## Namespaces and the OS >> >> At boot time, the firmware will read the label regions from the NVDIMM >> device and set up the memory controllers appropriately. It will then >> construct a table describing the resulting regions in a table called >> an NFIT table, and expose that table via ACPI. Labels are not involved in the creation of the NFIT. The NFIT only defines PMEM ranges and interleave sets, the rest is left to the OS. >> To use a namespace, an operating system needs at a minimum two pieces >> of information: The UUID and/or Name of the namespace, and the SPA >> range where that namespace is mapped; and ideally also the Type and >> Abstraction Type to know how to interpret the data inside. Not necessarily, no. Linux supports "label-less" mode where it exposes the raw capacity of a region in 1:1 mapped namespace without a label. This is how Linux supports "legacy" NVDIMMs that do not support labels. >> Unfortunately, the information needed to understand namespaces is >> somewhat disjoint. The namespace labels themselves contain the UUID, >> Name, Type, and Abstraction Type, but don't contain any information >> about SPA or block control / status registers and windows. The NFIT >> table contains a list of SPA Range Structures, which list the >> NVDIMM-related SPA ranges and their Type GUID; as well as a table >> containing individual DPA ranges, which specifies which SPAs they >> correspond to. But the NFIT does not contain the UUID or other >> identifying information from the Namespace labels. In order to >> actually discover that namespace with UUID _X_ is mapped at SPA _Y-Z_, >> an operating system must: >> >> 1. Read the label areas of all NVDIMMs and discover the DPA range and >> Interleave Set for namespace _X_ >> 2. Read the Region Mapping Structures from the NFIT table, and find >> out which structures match the DPA ranges for namespace _X_ >> 3. Find the System Physical Address Range Structure Index associated >> with the Region Mapping >> 4. Look up the SPA Range Structure in the NFIT table using the SPA >> Range Structure Index >> 5. Read the SPA range _Y-Z_ I'm not sure I'm grokking your distinction between 2, 3, 4? In any event we do the DIMM to SPA association first before reading labels. The OS calculates a so called "Interleave Set Cookie" from the NFIT information to compare against a similar value stored in the labels. This lets the OS determine that the Interleave Set composition has not changed from when the labels were initially written. An Interleave Set Cookie mismatch indicates the labels are stale, corrupted, or that the physical composition of the Interleave Set has changed. >> An OS driver can modify the namespaces by modifying the Label Storage >> Areas of the corresponding DIMMs. The NFIT table describes how the OS >> can access the Label Storage Areas. Label Storage Areas may be >> "isolated", in which case the area would be accessed via >> device-specific AML methods (DSM), or they may be exposed directly >> using a well-known location. AML methods to access the label areas >> are "dumb": they are essentially a memcpy() which copies into or out >> of a given {DIMM, Label Area Offest} address. No checking for >> validity of reads and writes is done, and simply modifying the labels >> does not change the mapping immediately -- this must be done either by >> the OS driver reprogramming the NVDIMM memory controller, or by >> rebooting and allowing the firmware to it. There are checksums in the Namespace definition to account label validity. Starting with ACPI 6.2 DSMs for labels are deprecated in favor of the new / named methods for label access _LSI, _LSR, and _LSW. >> Modifying labels is tricky, due to an issue that will be somewhat of a >> recurring theme when discussing NVDIMMs: The necessity of assuming >> that, at any given point in time, power may be suddenly cut, and the >> system needing to be able to recover sensible data in such a >> circumstance. The [UEFI Specification][uefi-spec] chapter on the >> NVDIMM label protocol specifies how the label area is to be modified >> such that a consistent "view" is always available; and how firmware >> and the operating system should respond consistently to labels which >> appear corrupt. Not that tricky :-). >> ## NVDIMMs and filesystems >> >> Along the same line, most filesystems are written with the assumption >> that a given write to a block device will either finish completely, or >> be entirely reverted. Since access to NVDIMMs (even in PBLK mode) are >> essentially `memcpy`s, writes may well be interrupted halfway through, >> resulting in _sector tearing_. In order to help with this, the UEFI >> spec defines method of reading and writing NVRAM which is capable of >> emulating sector-atomic write semantics via a _block translation >> layer_ (BTT) ([UEFI spec][uefi-spec], chapter 6, "Block Translation >> Table (BTT) Layout"). Namespaces accessed via this discipline will >> have a _BTT info block_ at the beginning of the namespace (similar to >> a superblock on a traditional hard disk). Additionally, the >> AddressAbstraction GUID in the namespace label(s) should be set to >> `EFI_BTT_ABSTRACTION_GUID`. >> >> ## Linux >> >> Linux has a _direct access_ (DAX) filesystem mount mode for block >> devices which are "memory-like" ^[kernel-dax]. If both the filesystem >> and the underlying device support DAX, and the `dax` mount option is >> enabled, then when a file on that filesystem is `mmap`ed, the page >> cache is bypassed and the underlying storage is mapped directly into >> the user process. (?) >> >> [kernel-dax]: https://www.kernel.org/doc/Documentation/filesystems/dax.txt >> >> Linux has a tool called `ndctl` to manage NVDIMM namespaces. From the >> documentation it looks fairly well abstracted: you don't typically >> specify individual DPAs when creating PBLK or PMEM regions: you >> specify the type you want and the size and it works out the layout >> details (?). >> >> The `ndctl` tool allows you to make PMEM namespaces in one of four >> modes: `raw`, `sector`, `fsdax` (or `memory`), and `devdax` (or, >> confusingly, `dax`). Yes, apologies on the confusion. Going forward from ndctl-v60 we have deprecated 'memory' and 'dax'. >> The `raw`, `sector`, and `fsdax` modes all result in a block device in >> the pattern of `/dev/pmemN[.M]`, in which a filesystem can be stored. >> `devdax` results in a character device in the pattern of >> `/dev/daxN[.M]`. >> >> It's not clear from the documentation exactly what `raw` mode is or >> when it would be safe to use it. We'll add some documentation to the man page, but 'raw' mode is effectively just a ramdisk. No, dax support. >> >> `sector` mode implements `BTT`; it is thus safe against sector >> tearing, but does not support mapping files in DAX mode. The >> namespace can be either PMEM or PBLK (?). As described above, the >> first block of the namespace will be a BTT info block. The info block is not exposed in the user accessible data space as this comment seems to imply. It's similar to a partition table it's on media metadata that specifies an encapsulation. >> `fsdax` and `devdax` mode are both designed to make it possible for >> user processes to have direct mapping of NVRAM. As such, both are >> only suitable for PMEM namespaces (?). Both also need to have kernel >> page structures allocated for each page of NVRAM; this amounts to 64 >> bytes for every 4k of NVRAM. Memory for these page structures can >> either be allocated out of normal "system" memory, or inside the PMEM >> namespace itself. >> >> In both cases, an "info block", very similar to the BTT info block, is >> written to the beginning of the namespace when created. This info >> block specifies whether the page structures come from system memory or >> from the namespace itself. If from the namespace itself, it contains >> information about what parts of the namespace have been set aside for >> Linux to use for this purpose. >> >> Linux has also defined "Type GUIDs" for these two types of namespace >> to be stored in the namespace label, although these are not yet in the >> ACPI spec. They never will be. One of the motivations for GUIDs is that an OS can define private ones without needing to go back and standardize them. Only GUIDs that are needed to inter-OS / pre-OS compatibility would need to be defined in ACPI, and there is no expectation that other OSes understand Linux's format for reserving page structure space. >> Documentation seems to indicate that both `pmem` and `dax` devices can >> be further subdivided (by mentioning `/dev/pmemN.M` and >> `/dev/daxN.M`), but don't mention specifically how. `pmem` devices, >> being block devices, can presumuably be partitioned like a block >> device can. `dax` devices may have something similar, or may have >> their own subdivision mechanism. The rest of this document will >> assume that this is the case. You can create multiple namespaces in a given region. Sub-sequent namespaces after the first get the .1, .2, .3 etc suffix. >> >> # Xen considerations >> >> ## RAM and MMIO in Xen >> >> Xen generally has two types of things that can go into a pagetable or >> p2m. The first is RAM or "system memory". RAM has a page struct, >> which allows it to be accounted for on a page-by-page basis: Assigned >> to a specific domain, reference counted, and so on. >> >> The second is MMIO. MMIO areas do not have page structures, and thus >> cannot be accounted on a page-by-page basis. Xen knows about PCI >> devices and the associated MMIO ranges, and makes sure that PV >> pagetables or HVM p2m tables only contain MMIO mappings for devices >> which have been assigned to a guest. >> >> ## Page structures >> >> To begin with, Xen, like Linux, needs page structs for NVDIMM >> memory. Without page structs, we don't have reference counts; which >> means there's no safe way, for instance, for a guest to ask a PV >> device to write into NVRAM owned by a guest; and no real way to be >> confident that the same memory hadn't been mapped multiple times. >> >> Page structures in Xen are 32 bytes for non-BIGMEM systems (<5 TiB), >> and 40 bytes for BIGMEM systems. >> >> ### Page structure allocation >> >> There are three potential places we could store page structs: >> >> 1. **System memory** Allocated from the host RAM >> >> 2. **Inside the namespace** Like Linux, there could be memory set >> aside inside the namespace set aside specifically for mapping that >> namespace. This could be 2a) As a user-visible separate partition, >> or 2b) allocated by `ndctl` from the namespace "superblock". As >> the page frame areas of the namespace can be discontiguous (?), it >> would be possible to enable or disable this extra space on an >> existing namespace, to allow users with existing vNVDIMM images to >> switch to or from Xen. I think a Xen mode namespace makes sense. If I understand correctly it would also need to house the struct page array at the same time in case dom0 needs to do a get_user_pages() operation when assigning pmem to a guest? The other consideration is how to sub-divide that namespace for handing it out to guests. We are currently working through problems with virtualization and device-assignment when the guest is given memory for a dax mapped file on a filesystem in dax mode. Given that the filesytem can do physical layout rearrangement at will it means that it is not suitable to give to guest. For now we require a devdax mode namespace for mapping pmem to a guest so that we do not collide with filesystem block map mutations. I assume this "xen-mode" namespace would be something like devdax + mfn array. >> >> 3. **A different namespace** NVRAM could be set aside for use by >> arbitrary namespaces. This could be a 3a) specially-selected >> partition from a normal namespace, or it could be 3b) a namespace >> specifically designed to be used for Xen (perhaps with its own Type >> GUID). >> >> 2b has the advantage that we should be able to unilaterally allocate a >> Type GUID and start using it for that purpose. It also has the >> advantage that it should be somewhat easier for someone with existing >> vNVDIMM images to switch into (or away from) using Xen. It has the >> disadvantage of being less transparent to the user. >> >> 3b has the advantage of being invisible to the user once being set up. >> It has the slight disadvantage of having more gatekeepers to get >> through; and if those gatekeepers aren't happy with enabling or >> disabling extra frametable space for Xen after creation (or if I've >> misunderstood and such functionality isn't straightforward to >> implement) then it will be difficult for people with existing images >> to switch to Xen. >> >> ### Dealing with changing frame tables >> >> Another potential issue to consider is the monolithic nature of the >> current frame table. At the moment, to find a page struct given an >> mfn, you use the mfn as an index into a single large array. >> >> I think we can assume that NVDIMM SPA ranges will be separate from >> normal system RAM. There's no reason the frame table couldn't be >> "sparse": i.e., only the sections of it that actually contain valid >> pages need to have ram backing them. >> >> However, if we pursue a solution like Linux, where each namespace >> contains memory set aside to use for its own pagetables, we may have a >> situation where boundary between two namespaces falls in the middle of >> a frame table page; in that case, from where should such a frame table >> page be allocated? It's already the case that the minimum alignment for multiple namespaces is 128MB given the current "section size" assumptions of the core mm. Can we make a similar alignment restriction for Xen to eliminate this problem.? >> A simple answer would be to use system RAM to "cover the gap": There >> would only ever need to be a single page per boundary. >> >> ## Page tracking for domain 0 >> >> When domain 0 adds or removes entries from its pagetables, it does not >> explicitly store the memory type (i.e., whether RAM or MMIO); Xen >> infers this from its knowledge of where RAM is and is not. Below we >> will explore design choices that involve domain 0 telling Xen about >> NVDIMM namespaces, SPAs, and what it can use for page structures. In >> such a scenario, NVRAM pages essentially transition from being MMIO >> (before Xen knows about them) to being RAM (after Xen knows about >> them), which in turn has implications for any mappings which domain 0 >> has in its pagetables. >> >> ## PVH and QEMU >> >> A number of solutions have suggested using QEMU to provide emulated >> NVDIMM support to guests. This is a workable solution for HVM guests, >> but for PVH guests we would like to avoid introducing a device model >> if at all possible. >> >> ## FS DAX and DMA in Linux >> >> There is [an issue][linux-fs-dax-dma-issue] with DAX and filesystems, >> in that filesystems (even those claiming to support DAX) may want to >> rearrange the block<->file mapping "under the feet" of running >> processes with mapped files. Unfortunately, this is more tricky with >> DAX than with a page cache, and as of [early 2018][linux-fs-dax-dma-2] >> was essentially incompatible with virtualization. ("I think we need to >> enforce this in the host kernel. I.e. do not allow file backed DAX >> pages to be mapped in EPT entries unless / until we have a solution to >> the DMA synchronization problem.") >> >> More needs to be discussed and investigated here; but for the time >> being, mapping a file in a DAX filesystem into a guest's p2m is >> probably not going to be possible. Ah, you have the fsdax issue captured here, great. >> >> [linux-fs-dax-dma-issue]: >> https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html >> [linux-fs-dax-dma-2]: >> https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg07347.html >> >> # Target functionality >> >> The above sets the stage, but to actually determine on an architecture >> we have to decide what kind of final functionality we're looking for. >> The functionality falls into two broad areas: Functionality from the >> host administrator's point of view (accessed from domain 0), and >> functionality from the guest administrator's point of view. >> >> ## Domain 0 functionality >> >> For the purposes of this section, I shall be distinguishing between >> "native Linux" functionality and "domain 0" functionality. By "native >> Linux" functionality I mean functionality which is available when >> Linux is running on bare metal -- `ndctl`, `/dev/pmem`, `/dev/dax`, >> and so on. By "dom0 functionality" I mean functionality which is >> available in domain 0 when Linux is running under Xen. >> >> 1. **Disjoint functionality** Have dom0 and native Linux >> functionality completely separate: namespaces created when booted >> on native Linux would not be accessible when booted under domain 0, >> and vice versa. Some Xen-specific tool similar to `ndctl` would >> need to be developed for accessing functionality. I'm open to teaching ndctl about Xen needs if that helps. >> >> 2. **Shared data but no dom0 functionality** Another option would be >> to have Xen and Linux have shared access to the same namespaces, >> but dom0 essentially have no direct access to the NVDIMM. Xen >> would read the NFIT, parse namespaces, and expose those namespaces >> to dom0 like any other guest; but dom0 would not be able to create >> or modify namespaces. To manage namespaces, an administrator would >> need to boot into native Linux, modify the namespaces, and then >> reboot into Xen again. Ugh. >> >> 3. **Dom0 fully functional, Manual Xen frame table** Another level of >> functionality would be to make it possible for dom0 to have full >> parity with native Linux in terms of using `ndctl` to manage >> namespaces, but to require the host administrator to manually set >> aside NVRAM for Xen to use for frame tables. >> >> 4. **Dom0 fully functional, automatic Xen frame table** This is like >> the above, but with the Xen frame table space automatically >> managed, similar to Linux's: You'd simply specify that you wanted >> the Xen frametable somehow when you create the namespace, and from >> then on forget about it. >> >> Number 1 should be avoided if at all possible, in my opinion. >> >> Given that the NFIT table doesn't currently have namespace UUIDs or >> other key pieces of information to fully understand the namespaces, it >> seems like #2 would likely not be able to be made functional enough. >> >> Number 3 should be achievable under our control. Obviously #4 would >> be ideal, but might depend on getting cooperation from the Linux >> NVDIMM maintainers to be able to set aside Xen frame table memory in >> addition to Linux frame table memory. "xen-mode" namespace? >> >> ## Guest functionality >> >> 1. **No remapping** The guest can take the PMEM device as-is. It's >> mapped by the toolstack at a specific place in _guest physical >> address_ (GPA) space and cannot be moved. There is no controller >> emulation (which would allow remapping) and minimal label area >> functionality. >> >> 2. **Full controller access for PMEM**. The guest has full >> controller access for PMEM: it can carve up namespaces, change >> mappings in GPA space, and so on. In it's own virtual label area, right? >> 3. **Full controller access for both PMEM and PBLK**. A guest has >> full controller access, and can carve up its NVRAM into arbitrary >> PMEM or PBLK regions, as it wants. I'd forget about giving PBLK to guests, just use standard virtio or equivalent to route requests to the dom0 driver. Unless the PBLK control registers are mapped on 4K boundaries there's no safe way to give individual guests their own direct PBLK access. >> >> Numbers 2 and 3 would of course be nice-to-have, but would almost >> certainly involve having a QEMU qprocess to emulate them. Since we'd >> like to have PVH use NVDIMMs, we should at least make #1 an option. >> >> # Proposed design / roadmap >> >> Initially, dom0 accesses the NVRAM as normal, using static ACPI tables >> and the DSM methods; mappings are treated by Xen during this phase as >> MMIO. >> >> Once dom0 is ready to pass parts of a namespace through to a guest, it >> makes a hypercall to tell Xen about the namespace. It includes any >> regions of the namespace which Xen may use for 'scratch'; it also >> includes a flag to indicate whether this 'scratch' space may be used >> for frame tables from other namespaces. >> >> Frame tables are then created for this SPA range. They will be >> allocated from, in this order: 1) designated 'scratch' range from >> within this namespace 2) designated 'scratch' range from other >> namespaces which has been marked as sharable 3) system RAM. >> >> Xen will either verify that dom0 has no existing mappings, or promote >> the mappings to full pages (taking appropriate reference counts for >> mappings). Dom0 must ensure that this namespace is not unmapped, >> modified, or relocated until it asks Xen to unmap it. >> >> For Xen frame tables, to begin with, set aside a partition inside a >> namespace to be used by Xen. Pass this in to Xen when activating the >> namespace; this could be either 2a or 3a from "Page structure >> allocation". After that, we could decide which of the two more >> streamlined approaches (2b or 3b) to pursue. >> >> At this point, dom0 can pass parts of the mapped namespace into >> guests. Unfortunately, passing files on a fsdax filesystem is >> probably not safe; but we can pass in full dev-dax or fsdax >> partitions. >> >> From a guest perspective, I propose we provide static NFIT only, no >> access to labels to begin with. This can be generated in hvmloader >> and/or the toolstack acpi code. I'm ignorant of Xen internals, but can you not reuse the existing QEMU emulation for labels and NFIT? Thanks for this thorough write up, it's always nice to see the tradeoffs. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-11 16:33 ` Dan Williams @ 2018-05-15 10:05 ` Roger Pau Monné 2018-05-15 10:05 ` Roger Pau Monné ` (2 subsequent siblings) 3 siblings, 0 replies; 24+ messages in thread From: Roger Pau Monné @ 2018-05-15 10:05 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Andrew Cooper, George Dunlap, xen-devel, Jan Beulich, Yi Zhang Just some replies/questions to some of the points raised below. On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: > [ adding linux-nvdimm ] > > Great write up! Some comments below... > > On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: > >> To use a namespace, an operating system needs at a minimum two pieces > >> of information: The UUID and/or Name of the namespace, and the SPA > >> range where that namespace is mapped; and ideally also the Type and > >> Abstraction Type to know how to interpret the data inside. > > Not necessarily, no. Linux supports "label-less" mode where it exposes > the raw capacity of a region in 1:1 mapped namespace without a label. > This is how Linux supports "legacy" NVDIMMs that do not support > labels. In that case, how does Linux know which area of the NVDIMM it should use to store the page structures? > >> `fsdax` and `devdax` mode are both designed to make it possible for > >> user processes to have direct mapping of NVRAM. As such, both are > >> only suitable for PMEM namespaces (?). Both also need to have kernel > >> page structures allocated for each page of NVRAM; this amounts to 64 > >> bytes for every 4k of NVRAM. Memory for these page structures can > >> either be allocated out of normal "system" memory, or inside the PMEM > >> namespace itself. > >> > >> In both cases, an "info block", very similar to the BTT info block, is > >> written to the beginning of the namespace when created. This info > >> block specifies whether the page structures come from system memory or > >> from the namespace itself. If from the namespace itself, it contains > >> information about what parts of the namespace have been set aside for > >> Linux to use for this purpose. > >> > >> Linux has also defined "Type GUIDs" for these two types of namespace > >> to be stored in the namespace label, although these are not yet in the > >> ACPI spec. > > They never will be. One of the motivations for GUIDs is that an OS can > define private ones without needing to go back and standardize them. > Only GUIDs that are needed to inter-OS / pre-OS compatibility would > need to be defined in ACPI, and there is no expectation that other > OSes understand Linux's format for reserving page structure space. Maybe it would be helpful to somehow mark those areas as "non-persistent" storage, so that other OSes know they can use this space for temporary data that doesn't need to survive across reboots? > >> # Proposed design / roadmap > >> > >> Initially, dom0 accesses the NVRAM as normal, using static ACPI tables > >> and the DSM methods; mappings are treated by Xen during this phase as > >> MMIO. > >> > >> Once dom0 is ready to pass parts of a namespace through to a guest, it > >> makes a hypercall to tell Xen about the namespace. It includes any > >> regions of the namespace which Xen may use for 'scratch'; it also > >> includes a flag to indicate whether this 'scratch' space may be used > >> for frame tables from other namespaces. > >> > >> Frame tables are then created for this SPA range. They will be > >> allocated from, in this order: 1) designated 'scratch' range from > >> within this namespace 2) designated 'scratch' range from other > >> namespaces which has been marked as sharable 3) system RAM. > >> > >> Xen will either verify that dom0 has no existing mappings, or promote > >> the mappings to full pages (taking appropriate reference counts for > >> mappings). Dom0 must ensure that this namespace is not unmapped, > >> modified, or relocated until it asks Xen to unmap it. > >> > >> For Xen frame tables, to begin with, set aside a partition inside a > >> namespace to be used by Xen. Pass this in to Xen when activating the > >> namespace; this could be either 2a or 3a from "Page structure > >> allocation". After that, we could decide which of the two more > >> streamlined approaches (2b or 3b) to pursue. > >> > >> At this point, dom0 can pass parts of the mapped namespace into > >> guests. Unfortunately, passing files on a fsdax filesystem is > >> probably not safe; but we can pass in full dev-dax or fsdax > >> partitions. > >> > >> From a guest perspective, I propose we provide static NFIT only, no > >> access to labels to begin with. This can be generated in hvmloader > >> and/or the toolstack acpi code. > > I'm ignorant of Xen internals, but can you not reuse the existing QEMU > emulation for labels and NFIT? We only use QEMU for HVM guests, which would still leave PVH guests without NVDIMM support. Ideally we would like to use the same solution for both HVM and PVH, which means QEMU cannot be part of that solution. Thanks, Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-11 16:33 ` Dan Williams 2018-05-15 10:05 ` Roger Pau Monné @ 2018-05-15 10:05 ` Roger Pau Monné 2018-05-15 10:12 ` George Dunlap 2018-05-15 10:12 ` George Dunlap 2018-05-15 14:19 ` George Dunlap 2018-05-15 14:19 ` George Dunlap 3 siblings, 2 replies; 24+ messages in thread From: Roger Pau Monné @ 2018-05-15 10:05 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Andrew Cooper, George Dunlap, xen-devel, Jan Beulich, Yi Zhang Just some replies/questions to some of the points raised below. On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: > [ adding linux-nvdimm ] > > Great write up! Some comments below... > > On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: > >> To use a namespace, an operating system needs at a minimum two pieces > >> of information: The UUID and/or Name of the namespace, and the SPA > >> range where that namespace is mapped; and ideally also the Type and > >> Abstraction Type to know how to interpret the data inside. > > Not necessarily, no. Linux supports "label-less" mode where it exposes > the raw capacity of a region in 1:1 mapped namespace without a label. > This is how Linux supports "legacy" NVDIMMs that do not support > labels. In that case, how does Linux know which area of the NVDIMM it should use to store the page structures? > >> `fsdax` and `devdax` mode are both designed to make it possible for > >> user processes to have direct mapping of NVRAM. As such, both are > >> only suitable for PMEM namespaces (?). Both also need to have kernel > >> page structures allocated for each page of NVRAM; this amounts to 64 > >> bytes for every 4k of NVRAM. Memory for these page structures can > >> either be allocated out of normal "system" memory, or inside the PMEM > >> namespace itself. > >> > >> In both cases, an "info block", very similar to the BTT info block, is > >> written to the beginning of the namespace when created. This info > >> block specifies whether the page structures come from system memory or > >> from the namespace itself. If from the namespace itself, it contains > >> information about what parts of the namespace have been set aside for > >> Linux to use for this purpose. > >> > >> Linux has also defined "Type GUIDs" for these two types of namespace > >> to be stored in the namespace label, although these are not yet in the > >> ACPI spec. > > They never will be. One of the motivations for GUIDs is that an OS can > define private ones without needing to go back and standardize them. > Only GUIDs that are needed to inter-OS / pre-OS compatibility would > need to be defined in ACPI, and there is no expectation that other > OSes understand Linux's format for reserving page structure space. Maybe it would be helpful to somehow mark those areas as "non-persistent" storage, so that other OSes know they can use this space for temporary data that doesn't need to survive across reboots? > >> # Proposed design / roadmap > >> > >> Initially, dom0 accesses the NVRAM as normal, using static ACPI tables > >> and the DSM methods; mappings are treated by Xen during this phase as > >> MMIO. > >> > >> Once dom0 is ready to pass parts of a namespace through to a guest, it > >> makes a hypercall to tell Xen about the namespace. It includes any > >> regions of the namespace which Xen may use for 'scratch'; it also > >> includes a flag to indicate whether this 'scratch' space may be used > >> for frame tables from other namespaces. > >> > >> Frame tables are then created for this SPA range. They will be > >> allocated from, in this order: 1) designated 'scratch' range from > >> within this namespace 2) designated 'scratch' range from other > >> namespaces which has been marked as sharable 3) system RAM. > >> > >> Xen will either verify that dom0 has no existing mappings, or promote > >> the mappings to full pages (taking appropriate reference counts for > >> mappings). Dom0 must ensure that this namespace is not unmapped, > >> modified, or relocated until it asks Xen to unmap it. > >> > >> For Xen frame tables, to begin with, set aside a partition inside a > >> namespace to be used by Xen. Pass this in to Xen when activating the > >> namespace; this could be either 2a or 3a from "Page structure > >> allocation". After that, we could decide which of the two more > >> streamlined approaches (2b or 3b) to pursue. > >> > >> At this point, dom0 can pass parts of the mapped namespace into > >> guests. Unfortunately, passing files on a fsdax filesystem is > >> probably not safe; but we can pass in full dev-dax or fsdax > >> partitions. > >> > >> From a guest perspective, I propose we provide static NFIT only, no > >> access to labels to begin with. This can be generated in hvmloader > >> and/or the toolstack acpi code. > > I'm ignorant of Xen internals, but can you not reuse the existing QEMU > emulation for labels and NFIT? We only use QEMU for HVM guests, which would still leave PVH guests without NVDIMM support. Ideally we would like to use the same solution for both HVM and PVH, which means QEMU cannot be part of that solution. Thanks, Roger. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 10:05 ` Roger Pau Monné @ 2018-05-15 10:12 ` George Dunlap 2018-05-15 10:12 ` George Dunlap 1 sibling, 0 replies; 24+ messages in thread From: George Dunlap @ 2018-05-15 10:12 UTC (permalink / raw) To: Roger Pau Monne Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Dan Williams, Yi Zhang > On May 15, 2018, at 11:05 AM, Roger Pau Monne <roger.pau@citrix.com> wrote: > > Just some replies/questions to some of the points raised below. > > On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >> [ adding linux-nvdimm ] >> >> Great write up! Some comments below... >> >> On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>>> To use a namespace, an operating system needs at a minimum two pieces >>>> of information: The UUID and/or Name of the namespace, and the SPA >>>> range where that namespace is mapped; and ideally also the Type and >>>> Abstraction Type to know how to interpret the data inside. >> >> Not necessarily, no. Linux supports "label-less" mode where it exposes >> the raw capacity of a region in 1:1 mapped namespace without a label. >> This is how Linux supports "legacy" NVDIMMs that do not support >> labels. > > In that case, how does Linux know which area of the NVDIMM it should > use to store the page structures? The answer to that is right here: >>>> `fsdax` and `devdax` mode are both designed to make it possible for >>>> user processes to have direct mapping of NVRAM. As such, both are >>>> only suitable for PMEM namespaces (?). Both also need to have kernel >>>> page structures allocated for each page of NVRAM; this amounts to 64 >>>> bytes for every 4k of NVRAM. Memory for these page structures can >>>> either be allocated out of normal "system" memory, or inside the PMEM >>>> namespace itself. >>>> >>>> In both cases, an "info block", very similar to the BTT info block, is >>>> written to the beginning of the namespace when created. This info >>>> block specifies whether the page structures come from system memory or >>>> from the namespace itself. If from the namespace itself, it contains >>>> information about what parts of the namespace have been set aside for >>>> Linux to use for this purpose. That is, each fsdax / devdax namespace has a superblock that, in part, defines what parts are used for Linux and what parts are used for data. Or to put it a different way: Linux decides which parts of a namespace to use for page structures, and writes it down in the metadata starting in the first page of the namespace. >>>> >>>> Linux has also defined "Type GUIDs" for these two types of namespace >>>> to be stored in the namespace label, although these are not yet in the >>>> ACPI spec. >> >> They never will be. One of the motivations for GUIDs is that an OS can >> define private ones without needing to go back and standardize them. >> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >> need to be defined in ACPI, and there is no expectation that other >> OSes understand Linux's format for reserving page structure space. > > Maybe it would be helpful to somehow mark those areas as > "non-persistent" storage, so that other OSes know they can use this > space for temporary data that doesn't need to survive across reboots? In theory there’s no reason another OS couldn’t learn Linux’s format, discover where the blocks were, and use those blocks for its own purposes while Linux wasn’t running. But that won’t help Xen, as we want to use those blocks while Linux *is* running. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 10:05 ` Roger Pau Monné 2018-05-15 10:12 ` George Dunlap @ 2018-05-15 10:12 ` George Dunlap 2018-05-15 12:26 ` Jan Beulich 2018-05-15 12:26 ` Jan Beulich 1 sibling, 2 replies; 24+ messages in thread From: George Dunlap @ 2018-05-15 10:12 UTC (permalink / raw) To: Roger Pau Monne Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang > On May 15, 2018, at 11:05 AM, Roger Pau Monne <roger.pau@citrix.com> wrote: > > Just some replies/questions to some of the points raised below. > > On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >> [ adding linux-nvdimm ] >> >> Great write up! Some comments below... >> >> On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>>> To use a namespace, an operating system needs at a minimum two pieces >>>> of information: The UUID and/or Name of the namespace, and the SPA >>>> range where that namespace is mapped; and ideally also the Type and >>>> Abstraction Type to know how to interpret the data inside. >> >> Not necessarily, no. Linux supports "label-less" mode where it exposes >> the raw capacity of a region in 1:1 mapped namespace without a label. >> This is how Linux supports "legacy" NVDIMMs that do not support >> labels. > > In that case, how does Linux know which area of the NVDIMM it should > use to store the page structures? The answer to that is right here: >>>> `fsdax` and `devdax` mode are both designed to make it possible for >>>> user processes to have direct mapping of NVRAM. As such, both are >>>> only suitable for PMEM namespaces (?). Both also need to have kernel >>>> page structures allocated for each page of NVRAM; this amounts to 64 >>>> bytes for every 4k of NVRAM. Memory for these page structures can >>>> either be allocated out of normal "system" memory, or inside the PMEM >>>> namespace itself. >>>> >>>> In both cases, an "info block", very similar to the BTT info block, is >>>> written to the beginning of the namespace when created. This info >>>> block specifies whether the page structures come from system memory or >>>> from the namespace itself. If from the namespace itself, it contains >>>> information about what parts of the namespace have been set aside for >>>> Linux to use for this purpose. That is, each fsdax / devdax namespace has a superblock that, in part, defines what parts are used for Linux and what parts are used for data. Or to put it a different way: Linux decides which parts of a namespace to use for page structures, and writes it down in the metadata starting in the first page of the namespace. >>>> >>>> Linux has also defined "Type GUIDs" for these two types of namespace >>>> to be stored in the namespace label, although these are not yet in the >>>> ACPI spec. >> >> They never will be. One of the motivations for GUIDs is that an OS can >> define private ones without needing to go back and standardize them. >> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >> need to be defined in ACPI, and there is no expectation that other >> OSes understand Linux's format for reserving page structure space. > > Maybe it would be helpful to somehow mark those areas as > "non-persistent" storage, so that other OSes know they can use this > space for temporary data that doesn't need to survive across reboots? In theory there’s no reason another OS couldn’t learn Linux’s format, discover where the blocks were, and use those blocks for its own purposes while Linux wasn’t running. But that won’t help Xen, as we want to use those blocks while Linux *is* running. -George _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 10:12 ` George Dunlap @ 2018-05-15 12:26 ` Jan Beulich 2018-05-15 12:26 ` Jan Beulich 1 sibling, 0 replies; 24+ messages in thread From: Jan Beulich @ 2018-05-15 12:26 UTC (permalink / raw) To: george.dunlap, Roger Pau Monne Cc: Andrew Cooper, dan.j.williams, linux-nvdimm, xen-devel, yi.z.zhang >>> On 15.05.18 at 12:12, <George.Dunlap@citrix.com> wrote: >> On May 15, 2018, at 11:05 AM, Roger Pau Monne <roger.pau@citrix.com> wrote: >> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >>> [ adding linux-nvdimm ] >>> >>> Great write up! Some comments below... >>> >>> On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>>>> To use a namespace, an operating system needs at a minimum two pieces >>>>> of information: The UUID and/or Name of the namespace, and the SPA >>>>> range where that namespace is mapped; and ideally also the Type and >>>>> Abstraction Type to know how to interpret the data inside. >>> >>> Not necessarily, no. Linux supports "label-less" mode where it exposes >>> the raw capacity of a region in 1:1 mapped namespace without a label. >>> This is how Linux supports "legacy" NVDIMMs that do not support >>> labels. >> >> In that case, how does Linux know which area of the NVDIMM it should >> use to store the page structures? > > The answer to that is right here: > >>>>> `fsdax` and `devdax` mode are both designed to make it possible for >>>>> user processes to have direct mapping of NVRAM. As such, both are >>>>> only suitable for PMEM namespaces (?). Both also need to have kernel >>>>> page structures allocated for each page of NVRAM; this amounts to 64 >>>>> bytes for every 4k of NVRAM. Memory for these page structures can >>>>> either be allocated out of normal "system" memory, or inside the PMEM >>>>> namespace itself. >>>>> >>>>> In both cases, an "info block", very similar to the BTT info block, is >>>>> written to the beginning of the namespace when created. This info >>>>> block specifies whether the page structures come from system memory or >>>>> from the namespace itself. If from the namespace itself, it contains >>>>> information about what parts of the namespace have been set aside for >>>>> Linux to use for this purpose. > > That is, each fsdax / devdax namespace has a superblock that, in part, > defines what parts are used for Linux and what parts are used for data. Or > to put it a different way: Linux decides which parts of a namespace to use > for page structures, and writes it down in the metadata starting in the first > page of the namespace. And that metadata layout is agreed upon between all OS vendors? >>>>> Linux has also defined "Type GUIDs" for these two types of namespace >>>>> to be stored in the namespace label, although these are not yet in the >>>>> ACPI spec. >>> >>> They never will be. One of the motivations for GUIDs is that an OS can >>> define private ones without needing to go back and standardize them. >>> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >>> need to be defined in ACPI, and there is no expectation that other >>> OSes understand Linux's format for reserving page structure space. >> >> Maybe it would be helpful to somehow mark those areas as >> "non-persistent" storage, so that other OSes know they can use this >> space for temporary data that doesn't need to survive across reboots? > > In theory there’s no reason another OS couldn’t learn Linux’s format, > discover where the blocks were, and use those blocks for its own purposes > while Linux wasn’t running. This looks to imply "no" to my question above, in which case I wonder how we would use (part of) the space when the "other" owner is e.g. Windows. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 10:12 ` George Dunlap 2018-05-15 12:26 ` Jan Beulich @ 2018-05-15 12:26 ` Jan Beulich 2018-05-15 13:05 ` George Dunlap ` (3 more replies) 1 sibling, 4 replies; 24+ messages in thread From: Jan Beulich @ 2018-05-15 12:26 UTC (permalink / raw) To: george.dunlap, Roger Pau Monne Cc: Andrew Cooper, linux-nvdimm, xen-devel, yi.z.zhang >>> On 15.05.18 at 12:12, <George.Dunlap@citrix.com> wrote: >> On May 15, 2018, at 11:05 AM, Roger Pau Monne <roger.pau@citrix.com> wrote: >> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >>> [ adding linux-nvdimm ] >>> >>> Great write up! Some comments below... >>> >>> On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>>>> To use a namespace, an operating system needs at a minimum two pieces >>>>> of information: The UUID and/or Name of the namespace, and the SPA >>>>> range where that namespace is mapped; and ideally also the Type and >>>>> Abstraction Type to know how to interpret the data inside. >>> >>> Not necessarily, no. Linux supports "label-less" mode where it exposes >>> the raw capacity of a region in 1:1 mapped namespace without a label. >>> This is how Linux supports "legacy" NVDIMMs that do not support >>> labels. >> >> In that case, how does Linux know which area of the NVDIMM it should >> use to store the page structures? > > The answer to that is right here: > >>>>> `fsdax` and `devdax` mode are both designed to make it possible for >>>>> user processes to have direct mapping of NVRAM. As such, both are >>>>> only suitable for PMEM namespaces (?). Both also need to have kernel >>>>> page structures allocated for each page of NVRAM; this amounts to 64 >>>>> bytes for every 4k of NVRAM. Memory for these page structures can >>>>> either be allocated out of normal "system" memory, or inside the PMEM >>>>> namespace itself. >>>>> >>>>> In both cases, an "info block", very similar to the BTT info block, is >>>>> written to the beginning of the namespace when created. This info >>>>> block specifies whether the page structures come from system memory or >>>>> from the namespace itself. If from the namespace itself, it contains >>>>> information about what parts of the namespace have been set aside for >>>>> Linux to use for this purpose. > > That is, each fsdax / devdax namespace has a superblock that, in part, > defines what parts are used for Linux and what parts are used for data. Or > to put it a different way: Linux decides which parts of a namespace to use > for page structures, and writes it down in the metadata starting in the first > page of the namespace. And that metadata layout is agreed upon between all OS vendors? >>>>> Linux has also defined "Type GUIDs" for these two types of namespace >>>>> to be stored in the namespace label, although these are not yet in the >>>>> ACPI spec. >>> >>> They never will be. One of the motivations for GUIDs is that an OS can >>> define private ones without needing to go back and standardize them. >>> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >>> need to be defined in ACPI, and there is no expectation that other >>> OSes understand Linux's format for reserving page structure space. >> >> Maybe it would be helpful to somehow mark those areas as >> "non-persistent" storage, so that other OSes know they can use this >> space for temporary data that doesn't need to survive across reboots? > > In theory there’s no reason another OS couldn’t learn Linux’s format, > discover where the blocks were, and use those blocks for its own purposes > while Linux wasn’t running. This looks to imply "no" to my question above, in which case I wonder how we would use (part of) the space when the "other" owner is e.g. Windows. Jan _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 12:26 ` Jan Beulich @ 2018-05-15 13:05 ` George Dunlap 2018-05-15 13:05 ` George Dunlap ` (2 subsequent siblings) 3 siblings, 0 replies; 24+ messages in thread From: George Dunlap @ 2018-05-15 13:05 UTC (permalink / raw) To: Jan Beulich Cc: linux-nvdimm, Andrew Cooper, xen-devel, Roger Pau Monne, yi.z.zhang > On May 15, 2018, at 1:26 PM, Jan Beulich <JBeulich@suse.com> wrote: > >>>> On 15.05.18 at 12:12, <George.Dunlap@citrix.com> wrote: >>> On May 15, 2018, at 11:05 AM, Roger Pau Monne <roger.pau@citrix.com> wrote: >>> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >>>> [ adding linux-nvdimm ] >>>> >>>> Great write up! Some comments below... >>>> >>>> On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>>>>> To use a namespace, an operating system needs at a minimum two pieces >>>>>> of information: The UUID and/or Name of the namespace, and the SPA >>>>>> range where that namespace is mapped; and ideally also the Type and >>>>>> Abstraction Type to know how to interpret the data inside. >>>> >>>> Not necessarily, no. Linux supports "label-less" mode where it exposes >>>> the raw capacity of a region in 1:1 mapped namespace without a label. >>>> This is how Linux supports "legacy" NVDIMMs that do not support >>>> labels. >>> >>> In that case, how does Linux know which area of the NVDIMM it should >>> use to store the page structures? >> >> The answer to that is right here: >> >>>>>> `fsdax` and `devdax` mode are both designed to make it possible for >>>>>> user processes to have direct mapping of NVRAM. As such, both are >>>>>> only suitable for PMEM namespaces (?). Both also need to have kernel >>>>>> page structures allocated for each page of NVRAM; this amounts to 64 >>>>>> bytes for every 4k of NVRAM. Memory for these page structures can >>>>>> either be allocated out of normal "system" memory, or inside the PMEM >>>>>> namespace itself. >>>>>> >>>>>> In both cases, an "info block", very similar to the BTT info block, is >>>>>> written to the beginning of the namespace when created. This info >>>>>> block specifies whether the page structures come from system memory or >>>>>> from the namespace itself. If from the namespace itself, it contains >>>>>> information about what parts of the namespace have been set aside for >>>>>> Linux to use for this purpose. >> >> That is, each fsdax / devdax namespace has a superblock that, in part, >> defines what parts are used for Linux and what parts are used for data. Or >> to put it a different way: Linux decides which parts of a namespace to use >> for page structures, and writes it down in the metadata starting in the first >> page of the namespace. > > And that metadata layout is agreed upon between all OS vendors? > >>>>>> Linux has also defined "Type GUIDs" for these two types of namespace >>>>>> to be stored in the namespace label, although these are not yet in the >>>>>> ACPI spec. >>>> >>>> They never will be. One of the motivations for GUIDs is that an OS can >>>> define private ones without needing to go back and standardize them. >>>> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >>>> need to be defined in ACPI, and there is no expectation that other >>>> OSes understand Linux's format for reserving page structure space. >>> >>> Maybe it would be helpful to somehow mark those areas as >>> "non-persistent" storage, so that other OSes know they can use this >>> space for temporary data that doesn't need to survive across reboots? >> >> In theory there’s no reason another OS couldn’t learn Linux’s format, >> discover where the blocks were, and use those blocks for its own purposes >> while Linux wasn’t running. > > This looks to imply "no" to my question above, in which case I wonder how > we would use (part of) the space when the "other" owner is e.g. Windows. So in classic DOS partition tables, you have partition types; and various operating systems just sort of “claimed” numbers for themselves (e.g., NTFS, Linux Swap, &c). But the DOS partition table number space is actually quite small. So in namespaces, you have a similar concept, except that it’s called a “type GUID”, and it’s massively long — long enough anyone who wants to make a new type can simply generate one randomly and be pretty confident that nobody else is using that one. So if the labels contain a TGUID you understand, you use it, just like you would a partition that you understand. If it contains GUIDs you don’t understand, you’d better leave it alone. -George _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 12:26 ` Jan Beulich 2018-05-15 13:05 ` George Dunlap @ 2018-05-15 13:05 ` George Dunlap 2018-05-15 17:33 ` Dan Williams 2018-05-15 17:33 ` Dan Williams 3 siblings, 0 replies; 24+ messages in thread From: George Dunlap @ 2018-05-15 13:05 UTC (permalink / raw) To: Jan Beulich Cc: linux-nvdimm, Andrew Cooper, xen-devel, dan.j.williams, Roger Pau Monne, yi.z.zhang > On May 15, 2018, at 1:26 PM, Jan Beulich <JBeulich@suse.com> wrote: > >>>> On 15.05.18 at 12:12, <George.Dunlap@citrix.com> wrote: >>> On May 15, 2018, at 11:05 AM, Roger Pau Monne <roger.pau@citrix.com> wrote: >>> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >>>> [ adding linux-nvdimm ] >>>> >>>> Great write up! Some comments below... >>>> >>>> On Wed, May 9, 2018 at 10:35 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>>>>> To use a namespace, an operating system needs at a minimum two pieces >>>>>> of information: The UUID and/or Name of the namespace, and the SPA >>>>>> range where that namespace is mapped; and ideally also the Type and >>>>>> Abstraction Type to know how to interpret the data inside. >>>> >>>> Not necessarily, no. Linux supports "label-less" mode where it exposes >>>> the raw capacity of a region in 1:1 mapped namespace without a label. >>>> This is how Linux supports "legacy" NVDIMMs that do not support >>>> labels. >>> >>> In that case, how does Linux know which area of the NVDIMM it should >>> use to store the page structures? >> >> The answer to that is right here: >> >>>>>> `fsdax` and `devdax` mode are both designed to make it possible for >>>>>> user processes to have direct mapping of NVRAM. As such, both are >>>>>> only suitable for PMEM namespaces (?). Both also need to have kernel >>>>>> page structures allocated for each page of NVRAM; this amounts to 64 >>>>>> bytes for every 4k of NVRAM. Memory for these page structures can >>>>>> either be allocated out of normal "system" memory, or inside the PMEM >>>>>> namespace itself. >>>>>> >>>>>> In both cases, an "info block", very similar to the BTT info block, is >>>>>> written to the beginning of the namespace when created. This info >>>>>> block specifies whether the page structures come from system memory or >>>>>> from the namespace itself. If from the namespace itself, it contains >>>>>> information about what parts of the namespace have been set aside for >>>>>> Linux to use for this purpose. >> >> That is, each fsdax / devdax namespace has a superblock that, in part, >> defines what parts are used for Linux and what parts are used for data. Or >> to put it a different way: Linux decides which parts of a namespace to use >> for page structures, and writes it down in the metadata starting in the first >> page of the namespace. > > And that metadata layout is agreed upon between all OS vendors? > >>>>>> Linux has also defined "Type GUIDs" for these two types of namespace >>>>>> to be stored in the namespace label, although these are not yet in the >>>>>> ACPI spec. >>>> >>>> They never will be. One of the motivations for GUIDs is that an OS can >>>> define private ones without needing to go back and standardize them. >>>> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >>>> need to be defined in ACPI, and there is no expectation that other >>>> OSes understand Linux's format for reserving page structure space. >>> >>> Maybe it would be helpful to somehow mark those areas as >>> "non-persistent" storage, so that other OSes know they can use this >>> space for temporary data that doesn't need to survive across reboots? >> >> In theory there’s no reason another OS couldn’t learn Linux’s format, >> discover where the blocks were, and use those blocks for its own purposes >> while Linux wasn’t running. > > This looks to imply "no" to my question above, in which case I wonder how > we would use (part of) the space when the "other" owner is e.g. Windows. So in classic DOS partition tables, you have partition types; and various operating systems just sort of “claimed” numbers for themselves (e.g., NTFS, Linux Swap, &c). But the DOS partition table number space is actually quite small. So in namespaces, you have a similar concept, except that it’s called a “type GUID”, and it’s massively long — long enough anyone who wants to make a new type can simply generate one randomly and be pretty confident that nobody else is using that one. So if the labels contain a TGUID you understand, you use it, just like you would a partition that you understand. If it contains GUIDs you don’t understand, you’d better leave it alone. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 12:26 ` Jan Beulich 2018-05-15 13:05 ` George Dunlap 2018-05-15 13:05 ` George Dunlap @ 2018-05-15 17:33 ` Dan Williams 2018-05-15 17:33 ` Dan Williams 3 siblings, 0 replies; 24+ messages in thread From: Dan Williams @ 2018-05-15 17:33 UTC (permalink / raw) To: Jan Beulich Cc: linux-nvdimm, Andrew Cooper, George Dunlap, xen-devel, Roger Pau Monne, Zhang, Yi Z On Tue, May 15, 2018 at 5:26 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>> On 15.05.18 at 12:12, <George.Dunlap@citrix.com> wrote: [..] >> That is, each fsdax / devdax namespace has a superblock that, in part, >> defines what parts are used for Linux and what parts are used for data. Or >> to put it a different way: Linux decides which parts of a namespace to use >> for page structures, and writes it down in the metadata starting in the first >> page of the namespace. > > And that metadata layout is agreed upon between all OS vendors? The only agreed upon metadata layouts across all OS vendors are the ones that are specified in UEFI. We typically only need inter-OS and UEFI compatibility for booting and other pre-OS accesses. For Linux "raw" and "sector" mode namespaces defined by namespace labels are inter-OS compatible while "fsdax", "devdax", and so called "label-less" configurations are not. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 12:26 ` Jan Beulich ` (2 preceding siblings ...) 2018-05-15 17:33 ` Dan Williams @ 2018-05-15 17:33 ` Dan Williams 3 siblings, 0 replies; 24+ messages in thread From: Dan Williams @ 2018-05-15 17:33 UTC (permalink / raw) To: Jan Beulich Cc: linux-nvdimm, Andrew Cooper, George Dunlap, xen-devel, Roger Pau Monne, Zhang, Yi Z On Tue, May 15, 2018 at 5:26 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>> On 15.05.18 at 12:12, <George.Dunlap@citrix.com> wrote: [..] >> That is, each fsdax / devdax namespace has a superblock that, in part, >> defines what parts are used for Linux and what parts are used for data. Or >> to put it a different way: Linux decides which parts of a namespace to use >> for page structures, and writes it down in the metadata starting in the first >> page of the namespace. > > And that metadata layout is agreed upon between all OS vendors? The only agreed upon metadata layouts across all OS vendors are the ones that are specified in UEFI. We typically only need inter-OS and UEFI compatibility for booting and other pre-OS accesses. For Linux "raw" and "sector" mode namespaces defined by namespace labels are inter-OS compatible while "fsdax", "devdax", and so called "label-less" configurations are not. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-11 16:33 ` Dan Williams 2018-05-15 10:05 ` Roger Pau Monné 2018-05-15 10:05 ` Roger Pau Monné @ 2018-05-15 14:19 ` George Dunlap 2018-05-15 14:19 ` George Dunlap 3 siblings, 0 replies; 24+ messages in thread From: George Dunlap @ 2018-05-15 14:19 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On 05/11/2018 05:33 PM, Dan Williams wrote: > [ adding linux-nvdimm ] > > Great write up! Some comments below... Thanks for the quick response! It seems I still have some fundamental misconceptions about what's going on, so I'd better start with that. :-) Here's the part that I'm having a hard time getting. If actual data on the NVDIMMs is a noun, and the act of writing is a verb, then the SPA and interleave sets are adverbs: they define *how* the write happens. When the processor says, "Write to address X", the memory controller converts address X into a <dimm number, dimm-physical address> tuple to actually write the data. So, who decides what this SPA range and interleave set is? Can the operating system change these interleave sets and mappings, or change data from PMEM to BLK, and is so, how? If you read through section 13.19 of the UEFI manual, it seems to imply that this is determined by the label area -- that each DIMM has a separate label area describing regions local to that DIMM; and that if you have 4 DIMMs you'll have 4 label areas, and each label area will have a label describing the DPA region on that DIMM which corresponds to the interleave set. And somehow someone sets up the interleave sets and SPA based on what's written there. Which would mean that an operating system could change how the interleave sets work by rewriting the various labels on the DIMMs; for instance, changing a single 4-way set spanning the entirety of 4 DIMMs, to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning half of 2 DIMMs each. But then you say: > Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs > provide a "label area" which is an out-of-band non-volatile memory > area where the OS can store whatever it likes. The UEFI 2.7 > specification defines a data format for the definition of namespaces > on top of persistent memory ranges advertised to the OS via the ACPI > NFIT structure. OK, so that sounds like no, that's that what happens. So where do the SPA range and interleave sets come from? Random guess: The BIOS / firmware makes it up. Either it's hard-coded, or there's some menu in the BIOS you can use to change things around; but once it hits the operating system, that's it -- the mapping of SPA range onto interleave sets onto DIMMs is, from the operating system's point of view, fixed. And so (here's another guess) -- when you're talking about namespaces and label areas, you're talking about namespaces stored *within a pre-existing SPA range*. You use the same format as described in the UEFI spec, but ignore all the stuff about interleave sets and whatever, and use system physical addresses relative to the SPA range rather than DPAs. Is that right? But then there's things like this: > There is no obligation for an NVDIMM to provide a label area, and as > far as I know all NVDIMMs on the market today do not provide a label > area. [snip] > Linux supports "label-less" mode where it exposes > the raw capacity of a region in 1:1 mapped namespace without a label. > This is how Linux supports "legacy" NVDIMMs that do not support > labels. So are "all NVDIMMs on the market today" then classed as "legacy" NVDIMMs because they don't support labels? And if labels are simply the NVDIMM equivalent of a partition table, then what does it mena to "support" or "not support" labels? And then there's this: > In any > event we do the DIMM to SPA association first before reading labels. > The OS calculates a so called "Interleave Set Cookie" from the NFIT > information to compare against a similar value stored in the labels. > This lets the OS determine that the Interleave Set composition has not > changed from when the labels were initially written. An Interleave Set > Cookie mismatch indicates the labels are stale, corrupted, or that the > physical composition of the Interleave Set has changed. So wait, the SPA and interleave sets can actually change? And the labels which the OS reads actually are per-DIMM, and do control somehow how the DPA ranges of individual DIMMs are mapped into interleave sets and exposed as SPAs? (And perhaps, can be changed by the operating system?) And: > There are checksums in the Namespace definition to account label > validity. Starting with ACPI 6.2 DSMs for labels are deprecated in > favor of the new / named methods for label access _LSI, _LSR, and > _LSW. Does this mean the methods will use checksums to verify writes to the label area, and refuse writes which create invalid labels? If all of the above is true, then in what way can it be said that "NVDIMM has no concept of namespaces", that an OS can "store whatever it likes" in the label area, and that UEFI namespaces are "on top of persistent memory ranges advertised to the OS via the ACPI NFIT structure"? I'm sorry if this is obvious, but I am exactly as confused as I was before I started writing this. :-) This is all pretty foundational. Xen can read static ACPI tables, but it can't do AML. So to do a proper design for Xen, we need to know: 1. If Xen can find out, without Linux's help, what namespaces exist and if there is one it can use for its own purposes 2. If the SPA regions can change at runtime. If SPA regions don't change after boot, and if Xen can find its own Xen-specific namespace to use for the frame tables by reading the NFIT table, then that significantly reduces the amount of interaction it needs with Linux. If SPA regions *can* change after boot, and if Xen must rely on Linux to read labels and find out what it can safely use for frame tables, then it makes things significantly more involved. Not impossible by any means, but a lot more complicated. Hope all that makes sense -- thanks again for your help. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-11 16:33 ` Dan Williams ` (2 preceding siblings ...) 2018-05-15 14:19 ` George Dunlap @ 2018-05-15 14:19 ` George Dunlap 2018-05-15 18:06 ` Dan Williams 2018-05-15 18:06 ` Dan Williams 3 siblings, 2 replies; 24+ messages in thread From: George Dunlap @ 2018-05-15 14:19 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On 05/11/2018 05:33 PM, Dan Williams wrote: > [ adding linux-nvdimm ] > > Great write up! Some comments below... Thanks for the quick response! It seems I still have some fundamental misconceptions about what's going on, so I'd better start with that. :-) Here's the part that I'm having a hard time getting. If actual data on the NVDIMMs is a noun, and the act of writing is a verb, then the SPA and interleave sets are adverbs: they define *how* the write happens. When the processor says, "Write to address X", the memory controller converts address X into a <dimm number, dimm-physical address> tuple to actually write the data. So, who decides what this SPA range and interleave set is? Can the operating system change these interleave sets and mappings, or change data from PMEM to BLK, and is so, how? If you read through section 13.19 of the UEFI manual, it seems to imply that this is determined by the label area -- that each DIMM has a separate label area describing regions local to that DIMM; and that if you have 4 DIMMs you'll have 4 label areas, and each label area will have a label describing the DPA region on that DIMM which corresponds to the interleave set. And somehow someone sets up the interleave sets and SPA based on what's written there. Which would mean that an operating system could change how the interleave sets work by rewriting the various labels on the DIMMs; for instance, changing a single 4-way set spanning the entirety of 4 DIMMs, to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning half of 2 DIMMs each. But then you say: > Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs > provide a "label area" which is an out-of-band non-volatile memory > area where the OS can store whatever it likes. The UEFI 2.7 > specification defines a data format for the definition of namespaces > on top of persistent memory ranges advertised to the OS via the ACPI > NFIT structure. OK, so that sounds like no, that's that what happens. So where do the SPA range and interleave sets come from? Random guess: The BIOS / firmware makes it up. Either it's hard-coded, or there's some menu in the BIOS you can use to change things around; but once it hits the operating system, that's it -- the mapping of SPA range onto interleave sets onto DIMMs is, from the operating system's point of view, fixed. And so (here's another guess) -- when you're talking about namespaces and label areas, you're talking about namespaces stored *within a pre-existing SPA range*. You use the same format as described in the UEFI spec, but ignore all the stuff about interleave sets and whatever, and use system physical addresses relative to the SPA range rather than DPAs. Is that right? But then there's things like this: > There is no obligation for an NVDIMM to provide a label area, and as > far as I know all NVDIMMs on the market today do not provide a label > area. [snip] > Linux supports "label-less" mode where it exposes > the raw capacity of a region in 1:1 mapped namespace without a label. > This is how Linux supports "legacy" NVDIMMs that do not support > labels. So are "all NVDIMMs on the market today" then classed as "legacy" NVDIMMs because they don't support labels? And if labels are simply the NVDIMM equivalent of a partition table, then what does it mena to "support" or "not support" labels? And then there's this: > In any > event we do the DIMM to SPA association first before reading labels. > The OS calculates a so called "Interleave Set Cookie" from the NFIT > information to compare against a similar value stored in the labels. > This lets the OS determine that the Interleave Set composition has not > changed from when the labels were initially written. An Interleave Set > Cookie mismatch indicates the labels are stale, corrupted, or that the > physical composition of the Interleave Set has changed. So wait, the SPA and interleave sets can actually change? And the labels which the OS reads actually are per-DIMM, and do control somehow how the DPA ranges of individual DIMMs are mapped into interleave sets and exposed as SPAs? (And perhaps, can be changed by the operating system?) And: > There are checksums in the Namespace definition to account label > validity. Starting with ACPI 6.2 DSMs for labels are deprecated in > favor of the new / named methods for label access _LSI, _LSR, and > _LSW. Does this mean the methods will use checksums to verify writes to the label area, and refuse writes which create invalid labels? If all of the above is true, then in what way can it be said that "NVDIMM has no concept of namespaces", that an OS can "store whatever it likes" in the label area, and that UEFI namespaces are "on top of persistent memory ranges advertised to the OS via the ACPI NFIT structure"? I'm sorry if this is obvious, but I am exactly as confused as I was before I started writing this. :-) This is all pretty foundational. Xen can read static ACPI tables, but it can't do AML. So to do a proper design for Xen, we need to know: 1. If Xen can find out, without Linux's help, what namespaces exist and if there is one it can use for its own purposes 2. If the SPA regions can change at runtime. If SPA regions don't change after boot, and if Xen can find its own Xen-specific namespace to use for the frame tables by reading the NFIT table, then that significantly reduces the amount of interaction it needs with Linux. If SPA regions *can* change after boot, and if Xen must rely on Linux to read labels and find out what it can safely use for frame tables, then it makes things significantly more involved. Not impossible by any means, but a lot more complicated. Hope all that makes sense -- thanks again for your help. -George _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 14:19 ` George Dunlap @ 2018-05-15 18:06 ` Dan Williams 2018-05-15 18:33 ` Andrew Cooper ` (3 more replies) 2018-05-15 18:06 ` Dan Williams 1 sibling, 4 replies; 24+ messages in thread From: Dan Williams @ 2018-05-15 18:06 UTC (permalink / raw) To: George Dunlap Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: > On 05/11/2018 05:33 PM, Dan Williams wrote: >> [ adding linux-nvdimm ] >> >> Great write up! Some comments below... > > Thanks for the quick response! > > It seems I still have some fundamental misconceptions about what's going > on, so I'd better start with that. :-) > > Here's the part that I'm having a hard time getting. > > If actual data on the NVDIMMs is a noun, and the act of writing is a > verb, then the SPA and interleave sets are adverbs: they define *how* > the write happens. When the processor says, "Write to address X", the > memory controller converts address X into a <dimm number, dimm-physical > address> tuple to actually write the data. > > So, who decides what this SPA range and interleave set is? Can the > operating system change these interleave sets and mappings, or change > data from PMEM to BLK, and is so, how? The interleave-set to SPA range association and delineation of capacity between PMEM and BLK access modes is current out-of-scope for ACPI. The BIOS reports the configuration to the OS via the NFIT, but the configuration is currently written by vendor specific tooling. Longer term it would be great for this mechanism to become standardized and available to the OS, but for now it requires platform specific tooling to change the DIMM interleave configuration. > If you read through section 13.19 of the UEFI manual, it seems to imply > that this is determined by the label area -- that each DIMM has a > separate label area describing regions local to that DIMM; and that if > you have 4 DIMMs you'll have 4 label areas, and each label area will > have a label describing the DPA region on that DIMM which corresponds to > the interleave set. And somehow someone sets up the interleave sets and > SPA based on what's written there. > > Which would mean that an operating system could change how the > interleave sets work by rewriting the various labels on the DIMMs; for > instance, changing a single 4-way set spanning the entirety of 4 DIMMs, > to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning > half of 2 DIMMs each. If a DIMM supports both the PMEM and BLK mechanisms for accessing the same DPA, then the label breaks the disambiguation and tells the OS to enforce one access mechanism per DPA at a time. Otherwise the OS has no ability to affect the interleave-set configuration, it's all initialized by platform BIOS/firmware before the OS boots. > > But then you say: > >> Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs >> provide a "label area" which is an out-of-band non-volatile memory >> area where the OS can store whatever it likes. The UEFI 2.7 >> specification defines a data format for the definition of namespaces >> on top of persistent memory ranges advertised to the OS via the ACPI >> NFIT structure. > > OK, so that sounds like no, that's that what happens. So where do the > SPA range and interleave sets come from? > > Random guess: The BIOS / firmware makes it up. Either it's hard-coded, > or there's some menu in the BIOS you can use to change things around; > but once it hits the operating system, that's it -- the mapping of SPA > range onto interleave sets onto DIMMs is, from the operating system's > point of view, fixed. Correct. > And so (here's another guess) -- when you're talking about namespaces > and label areas, you're talking about namespaces stored *within a > pre-existing SPA range*. You use the same format as described in the > UEFI spec, but ignore all the stuff about interleave sets and whatever, > and use system physical addresses relative to the SPA range rather than > DPAs. Well, we don't ignore it because we need to validate in the driver that the interleave set configuration matches a checksum that we generated when the namespace was first instantiated on the interleave set. However, you are right, for accesses at run time all we care about is the SPA for PMEM accesses. > > Is that right? > > But then there's things like this: > >> There is no obligation for an NVDIMM to provide a label area, and as >> far as I know all NVDIMMs on the market today do not provide a label >> area. > [snip] >> Linux supports "label-less" mode where it exposes >> the raw capacity of a region in 1:1 mapped namespace without a label. >> This is how Linux supports "legacy" NVDIMMs that do not support >> labels. > > So are "all NVDIMMs on the market today" then classed as "legacy" > NVDIMMs because they don't support labels? And if labels are simply the > NVDIMM equivalent of a partition table, then what does it mena to > "support" or "not support" labels? Yes, the term "legacy" has been thrown around for NVDIMMs that do not support labels. The way this support is determined is whether the platform publishes the _LSI, _LSR, and _LSW methods in ACPI (see: 6.5.10 NVDIMM Label Methods in ACPI 6.2a). I.e. each DIMM is represented by an ACPI device object, and we query those objects for these named methods. When the methods are missing *or* there is no initialized namespace index block found on the DIMMs, Linux will fall back to the "label-less" mode. > > And then there's this: > >> In any >> event we do the DIMM to SPA association first before reading labels. >> The OS calculates a so called "Interleave Set Cookie" from the NFIT >> information to compare against a similar value stored in the labels. >> This lets the OS determine that the Interleave Set composition has not >> changed from when the labels were initially written. An Interleave Set >> Cookie mismatch indicates the labels are stale, corrupted, or that the >> physical composition of the Interleave Set has changed. > > So wait, the SPA and interleave sets can actually change? And the > labels which the OS reads actually are per-DIMM, and do control somehow > how the DPA ranges of individual DIMMs are mapped into interleave sets > and exposed as SPAs? (And perhaps, can be changed by the operating system?) They can change, but only under the control of the BIOS. All changes to the interleave set configuration need a reboot because the memory controller needs to be set up differently at system-init time. > > And: > >> There are checksums in the Namespace definition to account label >> validity. Starting with ACPI 6.2 DSMs for labels are deprecated in >> favor of the new / named methods for label access _LSI, _LSR, and >> _LSW. > > Does this mean the methods will use checksums to verify writes to the > label area, and refuse writes which create invalid labels? No, the checksum I'm referring to is the interleave set cookie (see: "SetCookie" in the UEFI 2.7 specification). It validates that the interleave set backing the SPA has not changed configuration since the last boot. > > If all of the above is true, then in what way can it be said that > "NVDIMM has no concept of namespaces", that an OS can "store whatever it > likes" in the label area, and that UEFI namespaces are "on top of > persistent memory ranges advertised to the OS via the ACPI NFIT structure"? The NVDIMM just provides storage area for the OS to write opaque data that just happens to conform to the UEFI Namespace label format. The interleave-set configuration is stored in yet another out-of-band location on the DIMM or on some platform-specific storage location and is consulted / restored by the BIOS each boot. The NFIT is the output from the platform specific physical mappings of the DIMMs, and Namespaces are logical volumes built on top of those hard-defined NFIT boundaries. > > I'm sorry if this is obvious, but I am exactly as confused as I was > before I started writing this. :-) > > This is all pretty foundational. Xen can read static ACPI tables, but > it can't do AML. So to do a proper design for Xen, we need to know: Oooh, ok, no AML in Xen... > 1. If Xen can find out, without Linux's help, what namespaces exist and > if there is one it can use for its own purposes Yeah, no, not without calling AML methods. > 2. If the SPA regions can change at runtime. Nope, these are statically defined and can only change at reboot, if at all. A likely scenario is that an OEM ships the DIMMs already configured in an interleave-set and, barring component failure, nothing changes for the life of the platform. > If SPA regions don't change after boot, and if Xen can find its own > Xen-specific namespace to use for the frame tables by reading the NFIT > table, then that significantly reduces the amount of interaction it > needs with Linux. > > If SPA regions *can* change after boot, and if Xen must rely on Linux to > read labels and find out what it can safely use for frame tables, then > it makes things significantly more involved. Not impossible by any > means, but a lot more complicated. > > Hope all that makes sense -- thanks again for your help. I think it does, but it seems namespaces are out of reach for Xen without some agent / enabling that can execute the necessary AML methods. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 18:06 ` Dan Williams @ 2018-05-15 18:33 ` Andrew Cooper 2018-05-15 18:33 ` Andrew Cooper ` (2 subsequent siblings) 3 siblings, 0 replies; 24+ messages in thread From: Andrew Cooper @ 2018-05-15 18:33 UTC (permalink / raw) To: Dan Williams, George Dunlap Cc: linux-nvdimm, Roger Pau Monne, Jan Beulich, Yi Zhang, xen-devel On 15/05/18 19:06, Dan Williams wrote: > On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: >> On 05/11/2018 05:33 PM, Dan Williams wrote: >> >> This is all pretty foundational. Xen can read static ACPI tables, but >> it can't do AML. So to do a proper design for Xen, we need to know: > Oooh, ok, no AML in Xen... > >> 1. If Xen can find out, without Linux's help, what namespaces exist and >> if there is one it can use for its own purposes > Yeah, no, not without calling AML methods. One particularly thorny issue with Xen's architecture is the ownership of the ACPI OSPM, and the fact that there can only be one in the system. Dom0 has to be the OSPM in practice, as we don't want to port most of the Linux drivers and infrastructure in the hypervisor. If we knew a priori that certain AML methods had no side effects, then we could in principle execute them from the hypervisor, but this is an undecideable problem in general. As a result, everything involving AML requires dom0 to decipher the information and passing it to Xen at boot. ~Andrew _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 18:06 ` Dan Williams 2018-05-15 18:33 ` Andrew Cooper @ 2018-05-15 18:33 ` Andrew Cooper 2018-05-17 14:52 ` George Dunlap 2018-05-17 14:52 ` George Dunlap 3 siblings, 0 replies; 24+ messages in thread From: Andrew Cooper @ 2018-05-15 18:33 UTC (permalink / raw) To: Dan Williams, George Dunlap Cc: linux-nvdimm, Roger Pau Monne, Jan Beulich, Yi Zhang, xen-devel On 15/05/18 19:06, Dan Williams wrote: > On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: >> On 05/11/2018 05:33 PM, Dan Williams wrote: >> >> This is all pretty foundational. Xen can read static ACPI tables, but >> it can't do AML. So to do a proper design for Xen, we need to know: > Oooh, ok, no AML in Xen... > >> 1. If Xen can find out, without Linux's help, what namespaces exist and >> if there is one it can use for its own purposes > Yeah, no, not without calling AML methods. One particularly thorny issue with Xen's architecture is the ownership of the ACPI OSPM, and the fact that there can only be one in the system. Dom0 has to be the OSPM in practice, as we don't want to port most of the Linux drivers and infrastructure in the hypervisor. If we knew a priori that certain AML methods had no side effects, then we could in principle execute them from the hypervisor, but this is an undecideable problem in general. As a result, everything involving AML requires dom0 to decipher the information and passing it to Xen at boot. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 18:06 ` Dan Williams 2018-05-15 18:33 ` Andrew Cooper 2018-05-15 18:33 ` Andrew Cooper @ 2018-05-17 14:52 ` George Dunlap 2018-05-23 0:31 ` Dan Williams 2018-05-23 0:31 ` Dan Williams 2018-05-17 14:52 ` George Dunlap 3 siblings, 2 replies; 24+ messages in thread From: George Dunlap @ 2018-05-17 14:52 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On 05/15/2018 07:06 PM, Dan Williams wrote: > On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: >> So, who decides what this SPA range and interleave set is? Can the >> operating system change these interleave sets and mappings, or change >> data from PMEM to BLK, and is so, how? > > The interleave-set to SPA range association and delineation of > capacity between PMEM and BLK access modes is current out-of-scope for > ACPI. The BIOS reports the configuration to the OS via the NFIT, but > the configuration is currently written by vendor specific tooling. > Longer term it would be great for this mechanism to become > standardized and available to the OS, but for now it requires platform > specific tooling to change the DIMM interleave configuration. OK -- I was sort of assuming that different hardware would have different drivers in Linux that ndctl knew how to drive (just like any other hardware with vendor-specific interfaces); but it sounds a bit more like at the moment it's binary blobs either in the BIOS/firmware, or a vendor-supplied tool. >> And so (here's another guess) -- when you're talking about namespaces >> and label areas, you're talking about namespaces stored *within a >> pre-existing SPA range*. You use the same format as described in the >> UEFI spec, but ignore all the stuff about interleave sets and whatever, >> and use system physical addresses relative to the SPA range rather than >> DPAs. > > Well, we don't ignore it because we need to validate in the driver > that the interleave set configuration matches a checksum that we > generated when the namespace was first instantiated on the interleave > set. However, you are right, for accesses at run time all we care > about is the SPA for PMEM accesses. [snip] > They can change, but only under the control of the BIOS. All changes > to the interleave set configuration need a reboot because the memory > controller needs to be set up differently at system-init time. [snip] > No, the checksum I'm referring to is the interleave set cookie (see: > "SetCookie" in the UEFI 2.7 specification). It validates that the > interleave set backing the SPA has not changed configuration since the > last boot. [snip] > The NVDIMM just provides storage area for the OS to write opaque data > that just happens to conform to the UEFI Namespace label format. The > interleave-set configuration is stored in yet another out-of-band > location on the DIMM or on some platform-specific storage location and > is consulted / restored by the BIOS each boot. The NFIT is the output > from the platform specific physical mappings of the DIMMs, and > Namespaces are logical volumes built on top of those hard-defined NFIT > boundaries. OK, so what I'm hearing is: The label area isn't "within a pre-existing SPA range" as I was guessing (i.e., similar to a partition table residing within a disk); it is the per-DIMM label area as described by UEFI spec. But, the interleave set data in the label area doesn't *control* the hardware -- the NVDIMM controller / bios / firmware don't read it or do anything based on what's in it. Rather, the interleave set data in the label area is there to *record*, for the operating system's benefit, what the hardware configuration was when the labels were created, so that if it changes, the OS knows that the label area is invalid; it must either refrain from touching the NVRAM (if it wants to preserve the data), or write a new label area. The OS can also use labels to partition a single SPA range into several namespaces. It can't change the interleaving, but it can specify that [0-A) is one namespace, [A-B) is another namespace, &c; and these namespaces will naturally map into the SPA range advertised in the NFIT. And if a controller allows the same memory to be used either as PMEM or PBLK, it can write which *should* be used for which, and then can avoid accessing the same underlying NVRAM in two different ways (which will yield unpredictable results). That makes sense. >> If SPA regions don't change after boot, and if Xen can find its own >> Xen-specific namespace to use for the frame tables by reading the NFIT >> table, then that significantly reduces the amount of interaction it >> needs with Linux. >> >> If SPA regions *can* change after boot, and if Xen must rely on Linux to >> read labels and find out what it can safely use for frame tables, then >> it makes things significantly more involved. Not impossible by any >> means, but a lot more complicated. >> >> Hope all that makes sense -- thanks again for your help. > > I think it does, but it seems namespaces are out of reach for Xen > without some agent / enabling that can execute the necessary AML > methods. Sure, we're pretty much used to that. :-) We'll have Linux read the label area and tell Xen what it needs to know. But: * Xen can know the SPA ranges of all potential NVDIMMs before dom0 starts. So it can tell, for instance, if a page mapped by dom0 is inside an NVDIMM range, even if dom0 hasn't yet told it anything. * Linux doesn't actually need to map these NVDIMMs to read the label area and the NFIT and know where the PMEM namespaces live in system memory. With that sorted out, let me go back and see whether it makes sense to respond to your original response, or to write up a new design doc and send it out. Thanks for your help! -George _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-17 14:52 ` George Dunlap @ 2018-05-23 0:31 ` Dan Williams 2018-05-23 0:31 ` Dan Williams 1 sibling, 0 replies; 24+ messages in thread From: Dan Williams @ 2018-05-23 0:31 UTC (permalink / raw) To: George Dunlap Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On Thu, May 17, 2018 at 7:52 AM, George Dunlap <george.dunlap@citrix.com> wrote: > On 05/15/2018 07:06 PM, Dan Williams wrote: >> On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>> So, who decides what this SPA range and interleave set is? Can the >>> operating system change these interleave sets and mappings, or change >>> data from PMEM to BLK, and is so, how? >> >> The interleave-set to SPA range association and delineation of >> capacity between PMEM and BLK access modes is current out-of-scope for >> ACPI. The BIOS reports the configuration to the OS via the NFIT, but >> the configuration is currently written by vendor specific tooling. >> Longer term it would be great for this mechanism to become >> standardized and available to the OS, but for now it requires platform >> specific tooling to change the DIMM interleave configuration. > > OK -- I was sort of assuming that different hardware would have > different drivers in Linux that ndctl knew how to drive (just like any > other hardware with vendor-specific interfaces); That way potentially lies madness, at least for me as a Linux sub-system maintainer. There is no value for the kernel to help enable vendors to do the same thing slightly differently ways. libnvdimm + nfit is 100% an open standards driver and the hope is to be able to deprecate non-public vendor-specific support over time, and consolidate work-alike support from vendor specs into ACPI. The public standards that the kernel enables are: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/ https://msdn.microsoft.com/library/windows/hardware/mt604741 > but it sounds a bit > more like at the moment it's binary blobs either in the BIOS/firmware, > or a vendor-supplied tool. Only for the functionality, like interleave set configuration, that is not defined in those standards. Even then the impact is only userspace tooling, not the kernel. Also, we are seeing that functionality bleed into the standards over time. For example label methods used to only exist in the Intel DSM document, but have now been standardized in ACPI 6.2. Firmware update which was a private interface has now graduated to the public Intel DSM document. Hopefully more and more functionality transitions into an ACPI definition over time. Any common functionality in those Intel, HPE, and MSFT command formats is comprehended / abstracted by the ndctl tool. > >>> And so (here's another guess) -- when you're talking about namespaces >>> and label areas, you're talking about namespaces stored *within a >>> pre-existing SPA range*. You use the same format as described in the >>> UEFI spec, but ignore all the stuff about interleave sets and whatever, >>> and use system physical addresses relative to the SPA range rather than >>> DPAs. >> >> Well, we don't ignore it because we need to validate in the driver >> that the interleave set configuration matches a checksum that we >> generated when the namespace was first instantiated on the interleave >> set. However, you are right, for accesses at run time all we care >> about is the SPA for PMEM accesses. > [snip] >> They can change, but only under the control of the BIOS. All changes >> to the interleave set configuration need a reboot because the memory >> controller needs to be set up differently at system-init time. > [snip] >> No, the checksum I'm referring to is the interleave set cookie (see: >> "SetCookie" in the UEFI 2.7 specification). It validates that the >> interleave set backing the SPA has not changed configuration since the >> last boot. > [snip] >> The NVDIMM just provides storage area for the OS to write opaque data >> that just happens to conform to the UEFI Namespace label format. The >> interleave-set configuration is stored in yet another out-of-band >> location on the DIMM or on some platform-specific storage location and >> is consulted / restored by the BIOS each boot. The NFIT is the output >> from the platform specific physical mappings of the DIMMs, and >> Namespaces are logical volumes built on top of those hard-defined NFIT >> boundaries. > > OK, so what I'm hearing is: > > The label area isn't "within a pre-existing SPA range" as I was guessing > (i.e., similar to a partition table residing within a disk); it is the > per-DIMM label area as described by UEFI spec. > > But, the interleave set data in the label area doesn't *control* the > hardware -- the NVDIMM controller / bios / firmware don't read it or do > anything based on what's in it. Rather, the interleave set data in the > label area is there to *record*, for the operating system's benefit, > what the hardware configuration was when the labels were created, so > that if it changes, the OS knows that the label area is invalid; it must > either refrain from touching the NVRAM (if it wants to preserve the > data), or write a new label area. > > The OS can also use labels to partition a single SPA range into several > namespaces. It can't change the interleaving, but it can specify that > [0-A) is one namespace, [A-B) is another namespace, &c; and these > namespaces will naturally map into the SPA range advertised in the NFIT. > > And if a controller allows the same memory to be used either as PMEM or > PBLK, it can write which *should* be used for which, and then can avoid > accessing the same underlying NVRAM in two different ways (which will > yield unpredictable results). > > That makes sense. You got it. > >>> If SPA regions don't change after boot, and if Xen can find its own >>> Xen-specific namespace to use for the frame tables by reading the NFIT >>> table, then that significantly reduces the amount of interaction it >>> needs with Linux. >>> >>> If SPA regions *can* change after boot, and if Xen must rely on Linux to >>> read labels and find out what it can safely use for frame tables, then >>> it makes things significantly more involved. Not impossible by any >>> means, but a lot more complicated. >>> >>> Hope all that makes sense -- thanks again for your help. >> >> I think it does, but it seems namespaces are out of reach for Xen >> without some agent / enabling that can execute the necessary AML >> methods. > > Sure, we're pretty much used to that. :-) We'll have Linux read the > label area and tell Xen what it needs to know. But: > > * Xen can know the SPA ranges of all potential NVDIMMs before dom0 > starts. So it can tell, for instance, if a page mapped by dom0 is > inside an NVDIMM range, even if dom0 hasn't yet told it anything. > > * Linux doesn't actually need to map these NVDIMMs to read the label > area and the NFIT and know where the PMEM namespaces live in system memory. Theoretically we could support a mode where dom0 Linux just parses namespaces, but never enables namespaces. That would be additional enabling on top of what we have today. It would be similar to what we do for "locked" DIMMs. > With that sorted out, let me go back and see whether it makes sense to > respond to your original response, or to write up a new design doc and > send it out. > > Thanks for your help! No problem. I had typed up this response earlier, but neglected to hit send. That is now remedied. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-17 14:52 ` George Dunlap 2018-05-23 0:31 ` Dan Williams @ 2018-05-23 0:31 ` Dan Williams 1 sibling, 0 replies; 24+ messages in thread From: Dan Williams @ 2018-05-23 0:31 UTC (permalink / raw) To: George Dunlap Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On Thu, May 17, 2018 at 7:52 AM, George Dunlap <george.dunlap@citrix.com> wrote: > On 05/15/2018 07:06 PM, Dan Williams wrote: >> On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: >>> So, who decides what this SPA range and interleave set is? Can the >>> operating system change these interleave sets and mappings, or change >>> data from PMEM to BLK, and is so, how? >> >> The interleave-set to SPA range association and delineation of >> capacity between PMEM and BLK access modes is current out-of-scope for >> ACPI. The BIOS reports the configuration to the OS via the NFIT, but >> the configuration is currently written by vendor specific tooling. >> Longer term it would be great for this mechanism to become >> standardized and available to the OS, but for now it requires platform >> specific tooling to change the DIMM interleave configuration. > > OK -- I was sort of assuming that different hardware would have > different drivers in Linux that ndctl knew how to drive (just like any > other hardware with vendor-specific interfaces); That way potentially lies madness, at least for me as a Linux sub-system maintainer. There is no value for the kernel to help enable vendors to do the same thing slightly differently ways. libnvdimm + nfit is 100% an open standards driver and the hope is to be able to deprecate non-public vendor-specific support over time, and consolidate work-alike support from vendor specs into ACPI. The public standards that the kernel enables are: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/ https://msdn.microsoft.com/library/windows/hardware/mt604741 > but it sounds a bit > more like at the moment it's binary blobs either in the BIOS/firmware, > or a vendor-supplied tool. Only for the functionality, like interleave set configuration, that is not defined in those standards. Even then the impact is only userspace tooling, not the kernel. Also, we are seeing that functionality bleed into the standards over time. For example label methods used to only exist in the Intel DSM document, but have now been standardized in ACPI 6.2. Firmware update which was a private interface has now graduated to the public Intel DSM document. Hopefully more and more functionality transitions into an ACPI definition over time. Any common functionality in those Intel, HPE, and MSFT command formats is comprehended / abstracted by the ndctl tool. > >>> And so (here's another guess) -- when you're talking about namespaces >>> and label areas, you're talking about namespaces stored *within a >>> pre-existing SPA range*. You use the same format as described in the >>> UEFI spec, but ignore all the stuff about interleave sets and whatever, >>> and use system physical addresses relative to the SPA range rather than >>> DPAs. >> >> Well, we don't ignore it because we need to validate in the driver >> that the interleave set configuration matches a checksum that we >> generated when the namespace was first instantiated on the interleave >> set. However, you are right, for accesses at run time all we care >> about is the SPA for PMEM accesses. > [snip] >> They can change, but only under the control of the BIOS. All changes >> to the interleave set configuration need a reboot because the memory >> controller needs to be set up differently at system-init time. > [snip] >> No, the checksum I'm referring to is the interleave set cookie (see: >> "SetCookie" in the UEFI 2.7 specification). It validates that the >> interleave set backing the SPA has not changed configuration since the >> last boot. > [snip] >> The NVDIMM just provides storage area for the OS to write opaque data >> that just happens to conform to the UEFI Namespace label format. The >> interleave-set configuration is stored in yet another out-of-band >> location on the DIMM or on some platform-specific storage location and >> is consulted / restored by the BIOS each boot. The NFIT is the output >> from the platform specific physical mappings of the DIMMs, and >> Namespaces are logical volumes built on top of those hard-defined NFIT >> boundaries. > > OK, so what I'm hearing is: > > The label area isn't "within a pre-existing SPA range" as I was guessing > (i.e., similar to a partition table residing within a disk); it is the > per-DIMM label area as described by UEFI spec. > > But, the interleave set data in the label area doesn't *control* the > hardware -- the NVDIMM controller / bios / firmware don't read it or do > anything based on what's in it. Rather, the interleave set data in the > label area is there to *record*, for the operating system's benefit, > what the hardware configuration was when the labels were created, so > that if it changes, the OS knows that the label area is invalid; it must > either refrain from touching the NVRAM (if it wants to preserve the > data), or write a new label area. > > The OS can also use labels to partition a single SPA range into several > namespaces. It can't change the interleaving, but it can specify that > [0-A) is one namespace, [A-B) is another namespace, &c; and these > namespaces will naturally map into the SPA range advertised in the NFIT. > > And if a controller allows the same memory to be used either as PMEM or > PBLK, it can write which *should* be used for which, and then can avoid > accessing the same underlying NVRAM in two different ways (which will > yield unpredictable results). > > That makes sense. You got it. > >>> If SPA regions don't change after boot, and if Xen can find its own >>> Xen-specific namespace to use for the frame tables by reading the NFIT >>> table, then that significantly reduces the amount of interaction it >>> needs with Linux. >>> >>> If SPA regions *can* change after boot, and if Xen must rely on Linux to >>> read labels and find out what it can safely use for frame tables, then >>> it makes things significantly more involved. Not impossible by any >>> means, but a lot more complicated. >>> >>> Hope all that makes sense -- thanks again for your help. >> >> I think it does, but it seems namespaces are out of reach for Xen >> without some agent / enabling that can execute the necessary AML >> methods. > > Sure, we're pretty much used to that. :-) We'll have Linux read the > label area and tell Xen what it needs to know. But: > > * Xen can know the SPA ranges of all potential NVDIMMs before dom0 > starts. So it can tell, for instance, if a page mapped by dom0 is > inside an NVDIMM range, even if dom0 hasn't yet told it anything. > > * Linux doesn't actually need to map these NVDIMMs to read the label > area and the NFIT and know where the PMEM namespaces live in system memory. Theoretically we could support a mode where dom0 Linux just parses namespaces, but never enables namespaces. That would be additional enabling on top of what we have today. It would be similar to what we do for "locked" DIMMs. > With that sorted out, let me go back and see whether it makes sense to > respond to your original response, or to write up a new design doc and > send it out. > > Thanks for your help! No problem. I had typed up this response earlier, but neglected to hit send. That is now remedied. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 18:06 ` Dan Williams ` (2 preceding siblings ...) 2018-05-17 14:52 ` George Dunlap @ 2018-05-17 14:52 ` George Dunlap 3 siblings, 0 replies; 24+ messages in thread From: George Dunlap @ 2018-05-17 14:52 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On 05/15/2018 07:06 PM, Dan Williams wrote: > On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: >> So, who decides what this SPA range and interleave set is? Can the >> operating system change these interleave sets and mappings, or change >> data from PMEM to BLK, and is so, how? > > The interleave-set to SPA range association and delineation of > capacity between PMEM and BLK access modes is current out-of-scope for > ACPI. The BIOS reports the configuration to the OS via the NFIT, but > the configuration is currently written by vendor specific tooling. > Longer term it would be great for this mechanism to become > standardized and available to the OS, but for now it requires platform > specific tooling to change the DIMM interleave configuration. OK -- I was sort of assuming that different hardware would have different drivers in Linux that ndctl knew how to drive (just like any other hardware with vendor-specific interfaces); but it sounds a bit more like at the moment it's binary blobs either in the BIOS/firmware, or a vendor-supplied tool. >> And so (here's another guess) -- when you're talking about namespaces >> and label areas, you're talking about namespaces stored *within a >> pre-existing SPA range*. You use the same format as described in the >> UEFI spec, but ignore all the stuff about interleave sets and whatever, >> and use system physical addresses relative to the SPA range rather than >> DPAs. > > Well, we don't ignore it because we need to validate in the driver > that the interleave set configuration matches a checksum that we > generated when the namespace was first instantiated on the interleave > set. However, you are right, for accesses at run time all we care > about is the SPA for PMEM accesses. [snip] > They can change, but only under the control of the BIOS. All changes > to the interleave set configuration need a reboot because the memory > controller needs to be set up differently at system-init time. [snip] > No, the checksum I'm referring to is the interleave set cookie (see: > "SetCookie" in the UEFI 2.7 specification). It validates that the > interleave set backing the SPA has not changed configuration since the > last boot. [snip] > The NVDIMM just provides storage area for the OS to write opaque data > that just happens to conform to the UEFI Namespace label format. The > interleave-set configuration is stored in yet another out-of-band > location on the DIMM or on some platform-specific storage location and > is consulted / restored by the BIOS each boot. The NFIT is the output > from the platform specific physical mappings of the DIMMs, and > Namespaces are logical volumes built on top of those hard-defined NFIT > boundaries. OK, so what I'm hearing is: The label area isn't "within a pre-existing SPA range" as I was guessing (i.e., similar to a partition table residing within a disk); it is the per-DIMM label area as described by UEFI spec. But, the interleave set data in the label area doesn't *control* the hardware -- the NVDIMM controller / bios / firmware don't read it or do anything based on what's in it. Rather, the interleave set data in the label area is there to *record*, for the operating system's benefit, what the hardware configuration was when the labels were created, so that if it changes, the OS knows that the label area is invalid; it must either refrain from touching the NVRAM (if it wants to preserve the data), or write a new label area. The OS can also use labels to partition a single SPA range into several namespaces. It can't change the interleaving, but it can specify that [0-A) is one namespace, [A-B) is another namespace, &c; and these namespaces will naturally map into the SPA range advertised in the NFIT. And if a controller allows the same memory to be used either as PMEM or PBLK, it can write which *should* be used for which, and then can avoid accessing the same underlying NVRAM in two different ways (which will yield unpredictable results). That makes sense. >> If SPA regions don't change after boot, and if Xen can find its own >> Xen-specific namespace to use for the frame tables by reading the NFIT >> table, then that significantly reduces the amount of interaction it >> needs with Linux. >> >> If SPA regions *can* change after boot, and if Xen must rely on Linux to >> read labels and find out what it can safely use for frame tables, then >> it makes things significantly more involved. Not impossible by any >> means, but a lot more complicated. >> >> Hope all that makes sense -- thanks again for your help. > > I think it does, but it seems namespaces are out of reach for Xen > without some agent / enabling that can execute the necessary AML > methods. Sure, we're pretty much used to that. :-) We'll have Linux read the label area and tell Xen what it needs to know. But: * Xen can know the SPA ranges of all potential NVDIMMs before dom0 starts. So it can tell, for instance, if a page mapped by dom0 is inside an NVDIMM range, even if dom0 hasn't yet told it anything. * Linux doesn't actually need to map these NVDIMMs to read the label area and the NFIT and know where the PMEM namespaces live in system memory. With that sorted out, let me go back and see whether it makes sense to respond to your original response, or to write up a new design doc and send it out. Thanks for your help! -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Draft NVDIMM proposal 2018-05-15 14:19 ` George Dunlap 2018-05-15 18:06 ` Dan Williams @ 2018-05-15 18:06 ` Dan Williams 1 sibling, 0 replies; 24+ messages in thread From: Dan Williams @ 2018-05-15 18:06 UTC (permalink / raw) To: George Dunlap Cc: linux-nvdimm, Andrew Cooper, xen-devel, Jan Beulich, Yi Zhang, Roger Pau Monne On Tue, May 15, 2018 at 7:19 AM, George Dunlap <george.dunlap@citrix.com> wrote: > On 05/11/2018 05:33 PM, Dan Williams wrote: >> [ adding linux-nvdimm ] >> >> Great write up! Some comments below... > > Thanks for the quick response! > > It seems I still have some fundamental misconceptions about what's going > on, so I'd better start with that. :-) > > Here's the part that I'm having a hard time getting. > > If actual data on the NVDIMMs is a noun, and the act of writing is a > verb, then the SPA and interleave sets are adverbs: they define *how* > the write happens. When the processor says, "Write to address X", the > memory controller converts address X into a <dimm number, dimm-physical > address> tuple to actually write the data. > > So, who decides what this SPA range and interleave set is? Can the > operating system change these interleave sets and mappings, or change > data from PMEM to BLK, and is so, how? The interleave-set to SPA range association and delineation of capacity between PMEM and BLK access modes is current out-of-scope for ACPI. The BIOS reports the configuration to the OS via the NFIT, but the configuration is currently written by vendor specific tooling. Longer term it would be great for this mechanism to become standardized and available to the OS, but for now it requires platform specific tooling to change the DIMM interleave configuration. > If you read through section 13.19 of the UEFI manual, it seems to imply > that this is determined by the label area -- that each DIMM has a > separate label area describing regions local to that DIMM; and that if > you have 4 DIMMs you'll have 4 label areas, and each label area will > have a label describing the DPA region on that DIMM which corresponds to > the interleave set. And somehow someone sets up the interleave sets and > SPA based on what's written there. > > Which would mean that an operating system could change how the > interleave sets work by rewriting the various labels on the DIMMs; for > instance, changing a single 4-way set spanning the entirety of 4 DIMMs, > to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning > half of 2 DIMMs each. If a DIMM supports both the PMEM and BLK mechanisms for accessing the same DPA, then the label breaks the disambiguation and tells the OS to enforce one access mechanism per DPA at a time. Otherwise the OS has no ability to affect the interleave-set configuration, it's all initialized by platform BIOS/firmware before the OS boots. > > But then you say: > >> Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs >> provide a "label area" which is an out-of-band non-volatile memory >> area where the OS can store whatever it likes. The UEFI 2.7 >> specification defines a data format for the definition of namespaces >> on top of persistent memory ranges advertised to the OS via the ACPI >> NFIT structure. > > OK, so that sounds like no, that's that what happens. So where do the > SPA range and interleave sets come from? > > Random guess: The BIOS / firmware makes it up. Either it's hard-coded, > or there's some menu in the BIOS you can use to change things around; > but once it hits the operating system, that's it -- the mapping of SPA > range onto interleave sets onto DIMMs is, from the operating system's > point of view, fixed. Correct. > And so (here's another guess) -- when you're talking about namespaces > and label areas, you're talking about namespaces stored *within a > pre-existing SPA range*. You use the same format as described in the > UEFI spec, but ignore all the stuff about interleave sets and whatever, > and use system physical addresses relative to the SPA range rather than > DPAs. Well, we don't ignore it because we need to validate in the driver that the interleave set configuration matches a checksum that we generated when the namespace was first instantiated on the interleave set. However, you are right, for accesses at run time all we care about is the SPA for PMEM accesses. > > Is that right? > > But then there's things like this: > >> There is no obligation for an NVDIMM to provide a label area, and as >> far as I know all NVDIMMs on the market today do not provide a label >> area. > [snip] >> Linux supports "label-less" mode where it exposes >> the raw capacity of a region in 1:1 mapped namespace without a label. >> This is how Linux supports "legacy" NVDIMMs that do not support >> labels. > > So are "all NVDIMMs on the market today" then classed as "legacy" > NVDIMMs because they don't support labels? And if labels are simply the > NVDIMM equivalent of a partition table, then what does it mena to > "support" or "not support" labels? Yes, the term "legacy" has been thrown around for NVDIMMs that do not support labels. The way this support is determined is whether the platform publishes the _LSI, _LSR, and _LSW methods in ACPI (see: 6.5.10 NVDIMM Label Methods in ACPI 6.2a). I.e. each DIMM is represented by an ACPI device object, and we query those objects for these named methods. When the methods are missing *or* there is no initialized namespace index block found on the DIMMs, Linux will fall back to the "label-less" mode. > > And then there's this: > >> In any >> event we do the DIMM to SPA association first before reading labels. >> The OS calculates a so called "Interleave Set Cookie" from the NFIT >> information to compare against a similar value stored in the labels. >> This lets the OS determine that the Interleave Set composition has not >> changed from when the labels were initially written. An Interleave Set >> Cookie mismatch indicates the labels are stale, corrupted, or that the >> physical composition of the Interleave Set has changed. > > So wait, the SPA and interleave sets can actually change? And the > labels which the OS reads actually are per-DIMM, and do control somehow > how the DPA ranges of individual DIMMs are mapped into interleave sets > and exposed as SPAs? (And perhaps, can be changed by the operating system?) They can change, but only under the control of the BIOS. All changes to the interleave set configuration need a reboot because the memory controller needs to be set up differently at system-init time. > > And: > >> There are checksums in the Namespace definition to account label >> validity. Starting with ACPI 6.2 DSMs for labels are deprecated in >> favor of the new / named methods for label access _LSI, _LSR, and >> _LSW. > > Does this mean the methods will use checksums to verify writes to the > label area, and refuse writes which create invalid labels? No, the checksum I'm referring to is the interleave set cookie (see: "SetCookie" in the UEFI 2.7 specification). It validates that the interleave set backing the SPA has not changed configuration since the last boot. > > If all of the above is true, then in what way can it be said that > "NVDIMM has no concept of namespaces", that an OS can "store whatever it > likes" in the label area, and that UEFI namespaces are "on top of > persistent memory ranges advertised to the OS via the ACPI NFIT structure"? The NVDIMM just provides storage area for the OS to write opaque data that just happens to conform to the UEFI Namespace label format. The interleave-set configuration is stored in yet another out-of-band location on the DIMM or on some platform-specific storage location and is consulted / restored by the BIOS each boot. The NFIT is the output from the platform specific physical mappings of the DIMMs, and Namespaces are logical volumes built on top of those hard-defined NFIT boundaries. > > I'm sorry if this is obvious, but I am exactly as confused as I was > before I started writing this. :-) > > This is all pretty foundational. Xen can read static ACPI tables, but > it can't do AML. So to do a proper design for Xen, we need to know: Oooh, ok, no AML in Xen... > 1. If Xen can find out, without Linux's help, what namespaces exist and > if there is one it can use for its own purposes Yeah, no, not without calling AML methods. > 2. If the SPA regions can change at runtime. Nope, these are statically defined and can only change at reboot, if at all. A likely scenario is that an OEM ships the DIMMs already configured in an interleave-set and, barring component failure, nothing changes for the life of the platform. > If SPA regions don't change after boot, and if Xen can find its own > Xen-specific namespace to use for the frame tables by reading the NFIT > table, then that significantly reduces the amount of interaction it > needs with Linux. > > If SPA regions *can* change after boot, and if Xen must rely on Linux to > read labels and find out what it can safely use for frame tables, then > it makes things significantly more involved. Not impossible by any > means, but a lot more complicated. > > Hope all that makes sense -- thanks again for your help. I think it does, but it seems namespaces are out of reach for Xen without some agent / enabling that can execute the necessary AML methods. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2018-05-23 0:31 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-09 17:29 Draft NVDIMM proposal George Dunlap 2018-05-09 17:35 ` George Dunlap 2018-05-11 16:33 ` Dan Williams 2018-05-11 16:33 ` Dan Williams 2018-05-15 10:05 ` Roger Pau Monné 2018-05-15 10:05 ` Roger Pau Monné 2018-05-15 10:12 ` George Dunlap 2018-05-15 10:12 ` George Dunlap 2018-05-15 12:26 ` Jan Beulich 2018-05-15 12:26 ` Jan Beulich 2018-05-15 13:05 ` George Dunlap 2018-05-15 13:05 ` George Dunlap 2018-05-15 17:33 ` Dan Williams 2018-05-15 17:33 ` Dan Williams 2018-05-15 14:19 ` George Dunlap 2018-05-15 14:19 ` George Dunlap 2018-05-15 18:06 ` Dan Williams 2018-05-15 18:33 ` Andrew Cooper 2018-05-15 18:33 ` Andrew Cooper 2018-05-17 14:52 ` George Dunlap 2018-05-23 0:31 ` Dan Williams 2018-05-23 0:31 ` Dan Williams 2018-05-17 14:52 ` George Dunlap 2018-05-15 18:06 ` Dan Williams
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.