Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 00/22] add Object Storage Media Pool (mpool)
@ 2020-10-12 16:27 Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed
                   ` (22 more replies)
  0 siblings, 23 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This patch series introduces the mpool object storage media pool driver.
Mpool implements a simple transactional object store on top of block
storage devices.

Mpool was developed for the Heterogeneous-Memory Storage Engine (HSE)
project, which is a high-performance key-value storage engine designed
for SSDs. HSE stores its data exclusively in mpool.

Mpool is readily applicable to other storage systems built on immutable
objects. For example, the many databases that store records in
immutable SSTables organized as an LSM-tree or similar data structure.

We developed mpool for HSE storage, versus using a file system or raw
block device, for several reasons.

A primary motivator was the need for a storage model that maps naturally
to conventional block storage devices, as well as to emerging device
interfaces we plan to support in the future, such as
* NVMe Zoned Namespaces (ZNS)
* NVMe Streams
* Persistent memory accessed via CXL or similar technologies

Another motivator was the need for a storage model that readily supports
multiple classes of storage devices or media in a single storage pool,
such as
* QLC SSDs for storing the bulk of objects, and
* 3DXP SSDs or persistent memory for storing objects requiring
  low-latency access

The mpool object storage model meets these needs. It also provides
other features that benefit storage systems built on immutable objects,
including
* Facilities to memory-map a specified collection of objects into a
  linear address space
* Concurrent access to object data directly and memory-mapped to greatly
  reduce page cache pollution from background operations such as
  LSM-tree compaction
* Proactive eviction of object data from the page cache, based on
  object-level metrics, to avoid excessive memory pressure and its
  associated performance impacts
* High concurrency and short code paths for efficient access to
  low-latency storage devices

HSE takes advantage of all these mpool features to achieve high
throughput with low tail-latencies.

Mpool is implemented as a character device driver where
* /dev/mpoolctl is the control file (minor number 0) supporting mpool
  management ioctls
* /dev/mpool/<mpool-name> are mpool files (minor numbers >0), one per
  mpool, supporting object management ioctls

CLI/UAPI access to /dev/mpoolctl and /dev/mpool/<mpool-name> are
controlled by their UID, GID, and mode bits. To provide a familiar look
and feel, the mpool management model and CLI are intentionally aligned
to those of LVM to the degree practical.

An mpool is created with a block storage device specified for its
required capacity media class, and optionally a second block storage
device specified for its staging media class. We recommend virtual
block devices (such as LVM logical volumes) to aggregate the performance
and capacity of multiple physical block devices, to enable sharing of
physical block devices between mpools (or for other uses), and to
support extending the size of a block device used for an mpool media
class. The libblkid library recognizes mpool formatted block devices as
of util-linux v2.32.

Mpool implements a transactional object store with two simple object
abstractions: mblocks and mlogs.

Mblock objects are containers comprising a linear sequence of bytes that
can be written exactly once, are immutable after writing, and can be
read in whole or in part as needed until deleted. Mblocks in a media
class are currently fixed size, which is configured when an mpool is
created, though the amount of data written to mblocks will differ.

Mlog objects are containers for record logging. Records of arbitrary
size can be appended to an mlog until it is full. Once full, an mlog
must be erased before additional records can be appended. Mlog records
can be read sequentially from the beginning at any time. Mlogs in a
media class are always a multiple of the mblock size for that media
class.

Mblock and mlog writes avoid the page cache. Mblocks are written,
committed, and made immutable before they can be read either directly
(avoiding the page cache) or mmaped. Mlogs are always read and updated
directly (avoiding the page cache) and cannot be mmaped.

Mpool also provides the metadata container (MDC) APIs that clients can
use to simplify storing and maintaining metadata. These MDC APIs are
helper functions built on a pair of mlogs per MDC.

The mpool Wiki contains full details on the
* Management model in the "Configure mpools" section
* Object model in the "Develop mpool Applications" section
* Kernel module architecture in the "Explore mpool Internals" section,
  which provides context for reviewing this patch series

See https://github.com/hse-project/mpool/wiki

The mpool UAPI and kernel module (not the patchset) are available on
GitHub at:

https://github.com/hse-project/mpool

https://github.com/hse-project/mpool-kmod

The HSE key-value storage engine is available on GitHub at:

https://github.com/hse-project/hse

Changes in v2:

- Fixes build errors/warnings reported by bot on ARCH=m68k
Reported-by: kernel test robot <lkp@intel.com>

- Addresses review comments from Randy and Hillf:
  * Updates ioctl-number.rst file with mpool driver's ioctl code
  * Fixes issues in the usage of printk_timed_ratelimit()

- Fixes a readahead issue found by internal testing

Nabeel M Mohamed (22):
  mpool: add utility routines and ioctl definitions
  mpool: add in-memory struct definitions
  mpool: add on-media struct definitions
  mpool: add pool drive component which handles mpool IO using the block
    layer API
  mpool: add space map component which manages free space on mpool
    devices
  mpool: add on-media pack, unpack and upgrade routines
  mpool: add superblock management routines
  mpool: add pool metadata routines to manage object lifecycle and IO
  mpool: add mblock lifecycle management and IO routines
  mpool: add mlog IO utility routines
  mpool: add mlog lifecycle management and IO routines
  mpool: add metadata container or mlog-pair framework
  mpool: add utility routines for mpool lifecycle management
  mpool: add pool metadata routines to create persistent mpools
  mpool: add mpool lifecycle management routines
  mpool: add mpool control plane utility routines
  mpool: add mpool lifecycle management ioctls
  mpool: add object lifecycle management ioctls
  mpool: add support to mmap arbitrary collection of mblocks
  mpool: add support to proactively evict cached mblock data from the
    page-cache
  mpool: add documentation
  mpool: add Kconfig and Makefile

 .../userspace-api/ioctl/ioctl-number.rst      |    3 +-
 drivers/Kconfig                               |    2 +
 drivers/Makefile                              |    1 +
 drivers/mpool/Kconfig                         |   28 +
 drivers/mpool/Makefile                        |   11 +
 drivers/mpool/assert.h                        |   25 +
 drivers/mpool/init.c                          |  126 +
 drivers/mpool/init.h                          |   17 +
 drivers/mpool/mblock.c                        |  432 +++
 drivers/mpool/mblock.h                        |  161 +
 drivers/mpool/mcache.c                        | 1072 +++++++
 drivers/mpool/mcache.h                        |  104 +
 drivers/mpool/mclass.c                        |  103 +
 drivers/mpool/mclass.h                        |  137 +
 drivers/mpool/mdc.c                           |  486 +++
 drivers/mpool/mdc.h                           |  106 +
 drivers/mpool/mlog.c                          | 1667 ++++++++++
 drivers/mpool/mlog.h                          |  212 ++
 drivers/mpool/mlog_utils.c                    | 1352 ++++++++
 drivers/mpool/mlog_utils.h                    |   63 +
 drivers/mpool/mp.c                            | 1086 +++++++
 drivers/mpool/mp.h                            |  231 ++
 drivers/mpool/mpcore.c                        |  987 ++++++
 drivers/mpool/mpcore.h                        |  354 +++
 drivers/mpool/mpctl.c                         | 2747 +++++++++++++++++
 drivers/mpool/mpctl.h                         |   58 +
 drivers/mpool/mpool-locking.rst               |   90 +
 drivers/mpool/mpool_ioctl.h                   |  636 ++++
 drivers/mpool/mpool_printk.h                  |   43 +
 drivers/mpool/omf.c                           | 1316 ++++++++
 drivers/mpool/omf.h                           |  593 ++++
 drivers/mpool/omf_if.h                        |  381 +++
 drivers/mpool/params.h                        |  116 +
 drivers/mpool/pd.c                            |  424 +++
 drivers/mpool/pd.h                            |  202 ++
 drivers/mpool/pmd.c                           | 2046 ++++++++++++
 drivers/mpool/pmd.h                           |  379 +++
 drivers/mpool/pmd_obj.c                       | 1569 ++++++++++
 drivers/mpool/pmd_obj.h                       |  499 +++
 drivers/mpool/reaper.c                        |  686 ++++
 drivers/mpool/reaper.h                        |   71 +
 drivers/mpool/sb.c                            |  625 ++++
 drivers/mpool/sb.h                            |  162 +
 drivers/mpool/smap.c                          | 1031 +++++++
 drivers/mpool/smap.h                          |  334 ++
 drivers/mpool/sysfs.c                         |   48 +
 drivers/mpool/sysfs.h                         |   48 +
 drivers/mpool/upgrade.c                       |  138 +
 drivers/mpool/upgrade.h                       |  128 +
 drivers/mpool/uuid.h                          |   59 +
 50 files changed, 23194 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mpool/Kconfig
 create mode 100644 drivers/mpool/Makefile
 create mode 100644 drivers/mpool/assert.h
 create mode 100644 drivers/mpool/init.c
 create mode 100644 drivers/mpool/init.h
 create mode 100644 drivers/mpool/mblock.c
 create mode 100644 drivers/mpool/mblock.h
 create mode 100644 drivers/mpool/mcache.c
 create mode 100644 drivers/mpool/mcache.h
 create mode 100644 drivers/mpool/mclass.c
 create mode 100644 drivers/mpool/mclass.h
 create mode 100644 drivers/mpool/mdc.c
 create mode 100644 drivers/mpool/mdc.h
 create mode 100644 drivers/mpool/mlog.c
 create mode 100644 drivers/mpool/mlog.h
 create mode 100644 drivers/mpool/mlog_utils.c
 create mode 100644 drivers/mpool/mlog_utils.h
 create mode 100644 drivers/mpool/mp.c
 create mode 100644 drivers/mpool/mp.h
 create mode 100644 drivers/mpool/mpcore.c
 create mode 100644 drivers/mpool/mpcore.h
 create mode 100644 drivers/mpool/mpctl.c
 create mode 100644 drivers/mpool/mpctl.h
 create mode 100644 drivers/mpool/mpool-locking.rst
 create mode 100644 drivers/mpool/mpool_ioctl.h
 create mode 100644 drivers/mpool/mpool_printk.h
 create mode 100644 drivers/mpool/omf.c
 create mode 100644 drivers/mpool/omf.h
 create mode 100644 drivers/mpool/omf_if.h
 create mode 100644 drivers/mpool/params.h
 create mode 100644 drivers/mpool/pd.c
 create mode 100644 drivers/mpool/pd.h
 create mode 100644 drivers/mpool/pmd.c
 create mode 100644 drivers/mpool/pmd.h
 create mode 100644 drivers/mpool/pmd_obj.c
 create mode 100644 drivers/mpool/pmd_obj.h
 create mode 100644 drivers/mpool/reaper.c
 create mode 100644 drivers/mpool/reaper.h
 create mode 100644 drivers/mpool/sb.c
 create mode 100644 drivers/mpool/sb.h
 create mode 100644 drivers/mpool/smap.c
 create mode 100644 drivers/mpool/smap.h
 create mode 100644 drivers/mpool/sysfs.c
 create mode 100644 drivers/mpool/sysfs.h
 create mode 100644 drivers/mpool/upgrade.c
 create mode 100644 drivers/mpool/upgrade.h
 create mode 100644 drivers/mpool/uuid.h

-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 01/22] mpool: add utility routines and ioctl definitions
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:45   ` Randy Dunlap
  2020-10-12 16:27 ` [PATCH v2 02/22] mpool: add in-memory struct definitions Nabeel M Mohamed
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds structures used by mpool ioctls and utility routines
for logging, UUID management etc.

The mpool ioctls can be categorized as follows:

1. IOCTLs issued to the mpool control device (/dev/mpoolctl)
   - Mpool life cycle management (MPIOC_MP_*)

2. IOCTLs issued to the mpool device (/dev/mpool/<mpool-name>)
   - Mpool parameters (MPIOC_PARAMS_*)
   - Mpool properties (MPIOC_PROP_*)
   - Mpool media class management (MPIOC_MP_MCLASS_*)
   - Device management (MPIOC_DRV_*)
   - Mblock object life cycle management and IO (MPIOC_MB_*)
   - Mlog object life cycle management and IO (MPIOC_MLOG_*)
   - Mblock cache management (MPIOC_VMA_*)

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/assert.h       |  25 ++
 drivers/mpool/init.c         |  22 ++
 drivers/mpool/mpool_ioctl.h  | 636 +++++++++++++++++++++++++++++++++++
 drivers/mpool/mpool_printk.h |  43 +++
 drivers/mpool/uuid.h         |  59 ++++
 5 files changed, 785 insertions(+)
 create mode 100644 drivers/mpool/assert.h
 create mode 100644 drivers/mpool/init.c
 create mode 100644 drivers/mpool/mpool_ioctl.h
 create mode 100644 drivers/mpool/mpool_printk.h
 create mode 100644 drivers/mpool/uuid.h

diff --git a/drivers/mpool/assert.h b/drivers/mpool/assert.h
new file mode 100644
index 000000000000..a2081e71ec93
--- /dev/null
+++ b/drivers/mpool/assert.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_ASSERT_H
+#define MPOOL_ASSERT_H
+
+#include <linux/bug.h>
+
+#ifdef CONFIG_MPOOL_ASSERT
+__cold __noreturn
+static inline void assertfail(const char *expr, const char *file, int line)
+{
+	pr_err("mpool assertion failed: %s in %s:%d\n", expr, file, line);
+	BUG();
+}
+
+#define ASSERT(_expr)   (likely(_expr) ? (void)0 : assertfail(#_expr, __FILE__, __LINE__))
+
+#else
+#define ASSERT(_expr)   (void)(_expr)
+#endif
+
+#endif /* MPOOL_ASSERT_H */
diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
new file mode 100644
index 000000000000..0493fb5b1157
--- /dev/null
+++ b/drivers/mpool/init.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#include <linux/module.h>
+
+static __init int mpool_init(void)
+{
+	return 0;
+}
+
+static __exit void mpool_exit(void)
+{
+}
+
+module_init(mpool_init);
+module_exit(mpool_exit);
+
+MODULE_DESCRIPTION("Object Storage Media Pool (mpool)");
+MODULE_AUTHOR("Micron Technology, Inc.");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/mpool/mpool_ioctl.h b/drivers/mpool/mpool_ioctl.h
new file mode 100644
index 000000000000..599da0618a09
--- /dev/null
+++ b/drivers/mpool/mpool_ioctl.h
@@ -0,0 +1,636 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_IOCTL_H
+#define MPOOL_IOCTL_H
+
+#include <linux/uuid.h>
+#include <linux/uio.h>
+
+#ifndef __user
+#define __user
+#endif
+
+/*
+ * Maximum name lengths including NUL terminator.  Note that the maximum
+ * mpool name length is baked into libblkid and may not be changed here.
+ */
+#define MPOOL_NAMESZ_MAX            32
+#define MPOOL_LABELSZ_MAX           64
+#define PD_NAMESZ_MAX               32
+
+#define MPC_DEV_SUBDIR              "mpool"
+#define MPC_DEV_CTLNAME             MPC_DEV_SUBDIR "ctl"
+#define MPC_DEV_CTLPATH             "/dev/" MPC_DEV_CTLNAME
+
+#define MPOOL_LABEL_INVALID         ""
+#define MPOOL_LABEL_DEFAULT         "raw"
+
+#define MPOOL_RA_PAGES_INVALID      U32_MAX
+#define MPOOL_RA_PAGES_MAX          ((128 * 1024) / PAGE_SIZE)
+
+#define MPOOL_MCLASS_INVALID        MP_MED_INVALID
+#define MPOOL_MCLASS_DEFAULT        MP_MED_CAPACITY
+
+#define MPOOL_SPARES_INVALID        U8_MAX
+#define MPOOL_SPARES_DEFAULT        5
+
+#define MPOOL_ROOT_LOG_CAP          (8 * 1024 * 1024)
+
+#define MPOOL_MBSIZE_MB_DEFAULT     32
+
+#define MPOOL_MDCNUM_DEFAULT        16
+
+
+/**
+ * mp_mgmt_flags - Mpool Management Flags
+ * @MP_FLAGS_FORCE:
+ * @MP_PERMIT_META_CONV: permit mpool metadata conversion. That is, allow the
+ *	mpool activate to write back the mpool metadata to the latest version
+ *	used by the binary activating the mpool.
+ * @MP_FLAGS_RESIZE: Resize mpool
+ */
+enum mp_mgmt_flags {
+	MP_FLAGS_FORCE,
+	MP_FLAGS_PERMIT_META_CONV,
+	MP_FLAGS_RESIZE,
+};
+
+/**
+ * mp_media_classp = Media classes
+ *
+ * @MP_MED_STAGING:  Initial data ingest, hot data storage, or similar.
+ * @MP_MED_CAPACITY: Primary data storage, cold data, or similar.
+ */
+enum mp_media_classp {
+	MP_MED_STAGING   = 0,
+	MP_MED_CAPACITY  = 1,
+};
+
+#define MP_MED_BASE        MP_MED_STAGING
+#define MP_MED_NUMBER      (MP_MED_CAPACITY + 1)
+#define MP_MED_INVALID     U8_MAX
+
+/**
+ * struct mpool_devprops -
+ * @pdp_devid:   UUID for drive
+ * @pdp_mclassp: enum mp_media_classp
+ * @pdp_status:  enum pd_status
+ * @pdp_total:   raw capacity of drive
+ * @pdp_avail:   available capacity (total - bad zones) of drive
+ * @pdp_spare:   spare capacity of drive
+ * @pdp_fspare:  free spare capacity of drive
+ * @pdp_usable:  usable capacity of drive
+ * @pdp_fusable: free usable capacity of drive
+ * @pdp_used:    used capacity of drive:
+ */
+struct mpool_devprops {
+	uuid_le    pdp_devid;
+	uint8_t    pdp_mclassp;
+	uint8_t    pdp_status;
+	uint8_t    pdp_rsvd1[6];
+	uint64_t   pdp_total;
+	uint64_t   pdp_avail;
+	uint64_t   pdp_spare;
+	uint64_t   pdp_fspare;
+	uint64_t   pdp_usable;
+	uint64_t   pdp_fusable;
+	uint64_t   pdp_used;
+	uint64_t   pdp_rsvd2;
+};
+
+/**
+ * struct mpool_params -
+ * @mp_poolid:          UUID of mpool
+ * @mp_type:            user-specified identifier
+ * @mp_uid:
+ * @mp_gid:
+ * @mp_mode:
+ * @mp_stat:            overall mpool status (enum mpool_status)
+ * @mp_mdc_captgt:      user MDC capacity
+ * @mp_oidv:            user MDC OIDs
+ * @mp_ra_pages_max:    max VMA map readahead pages
+ * @mp_vma_size_max:    max VMA map size (log2)
+ * @mp_mblocksz:        mblock size by media class (MiB)
+ * @mp_utype:           user-defined type
+ * @mp_label:           user specified label
+ * @mp_name:            mpool name (2x for planned expansion)
+ */
+struct mpool_params {
+	uuid_le     mp_poolid;
+	uid_t       mp_uid;
+	gid_t       mp_gid;
+	mode_t      mp_mode;
+	uint8_t     mp_stat;
+	uint8_t     mp_spare_cap;
+	uint8_t     mp_spare_stg;
+	uint8_t     mp_rsvd0;
+	uint64_t    mp_mdc_captgt;
+	uint64_t    mp_oidv[2];
+	uint32_t    mp_ra_pages_max;
+	uint32_t    mp_vma_size_max;
+	uint32_t    mp_mblocksz[MP_MED_NUMBER];
+	uint16_t    mp_mdc0cap;
+	uint16_t    mp_mdcncap;
+	uint16_t    mp_mdcnum;
+	uint16_t    mp_rsvd1;
+	uint32_t    mp_rsvd2;
+	uint64_t    mp_rsvd3;
+	uint64_t    mp_rsvd4;
+	uuid_le     mp_utype;
+	char        mp_label[MPOOL_LABELSZ_MAX];
+	char        mp_name[MPOOL_NAMESZ_MAX * 2];
+};
+
+/**
+ * struct mpool_usage - in bytes
+ * @mpu_total:   raw capacity for all drives
+ * @mpu_usable:  usable capacity for all drives
+ * @mpu_fusable: free usable capacity for all drives
+ * @mpu_used:    used capacity for all drives; possible for
+ *               used > usable when fusable=0; see smap
+ *               module for details
+ * @mpu_spare:   total spare space
+ * @mpu_fspare:  free spare space
+ *
+ * @mpu_mblock_alen: mblock allocated length
+ * @mpu_mblock_wlen: mblock written length
+ * @mpu_mlog_alen:   mlog allocated length
+ * @mpu_mblock_cnt:  number of active mblocks
+ * @mpu_mlog_cnt:    number of active mlogs
+ */
+struct mpool_usage {
+	uint64_t   mpu_total;
+	uint64_t   mpu_usable;
+	uint64_t   mpu_fusable;
+	uint64_t   mpu_used;
+	uint64_t   mpu_spare;
+	uint64_t   mpu_fspare;
+
+	uint64_t   mpu_alen;
+	uint64_t   mpu_wlen;
+	uint64_t   mpu_mblock_alen;
+	uint64_t   mpu_mblock_wlen;
+	uint64_t   mpu_mlog_alen;
+	uint32_t   mpu_mblock_cnt;
+	uint32_t   mpu_mlog_cnt;
+};
+
+/**
+ * mpool_mclass_xprops -
+ * @mc_devtype: type of devices in the media class
+ *                  (enum pd_devtype)
+ * @mc_mclass: media class (enum mp_media_classp)
+ * @mc_sectorsz: media class (enum mp_media_classp)
+ * @mc_spare: percent spare zones for drives
+ * @mc_uacnt: UNAVAIL status drive count
+ * @mc_zonepg: pages per zone
+ * @mc_features: feature bitmask
+ * @mc_usage: feature bitmask
+ */
+struct mpool_mclass_xprops {
+	uint8_t                    mc_devtype;
+	uint8_t                    mc_mclass;
+	uint8_t                    mc_sectorsz;
+	uint8_t                    mc_rsvd1;
+	uint32_t                   mc_spare;
+	uint16_t                   mc_uacnt;
+	uint16_t                   mc_rsvd2;
+	uint32_t                   mc_zonepg;
+	uint64_t                   mc_features;
+	uint64_t                   mc_rsvd3;
+	struct mpool_usage         mc_usage;
+};
+
+/**
+ * mpool_mclass_props -
+ *
+ * @mc_mblocksz:   mblock size in MiB
+ * @mc_rsvd:       reserved struct field (for future use)
+ * @mc_total:      total space in the media class (mc_usable + mc_spare)
+ * @mc_usable:     usable space in bytes
+ * @mc_used:       bytes allocated from usable space
+ * @mc_spare:      spare space in bytes
+ * @mc_spare_used: bytes allocated from spare space
+ */
+struct mpool_mclass_props {
+	uint32_t   mc_mblocksz;
+	uint32_t   mc_rsvd;
+	uint64_t   mc_total;
+	uint64_t   mc_usable;
+	uint64_t   mc_used;
+	uint64_t   mc_spare;
+	uint64_t   mc_spare_used;
+};
+
+/**
+ * struct mpool_xprops - Extended mpool properties
+ * @ppx_params: mpool configuration parameters
+ * @ppx_drive_spares: percent spare zones for drives in each media class
+ * @ppx_uacnt:  UNAVAIL status drive count in each media class
+ */
+struct mpool_xprops {
+	struct mpool_params     ppx_params;
+	uint8_t                 ppx_rsvd[MP_MED_NUMBER];
+	uint8_t                 ppx_drive_spares[MP_MED_NUMBER];
+	uint16_t                ppx_uacnt[MP_MED_NUMBER];
+	uint32_t                ppx_pd_mclassv[MP_MED_NUMBER];
+	char                    ppx_pd_namev[MP_MED_NUMBER][PD_NAMESZ_MAX];
+};
+
+
+/*
+ * struct mblock_props -
+ *
+ * @mpr_objid:        mblock identifier
+ * @mpr_alloc_cap:    allocated capacity in bytes
+ * @mpr_write_len:    written user-data in bytes
+ * @mpr_optimal_wrsz: optimal write size(in bytes) for all but the last incremental mblock write
+ * @mpr_mclassp:      media class
+ * @mpr_iscommitted:  Is this mblock committed?
+ */
+struct mblock_props {
+	uint64_t                mpr_objid;
+	uint32_t                mpr_alloc_cap;
+	uint32_t                mpr_write_len;
+	uint32_t                mpr_optimal_wrsz;
+	uint32_t                mpr_mclassp; /* enum mp_media_classp */
+	uint8_t                 mpr_iscommitted;
+	uint8_t                 mpr_rsvd1[7];
+	uint64_t                mpr_rsvd2;
+};
+
+struct mblock_props_ex {
+	struct mblock_props     mbx_props;
+	uint8_t                 mbx_zonecnt;      /* zone count per strip */
+	uint8_t                 mbx_rsvd1[7];
+	uint64_t                mbx_rsvd2;
+};
+
+
+/*
+ * enum mlog_open_flags -
+ * @MLOG_OF_COMPACT_SEM: Enforce compaction semantics
+ * @MLOG_OF_SKIP_SER:    Appends and reads are guaranteed to be serialized
+ *                       outside of the mlog API
+ */
+enum mlog_open_flags {
+	MLOG_OF_COMPACT_SEM = 0x1,
+	MLOG_OF_SKIP_SER    = 0x2,
+};
+
+/*
+ * NOTE:
+ * + a value of 0 for targets (*tgt) means no specific target and the
+ *   allocator is free to choose based on media class configuration
+ */
+struct mlog_capacity {
+	uint64_t    lcp_captgt;       /* capacity target for mlog in bytes */
+	uint8_t     lcp_spare;        /* true if alloc mlog from spare space */
+	uint8_t     lcp_rsvd1[7];
+};
+
+/*
+ * struct mlog_props -
+ *
+ * @lpr_uuid:        UUID or mlog magic
+ * @lpr_objid:       mlog identifier
+ * @lpr_alloc_cap:   maximum capacity in bytes
+ * @lpr_gen:         generation no. (user mlogs)
+ * @lpr_mclassp:     media class
+ * @lpr_iscommitted: Is this mlog committed?
+ */
+struct mlog_props {
+	uuid_le     lpr_uuid;
+	uint64_t    lpr_objid;
+	uint64_t    lpr_alloc_cap;
+	uint64_t    lpr_gen;
+	uint8_t     lpr_mclassp;
+	uint8_t     lpr_iscommitted;
+	uint8_t     lpr_rsvd1[6];
+	uint64_t    lpr_rsvd2;
+};
+
+/*
+ * struct mlog_props_ex -
+ *
+ * @lpx_props:
+ * @lpx_totsec:   total number of sectors
+ * @lpx_zonecnt:   zone count per strip
+ * @lpx_state:    mlog layout state
+ * @lpx_secshift: sector shift
+ */
+struct mlog_props_ex {
+	struct mlog_props   lpx_props;
+	uint32_t            lpx_totsec;
+	uint32_t            lpx_zonecnt;
+	uint8_t             lpx_state;
+	uint8_t             lpx_secshift;
+	uint8_t             lpx_rsvd1[6];
+	uint64_t            lpx_rsvd2;
+};
+
+
+/**
+ * enum mdc_open_flags -
+ * @MDC_OF_SKIP_SER: appends and reads are guaranteed to be serialized
+ *                   outside of the MDC API
+ */
+enum mdc_open_flags {
+	MDC_OF_SKIP_SER  = 0x1,
+};
+
+/**
+ * struct mdc_capacity -
+ * @mdt_captgt: capacity target for mlog in bytes
+ * @mpt_spare:  true if alloc MDC from spare space
+ */
+struct mdc_capacity {
+	uint64_t   mdt_captgt;
+	bool       mdt_spare;
+};
+
+/**
+ * struct mdc_props -
+ * @mdc_objid1:
+ * @mdc_objid2:
+ * @mdc_alloc_cap:
+ * @mdc_mclassp:
+ */
+struct mdc_props {
+	uint64_t               mdc_objid1;
+	uint64_t               mdc_objid2;
+	uint64_t               mdc_alloc_cap;
+	enum mp_media_classp   mdc_mclassp;
+};
+
+
+/**
+ * enum mpc_vma_advice -
+ * @MPC_VMA_COLD:
+ * @MPC_VMA_WARM:
+ * @MPC_VMA_HOT:
+ * @MPC_VMA_PINNED:
+ */
+enum mpc_vma_advice {
+	MPC_VMA_COLD = 0,
+	MPC_VMA_WARM,
+	MPC_VMA_HOT,
+	MPC_VMA_PINNED
+};
+
+
+/**
+ * struct pd_znparam - zone parameter arg used in compute/set API functions
+ * @dvb_zonepg:     zone size in PAGE_SIZE units.
+ * @dvb_zonetot:    total number of zones
+ */
+struct pd_znparam {
+	uint32_t   dvb_zonepg;
+	uint32_t   dvb_zonetot;
+	uint64_t   dvb_rsvd1;
+};
+
+#define PD_DEV_ID_LEN              64
+
+/**
+ * struct pd_prop - PD properties
+ * @pdp_didstr:         drive id string (model)
+ * @pdp_devtype:	device type (enum pd_devtype)
+ * @pdp_phys_if:	physical interface of the drive
+ *			Determined by the device discovery.
+ *			(device_phys_if)
+ * @pdp_mclassp:        performance characteristic of the media class
+ *			Determined by the user, not by the device discovery.
+ *			(enum mp_media_classp)
+ * @pdp_cmdopt:         enum pd_cmd_opt. Features of the PD.
+ * @pdp_zparam:	zone parameters
+ * @pdp_discard_granularity: specified by
+ *	/sys/block/<disk>/queue/discard_granularity
+ * @pdp_sectorsz:	Sector size, exponent base 2
+ * @pdp_optiosz:        Optimal IO size
+ * @pdp_devsz:		device size in bytes
+ *
+ * Note: in order to avoid passing enums across user-kernel boundary
+ * declare the following as uint8_t
+ * pdp_devtype: enum pd_devtype
+ * pdp_devstate: enum pd_state
+ * pdp_phys_if: enum device_phys_if
+ * pdp_mclassp: enum mp_media_classp
+ */
+struct pd_prop {
+	char		        pdp_didstr[PD_DEV_ID_LEN];
+	uint8_t                 pdp_devtype;
+	uint8_t                 pdp_devstate;
+	uint8_t                 pdp_phys_if;
+	uint8_t                 pdp_mclassp;
+	bool                    pdp_fua;
+	uint64_t                pdp_cmdopt;
+
+	struct pd_znparam       pdp_zparam;
+	uint32_t                pdp_discard_granularity;
+	uint32_t                pdp_sectorsz;
+	uint32_t                pdp_optiosz;
+	uint32_t                pdp_rsvd2;
+	uint64_t	        pdp_devsz;
+	uint64_t	        pdp_rsvd3;
+};
+
+
+struct mpioc_mpool {
+	struct mpool_params     mp_params;
+	uint32_t                mp_flags;       /* mp_mgmt_flags */
+	uint32_t                mp_dpathc;      /* Count of device paths */
+	uint32_t                mp_dpathssz;    /* Length of mp_dpaths */
+	uint32_t                mp_rsvd1;
+	uint64_t                mp_rsvd2;
+	char __user            *mp_dpaths;      /* Newline separated paths */
+	struct pd_prop __user  *mp_pd_prop;     /* mp_dpathc elements */
+};
+
+/**
+ * struct mpioc_params -
+ * @mps_params;
+ */
+struct mpioc_params {
+	struct mpool_params     mps_params;
+};
+
+struct mpioc_mclass {
+	struct mpool_mclass_xprops __user  *mcl_xprops;
+	uint32_t                            mcl_cnt;
+	uint32_t                            mcl_rsvd1;
+};
+
+struct mpioc_drive {
+	uint32_t	        drv_flags;   /* mp_mgmt_flags */
+	uint32_t	        drv_rsvd1;
+	uint32_t                drv_dpathc;  /* Count of device paths */
+	uint32_t                drv_dpathssz;/* Length of mp_dpaths */
+	struct pd_prop __user  *drv_pd_prop; /* mp_dpathc elements */
+	char __user            *drv_dpaths;  /* Newline separated device paths*/
+};
+
+enum mpioc_list_cmd {
+	MPIOC_LIST_CMD_INVALID     = 0,
+	MPIOC_LIST_CMD_PROP_GET    = 1,       /* Used by mpool get command */
+	MPIOC_LIST_CMD_PROP_LIST   = 2,       /* Used by mpool list command */
+	MPIOC_LIST_CMD_LAST = MPIOC_LIST_CMD_PROP_LIST,
+};
+
+struct mpioc_list {
+	uint32_t        ls_cmd;     /* enum mpioc_list_cmd */
+	uint32_t        ls_listc;
+	void __user    *ls_listv;
+};
+
+struct mpioc_prop {
+	struct mpool_xprops         pr_xprops;
+	struct mpool_usage          pr_usage;
+	struct mpool_mclass_xprops  pr_mcxv[MP_MED_NUMBER];
+	uint32_t                    pr_mcxc;
+	uint32_t                    pr_rsvd1;
+	uint64_t                    pr_rsvd2;
+};
+
+struct mpioc_devprops {
+	char                    dpr_pdname[PD_NAMESZ_MAX];
+	struct mpool_devprops   dpr_devprops;
+};
+
+/**
+ * struct mpioc_mblock:
+ * @mb_objid:   mblock unique ID (permanent)
+ * @mb_offset:  mblock read offset (ephemeral)
+ * @mb_props:
+ * @mb_layout
+ * @mb_spare:
+ * @mb_mclassp: enum mp_media_classp, declared as uint8_t
+ */
+struct mpioc_mblock {
+	uint64_t                mb_objid;
+	int64_t                 mb_offset;
+	struct mblock_props_ex  mb_props;
+	uint8_t                 mb_spare;
+	uint8_t                 mb_mclassp;
+	uint16_t                mb_rsvd1;
+	uint32_t                mb_rsvd2;
+	uint64_t                mb_rsvd3;
+};
+
+struct mpioc_mblock_id {
+	uint64_t    mi_objid;
+};
+
+#define MPIOC_KIOV_MAX          (1024)
+
+struct mpioc_mblock_rw {
+	uint64_t                    mb_objid;
+	int64_t                     mb_offset;
+	uint32_t                    mb_rsvd2;
+	uint16_t                    mb_rsvd3;
+	uint16_t                    mb_iov_cnt;
+	const struct iovec __user  *mb_iov;
+};
+
+struct mpioc_mlog {
+	uint64_t                ml_objid;
+	uint64_t                ml_rsvd;
+	struct mlog_props_ex    ml_props;
+	struct mlog_capacity    ml_cap;
+	uint8_t                 ml_mclassp; /* enum mp_media_classp */
+	uint8_t                 ml_rsvd1[7];
+	uint64_t                ml_rsvd2;
+};
+
+struct mpioc_mlog_id {
+	uint64_t    mi_objid;
+	uint64_t    mi_gen;
+	uint8_t     mi_state;
+	uint8_t     mi_rsvd1[7];
+};
+
+struct mpioc_mlog_io {
+	uint64_t                mi_objid;
+	int64_t                 mi_off;
+	uint8_t                 mi_op;
+	uint8_t                 mi_rsvd1[5];
+	uint16_t                mi_iovc;
+	struct iovec __user    *mi_iov;
+	uint64_t                mi_rsvd2;
+};
+
+struct mpioc_vma {
+	uint32_t            im_advice;
+	uint32_t            im_mbidc;
+	uint64_t __user    *im_mbidv;
+	uint64_t            im_bktsz;
+	int64_t             im_offset;
+	uint64_t            im_len;
+	uint64_t            im_vssp;
+	uint64_t            im_rssp;
+	uint64_t            im_rsvd;
+};
+
+union mpioc_union {
+	struct mpioc_mpool      mpu_mpool;
+	struct mpioc_drive      mpu_drive;
+	struct mpioc_params     mpu_params;
+	struct mpioc_mclass     mpu_mclass;
+	struct mpioc_list       mpu_list;
+	struct mpioc_prop       mpu_prop;
+	struct mpioc_devprops   mpu_devprops;
+	struct mpioc_mlog       mpu_mlog;
+	struct mpioc_mlog_id    mpu_mlog_id;
+	struct mpioc_mlog_io    mpu_mlog_io;
+	struct mpioc_mblock     mpu_mblock;
+	struct mpioc_mblock_id  mpu_mblock_id;
+	struct mpioc_mblock_rw  mpu_mblock_rw;
+	struct mpioc_vma        mpu_vma;
+};
+
+#define MPIOC_MAGIC             ('2')
+
+#define MPIOC_MP_CREATE         _IOWR(MPIOC_MAGIC, 1, struct mpioc_mpool)
+#define MPIOC_MP_DESTROY        _IOW(MPIOC_MAGIC, 2, struct mpioc_mpool)
+#define MPIOC_MP_ACTIVATE       _IOWR(MPIOC_MAGIC, 5, struct mpioc_mpool)
+#define MPIOC_MP_DEACTIVATE     _IOW(MPIOC_MAGIC, 6, struct mpioc_mpool)
+#define MPIOC_MP_RENAME         _IOWR(MPIOC_MAGIC, 7, struct mpioc_mpool)
+
+#define MPIOC_PARAMS_GET        _IOWR(MPIOC_MAGIC, 10, struct mpioc_params)
+#define MPIOC_PARAMS_SET        _IOWR(MPIOC_MAGIC, 11, struct mpioc_params)
+#define MPIOC_MP_MCLASS_GET     _IOWR(MPIOC_MAGIC, 12, struct mpioc_mclass)
+
+#define MPIOC_DRV_ADD           _IOWR(MPIOC_MAGIC, 15, struct mpioc_drive)
+#define MPIOC_DRV_SPARES        _IOWR(MPIOC_MAGIC, 16, struct mpioc_drive)
+
+#define MPIOC_PROP_GET          _IOWR(MPIOC_MAGIC, 20, struct mpioc_list)
+#define MPIOC_PROP_SET          _IOWR(MPIOC_MAGIC, 21, struct mpioc_list)
+#define MPIOC_DEVPROPS_GET      _IOWR(MPIOC_MAGIC, 22, struct mpioc_devprops)
+
+#define MPIOC_MLOG_ALLOC        _IOWR(MPIOC_MAGIC, 30, struct mpioc_mlog)
+#define MPIOC_MLOG_COMMIT       _IOWR(MPIOC_MAGIC, 32, struct mpioc_mlog_id)
+#define MPIOC_MLOG_ABORT        _IOW(MPIOC_MAGIC, 33, struct mpioc_mlog_id)
+#define MPIOC_MLOG_DELETE       _IOW(MPIOC_MAGIC, 34, struct mpioc_mlog_id)
+#define MPIOC_MLOG_FIND         _IOWR(MPIOC_MAGIC, 37, struct mpioc_mlog)
+#define MPIOC_MLOG_READ         _IOW(MPIOC_MAGIC, 40, struct mpioc_mlog_io)
+#define MPIOC_MLOG_WRITE        _IOW(MPIOC_MAGIC, 41, struct mpioc_mlog_io)
+#define MPIOC_MLOG_PROPS        _IOWR(MPIOC_MAGIC, 42, struct mpioc_mlog)
+#define MPIOC_MLOG_ERASE        _IOWR(MPIOC_MAGIC, 43, struct mpioc_mlog_id)
+
+#define MPIOC_MB_ALLOC          _IOWR(MPIOC_MAGIC, 50, struct mpioc_mblock)
+#define MPIOC_MB_ABORT          _IOW(MPIOC_MAGIC, 52, struct mpioc_mblock_id)
+#define MPIOC_MB_COMMIT         _IOW(MPIOC_MAGIC, 53, struct mpioc_mblock_id)
+#define MPIOC_MB_DELETE         _IOW(MPIOC_MAGIC, 54, struct mpioc_mblock_id)
+#define MPIOC_MB_FIND           _IOWR(MPIOC_MAGIC, 56, struct mpioc_mblock)
+#define MPIOC_MB_READ           _IOW(MPIOC_MAGIC, 60, struct mpioc_mblock_rw)
+#define MPIOC_MB_WRITE          _IOW(MPIOC_MAGIC, 61, struct mpioc_mblock_rw)
+
+#define MPIOC_VMA_CREATE        _IOWR(MPIOC_MAGIC, 70, struct mpioc_vma)
+#define MPIOC_VMA_DESTROY       _IOW(MPIOC_MAGIC, 71, struct mpioc_vma)
+#define MPIOC_VMA_PURGE         _IOW(MPIOC_MAGIC, 72, struct mpioc_vma)
+#define MPIOC_VMA_VRSS          _IOWR(MPIOC_MAGIC, 73, struct mpioc_vma)
+
+#endif
diff --git a/drivers/mpool/mpool_printk.h b/drivers/mpool/mpool_printk.h
new file mode 100644
index 000000000000..280a8e064115
--- /dev/null
+++ b/drivers/mpool/mpool_printk.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_PRINTK_H
+#define MPOOL_PRINTK_H
+
+#include <linux/printk.h>
+
+static unsigned long mp_pr_rl_state __maybe_unused;
+
+/* TODO: Use dev_crit(), dev_err(), ... */
+
+#define mp_pr_crit(_fmt, _err, ...)				\
+	pr_crit("%s: " _fmt ": errno %d", __func__, ## __VA_ARGS__, (_err))
+
+#define mp_pr_err(_fmt, _err, ...)				\
+	pr_err("%s: " _fmt ": errno %d", __func__, ## __VA_ARGS__, (_err))
+
+#define mp_pr_warn(_fmt, ...)					\
+	pr_warn("%s: " _fmt, __func__, ## __VA_ARGS__)
+
+#define mp_pr_notice(_fmt, ...)					\
+	pr_notice("%s: " _fmt, __func__, ## __VA_ARGS__)
+
+#define mp_pr_info(_fmt, ...)					\
+	pr_info("%s: " _fmt, __func__, ## __VA_ARGS__)
+
+#define mp_pr_debug(_fmt, _err, ...)				\
+	pr_debug("%s: " _fmt ": errno %d", __func__, ## __VA_ARGS__,  (_err))
+
+
+/* Rate limited version of mp_pr_err(). */
+#define mp_pr_rl(_fmt, _err, ...)				\
+do {								\
+	if (printk_timed_ratelimit(&mp_pr_rl_state, 333)) {	\
+		pr_err("%s: " _fmt ": errno %d",		\
+		       __func__, ## __VA_ARGS__, (_err));	\
+	}							\
+} while (0)
+
+#endif /* MPOOL_PRINTK_H */
diff --git a/drivers/mpool/uuid.h b/drivers/mpool/uuid.h
new file mode 100644
index 000000000000..28e53be68662
--- /dev/null
+++ b/drivers/mpool/uuid.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_UUID_H
+#define MPOOL_UUID_H
+
+#define MPOOL_UUID_SIZE        16
+#define MPOOL_UUID_STRING_LEN  36
+
+#include <linux/kernel.h>
+#include <linux/uuid.h>
+
+struct mpool_uuid {
+	unsigned char uuid[MPOOL_UUID_SIZE];
+};
+
+/* mpool_uuid uses the LE version in the kernel */
+static inline void mpool_generate_uuid(struct mpool_uuid *uuid)
+{
+	generate_random_guid(uuid->uuid);
+}
+
+static inline void mpool_uuid_copy(struct mpool_uuid *u_dst, const struct mpool_uuid *u_src)
+{
+	memcpy(u_dst->uuid, u_src->uuid, MPOOL_UUID_SIZE);
+}
+
+static inline int mpool_uuid_compare(const struct mpool_uuid *uuid1, const struct mpool_uuid *uuid2)
+{
+	return memcmp(uuid1, uuid2, MPOOL_UUID_SIZE);
+}
+
+static inline void mpool_uuid_clear(struct mpool_uuid *uuid)
+{
+	memset(uuid->uuid, 0, MPOOL_UUID_SIZE);
+}
+
+static inline int mpool_uuid_is_null(const struct mpool_uuid *uuid)
+{
+	const struct mpool_uuid zero = { };
+
+	return !memcmp(&zero, uuid, sizeof(zero));
+}
+
+static inline void mpool_unparse_uuid(const struct mpool_uuid *uuid, char *dst)
+{
+	const unsigned char *u = uuid->uuid;
+
+	snprintf(dst, MPOOL_UUID_STRING_LEN + 1,
+		 "%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x",
+		 u[0], u[1], u[2], u[3],
+		 u[4], u[5], u[6], u[7],
+		 u[8], u[9], u[10], u[11],
+		 u[12], u[13], u[14], u[15]);
+}
+
+#endif /* MPOOL_UUID_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 02/22] mpool: add in-memory struct definitions
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 03/22] mpool: add on-media " Nabeel M Mohamed
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

Add headers containing the basic in-memory structures used by mpool.

- mclass.h: media classes
- mlog.h: mlog objects
- mp.h, mpcore.h: mpool objects
- params.h: mpool parameters
- pd.h: pool drive interface
- pmd.h, pmd_obj.h: Metadata manager
- sb.h: superblock interface
- smap.h: space map interface

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mclass.h  | 137 +++++++++++
 drivers/mpool/mlog.h    | 212 +++++++++++++++++
 drivers/mpool/mp.h      | 231 +++++++++++++++++++
 drivers/mpool/mpcore.h  | 354 ++++++++++++++++++++++++++++
 drivers/mpool/params.h  | 116 ++++++++++
 drivers/mpool/pd.h      | 202 ++++++++++++++++
 drivers/mpool/pmd.h     | 379 ++++++++++++++++++++++++++++++
 drivers/mpool/pmd_obj.h | 499 ++++++++++++++++++++++++++++++++++++++++
 drivers/mpool/sb.h      | 162 +++++++++++++
 drivers/mpool/smap.h    | 334 +++++++++++++++++++++++++++
 10 files changed, 2626 insertions(+)
 create mode 100644 drivers/mpool/mclass.h
 create mode 100644 drivers/mpool/mlog.h
 create mode 100644 drivers/mpool/mp.h
 create mode 100644 drivers/mpool/mpcore.h
 create mode 100644 drivers/mpool/params.h
 create mode 100644 drivers/mpool/pd.h
 create mode 100644 drivers/mpool/pmd.h
 create mode 100644 drivers/mpool/pmd_obj.h
 create mode 100644 drivers/mpool/sb.h
 create mode 100644 drivers/mpool/smap.h

diff --git a/drivers/mpool/mclass.h b/drivers/mpool/mclass.h
new file mode 100644
index 000000000000..2ecdcd08de9f
--- /dev/null
+++ b/drivers/mpool/mclass.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_MCLASS_H
+#define MPOOL_MCLASS_H
+
+#include "mpool_ioctl.h"
+
+struct omf_devparm_descriptor;
+struct mpool_descriptor;
+struct mpcore_params;
+
+/*
+ * This file contains the media class structures definitions and prototypes
+ * private to mpool core.
+ */
+
+/**
+ * struct mc_parms - media class parameters
+ * @mcp_classp:    class performance characteristics, enum mp_media_classp
+ * @mcp_zonepg:    zone size in number of zone pages
+ * @mcp_sectorsz:  2^sectorsz is the logical sector size
+ * @mcp_devtype:   device type. Enum pd_devtype.
+ * @mcp_features:  ored bits from mp_mc_features
+ *
+ * Two PDs can't be placed in the same media class if they have different
+ * mc_parms.
+ */
+struct mc_parms {
+	u8  mcp_classp;
+	u32 mcp_zonepg;
+	u8  mcp_sectorsz;
+	u8  mcp_devtype;
+	u64 mcp_features;
+};
+
+/**
+ * struct mc_smap_parms - media class space map parameters
+ * @mcsp_spzone: percent spare zones for drives.
+ * @mcsp_rgnc: no. of space map zones for drives in each media class
+ * @mcsp_align: space map zone alignment for drives in each media class
+ */
+struct mc_smap_parms {
+	u8		mcsp_spzone;
+	u8		mcsp_rgnc;
+	u8		mcsp_align;
+};
+
+/**
+ * struct media_class - define a media class
+ * @mc_parms:  define a media class, content differ for each media class
+ * @mc_sparms: space map params for this media class
+ * @mc_pdmc:   active pdv entries grouped by media class array
+ * @mc_uacnt:  UNAVAIL status drive count in each media class
+ *
+ * Locking:
+ *    Protected by mp.pds_pdvlock.
+ */
+struct media_class {
+	struct mc_parms        mc_parms;
+	struct mc_smap_parms   mc_sparms;
+	s8                     mc_pdmc;
+	u8                     mc_uacnt;
+};
+
+/**
+ * mc_pd_prop2mc_parms() -  Convert PD properties into media class parameters.
+ * @pd_prop: input, pd properties.
+ * @mc_parms: output, media class parameters.
+ *
+ * Typically used before a lookup (mc_lookup_from_mc_parms()) to know in
+ * which media class a PD belongs to.
+ */
+void mc_pd_prop2mc_parms(struct pd_prop *pd_prop, struct mc_parms *mc_parms);
+
+/**
+ * mc_omf_devparm2mc_parms() - convert a omf_devparm_descriptor into an mc_parms.
+ * @omf_devparm: input
+ * @mc_parms: output
+ */
+void mc_omf_devparm2mc_parms(struct omf_devparm_descriptor *omf_devparm, struct mc_parms *mc_parms);
+
+/**
+ * mc_parms2omf_devparm() - convert a mc_parms in a omf_devparm_descriptor
+ * @mc_parms: input
+ * @omf_devparm: output
+ */
+void mc_parms2omf_devparm(struct mc_parms *mc_parms, struct omf_devparm_descriptor *omf_devparm);
+
+/**
+ * mc_cmp_omf_devparm() - check if two omf_devparm_descriptor corresponds
+ *	to the same media class.
+ * @omf_devparm1:
+ * @omf_devparm2:
+ *
+ * Returns 0 if in same media class.
+ */
+int mc_cmp_omf_devparm(struct omf_devparm_descriptor *omfd1, struct omf_devparm_descriptor *omfd2);
+
+/**
+ * mc_init_class() - initialize a media class
+ * @mc:
+ * @mc_parms: parameters of the media class
+ * @mcsp:     smap parameters for mc
+ */
+void mc_init_class(struct media_class *mc, struct mc_parms *mc_parms, struct mc_smap_parms *mcsp);
+
+/**
+ * mc_set_spzone() - set the percent spare on the media class mclass.
+ * @mc:
+ * @spzone:
+ *
+ * Return: 0, or -ENOENT if the specified mclass doesn't exist.
+ */
+int mc_set_spzone(struct media_class *mc, u8 spzone);
+
+/**
+ * mclass_isvalid() - Return true if the media class is valid.
+ * @mclass:
+ */
+static inline bool mclass_isvalid(enum mp_media_classp mclass)
+{
+	return (mclass >= 0 && mclass < MP_MED_NUMBER);
+}
+
+/**
+ * mc_smap_parms_get() - get space map params for the specified mclass.
+ * @mp:
+ * @mclass:
+ * @mcsp: (output)
+ */
+int mc_smap_parms_get(struct media_class *mc, struct mpcore_params *params,
+		      struct mc_smap_parms *mcsp);
+
+#endif /* MPOOL_MCLASS_H */
diff --git a/drivers/mpool/mlog.h b/drivers/mpool/mlog.h
new file mode 100644
index 000000000000..0de816335d55
--- /dev/null
+++ b/drivers/mpool/mlog.h
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * Defines functions for writing, reading, and managing the lifecycle of mlogs.
+ */
+
+#ifndef MPOOL_MLOG_H
+#define MPOOL_MLOG_H
+
+#include <linux/uio.h>
+
+#include "mpool_ioctl.h"
+
+#define MB       (1024 * 1024)
+struct pmd_layout;
+struct mpool_descriptor;
+struct mlog_descriptor;
+
+
+/**
+ * struct mlog_read_iter -
+ * @lri_layout: Layout of log being read
+ * @lri_soff:   Sector offset of next log block to read from
+ * @lri_gen:    Log generation number at iterator initialization
+ * @lri_roff:   Next offset in log block soff to read from
+ * @lri_rbidx:  Read buffer page index currently reading from
+ * @lri_sidx:   Log block index in lri_rbidx
+ * @lri_valid:  1 if iterator is valid; 0 otherwise
+ */
+struct mlog_read_iter {
+	struct pmd_layout  *lri_layout;
+	off_t               lri_soff;
+	u64                 lri_gen;
+	u16                 lri_roff;
+	u16                 lri_rbidx;
+	u16                 lri_sidx;
+	u8                  lri_valid;
+};
+
+/**
+ * struct mlog_fsetparms -
+ *
+ * @mfp_totsec: Total number of log blocks in mlog
+ * @mfp_secpga: Is sector size page-aligned?
+ * @mfp_lpgsz:  Size of each page in read/append buffer
+ * @mfp_npgmb:  No. of pages in 1 MiB buffer
+ * @mfp_sectsz: Sector size obtained from PD prop
+ * @mfp_nsecmb: No. of sectors/log blocks in 1 MiB buffer
+ * @mfp_nsecpg: No. of sectors/log blocks per page
+ */
+struct mlog_fsetparms {
+	u32    mfp_totsec;
+	bool   mfp_secpga;
+	u32    mfp_lpgsz;
+	u16    mfp_nlpgmb;
+	u16    mfp_sectsz;
+	u16    mfp_nsecmb;
+	u16    mfp_nseclpg;
+};
+
+/**
+ * struct mlog_stat - mlog open status (referenced by associated struct pmd_layout)
+ * @lst_citr:    Current mlog read iterator
+ * @lst_mfp:     Mlog flush set parameters
+ * @lst_abuf:    Append buffer, max 1 MiB size
+ * @lst_rbuf:    Read buffer, max 1 MiB size - immutable
+ * @lst_rsoff:   LB offset of the 1st log block in lst_rbuf
+ * @lst_rseoff:  LB offset of the last log block in lst_rbuf
+ * @lst_asoff:   LB offset of the 1st log block in CFS
+ * @lst_wsoff:   Offset of the accumulating log block
+ * @lst_abdirty: true, if append buffer is dirty
+ * @lst_pfsetid: Prev. fSetID of the first log block in CFS
+ * @lst_cfsetid: Current fSetID of the CFS
+ * @lst_cfssoff: Offset within the 1st log block from where CFS starts
+ * @lst_aoff:    Next byte offset[0, sectsz) to fill in the current log block
+ * @lst_abidx:   Index of current filling page in lst_abuf
+ * @lst_csem:    enforce compaction semantics if true
+ * @lst_cstart:  valid compaction start marker in log?
+ * @lst_cend:    valid compaction end marker in log?
+ */
+struct mlog_stat {
+	struct mlog_read_iter  lst_citr;
+	struct mlog_fsetparms  lst_mfp;
+	char  **lst_abuf;
+	char  **lst_rbuf;
+	off_t   lst_rsoff;
+	off_t   lst_rseoff;
+	off_t   lst_asoff;
+	off_t   lst_wsoff;
+	bool    lst_abdirty;
+	u32     lst_pfsetid;
+	u32     lst_cfsetid;
+	u16     lst_cfssoff;
+	u16     lst_aoff;
+	u16     lst_abidx;
+	u8      lst_csem;
+	u8      lst_cstart;
+	u8      lst_cend;
+};
+
+#define MLOG_TOTSEC(lstat)  ((lstat)->lst_mfp.mfp_totsec)
+#define MLOG_LPGSZ(lstat)   ((lstat)->lst_mfp.mfp_lpgsz)
+#define MLOG_NLPGMB(lstat)  ((lstat)->lst_mfp.mfp_nlpgmb)
+#define MLOG_SECSZ(lstat)   ((lstat)->lst_mfp.mfp_sectsz)
+#define MLOG_NSECMB(lstat)  ((lstat)->lst_mfp.mfp_nsecmb)
+#define MLOG_NSECLPG(lstat) ((lstat)->lst_mfp.mfp_nseclpg)
+
+#define IS_SECPGA(lstat)    ((lstat)->lst_mfp.mfp_secpga)
+
+/*
+ * mlog API functions
+ */
+
+/*
+ * Error codes: all mlog fns can return one or more of:
+ * -EINVAL = invalid fn args
+ * -ENOENT = log not open or logid not found
+ * -EFBIG = log full
+ * -EMSGSIZE = cstart w/o cend indicating a crash during compaction
+ * -ENODATA = malformed or corrupted log
+ * -EIO = unable to read/write log on media
+ * -ENOMEM = insufficient room in copy-out buffer
+ * -EBUSY = log is in erasing state; wait or retry erase
+ */
+
+int mlog_alloc(struct mpool_descriptor *mp, struct mlog_capacity *capreq,
+	       enum mp_media_classp mclassp, struct mlog_props *prop,
+	       struct mlog_descriptor **mlh);
+
+int mlog_realloc(struct mpool_descriptor *mp, u64 objid, struct mlog_capacity *capreq,
+		 enum mp_media_classp mclassp, struct mlog_props *prop,
+		 struct mlog_descriptor **mlh);
+
+int mlog_find_get(struct mpool_descriptor *mp, u64 objid, int which,
+		  struct mlog_props *prop, struct mlog_descriptor **mlh);
+
+void mlog_put(struct mlog_descriptor *layout);
+
+void mlog_lookup_rootids(u64 *id1, u64 *id2);
+
+int mlog_commit(struct mpool_descriptor *mp, struct mlog_descriptor *mlh);
+
+int mlog_abort(struct mpool_descriptor *mp, struct mlog_descriptor *mlh);
+
+int mlog_delete(struct mpool_descriptor *mp, struct mlog_descriptor *mlh);
+
+/**
+ * mlog_open() - Open committed log, validate contents, and return its generation number
+ * @mp:
+ * @mlh:
+ * @flags:
+ * @gen: output
+ *
+ * If log is already open just returns gen; if csem is true enforces compaction
+ * semantics so that open fails if valid cstart/cend markers are not present.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_open(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u8 flags, u64 *gen);
+
+int mlog_close(struct mpool_descriptor *mp, struct mlog_descriptor *mlh);
+
+int mlog_gen(struct mlog_descriptor *mlh, u64 *gen);
+
+int mlog_empty(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, bool *empty);
+
+int mlog_erase(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u64 mingen);
+
+int mlog_append_cstart(struct mpool_descriptor *mp, struct mlog_descriptor *mlh);
+
+int mlog_append_cend(struct mpool_descriptor *mp, struct mlog_descriptor *mlh);
+
+int mlog_append_data(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+		     char *buf, u64 buflen, int sync);
+
+int mlog_read_data_init(struct mlog_descriptor *mlh);
+
+/**
+ * mlog_read_data_next() -
+ * @mp:
+ * @mlh:
+ * @buf:
+ * @buflen:
+ * @rdlen:
+ *
+ * Returns:
+ *   If -EOVERFLOW is returned, then "buf" is too small to
+ *   hold the read data. Can be retried with a bigger receive buffer whose
+ *   size is returned in rdlen.
+ */
+int mlog_read_data_next(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+			char *buf, u64 buflen, u64 *rdlen);
+
+int mlog_get_props_ex(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+		      struct mlog_props_ex *prop);
+
+void mlog_precompact_alsz(struct mpool_descriptor *mp, struct mlog_descriptor *mlh);
+
+int mlog_rw_raw(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+		const struct kvec *iov, int iovcnt, u64 boff, u8 rw);
+
+void mlogutil_closeall(struct mpool_descriptor *mp);
+
+bool mlog_objid(u64 objid);
+
+struct pmd_layout *mlog2layout(struct mlog_descriptor *mlh);
+
+struct mlog_descriptor *layout2mlog(struct pmd_layout *layout);
+
+#endif /* MPOOL_MLOG_H */
diff --git a/drivers/mpool/mp.h b/drivers/mpool/mp.h
new file mode 100644
index 000000000000..e1570f8c8d0c
--- /dev/null
+++ b/drivers/mpool/mp.h
@@ -0,0 +1,231 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_MP_H
+#define MPOOL_MP_H
+
+#include "mpool_ioctl.h"
+#include "uuid.h"
+#include "params.h"
+
+struct mpool_descriptor;
+
+#define MPOOL_OP_READ  0
+#define MPOOL_OP_WRITE 1
+#define PD_DEV_ID_PDUNAVAILABLE "DID_PDUNAVAILABLE"
+
+#define MPOOL_DRIVES_MAX       MP_MED_NUMBER
+#define MP_MED_ALL             MP_MED_NUMBER
+
+/* Object types */
+enum mp_obj_type {
+	MP_OBJ_UNDEF  = 0,
+	MP_OBJ_MBLOCK = 1,
+	MP_OBJ_MLOG   = 2,
+};
+
+/**
+ * struct mpool_config -
+ * @mc_oid1:
+ * @mc_oid2:
+ * @mc_uid:
+ * @mc_gid:
+ * @mc_mode:
+ * @mc_mclassp:
+ * @mc_captgt:
+ * @mc_ra_pages_max:
+ * @mc_vma_sz_max:
+ * @mc_utype:           user-defined type
+ * @mc_label:           user-defined label
+
+ */
+struct mpool_config {
+	u64                     mc_oid1;
+	u64                     mc_oid2;
+	uid_t                   mc_uid;
+	gid_t                   mc_gid;
+	mode_t                  mc_mode;
+	u32                     mc_rsvd0;
+	u64                     mc_captgt;
+	u32                     mc_ra_pages_max;
+	u32                     mc_vma_size_max;
+	u32                     mc_rsvd1;
+	u32                     mc_rsvd2;
+	u64                     mc_rsvd3;
+	u64                     mc_rsvd4;
+	uuid_le                 mc_utype;
+	char                    mc_label[MPOOL_LABELSZ_MAX];
+};
+
+/*
+ * mpool API functions
+ */
+
+/**
+ * mpool_create() - Create an mpool
+ * @mpname:
+ * @flags: enum mp_mgmt_flags
+ * @dpaths:
+ * @pd_prop: PDs properties obtained by mpool_create() caller.
+ * @params:  mpcore parameters
+ * @mlog_cap:
+ *
+ * Create an mpool from dcnt drive paths dpaths; store mpool metadata as
+ * specified by mdparm;
+ *
+ * Return:
+ * %0 if successful, -errno otherwise..
+ * ENODEV if insufficient number of drives meeting mdparm,
+ */
+int mpool_create(const char *name, u32 flags, char **dpaths, struct pd_prop *pd_prop,
+		 struct mpcore_params *params, u64 mlog_cap);
+
+/**
+ * mpool_activate() - Activate an mpool
+ * @dcnt:
+ * @dpaths:
+ * @pd_prop: properties of the PDs. dcnt elements.
+ * @mlog_cap:
+ * @params:   mpcore parameters
+ * @flags:
+ * @mpp: *mpp is set to NULL if error
+ *
+ * Activate mpool on dcnt drive paths dpaths; if force flag is set tolerate
+ * unavailable drives up to redundancy limit; if successful *mpp is a handle
+ * for the mpool.
+ *
+ * Return:
+ * %0 if successful, -errno otherwise
+ * ENODEV if too many drives unavailable or failed,
+ * ENXIO if device previously removed from mpool and is no longer a member
+ */
+int mpool_activate(u64 dcnt, char **dpaths, struct pd_prop *pd_prop, u64 mlog_cap,
+		   struct mpcore_params *params, u32 flags, struct mpool_descriptor **mpp);
+
+
+/**
+ * mpool_deactivate() - Deactivate an mpool.
+ * @mp: mpool descriptor
+ *
+ * Deactivate mpool; caller must ensure no other thread can access mp; mp is
+ * invalid after call.
+ */
+int mpool_deactivate(struct mpool_descriptor *mp);
+
+/**
+ * mpool_destroy() - Destroy an mpool
+ * @dcnt:
+ * @dpaths:
+ * @pd_prop: PD properties.
+ * @flags:
+ *
+ * Destroy mpool on dcnt drive paths dpaths;
+ *
+ * Return:
+ * %0 if successful, -errno otherwise
+ */
+int mpool_destroy(u64 dcnt, char **dpaths, struct pd_prop *pd_prop, u32 flags);
+
+/**
+ * mpool_rename() - Rename mpool to mp_newname
+ * @dcnt:
+ * @dpaths:
+ * @pd_prop: PD properties.
+ * @flags:
+ * @mp_newname:
+ *
+ * Return:
+ * %0 if successful, -errno otherwise
+ */
+int mpool_rename(u64 dcnt, char **dpaths, struct pd_prop *pd_prop, u32 flags,
+		 const char *mp_newname);
+
+/**
+ * mpool_drive_add() - Add new drive dpath to mpool.
+ * @mp:
+ * @dpath:
+ * @pd_prop: PD properties.
+ *
+ * Return: %0 if successful; -enno otherwise...
+ */
+int mpool_drive_add(struct mpool_descriptor *mp, char *dpath, struct pd_prop *pd_prop);
+
+/**
+ * mpool_drive_spares() - Set percent spare zones to spzone for drives in media class mclassp.
+ * @mp:
+ * @mclassp:
+ * @spzone:
+ *
+ * Return: 0 if successful, -errno otherwise...
+ */
+int mpool_drive_spares(struct mpool_descriptor *mp, enum mp_media_classp mclassp, u8 spzone);
+
+/**
+ * mpool_mclass_get_cnt() - Get a count of media classes with drives in this mpool
+ * @mp:
+ * @info:
+ */
+void mpool_mclass_get_cnt(struct mpool_descriptor *mp, u32 *cnt);
+
+/**
+ * mpool_mclass_get() - Get a information on mcl_cnt media classes
+ * @mp:
+ * @mcic:
+ * @mciv:
+ *
+ * Return: 0 if successful, -errno otherwise...
+ */
+int mpool_mclass_get(struct mpool_descriptor *mp, u32 *mcxc, struct mpool_mclass_xprops *mcxv);
+
+/**
+ * mpool_get_xprops() - Retrieve extended mpool properties
+ * @mp:
+ * @prop:
+ */
+void mpool_get_xprops(struct mpool_descriptor *mp, struct mpool_xprops *xprops);
+
+/**
+ * mpool_get_devprops_by_name() - Fill in dprop for active drive with name pdname
+ * @mp:
+ * @pdname:
+ * @dprop:
+ *
+ * Return: %0 if success, -errno otherwise...
+ * -ENOENT if device with specified name cannot be found
+ */
+int mpool_get_devprops_by_name(struct mpool_descriptor *mp, char *pdname,
+			       struct mpool_devprops *dprop);
+
+/**
+ * mpool_get_usage() - Fill in stats with mpool space usage for the media class mclassp
+ * @mp:
+ * @mclassp:
+ * @usage:
+ *
+ * If mclassp is MCLASS_ALL, report on entire pool (all media classes).
+ *
+ * Return: %0 if successful; err_t otherwise...
+ */
+void
+mpool_get_usage(
+	struct mpool_descriptor    *mp,
+	enum mp_media_classp        mclassp,
+	struct mpool_usage         *usage);
+
+/**
+ * mpool_config_store() - store a config record in MDC0
+ * @mp:
+ * @cfg:
+ */
+int mpool_config_store(struct mpool_descriptor *mp, const struct mpool_config *cfg);
+
+/**
+ * mpool_config_fetch() - fetch the current mpool config
+ * @mp:
+ * @cfg:
+ */
+int mpool_config_fetch(struct mpool_descriptor *mp, struct mpool_config *cfg);
+
+#endif /* MPOOL_MP_H */
diff --git a/drivers/mpool/mpcore.h b/drivers/mpool/mpcore.h
new file mode 100644
index 000000000000..904763d49814
--- /dev/null
+++ b/drivers/mpool/mpcore.h
@@ -0,0 +1,354 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_MPCORE_H
+#define MPOOL_MPCORE_H
+
+#include <linux/rbtree.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+
+#include "uuid.h"
+
+#include "mp.h"
+#include "pd.h"
+#include "smap.h"
+#include "mclass.h"
+#include "pmd.h"
+#include "params.h"
+
+extern struct rb_root mpool_pools;
+
+struct pmd_layout;
+
+/**
+ * enum mpool_status -
+ * @MPOOL_STAT_UNDEF:
+ * @MPOOL_STAT_OPTIMAL:
+ * @MPOOL_STAT_FAULTED:
+ */
+enum mpool_status {
+	MPOOL_STAT_UNDEF    = 0,
+	MPOOL_STAT_OPTIMAL  = 1,
+	MPOOL_STAT_FAULTED  = 2,
+	MPOOL_STAT_LAST = MPOOL_STAT_FAULTED,
+};
+
+_Static_assert((MPOOL_STAT_LAST < 256), "enum mpool_status must fit in u8");
+
+/**
+ * struct mpool_dev_info - Pool drive state, status, and params
+ * @pdi_devid:    UUID for this drive
+ * @pdi_parm:     drive parms
+ * @pdi_status:   enum pd_status value: drive status
+ * @pdi_ds:       drive space allocation info
+ * @pdi_rmap:     per allocation zone space maps rbtree array, node:
+ *                struct u64_to_u64_rb
+ * @pdi_rmlock:   lock protects per zone space maps
+ * @pdi_name:     device name (only the last path name component)
+ *
+ * Pool drive state, status, and params
+ *
+ * LOCKING:
+ *    devid, mclass : constant; no locking required
+ *    parm: constant EXCEPT in rare change of status from UNAVAIL; see below
+ *    status: usage does not require locking, but MUST get/set via accessors
+ *    state: protected by pdvlock in enclosing mpool_descriptor
+ *    ds: protected by ds.dalock defined in smap module
+ *    zmap[x]: protected by zmlock[x]
+ *
+ * parm fields are constant except in a rare change of status from UNAVAIL,
+ * during which a subset of the fields are modified.  see the pd module for
+ * details on how this is handled w/o requiring locking.
+ */
+struct mpool_dev_info {
+	atomic_t                pdi_status; /* Barriers or acq/rel required */
+	struct pd_dev_parm      pdi_parm;
+	struct smap_dev_alloc   pdi_ds;
+	struct rmbkt           *pdi_rmbktv;
+	struct mpool_uuid       pdi_devid;
+};
+
+/* Shortcuts */
+#define pdi_didstr    pdi_parm.dpr_prop.pdp_didstr
+#define pdi_zonepg    pdi_parm.dpr_prop.pdp_zparam.dvb_zonepg
+#define pdi_zonetot   pdi_parm.dpr_prop.pdp_zparam.dvb_zonetot
+#define pdi_devtype   pdi_parm.dpr_prop.pdp_devtype
+#define pdi_cmdopt    pdi_parm.dpr_prop.pdp_cmdopt
+#define pdi_mclass    pdi_parm.dpr_prop.pdp_mclassp
+#define pdi_devsz     pdi_parm.dpr_prop.pdp_devsz
+#define pdi_sectorsz  pdi_parm.dpr_prop.pdp_sectorsz
+#define pdi_optiosz   pdi_parm.dpr_prop.pdp_optiosz
+#define pdi_fua       pdi_parm.dpr_prop.pdp_fua
+#define pdi_prop      pdi_parm.dpr_prop
+#define pdi_name      pdi_parm.dpr_name
+
+/**
+ * struct uuid_to_mpdesc_rb -
+ * @utm_node:
+ * @utm_uuid_le:
+ * @utm_md:
+ */
+struct uuid_to_mpdesc_rb {
+	struct rb_node              utm_node;
+	struct mpool_uuid           utm_uuid_le;
+	struct mpool_descriptor    *utm_md;
+};
+
+/**
+ * struct mpdesc_mdparm - parameters used for the MDCs of the mpool.
+ * @md_mclass:  media class used for the mpool metadata
+ */
+struct mpdesc_mdparm {
+	u8     md_mclass;
+};
+
+/**
+ * struct pre_compact_ctrl - used to start/stop/control precompaction
+ * @pco_dwork:
+ * @pco_mp:
+ * @pco_nmtoc: next MDC to compact
+
+ * Each time pmd_precompact_cb() runs it will consider the next MDC
+ * for compaction.
+ */
+struct pre_compact_ctrl {
+	struct delayed_work	 pco_dwork;
+	struct mpool_descriptor *pco_mp;
+	atomic_t		 pco_nmtoc;
+};
+
+/**
+ * struct mpool_descriptor - Media pool descriptor
+ * @pds_pdvlock:  drive membership/state lock
+ * @pds_pdv:      per drive info array
+ * @pds_omlock:   open mlog index lock
+ * @pds_oml:      rbtree of open mlog layouts. indexed by objid
+ *                node type: objid_to_layout_rb
+ * @pds_poolid:   UUID of pool
+ * @pds_mdparm:   mclass id of mclass used for mdc layouts
+ * @pds_cfg:      mpool config
+ * @pds_pdvcnt:   cnt of valid pdv entries
+ * @pds_mc        table of media classes
+ * @pds_uctxt     used by user-space mlogs to indicate the context
+ * @pds_node:     for linking this object into an rbtree
+ * @pds_params:   Per mpool parameters
+ * @pds_workq:    Workqueue per mpool.
+ * @pds_sbmdc0:   Used to store in RAM the MDC0 metadata. Loaded at activate
+ *                time, changed when MDC0 is compacted.
+ * @pds_mda:      metadata container array (this thing is huge!)
+ *
+ * LOCKING:
+ *    poolid, ospagesz, mdparm: constant; no locking required
+ *    mda: protected by internal locks as documented in pmd module
+ *    oml: protected by omlock
+ *    pdv: see note
+ *    pds_mc: protected by pds_pdvlock
+ *	Update of pds_mc[].mc_sparams.mc_spzone must also be enclosed
+ *	with mpool_s_lock to serialize the spzone updates, because they include
+ *	an append of an MDC0 record on top of updating mc_spzone.
+ *    all other fields: protected by pds_pdvlock (as is pds_pdv[x].state)
+ *    pds_sbmdc0: Used to store in RAM the MDC0 metadata. Loaded when mpool
+ *	activated, no lock needed at that time (single) threaded.
+ *	Then changed during MDC0 compaction. At that time it is protected by
+ *	MDC0 compact lock.
+ *
+ * NOTE:
+ *    pds_pdvcnt only ever increases so that pds_pdv[x], x < pdvcnt, can be
+ *    accessed without locking, other than as required by the struct
+ *    mpool_dev_info.
+ *    mc_spzone is written and read only by mpool functions that are serialized
+ *    via mpool_s_lock.
+ */
+struct mpool_descriptor {
+	struct rw_semaphore         pds_pdvlock;
+
+	____cacheline_aligned
+	struct mpool_dev_info       pds_pdv[MPOOL_DRIVES_MAX];
+
+	____cacheline_aligned
+	struct mutex                pds_oml_lock;
+	struct rb_root              pds_oml_root;
+
+	/* Read-mostly fields... */
+	____cacheline_aligned
+	u16                         pds_pdvcnt;
+	struct mpdesc_mdparm        pds_mdparm;
+	struct workqueue_struct    *pds_workq;
+	struct workqueue_struct    *pds_erase_wq;
+	struct workqueue_struct    *pds_precompact_wq;
+
+	struct media_class          pds_mc[MP_MED_NUMBER];
+	struct mpcore_params        pds_params;
+	struct omf_sb_descriptor    pds_sbmdc0;
+	struct pre_compact_ctrl     pds_pco;
+	struct smap_usage_work      pds_smap_usage_work;
+
+	/* Rarey used fields... */
+	struct mpool_config         pds_cfg;
+	struct rb_node              pds_node;
+	struct mpool_uuid           pds_poolid;
+	char                        pds_name[MPOOL_NAMESZ_MAX];
+
+	/* pds_mda is enormous (91K) */
+	struct pmd_mda_info         pds_mda;
+};
+
+/**
+ * mpool_desc_unavail_add() - Add unavailable drive to mpool descriptor.
+ * @mp:
+ * @omf_devparm:
+ *
+ * Add unavailable drive to mpool descriptor; caller must guarantee that
+ * devparm.devid is not already there.
+ * As part of adding the drive to the mpool descriptor, the drive is added
+ * in its media class.
+ *
+ * Return: 0 if successful, -errno (-EINVAL or -ENOMEM) otherwise
+ */
+int mpool_desc_unavail_add(struct mpool_descriptor *mp, struct omf_devparm_descriptor *devparm);
+
+/**
+ * mpool_desc_pdmc_add() - Add a device in its media class.
+ * @mp:
+ * @pdh:
+ * @omf_devparm:
+ * @check_only: if true, the call doesn't change any state, it only check
+ *	if the PD could be added in a media class.
+ *
+ * If the media class doesn't exist yet, it is created here.
+ *
+ * This function has two inputs related to the PD it is acting on:
+ *  "phd"
+ *  and "omf_devparm"
+ *
+ * If omf_devparm is NULL, it means that the media class in which the PD must
+ * be placed is derived from mp->pds_pdv[pdh].pdi_parm.dpr_prop
+ * In that case the PD properties (.dpr_prop) must be updated and
+ * correct when entering this function.
+ * devparm is NULL when the device is available, that means the discovery
+ * was able to update .dpr_prop.
+ *
+ * If omf_devparm is not NULL, it means that the media class in which the PD
+ * must be placed is derived from omf_devparm.
+ * This is used when unavailable PDs are placed in their media class. In this
+ * situation (because the PD is unavailable) the discovery couldn't discover
+ * the PD properties and mp->pds_pdv[pdh].pdi_parm.dpr_prop has not been
+ * updated because of that.
+ * So we can't use .dpr_prop to place the PD in its class, instead we use what
+ * is coming from the persitent metadata (PD state record in MDC0). Aka
+ * omf_devparm.
+ * mp->pds_pdv[pdh].pdi_parm.dpr_prop will be update if/when the PD is available
+ * again.
+ *
+ * Restrictions in placing PDs in media classes
+ * --------------------------------------------
+ * This function enforces these restrictions.
+ * These restrictions are:
+ * a) in a mpool, for a given mclassp (enum mp_media_classp), there is
+ *    at maximum one media class.
+ * b) All drives of a media class must checksummed or none, no mix allowed.
+ * c) The STAGING and CAPACITY classes must be both checksummed or both not
+ *    checksummed.
+ *
+ * Locking:
+ * -------
+ *	Should be called with mp.pds_pdvlock held in write.
+ *	Except if mpool is single threaded (during activate for example).
+ */
+int
+mpool_desc_pdmc_add(
+	struct mpool_descriptor		*mp,
+	u16				 pdh,
+	struct omf_devparm_descriptor	*omf_devparm,
+	bool				 check_only);
+
+int uuid_to_mpdesc_insert(struct rb_root *root, struct mpool_descriptor *data);
+
+int
+mpool_dev_sbwrite(
+	struct mpool_descriptor    *mp,
+	struct mpool_dev_info      *pd,
+	struct omf_sb_descriptor   *sbmdc0);
+
+int
+mpool_mdc0_sb2obj(
+	struct mpool_descriptor    *mp,
+	struct omf_sb_descriptor   *sb,
+	struct pmd_layout         **l1,
+	struct pmd_layout         **l2);
+
+int mpool_desc_init_newpool(struct mpool_descriptor *mp, u32 flags);
+
+int
+mpool_dev_init_all(
+	struct mpool_dev_info  *pdv,
+	u64                     dcnt,
+	char                  **dpaths,
+	struct pd_prop	       *pd_prop);
+
+void mpool_mdc_cap_init(struct mpool_descriptor *mp, struct mpool_dev_info *pd);
+
+int
+mpool_desc_init_sb(
+	struct mpool_descriptor    *mp,
+	struct omf_sb_descriptor   *sbmdc0,
+	u32                         flags,
+	bool                       *mc_resize);
+
+int mpool_dev_sbwrite_newpool(struct mpool_descriptor *mp, struct omf_sb_descriptor *sbmdc0);
+
+int check_for_dups(char **listv, int cnt, int *dup, int *offset);
+
+void fill_in_devprops(struct mpool_descriptor *mp, u64 pdh, struct mpool_devprops *dprop);
+
+int mpool_create_rmlogs(struct mpool_descriptor *mp, u64 mlog_cap);
+
+struct mpool_descriptor *mpool_desc_alloc(void);
+
+void mpool_desc_free(struct mpool_descriptor *mp);
+
+int mpool_dev_check_new(struct mpool_descriptor *mp, struct mpool_dev_info *pd);
+
+static inline enum pd_status mpool_pd_status_get(struct mpool_dev_info *pd)
+{
+	enum pd_status  val;
+
+	/* Acquire semantics used so that no reads will be re-ordered from
+	 * before to after this read.
+	 */
+	val = atomic_read_acquire(&pd->pdi_status);
+
+	return val;
+}
+
+static inline void mpool_pd_status_set(struct mpool_dev_info *pd, enum pd_status status)
+{
+	/* All prior writes must be visible prior to the status change */
+	smp_wmb();
+	atomic_set(&pd->pdi_status, status);
+}
+
+/**
+ * mpool_get_mpname() - Get the mpool name
+ * @mp:     mpool descriptor of the mpool
+ * @mpname: buffer to copy the mpool name into
+ * @mplen:  buffer length
+ *
+ * Return:
+ * %0 if successful, -EINVAL otherwise
+ */
+static inline int mpool_get_mpname(struct mpool_descriptor *mp, char *mpname, size_t mplen)
+{
+	if (!mp || !mpname)
+		return -EINVAL;
+
+	strlcpy(mpname, mp->pds_name, mplen);
+
+	return 0;
+}
+
+
+#endif /* MPOOL_MPCORE_H */
diff --git a/drivers/mpool/params.h b/drivers/mpool/params.h
new file mode 100644
index 000000000000..5d1f40857a2a
--- /dev/null
+++ b/drivers/mpool/params.h
@@ -0,0 +1,116 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_PARAMS_H
+#define MPOOL_PARAMS_H
+
+#define MPOOL_MDC_SET_SZ                16
+
+/* Mpool metadata container compaction retries; keep relatively small */
+#define MPOOL_MDC_COMPACT_RETRY_DEFAULT 5
+
+/*
+ * Space map allocation zones per drive; bounds number of concurrent obj
+ * allocs
+ */
+#define MPOOL_SMAP_RGNCNT_DEFAULT       4
+
+/*
+ * Space map alignment in number of zones.
+ */
+#define MPOOL_SMAP_ZONEALIGN_DEFAULT    1
+
+/*
+ * Number of concurent jobs for loading user MDC 1~N
+ */
+#define MPOOL_OBJ_LOAD_JOBS_DEFAULT     8
+
+/*
+ * Defaults for MDC1/255 pre-compaction.
+ */
+#define MPOOL_PCO_PCTFULL               70
+#define MPOOL_PCO_PCTGARBAGE            20
+#define MPOOL_PCO_NBNOALLOC              2
+#define MPOOL_PCO_PERIOD                 5
+#define MPOOL_PCO_FILLBIAS	      1000
+#define MPOOL_PD_USAGE_PERIOD        60000
+#define MPOOL_CREATE_MDC_PCTFULL  (MPOOL_PCO_PCTFULL - MPOOL_PCO_PCTGARBAGE)
+#define MPOOL_CREATE_MDC_PCTGRBG   MPOOL_PCO_PCTGARBAGE
+
+
+/**
+ * struct mpcore_params - mpool core parameters. Not exported to public API.
+ * @mp_mdc0cap: MDC0 capacity,  *ONLY* for testing purpose
+ * @mp_mdcncap: MDCN capacity,  *ONLY* for testing purpose
+ * @mp_mdcnnum: Number of MDCs, *ONLY* for testing purpose
+ * @mp_smaprgnc:
+ * @mp_smapalign:
+ * @mp_spare:
+ * @mp_objloadjobs: number of concurrent MDC loading jobs
+ *
+ * The below parameters starting with "pco" are used for the pre-compaction
+ * of MDC1/255
+ * @mp_pcopctfull:  % (0-100) of fill of MDCi active mlog that must be reached
+ *	before a pre-compaction is attempted.
+ * @mp_pcopctgarbage:  % (0-100) of garbage in MDCi active mlog that must be
+ *	reached	before a pre-compaction is attempted.
+ * @mp_pconbnoalloc: Number of MDCs from which no object is allocated from.
+ *	If 0, that disable the background pre compaction.
+ * @mp_pcoperiod: In seconds. Period at which a background thread check if
+ *	a MDC needs compaction.
+ * @mp_pcofillbias: If the next mpool MDC has less objects than
+ *	(current MDC objects - pcofillbias), then allocate an object
+ *	from the next MDC instead of from the current one.
+ *	This bias favors object allocation from less filled MDCs (in term
+ *	of number of committed objects).
+ *	The bigger the number, the less bias.
+ * @mp_crtmdcpctfull: percent full threshold across all MDCs in combination
+ *      with crtmdcpctgrbg percent is used as a trigger to create new MDCs
+ * @mp_crtmdcpctgrbg: percent garbage threshold in combination with
+ *      @crtmdcpctfull percent is used as a trigger to create new MDCs
+ * @mp_mpusageperiod: period at which a background thread check mpool space
+ * usage, in milliseconds
+ */
+struct mpcore_params {
+	u64    mp_mdcnum;
+	u64    mp_mdc0cap;
+	u64    mp_mdcncap;
+	u64    mp_smaprgnc;
+	u64    mp_smapalign;
+	u64    mp_spare;
+	u64    mp_objloadjobs;
+	u64    mp_pcopctfull;
+	u64    mp_pcopctgarbage;
+	u64    mp_pconbnoalloc;
+	u64    mp_pcoperiod;
+	u64    mp_pcofillbias;
+	u64    mp_crtmdcpctfull;
+	u64    mp_crtmdcpctgrbg;
+	u64    mp_mpusageperiod;
+};
+
+/**
+ * mpcore_params_defaults() -
+ */
+static inline void mpcore_params_defaults(struct mpcore_params *params)
+{
+	params->mp_mdcnum          = MPOOL_MDCNUM_DEFAULT;
+	params->mp_mdc0cap         = 0;
+	params->mp_mdcncap         = 0;
+	params->mp_smaprgnc        = MPOOL_SMAP_RGNCNT_DEFAULT;
+	params->mp_smapalign       = MPOOL_SMAP_ZONEALIGN_DEFAULT;
+	params->mp_spare           = MPOOL_SPARES_DEFAULT;
+	params->mp_pcopctfull	   = MPOOL_PCO_PCTFULL;
+	params->mp_pcopctgarbage   = MPOOL_PCO_PCTGARBAGE;
+	params->mp_pconbnoalloc    = MPOOL_PCO_NBNOALLOC;
+	params->mp_pcoperiod       = MPOOL_PCO_PERIOD;
+	params->mp_pcofillbias     = MPOOL_PCO_FILLBIAS;
+	params->mp_crtmdcpctfull   = MPOOL_CREATE_MDC_PCTFULL;
+	params->mp_crtmdcpctgrbg   = MPOOL_CREATE_MDC_PCTGRBG;
+	params->mp_mpusageperiod   = MPOOL_PD_USAGE_PERIOD;
+	params->mp_objloadjobs     = MPOOL_OBJ_LOAD_JOBS_DEFAULT;
+}
+
+#endif /* MPOOL_PARAMS_H */
diff --git a/drivers/mpool/pd.h b/drivers/mpool/pd.h
new file mode 100644
index 000000000000..c8faefc7cf11
--- /dev/null
+++ b/drivers/mpool/pd.h
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_PD_H
+#define MPOOL_PD_H
+
+#include <linux/uio.h>
+
+#include "uuid.h"
+#include "mpool_ioctl.h"
+
+/* Returns PD length in bytes. */
+#define PD_LEN(_pd_prop) ((_pd_prop)->pdp_devsz)
+
+/* Returns PD sector size (exponent, power of 2) */
+#define PD_SECTORSZ(_pd_prop) ((_pd_prop)->pdp_sectorsz)
+
+/* Return PD sector size mask */
+#define PD_SECTORMASK(_pd_prop) ((uint64_t)(1 << PD_SECTORSZ(_pd_prop)) - 1)
+
+struct omf_devparm_descriptor;
+
+/**
+ * struct pd_dev_parm -
+ * @dpr_prop:		drive properties including zone parameters
+ * @dpr_dev_private:    private info for implementation
+ * @dpr_name:           device name
+ */
+struct pd_dev_parm {
+	struct pd_prop	         dpr_prop;
+	void		        *dpr_dev_private;
+	char                     dpr_name[PD_NAMESZ_MAX];
+};
+
+/* Shortcuts */
+#define dpr_zonepg        dpr_prop.pdp_zparam.dvb_zonepg
+#define dpr_zonetot       dpr_prop.pdp_zparam.dvb_zonetot
+#define dpr_devsz         dpr_prop.pdp_devsz
+#define dpr_didstr        dpr_prop.pdp_didstr
+#define dpr_mediachar     dpr_prop.pdp_mediachar
+#define dpr_cmdopt        dpr_prop.pdp_cmdopt
+#define dpr_optiosz       dpr_prop.pdp_optiosz
+
+/**
+ * enum pd_status - Transient drive status.
+ * @PD_STAT_UNDEF:       undefined; should never occur
+ * @PD_STAT_ONLINE:      drive is responding to I/O requests
+ * @PD_STAT_SUSPECT:     drive is failing some I/O requests
+ * @PD_STAT_OFFLINE:     drive declared non-responsive to I/O requests
+ * @PD_STAT_UNAVAIL:     drive path not provided or open failed when mpool was opened
+ *
+ * Transient drive status, these are stored as atomic_t variable
+ * values
+ */
+enum pd_status {
+	PD_STAT_UNDEF      = 0,
+	PD_STAT_ONLINE     = 1,
+	PD_STAT_SUSPECT    = 2,
+	PD_STAT_OFFLINE    = 3,
+	PD_STAT_UNAVAIL    = 4
+};
+
+_Static_assert((PD_STAT_UNAVAIL < 256), "enum pd_status must fit in uint8_t");
+
+/**
+ * enum pd_cmd_opt - drive command options
+ * @PD_CMD_DISCARD:	     the device has TRIM/UNMAP command.
+ * @PD_CMD_SECTOR_UPDATABLE: the device can be read/written with sector granularity.
+ * @PD_CMD_DIF_ENABLED:      T10 DIF is used on this device.
+ * @PD_CMD_SED_ENABLED:      Self encrypting enabled
+ * @PD_CMD_DISCARD_ZERO:     the device supports discard_zero
+ * @PD_CMD_RDONLY:           activate mpool with PDs in RDONLY mode,
+ *                           write/discard commands are No-OPs.
+ * Defined as a bit vector so can combine.
+ * Fields holding such a vector should uint64_t.
+ *
+ * TODO: we need to find a way to detect if SED is enabled on a device
+ */
+enum pd_cmd_opt {
+	PD_CMD_NONE             = 0,
+	PD_CMD_DISCARD          = 0x1,
+	PD_CMD_SECTOR_UPDATABLE = 0x2,
+	PD_CMD_DIF_ENABLED      = 0x4,
+	PD_CMD_SED_ENABLED      = 0x8,
+	PD_CMD_DISCARD_ZERO     = 0x10,
+	PD_CMD_RDONLY           = 0x20,
+};
+
+/**
+ * enum pd_devtype - Device types
+ * @PD_DEV_TYPE_BLOCK_STREAM: Block device implementing streams.
+ * @PD_DEV_TYPE_BLOCK_STD:    Standard (non-streams) device (SSD, HDD).
+ * @PD_DEV_TYPE_FILE:	      File in user space for UT.
+ * @PD_DEV_TYPE_MEM:	      Memory semantic device, e.g. NVDIMM direct access (raw or dax mode)
+ * @PD_DEV_TYPE_ZONE:	      zone-like device, e.g., open channel SSD and SMR HDD (using ZBC/ZAC)
+ * @PD_DEV_TYPE_BLOCK_NVDIMM: Standard (non-streams) NVDIMM in sector mode.
+ */
+enum pd_devtype {
+	PD_DEV_TYPE_BLOCK_STREAM = 1,
+	PD_DEV_TYPE_BLOCK_STD,
+	PD_DEV_TYPE_FILE,
+	PD_DEV_TYPE_MEM,
+	PD_DEV_TYPE_ZONE,
+	PD_DEV_TYPE_BLOCK_NVDIMM,
+	PD_DEV_TYPE_LAST = PD_DEV_TYPE_BLOCK_NVDIMM,
+};
+
+_Static_assert((PD_DEV_TYPE_LAST < 256), "enum pd_devtype must fit in uint8_t");
+
+/**
+ * enum pd_state - Device states
+ * @PD_DEV_STATE_AVAIL:       Device is available
+ * @PD_DEV_STATE_UNAVAIL:     Device is unavailable
+ */
+enum pd_state {
+	PD_DEV_STATE_UNDEFINED = 0,
+	PD_DEV_STATE_AVAIL = 1,
+	PD_DEV_STATE_UNAVAIL = 2,
+	PD_DEV_STATE_LAST = PD_DEV_STATE_UNAVAIL,
+};
+
+_Static_assert((PD_DEV_STATE_LAST < 256), "enum pd_state must fit in uint8_t");
+
+/*
+ * pd API functions -- device-type independent dparm ops
+ */
+
+/*
+ * Error codes: All pd functions can return one or more of:
+ *
+ * -EINVAL    invalid fn args
+ * -EBADSLT   attempt to read or write a bad zone on a zone device
+ * -EIO       all other errors
+ */
+
+int pd_dev_open(const char *path, struct pd_dev_parm *dparm, struct pd_prop *pd_prop);
+int pd_dev_close(struct pd_dev_parm *dparm);
+int pd_dev_flush(struct pd_dev_parm *dparm);
+
+/**
+ * pd_bio_erase() -
+ * @pd:
+ * @zaddr:
+ * @zonecnt:
+ * @reads_erased: whether the data can be read post DISCARD
+ *
+ * Return:
+ */
+int pd_zone_erase(struct pd_dev_parm *dparm, u64 zaddr, u32 zonecnt, bool reads_erased);
+
+/*
+ * pd API functions - device dependent operations
+ */
+
+/**
+ * pd_zone_pwritev() -
+ * @pd:
+ * @iov:
+ * @iovcnt:
+ * @zaddr:
+ * @boff: offset in bytes from the start of "zaddr".
+ * @opflags:
+ *
+ * Return:
+ */
+int pd_zone_pwritev(struct pd_dev_parm *dparm, const struct kvec *iov,
+		    int iovcnt, u64 zaddr, loff_t boff, int opflags);
+
+/**
+ * pd_zone_pwritev_sync() -
+ * @pd:
+ * @iov:
+ * @iovcnt:
+ * @zaddr:
+ * @boff: Offset in bytes from the start of zaddr.
+ *
+ * Return:
+ */
+int pd_zone_pwritev_sync(struct pd_dev_parm *dparm, const struct kvec *iov,
+			 int iovcnt, u64 zaddr, loff_t boff);
+
+/**
+ * pd_zone_preadv() -
+ * @pd:
+ * @iov:
+ * @iovcnt:
+ * @zaddr: target zone for this I/O
+ * @boff:    byte offset into the target zone
+ *
+ * Return:
+ */
+int pd_zone_preadv(struct pd_dev_parm *dparm, const struct kvec *iov,
+		   int iovcnt, u64 zaddr, loff_t boff);
+
+void pd_dev_set_unavail(struct pd_dev_parm *dparm, struct omf_devparm_descriptor *omf_devparm);
+
+int pd_init(void) __cold;
+void pd_exit(void) __cold;
+
+#endif /* MPOOL_PD_H */
diff --git a/drivers/mpool/pmd.h b/drivers/mpool/pmd.h
new file mode 100644
index 000000000000..5fd6ca020fd1
--- /dev/null
+++ b/drivers/mpool/pmd.h
@@ -0,0 +1,379 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_PMD_H
+#define MPOOL_PMD_H
+
+#include <linux/atomic.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+
+#include "mpool_ioctl.h"
+#include "omf_if.h"
+#include "pmd_obj.h"
+
+/**
+ * DOC: Module info.
+ *
+ * Pool metadata (pmd) module.
+ *
+ * Implements functions for mpool metadata management.
+ *
+ */
+
+struct mpool_descriptor;
+struct mpool_dev_info;
+struct mp_mdc;
+struct pmd_layout;
+struct mpool_config;
+
+/**
+ * DOC: Object lifecycle
+ *
+ * +) all mblock/mlog objects are owned by mpool layer users, excepting
+ *     mdc mlogs
+ * +) users are responsible for object lifecycle mgmt and must not violate it;
+ *    e.g. by using an object handle (layout pointer) after deleting that
+ *    object
+ * +) the mpool layer never independently aborts or deletes user objects
+ */
+
+/**
+ * DOC: Object ids
+ * Object ids for mblocks and mlogs are a unit64 of the form:
+ * <uniquifier (52-bits), type (4-bits), slot # (8 bits)>
+ *
+ */
+
+/**
+ * DOC: NOTES
+ * + metadata for a given object is stored in the mdc specified by slot #
+ * + uniquifiers are only guaranteed unique for a given slot #
+ * + metadata for all mdc (except mdc 0) are stored in mdc 0
+ * + mdc 0 is a distinguished container whose metadata is stored in superblocks
+ * + mdc 0 only stores object metadata for mdc 1-255
+ * + mdc N is implemented via mlogs with objids (2N, MLOG, 0) & (2N+1, MLOG, 0)
+ * + mdc 0 mlog objids are (0, MLOG, 0) and (1, MLOG, 0) where a slot # of 0
+ *   indicates the mlog metadata is stored in mdc 0 whereas it is actually in
+ *   superblocks; see comments in pmd_mdc0_init() for how we exploit this.
+ */
+
+/**
+ * struct pre_compact_ctrs - objects records counters, used for pre compaction of MDC1/255.
+ * @pcc_cr:   count of object create records
+ * @pcc_up:   count of object update records
+ * @pcc_del:  count of object delete records. If the object is shceduled for
+ *	deletion in the background, the counter is incremented (while the
+ *	delete record has not been written yet).
+ * @pcc_er:   count of object erase records
+ * @pcc_cobj: count of committed objects (and not deleted).
+ * @pcc_cap: In bytes, size of each mlog of the MDC
+ * @pcc_len: In bytes, how much is filled the active mlog.
+ *
+ * One such structure per mpool MDC.
+ *
+ * Locking:
+ *	Updates are serialized by the MDC compact lock.
+ *	The reads by the pre-compaction thread are done without holding any
+ *	lock. This is why atomic variables are used.
+ *	However because the variables are integers, the atomic read translates
+ *	into a simple load and the set translate in a simple store.
+ *
+ * The counters pcc_up, pcc_del, pcc_er are cleared at each compaction.
+ *
+ * Relaxed access is appropriate for all of these atomics
+ */
+struct pre_compact_ctrs {
+	atomic_t   pcc_cr;
+	atomic_t   pcc_up;
+	atomic_t   pcc_del;
+	atomic_t   pcc_er;
+	atomic_t   pcc_cobj;
+	atomic64_t pcc_cap;
+	atomic64_t pcc_len;
+};
+
+/**
+ * struct credit_info - mdc selector info
+ * @ci_credit:      available credit
+ * @ci_free:        available free space
+ * @ci_slot:        MDC slot number
+ *
+ * Contains information about available credit and a balance. Available
+ * credit is based on an rate at which records can be written to
+ * mdc such that all MDC will fill at the same time.
+ */
+struct credit_info  {
+	u64                 ci_credit;
+	u64                 ci_free;
+	u8                  ci_slot;
+};
+
+/**
+ * struct pmd_mdc_stats - per MDC space usage stats
+ * @pms_mblock_alen: mblock alloc len
+ * @pms_mblock_wlen: mblock write len
+ * @pms_mlog_alen: mlog alloc len
+ * @pms_mblock_cnt: mblock count
+ * @pms_mlog_cnt: mlog count
+ */
+struct pmd_mdc_stats {
+	u64    pms_mblock_alen;
+	u64    pms_mblock_wlen;
+	u64    pms_mlog_alen;
+	u32    pms_mblock_cnt;
+	u32    pms_mlog_cnt;
+};
+
+/**
+ * struct pmd_mdc_info - Metadata container (mdc) info.
+ * @mmi_compactlock: compaction lock
+ * @mmi_uc_lock:     uncommitted objects tree lock
+ * @mmi_uc_root:     uncommitted objects tree root
+ * @mmi_co_lock:     committed objects tree lock
+ * @mmi_co_root:     committed objects tree root
+ * @mmi_uqlock:      uniquifier lock
+ * @mmi_luniq:       uniquifier of last object assigned to container
+ * @mmi_mdc:         MDC implementing container
+ * @mmi_recbuf:      buffer for (un)packing log records
+ * @mmi_lckpt:       last objid checkpointed
+ * @mmi_stats:       per-MDC usage stats
+ * @mmi_stats_lock:  lock for protecting mmi_stats
+ * @mmi_pco_cnt:     counters used by the pre compaction of MDC1/255.
+ * @mmi_mdcver:      version of the mdc content on media when the mpool was
+ *                   activated. That may not be the current version on media
+ *                   if a MDC metadata conversion took place during activate.
+ * @mmi_credit       MDC credit info
+ *
+ * LOCKING:
+ * + mmi_luniq: protected by uqlock
+ * + mmi_mdc, recbuf, lckpt: protected by compactlock
+ * + mmi_co_root: protected by co_lock
+ * + mmi_uc_root: protected by uc_lock
+ * + mmi_stats: protected by mmi_stats_lock
+ * + mmi_pco_counters: updates serialized by mmi_compactlock
+ *
+ * NOTE:
+ *  + for mdc0 mmi_luniq is the slot # of the last mdc created
+ *  + logging to a mdc cannot execute concurrent with compacting
+ *    that mdc;
+ *    mmi_compactlock is used to enforce this
+ *  + compacting a mdc requires freezing both the list of committed
+ *    objects in that mdc and the metadata for those objects;
+ *    compactlock facilitates this in a way that avoids locking each
+ *    object during compaction; as a result object metadata updates
+ *    are serialized, but even without mdc compaction this would be
+ *    the case because all such metadata updates must be logged to
+ *    the object's mdc and mdc logging is inherently serial
+ *  + see struct pmd_layout comments for specifics on how
+ *    compactlock is used to freeze metadata for committed objects
+ */
+struct pmd_mdc_info {
+	struct mutex            mmi_compactlock;
+	char                   *mmi_recbuf;
+	u64                     mmi_lckpt;
+	struct mp_mdc          *mmi_mdc;
+
+	____cacheline_aligned
+	struct mutex            mmi_uc_lock;
+	struct rb_root          mmi_uc_root;
+
+	____cacheline_aligned
+	struct rw_semaphore     mmi_co_lock;
+	struct rb_root          mmi_co_root;
+
+	____cacheline_aligned
+	struct mutex            mmi_uqlock;
+	u64                     mmi_luniq;
+
+	____cacheline_aligned
+	struct credit_info      mmi_credit;
+	struct omf_mdcver       mmi_mdcver;
+
+	____cacheline_aligned
+	struct mutex            mmi_stats_lock;
+	struct pmd_mdc_stats    mmi_stats;
+
+	struct pre_compact_ctrs mmi_pco_cnt;
+};
+
+/**
+ * struct pmd_mdc_selector - Object containing MDC slots for allocation
+ * @mds_tbl_idx:      idx of the MDC slot selector in the mds_tbl
+ * @mds_tbl:          slot table used for MDC selection
+ * @mds_mdc:          scratch pad for sorting mdc by free size
+ *
+ * LOCKING:
+ *  + mdi_slotvlock lock will be taken to protect this object.
+ *
+ */
+struct pmd_mdc_selector {
+	atomic_t    mds_tbl_idx;
+	u8          mds_tbl[MDC_TBL_SZ];
+	void       *mds_smdc[MDC_SLOTS];
+};
+
+/**
+ * struct pmd_mda_info - Metadata container array (mda).
+ * @mdi_slotvlock:   it is assumed that this spinlock is NOT taken from interrupt context
+ * @mdi_slotvcnt:    number of active slotv entries
+ * @mdi_slotv:       per mdc info
+ * @mdi_sel:         MDC allocation selector
+ *
+ * LOCKING:
+ *  + mdi_slotvcnt: protected by mdi_slotvlock
+ *
+ * NOTE:
+ *  + mdi_slotvcnt only ever increases so mdi_slotv[x], x < mdi_slotvcnt, is
+ *    always active
+ *  + all mdi_slotv[] entries are initialized whether or not active so they
+ *    can all be accessed w/o locking except as required by pmd_mdc_info struct
+ */
+struct pmd_mda_info {
+	spinlock_t              mdi_slotvlock;
+	u16                     mdi_slotvcnt;
+
+	struct pmd_mdc_info     mdi_slotv[MDC_SLOTS];
+	struct pmd_mdc_selector mdi_sel;
+};
+
+/**
+ * struct pmd_obj_load_work - work struct for loading MDC 1~N
+ * @olw_work:     work struct
+ * @olw_mp:
+ * @olw_progress: Progress index. It is an (atomic_t *) so that multiple
+ *                pmd_obj_load_work structs can point to a single atomic_t
+ *                for grabbing the next MDC number to be processed.
+ * @olw_err:
+ */
+struct pmd_obj_load_work {
+	struct work_struct          olw_work;
+	struct mpool_descriptor    *olw_mp;
+	atomic_t                   *olw_progress; /* relaxed is correct */
+	atomic_t                   *olw_err;
+};
+
+/**
+ * pmd_mpool_activate() - Load all metadata for mpool mp.
+ * @mp:
+ * @mdc01:
+ * @mdc02:
+ * @create:
+ *
+ * Load all metadata for mpool mp; create flag indicates if is a new pool;
+ * caller must ensure no other thread accesses mp until activation is complete.
+ * note: pmd module owns mdc01/2 memory mgmt whether succeeds or fails
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_mpool_activate(struct mpool_descriptor *mp, struct pmd_layout *mdc01,
+		       struct pmd_layout *mdc02, int create);
+
+/**
+ * pmd_mpool_deactivate() - Deactivate mpool mp.
+ * @mp:
+ *
+ * Free all metadata for mpool mp excepting mp itself; caller must ensure
+ * no other thread can access mp during deactivation.
+ */
+void pmd_mpool_deactivate(struct mpool_descriptor *mp);
+
+/**
+ * pmd_mdc_alloc() - Add a metadata container to mpool.
+ * @mp:
+ * @mincap:
+ * @iter: the role of this parameter is to get the active mlogs of the mpool
+ *	MDCs uniformely spread on the mpool devices.
+ *	When pmd_mdc_alloc() is called in a loop to allocate several mpool MDCs,
+ *	iter should be incremented at each subsequent call.
+ *
+ * Add a metadata container (mdc) to mpool with a minimum capacity of mincap
+ * bytes.  Once added an mdc can never be deleted.
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_mdc_alloc(struct mpool_descriptor *mp, u64 mincap, u32 iter);
+
+/**
+ * pmd_mdc_cap() - Get metadata container (mdc) capacity stats.
+ * @mp:
+ * @mdcmax:
+ * @mdccap:
+ * @mdc0cap:
+ *
+ * Get metadata container (mdc) stats: count, aggregate capacity ex-mdc0 and
+ * mdc0 cap
+ */
+void pmd_mdc_cap(struct mpool_descriptor *mp, u64 *mdcmax, u64 *mdccap, u64 *mdc0cap);
+
+/**
+ * pmd_prop_mcconfig() -
+ * @mp:
+ * @pd:
+ * @compacting: if true, called by a compaction.
+ *
+ * Persist state (new or update) for drive pd; caller must hold mp.pdvlock
+ * if pd is an in-use member of mp.pdv.
+ *
+ * Locking: caller must hold MDC0 compact lock.
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_prop_mcconfig(struct mpool_descriptor *mp, struct mpool_dev_info *pd, bool compacting);
+
+/**
+ * pmd_prop_mcspare() -
+ * @mp:
+ * @mclassp:
+ * @spzone:
+ * @compacting: if true, called by a compaction.
+ *
+ * Persist spare zone info for drives in media class (new or update).
+ *
+ * Locking: caller must hold MDC0 compact lock.
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_prop_mcspare(struct mpool_descriptor *mp, enum mp_media_classp mclassp,
+		     u8 spzone, bool compacting);
+
+int pmd_prop_mpconfig(struct mpool_descriptor *mp, const struct mpool_config *cfg, bool compacting);
+
+/**
+ * pmd_precompact_start() - start MDC1/255 precompaction
+ * @mp:
+ */
+void pmd_precompact_start(struct mpool_descriptor *mp);
+
+/**
+ * pmd_precompact_stop() - stop MDC1/255 precompaction
+ * @mp:
+ */
+void pmd_precompact_stop(struct mpool_descriptor *mp);
+
+/**
+ * pmd_mdc_addrec_version() -add a version record in a mpool MDC.
+ * @mp:
+ * @cslot:
+ */
+int pmd_mdc_addrec_version(struct mpool_descriptor *mp, u8 cslot);
+
+int pmd_log_delete(struct mpool_descriptor *mp, u64 objid);
+
+int pmd_log_create(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+int pmd_log_erase(struct mpool_descriptor *mp, u64 objid, u64 gen);
+
+int pmd_log_idckpt(struct mpool_descriptor *mp, u64 objid);
+
+#define PMD_MDC0_COMPACTLOCK(_mp) \
+	pmd_mdc_lock(&((_mp)->pds_mda.mdi_slotv[0].mmi_compactlock), 0)
+
+#define PMD_MDC0_COMPACTUNLOCK(_mp) \
+	pmd_mdc_unlock(&((_mp)->pds_mda.mdi_slotv[0].mmi_compactlock))
+
+#endif /* MPOOL_PMD_H */
diff --git a/drivers/mpool/pmd_obj.h b/drivers/mpool/pmd_obj.h
new file mode 100644
index 000000000000..7cf5dea80f9d
--- /dev/null
+++ b/drivers/mpool/pmd_obj.h
@@ -0,0 +1,499 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_PMD_OBJ_H
+#define MPOOL_PMD_OBJ_H
+
+#include <linux/sort.h>
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/rwsem.h>
+#include <linux/workqueue.h>
+
+#include "uuid.h"
+#include "mpool_ioctl.h"
+#include "omf_if.h"
+#include "mlog.h"
+
+struct mpool_descriptor;
+struct pmd_mdc_info;
+
+/*
+ * objid uniquifier checkpoint interval; used to avoid reissuing an outstanding
+ * objid after a crash; supports pmd_{mblock|mlog}_realloc()
+ */
+#define OBJID_UNIQ_POW2 8
+#define OBJID_UNIQ_DELTA (1 << OBJID_UNIQ_POW2)
+
+/* MDC_SLOTS is 256 [0,255] to fit in 8-bit slot field in objid.
+ */
+#define MDC_SLOTS           256
+#define MDC_TBL_SZ          (MDC_SLOTS * 4)
+
+#define UROOT_OBJID_LOG1 logid_make(0, 1)
+#define UROOT_OBJID_LOG2 logid_make(1, 1)
+#define UROOT_OBJID_MAX  1
+
+#define MDC0_OBJID_LOG1 logid_make(0, 0)
+#define MDC0_OBJID_LOG2 logid_make(1, 0)
+
+/**
+ * enum pmd_lock_class -
+ * @PMD_NONE:
+ * @PMD_OBJ_CLIENT:
+ *      For layout rwlock,
+ *              - Object id contains a non-zero slot number
+ * @PMD_MDC_NORMAL:
+ *      For layout rwlock,
+ *              - Object id contains a zero slot number AND
+ *              - Object id is neither of the well-known MDC-0 objids
+ *      For pmd_mdc_info.* locks,
+ *              - Array index of pmd_mda_info.slov[] is > 0.
+ * @PMD_MDC_ZERO:
+ *      For layout rwlock,
+ *              - Object id contains a zero slot number AND
+ *              - Object id is either of the well-known MDC-0 objids
+ *      For pmd_mdc_info.* locks,
+ *              - Array index of pmd_mda_info.slov[] is == 0.
+ *
+ * NOTE:
+ * - Object layout rw locks must be acquired before any MDC locks.
+ * - MDC-0 locks of a given class are below MDC-1/255 locks of those same
+ *   classes.
+ */
+enum pmd_lock_class {
+	PMD_NONE       = 0,
+	PMD_OBJ_CLIENT = 1,
+	PMD_MDC_NORMAL = 2,
+	PMD_MDC_ZERO   = 3,
+};
+
+/**
+ * enum pmd_obj_op -
+ * @PMD_OBJ_LOAD:
+ * @PMD_OBJ_ALLOC:
+ * @PMD_OBJ_COMMIT:
+ * @PMD_OBJ_ABORT:
+ * @PMD_OBJ_DELETE:
+ */
+enum pmd_obj_op {
+	PMD_OBJ_LOAD     = 1,
+	PMD_OBJ_ALLOC    = 2,
+	PMD_OBJ_COMMIT   = 3,
+	PMD_OBJ_ABORT    = 4,
+	PMD_OBJ_DELETE   = 5,
+};
+
+/**
+ * enum pmd_layout_state - object state flags
+ * @PMD_LYT_COMMITTED: object is committed to media
+ * @PMD_LYT_REMOVED:   object logically removed (aborted or deleted)
+ */
+enum pmd_layout_state {
+	PMD_LYT_COMMITTED  = 0x01,
+	PMD_LYT_REMOVED    = 0x02,
+};
+
+/**
+ * struct pmd_layout_mlpriv - mlog private data for pmd_layout
+ * @mlp_uuid:       unique ID per mlog
+ * @mlp_lstat:      mlog status
+ * @mlp_nodeoml:    "open mlog" rbtree linkage
+ */
+struct pmd_layout_mlpriv {
+	struct mpool_uuid   mlp_uuid;
+	struct rb_node      mlp_nodeoml;
+	struct mlog_stat    mlp_lstat;
+};
+
+/**
+ * union pmd_layout_priv - pmd_layout object type specific private data
+ * @mlpriv: mlog private data
+ */
+union pmd_layout_priv {
+	struct pmd_layout_mlpriv    mlpriv;
+};
+
+/**
+ * struct pmd_layout - object layout (in-memory version)
+ * @eld_nodemdc: rbtree node for uncommitted and committed objects
+ * @eld_objid:   object ID associated with layout
+ * @eld_mblen:   Amount of data written in the mblock in bytes (0 for mlogs)
+ * @eld_state:   enum pmd_layout_state
+ * @eld_flags:   enum mlog_open_flags for mlogs
+ * @eld_gen:     object generation
+ * @eld_ld:
+ * @eld_ref:     user ref count from alloc/get/put
+ * @eld_rwlock:  implements pmd_obj_*lock() for this layout
+ * @dle_mlpriv:  mlog private data
+ *
+ * LOCKING:
+ * + objid: constant; no locking required
+ * + lstat: lstat and *lstat are protected by pmd_obj_*lock()
+ * + all other fields: see notes
+ *
+ * NOTE:
+ * + committed object fields (other): to update hold pmd_obj_wrlock()
+ *   AND
+ *   compactlock for object's mdc; to read hold pmd_obj_*lock()
+ *   See the comments associated with struct pmd_mdc_info for
+ *   further details.
+ *
+ * eld_priv[] contains exactly one element if the object type
+ * is and mlog, otherwise it contains exactly zero element.
+ */
+struct pmd_layout {
+	struct rb_node                  eld_nodemdc;
+	u64                             eld_objid;
+	u32                             eld_mblen;
+	u8                              eld_state;
+	u8                              eld_flags;
+	u64                             eld_gen;
+	struct omf_layout_descriptor    eld_ld;
+
+	/* The above fields are read-mostly, while the
+	 * following two fields mutate frequently.
+	 */
+	struct kref                     eld_ref;
+	struct rw_semaphore             eld_rwlock;
+
+	union pmd_layout_priv           eld_priv[];
+};
+
+/* Shortcuts for mlog private data...
+ */
+#define eld_mlpriv      eld_priv->mlpriv
+#define eld_uuid        eld_mlpriv.mlp_uuid
+#define eld_lstat       eld_mlpriv.mlp_lstat
+#define eld_nodeoml     eld_mlpriv.mlp_nodeoml
+
+/**
+ * struct pmd_obj_capacity -
+ * @moc_captgt:  capacity target for object in bytes
+ * @moc_spare:   true, if alloc obj from spare space
+ */
+struct pmd_obj_capacity {
+	u64    moc_captgt;
+	bool   moc_spare;
+};
+
+/**
+ * struct pmd_obj_erase_work - workqueue job struct for object erase and free
+ * @oef_mp:             mpool
+ * @oef_layout:         object layout
+ * @oef_cache:          kmem cache to free work (or NULL)
+ * @oef_wqstruct:	workq struct
+ */
+struct pmd_obj_erase_work {
+	struct mpool_descriptor    *oef_mp;
+	struct pmd_layout          *oef_layout;
+	struct kmem_cache          *oef_cache;
+	struct work_struct          oef_wqstruct;
+};
+
+/**
+ * struct mdc_csm_info - mdc credit set member info
+ * @m_slot:      mdc slot number
+ * @ci_credit:   available credit
+ */
+struct mdc_csm_info {
+	u8   m_slot;
+	u16  m_credit;
+};
+
+/**
+ * struct mdc_credit_set - mdc credit set
+ * @cs_idx:      index of current credit set member
+ * @cs_num_csm:  number of credit set members in this credit set
+ * @cs_csm:      array of credit set members
+ */
+struct mdc_credit_set {
+	u8                    cs_idx;
+	u8                    cs_num_csm;
+	struct mdc_csm_info   csm[MPOOL_MDC_SET_SZ];
+};
+
+/**
+ * pmd_obj_alloc() - Allocate an object.
+ * @mp:
+ * @otype:
+ * @ocap:
+ * @mclassp: media class
+ * @layoutp:
+ *
+ * Allocate object of type otype with parameters and capacity as specified
+ * by ocap on drives in media class mclassp providing a minimum capacity of
+ * mincap bytes; if successful returns object layout.
+ *
+ * Note:
+ * Object is not persistent until committed; allocation can be aborted.
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_obj_alloc(struct mpool_descriptor *mp, enum obj_type_omf otype,
+		  struct pmd_obj_capacity *ocap, enum mp_media_classp mclassp,
+		  struct pmd_layout **layoutp);
+
+
+/**
+ * pmd_obj_realloc() - Re-allocate an object.
+ * @mp:
+ * @objid:
+ * @ocap:
+ * @mclassp: media class
+ * @layoutp:
+ *
+ * Allocate object with specified objid to support crash recovery; otherwise
+ * is equivalent to pmd_obj_alloc(); if successful returns object layout.
+ *
+ * Note:
+ * Object is not persistent until committed; allocation can be aborted.
+ *
+ * Return: %0 if successful; -errno otherwise
+ */
+int pmd_obj_realloc(struct mpool_descriptor *mp, u64 objid, struct pmd_obj_capacity *ocap,
+		    enum mp_media_classp mclassp, struct pmd_layout **layoutp);
+
+
+/**
+ * pmd_obj_commit() - Commit an object.
+ * @mp:
+ * @layout:
+ *
+ * Make allocated object persistent; if fails object remains uncommitted so
+ * can retry commit or abort; object cannot be committed while in erasing or
+ * aborting state; caller MUST NOT hold pmd_obj_*lock() on layout.
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_obj_commit(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+/**
+ * pmd_obj_abort() - Discard un-committed object.
+ * @mp:
+ * @layout:
+ *
+ * Discard uncommitted object; caller MUST NOT hold pmd_obj_*lock() on
+ * layout; if successful layout is invalid after call.
+ *
+ * Return: %0 if successful; -errno otherwise
+ */
+int pmd_obj_abort(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+/**
+ * pmd_obj_delete() - Delete committed object.
+ * @mp:
+ * @layout:
+ *
+ * Delete committed object; caller MUST NOT hold pmd_obj_*lock() on layout;
+ * if successful layout is invalid.
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_obj_delete(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+/**
+ * pmd_obj_erase() - Log erase for object and set state flag and generation number
+ * @mp:
+ * @layout:
+ * @gen:
+ *
+ * Object must be in committed state; caller MUST hold pmd_obj_wrlock() on layout.
+ *
+ * Return: %0 if successful, -errno otherwise
+ */
+int pmd_obj_erase(struct mpool_descriptor *mp, struct pmd_layout *layout, u64 gen);
+
+/**
+ * pmd_obj_find_get() - Get a reference for a layout for objid.
+ * @mp:
+ * @objid:
+ * @which:
+ *
+ * Get layout for object with specified objid; return NULL either if not found
+ *
+ * Return: pointer to layout if successful, NULL otherwise
+ */
+struct pmd_layout *pmd_obj_find_get(struct mpool_descriptor *mp, u64 objid, int which);
+
+/**
+ * pmd_obj_rdlock() - Read-lock object layout with appropriate nesting level.
+ * @layout:
+ */
+void pmd_obj_rdlock(struct pmd_layout *layout);
+
+/**
+ * pmd_obj_rdunlock() - Release read lock on object layout.
+ * @layout:
+ */
+void pmd_obj_rdunlock(struct pmd_layout *layout);
+
+/**
+ * pmd_obj_wrlock() - Write-lock object layout with appropriate nesting level.
+ * @layout:
+ */
+void pmd_obj_wrlock(struct pmd_layout *layout);
+
+/**
+ * pmd_obj_wrunlock() - Release write lock on object layout.
+ * @layout:
+ */
+void pmd_obj_wrunlock(struct pmd_layout *layout);
+
+/**
+ * pmd_init_credit() - udpates available credit and setup mdc selector table
+ * @mp: mpool object
+ *
+ * Lock: No Lock required
+ *
+ * Used to initialize credit when new MDCs are added and add the mds to
+ * available
+ * credit list.
+ */
+void pmd_update_credit(struct mpool_descriptor *mp);
+
+/**
+ * pmd_mpool_usage() - calculate per-mpool space usage
+ * @mp:
+ * @usage:
+ */
+void pmd_mpool_usage(struct mpool_descriptor *mp, struct mpool_usage *usage);
+
+/**
+ * pmd_precompact_alsz() - Inform MDC1/255 pre-compacting about the active
+ *	mlog of an mpool MDCi 0<i<=255.
+ *	The size and how much is used are passed in.
+ *	"alsz" stands for active mlog size.
+ * @mp:
+ * @objid: objid of the active mlog of the mpool MDCi
+ * @len: In bytes, how much of the active mlog is used.
+ * @cap: In bytes, size of the active mlog.
+ */
+void pmd_precompact_alsz(struct mpool_descriptor *mp, u64 objid, u64 len, u64 cap);
+
+/**
+ * pmd_layout_alloc() - create and initialize an pmd_layout
+ * @objid:  mblock/mlog object ID
+ * @gen:    generation number
+ * @mblen:  mblock written length
+ * @zcnt:   number of zones in a strip
+ *
+ * Alloc and init object layout; non-arg fields and all strip descriptor
+ * fields are set to 0/UNDEF/NONE; no auxiliary object info is allocated.
+ *
+ * Return: NULL if allocation fails.
+ */
+struct pmd_layout *pmd_layout_alloc(struct mpool_uuid *uuid, u64 objid,
+				    u64 gen, u64 mblen, u32 zcnt);
+
+/**
+ * pmd_layout_release() - free pmd_layout and internal elements
+ * @layout:
+ *
+ * Deallocate all memory associated with object layout.
+ *
+ * Return: void
+ */
+void pmd_layout_release(struct kref *refp);
+
+int pmd_layout_rw(struct mpool_descriptor *mp, struct pmd_layout *layout,
+		  const struct kvec *iov, int iovcnt, u64 boff, int flags, u8 rw);
+
+struct mpool_dev_info *pmd_layout_pd_get(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+u64 pmd_layout_cap_get(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+int pmd_layout_erase(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+int pmd_obj_alloc_cmn(struct mpool_descriptor *mp, u64 objid, enum obj_type_omf otype,
+		      struct pmd_obj_capacity *ocap, enum mp_media_classp mclass,
+		      int realloc, bool needref, struct pmd_layout **layoutp);
+
+void pmd_update_obj_stats(struct mpool_descriptor *mp, struct pmd_layout *layout,
+			  struct pmd_mdc_info *cinfo, enum pmd_obj_op op);
+
+void pmd_obj_rdlock(struct pmd_layout *layout);
+void pmd_obj_rdunlock(struct pmd_layout *layout);
+
+void pmd_obj_wrlock(struct pmd_layout *layout);
+void pmd_obj_wrunlock(struct pmd_layout *layout);
+
+void pmd_co_rlock(struct pmd_mdc_info *cinfo, u8 slot);
+void pmd_co_runlock(struct pmd_mdc_info *cinfo);
+
+struct pmd_layout *pmd_co_find(struct pmd_mdc_info *cinfo, u64 objid);
+struct pmd_layout *pmd_co_insert(struct pmd_mdc_info *cinfo, struct pmd_layout *layout);
+struct pmd_layout *pmd_co_remove(struct pmd_mdc_info *cinfo, struct pmd_layout *layout);
+
+int pmd_smap_insert(struct mpool_descriptor *mp, struct pmd_layout *layout);
+
+int pmd_init(void) __cold;
+void pmd_exit(void) __cold;
+
+static inline bool objtype_user(enum obj_type_omf otype)
+{
+	return (otype == OMF_OBJ_MBLOCK || otype == OMF_OBJ_MLOG);
+}
+
+static inline u64 objid_make(u64 uniq, enum obj_type_omf otype, u8 cslot)
+{
+	return ((uniq << 12) | ((otype & 0xF) << 8) | (cslot & 0xFF));
+}
+
+static inline u64 objid_uniq(u64 objid)
+{
+	return (objid >> 12);
+}
+
+static inline u8 objid_slot(u64 objid)
+{
+	return (objid & 0xFF);
+}
+
+static inline bool objid_ckpt(u64 objid)
+{
+	return !(objid_uniq(objid) & (OBJID_UNIQ_DELTA - 1));
+}
+
+static inline u64 logid_make(u64 uniq, u8 cslot)
+{
+	return objid_make(uniq, OMF_OBJ_MLOG, cslot);
+}
+
+static inline bool objid_mdc0log(u64 objid)
+{
+	return ((objid == MDC0_OBJID_LOG1) || (objid == MDC0_OBJID_LOG2));
+}
+
+static inline enum obj_type_omf pmd_objid_type(u64 objid)
+{
+	enum obj_type_omf otype = objid_type(objid);
+
+	return objtype_valid(otype) ? otype : OMF_OBJ_UNDEF;
+}
+
+/* True if objid is an mpool user object (versus mpool metadata object). */
+static inline bool pmd_objid_isuser(u64 objid)
+{
+	return objtype_user(objid_type(objid)) && objid_slot(objid);
+}
+
+static inline void pmd_obj_put(struct pmd_layout *layout)
+{
+	kref_put(&layout->eld_ref, pmd_layout_release);
+}
+
+/* General mdc locking (has external callers...) */
+static inline void pmd_mdc_lock(struct mutex *lock, u8 slot)
+{
+	mutex_lock_nested(lock, slot > 0 ? PMD_MDC_NORMAL : PMD_MDC_ZERO);
+}
+
+static inline void pmd_mdc_unlock(struct mutex *lock)
+{
+	mutex_unlock(lock);
+}
+
+#endif /* MPOOL_PMD_OBJ_H */
diff --git a/drivers/mpool/sb.h b/drivers/mpool/sb.h
new file mode 100644
index 000000000000..673a5f742f7c
--- /dev/null
+++ b/drivers/mpool/sb.h
@@ -0,0 +1,162 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_SB_PRIV_H
+#define MPOOL_SB_PRIV_H
+
+#include "mpool_ioctl.h"
+
+struct pd_dev_parm;
+struct omf_sb_descriptor;
+struct pd_prop;
+
+/*
+ * Drives have 2 superblocks.
+ * + sb0 at byte offset 0
+ * + sb1 at byte offset SB_AREA_SZ
+ *
+ * Read: sb0 is the authoritative copy, other copies are not used.
+ * Updates: sb0 is updated first; if successful sb1 is updated
+ */
+/* Number of superblock per Physical Device.  */
+#define SB_SB_COUNT        2
+
+/*
+ * Size in byte of the area occupied by a superblock. The superblock itself
+ * may be smaller, but always starts at the beginning of its area.
+ */
+#define SB_AREA_SZ        (4096ULL)
+
+/*
+ * Size in byte of an area located just after the superblock areas.
+ * Not used in 1.0. Later can be used for MDC0 metadata and/or voting sets.
+ */
+#define MDC0MD_AREA_SZ    (4096ULL)
+
+/*
+ * sb API functions
+ */
+
+/**
+ * sb_magic_check() - check for sb magic value
+ * @dparm: struct pd_dev_parm *
+ *
+ * Determine if the mpool magic value exists in at least one place where
+ * expected on drive pd.  Does NOT imply drive has a valid superblock.
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Return: 1 if found, 0 if not found, -(errno) if error reading
+ */
+int sb_magic_check(struct pd_dev_parm *dparm);
+
+/**
+ * sb_write_new() - write superblock to new drive
+ * @dparm: struct pd_dev_parm *
+ * @sb: struct omf_sb_descriptor *
+ *
+ * Write superblock sb to new (non-pool) drive
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Return: 0 if successful; -errno otherwise
+ */
+int sb_write_new(struct pd_dev_parm *dparm, struct omf_sb_descriptor *sb);
+
+/**
+ * sb_write_update() - update superblock
+ * @dparm: "dparm" info is not used to fill up the super block, only "sb" content is used.
+ * @sb: "sb" content is written in the super block.
+ *
+ * Update superblock on pool drive
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Return: 0 if successful; -errno otherwise
+ */
+int sb_write_update(struct pd_dev_parm *dparm, struct omf_sb_descriptor *sb);
+
+/**
+ * sb_erase() - erase superblock
+ * @dparm: struct pd_dev_parm *
+ *
+ * Erase superblock on drive pd.
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Return: 0 if successful; -errno otherwise
+ */
+int sb_erase(struct pd_dev_parm *dparm);
+
+/**
+ * sb_read() - read superblock
+ * @dparm: struct pd_dev_parm *
+ * @sb: struct omf_sb_descriptor *
+ * @omf_ver: omf sb version
+ * @force:
+ *
+ * Read superblock from drive pd; make repairs as necessary.
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Return: 0 if successful; -errno otherwise
+ */
+int sb_read(struct pd_dev_parm *dparm, struct omf_sb_descriptor *sb, u16 *omf_ver, bool force);
+
+/**
+ * sbutil_mdc0_clear() - clear mdc0 of superblock
+ * @sb: struct omf_sb_descriptor *)
+ *
+ * Clear (set to zeros) mdc0 portion of sb.
+ *
+ * Return: void
+ */
+void sbutil_mdc0_clear(struct omf_sb_descriptor *sb);
+
+/**
+ * sbutil_mdc0_isclear() - Test if mdc0 is clear
+ * @sb: struct omf_sb_descriptor *
+ *
+ * Return: 1 if mdc0 portion of sb is clear.
+ */
+int sbutil_mdc0_isclear(struct omf_sb_descriptor *sb);
+
+/**
+ * sbutil_mdc0_copy() - copy mdc0 from one superblock to another
+ * @tgtsb: struct omf_sb_descriptor *
+ * @srcsb: struct omf_sb_descriptor *
+ *
+ * Copy mdc0 portion of srcsb to tgtsb.
+ *
+ * Return void
+ */
+void sbutil_mdc0_copy(struct omf_sb_descriptor *tgtsb, struct omf_sb_descriptor *srcsb);
+
+/**
+ * sbutil_mdc0_isvalid() - validate mdc0 of a superblock
+ * @sb: struct omf_sb_descriptor *
+ *
+ * Validate mdc0 portion of sb and extract mdparm.
+ * Return: 1 if valid and mdparm set; 0 otherwise.
+ */
+int sbutil_mdc0_isvalid(struct omf_sb_descriptor *sb);
+
+/**
+ * sb_zones_for_sbs() - compute how many zones are needed to contain the superblocks.
+ * @pd_prop:
+ */
+static inline u32 sb_zones_for_sbs(struct pd_prop *pd_prop)
+{
+	u32 zonebyte;
+
+	zonebyte = pd_prop->pdp_zparam.dvb_zonepg << PAGE_SHIFT;
+
+	return (2 * (SB_AREA_SZ + MDC0MD_AREA_SZ) + (zonebyte - 1)) / zonebyte;
+}
+
+int sb_init(void) __cold;
+void sb_exit(void) __cold;
+
+#endif /* MPOOL_SB_PRIV_H */
diff --git a/drivers/mpool/smap.h b/drivers/mpool/smap.h
new file mode 100644
index 000000000000..b9b72d3182c6
--- /dev/null
+++ b/drivers/mpool/smap.h
@@ -0,0 +1,334 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_SMAP_H
+#define MPOOL_SMAP_H
+
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/rbtree.h>
+#include <linux/workqueue.h>
+
+#include "mpool_ioctl.h"
+
+/* Forward Decls */
+struct mpool_usage;
+struct mpool_devprops;
+struct mc_smap_parms;
+struct mpool_descriptor;
+
+/*
+ * Common defs
+ */
+
+/**
+ * struct rmbkt - region map bucket
+ */
+struct rmbkt {
+	struct mutex    pdi_rmlock;
+	struct rb_root  pdi_rmroot;
+} ____cacheline_aligned;
+
+/**
+ * struct smap_zone -
+ * @smz_node:
+ * @smz_key:
+ * @smz_value:
+ */
+struct smap_zone {
+	struct rb_node  smz_node;
+	u64             smz_key;
+	u64             smz_value;
+};
+
+/**
+ * enum smap_space_type - space allocation policy flag
+ * @SMAP_SPC_UNDEF:
+ * @SMAP_SPC_USABLE_ONLY:    allocate from usable space only
+ * @SMAP_SPC_USABLE_2_SPARE: allocate from usable space first then spare
+ *                          if needed
+ * @SMAP_SPC_SPARE_ONLY:     allocate from spare space only
+ * @SMAP_SPC_SPARE_2_USABLE: allocate from spare space first then usable
+ *                          if needed
+ */
+enum smap_space_type {
+	SMAP_SPC_UNDEF           = 0,
+	SMAP_SPC_USABLE_ONLY     = 1,
+	SMAP_SPC_USABLE_2_SPARE  = 2,
+	SMAP_SPC_SPARE_ONLY      = 3,
+	SMAP_SPC_SPARE_2_USABLE  = 4
+};
+
+static inline int saptype_valid(enum smap_space_type saptype)
+{
+	return (saptype && saptype <= 4);
+}
+
+/*
+ * drive allocation info
+ *
+ * LOCKING:
+ * + rgnsz, rgnladdr: constants; no locking required
+ * + all other fields: protected by dalock
+ */
+
+/**
+ * struct smap_dev_alloc -
+ * @sda_dalock:
+ * @sda_rgnsz:    number of zones per rgn, excepting last
+ * @sda_rgnladdr: address of first zone in last rgn
+ * @sda_rgnalloc: rgn last alloced from
+ * @sda_zoneeff:    total zones (zonetot) minus bad zones
+ * @sda_utgt:      target max usable zones to allocate
+ * @sda_uact:      actual usable zones allocated
+ * @sda_stgt:      target max spare zones to allocate
+ * @sda_sact       actual spare zones allocated
+ *
+ * NOTE:
+ * + must maintain invariant that sact <= stgt
+ * + however it is possible for uact > utgt due to changing % spare
+ *   zones or zone failures.  this condition corrects when
+ *   sufficient space is freed or if % spare zones is changed
+ *   (again).
+ *
+ * Capacity pools and calcs:
+ * + total zones = zonetot
+ * + avail zones = zoneeff
+ * + usable zones = utgt which is (zoneeff * (1 - spzone/100))
+ * + free usable zones = max(0, utgt - uact); max handles uact > utgt
+ * + used zones = uact; possible for used > usable (uact > utgt)
+ * + spare zones = stgt which is (zoneeff - utgt)
+ * + free spare zones = (stgt - sact); guaranteed that sact <= stgt
+ */
+struct smap_dev_alloc {
+	spinlock_t sda_dalock;
+	u32        sda_rgnsz;
+	u32        sda_rgnladdr;
+	u32        sda_rgnalloc;
+	u32        sda_zoneeff;
+	u32        sda_utgt;
+	u32        sda_uact;
+	u32        sda_stgt;
+	u32        sda_sact;
+};
+
+struct smap_dev_znstats {
+	u64    sdv_total;
+	u64    sdv_avail;
+	u64    sdv_usable;
+	u64    sdv_fusable;
+	u64    sdv_spare;
+	u64    sdv_fspare;
+	u64    sdv_used;
+};
+
+/**
+ * smap_usage_work - delayed work struct for checking mpool free usable space usage
+ * @smapu_wstruct:
+ * @smapu_mp:
+ * @smapu_freepct: free space %
+ */
+struct smap_usage_work {
+	struct delayed_work             smapu_wstruct;
+	struct mpool_descriptor        *smapu_mp;
+	int                             smapu_freepct;
+};
+
+/*
+ * smap API functions
+ */
+
+/*
+ * Return: all smap fns can return -errno with the following errno values
+ * on failure:
+ * + -EINVAL = invalid fn args
+ * + -ENOSPC = unable to allocate requested space
+ * + -ENOMEM = insufficient memory to complete operation
+ */
+
+/*
+ * smap API usage notes:
+ * + During mpool activation call smap_insert() for all existing objects
+ *   before calling smap_alloc() or smap_free().
+ */
+
+/**
+ * smap_mpool_init() - initialize the smaps for an initialized mpool_descriptor
+ * @mp: struct mpool_descriptor *
+ *
+ * smap_mpool_init must be called once per mpool as it is being activated.
+ *
+ * Init space maps for all drives in mpool that are empty except for
+ * superblocks; caller must ensure no other thread can access mp.
+ *
+ * TODO: Traversing smap rbtrees may need fix, since there may be unsafe
+ * erases within loops.
+ *
+ * Return:
+ * 0 if successful, -errno with the following errno values on failure:
+ * -EINVAL if spare zone percentage is > 100%,
+ * -EINVAL if rgn count is 0, or
+ * -EINVAL if zonecnt on one of the drives is < rgn count
+ * -ENOMEM if there is no memory available
+ */
+int smap_mpool_init(struct mpool_descriptor *mp);
+
+/**
+ * smap_mpool_free() - free smap structures in a mpool_descriptor
+ * @mp: struct mpool_descriptor *
+ *
+ * Free space maps for all drives in mpool; caller must ensure no other
+ * thread can access mp.
+ *
+ * Return: void
+ */
+void smap_mpool_free(struct mpool_descriptor *mp);
+
+/**
+ * smap_mpool_usage() - present stats of smap usage
+ * @mp: struct mpool_descriptor *
+ * @mclass: media class or MP_MED_ALL for all classes
+ * @usage: struct mpool_usage *
+ *
+ * Fill in stats with space usage for media class; if MP_MED_ALL
+ * report on all media classes; caller must hold mp.pdvlock.
+ *
+ * Locking: the caller should hold the pds_pdvlock at least in read to
+ *	    be protected against media classes updates.
+ */
+void smap_mpool_usage(struct mpool_descriptor *mp, u8 mclass, struct mpool_usage *usage);
+
+/**
+ * smap_drive_spares() - Set percentage of zones to set aside as spares
+ * @mp: struct mpool_descriptor *
+ * @mclassp: media class
+ * @spzone: percentage of zones to use as spares
+ *
+ * Set percent spare zones to spzone for drives in media class mclass;
+ * caller must hold mp.pdvlock.
+ *
+ * Locking: the caller should hold the pds_pdvlock at least in read to
+ *	    be protected against media classes updates.
+ *
+ * Return: 0 if successful; -errno otherwise
+ */
+int smap_drive_spares(struct mpool_descriptor *mp, enum mp_media_classp mclassp, u8 spzone);
+
+/**
+ * smap_drive_usage() - Fill in a given drive's portion of dprop struct.
+ * @mp:    struct mpool_descriptor *
+ * @pdh:   drive number within the mpool_descriptor
+ * @dprop: struct mpool_devprops *, structure to fill in
+ *
+ * Fill in usage portion of dprop for drive pdh; caller must hold mp.pdvlock
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int smap_drive_usage(struct mpool_descriptor *mp, u16 pdh, struct mpool_devprops *dprop);
+
+/**
+ * smap_drive_init() - Initialize a specific drive within a mpool_descriptor
+ * @mp:    struct mpool_descriptor *
+ * @mcsp:  smap parameters
+ * @pdh:   u16, drive number within the mpool_descriptor
+ *
+ * Init space map for pool drive pdh that is empty except for superblocks
+ * with a percent spare zones of spzone; caller must ensure pdh is not in use.
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int smap_drive_init(struct mpool_descriptor *mp, struct mc_smap_parms *mcsp, u16 pdh);
+
+/**
+ * smap_drive_free() - Release resources for a specific drive
+ * @mp:  struct mpool_descriptor *
+ * @pdh: u16, drive number within the mpool_descriptor
+ *
+ * Free space map for pool drive pdh including partial (failed) inits;
+ * caller must ensure pdh is not in use.
+ *
+ * Return: void
+ */
+void smap_drive_free(struct mpool_descriptor *mp, u16 pdh);
+
+/**
+ * smap_insert() - Inject an entry to an smap for existing object
+ * @mp: struct mpool_descriptor *
+ * @pdh: drive number within the mpool_descriptor
+ * @zoneaddr: starting zone for entry
+ * @zonecnt: number of zones in entry
+ *
+ * Add entry to space map for an existing object with a strip on drive pdh
+ * starting at zoneaddr and continuing for zonecnt blocks.
+ *
+ * Used, in part for superblocks.
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int smap_insert(struct mpool_descriptor *mp, u16 pdh, u64 zoneaddr, u32 zonecnt);
+
+/**
+ * smap_alloc() - Allocate a new contiguous zone range on a specific drive
+ * @mp: struct mpool_descriptor
+ * @pdh: u16, drive number within the mpool_descriptor
+ * @zonecnt: u64, the number of zones requested
+ * @sapolicy: enum smap_space_type, usable only, spare only, etc.
+ * @zoneaddr: u64 *, the starting zone for the allocated range
+ * @align: no. of zones (must be a power-of-2)
+ *
+ * Attempt to allocate zonecnt contiguous zones on drive pdh
+ * in accordance with space allocation policy sapolicy.
+ *
+ * Return: 0 if succcessful; -errno otherwise
+ */
+int smap_alloc(struct mpool_descriptor *mp, u16 pdh, u64 zonecnt,
+	       enum smap_space_type sapolicy, u64 *zoneaddr, u64 align);
+
+/**
+ * smap_free() - Free a previously allocated range of zones in the smap
+ * @mp: struct mpool_descriptor *
+ * @pdh: u16, number of the disk within the mpool_descriptor
+ * @zoneaddr: u64, starting zone for the range to free
+ * @zonecnt: u16, the number of zones in the range
+ *
+ * Free currently allocated space starting at zoneaddr
+ * and continuing for zonecnt blocks.
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int smap_free(struct mpool_descriptor *mp, u16 pdh, u64 zoneaddr, u16 zonecnt);
+
+/*
+ * smap internal functions
+ */
+
+/**
+ * smap_mpool_usage() - Get the media class usage for a given mclass.
+ * @mp:
+ * @mclass: if MP_MED_ALL, return the sum of the stats for all media class,
+ *	else the stats only for one media class.
+ * @usage: output
+ *
+ * Locking: the caller should hold the pds_pdvlock at least in read to
+ *	    be protected against media classes updates.
+ */
+void smap_mclass_usage(struct mpool_descriptor *mp, u8 mclass, struct mpool_usage *usage);
+
+/**
+ * smap_log_mpool_usage() - check drive mpool free usable space %, and log a message if needed
+ * @ws:
+ */
+void smap_log_mpool_usage(struct work_struct *ws);
+
+/**
+ * smap_wait_usage_done() - wait for periodical job for logging pd free usable space % to complete
+ * @mp:
+ */
+void smap_wait_usage_done(struct mpool_descriptor *mp);
+
+int smap_init(void) __cold;
+void smap_exit(void) __cold;
+
+#endif /* MPOOL_SMAP_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 03/22] mpool: add on-media struct definitions
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 02/22] mpool: add in-memory struct definitions Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 04/22] mpool: add pool drive component which handles mpool IO using the block layer API Nabeel M Mohamed
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds headers containing the following on-media formats:
- Mpool superblock
- Object management records: create, update, delete, and erase
- Mpool configuration record
- Media class config and spare record
- OID checkpoint and version record
- Mlog page header and framing records

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/omf.h     | 593 ++++++++++++++++++++++++++++++++++++++++
 drivers/mpool/omf_if.h  | 381 ++++++++++++++++++++++++++
 drivers/mpool/upgrade.h | 128 +++++++++
 3 files changed, 1102 insertions(+)
 create mode 100644 drivers/mpool/omf.h
 create mode 100644 drivers/mpool/omf_if.h
 create mode 100644 drivers/mpool/upgrade.h

diff --git a/drivers/mpool/omf.h b/drivers/mpool/omf.h
new file mode 100644
index 000000000000..c750573720dd
--- /dev/null
+++ b/drivers/mpool/omf.h
@@ -0,0 +1,593 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * Pool on-drive format (omf) module.
+ *
+ * Defines:
+ * + on-drive format for mpool superblocks
+ * + on-drive formats for mlogs, mblocks, and metadata containers (mdc)
+ * + utility functions for working with these on-drive formats
+ * That includes structures and enums used by the on-drive format.
+ *
+ * All mpool metadata is versioned and stored on media in little-endian format.
+ *
+ * Naming conventions:
+ * -------------------
+ * The name of the structures ends with _omf
+ * The name of the structure members start with a "p" that means "packed".
+ */
+
+#ifndef MPOOL_OMF_H
+#define MPOOL_OMF_H
+
+#include <linux/bug.h>
+#include <asm/byteorder.h>
+
+/*
+ * The following two macros exist solely to enable the OMF_SETGET macros to
+ * work on 8 bit members as well as 16, 32 and 64 bit members.
+ */
+#define le8_to_cpu(x)  (x)
+#define cpu_to_le8(x)  (x)
+
+
+/* Helper macro to define set/get methods for 8, 16, 32 or 64 bit scalar OMF struct members. */
+#define OMF_SETGET(type, member, bits) \
+	OMF_SETGET2(type, member, bits, member)
+
+#define OMF_SETGET2(type, member, bits, name)				\
+	static __always_inline u##bits omf_##name(const type * s)	\
+	{								\
+		BUILD_BUG_ON(sizeof(((type *)0)->member)*8 != (bits));	\
+		return le##bits##_to_cpu(s->member);			\
+	}								\
+	static __always_inline void omf_set_##name(type *s, u##bits val)\
+	{								\
+		s->member = cpu_to_le##bits(val);			\
+	}
+
+/* Helper macro to define set/get methods for character strings embedded in OMF structures. */
+#define OMF_SETGET_CHBUF(type, member) \
+	OMF_SETGET_CHBUF2(type, member, member)
+
+#define OMF_SETGET_CHBUF2(type, member, name)				\
+	static inline void omf_set_##name(type *s, const void *p, size_t plen) \
+	{								\
+		size_t len = sizeof(((type *)0)->member);		\
+		memcpy(s->member, p, len < plen ? len : plen);		\
+	}								\
+	static inline void omf_##name(const type *s, void *p, size_t plen)\
+	{								\
+		size_t len = sizeof(((type *)0)->member);		\
+		memcpy(p, s->member, len < plen ? len : plen);		\
+	}
+
+
+/* MPOOL_NAMESZ_MAX should match OMF_MPOOL_NAME_LEN */
+#define OMF_MPOOL_NAME_LEN 32
+
+/* MPOOL_UUID_SIZE should match OMF_UUID_PACKLEN */
+#define OMF_UUID_PACKLEN 16
+
+/**
+ * enum mc_features_omf - Drive features that participate in media classes
+ *	                  definition. These values are ored in a 64 bits field.
+ */
+enum mc_features_omf {
+	OMF_MC_FEAT_MLOG_TGT   = 0x1,
+	OMF_MC_FEAT_MBLOCK_TGT = 0x2,
+	OMF_MC_FEAT_CHECKSUM   = 0x4,
+};
+
+
+/**
+ * enum devtype_omf -
+ * @OMF_PD_DEV_TYPE_BLOCK_STREAM: Block device implementing streams.
+ * @OMF_PD_DEV_TYPE_BLOCK_STD:    Standard (non-streams) device (SSD, HDD).
+ * @OMF_PD_DEV_TYPE_FILE:	  File in user space for UT.
+ * @OMF_PD_DEV_TYPE_MEM:	  Memory semantic device. Such as NVDIMM
+ *                                direct access (raw or dax mode).
+ * @OMF_PD_DEV_TYPE_ZONE:	  zone-like device, such as open channel SSD
+ *				  (OC-SSD) and SMR HDD (using ZBC/ZAC).
+ * @OMF_PD_DEV_TYPE_BLOCK_NVDIMM: Standard (non-streams) NVDIMM in sector mode.
+ */
+enum devtype_omf {
+	OMF_PD_DEV_TYPE_BLOCK_STREAM	= 1,
+	OMF_PD_DEV_TYPE_BLOCK_STD	= 2,
+	OMF_PD_DEV_TYPE_FILE		= 3,
+	OMF_PD_DEV_TYPE_MEM		= 4,
+	OMF_PD_DEV_TYPE_ZONE		= 5,
+	OMF_PD_DEV_TYPE_BLOCK_NVDIMM    = 6,
+};
+
+
+/**
+ * struct layout_descriptor_omf - Layout descriptor version 1.
+ * @pol_zcnt: number of zones
+ * @pol_zaddr: zone start addr
+ *
+ * Introduced with binary version 1.0.0.0.
+ * "pol_" = packed omf layout
+ */
+struct layout_descriptor_omf {
+	__le32 pol_zcnt;
+	__le64 pol_zaddr;
+} __packed;
+
+/* Define set/get methods for layout_descriptor_omf */
+OMF_SETGET(struct layout_descriptor_omf, pol_zcnt, 32)
+OMF_SETGET(struct layout_descriptor_omf, pol_zaddr, 64)
+#define OMF_LAYOUT_DESC_PACKLEN (sizeof(struct layout_descriptor_omf))
+
+
+/**
+ * struct devparm descriptor_omf - packed omf devparm descriptor
+ * @podp_devid:    UUID for drive
+ * @podp_zonetot:  total number of zones
+ * @podp_devsz:    size of partition in bytes
+ * @podp_features: Features, ored bits of enum mc_features_omf
+ * @podp_mclassp:   enum mp_media_classp
+ * @podp_devtype:   PD type (enum devtype_omf)
+ * @podp_sectorsz:  2^podp_sectorsz = sector size
+ * @podp_zonepg:    zone size in number of zone pages
+ *
+ * The fields mclassp, devtype, sectosz, and zonepg uniquely identify the media class of the PD.
+ * All drives in a media class must have the same values in these fields.
+ */
+struct devparm_descriptor_omf {
+	u8     podp_mclassp;
+	u8     podp_devtype;
+	u8     podp_sectorsz;
+	u8     podp_devid[OMF_UUID_PACKLEN];
+	u8     podp_pad[5];
+	__le32 podp_zonepg;
+	__le32 podp_zonetot;
+	__le64 podp_devsz;
+	__le64 podp_features;
+} __packed;
+
+/* Define set/get methods for devparm_descriptor_omf */
+OMF_SETGET(struct devparm_descriptor_omf, podp_mclassp, 8)
+OMF_SETGET(struct devparm_descriptor_omf, podp_devtype, 8)
+OMF_SETGET(struct devparm_descriptor_omf, podp_sectorsz, 8)
+OMF_SETGET_CHBUF(struct devparm_descriptor_omf, podp_devid)
+OMF_SETGET(struct devparm_descriptor_omf, podp_zonepg, 32)
+OMF_SETGET(struct devparm_descriptor_omf, podp_zonetot, 32)
+OMF_SETGET(struct devparm_descriptor_omf, podp_devsz, 64)
+OMF_SETGET(struct devparm_descriptor_omf, podp_features, 64)
+#define OMF_DEVPARM_DESC_PACKLEN (sizeof(struct devparm_descriptor_omf))
+
+
+/*
+ * mlog structure:
+ * + An mlog comprises a consecutive sequence of log blocks,
+ *   where each log block is a single page within a zone
+ * + A log block comprises a header and a consecutive sequence of records
+ * + A record is a typed blob
+ *
+ * Log block headers must be versioned. Log block records do not
+ * require version numbers because they are typed and new types can
+ * always be added.
+ */
+
+/*
+ * Log block format -- version 1
+ *
+ * log block := header record+ eolb? trailer?
+ *
+ * header := struct omf_logblock_header where vers=2
+ *
+ * record := lrd byte*
+ *
+ * lrd := struct omf_logrec_descriptor with value
+ *   (<record length>, <chunk length>, enum logrec_type_omf value)
+ *
+ * eolb (end of log block marker) := struct omf_logrec_descriptor with value
+ *   (0, 0, enum logrec_type_omf.EOLB/0)
+ *
+ * trailer := zero bytes from end of last log block record to end of log block
+ *
+ * OMF_LOGREC_CEND must be the max. value for this enum.
+ */
+
+/**
+ * enum logrec_type_omf -
+ * @OMF_LOGREC_EOLB:      end of log block marker (start of trailer)
+ * @OMF_LOGREC_DATAFULL:  data record; contains all specified data
+ * @OMF_LOGREC_DATAFIRST: data record; contains first part of specified data
+ * @OMF_LOGREC_DATAMID:   data record; contains interior part of data
+ * @OMF_LOGREC_DATALAST:  data record; contains final part of specified data
+ * @OMF_LOGREC_CSTART:    compaction start marker
+ * @OMF_LOGREC_CEND:      compaction end marker
+ *
+ * A log record type of 0 signifies EOLB. This is really the start of the
+ * trailer but this simplifies parsing for partially filled log blocks.
+ * DATAFIRST, -MID, -LAST types are used for chunking logical data records.
+ */
+enum logrec_type_omf {
+	OMF_LOGREC_EOLB      = 0,
+	OMF_LOGREC_DATAFULL  = 1,
+	OMF_LOGREC_DATAFIRST = 2,
+	OMF_LOGREC_DATAMID   = 3,
+	OMF_LOGREC_DATALAST  = 4,
+	OMF_LOGREC_CSTART    = 5,
+	OMF_LOGREC_CEND      = 6,
+};
+
+
+/**
+ * struct logrec_descriptor_omf -packed omf logrec descriptor
+ * @polr_tlen:  logical length of data record (all chunks)
+ * @polr_rlen:  length of data chunk in this log record
+ * @polr_rtype: enum logrec_type_omf value
+ */
+struct logrec_descriptor_omf {
+	__le32 polr_tlen;
+	__le16 polr_rlen;
+	u8     polr_rtype;
+	u8     polr_pad;
+} __packed;
+
+/* Define set/get methods for logrec_descriptor_omf */
+OMF_SETGET(struct logrec_descriptor_omf, polr_tlen, 32)
+OMF_SETGET(struct logrec_descriptor_omf, polr_rlen, 16)
+OMF_SETGET(struct logrec_descriptor_omf, polr_rtype, 8)
+#define OMF_LOGREC_DESC_PACKLEN (sizeof(struct logrec_descriptor_omf))
+#define OMF_LOGREC_DESC_RLENMAX 65535
+
+
+#define OMF_LOGBLOCK_VERS    1
+
+/**
+ * struct logblock_header_omf - packed omf logblock header for all versions
+ * @polh_vers:    log block hdr version, offset 0 in all vers
+ * @polh_magic:   unique magic per mlog
+ * @polh_pfsetid: flush set ID of the previous log block
+ * @polh_cfsetid: flush set ID this log block belongs to
+ * @polh_gen:     generation number
+ */
+struct logblock_header_omf {
+	__le16 polh_vers;
+	u8     polh_magic[OMF_UUID_PACKLEN];
+	u8     polh_pad[6];
+	__le32 polh_pfsetid;
+	__le32 polh_cfsetid;
+	__le64 polh_gen;
+} __packed;
+
+/* Define set/get methods for logblock_header_omf */
+OMF_SETGET(struct logblock_header_omf, polh_vers, 16)
+OMF_SETGET_CHBUF(struct logblock_header_omf, polh_magic)
+OMF_SETGET(struct logblock_header_omf, polh_pfsetid, 32)
+OMF_SETGET(struct logblock_header_omf, polh_cfsetid, 32)
+OMF_SETGET(struct logblock_header_omf, polh_gen, 64)
+/* On-media log block header length */
+#define OMF_LOGBLOCK_HDR_PACKLEN (sizeof(struct logblock_header_omf))
+
+
+/*
+ * Metadata container (mdc) mlog data record formats.
+ *
+ * NOTE: mdc records are typed and as such do not need a version number as new
+ * types can always be added as required.
+ */
+/**
+ * enum mdcrec_type_omf -
+ * @OMF_MDR_UNDEF:   undefined; should never occur
+ * @OMF_MDR_OCREATE:  object create
+ * @OMF_MDR_OUPDATE:  object update
+ * @OMF_MDR_ODELETE:  object delete
+ * @OMF_MDR_OIDCKPT:  object id checkpoint
+ * @OMF_MDR_OERASE:   object erase, also log mlog gen number
+ * @OMF_MDR_MCCONFIG: media class config
+ * @OMF_MDR_MCSPARE:  media class spare zones set
+ * @OMF_MDR_VERSION:  MDC content version.
+ * @OMF_MDR_MPCONFIG:  mpool config record
+ */
+enum mdcrec_type_omf {
+	OMF_MDR_UNDEF       = 0,
+	OMF_MDR_OCREATE     = 1,
+	OMF_MDR_OUPDATE     = 2,
+	OMF_MDR_ODELETE     = 3,
+	OMF_MDR_OIDCKPT     = 4,
+	OMF_MDR_OERASE      = 5,
+	OMF_MDR_MCCONFIG    = 6,
+	OMF_MDR_MCSPARE     = 7,
+	OMF_MDR_VERSION     = 8,
+	OMF_MDR_MPCONFIG    = 9,
+	OMF_MDR_MAX         = 10,
+};
+
+/**
+ * struct mdcver_omf - packed mdc version, version of an mpool MDC content.
+ * @pv_rtype:      OMF_MDR_VERSION
+ * @pv_mdcv_major: to compare with MAJOR in binary version.
+ * @pv_mdcv_minor: to compare with MINOR in binary version.
+ * @pv_mdcv_patch: to compare with PATCH in binary version.
+ * @pv_mdcv_dev:   used during development cycle when the above
+ *                 numbers don't change.
+ *
+ * This is not the version of the message framing used for the MDC. This is
+ * version of the binary that introduced that version of the MDC content.
+ */
+struct mdcver_omf {
+	u8     pv_rtype;
+	u8     pv_pad;
+	__le16 pv_mdcv_major;
+	__le16 pv_mdcv_minor;
+	__le16 pv_mdcv_patch;
+	__le16 pv_mdcv_dev;
+} __packed;
+
+/* Define set/get methods for mdcrec_version_omf */
+OMF_SETGET(struct mdcver_omf, pv_rtype, 8)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_major, 16)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_minor, 16)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_patch, 16)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_dev,   16)
+
+
+/**
+ * struct mdcrec_data_odelete_omf - packed data record odelete
+ * @pdro_rtype: mdrec_type_omf:OMF_MDR_ODELETE, OMF_MDR_OIDCKPT
+ * @pdro_objid: object identifier
+ */
+struct mdcrec_data_odelete_omf {
+	u8     pdro_rtype;
+	u8     pdro_pad[7];
+	__le64 pdro_objid;
+} __packed;
+
+/* Define set/get methods for  mdcrec_data_odelete_omf */
+OMF_SETGET(struct  mdcrec_data_odelete_omf, pdro_rtype, 8)
+OMF_SETGET(struct  mdcrec_data_odelete_omf, pdro_objid, 64)
+
+
+/**
+ * struct mdcrec_data_oerase_omf - packed data record oerase
+ * @pdrt_rtype: mdrec_type_omf: OMF_MDR_OERASE
+ * @pdrt_objid: object identifier
+ * @pdrt_gen:   object generation number
+ */
+struct mdcrec_data_oerase_omf {
+	u8     pdrt_rtype;
+	u8     pdrt_pad[7];
+	__le64 pdrt_objid;
+	__le64 pdrt_gen;
+} __packed;
+
+/* Define set/get methods for mdcrec_data_oerase_omf */
+OMF_SETGET(struct mdcrec_data_oerase_omf, pdrt_rtype, 8)
+OMF_SETGET(struct mdcrec_data_oerase_omf, pdrt_objid, 64)
+OMF_SETGET(struct mdcrec_data_oerase_omf, pdrt_gen, 64)
+#define OMF_MDCREC_OERASE_PACKLEN (sizeof(struct mdcrec_data_oerase_omf))
+
+
+/**
+ * struct mdcrec_data_mcconfig_omf - packed data record mclass config
+ * @pdrs_rtype: mdrec_type_omf: OMF_MDR_MCCONFIG
+ * @pdrs_parm:
+ */
+struct mdcrec_data_mcconfig_omf {
+	u8                             pdrs_rtype;
+	u8                             pdrs_pad[7];
+	struct devparm_descriptor_omf  pdrs_parm;
+} __packed;
+
+
+OMF_SETGET(struct mdcrec_data_mcconfig_omf, pdrs_rtype, 8)
+#define OMF_MDCREC_MCCONFIG_PACKLEN (sizeof(struct mdcrec_data_mcconfig_omf))
+
+
+/**
+ * struct mdcrec_data_mcspare_omf - packed data record mcspare
+ * @pdra_rtype:   mdrec_type_omf: OMF_MDR_MCSPARE
+ * @pdra_mclassp: enum mp_media_classp
+ * @pdra_spzone:   percent spare zones for drives in media class
+ */
+struct mdcrec_data_mcspare_omf {
+	u8     pdra_rtype;
+	u8     pdra_mclassp;
+	u8     pdra_spzone;
+} __packed;
+
+/* Define set/get methods for mdcrec_data_mcspare_omf */
+OMF_SETGET(struct mdcrec_data_mcspare_omf, pdra_rtype, 8)
+OMF_SETGET(struct mdcrec_data_mcspare_omf, pdra_mclassp, 8)
+OMF_SETGET(struct mdcrec_data_mcspare_omf, pdra_spzone, 8)
+#define OMF_MDCREC_CLS_SPARE_PACKLEN (sizeof(struct mdcrec_data_mcspare_omf))
+
+
+/**
+ * struct mdcrec_data_ocreate_omf - packed data record ocreate
+ * @pdrc_rtype:     mdrec_type_omf: OMF_MDR_OCREATE or OMF_MDR_OUPDATE
+ * @pdrc_mclass:
+ * @pdrc_uuid:
+ * @pdrc_ld:
+ * @pdrc_objid:     object identifier
+ * @pdrc_gen:       object generation number
+ * @pdrc_mblen:     amount of data written in the mblock, for mlog this is 0
+ * @pdrc_uuid:      Used only for mlogs. Must be at the end of this struct.
+ */
+struct mdcrec_data_ocreate_omf {
+	u8                             pdrc_rtype;
+	u8                             pdrc_mclass;
+	u8                             pdrc_pad[2];
+	struct layout_descriptor_omf   pdrc_ld;
+	__le64                         pdrc_objid;
+	__le64                         pdrc_gen;
+	__le64                         pdrc_mblen;
+	u8                             pdrc_uuid[];
+} __packed;
+
+/* Define set/get methods for mdcrec_data_ocreate_omf */
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_rtype, 8)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_mclass, 8)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_objid, 64)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_gen, 64)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_mblen, 64)
+#define OMF_MDCREC_OBJCMN_PACKLEN (sizeof(struct mdcrec_data_ocreate_omf) + \
+				   OMF_UUID_PACKLEN)
+
+
+/**
+ * struct mdcrec_data_mpconfig_omf - packed data mpool config
+ * @pdmc_rtype:
+ * @pdmc_oid1:
+ * @pdmc_oid2:
+ * @pdmc_uid:
+ * @pdmc_gid:
+ * @pdmc_mode:
+ * @pdmc_mclassp:
+ * @pdmc_captgt:
+ * @pdmc_ra_pages_max:
+ * @pdmc_vma_size_max:
+ * @pdmc_utype:         user-defined type (uuid)
+ * @pdmc_label:         user-defined label (ascii)
+ */
+struct mdcrec_data_mpconfig_omf {
+	u8      pdmc_rtype;
+	u8      pdmc_pad[7];
+	__le64  pdmc_oid1;
+	__le64  pdmc_oid2;
+	__le32  pdmc_uid;
+	__le32  pdmc_gid;
+	__le32  pdmc_mode;
+	__le32  pdmc_rsvd0;
+	__le64  pdmc_captgt;
+	__le32  pdmc_ra_pages_max;
+	__le32  pdmc_vma_size_max;
+	__le32  pdmc_rsvd1;
+	__le32  pdmc_rsvd2;
+	__le64  pdmc_rsvd3;
+	__le64  pdmc_rsvd4;
+	u8      pdmc_utype[16];
+	u8      pdmc_label[MPOOL_LABELSZ_MAX];
+} __packed;
+
+/* Define set/get methods for mdcrec_data_mpconfig_omf */
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rtype, 8)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_oid1, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_oid2, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_uid, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_gid, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_mode, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd0, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_captgt, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_ra_pages_max, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_vma_size_max, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd1, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd2, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd3, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd4, 64)
+OMF_SETGET_CHBUF(struct mdcrec_data_mpconfig_omf, pdmc_utype)
+OMF_SETGET_CHBUF(struct mdcrec_data_mpconfig_omf, pdmc_label)
+#define OMF_MDCREC_MPCONFIG_PACKLEN (sizeof(struct mdcrec_data_mpconfig_omf))
+
+
+/*
+ * Object types embedded in opaque uint64 object ids by the pmd module.
+ * This encoding is also present in the object ids stored in the
+ * data records on media.
+ *
+ * The obj_type field is 4 bits. There are two valid obj types.
+ */
+enum obj_type_omf {
+	OMF_OBJ_UNDEF       = 0,
+	OMF_OBJ_MBLOCK      = 1,
+	OMF_OBJ_MLOG        = 2,
+};
+
+/**
+ * sb_descriptor_ver_omf - Mpool super block version
+ * @OMF_SB_DESC_UNDEF: value not on media
+ */
+enum sb_descriptor_ver_omf {
+	OMF_SB_DESC_UNDEF        = 0,
+	OMF_SB_DESC_V1           = 1,
+
+};
+#define OMF_SB_DESC_VER_LAST   OMF_SB_DESC_V1
+
+
+/**
+ * struct sb_descriptor_omf - packed super block, super block descriptor format version 1.
+ * @psb_magic:  mpool magic value; offset 0 in all vers
+ * @psb_name:   mpool name
+ * @psb_poolid: UUID of pool this drive belongs to
+ * @psb_vers:   sb format version; offset 56
+ * @psb_gen:    sb generation number on this drive
+ * @psb_cksum1: checksum of all fields above
+ * @psb_parm:   parameters for this drive
+ * @psb_cksum2: checksum of psb_parm
+ * @psb_mdc01gen:   mdc0 log1 generation number
+ * @psb_mdc01uuid:
+ * @psb_mdc01devid: mdc0 log1 device UUID
+ * @psb_mdc01strip: mdc0 log1 strip desc.
+ * @psb_mdc01desc:  mdc0 log1 layout
+ * @psb_mdc02gen:   mdc0 log2 generation number
+ * @psb_mdc02uuid:
+ * @psb_mdc02devid: mdc0 log2 device UUID
+ * @psb_mdc02strip: mdc0 log2 strip desc.
+ * @psb_mdc02desc:  mdc0 log2 layout
+ * @psb_mdc0dev:    drive param for mdc0 strip
+ *
+ * Note: these fields, up to and including psb_cksum1, are known to libblkid.
+ * cannot change them without havoc. Fields from psb_magic to psb_cksum1
+ * included are at same offset in all versions.
+ */
+struct sb_descriptor_omf {
+	__le64                         psb_magic;
+	u8                             psb_name[OMF_MPOOL_NAME_LEN];
+	u8                             psb_poolid[OMF_UUID_PACKLEN];
+	__le16                         psb_vers;
+	__le32                         psb_gen;
+	u8                             psb_cksum1[4];
+
+	u8                             psb_pad1[6];
+	struct devparm_descriptor_omf  psb_parm;
+	u8                             psb_cksum2[4];
+
+	u8                             psb_pad2[4];
+	__le64                         psb_mdc01gen;
+	u8                             psb_mdc01uuid[OMF_UUID_PACKLEN];
+	u8                             psb_mdc01devid[OMF_UUID_PACKLEN];
+	struct layout_descriptor_omf   psb_mdc01desc;
+
+	u8                             psb_pad3[4];
+	__le64                         psb_mdc02gen;
+	u8                             psb_mdc02uuid[OMF_UUID_PACKLEN];
+	u8                             psb_mdc02devid[OMF_UUID_PACKLEN];
+	struct layout_descriptor_omf   psb_mdc02desc;
+
+	u8                             psb_pad4[4];
+	struct devparm_descriptor_omf  psb_mdc0dev;
+} __packed;
+
+OMF_SETGET(struct sb_descriptor_omf, psb_magic, 64)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_name)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_poolid)
+OMF_SETGET(struct sb_descriptor_omf, psb_vers, 16)
+OMF_SETGET(struct sb_descriptor_omf, psb_gen, 32)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_cksum1)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_cksum2)
+OMF_SETGET(struct sb_descriptor_omf, psb_mdc01gen, 64)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc01uuid)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc01devid)
+OMF_SETGET(struct sb_descriptor_omf, psb_mdc02gen, 64)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc02uuid)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc02devid)
+#define OMF_SB_DESC_PACKLEN (sizeof(struct sb_descriptor_omf))
+
+/*
+ * For object-related records OCREATE/OUPDATE is max so compute that here as:
+ * rtype + objid + gen + layout desc
+ */
+#define OMF_MDCREC_PACKLEN_MAX max(OMF_MDCREC_OBJCMN_PACKLEN,            \
+				   max(OMF_MDCREC_MCCONFIG_PACKLEN,      \
+				       max(OMF_MDCREC_CLS_SPARE_PACKLEN, \
+					   OMF_MDCREC_MPCONFIG_PACKLEN)))
+
+#endif /* MPOOL_OMF_H */
diff --git a/drivers/mpool/omf_if.h b/drivers/mpool/omf_if.h
new file mode 100644
index 000000000000..5f11a03ef500
--- /dev/null
+++ b/drivers/mpool/omf_if.h
@@ -0,0 +1,381 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_OMF_IF_H
+#define MPOOL_OMF_IF_H
+
+#include "uuid.h"
+#include "mpool_ioctl.h"
+
+#include "mp.h"
+#include "omf.h"
+
+struct mpool_descriptor;
+struct pmd_layout;
+
+/*
+ * Common defs: versioned via version number field of enclosing structs
+ */
+
+/**
+ * struct omf_layout_descriptor - version 1 layout descriptor
+ * @ol_zaddr:
+ * @ol_zcnt: number of zones
+ * @ol_pdh:
+ */
+struct omf_layout_descriptor {
+	u64    ol_zaddr;
+	u32    ol_zcnt;
+	u16    ol_pdh;
+};
+
+/**
+ * struct omf_devparm_descriptor - version 1 devparm descriptor
+ * @odp_devid:    UUID for drive
+ * @odp_devsz:    size, in bytes, of the volume/device
+ * @odp_zonetot:  total number of zones
+ * @odp_zonepg:   zone size in number of zone pages
+ * @odp_mclassp:  enum mp_media_classp
+ * @odp_devtype:  PD type. Enum pd_devtype
+ * @odp_sectorsz: 2^podp_sectorsz = sector size
+ * @odp_features: Features, ored bits of enum mp_mc_features
+ *
+ * The fields zonepg, mclassp, devtype, sectosz, and features uniquely identify
+ * the media class of the PD.
+ * All drives in a media class must have the same values in the below fields.
+ */
+struct omf_devparm_descriptor {
+	struct mpool_uuid  odp_devid;
+	u64                odp_devsz;
+	u32                odp_zonetot;
+
+	u32                odp_zonepg;
+	u8                 odp_mclassp;
+	u8                 odp_devtype;
+	u8                 odp_sectorsz;
+	u64                odp_features;
+};
+
+/*
+ * Superblock (sb) -- version 1
+ *
+ * Note this is 8-byte-wide reversed to get correct ascii order
+ */
+#define OMF_SB_MAGIC  0x7665446c6f6f706dULL  /* ASCII mpoolDev - no null */
+
+/**
+ * struct omf_sb_descriptor - version 1 superblock descriptor
+ * @osb_magic:  mpool magic value
+ * @osb_name:   mpool name, contains a terminating 0 byte
+ * @osb_cktype: enum mp_cksum_type value
+ * @osb_vers:   sb format version
+ * @osb_poolid: UUID of pool this drive belongs to
+ * @osb_gen:    sb generation number on this drive
+ * @osb_parm:   parameters for this drive
+ * @osb_mdc01gen:   mdc0 log1 generation number
+ * @osb_mdc01uuid:
+ * @osb_mdc01devid:
+ * @osb_mdc01desc:  mdc0 log1 layout
+ * @osb_mdc02gen:   mdc0 log2 generation number
+ * @osb_mdc02uuid:
+ * @osb_mdc02devid:
+ * @osb_mdc02desc:  mdc0 log2 layout
+ * @osb_mdc0dev:   drive param for mdc0
+ */
+struct omf_sb_descriptor {
+	u64                            osb_magic;
+	u8                             osb_name[MPOOL_NAMESZ_MAX];
+	u8                             osb_cktype;
+	u16                            osb_vers;
+	struct mpool_uuid              osb_poolid;
+	u32                            osb_gen;
+	struct omf_devparm_descriptor  osb_parm;
+
+	u64                            osb_mdc01gen;
+	struct mpool_uuid              osb_mdc01uuid;
+	struct mpool_uuid              osb_mdc01devid;
+	struct omf_layout_descriptor   osb_mdc01desc;
+
+	u64                            osb_mdc02gen;
+	struct mpool_uuid              osb_mdc02uuid;
+	struct mpool_uuid              osb_mdc02devid;
+	struct omf_layout_descriptor   osb_mdc02desc;
+
+	struct omf_devparm_descriptor  osb_mdc0dev;
+};
+
+/**
+ * struct omf_logrec_descriptor -
+ * @olr_tlen:  logical length of data record (all chunks)
+ * @olr_rlen:  length of data chunk in this log record
+ * @olr_rtype: enum logrec_type_omf value
+ *
+ */
+struct omf_logrec_descriptor {
+	u32    olr_tlen;
+	u16    olr_rlen;
+	u8     olr_rtype;
+};
+
+/**
+ * struct omf_logblock_header -
+ * @olh_magic:   unique ID per mlog
+ * @olh_pfsetid: flush set ID of the previous log block
+ * @olh_cfsetid: flush set ID this log block
+ * @olh_gen:     generation number
+ * @olh_vers:    log block format version
+ */
+struct omf_logblock_header {
+	struct mpool_uuid    olh_magic;
+	u32                olh_pfsetid;
+	u32                olh_cfsetid;
+	u64                olh_gen;
+	u16                olh_vers;
+};
+
+/**
+ * struct omf_mdcver - version of an mpool MDC content.
+ * @mdcver:
+ *
+ * mdcver[0]: major version number
+ * mdcver[1]: minor version number
+ * mdcver[2]: patch version number
+ * mdcver[3]: development version number. Used during development cycle when
+ *            the above numbers don't change.
+ *
+ * This is not the version of the message framing used for the MDC.
+ * This the version of the binary that introduced that version of the MDC
+ * content.
+ */
+struct omf_mdcver {
+	u16    mdcver[4];
+};
+
+#define mdcv_major    mdcver[0]
+#define mdcv_minor    mdcver[1]
+#define mdcv_patch    mdcver[2]
+#define mdcv_dev      mdcver[3]
+
+/**
+ * struct omf_mdcrec_data -
+ * @omd_version:  OMF_MDR_VERSION record
+ * @omd_objid:  object identifier
+ * @omd_gen:    object generation number
+ * @omd_layout:
+ * @omd_mblen:  Length of written data in object
+ * @omd_old:
+ * @omd_uuid:
+ * @omd_parm:
+ * @omd_mclassp: mp_media_classp
+ * @omd_spzone:   percent spare zones for drives in media class
+ * @omd_cfg:
+ * @omd_rtype: enum mdcrec_type_omf value
+ *
+ * object-related rtypes:
+ * ODELETE, OIDCKPT: objid field only; others ignored
+ * OERASE: objid and gen fields only; others ignored
+ * OCREATE, OUPDATE: layout field only; others ignored
+ */
+struct omf_mdcrec_data {
+	union ustruct {
+		struct omf_mdcver omd_version;
+
+		struct object {
+			u64                             omd_objid;
+			u64                             omd_gen;
+			struct pmd_layout              *omd_layout;
+			u64                             omd_mblen;
+			struct omf_layout_descriptor    omd_old;
+			struct mpool_uuid               omd_uuid;
+			u8                              omd_mclass;
+		} obj;
+
+		struct drive_state {
+			struct omf_devparm_descriptor  omd_parm;
+		} dev;
+
+		struct media_cls_spare {
+			u8 omd_mclassp;
+			u8 omd_spzone;
+		} mcs;
+
+		struct mpool_config    omd_cfg;
+	} u;
+
+	u8             omd_rtype;
+};
+
+/**
+ * objid_type() - Return the type field from an objid
+ * @objid:
+ */
+static inline int objid_type(u64 objid)
+{
+	return ((objid & 0xF00) >> 8);
+}
+
+static inline bool objtype_valid(enum obj_type_omf otype)
+{
+	return otype && (otype <= 2);
+};
+
+/*
+ * omf API functions -- exported functions for working with omf structures
+ */
+
+/**
+ * omf_sb_pack_htole() - pack superblock
+ * @sb: struct omf_sb_descriptor *
+ * @outbuf: char *
+ *
+ * Pack superblock into outbuf little-endian computing specified checksum.
+ *
+ * Return: 0 if successful, -EINVAL otherwise
+ */
+int omf_sb_pack_htole(struct omf_sb_descriptor *sb, char *outbuf);
+
+/**
+ * omf_sb_unpack_letoh() - unpack superblock
+ * @sb: struct omf_sb_descriptor *
+ * @inbuf: char *
+ * @omf_ver: on-media-format superblock version
+ *
+ * Unpack little-endian superblock from inbuf into sb verifying checksum.
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int omf_sb_unpack_letoh(struct omf_sb_descriptor *sb, const char *inbuf, u16 *omf_ver);
+
+/**
+ * omf_sb_has_magic_le() - Determine if buffer has superblock magic value
+ * @inbuf: char *
+ *
+ * Determine if little-endian buffer inbuf has superblock magic value
+ * where expected; does NOT imply inbuf is a valid superblock.
+ *
+ * Return: 1 if true; 0 otherwise
+ */
+bool omf_sb_has_magic_le(const char *inbuf);
+
+/**
+ * omf_logblock_header_pack_htole() - pack log block header
+ * @lbh: struct omf_logblock_header *
+ * @outbuf: char *
+ *
+ * Pack header into little-endian log block buffer lbuf, ex-checksum.
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int omf_logblock_header_pack_htole(struct omf_logblock_header *lbh, char *lbuf);
+
+/**
+ * omf_logblock_header_len_le() - Determine header length of log block
+ * @lbuf: char *
+ *
+ * Check little-endian log block in lbuf to determine header length.
+ *
+ * Return: bytes in packed header; -EINVAL if invalid header vers
+ */
+int omf_logblock_header_len_le(char *lbuf);
+
+/**
+ * omf_logblock_header_unpack_letoh() - unpack log block header
+ * @lbh: struct omf_logblock_header *
+ * @inbuf: char *
+ *
+ * Unpack little-endian log block header from lbuf into lbh; does not
+ * verify checksum.
+ *
+ * Return: 0 if successful, -EINVAL if invalid log block header vers
+ */
+int omf_logblock_header_unpack_letoh(struct omf_logblock_header *lbh, const char *inbuf);
+
+/**
+ * omf_logrec_desc_pack_htole() - pack log record descriptor
+ * @lrd: struct omf_logrec_descriptor *
+ * @outbuf: char *
+ *
+ * Pack log record descriptor into outbuf little-endian.
+ *
+ * Return: 0 if successful, -EINVAL if invalid log rec type
+ */
+int omf_logrec_desc_pack_htole(struct omf_logrec_descriptor *lrd, char *outbuf);
+
+/**
+ * omf_logrec_desc_unpack_letoh() - unpack log record descriptor
+ * @lrd: struct omf_logrec_descriptor *
+ * @inbuf: char *
+ *
+ * Unpack little-endian log record descriptor from inbuf into lrd.
+ */
+void omf_logrec_desc_unpack_letoh(struct omf_logrec_descriptor *lrd, const char *inbuf);
+
+/**
+ * omf_mdcrec_pack_htole() - pack mdc record
+ * @mp: struct mpool_descriptor *
+ * @cdr: struct omf_mdcrec_data *
+ * @outbuf: char *
+ *
+ * Pack mdc record into outbuf little-endian.
+ * NOTE: Assumes outbuf has enough space for the layout structure.
+ *
+ * Return: bytes packed if successful, -EINVAL otherwise
+ */
+int omf_mdcrec_pack_htole(struct mpool_descriptor *mp, struct omf_mdcrec_data *cdr, char *outbuf);
+
+/**
+ * omf_mdcrec_unpack_letoh() - unpack mdc record
+ * @mdcver: mdc content version of the mdc from which this data comes.
+ *          NULL means latest MDC content version known by this binary.
+ * @mp:     struct mpool_descriptor *
+ * @cdr:    struct omf_mdcrec_data *
+ * @inbuf:  char *
+ *
+ * Unpack little-endian mdc record from inbuf into cdr.
+ *
+ * Return: 0 if successful, -errno on error
+ */
+int omf_mdcrec_unpack_letoh(struct omf_mdcver *mdcver, struct mpool_descriptor *mp,
+			    struct omf_mdcrec_data *cdr, const char *inbuf);
+
+/**
+ * omf_mdcrec_isobj_le() - determine if mdc recordis object-related
+ * @inbuf: char *
+ *
+ * Return true if little-endian mdc record in inbuf is object-related.
+ */
+int omf_mdcrec_isobj_le(const char *inbuf);
+
+/**
+ * omf_mdcver_unpack_letoh() - Unpack le mdc version record from inbuf.
+ * @cdr:
+ * @inbuf:
+ */
+void omf_mdcver_unpack_letoh(struct omf_mdcrec_data *cdr, const char *inbuf);
+
+/**
+ * omf_mdcrec_unpack_type_letoh() - extract the record type from a packed MDC record.
+ * @inbuf: packed MDC record.
+ */
+u8 omf_mdcrec_unpack_type_letoh(const char *inbuf);
+
+/**
+ * logrec_type_datarec() - data record or not
+ * @rtype:
+ *
+ * Return: true if the log record type is related to a data record.
+ */
+bool logrec_type_datarec(enum logrec_type_omf rtype);
+
+/**
+ * omf_sbver_to_mdcver() - Returns the matching mdc version for a given superblock version
+ * @sbver: superblock version
+ */
+struct omf_mdcver *omf_sbver_to_mdcver(enum sb_descriptor_ver_omf sbver);
+
+int omf_init(void) __cold;
+void omf_exit(void) __cold;
+
+#endif /* MPOOL_OMF_IF_H */
diff --git a/drivers/mpool/upgrade.h b/drivers/mpool/upgrade.h
new file mode 100644
index 000000000000..3b3748c47a3e
--- /dev/null
+++ b/drivers/mpool/upgrade.h
@@ -0,0 +1,128 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * Defines structures for upgrading MPOOL meta data
+ */
+
+#ifndef MPOOL_UPGRADE_H
+#define MPOOL_UPGRADE_H
+
+#include "omf_if.h"
+
+/*
+ * Size of version converted to string.
+ * 4 * (5 bytes for a u16) + 3 * (1 byte for the '.') + 1 byte for \0
+ */
+#define MAX_MDCVERSTR          24
+
+/*
+ * Naming conventions:
+ *
+ * omf structures:
+ * ---------------
+ * The old structure names end with _omf_v<version number>.
+ * For example: layout_descriptor_omf_v1
+ * The current/latest structure name end simply with _omf.
+ * For example: layout_descriptor_omf
+ *
+ * Conversion functions:
+ * ---------------------
+ * They are named like:
+ * omf_convert_<blabla>_<maj>_<min>_<patch>_<dev>to<maj>_<min>_<patch>_<dev>()
+ *
+ * For example: omf_convert_sb_1_0_0_0to1_0_0_1()
+ *
+ * They are not named like omf_convert_<blabla>_v1tov2() because sometimes the
+ * input and output structures are exactly the same and the conversion is
+ * related to some subtle interpretation of structure filed[s] content.
+ *
+ * Unpack functions:
+ * -----------------
+ * They are named like:
+ * omf_<blabla>_unpack_letoh_v<version number>()
+ * <version number> being the version of the structure.
+ *
+ * For example: omf_layout_unpack_letoh_v1()
+ * Note that for the latest/current version of the structure we cannot
+ * name the unpack function omf_<blabla>_unpack_letoh() because that would
+ * introduce a name conflict with the top unpack function that calls
+ * omf_unpack_letoh_and_convert()
+ *
+ * For example for layout we have:
+ * omf_layout_unpack_letoh_v1() unpacks layout_descriptor_omf_v1
+ * omf_layout_unpack_letoh_v2() unpacks layout_descriptor_omf
+ * omf_layout_unpack_letoh() calls one of the two above.
+ */
+
+/**
+ * struct upgrade_history -
+ * @uh_size:    size of the current version in-memory structure
+ * @uh_unpack:  unpacking function from on-media format to in-memory format
+ * @uh_conv:    conversion function from previous version to current version,
+ *              set to NULL for the first version
+ * @uh_sbver:   corresponding superblock version since which the change has
+ *              been introduced. If this structure is not used by superblock
+ *              set uh_sbver =  OMF_SB_DESC_UNDEF.
+ * @uh_mdcver: corresponding mdc ver since which the change has been
+ *              introduced
+ *
+ * Every time we update a nested structure in superblock or MDC, we need to
+ * save the following information about this update, such that we can keep the
+ * update history of this structure
+ */
+struct upgrade_history {
+	size_t                      uh_size;
+	int (*uh_unpack)(void *out, const char *inbuf);
+	int (*uh_conv)(const void *pre, void *cur);
+	enum sb_descriptor_ver_omf  uh_sbver;
+	struct omf_mdcver          uh_mdcver;
+};
+
+/**
+ * omfu_mdcver_cur() - Return the latest mpool MDC content version understood by this binary
+ */
+struct omf_mdcver *omfu_mdcver_cur(void);
+
+/**
+ * omfu_mdcver_comment() - Return mpool MDC content version comment passed in via "mdcver".
+ * @mdcver:
+ */
+const char *omfu_mdcver_comment(struct omf_mdcver *mdcver);
+
+/**
+ * omfu_mdcver_to_str() - convert a version into a string.
+ * @mdcver: version to convert
+ * @buf:    buffer in which to place the conversion.
+ * @sz:     size of "buf" in bytes.
+ *
+ * Returns "buf"
+ */
+char *omfu_mdcver_to_str(struct omf_mdcver *mdcver, char *buf, size_t sz);
+
+/**
+ * omfu_mdcver_cmp() - compare two versions a and b
+ * @a:  first version
+ * @op: compare operator (C syntax), can be "<", "<=", ">", ">=", "==".
+ * @b:  second version
+ *
+ * Return (a op b)
+ */
+bool omfu_mdcver_cmp(struct omf_mdcver *a, char *op, struct omf_mdcver *b);
+
+/**
+ * omfu_mdcver_cmp2() - compare two versions
+ * @a:     first version
+ * @op:    compare operator (C syntax), can be "<", "<=", ">", ">=", "==".
+ * @major: major, minor, patch and dev which composes the second version
+ * @minor:
+ * @patch:
+ * @dev:
+ *
+ * Return true (a op b)
+ */
+bool omfu_mdcver_cmp2(struct omf_mdcver *a, char *op, u16 major, u16 minor, u16 patch, u16 dev);
+
+#endif /* MPOOL_UPGRADE_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 04/22] mpool: add pool drive component which handles mpool IO using the block layer API
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (2 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 03/22] mpool: add on-media " Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 05/22] mpool: add space map component which manages free space on mpool devices Nabeel M Mohamed
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

The pool drive (pd) component interfaces with the block layer to
read, write, flush, and discard mpool objects.

The underlying block device(s) are opened during mpool activation
and remains open until deactivated.

Read/write IO to an mpool device is chunked by the PD layer.
Chunking interleaves IO from different streams providing better
QoS.  The size of each chunk (or BIO size) is determined by the
module parameter 'chunk_size_kb'.  All the chunks from a single
r/w request is issued asynchronously to the block layer using
BIO chaining.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/init.c |  31 +++-
 drivers/mpool/init.h |  12 ++
 drivers/mpool/pd.c   | 424 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 466 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mpool/init.h
 create mode 100644 drivers/mpool/pd.c

diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
index 0493fb5b1157..294cf3cbbaa7 100644
--- a/drivers/mpool/init.c
+++ b/drivers/mpool/init.c
@@ -5,13 +5,42 @@
 
 #include <linux/module.h>
 
+#include "mpool_printk.h"
+
+#include "pd.h"
+
+/*
+ * Module params...
+ */
+unsigned int rsvd_bios_max __read_mostly = 16;
+module_param(rsvd_bios_max, uint, 0444);
+MODULE_PARM_DESC(rsvd_bios_max, "max reserved bios in mpool bioset");
+
+int chunk_size_kb __read_mostly = 128;
+module_param(chunk_size_kb, uint, 0644);
+MODULE_PARM_DESC(chunk_size_kb, "Chunk size (in KiB) for device I/O");
+
+static void mpool_exit_impl(void)
+{
+	pd_exit();
+}
+
 static __init int mpool_init(void)
 {
-	return 0;
+	int rc;
+
+	rc = pd_init();
+	if (rc) {
+		mp_pr_err("pd init failed", rc);
+		mpool_exit_impl();
+	}
+
+	return rc;
 }
 
 static __exit void mpool_exit(void)
 {
+	mpool_exit_impl();
 }
 
 module_init(mpool_init);
diff --git a/drivers/mpool/init.h b/drivers/mpool/init.h
new file mode 100644
index 000000000000..e02a9672e727
--- /dev/null
+++ b/drivers/mpool/init.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_INIT_H
+#define MPOOL_INIT_H
+
+extern unsigned int rsvd_bios_max;
+extern int chunk_size_kb;
+
+#endif /* MPOOL_INIT_H */
diff --git a/drivers/mpool/pd.c b/drivers/mpool/pd.c
new file mode 100644
index 000000000000..f13c7704efad
--- /dev/null
+++ b/drivers/mpool/pd.c
@@ -0,0 +1,424 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * Pool drive module with backing block devices.
+ *
+ * Defines functions for probing, reading, and writing drives in an mpool.
+ * IO is done using kerel BIO facilities.
+ */
+
+#define _LARGEFILE64_SOURCE
+
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/blk_types.h>
+
+#include "mpool_printk.h"
+#include "assert.h"
+
+#include "init.h"
+#include "omf_if.h"
+#include "pd.h"
+
+#ifndef SECTOR_SHIFT
+#define SECTOR_SHIFT   9
+#endif
+
+static struct bio_set mpool_bioset;
+
+static const fmode_t    pd_bio_fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+static char            *pd_bio_holder = "mpool";
+
+int pd_dev_open(const char *path, struct pd_dev_parm *dparm, struct pd_prop *pd_prop)
+{
+	struct block_device *bdev;
+
+	bdev = blkdev_get_by_path(path, pd_bio_fmode, pd_bio_holder);
+	if (IS_ERR(bdev))
+		return PTR_ERR(bdev);
+
+	dparm->dpr_dev_private = bdev;
+	dparm->dpr_prop = *pd_prop;
+
+	if ((pd_prop->pdp_devtype != PD_DEV_TYPE_BLOCK_STD) &&
+	    (pd_prop->pdp_devtype != PD_DEV_TYPE_BLOCK_NVDIMM)) {
+		int rc = -EINVAL;
+
+		mp_pr_err("unsupported PD type %d", rc, pd_prop->pdp_devtype);
+		return rc;
+	}
+
+	return 0;
+}
+
+int pd_dev_close(struct pd_dev_parm *dparm)
+{
+	struct block_device *bdev = dparm->dpr_dev_private;
+
+	if (bdev) {
+		dparm->dpr_dev_private = NULL;
+		sync_blockdev(bdev);
+		invalidate_bdev(bdev);
+		blkdev_put(bdev, pd_bio_fmode);
+	}
+
+	return bdev ? 0 : -EINVAL;
+}
+
+int pd_dev_flush(struct pd_dev_parm *dparm)
+{
+	struct block_device *bdev;
+	int rc;
+
+	bdev = dparm->dpr_dev_private;
+	if (!bdev) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s not registered", rc, dparm->dpr_name);
+		return rc;
+	}
+
+	rc = blkdev_issue_flush(bdev, GFP_NOIO);
+	if (rc)
+		mp_pr_err("bdev %s, flush failed", rc, dparm->dpr_name);
+
+	return rc;
+}
+
+/**
+ * pd_bio_discard() - issue discard command to erase a byte-aligned region
+ * @dparm:
+ * @off:
+ * @len:
+ */
+static int pd_bio_discard(struct pd_dev_parm *dparm, loff_t off, size_t len)
+{
+	struct block_device *bdev;
+	int rc;
+
+	bdev = dparm->dpr_dev_private;
+	if (!bdev) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s not registered", rc, dparm->dpr_name);
+		return rc;
+	}
+
+	/* Validate I/O offset is sector-aligned */
+	if (off & PD_SECTORMASK(&dparm->dpr_prop)) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s, offset 0x%lx not multiple of sec size %u",
+			  rc, dparm->dpr_name, (ulong)off, (1 << PD_SECTORSZ(&dparm->dpr_prop)));
+		return rc;
+	}
+
+	if (off > PD_LEN(&dparm->dpr_prop)) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s, offset 0x%lx past end 0x%lx",
+			  rc, dparm->dpr_name, (ulong)off, (ulong)PD_LEN(&dparm->dpr_prop));
+		return rc;
+	}
+
+	rc = blkdev_issue_discard(bdev, off >> SECTOR_SHIFT, len >> SECTOR_SHIFT, GFP_NOIO, 0);
+	if (rc)
+		mp_pr_err("bdev %s, offset 0x%lx len 0x%lx, discard faiure",
+			  rc, dparm->dpr_name, (ulong)off, (ulong)len);
+
+	return rc;
+}
+
+/**
+ * pd_zone_erase() - issue write-zeros or discard commands to erase PD
+ * @dparm:
+ * @zaddr:
+ * @zonecnt:
+ * @flag:
+ * @afp:
+ */
+int pd_zone_erase(struct pd_dev_parm *dparm, u64 zaddr, u32 zonecnt, bool reads_erased)
+{
+	int rc = 0;
+	u64 cmdopt;
+
+	/* Validate args against zone param */
+	if (zaddr >= dparm->dpr_zonetot)
+		return -EINVAL;
+
+	if (zonecnt == 0)
+		zonecnt = dparm->dpr_zonetot - zaddr;
+
+	if (zonecnt > (dparm->dpr_zonetot - zaddr))
+		return -EINVAL;
+
+	if (zonecnt == 0)
+		return 0;
+
+	/*
+	 * When both DIF and SED are enabled, read from a discared block
+	 * would fail, so we can't discard blocks if both DIF and SED are
+	 * enabled AND we need to read blocks after erase.
+	 */
+	cmdopt = dparm->dpr_cmdopt;
+	if ((cmdopt & PD_CMD_DISCARD) &&
+	    !(reads_erased && (cmdopt & PD_CMD_DIF_ENABLED) && (cmdopt & PD_CMD_SED_ENABLED))) {
+		size_t zlen;
+
+		zlen = dparm->dpr_zonepg << PAGE_SHIFT;
+		rc = pd_bio_discard(dparm, zaddr * zlen, zonecnt * zlen);
+	}
+
+	return rc;
+}
+
+static void pd_bio_init(struct bio *bio, struct block_device *bdev, int rw, loff_t off, int flags)
+{
+	bio_set_op_attrs(bio, rw, flags);
+	bio->bi_iter.bi_sector = off >> SECTOR_SHIFT;
+	bio_set_dev(bio, bdev);
+}
+
+static struct bio *pd_bio_chain(struct bio *target, unsigned int nr_pages, gfp_t gfp)
+{
+	struct bio *new;
+
+	new = bio_alloc_bioset(gfp, nr_pages, &mpool_bioset);
+
+	if (!target)
+		return new;
+
+	if (new) {
+		bio_chain(target, new);
+		submit_bio(target);
+	} else {
+		submit_bio_wait(target);
+		bio_put(target);
+	}
+
+	return new;
+}
+
+/**
+ * pd_bio_rw() -
+ * @dparm:
+ * @iov:
+ * @iovcnt:
+ * @off: offset in bytes on disk
+ * @rw:
+ * @opflags:
+ *
+ * pd_bio_rw() expects a list of kvecs wherein each base ptr is sector
+ * aligned and each length is multiple of sectors.
+ *
+ * If the IO is bigger than 1MiB (BIO_MAX_PAGES pages) or chunk_size_kb,
+ * it is split in several IOs.
+ */
+static int pd_bio_rw(struct pd_dev_parm *dparm, const struct kvec *iov,
+		     int iovcnt, loff_t off, int rw, int opflags)
+{
+	struct block_device *bdev;
+	struct page *page;
+	struct bio *bio;
+	uintptr_t iov_base;
+	u64 sector_mask;
+	u32 tot_pages, tot_len, len, iov_len, left, iolimit;
+	int i, cc, rc = 0;
+
+	if (iovcnt < 1)
+		return 0;
+
+	bdev = dparm->dpr_dev_private;
+	if (!bdev) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s not registered", rc, dparm->dpr_name);
+		return rc;
+	}
+
+	sector_mask = PD_SECTORMASK(&dparm->dpr_prop);
+	if (off & sector_mask) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s, %s offset 0x%lx not multiple of sector size %u",
+			  rc, dparm->dpr_name, (rw == REQ_OP_READ) ? "read" : "write",
+			  (ulong)off, (1 << PD_SECTORSZ(&dparm->dpr_prop)));
+		return rc;
+	}
+
+	if (off > PD_LEN(&dparm->dpr_prop)) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s, %s offset 0x%lx past device end 0x%lx",
+			  rc, dparm->dpr_name, (rw == REQ_OP_READ) ? "read" : "write",
+			  (ulong)off, (ulong)PD_LEN(&dparm->dpr_prop));
+		return rc;
+	}
+
+	tot_pages = 0;
+	tot_len = 0;
+	for (i = 0; i < iovcnt; i++) {
+		if (!PAGE_ALIGNED((uintptr_t)iov[i].iov_base) || (iov[i].iov_len & sector_mask)) {
+			rc = -EINVAL;
+			mp_pr_err("bdev %s, %s off 0x%lx, misaligned kvec, base 0x%lx, len 0x%lx",
+				  rc, dparm->dpr_name, (rw == REQ_OP_READ) ? "read" : "write",
+				  (ulong)off, (ulong)iov[i].iov_base, (ulong)iov[i].iov_len);
+			return rc;
+		}
+
+		iov_len = iov[i].iov_len;
+		tot_len += iov_len;
+		while (iov_len > 0) {
+			len = min_t(size_t, PAGE_SIZE, iov_len);
+			iov_len -= len;
+			tot_pages++;
+		}
+	}
+
+	if (off + tot_len > PD_LEN(&dparm->dpr_prop)) {
+		rc = -EINVAL;
+		mp_pr_err("bdev %s, %s I/O end past device end 0x%lx, 0x%lx:0x%x",
+			  rc, dparm->dpr_name, (rw == REQ_OP_READ) ? "read" : "write",
+			  (ulong)PD_LEN(&dparm->dpr_prop), (ulong)off, tot_len);
+		return rc;
+	}
+
+	if (tot_len == 0)
+		return 0;
+
+	/* IO size for each bio is determined by the chunk size. */
+	iolimit = chunk_size_kb >> (PAGE_SHIFT - 10);
+	iolimit = clamp_t(u32, iolimit, 32, BIO_MAX_PAGES);
+
+	left = 0;
+	bio = NULL;
+
+	for (i = 0; i < iovcnt; i++) {
+		iov_base = (uintptr_t)iov[i].iov_base;
+		iov_len = iov[i].iov_len;
+
+		while (iov_len > 0) {
+			if (left == 0) {
+				left = min_t(size_t, tot_pages, iolimit);
+
+				bio = pd_bio_chain(bio, left, GFP_NOIO);
+				if (!bio)
+					return -ENOMEM;
+
+				pd_bio_init(bio, bdev, rw, off, opflags);
+			}
+
+			len = min_t(size_t, PAGE_SIZE, iov_len);
+			page = virt_to_page(iov_base);
+			cc = -1;
+
+			if (page)
+				cc = bio_add_page(bio, page, len, 0);
+
+			if (cc != len) {
+				if (cc == 0 && bio->bi_vcnt > 0) {
+					left = 0;
+					continue;
+				}
+
+				bio_io_error(bio);
+				bio_put(bio);
+				return -ENOTRECOVERABLE;
+			}
+
+			iov_len -= len;
+			iov_base += len;
+			off += len;
+			left--;
+			tot_pages--;
+		}
+	}
+
+	ASSERT(bio);
+	ASSERT(tot_pages == 0);
+
+	rc = submit_bio_wait(bio);
+	bio_put(bio);
+
+	return rc;
+}
+
+int pd_zone_pwritev(struct pd_dev_parm *dparm, const struct kvec *iov,
+		    int iovcnt, u64 zaddr, loff_t boff, int opflags)
+{
+	loff_t woff;
+
+	woff = ((u64)dparm->dpr_zonepg << PAGE_SHIFT) * zaddr + boff;
+
+	return pd_bio_rw(dparm, iov, iovcnt, woff, REQ_OP_WRITE, opflags);
+}
+
+int pd_zone_pwritev_sync(struct pd_dev_parm *dparm, const struct kvec *iov,
+			 int iovcnt, u64 zaddr, loff_t boff)
+{
+	struct block_device *bdev;
+	int rc;
+
+	rc = pd_zone_pwritev(dparm, iov, iovcnt, zaddr, boff, REQ_FUA);
+	if (rc)
+		return rc;
+
+	/*
+	 * This sync & invalidate bdev ensures that the data written from the
+	 * kernel is immediately visible to the user-space.
+	 */
+	bdev = dparm->dpr_dev_private;
+	if (bdev) {
+		sync_blockdev(bdev);
+		invalidate_bdev(bdev);
+	}
+
+	return 0;
+}
+
+int pd_zone_preadv(struct pd_dev_parm *dparm, const struct kvec *iov,
+		   int iovcnt, u64 zaddr, loff_t boff)
+{
+	loff_t roff;
+
+	roff = ((u64)dparm->dpr_zonepg << PAGE_SHIFT) * zaddr + boff;
+
+	return pd_bio_rw(dparm, iov, iovcnt, roff, REQ_OP_READ, 0);
+}
+
+void pd_dev_set_unavail(struct pd_dev_parm *dparm, struct omf_devparm_descriptor *omf_devparm)
+{
+	struct pd_prop *pd_prop = &(dparm->dpr_prop);
+
+	/*
+	 * Fill in dparm for unavailable drive; sets zone parm and other
+	 * PD properties we keep in metadata; no ops vector because we need
+	 * the device to be available to know it (the discovery gets it).
+	 */
+	strncpy(dparm->dpr_prop.pdp_didstr, PD_DEV_ID_PDUNAVAILABLE, PD_DEV_ID_LEN);
+	pd_prop->pdp_devstate = PD_DEV_STATE_UNAVAIL;
+	pd_prop->pdp_cmdopt = PD_CMD_NONE;
+
+	pd_prop->pdp_zparam.dvb_zonepg  = omf_devparm->odp_zonepg;
+	pd_prop->pdp_zparam.dvb_zonetot = omf_devparm->odp_zonetot;
+	pd_prop->pdp_mclassp  = omf_devparm->odp_mclassp;
+	pd_prop->pdp_phys_if  = 0;
+	pd_prop->pdp_sectorsz = omf_devparm->odp_sectorsz;
+	pd_prop->pdp_devsz    = omf_devparm->odp_devsz;
+}
+
+
+int pd_init(void)
+{
+	int rc;
+
+	chunk_size_kb = clamp_t(uint, chunk_size_kb, 128, 1024);
+
+	rsvd_bios_max = clamp_t(uint, rsvd_bios_max, 1, 1024);
+
+	rc = bioset_init(&mpool_bioset, rsvd_bios_max, 0, BIOSET_NEED_BVECS);
+	if (rc)
+		mp_pr_err("mpool bioset init failed", rc);
+
+	return rc;
+}
+
+void pd_exit(void)
+{
+	bioset_exit(&mpool_bioset);
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 05/22] mpool: add space map component which manages free space on mpool devices
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (3 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 04/22] mpool: add pool drive component which handles mpool IO using the block layer API Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 06/22] mpool: add on-media pack, unpack and upgrade routines Nabeel M Mohamed
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

The smap layer implements a free space map for each media class
volume in an active mpool.

Free space maps are maintained in memory only. When an mpool is
activated, the free space map is reconstructed from the object
metadata read from media. This approach has the following
advantages:
- Objects can be allocated or freed without any space map device IO
- No overhead of tracking both the allocated and free space,
  and keeping them synchronized

The LBA space of a media class volume is subdivided into regions.
Allocation requests for a volume are distributed across these
regions. There's a separate space map per region which is
protected by a region mutex 'pdi_rmlock'.

Each region is further subdivided into zones. The zone size is
determined at mpool create time, and it defaults to 32MiB.
The free space in each region is represented by a rbtree,
where the key is a zone number and the value is the length of
the free space specified as a zone count.

A configurable percentage of the total zones on a given volume
are marked as spare zones, and the rest are marked as usable
zones. The smap supports different allocation policies which
determine which zone type is used to satisfy an allocation
request - usable or spare or usable then spare or
spare then usable.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/init.c   |   17 +-
 drivers/mpool/mclass.c |  103 ++++
 drivers/mpool/smap.c   | 1031 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1150 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mpool/mclass.c
 create mode 100644 drivers/mpool/smap.c

diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
index 294cf3cbbaa7..031408815b48 100644
--- a/drivers/mpool/init.c
+++ b/drivers/mpool/init.c
@@ -8,6 +8,7 @@
 #include "mpool_printk.h"
 
 #include "pd.h"
+#include "smap.h"
 
 /*
  * Module params...
@@ -22,16 +23,30 @@ MODULE_PARM_DESC(chunk_size_kb, "Chunk size (in KiB) for device I/O");
 
 static void mpool_exit_impl(void)
 {
+	smap_exit();
 	pd_exit();
 }
 
 static __init int mpool_init(void)
 {
+	const char *errmsg = NULL;
 	int rc;
 
 	rc = pd_init();
 	if (rc) {
-		mp_pr_err("pd init failed", rc);
+		errmsg = "pd init failed";
+		goto errout;
+	}
+
+	rc = smap_init();
+	if (rc) {
+		errmsg = "smap init failed";
+		goto errout;
+	}
+
+errout:
+	if (rc) {
+		mp_pr_err("%s", rc, errmsg);
 		mpool_exit_impl();
 	}
 
diff --git a/drivers/mpool/mclass.c b/drivers/mpool/mclass.c
new file mode 100644
index 000000000000..a81ee5ee9468
--- /dev/null
+++ b/drivers/mpool/mclass.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * This file contains the media class accessor functions.
+ */
+
+#include <linux/errno.h>
+
+#include "omf_if.h"
+#include "pd.h"
+#include "params.h"
+#include "mclass.h"
+
+void mc_init_class(struct media_class *mc, struct mc_parms *mc_parms, struct mc_smap_parms *mcsp)
+{
+	memcpy(&(mc->mc_parms), mc_parms, sizeof(*mc_parms));
+	mc->mc_uacnt = 0;
+	mc->mc_sparms = *mcsp;
+}
+
+void mc_omf_devparm2mc_parms(struct omf_devparm_descriptor *omf_devparm, struct mc_parms *mc_parms)
+{
+	/* Zeroes mc_ parms because memcmp() may be used on it later. */
+	memset(mc_parms, 0, sizeof(*mc_parms));
+	mc_parms->mcp_classp   = omf_devparm->odp_mclassp;
+	mc_parms->mcp_zonepg   = omf_devparm->odp_zonepg;
+	mc_parms->mcp_sectorsz = omf_devparm->odp_sectorsz;
+	mc_parms->mcp_devtype  = omf_devparm->odp_devtype;
+	mc_parms->mcp_features = omf_devparm->odp_features;
+}
+
+void mc_parms2omf_devparm(struct mc_parms *mc_parms, struct omf_devparm_descriptor *omf_devparm)
+{
+	omf_devparm->odp_mclassp  = mc_parms->mcp_classp;
+	omf_devparm->odp_zonepg   = mc_parms->mcp_zonepg;
+	omf_devparm->odp_sectorsz = mc_parms->mcp_sectorsz;
+	omf_devparm->odp_devtype  = mc_parms->mcp_devtype;
+	omf_devparm->odp_features = mc_parms->mcp_features;
+}
+
+int mc_cmp_omf_devparm(struct omf_devparm_descriptor *omfd1, struct omf_devparm_descriptor *omfd2)
+{
+	struct mc_parms mc_parms1;
+	struct mc_parms mc_parms2;
+
+	mc_omf_devparm2mc_parms(omfd1, &mc_parms1);
+	mc_omf_devparm2mc_parms(omfd2, &mc_parms2);
+
+	return memcmp(&mc_parms1, &mc_parms2, sizeof(mc_parms1));
+}
+
+void mc_pd_prop2mc_parms(struct pd_prop *pd_prop, struct mc_parms *mc_parms)
+{
+	/* Zeroes mc_ parms because memcmp() may be used on it later. */
+	memset(mc_parms, 0, sizeof(*mc_parms));
+	mc_parms->mcp_classp	= pd_prop->pdp_mclassp;
+	mc_parms->mcp_zonepg	= pd_prop->pdp_zparam.dvb_zonepg;
+	mc_parms->mcp_sectorsz	= PD_SECTORSZ(pd_prop);
+	mc_parms->mcp_devtype	= pd_prop->pdp_devtype;
+	mc_parms->mcp_features	= OMF_MC_FEAT_MBLOCK_TGT;
+
+	if (pd_prop->pdp_cmdopt & PD_CMD_SECTOR_UPDATABLE)
+		mc_parms->mcp_features |= OMF_MC_FEAT_MLOG_TGT;
+	if (pd_prop->pdp_cmdopt & PD_CMD_DIF_ENABLED)
+		mc_parms->mcp_features |= OMF_MC_FEAT_CHECKSUM;
+}
+
+int mc_set_spzone(struct media_class *mc, u8 spzone)
+{
+	if (!mc)
+		return -EINVAL;
+
+	if (mc->mc_pdmc < 0)
+		return -ENOENT;
+
+	mc->mc_sparms.mcsp_spzone = spzone;
+
+	return 0;
+}
+
+static void mc_smap_parms_get_internal(struct mpcore_params *params, struct mc_smap_parms *mcsp)
+{
+	mcsp->mcsp_spzone = params->mp_spare;
+	mcsp->mcsp_rgnc = params->mp_smaprgnc;
+	mcsp->mcsp_align = params->mp_smapalign;
+}
+
+int mc_smap_parms_get(struct media_class *mc, struct mpcore_params *params,
+		      struct mc_smap_parms *mcsp)
+{
+	if (!mc || !mcsp)
+		return -EINVAL;
+
+	if (mc->mc_pdmc >= 0)
+		*mcsp = mc->mc_sparms;
+	else
+		mc_smap_parms_get_internal(params, mcsp);
+
+	return 0;
+}
diff --git a/drivers/mpool/smap.c b/drivers/mpool/smap.c
new file mode 100644
index 000000000000..a62aaa2f0113
--- /dev/null
+++ b/drivers/mpool/smap.c
@@ -0,0 +1,1031 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * Space map module.
+ *
+ * Implements space maps for managing free space on drives.
+ */
+
+#include <linux/log2.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "pd.h"
+#include "sb.h"
+#include "mclass.h"
+#include "smap.h"
+#include "mpcore.h"
+
+static struct kmem_cache  *smap_zone_cache __read_mostly;
+
+static int smap_drive_alloc(struct mpool_descriptor *mp, struct mc_smap_parms *mcsp, u16 pdh);
+static int smap_drive_sballoc(struct mpool_descriptor *mp, u16 pdh);
+
+/*
+ * smap API functions
+ */
+
+static struct smap_zone *smap_zone_find(struct rb_root *root, u64 key)
+{
+	struct rb_node *node = root->rb_node;
+	struct smap_zone *elem;
+
+	while (node) {
+		elem = rb_entry(node, typeof(*elem), smz_node);
+
+		if (key < elem->smz_key)
+			node = node->rb_left;
+		else if (key > elem->smz_key)
+			node = node->rb_right;
+		else
+			return elem;
+	}
+
+	return NULL;
+}
+
+static int smap_zone_insert(struct rb_root *root, struct smap_zone *item)
+{
+	struct rb_node **pos = &root->rb_node, *parent = NULL;
+	struct smap_zone *this;
+
+	/* Figure out where to put new node */
+	while (*pos) {
+		this = rb_entry(*pos, typeof(*this), smz_node);
+		parent = *pos;
+
+		if (item->smz_key < this->smz_key)
+			pos = &(*pos)->rb_left;
+		else if (item->smz_key > this->smz_key)
+			pos = &(*pos)->rb_right;
+		else
+			return false;
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&item->smz_node, parent, pos);
+	rb_insert_color(&item->smz_node, root);
+
+	return true;
+}
+
+int smap_mpool_init(struct mpool_descriptor *mp)
+{
+	struct mpool_dev_info *pd = NULL;
+	struct media_class *mc;
+	u64 pdh = 0;
+	int rc = 0;
+
+	for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+		struct mc_smap_parms   mcsp;
+
+		pd = &mp->pds_pdv[pdh];
+		mc = &mp->pds_mc[pd->pdi_mclass];
+		rc = mc_smap_parms_get(&mp->pds_mc[mc->mc_parms.mcp_classp],
+				       &mp->pds_params, &mcsp);
+		if (rc)
+			break;
+
+		rc = smap_drive_init(mp, &mcsp, pdh);
+		if (rc) {
+			mp_pr_err("smap(%s, %s): drive init failed",
+				  rc, mp->pds_name, pd->pdi_name);
+			break;
+		}
+	}
+
+	if (rc)
+		smap_mpool_free(mp);
+
+	return rc;
+}
+
+void smap_mpool_free(struct mpool_descriptor *mp)
+{
+	u64 pdh = 0;
+
+	for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++)
+		smap_drive_free(mp, pdh);
+}
+
+void smap_mpool_usage(struct mpool_descriptor *mp, u8 mclass, struct mpool_usage *usage)
+{
+	if (mclass == MP_MED_ALL) {
+		u32 i;
+
+		for (i = 0; i < MP_MED_NUMBER; i++)
+			smap_mclass_usage(mp, i, usage);
+	} else {
+		smap_mclass_usage(mp, mclass, usage);
+	}
+}
+
+int smap_drive_spares(struct mpool_descriptor *mp, enum mp_media_classp mclassp, u8 spzone)
+{
+	struct mpool_dev_info *pd = NULL;
+	struct media_class *mc;
+	int rc;
+	u8 i;
+
+	if (!mclass_isvalid(mclassp) || spzone > 100) {
+		rc = -EINVAL;
+		mp_pr_err("smap mpool %s: smap drive spares failed mclassp %d spzone %u",
+			  rc, mp->pds_name, mclassp, spzone);
+		return rc;
+	}
+
+	/* Loop on all classes matching mclassp. */
+	for (i = 0; i < MP_MED_NUMBER; i++) {
+		mc = &mp->pds_mc[i];
+		if (mc->mc_parms.mcp_classp != mclassp || mc->mc_pdmc < 0)
+			continue;
+
+		pd = &mp->pds_pdv[mc->mc_pdmc];
+
+		spin_lock(&pd->pdi_ds.sda_dalock);
+		/* Adjust utgt but not uact; possible for uact > utgt due to spzone change. */
+		pd->pdi_ds.sda_utgt = (pd->pdi_ds.sda_zoneeff * (100 - spzone)) / 100;
+		/* Adjust stgt and sact maintaining invariant that sact <= stgt */
+		pd->pdi_ds.sda_stgt = pd->pdi_ds.sda_zoneeff - pd->pdi_ds.sda_utgt;
+		if (pd->pdi_ds.sda_sact > pd->pdi_ds.sda_stgt) {
+			pd->pdi_ds.sda_uact += (pd->pdi_ds.sda_sact - pd->pdi_ds.sda_stgt);
+			pd->pdi_ds.sda_sact = pd->pdi_ds.sda_stgt;
+		}
+		spin_unlock(&pd->pdi_ds.sda_dalock);
+
+	}
+	return 0;
+}
+
+/*
+ * Compute zone stats for drive pd per comments in smap_dev_alloc.
+ */
+static void smap_calc_znstats(struct mpool_dev_info *pd, struct smap_dev_znstats *zones)
+{
+	zones->sdv_total = pd->pdi_parm.dpr_zonetot;
+	zones->sdv_avail = pd->pdi_ds.sda_zoneeff;
+	zones->sdv_usable = pd->pdi_ds.sda_utgt;
+
+	if (pd->pdi_ds.sda_utgt > pd->pdi_ds.sda_uact)
+		zones->sdv_fusable = pd->pdi_ds.sda_utgt - pd->pdi_ds.sda_uact;
+	else
+		zones->sdv_fusable = 0;
+
+	zones->sdv_spare = pd->pdi_ds.sda_stgt;
+	zones->sdv_fspare = pd->pdi_ds.sda_stgt - pd->pdi_ds.sda_sact;
+	zones->sdv_used = pd->pdi_ds.sda_uact;
+}
+
+int smap_drive_usage(struct mpool_descriptor *mp, u16 pdh, struct mpool_devprops *dprop)
+{
+	struct mpool_dev_info *pd = &mp->pds_pdv[pdh];
+	struct smap_dev_znstats zones;
+	u32 zonepg = 0;
+
+	zonepg = pd->pdi_parm.dpr_zonepg;
+
+	spin_lock(&pd->pdi_ds.sda_dalock);
+	smap_calc_znstats(pd, &zones);
+	spin_unlock(&pd->pdi_ds.sda_dalock);
+
+	dprop->pdp_total = (zones.sdv_total * zonepg) << PAGE_SHIFT;
+	dprop->pdp_avail = (zones.sdv_avail * zonepg) << PAGE_SHIFT;
+	dprop->pdp_spare = (zones.sdv_spare * zonepg) << PAGE_SHIFT;
+	dprop->pdp_fspare = (zones.sdv_fspare * zonepg) << PAGE_SHIFT;
+	dprop->pdp_usable = (zones.sdv_usable * zonepg) << PAGE_SHIFT;
+	dprop->pdp_fusable = (zones.sdv_fusable * zonepg) << PAGE_SHIFT;
+	dprop->pdp_used = (zones.sdv_used * zonepg) << PAGE_SHIFT;
+
+	return 0;
+}
+
+int smap_drive_init(struct mpool_descriptor *mp, struct mc_smap_parms *mcsp, u16 pdh)
+{
+	struct mpool_dev_info *pd __maybe_unused;
+	int rc;
+
+	pd = &mp->pds_pdv[pdh];
+
+	if ((mcsp->mcsp_spzone > 100) || !(mcsp->mcsp_rgnc > 0)) {
+		rc = -EINVAL;
+		mp_pr_err("smap(%s, %s): drive init failed, spzone %u rcnt %lu", rc, mp->pds_name,
+			  pd->pdi_name, mcsp->mcsp_spzone, (ulong)mcsp->mcsp_rgnc);
+		return rc;
+	}
+
+	rc = smap_drive_alloc(mp, mcsp, pdh);
+	if (!rc) {
+		rc = smap_drive_sballoc(mp, pdh);
+		if (rc)
+			mp_pr_err("smap(%s, %s): sb alloc failed", rc, mp->pds_name, pd->pdi_name);
+	} else {
+		mp_pr_err("smap(%s, %s): drive alloc failed", rc, mp->pds_name, pd->pdi_name);
+	}
+
+	if (rc)
+		smap_drive_free(mp, pdh);
+
+	return rc;
+}
+
+void smap_drive_free(struct mpool_descriptor *mp, u16 pdh)
+{
+	struct mpool_dev_info *pd = &mp->pds_pdv[pdh];
+	u8 rgn = 0;
+
+	if (pd->pdi_rmbktv) {
+		struct media_class     *mc;
+		struct mc_smap_parms    mcsp;
+
+		mc = &mp->pds_mc[pd->pdi_mclass];
+		(void)mc_smap_parms_get(&mp->pds_mc[mc->mc_parms.mcp_classp],
+					&mp->pds_params, &mcsp);
+
+		for (rgn = 0; rgn < mcsp.mcsp_rgnc; rgn++) {
+			struct smap_zone   *zone, *tmp;
+			struct rb_root     *root;
+
+			root = &pd->pdi_rmbktv[rgn].pdi_rmroot;
+
+			rbtree_postorder_for_each_entry_safe(zone, tmp, root, smz_node) {
+				kmem_cache_free(smap_zone_cache, zone);
+			}
+		}
+
+		kfree(pd->pdi_rmbktv);
+		pd->pdi_rmbktv = NULL;
+	}
+
+	pd->pdi_ds.sda_rgnsz = 0;
+	pd->pdi_ds.sda_rgnladdr = 0;
+	pd->pdi_ds.sda_rgnalloc = 0;
+	pd->pdi_ds.sda_zoneeff = 0;
+	pd->pdi_ds.sda_utgt = 0;
+	pd->pdi_ds.sda_uact = 0;
+}
+
+static bool smap_alloccheck(struct mpool_dev_info *pd, u64 zonecnt, enum smap_space_type sapolicy)
+{
+	struct smap_dev_alloc *ds;
+	bool alloced = false;
+	u64 zoneextra;
+
+	ds = &pd->pdi_ds;
+
+	spin_lock(&ds->sda_dalock);
+
+	switch (sapolicy) {
+
+	case SMAP_SPC_USABLE_ONLY:
+		if ((ds->sda_uact + zonecnt) > ds->sda_utgt)
+			break;
+
+		ds->sda_uact = ds->sda_uact + zonecnt;
+		alloced = true;
+		break;
+
+	case SMAP_SPC_SPARE_ONLY:
+		if ((ds->sda_sact + zonecnt) > ds->sda_stgt)
+			break;
+
+		ds->sda_sact = ds->sda_sact + zonecnt;
+		alloced = true;
+		break;
+
+	case SMAP_SPC_USABLE_2_SPARE:
+		if ((ds->sda_uact + ds->sda_sact + zonecnt) > ds->sda_zoneeff)
+			break;
+
+		if ((ds->sda_uact + zonecnt) <= ds->sda_utgt) {
+			ds->sda_uact = ds->sda_uact + zonecnt;
+		} else {
+			zoneextra = (ds->sda_uact + zonecnt) - ds->sda_utgt;
+			ds->sda_uact = ds->sda_utgt;
+			ds->sda_sact = ds->sda_sact + zoneextra;
+		}
+		alloced = true;
+		break;
+
+	case SMAP_SPC_SPARE_2_USABLE:
+		if ((ds->sda_sact + ds->sda_uact + zonecnt) > ds->sda_zoneeff)
+			break;
+
+		if ((ds->sda_sact + zonecnt) <= ds->sda_stgt) {
+			ds->sda_sact = ds->sda_sact + zonecnt;
+		} else {
+			zoneextra = (ds->sda_sact + zonecnt) - ds->sda_stgt;
+			ds->sda_sact = ds->sda_stgt;
+			ds->sda_uact = ds->sda_uact + zoneextra;
+		}
+		alloced = true;
+		break;
+
+	default:
+		break;
+	}
+
+	spin_unlock(&ds->sda_dalock);
+
+	return alloced;
+}
+
+int smap_alloc(struct mpool_descriptor *mp, u16 pdh, u64 zonecnt,
+	       enum smap_space_type sapolicy, u64 *zoneaddr, u64 align)
+{
+	struct mc_smap_parms mcsp;
+	struct mpool_dev_info *pd;
+	struct smap_dev_alloc *ds;
+	struct smap_zone *elem = NULL;
+	struct rb_root *rmap = NULL;
+	struct mutex *rmlock = NULL;
+	struct media_class *mc;
+	u64 fsoff = 0, fslen = 0, ualen = 0;
+	u8 rgn = 0, rgnc;
+	s8 rgnleft;
+	bool res;
+	int rc;
+
+	*zoneaddr = 0;
+	pd = &mp->pds_pdv[pdh];
+
+	if (!zonecnt || !saptype_valid(sapolicy))
+		return -EINVAL;
+
+	ASSERT(is_power_of_2(align));
+
+	ds = &pd->pdi_ds;
+	mc = &mp->pds_mc[pd->pdi_mclass];
+	rc = mc_smap_parms_get(&mp->pds_mc[mc->mc_parms.mcp_classp], &mp->pds_params, &mcsp);
+	if (rc)
+		return rc;
+
+	rgnc = mcsp.mcsp_rgnc;
+
+	/*
+	 * We do not update the last rgn alloced beyond this point as it
+	 * would incur search penalty if all the regions except one are highly
+	 * fragmented, i.e., the last alloc rgn would never change in this case.
+	 */
+	spin_lock(&ds->sda_dalock);
+	ds->sda_rgnalloc = (ds->sda_rgnalloc + 1) % rgnc;
+	rgn = ds->sda_rgnalloc;
+	spin_unlock(&ds->sda_dalock);
+
+	rgnleft = rgnc;
+
+	/* Search per-rgn space maps for contiguous region. */
+	while (rgnleft--) {
+		struct rb_node *node;
+
+		rmlock = &pd->pdi_rmbktv[rgn].pdi_rmlock;
+		rmap = &pd->pdi_rmbktv[rgn].pdi_rmroot;
+
+		mutex_lock(rmlock);
+
+		for (node = rb_first(rmap); node; node = rb_next(node)) {
+			elem  = rb_entry(node, struct smap_zone, smz_node);
+			fsoff = elem->smz_key;
+			fslen = elem->smz_value;
+
+			if (zonecnt > fslen)
+				continue;
+
+			if (IS_ALIGNED(fsoff, align)) {
+				ualen = 0;
+				break;
+			}
+
+			ualen = ALIGN(fsoff, align) - fsoff;
+			if (ualen + zonecnt > fslen)
+				continue;
+
+			break;
+		}
+
+		if (node)
+			break;
+
+		mutex_unlock(rmlock);
+
+		rgn = (rgn + 1) % rgnc;
+	}
+
+	if (rgnleft < 0)
+		return -ENOSPC;
+
+	/* Alloc from this free space if permitted. First fit. */
+	res = smap_alloccheck(pd, zonecnt, sapolicy);
+	if (!res) {
+		mutex_unlock(rmlock);
+		return -ENOSPC;
+	}
+
+	fsoff = fsoff + ualen;
+	fslen = fslen - ualen;
+
+	*zoneaddr = fsoff;
+	rb_erase(&elem->smz_node, rmap);
+
+	if (zonecnt < fslen) {
+		/* Re-use elem */
+		elem->smz_key   = fsoff + zonecnt;
+		elem->smz_value = fslen - zonecnt;
+		smap_zone_insert(rmap, elem);
+		elem = NULL;
+	}
+
+	if (ualen) {
+		if (!elem) {
+			elem = kmem_cache_alloc(smap_zone_cache, GFP_ATOMIC);
+			if (!elem) {
+				mutex_unlock(rmlock);
+				return -ENOMEM;
+			}
+		}
+
+		elem->smz_key   = fsoff - ualen;
+		elem->smz_value = ualen;
+		smap_zone_insert(rmap, elem);
+		elem = NULL;
+	}
+
+	mutex_unlock(rmlock);
+
+	if (elem)
+		kmem_cache_free(smap_zone_cache, elem);
+
+	return 0;
+}
+
+/*
+ * smap internal functions
+ */
+
+/*
+ * Init empty space map for drive pdh with a % spare zones of spzone.
+ * Returns: 0 if successful, -errno otherwise...
+ */
+static int smap_drive_alloc(struct mpool_descriptor *mp, struct mc_smap_parms *mcsp, u16 pdh)
+{
+	struct mpool_dev_info *pd = &mp->pds_pdv[pdh];
+	struct smap_zone *urb_elem = NULL;
+	struct smap_zone *found_ue = NULL;
+	u32 rgnsz = 0;
+	u8 rgn = 0;
+	u8 rgn2 = 0;
+	u8 rgnc;
+	int rc;
+
+	rgnc  = mcsp->mcsp_rgnc;
+	rgnsz = pd->pdi_parm.dpr_zonetot / rgnc;
+	if (!rgnsz) {
+		rc = -EINVAL;
+		mp_pr_err("smap(%s, %s): drive alloc failed, invalid rgn size",
+			  rc, mp->pds_name, pd->pdi_name);
+		return rc;
+	}
+
+	/* Allocate and init per channel space maps and associated locks */
+	pd->pdi_rmbktv = kcalloc(rgnc, sizeof(*pd->pdi_rmbktv), GFP_KERNEL);
+	if (!pd->pdi_rmbktv) {
+		rc = -ENOMEM;
+		mp_pr_err("smap(%s, %s): rmbktv alloc failed", rc, mp->pds_name, pd->pdi_name);
+		return rc;
+	}
+
+	/* Define all space on all channels as being free (drive empty) */
+	for (rgn = 0; rgn < rgnc; rgn++) {
+		mutex_init(&pd->pdi_rmbktv[rgn].pdi_rmlock);
+
+		urb_elem = kmem_cache_alloc(smap_zone_cache, GFP_KERNEL);
+		if (!urb_elem) {
+			struct rb_root *rmroot;
+
+			for (rgn2 = 0; rgn2 < rgn; rgn2++) {
+				rmroot = &pd->pdi_rmbktv[rgn2].pdi_rmroot;
+
+				found_ue = smap_zone_find(rmroot, 0);
+				if (found_ue) {
+					rb_erase(&found_ue->smz_node, rmroot);
+					kmem_cache_free(smap_zone_cache, found_ue);
+				}
+			}
+
+			kfree(pd->pdi_rmbktv);
+			pd->pdi_rmbktv = NULL;
+
+			rc = -ENOMEM;
+			mp_pr_err("smap(%s, %s): rb node alloc failed, rgn %u",
+				  rc, mp->pds_name, pd->pdi_name, rgn);
+			return rc;
+		}
+
+		urb_elem->smz_key = rgn * rgnsz;
+		if (rgn < rgnc - 1)
+			urb_elem->smz_value = rgnsz;
+		else
+			urb_elem->smz_value = pd->pdi_parm.dpr_zonetot - (rgn * rgnsz);
+		smap_zone_insert(&pd->pdi_rmbktv[rgn].pdi_rmroot, urb_elem);
+	}
+
+	spin_lock_init(&pd->pdi_ds.sda_dalock);
+	pd->pdi_ds.sda_rgnalloc = 0;
+	pd->pdi_ds.sda_rgnsz = rgnsz;
+	pd->pdi_ds.sda_rgnladdr = (rgnc - 1) * rgnsz;
+	pd->pdi_ds.sda_zoneeff = pd->pdi_parm.dpr_zonetot;
+	pd->pdi_ds.sda_utgt = (pd->pdi_ds.sda_zoneeff * (100 - mcsp->mcsp_spzone)) / 100;
+	pd->pdi_ds.sda_uact = 0;
+	pd->pdi_ds.sda_stgt = pd->pdi_ds.sda_zoneeff - pd->pdi_ds.sda_utgt;
+	pd->pdi_ds.sda_sact = 0;
+
+	return 0;
+}
+
+/*
+ * Add entry to space map covering superblocks on drive pdh.
+ * Returns: 0 if successful, -errno otherwise...
+ */
+static int smap_drive_sballoc(struct mpool_descriptor *mp, u16 pdh)
+{
+	struct mpool_dev_info *pd = &mp->pds_pdv[pdh];
+	int rc;
+	u32 cnt;
+
+	cnt = sb_zones_for_sbs(&(pd->pdi_prop));
+	if (cnt < 1) {
+		rc = -ESPIPE;
+		mp_pr_err("smap(%s, %s): identifying sb failed", rc, mp->pds_name, pd->pdi_name);
+		return rc;
+	}
+
+	rc = smap_insert(mp, pdh, 0, cnt);
+	if (rc)
+		mp_pr_err("smap(%s, %s): insert failed, cnt %u",
+			  rc, mp->pds_name, pd->pdi_name, cnt);
+
+	return rc;
+}
+
+void smap_mclass_usage(struct mpool_descriptor *mp, u8 mclass, struct mpool_usage *usage)
+{
+	struct smap_dev_znstats zones;
+	struct mpool_dev_info *pd;
+	struct media_class *mc;
+	u32 zonepg = 0;
+
+	mc = &mp->pds_mc[mclass];
+	if (mc->mc_pdmc < 0)
+		return;
+
+	pd = &mp->pds_pdv[mc->mc_pdmc];
+	zonepg = pd->pdi_zonepg;
+
+	spin_lock(&pd->pdi_ds.sda_dalock);
+	smap_calc_znstats(pd, &zones);
+	spin_unlock(&pd->pdi_ds.sda_dalock);
+
+	usage->mpu_total  += ((zones.sdv_total * zonepg) << PAGE_SHIFT);
+	usage->mpu_usable += ((zones.sdv_usable * zonepg) << PAGE_SHIFT);
+	usage->mpu_used   += ((zones.sdv_used * zonepg) << PAGE_SHIFT);
+	usage->mpu_spare  += ((zones.sdv_spare * zonepg) << PAGE_SHIFT);
+	usage->mpu_fspare += ((zones.sdv_fspare * zonepg) << PAGE_SHIFT);
+	usage->mpu_fusable += ((zones.sdv_fusable * zonepg) << PAGE_SHIFT);
+}
+
+static u32 smap_addr2rgn(struct mpool_descriptor *mp, struct mpool_dev_info *pd, u64 zoneaddr)
+{
+	struct mc_smap_parms   mcsp;
+
+	mc_smap_parms_get(&mp->pds_mc[pd->pdi_mclass], &mp->pds_params, &mcsp);
+
+	if (zoneaddr >= pd->pdi_ds.sda_rgnladdr)
+		return mcsp.mcsp_rgnc - 1;
+
+	return zoneaddr / pd->pdi_ds.sda_rgnsz;
+}
+
+/*
+ * Add entry to space map in rgn starting at zoneaddr
+ * and continuing for zonecnt blocks.
+ *
+ *   Returns: 0 if successful, -errno otherwise...
+ */
+static int smap_insert_byrgn(struct mpool_dev_info *pd, u32 rgn, u64 zoneaddr, u16 zonecnt)
+{
+	const char *msg __maybe_unused;
+	struct smap_zone *elem = NULL;
+	struct rb_root *rmap;
+	struct rb_node *node;
+	u64 fsoff, fslen;
+	int rc;
+
+	fsoff = fslen = 0;
+	rc = 0;
+	msg = NULL;
+
+	mutex_lock(&pd->pdi_rmbktv[rgn].pdi_rmlock);
+	rmap = &pd->pdi_rmbktv[rgn].pdi_rmroot;
+
+	node = rmap->rb_node;
+	if (!node) {
+		msg = "invalid rgn map";
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	/* Use binary search to find the insertion point in the tree.
+	 */
+	while (node) {
+		elem = rb_entry(node, struct smap_zone, smz_node);
+
+		if (zoneaddr < elem->smz_key)
+			node = node->rb_left;
+		else if (zoneaddr > elem->smz_key + elem->smz_value)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	fsoff = elem->smz_key;
+	fslen = elem->smz_value;
+
+	/* Bail out if we're past zoneaddr in space map w/o finding the required chunk. */
+	if (zoneaddr < fsoff) {
+		elem = NULL;
+		msg = "requested range not free";
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	/* The allocation must fit entirely within this chunk or it fails. */
+	if (zoneaddr + zonecnt > fsoff + fslen) {
+		elem = NULL;
+		msg = "requested range does not fit";
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	rb_erase(&elem->smz_node, rmap);
+
+	if (zoneaddr > fsoff) {
+		elem->smz_key = fsoff;
+		elem->smz_value = zoneaddr - fsoff;
+		smap_zone_insert(rmap, elem);
+		elem = NULL;
+	}
+	if (zoneaddr + zonecnt < fsoff + fslen) {
+		if (!elem)
+			elem = kmem_cache_alloc(smap_zone_cache, GFP_KERNEL);
+		if (!elem) {
+			msg = "chunk alloc failed";
+			rc = -ENOMEM;
+			goto errout;
+		}
+
+		elem->smz_key = zoneaddr + zonecnt;
+		elem->smz_value = (fsoff + fslen) - (zoneaddr + zonecnt);
+		smap_zone_insert(rmap, elem);
+		elem = NULL;
+	}
+
+	/* Insert consumes usable only; possible for uact > utgt.*/
+	spin_lock(&pd->pdi_ds.sda_dalock);
+	pd->pdi_ds.sda_uact = pd->pdi_ds.sda_uact + zonecnt;
+	spin_unlock(&pd->pdi_ds.sda_dalock);
+
+errout:
+	mutex_unlock(&pd->pdi_rmbktv[rgn].pdi_rmlock);
+
+	if (elem != NULL) {
+		/* Was an exact match */
+		ASSERT((zoneaddr == fsoff) && (zonecnt == fslen));
+		kmem_cache_free(smap_zone_cache, elem);
+	}
+
+	if (rc)
+		mp_pr_err("smap pd %s: %s, zoneaddr %lu zonecnt %u fsoff %lu fslen %lu",
+			  rc, pd->pdi_name, msg ? msg : "(no detail)",
+			  (ulong)zoneaddr, zonecnt, (ulong)fsoff, (ulong)fslen);
+
+	return rc;
+}
+
+int smap_insert(struct mpool_descriptor *mp, u16 pdh, u64 zoneaddr, u32 zonecnt)
+{
+	struct mpool_dev_info *pd = &mp->pds_pdv[pdh];
+	u32 rstart = 0, rend = 0;
+	u64 raddr = 0, rcnt = 0;
+	u64 zoneadded = 0;
+	int rgn = 0;
+	int rc = 0;
+
+	if (zoneaddr >= pd->pdi_parm.dpr_zonetot ||
+	    (zoneaddr + zonecnt) > pd->pdi_parm.dpr_zonetot) {
+		rc = -EINVAL;
+		mp_pr_err("smap(%s, %s): insert failed, zoneaddr %lu zonecnt %u zonetot %u",
+			  rc, mp->pds_name, pd->pdi_name, (ulong)zoneaddr,
+			  zonecnt, pd->pdi_parm.dpr_zonetot);
+		return rc;
+	}
+
+	/*
+	 * smap_alloc() never crosses regions. however a previous instantiation
+	 * of this mpool might have used a different value of rgn count
+	 * so must handle inserts that cross regions.
+	 */
+	rstart = smap_addr2rgn(mp, pd, zoneaddr);
+	rend = smap_addr2rgn(mp, pd, zoneaddr + zonecnt - 1);
+	zoneadded = 0;
+
+	for (rgn = rstart; rgn < rend + 1; rgn++) {
+		/* Compute zone address and count for this rgn */
+		if (rgn == rstart)
+			raddr = zoneaddr;
+		else
+			raddr = (u64)rgn * pd->pdi_ds.sda_rgnsz;
+
+		if (rgn < rend)
+			rcnt = ((rgn + 1) * pd->pdi_ds.sda_rgnsz) - raddr;
+		else
+			rcnt = zonecnt - zoneadded;
+
+		rc = smap_insert_byrgn(pd, rgn, raddr, rcnt);
+		if (rc) {
+			mp_pr_err("smap(%s, %s): insert byrgn failed, rgn %d raddr %lu rcnt %lu",
+				  rc, mp->pds_name, pd->pdi_name, rgn, (ulong)raddr, (ulong)rcnt);
+			break;
+		}
+		zoneadded = zoneadded + rcnt;
+	}
+
+	return rc;
+}
+
+/**
+ * smap_free_byrgn() - free the specified range of zones
+ * @pd:         physical device object
+ * @rgn:       allocation rgn specifier
+ * @zoneaddr:    offset into the space map
+ * @zonecnt:     length of range to be freed
+ *
+ * Free the given range of zone (i.e., [%zoneaddr, %zoneaddr + %zonecnt])
+ * back to the indicated space map.  Always coalesces ranges in the space
+ * map that abut the range to be freed so as to minimize fragmentation.
+ *
+ * Return: 0 if successful, -errno otherwise...
+ */
+static int smap_free_byrgn(struct mpool_dev_info *pd, u32 rgn, u64 zoneaddr, u32 zonecnt)
+{
+	const char *msg __maybe_unused;
+	struct smap_zone *left, *right;
+	struct smap_zone *new, *old;
+	struct rb_root *rmap;
+	struct rb_node *node;
+	u32 orig_zonecnt = zonecnt;
+	int rc = 0;
+
+	new = old = left = right = NULL;
+	msg = NULL;
+
+	mutex_lock(&pd->pdi_rmbktv[rgn].pdi_rmlock);
+	rmap = &pd->pdi_rmbktv[rgn].pdi_rmroot;
+
+	node = rmap->rb_node;
+
+	/* Use binary search to find chunks to the left and/or right of the range being freed. */
+	while (node) {
+		struct smap_zone *this;
+
+		this = rb_entry(node, struct smap_zone, smz_node);
+
+		if (zoneaddr + zonecnt <= this->smz_key) {
+			right = this;
+			node = node->rb_left;
+		} else if (zoneaddr >= this->smz_key + this->smz_value) {
+			left = this;
+			node = node->rb_right;
+		} else {
+			msg = "chunk overlapping";
+			rc = -EINVAL;
+			goto unlock;
+		}
+	}
+
+	/* If the request abuts the chunk to the right then coalesce them. */
+	if (right) {
+		if (zoneaddr + zonecnt == right->smz_key) {
+			zonecnt += right->smz_value;
+			rb_erase(&right->smz_node, rmap);
+
+			new = right;  /* re-use right node */
+		}
+	}
+
+	/* If the request abuts the chunk to the left then coalesce them. */
+	if (left) {
+		if (left->smz_key + left->smz_value == zoneaddr) {
+			zoneaddr = left->smz_key;
+			zonecnt += left->smz_value;
+			rb_erase(&left->smz_node, rmap);
+
+			old = new;  /* free new/left outside the critsec */
+			new = left; /* re-use left node */
+		}
+	}
+
+	/*
+	 * If the request did not abut either the current or the previous
+	 * chunk (i.e., new == NULL) then we must create a new chunk node
+	 * and insert it into the smap.  Otherwise, we'll re-use one of
+	 * the abutting chunk nodes (i.e., left or right).
+	 *
+	 * Note: If we have to call kmalloc and it fails (unlikely) then
+	 * this chunk will be lost only for the current session.  It will
+	 * be recovered once the mpool is closed and re-opened.
+	 */
+	if (!new) {
+		new = kmem_cache_alloc(smap_zone_cache, GFP_ATOMIC);
+		if (!new) {
+			msg = "chunk alloc failed";
+			rc = -ENOMEM;
+			goto unlock;
+		}
+	}
+
+	new->smz_key = zoneaddr;
+	new->smz_value = zonecnt;
+
+	if (!smap_zone_insert(rmap, new)) {
+		kmem_cache_free(smap_zone_cache, new);
+		msg = "chunk insert failed";
+		rc = -ENOTRECOVERABLE;
+		goto unlock;
+	}
+
+	/* Freed space goes to spare first then usable. */
+	zonecnt = orig_zonecnt;
+
+	spin_lock(&pd->pdi_ds.sda_dalock);
+	if (pd->pdi_ds.sda_sact > 0) {
+		if (pd->pdi_ds.sda_sact > zonecnt) {
+			pd->pdi_ds.sda_sact -= zonecnt;
+			zonecnt = 0;
+		} else {
+			zonecnt -= pd->pdi_ds.sda_sact;
+			pd->pdi_ds.sda_sact = 0;
+		}
+	}
+
+	pd->pdi_ds.sda_uact -= zonecnt;
+	spin_unlock(&pd->pdi_ds.sda_dalock);
+
+unlock:
+	mutex_unlock(&pd->pdi_rmbktv[rgn].pdi_rmlock);
+
+	if (old)
+		kmem_cache_free(smap_zone_cache, old);
+
+	if (rc)
+		mp_pr_err("smap pd %s: %s, free byrgn failed, rgn %u zoneaddr %lu zonecnt %u",
+			  rc, pd->pdi_name, msg ? msg : "(no detail)",
+			  rgn, (ulong)zoneaddr, zonecnt);
+
+	return rc;
+}
+
+int smap_free(struct mpool_descriptor *mp, u16 pdh, u64 zoneaddr, u16 zonecnt)
+{
+	struct mpool_dev_info *pd = NULL;
+	u32 rstart = 0, rend = 0;
+	u32 raddr = 0, rcnt = 0;
+	u64 zonefreed = 0;
+	u32 rgn = 0;
+	int rc = 0;
+
+	pd = &mp->pds_pdv[pdh];
+
+	if (zoneaddr >= pd->pdi_parm.dpr_zonetot || zoneaddr + zonecnt > pd->pdi_parm.dpr_zonetot) {
+		rc = -EINVAL;
+		mp_pr_err("smap(%s, %s): free failed, zoneaddr %lu zonecnt %u zonetot: %u",
+			  rc, mp->pds_name, pd->pdi_name, (ulong)zoneaddr,
+			  zonecnt, pd->pdi_parm.dpr_zonetot);
+		return rc;
+	}
+
+	if (!zonecnt)
+		return 0; /* Nothing to be returned */
+
+	/*
+	 * smap_alloc() never crosses regions. however a previous instantiation
+	 * of this mpool might have used a different value of rgn count
+	 * so must handle frees that cross regions.
+	 */
+
+	rstart = smap_addr2rgn(mp, pd, zoneaddr);
+	rend = smap_addr2rgn(mp, pd, zoneaddr + zonecnt - 1);
+
+	for (rgn = rstart; rgn < rend + 1; rgn++) {
+		/* Compute zone address and count for this rgn */
+		if (rgn == rstart)
+			raddr = zoneaddr;
+		else
+			raddr = rgn * pd->pdi_ds.sda_rgnsz;
+
+		if (rgn < rend)
+			rcnt = ((u64)(rgn + 1) * pd->pdi_ds.sda_rgnsz) - raddr;
+		else
+			rcnt = zonecnt - zonefreed;
+
+		rc = smap_free_byrgn(pd, rgn, raddr, rcnt);
+		if (rc) {
+			mp_pr_err("smap(%s, %s): free byrgn failed, rgn %d raddr %lu, rcnt %lu",
+				  rc, mp->pds_name, pd->pdi_name, rgn, (ulong)raddr, (ulong)rcnt);
+			break;
+		}
+		zonefreed = zonefreed + rcnt;
+	}
+
+	return rc;
+}
+
+void smap_wait_usage_done(struct mpool_descriptor *mp)
+{
+	struct smap_usage_work *usagew = &mp->pds_smap_usage_work;
+
+	cancel_delayed_work_sync(&usagew->smapu_wstruct);
+}
+
+#define SMAP_FREEPCT_DELTA 5
+#define SMAP_FREEPCT_LOG_THLD   50
+
+void smap_log_mpool_usage(struct work_struct *ws)
+{
+	struct smap_usage_work *smapu;
+	struct mpool_descriptor *mp;
+	struct mpool_usage usage;
+	int last, cur, delta;
+
+	smapu = container_of(ws, struct smap_usage_work, smapu_wstruct.work);
+	mp = smapu->smapu_mp;
+
+	/* Get the current mpool space usage stats */
+	smap_mpool_usage(mp, MP_MED_ALL, &usage);
+
+	if (usage.mpu_usable == 0) {
+		mp_pr_err("smap mpool %s: zero usable space", -EINVAL, mp->pds_name);
+		return;
+	}
+	/*
+	 * Calculate the delta of free usable space/total usable space,
+	 * since last time a message was logged
+	 */
+	last = smapu->smapu_freepct;
+	cur = usage.mpu_fusable * 100 / usage.mpu_usable;
+	delta = cur - last;
+
+	/*
+	 * Log a message if delta >= 5% && free usable space % < 50%
+	 */
+	if ((abs(delta) >= SMAP_FREEPCT_DELTA) && (cur < SMAP_FREEPCT_LOG_THLD)) {
+
+		smapu->smapu_freepct = cur;
+		if (last == 0)
+			mp_pr_info("smap mpool %s, free space %d%%",
+				   mp->pds_name, smapu->smapu_freepct);
+
+		else
+			mp_pr_info("smap mpool %s, free space %s from %d%% to %d%%",
+				   mp->pds_name, (delta > 0) ? "increases" : "decreases",
+				   last, smapu->smapu_freepct);
+	}
+
+	/* Schedule the next run of smap_log_mpool_usage() */
+	queue_delayed_work(mp->pds_workq, &smapu->smapu_wstruct,
+			   msecs_to_jiffies(mp->pds_params.mp_mpusageperiod));
+}
+
+int smap_init(void)
+{
+	int rc = 0;
+
+	smap_zone_cache = kmem_cache_create("mpool_smap_zone", sizeof(struct smap_zone),
+					    0, SLAB_HWCACHE_ALIGN | SLAB_POISON, NULL);
+	if (!smap_zone_cache) {
+		rc = -ENOMEM;
+		mp_pr_err("kmem_cache_create(smap_zone, %zu) failed",
+			  rc, sizeof(struct smap_zone));
+	}
+
+	return rc;
+}
+
+void smap_exit(void)
+{
+	kmem_cache_destroy(smap_zone_cache);
+	smap_zone_cache = NULL;
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 06/22] mpool: add on-media pack, unpack and upgrade routines
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (4 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 05/22] mpool: add space map component which manages free space on mpool devices Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 07/22] mpool: add superblock management routines Nabeel M Mohamed
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds utilities to translate structures to and from their
on-media format. All mpool metadata is stored on media in
little-endian format.

The metadata records are both versioned and contains a record
type. This allows the record format to change over time, new
record types to be added and old record types to be deprecated.

All structs, enums, constants etc., representing on-media format
ends with a postfix "_omf". The functions for serializing
in-memory structures to on-media format are named with prefix
"omf_" and with postfix "_pack_htole". The corresponding
deserializing functions are named with postfix "_unpack_letoh".

The mpool metadata records are upgraded at mpool activation time.
The newer module code reads the metadata records created by an
older mpool version, converts them into the current in-memory
format and then serializes the in-memory format to the current
on-media format.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/init.c    |    8 +
 drivers/mpool/omf.c     | 1318 +++++++++++++++++++++++++++++++++++++++
 drivers/mpool/upgrade.c |  138 ++++
 3 files changed, 1464 insertions(+)
 create mode 100644 drivers/mpool/omf.c
 create mode 100644 drivers/mpool/upgrade.c

diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
index 031408815b48..70f907ccc28a 100644
--- a/drivers/mpool/init.c
+++ b/drivers/mpool/init.c
@@ -7,6 +7,7 @@
 
 #include "mpool_printk.h"
 
+#include "omf_if.h"
 #include "pd.h"
 #include "smap.h"
 
@@ -24,6 +25,7 @@ MODULE_PARM_DESC(chunk_size_kb, "Chunk size (in KiB) for device I/O");
 static void mpool_exit_impl(void)
 {
 	smap_exit();
+	omf_exit();
 	pd_exit();
 }
 
@@ -38,6 +40,12 @@ static __init int mpool_init(void)
 		goto errout;
 	}
 
+	rc = omf_init();
+	if (rc) {
+		errmsg = "omf init failed";
+		goto errout;
+	}
+
 	rc = smap_init();
 	if (rc) {
 		errmsg = "smap init failed";
diff --git a/drivers/mpool/omf.c b/drivers/mpool/omf.c
new file mode 100644
index 000000000000..8e612d35b19a
--- /dev/null
+++ b/drivers/mpool/omf.c
@@ -0,0 +1,1318 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * Pool on-drive format (omf) module.
+ *
+ * Defines:
+ * + on-drive format for mpool superblocks
+ * + on-drive formats for mlogs, mblocks, and metadata containers (mdc)
+ * + utility functions for working with these on-drive formats
+ *
+ * All mpool metadata is versioned and stored on media in little-endian format.
+ */
+
+#include <linux/slab.h>
+#include <crypto/hash.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "upgrade.h"
+#include "pmd_obj.h"
+#include "mpcore.h"
+
+static struct crypto_shash *mpool_tfm;
+
+enum unpack_only {
+	UNPACKONLY    = 0,
+	UNPACKCONVERT = 1,
+};
+
+/*
+ * Forward declarations.
+ */
+static int omf_layout_unpack_letoh_v1(void *out, const char *inbuf);
+static int omf_dparm_unpack_letoh_v1(void *out, const char *inbuf);
+static int omf_mdcrec_mcspare_unpack_letoh_v1(void *out, const char *inbuf);
+static int omf_sb_unpack_letoh_v1(void *out, const char *inbuf);
+static int omf_pmd_layout_unpack_letoh_v1(void *out, const char *inbuf);
+
+/*
+ * layout_descriptor_table: track changes in OMF and in-memory layout descriptor
+ */
+static struct upgrade_history layout_descriptor_table[] = {
+	{
+		sizeof(struct omf_layout_descriptor),
+		omf_layout_unpack_letoh_v1,
+		NULL,
+		OMF_SB_DESC_V1,
+		{ {1, 0, 0, 0} }
+	},
+};
+
+/*
+ * devparm_descriptor_table: track changes in dev parm descriptor
+ */
+static struct upgrade_history devparm_descriptor_table[] = {
+	{
+		sizeof(struct omf_devparm_descriptor),
+		omf_dparm_unpack_letoh_v1,
+		NULL,
+		OMF_SB_DESC_V1,
+		{ {1, 0, 0, 0} }
+	},
+};
+
+/*
+ * mdcrec_data_mcspare_table: track changes in spare % record.
+ */
+static struct upgrade_history mdcrec_data_mcspare_table[]
+	= {
+	{
+		sizeof(struct omf_mdcrec_data),
+		omf_mdcrec_mcspare_unpack_letoh_v1,
+		NULL,
+		OMF_SB_DESC_UNDEF,
+		{ {1, 0, 0, 0} },
+	},
+};
+
+/*
+ * sb_descriptor_table: track changes in mpool superblock descriptor
+ */
+static struct upgrade_history sb_descriptor_table[] = {
+	{
+		sizeof(struct omf_sb_descriptor),
+		omf_sb_unpack_letoh_v1,
+		NULL,
+		OMF_SB_DESC_V1,
+		{ {1, 0, 0, 0} }
+	},
+};
+
+/*
+ * mdcrec_data_ocreate_table: track changes in OCREATE mdc record.
+ */
+static struct upgrade_history mdcrec_data_ocreate_table[]
+	= {
+	{
+		sizeof(struct omf_mdcrec_data),
+		omf_pmd_layout_unpack_letoh_v1,
+		NULL,
+		OMF_SB_DESC_UNDEF,
+		{ {1, 0, 0, 0} }
+	},
+};
+
+
+/*
+ * Generic routines
+ */
+
+/**
+ * omf_find_upgrade_hist() - Find the upgrade history entry for the given sb or mdc version
+ * @uhtab:
+ * @tabsz:   NELEM of upgrade_table
+ * @sbver:   superblock version
+ * @mdcver:  mdc version
+ *
+ * Given a superblock version or a mpool MDC content version, find the
+ * corresponding upgrade history entry which matches the given sb or mdc
+ * version.  That is the entry with the highest version such as
+ * entry version <= the version passed in.
+ *
+ * Note that caller of this routine can pass in either a valid superblock
+ * version or a valid mdc version. If a valid superblock version is passed in,
+ * mdcver need to be set to NULL. If a mdc version is passed in, sbver
+ * need to set to 0.
+ *
+ * For example,
+ * We update a structure "struct abc" three times, which is part of mpool
+ * superblock or MDC. when superblock version is  1, 3 and 5 respectively.
+ * Each time we add an entry in the upgrade table for this structure.
+ * The upgrade history table looks like:
+ *
+ * struct upgrade_history abc_hist[] =
+ * {{sizeof(struct abc_v1), abc_unpack_v1, NULL, OMF_SB_DESC_V1, NULL},
+ *  {sizeof(struct abc_v2), abc_unpack_v2, NULL, OMF_SB_DESC_V3, NULL},
+ *  {sizeof(struct abc_v3), abc_unpack_v3, NULL, OMF_SB_DESC_V5, NULL}}
+ *
+ * if caller needs to find the upgrade history entry matches
+ * sb version 3(OMF_SB_DESC_V3), this routine finds the exact match and
+ * returns &abc_hist[1].
+ *
+ * if caller needs to find the upgrade history entry which matches
+ * sb version 4 (OMF_SB_DESC_V4), since we don't update this structure
+ * in sb version 4, this routine finds the prior entry which matches
+ * the sb version 3, return &abc_hist[1]
+ *
+ */
+static struct upgrade_history *
+omf_find_upgrade_hist(struct upgrade_history *uhtab, size_t tabsz,
+		      enum sb_descriptor_ver_omf sbver, struct omf_mdcver *mdcver)
+{
+	struct upgrade_history *cur = NULL;
+	int beg = 0, end = tabsz, mid;
+
+	while (beg < end) {
+		mid = (beg + end) / 2;
+		cur = &uhtab[mid];
+		if (mdcver) {
+			ASSERT(sbver == 0);
+
+			if (omfu_mdcver_cmp(mdcver, "==", &cur->uh_mdcver))
+				return cur;
+			else if (omfu_mdcver_cmp(mdcver, ">", &cur->uh_mdcver))
+				beg = mid + 1;
+			else
+				end = mid;
+		} else {
+			ASSERT(sbver <= OMF_SB_DESC_VER_LAST);
+
+			if (sbver == cur->uh_sbver)
+				return cur;
+			else if (sbver > cur->uh_sbver)
+				beg = mid + 1;
+			else
+				end = mid;
+		}
+	}
+
+	if (end == 0)
+		return NULL; /* not found */
+
+	return &uhtab[end - 1];
+}
+
+/**
+ * omf_upgrade_convert_only() - Convert sb/mdc from v1 to v2 (v1 <= v2)
+ * @out:   v2 in-memory metadata structure
+ * @in:    v1 in-memory metadata structure
+ * @uhtab: upgrade history table for this structure
+ * @tabsz: NELEM(uhtab)
+ * @sbv1:  superblock version converting from
+ * @sbv2:  superblock version converting to
+ * @mdcv1: mdc version converting from
+ * @mdcv2: mdc version converting to
+ *
+ * Convert a nested metadata structure in mpool superblock or
+ * MDC from v1 to v2 (v1 <= v2)
+ *
+ * Note that callers can pass in either mdc beg/end ver (mdcv1/mdcv2),
+ * or superblock beg/end versions (sbv1/sbv2). Set both mdcv1 and
+ * mdcv2 to NULL, if caller wants to use superblock versions
+ */
+static __maybe_unused int
+omf_upgrade_convert_only(void *out, const void *in, struct upgrade_history *uhtab,
+			 size_t tabsz, enum sb_descriptor_ver_omf sbv1,
+			 enum sb_descriptor_ver_omf sbv2,
+			 struct omf_mdcver *mdcv1, struct omf_mdcver *mdcv2)
+{
+	struct upgrade_history *v1, *v2, *cur;
+	void *new, *old;
+	size_t newsz;
+
+	v1 = omf_find_upgrade_hist(uhtab, tabsz, sbv1, mdcv1);
+	ASSERT(v1);
+
+	v2 = omf_find_upgrade_hist(uhtab, tabsz, sbv2, mdcv2);
+	ASSERT(v2);
+	ASSERT(v1 <= v2);
+
+	if (v1 == v2)
+		/* No need to do conversion */
+		return 0;
+
+	if (v2 == v1 + 1) {
+		/*
+		 * Single step conversion, Don't need to allocate/free
+		 * buffers for intermediate conversion states
+		 */
+		if (v2->uh_conv != NULL)
+			v2->uh_conv(in, out);
+		return 0;
+	}
+
+	/*
+	 * Make a local copy of input buffer, we won't free it
+	 * in the for loop below
+	 */
+	old = kmalloc(v1->uh_size, GFP_KERNEL);
+	if (!old)
+		return -ENOMEM;
+	memcpy(old, in, v1->uh_size);
+
+	new = old;
+	newsz = v1->uh_size;
+
+	for (cur = v1 + 1; cur <= v2; cur++) {
+		if (!cur->uh_conv)
+			continue;
+		new = kzalloc(cur->uh_size, GFP_KERNEL);
+		if (!new) {
+			kfree(old);
+			return -ENOMEM;
+		}
+		newsz = cur->uh_size;
+		cur->uh_conv(old, new);
+		kfree(old);
+		old = new;
+	}
+
+	memcpy(out, new, newsz);
+	kfree(new);
+
+	return 0;
+}
+
+/**
+ * omf_upgrade_unpack_only() - unpack OMF meta data
+ * @out:     output buffer for in-memory structure
+ * @inbuf:   OMF structure
+ * @uhtab:   upgrade history table
+ * @tabsz:   NELEM of uhtab
+ * @sbver:   superblock version
+ * @mdcver: mpool MDC content version
+ */
+static int omf_upgrade_unpack_only(void *out, const char *inbuf, struct upgrade_history *uhtab,
+				   size_t tabsz, enum sb_descriptor_ver_omf sbver,
+				   struct omf_mdcver *mdcver)
+{
+	struct upgrade_history *res;
+
+	res = omf_find_upgrade_hist(uhtab, tabsz, sbver, mdcver);
+
+	return res->uh_unpack(out, inbuf);
+}
+
+/**
+ * omf_unpack_letoh_and_convert() - Unpack OMF meta data and convert it to the latest version.
+ * @out:    in-memory structure
+ * @outsz:  size of in-memory structure
+ * @inbuf:  OMF structure
+ * @uhtab:  upgrade history table
+ * @tabsz:  number of elements in uhtab
+ * @sbver:  superblock version
+ * @mdcver: mdc version. if set to NULL, use sbver to find the corresponding
+ *          nested structure upgrade table
+ */
+static int omf_unpack_letoh_and_convert(void *out, size_t outsz, const char *inbuf,
+					struct upgrade_history *uhtab,
+					size_t tabsz, enum sb_descriptor_ver_omf sbver,
+					struct omf_mdcver *mdcver)
+{
+	struct upgrade_history *cur, *omf;
+	void *old, *new;
+	size_t newsz;
+	int rc;
+
+	omf = omf_find_upgrade_hist(uhtab, tabsz, sbver, mdcver);
+	ASSERT(omf);
+
+	if (omf == &uhtab[tabsz - 1]) {
+		/*
+		 * Current version is the latest version.
+		 * Don't need to do any conversion
+		 */
+		return omf->uh_unpack(out, inbuf);
+	}
+
+	old = kzalloc(omf->uh_size, GFP_KERNEL);
+	if (!old)
+		return -ENOMEM;
+
+	rc = omf->uh_unpack(old, inbuf);
+	if (rc) {
+		kfree(old);
+		return rc;
+	}
+
+	new = old;
+	newsz = omf->uh_size;
+
+	for (cur = omf + 1; cur <= &uhtab[tabsz - 1]; cur++) {
+		if (!cur->uh_conv)
+			continue;
+		new = kzalloc(cur->uh_size, GFP_KERNEL);
+		if (!new) {
+			kfree(old);
+			return -ENOMEM;
+		}
+		newsz = cur->uh_size;
+		cur->uh_conv(old, new);
+		kfree(old);
+		old = new;
+	}
+
+	ASSERT(newsz == outsz);
+
+	memcpy(out, new, newsz);
+	kfree(new);
+
+	return 0;
+}
+
+_Static_assert(MPOOL_UUID_SIZE == OMF_UUID_PACKLEN, "mpool uuid sz != omf uuid packlen");
+_Static_assert(OMF_MPOOL_NAME_LEN == MPOOL_NAMESZ_MAX, "omf mpool name len != mpool namesz max");
+
+/*
+ * devparm_descriptor
+ */
+static void omf_dparm_pack_htole(struct omf_devparm_descriptor *dp, char *outbuf)
+{
+	struct devparm_descriptor_omf *dp_omf;
+
+	dp_omf = (struct devparm_descriptor_omf *)outbuf;
+	omf_set_podp_devid(dp_omf, dp->odp_devid.uuid, MPOOL_UUID_SIZE);
+	omf_set_podp_devsz(dp_omf, dp->odp_devsz);
+	omf_set_podp_zonetot(dp_omf, dp->odp_zonetot);
+	omf_set_podp_zonepg(dp_omf, dp->odp_zonepg);
+	omf_set_podp_mclassp(dp_omf, dp->odp_mclassp);
+	/* Translate pd_devtype into devtype_omf */
+	omf_set_podp_devtype(dp_omf, dp->odp_devtype);
+	omf_set_podp_sectorsz(dp_omf, dp->odp_sectorsz);
+	omf_set_podp_features(dp_omf, dp->odp_features);
+}
+
+/**
+ * omf_dparm_unpack_letoh()- unpack version 1 omf devparm descriptor into
+ *                            in-memory format
+ * @out: in-memory format
+ * @inbuf: omf format
+ */
+static int omf_dparm_unpack_letoh_v1(void *out, const char *inbuf)
+{
+	struct devparm_descriptor_omf *dp_omf;
+	struct omf_devparm_descriptor *dp;
+
+	dp_omf = (struct devparm_descriptor_omf *)inbuf;
+	dp = (struct omf_devparm_descriptor *)out;
+
+	omf_podp_devid(dp_omf, dp->odp_devid.uuid, MPOOL_UUID_SIZE);
+	dp->odp_devsz     = omf_podp_devsz(dp_omf);
+	dp->odp_zonetot    = omf_podp_zonetot(dp_omf);
+	dp->odp_zonepg = omf_podp_zonepg(dp_omf);
+	dp->odp_mclassp   = omf_podp_mclassp(dp_omf);
+	/* Translate devtype_omf into mp_devtype */
+	dp->odp_devtype	  = omf_podp_devtype(dp_omf);
+	dp->odp_sectorsz  = omf_podp_sectorsz(dp_omf);
+	dp->odp_features  = omf_podp_features(dp_omf);
+
+	return 0;
+}
+
+static int omf_dparm_unpack_letoh(struct omf_devparm_descriptor *dp, const char *inbuf,
+				  enum sb_descriptor_ver_omf sbver,
+				  struct omf_mdcver *mdcver,
+				  enum unpack_only unpackonly)
+{
+	const size_t sz = ARRAY_SIZE(layout_descriptor_table);
+	int rc;
+
+	if (unpackonly == UNPACKONLY)
+		rc = omf_upgrade_unpack_only(dp, inbuf, devparm_descriptor_table,
+					     sz, sbver, mdcver);
+	else
+		rc = omf_unpack_letoh_and_convert(dp, sizeof(*dp), inbuf, devparm_descriptor_table,
+						  sz, sbver, mdcver);
+
+	return rc;
+}
+
+
+/*
+ * layout_descriptor
+ */
+static void omf_layout_pack_htole(const struct omf_layout_descriptor *ld, char *outbuf)
+{
+	struct layout_descriptor_omf *ld_omf;
+
+	ld_omf = (struct layout_descriptor_omf *)outbuf;
+	omf_set_pol_zcnt(ld_omf, ld->ol_zcnt);
+	omf_set_pol_zaddr(ld_omf, ld->ol_zaddr);
+}
+
+/**
+ * omf_layout_unpack_letoh_v1() - unpack omf layout descriptor version 1
+ * @out: in-memory layout descriptor
+ * @in: on-media layout descriptor
+ */
+int omf_layout_unpack_letoh_v1(void *out, const char *inbuf)
+{
+	struct omf_layout_descriptor *ld;
+	struct layout_descriptor_omf *ld_omf;
+
+	ld = (struct omf_layout_descriptor *)out;
+	ld_omf = (struct layout_descriptor_omf *)inbuf;
+
+	ld->ol_zcnt = omf_pol_zcnt(ld_omf);
+	ld->ol_zaddr = omf_pol_zaddr(ld_omf);
+
+	return 0;
+}
+
+static int omf_layout_unpack_letoh(struct omf_layout_descriptor *ld, const char *inbuf,
+				   enum sb_descriptor_ver_omf sbver,
+				   struct omf_mdcver *mdcver,
+				   enum unpack_only unpackonly)
+{
+	const size_t sz = ARRAY_SIZE(layout_descriptor_table);
+	int rc;
+
+	if (unpackonly == UNPACKONLY)
+		rc = omf_upgrade_unpack_only(ld, inbuf, layout_descriptor_table,
+					     sz, sbver, mdcver);
+	else
+		rc = omf_unpack_letoh_and_convert(ld, sizeof(*ld), inbuf, layout_descriptor_table,
+						  sz, sbver, mdcver);
+
+	return rc;
+}
+
+/*
+ * pmd_layout
+ */
+static int omf_pmd_layout_pack_htole(const struct mpool_descriptor *mp, u8 rtype,
+				     struct pmd_layout *ecl, char *outbuf)
+{
+	struct mdcrec_data_ocreate_omf *ocre_omf;
+	int data_rec_sz;
+
+	if (rtype != OMF_MDR_OCREATE && rtype != OMF_MDR_OUPDATE) {
+		mp_pr_warn("mpool %s, wrong rec type %u packing layout", mp->pds_name, rtype);
+		return -EINVAL;
+	}
+
+	data_rec_sz = sizeof(*ocre_omf);
+
+	ocre_omf = (struct mdcrec_data_ocreate_omf *)outbuf;
+	omf_set_pdrc_rtype(ocre_omf, rtype);
+	omf_set_pdrc_mclass(ocre_omf, mp->pds_pdv[ecl->eld_ld.ol_pdh].pdi_mclass);
+	omf_set_pdrc_objid(ocre_omf, ecl->eld_objid);
+	omf_set_pdrc_gen(ocre_omf, ecl->eld_gen);
+	omf_set_pdrc_mblen(ocre_omf, ecl->eld_mblen);
+
+	if (objid_type(ecl->eld_objid) == OMF_OBJ_MLOG) {
+		memcpy(ocre_omf->pdrc_uuid, ecl->eld_uuid.uuid, OMF_UUID_PACKLEN);
+		data_rec_sz += OMF_UUID_PACKLEN;
+	}
+
+	omf_layout_pack_htole(&(ecl->eld_ld), (char *)&(ocre_omf->pdrc_ld));
+
+	return data_rec_sz;
+}
+
+/**
+ * omf_pmd_layout_unpack_letoh_v1() - Unpack little-endian mdc obj record and optional obj layout
+ * @out:
+ * @inbuf:
+ *
+ * For version 1 of OMF_MDR_OCREATE record (struct layout_descriptor_omf)
+ *
+ * Return:
+ *   0 if successful
+ *   -EINVAL if invalid record type or format
+ *   -ENOMEM if cannot alloc memory for metadata conversion
+ */
+static int omf_pmd_layout_unpack_letoh_v1(void *out, const char *inbuf)
+{
+	struct mdcrec_data_ocreate_omf *ocre_omf;
+	struct omf_mdcrec_data *cdr = out;
+	int rc;
+
+	ocre_omf = (struct mdcrec_data_ocreate_omf *)inbuf;
+
+	cdr->omd_rtype = omf_pdrc_rtype(ocre_omf);
+	if (cdr->omd_rtype != OMF_MDR_OCREATE && cdr->omd_rtype == OMF_MDR_OUPDATE) {
+		rc = -EINVAL;
+		mp_pr_err("Unpacking layout failed, wrong record type %d", rc, cdr->omd_rtype);
+		return rc;
+	}
+
+	cdr->u.obj.omd_mclass = omf_pdrc_mclass(ocre_omf);
+	cdr->u.obj.omd_objid = omf_pdrc_objid(ocre_omf);
+	cdr->u.obj.omd_gen   = omf_pdrc_gen(ocre_omf);
+	cdr->u.obj.omd_mblen = omf_pdrc_mblen(ocre_omf);
+
+	if (objid_type(cdr->u.obj.omd_objid) == OMF_OBJ_MLOG)
+		memcpy(cdr->u.obj.omd_uuid.uuid, ocre_omf->pdrc_uuid, OMF_UUID_PACKLEN);
+
+	rc = omf_layout_unpack_letoh(&cdr->u.obj.omd_old, (char *)&(ocre_omf->pdrc_ld),
+				     OMF_SB_DESC_V1, NULL, UNPACKONLY);
+
+	return rc;
+}
+
+
+/**
+ * omf_pmd_layout_unpack_letoh() - Unpack little-endian mdc obj record and optional obj layout
+ * @mp:
+ * @mdcver: version of the mpool MDC content being unpacked.
+ * @cdr: output
+ * @inbuf:
+ *
+ * Allocate object layout.
+ * For version 1 of OMF_MDR_OCREATE record (strut layout_descriptor_omf)
+ *
+ * Return:
+ *   0 if successful
+ *   -EINVAL if invalid record type or format
+ *   -ENOMEM if cannot alloc memory to return an object layout
+ *   -ENOENT if cannot convert a devid to a device handle (pdh)
+ */
+static int omf_pmd_layout_unpack_letoh(struct mpool_descriptor *mp, struct omf_mdcver *mdcver,
+				       struct omf_mdcrec_data *cdr, const char *inbuf)
+{
+	struct pmd_layout *ecl = NULL;
+	int rc, i;
+
+	rc = omf_unpack_letoh_and_convert(cdr, sizeof(*cdr), inbuf, mdcrec_data_ocreate_table,
+					  ARRAY_SIZE(mdcrec_data_ocreate_table),
+					  OMF_SB_DESC_UNDEF, mdcver);
+	if (rc) {
+		char buf[MAX_MDCVERSTR];
+
+		omfu_mdcver_to_str(mdcver, buf, sizeof(buf));
+		mp_pr_err("mpool %s, unpacking layout failed for mdc content version %s",
+			  rc, mp->pds_name, buf);
+		return rc;
+	}
+
+#ifdef COMP_PMD_ENABLED
+	ecl = pmd_layout_alloc(&cdr->u.obj.omd_uuid, cdr->u.obj.omd_objid, cdr->u.obj.omd_gen,
+			       cdr->u.obj.omd_mblen, cdr->u.obj.omd_old.ol_zcnt);
+#endif
+	if (!ecl) {
+		rc = -ENOMEM;
+		mp_pr_err("mpool %s, unpacking layout failed, could not allocate layout structure",
+			  rc, mp->pds_name);
+		return rc;
+	}
+
+	ecl->eld_ld.ol_zaddr = cdr->u.obj.omd_old.ol_zaddr;
+
+	for (i = 0; i < mp->pds_pdvcnt; i++) {
+		if (mp->pds_pdv[i].pdi_mclass == cdr->u.obj.omd_mclass) {
+			ecl->eld_ld.ol_pdh = i;
+			break;
+		}
+	}
+
+	if (i >= mp->pds_pdvcnt) {
+		pmd_obj_put(ecl);
+
+		rc = -ENOENT;
+		mp_pr_err("mpool %s, unpacking layout failed, mclass %u not in mpool",
+			  rc, mp->pds_name, cdr->u.obj.omd_mclass);
+		return rc;
+	}
+
+	cdr->u.obj.omd_layout = ecl;
+
+	return rc;
+}
+
+
+/*
+ * sb_descriptor
+ */
+
+/**
+ * omf_cksum_crc32c_le() - compute checksum
+ * @dbuf: data buf
+ * @dlen: data length
+ * @obuf: output buf
+ *
+ * Compute 4-byte checksum of type CRC32C for data buffer dbuf with length dlen
+ * and store in obuf little-endian; CRC32C is the only crypto algorithm we
+ * currently support.
+ *
+ * Return: 0 if successful, -EINVAL otherwise...
+ */
+static int omf_cksum_crc32c_le(const char *dbuf, u64 dlen, u8 *obuf)
+{
+	struct shash_desc *desc;
+	size_t descsz;
+	int rc;
+
+	memset(obuf, 0, 4);
+
+	descsz = sizeof(*desc) + crypto_shash_descsize(mpool_tfm);
+	desc = kzalloc(descsz, GFP_KERNEL);
+	if (!desc)
+		return -ENOMEM;
+	desc->tfm = mpool_tfm;
+
+	rc = crypto_shash_digest(desc, (u8 *)dbuf, dlen, obuf);
+
+	kfree(desc);
+
+	return rc;
+}
+
+struct omf_mdcver *omf_sbver_to_mdcver(enum sb_descriptor_ver_omf sbver)
+{
+	struct upgrade_history *uhtab;
+
+	uhtab = omf_find_upgrade_hist(sb_descriptor_table, ARRAY_SIZE(sb_descriptor_table),
+				      sbver, NULL);
+	if (uhtab) {
+		ASSERT(uhtab->uh_sbver == sbver);
+		return &uhtab->uh_mdcver;
+	}
+
+	return NULL;
+}
+
+int omf_sb_pack_htole(struct omf_sb_descriptor *sb, char *outbuf)
+{
+	struct sb_descriptor_omf *sb_omf;
+	u8 cksum[4];
+	int rc;
+
+	if (sb->osb_vers != OMF_SB_DESC_VER_LAST)
+		return -EINVAL; /* Not a valid header version */
+
+	sb_omf = (struct sb_descriptor_omf *)outbuf;
+
+	/* Pack drive-specific info */
+	omf_set_psb_magic(sb_omf, sb->osb_magic);
+	omf_set_psb_name(sb_omf, sb->osb_name, MPOOL_NAMESZ_MAX);
+	omf_set_psb_poolid(sb_omf, sb->osb_poolid.uuid, MPOOL_UUID_SIZE);
+	omf_set_psb_vers(sb_omf, sb->osb_vers);
+	omf_set_psb_gen(sb_omf, sb->osb_gen);
+
+	omf_dparm_pack_htole(&(sb->osb_parm), (char *)&(sb_omf->psb_parm));
+
+	omf_set_psb_mdc01gen(sb_omf, sb->osb_mdc01gen);
+	omf_set_psb_mdc01uuid(sb_omf, sb->osb_mdc01uuid.uuid, MPOOL_UUID_SIZE);
+	omf_layout_pack_htole(&(sb->osb_mdc01desc), (char *)&(sb_omf->psb_mdc01desc));
+	omf_set_psb_mdc01devid(sb_omf, sb->osb_mdc01devid.uuid, MPOOL_UUID_SIZE);
+
+	omf_set_psb_mdc02gen(sb_omf, sb->osb_mdc02gen);
+	omf_set_psb_mdc02uuid(sb_omf, sb->osb_mdc02uuid.uuid, MPOOL_UUID_SIZE);
+	omf_layout_pack_htole(&(sb->osb_mdc02desc), (char *)&(sb_omf->psb_mdc02desc));
+	omf_set_psb_mdc02devid(sb_omf, sb->osb_mdc02devid.uuid, MPOOL_UUID_SIZE);
+
+	outbuf = (char *)&sb_omf->psb_mdc0dev;
+	omf_dparm_pack_htole(&sb->osb_mdc0dev, outbuf);
+
+	/* Add CKSUM1 */
+	rc = omf_cksum_crc32c_le((char *)sb_omf, offsetof(struct sb_descriptor_omf, psb_cksum1),
+				 cksum);
+	if (rc)
+		return -EINVAL;
+
+	omf_set_psb_cksum1(sb_omf, cksum, 4);
+
+	/* Add CKSUM2 */
+	rc = omf_cksum_crc32c_le((char *)&sb_omf->psb_parm, sizeof(sb_omf->psb_parm), cksum);
+	if (rc)
+		return -EINVAL;
+
+	omf_set_psb_cksum2(sb_omf, cksum, 4);
+
+	return 0;
+}
+
+/**
+ * omf_sb_unpack_letoh_v1()- unpack version 1 omf sb descriptor into in-memory format
+ * @out: in-memory format
+ * @inbuf: omf format
+ */
+int omf_sb_unpack_letoh_v1(void *out, const char *inbuf)
+{
+	struct sb_descriptor_omf *sb_omf;
+	struct omf_sb_descriptor *sb;
+	u8 cksum[4], omf_cksum[4];
+	int rc;
+
+	sb_omf = (struct sb_descriptor_omf *)inbuf;
+	sb = (struct omf_sb_descriptor *)out;
+
+	/* Verify CKSUM2 */
+	rc = omf_cksum_crc32c_le((char *) &(sb_omf->psb_parm), sizeof(sb_omf->psb_parm), cksum);
+	omf_psb_cksum2(sb_omf, omf_cksum, 4);
+
+	if (rc || memcmp(cksum, omf_cksum, 4))
+		return -EINVAL;
+
+
+	sb->osb_magic = omf_psb_magic(sb_omf);
+
+	omf_psb_name(sb_omf, sb->osb_name, MPOOL_NAMESZ_MAX);
+
+	sb->osb_vers = omf_psb_vers(sb_omf);
+	ASSERT(sb->osb_vers == OMF_SB_DESC_V1);
+
+	omf_psb_poolid(sb_omf, sb->osb_poolid.uuid, MPOOL_UUID_SIZE);
+
+	sb->osb_gen = omf_psb_gen(sb_omf);
+	omf_dparm_unpack_letoh(&(sb->osb_parm), (char *)&(sb_omf->psb_parm),
+			       OMF_SB_DESC_V1, NULL, UNPACKONLY);
+
+	sb->osb_mdc01gen  = omf_psb_mdc01gen(sb_omf);
+	omf_psb_mdc01uuid(sb_omf, sb->osb_mdc01uuid.uuid, MPOOL_UUID_SIZE);
+	omf_layout_unpack_letoh(&(sb->osb_mdc01desc), (char *)&(sb_omf->psb_mdc01desc),
+				OMF_SB_DESC_V1, NULL, UNPACKONLY);
+	omf_psb_mdc01devid(sb_omf, sb->osb_mdc01devid.uuid, MPOOL_UUID_SIZE);
+
+	sb->osb_mdc02gen = omf_psb_mdc02gen(sb_omf);
+	omf_psb_mdc02uuid(sb_omf, sb->osb_mdc02uuid.uuid, MPOOL_UUID_SIZE);
+	omf_layout_unpack_letoh(&(sb->osb_mdc02desc), (char *)&(sb_omf->psb_mdc02desc),
+				OMF_SB_DESC_V1, NULL, UNPACKONLY);
+	omf_psb_mdc02devid(sb_omf, sb->osb_mdc02devid.uuid, MPOOL_UUID_SIZE);
+
+	inbuf = (char *)&sb_omf->psb_mdc0dev;
+	omf_dparm_unpack_letoh(&sb->osb_mdc0dev, inbuf, OMF_SB_DESC_V1, NULL, UNPACKONLY);
+
+	return 0;
+}
+
+int omf_sb_unpack_letoh(struct omf_sb_descriptor *sb, const char *inbuf, u16 *omf_ver)
+{
+	struct sb_descriptor_omf *sb_omf;
+	u8 cksum[4], omf_cksum[4];
+	u64 magic = 0;
+	int rc;
+
+	sb_omf = (struct sb_descriptor_omf *)inbuf;
+
+	magic = omf_psb_magic(sb_omf);
+
+	*omf_ver = OMF_SB_DESC_UNDEF;
+
+	if (magic != OMF_SB_MAGIC)
+		return -EBADF;
+
+	/* Verify CKSUM1 */
+	rc = omf_cksum_crc32c_le(inbuf, offsetof(struct sb_descriptor_omf, psb_cksum1), cksum);
+	omf_psb_cksum1(sb_omf, omf_cksum, 4);
+	if (rc || memcmp(cksum, omf_cksum, 4))
+		return -EINVAL;
+
+	*omf_ver = omf_psb_vers(sb_omf);
+
+	if (*omf_ver > OMF_SB_DESC_VER_LAST) {
+		rc = -EPROTONOSUPPORT;
+		mp_pr_err("Unsupported sb version %d", rc, *omf_ver);
+		return rc;
+	}
+
+	rc = omf_unpack_letoh_and_convert(sb, sizeof(*sb), inbuf, sb_descriptor_table,
+					  ARRAY_SIZE(sb_descriptor_table), *omf_ver, NULL);
+	if (rc)
+		mp_pr_err("Unpacking superblock failed for version %u", rc, *omf_ver);
+
+	return rc;
+}
+
+bool omf_sb_has_magic_le(const char *inbuf)
+{
+	struct sb_descriptor_omf *sb_omf;
+	u64 magic;
+
+	sb_omf = (struct sb_descriptor_omf *)inbuf;
+	magic  = omf_psb_magic(sb_omf);
+
+	return magic == OMF_SB_MAGIC;
+}
+
+
+/*
+ * mdcrec_objcmn
+ */
+
+/**
+ * omf_mdcrec_objcmn_pack_htole() - pack mdc obj record
+ * @mp:
+ * @cdr:
+ * @outbuf:
+ *
+ * Pack mdc obj record and optional obj layout into outbuf little-endian.
+ *
+ * Return: bytes packed if successful, -EINVAL otherwise
+ */
+static u64 omf_mdcrec_objcmn_pack_htole(struct mpool_descriptor *mp,
+					struct omf_mdcrec_data *cdr, char *outbuf)
+{
+	struct pmd_layout *layout = cdr->u.obj.omd_layout;
+	struct mdcrec_data_odelete_omf *odel_omf;
+	struct mdcrec_data_oerase_omf *oera_omf;
+	s64 bytes = 0;
+
+	switch (cdr->omd_rtype) {
+	case OMF_MDR_ODELETE:
+	case OMF_MDR_OIDCKPT:
+		odel_omf = (struct mdcrec_data_odelete_omf *)outbuf;
+		omf_set_pdro_rtype(odel_omf, cdr->omd_rtype);
+		omf_set_pdro_objid(odel_omf, cdr->u.obj.omd_objid);
+		return sizeof(*odel_omf);
+
+	case OMF_MDR_OERASE:
+		oera_omf = (struct mdcrec_data_oerase_omf *)outbuf;
+		omf_set_pdrt_rtype(oera_omf, cdr->omd_rtype);
+		omf_set_pdrt_objid(oera_omf, cdr->u.obj.omd_objid);
+		omf_set_pdrt_gen(oera_omf, cdr->u.obj.omd_gen);
+		return sizeof(*oera_omf);
+
+	default:
+		break;
+	}
+
+	if (cdr->omd_rtype != OMF_MDR_OCREATE && cdr->omd_rtype != OMF_MDR_OUPDATE) {
+		mp_pr_warn("mpool %s, packing object, unknown rec type %d",
+			   mp->pds_name, cdr->omd_rtype);
+		return -EINVAL;
+	}
+
+	/* OCREATE or OUPDATE: pack object in provided layout descriptor */
+	if (!layout) {
+		mp_pr_warn("mpool %s, invalid layout", mp->pds_name);
+		return -EINVAL;
+	}
+
+	bytes = omf_pmd_layout_pack_htole(mp, cdr->omd_rtype, layout, outbuf);
+	if (bytes < 0)
+		return -EINVAL;
+
+	return bytes;
+}
+
+/**
+ * omf_mdcrec_objcmn_unpack_letoh() - Unpack little-endian mdc record and optional obj layout
+ * @mp:
+ * @mdcver:
+ * @cdr:
+ * @inbuf:
+ *
+ * Return:
+ *   0 if successful
+ *   -EINVAL if invalid record type or format
+ *   -ENOMEM if cannot alloc memory to return an object layout
+ *   -ENOENT if cannot convert a devid to a device handle (pdh)
+ */
+static int omf_mdcrec_objcmn_unpack_letoh(struct mpool_descriptor *mp, struct omf_mdcver *mdcver,
+					  struct omf_mdcrec_data *cdr, const char *inbuf)
+{
+	struct mdcrec_data_odelete_omf *odel_omf;
+	struct mdcrec_data_oerase_omf *oera_omf;
+	enum mdcrec_type_omf rtype;
+	int rc = 0;
+
+	/*
+	 * The data record type is always the first field of all the
+	 * data records.
+	 */
+	rtype = omf_pdro_rtype((struct mdcrec_data_odelete_omf *)inbuf);
+
+	switch (rtype) {
+	case OMF_MDR_ODELETE:
+	case OMF_MDR_OIDCKPT:
+		odel_omf = (struct mdcrec_data_odelete_omf *)inbuf;
+		cdr->omd_rtype = omf_pdro_rtype(odel_omf);
+		cdr->u.obj.omd_objid = omf_pdro_objid(odel_omf);
+		break;
+
+	case OMF_MDR_OERASE:
+		oera_omf = (struct mdcrec_data_oerase_omf *)inbuf;
+		cdr->omd_rtype = omf_pdrt_rtype(oera_omf);
+		cdr->u.obj.omd_objid = omf_pdrt_objid(oera_omf);
+		cdr->u.obj.omd_gen = omf_pdrt_gen(oera_omf);
+		break;
+
+	case OMF_MDR_OCREATE:
+	case OMF_MDR_OUPDATE:
+		rc = omf_pmd_layout_unpack_letoh(mp, mdcver, cdr, inbuf);
+		break;
+
+	default:
+		mp_pr_warn("mpool %s, invalid rtype %d", mp->pds_name, rtype);
+		return -EINVAL;
+	}
+
+	return rc;
+}
+
+
+/*
+ * mdcrec_mcconfig
+ */
+
+/**
+ * omf_mdcrec_mcconfig_pack_htole() - Pack mdc mclass config record into outbuf little-endian.
+ * @cdr:
+ * @outbuf:
+ *
+ * Return: bytes packed.
+ */
+static u64 omf_mdcrec_mcconfig_pack_htole(struct omf_mdcrec_data *cdr, char *outbuf)
+{
+	struct mdcrec_data_mcconfig_omf *mc_omf;
+
+	mc_omf = (struct mdcrec_data_mcconfig_omf *)outbuf;
+	omf_set_pdrs_rtype(mc_omf, cdr->omd_rtype);
+	omf_dparm_pack_htole(&(cdr->u.dev.omd_parm), (char *)&(mc_omf->pdrs_parm));
+
+	return sizeof(*mc_omf);
+}
+
+/**
+ * omf_mdcrec_mcconfig_unpack_letoh() - Unpack little-endian mdc mcconfig record from inbuf.
+ * @cdr:
+ * @inbuf:
+ */
+static int omf_mdcrec_mcconfig_unpack_letoh(struct omf_mdcver *mdcver, struct omf_mdcrec_data *cdr,
+					    const char *inbuf)
+{
+	struct mdcrec_data_mcconfig_omf *mc_omf;
+
+	mc_omf = (struct mdcrec_data_mcconfig_omf *)inbuf;
+
+	cdr->omd_rtype = omf_pdrs_rtype(mc_omf);
+
+	return omf_dparm_unpack_letoh(&(cdr->u.dev.omd_parm), (char *)&(mc_omf->pdrs_parm),
+				      OMF_SB_DESC_UNDEF, mdcver, UNPACKCONVERT);
+}
+
+
+/*
+ * mdcrec_version
+ */
+
+/**
+ * omf_mdcver_pack_htole() - Pack mdc content version record into outbuf little-endian.
+ * @cdr:
+ * @outbuf:
+ *
+ * Return: bytes packed.
+ */
+static u64 omf_mdcver_pack_htole(struct omf_mdcrec_data *cdr, char *outbuf)
+{
+	struct mdcver_omf *pv_omf = (struct mdcver_omf *)outbuf;
+
+	omf_set_pv_rtype(pv_omf, cdr->omd_rtype);
+	omf_set_pv_mdcv_major(pv_omf, cdr->u.omd_version.mdcv_major);
+	omf_set_pv_mdcv_minor(pv_omf, cdr->u.omd_version.mdcv_minor);
+	omf_set_pv_mdcv_patch(pv_omf, cdr->u.omd_version.mdcv_patch);
+	omf_set_pv_mdcv_dev(pv_omf, cdr->u.omd_version.mdcv_dev);
+
+	return sizeof(*pv_omf);
+}
+
+void omf_mdcver_unpack_letoh(struct omf_mdcrec_data *cdr, const char *inbuf)
+{
+	struct mdcver_omf *pv_omf = (struct mdcver_omf *)inbuf;
+
+	cdr->omd_rtype = omf_pv_rtype(pv_omf);
+	cdr->u.omd_version.mdcv_major = omf_pv_mdcv_major(pv_omf);
+	cdr->u.omd_version.mdcv_minor = omf_pv_mdcv_minor(pv_omf);
+	cdr->u.omd_version.mdcv_patch = omf_pv_mdcv_patch(pv_omf);
+	cdr->u.omd_version.mdcv_dev   = omf_pv_mdcv_dev(pv_omf);
+}
+
+
+/*
+ * mdcrec_mcspare
+ */
+static u64 omf_mdcrec_mcspare_pack_htole(struct omf_mdcrec_data *cdr, char *outbuf)
+{
+	struct mdcrec_data_mcspare_omf *mcs_omf;
+
+	mcs_omf = (struct mdcrec_data_mcspare_omf *)outbuf;
+	omf_set_pdra_rtype(mcs_omf, cdr->omd_rtype);
+	omf_set_pdra_mclassp(mcs_omf, cdr->u.mcs.omd_mclassp);
+	omf_set_pdra_spzone(mcs_omf, cdr->u.mcs.omd_spzone);
+
+	return sizeof(*mcs_omf);
+}
+
+/**
+ * omf_mdcrec_mcspare_unpack_letoh_v1() - Unpack little-endian mdc media class spare record
+ * @cdr:
+ * @inbuf:
+ */
+static int omf_mdcrec_mcspare_unpack_letoh_v1(void *out, const char *inbuf)
+{
+	struct mdcrec_data_mcspare_omf *mcs_omf;
+	struct omf_mdcrec_data *cdr = out;
+
+	mcs_omf = (struct mdcrec_data_mcspare_omf *)inbuf;
+
+	cdr->omd_rtype = omf_pdra_rtype(mcs_omf);
+	cdr->u.mcs.omd_mclassp = omf_pdra_mclassp(mcs_omf);
+	cdr->u.mcs.omd_spzone = omf_pdra_spzone(mcs_omf);
+
+	return 0;
+}
+
+/**
+ * omf_mdcrec_mcspare_unpack_letoh() - Unpack little-endian mdc media class spare record
+ * @cdr:
+ * @inbuf:
+ */
+static int omf_mdcrec_mcspare_unpack_letoh(struct omf_mdcrec_data *cdr, const char *inbuf,
+					   enum sb_descriptor_ver_omf sbver,
+					   struct omf_mdcver *mdcver)
+{
+	return omf_unpack_letoh_and_convert(cdr, sizeof(*cdr), inbuf, mdcrec_data_mcspare_table,
+					    ARRAY_SIZE(mdcrec_data_mcspare_table), sbver, mdcver);
+}
+
+
+/*
+ * mdcrec_mpconfig
+ */
+
+/**
+ * omf_mdcrec_mpconfig_pack_htole() - Pack an mpool config record
+ * @cdr:
+ * @outbuf:
+ *
+ * Return: bytes packed.
+ */
+static u64 omf_mdcrec_mpconfig_pack_htole(struct omf_mdcrec_data *cdr, char *outbuf)
+{
+	struct mdcrec_data_mpconfig_omf *cfg_omf;
+	struct mpool_config *cfg;
+
+	cfg = &cdr->u.omd_cfg;
+
+	cfg_omf = (struct mdcrec_data_mpconfig_omf *)outbuf;
+	omf_set_pdmc_rtype(cfg_omf, cdr->omd_rtype);
+	omf_set_pdmc_oid1(cfg_omf, cfg->mc_oid1);
+	omf_set_pdmc_oid2(cfg_omf, cfg->mc_oid2);
+	omf_set_pdmc_uid(cfg_omf, cfg->mc_uid);
+	omf_set_pdmc_gid(cfg_omf, cfg->mc_gid);
+	omf_set_pdmc_mode(cfg_omf, cfg->mc_mode);
+	omf_set_pdmc_rsvd0(cfg_omf, cfg->mc_rsvd0);
+	omf_set_pdmc_captgt(cfg_omf, cfg->mc_captgt);
+	omf_set_pdmc_ra_pages_max(cfg_omf, cfg->mc_ra_pages_max);
+	omf_set_pdmc_vma_size_max(cfg_omf, cfg->mc_vma_size_max);
+	omf_set_pdmc_rsvd1(cfg_omf, cfg->mc_rsvd1);
+	omf_set_pdmc_rsvd2(cfg_omf, cfg->mc_rsvd2);
+	omf_set_pdmc_rsvd3(cfg_omf, cfg->mc_rsvd3);
+	omf_set_pdmc_rsvd4(cfg_omf, cfg->mc_rsvd4);
+	omf_set_pdmc_utype(cfg_omf, &cfg->mc_utype, sizeof(cfg->mc_utype));
+	omf_set_pdmc_label(cfg_omf, cfg->mc_label, sizeof(cfg->mc_label));
+
+	return sizeof(*cfg_omf);
+}
+
+/**
+ * omf_mdcrec_mpconfig_unpack_letoh() - Unpack an mpool config record
+ * @cdr:
+ * @inbuf:
+ *
+ * Return: bytes packed.
+ */
+static void omf_mdcrec_mpconfig_unpack_letoh(struct omf_mdcrec_data *cdr, const char *inbuf)
+{
+	struct mdcrec_data_mpconfig_omf *cfg_omf;
+	struct mpool_config *cfg;
+
+	cfg = &cdr->u.omd_cfg;
+
+	cfg_omf = (struct mdcrec_data_mpconfig_omf *)inbuf;
+	cdr->omd_rtype = omf_pdmc_rtype(cfg_omf);
+	cfg->mc_oid1 = omf_pdmc_oid1(cfg_omf);
+	cfg->mc_oid2 = omf_pdmc_oid2(cfg_omf);
+	cfg->mc_uid = omf_pdmc_uid(cfg_omf);
+	cfg->mc_gid = omf_pdmc_gid(cfg_omf);
+	cfg->mc_mode = omf_pdmc_mode(cfg_omf);
+	cfg->mc_rsvd0 = omf_pdmc_rsvd0(cfg_omf);
+	cfg->mc_captgt = omf_pdmc_captgt(cfg_omf);
+	cfg->mc_ra_pages_max = omf_pdmc_ra_pages_max(cfg_omf);
+	cfg->mc_vma_size_max = omf_pdmc_vma_size_max(cfg_omf);
+	cfg->mc_rsvd1 = omf_pdmc_rsvd1(cfg_omf);
+	cfg->mc_rsvd2 = omf_pdmc_rsvd2(cfg_omf);
+	cfg->mc_rsvd3 = omf_pdmc_rsvd3(cfg_omf);
+	cfg->mc_rsvd4 = omf_pdmc_rsvd4(cfg_omf);
+	omf_pdmc_utype(cfg_omf, &cfg->mc_utype, sizeof(cfg->mc_utype));
+	omf_pdmc_label(cfg_omf, cfg->mc_label, sizeof(cfg->mc_label));
+}
+
+/**
+ * mdcrec_type_objcmn() - Determine if the data record type corresponds to an object.
+ * @rtype: record type
+ *
+ * Return: true if the type is of an object data record.
+ */
+static bool mdcrec_type_objcmn(enum mdcrec_type_omf rtype)
+{
+	return (rtype == OMF_MDR_OCREATE || rtype == OMF_MDR_OUPDATE || rtype == OMF_MDR_ODELETE ||
+		rtype == OMF_MDR_OIDCKPT || rtype == OMF_MDR_OERASE);
+}
+
+int omf_mdcrec_isobj_le(const char *inbuf)
+{
+	const u8 rtype = inbuf[0]; /* rtype is byte so no endian conversion */
+
+	return mdcrec_type_objcmn(rtype);
+}
+
+
+/*
+ * mdcrec
+ */
+int omf_mdcrec_pack_htole(struct mpool_descriptor *mp, struct omf_mdcrec_data *cdr, char *outbuf)
+{
+	u8 rtype = (char)cdr->omd_rtype;
+
+	if (mdcrec_type_objcmn(rtype))
+		return omf_mdcrec_objcmn_pack_htole(mp, cdr, outbuf);
+	else if (rtype == OMF_MDR_VERSION)
+		return omf_mdcver_pack_htole(cdr, outbuf);
+	else if (rtype == OMF_MDR_MCCONFIG)
+		return omf_mdcrec_mcconfig_pack_htole(cdr, outbuf);
+	else if (rtype == OMF_MDR_MCSPARE)
+		return omf_mdcrec_mcspare_pack_htole(cdr, outbuf);
+	else if (rtype == OMF_MDR_MPCONFIG)
+		return omf_mdcrec_mpconfig_pack_htole(cdr, outbuf);
+
+	mp_pr_warn("mpool %s, invalid record type %u in mdc log", mp->pds_name, rtype);
+
+	return -EINVAL;
+}
+
+int omf_mdcrec_unpack_letoh(struct omf_mdcver *mdcver, struct mpool_descriptor *mp,
+			    struct omf_mdcrec_data *cdr, const char *inbuf)
+{
+	u8 rtype = (u8)*inbuf;
+
+	/* rtype is byte so no endian conversion */
+
+	if (mdcrec_type_objcmn(rtype))
+		return omf_mdcrec_objcmn_unpack_letoh(mp, mdcver, cdr, inbuf);
+	else if (rtype == OMF_MDR_VERSION)
+		omf_mdcver_unpack_letoh(cdr, inbuf);
+	else if (rtype == OMF_MDR_MCCONFIG)
+		omf_mdcrec_mcconfig_unpack_letoh(mdcver, cdr, inbuf);
+	else if (rtype == OMF_MDR_MCSPARE)
+		omf_mdcrec_mcspare_unpack_letoh(cdr, inbuf, OMF_SB_DESC_UNDEF, mdcver);
+	else if (rtype == OMF_MDR_MPCONFIG)
+		omf_mdcrec_mpconfig_unpack_letoh(cdr, inbuf);
+	else {
+		mp_pr_warn("mpool %s, unknown record type %u in mdc log", mp->pds_name, rtype);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+u8 omf_mdcrec_unpack_type_letoh(const char *inbuf)
+{
+	/* rtype is byte so no endian conversion */
+	return (u8)*inbuf;
+}
+
+
+/*
+ * logblock_header
+ */
+
+int omf_logblock_header_pack_htole(struct omf_logblock_header *lbh, char *outbuf)
+{
+	struct logblock_header_omf *lbh_omf;
+
+	lbh_omf = (struct logblock_header_omf *)outbuf;
+
+	if (lbh->olh_vers != OMF_LOGBLOCK_VERS)
+		return -EINVAL;
+
+	omf_set_polh_vers(lbh_omf, lbh->olh_vers);
+	omf_set_polh_magic(lbh_omf, lbh->olh_magic.uuid, MPOOL_UUID_SIZE);
+	omf_set_polh_gen(lbh_omf, lbh->olh_gen);
+	omf_set_polh_pfsetid(lbh_omf, lbh->olh_pfsetid);
+	omf_set_polh_cfsetid(lbh_omf, lbh->olh_cfsetid);
+
+	return 0;
+}
+
+int omf_logblock_header_unpack_letoh(struct omf_logblock_header *lbh, const char *inbuf)
+{
+	struct logblock_header_omf *lbh_omf;
+
+	lbh_omf = (struct logblock_header_omf *)inbuf;
+
+	lbh->olh_vers    = omf_polh_vers(lbh_omf);
+	omf_polh_magic(lbh_omf, lbh->olh_magic.uuid, MPOOL_UUID_SIZE);
+	lbh->olh_gen     = omf_polh_gen(lbh_omf);
+	lbh->olh_pfsetid = omf_polh_pfsetid(lbh_omf);
+	lbh->olh_cfsetid = omf_polh_cfsetid(lbh_omf);
+
+	return 0;
+}
+
+int omf_logblock_header_len_le(char *lbuf)
+{
+	struct logblock_header_omf *lbh_omf;
+
+	lbh_omf = (struct logblock_header_omf *)lbuf;
+
+	if (omf_polh_vers(lbh_omf) == OMF_LOGBLOCK_VERS)
+		return OMF_LOGBLOCK_HDR_PACKLEN;
+
+	return -EINVAL;
+}
+
+
+/*
+ * logrec_descriptor
+ */
+static bool logrec_type_valid(enum logrec_type_omf rtype)
+{
+	return rtype <= OMF_LOGREC_CEND;
+}
+
+bool logrec_type_datarec(enum logrec_type_omf rtype)
+{
+	return rtype && rtype <= OMF_LOGREC_DATALAST;
+}
+
+int omf_logrec_desc_pack_htole(struct omf_logrec_descriptor *lrd, char *outbuf)
+{
+	struct logrec_descriptor_omf *lrd_omf;
+
+	if (!logrec_type_valid(lrd->olr_rtype))
+		return -EINVAL;
+
+	lrd_omf = (struct logrec_descriptor_omf *)outbuf;
+	omf_set_polr_tlen(lrd_omf, lrd->olr_tlen);
+	omf_set_polr_rlen(lrd_omf, lrd->olr_rlen);
+	omf_set_polr_rtype(lrd_omf, lrd->olr_rtype);
+
+	return 0;
+}
+
+void omf_logrec_desc_unpack_letoh(struct omf_logrec_descriptor *lrd, const char *inbuf)
+{
+	struct logrec_descriptor_omf *lrd_omf;
+
+	lrd_omf = (struct logrec_descriptor_omf *)inbuf;
+	lrd->olr_tlen  = omf_polr_tlen(lrd_omf);
+	lrd->olr_rlen  = omf_polr_rlen(lrd_omf);
+	lrd->olr_rtype = omf_polr_rtype(lrd_omf);
+}
+
+int omf_init(void)
+{
+	const char *algo = "crc32c";
+	int rc = 0;
+
+	mpool_tfm = crypto_alloc_shash(algo, 0, 0);
+	if (!mpool_tfm) {
+		rc = -ENOMEM;
+		mp_pr_err("crypto_alloc_shash(%s) failed", rc, algo);
+	}
+
+	return rc;
+}
+
+void omf_exit(void)
+{
+	if (mpool_tfm)
+		crypto_free_shash(mpool_tfm);
+}
diff --git a/drivers/mpool/upgrade.c b/drivers/mpool/upgrade.c
new file mode 100644
index 000000000000..1b6b692e58f4
--- /dev/null
+++ b/drivers/mpool/upgrade.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * DOC: Module info.
+ *
+ * Pool metadata upgrade module.
+ *
+ * Defines functions used to upgrade the mpool metadata.
+ *
+ */
+
+#include "omf_if.h"
+#include "upgrade.h"
+
+/*
+ * Latest mpool MDC content version understood by this binary.
+ * Also version used to write MDC content by this binary.
+ */
+#define MDCVER_MAJOR       1
+#define MDCVER_MINOR       0
+#define MDCVER_PATCH       0
+#define MDCVER_DEV         0
+
+/**
+ * struct mdcver_info - mpool MDC content version and its information.
+ * @mi_mdcver:  version of a mpool MDC content. It is the version of the
+ *              first binary that introduced that content semantic/format.
+ * @mi_types:   types used by this release (when writing MDC0-N content)
+ * @mi_ntypes:  no. of types are used by this release.
+ * @mi_comment: comment about that version
+ *
+ * Such a structure instance is added each time the mpool MDCs content
+ * semantic/format changes (making it incompatible with earlier binary
+ * versions).
+ */
+struct mdcver_info {
+	struct omf_mdcver   mi_mdcver;
+	uint8_t            *mi_types;
+	uint8_t             mi_ntypes;
+	const char         *mi_comment;
+};
+
+/*
+ * mpool MDC types used when MDC content is written at version 1.0.0.0.
+ */
+static uint8_t mdcver_1_0_0_0_types[] = {
+	OMF_MDR_OCREATE, OMF_MDR_OUPDATE, OMF_MDR_ODELETE, OMF_MDR_OIDCKPT,
+	OMF_MDR_OERASE, OMF_MDR_MCCONFIG, OMF_MDR_MCSPARE, OMF_MDR_VERSION,
+	OMF_MDR_MPCONFIG};
+
+
+/*
+ * mdcver_info mdcvtab[] - table of versions of mpool MDCs content.
+ *
+ * Each time MDC content semantic/format changes (making it incompatible
+ * with earlier binary versions) an entry is added in this table.
+ * The entry at the end of the array (highest index) is the version placed
+ * in the mpool MDC version record written to media when this binary writes
+ * the mpool MDCs.
+ * This entry is also the last mpool MDC content format/semantic that this
+ * binary understands.
+ *
+ * Example:
+ * - Initial binary 1.0.0.0 generates first ever MDCs content.
+ *   There is one entry in the table with its mi_mdcver being 1.0.0.0.
+ * - binary 1.0.0.1 is released and changes mpool MDC content semantic (for
+ *   example chenge the meaning of media class enum). This release adds the
+ *   entry 1.0.0.1 in this table.
+ * - binary 1.0.1.0 is released and doesn't change MDCs content semantic/format,
+ *   MDCs content generated by 1.0.1.0 binary is still compatible with a
+ *   1.0.0.1 binary reading it.
+ *   No entry is added in the table.
+ * - binary 2.0.0.0 is released and it changes MDCs content semantic.
+ *   A third entry is added in the table with its mi_mdcver being 2.0.0.0.
+ */
+static struct mdcver_info mdcvtab[] = {
+	{{ {MDCVER_MAJOR, MDCVER_MINOR, MDCVER_PATCH, MDCVER_DEV} },
+	mdcver_1_0_0_0_types, sizeof(mdcver_1_0_0_0_types),
+	"Initial mpool MDCs content"},
+};
+
+struct omf_mdcver *omfu_mdcver_cur(void)
+{
+	return &mdcvtab[ARRAY_SIZE(mdcvtab) - 1].mi_mdcver;
+}
+
+const char *omfu_mdcver_comment(struct omf_mdcver *mdcver)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mdcvtab); i++)
+		if (omfu_mdcver_cmp(mdcver, "==", &mdcvtab[i].mi_mdcver))
+			return mdcvtab[i].mi_comment;
+
+	return NULL;
+}
+
+char *omfu_mdcver_to_str(struct omf_mdcver *mdcver, char *buf, size_t sz)
+{
+	snprintf(buf, sz, "%u.%u.%u.%u", mdcver->mdcv_major,
+		 mdcver->mdcv_minor, mdcver->mdcv_patch, mdcver->mdcv_dev);
+
+	return buf;
+}
+
+bool omfu_mdcver_cmp(struct omf_mdcver *a, char *op, struct omf_mdcver *b)
+{
+	size_t cnt = ARRAY_SIZE(a->mdcver);
+	int res = 0, i;
+
+	for (i = 0; i < cnt; i++) {
+		if (a->mdcver[i] != b->mdcver[i]) {
+			res = (a->mdcver[i] > b->mdcver[i]) ? 1 : -1;
+			break;
+		}
+	}
+
+	if (((op[1] == '=') && (res == 0)) || ((op[0] == '>') && (res > 0)) ||
+	    ((op[0] == '<') && (res < 0)))
+		return true;
+
+	return false;
+}
+
+bool omfu_mdcver_cmp2(struct omf_mdcver *a, char *op, u16 major, u16 minor, u16 patch, u16 dev)
+{
+	struct omf_mdcver b;
+
+	b.mdcv_major = major;
+	b.mdcv_minor = minor;
+	b.mdcv_patch = patch;
+	b.mdcv_dev   = dev;
+
+	return omfu_mdcver_cmp(a, op, &b);
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 07/22] mpool: add superblock management routines
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (5 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 06/22] mpool: add on-media pack, unpack and upgrade routines Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 08/22] mpool: add pool metadata routines to manage object lifecycle and IO Nabeel M Mohamed
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

Provides utilities to initialize, read, update, and erase mpool
superblocks.

Mpool stores two copies of superblocks, one at offset 0 and the
other at offset 8K in zone 0 of each media class volume. SB0 is
the authoritative copy and SB1 is used for recovery in the event
of corruption.

The superblock comprises the metadata required to uniquely identify
a media class volume, the name and UUID of the mpool to which this
volume belongs to, the superblock version, checksum and device
properties. The superblock on the capacity media class volume also
includes metadata for accessing the metadata container 0 (MDC-0).
MDC-0 is introduced in a future patch.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/init.c |   8 +
 drivers/mpool/sb.c   | 625 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 633 insertions(+)
 create mode 100644 drivers/mpool/sb.c

diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
index 70f907ccc28a..261ce67e94dd 100644
--- a/drivers/mpool/init.c
+++ b/drivers/mpool/init.c
@@ -10,6 +10,7 @@
 #include "omf_if.h"
 #include "pd.h"
 #include "smap.h"
+#include "sb.h"
 
 /*
  * Module params...
@@ -25,6 +26,7 @@ MODULE_PARM_DESC(chunk_size_kb, "Chunk size (in KiB) for device I/O");
 static void mpool_exit_impl(void)
 {
 	smap_exit();
+	sb_exit();
 	omf_exit();
 	pd_exit();
 }
@@ -46,6 +48,12 @@ static __init int mpool_init(void)
 		goto errout;
 	}
 
+	rc = sb_init();
+	if (rc) {
+		errmsg = "sb init failed";
+		goto errout;
+	}
+
 	rc = smap_init();
 	if (rc) {
 		errmsg = "smap init failed";
diff --git a/drivers/mpool/sb.c b/drivers/mpool/sb.c
new file mode 100644
index 000000000000..c161eff2bc0d
--- /dev/null
+++ b/drivers/mpool/sb.c
@@ -0,0 +1,625 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * Superblock module.
+ *
+ * Defines functions for managing per drive superblocks.
+ *
+ */
+
+#include <linux/slab.h>
+#include <linux/uio.h>
+
+#include "mpool_printk.h"
+#include "assert.h"
+
+#include "mpool_ioctl.h"
+#include "pd.h"
+#include "omf_if.h"
+#include "sb.h"
+#include "mclass.h"
+
+/* Cleared out sb */
+static struct omf_sb_descriptor SBCLEAR;
+
+/*
+ * Drives have 2 superblocks.
+ * + sb0 at byte offset 0
+ * + sb1 at byte offset SB_AREA_SZ + MDC0MD_AREA_SZ
+ *
+ * Read: sb0 is the authoritative copy, other copies are not used.
+ * Updates: sb0 is updated first; if successful sb1 is updated
+ */
+
+/*
+ * sb internal functions
+ */
+
+/**
+ * sb_prop_valid() - Validate the PD properties needed to read the erase superblocks.
+ *
+ * When the superblocks are read, the zone parameters may not been known
+ * yet. They may be obtained from the superblocks.
+ *
+ * Returns: true if we have enough to read the superblocks.
+ */
+static bool sb_prop_valid(struct pd_dev_parm *dparm)
+{
+	struct pd_prop *pd_prop = &dparm->dpr_prop;
+
+	if (SB_AREA_SZ < OMF_SB_DESC_PACKLEN) {
+
+		/* Guarantee that the SB area is large enough to hold an SB */
+		mp_pr_err("sb(%s): structure too big %lu %lu",
+			  -EINVAL, dparm->dpr_name, (ulong)SB_AREA_SZ, OMF_SB_DESC_PACKLEN);
+		return false;
+	}
+
+	if ((pd_prop->pdp_devtype != PD_DEV_TYPE_BLOCK_STD) &&
+	    (pd_prop->pdp_devtype != PD_DEV_TYPE_BLOCK_NVDIMM) &&
+	    (pd_prop->pdp_devtype != PD_DEV_TYPE_FILE)) {
+		mp_pr_err("sb(%s): unknown device type %d",
+			  -EINVAL, dparm->dpr_name, pd_prop->pdp_devtype);
+		return false;
+	}
+
+	if (PD_LEN(pd_prop) == 0) {
+		mp_pr_err("sb(%s): unknown device size", -EINVAL, dparm->dpr_name);
+		return false;
+	}
+
+	return true;
+};
+
+static u64 sb_idx2woff(u32 idx)
+{
+	return (u64)idx * (SB_AREA_SZ + MDC0MD_AREA_SZ);
+}
+
+/**
+ * sb_parm_valid() - Validate parameters passed to an SB API function
+ * @dparm: struct pd_dev_parm
+ *
+ * When this function is called it is assumed that the zone parameters of the
+ * PD are already known.
+ *
+ * Part of the validation is enforcing the rule from the comment above that
+ * there needs to be at least one more zone than those consumed by the
+ * (SB_SB_COUNT) superblocks.
+ *
+ * Returns: true if drive pd meets criteria for sb, false otherwise.
+ */
+static bool sb_parm_valid(struct pd_dev_parm *dparm)
+{
+	struct pd_prop *pd_prop = &dparm->dpr_prop;
+	u32 cnt;
+
+	if (SB_AREA_SZ < OMF_SB_DESC_PACKLEN) {
+		/* Guarantee that the SB area is large enough to hold an SB */
+		return false;
+	}
+
+	if (pd_prop->pdp_zparam.dvb_zonepg == 0) {
+		/* Zone size can't be 0. */
+		return false;
+	}
+
+	cnt = sb_zones_for_sbs(pd_prop);
+	if (cnt < 1) {
+		/* At least one zone is needed to hold SB0 and SB1. */
+		return false;
+	}
+
+	if (dparm->dpr_zonetot < (cnt + 1)) {
+		/* Guarantee that there is at least one zone not consumed by SBs. */
+		return false;
+	}
+
+	return true;
+};
+
+/*
+ * Write packed superblock in outbuf to sb copy number idx on drive pd.
+ * Returns: 0 if successful; -errno otherwise...
+ */
+static int sb_write_sbx(struct pd_dev_parm *dparm, char *outbuf, u32 idx)
+{
+	const struct kvec iov = { outbuf, SB_AREA_SZ };
+	u64 woff;
+	int rc;
+
+	woff = sb_idx2woff(idx);
+
+	rc = pd_zone_pwritev_sync(dparm, &iov, 1, 0, woff);
+	if (rc) {
+		mp_pr_err("sb(%s, %d): write failed, woff %lu",
+			  rc, dparm->dpr_name, idx, (ulong)woff);
+		return rc;
+	}
+
+	return 0;
+}
+
+/*
+ * Read packed superblock into inbuf from sb copy number idx.
+ * Returns: 0 if successful; -errno otherwise...
+ *
+ */
+static int sb_read_sbx(struct pd_dev_parm *dparm, char *inbuf, u32 idx)
+{
+	const struct kvec  iov = { inbuf, SB_AREA_SZ };
+	u64 woff;
+	int rc;
+
+	woff = sb_idx2woff(idx);
+
+	rc = pd_zone_preadv(dparm, &iov, 1, 0, woff);
+	if (rc) {
+		mp_pr_err("sb(%s, %d): read failed, woff %lu",
+			  rc, dparm->dpr_name, idx, (ulong)woff);
+		return rc;
+	}
+
+	return 0;
+}
+
+/*
+ * sb API functions
+ */
+
+_Static_assert(SB_AREA_SZ >= OMF_SB_DESC_PACKLEN, "sb_area_sz < omf_sb_desc_packlen");
+
+/*
+ * Determine if the mpool magic value exists in at least one place where
+ * expected on drive pd.  Does NOT imply drive has a valid superblock.
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Returns: 1 if found, 0 if not found, -(errno) if error reading
+ *
+ */
+int sb_magic_check(struct pd_dev_parm *dparm)
+{
+	int rval = 0, i;
+	char *inbuf;
+	int rc;
+
+	if (!sb_prop_valid(dparm)) {
+		rc = -EINVAL;
+		mp_pr_err("sb(%s): invalid param, zonepg %u zonetot %u",
+			  rc, dparm->dpr_name, dparm->dpr_zonepg, dparm->dpr_zonetot);
+		return rc;
+	}
+
+	inbuf = kmalloc_large(SB_AREA_SZ + 1, GFP_KERNEL);
+	if (!inbuf) {
+		rc = -ENOMEM;
+		mp_pr_err("sb(%s) magic check: buffer alloc failed", rc, dparm->dpr_name);
+		return rc;
+	}
+
+	for (i = 0; i < SB_SB_COUNT; i++) {
+		const struct kvec iov = { inbuf, SB_AREA_SZ };
+		u64 woff = sb_idx2woff(i);
+
+		memset(inbuf, 0, SB_AREA_SZ);
+
+		rc = pd_zone_preadv(dparm, &iov, 1, 0, woff);
+		if (rc) {
+			rval = rc;
+			mp_pr_err("sb(%s, %d) magic: read failed, woff %lu",
+				  rc, dparm->dpr_name, i, (ulong)woff);
+		} else if (omf_sb_has_magic_le(inbuf)) {
+			kfree(inbuf);
+			return 1;
+		}
+	}
+
+	kfree(inbuf);
+	return rval;
+}
+
+/*
+ * Write superblock sb to new (non-pool) drive
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Returns: 0 if successful; -errno otherwise...
+ *
+ */
+int sb_write_new(struct pd_dev_parm *dparm, struct omf_sb_descriptor *sb)
+{
+	char *outbuf;
+	int rc, i;
+
+	if (!sb_parm_valid(dparm)) {
+		rc = -EINVAL;
+		mp_pr_err("sb(%s) invalid param, zonepg %u zonetot %u",
+			  rc, dparm->dpr_name, dparm->dpr_zonepg, dparm->dpr_zonetot);
+		return rc;
+	}
+
+	outbuf = kmalloc_large(SB_AREA_SZ + 1, GFP_KERNEL);
+	if (!outbuf)
+		return -ENOMEM;
+
+	memset(outbuf, 0, SB_AREA_SZ);
+
+	rc = omf_sb_pack_htole(sb, outbuf);
+	if (rc) {
+		mp_pr_err("sb(%s) packing failed", rc, dparm->dpr_name);
+		kfree(outbuf);
+		return rc;
+	}
+
+	/*
+	 * since pd is not yet a pool member only succeed if write all sb
+	 * copies.
+	 */
+	for (i = 0; i < SB_SB_COUNT; i++) {
+		rc = sb_write_sbx(dparm, outbuf, i);
+		if (rc) {
+			mp_pr_err("sb(%s, %d): write sbx failed", rc, dparm->dpr_name, i);
+			break;
+		}
+	}
+
+	kfree(outbuf);
+	return rc;
+}
+
+/*
+ * Update superblock on pool drive
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Returns: 0 if successful; -errno otherwise..
+ *
+ */
+int sb_write_update(struct pd_dev_parm *dparm, struct omf_sb_descriptor *sb)
+{
+	char *outbuf;
+	int rc, i;
+
+	if (!sb_parm_valid(dparm)) {
+		rc = -EINVAL;
+		mp_pr_err("sb(%s) invalid param, zonepg %u zonetot %u partlen %lu",
+			  rc, dparm->dpr_name, dparm->dpr_zonepg, dparm->dpr_zonetot,
+			  (ulong)PD_LEN(&dparm->dpr_prop));
+		return rc;
+	}
+
+	outbuf = kmalloc_large(SB_AREA_SZ + 1, GFP_KERNEL);
+	if (!outbuf)
+		return -ENOMEM;
+
+	memset(outbuf, 0, SB_AREA_SZ);
+
+	rc = omf_sb_pack_htole(sb, outbuf);
+	if (rc) {
+		mp_pr_err("sb(%s) packing failed", rc, dparm->dpr_name);
+		kfree(outbuf);
+		return rc;
+	}
+
+	/* Update sb0 first and then sb1 if that is successful */
+	for (i = 0; i < SB_SB_COUNT; i++) {
+		rc = sb_write_sbx(dparm, outbuf, i);
+		if (rc) {
+			mp_pr_err("sb(%s, %d) sbx write failed", rc, dparm->dpr_name, i);
+			if (i == 0)
+				break;
+			rc = 0;
+		}
+	}
+
+	kfree(outbuf);
+
+	return rc;
+}
+
+/*
+ * Erase superblock on drive pd.
+ *
+ * Note: only pd properties must be set.
+ *
+ * Returns: 0 if successful; -errno otherwise...
+ *
+ */
+int sb_erase(struct pd_dev_parm *dparm)
+{
+	int rc = 0, i;
+	char *buf;
+
+	if (!sb_prop_valid(dparm)) {
+		rc = -EINVAL;
+		mp_pr_err("sb(%s) invalid param, zonepg %u zonetot %u", rc, dparm->dpr_name,
+			  dparm->dpr_zonepg, dparm->dpr_zonetot);
+		return rc;
+	}
+
+	buf = kmalloc_large(SB_AREA_SZ + 1, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	memset(buf, 0, SB_AREA_SZ);
+
+	for (i = 0; i < SB_SB_COUNT; i++) {
+		const struct kvec iov = { buf, SB_AREA_SZ };
+		u64 woff = sb_idx2woff(i);
+
+		rc = pd_zone_pwritev_sync(dparm, &iov, 1, 0, woff);
+		if (rc)
+			mp_pr_err("sb(%s, %d): erase failed", rc, dparm->dpr_name, i);
+	}
+
+	kfree(buf);
+
+	return rc;
+}
+
+static int sb_reconcile(struct omf_sb_descriptor *sb, struct pd_dev_parm *dparm, bool force)
+{
+	struct omf_devparm_descriptor *sb_parm = &sb->osb_parm;
+	struct pd_prop *pd_prop = &dparm->dpr_prop;
+	struct mc_parms mc_parms;
+	int rc;
+
+	pd_prop->pdp_mclassp = sb_parm->odp_mclassp;
+	pd_prop->pdp_zparam.dvb_zonepg = sb_parm->odp_zonepg;
+	pd_prop->pdp_zparam.dvb_zonetot = sb_parm->odp_zonetot;
+
+	if (force)
+		return 0;
+
+	if (pd_prop->pdp_devsz < sb_parm->odp_devsz) {
+		rc = -EINVAL;
+
+		mp_pr_err("sb(%s): devsz(%lu) > discovered (%lu)",
+			  rc, dparm->dpr_name, (ulong)sb_parm->odp_devsz,
+			  (ulong)pd_prop->pdp_devsz);
+		return rc;
+	}
+
+	if (PD_SECTORSZ(pd_prop) != sb_parm->odp_sectorsz) {
+		rc = -EINVAL;
+
+		mp_pr_err("sb(%s) sector size(%u) mismatches discovered(%u)",
+			  rc, dparm->dpr_name, sb_parm->odp_sectorsz, PD_SECTORSZ(pd_prop));
+		return rc;
+	}
+
+	if (pd_prop->pdp_devtype != sb_parm->odp_devtype) {
+		rc = -EINVAL;
+
+		mp_pr_err("sb(%s), pd type(%u) mismatches discovered(%u)",
+			  rc, dparm->dpr_name, sb_parm->odp_devtype, pd_prop->pdp_devtype);
+		return rc;
+	}
+
+	mc_pd_prop2mc_parms(pd_prop, &mc_parms);
+	if (mc_parms.mcp_features != sb_parm->odp_features) {
+		rc = -EINVAL;
+
+		mp_pr_err("sb(%s), pd features(%lu) mismatches discovered(%lu)",
+			  rc, dparm->dpr_name, (ulong)sb_parm->odp_features,
+			  (ulong)mc_parms.mcp_features);
+		return rc;
+	}
+
+	return 0;
+}
+
+/*
+ * Read superblock from drive pd.
+ *
+ * Note: only pd.status and pd.parm must be set; no other pd fields accessed.
+ *
+ * Returns: 0 if successful; -errno otherwise...
+ *
+ */
+int sb_read(struct pd_dev_parm *dparm, struct omf_sb_descriptor *sb, u16 *omf_ver, bool force)
+{
+	struct omf_sb_descriptor *sbtmp;
+	int rc = 0, i;
+	char *buf;
+
+	if (!sb_prop_valid(dparm)) {
+		rc = -EINVAL;
+		mp_pr_err("sb(%s) invalid parameter, zonepg %u zonetot %u",
+			  rc, dparm->dpr_name, dparm->dpr_zonepg, dparm->dpr_zonetot);
+		return rc;
+	}
+
+	sbtmp = kzalloc(sizeof(*sbtmp), GFP_KERNEL);
+	if (!sbtmp)
+		return -ENOMEM;
+
+	buf = kmalloc_large(SB_AREA_SZ + 1, GFP_KERNEL);
+	if (!buf) {
+		kfree(sbtmp);
+		return -ENOMEM;
+	}
+
+	/*
+	 * In 1.0, voting + SB gen numbers across the drive SBs is not used.
+	 * There is one authoritave replica that is SB0.
+	 * SB1 only used for debugging.
+	 */
+	for (i = 0; i < SB_SB_COUNT; i++) {
+		memset(buf, 0, SB_AREA_SZ);
+
+		rc = sb_read_sbx(dparm, buf, i);
+		if (rc)
+			mp_pr_err("sb(%s, %d) read sbx failed", rc, dparm->dpr_name, i);
+		else {
+			rc = omf_sb_unpack_letoh(sbtmp, buf, omf_ver);
+			if (rc)
+				mp_pr_err("sb(%s, %d) bad magic/version/cksum",
+					  rc, dparm->dpr_name, i);
+			else if (i == 0)
+				/* Deep copy to main struct  */
+				*sb = *sbtmp;
+		}
+		if (rc && (i == 0)) {
+			/*
+			 * SB0 has the authoritative replica of
+			 * MDC0 metadata. We need it.
+			 */
+			goto exit;
+		}
+	}
+
+	/*
+	 * Check that superblock SB0 is consistent and
+	 * update the PD properties from it.
+	 */
+	rc = sb_reconcile(sb, dparm, force);
+
+exit:
+	kfree(sbtmp);
+	kfree(buf);
+	return rc;
+}
+
+/*
+ * Clear (set to zeros) mdc0 portion of sb.
+ */
+void sbutil_mdc0_clear(struct omf_sb_descriptor *sb)
+{
+	sb->osb_mdc01gen = 0;
+	sb->osb_mdc01desc.ol_zcnt = 0;
+	mpool_uuid_clear(&sb->osb_mdc01uuid);
+
+	mpool_uuid_clear(&sb->osb_mdc01devid);
+	sb->osb_mdc01desc.ol_zaddr = 0;
+
+	sb->osb_mdc02gen = 0;
+	sb->osb_mdc02desc.ol_zcnt = 0;
+	mpool_uuid_clear(&sb->osb_mdc02uuid);
+
+	mpool_uuid_clear(&sb->osb_mdc02devid);
+	sb->osb_mdc02desc.ol_zaddr = 0;
+
+	mpool_uuid_clear(&sb->osb_mdc0dev.odp_devid);
+	sb->osb_mdc0dev.odp_zonetot = 0;
+	sb->osb_mdc0dev.odp_zonepg = 0;
+	sb->osb_mdc0dev.odp_mclassp = 0;
+	sb->osb_mdc0dev.odp_devtype = 0;
+	sb->osb_mdc0dev.odp_sectorsz = 0;
+	sb->osb_mdc0dev.odp_features = 0;
+}
+
+/*
+ * Copy mdc0 portion of srcsb to tgtsb.
+ */
+void sbutil_mdc0_copy(struct omf_sb_descriptor *tgtsb, struct omf_sb_descriptor *srcsb)
+{
+	tgtsb->osb_mdc01gen = srcsb->osb_mdc01gen;
+	mpool_uuid_copy(&tgtsb->osb_mdc01uuid, &srcsb->osb_mdc01uuid);
+	mpool_uuid_copy(&tgtsb->osb_mdc01devid, &srcsb->osb_mdc01devid);
+	tgtsb->osb_mdc01desc.ol_zcnt = srcsb->osb_mdc01desc.ol_zcnt;
+	tgtsb->osb_mdc01desc.ol_zaddr = srcsb->osb_mdc01desc.ol_zaddr;
+
+	tgtsb->osb_mdc02gen = srcsb->osb_mdc02gen;
+	mpool_uuid_copy(&tgtsb->osb_mdc02uuid, &srcsb->osb_mdc02uuid);
+	mpool_uuid_copy(&tgtsb->osb_mdc02devid, &srcsb->osb_mdc02devid);
+	tgtsb->osb_mdc02desc.ol_zcnt = srcsb->osb_mdc02desc.ol_zcnt;
+	tgtsb->osb_mdc02desc.ol_zaddr = srcsb->osb_mdc02desc.ol_zaddr;
+
+	mpool_uuid_copy(&tgtsb->osb_mdc0dev.odp_devid, &srcsb->osb_mdc0dev.odp_devid);
+	tgtsb->osb_mdc0dev.odp_devsz    = srcsb->osb_mdc0dev.odp_devsz;
+	tgtsb->osb_mdc0dev.odp_zonetot  = srcsb->osb_mdc0dev.odp_zonetot;
+	tgtsb->osb_mdc0dev.odp_zonepg   = srcsb->osb_mdc0dev.odp_zonepg;
+	tgtsb->osb_mdc0dev.odp_mclassp  = srcsb->osb_mdc0dev.odp_mclassp;
+	tgtsb->osb_mdc0dev.odp_devtype  = srcsb->osb_mdc0dev.odp_devtype;
+	tgtsb->osb_mdc0dev.odp_sectorsz = srcsb->osb_mdc0dev.odp_sectorsz;
+	tgtsb->osb_mdc0dev.odp_features = srcsb->osb_mdc0dev.odp_features;
+}
+
+/*
+ * Compare mdc0 portions of sb1 and sb2.
+ */
+static int sbutil_mdc0_eq(struct omf_sb_descriptor *sb1, struct omf_sb_descriptor *sb2)
+{
+	if (sb1->osb_mdc01gen != sb2->osb_mdc01gen ||
+	    sb1->osb_mdc01desc.ol_zcnt != sb2->osb_mdc01desc.ol_zcnt)
+		return 0;
+
+	if (mpool_uuid_compare(&sb1->osb_mdc01devid, &sb2->osb_mdc01devid) ||
+	    sb1->osb_mdc01desc.ol_zaddr != sb2->osb_mdc01desc.ol_zaddr)
+		return 0;
+
+	if (sb1->osb_mdc02gen != sb2->osb_mdc02gen ||
+	    sb1->osb_mdc02desc.ol_zcnt != sb2->osb_mdc02desc.ol_zcnt)
+		return 0;
+
+	if (mpool_uuid_compare(&sb1->osb_mdc02devid, &sb2->osb_mdc02devid) ||
+	    sb1->osb_mdc02desc.ol_zaddr != sb2->osb_mdc02desc.ol_zaddr)
+		return 0;
+
+	if (mpool_uuid_compare(&sb1->osb_mdc0dev.odp_devid, &sb2->osb_mdc0dev.odp_devid) ||
+	    sb1->osb_mdc0dev.odp_zonetot != sb2->osb_mdc0dev.odp_zonetot ||
+	    mc_cmp_omf_devparm(&sb1->osb_mdc0dev, &sb2->osb_mdc0dev))
+		return 0;
+
+	return 1;
+}
+
+/**
+ * sbutil_mdc0_isclear() - returns 1 if there is no MDC0 metadata in the
+ *	                   mdc0 portion of the super block.
+ * @sb:
+ *
+ * Some fields in the MDC0 portion of "sb" may not be 0 even if there is no
+ * MDC0 metadata present. It is due to metadata upgrade.
+ * Metadata upgrade may have to place a specific (non zero) value in a field
+ * that was not existing in a previous metadata version to indicate that
+ * the value is invalid.
+ */
+int sbutil_mdc0_isclear(struct omf_sb_descriptor *sb)
+{
+	return sbutil_mdc0_eq(&SBCLEAR, sb);
+}
+
+/*
+ * Validate mdc0 portion of sb
+ * Returns: 1 if valid; 0 otherwise.
+ */
+int sbutil_mdc0_isvalid(struct omf_sb_descriptor *sb)
+{
+	/* Basic consistency validation; can make more extensive as needed */
+
+	if (mpool_uuid_compare(&sb->osb_mdc01devid, &sb->osb_mdc02devid) ||
+	    mpool_uuid_compare(&sb->osb_mdc01devid, &sb->osb_mdc0dev.odp_devid))
+		return 0;
+
+	if (mpool_uuid_is_null(&sb->osb_mdc01devid))
+		return 0;
+
+	if (mpool_uuid_is_null(&sb->osb_parm.odp_devid))
+		return 0;
+
+	/* Confirm this drive is supposed to contain this mdc0 info */
+	if (mpool_uuid_compare(&sb->osb_mdc01devid, &sb->osb_parm.odp_devid))
+		return 0;
+
+	/* Found this drive in mdc0 strip list; confirm param and ownership */
+	if (mc_cmp_omf_devparm(&sb->osb_parm, &sb->osb_mdc0dev))
+		return 0;
+
+	return (sb->osb_mdc01desc.ol_zcnt == sb->osb_mdc02desc.ol_zcnt);
+}
+
+int sb_init(void)
+{
+	sbutil_mdc0_clear(&SBCLEAR);
+
+	return 0;
+}
+
+void sb_exit(void)
+{
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 08/22] mpool: add pool metadata routines to manage object lifecycle and IO
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (6 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 07/22] mpool: add superblock management routines Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 09/22] mpool: add mblock lifecycle management and IO routines Nabeel M Mohamed
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

Metadata manager interface to allocate, commit, abort, erase, read,
write, and destroy objects.

Metadata containers (MDC-1 through MDC-255) store the metadata for
accessing client allocated objects (mblocks and mlogs). An object
identifier for an mblock or mlog encodes the MDC storing its
metadata for faster lookup.

The layout descriptor (struct pmd_layout) is the in-memory
representation of an object. It is a refcounted structure storing
the object ID, state, device ID, zone start address and the number
of consecutive zones, and a rw sempahore protecting the layout
fields.

An object is in uncommitted state on allocation with its storage
reserved from the free space map. An uncommitted object lives only
in memory, i.e., no metadata records are logged. An object's
metadata is persisted in the metadata container only at commit
time. The uncommitted and committed objects are tracked using
separate rbtrees stored in each metadata container. Support for
object persistence is added in a future patch.

The metadata manager interfaces with the PD layer to discard the
zones at object abort/delete, flush the device cache at object
commit and issue read/write requests to the block layer.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/init.c    |    8 +
 drivers/mpool/omf.c     |    2 -
 drivers/mpool/pmd_obj.c | 1577 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 1585 insertions(+), 2 deletions(-)
 create mode 100644 drivers/mpool/pmd_obj.c

diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
index 261ce67e94dd..eb1217f63746 100644
--- a/drivers/mpool/init.c
+++ b/drivers/mpool/init.c
@@ -10,6 +10,7 @@
 #include "omf_if.h"
 #include "pd.h"
 #include "smap.h"
+#include "pmd_obj.h"
 #include "sb.h"
 
 /*
@@ -25,6 +26,7 @@ MODULE_PARM_DESC(chunk_size_kb, "Chunk size (in KiB) for device I/O");
 
 static void mpool_exit_impl(void)
 {
+	pmd_exit();
 	smap_exit();
 	sb_exit();
 	omf_exit();
@@ -60,6 +62,12 @@ static __init int mpool_init(void)
 		goto errout;
 	}
 
+	rc = pmd_init();
+	if (rc) {
+		errmsg = "pmd init failed";
+		goto errout;
+	}
+
 errout:
 	if (rc) {
 		mp_pr_err("%s", rc, errmsg);
diff --git a/drivers/mpool/omf.c b/drivers/mpool/omf.c
index 8e612d35b19a..9f83fb20c4a8 100644
--- a/drivers/mpool/omf.c
+++ b/drivers/mpool/omf.c
@@ -580,10 +580,8 @@ static int omf_pmd_layout_unpack_letoh(struct mpool_descriptor *mp, struct omf_m
 		return rc;
 	}
 
-#ifdef COMP_PMD_ENABLED
 	ecl = pmd_layout_alloc(&cdr->u.obj.omd_uuid, cdr->u.obj.omd_objid, cdr->u.obj.omd_gen,
 			       cdr->u.obj.omd_mblen, cdr->u.obj.omd_old.ol_zcnt);
-#endif
 	if (!ecl) {
 		rc = -ENOMEM;
 		mp_pr_err("mpool %s, unpacking layout failed, could not allocate layout structure",
diff --git a/drivers/mpool/pmd_obj.c b/drivers/mpool/pmd_obj.c
new file mode 100644
index 000000000000..8966fc0abd0e
--- /dev/null
+++ b/drivers/mpool/pmd_obj.c
@@ -0,0 +1,1577 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * DOC: Module info.
+ *
+ * Pool metadata (pmd) module.
+ *
+ * Defines functions for probing, reading, and writing drives in an mpool.
+ *
+ */
+
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/atomic.h>
+#include <linux/delay.h>
+
+#include "mpool_printk.h"
+#include "uuid.h"
+#include "assert.h"
+
+#include "pd.h"
+#include "omf_if.h"
+#include "sb.h"
+#include "mclass.h"
+#include "smap.h"
+#include "mpcore.h"
+#include "pmd.h"
+
+static struct kmem_cache *pmd_obj_erase_work_cache __read_mostly;
+static struct kmem_cache *pmd_layout_priv_cache __read_mostly;
+static struct kmem_cache *pmd_layout_cache __read_mostly;
+
+static int pmd_mdc0_meta_update(struct mpool_descriptor *mp, struct pmd_layout *layout);
+static struct pmd_layout *pmd_layout_find(struct rb_root *root, u64 key);
+static struct pmd_layout *pmd_layout_insert(struct rb_root *root, struct pmd_layout *item);
+
+/* Committed object tree operations... */
+void pmd_co_rlock(struct pmd_mdc_info *cinfo, u8 slot)
+{
+	down_read_nested(&cinfo->mmi_co_lock, slot > 0 ? PMD_MDC_NORMAL : PMD_MDC_ZERO);
+}
+
+void pmd_co_runlock(struct pmd_mdc_info *cinfo)
+{
+	up_read(&cinfo->mmi_co_lock);
+}
+
+static void pmd_co_wlock(struct pmd_mdc_info *cinfo, u8 slot)
+{
+	down_write_nested(&cinfo->mmi_co_lock, slot > 0 ? PMD_MDC_NORMAL : PMD_MDC_ZERO);
+}
+
+static void pmd_co_wunlock(struct pmd_mdc_info *cinfo)
+{
+	up_write(&cinfo->mmi_co_lock);
+}
+
+struct pmd_layout *pmd_co_find(struct pmd_mdc_info *cinfo, u64 objid)
+{
+	return pmd_layout_find(&cinfo->mmi_co_root, objid);
+}
+
+struct pmd_layout *pmd_co_insert(struct pmd_mdc_info *cinfo, struct pmd_layout *layout)
+{
+	return pmd_layout_insert(&cinfo->mmi_co_root, layout);
+}
+
+struct pmd_layout *pmd_co_remove(struct pmd_mdc_info *cinfo, struct pmd_layout *layout)
+{
+	struct pmd_layout *found;
+
+	found = pmd_co_find(cinfo, layout->eld_objid);
+	if (found)
+		rb_erase(&found->eld_nodemdc, &cinfo->mmi_co_root);
+
+	return found;
+}
+
+/* Uncommitted object tree operations... */
+static void pmd_uc_lock(struct pmd_mdc_info *cinfo, u8 slot)
+{
+	mutex_lock_nested(&cinfo->mmi_uc_lock, slot > 0 ? PMD_MDC_NORMAL : PMD_MDC_ZERO);
+}
+
+static void pmd_uc_unlock(struct pmd_mdc_info *cinfo)
+{
+	mutex_unlock(&cinfo->mmi_uc_lock);
+}
+
+static struct pmd_layout *pmd_uc_find(struct pmd_mdc_info *cinfo, u64 objid)
+{
+	return pmd_layout_find(&cinfo->mmi_uc_root, objid);
+}
+
+static struct pmd_layout *pmd_uc_insert(struct pmd_mdc_info *cinfo, struct pmd_layout *layout)
+{
+	return pmd_layout_insert(&cinfo->mmi_uc_root, layout);
+}
+
+static struct pmd_layout *pmd_uc_remove(struct pmd_mdc_info *cinfo, struct pmd_layout *layout)
+{
+	struct pmd_layout *found;
+
+	found = pmd_uc_find(cinfo, layout->eld_objid);
+	if (found)
+		rb_erase(&found->eld_nodemdc, &cinfo->mmi_uc_root);
+
+	return found;
+}
+
+/*
+ * General object operations for both internal and external callers...
+ *
+ * See pmd.h for the various nesting levels for a locking class.
+ */
+void pmd_obj_rdlock(struct pmd_layout *layout)
+{
+	enum pmd_lock_class lc __maybe_unused = PMD_MDC_NORMAL;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	if (objid_slot(layout->eld_objid))
+		lc = PMD_OBJ_CLIENT;
+	else if (objid_mdc0log(layout->eld_objid))
+		lc = PMD_MDC_ZERO;
+#endif
+
+	down_read_nested(&layout->eld_rwlock, lc);
+}
+
+void pmd_obj_rdunlock(struct pmd_layout *layout)
+{
+	up_read(&layout->eld_rwlock);
+}
+
+void pmd_obj_wrlock(struct pmd_layout *layout)
+{
+	enum pmd_lock_class lc __maybe_unused = PMD_MDC_NORMAL;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	if (objid_slot(layout->eld_objid))
+		lc = PMD_OBJ_CLIENT;
+	else if (objid_mdc0log(layout->eld_objid))
+		lc = PMD_MDC_ZERO;
+#endif
+
+	down_write_nested(&layout->eld_rwlock, lc);
+}
+
+void pmd_obj_wrunlock(struct pmd_layout *layout)
+{
+	up_write(&layout->eld_rwlock);
+}
+
+/*
+ * Alloc and init object layout; non-arg fields and all strip descriptor
+ * fields are set to 0/UNDEF/NONE; no auxiliary object info is allocated.
+ *
+ * Returns NULL if allocation fails.
+ */
+struct pmd_layout *pmd_layout_alloc(struct mpool_uuid *uuid, u64 objid,
+				    u64 gen, u64 mblen, u32 zcnt)
+{
+	struct kmem_cache *cache = pmd_layout_cache;
+	struct pmd_layout *layout;
+
+	if (pmd_objid_type(objid) == OMF_OBJ_MLOG)
+		cache = pmd_layout_priv_cache;
+
+	layout = kmem_cache_zalloc(cache, GFP_KERNEL);
+	if (!layout)
+		return NULL;
+
+	layout->eld_objid     = objid;
+	layout->eld_gen       = gen;
+	layout->eld_mblen     = mblen;
+	layout->eld_ld.ol_zcnt = zcnt;
+	kref_init(&layout->eld_ref);
+	init_rwsem(&layout->eld_rwlock);
+
+	if (pmd_objid_type(objid) == OMF_OBJ_MLOG)
+		mpool_uuid_copy(&layout->eld_uuid, uuid);
+
+	return layout;
+}
+
+/*
+ * Deallocate all memory associated with object layout.
+ */
+void pmd_layout_release(struct kref *refp)
+{
+	struct kmem_cache *cache = pmd_layout_cache;
+	struct pmd_layout *layout;
+
+	layout = container_of(refp, typeof(*layout), eld_ref);
+
+	ASSERT(layout->eld_objid > 0);
+	ASSERT(kref_read(&layout->eld_ref) == 0);
+
+	if (pmd_objid_type(layout->eld_objid) == OMF_OBJ_MLOG)
+		cache = pmd_layout_priv_cache;
+
+	layout->eld_objid = 0;
+
+	kmem_cache_free(cache, layout);
+}
+
+static struct pmd_layout *pmd_layout_find(struct rb_root *root, u64 key)
+{
+	struct rb_node *node = root->rb_node;
+	struct pmd_layout *this;
+
+	while (node) {
+		this = rb_entry(node, typeof(*this), eld_nodemdc);
+
+		if (key < this->eld_objid)
+			node = node->rb_left;
+		else if (key > this->eld_objid)
+			node = node->rb_right;
+		else
+			return this;
+	}
+
+	return NULL;
+}
+
+static struct pmd_layout *pmd_layout_insert(struct rb_root *root, struct pmd_layout *item)
+{
+	struct rb_node **pos = &root->rb_node, *parent = NULL;
+	struct pmd_layout *this;
+
+	/*
+	 * Figure out where to insert given layout, or return the colliding
+	 * layout if there's already a layout in the tree with the given ID.
+	 */
+	while (*pos) {
+		this = rb_entry(*pos, typeof(*this), eld_nodemdc);
+		parent = *pos;
+
+		if (item->eld_objid < this->eld_objid)
+			pos = &(*pos)->rb_left;
+		else if (item->eld_objid > this->eld_objid)
+			pos = &(*pos)->rb_right;
+		else
+			return this;
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&item->eld_nodemdc, parent, pos);
+	rb_insert_color(&item->eld_nodemdc, root);
+
+	return NULL;
+}
+
+static void pmd_layout_unprovision(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	int rc;
+	u16 pdh;
+
+	pdh = layout->eld_ld.ol_pdh;
+
+	/* smap_free() should never fail */
+
+	rc = smap_free(mp, pdh, layout->eld_ld.ol_zaddr, layout->eld_ld.ol_zcnt);
+	if (rc)
+		mp_pr_err("releasing %s drive %s space for layout failed, objid 0x%lx",
+			  rc, mp->pds_name, mp->pds_pdv[pdh].pdi_name, (ulong)layout->eld_objid);
+
+	/* Drop birth reference... */
+	pmd_obj_put(layout);
+}
+
+static void pmd_layout_calculate(struct mpool_descriptor *mp, struct pmd_obj_capacity *ocap,
+				 struct media_class *mc, u64 *zcnt)
+{
+	u32 zonepg;
+
+	if (!ocap->moc_captgt) {
+		/* Obj capacity not specified; use one zone. */
+		*zcnt = 1;
+		return;
+	}
+
+	zonepg = mp->pds_pdv[mc->mc_pdmc].pdi_parm.dpr_zonepg;
+	*zcnt = 1 + ((ocap->moc_captgt - 1) / (zonepg << PAGE_SHIFT));
+}
+
+/**
+ * pmd_layout_provision() - provision storage for the given layout
+ * @mp:
+ * @ocap:
+ * @otype:
+ * @layoutp:
+ * @mc:		media class
+ * @zcnt:
+ */
+static int pmd_layout_provision(struct mpool_descriptor *mp, struct pmd_obj_capacity *ocap,
+				struct pmd_layout *layout, struct media_class *mc, u64 zcnt)
+{
+	enum smap_space_type spctype;
+	struct mc_smap_parms mcsp;
+	u64 zoneaddr, align;
+	u8 pdh;
+	int rc;
+
+	spctype = SMAP_SPC_USABLE_ONLY;
+	if (ocap->moc_spare)
+		spctype = SMAP_SPC_SPARE_2_USABLE;
+
+	/* To reduce/eliminate fragmenation, make sure the alignment is a power of 2. */
+	rc = mc_smap_parms_get(&mp->pds_mc[mc->mc_parms.mcp_classp], &mp->pds_params, &mcsp);
+	if (rc)
+		return rc;
+
+	align = min_t(u64, zcnt, mcsp.mcsp_align);
+	align = roundup_pow_of_two(align);
+
+	pdh = mc->mc_pdmc;
+	rc = smap_alloc(mp, pdh, zcnt, spctype, &zoneaddr, align);
+	if (rc)
+		return rc;
+
+	layout->eld_ld.ol_pdh = pdh;
+	layout->eld_ld.ol_zaddr = zoneaddr;
+
+	return 0;
+}
+
+int pmd_layout_rw(struct mpool_descriptor *mp, struct pmd_layout *layout,
+		  const struct kvec *iov, int iovcnt, u64 boff, int flags, u8 rw)
+{
+	struct mpool_dev_info *pd;
+	u64 zaddr;
+	int rc;
+
+	if (!mp || !layout || !iov)
+		return -EINVAL;
+
+	if (rw != MPOOL_OP_READ && rw != MPOOL_OP_WRITE)
+		return -EINVAL;
+
+	pd = &mp->pds_pdv[layout->eld_ld.ol_pdh];
+	if (mpool_pd_status_get(pd) != PD_STAT_ONLINE)
+		return -EIO;
+
+	if (iovcnt == 0)
+		return 0;
+
+	zaddr = layout->eld_ld.ol_zaddr;
+	if (rw == MPOOL_OP_READ)
+		rc = pd_zone_preadv(&pd->pdi_parm, iov, iovcnt, zaddr, boff);
+	else
+		rc = pd_zone_pwritev(&pd->pdi_parm, iov, iovcnt, zaddr, boff, flags);
+
+	if (rc)
+		mpool_pd_status_set(pd, PD_STAT_OFFLINE);
+
+	return rc;
+}
+
+int pmd_layout_erase(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct mpool_dev_info *pd;
+	int rc;
+
+	if (!mp || !layout)
+		return -EINVAL;
+
+	pd = &mp->pds_pdv[layout->eld_ld.ol_pdh];
+	if (mpool_pd_status_get(pd) != PD_STAT_ONLINE)
+		return -EIO;
+
+	rc = pd_zone_erase(&pd->pdi_parm, layout->eld_ld.ol_zaddr, layout->eld_ld.ol_zcnt,
+			   pmd_objid_type(layout->eld_objid) == OMF_OBJ_MLOG);
+	if (rc)
+		mpool_pd_status_set(pd, PD_STAT_OFFLINE);
+
+	return rc;
+}
+
+u64 pmd_layout_cap_get(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	enum obj_type_omf otype = pmd_objid_type(layout->eld_objid);
+	u32 zonepg;
+
+	switch (otype) {
+	case OMF_OBJ_MBLOCK:
+	case OMF_OBJ_MLOG:
+		zonepg = mp->pds_pdv[layout->eld_ld.ol_pdh].pdi_parm.dpr_zonepg;
+		return ((u64)zonepg * layout->eld_ld.ol_zcnt) << PAGE_SHIFT;
+
+	case OMF_OBJ_UNDEF:
+		break;
+	}
+
+	mp_pr_warn("mpool %s objid 0x%lx, undefined object type %d",
+		   mp->pds_name, (ulong)layout->eld_objid, otype);
+
+	return 0;
+}
+
+struct mpool_dev_info *pmd_layout_pd_get(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	return &mp->pds_pdv[layout->eld_ld.ol_pdh];
+}
+
+int pmd_smap_insert(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	int rc;
+	u16 pdh;
+
+	pdh = layout->eld_ld.ol_pdh;
+
+	/* Insert should never fail */
+
+	rc = smap_insert(mp, pdh, layout->eld_ld.ol_zaddr, layout->eld_ld.ol_zcnt);
+	if (rc)
+		mp_pr_err("mpool %s, allocating drive %s space for layout failed, objid 0x%lx",
+			  rc, mp->pds_name, mp->pds_pdv[pdh].pdi_name, (ulong)layout->eld_objid);
+
+	return rc;
+}
+
+struct pmd_layout *pmd_obj_find_get(struct mpool_descriptor *mp, u64 objid, int which)
+{
+	struct pmd_mdc_info *cinfo;
+	struct pmd_layout *found;
+	u8 cslot;
+
+	if (!objtype_user(objid_type(objid)))
+		return NULL;
+
+	cslot = objid_slot(objid);
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	found = NULL;
+
+	/*
+	 * which < 0  - search uncommitted tree only
+	 * which > 0  - search tree only
+	 * which == 0 - search both trees
+	 */
+	if (which <= 0) {
+		pmd_uc_lock(cinfo, cslot);
+		found = pmd_uc_find(cinfo, objid);
+		if (found)
+			kref_get(&found->eld_ref);
+		pmd_uc_unlock(cinfo);
+	}
+
+	if (!found && which >= 0) {
+		pmd_co_rlock(cinfo, cslot);
+		found = pmd_co_find(cinfo, objid);
+		if (found)
+			kref_get(&found->eld_ref);
+		pmd_co_runlock(cinfo);
+	}
+
+	return found;
+}
+
+int pmd_obj_alloc(struct mpool_descriptor *mp, enum obj_type_omf otype,
+		  struct pmd_obj_capacity *ocap, enum mp_media_classp mclassp,
+		  struct pmd_layout **layoutp)
+{
+	return pmd_obj_alloc_cmn(mp, 0, otype, ocap, mclassp, 0, true, layoutp);
+}
+
+int pmd_obj_realloc(struct mpool_descriptor *mp, u64 objid, struct pmd_obj_capacity *ocap,
+		    enum mp_media_classp mclassp, struct pmd_layout **layoutp)
+{
+	if (!pmd_objid_isuser(objid)) {
+		*layoutp = NULL;
+		return -EINVAL;
+	}
+
+	return pmd_obj_alloc_cmn(mp, objid, objid_type(objid), ocap, mclassp, 1, true, layoutp);
+}
+
+int pmd_obj_commit(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct pmd_mdc_info *cinfo;
+	struct pmd_layout *found;
+	int rc;
+	u8 cslot;
+
+	if (!objtype_user(objid_type(layout->eld_objid)))
+		return -EINVAL;
+
+	pmd_obj_wrlock(layout);
+	if (layout->eld_state & PMD_LYT_COMMITTED) {
+		pmd_obj_wrunlock(layout);
+		return 0;
+	}
+
+	/*
+	 * must log create before marking object committed to guarantee it will
+	 * exist after a crash; must hold cinfo.compactclock while log create,
+	 * update layout.state, and add to list of committed objects to prevent
+	 * a race with mdc compaction
+	 */
+	cslot = objid_slot(layout->eld_objid);
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+	pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+
+#ifdef OBJ_PERSISTENCE_ENABLED
+	rc = pmd_log_create(mp, layout);
+#endif
+	if (!rc) {
+		pmd_uc_lock(cinfo, cslot);
+		found = pmd_uc_remove(cinfo, layout);
+		pmd_uc_unlock(cinfo);
+
+		pmd_co_wlock(cinfo, cslot);
+		found = pmd_co_insert(cinfo, layout);
+		if (!found)
+			layout->eld_state |= PMD_LYT_COMMITTED;
+		pmd_co_wunlock(cinfo);
+
+		if (found) {
+			rc = -EEXIST;
+
+			/*
+			 * if objid exists in committed object list this is a
+			 * SERIOUS bug; need to log a warning message; should
+			 * never happen. Note in this case we are stuck because
+			 * we just logged a second create for an existing
+			 * object.  If mdc compaction runs before a restart this
+			 * extraneous create record will be eliminated,
+			 * otherwise pmd_objs_load() will see the conflict and
+			 * fail the next mpool activation.  We could make
+			 * pmd_objs_load() tolerate this but for now it is
+			 * better to get an activation failure so that
+			 * it's obvious this bug occurred. Best we can do is put
+			 * the layout back in the uncommitted object list so the
+			 * caller can abort after getting the commit failure.
+			 */
+			mp_pr_crit("mpool %s, obj 0x%lx collided during commit",
+				   rc, mp->pds_name, (ulong)layout->eld_objid);
+
+			/* Put the object back in the uncommitted objects tree */
+			pmd_uc_lock(cinfo, cslot);
+			pmd_uc_insert(cinfo, layout);
+			pmd_uc_unlock(cinfo);
+		} else {
+			atomic_inc(&cinfo->mmi_pco_cnt.pcc_cr);
+			atomic_inc(&cinfo->mmi_pco_cnt.pcc_cobj);
+		}
+	}
+
+	pmd_mdc_unlock(&cinfo->mmi_compactlock);
+	pmd_obj_wrunlock(layout);
+
+	if (!rc)
+		pmd_update_obj_stats(mp, layout, cinfo, PMD_OBJ_COMMIT);
+
+	return rc;
+}
+
+static void pmd_obj_erase_cb(struct work_struct *work)
+{
+	struct pmd_obj_erase_work *oef;
+	struct mpool_descriptor *mp;
+	struct pmd_layout *layout;
+
+	oef = container_of(work, struct pmd_obj_erase_work, oef_wqstruct);
+	mp = oef->oef_mp;
+	layout = oef->oef_layout;
+
+	pmd_layout_erase(mp, layout);
+
+	if (oef->oef_cache)
+		kmem_cache_free(oef->oef_cache, oef);
+
+	pmd_layout_unprovision(mp, layout);
+}
+
+static void pmd_obj_erase_start(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct pmd_obj_erase_work oefbuf, *oef;
+	bool async = true;
+
+	oef = kmem_cache_zalloc(pmd_obj_erase_work_cache, GFP_KERNEL);
+	if (!oef) {
+		oef = &oefbuf;
+		async = false;
+	}
+
+	/* If async oef will be freed in pmd_obj_erase_and_free() */
+	oef->oef_mp = mp;
+	oef->oef_layout = layout;
+	oef->oef_cache = async ? pmd_obj_erase_work_cache : NULL;
+	INIT_WORK(&oef->oef_wqstruct, pmd_obj_erase_cb);
+
+	queue_work(mp->pds_erase_wq, &oef->oef_wqstruct);
+
+	if (!async)
+		flush_work(&oef->oef_wqstruct);
+}
+
+int pmd_obj_abort(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct pmd_mdc_info *cinfo;
+	struct pmd_layout *found;
+	long refcnt;
+	u8 cslot;
+
+	if (!objtype_user(objid_type(layout->eld_objid)))
+		return -EINVAL;
+
+	cslot = objid_slot(layout->eld_objid);
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	found = NULL;
+
+	pmd_obj_wrlock(layout);
+
+	pmd_uc_lock(cinfo, cslot);
+	refcnt = kref_read(&layout->eld_ref);
+	if (refcnt == 2) {
+		found = pmd_uc_remove(cinfo, layout);
+		if (found)
+			found->eld_state |= PMD_LYT_REMOVED;
+	}
+	pmd_uc_unlock(cinfo);
+
+	pmd_obj_wrunlock(layout);
+
+	if (!found)
+		return (refcnt > 2) ? -EBUSY : -EINVAL;
+
+	pmd_update_obj_stats(mp, layout, cinfo, PMD_OBJ_ABORT);
+	pmd_obj_erase_start(mp, layout);
+
+	/* Drop caller's reference... */
+	pmd_obj_put(layout);
+
+	return 0;
+}
+
+int pmd_obj_delete(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct pmd_mdc_info *cinfo;
+	struct pmd_layout *found;
+	long refcnt;
+	u64 objid;
+	u8 cslot;
+	int rc;
+
+	if (!objtype_user(objid_type(layout->eld_objid)))
+		return -EINVAL;
+
+	objid = layout->eld_objid;
+	cslot = objid_slot(objid);
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	found = NULL;
+
+	/*
+	 * Must log delete record before removing object for crash recovery.
+	 * Must hold cinfo.compactlock while logging delete record and
+	 * removing object from the list of committed objects to prevent
+	 * race with MDC compaction
+	 */
+	pmd_obj_wrlock(layout);
+	pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+
+	refcnt = kref_read(&layout->eld_ref);
+	if (refcnt != 2) {
+		pmd_mdc_unlock(&cinfo->mmi_compactlock);
+		pmd_obj_wrunlock(layout);
+
+		return (refcnt > 2) ? -EBUSY : -EINVAL;
+	}
+
+#ifdef OBJ_PERSISTENCE_ENABLED
+	rc = pmd_log_delete(mp, objid);
+#endif
+	if (!rc) {
+		pmd_co_wlock(cinfo, cslot);
+		found = pmd_co_remove(cinfo, layout);
+		if (found)
+			found->eld_state |= PMD_LYT_REMOVED;
+		pmd_co_wunlock(cinfo);
+	}
+
+	pmd_mdc_unlock(&cinfo->mmi_compactlock);
+	pmd_obj_wrunlock(layout);
+
+	if (!found) {
+		mp_pr_rl("mpool %s, objid 0x%lx, pmd_log_del failed",
+			 rc, mp->pds_name, (ulong)objid);
+		return rc;
+	}
+
+	atomic_inc(&cinfo->mmi_pco_cnt.pcc_del);
+	atomic_dec(&cinfo->mmi_pco_cnt.pcc_cobj);
+	pmd_update_obj_stats(mp, layout, cinfo, PMD_OBJ_DELETE);
+	pmd_obj_erase_start(mp, layout);
+
+	/* Drop caller's reference... */
+	pmd_obj_put(layout);
+
+	return 0;
+}
+
+int pmd_obj_erase(struct mpool_descriptor *mp, struct pmd_layout *layout, u64 gen)
+{
+	u64 objid = layout->eld_objid;
+	int rc;
+
+	if ((pmd_objid_type(objid) != OMF_OBJ_MLOG) ||
+	     (!(layout->eld_state & PMD_LYT_COMMITTED)) ||
+	     (layout->eld_state & PMD_LYT_REMOVED) || (gen <= layout->eld_gen)) {
+		mp_pr_warn("mpool %s, object erase failed to start, objid 0x%lx state 0x%x gen %lu",
+			   mp->pds_name, (ulong)objid, layout->eld_state, (ulong)gen);
+
+		return -EINVAL;
+	}
+
+	/*
+	 * Must log the higher gen number for the old active mlog before
+	 * updating object state (layout->eld_gen of the old active mlog).
+	 * It is to guarantee that a activate after crash will know which is the
+	 * new active mlog.
+	 */
+
+	if (objid_mdc0log(objid)) {
+		/* Compact lock is held by the caller */
+
+		/*
+		 * Change MDC0 metadata image in RAM
+		 */
+		if (objid == MDC0_OBJID_LOG1)
+			mp->pds_sbmdc0.osb_mdc01gen = gen;
+		else
+			mp->pds_sbmdc0.osb_mdc02gen = gen;
+
+		/*
+		 * Write the updated MDC0 metadata in the super blocks of the
+		 * drives holding MDC0 metadata.
+		 * Note: for 1.0, there is only one drive.
+		 */
+		rc = pmd_mdc0_meta_update(mp, layout);
+		if (!rc)
+			/*
+			 * Update in-memory eld_gen, only if on-media
+			 * gen gets successfully updated
+			 */
+			layout->eld_gen = gen;
+	} else {
+		struct pmd_mdc_info *cinfo;
+		u8                   cslot;
+
+		/*
+		 * Take the MDC0 (or mlog MDCi for user MDC) compact lock to
+		 * avoid a race with MDC0 (or mlog MDCi) compaction).
+		 */
+		cslot = objid_slot(layout->eld_objid);
+		cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+		pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+
+#ifdef OBJ_PERSISTENCE_ENABLED
+		rc = pmd_log_erase(mp, layout->eld_objid, gen);
+#endif
+		if (!rc) {
+			layout->eld_gen = gen;
+			if (cslot)
+				atomic_inc(&cinfo->mmi_pco_cnt.pcc_er);
+
+		}
+		pmd_mdc_unlock(&cinfo->mmi_compactlock);
+	}
+
+	return rc;
+}
+
+/**
+ * pmd_alloc_idgen() - generate an id for an allocated object.
+ * @mp:
+ * @otype:
+ * @objid: outpout
+ *
+ * Does a round robin on the MDC1/255 avoiding the ones that are candidate
+ * for pre compaction.
+ *
+ * The round robin has a bias toward the MDCs with the smaller number of
+ * objects. This is to recover from rare and very big allocation bursts.
+ * During an allocation, the MDC[s] candidate for pre compaction are avoided.
+ * If the allocation is a big burst, the result is that these MDC[s] have much
+ * less objects in them as compared to the other ones.
+ * After the burst if a relatively constant allocation rate takes place, the
+ * deficit in objects of the MDCs avoided during the burst, is never recovered.
+ * The bias in the round robin allows to recover. After a while all MDCs ends
+ * up again with about the same number of objects.
+ */
+static int pmd_alloc_idgen(struct mpool_descriptor *mp, enum obj_type_omf otype, u64 *objid)
+{
+	struct pmd_mdc_info *cinfo = NULL;
+	int rc = 0;
+	u8 cslot;
+	u32 tidx;
+
+	if (mp->pds_mda.mdi_slotvcnt < 2) {
+		/* No mdc available to assign object to; cannot use mdc0 */
+		rc = -ENOSPC;
+		mp_pr_err("mpool %s, no MDCi with i>0", rc, mp->pds_name);
+		*objid = 0;
+		return rc;
+	}
+
+	/* Get next mdc for allocation */
+	tidx = atomic_inc_return(&mp->pds_mda.mdi_sel.mds_tbl_idx) % MDC_TBL_SZ;
+	ASSERT(tidx <= MDC_TBL_SZ);
+
+	cslot = mp->pds_mda.mdi_sel.mds_tbl[tidx];
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+	pmd_mdc_lock(&cinfo->mmi_uqlock, cslot);
+	*objid = objid_make(cinfo->mmi_luniq + 1, otype, cslot);
+	if (objid_ckpt(*objid)) {
+
+		/*
+		 * Must checkpoint objid before assigning it to an object
+		 * to guarantee it will not reissue objid after a crash.
+		 * Must hold cinfo.compactlock while log checkpoint to mdc
+		 * to prevent a race with mdc compaction.
+		 */
+		pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+#ifdef OBJ_PERSISTENCE_ENABLED
+		rc = pmd_log_idckpt(mp, *objid);
+#endif
+		if (!rc)
+			cinfo->mmi_lckpt = *objid;
+		pmd_mdc_unlock(&cinfo->mmi_compactlock);
+	}
+
+	if (!rc)
+		cinfo->mmi_luniq = cinfo->mmi_luniq + 1;
+	pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+	if (rc) {
+		mp_pr_rl("mpool %s, checkpoint append for objid 0x%lx failed",
+			 rc, mp->pds_name, (ulong)*objid);
+		*objid = 0;
+		return rc;
+	}
+
+	return 0;
+}
+
+static int pmd_realloc_idvalidate(struct mpool_descriptor *mp, u64 objid)
+{
+	struct pmd_mdc_info *cinfo = NULL;
+	u8 cslot = objid_slot(objid);
+	u64 uniq = objid_uniq(objid);
+	int rc = 0;
+
+	/* We never realloc objects in mdc0 */
+	if (!cslot) {
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, can't re-allocate an object 0x%lx associated to MDC0",
+			  rc, mp->pds_name, (ulong)objid);
+		return rc;
+	}
+
+	spin_lock(&mp->pds_mda.mdi_slotvlock);
+	if (cslot >= mp->pds_mda.mdi_slotvcnt)
+		rc = -EINVAL;
+	spin_unlock(&mp->pds_mda.mdi_slotvlock);
+
+	if (rc) {
+		mp_pr_err("mpool %s, realloc failed, slot number %u is too big %u 0x%lx",
+			  rc, mp->pds_name, cslot, mp->pds_mda.mdi_slotvcnt, (ulong)objid);
+	} else {
+		cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+		pmd_mdc_lock(&cinfo->mmi_uqlock, cslot);
+		if (uniq > cinfo->mmi_luniq)
+			rc = -EINVAL;
+		pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+		if (rc) {
+			mp_pr_err("mpool %s, realloc failed, unique id %lu too big %lu 0x%lx",
+				  rc, mp->pds_name, (ulong)uniq,
+				  (ulong)cinfo->mmi_luniq, (ulong)objid);
+		}
+	}
+
+	return rc;
+}
+
+/**
+ * pmd_alloc_argcheck() -
+ * @mp:      Mpool descriptor
+ * @objid:   Object ID
+ * @otype:   Object type
+ * @mclassp: Media class
+ */
+static int pmd_alloc_argcheck(struct mpool_descriptor *mp, u64 objid,
+			      enum obj_type_omf otype, enum mp_media_classp mclassp)
+{
+	int rc = -EINVAL;
+
+	if (!mp)
+		return rc;
+
+	if (!objtype_user(otype) || !mclass_isvalid(mclassp)) {
+		mp_pr_err("mpool %s, unknown object type or media class %d %d",
+			  rc, mp->pds_name, otype, mclassp);
+		return rc;
+	}
+
+	if (objid && objid_type(objid) != otype) {
+		mp_pr_err("mpool %s, unknown object type mismatch %d %d",
+			  rc, mp->pds_name, objid_type(objid), otype);
+		return rc;
+	}
+
+	return 0;
+}
+
+int pmd_obj_alloc_cmn(struct mpool_descriptor *mp, u64 objid, enum obj_type_omf otype,
+		      struct pmd_obj_capacity *ocap, enum mp_media_classp mclass,
+		      int realloc, bool needref, struct pmd_layout **layoutp)
+{
+	struct pmd_mdc_info *cinfo;
+	struct media_class *mc;
+	struct pmd_layout *layout;
+	struct mpool_uuid uuid;
+	int retries, flush, rc;
+	u64 zcnt = 0;
+	u8  cslot;
+
+	*layoutp = NULL;
+
+	rc = pmd_alloc_argcheck(mp, objid, otype, mclass);
+	if (rc)
+		return rc;
+
+	if (!objid) {
+		/*
+		 * alloc: generate objid, checkpoint as needed to
+		 * support realloc of uncommitted objects after crash and to
+		 * guarantee objids never reuse
+		 */
+		rc = pmd_alloc_idgen(mp, otype, &objid);
+	} else if (realloc) {
+		/* realloc: validate objid */
+		rc = pmd_realloc_idvalidate(mp, objid);
+	}
+	if (rc)
+		return rc;
+
+	if (otype == OMF_OBJ_MLOG)
+		mpool_generate_uuid(&uuid);
+
+	/*
+	 * Retry from 128 to 256ms with a flush every 1/8th of the retries.
+	 * This is a workaround for the async mblock trim problem.
+	 */
+	retries = 1024;
+	flush   = retries >> 3;
+
+retry:
+	down_read(&mp->pds_pdvlock);
+
+	mc = &mp->pds_mc[mclass];
+	if (mc->mc_pdmc < 0) {
+		up_read(&mp->pds_pdvlock);
+		return -ENOENT;
+	}
+
+	/* Calculate the height (zcnt) of layout. */
+	pmd_layout_calculate(mp, ocap, mc, &zcnt);
+
+	layout = pmd_layout_alloc(&uuid, objid, 0, 0, zcnt);
+	if (!layout) {
+		up_read(&mp->pds_pdvlock);
+		return -ENOMEM;
+	}
+
+	/* Try to allocate zones from drives in media class */
+	rc = pmd_layout_provision(mp, ocap, layout, mc, zcnt);
+	up_read(&mp->pds_pdvlock);
+
+	if (rc) {
+		pmd_obj_put(layout);
+
+		/* TODO: Retry only if mperasewq is busy... */
+		if (retries-- > 0) {
+			usleep_range(128, 256);
+
+			if (flush && (retries % flush == 0))
+				flush_workqueue(mp->pds_erase_wq);
+
+			goto retry;
+		}
+
+		mp_pr_rl("mpool %s, layout alloc failed: objid 0x%lx %lu %u",
+			 rc, mp->pds_name, (ulong)objid, (ulong)zcnt, otype);
+
+		return rc;
+	}
+
+	cslot = objid_slot(objid);
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+	pmd_update_obj_stats(mp, layout, cinfo, PMD_OBJ_ALLOC);
+
+	if (needref)
+		kref_get(&layout->eld_ref);
+
+	/*
+	 * If realloc, we MUST confirm (while holding the uncommitted obj
+	 * tree lock) that objid is not in the committed obj tree in order
+	 * to protect against an invalid *_realloc() call.
+	 */
+	pmd_uc_lock(cinfo, cslot);
+	if (realloc) {
+		pmd_co_rlock(cinfo, cslot);
+		if (pmd_co_find(cinfo, objid))
+			rc = -EEXIST;
+		pmd_co_runlock(cinfo);
+	}
+
+	/*
+	 * For both alloc and realloc, confirm that objid is not in the
+	 * uncommitted obj tree and insert it.  Note that a reallocated
+	 * objid can collide, but a generated objid should never collide.
+	 */
+	if (!rc && pmd_uc_insert(cinfo, layout))
+		rc = -EEXIST;
+	pmd_uc_unlock(cinfo);
+
+	if (rc) {
+		mp_pr_err("mpool %s, %sallocated obj 0x%lx should not be in the %scommitted tree",
+			  rc, mp->pds_name, realloc ? "re-" : "",
+			  (ulong)objid, realloc ? "" : "un");
+
+		if (needref)
+			pmd_obj_put(layout);
+
+		/*
+		 * Since object insertion failed, we need to undo the
+		 * per-mdc stats update we did earlier in this routine
+		 */
+		pmd_update_obj_stats(mp, layout, cinfo, PMD_OBJ_ABORT);
+		pmd_layout_unprovision(mp, layout);
+		layout = NULL;
+	}
+
+	*layoutp = layout;
+
+	return rc;
+}
+
+void pmd_mpool_usage(struct mpool_descriptor *mp, struct mpool_usage *usage)
+{
+	int sidx;
+	u16 slotvcnt;
+
+	/*
+	 * Get a local copy of MDC count (slotvcnt), and then drop the lock
+	 * It's okay another MDC is added concurrently, since pds_ds_info
+	 * is always stale by design
+	 */
+	spin_lock(&mp->pds_mda.mdi_slotvlock);
+	slotvcnt = mp->pds_mda.mdi_slotvcnt;
+	spin_unlock(&mp->pds_mda.mdi_slotvlock);
+
+	for (sidx = 1; sidx < slotvcnt; sidx++) {
+		struct pmd_mdc_stats *pms;
+		struct pmd_mdc_info *cinfo;
+
+		cinfo = &mp->pds_mda.mdi_slotv[sidx];
+		pms   = &cinfo->mmi_stats;
+
+		mutex_lock(&cinfo->mmi_stats_lock);
+		usage->mpu_mblock_alen += pms->pms_mblock_alen;
+		usage->mpu_mblock_wlen += pms->pms_mblock_wlen;
+		usage->mpu_mlog_alen   += pms->pms_mlog_alen;
+		usage->mpu_mblock_cnt  += pms->pms_mblock_cnt;
+		usage->mpu_mlog_cnt    += pms->pms_mlog_cnt;
+		mutex_unlock(&cinfo->mmi_stats_lock);
+	}
+
+	if (slotvcnt < 2)
+		return;
+
+	usage->mpu_alen = (usage->mpu_mblock_alen + usage->mpu_mlog_alen);
+	usage->mpu_wlen = (usage->mpu_mblock_wlen + usage->mpu_mlog_alen);
+}
+
+/**
+ * pmd_mdc0_meta_update_update() - update on media the MDC0 metadata.
+ * @mp:
+ * @layout: Used to know on which drives to write the MDC0 metadata.
+ *
+ * For now write the whole super block, but only the MDC0 metadata needs
+ * to be updated, the rest of the superblock doesn't change.
+ *
+ * In 1.0 the MDC0 metadata is replicated on the 4 superblocks of the drive.
+ * In case of failure, the SBs of a same drive may end up having different
+ * values for the MDC0 metadata.
+ * To address this situation voting could be used along with the SB gen number
+ * psb_gen. But for 1.0 a simpler approach is taken: SB gen number is not used
+ * and SB0 is the authoritative replica. The other 3 replicas of MDC0 metadata
+ * are not used when the mpool activates.
+ */
+static int pmd_mdc0_meta_update(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct omf_sb_descriptor *sb;
+	struct mpool_dev_info *pd;
+	struct mc_parms mc_parms;
+	int rc;
+
+	pd = &(mp->pds_pdv[layout->eld_ld.ol_pdh]);
+	if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+		rc = -EIO;
+		mp_pr_err("%s: pd %s unavailable or offline, status %d",
+			  rc, mp->pds_name, pd->pdi_name, mpool_pd_status_get(pd));
+		return rc;
+	}
+
+	sb = kzalloc(sizeof(*sb), GFP_KERNEL);
+	if (!sb)
+		return -ENOMEM;
+
+	/*
+	 * set superblock values common to all new drives in pool
+	 * (new or extant)
+	 */
+	sb->osb_magic = OMF_SB_MAGIC;
+	strlcpy((char *) sb->osb_name, mp->pds_name, sizeof(sb->osb_name));
+	sb->osb_vers = OMF_SB_DESC_VER_LAST;
+	mpool_uuid_copy(&sb->osb_poolid, &mp->pds_poolid);
+	sb->osb_gen = 1;
+
+	/* Set superblock values specific to this drive */
+	mpool_uuid_copy(&sb->osb_parm.odp_devid, &pd->pdi_devid);
+	sb->osb_parm.odp_devsz = pd->pdi_parm.dpr_devsz;
+	sb->osb_parm.odp_zonetot = pd->pdi_parm.dpr_zonetot;
+	mc_pd_prop2mc_parms(&pd->pdi_parm.dpr_prop, &mc_parms);
+	mc_parms2omf_devparm(&mc_parms, &sb->osb_parm);
+
+	sbutil_mdc0_copy(sb, &mp->pds_sbmdc0);
+
+	mp_pr_debug("MDC0 compaction gen1 %lu gen2 %lu",
+		    0, (ulong)sb->osb_mdc01gen, (ulong)sb->osb_mdc02gen);
+
+	/*
+	 * sb_write_update() succeeds if at least SB0 is written. It is
+	 * not a problem to have SB1 not written because the authoritative
+	 * MDC0 metadata replica is the one in SB0.
+	 */
+	rc = sb_write_update(&pd->pdi_parm, sb);
+	if (rc)
+		mp_pr_err("compacting %s MDC0, writing superblock on drive %s failed",
+			  rc, mp->pds_name, pd->pdi_name);
+
+	kfree(sb);
+	return rc;
+}
+
+/**
+ * pmd_update_obj_stats() - update per-MDC space usage
+ * @mp:
+ * @layout:
+ * @cinfo:
+ * @op: object opcode
+ */
+void pmd_update_obj_stats(struct mpool_descriptor *mp, struct pmd_layout *layout,
+			  struct pmd_mdc_info *cinfo, enum pmd_obj_op op)
+{
+	struct pmd_mdc_stats *pms;
+	enum obj_type_omf otype;
+	u64 cap;
+
+	otype = pmd_objid_type(layout->eld_objid);
+
+	mutex_lock(&cinfo->mmi_stats_lock);
+	pms = &cinfo->mmi_stats;
+
+	/* Update space usage and mblock/mlog count */
+	switch (op) {
+	case PMD_OBJ_LOAD:
+		if (otype == OMF_OBJ_MBLOCK)
+			pms->pms_mblock_wlen += layout->eld_mblen;
+		fallthrough;
+
+	case PMD_OBJ_ALLOC:
+		cap = pmd_layout_cap_get(mp, layout);
+		if (otype == OMF_OBJ_MLOG) {
+			pms->pms_mlog_cnt++;
+			pms->pms_mlog_alen += cap;
+		} else if (otype == OMF_OBJ_MBLOCK) {
+			pms->pms_mblock_cnt++;
+			pms->pms_mblock_alen += cap;
+		}
+		break;
+
+	case PMD_OBJ_COMMIT:
+		if (otype == OMF_OBJ_MBLOCK)
+			pms->pms_mblock_wlen += layout->eld_mblen;
+		break;
+
+	case PMD_OBJ_DELETE:
+		if (otype == OMF_OBJ_MBLOCK)
+			pms->pms_mblock_wlen -= layout->eld_mblen;
+		fallthrough;
+
+	case PMD_OBJ_ABORT:
+		cap = pmd_layout_cap_get(mp, layout);
+		if (otype == OMF_OBJ_MLOG) {
+			pms->pms_mlog_cnt--;
+			pms->pms_mlog_alen -= cap;
+		} else if (otype == OMF_OBJ_MBLOCK) {
+			pms->pms_mblock_cnt--;
+			pms->pms_mblock_alen -= cap;
+		}
+		break;
+
+	default:
+		ASSERT(0);
+		break;
+	}
+
+	mutex_unlock(&cinfo->mmi_stats_lock);
+}
+
+/**
+ * pmd_compare_free_space() - compare free space between MDCs
+ * @f:  First  MDC
+ * @s:  Second MDC
+ *
+ * Arrange MDCs in descending order of free space
+ */
+static int pmd_compare_free_space(const void *first, const void *second)
+{
+	const struct pmd_mdc_info *f = *(const struct pmd_mdc_info **)first;
+	const struct pmd_mdc_info *s = *(const struct pmd_mdc_info **)second;
+
+	/* return < 0 - first member should be ahead for second */
+	if (f->mmi_credit.ci_free > s->mmi_credit.ci_free)
+		return -1;
+
+	/* return > 0 - first member should be after second */
+	if (f->mmi_credit.ci_free < s->mmi_credit.ci_free)
+		return 1;
+
+	return 0;
+
+}
+
+/**
+ * pmd_update_ms_tbl() - udpates mds_tlb with MDC slot numbers
+ * @mp:  mpool descriptor
+ * @slotnum:  array of slot numbers
+ *
+ * This function creates an array of mdc slot and credit sets by interleaving
+ * MDC slots. Interleave maximize the interval at which the slots appear in
+ * the mds_tbl.
+ *
+ * The first set in the array is reference set with only 1 member and has max
+ * assigned credits. Subsequent sets are formed to match the reference set and
+ * may contain one or more member such that total credit of the set will match
+ * the reference set. The last set may have fewer credit than the reference set
+ *
+ * Locking: no lock need to be held when calling this function.
+ *
+ */
+static void pmd_update_mds_tbl(struct mpool_descriptor *mp, u8 num_mdc, u8 *slotnum)
+{
+	struct mdc_credit_set *cset, *cs;
+	struct pmd_mdc_info *cinfo;
+	u16 refcredit, neededcredit, tidx, totalcredit = 0;
+	u8 csidx, csmidx, num_cset, i;
+
+	cset = kcalloc(num_mdc, sizeof(*cset), GFP_KERNEL);
+	if (!cset)
+		return;
+
+	cinfo = &mp->pds_mda.mdi_slotv[slotnum[0]];
+	refcredit = cinfo->mmi_credit.ci_credit;
+
+	csidx = 0; /* creditset index */
+	i     = 0; /* slotnum index   */
+	while (i < num_mdc) {
+		cs = &cset[csidx++];
+		neededcredit = refcredit;
+
+		csmidx = 0;
+		/* Setup members of the credit set */
+		while (csmidx < MPOOL_MDC_SET_SZ  && i < num_mdc) {
+			/* slot 0 should never be there */
+			ASSERT(slotnum[i] != 0);
+
+			cinfo = &mp->pds_mda.mdi_slotv[slotnum[i]];
+			cs->cs_num_csm = csmidx + 1;
+			cs->csm[csmidx].m_slot = slotnum[i];
+
+			if (neededcredit <= cinfo->mmi_credit.ci_credit) {
+				/*
+				 * More than required credit is available,
+				 * leftover will be assigned to the next set.
+				 */
+				cs->csm[csmidx].m_credit    += neededcredit;
+				cinfo->mmi_credit.ci_credit -= neededcredit;
+				totalcredit += neededcredit; /* Debug */
+				neededcredit = 0;
+
+				/* Some credit available stay at this mdc */
+				if (cinfo->mmi_credit.ci_credit == 0)
+					i++;
+				break;
+			}
+
+			/*
+			 * Available credit is < needed, assign all
+			 * the available credit and move to the next
+			 * mdc slot.
+			 */
+			cs->csm[csmidx].m_credit += cinfo->mmi_credit.ci_credit;
+			neededcredit -= cinfo->mmi_credit.ci_credit;
+			totalcredit  += cinfo->mmi_credit.ci_credit;
+			cinfo->mmi_credit.ci_credit = 0;
+
+			/* Move to the next mdcslot and set member */
+			i++;
+			csmidx++;
+		}
+	}
+
+	ASSERT(totalcredit == MDC_TBL_SZ);
+	num_cset = csidx;
+
+	tidx  = 0;
+	csidx = 0;
+	while (tidx < MDC_TBL_SZ) {
+		cs = &cset[csidx];
+		if (cs->cs_idx < cs->cs_num_csm) {
+			csmidx = cs->cs_idx;
+			if (cs->csm[csmidx].m_credit) {
+				cs->csm[csmidx].m_credit--;
+				mp->pds_mda.mdi_sel.mds_tbl[tidx] = cs->csm[csmidx].m_slot;
+				totalcredit--;
+
+				if (cs->csm[csmidx].m_credit == 0)
+					cs->cs_idx += 1;
+
+				tidx++;
+			}
+		}
+		/* Loop over the sets */
+		csidx = (csidx + 1) % num_cset;
+	}
+
+	ASSERT(totalcredit == 0);
+
+	kfree(cset);
+}
+
+/**
+ * pmd_update_credit() - udpates MDC credit if new MDCs should be created
+ * @mp:  mpool descriptor
+ *
+ * Credits are assigned as a ratio between MDC such that MDC with least free
+ * space will fill up at the same time as other MDC.
+ *
+ * Locking: no lock need to be held when calling this function.
+ */
+void pmd_update_credit(struct mpool_descriptor *mp)
+{
+	struct pre_compact_ctrs *pco_cnt;
+	struct pmd_mdc_info *cinfo;
+	u64 cap, used, free, nmtoc;
+	u16 credit, cslot;
+	u8 sidx, nidx, num_mdc;
+	u8 *slotnum;
+	void **sarray = mp->pds_mda.mdi_sel.mds_smdc;
+	u32 nbnoalloc = (u32)mp->pds_params.mp_pconbnoalloc;
+
+	if (mp->pds_mda.mdi_slotvcnt < 2) {
+		mp_pr_warn("Not enough MDCn %u", mp->pds_mda.mdi_slotvcnt - 1);
+		return;
+	}
+
+	slotnum = kcalloc(MDC_SLOTS, sizeof(*slotnum), GFP_KERNEL);
+	if (!slotnum)
+		return;
+
+	nmtoc = atomic_read(&mp->pds_pco.pco_nmtoc);
+	nmtoc = nmtoc % (mp->pds_mda.mdi_slotvcnt - 1) + 1;
+
+	/*
+	 * slotvcnt includes MDC 0 and MDCn that are in precompaction
+	 * list and should be excluded. If there are less than (nbnoalloc
+	 * +2) MDCs exclusion is not possible. 2 is added to account for
+	 * MDC0 and the MDC pointed to by pco_nmtoc.
+	 *
+	 * MDC that is in pre-compacting state and two MDCs that follows
+	 * are excluded from allocation. This is done to prevent stall/
+	 * delays for a sync that follows an allocation as both
+	 * take a compaction lock.
+	 */
+	if (mp->pds_mda.mdi_slotvcnt < (nbnoalloc + 2)) {
+		num_mdc = mp->pds_mda.mdi_slotvcnt - 1;
+		cslot  = 1;
+		mp_pr_debug("MDCn cnt %u, cannot skip %u num_mdc %u",
+			    0, mp->pds_mda.mdi_slotvcnt - 1, (u32)nmtoc, num_mdc);
+	} else {
+		num_mdc = mp->pds_mda.mdi_slotvcnt - (nbnoalloc + 2);
+		cslot = (nmtoc + nbnoalloc) % (mp->pds_mda.mdi_slotvcnt - 1);
+	}
+
+
+	/* Walkthrough all MDCs and exclude MDCs that are almost full */
+	for (nidx = 0, sidx = 0; nidx < num_mdc; nidx++) {
+		cslot = cslot % (mp->pds_mda.mdi_slotvcnt - 1) + 1;
+
+		if (cslot == 0)
+			cslot = 1;
+
+		cinfo = &mp->pds_mda.mdi_slotv[cslot];
+		pco_cnt = &(cinfo->mmi_pco_cnt);
+
+		cap  = atomic64_read(&pco_cnt->pcc_cap);
+		used = atomic64_read(&pco_cnt->pcc_len);
+
+		if ((cap - used) < (cap / 400)) {
+			/* Consider < .25% free space as full */
+			mp_pr_warn("MDC slot %u almost full", cslot);
+			continue;
+		}
+		sarray[sidx++] = cinfo;
+		cinfo->mmi_credit.ci_free = cap - used;
+	}
+
+	/* Sort the array with decreasing order of space */
+	sort((void *)sarray, sidx, sizeof(sarray[0]), pmd_compare_free_space, NULL);
+	num_mdc = sidx;
+
+	/* Calculate total free space across the chosen MDC set */
+	for (sidx = 0, free = 0; sidx < num_mdc; sidx++) {
+		cinfo = sarray[sidx];
+		free += cinfo->mmi_credit.ci_free;
+		slotnum[sidx] = cinfo->mmi_credit.ci_slot;
+	}
+
+	/*
+	 * Assign credit to MDCs in the MDC set. Credit is relative and
+	 * will not exceed the total slots in mds_tbl
+	 */
+	for (sidx = 0, credit = 0; sidx < num_mdc; sidx++) {
+		cinfo = &mp->pds_mda.mdi_slotv[slotnum[sidx]];
+		cinfo->mmi_credit.ci_credit = (MDC_TBL_SZ * cinfo->mmi_credit.ci_free) / free;
+		credit += cinfo->mmi_credit.ci_credit;
+	}
+
+	ASSERT(credit <= MDC_TBL_SZ);
+
+	/*
+	 * If the credit is not equal to the table size, assign
+	 * credits so table can be filled all the way.
+	 */
+	if (credit < MDC_TBL_SZ) {
+		credit = MDC_TBL_SZ - credit;
+
+		sidx = 0;
+		while (credit > 0) {
+			sidx = (sidx % num_mdc);
+			cinfo = &mp->pds_mda.mdi_slotv[slotnum[sidx]];
+			cinfo->mmi_credit.ci_credit += 1;
+			sidx++;
+			credit--;
+		}
+	}
+
+	pmd_update_mds_tbl(mp, num_mdc, slotnum);
+
+	kfree(slotnum);
+}
+
+/*
+ * pmd_mlogid2cslot() - Given an mlog object ID which makes one of the mpool
+ *	core MDCs (MDCi with i >0), it returns i.
+ *	Given an client created object ID (mblock or mlog), it returns -1.
+ * @mlogid:
+ */
+static int pmd_mlogid2cslot(u64 mlogid)
+{
+	u64 uniq;
+
+	if (pmd_objid_type(mlogid) != OMF_OBJ_MLOG)
+		return -1;
+	if (objid_slot(mlogid))
+		return -1;
+	uniq = objid_uniq(mlogid);
+	if (uniq > (2 * MDC_SLOTS) - 1)
+		return -1;
+
+	return(uniq/2);
+}
+
+void pmd_precompact_alsz(struct mpool_descriptor *mp, u64 objid, u64 len, u64 cap)
+{
+	struct pre_compact_ctrs *pco_cnt;
+	struct pmd_mdc_info *cinfo;
+	int ret;
+	u8 cslot;
+
+	ret = pmd_mlogid2cslot(objid);
+	if (ret <= 0)
+		return;
+
+	cslot = ret;
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	pco_cnt = &(cinfo->mmi_pco_cnt);
+	atomic64_set(&pco_cnt->pcc_len, len);
+	atomic64_set(&pco_cnt->pcc_cap, cap);
+}
+
+int pmd_init(void)
+{
+	int rc = 0;
+
+	/* Initialize the slab caches. */
+	pmd_layout_cache = kmem_cache_create("mpool_pmd_layout", sizeof(struct pmd_layout),
+					     0, SLAB_HWCACHE_ALIGN | SLAB_POISON, NULL);
+	if (!pmd_layout_cache) {
+		rc = -ENOMEM;
+		mp_pr_err("kmem_cache_create(pmd_layout, %zu) failed",
+			  rc, sizeof(struct pmd_layout));
+		goto errout;
+	}
+
+	pmd_layout_priv_cache = kmem_cache_create("mpool_pmd_layout_priv",
+				sizeof(struct pmd_layout) + sizeof(union pmd_layout_priv),
+				0, SLAB_HWCACHE_ALIGN | SLAB_POISON, NULL);
+	if (!pmd_layout_priv_cache) {
+		rc = -ENOMEM;
+		mp_pr_err("kmem_cache_create(pmd priv, %zu) failed",
+			  rc, sizeof(union pmd_layout_priv));
+		goto errout;
+	}
+
+	pmd_obj_erase_work_cache = kmem_cache_create("mpool_pmd_obj_erase_work",
+						     sizeof(struct pmd_obj_erase_work),
+						     0, SLAB_HWCACHE_ALIGN | SLAB_POISON, NULL);
+	if (!pmd_obj_erase_work_cache) {
+		rc = -ENOMEM;
+		mp_pr_err("kmem_cache_create(pmd_obj_erase, %zu) failed",
+			  rc, sizeof(struct pmd_obj_erase_work));
+		goto errout;
+	}
+
+errout:
+	if (rc)
+		pmd_exit();
+
+	return rc;
+}
+
+void pmd_exit(void)
+{
+	kmem_cache_destroy(pmd_obj_erase_work_cache);
+	kmem_cache_destroy(pmd_layout_priv_cache);
+	kmem_cache_destroy(pmd_layout_cache);
+
+	pmd_obj_erase_work_cache = NULL;
+	pmd_layout_priv_cache = NULL;
+	pmd_layout_cache = NULL;
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 09/22] mpool: add mblock lifecycle management and IO routines
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (7 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 08/22] mpool: add pool metadata routines to manage object lifecycle and IO Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 10/22] mpool: add mlog IO utility routines Nabeel M Mohamed
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

Implements the mblock lifecycle management functions: allocate,
commit, abort, read, write, destroy etc.

Mblocks are containers comprising a linear sequence of bytes that
can be written exactly once, are immutable after writing, and can
be read in whole or in part as needed until deleted.

Mpool currently supports only fixed size mblocks whose size is the
same as the zone size and is established at mpool create.

Mblock API uses the metadata manager interface to reserve storage
space at allocation time, store metadata for the mblock in its
associated MDC-K when committing it, record end-of-life for the
mblock in its associated MDC-K when deleting it and read/write
mblock data.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mblock.c | 432 +++++++++++++++++++++++++++++++++++++++++
 drivers/mpool/mblock.h | 161 +++++++++++++++
 2 files changed, 593 insertions(+)
 create mode 100644 drivers/mpool/mblock.c
 create mode 100644 drivers/mpool/mblock.h

diff --git a/drivers/mpool/mblock.c b/drivers/mpool/mblock.c
new file mode 100644
index 000000000000..10c47ec74ff6
--- /dev/null
+++ b/drivers/mpool/mblock.c
@@ -0,0 +1,432 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * DOC: Module info.
+ *
+ * Mblock module.
+ *
+ * Defines functions for writing, reading, and managing the lifecycle
+ * of mblocks.
+ *
+ */
+
+#include <linux/vmalloc.h>
+#include <linux/blk_types.h>
+#include <linux/mm.h>
+
+#include "mpool_printk.h"
+#include "assert.h"
+
+#include "pd.h"
+#include "pmd_obj.h"
+#include "mpcore.h"
+#include "mblock.h"
+
+/**
+ * mblock2layout() - convert opaque mblock handle to pmd_layout
+ *
+ * This function converts the opaque handle (mblock_descriptor) used by
+ * clients to the internal representation (pmd_layout).  The
+ * conversion is a simple cast, followed by a sanity check to verify the
+ * layout object is an mblock object.  If the validation fails, a NULL
+ * pointer is returned.
+ */
+static struct pmd_layout *mblock2layout(struct mblock_descriptor *mbh)
+{
+	struct pmd_layout *layout = (void *)mbh;
+
+	if (!layout)
+		return NULL;
+
+	ASSERT(layout->eld_objid > 0);
+	ASSERT(kref_read(&layout->eld_ref) >= 2);
+
+	return mblock_objid(layout->eld_objid) ? layout : NULL;
+}
+
+static u32 mblock_optimal_iosz_get(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct mpool_dev_info *pd = pmd_layout_pd_get(mp, layout);
+
+	return pd->pdi_optiosz;
+}
+
+/**
+ * layout2mblock() - convert pmd_layout to opaque mblock_descriptor
+ *
+ * This function converts the internally used pmd_layout to
+ * the externally used opaque mblock_descriptor.
+ */
+static struct mblock_descriptor *layout2mblock(struct pmd_layout *layout)
+{
+	return (struct mblock_descriptor *)layout;
+}
+
+static void mblock_getprops_cmn(struct mpool_descriptor *mp, struct pmd_layout *layout,
+				struct mblock_props *prop)
+{
+	struct mpool_dev_info *pd;
+
+	pd = pmd_layout_pd_get(mp, layout);
+
+	prop->mpr_objid = layout->eld_objid;
+	prop->mpr_alloc_cap = pmd_layout_cap_get(mp, layout);
+	prop->mpr_write_len = layout->eld_mblen;
+	prop->mpr_optimal_wrsz = mblock_optimal_iosz_get(mp, layout);
+	prop->mpr_mclassp = pd->pdi_mclass;
+	prop->mpr_iscommitted = layout->eld_state & PMD_LYT_COMMITTED;
+}
+
+static int mblock_alloc_cmn(struct mpool_descriptor *mp, u64 objid,
+			    enum mp_media_classp mclassp, bool spare,
+			    struct mblock_props *prop, struct mblock_descriptor **mbh)
+{
+	struct pmd_obj_capacity ocap = { .moc_spare = spare };
+	struct pmd_layout *layout = NULL;
+	int rc;
+
+	if (!mp)
+		return -EINVAL;
+
+	*mbh = NULL;
+
+	if (!objid) {
+		rc = pmd_obj_alloc(mp, OMF_OBJ_MBLOCK, &ocap, mclassp, &layout);
+		if (rc)
+			return rc;
+	} else {
+		rc = pmd_obj_realloc(mp, objid, &ocap, mclassp, &layout);
+		if (rc) {
+			if (rc != -ENOENT)
+				mp_pr_err("mpool %s, re-allocating mblock 0x%lx failed",
+					  rc, mp->pds_name, (ulong)objid);
+			return rc;
+		}
+	}
+
+	if (!layout)
+		return -ENOTRECOVERABLE;
+
+	if (prop) {
+		pmd_obj_rdlock(layout);
+		mblock_getprops_cmn(mp, layout, prop);
+		pmd_obj_rdunlock(layout);
+	}
+
+	*mbh = layout2mblock(layout);
+
+	return 0;
+}
+
+int mblock_alloc(struct mpool_descriptor *mp, enum mp_media_classp mclassp, bool spare,
+		 struct mblock_descriptor **mbh, struct mblock_props *prop)
+{
+	return mblock_alloc_cmn(mp, 0, mclassp, spare, prop, mbh);
+}
+
+int mblock_find_get(struct mpool_descriptor *mp, u64 objid, int which,
+		    struct mblock_props *prop, struct mblock_descriptor **mbh)
+{
+	struct pmd_layout *layout;
+
+	*mbh = NULL;
+
+	if (!mblock_objid(objid))
+		return -EINVAL;
+
+	layout = pmd_obj_find_get(mp, objid, which);
+	if (!layout)
+		return -ENOENT;
+
+	if (prop) {
+		pmd_obj_rdlock(layout);
+		mblock_getprops_cmn(mp, layout, prop);
+		pmd_obj_rdunlock(layout);
+	}
+
+	*mbh = layout2mblock(layout);
+
+	return 0;
+}
+
+void mblock_put(struct mblock_descriptor *mbh)
+{
+	struct pmd_layout *layout;
+
+	layout = mblock2layout(mbh);
+	if (layout)
+		pmd_obj_put(layout);
+}
+
+/*
+ * Helper function to log a message that many functions need to log:
+ */
+#define mp_pr_layout_not_found(_mp, _mbh)				\
+do {									\
+	static unsigned long state;					\
+	uint dly = msecs_to_jiffies(1000);				\
+									\
+	if (printk_timed_ratelimit(&state, dly)) {			\
+		mp_pr_warn("mpool %s, layout not found: mbh %p",	\
+			   (_mp)->pds_name, (_mbh));			\
+		dump_stack();						\
+	}								\
+} while (0)
+
+int mblock_commit(struct mpool_descriptor *mp, struct mblock_descriptor *mbh)
+{
+	struct mpool_dev_info *pd;
+	struct pmd_layout *layout;
+	int rc;
+
+	layout = mblock2layout(mbh);
+	if (!layout) {
+		mp_pr_layout_not_found(mp, mbh);
+		return -EINVAL;
+	}
+
+	pd = pmd_layout_pd_get(mp, layout);
+	if (!pd->pdi_fua) {
+		rc = pd_dev_flush(&pd->pdi_parm);
+		if (rc)
+			return rc;
+	}
+
+	/* Commit will fail with EBUSY if aborting flag set. */
+	rc = pmd_obj_commit(mp, layout);
+	if (rc) {
+		mp_pr_rl("mpool %s, committing mblock 0x%lx failed",
+			 rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	return 0;
+}
+
+int mblock_abort(struct mpool_descriptor *mp, struct mblock_descriptor *mbh)
+{
+	struct pmd_layout *layout;
+	int rc;
+
+	layout = mblock2layout(mbh);
+	if (!layout) {
+		mp_pr_layout_not_found(mp, mbh);
+		return -EINVAL;
+	}
+
+	rc = pmd_obj_abort(mp, layout);
+	if (rc) {
+		mp_pr_err("mpool %s, aborting mblock 0x%lx failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	return 0;
+}
+
+int mblock_delete(struct mpool_descriptor *mp, struct mblock_descriptor *mbh)
+{
+	struct pmd_layout *layout;
+
+	layout = mblock2layout(mbh);
+	if (!layout) {
+		mp_pr_layout_not_found(mp, mbh);
+		return -EINVAL;
+	}
+
+	return pmd_obj_delete(mp, layout);
+}
+
+/**
+ * mblock_rw_argcheck() - Validate mblock_write() and mblock_read().
+ * @mp:      Mpool descriptor
+ * @layout:  Layout of the mblock
+ * @iov:     iovec array
+ * @iovcnt:  iovec count
+ * @boff:    Byte offset into the layout.  Must be equal to layout->eld_mblen for write
+ * @rw:      MPOOL_OP_READ or MPOOL_OP_WRITE
+ * @len:     number of bytes in iov list
+ *
+ * Returns: 0 if successful, -errno otherwise
+ *
+ * Note: be aware that there are checks in this function that prevent illegal
+ * arguments in lower level functions (lower level functions should assert the
+ * requirements but not otherwise check them)
+ */
+static int mblock_rw_argcheck(struct mpool_descriptor *mp, struct pmd_layout *layout,
+			      loff_t boff, int rw, size_t len)
+{
+	u64 opt_iosz;
+	u32 mblock_cap;
+	int rc;
+
+	mblock_cap = pmd_layout_cap_get(mp, layout);
+	opt_iosz = mblock_optimal_iosz_get(mp, layout);
+
+	if (rw == MPOOL_OP_READ) {
+		/* boff must be a multiple of the OS page size */
+		if (!PAGE_ALIGNED(boff)) {
+			rc = -EINVAL;
+			mp_pr_err("mpool %s, read offset 0x%lx is not multiple of OS page size",
+				  rc, mp->pds_name, (ulong) boff);
+			return rc;
+		}
+
+		/* Check that boff is not past end of mblock capacity. */
+		if (mblock_cap <= boff) {
+			rc = -EINVAL;
+			mp_pr_err("mpool %s, read offset 0x%lx >= mblock capacity 0x%x",
+				  rc, mp->pds_name, (ulong)boff, mblock_cap);
+			return rc;
+		}
+
+		/*
+		 * Check that the request does not extend past the data
+		 * written.  Don't record an error if this appears to
+		 * be an mcache readahead request.
+		 *
+		 * TODO: Use (len != MCACHE_RA_PAGES_MAX)
+		 */
+		if (boff + len > layout->eld_mblen)
+			return -EINVAL;
+	} else {
+		/* Write boff required to match eld_mblen */
+		if (boff != layout->eld_mblen) {
+			rc = -EINVAL;
+			mp_pr_err("mpool %s write boff (%ld) != eld_mblen (%d)",
+				  rc, mp->pds_name, (ulong)boff, layout->eld_mblen);
+			return rc;
+		}
+
+		/* Writes must be optimal iosz aligned */
+		if (boff % opt_iosz) {
+			rc = -EINVAL;
+			mp_pr_err("mpool %s, write not optimal iosz aligned, offset 0x%lx",
+				  rc, mp->pds_name, (ulong)boff);
+			return rc;
+		}
+
+		/* Check for write past end of allocated space (!) */
+		if ((len + boff) > mblock_cap) {
+			rc = -EINVAL;
+			mp_pr_err("(write): len %lu + boff %lu > mblock_cap %lu",
+				  rc, (ulong)len, (ulong)boff, (ulong)mblock_cap);
+			return rc;
+		}
+	}
+
+	return 0;
+}
+
+int mblock_write(struct mpool_descriptor *mp, struct mblock_descriptor *mbh,
+		 const struct kvec *iov, int iovcnt, size_t len)
+{
+	struct pmd_layout *layout;
+	loff_t boff;
+	u8 state;
+	int rc;
+
+	layout = mblock2layout(mbh);
+	if (!layout) {
+		mp_pr_layout_not_found(mp, mbh);
+		return -EINVAL;
+	}
+
+	rc = mblock_rw_argcheck(mp, layout, layout->eld_mblen, MPOOL_OP_WRITE, len);
+	if (rc) {
+		mp_pr_debug("mblock write argcheck failed ", rc);
+		return rc;
+	}
+
+	if (len == 0)
+		return 0;
+
+	boff = layout->eld_mblen;
+
+	ASSERT(PAGE_ALIGNED(len));
+	ASSERT(PAGE_ALIGNED(boff));
+	ASSERT(iovcnt == (len >> PAGE_SHIFT));
+
+	pmd_obj_wrlock(layout);
+	state = layout->eld_state;
+	if (!(state & PMD_LYT_COMMITTED)) {
+		struct mpool_dev_info *pd = pmd_layout_pd_get(mp, layout);
+		int flags = 0;
+
+		if (pd->pdi_fua)
+			flags = REQ_FUA;
+
+		rc = pmd_layout_rw(mp, layout, iov, iovcnt, boff, flags, MPOOL_OP_WRITE);
+		if (!rc)
+			layout->eld_mblen += len;
+	}
+	pmd_obj_wrunlock(layout);
+
+	return !(state & PMD_LYT_COMMITTED) ? rc : -EALREADY;
+}
+
+int mblock_read(struct mpool_descriptor *mp, struct mblock_descriptor *mbh,
+		const struct kvec *iov, int iovcnt, loff_t boff, size_t len)
+{
+	struct pmd_layout *layout;
+	u8 state;
+	int rc;
+
+	layout = mblock2layout(mbh);
+	if (!layout) {
+		mp_pr_layout_not_found(mp, mbh);
+		return -EINVAL;
+	}
+
+	rc = mblock_rw_argcheck(mp, layout, boff, MPOOL_OP_READ, len);
+	if (rc) {
+		mp_pr_debug("mblock read argcheck failed ", rc);
+		return rc;
+	}
+
+	if (len == 0)
+		return 0;
+
+	ASSERT(PAGE_ALIGNED(len));
+	ASSERT(PAGE_ALIGNED(boff));
+	ASSERT(iovcnt == (len >> PAGE_SHIFT));
+
+	/*
+	 * Read lock the mblock layout; mblock reads can proceed concurrently;
+	 * Mblock writes are serialized but concurrent with reads
+	 */
+	pmd_obj_rdlock(layout);
+	state = layout->eld_state;
+	if (state & PMD_LYT_COMMITTED)
+		rc = pmd_layout_rw(mp, layout, iov, iovcnt, boff, 0, MPOOL_OP_READ);
+	pmd_obj_rdunlock(layout);
+
+	return (state & PMD_LYT_COMMITTED) ? rc : -EAGAIN;
+}
+
+int mblock_get_props_ex(struct mpool_descriptor *mp, struct mblock_descriptor *mbh,
+			struct mblock_props_ex *prop)
+{
+	struct pmd_layout *layout;
+
+	layout = mblock2layout(mbh);
+	if (!layout) {
+		mp_pr_layout_not_found(mp, mbh);
+		return -EINVAL;
+	}
+
+	pmd_obj_rdlock(layout);
+	prop->mbx_zonecnt = layout->eld_ld.ol_zcnt;
+	mblock_getprops_cmn(mp, layout, &prop->mbx_props);
+	pmd_obj_rdunlock(layout);
+
+	return 0;
+}
+
+bool mblock_objid(u64 objid)
+{
+	return objid && (pmd_objid_type(objid) == OMF_OBJ_MBLOCK);
+}
diff --git a/drivers/mpool/mblock.h b/drivers/mpool/mblock.h
new file mode 100644
index 000000000000..2a5435875a92
--- /dev/null
+++ b/drivers/mpool/mblock.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * DOC: Module info.
+ *
+ * Defines functions for writing, reading, and managing the lifecycle of mlogs.
+ *
+ */
+
+#ifndef MPOOL_MBLOCK_H
+#define MPOOL_MBLOCK_H
+
+#include <linux/uio.h>
+
+#include "mpool_ioctl.h"
+/*
+ * Opaque handles for clients
+ */
+struct mpool_descriptor;
+struct mblock_descriptor;
+struct mpool_obj_layout;
+
+/*
+ * mblock API functions
+ */
+
+/**
+ * mblock_alloc() - Allocate an mblock.
+ * @mp:         mpool descriptor
+ * @capreq:     mblock capacity requested
+ * @mclassp:    media class
+ * @mbh:        mblock handle returned
+ * @prop:       mblock properties returned
+ *
+ * Allocate mblock with erasure code and capacity params as specified in
+ * ecparm and capreq on drives in a media class mclassp;
+ * if successful mbh is a handle for the mblock and prop contains its
+ * properties.
+ * Note: mblock is not persistent until committed; allocation can be aborted.
+ *
+ * Return: %0 if successful, -errno otherwise...
+ */
+int mblock_alloc(struct mpool_descriptor *mp, enum mp_media_classp mclassp, bool spare,
+		 struct mblock_descriptor **mbh, struct mblock_props *prop);
+
+/**
+ * mblock_find_get() - Get handle and properties for existing mblock with specified objid.
+ * @mp:
+ * @objid:
+ * @which:
+ * @prop:
+ * @mbh:
+ *
+ * If successful, the caller holds a ref on the mblock (which must be put eventually).
+ *
+ * Return: %0 if successful, -errno otherwise...
+ */
+int mblock_find_get(struct mpool_descriptor *mp, u64 objid, int which,
+		    struct mblock_props *prop, struct mblock_descriptor **mbh);
+
+/**
+ * mblock_put() - Put (release) a ref on an mblock
+ * @mbh:
+ *
+ * Put a ref on a known mblock.
+ *
+ * Return: %0 if successful, -errno otherwise...
+ */
+void mblock_put(struct mblock_descriptor *mbh);
+
+/**
+ * mblock_commit() - Make allocated mblock persistent
+ * @mp:
+ * @mbh:
+ *
+ * if fails mblock still exists in an
+ * uncommitted state so can retry commit or abort except as noted.
+ *
+ * Return: %0 if successful, -errno otherwise...
+ * EBUSY if must abort
+ */
+int mblock_commit(struct mpool_descriptor *mp, struct mblock_descriptor *mbh);
+
+/**
+ * mblock_abort() - Discard uncommitted mblock
+ * @mp:
+ * @mbh:
+ *
+ * If successful mbh is invalid after call.
+ *
+ * Return: %0 if successful, -errno otherwise...
+ *
+ */
+int mblock_abort(struct mpool_descriptor *mp, struct mblock_descriptor *mbh);
+
+/**
+ * mblock_delete() - Delete committed mblock
+ * @mp:
+ * @mbh:
+ *
+ * If successful mbh is invalid after call.
+ *
+ * Return: %0 if successful, -errno otherwise...
+ */
+int mblock_delete(struct mpool_descriptor *mp, struct mblock_descriptor *mbh);
+
+/**
+ * mblock_write() - Write iov to mblock
+ * @mp:
+ * @mbh:
+ * @iov:
+ * @iovcnt:
+ * @len:
+ *
+ * Mblocks can be written until they are committed, or
+ * until they are full.  If a caller needs to issue more than one write call
+ * to the same mblock, all but the last write call must be optimal write size aligned.
+ * The mpr_optimal_wrsz field in struct mblock_props gives the optimal write size.
+ *
+ * Return: %0 if success, -errno otherwise...
+ */
+int mblock_write(struct mpool_descriptor *mp, struct mblock_descriptor *mbh,
+		 const struct kvec *iov, int iovcnt, size_t len);
+
+/**
+ * mblock_read() - Read data from mblock mbnum in committed mblock into iov
+ * @mp:
+ * @mbh:
+ * @iov:
+ * @iovcnt:
+ * @boff:
+ * @len:
+ *
+ * Read data from mblock mbnum in committed mblock into iov starting at
+ * byte offset boff; boff and iov buffers must be a multiple of OS page
+ * size for the mblock.
+ *
+ * If fails can call mblock_get_props() to confirm mblock was written.
+ *
+ * Return: 0 if successful, -errno otherwise...
+ */
+int mblock_read(struct mpool_descriptor *mp, struct mblock_descriptor *mbh,
+		const struct kvec *iov, int iovcnt, loff_t boff, size_t len);
+
+/**
+ * mblock_get_props_ex() - Return extended mblock properties in prop
+ * @mp:
+ * @mbh:
+ * @prop:
+ *
+ * Return: %0 if successful, -errno otherwise...
+ */
+int mblock_get_props_ex(struct mpool_descriptor *mp, struct mblock_descriptor *mbh,
+			struct mblock_props_ex *prop);
+
+bool mblock_objid(u64 objid);
+
+#endif /* MPOOL_MBLOCK_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 10/22] mpool: add mlog IO utility routines
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (8 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 09/22] mpool: add mblock lifecycle management and IO routines Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 11/22] mpool: add mlog lifecycle management and IO routines Nabeel M Mohamed
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

Adds buffer management routines used by the mlog IO path.

Mlog objects are containers for record logging. An mlog is
structured as a series of consecutive log blocks where each
log block is exactly one sector in size. Records of
arbitrary size can be appended to an mlog until it's full.

Mlog uses a flush set algorithm for implementing multi-sector
append and read buffers. A flush set is one or more sector-
size memory buffers containing consecutive log block data
with newly appended records written together in a single
logical I/O. The flush set algorithm maintains flush set
identifiers in the headers of the log blocks comprising the
append buffer and handles failures when writing these log
blocks.

The flush set grows up to the append buffer size providing
better throughput for async appends. Once the flush set is
full or when the client issues a synchronous append or mlog
flush, the flush set is written to media and the buffer pages
are freed/prepared for the next append depending on the I/O
outcome.

The read buffer is also provisioned like the append buffer.
The read buffer management is optimized for the expected use
cases, which is open a log, read once to load persisted state
into memory and thereafter only append while open. The log
blocks are loaded into the read buffer up to but not including
the offset of the first log block in the append buffer.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mlog_utils.c | 1352 ++++++++++++++++++++++++++++++++++++
 drivers/mpool/mlog_utils.h |   63 ++
 2 files changed, 1415 insertions(+)
 create mode 100644 drivers/mpool/mlog_utils.c
 create mode 100644 drivers/mpool/mlog_utils.h

diff --git a/drivers/mpool/mlog_utils.c b/drivers/mpool/mlog_utils.c
new file mode 100644
index 000000000000..3fae46b5a1c3
--- /dev/null
+++ b/drivers/mpool/mlog_utils.c
@@ -0,0 +1,1352 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#include <linux/mm.h>
+#include <linux/log2.h>
+#include <linux/blk_types.h>
+#include <linux/rbtree.h>
+#include <linux/slab.h>
+#include <asm/page.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "pmd_obj.h"
+#include "mpcore.h"
+#include "mlog.h"
+#include "mlog_utils.h"
+
+#define mlpriv2layout(_ptr) \
+	((struct pmd_layout *)((char *)(_ptr) - offsetof(struct pmd_layout, eld_priv)))
+
+bool mlog_objid(u64 objid)
+{
+	return objid && pmd_objid_type(objid) == OMF_OBJ_MLOG;
+}
+
+/**
+ * mlog2layout() - convert opaque mlog handle to pmd_layout
+ *
+ * This function converts the opaque handle (mlog_descriptor) used by
+ * clients to the internal representation (pmd_layout).  The
+ * conversion is a simple cast, followed by a sanity check to verify the
+ * layout object is an mlog object.  If the validation fails, a NULL
+ * pointer is returned.
+ */
+struct pmd_layout *mlog2layout(struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout = (void *)mlh;
+
+	return mlog_objid(layout->eld_objid) ? layout : NULL;
+}
+
+/**
+ * layout2mlog() - convert pmd_layout to opaque mlog_descriptor
+ *
+ * This function converts the internally used pmd_layout to
+ * the externally used opaque mlog_descriptor.
+ */
+struct mlog_descriptor *layout2mlog(struct pmd_layout *layout)
+{
+	return (struct mlog_descriptor *)layout;
+}
+
+static struct pmd_layout_mlpriv *oml_layout_find(struct mpool_descriptor *mp, u64 key)
+{
+	struct pmd_layout_mlpriv *this;
+	struct pmd_layout *layout;
+	struct rb_node *node;
+
+	node = mp->pds_oml_root.rb_node;
+	while (node) {
+		this = rb_entry(node, typeof(*this), mlp_nodeoml);
+		layout = mlpriv2layout(this);
+
+		if (key < layout->eld_objid)
+			node = node->rb_left;
+		else if (key > layout->eld_objid)
+			node = node->rb_right;
+		else
+			return this;
+	}
+
+	return NULL;
+}
+
+struct pmd_layout_mlpriv *oml_layout_insert(struct mpool_descriptor *mp,
+					    struct pmd_layout_mlpriv *item)
+{
+	struct pmd_layout_mlpriv *this;
+	struct pmd_layout *layout;
+	struct rb_node **pos, *parent;
+	struct rb_root *root;
+	u64 key;
+
+	root = &mp->pds_oml_root;
+	pos = &root->rb_node;
+	parent = NULL;
+
+	key = mlpriv2layout(item)->eld_objid;
+
+	while (*pos) {
+		this = rb_entry(*pos, typeof(*this), mlp_nodeoml);
+		layout = mlpriv2layout(this);
+
+		parent = *pos;
+		if (key < layout->eld_objid)
+			pos = &(*pos)->rb_left;
+		else if (key > layout->eld_objid)
+			pos = &(*pos)->rb_right;
+		else
+			return this;
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&item->mlp_nodeoml, parent, pos);
+	rb_insert_color(&item->mlp_nodeoml, root);
+
+	return NULL;
+}
+
+struct pmd_layout_mlpriv *oml_layout_remove(struct mpool_descriptor *mp, u64 key)
+{
+	struct pmd_layout_mlpriv *found;
+
+	found = oml_layout_find(mp, key);
+	if (found)
+		rb_erase(&found->mlp_nodeoml, &mp->pds_oml_root);
+
+	return found;
+}
+
+/**
+ * mlog_free_abuf() - Free log pages in the append buffer, range:[start, end].
+ * @lstat: mlog_stat
+ * @start: start log page index, inclusive
+ * @end:   end log page index, inclusive
+ */
+void mlog_free_abuf(struct mlog_stat *lstat, int start, int end)
+{
+	int i;
+
+	for (i = start; i <= end; i++) {
+		if (lstat->lst_abuf[i]) {
+			free_page((unsigned long)lstat->lst_abuf[i]);
+			lstat->lst_abuf[i] = NULL;
+		}
+	}
+}
+
+/**
+ * mlog_free_rbuf() - Free log pages in the read buffer, range:[start, end].
+ * @lstat: mlog_stat
+ * @start: start log page index, inclusive
+ * @end:   end log page index, inclusive
+ */
+void mlog_free_rbuf(struct mlog_stat *lstat, int start, int end)
+{
+	int i;
+
+	for (i = start; i <= end; i++) {
+		if (lstat->lst_rbuf[i]) {
+			free_page((unsigned long)lstat->lst_rbuf[i]);
+			lstat->lst_rbuf[i] = NULL;
+		}
+	}
+}
+
+/**
+ * mlog_init_fsetparms() - Initialize frequently used mlog & flush set parameters.
+ * @mp:     mpool descriptor
+ * @layout: layout descriptor
+ * @mfp:    fset parameters (output)
+ */
+static void mlog_init_fsetparms(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+				struct mlog_fsetparms *mfp)
+{
+	struct pmd_layout *layout;
+	struct pd_prop *pdp;
+	u8 secshift;
+	u16 sectsz;
+
+	layout = mlog2layout(mlh);
+	ASSERT(layout);
+
+	pdp = &mp->pds_pdv[layout->eld_ld.ol_pdh].pdi_prop;
+	secshift = PD_SECTORSZ(pdp);
+	mfp->mfp_totsec = pmd_layout_cap_get(mp, layout) >> secshift;
+
+	sectsz = 1 << secshift;
+	ASSERT((sectsz == PAGE_SIZE) || (sectsz == 512));
+
+	mfp->mfp_sectsz  = sectsz;
+	mfp->mfp_lpgsz   = PAGE_SIZE;
+	mfp->mfp_secpga  = IS_ALIGNED(mfp->mfp_sectsz, mfp->mfp_lpgsz);
+	mfp->mfp_nlpgmb  = MB >> PAGE_SHIFT;
+	mfp->mfp_nsecmb  = MB >> secshift;
+	mfp->mfp_nseclpg = mfp->mfp_lpgsz >> secshift;
+}
+
+/**
+ * mlog_extract_fsetparms() - Helper to extract flush set parameters.
+ * @lstat:   mlog stat
+ * @sectsz:  sector size
+ * @totsec:  total number of sectors in the mlog
+ * @nsecmb:  number of sectors in 1 MiB
+ * @nseclpg: number of sectors in a log page
+ */
+void
+mlog_extract_fsetparms(struct mlog_stat *lstat, u16 *sectsz, u32 *totsec, u16 *nsecmb, u16 *nseclpg)
+{
+	if (sectsz)
+		*sectsz = MLOG_SECSZ(lstat);
+	if (totsec)
+		*totsec = MLOG_TOTSEC(lstat);
+	if (nsecmb)
+		*nsecmb = MLOG_NSECMB(lstat);
+	if (nseclpg)
+		*nseclpg = MLOG_NSECLPG(lstat);
+}
+
+/**
+ * mlog_stat_free() - Deallocate log stat struct for mlog layout (if any)
+ */
+void mlog_stat_free(struct pmd_layout *layout)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+
+	if (!lstat->lst_abuf)
+		return;
+
+	mlog_free_rbuf(lstat, 0, MLOG_NLPGMB(lstat) - 1);
+	mlog_free_abuf(lstat, 0, MLOG_NLPGMB(lstat) - 1);
+
+	kfree(lstat->lst_abuf);
+	lstat->lst_abuf = NULL;
+}
+
+/**
+ * mlog_read_iter_init() - Initialize read iterator
+ *
+ * @layout: mlog layout
+ * @lstat:  mlog stat
+ * @lri"    mlog read iterator
+ */
+void
+mlog_read_iter_init(struct pmd_layout *layout, struct mlog_stat *lstat, struct mlog_read_iter *lri)
+{
+	lri->lri_layout = layout;
+	lri->lri_gen    = layout->eld_gen;
+	lri->lri_soff   = 0;
+	lri->lri_roff   = 0;
+	lri->lri_valid  = 1;
+	lri->lri_rbidx  = 0;
+	lri->lri_sidx   = 0;
+
+	lstat->lst_rsoff  = -1;
+	lstat->lst_rseoff = -1;
+}
+
+/**
+ * mlog_stat_init_common() - Initialize mlog_stat fields.
+ * @layout:
+ * @lstat: mlog_stat
+ */
+void mlog_stat_init_common(struct pmd_layout *layout, struct mlog_stat *lstat)
+{
+	struct mlog_read_iter *lri;
+
+	lstat->lst_pfsetid = 0;
+	lstat->lst_cfsetid = 1;
+	lstat->lst_abidx   = 0;
+	lstat->lst_asoff   = -1;
+	lstat->lst_cfssoff = OMF_LOGBLOCK_HDR_PACKLEN;
+	lstat->lst_aoff    = OMF_LOGBLOCK_HDR_PACKLEN;
+	lstat->lst_abdirty = false;
+	lstat->lst_wsoff   = 0;
+	lstat->lst_cstart  = 0;
+	lstat->lst_cend    = 0;
+
+	lri = &lstat->lst_citr;
+	mlog_read_iter_init(layout, lstat, lri);
+}
+
+/**
+ * mlog_rw_raw() - Called by mpctl kernel for mlog IO.
+ * @mp:     mpool descriptor
+ * @mlh:    mlog descriptor
+ * @iov:    iovec
+ * @iovcnt: iov cnt
+ * @boff:   IO offset
+ * @rw:     MPOOL_OP_READ or MPOOL_OP_WRITE
+ *
+ * The scatter-gather buffer must contain
+ * framed mlog data (this is done in user space for user space mlogs).
+ */
+int mlog_rw_raw(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+		const struct kvec *iov, int iovcnt, u64 boff, u8 rw)
+{
+	struct pmd_layout *layout;
+	int flags;
+	int rc;
+
+	layout = mlog2layout(mlh);
+	if (!layout)
+		return -EINVAL;
+
+	flags = (rw == MPOOL_OP_WRITE) ? REQ_FUA : 0;
+
+	pmd_obj_wrlock(layout);
+	rc = pmd_layout_rw(mp, layout, iov, iovcnt, boff, flags, rw);
+	pmd_obj_wrunlock(layout);
+
+	return rc;
+}
+
+/**
+ * mlog_rw() -
+ * @mp:       mpool descriptor
+ * @mlh:      mlog descriptor
+ * @iov:      iovec
+ * @iovcnt:   iov cnt
+ * @boff:     IO offset
+ * @rw:       MPOOL_OP_READ or MPOOL_OP_WRITE
+ * @skip_ser: client guarantees serialization
+ */
+static int mlog_rw(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+		   struct kvec *iov, int iovcnt, u64 boff, u8 rw, bool skip_ser)
+{
+	struct pmd_layout *layout;
+
+	layout = mlog2layout(mlh);
+	if (!layout)
+		return -EINVAL;
+
+	if (!skip_ser) {
+		int flags = (rw == MPOOL_OP_WRITE) ? REQ_FUA : 0;
+
+		return pmd_layout_rw(mp, layout, iov, iovcnt, boff, flags, rw);
+	}
+
+	return mlog_rw_raw(mp, mlh, iov, iovcnt, boff, rw);
+}
+
+/**
+ * mlog_stat_init() - Allocate and init log stat struct for mlog layout.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_stat_init(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, bool csem)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_fsetparms mfp;
+	struct mlog_stat *lstat;
+	size_t bufsz;
+	int rc;
+
+	if (!layout)
+		return -EINVAL;
+
+	lstat = &layout->eld_lstat;
+
+	mlog_stat_init_common(layout, lstat);
+	mlog_init_fsetparms(mp, mlh, &mfp);
+
+	bufsz = mfp.mfp_nlpgmb * sizeof(char *) * 2;
+
+	lstat->lst_abuf = kzalloc(bufsz, GFP_KERNEL);
+	if (!lstat->lst_abuf) {
+		rc = -ENOMEM;
+		mp_pr_err("mpool %s, allocating mlog 0x%lx status failed %zu",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, bufsz);
+		return rc;
+	}
+
+	lstat->lst_rbuf = lstat->lst_abuf + mfp.mfp_nlpgmb;
+	lstat->lst_mfp  = mfp;
+	lstat->lst_csem = csem;
+
+	return 0;
+}
+
+/**
+ * mlog_setup_buf() - Build an iovec list to read into an mlog read buffer, or write from
+ * an mlog append buffer.
+ * @lstat:   mlog_stat
+ * @riov:    iovec (output)
+ * @iovcnt:  number of iovecs
+ * @l_iolen: IO length for the last log page in the buffer
+ * @op:      MPOOL_OP_READ or MPOOL_OP_WRITE
+ *
+ * In the read case, the read buffer pages will be allocated if not already populated.
+ */
+static int
+mlog_setup_buf(struct mlog_stat *lstat, struct kvec **riov, u16 iovcnt, u32 l_iolen, u8 op)
+{
+	struct kvec *iov = *riov;
+	u32 len = MLOG_LPGSZ(lstat);
+	bool alloc_iov = false;
+	u16 i;
+	char *buf;
+
+	ASSERT(len == PAGE_SIZE);
+	ASSERT(l_iolen <= PAGE_SIZE);
+
+	if (!iov) {
+		ASSERT((iovcnt * sizeof(*iov)) <= PAGE_SIZE);
+
+		iov = kcalloc(iovcnt, sizeof(*iov), GFP_KERNEL);
+		if (!iov)
+			return -ENOMEM;
+
+		alloc_iov = true;
+		*riov = iov;
+	}
+
+	for (i = 0; i < iovcnt; i++, iov++) {
+		buf = ((op == MPOOL_OP_READ) ? lstat->lst_rbuf[i] : lstat->lst_abuf[i]);
+
+		/* iov_len for the last log page in read/write buffer. */
+		if (i == iovcnt - 1 && l_iolen != 0)
+			len = l_iolen;
+
+		ASSERT(IS_ALIGNED(len, MLOG_SECSZ(lstat)));
+
+		if (op == MPOOL_OP_WRITE && buf) {
+			iov->iov_base = buf;
+			iov->iov_len  = len;
+			continue;
+		}
+
+		/*
+		 * Pages for the append buffer are allocated in
+		 * mlog_append_*(), so we shouldn't be here for MPOOL_OP_WRITE.
+		 */
+		ASSERT(op == MPOOL_OP_READ);
+
+		/*
+		 * If the read buffer contains stale log pages from a prior
+		 * iterator, reuse them. No need to zero these pages for
+		 * the same reason provided in the following comment.
+		 */
+		if (buf) {
+			iov->iov_base = buf;
+			iov->iov_len  = len;
+			continue;
+		}
+
+		/*
+		 * No need to zero the read buffer as we never read more than
+		 * what's needed and do not consume beyond what's read.
+		 */
+		buf = (char *)__get_free_page(GFP_KERNEL);
+		if (!buf) {
+			mlog_free_rbuf(lstat, 0, i - 1);
+			if (alloc_iov) {
+				kfree(iov);
+				*riov = NULL;
+			}
+
+			return -ENOMEM;
+		}
+
+		/*
+		 * Must be a page-aligned buffer so that it can be used
+		 * in bio_add_page().
+		 */
+		ASSERT(PAGE_ALIGNED(buf));
+
+		lstat->lst_rbuf[i] = iov->iov_base = buf;
+		iov->iov_len = len;
+	}
+
+	return 0;
+}
+
+/**
+ * mlog_populate_abuf() - Makes append offset page-aligned and performs the
+ * read operation in the read-modify-write cycle.
+ * @mp:       mpool descriptor
+ * @layout:   layout descriptor
+ * @soff:     sector/LB offset
+ * @buf:      buffer to populate. Size of this buffer must be MLOG_LPGSZ(lstat).
+ * @skip_ser: client guarantees serialization
+ *
+ * This is to ensure that IO-requests to the device are always 4K-aligned.
+ * The read-modify-write cycle happens *only* if the first append post mlog
+ * open lands on a non-page aligned sector offset. For any further appends,
+ * read-modify-write cycle doesn't happen, as the 4k-aligned version of the
+ * flush set algorithm ensures 4k-alignment of sector offsets at the start
+ * of each log page.
+ */
+static int mlog_populate_abuf(struct mpool_descriptor *mp, struct pmd_layout *layout,
+			      off_t *soff, char *buf, bool skip_ser)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	struct kvec iov;
+	u16 sectsz, iovcnt, leading;
+	off_t off;
+	u32 leadb;
+	int rc;
+
+	sectsz = MLOG_SECSZ(lstat);
+
+	/* Find the leading number of sectors to make it page-aligned. */
+	leading = ((*soff * sectsz) & ~PAGE_MASK) >> ilog2(sectsz);
+	if (leading == 0)
+		return 0; /* Nothing to do */
+
+	*soff = *soff - leading;
+	leadb = leading * sectsz;
+
+	iovcnt       = 1;
+	iov.iov_base = buf;
+	iov.iov_len  = MLOG_LPGSZ(lstat);
+
+	off = *soff * sectsz;
+	ASSERT(IS_ALIGNED(off, MLOG_LPGSZ(lstat)));
+
+	rc = mlog_rw(mp, layout2mlog(layout), &iov, iovcnt, off, MPOOL_OP_READ, skip_ser);
+	if (rc) {
+		mp_pr_err("mpool %s, mlog 0x%lx, read IO failed, iovcnt: %u, off: 0x%lx",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, iovcnt, off);
+		return rc;
+	}
+
+	memset(&buf[leadb], 0, MLOG_LPGSZ(lstat) - leadb);
+
+	return 0;
+}
+
+/**
+ * mlog_populate_rbuf() - Fill the read buffer after aligning the read offset to page boundary.
+ * @mp:       mpool descriptor
+ * @layout:   layout descriptor
+ * @nsec:     number of sectors to populate
+ * @soff:     start sector/LB offset
+ * @skip_ser: client guarantees serialization
+ *
+ * Having the read offsets page-aligned avoids unnecessary
+ * complexity at the pd layer.
+ *
+ * In the worst case, for 512 byte sectors, we would end up reading 7
+ * additional sectors, which is acceptable. There won't be any overhead for
+ * 4 KiB sectors as they are naturally page-aligned.
+ *
+ * Caller must hold the write lock on the layout.
+ */
+int mlog_populate_rbuf(struct mpool_descriptor *mp, struct pmd_layout *layout,
+		       u16 *nsec, off_t *soff, bool skip_ser)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	struct kvec *iov = NULL;
+	u16 maxsec, sectsz, iovcnt, nseclpg, leading;
+	off_t off;
+	u32 l_iolen;
+	int rc;
+
+	mlog_extract_fsetparms(lstat, &sectsz, NULL, &maxsec, &nseclpg);
+
+	/* Find the leading number of sectors to make it page-aligned. */
+	leading = ((*soff * sectsz) & ~PAGE_MASK) >> ilog2(sectsz);
+	*soff   = *soff - leading;
+	*nsec  += leading;
+
+	*nsec   = min_t(u32, maxsec, *nsec);
+	iovcnt  = (*nsec + nseclpg - 1) / nseclpg;
+	l_iolen = MLOG_LPGSZ(lstat);
+
+	rc = mlog_setup_buf(lstat, &iov, iovcnt, l_iolen, MPOOL_OP_READ);
+	if (rc) {
+		mp_pr_err("mpool %s, mlog 0x%lx setup failed, iovcnt: %u, last iolen: %u",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, iovcnt, l_iolen);
+		return rc;
+	}
+
+	off = *soff * sectsz;
+	ASSERT(IS_ALIGNED(off, MLOG_LPGSZ(lstat)));
+
+	rc = mlog_rw(mp, layout2mlog(layout), iov, iovcnt, off, MPOOL_OP_READ, skip_ser);
+	if (rc) {
+		mp_pr_err("mpool %s, mlog 0x%lx populate rbuf, IO failed iovcnt: %u, off: 0x%lx",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, iovcnt, off);
+
+		mlog_free_rbuf(lstat, 0, MLOG_NLPGMB(lstat) - 1);
+		kfree(iov);
+		return rc;
+	}
+
+	/*
+	 * If there're any unused buffers beyond iovcnt, free it. This is
+	 * likely to happen when there're multiple threads reading from
+	 * the same mlog simultaneously, using their own iterator.
+	 */
+	mlog_free_rbuf(lstat, iovcnt, MLOG_NLPGMB(lstat) - 1);
+
+	kfree(iov);
+
+	return 0;
+}
+
+/**
+ * mlog_alloc_abufpg() - Allocate a log page at append buffer index 'abidx'.
+ * @mp:       mpool descriptor
+ * @layout:   layout descriptor
+ * @abidx:    allocate log page at index 'abidx'.
+ * @skip_ser: client guarantees serialization
+ *
+ * If the sector size is 512B AND 4K-alignment is forced AND the append offset
+ * at buffer index '0' is not 4K-aligned, then call mlog_populate_abuf().
+ */
+static int mlog_alloc_abufpg(struct mpool_descriptor *mp, struct pmd_layout *layout,
+			     u16 abidx, bool skip_ser)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	char *abuf;
+
+	ASSERT(MLOG_LPGSZ(lstat) == PAGE_SIZE);
+
+	abuf = (char *)get_zeroed_page(GFP_KERNEL);
+	if (!abuf)
+		return -ENOMEM;
+
+	lstat->lst_abuf[abidx] = abuf;
+
+	if (abidx == 0) {
+		off_t asoff, wsoff;
+		u16 aoff, sectsz;
+		int rc;
+
+		/* This path is taken *only* for the first append following an mlog_open(). */
+		sectsz = MLOG_SECSZ(lstat);
+		wsoff  = lstat->lst_wsoff;
+		aoff   = lstat->lst_aoff;
+
+		if (IS_SECPGA(lstat) || (IS_ALIGNED(wsoff * sectsz, MLOG_LPGSZ(lstat)))) {
+			/* This is the common path */
+			lstat->lst_asoff = wsoff;
+			return 0;
+		}
+
+		/*
+		 * This path is taken *only* if,
+		 * - the log block size is 512B AND
+		 * - lst_wsoff is non page-aligned, which is possible for the
+		 *   first append post mlog_open.
+		 */
+		asoff = wsoff;
+		rc = mlog_populate_abuf(mp, layout, &asoff, abuf, skip_ser);
+		if (rc) {
+			mp_pr_err("mpool %s, mlog 0x%lx, making write offset %ld 4K-aligned failed",
+				  rc, mp->pds_name, (ulong)layout->eld_objid, wsoff);
+			mlog_free_abuf(lstat, abidx, abidx);
+			return rc;
+		}
+
+		ASSERT(asoff <= wsoff);
+		ASSERT(IS_ALIGNED(asoff * sectsz, MLOG_LPGSZ(lstat)));
+
+		lstat->lst_cfssoff = ((wsoff - asoff) * sectsz) + aoff;
+		lstat->lst_asoff   = asoff;
+	}
+
+	return 0;
+}
+
+/**
+ * mlog_flush_abuf() - Set up iovec and flush the append buffer to media.
+ * @mp:       mpool descriptor
+ * @layout:   layout descriptor
+ * @skip_ser: client guarantees serialization
+ */
+static int mlog_flush_abuf(struct mpool_descriptor *mp, struct pmd_layout *layout, bool skip_ser)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	struct kvec *iov = NULL;
+	u16 abidx, sectsz, nseclpg;
+	off_t off;
+	u32 l_iolen;
+	int rc;
+
+	mlog_extract_fsetparms(lstat, &sectsz, NULL, NULL, &nseclpg);
+
+	abidx   = lstat->lst_abidx;
+	l_iolen = MLOG_LPGSZ(lstat);
+
+	rc = mlog_setup_buf(lstat, &iov, abidx + 1, l_iolen, MPOOL_OP_WRITE);
+	if (rc) {
+		mp_pr_err("mpool %s, mlog 0x%lx flush, buf setup failed, iovcnt: %u, iolen: %u",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, abidx + 1, l_iolen);
+		return rc;
+	}
+
+	off = lstat->lst_asoff * sectsz;
+
+	ASSERT((IS_ALIGNED(off, MLOG_LPGSZ(lstat))) ||
+		(IS_SECPGA(lstat) && IS_ALIGNED(off, MLOG_SECSZ(lstat))));
+
+	rc = mlog_rw(mp, layout2mlog(layout), iov, abidx + 1, off, MPOOL_OP_WRITE, skip_ser);
+	if (rc)
+		mp_pr_err("mpool %s, mlog 0x%lx flush append buf, IO failed iovcnt %u, off 0x%lx",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, abidx + 1, off);
+
+	kfree(iov);
+
+	return rc;
+}
+
+/**
+ * mlog_flush_posthdlr_4ka() - Handles both successful and failed flush for
+ *                             512B sectors with 4K-Alignment.
+ * @layout: layout descriptor
+ * @fsucc:  flush status
+ */
+static void mlog_flush_posthdlr_4ka(struct pmd_layout *layout, bool fsucc)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	u16 abidx, sectsz, asidx;
+	off_t asoff, wsoff;
+	char *abuf;
+	u32 nsecwr;
+
+	sectsz = MLOG_SECSZ(lstat);
+	abidx = lstat->lst_abidx;
+	asoff = lstat->lst_asoff;
+	wsoff = lstat->lst_wsoff;
+
+	asidx  = wsoff - ((MLOG_NSECLPG(lstat) * abidx) + asoff);
+
+	/* Set the current filling log page index to 0. */
+	lstat->lst_abidx = 0;
+	abuf = lstat->lst_abuf[0];
+
+	if (!fsucc) {
+		u32 cfssoff;
+
+		/*
+		 * Last CFS flush or header packing failed.
+		 * Retain the pfsetid of the first log block.
+		 */
+		cfssoff = lstat->lst_cfssoff;
+		memset(&abuf[cfssoff], 0, MLOG_LPGSZ(lstat) - cfssoff);
+		asidx = (cfssoff >> ilog2(sectsz));
+		lstat->lst_aoff  = cfssoff - (asidx * sectsz);
+		lstat->lst_wsoff = asoff + asidx;
+
+		goto exit2;
+	}
+
+	/* Last CFS flush succeded. */
+	if (abidx != 0) {
+		/* Reorganize buffers if the active log page not at index 0. */
+		abuf = lstat->lst_abuf[abidx];
+		lstat->lst_abuf[abidx] = NULL;
+	}
+
+	nsecwr = wsoff - (asoff + (lstat->lst_cfssoff >> ilog2(sectsz)));
+	asoff  = wsoff - asidx;
+
+	/* The last logblock of the just-written CFS is not full. */
+	if (sectsz - lstat->lst_aoff >= OMF_LOGREC_DESC_PACKLEN) {
+		if (nsecwr != 0)
+			/* Set pfsetid to the cfsetid of just-written CFS. */
+			lstat->lst_pfsetid  = lstat->lst_cfsetid;
+
+		goto exit1;
+	}
+
+	/* The last logblock of the just-written CFS is full. */
+	lstat->lst_aoff = OMF_LOGBLOCK_HDR_PACKLEN;
+	++wsoff;
+	if ((wsoff - asoff) == MLOG_NSECLPG(lstat)) {
+		memset(&abuf[0], 0, MLOG_LPGSZ(lstat));
+		asoff = wsoff;
+	}
+	/* Set pfsetid to the cfsetid of just-written CFS. */
+	lstat->lst_pfsetid  = lstat->lst_cfsetid;
+
+exit1:
+	asidx              = wsoff - asoff;
+	lstat->lst_cfssoff = (asidx * sectsz) + lstat->lst_aoff;
+	lstat->lst_asoff   = asoff;
+	lstat->lst_wsoff   = wsoff;
+
+exit2:
+	/* Increment cfsetid in all cases. */
+	++lstat->lst_cfsetid;
+
+	lstat->lst_abuf[0] = abuf;
+}
+
+/**
+ * mlog_flush_posthdlr() - Handles both successful and failed flush for
+ * 512B and 4K-sectors with native alignment, i.e., 512B and 4K resply.
+ *
+ * @layout: layout descriptor
+ * @fsucc:  flush status
+ */
+static void mlog_flush_posthdlr(struct pmd_layout *layout, bool fsucc)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	u16 abidx, sectsz, asidx;
+	off_t asoff, lpgoff;
+	char *abuf;
+
+	sectsz = MLOG_SECSZ(lstat);
+	abidx  = lstat->lst_abidx;
+	asoff  = lstat->lst_asoff;
+
+	asidx  = lstat->lst_wsoff - ((MLOG_NSECLPG(lstat) * abidx) + asoff);
+	lpgoff = asidx * sectsz;
+
+	/* Set the current filling log page index to 0. */
+	lstat->lst_abidx = 0;
+	abuf = lstat->lst_abuf[0];
+
+	if (!fsucc) {
+		u32 cfssoff;
+
+		/*
+		 * Last CFS flush or header packing failed.
+		 * Retain the pfsetid of the first log block.
+		 */
+		cfssoff = lstat->lst_cfssoff;
+		memset(&abuf[cfssoff], 0, MLOG_LPGSZ(lstat) - cfssoff);
+		lstat->lst_aoff  = cfssoff;
+		lstat->lst_wsoff = asoff;
+
+		goto exit2;
+	}
+
+	/* Last CFS flush succeded. */
+	if (abidx != 0) {
+		/* Reorganize buffers if the active log page not at index 0. */
+		abuf = lstat->lst_abuf[abidx];
+		lstat->lst_abuf[abidx] = NULL;
+	}
+
+	/* The last logblock of the just-written CFS is not full. */
+	if (sectsz - lstat->lst_aoff >= OMF_LOGREC_DESC_PACKLEN) {
+		/*
+		 * If the last logblock in the just-written CFS is
+		 * first in the append buffer at abidx.
+		 */
+		if (lpgoff == 0) {
+			if (abidx != 0)
+				lstat->lst_pfsetid = lstat->lst_cfsetid;
+
+			goto exit1;
+		}
+
+		memcpy(&abuf[0], &abuf[lpgoff], sectsz);
+		memset(&abuf[sectsz], 0, lpgoff - sectsz + lstat->lst_aoff);
+	} else { /* The last logblock of the just-written CFS is full. */
+		memset(&abuf[0], 0, lpgoff + sectsz);
+		lstat->lst_aoff = OMF_LOGBLOCK_HDR_PACKLEN;
+		++lstat->lst_wsoff;
+	}
+	/* Set pfsetid to the cfsetid of just-written CFS. */
+	lstat->lst_pfsetid  = lstat->lst_cfsetid;
+
+exit1:
+	lstat->lst_cfssoff = lstat->lst_aoff;
+	lstat->lst_asoff   = lstat->lst_wsoff;
+
+exit2:
+	/* Increment cfsetid in all cases. */
+	++lstat->lst_cfsetid;
+
+	lstat->lst_abuf[0] = abuf;
+}
+
+/**
+ * mlog_logblocks_hdrpack() -  Pack log block header in all log blocks in the append buffer.
+ * @layout: object layout
+ *
+ * Called prior to CFS flush
+ */
+static int mlog_logblocks_hdrpack(struct pmd_layout *layout)
+{
+	struct omf_logblock_header lbh;
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	off_t lpgoff;
+	u32 pfsetid, cfsetid;
+	u16 sectsz, nseclpg;
+	u16 idx, abidx;
+	u16 sec, start;
+
+	sectsz  = MLOG_SECSZ(lstat);
+	nseclpg = MLOG_NSECLPG(lstat);
+	abidx   = lstat->lst_abidx;
+	pfsetid = lstat->lst_pfsetid;
+	cfsetid = lstat->lst_cfsetid;
+
+	lbh.olh_vers = OMF_LOGBLOCK_VERS;
+
+	for (idx = 0; idx <= abidx; idx++) {
+		start = 0;
+
+		if (!IS_SECPGA(lstat) && idx == 0)
+			start = (lstat->lst_cfssoff >> ilog2(sectsz));
+
+		if (idx == abidx)
+			nseclpg = lstat->lst_wsoff - (nseclpg * abidx + lstat->lst_asoff) + 1;
+
+		for (sec = start; sec < nseclpg; sec++) {
+			int rc;
+
+			lbh.olh_pfsetid = pfsetid;
+			lbh.olh_cfsetid = cfsetid;
+			mpool_uuid_copy(&lbh.olh_magic, &layout->eld_uuid);
+			lbh.olh_gen = layout->eld_gen;
+			lpgoff = sec * sectsz;
+
+			/* Pack the log block header. */
+			rc = omf_logblock_header_pack_htole(&lbh, &lstat->lst_abuf[idx][lpgoff]);
+			if (rc) {
+				mp_pr_err("mlog packing lbh failed, log pg idx %u, vers %u failed",
+					  rc, idx, lbh.olh_vers);
+
+				return rc;
+			}
+
+			/* If there's more than one sector to flush, pfsetid is set to cfsetid. */
+			pfsetid = cfsetid;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * mlog_logblocks_flush() - Flush CFS and handle both successful and failed flush.
+ * @mp:       mpool descriptor
+ * @layout:   layout descriptor
+ * @skip_ser: client guarantees serialization
+ */
+int mlog_logblocks_flush(struct mpool_descriptor *mp, struct pmd_layout *layout, bool skip_ser)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	int start, end, rc;
+	bool fsucc = true;
+	u16 abidx;
+
+	abidx = lstat->lst_abidx;
+
+	/* Pack log block header in all the log blocks. */
+	rc = mlog_logblocks_hdrpack(layout);
+	if (rc) {
+		mp_pr_err("mpool %s, mlog 0x%lx packing header failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+
+	} else {
+		rc = mlog_flush_abuf(mp, layout, skip_ser);
+		if (rc)
+			mp_pr_err("mpool %s, mlog 0x%lx log block flush failed",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+	}
+
+	if (rc) {
+		/* If flush failed, free all log pages except the first one. */
+		start = 1;
+		end   = abidx;
+		fsucc = false;
+	} else {
+		/* If flush succeeded, free all log pages except the last one.*/
+		start = 0;
+		end   = abidx - 1;
+
+		/*
+		 * Inform pre-compaction of the size of the active mlog and
+		 * how much is used.
+		 */
+		pmd_precompact_alsz(mp, layout->eld_objid, lstat->lst_wsoff * MLOG_SECSZ(lstat),
+				    lstat->lst_mfp.mfp_totsec * MLOG_SECSZ(lstat));
+	}
+	mlog_free_abuf(lstat, start, end);
+
+	if (!IS_SECPGA(lstat))
+		mlog_flush_posthdlr_4ka(layout, fsucc);
+	else
+		mlog_flush_posthdlr(layout, fsucc);
+
+	return rc;
+}
+
+/**
+ * mlog_append_dmax() -
+ *
+ * Max data record that can be appended to log in bytes; -1 if no room
+ * for a 0 byte data record due to record descriptor length.
+ */
+s64 mlog_append_dmax(struct pmd_layout *layout)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	u64 lbmax, lbrest;
+	u32 sectsz, datalb;
+
+	sectsz = MLOG_SECSZ(lstat);
+	datalb = MLOG_TOTSEC(lstat);
+
+	if (lstat->lst_wsoff >= datalb) {
+		/* Mlog already full */
+		return -1;
+	}
+
+	lbmax  = (sectsz - OMF_LOGBLOCK_HDR_PACKLEN - OMF_LOGREC_DESC_PACKLEN);
+	lbrest = (datalb - lstat->lst_wsoff - 1) * lbmax;
+
+	if ((sectsz - lstat->lst_aoff) < OMF_LOGREC_DESC_PACKLEN) {
+		/* Current log block cannot hold even a record descriptor */
+		if (lbrest)
+			return lbrest;
+
+		return -1;
+	}
+
+	/*
+	 * Can start in current log block and spill over to others (if any)
+	 */
+	return sectsz - lstat->lst_aoff - OMF_LOGREC_DESC_PACKLEN + lbrest;
+}
+
+/**
+ * mlog_update_append_idx() - Check if active log block is full and update the append offsets
+ *
+ * Returns: 0 on success; -errno otherwise
+ */
+int mlog_update_append_idx(struct mpool_descriptor *mp, struct pmd_layout *layout, bool skip_ser)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	u16 sectsz, nseclpg, abidx, asidx;
+	int rc;
+
+	sectsz  = MLOG_SECSZ(lstat);
+	nseclpg = MLOG_NSECLPG(lstat);
+
+	if (sectsz - lstat->lst_aoff < OMF_LOGREC_DESC_PACKLEN) {
+		/* If the log block is full, move to the next log block in the buffer. */
+		abidx = lstat->lst_abidx;
+		asidx = lstat->lst_wsoff - ((nseclpg * abidx) + lstat->lst_asoff);
+		if (asidx == nseclpg - 1)
+			++lstat->lst_abidx;
+		++lstat->lst_wsoff;
+		lstat->lst_aoff = OMF_LOGBLOCK_HDR_PACKLEN;
+	}
+
+	abidx = lstat->lst_abidx;
+	if (!lstat->lst_abuf[abidx]) {
+		/* Allocate a log page at 'abidx' */
+		rc = mlog_alloc_abufpg(mp, layout, abidx, skip_ser);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * mlog_logblocks_load_media() - Read log blocks from media, upto a maximum of 1 MiB.
+ * @mp:    mpool descriptor
+ * @lri:   read iterator
+ * @inbuf: buffer to into (output)
+ */
+static int mlog_logblocks_load_media(struct mpool_descriptor *mp, struct mlog_read_iter *lri,
+				     char **inbuf)
+{
+	struct pmd_layout *layout = lri->lri_layout;
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	u16 maxsec, nsecs, sectsz;
+	bool skip_ser = false;
+	off_t rsoff;
+	int remsec, rc;
+
+	mlog_extract_fsetparms(lstat, &sectsz, NULL, &maxsec, NULL);
+
+	/*
+	 * The read and append buffer must never overlap. So, the read buffer
+	 * can only hold sector offsets in the range [0, lstat->lst_asoff - 1].
+	 */
+	if (lstat->lst_asoff < 0)
+		remsec = lstat->lst_wsoff;
+	else
+		remsec = lstat->lst_asoff;
+
+	if (remsec == 0) {
+		rc = -ENOTRECOVERABLE;
+		mp_pr_err("mpool %s, objid 0x%lx, mlog read cannot be served from read buffer",
+			  rc, mp->pds_name, (ulong)lri->lri_layout->eld_objid);
+		return rc;
+	}
+
+	lri->lri_rbidx = 0;
+	lri->lri_sidx = 0;
+
+	rsoff = lri->lri_soff;
+	remsec -= rsoff;
+	ASSERT(remsec > 0);
+
+	nsecs = min_t(u32, maxsec, remsec);
+
+	if (layout->eld_flags & MLOG_OF_SKIP_SER)
+		skip_ser = true;
+
+	rc = mlog_populate_rbuf(mp, lri->lri_layout, &nsecs, &rsoff, skip_ser);
+	if (rc) {
+		mp_pr_err("mpool %s, objid 0x%lx, mlog read failed, nsecs: %u, rsoff: 0x%lx",
+			  rc, mp->pds_name, (ulong)lri->lri_layout->eld_objid, nsecs, rsoff);
+
+		lstat->lst_rsoff = lstat->lst_rseoff = -1;
+
+		return rc;
+	}
+
+	/*
+	 * 'nsecs' and 'rsoff' can be changed by mlog_populate_rbuf, if the
+	 * read offset is not page-aligned. Adjust lri_sidx and lst_rsoff
+	 * accordingly.
+	 */
+	lri->lri_sidx     = lri->lri_soff - rsoff;
+	lstat->lst_rsoff  = rsoff;
+	lstat->lst_rseoff = rsoff + nsecs - 1;
+
+	*inbuf = lstat->lst_rbuf[lri->lri_rbidx];
+	*inbuf += lri->lri_sidx * sectsz;
+
+	return 0;
+}
+
+/**
+ * mlog_logblock_load_internal() - Read log blocks from either the read buffer or media.
+ * @mp:    mpool descriptor
+ * @lri:   read iterator
+ * @inbuf: buffer to load into (output)
+ */
+static int mlog_logblock_load_internal(struct mpool_descriptor *mp, struct mlog_read_iter *lri,
+				       char **inbuf)
+{
+	struct mlog_stat *lstat;
+	off_t rsoff, rseoff, soff;
+	u16 nsecs, rbidx, rsidx;
+	u16 nlpgs, nseclpg;
+	int rc;
+
+	lstat = &lri->lri_layout->eld_lstat;
+
+	nseclpg = MLOG_NSECLPG(lstat);
+	rbidx   = lri->lri_rbidx;
+	rsidx   = lri->lri_sidx;
+	soff    = lri->lri_soff;
+	rsoff   = lstat->lst_rsoff;
+	rseoff  = lstat->lst_rseoff;
+
+	if (rsoff < 0)
+		goto media_read;
+
+	/*
+	 * If the read offset doesn't fall within the read buffer range,
+	 * then media read.
+	 */
+	if ((soff < rsoff) || (soff > rseoff))
+		goto media_read;
+
+	do {
+		/* If this is not the start of log block. */
+		if (lri->lri_roff != 0)
+			break;
+
+		/* Check if there's unconsumed data in rbuf. */
+		nsecs = rseoff - rsoff + 1;
+		nlpgs = (nsecs + nseclpg - 1) / nseclpg;
+
+		/* No. of sectors in the last log page. */
+		if (rbidx == nlpgs - 1) {
+			nseclpg = nsecs % nseclpg;
+			nseclpg = nseclpg > 0 ? nseclpg : MLOG_NSECLPG(lstat);
+		}
+		/* Remaining sectors in the active log page? */
+		if (rsidx < nseclpg - 1) {
+			++rsidx;
+			break;
+		}
+		/* Remaining log pages in the read buffer? */
+		if (rbidx >= nlpgs - 1)
+			goto media_read;
+
+		/* Free the active log page and move to next one. */
+		mlog_free_rbuf(lstat, rbidx, rbidx);
+		++rbidx;
+		rsidx = 0;
+
+		break;
+	} while (0);
+
+	/* Serve data from the read buffer. */
+	*inbuf  = lstat->lst_rbuf[rbidx];
+	*inbuf += rsidx * MLOG_SECSZ(lstat);
+
+	lri->lri_rbidx = rbidx;
+	lri->lri_sidx  = rsidx;
+
+	return 0;
+
+media_read:
+	rc = mlog_logblocks_load_media(mp, lri, inbuf);
+	if (rc) {
+		mp_pr_err("mpool %s, objid 0x%lx, mlog new read failed",
+			  rc, mp->pds_name, (ulong)lri->lri_layout->eld_objid);
+
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * mlog_loopback_load() - Load log block referenced by lri into lstat.
+ *
+ * Load log block referenced by lri into lstat, update lri if first read
+ * from this log block, and return a pointer to the log block and a flag
+ * indicating if lri references first record in log block.
+ *
+ * Note: lri can reference the log block currently accumulating in lstat
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -ENOMSG - if at end of log -- NB: requires an API change to signal without
+ */
+int mlog_logblock_load(struct mpool_descriptor *mp, struct mlog_read_iter *lri,
+		       char **buf, bool *first)
+{
+	struct mlog_stat *lstat = NULL;
+	int lbhlen = 0;
+	int rc = 0;
+
+	*buf = NULL;
+	*first = false;
+	lstat  = &lri->lri_layout->eld_lstat;
+
+	if (!lri->lri_valid || lri->lri_soff > lstat->lst_wsoff) {
+		/* lri is invalid; prior checks should prevent this */
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, invalid offset %u %ld %ld",
+			  rc, mp->pds_name, lri->lri_valid, lri->lri_soff, lstat->lst_wsoff);
+	} else if ((lri->lri_soff == lstat->lst_wsoff) || (lstat->lst_asoff > -1 &&
+			lri->lri_soff >= lstat->lst_asoff &&
+			lri->lri_soff <= lstat->lst_wsoff)) {
+		u16 abidx, sectsz, asidx, nseclpg;
+
+		/*
+		 * lri refers to the currently accumulating log block
+		 * in lstat
+		 */
+		if (!lri->lri_roff)
+			/* First read with handle from this log block. */
+			lri->lri_roff = OMF_LOGBLOCK_HDR_PACKLEN;
+
+		if (lri->lri_soff == lstat->lst_wsoff && lri->lri_roff > lstat->lst_aoff) {
+			/* lri is invalid; prior checks should prevent this */
+			rc = -EINVAL;
+			mp_pr_err("mpool %s, invalid next offset %u %u",
+				  rc, mp->pds_name, lri->lri_roff, lstat->lst_aoff);
+			goto out;
+		} else if (lri->lri_soff == lstat->lst_wsoff && lri->lri_roff == lstat->lst_aoff) {
+			/* Hit end of log */
+			rc = -ENOMSG;
+			goto out;
+		} else if (lri->lri_roff == OMF_LOGBLOCK_HDR_PACKLEN)
+			*first = true;
+
+		sectsz  = MLOG_SECSZ(lstat);
+		nseclpg = MLOG_NSECLPG(lstat);
+
+		abidx = (lri->lri_soff - lstat->lst_asoff) / nseclpg;
+		asidx = lri->lri_soff - ((nseclpg * abidx) + lstat->lst_asoff);
+
+		*buf = &lstat->lst_abuf[abidx][asidx * sectsz];
+	} else {
+		/* lri refers to an existing log block; fetch it if not cached. */
+		rc = mlog_logblock_load_internal(mp, lri, buf);
+		if (!rc) {
+			/*
+			 * NOTE: log block header length must be based
+			 * on version since not guaranteed to be the latest
+			 */
+			lbhlen = omf_logblock_header_len_le(*buf);
+
+			if (lbhlen < 0) {
+				rc = -ENODATA;
+				mp_pr_err("mpool %s, getting header length failed %ld",
+					  rc, mp->pds_name, (long)lbhlen);
+			} else {
+				if (!lri->lri_roff)
+					/* First read with handle from this log block. */
+					lri->lri_roff = lbhlen;
+
+				if (lri->lri_roff == lbhlen)
+					*first = true;
+			}
+		}
+	}
+
+out:
+	if (rc) {
+		*buf = NULL;
+		*first = false;
+	}
+
+	return rc;
+}
+
+/**
+ * mlogutil_closeall() - Close all open user (non-mdc) mlogs in mpool and release resources.
+ *
+ * This is an mpool deactivation utility and not part of the mlog user API.
+ */
+void mlogutil_closeall(struct mpool_descriptor *mp)
+{
+	struct pmd_layout_mlpriv *this, *tmp;
+	struct pmd_layout *layout;
+
+	oml_layout_lock(mp);
+
+	rbtree_postorder_for_each_entry_safe(
+		this, tmp, &mp->pds_oml_root, mlp_nodeoml) {
+
+		layout = mlpriv2layout(this);
+
+		if (pmd_objid_type(layout->eld_objid) != OMF_OBJ_MLOG) {
+			mp_pr_warn("mpool %s, non-mlog object 0x%lx in open mlog layout tree",
+				   mp->pds_name, (ulong)layout->eld_objid);
+			continue;
+		}
+
+		if (!pmd_objid_isuser(layout->eld_objid))
+			continue;
+
+		/* Remove layout from open list and discard log data. */
+		rb_erase(&this->mlp_nodeoml, &mp->pds_oml_root);
+		mlog_stat_free(layout);
+	}
+
+	oml_layout_unlock(mp);
+}
+
+/**
+ * mlog_getprops_cmn() - Retrieve basic mlog properties from layout.
+ * @mp:
+ * @layout:
+ * @prop:
+ */
+void
+mlog_getprops_cmn(struct mpool_descriptor *mp, struct pmd_layout *layout, struct mlog_props *prop)
+{
+	memcpy(prop->lpr_uuid.b, layout->eld_uuid.uuid, MPOOL_UUID_SIZE);
+	prop->lpr_objid       = layout->eld_objid;
+	prop->lpr_alloc_cap   = pmd_layout_cap_get(mp, layout);
+	prop->lpr_gen         = layout->eld_gen;
+	prop->lpr_iscommitted = layout->eld_state & PMD_LYT_COMMITTED;
+	prop->lpr_mclassp    = mp->pds_pdv[layout->eld_ld.ol_pdh].pdi_mclass;
+}
diff --git a/drivers/mpool/mlog_utils.h b/drivers/mpool/mlog_utils.h
new file mode 100644
index 000000000000..560548069c11
--- /dev/null
+++ b/drivers/mpool/mlog_utils.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * Defines functions for writing, reading, and managing the lifecycle of mlogs.
+ */
+
+#ifndef MPOOL_MLOG_UTILS_H
+#define MPOOL_MLOG_UTILS_H
+
+#include <linux/uio.h>
+#include <linux/mutex.h>
+
+struct pmd_layout;
+struct pmd_layout_mlpriv;
+struct mpool_descriptor;
+struct mlog_descriptor;
+struct mlog_read_iter;
+struct mlog_stat;
+struct mlog_props;
+
+/* "open mlog" rbtree operations... */
+#define oml_layout_lock(_mp)        mutex_lock(&(_mp)->pds_oml_lock)
+#define oml_layout_unlock(_mp)      mutex_unlock(&(_mp)->pds_oml_lock)
+
+struct pmd_layout_mlpriv *
+oml_layout_insert(struct mpool_descriptor *mp, struct pmd_layout_mlpriv *item);
+
+struct pmd_layout_mlpriv *oml_layout_remove(struct mpool_descriptor *mp, u64 key);
+
+void mlog_free_abuf(struct mlog_stat *lstat, int start, int end);
+
+void mlog_free_rbuf(struct mlog_stat *lstat, int start, int end);
+
+void mlog_extract_fsetparms(struct mlog_stat *lstat, u16 *sectsz, u32 *totsec,
+			    u16 *nsecmb, u16 *nseclpg);
+
+int mlog_stat_init(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, bool csem);
+
+void mlog_stat_free(struct pmd_layout *layout);
+
+void mlog_read_iter_init(struct pmd_layout *layout, struct mlog_stat *lstat,
+			 struct mlog_read_iter *lri);
+
+void mlog_stat_init_common(struct pmd_layout *layout, struct mlog_stat *lstat);
+
+int mlog_populate_rbuf(struct mpool_descriptor *mp, struct pmd_layout *layout,
+		       u16 *nsec, off_t *soff, bool skip_ser);
+
+void mlog_getprops_cmn(struct mpool_descriptor *mp, struct pmd_layout *layout,
+		       struct mlog_props *prop);
+
+int mlog_logblocks_flush(struct mpool_descriptor *mp, struct pmd_layout *layout, bool skip_ser);
+
+s64 mlog_append_dmax(struct pmd_layout *layout);
+
+int mlog_update_append_idx(struct mpool_descriptor *mp, struct pmd_layout *layout, bool skip_ser);
+
+int mlog_logblock_load(struct mpool_descriptor *mp, struct mlog_read_iter *lri,
+		       char **buf, bool *first);
+
+#endif /* MPOOL_MLOG_UTILS_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 11/22] mpool: add mlog lifecycle management and IO routines
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (9 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 10/22] mpool: add mlog IO utility routines Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 12/22] mpool: add metadata container or mlog-pair framework Nabeel M Mohamed
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This implements the mlog lifecycle management functions:
allocate, commit, abort, destroy, append, read etc.

Mlog objects are containers for record logging. Mlogs can be
appended with arbitrary sized records and once full, an mlog
must be erased before additional records can be appended.
Mlog records can be read sequentially from the beginning at
any time. Mlogs in a media class are always a multiple of
the mblock size for that media class.

The mlog APIs implement a pattern whereby an mlog is allocated
and then committed or aborted. An mlog is not persistent or
accessible until committed, and a system failure prior to
commit results in the same logical mpool state as if the mlog
had never been allocated. An mlog allocation returns an OID
that is used to commit, append, flush, erase, or read as needed,
and delete the mlog.

At mlog open, the read buffer is fully loaded and parsed to
identify the end-of-log and the next flush set ID, to detect
media corruption, to detect bad record formatting, and to
optionally enforce compaction semantics. At mlog close, the
dirty data is flushed and all memory resources are freed.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mlog.c | 1667 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1667 insertions(+)
 create mode 100644 drivers/mpool/mlog.c

diff --git a/drivers/mpool/mlog.c b/drivers/mpool/mlog.c
new file mode 100644
index 000000000000..6ccca00735c1
--- /dev/null
+++ b/drivers/mpool/mlog.c
@@ -0,0 +1,1667 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#include <linux/mm.h>
+#include <linux/log2.h>
+#include <linux/blk_types.h>
+#include <asm/page.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "omf_if.h"
+#include "mpcore.h"
+#include "mlog_utils.h"
+
+/**
+ * mlog_alloc_cmn() - Allocate mlog with specified parameters using new or specified objid.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+static int mlog_alloc_cmn(struct mpool_descriptor *mp, u64 objid,
+			  struct mlog_capacity *capreq, enum mp_media_classp mclassp,
+			  struct mlog_props *prop, struct mlog_descriptor **mlh)
+{
+	struct pmd_obj_capacity ocap;
+	struct pmd_layout *layout;
+	int rc;
+
+	layout = NULL;
+	*mlh = NULL;
+
+	ocap.moc_captgt = capreq->lcp_captgt;
+	ocap.moc_spare  = capreq->lcp_spare;
+
+	if (!objid) {
+		rc = pmd_obj_alloc(mp, OMF_OBJ_MLOG, &ocap, mclassp, &layout);
+		if (rc || !layout) {
+			if (rc != -ENOENT)
+				mp_pr_err("mpool %s, allocating mlog failed", rc, mp->pds_name);
+		}
+	} else {
+		rc = pmd_obj_realloc(mp, objid, &ocap, mclassp, &layout);
+		if (rc || !layout) {
+			if (rc != -ENOENT)
+				mp_pr_err("mpool %s, re-allocating mlog 0x%lx failed",
+					  rc, mp->pds_name, (ulong)objid);
+		}
+	}
+	if (rc)
+		return rc;
+
+	/*
+	 * Mlogs rarely created and usually committed immediately so erase in-line;
+	 * mlog not committed so pmd_obj_erase() not needed to make atomic
+	 */
+	pmd_obj_wrlock(layout);
+	rc = pmd_layout_erase(mp, layout);
+	if (!rc)
+		mlog_getprops_cmn(mp, layout, prop);
+	pmd_obj_wrunlock(layout);
+
+	if (rc) {
+		pmd_obj_abort(mp, layout);
+		mp_pr_err("mpool %s, mlog 0x%lx alloc, erase failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	*mlh = layout2mlog(layout);
+
+	return 0;
+}
+
+/**
+ * mlog_alloc() - Allocate mlog with the capacity params specified in capreq.
+ *
+ * Allocate mlog with the capacity params specified in capreq on drives in a
+ * media class mclassp.
+ * If successful mlh is a handle for the mlog and prop contains its properties.
+ *
+ * Note: mlog is not persistent until committed; allocation can be aborted.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_alloc(struct mpool_descriptor *mp, struct mlog_capacity *capreq,
+	       enum mp_media_classp mclassp, struct mlog_props *prop,
+	       struct mlog_descriptor **mlh)
+{
+	return mlog_alloc_cmn(mp, 0, capreq, mclassp, prop, mlh);
+}
+
+
+/**
+ * mlog_realloc() - Allocate mlog with specified objid to support crash recovery.
+ *
+ * Allocate mlog with specified objid to support crash recovery; otherwise
+ * is equivalent to mlog_alloc().
+ *
+ * Returns: 0 if successful, -errno otherwise
+ * One of the possible errno values:
+ * -EEXISTS - if objid exists
+ */
+int mlog_realloc(struct mpool_descriptor *mp, u64 objid,
+		 struct mlog_capacity *capreq, enum mp_media_classp mclassp,
+		 struct mlog_props *prop, struct mlog_descriptor **mlh)
+{
+	if (!mlog_objid(objid))
+		return -EINVAL;
+
+	return mlog_alloc_cmn(mp, objid, capreq, mclassp, prop, mlh);
+}
+
+/**
+ * mlog_find_get() - Get handle and properties for existing mlog with specified objid.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_find_get(struct mpool_descriptor *mp, u64 objid, int which,
+		  struct mlog_props *prop, struct mlog_descriptor **mlh)
+{
+	struct pmd_layout *layout;
+
+	*mlh = NULL;
+
+	if (!mlog_objid(objid))
+		return -EINVAL;
+
+	layout = pmd_obj_find_get(mp, objid, which);
+	if (!layout)
+		return -ENOENT;
+
+	if (prop) {
+		pmd_obj_rdlock(layout);
+		mlog_getprops_cmn(mp, layout, prop);
+		pmd_obj_rdunlock(layout);
+	}
+
+	*mlh = layout2mlog(layout);
+
+	return 0;
+}
+
+/**
+ * mlog_put() - Put a reference for mlog with specified objid.
+ */
+void mlog_put(struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout;
+
+	layout = mlog2layout(mlh);
+	if (layout)
+		pmd_obj_put(layout);
+}
+
+/**
+ * mlog_lookup_rootids() - Return OIDs of mpctl root MDC.
+ * @id1: (output): OID of one of the mpctl root MDC mlogs.
+ * @id2: (output): OID of the other mpctl root MDC mlogs.
+ */
+void mlog_lookup_rootids(u64 *id1, u64 *id2)
+{
+	if (id1)
+		*id1 = UROOT_OBJID_LOG1;
+
+	if (id2)
+		*id2 = UROOT_OBJID_LOG2;
+}
+
+/**
+ * mlog_commit() - Make allocated mlog persistent.
+ *
+ * If fails mlog still exists in an uncommitted state so can retry commit or abort.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_commit(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout;
+
+	layout = mlog2layout(mlh);
+	if (!layout)
+		return -EINVAL;
+
+	return pmd_obj_commit(mp, layout);
+}
+
+/**
+ * mlog_abort() - Discard uncommitted mlog; if successful mlh is invalid after call.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_abort(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout;
+
+	layout = mlog2layout(mlh);
+	if (!layout)
+		return -EINVAL;
+
+	return pmd_obj_abort(mp, layout);
+}
+
+/**
+ * mlog_delete() - Delete committed mlog.
+
+ * If successful mlh is invalid after call; if fails mlog is closed.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_delete(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout;
+
+	layout = mlog2layout(mlh);
+	if (!layout)
+		return -EINVAL;
+
+	/* Remove from open list and discard buffered log data */
+	pmd_obj_wrlock(layout);
+	oml_layout_lock(mp);
+	oml_layout_remove(mp, layout->eld_objid);
+	oml_layout_unlock(mp);
+
+	mlog_stat_free(layout);
+	pmd_obj_wrunlock(layout);
+
+	return pmd_obj_delete(mp, layout);
+}
+
+/**
+ * mlog_logrecs_validate() - Validate records in lstat.rbuf relative to lstat state.
+ *
+ * Validate records in lstat.rbuf relative to lstat state where midrec
+ * indicates if mid data record from previous log block; updates lstate to
+ * reflect valid markers found (if any).
+ *
+ * Returns:
+ *   0 if successful; -errno otherwise
+ *
+ *   In the output param, i.e., midrec, we store:
+ *   1 if log records are valid and ended mid data record
+ *   0 if log records are valid and did NOT end mid data record
+ */
+static int mlog_logrecs_validate(struct mlog_stat *lstat, int *midrec, u16 rbidx, u16 lbidx)
+{
+	struct omf_logrec_descriptor lrd;
+	u64 recnum = 0;
+	int recoff;
+	int rc = 0;
+	char *rbuf;
+	u16 sectsz = 0;
+
+	sectsz = MLOG_SECSZ(lstat);
+	rbuf   = lstat->lst_rbuf[rbidx] + lbidx * sectsz;
+
+	recoff = omf_logblock_header_len_le(rbuf);
+	if (recoff < 0)
+		return -ENODATA;
+
+	while (sectsz - recoff >= OMF_LOGREC_DESC_PACKLEN) {
+		omf_logrec_desc_unpack_letoh(&lrd, &rbuf[recoff]);
+
+		if (lrd.olr_rtype == OMF_LOGREC_CSTART) {
+			if (!lstat->lst_csem || lstat->lst_rsoff || recnum) {
+				rc = -ENODATA;
+
+				/* No compaction or not first rec in first log block */
+				mp_pr_err("no compact marker nor first rec %u %ld %u %u %lu",
+					  rc, lstat->lst_csem, lstat->lst_rsoff,
+					  rbidx, lbidx, (ulong)recnum);
+				return rc;
+			}
+			lstat->lst_cstart = 1;
+			*midrec = 0;
+		} else if (lrd.olr_rtype == OMF_LOGREC_CEND) {
+			if (!lstat->lst_csem || !lstat->lst_cstart || lstat->lst_cend || *midrec) {
+				rc = -ENODATA;
+
+				/*
+				 * No compaction or cend before cstart or more than one cend
+				 * or cend mid-record.
+				 */
+				mp_pr_err("inconsistent compaction recs %u %u %u %d", rc,
+					  lstat->lst_csem, lstat->lst_cstart, lstat->lst_cend,
+					  *midrec);
+				return rc;
+			}
+			lstat->lst_cend = 1;
+		} else if (lrd.olr_rtype == OMF_LOGREC_EOLB) {
+			if (*midrec || !recnum) {
+				/* EOLB mid-record or first record. */
+				rc = -ENODATA;
+				mp_pr_err("end of log block marker at wrong place %d %lu",
+					  rc, *midrec, (ulong)recnum);
+				return rc;
+			}
+			/* No more records in log buffer */
+			break;
+		} else if (lrd.olr_rtype == OMF_LOGREC_DATAFULL) {
+			if (*midrec && recnum) {
+				rc = -ENODATA;
+
+				/*
+				 * Can occur mid data rec only if is first rec in log block
+				 * indicating partial data rec at end of last log block
+				 * which is a valid failure mode; otherwise is a logging
+				 * error.
+				 */
+				mp_pr_err("data full marker at wrong place %d %lu",
+					  rc, *midrec, (ulong)recnum);
+				return rc;
+			}
+			*midrec = 0;
+		} else if (lrd.olr_rtype == OMF_LOGREC_DATAFIRST) {
+			if (*midrec && recnum) {
+				rc = -ENODATA;
+
+				/* See comment for DATAFULL */
+				mp_pr_err("data first marker at wrong place %d %lu",
+					  rc, *midrec, (ulong)recnum);
+				return rc;
+			}
+			*midrec = 1;
+		} else if (lrd.olr_rtype == OMF_LOGREC_DATAMID) {
+			if (!*midrec) {
+				rc = -ENODATA;
+
+				/* Must occur mid data record. */
+				mp_pr_err("data mid marker at wrong place %d %lu",
+					  rc, *midrec, (ulong)recnum);
+				return rc;
+			}
+		} else if (lrd.olr_rtype == OMF_LOGREC_DATALAST) {
+			if (!(*midrec)) {
+				rc = -ENODATA;
+
+				/* Must occur mid data record */
+				mp_pr_err("data last marker at wrong place %d %lu",
+					  rc, *midrec, (ulong)recnum);
+				return rc;
+			}
+			*midrec = 0;
+		} else {
+			rc = -ENODATA;
+			mp_pr_err("unknown record type %d %lu", rc, lrd.olr_rtype, (ulong)recnum);
+			return rc;
+		}
+
+		recnum = recnum + 1;
+		recoff = recoff + OMF_LOGREC_DESC_PACKLEN + lrd.olr_rlen;
+	}
+
+	return rc;
+}
+
+static inline void max_cfsetid(struct omf_logblock_header *lbh,
+			       struct pmd_layout *layout, u32 *fsetid)
+{
+	if (!mpool_uuid_compare(&lbh->olh_magic, &layout->eld_uuid) &&
+	    (lbh->olh_gen == layout->eld_gen))
+		*fsetid  = max_t(u32, *fsetid, lbh->olh_cfsetid);
+}
+
+/**
+ * mlog_logpage_validate() - Validate log records at log page index 'rbidx' in the read buffer.
+ * @mlh:        mlog_descriptor
+ * @lstat:      mlog_stat
+ * @rbidx:      log page index in the read buffer to validate
+ * @nseclpg:    number of sectors in the log page @rbidx
+ * @midrec:     refer to mlog_logrecs_validate
+ * @leol_found: true, if LEOL found. false, if LEOL not found/log full (output)
+ * @fsetidmax:  maximum flush set ID found in the log (output)
+ * @pfsetid:    previous flush set ID, if LEOL found (output)
+ */
+static int mlog_logpage_validate(struct mlog_descriptor *mlh, struct mlog_stat *lstat,
+				 u16 rbidx, u16 nseclpg, int *midrec,
+				 bool *leol_found, u32 *fsetidmax, u32 *pfsetid)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	char *rbuf;
+	u16 lbidx;
+	u16 sectsz;
+
+	sectsz = MLOG_SECSZ(lstat);
+	rbuf   = lstat->lst_rbuf[rbidx];
+
+	/* Loop through nseclpg sectors in the log page @rbidx. */
+	for (lbidx = 0; lbidx < nseclpg; lbidx++) {
+		struct omf_logblock_header lbh;
+		int rc;
+
+		memset(&lbh, 0, sizeof(lbh));
+
+		(void)omf_logblock_header_unpack_letoh(&lbh, rbuf);
+
+		/*
+		 * If LEOL is already found, then this loop determines
+		 * fsetidmax, i.e., scans through the sectors to determine
+		 * any stale flush set id from a prior failed CFS flush.
+		 */
+		if (*leol_found) {
+			max_cfsetid(&lbh, layout, fsetidmax);
+			rbuf += sectsz;
+			continue;
+		}
+
+		/*
+		 * Check for LEOL based on prev and cur flush set ID.
+		 * If LEOL is detected, then no need to validate this and
+		 * the log blocks that follow.
+		 *
+		 * We issue DISCARD commands to erase mlogs. However the data
+		 * read from a discarded block is non-determinstic. It could be
+		 * all 0s, all 1s or last written data.
+		 *
+		 * We could read following 5 types of data from mlog:
+		 * 1) Garbage
+		 * 2) Stale logs with different log block gen
+		 * 3) Stale logs with different flushset ID
+		 * 4) Stale logs with different magic (UUID)
+		 * 5) Valid logs
+		 */
+		if (mpool_uuid_compare(&lbh.olh_magic, &layout->eld_uuid) ||
+		    (lbh.olh_gen != layout->eld_gen) || (lbh.olh_pfsetid != *fsetidmax)) {
+			*leol_found = true;
+			*pfsetid    = *fsetidmax;
+			rbuf       += sectsz;
+			max_cfsetid(&lbh, layout, fsetidmax);
+			continue;
+		}
+
+		*fsetidmax = lbh.olh_cfsetid;
+
+		/* Validate the log block at lbidx. */
+		rc = mlog_logrecs_validate(lstat, midrec, rbidx, lbidx);
+		if (rc) {
+			mp_pr_err("mlog %p,, midrec %d, log pg idx %u, sector idx %u",
+				  rc, mlh, *midrec, rbidx, lbidx);
+
+			return rc;
+		}
+
+		++lstat->lst_wsoff;
+		rbuf += sectsz;
+	}
+
+	return 0;
+}
+
+/**
+ * mlog_read_and_validate() - Read and validate mlog records
+ * @mp:     mpool descriptor
+ * @layout: layout descriptor
+ * @lempty: is the log empty? (output)
+ *
+ * Called by mlog_open() to read and validate log records in the mlog.
+ * In addition, determine the previous and current flush
+ * set ID to be used by the next flush.
+ *
+ * Note: this function reads the entire mlog. Doing so allows us to confirm that
+ * the mlog's contents are completely legit, and also to recognize the case
+ * where a compaction started but failed to complete (CSTART with no CEND) -
+ * for which the recovery is to use the other mlog of the mlpair.
+ * If the mlog is huge, or if there are a bazillion of them, this could be an
+ * issue to revisit in future performance or functionality optimizations.
+ *
+ * Transactional logs are expensive; this does some "extra" reading at open
+ * time, with some serious benefits.
+ *
+ * Caller must hold the write lock on the layout, which protects the mutation
+ * of the read buffer.
+ */
+static int mlog_read_and_validate(struct mpool_descriptor *mp,
+				  struct pmd_layout *layout, bool *lempty)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	off_t leol_off = 0, rsoff;
+	int midrec = 0, remsec;
+	bool leol_found = false;
+	bool fsetid_loop = false;
+	bool skip_ser = false;
+	u32 fsetidmax = 0;
+	u32 pfsetid = 0;
+	u16 maxsec, nsecs;
+	u16 nlpgs, nseclpg;
+	int rc = 0;
+
+	remsec = MLOG_TOTSEC(lstat);
+	maxsec = MLOG_NSECMB(lstat);
+	rsoff  = lstat->lst_wsoff;
+
+	while (remsec > 0) {
+		u16 rbidx;
+
+		nseclpg = MLOG_NSECLPG(lstat);
+		nsecs   = min_t(u32, maxsec, remsec);
+
+		rc = mlog_populate_rbuf(mp, layout, &nsecs, &rsoff, skip_ser);
+		if (rc) {
+			mp_pr_err("mpool %s, mlog 0x%lx validate failed, nsecs: %u, rsoff: 0x%lx",
+				  rc, mp->pds_name, (ulong)layout->eld_objid, nsecs, rsoff);
+
+			goto exit;
+		}
+
+		nlpgs = (nsecs + nseclpg - 1) / nseclpg;
+		lstat->lst_rsoff = rsoff;
+
+		/* Validate the read buffer, one log page at a time. */
+		for (rbidx = 0; rbidx < nlpgs; rbidx++) {
+
+			/* No. of sectors in the last log page. */
+			if (rbidx == nlpgs - 1) {
+				nseclpg = nsecs % nseclpg;
+				nseclpg = nseclpg > 0 ? nseclpg : MLOG_NSECLPG(lstat);
+			}
+
+			/* Validate the log block(s) in the log page @rbidx. */
+			rc = mlog_logpage_validate(layout2mlog(layout), lstat, rbidx, nseclpg,
+						   &midrec, &leol_found, &fsetidmax, &pfsetid);
+			if (rc) {
+				mp_pr_err("mpool %s, mlog 0x%lx rbuf validate failed, leol: %d, fsetidmax: %u, pfsetid: %u",
+					  rc, mp->pds_name, (ulong)layout->eld_objid, leol_found,
+					  fsetidmax, pfsetid);
+
+				mlog_free_rbuf(lstat, rbidx, nlpgs - 1);
+				goto exit;
+			}
+
+			mlog_free_rbuf(lstat, rbidx, rbidx);
+
+			/*
+			 * If LEOL is found, then note down the LEOL offset
+			 * and kick off the scan to identify any stale flush
+			 * set id from a prior failed flush. If there's one,
+			 * then the next flush set ID must be set one greater
+			 * than the stale fsetid.
+			 */
+			if (leol_found && !fsetid_loop) {
+				leol_off    = lstat->lst_wsoff;
+				fsetid_loop = true;
+			}
+		}
+
+		remsec -= nsecs;
+		if (remsec == 0)
+			break;
+		ASSERT(remsec > 0);
+
+		if (fsetid_loop) {
+			u16    compsec;
+			off_t  endoff;
+			/*
+			 * To determine the new flush set ID, we need to
+			 * scan only through the next min(MLOG_NSECMB, remsec)
+			 * sectors. This is because of the max flush size being
+			 * 1 MB and hence a failed flush wouldn't have touched
+			 * any sectors beyond 1 MB from LEOL.
+			 */
+			endoff  = rsoff + nsecs - 1;
+			compsec = endoff - leol_off + 1;
+			remsec  = min_t(u32, remsec, maxsec - compsec);
+			ASSERT(remsec >= 0);
+
+			rsoff = endoff + 1;
+		} else {
+			rsoff = lstat->lst_wsoff;
+		}
+	}
+
+	/* LEOL wouldn't have been set for a full log. */
+	if (!leol_found)
+		pfsetid = fsetidmax;
+
+	if (pfsetid != 0)
+		*lempty = false;
+
+	lstat->lst_pfsetid = pfsetid;
+	lstat->lst_cfsetid = fsetidmax + 1;
+
+exit:
+	lstat->lst_rsoff = -1;
+
+	return rc;
+}
+
+int mlog_open(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u8 flags, u64 *gen)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	bool lempty, csem, skip_ser;
+	int rc = 0;
+
+	lempty = csem = skip_ser = false;
+	lstat = NULL;
+	*gen = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_wrlock(layout);
+
+	flags &= MLOG_OF_SKIP_SER | MLOG_OF_COMPACT_SEM;
+
+	if (flags & MLOG_OF_COMPACT_SEM)
+		csem = true;
+
+	if (flags & MLOG_OF_SKIP_SER)
+		skip_ser = true;
+
+	lstat = &layout->eld_lstat;
+
+	if (lstat->lst_abuf) {
+		/* Mlog already open */
+		if (csem && !lstat->lst_csem) {
+			pmd_obj_wrunlock(layout);
+
+			/* Re-open has inconsistent csem flag */
+			rc = -EINVAL;
+			mp_pr_err("mpool %s, re-opening of mlog 0x%lx, inconsistent csem %u %u",
+				  rc, mp->pds_name, (ulong)layout->eld_objid,
+				  csem, lstat->lst_csem);
+		} else if (skip_ser && !(layout->eld_flags & MLOG_OF_SKIP_SER)) {
+			pmd_obj_wrunlock(layout);
+
+			/* Re-open has inconsistent seralization flag */
+			rc = -EINVAL;
+			mp_pr_err("mpool %s, re-opening of mlog 0x%lx, inconsistent ser %u %u",
+				  rc, mp->pds_name, (ulong)layout->eld_objid, skip_ser,
+				  layout->eld_flags & MLOG_OF_SKIP_SER);
+		} else {
+			*gen = layout->eld_gen;
+			pmd_obj_wrunlock(layout);
+		}
+		return rc;
+	}
+
+	if (!(layout->eld_state & PMD_LYT_COMMITTED)) {
+		*gen = 0;
+		pmd_obj_wrunlock(layout);
+
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, mlog 0x%lx, not committed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	if (skip_ser)
+		layout->eld_flags |= MLOG_OF_SKIP_SER;
+
+	rc = mlog_stat_init(mp, mlh, csem);
+	if (rc) {
+		*gen = 0;
+		pmd_obj_wrunlock(layout);
+
+		mp_pr_err("mpool %s, mlog 0x%lx, mlog status initialization failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	lempty = true;
+
+	rc = mlog_read_and_validate(mp, layout, &lempty);
+	if (rc) {
+		mlog_stat_free(layout);
+		pmd_obj_wrunlock(layout);
+
+		mp_pr_err("mpool %s, mlog 0x%lx, mlog content validation failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	} else if (!lempty && csem) {
+		if (!lstat->lst_cstart) {
+			mlog_stat_free(layout);
+			pmd_obj_wrunlock(layout);
+
+			rc = -ENODATA;
+			mp_pr_err("mpool %s, mlog 0x%lx, compaction start missing",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+			return rc;
+		} else if (!lstat->lst_cend) {
+			mlog_stat_free(layout);
+			pmd_obj_wrunlock(layout);
+
+			/* Incomplete compaction */
+			rc = -EMSGSIZE;
+			mp_pr_err("mpool %s, mlog 0x%lx, incomplete compaction",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+			return rc;
+		}
+	}
+
+	*gen = layout->eld_gen;
+
+	/* TODO: Verify that the insert succeeded... */
+	oml_layout_lock(mp);
+	oml_layout_insert(mp, &layout->eld_mlpriv);
+	oml_layout_unlock(mp);
+
+	pmd_obj_wrunlock(layout);
+
+	return rc;
+}
+
+/**
+ * mlog_close() - Flush and close log and release resources; no op if log is not open.
+ *
+ * Returns: 0 on success; -errno otherwise
+ */
+int mlog_close(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	bool skip_ser = false;
+	int rc = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	/*
+	 * Inform pre-compaction that there is no need to try to compact
+	 * an mpool MDC that would contain this mlog because it is closed.
+	 */
+	pmd_precompact_alsz(mp, layout->eld_objid, 0, 0);
+
+	pmd_obj_wrlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (!lstat->lst_abuf) {
+		pmd_obj_wrunlock(layout);
+
+		return 0; /* Log already closed */
+	}
+
+	/* Flush log if potentially dirty and remove layout from open list */
+	if (lstat->lst_abdirty) {
+		rc = mlog_logblocks_flush(mp, layout, skip_ser);
+		lstat->lst_abdirty = false;
+		if (rc)
+			mp_pr_err("mpool %s, mlog 0x%lx close, log block flush failed",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+	}
+
+	oml_layout_lock(mp);
+	oml_layout_remove(mp, layout->eld_objid);
+	oml_layout_unlock(mp);
+
+	mlog_stat_free(layout);
+
+	/* Reset Mlog flags */
+	layout->eld_flags &= (~MLOG_OF_SKIP_SER);
+
+	pmd_obj_wrunlock(layout);
+
+	return rc;
+}
+
+/**
+ * mlog_gen() - Get generation number for log; log can be open or closed.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+int mlog_gen(struct mlog_descriptor *mlh, u64 *gen)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+
+	*gen = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_rdlock(layout);
+	*gen = layout->eld_gen;
+	pmd_obj_rdunlock(layout);
+
+	return 0;
+}
+
+/**
+ * mlog_empty() - Determine if log is empty; log must be open.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+int mlog_empty(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, bool *empty)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	int rc = 0;
+
+	*empty = false;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_rdlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (lstat->lst_abuf) {
+		if ((!lstat->lst_wsoff &&
+		     (lstat->lst_aoff == OMF_LOGBLOCK_HDR_PACKLEN)))
+			*empty = true;
+	} else {
+		rc = -ENOENT;
+	}
+
+	pmd_obj_rdunlock(layout);
+
+	if (rc)
+		mp_pr_err("mpool %s, mlog 0x%lx empty: no mlog status",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+
+	return rc;
+}
+
+/**
+ * mlog_len() - Returns the raw mlog bytes consumed. log must be open.
+ *
+ * Need to account for both metadata and user bytes while computing the log length.
+ */
+static int mlog_len(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u64 *len)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	int rc = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_rdlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (lstat->lst_abuf)
+		*len = ((u64) lstat->lst_wsoff * MLOG_SECSZ(lstat)) + lstat->lst_aoff;
+	else
+		rc = -ENOENT;
+
+	pmd_obj_rdunlock(layout);
+
+	if (rc)
+		mp_pr_err("mpool %s, mlog 0x%lx bytes consumed: no mlog status",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+
+	return rc;
+}
+
+/**
+ * mlog_erase() - Erase log setting generation number to max(current gen + 1, mingen).
+ *
+ * Log can be open or closed, but must be committed; operation is idempotent
+ * and can be retried if fails.
+ *
+ * Returns: 0 on success; -errno otherwise
+ */
+int mlog_erase(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u64 mingen)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat = NULL;
+	u64 newgen = 0;
+	int rc = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_wrlock(layout);
+
+	/* Must be committed to log erase start/end markers */
+	if (!(layout->eld_state & PMD_LYT_COMMITTED)) {
+		pmd_obj_wrunlock(layout);
+
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, erasing mlog 0x%lx, mlog not committed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	newgen = max(layout->eld_gen + 1, mingen);
+
+	/* If successful updates state and gen in layout */
+	rc = pmd_obj_erase(mp, layout, newgen);
+	if (rc) {
+		pmd_obj_wrunlock(layout);
+
+		mp_pr_err("mpool %s, erasing mlog 0x%lx, logging erase start failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	rc = pmd_layout_erase(mp, layout);
+	if (rc) {
+		/*
+		 * Log the failure as a debugging message, but ignore the
+		 * failure, since discarding blocks here is only advisory
+		 */
+		mp_pr_debug("mpool %s, erasing mlog 0x%lx, erase failed ",
+			    rc, mp->pds_name, (ulong)layout->eld_objid);
+		rc = 0;
+	}
+
+	/* If successful updates state in layout */
+	lstat = &layout->eld_lstat;
+	if (lstat->lst_abuf) {
+		/* Log is open so need to update lstat info */
+		mlog_free_abuf(lstat, 0, lstat->lst_abidx);
+		mlog_free_rbuf(lstat, 0, MLOG_NLPGMB(lstat) - 1);
+
+		mlog_stat_init_common(layout, lstat);
+	}
+
+	pmd_obj_wrunlock(layout);
+
+	return rc;
+}
+
+/**
+ * mlog_append_marker() - Append a marker (log rec with zero-length data field) of type mtype.
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+static int mlog_append_marker(struct mpool_descriptor *mp, struct pmd_layout *layout,
+			      enum logrec_type_omf mtype)
+{
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	struct omf_logrec_descriptor lrd;
+	u16 sectsz, abidx, aoff;
+	u16 asidx, nseclpg;
+	bool skip_ser = false;
+	char *abuf;
+	off_t lpgoff;
+	int rc;
+
+	sectsz  = MLOG_SECSZ(lstat);
+	nseclpg = MLOG_NSECLPG(lstat);
+
+	if (mlog_append_dmax(layout) == -1) {
+		/* Mlog is already full, flush whatever we can */
+		if (lstat->lst_abdirty) {
+			(void)mlog_logblocks_flush(mp, layout, skip_ser);
+			lstat->lst_abdirty = false;
+		}
+
+		return -EFBIG;
+	}
+
+	rc = mlog_update_append_idx(mp, layout, skip_ser);
+	if (rc)
+		return rc;
+
+	abidx  = lstat->lst_abidx;
+	abuf   = lstat->lst_abuf[abidx];
+	asidx  = lstat->lst_wsoff - ((nseclpg * abidx) + lstat->lst_asoff);
+	lpgoff = asidx * sectsz;
+	aoff   = lstat->lst_aoff;
+
+	lrd.olr_tlen  = 0;
+	lrd.olr_rlen  = 0;
+	lrd.olr_rtype = mtype;
+
+	ASSERT(abuf != NULL);
+
+	rc = omf_logrec_desc_pack_htole(&lrd, &abuf[lpgoff + aoff]);
+	if (!rc) {
+		lstat->lst_aoff = aoff + OMF_LOGREC_DESC_PACKLEN;
+
+		rc = mlog_logblocks_flush(mp, layout, skip_ser);
+		lstat->lst_abdirty = false;
+		if (rc)
+			mp_pr_err("mpool %s, mlog 0x%lx log block flush failed",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+	} else {
+		mp_pr_err("mpool %s, mlog 0x%lx log record descriptor packing failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+	}
+
+	return rc;
+}
+
+/**
+ * mlog_append_cstart() - Append compaction start marker; log must be open with csem flag true.
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+int mlog_append_cstart(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	int rc = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_wrlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (!lstat->lst_abuf) {
+		pmd_obj_wrunlock(layout);
+
+		rc = -ENOENT;
+		mp_pr_err("mpool %s, in mlog 0x%lx, inconsistency: no mlog status",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	if (!lstat->lst_csem || lstat->lst_cstart) {
+		pmd_obj_wrunlock(layout);
+
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, in mlog 0x%lx, inconsistent state %u %u",
+			  rc, mp->pds_name,
+			  (ulong)layout->eld_objid, lstat->lst_csem, lstat->lst_cstart);
+		return rc;
+	}
+
+	rc = mlog_append_marker(mp, layout, OMF_LOGREC_CSTART);
+	if (rc) {
+		pmd_obj_wrunlock(layout);
+
+		mp_pr_err("mpool %s, in mlog 0x%lx, marker append failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	lstat->lst_cstart = 1;
+	pmd_obj_wrunlock(layout);
+
+	return 0;
+}
+
+/**
+ * mlog_append_cend() - Append compaction start marker; log must be open with csem flag true.
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+int mlog_append_cend(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	int rc = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_wrlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (!lstat->lst_abuf) {
+		pmd_obj_wrunlock(layout);
+
+		rc = -ENOENT;
+		mp_pr_err("mpool %s, mlog 0x%lx, inconsistency: no mlog status",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	if (!lstat->lst_csem || !lstat->lst_cstart || lstat->lst_cend) {
+		pmd_obj_wrunlock(layout);
+
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, mlog 0x%lx, inconsistent state %u %u %u",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, lstat->lst_csem,
+			  lstat->lst_cstart, lstat->lst_cend);
+		return rc;
+	}
+
+	rc = mlog_append_marker(mp, layout, OMF_LOGREC_CEND);
+	if (rc) {
+		pmd_obj_wrunlock(layout);
+
+		mp_pr_err("mpool %s, mlog 0x%lx, marker append failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+		return rc;
+	}
+
+	lstat->lst_cend = 1;
+	pmd_obj_wrunlock(layout);
+
+	return 0;
+}
+
+/**
+ * memcpy_from_iov() - Moves contents from an iovec to one or more destination buffers.
+ * @iov    : One or more source buffers in the form of an iovec
+ * @buf    : Destination buffer
+ * @buflen : The length of either source or destination whichever is minimum
+ * @nextidx: The next index in iov if the copy requires multiple invocations
+ *           of memcpy_from_iov.
+ *
+ * No bounds check is done on iov. The caller is expected to give the minimum
+ * of source and destination buffers as the length (buflen) here.
+ */
+static void memcpy_from_iov(struct kvec *iov, char *buf, size_t buflen, int *nextidx)
+{
+	int i = *nextidx, cp;
+
+	if ((buflen > 0) && (iov[i].iov_len == 0))
+		i++;
+
+	while (buflen > 0) {
+		cp = (buflen < iov[i].iov_len) ? buflen : iov[i].iov_len;
+
+		if (iov[i].iov_base)
+			memcpy(buf, iov[i].iov_base, cp);
+
+		iov[i].iov_len  -= cp;
+		iov[i].iov_base += cp;
+		buflen          -= cp;
+		buf             += cp;
+
+		if (iov[i].iov_len == 0)
+			i++;
+	}
+
+	*nextidx = i;
+}
+
+/**
+ * mlog_append_data_internal() - Append data record with buflen data bytes from buf.
+ * @mp:       mpool descriptor
+ * @mlh:      mlog descriptor
+ * @iov:      iovec containing user data
+ * @buflen:   length of the user buffer
+ * @sync:     if true, then we do not return until data is on media
+ * @skip_ser: client guarantees serialization
+ *
+ * Log must be open; if log opened with csem true then a compaction
+ * start marker must be in place;
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+static int mlog_append_data_internal(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+				     struct kvec *iov, u64 buflen, int sync, bool skip_ser)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat = &layout->eld_lstat;
+	struct omf_logrec_descriptor lrd;
+	int rc = 0, dfirst, cpidx;
+	u32 datasec;
+	u64 bufoff, rlenmax;
+	u16 aoff, abidx, asidx;
+	u16 nseclpg, sectsz;
+	off_t lpgoff;
+	char *abuf;
+
+	mlog_extract_fsetparms(lstat, &sectsz, &datasec, NULL, &nseclpg);
+
+	bufoff = 0;
+	dfirst = 1;
+	cpidx  = 0;
+
+	lrd.olr_tlen = buflen;
+
+	while (true) {
+		if ((bufoff != buflen) && (mlog_append_dmax(layout) == -1)) {
+
+			/* Mlog is full and there's more to write;
+			 * mlog_append_dmax() should prevent this, but it lied.
+			 */
+			mp_pr_warn("mpool %s, mlog 0x%lx append, mlog free space incorrect",
+				   mp->pds_name, (ulong)layout->eld_objid);
+
+			return -EFBIG;
+		}
+
+		rc = mlog_update_append_idx(mp, layout, skip_ser);
+		if (rc)
+			return rc;
+
+		abidx  = lstat->lst_abidx;
+		abuf   = lstat->lst_abuf[abidx];
+		asidx  = lstat->lst_wsoff - ((nseclpg * abidx) + lstat->lst_asoff);
+		lpgoff = asidx * sectsz;
+		aoff   = lstat->lst_aoff;
+
+		ASSERT(abuf != NULL);
+
+		rlenmax = min((u64)(sectsz - aoff - OMF_LOGREC_DESC_PACKLEN),
+			      (u64)OMF_LOGREC_DESC_RLENMAX);
+
+		if (buflen - bufoff <= rlenmax) {
+			lrd.olr_rlen = buflen - bufoff;
+			if (dfirst)
+				lrd.olr_rtype = OMF_LOGREC_DATAFULL;
+			else
+				lrd.olr_rtype = OMF_LOGREC_DATALAST;
+		} else {
+			lrd.olr_rlen = rlenmax;
+			if (dfirst) {
+				lrd.olr_rtype = OMF_LOGREC_DATAFIRST;
+				dfirst = 0;
+			} else {
+				lrd.olr_rtype = OMF_LOGREC_DATAMID;
+			}
+		}
+
+		rc = omf_logrec_desc_pack_htole(&lrd, &abuf[lpgoff + aoff]);
+		if (rc) {
+			mp_pr_err("mpool %s, mlog 0x%lx, log record packing failed",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+			break;
+		}
+
+		lstat->lst_abdirty = true;
+
+		aoff = aoff + OMF_LOGREC_DESC_PACKLEN;
+		if (lrd.olr_rlen) {
+			memcpy_from_iov(iov, &abuf[lpgoff + aoff], lrd.olr_rlen, &cpidx);
+			aoff   = aoff + lrd.olr_rlen;
+			bufoff = bufoff + lrd.olr_rlen;
+		}
+		lstat->lst_aoff = aoff;
+
+		/*
+		 * Flush log block if sync and no more to write (or)
+		 * if the CFS is full.
+		 */
+		if ((sync && buflen == bufoff) ||
+			(abidx == MLOG_NLPGMB(lstat) - 1 && asidx == nseclpg - 1 &&
+			 sectsz - aoff < OMF_LOGREC_DESC_PACKLEN)) {
+
+			rc = mlog_logblocks_flush(mp, layout, skip_ser);
+			lstat->lst_abdirty = false;
+			if (rc) {
+				mp_pr_err("mpool %s, mlog 0x%lx, log block flush failed",
+					  rc, mp->pds_name, (ulong)layout->eld_objid);
+				break;
+			}
+		}
+
+		ASSERT(rc == 0);
+
+		if (bufoff == buflen)
+			break;
+	}
+
+	return rc;
+}
+
+static int mlog_append_datav(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+			     struct kvec *iov, u64 buflen, int sync)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	s64 dmax = 0;
+	bool skip_ser = false;
+	int rc = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	if (layout->eld_flags & MLOG_OF_SKIP_SER)
+		skip_ser = true;
+
+	if (!skip_ser)
+		pmd_obj_wrlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (!lstat->lst_abuf) {
+		rc = -ENOENT;
+		mp_pr_err("mpool %s, mlog 0x%lx, inconsistency: no mlog status",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+	} else if (lstat->lst_csem && !lstat->lst_cstart) {
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, mlog 0x%lx, inconsistent state %u %u", rc, mp->pds_name,
+			  (ulong)layout->eld_objid, lstat->lst_csem, lstat->lst_cstart);
+	} else {
+		dmax = mlog_append_dmax(layout);
+		if (dmax < 0 || buflen > dmax) {
+			rc = -EFBIG;
+			mp_pr_debug("mpool %s, mlog 0x%lx mlog full %ld",
+				    rc, mp->pds_name, (ulong)layout->eld_objid, (long)dmax);
+
+			/* Flush whatever we can. */
+			if (lstat->lst_abdirty) {
+				(void)mlog_logblocks_flush(mp, layout, skip_ser);
+				lstat->lst_abdirty = false;
+			}
+		}
+	}
+
+	if (rc) {
+		if (!skip_ser)
+			pmd_obj_wrunlock(layout);
+		return rc;
+	}
+
+	rc = mlog_append_data_internal(mp, mlh, iov, buflen, sync, skip_ser);
+	if (rc) {
+		mp_pr_err("mpool %s, mlog 0x%lx append failed",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+
+		/* Flush whatever we can. */
+		if (lstat->lst_abdirty) {
+			(void)mlog_logblocks_flush(mp, layout, skip_ser);
+			lstat->lst_abdirty = false;
+		}
+	}
+
+	if (!skip_ser)
+		pmd_obj_wrunlock(layout);
+
+	return rc;
+}
+
+int mlog_append_data(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+		     char *buf, u64 buflen, int sync)
+{
+	struct kvec iov;
+
+	iov.iov_base = buf;
+	iov.iov_len  = buflen;
+
+	return mlog_append_datav(mp, mlh, &iov, buflen, sync);
+}
+
+/**
+ * mlog_read_data_init() - Initialize iterator for reading data records from log.
+ *
+ * Log must be open; skips non-data records (markers).
+ *
+ * Returns: 0 on success; -errno otherwise
+ */
+int mlog_read_data_init(struct mlog_descriptor *mlh)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+	struct mlog_stat *lstat;
+	struct mlog_read_iter *lri;
+	int rc = 0;
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_wrlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (!lstat->lst_abuf) {
+		rc = -ENOENT;
+	} else {
+		lri = &lstat->lst_citr;
+
+		mlog_read_iter_init(layout, lstat, lri);
+	}
+
+	pmd_obj_wrunlock(layout);
+
+	return rc;
+}
+
+/**
+ * mlog_read_data_next_impl() -
+ * @mp:
+ * @mlh:
+ * @skip:
+ * @buf:
+ * @buflen:
+ * @rdlen:
+ *
+ * Return:
+ *   -EOVERFLOW: the caller must retry with a larger receive buffer,
+ *   the length of an adequate receive buffer is returned in "rdlen".
+ */
+static int mlog_read_data_next_impl(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+				    bool skip, char *buf, u64 buflen, u64 *rdlen)
+{
+	struct omf_logrec_descriptor lrd;
+	struct mlog_read_iter *lri = NULL;
+	struct pmd_layout *layout;
+	struct mlog_stat *lstat;
+
+	u64 bufoff = 0, midrec = 0;
+	bool recfirst = false;
+	bool skip_ser = false;
+	char *inbuf = NULL;
+	u32 sectsz = 0;
+	int rc = 0;
+
+	layout = mlog2layout(mlh);
+	if (!layout)
+		return -EINVAL;
+
+	if (!mlog_objid(layout->eld_objid))
+		return -EINVAL;
+
+	if (layout->eld_flags & MLOG_OF_SKIP_SER)
+		skip_ser = true;
+	/*
+	 * Need write lock because loading log block to read updates lstat.
+	 * Currently have no use case requiring support for concurrent readers.
+	 */
+	if (!skip_ser)
+		pmd_obj_wrlock(layout);
+
+	lstat = &layout->eld_lstat;
+	if (lstat->lst_abuf) {
+		sectsz = MLOG_SECSZ(lstat);
+		lri    = &lstat->lst_citr;
+
+		if (!lri->lri_valid) {
+			if (!skip_ser)
+				pmd_obj_wrunlock(layout);
+
+			rc = -EINVAL;
+			mp_pr_err("mpool %s, mlog 0x%lx, invalid iterator",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+			return rc;
+		}
+	}
+
+	if (!lstat || !lri) {
+		rc = -ENOENT;
+		mp_pr_err("mpool %s, mlog 0x%lx, inconsistency: no mlog status",
+			  rc, mp->pds_name, (ulong)layout->eld_objid);
+	} else if (lri->lri_gen != layout->eld_gen ||
+		   lri->lri_soff > lstat->lst_wsoff ||
+		   (lri->lri_soff == lstat->lst_wsoff && lri->lri_roff > lstat->lst_aoff) ||
+		   lri->lri_roff > sectsz) {
+
+		rc = -EINVAL;
+		mp_pr_err("mpool %s, mlog 0x%lx, invalid args gen %lu %lu offsets %ld %ld %u %u %u",
+			  rc, mp->pds_name, (ulong)layout->eld_objid, (ulong)lri->lri_gen,
+			  (ulong)layout->eld_gen, lri->lri_soff, lstat->lst_wsoff, lri->lri_roff,
+			  lstat->lst_aoff, sectsz);
+	} else if (lri->lri_soff == lstat->lst_wsoff && lri->lri_roff == lstat->lst_aoff) {
+		/* Hit end of log - do not error count */
+		rc = -ENOMSG;
+	}
+
+	if (rc) {
+		if (!skip_ser)
+			pmd_obj_wrunlock(layout);
+		if (rc == -ENOMSG) {
+			rc = 0;
+			if (rdlen)
+				*rdlen = 0;
+		}
+
+		return rc;
+	}
+
+	bufoff = 0;
+	midrec = 0;
+
+	while (true) {
+		/* Get log block referenced by lri which can be accumulating buffer */
+		rc = mlog_logblock_load(mp, lri, &inbuf, &recfirst);
+		if (rc) {
+			if (rc == -ENOMSG) {
+				if (!skip_ser)
+					pmd_obj_wrunlock(layout);
+				rc = 0;
+				if (rdlen)
+					*rdlen = 0;
+
+				return rc;
+			}
+
+			mp_pr_err("mpool %s, mlog 0x%lx, getting log block failed",
+				  rc, mp->pds_name, (ulong)layout->eld_objid);
+			break;
+		}
+
+		if ((sectsz - lri->lri_roff) < OMF_LOGREC_DESC_PACKLEN) {
+			/* No more records in current log block */
+			if (lri->lri_soff < lstat->lst_wsoff) {
+
+				/* Move to next log block */
+				lri->lri_soff = lri->lri_soff + 1;
+				lri->lri_roff = 0;
+				continue;
+			} else {
+				/*
+				 * hit end of log; return EOF even in case
+				 * of a partial data record which is a valid
+				 * failure mode and must be ignored
+				 */
+				if (bufoff)
+					rc = -ENODATA;
+
+				bufoff = 0;	/* Force EOF on partials! */
+				break;
+			}
+		}
+
+		/* Parse next record in log block */
+		omf_logrec_desc_unpack_letoh(&lrd, &inbuf[lri->lri_roff]);
+
+		if (logrec_type_datarec(lrd.olr_rtype)) {
+			/* Data record */
+			if (lrd.olr_rtype == OMF_LOGREC_DATAFULL ||
+			    lrd.olr_rtype == OMF_LOGREC_DATAFIRST) {
+				if (midrec && !recfirst) {
+					rc = -ENODATA;
+
+					/*
+					 * Can occur mid data rec only if is first rec in log
+					 * block indicating partial data rec at end of last
+					 * block which is a valid failure mode,
+					 * Otherwise is a logging error
+					 */
+					mp_pr_err("mpool %s, mlog 0x%lx, inconsistent 1 data rec",
+						  rc, mp->pds_name, (ulong)layout->eld_objid);
+					break;
+				}
+				/*
+				 * Reset copy-out; set midrec which is needed for DATAFIRST
+				 */
+				bufoff = 0;
+				midrec = 1;
+			} else if (lrd.olr_rtype == OMF_LOGREC_DATAMID ||
+				   lrd.olr_rtype == OMF_LOGREC_DATALAST) {
+				if (!midrec) {
+					rc = -ENODATA;
+
+					/* Must occur mid data record. */
+					mp_pr_err("mpool %s, mlog 0x%lx, inconsistent 2 data rec",
+						  rc, mp->pds_name, (ulong)layout->eld_objid);
+					break;
+				}
+			}
+
+			/*
+			 * This is inside a loop, but it is invariant;
+			 * (and it cannot be done until after the unpack)
+			 *
+			 * Return the necessary length to caller.
+			 */
+			if (buflen < lrd.olr_tlen) {
+				if (rdlen)
+					*rdlen = lrd.olr_tlen;
+
+				rc = -EOVERFLOW;
+				break;
+			}
+
+			/* Copy-out data */
+			lri->lri_roff = lri->lri_roff + OMF_LOGREC_DESC_PACKLEN;
+
+			if (!skip)
+				memcpy(&buf[bufoff], &inbuf[lri->lri_roff], lrd.olr_rlen);
+
+			lri->lri_roff = lri->lri_roff + lrd.olr_rlen;
+			bufoff = bufoff + lrd.olr_rlen;
+
+			if (lrd.olr_rtype == OMF_LOGREC_DATAFULL ||
+			    lrd.olr_rtype == OMF_LOGREC_DATALAST)
+				break;
+		} else {
+			/*
+			 * Non data record; just skip unless midrec which is a logging error
+			 */
+			if (midrec) {
+				rc = -ENODATA;
+				mp_pr_err("mpool %s, mlog 0x%lx, inconsistent non-data record",
+					  rc, mp->pds_name, (ulong)layout->eld_objid);
+				break;
+			}
+			if (lrd.olr_rtype == OMF_LOGREC_EOLB)
+				lri->lri_roff = sectsz;
+			else
+				lri->lri_roff = lri->lri_roff + OMF_LOGREC_DESC_PACKLEN +
+					lrd.olr_rlen;
+		}
+	}
+	if (!rc && rdlen)
+		*rdlen = bufoff;
+	else if (rc != -EOVERFLOW && rc != -ENOMEM)
+		/* Handle only remains valid if buffer too small */
+		lri->lri_valid = 0;
+
+	if (!skip_ser)
+		pmd_obj_wrunlock(layout);
+
+	return rc;
+}
+
+/**
+ * mlog_read_data_next() - Read next data record into buffer buf of length buflen bytes.
+ *
+ * Log must be open; skips non-data records (markers).
+ *
+ * Iterator lri must be re-init if returns any error except ENOMEM
+ *
+ * Returns:
+ *   0 on success; The following errno values on failure:
+ *   -EOVERFLOW if buflen is insufficient to hold data record; can retry
+ *   errno otherwise
+ *
+ *   Bytes read on success in the output param rdlen (can be 0 if appended a
+ *   zero-length data record)
+ */
+int mlog_read_data_next(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+			char *buf, u64 buflen, u64 *rdlen)
+{
+	return mlog_read_data_next_impl(mp, mlh, false, buf, buflen, rdlen);
+}
+
+/**
+ * mlog_get_props() - Return basic mlog properties in prop.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+static int mlog_get_props(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+			  struct mlog_props *prop)
+{
+	struct pmd_layout *layout = mlog2layout(mlh);
+
+	if (!layout)
+		return -EINVAL;
+
+	pmd_obj_rdlock(layout);
+	mlog_getprops_cmn(mp, layout, prop);
+	pmd_obj_rdunlock(layout);
+
+	return 0;
+}
+
+/**
+ * mlog_get_props_ex() - Return extended mlog properties in prop.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+int mlog_get_props_ex(struct mpool_descriptor *mp, struct mlog_descriptor  *mlh,
+		      struct mlog_props_ex *prop)
+{
+	struct pmd_layout *layout;
+	struct pd_prop *pdp;
+
+	layout = mlog2layout(mlh);
+	if (!layout)
+		return -EINVAL;
+
+	pdp = &mp->pds_pdv[layout->eld_ld.ol_pdh].pdi_prop;
+
+	pmd_obj_rdlock(layout);
+	mlog_getprops_cmn(mp, layout, &prop->lpx_props);
+	prop->lpx_zonecnt  = layout->eld_ld.ol_zcnt;
+	prop->lpx_state    = layout->eld_state;
+	prop->lpx_secshift = PD_SECTORSZ(pdp);
+	prop->lpx_totsec   = pmd_layout_cap_get(mp, layout) >> prop->lpx_secshift;
+	pmd_obj_rdunlock(layout);
+
+	return 0;
+}
+
+void mlog_precompact_alsz(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+	struct mlog_props prop;
+	u64 len;
+	int rc;
+
+	rc = mlog_get_props(mp, mlh, &prop);
+	if (rc)
+		return;
+
+	rc = mlog_len(mp, mlh, &len);
+	if (rc)
+		return;
+
+	pmd_precompact_alsz(mp, prop.lpr_objid, len, prop.lpr_alloc_cap);
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 12/22] mpool: add metadata container or mlog-pair framework
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (10 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 11/22] mpool: add mlog lifecycle management and IO routines Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 13/22] mpool: add utility routines for mpool lifecycle management Nabeel M Mohamed
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

Metadata containers are used for storing and maintaining metadata.

MDC APIs are implemented as helper functions built on a pair of
mlogs per MDC. It embodies the concept of compaction to deal with
one of the mlog pairs filling, what it means to compact is
use-case dependent.

The MDC APIs make it easy for a client to:

- Append metadata update records to the active mlog of an MDC
  until it is full (or exceeds some client-specific threshold)
- Flag the start of a compaction which marks the other mlog of
  the MDC as active
- Re-serialize its metadata by appending it to the (newly)
  active mlog of the MDC
- Flag the end of the compaction
- Continue appending metadata update records to the MDC until
  the above process repeats

The MDC API functions handle all failures, including crash
recovery, by using special markers recognized by the mlog
implementation.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mdc.c | 486 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/mpool/mdc.h | 106 ++++++++++
 2 files changed, 592 insertions(+)
 create mode 100644 drivers/mpool/mdc.c
 create mode 100644 drivers/mpool/mdc.h

diff --git a/drivers/mpool/mdc.c b/drivers/mpool/mdc.c
new file mode 100644
index 000000000000..288e6ee4670b
--- /dev/null
+++ b/drivers/mpool/mdc.c
@@ -0,0 +1,486 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include "mpool_printk.h"
+#include "mpool_ioctl.h"
+#include "mpcore.h"
+#include "mp.h"
+#include "mlog.h"
+#include "mdc.h"
+
+#define mdc_logerr(_mpname, _msg, _mlh, _objid, _gen1, _gen2, _err)     \
+	mp_pr_err("mpool %s, mdc open, %s "			        \
+		  "mlog %p objid 0x%lx gen1 %lu gen2 %lu",		\
+		  (_err), (_mpname), (_msg),		                \
+		  (_mlh), (ulong)(_objid), (ulong)(_gen1),		\
+		  (ulong)(_gen2))					\
+
+#define OP_COMMIT      0
+#define OP_DELETE      1
+
+/**
+ * mdc_acquire() - Validate mdc handle and acquire mdc_lock
+ * @mlh: MDC handle
+ * @rw:  read/append?
+ */
+static inline int mdc_acquire(struct mp_mdc *mdc, bool rw)
+{
+	if (!mdc || mdc->mdc_magic != MPC_MDC_MAGIC || !mdc->mdc_valid)
+		return -EINVAL;
+
+	if (rw && (mdc->mdc_flags & MDC_OF_SKIP_SER))
+		return 0;
+
+	mutex_lock(&mdc->mdc_lock);
+
+	/* Validate again after acquiring lock */
+	if (!mdc->mdc_valid) {
+		mutex_unlock(&mdc->mdc_lock);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
+ * mdc_release() - Release mdc_lock
+ * @mlh: MDC handle
+ * @rw:  read/append?
+ */
+static inline void mdc_release(struct mp_mdc *mdc, bool rw)
+{
+	if (rw && (mdc->mdc_flags & MDC_OF_SKIP_SER))
+		return;
+
+	mutex_unlock(&mdc->mdc_lock);
+}
+
+/**
+ * mdc_invalidate() - Invalidates MDC handle by resetting the magic
+ * @mdc: MDC handle
+ */
+static inline void mdc_invalidate(struct mp_mdc *mdc)
+{
+	mdc->mdc_magic = MPC_NO_MAGIC;
+}
+
+/**
+ * mdc_get_mpname() - Get mpool name from mpool descriptor
+ * @mp:     mpool descriptor
+ * @mpname: buffer to store the mpool name (output)
+ * @mplen:  buffer len
+ */
+static int mdc_get_mpname(struct mpool_descriptor *mp, char *mpname, size_t mplen)
+{
+	if (!mp || !mpname)
+		return -EINVAL;
+
+	return mpool_get_mpname(mp, mpname, mplen);
+}
+
+/**
+ * mdc_find_get() - Wrapper around get for mlog pair.
+ */
+static void mdc_find_get(struct mpool_descriptor *mp, u64 *logid, bool do_put,
+			 struct mlog_props *props, struct mlog_descriptor **mlh, int *ferr)
+{
+	int i;
+
+	for (i = 0; i < 2; ++i)
+		ferr[i] = mlog_find_get(mp, logid[i], 0, &props[i], &mlh[i]);
+
+	if (do_put && ((ferr[0] && !ferr[1]) || (ferr[1] && !ferr[0]))) {
+		if (ferr[0])
+			mlog_put(mlh[1]);
+		else
+			mlog_put(mlh[0]);
+	}
+}
+
+/**
+ * mdc_put() - Wrapper around put for mlog pair.
+ */
+static void mdc_put(struct mlog_descriptor *mlh1, struct mlog_descriptor *mlh2)
+{
+	mlog_put(mlh1);
+	mlog_put(mlh2);
+}
+
+int mp_mdc_open(struct mpool_descriptor *mp, u64 logid1, u64 logid2, u8 flags,
+		struct mp_mdc **mdc_out)
+{
+	struct mlog_descriptor *mlh[2];
+	struct mlog_props *props = NULL;
+	struct mp_mdc *mdc;
+
+	int     err = 0, err1 = 0, err2 = 0;
+	int     ferr[2] = {0};
+	u64     gen1 = 0, gen2 = 0;
+	bool    empty = false;
+	u8      mlflags = 0;
+	u64     id[2];
+	char   *mpname;
+
+	if (!mp || !mdc_out)
+		return -EINVAL;
+
+	mdc = kzalloc(sizeof(*mdc), GFP_KERNEL);
+	if (!mdc)
+		return -ENOMEM;
+
+	mdc->mdc_valid = 0;
+	mdc->mdc_mp    = mp;
+	mdc_get_mpname(mp, mdc->mdc_mpname, sizeof(mdc->mdc_mpname));
+
+	mpname = mdc->mdc_mpname;
+
+	if (logid1 == logid2) {
+		err = -EINVAL;
+		goto exit;
+	}
+
+	props = kcalloc(2, sizeof(*props), GFP_KERNEL);
+	if (!props) {
+		err = -ENOMEM;
+		goto exit;
+	}
+
+	/*
+	 * This mdc_find_get can go away once mp_mdc_open is modified to
+	 * operate on handles.
+	 */
+	id[0] = logid1;
+	id[1] = logid2;
+	mdc_find_get(mp, id, true, props, mlh, ferr);
+	if (ferr[0] || ferr[1]) {
+		err = ferr[0] ? : ferr[1];
+		goto exit;
+	}
+	mdc->mdc_logh1 = mlh[0];
+	mdc->mdc_logh2 = mlh[1];
+
+	if (flags & MDC_OF_SKIP_SER)
+		mlflags |= MLOG_OF_SKIP_SER;
+
+	mlflags |= MLOG_OF_COMPACT_SEM;
+
+	err1 = mlog_open(mp, mdc->mdc_logh1, mlflags, &gen1);
+	err2 = mlog_open(mp, mdc->mdc_logh2, mlflags, &gen2);
+
+	if (err1 && err1 != -EMSGSIZE && err1 != -EBUSY) {
+		err = err1;
+	} else if (err2 && err2 != -EMSGSIZE && err2 != -EBUSY) {
+		err = err2;
+	} else if ((err1 && err2) || (!err1 && !err2 && gen1 && gen1 == gen2)) {
+
+		err = -EINVAL;
+
+		/* Bad pair; both have failed erases/compactions or equal non-0 gens. */
+		mp_pr_err("mpool %s, mdc open, bad mlog handle, mlog1 %p logid1 0x%lx errno %d gen1 %lu, mlog2 %p logid2 0x%lx errno %d gen2 %lu",
+			err, mpname, mdc->mdc_logh1, (ulong)logid1, err1, (ulong)gen1,
+			mdc->mdc_logh2, (ulong)logid2, err2, (ulong)gen2);
+	} else {
+		/* Active log is valid log with smallest gen */
+		if (err1 || (!err2 && gen2 < gen1)) {
+			mdc->mdc_alogh = mdc->mdc_logh2;
+			if (!err1) {
+				err = mlog_empty(mp, mdc->mdc_logh1, &empty);
+				if (err)
+					mdc_logerr(mpname, "mlog1 empty check failed",
+						   mdc->mdc_logh1, logid1, gen1, gen2, err);
+			}
+			if (!err && (err1 || !empty)) {
+				err = mlog_erase(mp, mdc->mdc_logh1, gen2 + 1);
+				if (!err) {
+					err = mlog_open(mp, mdc->mdc_logh1, mlflags, &gen1);
+					if (err)
+						mdc_logerr(mpname, "mlog1 open failed",
+							   mdc->mdc_logh1, logid1, gen1, gen2, err);
+				} else {
+					mdc_logerr(mpname, "mlog1 erase failed", mdc->mdc_logh1,
+						   logid1, gen1, gen2, err);
+				}
+			}
+		} else {
+			mdc->mdc_alogh = mdc->mdc_logh1;
+			if (!err2) {
+				err = mlog_empty(mp, mdc->mdc_logh2, &empty);
+				if (err)
+					mdc_logerr(mpname, "mlog2 empty check failed",
+						   mdc->mdc_logh2, logid2, gen1, gen2, err);
+			}
+			if (!err && (err2 || gen2 == gen1 || !empty)) {
+				err = mlog_erase(mp, mdc->mdc_logh2, gen1 + 1);
+				if (!err) {
+					err = mlog_open(mp, mdc->mdc_logh2, mlflags, &gen2);
+					if (err)
+						mdc_logerr(mpname, "mlog2 open failed",
+							   mdc->mdc_logh2, logid2, gen1, gen2, err);
+				} else {
+					mdc_logerr(mpname, "mlog2 erase failed", mdc->mdc_logh2,
+						   logid2, gen1, gen2, err);
+				}
+			}
+		}
+
+		if (!err) {
+			err = mlog_empty(mp, mdc->mdc_alogh, &empty);
+			if (!err && empty) {
+				/*
+				 * First use of log pair so need to add
+				 * cstart/cend recs; above handles case of
+				 * failure between adding cstart and cend
+				 */
+				err = mlog_append_cstart(mp, mdc->mdc_alogh);
+				if (!err) {
+					err = mlog_append_cend(mp, mdc->mdc_alogh);
+					if (err)
+						mdc_logerr(mpname,
+							   "adding cend to active mlog failed",
+							   mdc->mdc_alogh,
+							   mdc->mdc_alogh == mdc->mdc_logh1 ?
+							   logid1 : logid2, gen1, gen2, err);
+				} else {
+					mdc_logerr(mpname, "adding cstart to active mlog failed",
+						   mdc->mdc_alogh,
+						   mdc->mdc_alogh == mdc->mdc_logh1 ?
+						   logid1 : logid2, gen1, gen2, err);
+				}
+
+			} else if (err) {
+				mdc_logerr(mpname, "active mlog empty check failed",
+					   mdc->mdc_alogh, mdc->mdc_alogh == mdc->mdc_logh1 ?
+					   logid1 : logid2, gen1, gen2, err);
+			}
+		}
+	}
+
+	if (!err) {
+		/*
+		 * Inform pre-compaction of the size of the active
+		 * mlog and how much is used. This is applicable
+		 * only for mpool core's internal MDCs.
+		 */
+		mlog_precompact_alsz(mp, mdc->mdc_alogh);
+
+		mdc->mdc_valid = 1;
+		mdc->mdc_magic = MPC_MDC_MAGIC;
+		mdc->mdc_flags = flags;
+		mutex_init(&mdc->mdc_lock);
+
+		*mdc_out = mdc;
+	} else {
+		err1 = mlog_close(mp, mdc->mdc_logh1);
+		err2 = mlog_close(mp, mdc->mdc_logh2);
+
+		mdc_put(mdc->mdc_logh1, mdc->mdc_logh2);
+	}
+
+exit:
+	if (err)
+		kfree(mdc);
+
+	kfree(props);
+
+	return err;
+}
+
+int mp_mdc_cstart(struct mp_mdc *mdc)
+{
+	struct mlog_descriptor *tgth = NULL;
+	struct mpool_descriptor *mp;
+	bool rw = false;
+	int rc;
+
+	if (!mdc)
+		return -EINVAL;
+
+	rc = mdc_acquire(mdc, rw);
+	if (rc)
+		return rc;
+
+	mp = mdc->mdc_mp;
+
+	if (mdc->mdc_alogh == mdc->mdc_logh1)
+		tgth = mdc->mdc_logh2;
+	else
+		tgth = mdc->mdc_logh1;
+
+	rc = mlog_append_cstart(mp, tgth);
+	if (rc) {
+		mdc_release(mdc, rw);
+
+		mp_pr_err("mpool %s, mdc %p cstart failed, mlog %p",
+			  rc, mdc->mdc_mpname, mdc, tgth);
+
+		(void)mp_mdc_close(mdc);
+
+		return rc;
+	}
+
+	mdc->mdc_alogh = tgth;
+	mdc_release(mdc, rw);
+
+	return 0;
+}
+
+int mp_mdc_cend(struct mp_mdc *mdc)
+{
+	struct mlog_descriptor *srch = NULL;
+	struct mlog_descriptor *tgth = NULL;
+	struct mpool_descriptor *mp;
+	u64 gentgt = 0;
+	bool rw = false;
+	int rc;
+
+	if (!mdc)
+		return -EINVAL;
+
+	rc = mdc_acquire(mdc, rw);
+	if (rc)
+		return rc;
+
+	mp = mdc->mdc_mp;
+
+	if (mdc->mdc_alogh == mdc->mdc_logh1) {
+		tgth = mdc->mdc_logh1;
+		srch = mdc->mdc_logh2;
+	} else {
+		tgth = mdc->mdc_logh2;
+		srch = mdc->mdc_logh1;
+	}
+
+	rc = mlog_append_cend(mp, tgth);
+	if (!rc) {
+		rc = mlog_gen(tgth, &gentgt);
+		if (!rc)
+			rc = mlog_erase(mp, srch, gentgt + 1);
+	}
+
+	if (rc) {
+		mdc_release(mdc, rw);
+
+		mp_pr_err("mpool %s, mdc %p cend failed, mlog %p",
+			  rc, mdc->mdc_mpname, mdc, tgth);
+
+		mp_mdc_close(mdc);
+
+		return rc;
+	}
+
+	mdc_release(mdc, rw);
+
+	return rc;
+}
+
+int mp_mdc_close(struct mp_mdc *mdc)
+{
+	struct mpool_descriptor *mp;
+	int rval = 0, rc;
+	bool rw = false;
+
+	if (!mdc)
+		return -EINVAL;
+
+	rc = mdc_acquire(mdc, rw);
+	if (rc)
+		return rc;
+
+	mp = mdc->mdc_mp;
+
+	mdc->mdc_valid = 0;
+
+	rc = mlog_close(mp, mdc->mdc_logh1);
+	if (rc) {
+		mp_pr_err("mpool %s, mdc %p close failed, mlog1 %p",
+			  rc, mdc->mdc_mpname, mdc, mdc->mdc_logh1);
+		rval = rc;
+	}
+
+	rc = mlog_close(mp, mdc->mdc_logh2);
+	if (rc) {
+		mp_pr_err("mpool %s, mdc %p close failed, mlog2 %p",
+			  rc, mdc->mdc_mpname, mdc, mdc->mdc_logh2);
+		rval = rc;
+	}
+
+	mdc_put(mdc->mdc_logh1, mdc->mdc_logh2);
+
+	mdc_invalidate(mdc);
+	mdc_release(mdc, false);
+
+	kfree(mdc);
+
+	return rval;
+}
+
+int mp_mdc_rewind(struct mp_mdc *mdc)
+{
+	bool rw = false;
+	int rc;
+
+	if (!mdc)
+		return -EINVAL;
+
+	rc = mdc_acquire(mdc, rw);
+	if (rc)
+		return rc;
+
+	rc = mlog_read_data_init(mdc->mdc_alogh);
+	if (rc)
+		mp_pr_err("mpool %s, mdc %p rewind failed, mlog %p",
+			  rc, mdc->mdc_mpname, mdc, mdc->mdc_alogh);
+
+	mdc_release(mdc, rw);
+
+	return rc;
+}
+
+int mp_mdc_read(struct mp_mdc *mdc, void *data, size_t len, size_t *rdlen)
+{
+	bool rw = true;
+	int rc;
+
+	if (!mdc || !data)
+		return -EINVAL;
+
+	rc = mdc_acquire(mdc, rw);
+	if (rc)
+		return rc;
+
+	rc = mlog_read_data_next(mdc->mdc_mp, mdc->mdc_alogh, data, (u64)len, (u64 *)rdlen);
+	if (rc && rc != -EOVERFLOW)
+		mp_pr_err("mpool %s, mdc %p read failed, mlog %p len %lu",
+			  rc, mdc->mdc_mpname, mdc, mdc->mdc_alogh, (ulong)len);
+
+	mdc_release(mdc, rw);
+
+	return rc;
+}
+
+int mp_mdc_append(struct mp_mdc *mdc, void *data, ssize_t len, bool sync)
+{
+	bool rw = true;
+	int rc;
+
+	if (!mdc || !data)
+		return -EINVAL;
+
+	rc = mdc_acquire(mdc, rw);
+	if (rc)
+		return rc;
+
+	rc = mlog_append_data(mdc->mdc_mp, mdc->mdc_alogh, data, (u64)len, sync);
+	if (rc)
+		mp_pr_rl("mpool %s, mdc %p append failed, mlog %p, len %lu sync %d",
+			 rc, mdc->mdc_mpname, mdc, mdc->mdc_alogh, (ulong)len, sync);
+
+	mdc_release(mdc, rw);
+
+	return rc;
+}
diff --git a/drivers/mpool/mdc.h b/drivers/mpool/mdc.h
new file mode 100644
index 000000000000..7ab1de261eff
--- /dev/null
+++ b/drivers/mpool/mdc.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_MDC_PRIV_H
+#define MPOOL_MDC_PRIV_H
+
+#include <linux/mutex.h>
+
+#define MPC_MDC_MAGIC           0xFEEDFEED
+#define MPC_NO_MAGIC            0xFADEFADE
+
+struct mpool_descriptor;
+struct mlog_descriptor;
+
+/**
+ * struct mp_mdc - MDC handle
+ * @mdc_mp:     mpool handle
+ * @mdc_logh1:  mlog 1 handle
+ * @mdc_logh2:  mlog 2 handle
+ * @mdc_alogh:  active mlog handle
+ * @mdc_lock:   mdc mutex
+ * @mdc_mpname: mpool name
+ * @mdc_valid:  is the handle valid?
+ * @mdc_magic:  MDC handle magic
+ * @mdc_flags:	MDC flags
+ */
+struct mp_mdc {
+	struct mpool_descriptor    *mdc_mp;
+	struct mlog_descriptor     *mdc_logh1;
+	struct mlog_descriptor     *mdc_logh2;
+	struct mlog_descriptor     *mdc_alogh;
+	struct mutex                mdc_lock;
+	char                        mdc_mpname[MPOOL_NAMESZ_MAX];
+	int                         mdc_valid;
+	int                         mdc_magic;
+	u8                          mdc_flags;
+};
+
+/* MDC (Metadata Container) APIs */
+
+/**
+ * mp_mdc_open() - Open MDC by OIDs
+ * @mp:       mpool handle
+ * @logid1:   Mlog ID 1
+ * @logid2:   Mlog ID 2
+ * @flags:    MDC Open flags (enum mdc_open_flags)
+ * @mdc_out:  MDC handle
+ */
+int
+mp_mdc_open(struct mpool_descriptor *mp, u64 logid1, u64 logid2, u8 flags, struct mp_mdc **mdc_out);
+
+/**
+ * mp_mdc_close() - Close MDC
+ * @mdc:      MDC handle
+ */
+int mp_mdc_close(struct mp_mdc *mdc);
+
+/**
+ * mp_mdc_rewind() - Rewind MDC to first record
+ * @mdc:      MDC handle
+ */
+int mp_mdc_rewind(struct mp_mdc *mdc);
+
+/**
+ * mp_mdc_read() - Read next record from MDC
+ * @mdc:      MDC handle
+ * @data:     buffer to receive data
+ * @len:      length of supplied buffer
+ * @rdlen:    number of bytes read
+ *
+ * Return:
+ *   If the return value is -EOVERFLOW, then the receive buffer "data"
+ *   is too small and must be resized according to the value returned
+ *   in "rdlen".
+ */
+int mp_mdc_read(struct mp_mdc *mdc, void *data, size_t len, size_t *rdlen);
+
+/**
+ * mp_mdc_append() - append record to MDC
+ * @mdc:      MDC handle
+ * @data:     data to write
+ * @len:      length of data
+ * @sync:     flag to defer return until IO is complete
+ */
+int mp_mdc_append(struct mp_mdc *mdc, void *data, ssize_t len, bool sync);
+
+/**
+ * mp_mdc_cstart() - Initiate MDC compaction
+ * @mdc:      MDC handle
+ *
+ * Swap active (ostensibly full) and inactive (empty) mlogs
+ * Append a compaction start marker to newly active mlog
+ */
+int mp_mdc_cstart(struct mp_mdc *mdc);
+
+/**
+ * mp_mdc_cend() - End MDC compactions
+ * @mdc:      MDC handle
+ *
+ * Append a compaction end marker to the active mlog
+ */
+int mp_mdc_cend(struct mp_mdc *mdc);
+
+#endif /* MPOOL_MDC_PRIV_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 13/22] mpool: add utility routines for mpool lifecycle management
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (11 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 12/22] mpool: add metadata container or mlog-pair framework Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 14/22] mpool: add pool metadata routines to create persistent mpools Nabeel M Mohamed
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds utility routines to:

- Create and initialize a media class with an mpool volume
- Initialize and validate superblocks on all media class
  volumes
- Open and initialize all media class volumes
- Allocate metadata container 0 (MDC0) and update the
  superblock on capacity media class volume with metadata for
  accessing MDC0
- Create and initialize root MDC
- Initialize mpool descriptor and track the mapping between an
  mpool UUID and its descriptor in a rbtree

When an mpool is created, a pair of mlogs are instantiated with
well-known OIDs comprising the root MDC of the mpool. The root
MDC provides a location for mpool clients to store whatever
metadata they need for start-up.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mpcore.c | 987 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 987 insertions(+)
 create mode 100644 drivers/mpool/mpcore.c

diff --git a/drivers/mpool/mpcore.c b/drivers/mpool/mpcore.c
new file mode 100644
index 000000000000..246baedcdcec
--- /dev/null
+++ b/drivers/mpool/mpcore.c
@@ -0,0 +1,987 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * Media pool (mpool) manager module.
+ *
+ * Defines functions to create and maintain mpools comprising multiple drives
+ * in multiple media classes used for storing mblocks and mlogs.
+ */
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/sort.h>
+#include <linux/slab.h>
+#include <linux/kref.h>
+#include <linux/rbtree.h>
+
+#include "mpool_ioctl.h"
+
+#include "mpool_printk.h"
+#include "assert.h"
+#include "uuid.h"
+
+#include "mp.h"
+#include "omf.h"
+#include "omf_if.h"
+#include "pd.h"
+#include "smap.h"
+#include "mclass.h"
+#include "pmd_obj.h"
+#include "mpcore.h"
+#include "sb.h"
+#include "upgrade.h"
+
+struct omf_devparm_descriptor;
+struct mpool_descriptor;
+
+/* Rbtree mapping mpool UUID to mpool descriptor node: uuid_to_mpdesc_rb */
+struct rb_root mpool_pools = { NULL };
+
+int uuid_to_mpdesc_insert(struct rb_root *root, struct mpool_descriptor *data)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	/* Figure out where to put new node */
+	while (*new) {
+		struct mpool_descriptor *this = rb_entry(*new, struct mpool_descriptor, pds_node);
+
+		int result = mpool_uuid_compare(&data->pds_poolid, &this->pds_poolid);
+
+		parent = *new;
+		if (result < 0)
+			new = &((*new)->rb_left);
+		else if (result > 0)
+			new = &((*new)->rb_right);
+		else
+			return false;
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&data->pds_node, parent, new);
+	rb_insert_color(&data->pds_node, root);
+
+	return true;
+}
+
+static struct mpool_descriptor *
+uuid_to_mpdesc_search(struct rb_root *root, struct mpool_uuid *key_uuid)
+{
+	struct rb_node *node = root->rb_node;
+
+	while (node) {
+		struct mpool_descriptor *data = rb_entry(node, struct mpool_descriptor, pds_node);
+
+		int  result = mpool_uuid_compare(key_uuid, &data->pds_poolid);
+
+		if (result < 0)
+			node = node->rb_left;
+		else if (result > 0)
+			node = node->rb_right;
+		else
+			return data;
+	}
+	return NULL;
+}
+
+int mpool_dev_sbwrite(struct mpool_descriptor *mp, struct mpool_dev_info *pd,
+		      struct omf_sb_descriptor *sbmdc0)
+{
+	struct omf_sb_descriptor *sb = NULL;
+	struct mc_parms mc_parms;
+	int rc;
+
+	if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+		rc = -EIO;
+		mp_pr_err("%s:%s unavailable or offline, status %d",
+			  rc, mp->pds_name, pd->pdi_name, mpool_pd_status_get(pd));
+		return rc;
+	}
+
+	sb = kzalloc(sizeof(struct omf_sb_descriptor), GFP_KERNEL);
+	if (!sb) {
+		rc = -ENOMEM;
+		mp_pr_err("mpool %s, writing superblock on drive %s, alloc of superblock descriptor failed %lu",
+			  rc, mp->pds_name, pd->pdi_name, sizeof(struct omf_sb_descriptor));
+		return rc;
+	}
+
+	/*
+	 * Set superblock values common to all new drives in pool
+	 * (new or extant)
+	 */
+	sb->osb_magic = OMF_SB_MAGIC;
+	strlcpy((char *) sb->osb_name, mp->pds_name, sizeof(sb->osb_name));
+	sb->osb_vers = OMF_SB_DESC_VER_LAST;
+	mpool_uuid_copy(&sb->osb_poolid, &mp->pds_poolid);
+	sb->osb_gen = 1;
+
+	/* Set superblock values specific to this drive */
+	mpool_uuid_copy(&sb->osb_parm.odp_devid, &pd->pdi_devid);
+	sb->osb_parm.odp_devsz = pd->pdi_parm.dpr_devsz;
+	sb->osb_parm.odp_zonetot = pd->pdi_parm.dpr_zonetot;
+	mc_pd_prop2mc_parms(&pd->pdi_parm.dpr_prop, &mc_parms);
+	mc_parms2omf_devparm(&mc_parms, &sb->osb_parm);
+
+	if (sbmdc0)
+		sbutil_mdc0_copy(sb, sbmdc0);
+	else
+		sbutil_mdc0_clear(sb);
+
+	rc = sb_write_new(&pd->pdi_parm, sb);
+	if (rc) {
+		mp_pr_err("mpool %s, writing superblock on drive %s, write failed",
+			  rc, mp->pds_name, pd->pdi_name);
+	}
+
+	kfree(sb);
+	return rc;
+}
+
+/**
+ * mpool_mdc0_alloc() - Allocate space for the two MDC0 mlogs
+ * @mp:
+ * @sb:
+ *
+ * In the context of a mpool create, allocate space for the two MDC0 mlogs
+ *	and update the sb structure with the position of MDC0.
+ *
+ * Note: this function assumes that the media classes have already been
+ *	created.
+ */
+static int mpool_mdc0_alloc(struct mpool_descriptor *mp, struct omf_sb_descriptor *sb)
+{
+	struct mpool_dev_info *pd;
+	struct media_class *mc;
+	struct mpool_uuid uuid;
+	u64 zcnt, zonelen;
+	u32 cnt;
+	int rc;
+
+	sbutil_mdc0_clear(sb);
+
+	ASSERT(mp->pds_mdparm.md_mclass < MP_MED_NUMBER);
+
+	mc = &mp->pds_mc[mp->pds_mdparm.md_mclass];
+	if (mc->mc_pdmc < 0) {
+		rc = -ENOSPC;
+		mp_pr_err("%s: sb update memory image MDC0 information, not enough drives",
+			  rc, mp->pds_name);
+		return rc;
+	}
+
+	pd = &mp->pds_pdv[mc->mc_pdmc];
+
+	zonelen = (u64)pd->pdi_parm.dpr_zonepg << PAGE_SHIFT;
+	zcnt = 1 + ((mp->pds_params.mp_mdc0cap - 1) / zonelen);
+
+	cnt = sb_zones_for_sbs(&(pd->pdi_prop));
+	if (cnt < 1) {
+		rc = -EINVAL;
+		mp_pr_err("%s: sb MDC0, getting sb range failed for drive %s %u",
+			  rc, mp->pds_name, pd->pdi_name, cnt);
+		return rc;
+	}
+
+	if ((pd->pdi_zonetot - cnt) < zcnt * 2) {
+		rc = -ENOSPC;
+		mp_pr_err("%s: sb MDC0, no room for MDC0 on drive %s %lu %u %lu",
+			  rc, mp->pds_name, pd->pdi_name,
+			  (ulong)pd->pdi_zonetot, cnt, (ulong)zcnt);
+		return rc;
+	}
+
+	/*
+	 * mdc0 log1/2 alloced on first 2 * zcnt zone's
+	 */
+	rc = pd_zone_erase(&pd->pdi_parm, cnt, zcnt * 2, true);
+	if (rc) {
+		mp_pr_err("%s: sb MDC0, erase failed on %s %u %lu",
+			  rc, mp->pds_name, pd->pdi_name, cnt, (ulong)zcnt);
+		return rc;
+	}
+
+	/*
+	 * Fill in common mdc0 log1/2 and drive info.
+	 */
+	sb->osb_mdc01gen = 1;
+	sb->osb_mdc01desc.ol_zcnt = zcnt;
+	mpool_generate_uuid(&uuid);
+	mpool_uuid_copy(&sb->osb_mdc01uuid, &uuid);
+
+	sb->osb_mdc02gen = 2;
+	sb->osb_mdc02desc.ol_zcnt = zcnt;
+	mpool_generate_uuid(&uuid);
+	mpool_uuid_copy(&sb->osb_mdc02uuid, &uuid);
+
+	mpool_uuid_copy(&sb->osb_mdc01devid, &pd->pdi_devid);
+	sb->osb_mdc01desc.ol_zaddr = cnt;
+
+	mpool_uuid_copy(&sb->osb_mdc02devid, &pd->pdi_devid);
+	sb->osb_mdc02desc.ol_zaddr = cnt + zcnt;
+
+	mpool_uuid_copy(&sb->osb_mdc0dev.odp_devid, &pd->pdi_devid);
+	sb->osb_mdc0dev.odp_devsz = pd->pdi_parm.dpr_devsz;
+	sb->osb_mdc0dev.odp_zonetot = pd->pdi_parm.dpr_zonetot;
+	mc_parms2omf_devparm(&mc->mc_parms, &sb->osb_mdc0dev);
+
+	return 0;
+}
+
+int mpool_dev_sbwrite_newpool(struct mpool_descriptor *mp, struct omf_sb_descriptor *sbmdc0)
+{
+	struct mpool_dev_info *pd = NULL;
+	u64 pdh = 0;
+	int rc;
+
+	/* Alloc mdc0 and generate mdc0 info for superblocks */
+	rc = mpool_mdc0_alloc(mp, sbmdc0);
+	if (rc) {
+		mp_pr_err("%s: MDC0 allocation failed", rc, mp->pds_name);
+		return rc;
+	}
+
+	for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+		pd = &mp->pds_pdv[pdh];
+
+		if (pd->pdi_mclass == mp->pds_mdparm.md_mclass)
+			rc = mpool_dev_sbwrite(mp, pd, sbmdc0);
+		else
+			rc = mpool_dev_sbwrite(mp, pd, NULL);
+		if (rc) {
+			mp_pr_err("%s: sb write %s failed, %d %d", rc, mp->pds_name,
+				  pd->pdi_name, pd->pdi_mclass, mp->pds_mdparm.md_mclass);
+			break;
+		}
+	}
+
+	return rc;
+}
+
+int mpool_mdc0_sb2obj(struct mpool_descriptor *mp, struct omf_sb_descriptor *sb,
+		      struct pmd_layout **l1, struct pmd_layout **l2)
+{
+	int rc, i;
+
+	/* MDC0 mlog1 layout */
+	*l1 = pmd_layout_alloc(&sb->osb_mdc01uuid, MDC0_OBJID_LOG1, sb->osb_mdc01gen, 0,
+			       sb->osb_mdc01desc.ol_zcnt);
+	if (!*l1) {
+		*l1 = *l2 = NULL;
+
+		rc = -ENOMEM;
+		mp_pr_err("mpool %s, MDC0 mlog1 allocation failed", rc, mp->pds_name);
+		return rc;
+	}
+
+	(*l1)->eld_state = PMD_LYT_COMMITTED;
+
+	for (i = 0; i < mp->pds_pdvcnt; i++) {
+		if (mpool_uuid_compare(&mp->pds_pdv[i].pdi_devid, &sb->osb_mdc01devid) == 0) {
+			(*l1)->eld_ld.ol_pdh = i;
+			(*l1)->eld_ld.ol_zaddr = sb->osb_mdc01desc.ol_zaddr;
+			break;
+		}
+	}
+
+	if (i >= mp->pds_pdvcnt) {
+		char uuid_str[40];
+
+		/* Should never happen */
+		pmd_obj_put(*l1);
+		*l1 = *l2 = NULL;
+
+		mpool_unparse_uuid(&sb->osb_mdc01devid, uuid_str);
+		rc = -ENOENT;
+		mp_pr_err("mpool %s, allocating MDC0 mlog1, can't find handle for pd uuid %s,",
+			  rc, mp->pds_name, uuid_str);
+
+		return rc;
+	}
+
+	/* MDC0 mlog2 layout */
+	*l2 = pmd_layout_alloc(&sb->osb_mdc02uuid, MDC0_OBJID_LOG2, sb->osb_mdc02gen, 0,
+			       sb->osb_mdc02desc.ol_zcnt);
+	if (!*l2) {
+		pmd_obj_put(*l1);
+
+		*l1 = *l2 = NULL;
+
+		rc = -ENOMEM;
+		mp_pr_err("mpool %s, MDC0 mlog2 allocation failed", rc, mp->pds_name);
+		return rc;
+	}
+
+	(*l2)->eld_state = PMD_LYT_COMMITTED;
+
+	for (i = 0; i < mp->pds_pdvcnt; i++) {
+		if (mpool_uuid_compare(&mp->pds_pdv[i].pdi_devid, &sb->osb_mdc02devid) == 0) {
+			(*l2)->eld_ld.ol_pdh = i;
+			(*l2)->eld_ld.ol_zaddr = sb->osb_mdc02desc.ol_zaddr;
+			break;
+		}
+	}
+
+	if (i >= mp->pds_pdvcnt) {
+		char uuid_str[40];
+
+		/* Should never happen */
+		pmd_obj_put(*l1);
+		pmd_obj_put(*l2);
+		*l1 = *l2 = NULL;
+
+		mpool_unparse_uuid(&sb->osb_mdc02devid, uuid_str);
+		rc = -ENOENT;
+		mp_pr_err("mpool %s, allocating MDC0 mlog2, can't find handle for pd uuid %s",
+			  rc, mp->pds_name, uuid_str);
+
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * mpool_dev_check_new() - check if a drive is ready to be added in an mpool.
+ * @mp:
+ * @pd:
+ */
+int mpool_dev_check_new(struct mpool_descriptor *mp, struct mpool_dev_info *pd)
+{
+	int rval, rc;
+
+	if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+		rc = -EIO;
+		mp_pr_err("%s:%s unavailable or offline, status %d",
+			  rc, mp->pds_name, pd->pdi_name, mpool_pd_status_get(pd));
+		return rc;
+	}
+
+	/* Confirm drive does not contain mpool magic value */
+	rval = sb_magic_check(&pd->pdi_parm);
+	if (rval) {
+		if (rval < 0) {
+			rc = rval;
+			mp_pr_err("%s:%s read sb magic failed", rc, mp->pds_name, pd->pdi_name);
+			return rc;
+		}
+
+		rc = -EBUSY;
+		mp_pr_err("%s:%s sb magic already exists", rc, mp->pds_name, pd->pdi_name);
+		return rc;
+	}
+
+	return 0;
+}
+
+int mpool_desc_pdmc_add(struct mpool_descriptor *mp, u16 pdh,
+			struct omf_devparm_descriptor *omf_devparm, bool check_only)
+{
+	struct mpool_dev_info *pd = NULL;
+	struct media_class *mc;
+	struct mc_parms mc_parms;
+	int rc;
+
+	pd = &mp->pds_pdv[pdh];
+	if (omf_devparm == NULL)
+		mc_pd_prop2mc_parms(&pd->pdi_parm.dpr_prop, &mc_parms);
+	else
+		mc_omf_devparm2mc_parms(omf_devparm, &mc_parms);
+
+	if (!mclass_isvalid(mc_parms.mcp_classp)) {
+		rc = -EINVAL;
+		mp_pr_err("%s: media class %u of %s is undefined",  rc, mp->pds_name,
+			  mc_parms.mcp_classp, pd->pdi_name);
+		return rc;
+	}
+
+	/*
+	 * Devices that do not support updatable sectors can't be included
+	 * in an mpool. Do not check if in the context of an unavailable PD
+	 * during activate, because it is impossible to determine the PD
+	 * properties.
+	 */
+	if ((omf_devparm == NULL) && !(pd->pdi_cmdopt & PD_CMD_SECTOR_UPDATABLE)) {
+		rc = -EINVAL;
+		mp_pr_err("%s: device %s sectors not updatable", rc, mp->pds_name, pd->pdi_name);
+		return rc;
+	}
+
+	mc = &mp->pds_mc[mc_parms.mcp_classp];
+	if (mc->mc_pdmc < 0) {
+		struct mc_smap_parms mcsp;
+
+		/*
+		 * No media class corresponding to the PD class yet, create one.
+		 */
+		rc = mc_smap_parms_get(&mp->pds_mc[mc_parms.mcp_classp], &mp->pds_params, &mcsp);
+		if (rc)
+			return rc;
+
+		if (!check_only)
+			mc_init_class(mc, &mc_parms, &mcsp);
+	} else {
+		rc = -EINVAL;
+		mp_pr_err("%s: add %s, only 1 device allowed per media class",
+			  rc, mp->pds_name, pd->pdi_name);
+		return rc;
+	}
+
+	if (check_only)
+		return 0;
+
+	mc->mc_pdmc = pdh;
+
+	return 0;
+}
+
+/**
+ * mpool_desc_init_newpool() - Create the media classes and add all the mpool PDs
+ * @mp:
+ * @flags: enum mp_mgmt_flags
+ *
+ * Called on mpool create.
+ * Create the media classes and add all the mpool PDs in their media class.
+ * Update the metadata media class in mp->pds_mdparm
+ *
+ * Note: the PD properties (pd->pdi_parm.dpr_prop) must be updated
+ * and correct when entering this function.
+ */
+int mpool_desc_init_newpool(struct mpool_descriptor *mp, u32 flags)
+{
+	u64 pdh = 0;
+	int rc;
+
+	if (!(flags & (1 << MP_FLAGS_FORCE))) {
+		rc = mpool_dev_check_new(mp, &mp->pds_pdv[pdh]);
+		if (rc)
+			return rc;
+	}
+
+	/*
+	 * Add drive in its media class. That may create the class
+	 * if first drive of the class.
+	 */
+	rc = mpool_desc_pdmc_add(mp, pdh, NULL, false);
+	if (rc) {
+		struct mpool_dev_info *pd __maybe_unused;
+
+		pd = &mp->pds_pdv[pdh];
+
+		mp_pr_err("mpool %s, mpool desc init, adding drive %s in a media class failed",
+			  rc, mp->pds_name, pd->pdi_name);
+		return rc;
+	}
+
+	mp->pds_mdparm.md_mclass = mp->pds_pdv[pdh].pdi_mclass;
+
+	return 0;
+}
+
+int mpool_dev_init_all(struct mpool_dev_info *pdv, u64 dcnt, char **dpaths,
+		       struct pd_prop *pd_prop)
+{
+	char *pdname;
+	int idx, rc;
+
+	if (dcnt == 0)
+		return -EINVAL;
+
+	for (rc = 0, idx = 0; idx < dcnt; idx++, pd_prop++) {
+		rc = pd_dev_open(dpaths[idx], &pdv[idx].pdi_parm, pd_prop);
+		if (rc) {
+			mp_pr_err("opening device %s failed", rc, dpaths[idx]);
+			break;
+		}
+
+		pdname = strrchr(dpaths[idx], '/');
+		pdname = pdname ? pdname + 1 : dpaths[idx];
+		strlcpy(pdv[idx].pdi_name, pdname, sizeof(pdv[idx].pdi_name));
+
+		mpool_pd_status_set(&pdv[idx], PD_STAT_ONLINE);
+	}
+
+	while (rc && idx-- > 0)
+		pd_dev_close(&pdv[idx].pdi_parm);
+
+	return rc;
+}
+
+void mpool_mdc_cap_init(struct mpool_descriptor *mp, struct mpool_dev_info *pd)
+{
+	u64 zonesz, defmbsz;
+
+	zonesz = (pd->pdi_zonepg << PAGE_SHIFT) >> 20;
+	defmbsz = MPOOL_MBSIZE_MB_DEFAULT;
+
+	if (mp->pds_params.mp_mdc0cap == 0) {
+		mp->pds_params.mp_mdc0cap = max_t(u64, defmbsz, zonesz);
+		mp->pds_params.mp_mdc0cap <<= 20;
+	}
+
+	if (mp->pds_params.mp_mdcncap == 0) {
+		mp->pds_params.mp_mdcncap = max_t(u64, zonesz, (256 / zonesz));
+		mp->pds_params.mp_mdcncap <<= 20;
+	}
+}
+
+/**
+ * mpool_desc_init_sb() - Read the super blocks of the PDs.
+ * @mp:
+ * @sbmdc0: output. MDC0 information stored in the super blocks.
+ * @flags:
+ *
+ * Adjust the discovered PD properties stored in pd->pdi_parm.dpr_prop with
+ * PD parameters from the super block. Some of discovered PD properties are
+ * default (like zone size) and need to be adjusted to what the PD actually
+ * use.
+ */
+int mpool_desc_init_sb(struct mpool_descriptor *mp, struct omf_sb_descriptor *sbmdc0,
+		       u32 flags, bool *mc_resize)
+{
+	struct omf_sb_descriptor *sb = NULL;
+	struct mpool_dev_info *pd = NULL;
+	u16 omf_ver = OMF_SB_DESC_UNDEF;
+	bool mdc0found = false;
+	bool force = ((flags & (1 << MP_FLAGS_FORCE)) != 0);
+	u8 pdh = 0;
+	int rc;
+
+	sb = kzalloc(sizeof(*sb), GFP_KERNEL);
+	if (!sb) {
+		rc = -ENOMEM;
+		mp_pr_err("sb desc alloc failed %lu", rc, (ulong)sizeof(*sb));
+		return rc;
+	}
+
+	for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+		struct omf_devparm_descriptor *dparm;
+		bool resize = false;
+		int i;
+
+		pd = &mp->pds_pdv[pdh];
+		if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+			rc = -EIO;
+			mp_pr_err("pd %s unavailable or offline, status %d",
+				  rc, pd->pdi_name, mpool_pd_status_get(pd));
+			kfree(sb);
+			return rc;
+		}
+
+		/*
+		 * Read superblock; init and validate pool drive info
+		 * from device parameters stored in the super block.
+		 */
+		rc = sb_read(&pd->pdi_parm, sb, &omf_ver, force);
+		if (rc) {
+			mp_pr_err("sb read from %s failed", rc, pd->pdi_name);
+			kfree(sb);
+			return rc;
+		}
+
+		if (!pdh) {
+			size_t n __maybe_unused;
+
+			/*
+			 * First drive; confirm pool not open; set pool-wide
+			 * properties
+			 */
+			if (uuid_to_mpdesc_search(&mpool_pools, &sb->osb_poolid)) {
+				char *uuid_str;
+
+				uuid_str = kmalloc(MPOOL_UUID_STRING_LEN + 1, GFP_KERNEL);
+				if (uuid_str)
+					mpool_unparse_uuid(&sb->osb_poolid, uuid_str);
+
+				rc = -EBUSY;
+				mp_pr_err("%s: mpool already activated, id %s, pd name %s",
+					  rc, sb->osb_name, uuid_str, pd->pdi_name);
+				kfree(sb);
+				kfree(uuid_str);
+				return rc;
+			}
+			mpool_uuid_copy(&mp->pds_poolid, &sb->osb_poolid);
+
+			n = strlcpy(mp->pds_name, (char *)sb->osb_name, sizeof(mp->pds_name));
+			ASSERT(n < sizeof(mp->pds_name));
+		} else {
+			/* Second or later drive; validate pool-wide properties */
+			if (mpool_uuid_compare(&sb->osb_poolid, &mp->pds_poolid) != 0) {
+				char *uuid_str1, *uuid_str2 = NULL;
+
+				uuid_str1 = kmalloc(2 * (MPOOL_UUID_STRING_LEN + 1), GFP_KERNEL);
+				if (uuid_str1) {
+					uuid_str2 = uuid_str1 + MPOOL_UUID_STRING_LEN + 1;
+					mpool_unparse_uuid(&sb->osb_poolid, uuid_str1);
+					mpool_unparse_uuid(&mp->pds_poolid, uuid_str2);
+				}
+
+				rc = -EINVAL;
+				mp_pr_err("%s: pd %s, mpool id %s different from prior id %s",
+					  rc, mp->pds_name, pd->pdi_name, uuid_str1, uuid_str2);
+				kfree(sb);
+				kfree(uuid_str1);
+				return rc;
+			}
+		}
+
+		dparm = &sb->osb_parm;
+		if (!force && pd->pdi_devsz > dparm->odp_devsz) {
+			mp_pr_info("%s: pd %s, discovered size %lu > on-media size %lu",
+				mp->pds_name, pd->pdi_name,
+				(ulong)pd->pdi_devsz, (ulong)dparm->odp_devsz);
+
+			if ((flags & (1 << MP_FLAGS_RESIZE)) == 0) {
+				pd->pdi_devsz = dparm->odp_devsz;
+			} else {
+				dparm->odp_devsz  = pd->pdi_devsz;
+				dparm->odp_zonetot = pd->pdi_devsz / (pd->pdi_zonepg << PAGE_SHIFT);
+
+				pd->pdi_zonetot = dparm->odp_zonetot;
+				resize = true;
+			}
+		}
+
+		/* Validate mdc0 info in superblock if present */
+		if (!sbutil_mdc0_isclear(sb)) {
+			if (!force && !sbutil_mdc0_isvalid(sb)) {
+				rc = -EINVAL;
+				mp_pr_err("%s: pd %s, invalid sb MDC0",
+					  rc, mp->pds_name, pd->pdi_name);
+				kfree(sb);
+				return rc;
+			}
+
+			dparm = &sb->osb_mdc0dev;
+			if (resize) {
+				ASSERT(pd->pdi_devsz > dparm->odp_devsz);
+
+				dparm->odp_devsz = pd->pdi_devsz;
+				dparm->odp_zonetot = pd->pdi_devsz / (pd->pdi_zonepg << PAGE_SHIFT);
+			}
+
+			sbutil_mdc0_copy(sbmdc0, sb);
+			mdc0found = true;
+		}
+
+		/* Set drive info confirming devid is unique and zone parms match */
+		for (i = 0; i < pdh; i++) {
+			if (mpool_uuid_compare(&mp->pds_pdv[i].pdi_devid,
+					       &sb->osb_parm.odp_devid) == 0) {
+				char *uuid_str;
+
+				uuid_str = kmalloc(MPOOL_UUID_STRING_LEN + 1, GFP_KERNEL);
+				if (uuid_str)
+					mpool_unparse_uuid(&sb->osb_parm.odp_devid, uuid_str);
+				rc = -EINVAL;
+				mp_pr_err("%s: pd %s, duplicate devices, uuid %s",
+					  rc, mp->pds_name, pd->pdi_name, uuid_str);
+				kfree(uuid_str);
+				kfree(sb);
+				return rc;
+			}
+		}
+
+		if (omf_ver > OMF_SB_DESC_VER_LAST) {
+			rc = -EOPNOTSUPP;
+			mp_pr_err("%s: unsupported sb version %d", rc, mp->pds_name, omf_ver);
+			kfree(sb);
+			return rc;
+		} else if (!force && (omf_ver < OMF_SB_DESC_VER_LAST || resize)) {
+			if ((flags & (1 << MP_FLAGS_PERMIT_META_CONV)) == 0) {
+				struct omf_mdcver *mdcver;
+				char *buf1, *buf2 = NULL;
+
+				/*
+				 * We have to get the permission from users
+				 * to update mpool meta data
+				 */
+				mdcver = omf_sbver_to_mdcver(omf_ver);
+				ASSERT(mdcver != NULL);
+
+				buf1 = kmalloc(2 * MAX_MDCVERSTR, GFP_KERNEL);
+				if (buf1) {
+					buf2 = buf1 + MAX_MDCVERSTR;
+					omfu_mdcver_to_str(mdcver, buf1, sizeof(buf1));
+					omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+				}
+
+				rc = -EPERM;
+				mp_pr_err("%s: reqd sb upgrade from version %s (%s) to %s (%s)",
+					  rc, mp->pds_name,
+					  buf1, omfu_mdcver_comment(mdcver) ?: "",
+					  buf2, omfu_mdcver_comment(omfu_mdcver_cur()));
+				kfree(buf1);
+				kfree(sb);
+				return rc;
+			}
+
+			/* We need to overwrite the old version superblock on the device */
+			rc = sb_write_update(&pd->pdi_parm, sb);
+			if (rc) {
+				mp_pr_err("%s: pd %s, failed to convert or overwrite mpool sb",
+					  rc, mp->pds_name, pd->pdi_name);
+				kfree(sb);
+				return rc;
+			}
+
+			if (!resize)
+				mp_pr_info("%s: pd %s, Convert mpool sb, oldv %d newv %d",
+					   mp->pds_name, pd->pdi_name, omf_ver, sb->osb_vers);
+		}
+
+		mpool_uuid_copy(&pd->pdi_devid, &sb->osb_parm.odp_devid);
+
+		/* Add drive in its media class. Create the media class if not yet created. */
+		rc = mpool_desc_pdmc_add(mp, pdh, NULL, false);
+		if (rc) {
+			mp_pr_err("%s: pd %s, adding drive in a media class failed",
+				  rc, mp->pds_name, pd->pdi_name);
+
+			kfree(sb);
+			return rc;
+		}
+
+		/*
+		 * Record the media class used by the MDC0 metadata.
+		 */
+		if (mdc0found)
+			mp->pds_mdparm.md_mclass = pd->pdi_mclass;
+
+		if (resize && mc_resize)
+			mc_resize[pd->pdi_mclass] = resize;
+	}
+
+	if (!mdc0found) {
+		rc = -EINVAL;
+		mp_pr_err("%s: MDC0 not found", rc, mp->pds_name);
+		kfree(sb);
+		return rc;
+	}
+
+	kfree(sb);
+
+	return 0;
+}
+
+static int comp_func(const void *c1, const void *c2)
+{
+	return strcmp(*(char **)c1, *(char **)c2);
+}
+
+int check_for_dups(char **listv, int cnt, int *dup, int *offset)
+{
+	const char **sortedv;
+	const char *prev;
+	int rc, i;
+
+	*dup = 0;
+	*offset = -1;
+
+	if (0 == cnt || 1 == cnt)
+		return 0;
+
+	sortedv = kcalloc(cnt + 1, sizeof(char *), GFP_KERNEL);
+	if (!sortedv) {
+		rc = -ENOMEM;
+		mp_pr_err("kcalloc failed for %d paths, first path %s", rc, cnt, *listv);
+		return rc;
+	}
+
+	/* Make a shallow copy */
+	for (i = 0; i < cnt; i++)
+		sortedv[i] = listv[i];
+
+	sortedv[i] = NULL;
+
+	sort(sortedv, cnt, sizeof(char *), comp_func, NULL);
+
+	prev = sortedv[0];
+	for (i = 1; i < cnt; i++) {
+		if (strcmp(sortedv[i], prev) == 0) {
+			mp_pr_info("path %s is duplicated", prev);
+			*dup = 1;
+			break;
+		}
+
+		prev = sortedv[i];
+	}
+
+	/* Find offset, prev points to first dup */
+	if (*dup) {
+		for (i = 0; i < cnt; i++) {
+			if (prev == listv[i]) {
+				*offset = i;
+				break;
+			}
+		}
+	}
+
+	kfree(sortedv);
+	return 0;
+}
+
+void fill_in_devprops(struct mpool_descriptor *mp, u64 pdh, struct mpool_devprops *dprop)
+{
+	struct mpool_dev_info *pd;
+	struct media_class *mc;
+	int rc;
+
+	pd = &mp->pds_pdv[pdh];
+	memcpy(dprop->pdp_devid.b, pd->pdi_devid.uuid, MPOOL_UUID_SIZE);
+
+	mc = &mp->pds_mc[pd->pdi_mclass];
+	dprop->pdp_mclassp = mc->mc_parms.mcp_classp;
+	dprop->pdp_status  = mpool_pd_status_get(pd);
+
+	rc = smap_drive_usage(mp, pdh, dprop);
+	if (rc) {
+		mp_pr_err("mpool %s, can't get drive usage, media class %d",
+			  rc, mp->pds_name, dprop->pdp_mclassp);
+	}
+}
+
+int mpool_desc_unavail_add(struct mpool_descriptor *mp, struct omf_devparm_descriptor *omf_devparm)
+{
+	struct mpool_dev_info *pd = NULL;
+	char uuid_str[40];
+	int rc;
+
+	mpool_unparse_uuid(&omf_devparm->odp_devid, uuid_str);
+
+	mp_pr_warn("Activating mpool %s, adding unavailable drive %s", mp->pds_name, uuid_str);
+
+	if (mp->pds_pdvcnt >= MPOOL_DRIVES_MAX) {
+		rc = -EINVAL;
+		mp_pr_err("Activating mpool %s, adding an unavailable drive, too many drives",
+			  rc, mp->pds_name);
+		return rc;
+	}
+
+	pd = &mp->pds_pdv[mp->pds_pdvcnt];
+
+	mpool_uuid_copy(&pd->pdi_devid, &omf_devparm->odp_devid);
+
+	/* Update the PD properties from the metadata record. */
+	mpool_pd_status_set(pd, PD_STAT_UNAVAIL);
+	pd_dev_set_unavail(&pd->pdi_parm, omf_devparm);
+
+	/* Add the PD in its media class. */
+	rc = mpool_desc_pdmc_add(mp, mp->pds_pdvcnt, omf_devparm, false);
+	if (rc)
+		return rc;
+
+	mp->pds_pdvcnt = mp->pds_pdvcnt + 1;
+
+	return 0;
+}
+
+int mpool_create_rmlogs(struct mpool_descriptor *mp, u64 mlog_cap)
+{
+	struct mlog_descriptor *ml_desc;
+	struct mlog_capacity mlcap = {
+		.lcp_captgt = mlog_cap,
+	};
+	struct mlog_props mlprops;
+	u64 root_mlog_id[2];
+	int rc, i;
+
+	mlog_lookup_rootids(&root_mlog_id[0], &root_mlog_id[1]);
+
+	for (i = 0; i < 2; ++i) {
+		rc = mlog_find_get(mp, root_mlog_id[i], 1, NULL, &ml_desc);
+		if (!rc) {
+			mlog_put(ml_desc);
+			continue;
+		}
+
+		if (rc != -ENOENT) {
+			mp_pr_err("mpool %s, root mlog find 0x%lx failed",
+				  rc, mp->pds_name, (ulong)root_mlog_id[i]);
+			return rc;
+		}
+
+		rc = mlog_realloc(mp, root_mlog_id[i], &mlcap,
+				  MP_MED_CAPACITY, &mlprops, &ml_desc);
+		if (rc) {
+			mp_pr_err("mpool %s, root mlog realloc 0x%lx failed",
+				  rc, mp->pds_name, (ulong)root_mlog_id[i]);
+			return rc;
+		}
+
+		if (mlprops.lpr_objid != root_mlog_id[i]) {
+			mlog_put(ml_desc);
+			rc = -ENOENT;
+			mp_pr_err("mpool %s, root mlog mismatch 0x%lx 0x%lx", rc,
+				  mp->pds_name, (ulong)root_mlog_id[i], (ulong)mlprops.lpr_objid);
+			return rc;
+		}
+
+		rc = mlog_commit(mp, ml_desc);
+		if (rc) {
+			if (mlog_abort(mp, ml_desc))
+				mlog_put(ml_desc);
+
+			mp_pr_err("mpool %s, root mlog commit 0x%lx failed",
+				  rc, mp->pds_name, (ulong)root_mlog_id[i]);
+			return rc;
+		}
+
+		mlog_put(ml_desc);
+	}
+
+	return rc;
+}
+
+struct mpool_descriptor *mpool_desc_alloc(void)
+{
+	struct mpool_descriptor *mp;
+	int i;
+
+	mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+	if (!mp)
+		return NULL;
+
+	init_rwsem(&mp->pds_pdvlock);
+
+	mutex_init(&mp->pds_oml_lock);
+	mp->pds_oml_root = RB_ROOT;
+
+	mp->pds_mdparm.md_mclass = MP_MED_INVALID;
+
+	mpcore_params_defaults(&mp->pds_params);
+
+	for (i = 0; i < MP_MED_NUMBER; i++)
+		mp->pds_mc[i].mc_pdmc = -1;
+
+	return mp;
+}
+
+/*
+ * remove mp from mpool_pools; close all dev; dealloc mp.
+ */
+void mpool_desc_free(struct mpool_descriptor *mp)
+{
+	struct mpool_descriptor *found_mp = NULL;
+	struct mpool_uuid uuid_zero;
+	int i;
+
+	mpool_uuid_clear(&uuid_zero);
+
+	/*
+	 * Handle case where poolid and devid not in mappings
+	 * which can happen when cleaning up from failed create/open.
+	 */
+	found_mp = uuid_to_mpdesc_search(&mpool_pools, &mp->pds_poolid);
+	if (found_mp)
+		rb_erase(&found_mp->pds_node, &mpool_pools);
+
+	for (i = 0; i < mp->pds_pdvcnt; i++) {
+		if (mpool_pd_status_get(&mp->pds_pdv[i]) != PD_STAT_UNAVAIL)
+			pd_dev_close(&mp->pds_pdv[i].pdi_parm);
+	}
+
+	kfree(mp);
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 14/22] mpool: add pool metadata routines to create persistent mpools
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (12 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 13/22] mpool: add utility routines for mpool lifecycle management Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 15/22] mpool: add mpool lifecycle management routines Nabeel M Mohamed
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

Mpool metadata is stored in metadata containers (MDC). An
mpool can have a maximum of 256 MDCs, MDC-0 through MDC-255.
The following metadata manager functionality is added here
for object persistence:

- Initialize and validate MDC0
- Allocate and initialize MDC 1-N. An mpool is created with
  16 MDCs to provide the requisite concurrency.
- Dynamically scale up the number of MDCs when running low
  on space and the garbage is below a certain threshold
  across all MDCs
- Deserialize metadata records from MDC 0-N at mpool activation
  and setup the corresponding in-memory structures
- Pre-compact MDC-K based on its usage and if the garbage in
  MDC-K is above a certain threshold. A pre-compacting MDC is
  not chosen for object allocation.

MDC0 is a distinguished container that stores both the metadata
for accessing MDC-1 through MDC-255 and all mpool properties.
MDC-1 through MDC-255 store the metadata for accessing client
allocated mblocks and mlogs. Metadata for accessing the mlogs
comprising MDC-0 is in the superblock for the capacity media
class.

In the context of MDC-1/255, compacting MDC-K is simply
serializing the in-memory metadata for accessing the still-live
client objects associated with MDC-K. In the context of MDC-0,
compacting is simply serializing the in-memory mpool properties
and in-memory metadata for accessing MDC-1/255.

An instance of struct pmd_mdc_info is created for each MDC in
an mpool. This struct hosts both the uncommitted and committed
object trees and a lock protecting each of the two trees.
Compacting an MDC requires freezing both the list of committed
objects in that MDC and the metadata for those objects,
which is facilitated by the compact lock in each MDC instance.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/pmd.c     | 2046 +++++++++++++++++++++++++++++++++++++++
 drivers/mpool/pmd_obj.c |    8 -
 2 files changed, 2046 insertions(+), 8 deletions(-)
 create mode 100644 drivers/mpool/pmd.c

diff --git a/drivers/mpool/pmd.c b/drivers/mpool/pmd.c
new file mode 100644
index 000000000000..07e08b5eed43
--- /dev/null
+++ b/drivers/mpool/pmd.c
@@ -0,0 +1,2046 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * DOC: Module info.
+ *
+ * Pool metadata (pmd) module.
+ *
+ * Defines functions for probing, reading, and writing drives in an mpool.
+ *
+ */
+
+#include <linux/workqueue.h>
+#include <linux/atomic.h>
+#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "mpool_ioctl.h"
+#include "mdc.h"
+#include "upgrade.h"
+#include "smap.h"
+#include "omf_if.h"
+#include "mpcore.h"
+#include "pmd.h"
+
+static DEFINE_MUTEX(pmd_s_lock);
+
+#define pmd_co_foreach(_cinfo, _node) \
+	for ((_node) = rb_first(&(_cinfo)->mmi_co_root); (_node); (_node) = rb_next((_node)))
+
+static int pmd_mdc0_validate(struct mpool_descriptor *mp, int activation);
+
+static void pmd_mda_init(struct mpool_descriptor *mp)
+{
+	int i;
+
+	spin_lock_init(&mp->pds_mda.mdi_slotvlock);
+	mp->pds_mda.mdi_slotvcnt = 0;
+
+	for (i = 0; i < MDC_SLOTS; ++i) {
+		struct pmd_mdc_info *pmi = mp->pds_mda.mdi_slotv + i;
+
+		mutex_init(&pmi->mmi_compactlock);
+		mutex_init(&pmi->mmi_uc_lock);
+		pmi->mmi_uc_root = RB_ROOT;
+		init_rwsem(&pmi->mmi_co_lock);
+		pmi->mmi_co_root = RB_ROOT;
+		mutex_init(&pmi->mmi_uqlock);
+		pmi->mmi_luniq = 0;
+		pmi->mmi_recbuf = NULL;
+		pmi->mmi_lckpt = objid_make(0, OMF_OBJ_UNDEF, i);
+		memset(&pmi->mmi_stats, 0, sizeof(pmi->mmi_stats));
+
+		/*
+		 * Initial mpool metadata content version.
+		 */
+		pmi->mmi_mdcver.mdcv_major = 1;
+		pmi->mmi_mdcver.mdcv_minor = 0;
+		pmi->mmi_mdcver.mdcv_patch = 0;
+		pmi->mmi_mdcver.mdcv_dev   = 0;
+
+		pmi->mmi_credit.ci_slot = i;
+
+		mutex_init(&pmi->mmi_stats_lock);
+	}
+
+	mp->pds_mda.mdi_slotv[1].mmi_luniq = UROOT_OBJID_MAX;
+	mp->pds_mda.mdi_sel.mds_tbl_idx.counter = 0;
+}
+
+static void pmd_mda_free(struct mpool_descriptor *mp)
+{
+	int sidx;
+
+	/*
+	 * close mdc0 last because closing other mdc logs can result in
+	 * mdc0 updates
+	 */
+	for (sidx = mp->pds_mda.mdi_slotvcnt - 1; sidx > -1; sidx--) {
+		struct pmd_layout      *layout, *tmp;
+		struct pmd_mdc_info    *cinfo;
+
+		cinfo = &mp->pds_mda.mdi_slotv[sidx];
+
+		mp_mdc_close(cinfo->mmi_mdc);
+		kfree(cinfo->mmi_recbuf);
+		cinfo->mmi_recbuf = NULL;
+
+		/* Release committed objects... */
+		rbtree_postorder_for_each_entry_safe(
+			layout, tmp, &cinfo->mmi_co_root, eld_nodemdc) {
+
+			pmd_obj_put(layout);
+		}
+
+		/* Release uncommitted objects... */
+		rbtree_postorder_for_each_entry_safe(
+			layout, tmp, &cinfo->mmi_uc_root, eld_nodemdc) {
+
+			pmd_obj_put(layout);
+		}
+	}
+}
+
+static int pmd_mdc0_init(struct mpool_descriptor *mp, struct pmd_layout *mdc01,
+		     struct pmd_layout *mdc02)
+{
+	struct pmd_mdc_info *cinfo = &mp->pds_mda.mdi_slotv[0];
+	int rc;
+
+	cinfo->mmi_recbuf = kzalloc(OMF_MDCREC_PACKLEN_MAX, GFP_KERNEL);
+	if (!cinfo->mmi_recbuf) {
+		rc = -ENOMEM;
+		mp_pr_err("mpool %s, log rec buffer alloc %zu failed",
+			  rc, mp->pds_name, OMF_MDCREC_PACKLEN_MAX);
+		return rc;
+	}
+
+	/*
+	 * we put the mdc0 mlog layouts in mdc 0 because mdc0 mlog objids have a
+	 * slot # of 0 so the rest of the code expects to find the layout there.
+	 * this allows the majority of the code to treat mdc0 mlog metadata
+	 * exactly the same as for mdcN (and user mlogs), even though mdc0
+	 * metadata is actually stored in superblocks.  however there are a few
+	 * places that need to recognize mdc0 mlogs are special, including
+	 * pmd_mdc_compact() and pmd_obj_erase().
+	 */
+
+	mp->pds_mda.mdi_slotvcnt = 1;
+	pmd_co_insert(cinfo, mdc01);
+	pmd_co_insert(cinfo, mdc02);
+
+	rc = mp_mdc_open(mp, mdc01->eld_objid, mdc02->eld_objid, MDC_OF_SKIP_SER, &cinfo->mmi_mdc);
+	if (rc) {
+		mp_pr_err("mpool %s, MDC0 open failed", rc, mp->pds_name);
+
+		pmd_co_remove(cinfo, mdc01);
+		pmd_co_remove(cinfo, mdc02);
+
+		kfree(cinfo->mmi_recbuf);
+		cinfo->mmi_recbuf = NULL;
+
+		mp->pds_mda.mdi_slotvcnt = 0;
+	}
+
+	return rc;
+}
+
+/**
+ * pmd_mdc0_validate() -
+ * @mp:
+ * @activation:
+ *
+ * Called during mpool activation and mdc alloc because a failed
+ * mdc alloc can result in extraneous mdc mlog objects which if
+ * found we attempt to clean-up here. when called during activation
+ * we may need to adjust mp.mda. this is not so when called from
+ * mdc alloc and in fact decreasing slotvcnt post activation would
+ * violate a key invariant.
+ */
+static int pmd_mdc0_validate(struct mpool_descriptor *mp, int activation)
+{
+	struct pmd_mdc_info *cinfo;
+	struct pmd_layout *layout;
+	struct rb_node *node;
+	int err = 0, err1, err2, i;
+	u64 mdcn, mdcmax = 0;
+	u64 logid1, logid2;
+	u16 slotvcnt;
+	u8 *lcnt;
+
+	/*
+	 * Activation is single-threaded and mdc alloc is serialized
+	 * so the number of active mdc (slotvcnt) will not change.
+	 */
+	spin_lock(&mp->pds_mda.mdi_slotvlock);
+	slotvcnt = mp->pds_mda.mdi_slotvcnt;
+	spin_unlock(&mp->pds_mda.mdi_slotvlock);
+
+	if (!slotvcnt) {
+		/* Must be at least mdc0 */
+		err = -EINVAL;
+		mp_pr_err("mpool %s, no MDC0", err, mp->pds_name);
+		return err;
+	}
+
+	cinfo = &mp->pds_mda.mdi_slotv[0];
+
+	lcnt = kcalloc(MDC_SLOTS, sizeof(*lcnt), GFP_KERNEL);
+	if (!lcnt) {
+		err = -ENOMEM;
+		mp_pr_err("mpool %s, lcnt alloc failed", err, mp->pds_name);
+		return err;
+	}
+
+	pmd_co_rlock(cinfo, 0);
+
+	pmd_co_foreach(cinfo, node) {
+		layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+		mdcn = objid_uniq(layout->eld_objid) >> 1;
+		if (mdcn < MDC_SLOTS) {
+			lcnt[mdcn] = lcnt[mdcn] + 1;
+			mdcmax = max(mdcmax, mdcn);
+		}
+		if (mdcn >= MDC_SLOTS || lcnt[mdcn] > 2 ||
+		    objid_type(layout->eld_objid) != OMF_OBJ_MLOG ||
+		    objid_slot(layout->eld_objid)) {
+			err = -EINVAL;
+			mp_pr_err("mpool %s, MDC0 number of MDCs %lu %u or bad otype, objid 0x%lx",
+				  err, mp->pds_name, (ulong)mdcn,
+				  lcnt[mdcn], (ulong)layout->eld_objid);
+			break;
+		}
+	}
+
+	pmd_co_runlock(cinfo);
+
+	if (err)
+		goto exit;
+
+	if (!mdcmax) {
+		/*
+		 * trivial case of mdc0 only; no mdc alloc failure to
+		 * clean-up
+		 */
+		if (lcnt[0] != 2 || slotvcnt != 1) {
+			err = -EINVAL;
+			mp_pr_err("mpool %s, inconsistent number of MDCs or slots %d %d",
+				  err, mp->pds_name, lcnt[0], slotvcnt);
+		}
+
+		goto exit;
+	}
+
+	if ((mdcmax != (slotvcnt - 1)) && mdcmax != slotvcnt) {
+		err = -EINVAL;
+
+		/*
+		 * mdcmax is normally slotvcnt-1; can be slotvcnt if
+		 * mdc alloc failed
+		 */
+		mp_pr_err("mpool %s, inconsistent max number of MDCs %lu %u",
+			  err, mp->pds_name, (ulong)mdcmax, slotvcnt);
+		goto exit;
+	}
+
+	/* Both logs must always exist below mdcmax */
+	for (i = 0; i < mdcmax; i++) {
+		if (lcnt[i] != 2) {
+			err = -ENOENT;
+			mp_pr_err("mpool %s, MDC0 missing mlogs %lu %d %u",
+				  err, mp->pds_name, (ulong)mdcmax, i, lcnt[i]);
+			goto exit;
+		}
+	}
+
+	/* Clean-up from failed mdc alloc if needed */
+	if (lcnt[mdcmax] != 2 || mdcmax == slotvcnt) {
+		/* Note: if activation then mdcmax == slotvcnt-1 always */
+		err1 = 0;
+		err2 = 0;
+		logid1 = logid_make(2 * mdcmax, 0);
+		logid2 = logid_make(2 * mdcmax + 1, 0);
+
+		layout = pmd_obj_find_get(mp, logid1, 1);
+		if (layout) {
+			err1 = pmd_obj_delete(mp, layout);
+			if (err1)
+				mp_pr_err("mpool %s, MDC0 %d, can't delete mlog %lu %lu %u %u",
+					  err1, mp->pds_name, activation, (ulong)logid1,
+					  (ulong)mdcmax, lcnt[mdcmax], slotvcnt);
+		}
+
+		layout = pmd_obj_find_get(mp, logid2, 1);
+		if (layout) {
+			err2 = pmd_obj_delete(mp, layout);
+			if (err2)
+				mp_pr_err("mpool %s, MDC0 %d, can't delete mlog %lu %lu %u %u",
+					  err2, mp->pds_name, activation, (ulong)logid2,
+					  (ulong)mdcmax, lcnt[mdcmax], slotvcnt);
+		}
+
+		if (activation) {
+			/*
+			 * Mpool activation can ignore mdc alloc clean-up
+			 * failures; single-threaded; don't need slotvlock
+			 * or uqlock to adjust mda
+			 */
+			cinfo->mmi_luniq = mdcmax - 1;
+			mp->pds_mda.mdi_slotvcnt = mdcmax;
+			mp_pr_warn("mpool %s, MDC0 alloc recovery: uniq %llu slotvcnt %d",
+				   mp->pds_name, (unsigned long long)cinfo->mmi_luniq,
+				   mp->pds_mda.mdi_slotvcnt);
+		} else {
+			/* MDC alloc cannot tolerate clean-up failures */
+			if (err1)
+				err = err1;
+			else if (err2)
+				err = err2;
+
+			if (err)
+				mp_pr_err("mpool %s, MDC0 alloc recovery, cleanup failed %lu %u %u",
+					  err, mp->pds_name, (ulong)mdcmax, lcnt[mdcmax], slotvcnt);
+			else
+				mp_pr_warn("mpool %s, MDC0 alloc recovery", mp->pds_name);
+
+		}
+	}
+
+exit:
+	kfree(lcnt);
+
+	return err;
+}
+
+int pmd_mdc_alloc(struct mpool_descriptor *mp, u64 mincap, u32 iter)
+{
+	struct pmd_obj_capacity ocap;
+	enum mp_media_classp mclassp;
+	struct pmd_mdc_info *cinfo, *cinew;
+	struct pmd_layout *layout1, *layout2;
+	const char *msg = "(no detail)";
+	u64 mdcslot, logid1, logid2;
+	bool reverse = false;
+	u32 pdcnt;
+	int err;
+
+	/*
+	 * serialize to prevent gap in mdc slot space in event of failure
+	 */
+	mutex_lock(&pmd_s_lock);
+
+	/*
+	 * recover previously failed mdc alloc if needed; cannot continue
+	 * if fails
+	 * note: there is an unlikely corner case where we logically delete an
+	 * mlog from a previously failed mdc alloc but a background op is
+	 * preventing its full removal; this will show up later in this
+	 * fn as a failed alloc.
+	 */
+	err = pmd_mdc0_validate(mp, 0);
+	if (err) {
+		mutex_unlock(&pmd_s_lock);
+
+		mp_pr_err("mpool %s, allocating an MDC, inconsistent MDC0", err, mp->pds_name);
+		return err;
+	}
+
+	/* MDC0 exists by definition; created as part of mpool creation */
+	cinfo = &mp->pds_mda.mdi_slotv[0];
+
+	pmd_mdc_lock(&cinfo->mmi_uqlock, 0);
+	mdcslot = cinfo->mmi_luniq;
+	pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+	if (mdcslot >= MDC_SLOTS - 1) {
+		mutex_unlock(&pmd_s_lock);
+
+		err = -ENOSPC;
+		mp_pr_err("mpool %s, allocating an MDC, too many %lu",
+			  err, mp->pds_name, (ulong)mdcslot);
+		return err;
+	}
+	mdcslot = mdcslot + 1;
+
+	/*
+	 * Alloc rec buf for new mdc slot; not visible so don't need to
+	 * lock fields.
+	 */
+	cinew = &mp->pds_mda.mdi_slotv[mdcslot];
+	cinew->mmi_recbuf = kzalloc(OMF_MDCREC_PACKLEN_MAX, GFP_KERNEL);
+	if (!cinew->mmi_recbuf) {
+		mutex_unlock(&pmd_s_lock);
+
+		mp_pr_warn("mpool %s, MDC%lu pack/unpack buf alloc failed %lu",
+			   mp->pds_name, (ulong)mdcslot, (ulong)OMF_MDCREC_PACKLEN_MAX);
+		return -ENOMEM;
+	}
+	cinew->mmi_credit.ci_slot = mdcslot;
+
+	mclassp = MP_MED_CAPACITY;
+	pdcnt = 1;
+
+	/*
+	 * Create new mdcs with same parameters and on same media class
+	 * as mdc0.
+	 */
+	ocap.moc_captgt = mincap;
+	ocap.moc_spare  = false;
+
+	logid1 = logid_make(2 * mdcslot, 0);
+	logid2 = logid_make(2 * mdcslot + 1, 0);
+
+	if (!(pdcnt & 0x1) && ((iter * 2 / pdcnt) & 0x1)) {
+		/*
+		 * Reverse the allocation order.
+		 * The goal is to have active mlogs on all the mpool PDs.
+		 * If 2 PDs, no parity, no reserve, the active mlogs
+		 * will be on PDs 0,1,0,1,0,1,0,1 etc
+		 * instead of 0,0,0,0,0 etc without reversing.
+		 * No need to reverse if the number of PDs is odd.
+		 */
+		reverse = true;
+	}
+
+	/*
+	 * Each mlog must meet mincap since only one is active at a
+	 * time.
+	 */
+	layout1 = NULL;
+	err = pmd_obj_alloc_cmn(mp, reverse ? logid2 : logid1, OMF_OBJ_MLOG,
+				&ocap, mclassp, 0, false, &layout1);
+	if (err) {
+		if (err != -ENOENT)
+			msg = "allocation of first mlog failed";
+		goto exit;
+	}
+
+	layout2 = NULL;
+	err = pmd_obj_alloc_cmn(mp, reverse ? logid1 : logid2, OMF_OBJ_MLOG,
+				&ocap, mclassp, 0, false, &layout2);
+	if (err) {
+		pmd_obj_abort(mp, layout1);
+		if (err != -ENOENT)
+			msg = "allocation of second mlog failed";
+		goto exit;
+	}
+
+	/*
+	 * Must erase before commit to guarantee new mdc logs start
+	 * empty; mlogs not committed so pmd_obj_erase()
+	 * not needed to make atomic.
+	 */
+	pmd_obj_wrlock(layout1);
+	err = pmd_layout_erase(mp, layout1);
+	pmd_obj_wrunlock(layout1);
+
+	if (err) {
+		msg = "erase of first mlog failed";
+	} else {
+		pmd_obj_wrlock(layout2);
+		err = pmd_layout_erase(mp, layout2);
+		pmd_obj_wrunlock(layout2);
+
+		if (err)
+			msg = "erase of second mlog failed";
+	}
+	if (err) {
+		pmd_obj_abort(mp, layout1);
+		pmd_obj_abort(mp, layout2);
+		goto exit;
+	}
+
+	/*
+	 * don't need to commit logid1 and logid2 atomically; mdc0
+	 * validation deletes non-paired mdc logs to handle failing part
+	 * way through this process
+	 */
+	err = pmd_obj_commit(mp, layout1);
+	if (err) {
+		pmd_obj_abort(mp, layout1);
+		pmd_obj_abort(mp, layout2);
+		msg = "commit of first mlog failed";
+		goto exit;
+	} else {
+		err = pmd_obj_commit(mp, layout2);
+		if (err) {
+			pmd_obj_delete(mp, layout1);
+			pmd_obj_abort(mp, layout2);
+			msg = "commit of second mlog failed";
+			goto exit;
+		}
+	}
+
+	/*
+	 * Finalize new mdc slot before making visible; don't need to
+	 * lock fields.
+	 */
+	err = mp_mdc_open(mp, logid1, logid2, MDC_OF_SKIP_SER, &cinew->mmi_mdc);
+	if (err) {
+		msg = "mdc open failed";
+
+		/* Failed open so just delete logid1/2; don't
+		 * need to delete atomically since mdc0 validation
+		 * will cleanup any detritus
+		 */
+		pmd_obj_delete(mp, layout1);
+		pmd_obj_delete(mp, layout2);
+		goto exit;
+	}
+
+	/*
+	 * Append the version record.
+	 */
+	if (omfu_mdcver_cmp2(omfu_mdcver_cur(), ">=", 1, 0, 0, 1)) {
+		err = pmd_mdc_addrec_version(mp, mdcslot);
+		if (err) {
+			msg = "error adding the version record";
+			/*
+			 * No version record in a MDC will trigger a MDC
+			 * compaction if a activate is attempted later with this
+			 * empty MDC.
+			 * The compaction will add the version record in that
+			 * empty MDC.
+			 * Same error handling as above.
+			 */
+			pmd_obj_delete(mp, layout1);
+			pmd_obj_delete(mp, layout2);
+			goto exit;
+		}
+	}
+
+	/* Make new mdc visible */
+	pmd_mdc_lock(&cinfo->mmi_uqlock, 0);
+
+	spin_lock(&mp->pds_mda.mdi_slotvlock);
+	cinfo->mmi_luniq = mdcslot;
+	mp->pds_mda.mdi_slotvcnt = mdcslot + 1;
+	spin_unlock(&mp->pds_mda.mdi_slotvlock);
+
+	pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+exit:
+	if (err) {
+		kfree(cinew->mmi_recbuf);
+		cinew->mmi_recbuf = NULL;
+	}
+
+	mutex_unlock(&pmd_s_lock);
+
+	mp_pr_debug("new mdc logid1 %llu logid2 %llu",
+		    0, (unsigned long long)logid1, (unsigned long long)logid2);
+
+	if (err) {
+		mp_pr_err("mpool %s, MDC%lu: %s", err, mp->pds_name, (ulong)mdcslot, msg);
+
+	} else {
+		mp_pr_debug("mpool %s, delta slotvcnt from %u to %llu", 0, mp->pds_name,
+			    mp->pds_mda.mdi_slotvcnt, (unsigned long long)mdcslot + 1);
+
+	}
+	return err;
+}
+
+/**
+ * pmd_mdc_alloc_set() - allocates a set of MDCs
+ * @mp: mpool descriptor
+ *
+ * Creates MDCs in multiple of MPOOL_MDC_SET_SZ. If allocation had
+ * failed in prior iteration allocate MDCs to make it even multiple
+ * of MPOOL_MDC_SET_SZ.
+ *
+ * Locking: lock should not be held when calling this function.
+ */
+static void pmd_mdc_alloc_set(struct mpool_descriptor *mp)
+{
+	u8 mdc_cnt, sidx;
+	int rc;
+
+	/*
+	 * MDCs are created in multiple of MPOOL_MDC_SET_SZ.
+	 * However, if past allocation had failed there may not be an
+	 * even multiple of MDCs in that case create any remaining
+	 * MDCs to get an even multiple.
+	 */
+	mdc_cnt =  MPOOL_MDC_SET_SZ - ((mp->pds_mda.mdi_slotvcnt - 1) % MPOOL_MDC_SET_SZ);
+
+	mdc_cnt = min(mdc_cnt, (u8)(MDC_SLOTS - (mp->pds_mda.mdi_slotvcnt)));
+
+	for (sidx = 1; sidx <= mdc_cnt; sidx++) {
+		rc = pmd_mdc_alloc(mp, mp->pds_params.mp_mdcncap, 0);
+		if (rc) {
+			mp_pr_err("mpool %s, only %u of %u MDCs created",
+				  rc, mp->pds_name, sidx-1, mdc_cnt);
+
+			/*
+			 * For MDCN creation failure ignore the error.
+			 * Attempt to create any remaining MDC next time
+			 * next time new mdcs are required.
+			 */
+			rc = 0;
+			break;
+		}
+	}
+}
+
+/**
+ * pmd_cmp_drv_mdc0() - compare the drive info read from the MDC0 drive list
+ *	to what is obtained from the drive itself or from the configuration.
+ * @mp:
+ * @pdh:
+ * @omd:
+ *
+ * The drive is in list passed to mpool open or an UNAVAIL mdc0 drive.
+ */
+static int pmd_cmp_drv_mdc0(struct mpool_descriptor *mp, u8 pdh,
+			    struct omf_devparm_descriptor *omd)
+{
+	const char *msg __maybe_unused;
+	struct mc_parms mcp_mdc0list, mcp_pd;
+	struct mpool_dev_info *pd;
+
+	pd = &mp->pds_pdv[pdh];
+
+	mc_pd_prop2mc_parms(&(pd->pdi_parm.dpr_prop), &mcp_pd);
+	mc_omf_devparm2mc_parms(omd, &mcp_mdc0list);
+
+	if (!memcmp(&mcp_pd, &mcp_mdc0list, sizeof(mcp_pd)))
+		return 0;
+
+	if (mpool_pd_status_get(pd) == PD_STAT_UNAVAIL)
+		msg = "UNAVAIL mdc0 drive parms don't match those in drive list record";
+	else
+		msg = "mismatch between MDC0 drive list record and drive parms";
+
+	mp_pr_warn("mpool %s, %s for %s, mclassp %d %d zonepg %u %u sectorsz %u %u devtype %u %u features %lu %lu",
+		   mp->pds_name, msg, pd->pdi_name, mcp_pd.mcp_classp, mcp_mdc0list.mcp_classp,
+		   mcp_pd.mcp_zonepg, mcp_mdc0list.mcp_zonepg, mcp_pd.mcp_sectorsz,
+		   mcp_mdc0list.mcp_sectorsz, mcp_pd.mcp_devtype, mcp_mdc0list.mcp_devtype,
+		   (ulong)mcp_pd.mcp_features, (ulong)mcp_mdc0list.mcp_features);
+
+	return -EINVAL;
+}
+
+static const char *msg_unavail1 __maybe_unused =
+	"defunct and unavailable drive still belong to the mpool";
+
+static const char *msg_unavail2 __maybe_unused =
+	"defunct and available drive still belong to the mpool";
+
+static int pmd_props_load(struct mpool_descriptor *mp)
+{
+	struct omf_devparm_descriptor netdev[MP_MED_NUMBER] = { };
+	struct omf_mdcrec_data *cdr;
+	enum mp_media_classp mclassp;
+	struct pmd_mdc_info *cinfo;
+	struct media_class *mc;
+	bool zombie[MPOOL_DRIVES_MAX];
+	int spzone[MP_MED_NUMBER], i;
+	size_t rlen = 0;
+	u64 pdh, buflen;
+	int err;
+
+	cinfo = &mp->pds_mda.mdi_slotv[0];
+	buflen = OMF_MDCREC_PACKLEN_MAX;
+
+	/*  Note: single threaded here so don't need any locks */
+
+	/* Set mpool properties to defaults; overwritten by property records (if any). */
+	for (mclassp = 0; mclassp < MP_MED_NUMBER; mclassp++)
+		spzone[mclassp] = -1;
+
+	/*
+	 * read mdc0 to capture net of drives, content version & other
+	 * properties; ignore obj records
+	 */
+	err = mp_mdc_rewind(cinfo->mmi_mdc);
+	if (err) {
+		mp_pr_err("mpool %s, MDC0 init for read properties failed", err, mp->pds_name);
+		return err;
+	}
+
+	cdr = kzalloc(sizeof(*cdr), GFP_KERNEL);
+	if (!cdr) {
+		err = -ENOMEM;
+		mp_pr_err("mpool %s, cdr alloc failed", err, mp->pds_name);
+		return err;
+	}
+
+	while (true) {
+		err = mp_mdc_read(cinfo->mmi_mdc, cinfo->mmi_recbuf, buflen, &rlen);
+		if (err) {
+			mp_pr_err("mpool %s, MDC0 read next failed %lu",
+				  err, mp->pds_name, (ulong)rlen);
+			break;
+		}
+		if (rlen == 0)
+			/* Hit end of log */
+			break;
+
+		/*
+		 * skip object-related mdcrec in mdc0; not ready to unpack
+		 * these yet
+		 */
+		if (omf_mdcrec_isobj_le(cinfo->mmi_recbuf))
+			continue;
+
+		err = omf_mdcrec_unpack_letoh(&(cinfo->mmi_mdcver), mp, cdr, cinfo->mmi_recbuf);
+		if (err) {
+			mp_pr_err("mpool %s, MDC0 property unpack failed", err, mp->pds_name);
+			break;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_MCCONFIG) {
+			struct omf_devparm_descriptor *src;
+
+			src = &cdr->u.dev.omd_parm;
+			ASSERT(src->odp_mclassp < MP_MED_NUMBER);
+
+			memcpy(&netdev[src->odp_mclassp], src, sizeof(*src));
+			continue;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_MCSPARE) {
+			mclassp = cdr->u.mcs.omd_mclassp;
+			if (mclass_isvalid(mclassp)) {
+				spzone[mclassp] = cdr->u.mcs.omd_spzone;
+			} else {
+				err = -EINVAL;
+
+				/* Should never happen */
+				mp_pr_err("mpool %s, MDC0 mclass spare record, invalid mclassp %u",
+					  err, mp->pds_name, mclassp);
+				break;
+			}
+			continue;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_VERSION) {
+			cinfo->mmi_mdcver = cdr->u.omd_version;
+			if (omfu_mdcver_cmp(&cinfo->mmi_mdcver, ">", omfu_mdcver_cur())) {
+				char *buf1, *buf2 = NULL;
+
+				buf1 = kmalloc(2 * MAX_MDCVERSTR, GFP_KERNEL);
+				if (buf1) {
+					buf2 = buf1 + MAX_MDCVERSTR;
+					omfu_mdcver_to_str(&cinfo->mmi_mdcver, buf1, sizeof(buf1));
+					omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+				}
+
+				err = -EOPNOTSUPP;
+				mp_pr_err("mpool %s, MDC0 version %s, binary version %s",
+					  err, mp->pds_name, buf1, buf2);
+				kfree(buf1);
+				break;
+			}
+			continue;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_MPCONFIG)
+			mp->pds_cfg = cdr->u.omd_cfg;
+	}
+
+	if (err) {
+		kfree(cdr);
+		return err;
+	}
+
+	/* Reconcile net drive list with those in mpool descriptor */
+	for (i = 0; i < mp->pds_pdvcnt; i++)
+		zombie[i] = true;
+
+	for (i = 0; i < MP_MED_NUMBER; i++) {
+		struct omf_devparm_descriptor *omd;
+		int    j;
+
+		omd = &netdev[i];
+
+		if (mpool_uuid_is_null(&omd->odp_devid))
+			continue;
+
+		j = mp->pds_pdvcnt;
+		while (j--) {
+			if (mpool_uuid_compare(&mp->pds_pdv[j].pdi_devid, &omd->odp_devid) == 0)
+				break;
+		}
+
+		if (j >= 0) {
+			zombie[j] = false;
+			err = pmd_cmp_drv_mdc0(mp, j, omd);
+			if (err)
+				break;
+		} else {
+			err = mpool_desc_unavail_add(mp, omd);
+			if (err)
+				break;
+			zombie[mp->pds_pdvcnt - 1] = false;
+		}
+	}
+
+	/* Check for zombie drives and recompute uacnt[] */
+	if (!err) {
+		for (i = 0; i < MP_MED_NUMBER; i++) {
+			mc = &mp->pds_mc[i];
+			mc->mc_uacnt = 0;
+		}
+
+		for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+			struct mpool_dev_info  *pd;
+
+			mc = &mp->pds_mc[mp->pds_pdv[pdh].pdi_mclass];
+			pd = &mp->pds_pdv[pdh];
+			if (zombie[pdh]) {
+				char uuid_str[40];
+
+				mpool_unparse_uuid(&pd->pdi_devid, uuid_str);
+				err = -ENXIO;
+
+				if (mpool_pd_status_get(pd) == PD_STAT_UNAVAIL)
+					mp_pr_err("mpool %s, drive %s %s %s", err, mp->pds_name,
+						   uuid_str, pd->pdi_name, msg_unavail1);
+				else
+					mp_pr_err("mpool %s, drive %s %s %s", err, mp->pds_name,
+						  uuid_str, pd->pdi_name, msg_unavail2);
+				break;
+			} else if (mpool_pd_status_get(pd) == PD_STAT_UNAVAIL) {
+				mc->mc_uacnt += 1;
+			}
+		}
+	}
+
+	/*
+	 * Now it is possible to update the percent spare because all
+	 * the media classes of the mpool have been created because all
+	 * the mpool PDs have been added in their classes.
+	 */
+	if (!err) {
+		for (mclassp = 0; mclassp < MP_MED_NUMBER; mclassp++) {
+			if (spzone[mclassp] >= 0) {
+				err = mc_set_spzone(&mp->pds_mc[mclassp], spzone[mclassp]);
+				/*
+				 * Should never happen, it should exist a class
+				 * with perf. level mclassp with a least 1 PD.
+				 */
+				if (err)
+					break;
+			}
+		}
+		if (err)
+			mp_pr_err("mpool %s, can't set spare %u because the class %u has no PD",
+				  err, mp->pds_name, spzone[mclassp], mclassp);
+	}
+
+	kfree(cdr);
+
+	return err;
+}
+
+static int pmd_objs_load(struct mpool_descriptor *mp, u8 cslot)
+{
+	struct omf_mdcrec_data *cdr = NULL;
+	struct pmd_mdc_info *cinfo;
+	struct rb_node *node;
+	u64 argv[2] = { 0 };
+	const char *msg;
+	size_t recbufsz;
+	char *recbuf;
+	u64 mdcmax;
+	int err;
+
+	/* Note: single threaded here so don't need any locks */
+
+	recbufsz = OMF_MDCREC_PACKLEN_MAX;
+	msg = "(no detail)";
+	mdcmax = 0;
+
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+	/* Initialize mdc if not mdc0. */
+	if (cslot) {
+		u64 logid1 = logid_make(2 * cslot, 0);
+		u64 logid2 = logid_make(2 * cslot + 1, 0);
+
+		/* Freed in pmd_mda_free() */
+		cinfo->mmi_recbuf = kmalloc(recbufsz, GFP_KERNEL);
+		if (!cinfo->mmi_recbuf) {
+			msg = "MDC recbuf alloc failed";
+			err = -ENOMEM;
+			goto errout;
+		}
+
+		err = mp_mdc_open(mp, logid1, logid2, MDC_OF_SKIP_SER, &cinfo->mmi_mdc);
+		if (err) {
+			msg = "mdc open failed";
+			goto errout;
+		}
+	}
+
+	/* Read mdc and capture net result of object data records. */
+	err = mp_mdc_rewind(cinfo->mmi_mdc);
+	if (err) {
+		msg = "mdc rewind failed";
+		goto errout;
+	}
+
+	/* Cache these pointers to simplify the ensuing code. */
+	recbuf = cinfo->mmi_recbuf;
+
+	cdr = kzalloc(sizeof(*cdr), GFP_KERNEL);
+	if (!cdr) {
+		msg = "cdr alloc failed";
+		goto errout;
+	}
+
+	while (true) {
+		struct pmd_layout *layout, *found;
+		size_t rlen = 0;
+		u64 objid;
+
+		err = mp_mdc_read(cinfo->mmi_mdc, recbuf, recbufsz, &rlen);
+		if (err) {
+			msg = "mdc read data failed";
+			break;
+		}
+		if (rlen == 0)
+			break; /* Hit end of log */
+
+		/*
+		 * Version record, if present, must be first.
+		 */
+		if (omf_mdcrec_unpack_type_letoh(recbuf) == OMF_MDR_VERSION) {
+			omf_mdcver_unpack_letoh(cdr, recbuf);
+			cinfo->mmi_mdcver = cdr->u.omd_version;
+
+			if (omfu_mdcver_cmp(&cinfo->mmi_mdcver, ">", omfu_mdcver_cur())) {
+				char *buf1, *buf2 = NULL;
+
+				buf1 = kmalloc(2 * MAX_MDCVERSTR, GFP_KERNEL);
+				if (buf1) {
+					buf2 = buf1 + MAX_MDCVERSTR;
+					omfu_mdcver_to_str(&cinfo->mmi_mdcver, buf1, sizeof(buf1));
+					omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+				}
+
+				err = -EOPNOTSUPP;
+				mp_pr_err("mpool %s, MDC%u version %s, binary version %s",
+					  err, mp->pds_name, cslot, buf1, buf2);
+				kfree(buf1);
+				break;
+			}
+			continue;
+		}
+
+		/* Skip non object-related mdcrec in mdc0; i.e., property
+		 * records.
+		 */
+		if (!cslot && !omf_mdcrec_isobj_le(recbuf))
+			continue;
+
+		err = omf_mdcrec_unpack_letoh(&cinfo->mmi_mdcver, mp, cdr, recbuf);
+		if (err) {
+			msg = "mlog record unpack failed";
+			break;
+		}
+
+		objid = cdr->u.obj.omd_objid;
+
+		if (objid_slot(objid) != cslot) {
+			msg = "mlog record wrong slot";
+			err = -EBADSLT;
+			break;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_OCREATE) {
+			layout = cdr->u.obj.omd_layout;
+			layout->eld_state = PMD_LYT_COMMITTED;
+
+			found = pmd_co_insert(cinfo, layout);
+			if (found) {
+				msg = "OCREATE duplicate object ID";
+				pmd_obj_put(layout);
+				err = -EEXIST;
+				break;
+			}
+
+			atomic_inc(&cinfo->mmi_pco_cnt.pcc_cr);
+			atomic_inc(&cinfo->mmi_pco_cnt.pcc_cobj);
+
+			continue;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_ODELETE) {
+			found = pmd_co_find(cinfo, objid);
+			if (!found) {
+				msg = "ODELETE object not found";
+				err = -ENOENT;
+				break;
+			}
+
+			pmd_co_remove(cinfo, found);
+			pmd_obj_put(found);
+
+			atomic_inc(&cinfo->mmi_pco_cnt.pcc_del);
+			atomic_dec(&cinfo->mmi_pco_cnt.pcc_cobj);
+
+			continue;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_OIDCKPT) {
+			/*
+			 * objid == mmi_lckpt == 0 is legit. Such records
+			 * are appended by mpool MDC compaction due to a
+			 * mpool metadata upgrade on an empty mpool.
+			 */
+			if ((objid_uniq(objid) || objid_uniq(cinfo->mmi_lckpt))
+				&& (objid_uniq(objid) <= objid_uniq(cinfo->mmi_lckpt))) {
+				msg = "OIDCKPT cdr ckpt %lu <= cinfo ckpt %lu";
+				argv[0] = objid_uniq(objid);
+				argv[1] = objid_uniq(cinfo->mmi_lckpt);
+				err = -EINVAL;
+				break;
+			}
+
+			cinfo->mmi_lckpt = objid;
+			continue;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_OERASE) {
+			layout = pmd_co_find(cinfo, objid);
+			if (!layout) {
+				msg = "OERASE object not found";
+				err = -ENOENT;
+				break;
+			}
+
+			/* Note: OERASE gen can equal layout gen after a compaction. */
+			if (cdr->u.obj.omd_gen < layout->eld_gen) {
+				msg = "OERASE cdr gen %lu < layout gen %lu";
+				argv[0] = cdr->u.obj.omd_gen;
+				argv[1] = layout->eld_gen;
+				err = -EINVAL;
+				break;
+			}
+
+			layout->eld_gen = cdr->u.obj.omd_gen;
+
+			atomic_inc(&cinfo->mmi_pco_cnt.pcc_er);
+			continue;
+		}
+
+		if (cdr->omd_rtype == OMF_MDR_OUPDATE) {
+			layout = cdr->u.obj.omd_layout;
+
+			found = pmd_co_find(cinfo, objid);
+			if (!found) {
+				msg = "OUPDATE object not found";
+				pmd_obj_put(layout);
+				err = -ENOENT;
+				break;
+			}
+
+			pmd_co_remove(cinfo, found);
+			pmd_obj_put(found);
+
+			layout->eld_state = PMD_LYT_COMMITTED;
+			pmd_co_insert(cinfo, layout);
+
+			atomic_inc(&cinfo->mmi_pco_cnt.pcc_up);
+
+			continue;
+		}
+	}
+
+	if (err)
+		goto errout;
+
+	/*
+	 * Add all existing objects to space map.
+	 * Also add/update per-mpool space usage stats
+	 */
+	pmd_co_foreach(cinfo, node) {
+		struct pmd_layout *layout;
+
+		layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+		/* Remember objid and gen in case of error... */
+		cdr->u.obj.omd_objid = layout->eld_objid;
+		cdr->u.obj.omd_gen = layout->eld_gen;
+
+		if (objid_slot(layout->eld_objid) != cslot) {
+			msg = "layout wrong slot";
+			err = -EBADSLT;
+			break;
+		}
+
+		err = pmd_smap_insert(mp, layout);
+		if (err) {
+			msg = "smap insert failed";
+			break;
+		}
+
+		pmd_update_obj_stats(mp, layout, cinfo, PMD_OBJ_LOAD);
+
+		/* For mdc0 track last logical mdc created. */
+		if (!cslot)
+			mdcmax = max(mdcmax, (objid_uniq(layout->eld_objid) >> 1));
+	}
+
+	if (err)
+		goto errout;
+
+	cdr->u.obj.omd_objid = 0;
+	cdr->u.obj.omd_gen = 0;
+
+	if (!cslot) {
+		/* MDC0: finish initializing mda */
+		cinfo->mmi_luniq = mdcmax;
+		mp->pds_mda.mdi_slotvcnt = mdcmax + 1;
+
+		/* MDC0 only: validate other mdc metadata; may make adjustments to mp.mda. */
+		err = pmd_mdc0_validate(mp, 1);
+		if (err)
+			msg = "MDC0 validation failed";
+	} else {
+		/*
+		 * other mdc: set luniq to guaranteed max value
+		 * previously used and ensure next objid allocation
+		 * will be checkpointed; supports realloc of
+		 * uncommitted objects after a crash
+		 */
+		cinfo->mmi_luniq = objid_uniq(cinfo->mmi_lckpt) + OBJID_UNIQ_DELTA - 1;
+	}
+
+errout:
+	if (err) {
+		char *msgbuf;
+
+		msgbuf = kmalloc(64, GFP_KERNEL);
+		if (msgbuf)
+			snprintf(msgbuf, 64, msg, argv[0], argv[1]);
+
+		mp_pr_err("mpool %s, %s: cslot %u, ckpt %lx, %lx/%lu",
+			  err, mp->pds_name, msgbuf, cslot, (ulong)cinfo->mmi_lckpt,
+			  (ulong)cdr->u.obj.omd_objid, (ulong)cdr->u.obj.omd_gen);
+
+		kfree(msgbuf);
+	}
+
+	kfree(cdr);
+
+	return err;
+}
+
+/**
+ * pmd_objs_load_worker() -
+ * @ws:
+ *
+ * worker thread for loading user MDC 1~N
+ * Each worker instance will do the following (not counting errors):
+ * * grab an MDC number atomically from olw->olw_progress
+ * * If the MDC number is invalid, exit
+ * * load the objects from that MDC
+ *
+ * If an error occurs in this or any other worker, don't load any more MDCs
+ */
+static void pmd_objs_load_worker(struct work_struct *ws)
+{
+	struct pmd_obj_load_work *olw;
+	int sidx, rc;
+
+	olw = container_of(ws, struct pmd_obj_load_work, olw_work);
+
+	while (atomic_read(olw->olw_err) == 0) {
+		sidx = atomic_fetch_add(1, olw->olw_progress);
+		if (sidx >= olw->olw_mp->pds_mda.mdi_slotvcnt)
+			break; /* No more MDCs to load */
+
+		rc = pmd_objs_load(olw->olw_mp, sidx);
+		if (rc)
+			atomic_set(olw->olw_err, rc);
+	}
+}
+
+/**
+ * pmd_objs_load_parallel() - load MDC 1~N in parallel
+ * @mp:
+ *
+ * By loading user MDCs in parallel, we can reduce the mpool activate
+ * time, since the jobs of loading MDC 1~N are independent.
+ * On the other hand, we don't want to start all the jobs at once.
+ * If any one fails, we don't have to start others.
+ */
+static int pmd_objs_load_parallel(struct mpool_descriptor *mp)
+{
+	struct pmd_obj_load_work *olwv;
+	atomic_t err = ATOMIC_INIT(0);
+	atomic_t progress = ATOMIC_INIT(1);
+	uint njobs, inc, cpu, i;
+
+	if (mp->pds_mda.mdi_slotvcnt < 2)
+		return 0; /* No user MDCs allocated */
+
+	njobs = mp->pds_params.mp_objloadjobs;
+	njobs = clamp_t(uint, njobs, 1, mp->pds_mda.mdi_slotvcnt - 1);
+
+	if (mp->pds_mda.mdi_slotvcnt / njobs >= 4 && num_online_cpus() > njobs)
+		njobs *= 2;
+
+	olwv = kcalloc(njobs, sizeof(*olwv), GFP_KERNEL);
+	if (!olwv)
+		return -ENOMEM;
+
+	inc = (num_online_cpus() / njobs) & ~1u;
+	cpu = raw_smp_processor_id();
+
+	/*
+	 * Each of njobs workers will atomically grab MDC numbers from &progress
+	 * and load them, until all valid user MDCs have been loaded.
+	 */
+	for (i = 0; i < njobs; ++i) {
+		INIT_WORK(&olwv[i].olw_work, pmd_objs_load_worker);
+		olwv[i].olw_progress = &progress;
+		olwv[i].olw_err = &err;
+		olwv[i].olw_mp = mp;
+
+		/*
+		 * Try to distribute work across all NUMA nodes.
+		 * queue_work_node() would be preferable, but
+		 * it's not available on older kernels.
+		 */
+		cpu = (cpu + inc) % nr_cpumask_bits;
+		cpu = cpumask_next_wrap(cpu, cpu_online_mask, nr_cpumask_bits, false);
+		queue_work_on(cpu, mp->pds_workq, &olwv[i].olw_work);
+	}
+
+	/* Wait for all worker threads to complete */
+	flush_workqueue(mp->pds_workq);
+
+	kfree(olwv);
+
+	return atomic_read(&err);
+}
+
+static int pmd_mdc_append(struct mpool_descriptor *mp, u8 cslot,
+			  struct omf_mdcrec_data *cdr, int sync)
+{
+	struct pmd_mdc_info *cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	s64 plen;
+
+	plen = omf_mdcrec_pack_htole(mp, cdr, cinfo->mmi_recbuf);
+	if (plen < 0) {
+		mp_pr_warn("mpool %s, MDC%u append failed", mp->pds_name, cslot);
+		return plen;
+	}
+
+	return mp_mdc_append(cinfo->mmi_mdc, cinfo->mmi_recbuf, plen, sync);
+}
+
+/**
+ * pmd_log_all_mdc_cobjs() - write in the new active mlog the object records.
+ * @mp:
+ * @cslot:
+ * @compacted: output
+ * @total: output
+ */
+static int pmd_log_all_mdc_cobjs(struct mpool_descriptor *mp, u8 cslot,
+				 struct omf_mdcrec_data *cdr, u32 *compacted, u32 *total)
+{
+	struct pmd_mdc_info *cinfo;
+	struct pmd_layout *layout;
+	struct rb_node *node;
+	int rc;
+
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	rc = 0;
+
+	pmd_co_foreach(cinfo, node) {
+		layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+		if (!objid_mdc0log(layout->eld_objid)) {
+			cdr->omd_rtype = OMF_MDR_OCREATE;
+			cdr->u.obj.omd_layout = layout;
+
+			rc = pmd_mdc_append(mp, cslot, cdr, 0);
+			if (rc) {
+				mp_pr_err("mpool %s, MDC%u log committed obj failed, objid 0x%lx",
+					  rc, mp->pds_name, cslot, (ulong)layout->eld_objid);
+				break;
+			}
+
+			++(*compacted);
+		}
+		++(*total);
+	}
+
+	for (; node; node = rb_next(node))
+		++(*total);
+
+	return rc;
+}
+
+/**
+ * pmd_log_mdc0_cobjs() - write in the new active mlog (of MDC0) the MDC0
+ *	records that are particular to MDC0.
+ * @mp:
+ */
+static int pmd_log_mdc0_cobjs(struct mpool_descriptor *mp)
+{
+	struct mpool_dev_info *pd;
+	int rc = 0, i;
+
+	/*
+	 * Log a drive record (OMF_MDR_MCCONFIG) for every drive in pds_pdv[]
+	 * that is not defunct.
+	 */
+	for (i = 0; i < mp->pds_pdvcnt; i++) {
+		pd = &(mp->pds_pdv[i]);
+		rc = pmd_prop_mcconfig(mp, pd, true);
+		if (rc)
+			return rc;
+	}
+
+	/*
+	 * Log a media class spare record (OMF_MDR_MCSPARE) for every media
+	 * class.
+	 * mc count can't change now. Because the MDC0 compact lock is held
+	 * and that blocks the addition of PDs in the  mpool.
+	 */
+	for (i = 0; i < MP_MED_NUMBER; i++) {
+		struct media_class *mc;
+
+		mc = &mp->pds_mc[i];
+		if (mc->mc_pdmc >= 0) {
+			rc = pmd_prop_mcspare(mp, mc->mc_parms.mcp_classp,
+					       mc->mc_sparms.mcsp_spzone, true);
+			if (rc)
+				return rc;
+		}
+	}
+
+	return pmd_prop_mpconfig(mp, &mp->pds_cfg, true);
+}
+
+/**
+ * pmd_log_non_mdc0_cobjs() - write in the new active mlog (of MDCi i>0) the
+ *	MDCi records that are particular to MDCi (not used by MDC0).
+ * @mp:
+ * @cslot:
+ */
+static int pmd_log_non_mdc0_cobjs(struct mpool_descriptor *mp, u8 cslot,
+				  struct omf_mdcrec_data *cdr)
+{
+	struct pmd_mdc_info *cinfo;
+
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+	/*
+	 * if not mdc0 log last objid checkpoint to support realloc of
+	 * uncommitted objects after a crash and to guarantee objids are
+	 * never reused.
+	 */
+	cdr->omd_rtype = OMF_MDR_OIDCKPT;
+	cdr->u.obj.omd_objid = cinfo->mmi_lckpt;
+
+	return pmd_mdc_append(mp, cslot, cdr, 0);
+}
+
+/**
+ * pmd_pre_compact_reset() - called on MDCi i>0
+ * @cinfo:
+ * @compacted: object create records appended in the new active mlog.
+ *
+ * Locking:
+ *	MDCi compact lock is held by the caller.
+ */
+static void pmd_pre_compact_reset(struct pmd_mdc_info *cinfo, u32 compacted)
+{
+	struct pre_compact_ctrs *pco_cnt;
+
+	pco_cnt = &cinfo->mmi_pco_cnt;
+	ASSERT(pco_cnt->pcc_cobj.counter == compacted);
+
+	atomic_set(&pco_cnt->pcc_cr, compacted);
+	atomic_set(&pco_cnt->pcc_cobj, compacted);
+	atomic_set(&pco_cnt->pcc_up, 0);
+	atomic_set(&pco_cnt->pcc_del, 0);
+	atomic_set(&pco_cnt->pcc_er, 0);
+}
+
+/**
+ * pmd_mdc_compact() - compact an mpool MDCi with i >= 0.
+ * @mp:
+ * @cslot: the "i" of MDCi
+ *
+ * Locking:
+ * 1) caller must hold MDCi compact lock
+ * 2) MDC compaction freezes the state of all MDCs objects [and for MDC0
+ *    also freezes all mpool properties] by simply holding MDC
+ *    mmi_compactlock mutex. Hence, MDC compaction does not need to
+ *    read-lock individual object layouts or mpool property data
+ *    structures to read them. It is why this function and its callees don't
+ *    take any lock.
+ *
+ * Note: this function or its callees must call pmd_mdc_append() with no sync
+ *	instead of pmd_mdc_addrec() to avoid triggering nested compaction of
+ *	a same MDCi.
+ *	The sync/flush is done by append of cend, no need to sync before that.
+ */
+static int pmd_mdc_compact(struct mpool_descriptor *mp, u8 cslot)
+{
+	struct pmd_mdc_info *cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	u64 logid1 = logid_make(2 * cslot, 0);
+	u64 logid2 = logid_make(2 * cslot + 1, 0);
+	struct omf_mdcrec_data *cdr;
+	int retry = 0, rc = 0;
+
+	cdr = kzalloc(sizeof(*cdr), GFP_KERNEL);
+	if (!cdr) {
+		rc = -ENOMEM;
+		mp_pr_crit("mpool %s, alloc failure during compact", rc, mp->pds_name);
+		return rc;
+	}
+
+	for (retry = 0; retry < MPOOL_MDC_COMPACT_RETRY_DEFAULT; retry++) {
+		u32 compacted = 0;
+		u32 total = 0;
+
+		if (rc) {
+			rc = mp_mdc_open(mp, logid1, logid2, MDC_OF_SKIP_SER, &cinfo->mmi_mdc);
+			if (rc)
+				continue;
+		}
+
+		mp_pr_debug("mpool %s, MDC%u start: mlog1 gen %lu mlog2 gen %lu",
+			    rc, mp->pds_name, cslot,
+			    (ulong)((struct pmd_layout *)cinfo->mmi_mdc->mdc_logh1)->eld_gen,
+			    (ulong)((struct pmd_layout *)cinfo->mmi_mdc->mdc_logh2)->eld_gen);
+
+		rc = mp_mdc_cstart(cinfo->mmi_mdc);
+		if (rc)
+			continue;
+
+		if (omfu_mdcver_cmp2(omfu_mdcver_cur(), ">=", 1, 0, 0, 1)) {
+			rc = pmd_mdc_addrec_version(mp, cslot);
+			if (rc) {
+				mp_mdc_close(cinfo->mmi_mdc);
+				continue;
+			}
+		}
+
+		if (cslot)
+			rc = pmd_log_non_mdc0_cobjs(mp, cslot, cdr);
+		else
+			rc = pmd_log_mdc0_cobjs(mp);
+		if (rc)
+			continue;
+
+		rc = pmd_log_all_mdc_cobjs(mp, cslot, cdr, &compacted, &total);
+
+		mp_pr_debug("mpool %s, MDC%u compacted %u of %u objects: retry=%d",
+			    rc, mp->pds_name, cslot, compacted, total, retry);
+
+		/*
+		 * Append the compaction end record in the new active
+		 * mlog, and flush/sync all the previous records
+		 * appended in the new active log by the compaction
+		 * above.
+		 */
+		if (!rc)
+			rc = mp_mdc_cend(cinfo->mmi_mdc);
+
+		if (!rc) {
+			if (cslot) {
+				/*
+				 * MDCi i>0 compacted successfully
+				 * MDCi compact lock is held.
+				 */
+				pmd_pre_compact_reset(cinfo, compacted);
+			}
+
+			mp_pr_debug("mpool %s, MDC%u end: mlog1 gen %lu mlog2 gen %lu",
+				    rc, mp->pds_name, cslot,
+				    (ulong)((struct pmd_layout *)
+					    cinfo->mmi_mdc->mdc_logh1)->eld_gen,
+				    (ulong)((struct pmd_layout *)
+					    cinfo->mmi_mdc->mdc_logh2)->eld_gen);
+			break;
+		}
+	}
+
+	if (rc)
+		mp_pr_crit("mpool %s, MDC%u compaction failed", rc, mp->pds_name, cslot);
+
+	kfree(cdr);
+
+	return rc;
+}
+
+static int pmd_mdc_addrec(struct mpool_descriptor *mp, u8 cslot, struct omf_mdcrec_data *cdr)
+{
+	int rc;
+
+	rc = pmd_mdc_append(mp, cslot, cdr, 1);
+
+	if (rc == -EFBIG) {
+		rc = pmd_mdc_compact(mp, cslot);
+		if (!rc)
+			rc = pmd_mdc_append(mp, cslot, cdr, 1);
+	}
+
+	if (rc)
+		mp_pr_rl("mpool %s, MDC%u append failed%s", rc, mp->pds_name, cslot,
+			 (rc == -EFBIG) ? " post compaction" : "");
+
+	return rc;
+}
+
+int pmd_mdc_addrec_version(struct mpool_descriptor *mp, u8 cslot)
+{
+	struct omf_mdcrec_data cdr;
+	struct omf_mdcver *ver;
+
+	cdr.omd_rtype = OMF_MDR_VERSION;
+
+	ver = omfu_mdcver_cur();
+	cdr.u.omd_version = *ver;
+
+	return pmd_mdc_addrec(mp, cslot, &cdr);
+}
+
+int pmd_prop_mcconfig(struct mpool_descriptor *mp, struct mpool_dev_info *pd, bool compacting)
+{
+	struct omf_mdcrec_data cdr;
+	struct mc_parms mc_parms;
+
+	cdr.omd_rtype = OMF_MDR_MCCONFIG;
+	mpool_uuid_copy(&cdr.u.dev.omd_parm.odp_devid, &pd->pdi_devid);
+	mc_pd_prop2mc_parms(&pd->pdi_parm.dpr_prop, &mc_parms);
+	mc_parms2omf_devparm(&mc_parms, &cdr.u.dev.omd_parm);
+	cdr.u.dev.omd_parm.odp_zonetot = pd->pdi_parm.dpr_zonetot;
+	cdr.u.dev.omd_parm.odp_devsz = pd->pdi_parm.dpr_devsz;
+
+	/* If compacting no sync needed and don't trigger another compaction. */
+	if (compacting)
+		return pmd_mdc_append(mp, 0, &cdr, 0);
+
+	return pmd_mdc_addrec(mp, 0, &cdr);
+}
+
+int pmd_prop_mcspare(struct mpool_descriptor *mp, enum mp_media_classp mclassp,
+		     u8 spzone, bool compacting)
+{
+	struct omf_mdcrec_data cdr;
+	int rc;
+
+	if (!mclass_isvalid(mclassp) || spzone > 100) {
+		rc = -EINVAL;
+		mp_pr_err("persisting %s spare zone info, invalid arguments %d %u",
+			  rc, mp->pds_name, mclassp, spzone);
+		return rc;
+	}
+
+	cdr.omd_rtype = OMF_MDR_MCSPARE;
+	cdr.u.mcs.omd_mclassp = mclassp;
+	cdr.u.mcs.omd_spzone = spzone;
+
+	/* If compacting no sync needed and don't trigger another compaction. */
+	if (compacting)
+		return pmd_mdc_append(mp, 0, &cdr, 0);
+
+	return pmd_mdc_addrec(mp, 0, &cdr);
+}
+
+int pmd_log_delete(struct mpool_descriptor *mp, u64 objid)
+{
+	struct omf_mdcrec_data cdr;
+
+	cdr.omd_rtype = OMF_MDR_ODELETE;
+	cdr.u.obj.omd_objid = objid;
+
+	return pmd_mdc_addrec(mp, objid_slot(objid), &cdr);
+}
+
+int pmd_log_create(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+	struct omf_mdcrec_data cdr;
+
+	cdr.omd_rtype = OMF_MDR_OCREATE;
+	cdr.u.obj.omd_layout = layout;
+
+	return pmd_mdc_addrec(mp, objid_slot(layout->eld_objid), &cdr);
+}
+
+int pmd_log_erase(struct mpool_descriptor *mp, u64 objid, u64 gen)
+{
+	struct omf_mdcrec_data cdr;
+
+	cdr.omd_rtype = OMF_MDR_OERASE;
+	cdr.u.obj.omd_objid = objid;
+	cdr.u.obj.omd_gen = gen;
+
+	return pmd_mdc_addrec(mp, objid_slot(objid), &cdr);
+}
+
+int pmd_log_idckpt(struct mpool_descriptor *mp, u64 objid)
+{
+	struct omf_mdcrec_data cdr;
+
+	cdr.omd_rtype = OMF_MDR_OIDCKPT;
+	cdr.u.obj.omd_objid = objid;
+
+	return pmd_mdc_addrec(mp, objid_slot(objid), &cdr);
+}
+
+int pmd_prop_mpconfig(struct mpool_descriptor *mp, const struct mpool_config *cfg, bool compacting)
+{
+	struct omf_mdcrec_data cdr = { };
+
+	cdr.omd_rtype = OMF_MDR_MPCONFIG;
+	cdr.u.omd_cfg = *cfg;
+
+	if (compacting)
+		return pmd_mdc_append(mp, 0, &cdr, 0);
+
+	return pmd_mdc_addrec(mp, 0, &cdr);
+}
+
+/**
+ * pmd_need_compact() - determine if MDCi corresponding to cslot need compaction of not.
+ * @mp:
+ * @cslot:
+ *
+ * The MDCi needs compaction if the active mlog is above some threshold and
+ * if there is enough garbage (that can be eliminated by the compaction).
+ *
+ * Locking: not lock need to be held when calling this function.
+ *	as a result of not holding lock the result may be off if a compaction
+ *	of MDCi (with i = cslot) is taking place at the same time.
+ */
+static bool pmd_need_compact(struct mpool_descriptor *mp, u8 cslot, char *msgbuf, size_t msgsz)
+{
+	struct pre_compact_ctrs *pco_cnt;
+	struct pmd_mdc_info *cinfo;
+	u64 rec, cobj, len, cap;
+	u32 garbage, pct;
+
+	ASSERT(cslot > 0);
+
+	cinfo = &mp->pds_mda.mdi_slotv[cslot];
+	pco_cnt = &(cinfo->mmi_pco_cnt);
+
+	cap = atomic64_read(&pco_cnt->pcc_cap);
+	if (cap == 0)
+		return false; /* MDC closed for now. */
+
+	len = atomic64_read(&pco_cnt->pcc_len);
+	rec = atomic_read(&pco_cnt->pcc_cr) + atomic_read(&pco_cnt->pcc_up) +
+		atomic_read(&pco_cnt->pcc_del) + atomic_read(&pco_cnt->pcc_er);
+	cobj = atomic_read(&pco_cnt->pcc_cobj);
+
+	pct = (len * 100) / cap;
+	if (pct < mp->pds_params.mp_pcopctfull)
+		return false; /* Active mlog not filled enough */
+
+	if (rec > cobj) {
+		garbage = (rec - cobj) * 100;
+		garbage /= rec;
+	} else {
+
+		/*
+		 * We may arrive here rarely if the caller doesn't
+		 * hold the compact lock. In that case, the update of
+		 * the counters may be seen out of order or a compaction
+		 * may take place at the same time.
+		 */
+		garbage = 0;
+	}
+
+	if (garbage < mp->pds_params.mp_pcopctgarbage)
+		return false;
+
+	if (msgbuf)
+		snprintf(msgbuf, msgsz,
+			 "bytes used %lu, total %lu, pct %u, records %lu, objects %lu, garbage %u",
+			 (ulong)len, (ulong)cap, pct, (ulong)rec, (ulong)cobj, garbage);
+
+	return true;
+}
+
+/**
+ * pmd_mdc_needed() - determines if new MDCns should be created
+ * @mp:  mpool descriptor
+ *
+ * New MDC's are created if total free space across all MDC's
+ * is above a threshold value and the garbage to reclaim space
+ * is below a garbage threshold.
+ *
+ * Locking: no lock needs to be held when calling this function.
+ *
+ * NOTES:
+ * - Skip non-active MDC
+ * - Accumulate total capacity, total garbage and total in-use capacity
+ *   across all active MDCs.
+ * - Return true if total used capacity across all MDCs is threshold and
+ *   garbage is < a threshold that would yield significant free space upon
+ *   compaction.
+ */
+static bool pmd_mdc_needed(struct mpool_descriptor *mp)
+{
+	struct pre_compact_ctrs *pco_cnt;
+	struct pmd_mdc_info *cinfo;
+	u64 cap, tcap, used, garbage, record, rec, cobj;
+	u32 pct, pctg, mdccnt;
+	u16 cslot;
+
+	ASSERT(mp->pds_mda.mdi_slotvcnt <= MDC_SLOTS);
+
+	cap = used = garbage = record = pctg = 0;
+
+	if (mp->pds_mda.mdi_slotvcnt == MDC_SLOTS)
+		return false;
+
+	for (cslot = 1, mdccnt = 0; cslot < mp->pds_mda.mdi_slotvcnt; cslot++) {
+
+		cinfo = &mp->pds_mda.mdi_slotv[cslot];
+		pco_cnt = &(cinfo->mmi_pco_cnt);
+
+		tcap = atomic64_read(&pco_cnt->pcc_cap);
+		if (tcap == 0) {
+			/*
+			 * MDC closed for now and will not be considered
+			 * in making a decision to create new MDC.
+			 */
+			mp_pr_warn("MDC %u not open", cslot);
+			continue;
+		}
+		cap += tcap;
+
+		mdccnt++;
+
+		used += atomic64_read(&pco_cnt->pcc_len);
+		rec = atomic_read(&pco_cnt->pcc_cr) + atomic_read(&pco_cnt->pcc_up) +
+			atomic_read(&pco_cnt->pcc_del) + atomic_read(&pco_cnt->pcc_er);
+
+		cobj = atomic_read(&pco_cnt->pcc_cobj);
+
+		if (rec > cobj)
+			garbage += (rec - cobj);
+
+		record += rec;
+	}
+
+	if (mdccnt == 0) {
+		mp_pr_warn("No mpool MDCs available");
+		return false;
+	}
+
+	/* Percentage capacity used across all MDCs */
+	pct  = (used  * 100) / cap;
+
+	/* Percentage garbage available across all MDCs */
+	if (garbage)
+		pctg = (garbage * 100) / record;
+
+	if (pct > mp->pds_params.mp_crtmdcpctfull && pctg < mp->pds_params.mp_crtmdcpctgrbg) {
+		mp_pr_debug("MDCn %u cap %u used %u rec %u grbg %u pct used %u grbg %u Thres %u-%u",
+			    0, mdccnt, (u32)cap, (u32)used, (u32)record, (u32)garbage, pct, pctg,
+			    (u32)mp->pds_params.mp_crtmdcpctfull,
+			    (u32)mp->pds_params.mp_crtmdcpctgrbg);
+		return true;
+	}
+
+	return false;
+}
+
+
+/**
+ * pmd_precompact() - precompact an mpool MDC
+ * @work:
+ *
+ * The goal of this thread is to minimize the application objects commit time.
+ * This thread pre compacts the MDC1/255. As a consequence MDC1/255 compaction
+ * does not occurs in the context of an application object commit.
+ */
+static void pmd_precompact(struct work_struct *work)
+{
+	struct pre_compact_ctrl *pco;
+	struct mpool_descriptor *mp;
+	struct pmd_mdc_info *cinfo;
+	char msgbuf[128];
+	uint nmtoc, delay;
+	bool compact;
+	u8 cslot;
+
+	pco = container_of(work, typeof(*pco), pco_dwork.work);
+	mp = pco->pco_mp;
+
+	nmtoc = atomic_fetch_add(1, &pco->pco_nmtoc);
+
+	/* Only compact MDC1/255 not MDC0. */
+	cslot = (nmtoc % (mp->pds_mda.mdi_slotvcnt - 1)) + 1;
+
+	/*
+	 * Check if the next mpool mdc to compact needs compaction.
+	 *
+	 * Note that this check is done without taking any lock.
+	 * This is safe because the mpool MDCs don't go away as long as
+	 * the mpool is activated. The mpool can't deactivate before
+	 * this thread exit.
+	 */
+	compact = pmd_need_compact(mp, cslot, NULL, 0);
+	if (compact) {
+		cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+		/*
+		 * Check a second time while we hold the compact lock
+		 * to avoid doing a useless compaction.
+		 */
+		pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+		compact = pmd_need_compact(mp, cslot, msgbuf, sizeof(msgbuf));
+		if (compact)
+			pmd_mdc_compact(mp, cslot);
+		pmd_mdc_unlock(&cinfo->mmi_compactlock);
+
+		if (compact)
+			mp_pr_info("mpool %s, MDC%u %s", mp->pds_name, cslot, msgbuf);
+	}
+
+	/* If running low on MDC space create new MDCs */
+	if (pmd_mdc_needed(mp))
+		pmd_mdc_alloc_set(mp);
+
+	pmd_update_credit(mp);
+
+	delay = clamp_t(uint, mp->pds_params.mp_pcoperiod, 1, 3600);
+
+	queue_delayed_work(mp->pds_workq, &pco->pco_dwork, msecs_to_jiffies(delay * 1000));
+}
+
+void pmd_precompact_start(struct mpool_descriptor *mp)
+{
+	struct pre_compact_ctrl *pco;
+
+	pco = &mp->pds_pco;
+	pco->pco_mp = mp;
+	atomic_set(&pco->pco_nmtoc, 0);
+
+	INIT_DELAYED_WORK(&pco->pco_dwork, pmd_precompact);
+	queue_delayed_work(mp->pds_workq, &pco->pco_dwork, 1);
+}
+
+void pmd_precompact_stop(struct mpool_descriptor *mp)
+{
+	cancel_delayed_work_sync(&mp->pds_pco.pco_dwork);
+}
+
+static int pmd_write_meta_to_latest_version(struct mpool_descriptor *mp, bool permitted)
+{
+	struct pmd_mdc_info *cinfo_converted = NULL, *cinfo;
+	char buf1[MAX_MDCVERSTR] __maybe_unused;
+	char buf2[MAX_MDCVERSTR] __maybe_unused;
+	u32 cslot;
+	int rc;
+
+	/*
+	 * Compact MDC0 first (before MDC1-255 compaction appends in MDC0) to
+	 * avoid having a potential mix of new and old records in MDC0.
+	 */
+	for (cslot = 0; cslot < mp->pds_mda.mdi_slotvcnt; cslot++) {
+		cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+		/*
+		 * At that point the version on media should be smaller or
+		 * equal to the latest version supported by this binary.
+		 * If it is not the case, the activate fails earlier.
+		 */
+		if (omfu_mdcver_cmp(&cinfo->mmi_mdcver, "==", omfu_mdcver_cur()))
+			continue;
+
+		omfu_mdcver_to_str(&cinfo->mmi_mdcver, buf1, sizeof(buf1));
+		omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+
+		if (!permitted) {
+			rc = -EPERM;
+			mp_pr_err("mpool %s, MDC%u upgrade needed from version %s to %s",
+				  rc, mp->pds_name, cslot, buf1, buf2);
+			return rc;
+		}
+
+		mp_pr_info("mpool %s, MDC%u upgraded from version %s to %s",
+			   mp->pds_name, cslot, buf1, buf2);
+
+		cinfo_converted = cinfo;
+
+		pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+		rc = pmd_mdc_compact(mp, cslot);
+		pmd_mdc_unlock(&cinfo->mmi_compactlock);
+
+		if (rc) {
+			mp_pr_err("mpool %s, failed to compact MDC %u post upgrade from %s to %s",
+				  rc, mp->pds_name, cslot, buf1, buf2);
+			return rc;
+		}
+	}
+
+	if (cinfo_converted != NULL)
+		mp_pr_info("mpool %s, converted MDC from version %s to %s", mp->pds_name,
+			   omfu_mdcver_to_str(&cinfo_converted->mmi_mdcver, buf1, sizeof(buf1)),
+			   omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2)));
+
+	return 0;
+}
+
+void pmd_mdc_cap(struct mpool_descriptor *mp, u64 *mdcmax, u64 *mdccap, u64 *mdc0cap)
+{
+	struct pmd_mdc_info *cinfo = NULL;
+	struct pmd_layout *layout = NULL;
+	struct rb_node *node = NULL;
+	u64 mlogsz;
+	u32 zonepg = 0;
+	u16 mdcn = 0;
+
+	if (!mdcmax || !mdccap || !mdc0cap)
+		return;
+
+	/* Serialize to prevent race with pmd_mdc_alloc() */
+	mutex_lock(&pmd_s_lock);
+
+	/*
+	 * exclude mdc0 from stats because not used for mpool user
+	 * object metadata
+	 */
+	cinfo = &mp->pds_mda.mdi_slotv[0];
+
+	pmd_mdc_lock(&cinfo->mmi_uqlock, 0);
+	*mdcmax = cinfo->mmi_luniq;
+	pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+	/*  Taking compactlock to freeze all object layout metadata in mdc0 */
+	pmd_mdc_lock(&cinfo->mmi_compactlock, 0);
+	pmd_co_rlock(cinfo, 0);
+
+	pmd_co_foreach(cinfo, node) {
+		layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+		mdcn = objid_uniq(layout->eld_objid) >> 1;
+
+		if (mdcn > *mdcmax)
+			/* Ignore detritus from failed pmd_mdc_alloc() */
+			continue;
+
+		zonepg = mp->pds_pdv[layout->eld_ld.ol_pdh].pdi_parm.dpr_zonepg;
+		mlogsz = (layout->eld_ld.ol_zcnt * zonepg) << PAGE_SHIFT;
+
+		if (!mdcn)
+			*mdc0cap = *mdc0cap + mlogsz;
+		else
+			*mdccap = *mdccap + mlogsz;
+	}
+
+	pmd_co_runlock(cinfo);
+	pmd_mdc_unlock(&cinfo->mmi_compactlock);
+	mutex_unlock(&pmd_s_lock);
+
+	/* Only count capacity of one mlog in each mdc mlog pair */
+	*mdccap  = *mdccap >> 1;
+	*mdc0cap = *mdc0cap >> 1;
+}
+
+int pmd_mpool_activate(struct mpool_descriptor *mp, struct pmd_layout *mdc01,
+		       struct pmd_layout *mdc02, int create)
+{
+	int rc;
+
+	mp_pr_debug("mdc01: %lu mdc02: %lu", 0, (ulong)mdc01->eld_objid, (ulong)mdc02->eld_objid);
+
+	/* Activation is intense; serialize it when have multiple mpools */
+	mutex_lock(&pmd_s_lock);
+
+	/* Init metadata array for mpool */
+	pmd_mda_init(mp);
+
+	/* Initialize mdc0 for mpool */
+	rc = pmd_mdc0_init(mp, mdc01, mdc02);
+	if (rc) {
+		/*
+		 * pmd_mda_free() will dealloc mdc01/2 on subsequent
+		 * activation failures
+		 */
+		pmd_obj_put(mdc01);
+		pmd_obj_put(mdc02);
+		goto exit;
+	}
+
+	/* Load mpool properties from mdc0 including drive list and states */
+	if (!create) {
+		rc = pmd_props_load(mp);
+		if (rc)
+			goto exit;
+	}
+
+	/*
+	 * initialize smaps for all drives in mpool (now that list
+	 * is finalized)
+	 */
+	rc = smap_mpool_init(mp);
+	if (rc)
+		goto exit;
+
+	/* Load mdc layouts from mdc0 and finalize mda initialization */
+	rc = pmd_objs_load(mp, 0);
+	if (rc)
+		goto exit;
+
+	/* Load user object layouts from all other mdc */
+	rc = pmd_objs_load_parallel(mp);
+	if (rc) {
+		mp_pr_err("mpool %s, failed to load user MDCs", rc, mp->pds_name);
+		goto exit;
+	}
+
+	/*
+	 * If the format of the mpool metadata read from media during activate
+	 * is not the latest, it is time to write the metadata on media with
+	 * the latest format.
+	 */
+	if (!create) {
+		rc = pmd_write_meta_to_latest_version(mp, true);
+		if (rc) {
+			mp_pr_err("mpool %s, failed to compact MDCs (metadata conversion)",
+				  rc, mp->pds_name);
+			goto exit;
+		}
+	}
+exit:
+	if (rc) {
+		/* Activation failed; cleanup */
+		pmd_mda_free(mp);
+		smap_mpool_free(mp);
+	}
+
+	mutex_unlock(&pmd_s_lock);
+
+	return rc;
+}
+
+void pmd_mpool_deactivate(struct mpool_descriptor *mp)
+{
+	/* Deactivation is intense; serialize it when have multiple mpools */
+	mutex_lock(&pmd_s_lock);
+
+	/* Close all open user (non-mdc) mlogs */
+	mlogutil_closeall(mp);
+
+	pmd_mda_free(mp);
+	smap_mpool_free(mp);
+
+	mutex_unlock(&pmd_s_lock);
+}
diff --git a/drivers/mpool/pmd_obj.c b/drivers/mpool/pmd_obj.c
index 8966fc0abd0e..18157fecccfb 100644
--- a/drivers/mpool/pmd_obj.c
+++ b/drivers/mpool/pmd_obj.c
@@ -507,9 +507,7 @@ int pmd_obj_commit(struct mpool_descriptor *mp, struct pmd_layout *layout)
 
 	pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
 
-#ifdef OBJ_PERSISTENCE_ENABLED
 	rc = pmd_log_create(mp, layout);
-#endif
 	if (!rc) {
 		pmd_uc_lock(cinfo, cslot);
 		found = pmd_uc_remove(cinfo, layout);
@@ -675,9 +673,7 @@ int pmd_obj_delete(struct mpool_descriptor *mp, struct pmd_layout *layout)
 		return (refcnt > 2) ? -EBUSY : -EINVAL;
 	}
 
-#ifdef OBJ_PERSISTENCE_ENABLED
 	rc = pmd_log_delete(mp, objid);
-#endif
 	if (!rc) {
 		pmd_co_wlock(cinfo, cslot);
 		found = pmd_co_remove(cinfo, layout);
@@ -763,9 +759,7 @@ int pmd_obj_erase(struct mpool_descriptor *mp, struct pmd_layout *layout, u64 ge
 
 		pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
 
-#ifdef OBJ_PERSISTENCE_ENABLED
 		rc = pmd_log_erase(mp, layout->eld_objid, gen);
-#endif
 		if (!rc) {
 			layout->eld_gen = gen;
 			if (cslot)
@@ -830,9 +824,7 @@ static int pmd_alloc_idgen(struct mpool_descriptor *mp, enum obj_type_omf otype,
 		 * to prevent a race with mdc compaction.
 		 */
 		pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
-#ifdef OBJ_PERSISTENCE_ENABLED
 		rc = pmd_log_idckpt(mp, *objid);
-#endif
 		if (!rc)
 			cinfo->mmi_lckpt = *objid;
 		pmd_mdc_unlock(&cinfo->mmi_compactlock);
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 15/22] mpool: add mpool lifecycle management routines
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (13 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 14/22] mpool: add pool metadata routines to create persistent mpools Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 16/22] mpool: add mpool control plane utility routines Nabeel M Mohamed
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds mpool lifecycle management functions: create,
activate, deactivate, destroy, rename, add a new media
class, fetch properties etc.

An mpool is created with a mandatory capacity media class
volume. A pool drive (PD) instance is initialized for each
media class volume using the attributes pushed from the mpool
user library. The metadata manager interfaces are then
invoked to activate this mpool and allocate the initial set
of metadata containers. The media class attributes, spare
percent, mpool configuration etc. are persisted in MDC-0.

At mpool activation, the records from MDC-0 containing the
mpool properties and metadata for accessing MDC-1 through
MDC-N are first loaded into memory, initializing all the
necessary in-core structures using the metadata manager and
space map interfaces.  Then the records from MDC-1 through
MDC-N containing the metadata for accessing client mblock
and mlog objects are loaded into memory, again initializing
all the necessary in-core structures using the metadata
manager and space map interfaces.

An mpool is destroyed by erasing the superblock on all its
constituent media class volumes. Renaming an mpool updates
the superblock on all the media class volumes with the new
name.  Adding a new media class volume to an activated mpool
is handled like initializing a volume at mpool create.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mp.c | 1086 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1086 insertions(+)
 create mode 100644 drivers/mpool/mp.c

diff --git a/drivers/mpool/mp.c b/drivers/mpool/mp.c
new file mode 100644
index 000000000000..6b8c51c23fec
--- /dev/null
+++ b/drivers/mpool/mp.c
@@ -0,0 +1,1086 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+/*
+ * Media pool (mpool) manager module.
+ *
+ * Defines functions to create and maintain mpools comprising multiple drives
+ * in multiple media classes used for storing mblocks and mlogs.
+ */
+
+#include <linux/string.h>
+#include <linux/mutex.h>
+#include <crypto/hash.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "sb.h"
+#include "upgrade.h"
+#include "mpcore.h"
+#include "mp.h"
+
+/*
+ * Lock for serializing certain mpool ops where required/desirable; could be per
+ * mpool in some cases but no meaningful performance benefit for these rare ops;
+ * also protects mpool_pools and certain mpool_descriptor fields.
+ */
+static DEFINE_MUTEX(mpool_s_lock);
+
+int mpool_create(const char *mpname, u32 flags, char **dpaths, struct pd_prop *pd_prop,
+		 struct mpcore_params *params, u64 mlog_cap)
+{
+	struct omf_sb_descriptor *sbmdc0;
+	struct mpool_descriptor *mp;
+	struct pmd_layout *mdc01, *mdc02;
+	bool active, sbvalid;
+	u16 sidx;
+	int err;
+
+	if (!mpname || !*mpname || !dpaths || !pd_prop)
+		return -EINVAL;
+
+	mdc01 = mdc02 = NULL;
+	active = sbvalid = false;
+
+	mp = mpool_desc_alloc();
+	if (!mp) {
+		err = -ENOMEM;
+		mp_pr_err("mpool %s, alloc desc failed", err, mpname);
+		return err;
+	}
+
+	sbmdc0 = &(mp->pds_sbmdc0);
+	strlcpy((char *)mp->pds_name, mpname, sizeof(mp->pds_name));
+	mpool_generate_uuid(&mp->pds_poolid);
+
+	if (params)
+		mp->pds_params = *params;
+
+	mp->pds_pdvcnt = 0;
+
+	mutex_lock(&mpool_s_lock);
+
+	/*
+	 * Allocate the per-mpool workqueue.
+	 * TODO: Make this per-driver
+	 */
+	mp->pds_erase_wq = alloc_workqueue("mperasewq", WQ_HIGHPRI, 0);
+	if (!mp->pds_erase_wq) {
+		err = -ENOMEM;
+		mp_pr_err("mpool %s, alloc per-mpool wq failed", err, mpname);
+		goto errout;
+	}
+
+	/*
+	 * Set the devices parameters from the ones placed by the discovery
+	 * in pd_prop.
+	 */
+	err = mpool_dev_init_all(mp->pds_pdv, 1, dpaths, pd_prop);
+	if (err) {
+		mp_pr_err("mpool %s, failed to get device parameters", err, mpname);
+		goto errout;
+	}
+
+	mp->pds_pdvcnt = 1;
+
+	mpool_mdc_cap_init(mp, &mp->pds_pdv[0]);
+
+	/* Init new pool drives uuid and mclassp */
+	mpool_generate_uuid(&mp->pds_pdv[0].pdi_devid);
+
+	/*
+	 * Init mpool descriptor from new drive info.
+	 * Creates the media classes and place the PDs in them.
+	 * Determine the media class used for the metadata.
+	 */
+	err = mpool_desc_init_newpool(mp, flags);
+	if (err) {
+		mp_pr_err("mpool %s, desc init from new drive info failed", err, mpname);
+		goto errout;
+	}
+
+	/*
+	 * Alloc empty mdc0 and write superblocks to all drives; if
+	 * crash drives with superblocks will not be recognized as mpool
+	 * members because there are not yet any drive state records in mdc0
+	 */
+	sbvalid = true;
+	err = mpool_dev_sbwrite_newpool(mp, sbmdc0);
+	if (err) {
+		mp_pr_err("mpool %s, couldn't write superblocks", err, mpname);
+		goto errout;
+	}
+
+	/* Alloc mdc0 mlog layouts and activate mpool with empty mdc0 */
+	err = mpool_mdc0_sb2obj(mp, sbmdc0, &mdc01, &mdc02);
+	if (err) {
+		mp_pr_err("mpool %s, alloc of MDC0 mlogs failed", err, mpname);
+		goto errout;
+	}
+
+	err = pmd_mpool_activate(mp, mdc01, mdc02, 1);
+	if (err) {
+		mp_pr_err("mpool %s, activation failed", err, mpname);
+		goto errout;
+	}
+
+	active = true;
+
+	/*
+	 * Add the version record (always first record) in MDC0.
+	 * The version record is used only from version 1.0.0.1.
+	 */
+	if (omfu_mdcver_cmp2(omfu_mdcver_cur(), ">=", 1, 0, 0, 1)) {
+		err = pmd_mdc_addrec_version(mp, 0);
+		if (err) {
+			mp_pr_err("mpool %s, writing MDC version record in MDC0 failed",
+				  err, mpname);
+			goto errout;
+		}
+	}
+
+	/*
+	 * Add drive state records to mdc0; if crash before complete will
+	 * detect if attempt to open same drive list; it may be possible to
+	 * open the subset of the drive list for which state records were
+	 * written without detection, in which case the other drives can be
+	 * added
+	 */
+	err = pmd_prop_mcconfig(mp, &mp->pds_pdv[0], false);
+	if (err) {
+		mp_pr_err("mpool %s, add drive state to MDC0 failed", err, mpname);
+		goto errout;
+	}
+
+	/*
+	 * Create mdcs so user can create mlog/mblock objects;
+	 * if crash before all the configured mdcs are created, or if create
+	 * fails, will detect in activate and re-try.
+	 *
+	 * mp_cmdcn corresponds to the number of MDCNs used for client
+	 * objects, i.e., [1 - mp_cmdcn]
+	 */
+	for (sidx = 1; sidx <= mp->pds_params.mp_mdcnum; sidx++) {
+		err = pmd_mdc_alloc(mp, mp->pds_params.mp_mdcncap, sidx - 1);
+		if (err) {
+			mp_pr_info("mpool %s, only %u MDCs out of %lu MDCs were created",
+				  mpname, sidx - 1, (ulong)mp->pds_params.mp_mdcnum);
+			/*
+			 * For MDCN creation failure, mask the error and
+			 * continue further with create.
+			 */
+			err = 0;
+			break;
+		}
+	}
+	pmd_update_credit(mp);
+
+	/*
+	 * Attempt root mlog creation only if MDC1 was successfully created.
+	 * If MDC1 doesn't exist, it will be re-created during activate.
+	 */
+	if (sidx > 1) {
+		err = mpool_create_rmlogs(mp, mlog_cap);
+		if (err) {
+			mp_pr_info("mpool %s, root mlog creation failed", mpname);
+			/*
+			 * If root mlog creation fails, mask the error and
+			 * proceed with create. root mlogs will be re-created
+			 * during activate.
+			 */
+			err = 0;
+		}
+	}
+
+	/* Add mp to the list of all open mpools */
+	uuid_to_mpdesc_insert(&mpool_pools, mp);
+
+errout:
+
+	if (mp->pds_erase_wq)
+		destroy_workqueue(mp->pds_erase_wq);
+
+	if (active)
+		pmd_mpool_deactivate(mp);
+
+	if (err && sbvalid) {
+		struct mpool_dev_info *pd;
+		int err1;
+
+		/* Erase super blocks on the drives */
+		pd = &mp->pds_pdv[0];
+		if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+			err1 = -EIO;
+			mp_pr_err("%s:%s unavailable or offline, status %d",
+				  err1, mp->pds_name, pd->pdi_name, mpool_pd_status_get(pd));
+		} else {
+			err1 = sb_erase(&pd->pdi_parm);
+			if (err1)
+				mp_pr_info("%s: cleanup, sb erase failed on device %s",
+					   mp->pds_name, pd->pdi_name);
+		}
+	}
+
+	mpool_desc_free(mp);
+
+	mutex_unlock(&mpool_s_lock);
+
+	return err;
+}
+
+int mpool_activate(u64 dcnt, char **dpaths, struct pd_prop *pd_prop, u64 mlog_cap,
+		   struct mpcore_params *params, u32 flags, struct mpool_descriptor **mpp)
+{
+	struct omf_sb_descriptor *sbmdc0;
+	struct mpool_descriptor *mp;
+	struct pmd_layout *mdc01 = NULL;
+	struct pmd_layout *mdc02 = NULL;
+	struct media_class *mcmeta;
+	u64 mdcmax, mdcnum, mdcncap, mdc0cap;
+	bool force = ((flags & (1 << MP_FLAGS_FORCE)) != 0);
+	bool mc_resize[MP_MED_NUMBER] = { };
+	bool active;
+	int dup, doff, err, i;
+	u8  pdh;
+
+	active = false;
+	*mpp = NULL;
+
+	if (dcnt > MPOOL_DRIVES_MAX) {
+		err = -EINVAL;
+		mp_pr_err("too many drives in input %lu, first drive path %s",
+			  err, (ulong)dcnt, dpaths[0]);
+		return err;
+	}
+
+	/*
+	 * Verify no duplicate drive paths
+	 */
+	err = check_for_dups(dpaths, dcnt, &dup, &doff);
+	if (err) {
+		mp_pr_err("duplicate drive check failed", err);
+		return err;
+	} else if (dup) {
+		err = -EINVAL;
+		mp_pr_err("duplicate drive path %s", err, (doff == -1) ? "" : dpaths[doff]);
+		return err;
+	}
+
+	/* Alloc mpool descriptor and fill in device-indepdendent values */
+	mp = mpool_desc_alloc();
+	if (!mp) {
+		err = -ENOMEM;
+		mp_pr_err("alloc mpool desc failed", err);
+		return err;
+	}
+
+	sbmdc0 = &(mp->pds_sbmdc0);
+
+	mp->pds_pdvcnt = 0;
+
+	if (params)
+		mp->pds_params = *params;
+
+	mutex_lock(&mpool_s_lock);
+
+	mp->pds_workq = alloc_workqueue("mpoolwq", WQ_UNBOUND, 0);
+	if (!mp->pds_workq) {
+		err = -ENOMEM;
+		mp_pr_err("alloc mpoolwq failed, first drive path %s", err, dpaths[0]);
+		goto errout;
+	}
+
+	mp->pds_erase_wq = alloc_workqueue("mperasewq", WQ_HIGHPRI, 0);
+	if (!mp->pds_erase_wq) {
+		err = -ENOMEM;
+		mp_pr_err("alloc mperasewq failed, first drive path %s", err, dpaths[0]);
+		goto errout;
+	}
+
+	/* Get device parm for all drive paths */
+	err = mpool_dev_init_all(mp->pds_pdv, dcnt, dpaths, pd_prop);
+	if (err) {
+		mp_pr_err("can't get drive device params, first drive path %s", err, dpaths[0]);
+		goto errout;
+	}
+
+	/* Set mp.pdvcnt so dpaths will get closed in cleanup if activate fails. */
+	mp->pds_pdvcnt = dcnt;
+
+	/* Init mpool descriptor from superblocks on drives */
+	err = mpool_desc_init_sb(mp, sbmdc0, flags, mc_resize);
+	if (err) {
+		mp_pr_err("mpool_desc_init_sb failed, first drive path %s", err, dpaths[0]);
+		goto errout;
+	}
+
+	mcmeta = &mp->pds_mc[mp->pds_mdparm.md_mclass];
+	if (mcmeta->mc_pdmc < 0) {
+		err = -ENODEV;
+		mp_pr_err("mpool %s, too many unavailable drives", err, mp->pds_name);
+		goto errout;
+	}
+
+	/* Alloc mdc0 mlog layouts from superblock and activate mpool */
+	err = mpool_mdc0_sb2obj(mp, sbmdc0, &mdc01, &mdc02);
+	if (err) {
+		mp_pr_err("mpool %s, allocation of MDC0 mlogs layouts failed", err, mp->pds_name);
+		goto errout;
+	}
+
+	err = pmd_mpool_activate(mp, mdc01, mdc02, 0);
+	if (err) {
+		mp_pr_err("mpool %s, activation failed", err, mp->pds_name);
+		goto errout;
+	}
+
+	active = true;
+
+	for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+		struct mpool_dev_info  *pd;
+
+		pd = &mp->pds_pdv[pdh];
+
+		if (mc_resize[pd->pdi_mclass]) {
+			err = pmd_prop_mcconfig(mp, pd, false);
+			if (err) {
+				mp_pr_err("mpool %s, updating MCCONFIG record for resize failed",
+					  err, mp->pds_name);
+				goto errout;
+			}
+		}
+
+		if (pd->pdi_mclass == MP_MED_CAPACITY)
+			mpool_mdc_cap_init(mp, pd);
+	}
+
+	/* Tolerate unavailable drives only if force flag specified */
+	for (i = 0; !force && i < MP_MED_NUMBER; i++) {
+		struct media_class *mc;
+
+		mc = &mp->pds_mc[i];
+		if (mc->mc_uacnt) {
+			err = -ENODEV;
+			mp_pr_err("mpool %s, unavailable drives present", err, mp->pds_name);
+			goto errout;
+		}
+	}
+
+	/*
+	 * Create mdcs if needed so user can create mlog/mblock objects;
+	 * Only needed if the configured number of mdcs did not get created
+	 * during mpool create due to crash or failure.
+	 */
+	mdcmax = mdcncap = mdc0cap = 0;
+	mdcnum = mp->pds_params.mp_mdcnum;
+
+	pmd_mdc_cap(mp, &mdcmax, &mdcncap, &mdc0cap);
+
+	if (mdc0cap)
+		mp->pds_params.mp_mdc0cap = mdc0cap;
+
+	if (mdcncap && mdcmax) {
+		mdcncap = mdcncap / mdcmax;
+		mp->pds_params.mp_mdcncap = mdcncap;
+		mp->pds_params.mp_mdcnum  = mdcmax;
+	}
+
+	if (mdcmax < mdcnum) {
+		mp_pr_info("mpool %s, detected missing MDCs %lu %lu",
+			   mp->pds_name, (ulong)mdcnum, (ulong)mdcmax);
+
+		for (mdcmax++; mdcmax <= mdcnum; mdcmax++) {
+
+			err = pmd_mdc_alloc(mp, mp->pds_params.mp_mdcncap,
+					    mdcmax);
+			if (!err)
+				continue;
+
+			/* MDC1 creation failure - non-functional mpool */
+			if (mdcmax < 2) {
+				mp_pr_err("mpool %s, MDC1 can't be created", err, mp->pds_name);
+				goto errout;
+			}
+
+			mp_pr_notice("mpool %s, couldn't create %lu MDCs out of %lu MDCs",
+				     mp->pds_name, (ulong)(mdcnum - mdcmax + 1), (ulong)mdcnum);
+
+			/*
+			 * For MDCN (N > 1) creation failure, log a warning,
+			 * mask the error and continue with activate. Mpool
+			 * only needs a minimum of 1 MDC to be functional.
+			 */
+			err = 0;
+
+			break;
+		}
+		mp->pds_params.mp_mdcnum = mdcmax - 1;
+	}
+
+	pmd_update_credit(mp);
+
+	/*
+	 * If we reach here, then MDC1 must exist. Now, make sure that the
+	 * root mlogs also exist and if they don't, re-create them.
+	 */
+	err = mpool_create_rmlogs(mp, mlog_cap);
+	if (err) {
+		/* Root mlogs creation failure - non-functional mpool */
+		mp_pr_err("mpool %s, root mlogs creation failed", err, mp->pds_name);
+		goto errout;
+	}
+
+	/* Add mp to the list of all activated mpools */
+	uuid_to_mpdesc_insert(&mpool_pools, mp);
+
+	/* Start the background thread doing pre-compaction of MDC1/255 */
+	pmd_precompact_start(mp);
+
+errout:
+	if (err) {
+		if (mp->pds_workq)
+			destroy_workqueue(mp->pds_workq);
+		if (mp->pds_erase_wq)
+			destroy_workqueue(mp->pds_erase_wq);
+
+		if (active)
+			pmd_mpool_deactivate(mp);
+
+		mpool_desc_free(mp);
+		mp = NULL;
+	}
+
+	mutex_unlock(&mpool_s_lock);
+
+	*mpp = mp;
+
+	if (!err) {
+		/*
+		 * Start the periodic background job which logs a message
+		 * when an mpool's usable space is close to its limits.
+		 */
+		struct smap_usage_work *usagew;
+
+		usagew = &mp->pds_smap_usage_work;
+
+		INIT_DELAYED_WORK(&usagew->smapu_wstruct, smap_log_mpool_usage);
+		usagew->smapu_mp = mp;
+		smap_log_mpool_usage(&usagew->smapu_wstruct.work);
+	}
+
+	return err;
+}
+
+int mpool_deactivate(struct mpool_descriptor *mp)
+{
+	pmd_precompact_stop(mp);
+	smap_wait_usage_done(mp);
+
+	mutex_lock(&mpool_s_lock);
+	destroy_workqueue(mp->pds_workq);
+	destroy_workqueue(mp->pds_erase_wq);
+
+	pmd_mpool_deactivate(mp);
+
+	mpool_desc_free(mp);
+	mutex_unlock(&mpool_s_lock);
+
+	return 0;
+}
+
+int mpool_destroy(u64 dcnt, char **dpaths, struct pd_prop *pd_prop, u32 flags)
+{
+	struct omf_sb_descriptor *sbmdc0;
+	struct mpool_descriptor *mp;
+	int dup, doff;
+	int err, i;
+
+	if (dcnt > MPOOL_DRIVES_MAX) {
+		err = -EINVAL;
+		mp_pr_err("first pd %s, too many drives %lu %d",
+			  err, dpaths[0], (ulong)dcnt, MPOOL_DRIVES_MAX);
+		return err;
+	} else if (dcnt == 0) {
+		return -EINVAL;
+	}
+
+	/*
+	 * Verify no duplicate drive paths
+	 */
+	err = check_for_dups(dpaths, dcnt, &dup, &doff);
+	if (err) {
+		mp_pr_err("check_for_dups failed, dcnt %lu", err, (ulong)dcnt);
+		return err;
+	} else if (dup) {
+		err = -ENOMEM;
+		mp_pr_err("duplicate drives found", err);
+		return err;
+	}
+
+	sbmdc0 = kzalloc(sizeof(*sbmdc0), GFP_KERNEL);
+	if (!sbmdc0) {
+		err = -ENOMEM;
+		mp_pr_err("alloc sb %zu failed", err, sizeof(*sbmdc0));
+		return err;
+	}
+
+	mp = mpool_desc_alloc();
+	if (!mp) {
+		err = -ENOMEM;
+		mp_pr_err("alloc mpool desc failed", err);
+		kfree(sbmdc0);
+		return err;
+	}
+
+	mp->pds_pdvcnt = 0;
+
+	mutex_lock(&mpool_s_lock);
+
+	/* Get device parm for all drive paths */
+	err = mpool_dev_init_all(mp->pds_pdv, dcnt, dpaths, pd_prop);
+	if (err) {
+		mp_pr_err("first pd %s, get device params failed", err, dpaths[0]);
+		goto errout;
+	}
+
+	/* Set pdvcnt so dpaths will get closed in cleanup if open fails. */
+	mp->pds_pdvcnt = dcnt;
+
+	/* Init mpool descriptor from superblocks on drives */
+	err = mpool_desc_init_sb(mp, sbmdc0, flags, NULL);
+	if (err) {
+		mp_pr_err("mpool %s, first pd %s, mpool desc init from sb failed",
+			  err, (mp->pds_name == NULL) ? "" : mp->pds_name, dpaths[0]);
+		goto errout;
+	}
+
+	/* Erase super blocks on the drives */
+	for (i = 0; i < mp->pds_pdvcnt; i++) {
+		struct mpool_dev_info *pd;
+
+		pd = &mp->pds_pdv[i];
+		if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+			err = -EIO;
+			mp_pr_err("pd %s unavailable or offline, status %d",
+				  err, pd->pdi_name, mpool_pd_status_get(pd));
+		} else {
+			err = sb_erase(&pd->pdi_parm);
+			if (err)
+				mp_pr_err("pd %s, sb erase failed", err, pd->pdi_name);
+		}
+
+		if (err)
+			break;
+	}
+
+errout:
+	mpool_desc_free(mp);
+
+	mutex_unlock(&mpool_s_lock);
+
+	kfree(sbmdc0);
+
+	return err;
+}
+
+int mpool_rename(u64 dcnt, char **dpaths, struct pd_prop *pd_prop,
+		 u32 flags, const char *mp_newname)
+{
+	struct omf_sb_descriptor *sb;
+	struct mpool_descriptor *mp;
+	struct mpool_dev_info *pd = NULL;
+	u16 omf_ver = OMF_SB_DESC_UNDEF;
+	bool force = ((flags & (1 << MP_FLAGS_FORCE)) != 0);
+	u8 pdh;
+	int dup, doff;
+	int err = 0;
+
+	if (!mp_newname || dcnt == 0)
+		return -EINVAL;
+
+	if (dcnt > MPOOL_DRIVES_MAX) {
+		err = -EINVAL;
+		mp_pr_err("first pd %s, too many drives %lu %d",
+			  err, dpaths[0], (ulong)dcnt, MPOOL_DRIVES_MAX);
+		return err;
+	}
+
+	/*
+	 * Verify no duplicate drive paths
+	 */
+	err = check_for_dups(dpaths, dcnt, &dup, &doff);
+	if (err) {
+		mp_pr_err("check_for_dups failed, dcnt %lu", err, (ulong)dcnt);
+		return err;
+	} else if (dup) {
+		err = -ENOMEM;
+		mp_pr_err("duplicate drives found", err);
+		return err;
+	}
+
+	sb = kzalloc(sizeof(*sb), GFP_KERNEL);
+	if (!sb) {
+		err = -ENOMEM;
+		mp_pr_err("alloc sb %zu failed", err, sizeof(*sb));
+		return err;
+	}
+
+	mp = mpool_desc_alloc();
+	if (!mp) {
+		err = -ENOMEM;
+		mp_pr_err("alloc mpool desc failed", err);
+		kfree(sb);
+		return err;
+	}
+
+	mp->pds_pdvcnt = 0;
+
+	mutex_lock(&mpool_s_lock);
+
+	/* Get device parm for all drive paths */
+	err = mpool_dev_init_all(mp->pds_pdv, dcnt, dpaths, pd_prop);
+	if (err) {
+		mp_pr_err("first pd %s, get device params failed", err, dpaths[0]);
+		goto errout;
+	}
+
+	/* Set pdvcnt so dpaths will get closed in cleanup if open fails.
+	 */
+	mp->pds_pdvcnt = dcnt;
+
+	for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+		pd = &mp->pds_pdv[pdh];
+
+		if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+			err = -EIO;
+			mp_pr_err("pd %s unavailable or offline, status %d",
+				  err, pd->pdi_name, mpool_pd_status_get(pd));
+			goto errout;
+		}
+
+		/*
+		 * Read superblock; init and validate pool drive info
+		 * from device parameters stored in the super block.
+		 */
+		err = sb_read(&pd->pdi_parm, sb, &omf_ver, force);
+		if (err) {
+			mp_pr_err("pd %s, sb read failed", err, pd->pdi_name);
+			goto errout;
+		}
+
+		if (omf_ver > OMF_SB_DESC_VER_LAST ||
+		    omf_ver < OMF_SB_DESC_VER_LAST) {
+			err = -EOPNOTSUPP;
+			mp_pr_err("pd %s, invalid sb version %d %d",
+				  err, pd->pdi_name, omf_ver, OMF_SB_DESC_VER_LAST);
+			goto errout;
+		}
+
+		if (!strcmp(mp_newname, sb->osb_name))
+			continue;
+
+		strlcpy(sb->osb_name, mp_newname, sizeof(sb->osb_name));
+
+		err = sb_write_update(&pd->pdi_parm, sb);
+		if (err) {
+			mp_pr_err("Failed to rename mpool %s on device %s",
+				  err, mp->pds_name, pd->pdi_name);
+			goto errout;
+		}
+	}
+
+errout:
+	mutex_unlock(&mpool_s_lock);
+
+	mpool_desc_free(mp);
+	kfree(sb);
+
+	return err;
+}
+
+int mpool_drive_add(struct mpool_descriptor *mp, char *dpath, struct pd_prop *pd_prop)
+{
+	struct mpool_dev_info *pd;
+	struct mc_smap_parms mcsp;
+	char *dpathv[1] = { dpath };
+	bool erase = false;
+	bool smap = false;
+	int err;
+
+	/*
+	 * All device list changes are serialized via mpool_s_lock so
+	 * don't need to acquire mp.pdvlock until ready to update mpool
+	 * descriptor
+	 */
+	mutex_lock(&mpool_s_lock);
+
+	if (mp->pds_pdvcnt >= MPOOL_DRIVES_MAX) {
+		mutex_unlock(&mpool_s_lock);
+
+		mp_pr_warn("%s: pd %s, too many drives %u %d",
+			   mp->pds_name, dpath, mp->pds_pdvcnt, MPOOL_DRIVES_MAX);
+		return -EINVAL;
+	}
+
+	/*
+	 * get device parm for dpath; use next slot in mp.pdv which won't
+	 * be visible until we update mp.pdvcnt
+	 */
+	pd = &mp->pds_pdv[mp->pds_pdvcnt];
+
+	/*
+	 * Some leftover may be present due to a previous try to add a PD
+	 * at this position. Clear up.
+	 */
+	memset(pd, 0, sizeof(*pd));
+
+	err = mpool_dev_init_all(pd, 1, dpathv, pd_prop);
+	if (err) {
+		mutex_unlock(&mpool_s_lock);
+
+		mp_pr_err("%s: pd %s, getting drive params failed", err, mp->pds_name, dpath);
+		return err;
+	}
+
+	/* Confirm drive meets all criteria for adding to this mpool */
+	err = mpool_dev_check_new(mp, pd);
+	if (err) {
+		mp_pr_err("%s: pd %s, drive doesn't pass criteria", err, mp->pds_name, dpath);
+		goto errout;
+	}
+
+	/*
+	 * Check that the drive can be added in a media class.
+	 */
+	down_read(&mp->pds_pdvlock);
+	err = mpool_desc_pdmc_add(mp, mp->pds_pdvcnt, NULL, true);
+	up_read(&mp->pds_pdvlock);
+	if (err) {
+		mp_pr_err("%s: pd %s, can't place in any media class", err, mp->pds_name, dpath);
+		goto errout;
+	}
+
+
+	mpool_generate_uuid(&pd->pdi_devid);
+
+	/* Write mpool superblock to drive */
+	erase = true;
+	err = mpool_dev_sbwrite(mp, pd, NULL);
+	if (err) {
+		mp_pr_err("%s: pd %s, sb write failed", err, mp->pds_name, dpath);
+		goto errout;
+	}
+
+	/* Get percent spare */
+	down_read(&mp->pds_pdvlock);
+	err = mc_smap_parms_get(&mp->pds_mc[pd->pdi_mclass], &mp->pds_params, &mcsp);
+	up_read(&mp->pds_pdvlock);
+	if (err)
+		goto errout;
+
+	/* Alloc space map for drive */
+	err = smap_drive_init(mp, &mcsp, mp->pds_pdvcnt);
+	if (err) {
+		mp_pr_err("%s: pd %s, smap init failed", err, mp->pds_name, dpath);
+		goto errout;
+	}
+	smap = true;
+
+	/*
+	 * Take MDC0 compact lock to prevent race with MDC0 compaction.
+	 * Take it across memory and media update.
+	 */
+	PMD_MDC0_COMPACTLOCK(mp);
+
+	/*
+	 * Add drive state record to mdc0; if crash any time prior to adding
+	 * this record the drive will not be recognized as an mpool member
+	 * on next open
+	 */
+	err = pmd_prop_mcconfig(mp, pd, false);
+	if (err) {
+		PMD_MDC0_COMPACTUNLOCK(mp);
+		mp_pr_err("%s: pd %s, adding drive state to MDC0 failed", err, mp->pds_name, dpath);
+		goto errout;
+	}
+
+	/* Make new drive visible in mpool */
+	down_write(&mp->pds_pdvlock);
+	mp->pds_pdvcnt++;
+
+	/*
+	 * Add the PD in its class. That should NOT fail because we already
+	 * checked that the drive can be added in a media class.
+	 */
+	err = mpool_desc_pdmc_add(mp, mp->pds_pdvcnt - 1, NULL, false);
+	if (err)
+		mp->pds_pdvcnt--;
+
+	up_write(&mp->pds_pdvlock);
+	PMD_MDC0_COMPACTUNLOCK(mp);
+
+errout:
+	if (err) {
+		/*
+		 * No pd could have been be added at mp->pds_pdvcnt since we
+		 * dropped pds_pdvlock because mpool_s_lock is held.
+		 */
+		if (smap)
+			smap_drive_free(mp, mp->pds_pdvcnt);
+
+		/*
+		 * Erase the pd super blocks only if the pd doesn't already
+		 * belong to this mpool or another one.
+		 */
+		if (erase)
+			sb_erase(&pd->pdi_parm);
+
+		pd_dev_close(&pd->pdi_parm);
+	}
+
+	mutex_unlock(&mpool_s_lock);
+
+	return err;
+}
+
+void mpool_mclass_get_cnt(struct mpool_descriptor *mp, u32 *cnt)
+{
+	int i;
+
+	*cnt = 0;
+
+	down_read(&mp->pds_pdvlock);
+	for (i = 0; i < MP_MED_NUMBER; i++) {
+		struct media_class *mc;
+
+		mc = &mp->pds_mc[i];
+		if (mc->mc_pdmc >= 0)
+			(*cnt)++;
+	}
+	up_read(&mp->pds_pdvlock);
+}
+
+int mpool_mclass_get(struct mpool_descriptor *mp, u32 *mcxc, struct mpool_mclass_xprops *mcxv)
+{
+	int i, n;
+
+	if (!mp || !mcxc || !mcxv)
+		return -EINVAL;
+
+	mutex_lock(&mpool_s_lock);
+	down_read(&mp->pds_pdvlock);
+
+	for (n = i = 0; i < MP_MED_NUMBER && n < *mcxc; i++) {
+		struct media_class *mc;
+
+		mc = &mp->pds_mc[i];
+		if (mc->mc_pdmc < 0)
+			continue;
+
+		mcxv->mc_mclass = mc->mc_parms.mcp_classp;
+		mcxv->mc_devtype = mc->mc_parms.mcp_devtype;
+		mcxv->mc_spare = mc->mc_sparms.mcsp_spzone;
+
+		mcxv->mc_zonepg = mc->mc_parms.mcp_zonepg;
+		mcxv->mc_sectorsz = mc->mc_parms.mcp_sectorsz;
+		mcxv->mc_features = mc->mc_parms.mcp_features;
+		mcxv->mc_uacnt = mc->mc_uacnt;
+		smap_mclass_usage(mp, i, &mcxv->mc_usage);
+
+		++mcxv;
+		++n;
+	}
+
+	up_read(&mp->pds_pdvlock);
+	mutex_unlock(&mpool_s_lock);
+
+	*mcxc = n;
+
+	return 0;
+}
+
+int mpool_drive_spares(struct mpool_descriptor *mp, enum mp_media_classp mclassp, u8 drive_spares)
+{
+	struct media_class *mc;
+	int err;
+
+	if (!mclass_isvalid(mclassp) || drive_spares > 100) {
+		err = -EINVAL;
+		mp_pr_err("mpool %s, setting percent %u spare for drives in media class %d failed",
+			  err, mp->pds_name, drive_spares, mclassp);
+		return err;
+	}
+
+	/*
+	 * Do not write the spare record or try updating spare if there are
+	 * no PDs in the specified media class.
+	 */
+	down_read(&mp->pds_pdvlock);
+	mc = &mp->pds_mc[mclassp];
+	up_read(&mp->pds_pdvlock);
+
+	if (mc->mc_pdmc < 0) {
+		err = -ENOENT;
+		goto skip_update;
+	}
+
+	mutex_lock(&mpool_s_lock);
+
+	/*
+	 * Take mdc0 compact lock to prevent race with mdc0 compaction.
+	 * Also make memory and media update to look atomic to compaction.
+	 */
+	PMD_MDC0_COMPACTLOCK(mp);
+
+	/*
+	 * update media class spare record in mdc0; no effect if crash before
+	 * complete
+	 */
+	err = pmd_prop_mcspare(mp, mclassp, drive_spares, false);
+	if (err) {
+		mp_pr_err("mpool %s, setting spare %u mclass %d failed, could not record in MDC0",
+			  err, mp->pds_name, drive_spares, mclassp);
+	} else {
+		/* Update spare zone accounting for media class */
+		down_write(&mp->pds_pdvlock);
+
+		err = mc_set_spzone(&mp->pds_mc[mclassp], drive_spares);
+		if (err)
+			mp_pr_err("mpool %s, setting spare %u mclass %d failed",
+				  err, mp->pds_name, drive_spares, mclassp);
+		else
+			/*
+			 * smap accounting update always succeeds when
+			 * mclassp/zone are valid
+			 */
+			smap_drive_spares(mp, mclassp, drive_spares);
+
+		up_write(&mp->pds_pdvlock);
+	}
+
+	PMD_MDC0_COMPACTUNLOCK(mp);
+
+	mutex_unlock(&mpool_s_lock);
+
+skip_update:
+	return err;
+}
+
+void mpool_get_xprops(struct mpool_descriptor *mp, struct mpool_xprops *xprops)
+{
+	struct media_class *mc;
+	int mclassp, i;
+	u16 ftmax;
+
+	mutex_lock(&mpool_s_lock);
+	down_read(&mp->pds_pdvlock);
+
+	memcpy(xprops->ppx_params.mp_poolid.b, mp->pds_poolid.uuid, MPOOL_UUID_SIZE);
+	ftmax = 0;
+
+	for (mclassp = 0; mclassp < MP_MED_NUMBER; mclassp++) {
+		xprops->ppx_pd_mclassv[mclassp] = MP_MED_INVALID;
+
+		mc = &mp->pds_mc[mclassp];
+		if (mc->mc_pdmc < 0) {
+			xprops->ppx_drive_spares[mclassp] = 0;
+			xprops->ppx_uacnt[mclassp] = 0;
+
+			xprops->ppx_params.mp_mblocksz[mclassp] = 0;
+			continue;
+		}
+
+		xprops->ppx_drive_spares[mclassp] = mc->mc_sparms.mcsp_spzone;
+		xprops->ppx_uacnt[mclassp] = mc->mc_uacnt;
+		ftmax = max((u16)ftmax, (u16)(xprops->ppx_uacnt[mclassp]));
+		xprops->ppx_params.mp_mblocksz[mclassp] =
+			(mc->mc_parms.mcp_zonepg << PAGE_SHIFT) >> 20;
+	}
+
+	for (i = 0; i < mp->pds_pdvcnt; ++i) {
+		mc = &mp->pds_mc[mp->pds_pdv[i].pdi_mclass];
+		if (mc->mc_pdmc < 0)
+			continue;
+
+		xprops->ppx_pd_mclassv[i] = mc->mc_parms.mcp_classp;
+
+		strlcpy(xprops->ppx_pd_namev[i], mp->pds_pdv[i].pdi_name,
+			sizeof(xprops->ppx_pd_namev[i]));
+	}
+
+	up_read(&mp->pds_pdvlock);
+	mutex_unlock(&mpool_s_lock);
+
+	xprops->ppx_params.mp_stat = ftmax ? MPOOL_STAT_FAULTED : MPOOL_STAT_OPTIMAL;
+}
+
+int mpool_get_devprops_by_name(struct mpool_descriptor *mp, char *pdname,
+			       struct mpool_devprops *dprop)
+{
+	int i;
+
+	down_read(&mp->pds_pdvlock);
+
+	for (i = 0; i < mp->pds_pdvcnt; i++) {
+		if (!strcmp(pdname, mp->pds_pdv[i].pdi_name))
+			fill_in_devprops(mp, i, dprop);
+	}
+
+	up_read(&mp->pds_pdvlock);
+
+	return 0;
+}
+
+void mpool_get_usage(struct mpool_descriptor *mp, enum mp_media_classp mclassp,
+		     struct mpool_usage *usage)
+{
+	memset(usage, 0, sizeof(*usage));
+
+	down_read(&mp->pds_pdvlock);
+	if (mclassp != MP_MED_ALL) {
+		struct media_class *mc;
+
+		ASSERT(mclassp < MP_MED_NUMBER);
+
+		mc = &mp->pds_mc[mclassp];
+		if (mc->mc_pdmc < 0) {
+			/* Not an error, this media class is empty. */
+			up_read(&mp->pds_pdvlock);
+			return;
+		}
+	}
+	smap_mpool_usage(mp, mclassp, usage);
+	up_read(&mp->pds_pdvlock);
+
+	if (mclassp == MP_MED_ALL)
+		pmd_mpool_usage(mp, usage);
+}
+
+int mpool_config_store(struct mpool_descriptor *mp, const struct mpool_config *cfg)
+{
+	int err;
+
+	if (!mp || !cfg)
+		return -EINVAL;
+
+	mp->pds_cfg = *cfg;
+
+	err = pmd_prop_mpconfig(mp, cfg, false);
+	if (err)
+		mp_pr_err("mpool %s, logging config record failed", err, mp->pds_name);
+
+	return err;
+}
+
+int mpool_config_fetch(struct mpool_descriptor *mp, struct mpool_config *cfg)
+{
+	if (!mp || !cfg)
+		return -EINVAL;
+
+	*cfg = mp->pds_cfg;
+
+	return 0;
+}
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 16/22] mpool: add mpool control plane utility routines
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (14 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 15/22] mpool: add mpool lifecycle management routines Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 17/22] mpool: add mpool lifecycle management ioctls Nabeel M Mohamed
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds the mpool control plane infrastructure to manage
mpools.

There is a unit object instance for each device object
created by the mpool driver. A minor number is reserved
for each unit object.

The mpool control device (/dev/mpoolctl) gets a minor
number of 0. An mpool device (/dev/mpool/<mpool_name>)
gets a minor number > 0. Utility routines exist to lookup
an mpool unit given its minor number or name.

All units are born with a reference count of two -
one for the caller and a birth reference that can be released
only by either destroying the unit or unloading the module.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/init.c  |  20 ++
 drivers/mpool/init.h  |   3 +
 drivers/mpool/mpctl.c | 465 ++++++++++++++++++++++++++++++++++++++++++
 drivers/mpool/mpctl.h |  49 +++++
 drivers/mpool/sysfs.c |  48 +++++
 drivers/mpool/sysfs.h |  48 +++++
 6 files changed, 633 insertions(+)
 create mode 100644 drivers/mpool/mpctl.c
 create mode 100644 drivers/mpool/mpctl.h
 create mode 100644 drivers/mpool/sysfs.c
 create mode 100644 drivers/mpool/sysfs.h

diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
index eb1217f63746..126c6c7142b5 100644
--- a/drivers/mpool/init.c
+++ b/drivers/mpool/init.c
@@ -12,10 +12,23 @@
 #include "smap.h"
 #include "pmd_obj.h"
 #include "sb.h"
+#include "mpctl.h"
 
 /*
  * Module params...
  */
+unsigned int maxunits __read_mostly = 1024;
+module_param(maxunits, uint, 0444);
+MODULE_PARM_DESC(maxunits, " max mpools");
+
+unsigned int rwsz_max_mb __read_mostly = 32;
+module_param(rwsz_max_mb, uint, 0444);
+MODULE_PARM_DESC(rwsz_max_mb, " max mblock/mlog r/w size (mB)");
+
+unsigned int rwconc_max __read_mostly = 8;
+module_param(rwconc_max, uint, 0444);
+MODULE_PARM_DESC(rwconc_max, " max mblock/mlog large r/w concurrency");
+
 unsigned int rsvd_bios_max __read_mostly = 16;
 module_param(rsvd_bios_max, uint, 0444);
 MODULE_PARM_DESC(rsvd_bios_max, "max reserved bios in mpool bioset");
@@ -26,6 +39,7 @@ MODULE_PARM_DESC(chunk_size_kb, "Chunk size (in KiB) for device I/O");
 
 static void mpool_exit_impl(void)
 {
+	mpctl_exit();
 	pmd_exit();
 	smap_exit();
 	sb_exit();
@@ -68,6 +82,12 @@ static __init int mpool_init(void)
 		goto errout;
 	}
 
+	rc = mpctl_init();
+	if (rc) {
+		errmsg = "mpctl init failed";
+		goto errout;
+	}
+
 errout:
 	if (rc) {
 		mp_pr_err("%s", rc, errmsg);
diff --git a/drivers/mpool/init.h b/drivers/mpool/init.h
index e02a9672e727..3d8f809a5e45 100644
--- a/drivers/mpool/init.h
+++ b/drivers/mpool/init.h
@@ -6,6 +6,9 @@
 #ifndef MPOOL_INIT_H
 #define MPOOL_INIT_H
 
+extern unsigned int maxunits;
+extern unsigned int rwsz_max_mb;
+extern unsigned int rwconc_max;
 extern unsigned int rsvd_bios_max;
 extern int chunk_size_kb;
 
diff --git a/drivers/mpool/mpctl.c b/drivers/mpool/mpctl.c
new file mode 100644
index 000000000000..21eb7ac4610b
--- /dev/null
+++ b/drivers/mpool/mpctl.c
@@ -0,0 +1,465 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/cdev.h>
+#include <linux/log2.h>
+#include <linux/idr.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/blkdev.h>
+#include <linux/vmalloc.h>
+#include <linux/memcontrol.h>
+#include <linux/pagemap.h>
+#include <linux/kobject.h>
+#include <linux/mm_inline.h>
+#include <linux/version.h>
+#include <linux/kref.h>
+
+#include <linux/backing-dev.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/migrate.h>
+#include <linux/delay.h>
+#include <linux/ctype.h>
+#include <linux/uio.h>
+
+#include "mpool_printk.h"
+#include "assert.h"
+
+#include "mpool_ioctl.h"
+#include "mlog.h"
+#include "mp.h"
+#include "mpctl.h"
+#include "sysfs.h"
+#include "init.h"
+
+
+#define NODEV               MKDEV(0, 0)    /* Non-existent device */
+
+/* mpc pseudo-driver instance data (i.e., all globals live here). */
+struct mpc_softstate {
+	struct mutex        ss_lock;        /* Protects ss_unitmap */
+	struct idr          ss_unitmap;     /* minor-to-unit map */
+
+	____cacheline_aligned
+	struct semaphore    ss_op_sema;     /* Serialize mgmt. ops */
+	dev_t               ss_devno;       /* Control device devno */
+	struct cdev         ss_cdev;
+	struct class       *ss_class;
+	bool                ss_inited;
+};
+
+/* Unit-type specific information. */
+struct mpc_uinfo {
+	const char     *ui_typename;
+	const char     *ui_subdirfmt;
+};
+
+/* One mpc_mpool object per mpool. */
+struct mpc_mpool {
+	struct kref                 mp_ref;
+	struct rw_semaphore         mp_lock;
+	struct mpool_descriptor    *mp_desc;
+	struct mp_mdc              *mp_mdc;
+	uint                        mp_dpathc;
+	char                      **mp_dpathv;
+	char                        mp_name[];
+};
+
+/* The following structures are initialized at the end of this file. */
+static const struct file_operations mpc_fops_default;
+
+static struct mpc_softstate mpc_softstate;
+
+static unsigned int mpc_ctl_uid __read_mostly;
+static unsigned int mpc_ctl_gid __read_mostly = 6;
+static unsigned int mpc_ctl_mode __read_mostly = 0664;
+
+static const struct mpc_uinfo mpc_uinfo_ctl = {
+	.ui_typename = "mpoolctl",
+	.ui_subdirfmt = "%s",
+};
+
+static const struct mpc_uinfo mpc_uinfo_mpool = {
+	.ui_typename = "mpool",
+	.ui_subdirfmt = "mpool/%s",
+};
+
+static inline bool mpc_unit_isctldev(const struct mpc_unit *unit)
+{
+	return (unit->un_uinfo == &mpc_uinfo_ctl);
+}
+
+static inline bool mpc_unit_ismpooldev(const struct mpc_unit *unit)
+{
+	return (unit->un_uinfo == &mpc_uinfo_mpool);
+}
+
+static inline uid_t mpc_current_uid(void)
+{
+	return from_kuid(current_user_ns(), current_uid());
+}
+
+static inline gid_t mpc_current_gid(void)
+{
+	return from_kgid(current_user_ns(), current_gid());
+}
+
+/**
+ * mpc_mpool_release() - release kref handler for mpc_mpool object
+ * @refp:  kref pointer
+ */
+static void mpc_mpool_release(struct kref *refp)
+{
+	struct mpc_mpool *mpool = container_of(refp, struct mpc_mpool, mp_ref);
+	int rc;
+
+	if (mpool->mp_desc) {
+		rc = mpool_deactivate(mpool->mp_desc);
+		if (rc)
+			mp_pr_err("mpool %s deactivate failed", rc, mpool->mp_name);
+	}
+
+	kfree(mpool->mp_dpathv);
+	kfree(mpool);
+
+	module_put(THIS_MODULE);
+}
+
+static void mpc_mpool_put(struct mpc_mpool *mpool)
+{
+	kref_put(&mpool->mp_ref, mpc_mpool_release);
+}
+
+/**
+ * mpc_unit_create() - Create and install a unit object
+ * @path:         device path under "/dev/" to create
+ * @mpool:        mpool ptr
+ * @unitp:        unit ptr
+ *
+ * Create a unit object and install a NULL ptr for it in the units map,
+ * thereby reserving a minor number.  The unit cannot be found by any
+ * of the lookup routines until the NULL ptr is replaced by the actual
+ * ptr to the unit.
+ *
+ * A unit maps an mpool device (.e.g., /dev/mpool/foo)  to an mpool object
+ * created by mpool_create().
+ *
+ * All units are born with two references, one for the caller and one that
+ * can only be released by destroying the unit or unloading the module.
+ *
+ * Return:  Returns 0 if successful and sets *unitp.
+ *          Returns -errno on error.
+ */
+static int mpc_unit_create(const char *name, struct mpc_mpool *mpool, struct mpc_unit **unitp)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpc_unit *unit;
+	size_t unitsz;
+	int minor;
+
+	if (!ss || !name || !unitp)
+		return -EINVAL;
+
+	unitsz = sizeof(*unit) + strlen(name) + 1;
+
+	unit = kzalloc(unitsz, GFP_KERNEL);
+	if (!unit)
+		return -ENOMEM;
+
+	strcpy(unit->un_name, name);
+
+	sema_init(&unit->un_open_lock, 1);
+	unit->un_open_excl = false;
+	unit->un_open_cnt = 0;
+	unit->un_devno = NODEV;
+	kref_init(&unit->un_ref);
+	unit->un_mpool = mpool;
+
+	mutex_lock(&ss->ss_lock);
+	minor = idr_alloc(&ss->ss_unitmap, NULL, 0, -1, GFP_KERNEL);
+	mutex_unlock(&ss->ss_lock);
+
+	if (minor < 0) {
+		kfree(unit);
+		return minor;
+	}
+
+	kref_get(&unit->un_ref); /* acquire additional ref for the caller */
+
+	unit->un_devno = MKDEV(MAJOR(ss->ss_cdev.dev), minor);
+	*unitp = unit;
+
+	return 0;
+}
+
+/**
+ * mpc_unit_release() - Destroy a unit object created by mpc_unit_create().
+ * @unit:
+ */
+static void mpc_unit_release(struct kref *refp)
+{
+	struct mpc_unit *unit = container_of(refp, struct mpc_unit, un_ref);
+	struct mpc_softstate *ss = &mpc_softstate;
+
+	mutex_lock(&ss->ss_lock);
+	idr_remove(&ss->ss_unitmap, MINOR(unit->un_devno));
+	mutex_unlock(&ss->ss_lock);
+
+	if (unit->un_mpool)
+		mpc_mpool_put(unit->un_mpool);
+
+	if (unit->un_device)
+		device_destroy(ss->ss_class, unit->un_devno);
+
+	kfree(unit);
+}
+
+static void mpc_unit_put(struct mpc_unit *unit)
+{
+	if (unit)
+		kref_put(&unit->un_ref, mpc_unit_release);
+}
+
+/**
+ * mpc_unit_setup() - Create a device unit object and special file
+ * @uinfo:
+ * @name:
+ * @cfg:
+ * @mpool:
+ * @unitp: unitp can be NULL. *unitp is updated only if unitp is not NULL
+ *	and no error is returned.
+ *
+ * If successful, this function adopts mpool.  On failure, mpool
+ * remains the responsibility of the caller.
+ *
+ * All units are born with two references, one for the caller and one
+ * that can only be released by destroying the unit or unloading the
+ * module. If the caller passes in nil for unitp then this function
+ * will drop the caller's "caller reference" on his behalf.
+ *
+ * Return:  Returns 0 on success, -errno otherwise...
+ */
+static int mpc_unit_setup(const struct mpc_uinfo *uinfo, const char *name,
+			  const struct mpool_config *cfg, struct mpc_mpool *mpool,
+			  struct mpc_unit **unitp)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpc_unit *unit;
+	struct device *device;
+	int rc;
+
+	if (!ss || !uinfo || !name || !name[0] || !cfg || !unitp)
+		return -EINVAL;
+
+	if (cfg->mc_uid == -1 || cfg->mc_gid == -1 || cfg->mc_mode == -1)
+		return -EINVAL;
+
+	if (!capable(CAP_MKNOD))
+		return -EPERM;
+
+	if (cfg->mc_uid != mpc_current_uid() && !capable(CAP_CHOWN))
+		return -EPERM;
+
+	if (cfg->mc_gid != mpc_current_gid() && !capable(CAP_CHOWN))
+		return -EPERM;
+
+	if (mpool && strcmp(mpool->mp_name, name))
+		return -EINVAL;
+
+	*unitp = NULL;
+	unit = NULL;
+
+	/*
+	 * Try to create a new unit object.  If successful, then all error
+	 * handling beyond this point must route through the errout label
+	 * to ensure the unit is fully destroyed.
+	 */
+	rc = mpc_unit_create(name, mpool, &unit);
+	if (rc)
+		return rc;
+
+	unit->un_uid = cfg->mc_uid;
+	unit->un_gid = cfg->mc_gid;
+	unit->un_mode = cfg->mc_mode;
+
+	unit->un_mdc_captgt = cfg->mc_captgt;
+	memcpy(&unit->un_utype, &cfg->mc_utype, sizeof(unit->un_utype));
+	strlcpy(unit->un_label, cfg->mc_label, sizeof(unit->un_label));
+	unit->un_ds_oidv[0] = cfg->mc_oid1;
+	unit->un_ds_oidv[1] = cfg->mc_oid2;
+	unit->un_ra_pages_max = cfg->mc_ra_pages_max;
+
+	device = device_create(ss->ss_class, NULL, unit->un_devno, unit, uinfo->ui_subdirfmt, name);
+	if (IS_ERR(device)) {
+		rc = PTR_ERR(device);
+		mp_pr_err("device_create %s failed", rc, name);
+		goto errout;
+	}
+
+	unit->un_device = device;
+	unit->un_uinfo = uinfo;
+
+	dev_info(unit->un_device, "minor %u, uid %u, gid %u, mode 0%02o",
+		 MINOR(unit->un_devno), cfg->mc_uid, cfg->mc_gid, cfg->mc_mode);
+
+	*unitp = unit;
+
+errout:
+	if (rc) {
+		/*
+		 * Acquire an additional reference on mpool so that it is not
+		 * errantly destroyed along with the unit, then release both
+		 * the unit's birth and caller's references which should
+		 * destroy the unit.
+		 */
+		kref_get(&mpool->mp_ref);
+		mpc_unit_put(unit);
+		mpc_unit_put(unit);
+	}
+
+	return rc;
+}
+
+/**
+ * mpc_uevent() - Hook to intercept and modify uevents before they're posted to udev
+ * @dev:    mpc driver device
+ * @env:
+ *
+ * See man 7 udev for more info.
+ */
+static int mpc_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+	struct mpc_unit *unit = dev_get_drvdata(dev);
+
+	if (unit) {
+		add_uevent_var(env, "DEVMODE=%#o", unit->un_mode);
+		add_uevent_var(env, "DEVUID=%u", unit->un_uid);
+		add_uevent_var(env, "DEVGID=%u", unit->un_gid);
+	}
+
+	return 0;
+}
+
+static int mpc_exit_unit(int minor, void *item, void *arg)
+{
+	mpc_unit_put(item);
+
+	return ITERCB_NEXT;
+}
+
+/**
+ * mpctl_exit() - Tear down and unload the mpool control module.
+ */
+void mpctl_exit(void)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+
+	if (ss->ss_inited) {
+		idr_for_each(&ss->ss_unitmap, mpc_exit_unit, NULL);
+		idr_destroy(&ss->ss_unitmap);
+
+		if (ss->ss_devno != NODEV) {
+			if (ss->ss_class) {
+				if (ss->ss_cdev.ops)
+					cdev_del(&ss->ss_cdev);
+				class_destroy(ss->ss_class);
+			}
+			unregister_chrdev_region(ss->ss_devno, maxunits);
+		}
+
+		ss->ss_inited = false;
+	}
+}
+
+/**
+ * mpctl_init() - Load and initialize the mpool control module.
+ */
+int mpctl_init(void)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpool_config *cfg = NULL;
+	struct mpc_unit *ctlunit;
+	const char *errmsg = NULL;
+	int rc;
+
+	if (ss->ss_inited)
+		return -EBUSY;
+
+	ctlunit = NULL;
+
+	maxunits = clamp_t(uint, maxunits, 8, 8192);
+
+	cdev_init(&ss->ss_cdev, &mpc_fops_default);
+	ss->ss_cdev.owner = THIS_MODULE;
+
+	mutex_init(&ss->ss_lock);
+	idr_init(&ss->ss_unitmap);
+	ss->ss_class = NULL;
+	ss->ss_devno = NODEV;
+	sema_init(&ss->ss_op_sema, 1);
+	ss->ss_inited = true;
+
+	rc = alloc_chrdev_region(&ss->ss_devno, 0, maxunits, "mpool");
+	if (rc) {
+		errmsg = "cannot allocate control device major";
+		ss->ss_devno = NODEV;
+		goto errout;
+	}
+
+	ss->ss_class = class_create(THIS_MODULE, module_name(THIS_MODULE));
+	if (IS_ERR(ss->ss_class)) {
+		errmsg = "class_create() failed";
+		rc = PTR_ERR(ss->ss_class);
+		ss->ss_class = NULL;
+		goto errout;
+	}
+
+	ss->ss_class->dev_uevent = mpc_uevent;
+
+	rc = cdev_add(&ss->ss_cdev, ss->ss_devno, maxunits);
+	if (rc) {
+		errmsg = "cdev_add() failed";
+		ss->ss_cdev.ops = NULL;
+		goto errout;
+	}
+
+	cfg = kzalloc(sizeof(*cfg), GFP_KERNEL);
+	if (!cfg) {
+		errmsg = "cfg alloc failed";
+		rc = -ENOMEM;
+		goto errout;
+	}
+
+	cfg->mc_uid = mpc_ctl_uid;
+	cfg->mc_gid = mpc_ctl_gid;
+	cfg->mc_mode = mpc_ctl_mode;
+
+	rc = mpc_unit_setup(&mpc_uinfo_ctl, MPC_DEV_CTLNAME, cfg, NULL, &ctlunit);
+	if (rc) {
+		errmsg = "cannot create control device";
+		goto errout;
+	}
+
+	mutex_lock(&ss->ss_lock);
+	idr_replace(&ss->ss_unitmap, ctlunit, MINOR(ctlunit->un_devno));
+	mutex_unlock(&ss->ss_lock);
+
+	mpc_unit_put(ctlunit);
+
+errout:
+	if (rc) {
+		mp_pr_err("%s", rc, errmsg);
+		mpctl_exit();
+	}
+
+	kfree(cfg);
+
+	return rc;
+}
diff --git a/drivers/mpool/mpctl.h b/drivers/mpool/mpctl.h
new file mode 100644
index 000000000000..412a6a491c15
--- /dev/null
+++ b/drivers/mpool/mpctl.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_MPCTL_H
+#define MPOOL_MPCTL_H
+
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/device.h>
+#include <linux/semaphore.h>
+
+#define ITERCB_NEXT     (0)
+#define ITERCB_DONE     (1)
+
+/* There is one unit object for each device object created by the driver. */
+struct mpc_unit {
+	struct kref                 un_ref;
+	int                         un_open_cnt;    /* Unit open count */
+	struct semaphore            un_open_lock;   /* Protects un_open_* */
+	bool                        un_open_excl;   /* Unit exclusively open */
+	uid_t                       un_uid;
+	gid_t                       un_gid;
+	mode_t                      un_mode;
+	dev_t                       un_devno;
+	const struct mpc_uinfo     *un_uinfo;
+	struct mpc_mpool           *un_mpool;
+	struct address_space       *un_mapping;
+	struct device              *un_device;
+	struct mpc_attr            *un_attr;
+	uint                        un_rawio;       /* log2(max_mblock_size) */
+	u64                         un_ds_oidv[2];
+	u32                         un_ra_pages_max;
+	u64                         un_mdc_captgt;
+	uuid_le                     un_utype;
+	u8                          un_label[MPOOL_LABELSZ_MAX];
+	char                        un_name[];
+};
+
+static inline struct mpc_unit *dev_to_unit(struct device *dev)
+{
+	return dev_get_drvdata(dev);
+}
+
+int mpctl_init(void) __cold;
+void mpctl_exit(void) __cold;
+
+#endif /* MPOOL_MPCTL_H */
diff --git a/drivers/mpool/sysfs.c b/drivers/mpool/sysfs.c
new file mode 100644
index 000000000000..638106a77669
--- /dev/null
+++ b/drivers/mpool/sysfs.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#include <linux/slab.h>
+
+#include "sysfs.h"
+
+struct mpc_attr *mpc_attr_create(struct device *dev, const char *name, int acnt)
+{
+	struct mpc_attr *attr;
+	int i;
+
+	attr = kzalloc(sizeof(*attr) + acnt * sizeof(*attr->a_dattr) +
+		       (acnt + 1) * sizeof(*attr->a_attrs), GFP_KERNEL);
+	if (!attr)
+		return NULL;
+
+	attr->a_kobj = &dev->kobj;
+
+	attr->a_dattr = (void *)(attr + 1);
+
+	attr->a_attrs = (void *)(attr->a_dattr + acnt);
+	for (i = 0; i < acnt; i++)
+		attr->a_attrs[i] = &attr->a_dattr[i].attr;
+	attr->a_attrs[i] = NULL;
+
+	attr->a_group.attrs = attr->a_attrs;
+	attr->a_group.name = name;
+
+	return attr;
+}
+
+void mpc_attr_destroy(struct mpc_attr *attr)
+{
+	kfree(attr);
+}
+
+int mpc_attr_group_create(struct mpc_attr *attr)
+{
+	return sysfs_create_group(attr->a_kobj, &attr->a_group);
+}
+
+void mpc_attr_group_destroy(struct mpc_attr *attr)
+{
+	sysfs_remove_group(attr->a_kobj, &attr->a_group);
+}
diff --git a/drivers/mpool/sysfs.h b/drivers/mpool/sysfs.h
new file mode 100644
index 000000000000..b161eceec75f
--- /dev/null
+++ b/drivers/mpool/sysfs.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_SYSFS_H
+#define MPOOL_SYSFS_H
+
+#include <linux/device.h>
+#include <linux/sysfs.h>
+
+
+#define MPC_ATTR(_da, _name, _mode)                \
+	(_da)->attr.name = __stringify(_name);     \
+	(_da)->attr.mode = (_mode);                \
+	(_da)->show      = mpc_##_name##_show      \
+
+#define MPC_ATTR_RO(_dattr, _name)                 \
+	do {                                       \
+		__typeof(_dattr) da = (_dattr);    \
+		MPC_ATTR(da, _name, 0444);         \
+		da->store = NULL;                  \
+	} while (0)
+
+#define MPC_ATTR_RW(_dattr, _name)                 \
+	do {                                       \
+		__typeof(_dattr) da = (_dattr);    \
+		MPC_ATTR(da, _name, 0644);         \
+		da->store = mpc_##_name##_store;   \
+	} while (0)
+
+
+struct mpc_attr {
+	struct attribute_group       a_group;
+	struct kobject              *a_kobj;
+	struct device_attribute     *a_dattr;
+	struct attribute           **a_attrs;
+};
+
+struct mpc_attr *mpc_attr_create(struct device *d, const char *name, int acnt);
+
+void mpc_attr_destroy(struct mpc_attr *attr);
+
+int mpc_attr_group_create(struct mpc_attr *attr);
+
+void mpc_attr_group_destroy(struct mpc_attr *attr);
+
+#endif /* MPOOL_SYSFS_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 17/22] mpool: add mpool lifecycle management ioctls
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (15 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 16/22] mpool: add mpool control plane utility routines Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 18/22] mpool: add object " Nabeel M Mohamed
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds the open, release and mpool management ioctls for
the mpool driver.

The create, destroy, activate, deactivate and rename ioctls
are issued to the mpool control device (/dev/mpoolctl),
and the rest are issued to the mpool device
(/dev/mpool/<mpool_name>).

The mpool control device is owned by (root, disk) with
mode 0664. Non-default uid, gid and mode can be assigned to
an mpool device either at create time or post creation using
the params set ioctl.

Both the per-mpool and common parameters are available in
the sysfs device tree path created by the kernel for each mpool
device minor (/sys/devices/virtual/mpool). Mpool parameters
cannot be changed via the sysfs tree at this point.

The mpool management ioctl handlers invoke the mpool lifecycle
management routines to administer mpools. Activating an mpool
creates a unit object which stores some key information like
reference to the device object, reference to the mpc_mpool
instance containing the per-mpool private data, device props,
ownership and mode bits, device open count, flags etc. The
per-mpool parameters are persisted in MDC0 at activation.

Deactivating an mpool tears down the unit object and releases
all its associated resources.

An mpool can be renamed only when it's deactivated.  Renaming
an mpool updates the superblock on all its constituent storage
volumes with the new mpool name.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mpctl.c | 1560 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1560 insertions(+)

diff --git a/drivers/mpool/mpctl.c b/drivers/mpool/mpctl.c
index 21eb7ac4610b..de62f9a5524d 100644
--- a/drivers/mpool/mpctl.c
+++ b/drivers/mpool/mpctl.c
@@ -80,6 +80,9 @@ static struct mpc_softstate mpc_softstate;
 static unsigned int mpc_ctl_uid __read_mostly;
 static unsigned int mpc_ctl_gid __read_mostly = 6;
 static unsigned int mpc_ctl_mode __read_mostly = 0664;
+static unsigned int mpc_default_uid __read_mostly;
+static unsigned int mpc_default_gid __read_mostly = 6;
+static unsigned int mpc_default_mode __read_mostly = 0660;
 
 static const struct mpc_uinfo mpc_uinfo_ctl = {
 	.ui_typename = "mpoolctl",
@@ -111,6 +114,202 @@ static inline gid_t mpc_current_gid(void)
 	return from_kgid(current_user_ns(), current_gid());
 }
 
+#define MPC_MPOOL_PARAMS_CNT     7
+
+static ssize_t mpc_uid_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%d\n", dev_to_unit(dev)->un_uid);
+}
+
+static ssize_t mpc_gid_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%d\n", dev_to_unit(dev)->un_gid);
+}
+
+static ssize_t mpc_mode_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "0%o\n", dev_to_unit(dev)->un_mode);
+}
+
+static ssize_t mpc_ra_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%u\n", dev_to_unit(dev)->un_ra_pages_max);
+}
+
+static ssize_t mpc_label_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%s\n", dev_to_unit(dev)->un_label);
+}
+
+static ssize_t mpc_type_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	struct mpool_uuid  uuid;
+	char               uuid_str[MPOOL_UUID_STRING_LEN + 1] = { };
+
+	memcpy(uuid.uuid, dev_to_unit(dev)->un_utype.b, MPOOL_UUID_SIZE);
+	mpool_unparse_uuid(&uuid, uuid_str);
+
+	return scnprintf(buf, PAGE_SIZE, "%s\n", uuid_str);
+}
+
+static void mpc_mpool_params_add(struct device_attribute *dattr)
+{
+	MPC_ATTR_RO(dattr++, uid);
+	MPC_ATTR_RO(dattr++, gid);
+	MPC_ATTR_RO(dattr++, mode);
+	MPC_ATTR_RO(dattr++, ra);
+	MPC_ATTR_RO(dattr++, label);
+	MPC_ATTR_RO(dattr,   type);
+}
+
+static int mpc_params_register(struct mpc_unit *unit, int cnt)
+{
+	struct device_attribute *dattr;
+	struct mpc_attr *attr;
+	int rc;
+
+	attr = mpc_attr_create(unit->un_device, "parameters", cnt);
+	if (!attr)
+		return -ENOMEM;
+
+	dattr = attr->a_dattr;
+
+	/* Per-mpool parameters */
+	if (mpc_unit_ismpooldev(unit))
+		mpc_mpool_params_add(dattr);
+
+	rc = mpc_attr_group_create(attr);
+	if (rc) {
+		mpc_attr_destroy(attr);
+		return rc;
+	}
+
+	unit->un_attr = attr;
+
+	return 0;
+}
+
+static void mpc_params_unregister(struct mpc_unit *unit)
+{
+	mpc_attr_group_destroy(unit->un_attr);
+	mpc_attr_destroy(unit->un_attr);
+	unit->un_attr = NULL;
+}
+
+/**
+ * mpc_toascii() - convert string to restricted ASCII
+ *
+ * Zeroes out the remainder of str[] and returns the length.
+ */
+static size_t mpc_toascii(char *str, size_t sz)
+{
+	size_t len = 0;
+	int i;
+
+	if (!str || sz < 1)
+		return 0;
+
+	if (str[0] == '-')
+		str[0] = '_';
+
+	for (i = 0; i < (sz - 1) && str[i]; ++i) {
+		if (isalnum(str[i]) || strchr("_.-", str[i]))
+			continue;
+
+		str[i] = '_';
+	}
+
+	len = i;
+
+	while (i < sz)
+		str[i++] = '\000';
+
+	return len;
+}
+
+static void mpool_params_merge_defaults(struct mpool_params *params)
+{
+	if (params->mp_spare_cap == MPOOL_SPARES_INVALID)
+		params->mp_spare_cap = MPOOL_SPARES_DEFAULT;
+
+	if (params->mp_spare_stg == MPOOL_SPARES_INVALID)
+		params->mp_spare_stg = MPOOL_SPARES_DEFAULT;
+
+	if (params->mp_ra_pages_max == U32_MAX)
+		params->mp_ra_pages_max = MPOOL_RA_PAGES_MAX;
+	params->mp_ra_pages_max = clamp_t(u32, params->mp_ra_pages_max, 0, MPOOL_RA_PAGES_MAX);
+
+	if (params->mp_mode != -1)
+		params->mp_mode &= 0777;
+
+	params->mp_rsvd0 = 0;
+	params->mp_rsvd1 = 0;
+	params->mp_rsvd2 = 0;
+	params->mp_rsvd3 = 0;
+	params->mp_rsvd4 = 0;
+
+	if (!strcmp(params->mp_label, MPOOL_LABEL_INVALID))
+		strcpy(params->mp_label, MPOOL_LABEL_DEFAULT);
+
+	mpc_toascii(params->mp_label, sizeof(params->mp_label));
+}
+
+static void mpool_to_mpcore_params(struct mpool_params *params, struct mpcore_params *mpc_params)
+{
+	u64 mdc0cap, mdcncap;
+	u32 mdcnum;
+
+	mpcore_params_defaults(mpc_params);
+
+	mdc0cap = (u64)params->mp_mdc0cap << 20;
+	mdcncap = (u64)params->mp_mdcncap << 20;
+	mdcnum  = params->mp_mdcnum;
+
+	if (mdc0cap != 0)
+		mpc_params->mp_mdc0cap = mdc0cap;
+
+	if (mdcncap != 0)
+		mpc_params->mp_mdcncap = mdcncap;
+
+	if (mdcnum != 0)
+		mpc_params->mp_mdcnum = mdcnum;
+}
+
+static bool mpool_params_merge_config(struct mpool_params *params, struct mpool_config *cfg)
+{
+	uuid_le uuidnull = { };
+	bool changed = false;
+
+	if (params->mp_uid != -1 && params->mp_uid != cfg->mc_uid) {
+		cfg->mc_uid = params->mp_uid;
+		changed = true;
+	}
+
+	if (params->mp_gid != -1 && params->mp_gid != cfg->mc_gid) {
+		cfg->mc_gid = params->mp_gid;
+		changed = true;
+	}
+
+	if (params->mp_mode != -1 && params->mp_mode != cfg->mc_mode) {
+		cfg->mc_mode = params->mp_mode;
+		changed = true;
+	}
+
+	if (memcmp(&uuidnull, &params->mp_utype, sizeof(uuidnull)) &&
+	    memcmp(&params->mp_utype, &cfg->mc_utype, sizeof(params->mp_utype))) {
+		memcpy(&cfg->mc_utype, &params->mp_utype, sizeof(cfg->mc_utype));
+		changed = true;
+	}
+
+	if (strcmp(params->mp_label, MPOOL_LABEL_DEFAULT) &&
+	    strncmp(params->mp_label, cfg->mc_label, sizeof(params->mp_label))) {
+		strlcpy(cfg->mc_label, params->mp_label, sizeof(cfg->mc_label));
+		changed = true;
+	}
+
+	return changed;
+}
+
 /**
  * mpc_mpool_release() - release kref handler for mpc_mpool object
  * @refp:  kref pointer
@@ -215,6 +414,9 @@ static void mpc_unit_release(struct kref *refp)
 	if (unit->un_mpool)
 		mpc_mpool_put(unit->un_mpool);
 
+	if (unit->un_attr)
+		mpc_params_unregister(unit);
+
 	if (unit->un_device)
 		device_destroy(ss->ss_class, unit->un_devno);
 
@@ -227,6 +429,89 @@ static void mpc_unit_put(struct mpc_unit *unit)
 		kref_put(&unit->un_ref, mpc_unit_release);
 }
 
+/**
+ * mpc_unit_lookup() - Look up a unit by minor number.
+ * @minor:  minor number
+ * @unitp:  unit ptr
+ *
+ * Returns a referenced ptr to the unit (via *unitp) if found,
+ * otherwise it sets *unitp to NULL.
+ */
+static void mpc_unit_lookup(int minor, struct mpc_unit **unitp)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpc_unit *unit;
+
+	*unitp = NULL;
+
+	mutex_lock(&ss->ss_lock);
+	unit = idr_find(&ss->ss_unitmap, minor);
+	if (unit) {
+		kref_get(&unit->un_ref);
+		*unitp = unit;
+	}
+	mutex_unlock(&ss->ss_lock);
+}
+
+/**
+ * mpc_unit_lookup_by_name_itercb() - Test to see if unit matches arg.
+ * @item:   unit ptr
+ * @arg:    argument vector base ptr
+ *
+ * This iterator callback is called by mpc_unit_lookup_by_name()
+ * for each unit in the units table.
+ *
+ * Return: If the unit matching the given name is found returns
+ * the referenced unit pointer in argv[2], otherwise NULL.
+ */
+static int mpc_unit_lookup_by_name_itercb(int minor, void *item, void *arg)
+{
+	struct mpc_unit *unit = item;
+	void **argv = arg;
+	struct mpc_unit *parent = argv[0];
+	const char *name = argv[1];
+
+	if (!unit)
+		return ITERCB_NEXT;
+
+	if (mpc_unit_isctldev(parent) && !mpc_unit_ismpooldev(unit))
+		return ITERCB_NEXT;
+
+	if (parent->un_mpool && unit->un_mpool != parent->un_mpool)
+		return ITERCB_NEXT;
+
+	if (strcmp(unit->un_name, name) == 0) {
+		kref_get(&unit->un_ref);
+		argv[2] = unit;
+		return ITERCB_DONE;
+	}
+
+	return ITERCB_NEXT;
+}
+
+/**
+ * mpc_unit_lookup_by_name() - Look up an mpool unit by name.
+ * @parent: parent unit
+ * @name:   unit name. This is not the mpool name.
+ * @unitp:  unit ptr
+ *
+ * If a unit exists in the system which has the given name and parent
+ * then it is referenced and returned via *unitp.  Otherwise, *unitp
+ * is set to NULL.
+ */
+static void mpc_unit_lookup_by_name(struct mpc_unit *parent, const char *name,
+				    struct mpc_unit **unitp)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	void *argv[] = { parent, (void *)name, NULL };
+
+	mutex_lock(&ss->ss_lock);
+	idr_for_each(&ss->ss_unitmap, mpc_unit_lookup_by_name_itercb, argv);
+	mutex_unlock(&ss->ss_lock);
+
+	*unitp = argv[2];
+}
+
 /**
  * mpc_unit_setup() - Create a device unit object and special file
  * @uinfo:
@@ -327,6 +612,36 @@ static int mpc_unit_setup(const struct mpc_uinfo *uinfo, const char *name,
 	return rc;
 }
 
+
+static int mpc_cf_journal(struct mpc_unit *unit)
+{
+	struct mpool_config cfg = { };
+	struct mpc_mpool *mpool;
+	int rc;
+
+	mpool = unit->un_mpool;
+	if (!mpool)
+		return -EINVAL;
+
+	down_write(&mpool->mp_lock);
+
+	cfg.mc_uid = unit->un_uid;
+	cfg.mc_gid = unit->un_gid;
+	cfg.mc_mode = unit->un_mode;
+	cfg.mc_oid1 = unit->un_ds_oidv[0];
+	cfg.mc_oid2 = unit->un_ds_oidv[1];
+	cfg.mc_captgt = unit->un_mdc_captgt;
+	cfg.mc_ra_pages_max = unit->un_ra_pages_max;
+	memcpy(&cfg.mc_utype, &unit->un_utype, sizeof(cfg.mc_utype));
+	strlcpy(cfg.mc_label, unit->un_label, sizeof(cfg.mc_label));
+
+	rc = mpool_config_store(mpool->mp_desc, &cfg);
+
+	up_write(&mpool->mp_lock);
+
+	return rc;
+}
+
 /**
  * mpc_uevent() - Hook to intercept and modify uevents before they're posted to udev
  * @dev:    mpc driver device
@@ -347,6 +662,1251 @@ static int mpc_uevent(struct device *dev, struct kobj_uevent_env *env)
 	return 0;
 }
 
+/**
+ * mpc_mp_chown() - Change ownership of an mpool.
+ * @unit: mpool unit ptr
+ * @mps:
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpc_mp_chown(struct mpc_unit *unit, struct mpool_params *params)
+{
+	mode_t mode;
+	uid_t uid;
+	gid_t gid;
+	int rc = 0;
+
+	if (!mpc_unit_ismpooldev(unit))
+		return -EINVAL;
+
+	uid  = params->mp_uid;
+	gid  = params->mp_gid;
+	mode = params->mp_mode;
+
+	if (mode != -1)
+		mode &= 0777;
+
+	if (uid != -1 && uid != unit->un_uid && !capable(CAP_CHOWN))
+		return -EPERM;
+
+	if (gid != -1 && gid != unit->un_gid && !capable(CAP_CHOWN))
+		return -EPERM;
+
+	if (mode != -1 && mode != unit->un_mode && !capable(CAP_FOWNER))
+		return -EPERM;
+
+	if (-1 != uid)
+		unit->un_uid = uid;
+	if (-1 != gid)
+		unit->un_gid = gid;
+	if (-1 != mode)
+		unit->un_mode = mode;
+
+	if (uid != -1 || gid != -1 || mode != -1)
+		rc = kobject_uevent(&unit->un_device->kobj, KOBJ_CHANGE);
+
+	return rc;
+}
+
+/**
+ * mpioc_params_get() - get parameters of an activated mpool
+ * @unit:   mpool unit ptr
+ * @get:    mpool params
+ *
+ * MPIOC_PARAMS_GET ioctl handler to get mpool parameters
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_params_get(struct mpc_unit *unit, struct mpioc_params *get)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpool_descriptor *desc;
+	struct mpool_params *params;
+	struct mpool_xprops xprops = { };
+	u8 mclass;
+
+	if (!mpc_unit_ismpooldev(unit))
+		return -EINVAL;
+
+	desc = unit->un_mpool->mp_desc;
+
+	mutex_lock(&ss->ss_lock);
+
+	params = &get->mps_params;
+	memset(params, 0, sizeof(*params));
+	params->mp_uid = unit->un_uid;
+	params->mp_gid = unit->un_gid;
+	params->mp_mode = unit->un_mode;
+	params->mp_mdc_captgt = MPOOL_ROOT_LOG_CAP;
+	params->mp_oidv[0] = unit->un_ds_oidv[0];
+	params->mp_oidv[1] = unit->un_ds_oidv[1];
+	params->mp_ra_pages_max = unit->un_ra_pages_max;
+	memcpy(&params->mp_utype, &unit->un_utype, sizeof(params->mp_utype));
+	strlcpy(params->mp_label, unit->un_label, sizeof(params->mp_label));
+	strlcpy(params->mp_name, unit->un_name, sizeof(params->mp_name));
+
+	/* Get mpool properties.. */
+	mpool_get_xprops(desc, &xprops);
+
+	for (mclass = 0; mclass < MP_MED_NUMBER; mclass++)
+		params->mp_mblocksz[mclass] = xprops.ppx_params.mp_mblocksz[mclass];
+
+	params->mp_spare_cap = xprops.ppx_drive_spares[MP_MED_CAPACITY];
+	params->mp_spare_stg = xprops.ppx_drive_spares[MP_MED_STAGING];
+
+	memcpy(params->mp_poolid.b, xprops.ppx_params.mp_poolid.b, MPOOL_UUID_SIZE);
+
+	mutex_unlock(&ss->ss_lock);
+
+	return 0;
+}
+
+/**
+ * mpioc_params_set() - set parameters of an activated mpool
+ * @unit:   mpool unit ptr
+ * @set:    mpool params
+ *
+ * MPIOC_PARAMS_SET ioctl handler to set mpool parameters
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_params_set(struct mpc_unit *unit, struct mpioc_params *set)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpool_descriptor *mp;
+	struct mpool_params *params;
+	uuid_le uuidnull = { };
+	int rerr = 0, err = 0;
+	bool journal = false;
+
+	if (!mpc_unit_ismpooldev(unit))
+		return -EINVAL;
+
+	params = &set->mps_params;
+
+	mutex_lock(&ss->ss_lock);
+	if (params->mp_uid != -1 || params->mp_gid != -1 || params->mp_mode != -1) {
+		err = mpc_mp_chown(unit, params);
+		if (err) {
+			mutex_unlock(&ss->ss_lock);
+			return err;
+		}
+		journal = true;
+	}
+
+	if (params->mp_label[0]) {
+		mpc_toascii(params->mp_label, sizeof(params->mp_label));
+		strlcpy(unit->un_label, params->mp_label, sizeof(unit->un_label));
+		journal = true;
+	}
+
+	if (memcmp(&uuidnull, &params->mp_utype, sizeof(uuidnull))) {
+		memcpy(&unit->un_utype, &params->mp_utype, sizeof(unit->un_utype));
+		journal = true;
+	}
+
+	if (params->mp_ra_pages_max != U32_MAX) {
+		unit->un_ra_pages_max = clamp_t(u32, params->mp_ra_pages_max,
+						0, MPOOL_RA_PAGES_MAX);
+		journal = true;
+	}
+
+	if (journal)
+		err = mpc_cf_journal(unit);
+	mutex_unlock(&ss->ss_lock);
+
+	if (err) {
+		mp_pr_err("%s: params commit failed", err, unit->un_name);
+		return err;
+	}
+
+	mp = unit->un_mpool->mp_desc;
+
+	if (params->mp_spare_cap != MPOOL_SPARES_INVALID) {
+		err = mpool_drive_spares(mp, MP_MED_CAPACITY, params->mp_spare_cap);
+		if (err && err != -ENOENT)
+			rerr = err;
+	}
+
+	if (params->mp_spare_stg != MPOOL_SPARES_INVALID) {
+		err = mpool_drive_spares(mp, MP_MED_STAGING, params->mp_spare_stg);
+		if (err && err != -ENOENT)
+			rerr = err;
+	}
+
+	return rerr;
+}
+
+/**
+ * mpioc_mp_mclass_get() - get information regarding an mpool's mclasses
+ * @unit:   mpool unit ptr
+ * @mcl:    mclass info struct
+ *
+ * MPIOC_MP_MCLASS_GET ioctl handler to get mclass information
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mp_mclass_get(struct mpc_unit *unit, struct mpioc_mclass *mcl)
+{
+	struct mpool_descriptor *desc = unit->un_mpool->mp_desc;
+	struct mpool_mclass_xprops mcxv[MP_MED_NUMBER];
+	uint32_t mcxc = ARRAY_SIZE(mcxv);
+	int rc;
+
+	if (!mcl || !desc)
+		return -EINVAL;
+
+	if (!mcl->mcl_xprops) {
+		mpool_mclass_get_cnt(desc, &mcl->mcl_cnt);
+		return 0;
+	}
+
+	memset(mcxv, 0, sizeof(mcxv));
+
+	rc = mpool_mclass_get(desc, &mcxc, mcxv);
+	if (rc)
+		return rc;
+
+	if (mcxc > mcl->mcl_cnt)
+		mcxc = mcl->mcl_cnt;
+	mcl->mcl_cnt = mcxc;
+
+	rc = copy_to_user(mcl->mcl_xprops, mcxv, sizeof(mcxv[0]) * mcxc);
+
+	return rc ? -EFAULT : 0;
+}
+
+/**
+ * mpioc_devprops_get() - Get device properties
+ * @unit:   mpool unit ptr
+ *
+ * MPIOC_PROP_GET ioctl handler to retrieve properties for the specified device.
+ */
+static int mpioc_devprops_get(struct mpc_unit *unit, struct mpioc_devprops *devprops)
+{
+	int rc = 0;
+
+	if (unit->un_mpool) {
+		struct mpool_descriptor *mp = unit->un_mpool->mp_desc;
+
+		rc = mpool_get_devprops_by_name(mp, devprops->dpr_pdname, &devprops->dpr_devprops);
+	}
+
+	return rc;
+}
+
+/**
+ * mpioc_prop_get() - Get mpool properties.
+ * @unit:   mpool unit ptr
+ *
+ * MPIOC_PROP_GET ioctl handler to retrieve properties for the specified device.
+ */
+static void mpioc_prop_get(struct mpc_unit *unit, struct mpioc_prop *kprop)
+{
+	struct mpool_descriptor *desc = unit->un_mpool->mp_desc;
+	struct mpool_params *params;
+	struct mpool_xprops *xprops;
+
+	memset(kprop, 0, sizeof(*kprop));
+
+	/* Get unit properties.. */
+	params = &kprop->pr_xprops.ppx_params;
+	params->mp_uid = unit->un_uid;
+	params->mp_gid = unit->un_gid;
+	params->mp_mode = unit->un_mode;
+	params->mp_mdc_captgt = unit->un_mdc_captgt;
+	params->mp_oidv[0] = unit->un_ds_oidv[0];
+	params->mp_oidv[1] = unit->un_ds_oidv[1];
+	params->mp_ra_pages_max = unit->un_ra_pages_max;
+	memcpy(&params->mp_utype, &unit->un_utype, sizeof(params->mp_utype));
+	strlcpy(params->mp_label, unit->un_label, sizeof(params->mp_label));
+	strlcpy(params->mp_name, unit->un_name, sizeof(params->mp_name));
+
+	/* Get mpool properties.. */
+	xprops = &kprop->pr_xprops;
+	mpool_get_xprops(desc, xprops);
+	mpool_get_usage(desc, MP_MED_ALL, &kprop->pr_usage);
+
+	params->mp_spare_cap = xprops->ppx_drive_spares[MP_MED_CAPACITY];
+	params->mp_spare_stg = xprops->ppx_drive_spares[MP_MED_STAGING];
+
+	kprop->pr_mcxc = ARRAY_SIZE(kprop->pr_mcxv);
+	mpool_mclass_get(desc, &kprop->pr_mcxc, kprop->pr_mcxv);
+}
+
+/**
+ * mpioc_proplist_get_itercb() - Get properties iterator callback.
+ * @item:   unit ptr
+ * @arg:    argument list
+ *
+ * Return: Returns properties for each unit matching the input criteria.
+ */
+static int mpioc_proplist_get_itercb(int minor, void *item, void *arg)
+{
+	struct mpc_unit *unit = item;
+	struct mpioc_prop __user *uprop;
+	struct mpioc_prop kprop;
+	struct mpc_unit *match;
+	struct mpioc_list *ls;
+	void **argv = arg;
+	int *cntp, rc;
+	int *errp;
+
+	if (!unit)
+		return ITERCB_NEXT;
+
+	match = argv[0];
+	ls = argv[1];
+
+	if (mpc_unit_isctldev(match) && !mpc_unit_ismpooldev(unit) &&
+	    ls->ls_cmd != MPIOC_LIST_CMD_PROP_GET)
+		return ITERCB_NEXT;
+
+	if (mpc_unit_ismpooldev(match) && !mpc_unit_ismpooldev(unit) &&
+	    ls->ls_cmd != MPIOC_LIST_CMD_PROP_GET)
+		return ITERCB_NEXT;
+
+	if (mpc_unit_ismpooldev(match) && unit->un_mpool != match->un_mpool)
+		return ITERCB_NEXT;
+
+	cntp = argv[2];
+	errp = argv[3];
+
+	mpioc_prop_get(unit, &kprop);
+
+	uprop = (struct mpioc_prop __user *)ls->ls_listv + *cntp;
+
+	rc = copy_to_user(uprop, &kprop, sizeof(*uprop));
+	if (rc) {
+		*errp = -EFAULT;
+		return ITERCB_DONE;
+	}
+
+	return (++(*cntp) >= ls->ls_listc) ? ITERCB_DONE : ITERCB_NEXT;
+}
+
+/**
+ * mpioc_proplist_get() - Get mpool properties.
+ * @unit:   mpool unit ptr
+ * @ls:     properties parameter block
+ *
+ * MPIOC_PROP_GET ioctl handler to retrieve properties for one
+ * or more mpools.
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_proplist_get(struct mpc_unit *unit, struct mpioc_list *ls)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	int err = 0;
+	int cnt = 0;
+	void *argv[] = { unit, ls, &cnt, &err };
+
+	if (!ls || ls->ls_listc < 1 || ls->ls_cmd == MPIOC_LIST_CMD_INVALID)
+		return -EINVAL;
+
+	mutex_lock(&ss->ss_lock);
+	idr_for_each(&ss->ss_unitmap, mpioc_proplist_get_itercb, argv);
+	mutex_unlock(&ss->ss_lock);
+
+	ls->ls_listc = cnt;
+
+	return err;
+}
+
+/**
+ * mpc_mpool_open() - Open the mpool specified by the given drive paths,
+ *                    and then create an mpool object to track the
+ *                    underlying mpool.
+ * @dpathc: drive count
+ * @dpathv: drive path name vector
+ * @mpoolp: mpool ptr. Set only if success.
+ * @pd_prop: PDs properties
+ *
+ * Return:  Returns 0 if successful and sets *mpoolp.
+ *          Returns -errno on error.
+ */
+static int mpc_mpool_open(uint dpathc, char **dpathv, struct mpc_mpool **mpoolp,
+			  struct pd_prop *pd_prop, struct mpool_params *params, u32 flags)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpcore_params mpc_params;
+	struct mpc_mpool *mpool;
+	size_t mpoolsz, len;
+	int rc;
+
+	if (!ss || !dpathv || !mpoolp || !params)
+		return -EINVAL;
+
+	len = mpc_toascii(params->mp_name, sizeof(params->mp_name));
+	if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+		return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+	mpoolsz = sizeof(*mpool) + len + 1;
+
+	mpool = kzalloc(mpoolsz, GFP_KERNEL);
+	if (!mpool)
+		return -ENOMEM;
+
+	if (!try_module_get(THIS_MODULE)) {
+		kfree(mpool);
+		return -EBUSY;
+	}
+
+	mpool_to_mpcore_params(params, &mpc_params);
+
+	rc = mpool_activate(dpathc, dpathv, pd_prop, MPOOL_ROOT_LOG_CAP,
+			    &mpc_params, flags, &mpool->mp_desc);
+	if (rc) {
+		mp_pr_err("Activating %s failed", rc, params->mp_name);
+		module_put(THIS_MODULE);
+		kfree(mpool);
+		return rc;
+	}
+
+	kref_init(&mpool->mp_ref);
+	init_rwsem(&mpool->mp_lock);
+	mpool->mp_dpathc = dpathc;
+	mpool->mp_dpathv = dpathv;
+	strcpy(mpool->mp_name, params->mp_name);
+
+	*mpoolp = mpool;
+
+	return 0;
+}
+
+/**
+ * mpioc_mp_create() - create an mpool.
+ * @mp:      mpool parameter block
+ * @pd_prop:
+ * @dpathv:
+ *
+ * MPIOC_MP_CREATE ioctl handler to create an mpool.
+ *
+ * Return:  Returns 0 if the mpool is created, -errno otherwise...
+ */
+static int mpioc_mp_create(struct mpc_unit *ctl, struct mpioc_mpool *mp,
+			   struct pd_prop *pd_prop, char ***dpathv)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpcore_params mpc_params;
+	struct mpool_config cfg = { };
+	struct mpc_mpool *mpool = NULL;
+	struct mpc_unit *unit = NULL;
+	size_t len;
+	mode_t mode;
+	uid_t uid;
+	gid_t gid;
+	int rc;
+
+	if (!ctl || !mp || !pd_prop || !dpathv)
+		return -EINVAL;
+
+	len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+	if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+		return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+	mpool_params_merge_defaults(&mp->mp_params);
+
+	uid  = mp->mp_params.mp_uid;
+	gid  = mp->mp_params.mp_gid;
+	mode = mp->mp_params.mp_mode;
+
+	if (uid == -1)
+		uid = mpc_default_uid;
+	if (gid == -1)
+		gid = mpc_default_gid;
+	if (mode == -1)
+		mode = mpc_default_mode;
+
+	mode &= 0777;
+
+	if (uid != mpc_current_uid() && !capable(CAP_CHOWN)) {
+		rc = -EPERM;
+		mp_pr_err("chown permission denied, uid %d", rc, uid);
+		return rc;
+	}
+
+	if (gid != mpc_current_gid() && !capable(CAP_CHOWN)) {
+		rc = -EPERM;
+		mp_pr_err("chown permission denied, gid %d", rc, gid);
+		return rc;
+	}
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		rc = -EPERM;
+		mp_pr_err("chmod/activate permission denied", rc);
+		return rc;
+	}
+
+	mpool_to_mpcore_params(&mp->mp_params, &mpc_params);
+
+	rc = mpool_create(mp->mp_params.mp_name, mp->mp_flags, *dpathv,
+			  pd_prop, &mpc_params, MPOOL_ROOT_LOG_CAP);
+	if (rc) {
+		mp_pr_err("%s: create failed", rc, mp->mp_params.mp_name);
+		return rc;
+	}
+
+	/*
+	 * Create an mpc_mpool object through which we can (re)open and manage
+	 * the mpool.  If successful, mpc_mpool_open() adopts dpathv.
+	 */
+	mpool_params_merge_defaults(&mp->mp_params);
+
+	rc = mpc_mpool_open(mp->mp_dpathc, *dpathv, &mpool, pd_prop, &mp->mp_params, mp->mp_flags);
+	if (rc) {
+		mp_pr_err("%s: mpc_mpool_open failed", rc, mp->mp_params.mp_name);
+		mpool_destroy(mp->mp_dpathc, *dpathv, pd_prop, mp->mp_flags);
+		return rc;
+	}
+
+	*dpathv = NULL;
+
+	mlog_lookup_rootids(&cfg.mc_oid1, &cfg.mc_oid2);
+	cfg.mc_uid = uid;
+	cfg.mc_gid = gid;
+	cfg.mc_mode = mode;
+	cfg.mc_rsvd0 = mp->mp_params.mp_rsvd0;
+	cfg.mc_captgt = MPOOL_ROOT_LOG_CAP;
+	cfg.mc_ra_pages_max = mp->mp_params.mp_ra_pages_max;
+	cfg.mc_rsvd1 = mp->mp_params.mp_rsvd1;
+	cfg.mc_rsvd2 = mp->mp_params.mp_rsvd2;
+	cfg.mc_rsvd3 = mp->mp_params.mp_rsvd3;
+	cfg.mc_rsvd4 = mp->mp_params.mp_rsvd4;
+	memcpy(&cfg.mc_utype, &mp->mp_params.mp_utype, sizeof(cfg.mc_utype));
+	strlcpy(cfg.mc_label, mp->mp_params.mp_label, sizeof(cfg.mc_label));
+
+	rc = mpool_config_store(mpool->mp_desc, &cfg);
+	if (rc) {
+		mp_pr_err("%s: config store failed", rc, mp->mp_params.mp_name);
+		goto errout;
+	}
+
+	/* A unit is born with two references:  A birth reference, and one for the caller. */
+	rc = mpc_unit_setup(&mpc_uinfo_mpool, mp->mp_params.mp_name,
+			    &cfg, mpool, &unit);
+	if (rc) {
+		mp_pr_err("%s: unit setup failed", rc, mp->mp_params.mp_name);
+		goto errout;
+	}
+
+	/* Return resolved params to caller. */
+	mp->mp_params.mp_uid = uid;
+	mp->mp_params.mp_gid = gid;
+	mp->mp_params.mp_mode = mode;
+	mp->mp_params.mp_mdc_captgt = cfg.mc_captgt;
+	mp->mp_params.mp_oidv[0] = cfg.mc_oid1;
+	mp->mp_params.mp_oidv[1] = cfg.mc_oid2;
+
+	rc = mpc_params_register(unit, MPC_MPOOL_PARAMS_CNT);
+	if (rc) {
+		mpc_unit_put(unit); /* drop birth ref */
+		goto errout;
+	}
+
+	mutex_lock(&ss->ss_lock);
+	idr_replace(&ss->ss_unitmap, unit, MINOR(unit->un_devno));
+	mutex_unlock(&ss->ss_lock);
+
+	mpool = NULL;
+
+errout:
+	if (mpool) {
+		mpool_deactivate(mpool->mp_desc);
+		mpool->mp_desc = NULL;
+		mpool_destroy(mp->mp_dpathc, mpool->mp_dpathv, pd_prop, mp->mp_flags);
+	}
+
+	/*
+	 * For failures after mpc_unit_setup() (i.e., mpool != NULL)
+	 * dropping the final unit ref will release the mpool ref.
+	 */
+	if (unit)
+		mpc_unit_put(unit); /* Drop caller's ref */
+	else if (mpool)
+		mpc_mpool_put(mpool);
+
+	return rc;
+}
+
+/**
+ * mpioc_mp_activate() - activate an mpool.
+ * @mp:      mpool parameter block
+ * @pd_prop:
+ * @dpathv:
+ *
+ * MPIOC_MP_ACTIVATE ioctl handler to activate an mpool.
+ *
+ * Return:  Returns 0 if the mpool is activated, -errno otherwise...
+ */
+static int mpioc_mp_activate(struct mpc_unit *ctl, struct mpioc_mpool *mp,
+			     struct pd_prop *pd_prop, char ***dpathv)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpool_config cfg;
+	struct mpc_mpool *mpool = NULL;
+	struct mpc_unit *unit = NULL;
+	size_t len;
+	int rc;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!ctl || !mp || !pd_prop || !dpathv)
+		return -EINVAL;
+
+	len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+	if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+		return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+	mpool_params_merge_defaults(&mp->mp_params);
+
+	/*
+	 * Create an mpc_mpool object through which we can (re)open and manage
+	 * the mpool.  If successful, mpc_mpool_open() adopts dpathv.
+	 */
+	rc = mpc_mpool_open(mp->mp_dpathc, *dpathv, &mpool, pd_prop, &mp->mp_params, mp->mp_flags);
+	if (rc) {
+		mp_pr_err("%s: mpc_mpool_open failed", rc, mp->mp_params.mp_name);
+		return rc;
+	}
+
+	*dpathv = NULL; /* Was adopted by successful mpc_mpool_open() */
+
+	rc = mpool_config_fetch(mpool->mp_desc, &cfg);
+	if (rc) {
+		mp_pr_err("%s config fetch failed", rc, mp->mp_params.mp_name);
+		goto errout;
+	}
+
+	if (mpool_params_merge_config(&mp->mp_params, &cfg))
+		mpool_config_store(mpool->mp_desc, &cfg);
+
+	/* A unit is born with two references:  A birth reference, and one for the caller. */
+	rc = mpc_unit_setup(&mpc_uinfo_mpool, mp->mp_params.mp_name,
+			    &cfg, mpool, &unit);
+	if (rc) {
+		mp_pr_err("%s unit setup failed", rc, mp->mp_params.mp_name);
+		goto errout;
+	}
+
+	/* Return resolved params to caller. */
+	mp->mp_params.mp_uid = cfg.mc_uid;
+	mp->mp_params.mp_gid = cfg.mc_gid;
+	mp->mp_params.mp_mode = cfg.mc_mode;
+	mp->mp_params.mp_mdc_captgt = cfg.mc_captgt;
+	mp->mp_params.mp_oidv[0] = cfg.mc_oid1;
+	mp->mp_params.mp_oidv[1] = cfg.mc_oid2;
+	mp->mp_params.mp_ra_pages_max = cfg.mc_ra_pages_max;
+	mp->mp_params.mp_vma_size_max = cfg.mc_vma_size_max;
+	memcpy(&mp->mp_params.mp_utype, &cfg.mc_utype, sizeof(mp->mp_params.mp_utype));
+	strlcpy(mp->mp_params.mp_label, cfg.mc_label, sizeof(mp->mp_params.mp_label));
+
+	rc = mpc_params_register(unit, MPC_MPOOL_PARAMS_CNT);
+	if (rc) {
+		mpc_unit_put(unit); /* drop birth ref */
+		goto errout;
+	}
+
+	mutex_lock(&ss->ss_lock);
+	idr_replace(&ss->ss_unitmap, unit, MINOR(unit->un_devno));
+	mutex_unlock(&ss->ss_lock);
+
+	mpool = NULL;
+
+errout:
+	/*
+	 * For failures after mpc_unit_setup() (i.e., mpool != NULL)
+	 * dropping the final unit ref will release the mpool ref.
+	 */
+	if (unit)
+		mpc_unit_put(unit); /* drop caller's ref */
+	else if (mpool)
+		mpc_mpool_put(mpool);
+
+	return rc;
+}
+
+/**
+ * mpioc_mp_deactivate_impl() - deactivate an mpool.
+ * @unit:   control device unit ptr
+ * @mp:     mpool parameter block
+ *
+ * MPIOC_MP_DEACTIVATE ioctl handler to deactivate an mpool.
+ */
+static int mp_deactivate_impl(struct mpc_unit *ctl, struct mpioc_mpool *mp, bool locked)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpc_unit *unit = NULL;
+	size_t len;
+	int rc;
+
+	if (!ctl || !mp)
+		return -EINVAL;
+
+	if (!mpc_unit_isctldev(ctl))
+		return -ENOTTY;
+
+	len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+	if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+		return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+	if (!locked) {
+		rc = down_interruptible(&ss->ss_op_sema);
+		if (rc)
+			return rc;
+	}
+
+	mpc_unit_lookup_by_name(ctl, mp->mp_params.mp_name, &unit);
+	if (!unit) {
+		rc = -ENXIO;
+		goto errout;
+	}
+
+	/*
+	 * In order to be determined idle, a unit shall not be open
+	 * and shall have a ref count of exactly two (the birth ref
+	 * and the lookup ref from above).
+	 */
+	mutex_lock(&ss->ss_lock);
+	if (unit->un_open_cnt > 0 || kref_read(&unit->un_ref) != 2) {
+		rc = -EBUSY;
+		mp_pr_err("%s: busy, cannot deactivate", rc, unit->un_name);
+	} else {
+		idr_replace(&ss->ss_unitmap, NULL, MINOR(unit->un_devno));
+		rc = 0;
+	}
+	mutex_unlock(&ss->ss_lock);
+
+	if (!rc)
+		mpc_unit_put(unit); /* drop birth ref */
+
+	mpc_unit_put(unit); /* drop lookup ref */
+
+errout:
+	if (!locked)
+		up(&ss->ss_op_sema);
+
+	return rc;
+}
+
+static int mpioc_mp_deactivate(struct mpc_unit *ctl, struct mpioc_mpool *mp)
+{
+	return mp_deactivate_impl(ctl, mp, false);
+}
+
+static int mpioc_mp_cmd(struct mpc_unit *ctl, uint cmd, struct mpioc_mpool *mp)
+{
+	struct mpc_softstate *ss = &mpc_softstate;
+	struct mpc_unit *unit = NULL;
+	struct pd_prop *pd_prop = NULL;
+	char **dpathv = NULL, *dpaths;
+	size_t dpathvsz, pd_prop_sz;
+	const char *action;
+	size_t len;
+	int rc, i;
+
+	if (!ctl || !mp)
+		return -EINVAL;
+
+	if (!mpc_unit_isctldev(ctl))
+		return -EOPNOTSUPP;
+
+	if (mp->mp_dpathc < 1 || mp->mp_dpathc > MPOOL_DRIVES_MAX)
+		return -EDOM;
+
+	len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+	if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+		return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+	switch (cmd) {
+	case MPIOC_MP_CREATE:
+		action = "create";
+		break;
+
+	case MPIOC_MP_DESTROY:
+		action = "destroy";
+		break;
+
+	case MPIOC_MP_ACTIVATE:
+		action = "activate";
+		break;
+
+	case MPIOC_MP_RENAME:
+		action = "rename";
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	if (!mp->mp_pd_prop || !mp->mp_dpaths) {
+		rc = -EINVAL;
+		mp_pr_err("%s: %s, (%d drives), drives names %p or PD props %p invalid",
+			  rc, mp->mp_params.mp_name, action, mp->mp_dpathc,
+			  mp->mp_dpaths, mp->mp_pd_prop);
+
+		return rc;
+	}
+
+	if (mp->mp_dpathssz > (mp->mp_dpathc + 1) * PATH_MAX)
+		return -EINVAL;
+
+	rc = down_interruptible(&ss->ss_op_sema);
+	if (rc)
+		return rc;
+
+	/*
+	 * If mpc_unit_lookup_by_name() succeeds it will have acquired
+	 * a reference on unit.  We release that reference at the
+	 * end of this function.
+	 */
+	mpc_unit_lookup_by_name(ctl, mp->mp_params.mp_name, &unit);
+
+	if (unit && cmd != MPIOC_MP_DESTROY) {
+		if (cmd == MPIOC_MP_ACTIVATE)
+			goto errout;
+		rc = -EEXIST;
+		mp_pr_err("%s: mpool already activated", rc, mp->mp_params.mp_name);
+		goto errout;
+	}
+
+	/*
+	 * The device path names are in one long string separated by
+	 * newlines.  Here we allocate one chunk of memory to hold
+	 * all the device paths and a vector of ptrs to them.
+	 */
+	dpathvsz = mp->mp_dpathc * sizeof(*dpathv) + mp->mp_dpathssz;
+	if (dpathvsz > MPOOL_DRIVES_MAX * (PATH_MAX + sizeof(*dpathv))) {
+		rc = -E2BIG;
+		mp_pr_err("%s: %s, too many member drives %zu",
+			  rc, mp->mp_params.mp_name, action, dpathvsz);
+		goto errout;
+	}
+
+	dpathv = kmalloc(dpathvsz, GFP_KERNEL);
+	if (!dpathv) {
+		rc = -ENOMEM;
+		goto errout;
+	}
+
+	dpaths = (char *)dpathv + mp->mp_dpathc * sizeof(*dpathv);
+
+	rc = copy_from_user(dpaths, mp->mp_dpaths, mp->mp_dpathssz);
+	if (rc) {
+		rc = -EFAULT;
+		goto errout;
+	}
+
+	for (i = 0; i < mp->mp_dpathc; ++i) {
+		dpathv[i] = strsep(&dpaths, "\n");
+		if (!dpathv[i]) {
+			rc = -EINVAL;
+			goto errout;
+		}
+	}
+
+	/* Get the PDs properties from user space buffer. */
+	pd_prop_sz = mp->mp_dpathc * sizeof(*pd_prop);
+	pd_prop = kmalloc(pd_prop_sz, GFP_KERNEL);
+	if (!pd_prop) {
+		rc = -ENOMEM;
+		mp_pr_err("%s: %s, alloc pd prop %zu failed",
+			  rc, mp->mp_params.mp_name, action, pd_prop_sz);
+		goto errout;
+	}
+
+	rc = copy_from_user(pd_prop, mp->mp_pd_prop, pd_prop_sz);
+	if (rc) {
+		rc = -EFAULT;
+		mp_pr_err("%s: %s, copyin pd prop %zu failed",
+			  rc, mp->mp_params.mp_name, action, pd_prop_sz);
+		goto errout;
+	}
+
+	switch (cmd) {
+	case MPIOC_MP_CREATE:
+		rc = mpioc_mp_create(ctl, mp, pd_prop, &dpathv);
+		break;
+
+	case MPIOC_MP_ACTIVATE:
+		rc = mpioc_mp_activate(ctl, mp, pd_prop, &dpathv);
+		break;
+
+	case MPIOC_MP_DESTROY:
+		if (unit) {
+			mpc_unit_put(unit);
+			unit = NULL;
+
+			rc = mp_deactivate_impl(ctl, mp, true);
+			if (rc) {
+				action = "deactivate";
+				break;
+			}
+		}
+		rc = mpool_destroy(mp->mp_dpathc, dpathv, pd_prop, mp->mp_flags);
+		break;
+
+	case MPIOC_MP_RENAME:
+		rc = mpool_rename(mp->mp_dpathc, dpathv, pd_prop, mp->mp_flags,
+				   mp->mp_params.mp_name);
+		break;
+	}
+
+	if (rc)
+		mp_pr_err("%s: %s failed", rc, mp->mp_params.mp_name, action);
+
+errout:
+	mpc_unit_put(unit);
+	up(&ss->ss_op_sema);
+
+	kfree(pd_prop);
+	kfree(dpathv);
+
+	return rc;
+}
+
+/**
+ * mpioc_mp_add() - add a device to an existing mpool
+ * @unit:   mpool unit ptr
+ * @drv:    mpool device parameter block
+ *
+ * MPIOC_MP_ADD ioctl handler to add a drive to a activated mpool
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mp_add(struct mpc_unit *unit, struct mpioc_drive *drv)
+{
+	struct mpool_descriptor *desc = unit->un_mpool->mp_desc;
+	size_t pd_prop_sz, dpathvsz;
+	struct pd_prop *pd_prop;
+	char **dpathv, *dpaths;
+	int rc, i;
+
+	/*
+	 * The device path names are in one long string separated by
+	 * newlines.  Here we allocate one chunk of memory to hold
+	 * all the device paths and a vector of ptrs to them.
+	 */
+	dpathvsz = drv->drv_dpathc * sizeof(*dpathv) + drv->drv_dpathssz;
+	if (drv->drv_dpathc > MPOOL_DRIVES_MAX ||
+	    dpathvsz > MPOOL_DRIVES_MAX * (PATH_MAX + sizeof(*dpathv))) {
+		rc = -E2BIG;
+		mp_pr_err("%s: invalid pathc %u, pathsz %zu",
+			  rc, unit->un_name, drv->drv_dpathc, dpathvsz);
+		return rc;
+	}
+
+	dpathv = kmalloc(dpathvsz, GFP_KERNEL);
+	if (!dpathv) {
+		rc = -ENOMEM;
+		mp_pr_err("%s: alloc dpathv %zu failed", rc, unit->un_name, dpathvsz);
+		return rc;
+	}
+
+	dpaths = (char *)dpathv + drv->drv_dpathc * sizeof(*dpathv);
+	rc = copy_from_user(dpaths, drv->drv_dpaths, drv->drv_dpathssz);
+	if (rc) {
+		rc = -EFAULT;
+		mp_pr_err("%s: copyin dpaths %u failed", rc, unit->un_name, drv->drv_dpathssz);
+		kfree(dpathv);
+		return rc;
+	}
+
+	for (i = 0; i < drv->drv_dpathc; ++i) {
+		dpathv[i] = strsep(&dpaths, "\n");
+		if (!dpathv[i] || (strlen(dpathv[i]) > PATH_MAX - 1)) {
+			rc = -EINVAL;
+			mp_pr_err("%s: ill-formed dpathv list ", rc, unit->un_name);
+			kfree(dpathv);
+			return rc;
+		}
+	}
+
+	/* Get the PDs properties from user space buffer. */
+	pd_prop_sz = drv->drv_dpathc * sizeof(*pd_prop);
+
+	pd_prop = kmalloc(pd_prop_sz, GFP_KERNEL);
+	if (!pd_prop) {
+		rc = -ENOMEM;
+		mp_pr_err("%s: alloc pd prop %zu failed", rc, unit->un_name, pd_prop_sz);
+		kfree(dpathv);
+		return rc;
+	}
+
+	rc = copy_from_user(pd_prop, drv->drv_pd_prop, pd_prop_sz);
+	if (rc) {
+		rc = -EFAULT;
+		mp_pr_err("%s: copyin pd prop %zu failed", rc, unit->un_name, pd_prop_sz);
+		kfree(pd_prop);
+		kfree(dpathv);
+		return rc;
+	}
+
+	for (i = 0; i < drv->drv_dpathc; ++i) {
+		rc = mpool_drive_add(desc, dpathv[i], &pd_prop[i]);
+		if (rc)
+			break;
+	}
+
+	kfree(pd_prop);
+	kfree(dpathv);
+
+	return rc;
+}
+
+static struct mpc_softstate *mpc_cdev2ss(struct cdev *cdev)
+{
+	if (!cdev || cdev->owner != THIS_MODULE) {
+		mp_pr_crit("module dissociated", -EINVAL);
+		return NULL;
+	}
+
+	return container_of(cdev, struct mpc_softstate, ss_cdev);
+}
+
+/*
+ * MPCTL file operations.
+ */
+
+/**
+ * mpc_open() - Open an mpool device.
+ * @ip: inode ptr
+ * @fp: file ptr
+ *
+ * Return:  Returns 0 on success, -errno otherwise...
+ */
+static int mpc_open(struct inode *ip, struct file *fp)
+{
+	struct mpc_softstate *ss;
+	struct mpc_unit *unit;
+	bool firstopen;
+	int rc = 0;
+
+	ss = mpc_cdev2ss(ip->i_cdev);
+	if (!ss || ss != &mpc_softstate)
+		return -EBADFD;
+
+	/* Acquire a reference on the unit object.  We'll release it in mpc_release(). */
+	mpc_unit_lookup(iminor(fp->f_inode), &unit);
+	if (!unit)
+		return -ENODEV;
+
+	if (down_trylock(&unit->un_open_lock)) {
+		rc = (fp->f_flags & O_NONBLOCK) ? -EWOULDBLOCK :
+			down_interruptible(&unit->un_open_lock);
+
+		if (rc)
+			goto errout;
+	}
+
+	firstopen = (unit->un_open_cnt == 0);
+
+	if (!firstopen) {
+		if (fp->f_mapping != unit->un_mapping)
+			rc = -EBUSY;
+		else if (unit->un_open_excl || (fp->f_flags & O_EXCL))
+			rc = -EBUSY;
+		goto unlock;
+	}
+
+	if (!mpc_unit_ismpooldev(unit)) {
+		unit->un_open_excl = !!(fp->f_flags & O_EXCL);
+		goto unlock; /* control device */
+	}
+
+	/* First open of an mpool unit (not the control device). */
+	if (!fp->f_mapping || fp->f_mapping != ip->i_mapping) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	fp->f_op = &mpc_fops_default;
+
+	unit->un_mapping = fp->f_mapping;
+
+	inode_lock(ip);
+	i_size_write(ip, 1ul << (__BITS_PER_LONG - 1));
+	inode_unlock(ip);
+
+	unit->un_open_excl = !!(fp->f_flags & O_EXCL);
+
+unlock:
+	if (!rc) {
+		fp->private_data = unit;
+		nonseekable_open(ip, fp);
+		++unit->un_open_cnt;
+	}
+	up(&unit->un_open_lock);
+
+errout:
+	if (rc) {
+		if (rc != -EBUSY)
+			mp_pr_err("open %s failed", rc, unit->un_name);
+		mpc_unit_put(unit);
+	}
+
+	return rc;
+}
+
+/**
+ * mpc_release() - Close the specified mpool device.
+ * @ip: inode ptr
+ * @fp: file ptr
+ *
+ * Return:  Returns 0 on success, -errno otherwise...
+ */
+static int mpc_release(struct inode *ip, struct file *fp)
+{
+	struct mpc_unit *unit;
+	bool lastclose;
+
+	unit = fp->private_data;
+	if (!unit)
+		return -EBADFD;
+
+	down(&unit->un_open_lock);
+	lastclose = (--unit->un_open_cnt == 0);
+	if (!lastclose)
+		goto errout;
+
+	if (mpc_unit_ismpooldev(unit))
+		unit->un_mapping = NULL;
+
+	unit->un_open_excl = false;
+
+errout:
+	up(&unit->un_open_lock);
+
+	mpc_unit_put(unit);
+
+	return 0;
+}
+
+/**
+ * mpc_ioctl() - mpc driver ioctl entry point
+ * @fp:     file pointer
+ * @cmd:    an mpool ioctl command (i.e.,  MPIOC_*)
+ * @arg:    varies..
+ *
+ * Perform the specified mpool ioctl command.
+ *
+ * Return:  Returns 0 on success, -errno otherwise...
+ */
+static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
+{
+	char argbuf[256] __aligned(16);
+	struct mpc_unit *unit;
+	size_t argbufsz;
+	void *argp;
+	ulong iosz;
+	int rc;
+
+	if (_IOC_TYPE(cmd) != MPIOC_MAGIC)
+		return -ENOTTY;
+
+	if ((fp->f_flags & O_ACCMODE) == O_RDONLY) {
+		switch (cmd) {
+		case MPIOC_PROP_GET:
+		case MPIOC_DEVPROPS_GET:
+		case MPIOC_MP_MCLASS_GET:
+			break;
+
+		default:
+			return -EINVAL;
+		}
+	}
+
+	unit = fp->private_data;
+	argbufsz = sizeof(argbuf);
+	iosz = _IOC_SIZE(cmd);
+	argp = (void *)arg;
+
+	if (!unit || (iosz > sizeof(union mpioc_union)))
+		return -EINVAL;
+
+	/* Set up argp/argbuf for read/write requests. */
+	if (_IOC_DIR(cmd) & (_IOC_READ | _IOC_WRITE)) {
+		argp = argbuf;
+		if (iosz > argbufsz) {
+			argbufsz = roundup_pow_of_two(iosz);
+
+			argp = kzalloc(argbufsz, GFP_KERNEL);
+			if (!argp)
+				return -ENOMEM;
+		}
+
+		if (_IOC_DIR(cmd) & _IOC_WRITE) {
+			if (copy_from_user(argp, (const void __user *)arg, iosz)) {
+				if (argp != argbuf)
+					kfree(argp);
+				return -EFAULT;
+			}
+		}
+	}
+
+	switch (cmd) {
+	case MPIOC_MP_CREATE:
+	case MPIOC_MP_ACTIVATE:
+	case MPIOC_MP_DESTROY:
+	case MPIOC_MP_RENAME:
+		rc = mpioc_mp_cmd(unit, cmd, argp);
+		break;
+
+	case MPIOC_MP_DEACTIVATE:
+		rc = mpioc_mp_deactivate(unit, argp);
+		break;
+
+	case MPIOC_DRV_ADD:
+		rc = mpioc_mp_add(unit, argp);
+		break;
+
+	case MPIOC_PARAMS_SET:
+		rc = mpioc_params_set(unit, argp);
+		break;
+
+	case MPIOC_PARAMS_GET:
+		rc = mpioc_params_get(unit, argp);
+		break;
+
+	case MPIOC_MP_MCLASS_GET:
+		rc = mpioc_mp_mclass_get(unit, argp);
+		break;
+
+	case MPIOC_PROP_GET:
+		rc = mpioc_proplist_get(unit, argp);
+		break;
+
+	case MPIOC_DEVPROPS_GET:
+		rc = mpioc_devprops_get(unit, argp);
+		break;
+
+	default:
+		rc = -ENOTTY;
+		mp_pr_rl("invalid command %x: dir=%u type=%c nr=%u size=%u",
+			 rc, cmd, _IOC_DIR(cmd), _IOC_TYPE(cmd), _IOC_NR(cmd), _IOC_SIZE(cmd));
+		break;
+	}
+
+	if (!rc && _IOC_DIR(cmd) & _IOC_READ) {
+		if (copy_to_user((void __user *)arg, argp, iosz))
+			rc = -EFAULT;
+	}
+
+	if (argp != argbuf)
+		kfree(argp);
+
+	return rc;
+}
+
+static const struct file_operations mpc_fops_default = {
+	.owner		= THIS_MODULE,
+	.open		= mpc_open,
+	.release	= mpc_release,
+	.unlocked_ioctl	= mpc_ioctl,
+};
+
 static int mpc_exit_unit(int minor, void *item, void *arg)
 {
 	mpc_unit_put(item);
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 18/22] mpool: add object lifecycle management ioctls
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (16 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 17/22] mpool: add mpool lifecycle management ioctls Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 19/22] mpool: add support to mmap arbitrary collection of mblocks Nabeel M Mohamed
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds the mblock and mlog management ioctls: alloc, commit,
abort, destroy, read, write, fetch properties etc.

The mblock and mlog management ioctl handlers are thin wrappers
around the core mblock/mlog lifecycle management and IO routines
introduced in an earlier patch.

The object read/write ioctl handlers utilizes vcache, which is a
small cache of iovec objects and page pointers. This cache is
used for large mblock/mlog IO. It acts as an emergency memory
pool for handling IO requests under memory pressure thereby
reducing tail latencies.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mpctl.c | 670 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 667 insertions(+), 3 deletions(-)

diff --git a/drivers/mpool/mpctl.c b/drivers/mpool/mpctl.c
index de62f9a5524d..3a231cf982b3 100644
--- a/drivers/mpool/mpctl.c
+++ b/drivers/mpool/mpctl.c
@@ -33,6 +33,7 @@
 #include "assert.h"
 
 #include "mpool_ioctl.h"
+#include "mblock.h"
 #include "mlog.h"
 #include "mp.h"
 #include "mpctl.h"
@@ -1299,7 +1300,6 @@ static int mpioc_mp_activate(struct mpc_unit *ctl, struct mpioc_mpool *mp,
 	mp->mp_params.mp_oidv[0] = cfg.mc_oid1;
 	mp->mp_params.mp_oidv[1] = cfg.mc_oid2;
 	mp->mp_params.mp_ra_pages_max = cfg.mc_ra_pages_max;
-	mp->mp_params.mp_vma_size_max = cfg.mc_vma_size_max;
 	memcpy(&mp->mp_params.mp_utype, &cfg.mc_utype, sizeof(mp->mp_params.mp_utype));
 	strlcpy(mp->mp_params.mp_label, cfg.mc_label, sizeof(mp->mp_params.mp_label));
 
@@ -1656,6 +1656,596 @@ static int mpioc_mp_add(struct mpc_unit *unit, struct mpioc_drive *drv)
 	return rc;
 }
 
+
+/**
+ * struct vcache -  very-large-buffer cache...
+ */
+struct vcache {
+	spinlock_t  vc_lock;
+	void       *vc_head;
+	size_t      vc_size;
+} ____cacheline_aligned;
+
+static struct vcache mpc_physio_vcache;
+
+static void *mpc_vcache_alloc(struct vcache *vc, size_t sz)
+{
+	void *p;
+
+	if (!vc || sz > vc->vc_size)
+		return NULL;
+
+	spin_lock(&vc->vc_lock);
+	p = vc->vc_head;
+	if (p)
+		vc->vc_head = *(void **)p;
+	spin_unlock(&vc->vc_lock);
+
+	return p;
+}
+
+static void mpc_vcache_free(struct vcache *vc, void *p)
+{
+	if (!vc || !p)
+		return;
+
+	spin_lock(&vc->vc_lock);
+	*(void **)p = vc->vc_head;
+	vc->vc_head = p;
+	spin_unlock(&vc->vc_lock);
+}
+
+static int mpc_vcache_init(struct vcache *vc, size_t sz, size_t n)
+{
+	if (!vc || sz < PAGE_SIZE || n < 1)
+		return -EINVAL;
+
+	spin_lock_init(&vc->vc_lock);
+	vc->vc_head = NULL;
+	vc->vc_size = sz;
+
+	while (n-- > 0)
+		mpc_vcache_free(vc, vmalloc(sz));
+
+	return vc->vc_head ? 0 : -ENOMEM;
+}
+
+static void mpc_vcache_fini(struct vcache *vc)
+{
+	void *p;
+
+	while ((p = mpc_vcache_alloc(vc, PAGE_SIZE)))
+		vfree(p);
+}
+
+/**
+ * mpc_physio() - Generic raw device mblock read/write routine.
+ * @mpd:      mpool descriptor
+ * @desc:     mblock or mlog descriptor
+ * @uiov:     vector of iovecs that describe user-space segments
+ * @uioc:     count of elements in uiov[]
+ * @offset:   offset into the mblock at which to start reading
+ * @objtype:  mblock or mlog
+ * @rw:       READ or WRITE in regards to the media.
+ * @stkbuf:   caller provided scratch space
+ * @stkbufsz: size of stkbuf
+ *
+ * This function creates an array of iovec objects each of which
+ * map a portion of the user request into kernel space so that
+ * mpool can directly access the user data.  Note that this is
+ * a zero-copy operation.
+ *
+ * Requires that each user-space segment be page aligned and of an
+ * integral number of pages.
+ *
+ * See http://www.makelinux.net/ldd3/chp-15-sect-3 for more detail.
+ */
+static int mpc_physio(struct mpool_descriptor *mpd, void *desc, struct iovec *uiov,
+		      int uioc, off_t offset, enum mp_obj_type objtype, int rw,
+		      void *stkbuf, size_t stkbufsz)
+{
+	struct kvec *iov_base, *iov;
+	struct iov_iter iter;
+	struct page **pagesv;
+	size_t pagesvsz, pgbase, length;
+	int pagesc, niov, rc, i;
+	ssize_t cc;
+
+	iov = NULL;
+	niov = 0;
+	rc = 0;
+
+	length = iov_length(uiov, uioc);
+
+	if (length < PAGE_SIZE || !IS_ALIGNED(length, PAGE_SIZE))
+		return -EINVAL;
+
+	if (length > (rwsz_max_mb << 20))
+		return -EINVAL;
+
+	/*
+	 * Allocate an array of page pointers for iov_iter_get_pages()
+	 * and an array of iovecs for mblock_read() and mblock_write().
+	 *
+	 * Note: the only way we can calculate the number of required
+	 * iovecs in advance is to assume that we need one per page.
+	 */
+	pagesc = length / PAGE_SIZE;
+	pagesvsz = (sizeof(*pagesv) + sizeof(*iov)) * pagesc;
+
+	/*
+	 * pagesvsz may be big, and it will not be used as the iovec_list
+	 * for the block stack - pd will chunk it up to the underlying
+	 * devices (with another iovec list per pd).
+	 */
+	if (pagesvsz > stkbufsz) {
+		pagesv = NULL;
+
+		if (pagesvsz <= PAGE_SIZE * 2)
+			pagesv = kmalloc(pagesvsz, GFP_NOIO);
+
+		while (!pagesv) {
+			pagesv = mpc_vcache_alloc(&mpc_physio_vcache, pagesvsz);
+			if (!pagesv)
+				usleep_range(750, 1250);
+		}
+	} else {
+		pagesv = stkbuf;
+	}
+
+	if (!pagesv)
+		return -ENOMEM;
+
+	iov_base = (struct kvec *)((char *)pagesv + (sizeof(*pagesv) * pagesc));
+
+	iov_iter_init(&iter, rw, uiov, uioc, length);
+
+	for (i = 0, cc = 0; i < pagesc; i += (cc / PAGE_SIZE)) {
+
+		/* Get struct page vector for the user buffers. */
+		cc = iov_iter_get_pages(&iter, &pagesv[i], length - (i * PAGE_SIZE),
+					pagesc - i, &pgbase);
+		if (cc < 0) {
+			rc = cc;
+			pagesc = i;
+			goto errout;
+		}
+
+		/*
+		 * pgbase is the offset into the 1st iovec - our alignment
+		 * requirements force it to be 0
+		 */
+		if (cc < PAGE_SIZE || pgbase != 0) {
+			rc = -EINVAL;
+			pagesc = i + 1;
+			goto errout;
+		}
+
+		iov_iter_advance(&iter, cc);
+	}
+
+	/* Build an array of iovecs for mpool so that it can directly access the user data. */
+	for (i = 0, iov = iov_base; i < pagesc; ++i, ++iov, ++niov) {
+		iov->iov_len = PAGE_SIZE;
+		iov->iov_base = kmap(pagesv[i]);
+
+		if (!iov->iov_base) {
+			rc = -EINVAL;
+			pagesc = i + 1;
+			goto errout;
+		}
+	}
+
+	switch (objtype) {
+	case MP_OBJ_MBLOCK:
+		if (rw == WRITE)
+			rc = mblock_write(mpd, desc, iov_base, niov, pagesc << PAGE_SHIFT);
+		else
+			rc = mblock_read(mpd, desc, iov_base, niov, offset, pagesc << PAGE_SHIFT);
+		break;
+
+	case MP_OBJ_MLOG:
+		rc = mlog_rw_raw(mpd, desc, iov_base, niov, offset, rw);
+		break;
+
+	default:
+		rc = -EINVAL;
+		goto errout;
+	}
+
+errout:
+	for (i = 0, iov = iov_base; i < pagesc; ++i, ++iov) {
+		if (i < niov)
+			kunmap(pagesv[i]);
+		put_page(pagesv[i]);
+	}
+
+	if (pagesvsz > stkbufsz) {
+		if (pagesvsz > PAGE_SIZE * 2)
+			mpc_vcache_free(&mpc_physio_vcache, pagesv);
+		else
+			kfree(pagesv);
+	}
+
+	return rc;
+}
+
+/**
+ * mpioc_mb_alloc() - Allocate an mblock object.
+ * @unit:   mpool unit ptr
+ * @mb:     mblock parameter block
+ *
+ * MPIOC_MB_ALLOC ioctl handler to allocate a single mblock.
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mb_alloc(struct mpc_unit *unit, struct mpioc_mblock *mb)
+{
+	struct mblock_descriptor *mblock;
+	struct mpool_descriptor *mpool;
+	struct mblock_props props;
+	int rc;
+
+	if (!unit || !mb || !unit->un_mpool)
+		return -EINVAL;
+
+	mpool = unit->un_mpool->mp_desc;
+
+	rc = mblock_alloc(mpool, mb->mb_mclassp, mb->mb_spare, &mblock, &props);
+	if (rc)
+		return rc;
+
+	mblock_get_props_ex(mpool, mblock, &mb->mb_props);
+	mblock_put(mblock);
+
+	mb->mb_objid  = props.mpr_objid;
+	mb->mb_offset = -1;
+
+	return 0;
+}
+
+/**
+ * mpioc_mb_find() - Find an mblock object by its objid
+ * @unit:   mpool unit ptr
+ * @mb:     mblock parameter block
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mb_find(struct mpc_unit *unit, struct mpioc_mblock *mb)
+{
+	struct mblock_descriptor *mblock;
+	struct mpool_descriptor *mpool;
+	int rc;
+
+	if (!unit || !mb || !unit->un_mpool)
+		return -EINVAL;
+
+	if (!mblock_objid(mb->mb_objid))
+		return -EINVAL;
+
+	mpool = unit->un_mpool->mp_desc;
+
+	rc = mblock_find_get(mpool, mb->mb_objid, 0, NULL, &mblock);
+	if (rc)
+		return rc;
+
+	(void)mblock_get_props_ex(mpool, mblock, &mb->mb_props);
+
+	mblock_put(mblock);
+
+	mb->mb_offset = -1;
+
+	return 0;
+}
+
+/**
+ * mpioc_mb_abcomdel() - Abort, commit, or delete an mblock.
+ * @unit:   mpool unit ptr
+ * @cmd     MPIOC_MB_ABORT, MPIOC_MB_COMMIT, or MPIOC_MB_DELETE
+ * @mi:     mblock parameter block
+ *
+ * MPIOC_MB_ACD ioctl handler to either abort, commit, or delete
+ * the specified mblock.
+ *
+ * Return:  Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mb_abcomdel(struct mpc_unit *unit, uint cmd, struct mpioc_mblock_id *mi)
+{
+	struct mblock_descriptor *mblock;
+	struct mpool_descriptor *mpool;
+	int which, rc;
+	bool drop;
+
+	if (!unit || !mi || !unit->un_mpool)
+		return -EINVAL;
+
+	if (!mblock_objid(mi->mi_objid))
+		return -EINVAL;
+
+	which = (cmd == MPIOC_MB_DELETE) ? 1 : -1;
+	mpool = unit->un_mpool->mp_desc;
+	drop = true;
+
+	rc = mblock_find_get(mpool, mi->mi_objid, which, NULL, &mblock);
+	if (rc)
+		return rc;
+
+	switch (cmd) {
+	case MPIOC_MB_COMMIT:
+		rc = mblock_commit(mpool, mblock);
+		break;
+
+	case MPIOC_MB_ABORT:
+		rc = mblock_abort(mpool, mblock);
+		drop = !!rc;
+		break;
+
+	case MPIOC_MB_DELETE:
+		rc = mblock_delete(mpool, mblock);
+		drop = !!rc;
+		break;
+
+	default:
+		rc = -ENOTTY;
+		break;
+	}
+
+	if (drop)
+		mblock_put(mblock);
+
+	return rc;
+}
+
+/**
+ * mpioc_mb_rw() - read/write mblock ioctl handler
+ * @unit:   mpool unit ptr
+ * @cmd:    MPIOC_MB_READ or MPIOC_MB_WRITE
+ * @mbiov:  mblock parameter block
+ */
+static int mpioc_mb_rw(struct mpc_unit *unit, uint cmd, struct mpioc_mblock_rw *mbrw,
+		       void *stkbuf, size_t stkbufsz)
+{
+	struct mblock_descriptor *mblock;
+	struct mpool_descriptor *mpool;
+	struct iovec *kiov;
+	bool xfree = false;
+	int which, rc;
+	size_t kiovsz;
+
+	if (!unit || !mbrw || !unit->un_mpool)
+		return -EINVAL;
+
+	if (!mblock_objid(mbrw->mb_objid))
+		return -EINVAL;
+
+	/*
+	 * For small iovec counts we simply copyin the array of iovecs
+	 * to local storage (stkbuf).  Otherwise, we must kmalloc a
+	 * buffer into which to perform the copyin.
+	 */
+	if (mbrw->mb_iov_cnt > MPIOC_KIOV_MAX)
+		return -EINVAL;
+
+	kiovsz = mbrw->mb_iov_cnt * sizeof(*kiov);
+
+	if (kiovsz > stkbufsz) {
+		kiov = kmalloc(kiovsz, GFP_KERNEL);
+		if (!kiov)
+			return -ENOMEM;
+
+		xfree = true;
+	} else {
+		kiov = stkbuf;
+		stkbuf += kiovsz;
+		stkbufsz -= kiovsz;
+	}
+
+	which = (cmd == MPIOC_MB_READ) ? 1 : -1;
+	mpool = unit->un_mpool->mp_desc;
+
+	rc = mblock_find_get(mpool, mbrw->mb_objid, which, NULL, &mblock);
+	if (rc)
+		goto errout;
+
+	if (copy_from_user(kiov, mbrw->mb_iov, kiovsz)) {
+		rc = -EFAULT;
+	} else {
+		rc = mpc_physio(mpool, mblock, kiov, mbrw->mb_iov_cnt, mbrw->mb_offset,
+				MP_OBJ_MBLOCK, (cmd == MPIOC_MB_READ) ? READ : WRITE,
+				stkbuf, stkbufsz);
+	}
+
+	mblock_put(mblock);
+
+errout:
+	if (xfree)
+		kfree(kiov);
+
+	return rc;
+}
+
+/*
+ * Mpctl mlog ioctl handlers
+ */
+static int mpioc_mlog_alloc(struct mpc_unit *unit, struct mpioc_mlog *ml)
+{
+	struct mpool_descriptor *mpool;
+	struct mlog_descriptor *mlog;
+	struct mlog_props props;
+	int rc;
+
+	if (!unit || !unit->un_mpool || !ml)
+		return -EINVAL;
+
+	mpool = unit->un_mpool->mp_desc;
+
+	rc = mlog_alloc(mpool, &ml->ml_cap, ml->ml_mclassp, &props, &mlog);
+	if (rc)
+		return rc;
+
+	mlog_get_props_ex(mpool, mlog, &ml->ml_props);
+	mlog_put(mlog);
+
+	ml->ml_objid = props.lpr_objid;
+
+	return 0;
+}
+
+static int mpioc_mlog_find(struct mpc_unit *unit, struct mpioc_mlog *ml)
+{
+	struct mpool_descriptor    *mpool;
+	struct mlog_descriptor     *mlog;
+	int rc;
+
+	if (!unit || !unit->un_mpool || !ml || !mlog_objid(ml->ml_objid))
+		return -EINVAL;
+
+	mpool = unit->un_mpool->mp_desc;
+
+	rc = mlog_find_get(mpool, ml->ml_objid, 0, NULL, &mlog);
+	if (!rc) {
+		rc = mlog_get_props_ex(mpool, mlog, &ml->ml_props);
+		mlog_put(mlog);
+	}
+
+	return rc;
+}
+
+static int mpioc_mlog_abcomdel(struct mpc_unit *unit, uint cmd, struct mpioc_mlog_id *mi)
+{
+	struct mpool_descriptor *mpool;
+	struct mlog_descriptor *mlog;
+	struct mlog_props_ex props;
+	int which, rc;
+	bool drop;
+
+	if (!unit || !unit->un_mpool || !mi || !mlog_objid(mi->mi_objid))
+		return -EINVAL;
+
+	which = (cmd == MPIOC_MLOG_DELETE) ? 1 : -1;
+	mpool = unit->un_mpool->mp_desc;
+	drop = true;
+
+	rc = mlog_find_get(mpool, mi->mi_objid, which, NULL, &mlog);
+	if (rc)
+		return rc;
+
+	switch (cmd) {
+	case MPIOC_MLOG_COMMIT:
+		rc = mlog_commit(mpool, mlog);
+		if (!rc) {
+			mlog_get_props_ex(mpool, mlog, &props);
+			mi->mi_gen   = props.lpx_props.lpr_gen;
+			mi->mi_state = props.lpx_state;
+		}
+		break;
+
+	case MPIOC_MLOG_ABORT:
+		rc = mlog_abort(mpool, mlog);
+		drop = !!rc;
+		break;
+
+	case MPIOC_MLOG_DELETE:
+		rc = mlog_delete(mpool, mlog);
+		drop = !!rc;
+		break;
+
+	default:
+		rc = -ENOTTY;
+		break;
+	}
+
+	if (drop)
+		mlog_put(mlog);
+
+	return rc;
+}
+
+static int mpioc_mlog_rw(struct mpc_unit *unit, struct mpioc_mlog_io *mi,
+			 void *stkbuf, size_t stkbufsz)
+{
+	struct mpool_descriptor *mpool;
+	struct mlog_descriptor *mlog;
+	struct iovec *kiov;
+	bool xfree = false;
+	size_t kiovsz;
+	int rc;
+
+	if (!unit || !unit->un_mpool || !mi || !mlog_objid(mi->mi_objid))
+		return -EINVAL;
+
+	/*
+	 * For small iovec counts we simply copyin the array of iovecs
+	 * to the stack (kiov_buf). Otherwise, we must kmalloc a
+	 * buffer into which to perform the copyin.
+	 */
+	if (mi->mi_iovc > MPIOC_KIOV_MAX)
+		return -EINVAL;
+
+	kiovsz = mi->mi_iovc * sizeof(*kiov);
+
+	if (kiovsz > stkbufsz) {
+		kiov = kmalloc(kiovsz, GFP_KERNEL);
+		if (!kiov)
+			return -ENOMEM;
+
+		xfree = true;
+	} else {
+		kiov = stkbuf;
+		stkbuf += kiovsz;
+		stkbufsz -= kiovsz;
+	}
+
+	mpool = unit->un_mpool->mp_desc;
+
+	rc = mlog_find_get(mpool, mi->mi_objid, 1, NULL, &mlog);
+	if (rc)
+		goto errout;
+
+	if (copy_from_user(kiov, mi->mi_iov, kiovsz)) {
+		rc = -EFAULT;
+	} else {
+		rc = mpc_physio(mpool, mlog, kiov, mi->mi_iovc, mi->mi_off, MP_OBJ_MLOG,
+				(mi->mi_op == MPOOL_OP_READ) ? READ : WRITE, stkbuf, stkbufsz);
+	}
+
+	mlog_put(mlog);
+
+errout:
+	if (xfree)
+		kfree(kiov);
+
+	return rc;
+}
+
+static int mpioc_mlog_erase(struct mpc_unit *unit, struct mpioc_mlog_id *mi)
+{
+	struct mpool_descriptor *mpool;
+	struct mlog_descriptor *mlog;
+	struct mlog_props_ex props;
+	int rc;
+
+	if (!unit || !unit->un_mpool || !mi || !mlog_objid(mi->mi_objid))
+		return -EINVAL;
+
+	mpool = unit->un_mpool->mp_desc;
+
+	rc = mlog_find_get(mpool, mi->mi_objid, 0, NULL, &mlog);
+	if (rc)
+		return rc;
+
+	rc = mlog_erase(mpool, mlog, mi->mi_gen);
+	if (!rc) {
+		mlog_get_props_ex(mpool, mlog, &props);
+		mi->mi_gen   = props.lpx_props.lpr_gen;
+		mi->mi_state = props.lpx_state;
+	}
+
+	mlog_put(mlog);
+
+	return rc;
+}
+
 static struct mpc_softstate *mpc_cdev2ss(struct cdev *cdev)
 {
 	if (!cdev || cdev->owner != THIS_MODULE) {
@@ -1798,8 +2388,8 @@ static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
 {
 	char argbuf[256] __aligned(16);
 	struct mpc_unit *unit;
-	size_t argbufsz;
-	void *argp;
+	size_t argbufsz, stkbufsz;
+	void *argp, *stkbuf;
 	ulong iosz;
 	int rc;
 
@@ -1810,7 +2400,12 @@ static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
 		switch (cmd) {
 		case MPIOC_PROP_GET:
 		case MPIOC_DEVPROPS_GET:
+		case MPIOC_MB_FIND:
+		case MPIOC_MB_READ:
 		case MPIOC_MP_MCLASS_GET:
+		case MPIOC_MLOG_FIND:
+		case MPIOC_MLOG_READ:
+		case MPIOC_MLOG_PROPS:
 			break;
 
 		default:
@@ -1882,6 +2477,59 @@ static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
 		rc = mpioc_devprops_get(unit, argp);
 		break;
 
+	case MPIOC_MB_ALLOC:
+		rc = mpioc_mb_alloc(unit, argp);
+		break;
+
+	case MPIOC_MB_FIND:
+		rc = mpioc_mb_find(unit, argp);
+		break;
+
+	case MPIOC_MB_COMMIT:
+	case MPIOC_MB_DELETE:
+	case MPIOC_MB_ABORT:
+		rc = mpioc_mb_abcomdel(unit, cmd, argp);
+		break;
+
+	case MPIOC_MB_READ:
+	case MPIOC_MB_WRITE:
+		ASSERT(roundup(iosz, 16) < argbufsz);
+
+		stkbufsz = argbufsz - roundup(iosz, 16);
+		stkbuf = argbuf + roundup(iosz, 16);
+
+		rc = mpioc_mb_rw(unit, cmd, argp, stkbuf, stkbufsz);
+		break;
+
+	case MPIOC_MLOG_ALLOC:
+		rc = mpioc_mlog_alloc(unit, argp);
+		break;
+
+	case MPIOC_MLOG_FIND:
+	case MPIOC_MLOG_PROPS:
+		rc = mpioc_mlog_find(unit, argp);
+		break;
+
+	case MPIOC_MLOG_ABORT:
+	case MPIOC_MLOG_COMMIT:
+	case MPIOC_MLOG_DELETE:
+		rc = mpioc_mlog_abcomdel(unit, cmd, argp);
+		break;
+
+	case MPIOC_MLOG_READ:
+	case MPIOC_MLOG_WRITE:
+		ASSERT(roundup(iosz, 16) < argbufsz);
+
+		stkbufsz = argbufsz - roundup(iosz, 16);
+		stkbuf = argbuf + roundup(iosz, 16);
+
+		rc = mpioc_mlog_rw(unit, argp, stkbuf, stkbufsz);
+		break;
+
+	case MPIOC_MLOG_ERASE:
+		rc = mpioc_mlog_erase(unit, argp);
+		break;
+
 	default:
 		rc = -ENOTTY;
 		mp_pr_rl("invalid command %x: dir=%u type=%c nr=%u size=%u",
@@ -1936,6 +2584,8 @@ void mpctl_exit(void)
 
 		ss->ss_inited = false;
 	}
+
+	mpc_vcache_fini(&mpc_physio_vcache);
 }
 
 /**
@@ -1947,6 +2597,7 @@ int mpctl_init(void)
 	struct mpool_config *cfg = NULL;
 	struct mpc_unit *ctlunit;
 	const char *errmsg = NULL;
+	size_t sz;
 	int rc;
 
 	if (ss->ss_inited)
@@ -1956,6 +2607,19 @@ int mpctl_init(void)
 
 	maxunits = clamp_t(uint, maxunits, 8, 8192);
 
+	rwsz_max_mb = clamp_t(ulong, rwsz_max_mb, 1, 128);
+	rwconc_max = clamp_t(ulong, rwconc_max, 1, 32);
+
+	/* Must be same as mpc_physio() pagesvsz calculation. */
+	sz = (rwsz_max_mb << 20) / PAGE_SIZE;
+	sz *= (sizeof(void *) + sizeof(struct iovec));
+
+	rc = mpc_vcache_init(&mpc_physio_vcache, sz, rwconc_max);
+	if (rc) {
+		errmsg = "vcache init failed";
+		goto errout;
+	}
+
 	cdev_init(&ss->ss_cdev, &mpc_fops_default);
 	ss->ss_cdev.owner = THIS_MODULE;
 
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 19/22] mpool: add support to mmap arbitrary collection of mblocks
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (17 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 18/22] mpool: add object " Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 20/22] mpool: add support to proactively evict cached mblock data from the page-cache Nabeel M Mohamed
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This implements the mmap file operation for the mpool driver.

Mblock and mlog writes avoid the Linux page cache. Mblocks are
written, committed, and made immutable before they can be read
either directly (avoiding the page cache) or memory-mapped.
Mlogs are always read and updated directly and cannot be
memory-mapped.

Mblocks are memory-mapped by creating an mcache map.
The mcache map APIs allow an arbitrary collection of mblocks
(specified as a vector of mblock OIDs) to be mapped linearly
into the virtual address space of an mpool client.

An extended VMA instance (mpc_xvm) is created for each
mcache map. The xvm takes a ref on all the mblocks it maps.
The size of an xvm region is 1G by default and can be tuned
using the module parameter 'xvm_size_max'. When an mcache
map is created, its xvm instance is assigned with the next
available device offset range. The device offset -> xvm
mapping is inserted into a region map stored in the mpool's
unit object.

The mmap driver entry point uses the vma page offset to deduce
the device offset of the corresponding mcache map. The xvm
instance is looked up from the region map using device offset
as the key. A reference to this xvm instance is stored in the
vma private data, which enables the page fault handler to find
the xvm for the faulting range in constant time. The offset into
the mcache map can then be used to determine the mblock id and
offset to be read for filling the page.

Readahead is enabled for the mpool device by providing the
fadvise file op and by initializing file_ra_state.ra_pages.
Readahead requests are dispatched to one of the four
readahead workqueues and served by the workqueue threads.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/init.c   |   16 +
 drivers/mpool/init.h   |    2 +
 drivers/mpool/mcache.c | 1029 ++++++++++++++++++++++++++++++++++++++++
 drivers/mpool/mcache.h |   96 ++++
 drivers/mpool/mpctl.c  |   45 +-
 drivers/mpool/mpctl.h  |    3 +
 6 files changed, 1190 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mpool/mcache.c
 create mode 100644 drivers/mpool/mcache.h

diff --git a/drivers/mpool/init.c b/drivers/mpool/init.c
index 126c6c7142b5..b1fe3286773a 100644
--- a/drivers/mpool/init.c
+++ b/drivers/mpool/init.c
@@ -12,11 +12,20 @@
 #include "smap.h"
 #include "pmd_obj.h"
 #include "sb.h"
+#include "mcache.h"
 #include "mpctl.h"
 
 /*
  * Module params...
  */
+unsigned int xvm_max __read_mostly = 1048576 * 128;
+module_param(xvm_max, uint, 0444);
+MODULE_PARM_DESC(xvm_max, " max extended VMA regions");
+
+unsigned int xvm_size_max __read_mostly = 30;
+module_param(xvm_size_max, uint, 0444);
+MODULE_PARM_DESC(xvm_size_max, " max extended VMA size log2");
+
 unsigned int maxunits __read_mostly = 1024;
 module_param(maxunits, uint, 0444);
 MODULE_PARM_DESC(maxunits, " max mpools");
@@ -40,6 +49,7 @@ MODULE_PARM_DESC(chunk_size_kb, "Chunk size (in KiB) for device I/O");
 static void mpool_exit_impl(void)
 {
 	mpctl_exit();
+	mcache_exit();
 	pmd_exit();
 	smap_exit();
 	sb_exit();
@@ -82,6 +92,12 @@ static __init int mpool_init(void)
 		goto errout;
 	}
 
+	rc = mcache_init();
+	if (rc) {
+		errmsg = "mcache init failed";
+		goto errout;
+	}
+
 	rc = mpctl_init();
 	if (rc) {
 		errmsg = "mpctl init failed";
diff --git a/drivers/mpool/init.h b/drivers/mpool/init.h
index 3d8f809a5e45..507d43a55c01 100644
--- a/drivers/mpool/init.h
+++ b/drivers/mpool/init.h
@@ -6,6 +6,8 @@
 #ifndef MPOOL_INIT_H
 #define MPOOL_INIT_H
 
+extern unsigned int xvm_max;
+extern unsigned int xvm_size_max;
 extern unsigned int maxunits;
 extern unsigned int rwsz_max_mb;
 extern unsigned int rwconc_max;
diff --git a/drivers/mpool/mcache.c b/drivers/mpool/mcache.c
new file mode 100644
index 000000000000..07c79615ecf1
--- /dev/null
+++ b/drivers/mpool/mcache.c
@@ -0,0 +1,1029 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#include <linux/module.h>
+#include <linux/memcontrol.h>
+#include <linux/kref.h>
+#include <linux/migrate.h>
+#include <linux/delay.h>
+#include <linux/uio.h>
+#include <linux/fadvise.h>
+#include <linux/prefetch.h>
+
+#include "mpool_ioctl.h"
+
+#include "mp.h"
+#include "mpool_printk.h"
+#include "assert.h"
+
+#include "mpctl.h"
+
+#ifndef lru_to_page
+#define lru_to_page(_head)  (list_entry((_head)->prev, struct page, lru))
+#endif
+
+/*
+ * MPC_RA_IOV_MAX - Max pages per call to mblock read by a readahead
+ * request.  Be careful about increasing this as it directly adds
+ * (n * 24) bytes to the stack frame of mpc_readpages_cb().
+ */
+#define MPC_RA_IOV_MAX      (8)
+
+#define NODEV               MKDEV(0, 0)    /* Non-existent device */
+
+/*
+ * Arguments required to initiate an asynchronous call to mblock_read()
+ * and which must also be preserved across that call.
+ *
+ * Note: We could make things more efficient by changing a_pagev[]
+ * to a struct kvec if mblock_read() would guarantee that it will
+ * not alter the given iovec.
+ */
+struct readpage_args {
+	void                       *a_xvm;
+	struct mblock_descriptor   *a_mbdesc;
+	u64                         a_mboffset;
+	int                         a_pagec;
+	struct page                *a_pagev[];
+};
+
+struct readpage_work {
+	struct work_struct      w_work;
+	struct readpage_args    w_args;
+};
+
+static void mpc_xvm_put(struct mpc_xvm *xvm);
+
+static int mpc_readpage_impl(struct page *page, struct mpc_xvm *map);
+
+/* The following structures are initialized at the end of this file. */
+static const struct vm_operations_struct mpc_vops_default;
+const struct address_space_operations mpc_aops_default;
+
+static struct workqueue_struct *mpc_wq_trunc __read_mostly;
+static struct workqueue_struct *mpc_wq_rav[4] __read_mostly;
+
+static size_t mpc_xvm_cachesz[2] __read_mostly;
+static struct kmem_cache *mpc_xvm_cache[2] __read_mostly;
+
+static struct workqueue_struct *mpc_rgn2wq(uint rgn)
+{
+	return mpc_wq_rav[rgn % ARRAY_SIZE(mpc_wq_rav)];
+}
+
+static int mpc_rgnmap_isorphan(int rgn, void *item, void *data)
+{
+	struct mpc_xvm *xvm = item;
+	void **headp = data;
+
+	if (xvm && kref_read(&xvm->xvm_ref) == 1 && !atomic_read(&xvm->xvm_opened)) {
+		idr_replace(&xvm->xvm_rgnmap->rm_root, NULL, rgn);
+		xvm->xvm_next = *headp;
+		*headp = xvm;
+	}
+
+	return ITERCB_NEXT;
+}
+
+void mpc_rgnmap_flush(struct mpc_rgnmap *rm)
+{
+	struct mpc_xvm *head = NULL, *xvm;
+
+	if (!rm)
+		return;
+
+	/* Wait for all mpc_xvm_free_cb() callbacks to complete... */
+	flush_workqueue(mpc_wq_trunc);
+
+	/*
+	 * Build a list of all orphaned XVMs and release their birth
+	 * references (i.e., XVMs that were created but never mmapped).
+	 */
+	mutex_lock(&rm->rm_lock);
+	idr_for_each(&rm->rm_root, mpc_rgnmap_isorphan, &head);
+	mutex_unlock(&rm->rm_lock);
+
+	while ((xvm = head)) {
+		head = xvm->xvm_next;
+		mpc_xvm_put(xvm);
+	}
+}
+
+static struct mpc_xvm *mpc_xvm_lookup(struct mpc_rgnmap *rm, uint key)
+{
+	struct mpc_xvm *xvm;
+
+	mutex_lock(&rm->rm_lock);
+	xvm = idr_find(&rm->rm_root, key);
+	if (xvm && !kref_get_unless_zero(&xvm->xvm_ref))
+		xvm = NULL;
+	mutex_unlock(&rm->rm_lock);
+
+	return xvm;
+}
+
+void mpc_xvm_free(struct mpc_xvm *xvm)
+{
+	struct mpc_rgnmap *rm;
+
+	ASSERT((u32)(uintptr_t)xvm == xvm->xvm_magic);
+
+	rm = xvm->xvm_rgnmap;
+
+	mutex_lock(&rm->rm_lock);
+	idr_remove(&rm->rm_root, xvm->xvm_rgn);
+	mutex_unlock(&rm->rm_lock);
+
+	xvm->xvm_magic = 0xbadcafe;
+	xvm->xvm_rgn = -1;
+
+	kmem_cache_free(xvm->xvm_cache, xvm);
+
+	atomic_dec(&rm->rm_rgncnt);
+}
+
+static void mpc_xvm_free_cb(struct work_struct *work)
+{
+	struct mpc_xvm *xvm = container_of(work, typeof(*xvm), xvm_work);
+
+	mpc_xvm_free(xvm);
+}
+
+static void mpc_xvm_get(struct mpc_xvm *xvm)
+{
+	kref_get(&xvm->xvm_ref);
+}
+
+static void mpc_xvm_release(struct kref *kref)
+{
+	struct mpc_xvm *xvm = container_of(kref, struct mpc_xvm, xvm_ref);
+	struct mpc_rgnmap *rm = xvm->xvm_rgnmap;
+	int i;
+
+	ASSERT((u32)(uintptr_t)xvm == xvm->xvm_magic);
+
+	mutex_lock(&rm->rm_lock);
+	ASSERT(kref_read(kref) == 0);
+	idr_replace(&rm->rm_root, NULL, xvm->xvm_rgn);
+	mutex_unlock(&rm->rm_lock);
+
+	/*
+	 * Wait for all in-progress readaheads to complete
+	 * before we drop our mblock references.
+	 */
+	if (atomic_add_return(WQ_MAX_ACTIVE, &xvm->xvm_rabusy) > WQ_MAX_ACTIVE)
+		flush_workqueue(mpc_rgn2wq(xvm->xvm_rgn));
+
+	for (i = 0; i < xvm->xvm_mbinfoc; ++i)
+		mblock_put(xvm->xvm_mbinfov[i].mbdesc);
+
+	INIT_WORK(&xvm->xvm_work, mpc_xvm_free_cb);
+	queue_work(mpc_wq_trunc, &xvm->xvm_work);
+}
+
+static void mpc_xvm_put(struct mpc_xvm *xvm)
+{
+	kref_put(&xvm->xvm_ref, mpc_xvm_release);
+}
+
+/*
+ * VM operations
+ */
+
+static void mpc_vm_open(struct vm_area_struct *vma)
+{
+	mpc_xvm_get(vma->vm_private_data);
+}
+
+static void mpc_vm_close(struct vm_area_struct *vma)
+{
+	mpc_xvm_put(vma->vm_private_data);
+}
+
+static int mpc_alloc_and_readpage(struct vm_area_struct *vma, pgoff_t offset, gfp_t gfp)
+{
+	struct address_space *mapping;
+	struct file *file;
+	struct page *page;
+	int rc;
+
+	page = __page_cache_alloc(gfp | __GFP_NOWARN);
+	if (!page)
+		return -ENOMEM;
+
+	file    = vma->vm_file;
+	mapping = file->f_mapping;
+
+	rc = add_to_page_cache_lru(page, mapping, offset, gfp & GFP_KERNEL);
+	if (rc == 0)
+		rc = mpc_readpage_impl(page, vma->vm_private_data);
+	else if (rc == -EEXIST)
+		rc = 0;
+
+	put_page(page);
+
+	return rc;
+}
+
+static bool mpc_lock_page_or_retry(struct page *page, struct mm_struct *mm, uint flags)
+{
+	might_sleep();
+
+	if (trylock_page(page))
+		return true;
+
+	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (flags & FAULT_FLAG_RETRY_NOWAIT)
+			return false;
+
+		mmap_read_unlock(mm);
+		/* _killable version is not exported by the kernel. */
+		wait_on_page_locked(page);
+		return false;
+	}
+
+	if (flags & FAULT_FLAG_KILLABLE) {
+		int rc;
+
+		rc = lock_page_killable(page);
+		if (rc) {
+			mmap_read_unlock(mm);
+			return false;
+		}
+	} else {
+		lock_page(page);
+	}
+
+	return true;
+}
+
+static int mpc_handle_page_error(struct page *page, struct vm_area_struct *vma)
+{
+	int rc;
+
+	ClearPageError(page);
+
+	rc = mpc_readpage_impl(page, vma->vm_private_data);
+	if (rc == 0) {
+		wait_on_page_locked(page);
+		if (!PageUptodate(page))
+			rc = -EIO;
+	}
+
+	put_page(page);
+
+	return rc;
+}
+
+static vm_fault_t mpc_vm_fault_impl(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct address_space *mapping;
+	struct inode *inode;
+	struct page *page;
+	vm_fault_t vmfrc;
+	pgoff_t offset;
+	loff_t size;
+
+	mapping = vma->vm_file->f_mapping;
+	inode   = mapping->host;
+	offset  = vmf->pgoff;
+	vmfrc   = 0;
+
+	size = round_up(i_size_read(inode), PAGE_SIZE);
+	if (offset >= (size >> PAGE_SHIFT))
+		return VM_FAULT_SIGBUS;
+
+retry_find:
+	page = find_get_page(mapping, offset);
+	if (!page) {
+		int rc = mpc_alloc_and_readpage(vma, offset, mapping_gfp_mask(mapping));
+
+		if (rc < 0)
+			return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
+
+		vmfrc = VM_FAULT_MAJOR;
+		goto retry_find;
+	}
+
+	/* At this point, page is not locked but has a ref. */
+	if (vmfrc == VM_FAULT_MAJOR)
+		count_vm_event(PGMAJFAULT);
+
+	if (!mpc_lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+		put_page(page);
+		return vmfrc | VM_FAULT_RETRY;
+	}
+
+	/* At this point, page is locked with a ref. */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry_find;
+	}
+
+	VM_BUG_ON_PAGE(page->index != offset, page);
+
+	if (unlikely(!PageUptodate(page))) {
+		int rc = mpc_handle_page_error(page, vma);
+
+		/* At this point, page is not locked and has no ref. */
+		if (rc)
+			return VM_FAULT_SIGBUS;
+		goto retry_find;
+	}
+
+	/* Page is locked with a ref. */
+	vmf->page = page;
+
+	return vmfrc | VM_FAULT_LOCKED;
+}
+
+static vm_fault_t mpc_vm_fault(struct vm_fault *vmf)
+{
+	return mpc_vm_fault_impl(vmf->vma, vmf);
+}
+
+/*
+ * MPCTL address-space operations.
+ */
+
+static int mpc_readpage_impl(struct page *page, struct mpc_xvm *xvm)
+{
+	struct mpc_mbinfo *mbinfo;
+	struct kvec iov[1];
+	off_t offset;
+	uint mbnum;
+	int rc;
+
+	offset  = page->index << PAGE_SHIFT;
+	offset %= (1ul << xvm_size_max);
+
+	mbnum = offset / xvm->xvm_bktsz;
+	if (mbnum >= xvm->xvm_mbinfoc) {
+		unlock_page(page);
+		return -EINVAL;
+	}
+
+	mbinfo = xvm->xvm_mbinfov + mbnum;
+	offset %= xvm->xvm_bktsz;
+
+	if (offset >= mbinfo->mblen) {
+		unlock_page(page);
+		return -EINVAL;
+	}
+
+	iov[0].iov_base = page_address(page);
+	iov[0].iov_len = PAGE_SIZE;
+
+	rc = mblock_read(xvm->xvm_mpdesc, mbinfo->mbdesc, iov, 1, offset, PAGE_SIZE);
+	if (rc) {
+		unlock_page(page);
+		return rc;
+	}
+
+	if (xvm->xvm_hcpagesp)
+		atomic64_inc(xvm->xvm_hcpagesp);
+	atomic64_inc(&xvm->xvm_nrpages);
+
+	SetPagePrivate(page);
+	set_page_private(page, (ulong)xvm);
+	SetPageUptodate(page);
+	unlock_page(page);
+
+	return 0;
+}
+
+#define MPC_RPARGSBUFSZ \
+	(sizeof(struct readpage_args) + MPC_RA_IOV_MAX * sizeof(void *))
+
+/**
+ * mpc_readpages_cb() - mpc_readpages() callback
+ * @work:   w_work.work from struct readpage_work
+ *
+ * The incoming arguments are in the first page (a_pagev[0]) which
+ * we are about to overwrite, so we copy them to the stack.
+ */
+static void mpc_readpages_cb(struct work_struct *work)
+{
+	struct kvec iovbuf[MPC_RA_IOV_MAX];
+	char argsbuf[MPC_RPARGSBUFSZ];
+	struct readpage_args *args = (void *)argsbuf;
+	struct readpage_work *w;
+	struct mpc_xvm *xvm;
+	struct kvec *iov = iovbuf;
+	size_t argssz;
+	int pagec, rc, i;
+
+	w = container_of(work, struct readpage_work, w_work);
+
+	pagec = w->w_args.a_pagec;
+	argssz = sizeof(*args) + sizeof(args->a_pagev[0]) * pagec;
+
+	ASSERT(pagec <= ARRAY_SIZE(iovbuf));
+	ASSERT(argssz <= sizeof(argsbuf));
+
+	memcpy(args, &w->w_args, argssz);
+	w = NULL; /* Do not touch! */
+
+	xvm = args->a_xvm;
+
+	/*
+	 * Synchronize with mpc_xvm_put() to prevent dropping our
+	 * mblock references while there are reads in progress.
+	 */
+	if (atomic_inc_return(&xvm->xvm_rabusy) > WQ_MAX_ACTIVE) {
+		rc = -ENXIO;
+		goto errout;
+	}
+
+	for (i = 0; i < pagec; ++i) {
+		iov[i].iov_base = page_address(args->a_pagev[i]);
+		iov[i].iov_len = PAGE_SIZE;
+	}
+
+	rc = mblock_read(xvm->xvm_mpdesc, args->a_mbdesc, iov,
+			 pagec, args->a_mboffset, pagec << PAGE_SHIFT);
+	if (rc)
+		goto errout;
+
+	if (xvm->xvm_hcpagesp)
+		atomic64_add(pagec, xvm->xvm_hcpagesp);
+	atomic64_add(pagec, &xvm->xvm_nrpages);
+	atomic_dec(&xvm->xvm_rabusy);
+
+	for (i = 0; i < pagec; ++i) {
+		struct page *page = args->a_pagev[i];
+
+		SetPagePrivate(page);
+		set_page_private(page, (ulong)xvm);
+		SetPageUptodate(page);
+
+		unlock_page(page);
+		put_page(page);
+	}
+
+	return;
+
+errout:
+	atomic_dec(&xvm->xvm_rabusy);
+
+	for (i = 0; i < pagec; ++i) {
+		unlock_page(args->a_pagev[i]);
+		put_page(args->a_pagev[i]);
+	}
+}
+
+static int mpc_readpages(struct file *file, struct address_space *mapping,
+			 struct list_head *pages, uint nr_pages)
+{
+	struct workqueue_struct *wq;
+	struct readpage_work *w;
+	struct work_struct *work;
+	struct mpc_mbinfo *mbinfo;
+	struct mpc_unit *unit;
+	struct mpc_xvm *xvm;
+	struct page *page;
+	off_t offset, mbend;
+	uint mbnum, iovmax, i;
+	uint ra_pages_max;
+	ulong index;
+	gfp_t gfp;
+	u32 key;
+	int rc;
+
+	unit = file->private_data;
+
+	ra_pages_max = unit->un_ra_pages_max;
+	if (ra_pages_max < 1)
+		return 0;
+
+	page   = lru_to_page(pages);
+	offset = page->index << PAGE_SHIFT;
+	index  = page->index;
+	work   = NULL;
+	w      = NULL;
+
+	key = offset >> xvm_size_max;
+
+	/*
+	 * The idr value here (xvm) is pinned for the lifetime of the address map.
+	 * Therefore, we can exit the rcu read-side critsec without worry that xvm will be
+	 * destroyed before put_page() has been called on each and every page in the given
+	 * list of pages.
+	 */
+	rcu_read_lock();
+	xvm = idr_find(&unit->un_rgnmap.rm_root, key);
+	rcu_read_unlock();
+
+	if (!xvm)
+		return 0;
+
+	offset %= (1ul << xvm_size_max);
+
+	mbnum = offset / xvm->xvm_bktsz;
+	if (mbnum >= xvm->xvm_mbinfoc)
+		return 0;
+
+	mbinfo = xvm->xvm_mbinfov + mbnum;
+
+	mbend = mbnum * xvm->xvm_bktsz + mbinfo->mblen;
+	iovmax = MPC_RA_IOV_MAX;
+
+	gfp = mapping_gfp_mask(mapping) & GFP_KERNEL;
+	wq = mpc_rgn2wq(xvm->xvm_rgn);
+
+	nr_pages = min_t(uint, nr_pages, ra_pages_max);
+
+	for (i = 0; i < nr_pages; ++i) {
+		page    = lru_to_page(pages);
+		offset  = page->index << PAGE_SHIFT;
+		offset %= (1ul << xvm_size_max);
+
+		/* Don't read past the end of the mblock. */
+		if (offset >= mbend)
+			break;
+
+		/* mblock reads must be logically contiguous. */
+		if (page->index != index && work) {
+			queue_work(wq, work);
+			work = NULL;
+		}
+
+		index = page->index + 1; /* next expected page index */
+
+		prefetchw(&page->flags);
+		list_del(&page->lru);
+
+		rc = add_to_page_cache_lru(page, mapping, page->index, gfp);
+		if (rc) {
+			if (work) {
+				queue_work(wq, work);
+				work = NULL;
+			}
+			put_page(page);
+			continue;
+		}
+
+		if (!work) {
+			w = page_address(page);
+			INIT_WORK(&w->w_work, mpc_readpages_cb);
+			w->w_args.a_xvm = xvm;
+			w->w_args.a_mbdesc = mbinfo->mbdesc;
+			w->w_args.a_mboffset = offset % xvm->xvm_bktsz;
+			w->w_args.a_pagec = 0;
+			work = &w->w_work;
+
+			iovmax = MPC_RA_IOV_MAX;
+			iovmax -= page->index % MPC_RA_IOV_MAX;
+		}
+
+		w->w_args.a_pagev[w->w_args.a_pagec++] = page;
+
+		/*
+		 * Restrict batch size to the number of struct kvecs
+		 * that will fit into a page (minus our header).
+		 */
+		if (w->w_args.a_pagec >= iovmax) {
+			queue_work(wq, work);
+			work = NULL;
+		}
+	}
+
+	if (work)
+		queue_work(wq, work);
+
+	return 0;
+}
+
+/**
+ * mpc_releasepage() - Linux VM calls the release page when pages are released.
+ * @page:
+ * @gfp:
+ *
+ * The function is added as part of tracking incoming and outgoing pages.
+ */
+static int mpc_releasepage(struct page *page, gfp_t gfp)
+{
+	struct mpc_xvm *xvm;
+
+	if (!PagePrivate(page))
+		return 0;
+
+	xvm = (void *)page_private(page);
+	if (!xvm)
+		return 0;
+
+	ClearPagePrivate(page);
+	set_page_private(page, 0);
+
+	ASSERT((u32)(uintptr_t)xvm == xvm->xvm_magic);
+
+	if (xvm->xvm_hcpagesp)
+		atomic64_dec(xvm->xvm_hcpagesp);
+	atomic64_dec(&xvm->xvm_nrpages);
+
+	return 1;
+}
+
+static void mpc_invalidatepage(struct page *page, uint offset, uint length)
+{
+	mpc_releasepage(page, 0);
+}
+
+/**
+ * mpc_migratepage() -  Callback for handling page migration.
+ * @mapping:
+ * @newpage:
+ * @page:
+ * @mode:
+ *
+ * The drivers having private pages are supplying this callback.
+ * Not sure the page migration releases or invalidates the page being migrated,
+ * or else the tracking of incoming and outgoing pages will be in trouble. The
+ * callback is added to deal with uncertainties around migration. The migration
+ * will be declined so long as the page is private and it belongs to mpctl.
+ */
+static int mpc_migratepage(struct address_space *mapping, struct page *newpage,
+			   struct page *page, enum migrate_mode mode)
+{
+	if (page_has_private(page) &&
+	    !try_to_release_page(page, GFP_KERNEL))
+		return -EAGAIN;
+
+	ASSERT(PageLocked(page));
+
+	return migrate_page(mapping, newpage, page, mode);
+}
+
+int mpc_mmap(struct file *fp, struct vm_area_struct *vma)
+{
+	struct mpc_unit *unit = fp->private_data;
+	struct mpc_xvm *xvm;
+	off_t off;
+	ulong len;
+	u32 key;
+
+	off = vma->vm_pgoff << PAGE_SHIFT;
+	len = vma->vm_end - vma->vm_start - 1;
+
+	/* Verify that the request does not cross an xvm region boundary. */
+	if ((off >> xvm_size_max) != ((off + len) >> xvm_size_max))
+		return -EINVAL;
+
+	/* Acquire a reference on the region map for this region. */
+	key = off >> xvm_size_max;
+
+	xvm = mpc_xvm_lookup(&unit->un_rgnmap, key);
+	if (!xvm)
+		return -EINVAL;
+
+	/*
+	 * Drop the birth ref on first open so that the final call
+	 * to mpc_vm_close() will cause the vma to be destroyed.
+	 */
+	if (atomic_inc_return(&xvm->xvm_opened) == 1)
+		mpc_xvm_put(xvm);
+
+	vma->vm_ops = &mpc_vops_default;
+
+	vma->vm_flags &= ~(VM_RAND_READ | VM_SEQ_READ);
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+
+	vma->vm_flags = (VM_DONTEXPAND | VM_DONTDUMP | VM_NORESERVE);
+	vma->vm_flags |= VM_MAYREAD | VM_READ | VM_RAND_READ;
+
+	vma->vm_private_data = xvm;
+
+	fp->f_ra.ra_pages = unit->un_ra_pages_max;
+	fp->f_mode |= FMODE_RANDOM;
+
+	return 0;
+}
+
+/**
+ * mpc_fadvise() -
+ *
+ * mpc_fadvise() currently handles only POSIX_FADV_WILLNEED.
+ *
+ * The code path that leads here is: madvise_willneed() -> vfs_fadvise() -> mpc_fadvise()
+ */
+int mpc_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+	pgoff_t start, end;
+
+	if (!file)
+		return -EINVAL;
+
+	if (advice != POSIX_FADV_WILLNEED)
+		return -EOPNOTSUPP;
+
+	start = offset >> PAGE_SHIFT;
+	end = (offset + len - 1) >> PAGE_SHIFT;
+
+	if (end < start)
+		return -EINVAL;
+
+	/* To force page cache readahead */
+	spin_lock(&file->f_lock);
+	file->f_mode |= FMODE_RANDOM;
+	spin_unlock(&file->f_lock);
+
+	page_cache_sync_readahead(file->f_mapping, &file->f_ra, file, start, end - start + 1);
+
+	return 0;
+}
+
+/**
+ * mpioc_xvm_create() - create an extended VMA map (AKA mcache map)
+ * @unit:
+ * @arg:
+ */
+int mpioc_xvm_create(struct mpc_unit *unit, struct mpool_descriptor *mp, struct mpioc_vma *ioc)
+{
+	struct mpc_mbinfo *mbinfov;
+	struct kmem_cache *cache;
+	struct mpc_rgnmap *rm;
+	struct mpc_xvm *xvm;
+	size_t largest, sz;
+	uint mbidc, mult;
+	u64 *mbidv;
+	int rc, i;
+
+	if (!unit || !unit->un_mapping || !ioc)
+		return -EINVAL;
+
+	if (ioc->im_mbidc < 1)
+		return -EINVAL;
+
+	if (ioc->im_advice > MPC_VMA_PINNED)
+		return -EINVAL;
+
+	mult = 1;
+	if (ioc->im_advice == MPC_VMA_WARM)
+		mult = 10;
+	else if (ioc->im_advice == MPC_VMA_HOT)
+		mult = 100;
+
+	mbidc = ioc->im_mbidc;
+
+	sz = sizeof(*xvm) + sizeof(*mbinfov) * mbidc;
+	if (sz > mpc_xvm_cachesz[1])
+		return -EINVAL;
+	else if (sz > mpc_xvm_cachesz[0])
+		cache = mpc_xvm_cache[1];
+	else
+		cache = mpc_xvm_cache[0];
+
+	sz = mbidc * sizeof(mbidv[0]);
+
+	mbidv = kmalloc(sz, GFP_KERNEL);
+	if (!mbidv)
+		return -ENOMEM;
+
+	rc = copy_from_user(mbidv, ioc->im_mbidv, sz);
+	if (rc) {
+		kfree(mbidv);
+		return -EFAULT;
+	}
+
+	xvm = kmem_cache_zalloc(cache, GFP_KERNEL);
+	if (!xvm) {
+		kfree(mbidv);
+		return -ENOMEM;
+	}
+
+	xvm->xvm_magic = (u32)(uintptr_t)xvm;
+	xvm->xvm_mbinfoc = mbidc;
+	xvm->xvm_mpdesc = mp;
+
+	xvm->xvm_mapping = unit->un_mapping;
+	xvm->xvm_rgnmap = &unit->un_rgnmap;
+	xvm->xvm_advice = ioc->im_advice;
+	kref_init(&xvm->xvm_ref);
+	xvm->xvm_cache = cache;
+	atomic_set(&xvm->xvm_opened, 0);
+
+	atomic64_set(&xvm->xvm_nrpages, 0);
+	atomic_set(&xvm->xvm_rabusy, 0);
+
+	largest = 0;
+
+	mbinfov = xvm->xvm_mbinfov;
+
+	for (i = 0; i < mbidc; ++i) {
+		struct mpc_mbinfo *mbinfo = mbinfov + i;
+		struct mblock_props props;
+
+		rc = mblock_find_get(mp, mbidv[i], 1, &props, &mbinfo->mbdesc);
+		if (rc) {
+			mbidc = i;
+			goto errout;
+		}
+
+		mbinfo->mblen = ALIGN(props.mpr_write_len, PAGE_SIZE);
+		mbinfo->mbmult = mult;
+		atomic64_set(&mbinfo->mbatime, 0);
+
+		largest = max_t(size_t, largest, mbinfo->mblen);
+	}
+
+	xvm->xvm_bktsz = roundup_pow_of_two(largest);
+
+	if (xvm->xvm_bktsz * mbidc > (1ul << xvm_size_max)) {
+		rc = -E2BIG;
+		goto errout;
+	}
+
+	rm = &unit->un_rgnmap;
+
+	mutex_lock(&rm->rm_lock);
+	xvm->xvm_rgn = idr_alloc(&rm->rm_root, NULL, 1, -1, GFP_KERNEL);
+	if (xvm->xvm_rgn < 1) {
+		mutex_unlock(&rm->rm_lock);
+
+		rc = xvm->xvm_rgn ?: -EINVAL;
+		goto errout;
+	}
+
+	ioc->im_offset = (ulong)xvm->xvm_rgn << xvm_size_max;
+	ioc->im_bktsz = xvm->xvm_bktsz;
+	ioc->im_len = xvm->xvm_bktsz * mbidc;
+	ioc->im_len = ALIGN(ioc->im_len, (1ul << xvm_size_max));
+
+	atomic_inc(&rm->rm_rgncnt);
+
+	idr_replace(&rm->rm_root, xvm, xvm->xvm_rgn);
+	mutex_unlock(&rm->rm_lock);
+
+errout:
+	if (rc) {
+		for (i = 0; i < mbidc; ++i)
+			mblock_put(mbinfov[i].mbdesc);
+		kmem_cache_free(cache, xvm);
+	}
+
+	kfree(mbidv);
+
+	return rc;
+}
+
+/**
+ * mpioc_xvm_destroy() - destroy an extended VMA
+ * @unit:
+ * @arg:
+ */
+int mpioc_xvm_destroy(struct mpc_unit *unit, struct mpioc_vma *ioc)
+{
+	struct mpc_rgnmap *rm;
+	struct mpc_xvm *xvm;
+	u64 rgn;
+
+	if (!unit || !ioc)
+		return -EINVAL;
+
+	rgn = ioc->im_offset >> xvm_size_max;
+	rm = &unit->un_rgnmap;
+
+	mutex_lock(&rm->rm_lock);
+	xvm = idr_find(&rm->rm_root, rgn);
+	if (xvm && kref_read(&xvm->xvm_ref) == 1 && !atomic_read(&xvm->xvm_opened))
+		idr_remove(&rm->rm_root, rgn);
+	else
+		xvm = NULL;
+	mutex_unlock(&rm->rm_lock);
+
+	if (xvm)
+		mpc_xvm_put(xvm);
+
+	return 0;
+}
+
+int mpioc_xvm_purge(struct mpc_unit *unit, struct mpioc_vma *ioc)
+{
+	struct mpc_xvm *xvm;
+	u64 rgn;
+
+	if (!unit || !ioc)
+		return -EINVAL;
+
+	rgn = ioc->im_offset >> xvm_size_max;
+
+	xvm = mpc_xvm_lookup(&unit->un_rgnmap, rgn);
+	if (!xvm)
+		return -ENOENT;
+
+	mpc_xvm_put(xvm);
+
+	return 0;
+}
+
+int mpioc_xvm_vrss(struct mpc_unit *unit, struct mpioc_vma *ioc)
+{
+	struct mpc_xvm *xvm;
+	u64 rgn;
+
+	if (!unit || !ioc)
+		return -EINVAL;
+
+	rgn = ioc->im_offset >> xvm_size_max;
+
+	xvm = mpc_xvm_lookup(&unit->un_rgnmap, rgn);
+	if (!xvm)
+		return -ENOENT;
+
+	ioc->im_vssp = mpc_xvm_pglen(xvm);
+	ioc->im_rssp = atomic64_read(&xvm->xvm_nrpages);
+
+	mpc_xvm_put(xvm);
+
+	return 0;
+}
+
+int mcache_init(void)
+{
+	size_t sz;
+	int rc, i;
+
+	xvm_max = clamp_t(uint, xvm_max, 1024, 1u << 30);
+	xvm_size_max = clamp_t(ulong, xvm_size_max, 27, 32);
+
+	sz = sizeof(struct mpc_mbinfo) * 8;
+	mpc_xvm_cachesz[0] = sizeof(struct mpc_xvm) + sz;
+
+	mpc_xvm_cache[0] = kmem_cache_create("mpool_xvm_0", mpc_xvm_cachesz[0], 0,
+					     SLAB_HWCACHE_ALIGN | SLAB_POISON, NULL);
+	if (!mpc_xvm_cache[0]) {
+		rc = -ENOMEM;
+		mp_pr_err("mpc xvm cache 0 create failed", rc);
+		return rc;
+	}
+
+	sz = sizeof(struct mpc_mbinfo) * 32;
+	mpc_xvm_cachesz[1] = sizeof(struct mpc_xvm) + sz;
+
+	mpc_xvm_cache[1] = kmem_cache_create("mpool_xvm_1", mpc_xvm_cachesz[1], 0,
+					     SLAB_HWCACHE_ALIGN | SLAB_POISON, NULL);
+	if (!mpc_xvm_cache[1]) {
+		rc = -ENOMEM;
+		mp_pr_err("mpc xvm cache 1 create failed", rc);
+		goto errout;
+	}
+
+	mpc_wq_trunc = alloc_workqueue("mpc_wq_trunc", WQ_UNBOUND, 16);
+	if (!mpc_wq_trunc) {
+		rc = -ENOMEM;
+		mp_pr_err("trunc workqueue alloc failed", rc);
+		goto errout;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(mpc_wq_rav); ++i) {
+		int     maxactive = 16;
+		char    name[16];
+
+		snprintf(name, sizeof(name), "mpc_wq_ra%d", i);
+
+		mpc_wq_rav[i] = alloc_workqueue(name, 0, maxactive);
+		if (!mpc_wq_rav[i]) {
+			rc = -ENOMEM;
+			mp_pr_err("mpctl ra workqueue alloc failed", rc);
+			goto errout;
+		}
+	}
+
+	return 0;
+
+errout:
+	mcache_exit();
+	return rc;
+}
+
+void mcache_exit(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mpc_wq_rav); ++i) {
+		if (mpc_wq_rav[i])
+			destroy_workqueue(mpc_wq_rav[i]);
+		mpc_wq_rav[i] = NULL;
+	}
+
+	if (mpc_wq_trunc)
+		destroy_workqueue(mpc_wq_trunc);
+	kmem_cache_destroy(mpc_xvm_cache[1]);
+	kmem_cache_destroy(mpc_xvm_cache[0]);
+}
+
+static const struct vm_operations_struct mpc_vops_default = {
+	.open           = mpc_vm_open,
+	.close          = mpc_vm_close,
+	.fault          = mpc_vm_fault,
+};
+
+const struct address_space_operations mpc_aops_default = {
+	.readpages      = mpc_readpages,
+	.releasepage    = mpc_releasepage,
+	.invalidatepage = mpc_invalidatepage,
+	.migratepage    = mpc_migratepage,
+};
diff --git a/drivers/mpool/mcache.h b/drivers/mpool/mcache.h
new file mode 100644
index 000000000000..fe6f45a05494
--- /dev/null
+++ b/drivers/mpool/mcache.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_MCACHE_H
+#define MPOOL_MCACHE_H
+
+#include <linux/kref.h>
+
+#include "init.h"
+#include "mblock.h"
+
+struct mpc_unit;
+
+/**
+ * struct mpc_rgnmap - extended vma (xvm) region management
+ * @rm_lock:    protects rm_root
+ * @rm_root:    root of the region map
+ * @rm_rgncnt;  number of active regions
+ *
+ * Note that this is not a ref-counted object, its lifetime
+ * is tied to struct mpc_unit.
+ */
+struct mpc_rgnmap {
+	struct mutex    rm_lock;
+	struct idr      rm_root;
+	atomic_t        rm_rgncnt;
+} ____cacheline_aligned;
+
+
+struct mpc_mbinfo {
+	struct mblock_descriptor   *mbdesc;
+	u32                         mblen;
+	u32                         mbmult;
+	atomic64_t                  mbatime;
+} __aligned(32);
+
+struct mpc_xvm {
+	size_t                      xvm_bktsz;
+	uint                        xvm_mbinfoc;
+	uint                        xvm_rgn;
+	struct kref                 xvm_ref;
+	u32                         xvm_magic;
+	struct mpool_descriptor    *xvm_mpdesc;
+
+	atomic64_t                 *xvm_hcpagesp;
+	struct address_space       *xvm_mapping;
+	struct mpc_rgnmap          *xvm_rgnmap;
+
+	enum mpc_vma_advice         xvm_advice;
+	atomic_t                    xvm_opened;
+	struct kmem_cache          *xvm_cache;
+	struct mpc_xvm             *xvm_next;
+
+	____cacheline_aligned
+	atomic64_t                  xvm_nrpages;
+	atomic_t                    xvm_rabusy;
+	struct work_struct          xvm_work;
+
+	____cacheline_aligned
+	struct mpc_mbinfo           xvm_mbinfov[];
+};
+
+extern const struct address_space_operations mpc_aops_default;
+
+void mpc_rgnmap_flush(struct mpc_rgnmap *rm);
+
+int mpc_mmap(struct file *fp, struct vm_area_struct *vma);
+
+int mpc_fadvise(struct file *file, loff_t offset, loff_t len, int advice);
+
+int mpioc_xvm_create(struct mpc_unit *unit, struct mpool_descriptor *mp, struct mpioc_vma *ioc);
+
+int mpioc_xvm_destroy(struct mpc_unit *unit, struct mpioc_vma *ioc);
+
+int mpioc_xvm_purge(struct mpc_unit *unit, struct mpioc_vma *ioc);
+
+int mpioc_xvm_vrss(struct mpc_unit *unit, struct mpioc_vma *ioc);
+
+void mpc_xvm_free(struct mpc_xvm *xvm);
+
+int mcache_init(void) __cold;
+void mcache_exit(void) __cold;
+
+static inline pgoff_t mpc_xvm_pgoff(struct mpc_xvm *xvm)
+{
+	return ((ulong)xvm->xvm_rgn << xvm_size_max) >> PAGE_SHIFT;
+}
+
+static inline size_t mpc_xvm_pglen(struct mpc_xvm *xvm)
+{
+	return (xvm->xvm_bktsz * xvm->xvm_mbinfoc) >> PAGE_SHIFT;
+}
+
+#endif /* MPOOL_MCACHE_H */
diff --git a/drivers/mpool/mpctl.c b/drivers/mpool/mpctl.c
index 3a231cf982b3..f11f522ec90c 100644
--- a/drivers/mpool/mpctl.c
+++ b/drivers/mpool/mpctl.c
@@ -142,6 +142,11 @@ static ssize_t mpc_label_show(struct device *dev, struct device_attribute *da, c
 	return scnprintf(buf, PAGE_SIZE, "%s\n", dev_to_unit(dev)->un_label);
 }
 
+static ssize_t mpc_vma_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%u\n", xvm_size_max);
+}
+
 static ssize_t mpc_type_show(struct device *dev, struct device_attribute *da, char *buf)
 {
 	struct mpool_uuid  uuid;
@@ -160,6 +165,7 @@ static void mpc_mpool_params_add(struct device_attribute *dattr)
 	MPC_ATTR_RO(dattr++, mode);
 	MPC_ATTR_RO(dattr++, ra);
 	MPC_ATTR_RO(dattr++, label);
+	MPC_ATTR_RO(dattr++, vma);
 	MPC_ATTR_RO(dattr,   type);
 }
 
@@ -243,6 +249,8 @@ static void mpool_params_merge_defaults(struct mpool_params *params)
 	if (params->mp_mode != -1)
 		params->mp_mode &= 0777;
 
+	params->mp_vma_size_max = xvm_size_max;
+
 	params->mp_rsvd0 = 0;
 	params->mp_rsvd1 = 0;
 	params->mp_rsvd2 = 0;
@@ -382,6 +390,10 @@ static int mpc_unit_create(const char *name, struct mpc_mpool *mpool, struct mpc
 	kref_init(&unit->un_ref);
 	unit->un_mpool = mpool;
 
+	mutex_init(&unit->un_rgnmap.rm_lock);
+	idr_init(&unit->un_rgnmap.rm_root);
+	atomic_set(&unit->un_rgnmap.rm_rgncnt, 0);
+
 	mutex_lock(&ss->ss_lock);
 	minor = idr_alloc(&ss->ss_unitmap, NULL, 0, -1, GFP_KERNEL);
 	mutex_unlock(&ss->ss_lock);
@@ -421,6 +433,8 @@ static void mpc_unit_release(struct kref *refp)
 	if (unit->un_device)
 		device_destroy(ss->ss_class, unit->un_devno);
 
+	idr_destroy(&unit->un_rgnmap.rm_root);
+
 	kfree(unit);
 }
 
@@ -633,6 +647,7 @@ static int mpc_cf_journal(struct mpc_unit *unit)
 	cfg.mc_oid2 = unit->un_ds_oidv[1];
 	cfg.mc_captgt = unit->un_mdc_captgt;
 	cfg.mc_ra_pages_max = unit->un_ra_pages_max;
+	cfg.mc_vma_size_max = xvm_size_max;
 	memcpy(&cfg.mc_utype, &unit->un_utype, sizeof(cfg.mc_utype));
 	strlcpy(cfg.mc_label, unit->un_label, sizeof(cfg.mc_label));
 
@@ -742,6 +757,7 @@ static int mpioc_params_get(struct mpc_unit *unit, struct mpioc_params *get)
 	params->mp_oidv[0] = unit->un_ds_oidv[0];
 	params->mp_oidv[1] = unit->un_ds_oidv[1];
 	params->mp_ra_pages_max = unit->un_ra_pages_max;
+	params->mp_vma_size_max = xvm_size_max;
 	memcpy(&params->mp_utype, &unit->un_utype, sizeof(params->mp_utype));
 	strlcpy(params->mp_label, unit->un_label, sizeof(params->mp_label));
 	strlcpy(params->mp_name, unit->un_name, sizeof(params->mp_name));
@@ -785,6 +801,8 @@ static int mpioc_params_set(struct mpc_unit *unit, struct mpioc_params *set)
 
 	params = &set->mps_params;
 
+	params->mp_vma_size_max = xvm_size_max;
+
 	mutex_lock(&ss->ss_lock);
 	if (params->mp_uid != -1 || params->mp_gid != -1 || params->mp_mode != -1) {
 		err = mpc_mp_chown(unit, params);
@@ -919,6 +937,7 @@ static void mpioc_prop_get(struct mpc_unit *unit, struct mpioc_prop *kprop)
 	params->mp_oidv[0] = unit->un_ds_oidv[0];
 	params->mp_oidv[1] = unit->un_ds_oidv[1];
 	params->mp_ra_pages_max = unit->un_ra_pages_max;
+	params->mp_vma_size_max = xvm_size_max;
 	memcpy(&params->mp_utype, &unit->un_utype, sizeof(params->mp_utype));
 	strlcpy(params->mp_label, unit->un_label, sizeof(params->mp_label));
 	strlcpy(params->mp_name, unit->un_name, sizeof(params->mp_name));
@@ -1171,6 +1190,7 @@ static int mpioc_mp_create(struct mpc_unit *ctl, struct mpioc_mpool *mp,
 	cfg.mc_rsvd0 = mp->mp_params.mp_rsvd0;
 	cfg.mc_captgt = MPOOL_ROOT_LOG_CAP;
 	cfg.mc_ra_pages_max = mp->mp_params.mp_ra_pages_max;
+	cfg.mc_vma_size_max = mp->mp_params.mp_vma_size_max;
 	cfg.mc_rsvd1 = mp->mp_params.mp_rsvd1;
 	cfg.mc_rsvd2 = mp->mp_params.mp_rsvd2;
 	cfg.mc_rsvd3 = mp->mp_params.mp_rsvd3;
@@ -1300,6 +1320,7 @@ static int mpioc_mp_activate(struct mpc_unit *ctl, struct mpioc_mpool *mp,
 	mp->mp_params.mp_oidv[0] = cfg.mc_oid1;
 	mp->mp_params.mp_oidv[1] = cfg.mc_oid2;
 	mp->mp_params.mp_ra_pages_max = cfg.mc_ra_pages_max;
+	mp->mp_params.mp_vma_size_max = cfg.mc_vma_size_max;
 	memcpy(&mp->mp_params.mp_utype, &cfg.mc_utype, sizeof(mp->mp_params.mp_utype));
 	strlcpy(mp->mp_params.mp_label, cfg.mc_label, sizeof(mp->mp_params.mp_label));
 
@@ -2313,6 +2334,7 @@ static int mpc_open(struct inode *ip, struct file *fp)
 	}
 
 	fp->f_op = &mpc_fops_default;
+	fp->f_mapping->a_ops = &mpc_aops_default;
 
 	unit->un_mapping = fp->f_mapping;
 
@@ -2361,8 +2383,11 @@ static int mpc_release(struct inode *ip, struct file *fp)
 	if (!lastclose)
 		goto errout;
 
-	if (mpc_unit_ismpooldev(unit))
+	if (mpc_unit_ismpooldev(unit)) {
+		mpc_rgnmap_flush(&unit->un_rgnmap);
+
 		unit->un_mapping = NULL;
+	}
 
 	unit->un_open_excl = false;
 
@@ -2530,6 +2555,22 @@ static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
 		rc = mpioc_mlog_erase(unit, argp);
 		break;
 
+	case MPIOC_VMA_CREATE:
+		rc = mpioc_xvm_create(unit, unit->un_mpool->mp_desc, argp);
+		break;
+
+	case MPIOC_VMA_DESTROY:
+		rc = mpioc_xvm_destroy(unit, argp);
+		break;
+
+	case MPIOC_VMA_PURGE:
+		rc = mpioc_xvm_purge(unit, argp);
+		break;
+
+	case MPIOC_VMA_VRSS:
+		rc = mpioc_xvm_vrss(unit, argp);
+		break;
+
 	default:
 		rc = -ENOTTY;
 		mp_pr_rl("invalid command %x: dir=%u type=%c nr=%u size=%u",
@@ -2553,6 +2594,8 @@ static const struct file_operations mpc_fops_default = {
 	.open		= mpc_open,
 	.release	= mpc_release,
 	.unlocked_ioctl	= mpc_ioctl,
+	.mmap           = mpc_mmap,
+	.fadvise        = mpc_fadvise,
 };
 
 static int mpc_exit_unit(int minor, void *item, void *arg)
diff --git a/drivers/mpool/mpctl.h b/drivers/mpool/mpctl.h
index 412a6a491c15..b93e44248f03 100644
--- a/drivers/mpool/mpctl.h
+++ b/drivers/mpool/mpctl.h
@@ -11,6 +11,8 @@
 #include <linux/device.h>
 #include <linux/semaphore.h>
 
+#include "mcache.h"
+
 #define ITERCB_NEXT     (0)
 #define ITERCB_DONE     (1)
 
@@ -23,6 +25,7 @@ struct mpc_unit {
 	uid_t                       un_uid;
 	gid_t                       un_gid;
 	mode_t                      un_mode;
+	struct mpc_rgnmap           un_rgnmap;
 	dev_t                       un_devno;
 	const struct mpc_uinfo     *un_uinfo;
 	struct mpc_mpool           *un_mpool;
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 20/22] mpool: add support to proactively evict cached mblock data from the page-cache
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (18 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 19/22] mpool: add support to mmap arbitrary collection of mblocks Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:27 ` [PATCH v2 21/22] mpool: add documentation Nabeel M Mohamed
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds a mechanism to track object-level usage metrics and use
that metrics to proactively evict mblock data from the page cache.

The proactive reaping is employed just before the onset of memory
pressure, which greatly improves throughput and reduces tail
latencies for read and memory intensive workloads.

The reaper component tracks residency of pages from specified xvms
and weighs the memory used against system free memory. The reaper
begins evicting pages from the specified xvm when the free memory
falls below a predetermined low watermark, stopping when the free
memory rises above a high watermark.

The reaper maintains several lists of xvms and cycles through them
in a round-robin fashion to select pages to evict. Each xvm is
comprised of one or more contiguous virtual subranges of pages,
where each subrange is delineated by an mblock. Each mblock has an
associated access time which is updated on each page fault to any
page in the mblock.  The reaper leverages the atime to decide
whether or not to evict all the pages in the subrange based upon
the current TTL, where the current TTL grows shorter as the urgency
to evict pages grows stronger.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/mpool/mcache.c |  43 +++
 drivers/mpool/mcache.h |   8 +
 drivers/mpool/mpctl.c  |  15 +
 drivers/mpool/mpctl.h  |   6 +
 drivers/mpool/reaper.c | 686 +++++++++++++++++++++++++++++++++++++++++
 drivers/mpool/reaper.h |  71 +++++
 6 files changed, 829 insertions(+)
 create mode 100644 drivers/mpool/reaper.c
 create mode 100644 drivers/mpool/reaper.h

diff --git a/drivers/mpool/mcache.c b/drivers/mpool/mcache.c
index 07c79615ecf1..1f1b9173a2b4 100644
--- a/drivers/mpool/mcache.c
+++ b/drivers/mpool/mcache.c
@@ -19,6 +19,7 @@
 #include "assert.h"
 
 #include "mpctl.h"
+#include "reaper.h"
 
 #ifndef lru_to_page
 #define lru_to_page(_head)  (list_entry((_head)->prev, struct page, lru))
@@ -64,6 +65,7 @@ const struct address_space_operations mpc_aops_default;
 
 static struct workqueue_struct *mpc_wq_trunc __read_mostly;
 static struct workqueue_struct *mpc_wq_rav[4] __read_mostly;
+struct mpc_reap *mpc_reap __read_mostly;
 
 static size_t mpc_xvm_cachesz[2] __read_mostly;
 static struct kmem_cache *mpc_xvm_cache[2] __read_mostly;
@@ -109,6 +111,10 @@ void mpc_rgnmap_flush(struct mpc_rgnmap *rm)
 		head = xvm->xvm_next;
 		mpc_xvm_put(xvm);
 	}
+
+	/* Wait for reaper to prune its lists... */
+	while (atomic_read(&rm->rm_rgncnt) > 0)
+		usleep_range(100000, 150000);
 }
 
 static struct mpc_xvm *mpc_xvm_lookup(struct mpc_rgnmap *rm, uint key)
@@ -129,6 +135,22 @@ void mpc_xvm_free(struct mpc_xvm *xvm)
 	struct mpc_rgnmap *rm;
 
 	ASSERT((u32)(uintptr_t)xvm == xvm->xvm_magic);
+	ASSERT(atomic_read(&xvm->xvm_reapref) > 0);
+
+again:
+	mpc_reap_xvm_evict(xvm);
+
+	if (atomic_dec_return(&xvm->xvm_reapref) > 0) {
+		atomic_inc(xvm->xvm_freedp);
+		return;
+	}
+
+	if (atomic64_read(&xvm->xvm_nrpages) > 0) {
+		atomic_cmpxchg(&xvm->xvm_evicting, 1, 0);
+		atomic_inc(&xvm->xvm_reapref);
+		usleep_range(10000, 30000);
+		goto again;
+	}
 
 	rm = xvm->xvm_rgnmap;
 
@@ -337,6 +359,8 @@ static vm_fault_t mpc_vm_fault_impl(struct vm_area_struct *vma, struct vm_fault
 	/* Page is locked with a ref. */
 	vmf->page = page;
 
+	mpc_reap_xvm_touch(vma->vm_private_data, page->index);
+
 	return vmfrc | VM_FAULT_LOCKED;
 }
 
@@ -534,6 +558,9 @@ static int mpc_readpages(struct file *file, struct address_space *mapping,
 	gfp = mapping_gfp_mask(mapping) & GFP_KERNEL;
 	wq = mpc_rgn2wq(xvm->xvm_rgn);
 
+	if (mpc_reap_xvm_duress(xvm))
+		nr_pages = min_t(uint, nr_pages, 8);
+
 	nr_pages = min_t(uint, nr_pages, ra_pages_max);
 
 	for (i = 0; i < nr_pages; ++i) {
@@ -603,6 +630,8 @@ static int mpc_readpages(struct file *file, struct address_space *mapping,
  * @gfp:
  *
  * The function is added as part of tracking incoming and outgoing pages.
+ * When the number of pages owned exceeds the limit (if defined) reap function
+ * will get invoked to trim down the usage.
  */
 static int mpc_releasepage(struct page *page, gfp_t gfp)
 {
@@ -699,6 +728,8 @@ int mpc_mmap(struct file *fp, struct vm_area_struct *vma)
 	fp->f_ra.ra_pages = unit->un_ra_pages_max;
 	fp->f_mode |= FMODE_RANDOM;
 
+	mpc_reap_xvm_add(unit->un_ds_reap, xvm);
+
 	return 0;
 }
 
@@ -805,6 +836,9 @@ int mpioc_xvm_create(struct mpc_unit *unit, struct mpool_descriptor *mp, struct
 	xvm->xvm_cache = cache;
 	atomic_set(&xvm->xvm_opened, 0);
 
+	INIT_LIST_HEAD(&xvm->xvm_list);
+	atomic_set(&xvm->xvm_evicting, 0);
+	atomic_set(&xvm->xvm_reapref, 1);
 	atomic64_set(&xvm->xvm_nrpages, 0);
 	atomic_set(&xvm->xvm_rabusy, 0);
 
@@ -914,6 +948,8 @@ int mpioc_xvm_purge(struct mpc_unit *unit, struct mpioc_vma *ioc)
 	if (!xvm)
 		return -ENOENT;
 
+	mpc_reap_xvm_evict(xvm);
+
 	mpc_xvm_put(xvm);
 
 	return 0;
@@ -978,6 +1014,12 @@ int mcache_init(void)
 		goto errout;
 	}
 
+	rc = mpc_reap_create(&mpc_reap);
+	if (rc) {
+		mp_pr_err("reap create failed", rc);
+		goto errout;
+	}
+
 	for (i = 0; i < ARRAY_SIZE(mpc_wq_rav); ++i) {
 		int     maxactive = 16;
 		char    name[16];
@@ -1009,6 +1051,7 @@ void mcache_exit(void)
 		mpc_wq_rav[i] = NULL;
 	}
 
+	mpc_reap_destroy(mpc_reap);
 	if (mpc_wq_trunc)
 		destroy_workqueue(mpc_wq_trunc);
 	kmem_cache_destroy(mpc_xvm_cache[1]);
diff --git a/drivers/mpool/mcache.h b/drivers/mpool/mcache.h
index fe6f45a05494..8ddb407c862b 100644
--- a/drivers/mpool/mcache.h
+++ b/drivers/mpool/mcache.h
@@ -47,12 +47,19 @@ struct mpc_xvm {
 	atomic64_t                 *xvm_hcpagesp;
 	struct address_space       *xvm_mapping;
 	struct mpc_rgnmap          *xvm_rgnmap;
+	struct mpc_reap            *xvm_reap;
 
 	enum mpc_vma_advice         xvm_advice;
 	atomic_t                    xvm_opened;
 	struct kmem_cache          *xvm_cache;
 	struct mpc_xvm             *xvm_next;
 
+	____cacheline_aligned
+	struct list_head            xvm_list;
+	atomic_t                    xvm_evicting;
+	atomic_t                    xvm_reapref;
+	atomic_t                   *xvm_freedp;
+
 	____cacheline_aligned
 	atomic64_t                  xvm_nrpages;
 	atomic_t                    xvm_rabusy;
@@ -62,6 +69,7 @@ struct mpc_xvm {
 	struct mpc_mbinfo           xvm_mbinfov[];
 };
 
+extern struct mpc_reap *mpc_reap;
 extern const struct address_space_operations mpc_aops_default;
 
 void mpc_rgnmap_flush(struct mpc_rgnmap *rm);
diff --git a/drivers/mpool/mpctl.c b/drivers/mpool/mpctl.c
index f11f522ec90c..e42abffdcc14 100644
--- a/drivers/mpool/mpctl.c
+++ b/drivers/mpool/mpctl.c
@@ -38,6 +38,7 @@
 #include "mp.h"
 #include "mpctl.h"
 #include "sysfs.h"
+#include "reaper.h"
 #include "init.h"
 
 
@@ -185,6 +186,10 @@ static int mpc_params_register(struct mpc_unit *unit, int cnt)
 	if (mpc_unit_ismpooldev(unit))
 		mpc_mpool_params_add(dattr);
 
+	/* Common parameters */
+	if (mpc_unit_isctldev(unit))
+		mpc_reap_params_add(dattr);
+
 	rc = mpc_attr_group_create(attr);
 	if (rc) {
 		mpc_attr_destroy(attr);
@@ -2337,6 +2342,7 @@ static int mpc_open(struct inode *ip, struct file *fp)
 	fp->f_mapping->a_ops = &mpc_aops_default;
 
 	unit->un_mapping = fp->f_mapping;
+	unit->un_ds_reap = mpc_reap;
 
 	inode_lock(ip);
 	i_size_write(ip, 1ul << (__BITS_PER_LONG - 1));
@@ -2386,6 +2392,7 @@ static int mpc_release(struct inode *ip, struct file *fp)
 	if (mpc_unit_ismpooldev(unit)) {
 		mpc_rgnmap_flush(&unit->un_rgnmap);
 
+		unit->un_ds_reap = NULL;
 		unit->un_mapping = NULL;
 	}
 
@@ -2714,6 +2721,14 @@ int mpctl_init(void)
 		goto errout;
 	}
 
+	/* The reaper component has already been initialized before mpctl. */
+	ctlunit->un_ds_reap = mpc_reap;
+	rc = mpc_params_register(ctlunit, MPC_REAP_PARAMS_CNT);
+	if (rc) {
+		errmsg = "cannot register common parameters";
+		goto errout;
+	}
+
 	mutex_lock(&ss->ss_lock);
 	idr_replace(&ss->ss_unitmap, ctlunit, MINOR(ctlunit->un_devno));
 	mutex_unlock(&ss->ss_lock);
diff --git a/drivers/mpool/mpctl.h b/drivers/mpool/mpctl.h
index b93e44248f03..1a582bf1a474 100644
--- a/drivers/mpool/mpctl.h
+++ b/drivers/mpool/mpctl.h
@@ -30,6 +30,7 @@ struct mpc_unit {
 	const struct mpc_uinfo     *un_uinfo;
 	struct mpc_mpool           *un_mpool;
 	struct address_space       *un_mapping;
+	struct mpc_reap            *un_ds_reap;
 	struct device              *un_device;
 	struct mpc_attr            *un_attr;
 	uint                        un_rawio;       /* log2(max_mblock_size) */
@@ -46,6 +47,11 @@ static inline struct mpc_unit *dev_to_unit(struct device *dev)
 	return dev_get_drvdata(dev);
 }
 
+static inline struct mpc_reap *dev_to_reap(struct device *dev)
+{
+	return dev_to_unit(dev)->un_ds_reap;
+}
+
 int mpctl_init(void) __cold;
 void mpctl_exit(void) __cold;
 
diff --git a/drivers/mpool/reaper.c b/drivers/mpool/reaper.c
new file mode 100644
index 000000000000..364d692a71ee
--- /dev/null
+++ b/drivers/mpool/reaper.c
@@ -0,0 +1,686 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+/*
+ * The reaper subsystem tracks residency of pages from specified VMAs and
+ * weighs the memory used against system free memory.  Should the relative
+ * residency vs free memory fall below a predetermined low watermark the
+ * reaper begins evicting pages from the specified VMAs, stopping when free
+ * memory rises above a high watermark that is slightly higher than the low
+ * watermark.
+ *
+ * The reaper maintains several lists of VMAs and cycles through them in a
+ * round-robin fashion to select pages to evict.  Each VMA is comprised of
+ * one or more contiguous virtual subranges of pages, where each subrange
+ * is delineated by an mblock (typically no larger than 32M).  Each mblock
+ * has an associated access time which is updated on each page fault to any
+ * page in the mblock.  The reaper leverages the atime to decide whether or
+ * not to evict all the pages in the subrange based upon the current TTL,
+ * where the current TTL grows shorter as the urgency to evict pages grows
+ * stronger.
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/sched/clock.h>
+
+#include "mpool_printk.h"
+#include "assert.h"
+
+#include "sysfs.h"
+#include "mpctl.h"
+#include "reaper.h"
+
+#define REAP_ELEM_MAX       3
+
+/**
+ * struct mpc_reap_elem -
+ * @reap_lock:      lock to protect reap_list
+ * @reap_list:      list of inodes
+ * @reap_active:    reaping in progress
+ * @reap_hpages:    total hot pages which are mapped
+ * @reap_cpages:    total coldpages which are mapped
+ */
+struct mpc_reap_elem {
+	struct mutex            reap_lock;
+	struct list_head        reap_list;
+
+	____cacheline_aligned
+	atomic_t                reap_running;
+	struct work_struct      reap_work;
+	struct mpc_reap        *reap_reap;
+
+	____cacheline_aligned
+	atomic64_t              reap_hpages;
+	atomic_t                reap_nfreed;
+
+	____cacheline_aligned
+	atomic64_t              reap_wpages;
+
+	____cacheline_aligned
+	atomic64_t              reap_cpages;
+};
+
+/**
+ * struct mpc_reap -
+ * @reap_lwm:     Low water mark
+ * @reap_ttl_cur: Current time-to-live
+ * @reap_wq:
+ *
+ * @reap_hdr:     sysctl table header
+ * @reap_tab:     sysctl table components
+ * @reap_mempct:
+ * @reap_ttl:
+ * @reap_debug:
+ *
+ * @reap_eidx:    Pruner element index
+ * @reap_emit:    Pruner debug message control
+ * @reap_dwork:
+ * @reap_elem:    Array of reaper lists (reaper pool)
+ */
+struct mpc_reap {
+	atomic_t                    reap_lwm;
+	atomic_t                    reap_ttl_cur;
+	struct workqueue_struct    *reap_wq;
+
+	____cacheline_aligned
+	unsigned int                reap_mempct;
+	unsigned int                reap_ttl;
+	unsigned int                reap_debug;
+
+	____cacheline_aligned
+	atomic_t                    reap_eidx;
+	atomic_t                    reap_emit;
+	struct delayed_work         reap_dwork;
+
+	struct mpc_reap_elem        reap_elem[REAP_ELEM_MAX];
+};
+
+/**
+ * mpc_reap_meminfo() - Get current system-wide memory usage
+ * @freep:    ptr to return bytes of free memory
+ * @availp:   ptr to return bytes of available memory
+ * @shift:    shift results by %shift bits
+ *
+ * %mpc_reap_meminfo() returns current free and available memory
+ * sizes obtained from /proc/meminfo in userland and si_meminfo()
+ * in the kernel.  The resulting sizes are in bytes, but the
+ * caller can supply a non-zero %shift argment to obtain results
+ * in different units (e.g., for MiB shift=20, for GiB shift=30).
+ *
+ * %freep and/or %availp may be NULL.
+ */
+static void mpc_reap_meminfo(ulong *freep, ulong *availp, uint shift)
+{
+	struct sysinfo si;
+
+	si_meminfo(&si);
+
+	if (freep)
+		*freep = (si.freeram * si.mem_unit) >> shift;
+
+	if (availp)
+		*availp = (si_mem_available() * si.mem_unit) >> shift;
+}
+
+static void mpc_reap_evict_vma(struct mpc_xvm *xvm)
+{
+	struct address_space *mapping = xvm->xvm_mapping;
+	struct mpc_reap *reap = xvm->xvm_reap;
+	pgoff_t off, bktsz, len;
+	u64 ttl, xtime, now;
+	int i;
+
+	bktsz = xvm->xvm_bktsz >> PAGE_SHIFT;
+	off = mpc_xvm_pgoff(xvm);
+
+	ttl = atomic_read(&reap->reap_ttl_cur) * 1000ul;
+	now = local_clock();
+
+	for (i = 0; i < xvm->xvm_mbinfoc; ++i, off += bktsz) {
+		struct mpc_mbinfo *mbinfo = xvm->xvm_mbinfov + i;
+
+		xtime = now - (ttl * mbinfo->mbmult);
+		len = mbinfo->mblen >> PAGE_SHIFT;
+
+		if (atomic64_read(&mbinfo->mbatime) > xtime)
+			continue;
+
+		atomic64_set(&mbinfo->mbatime, U64_MAX);
+
+		invalidate_inode_pages2_range(mapping, off, off + len);
+
+		if (atomic64_read(&xvm->xvm_nrpages) < 32)
+			break;
+
+		if (need_resched())
+			cond_resched();
+
+		ttl = atomic_read(&reap->reap_ttl_cur) * 1000ul;
+		now = local_clock();
+	}
+}
+
+/**
+ * mpc_reap_evict() - Evict "cold" pages from the given XVMs
+ * @process:    A list of one of more XVMs to be reaped
+ */
+static void mpc_reap_evict(struct list_head *process)
+{
+	struct mpc_xvm *xvm, *next;
+
+	list_for_each_entry_safe(xvm, next, process, xvm_list) {
+		if (atomic_read(&xvm->xvm_reap->reap_lwm))
+			mpc_reap_evict_vma(xvm);
+
+		atomic_cmpxchg(&xvm->xvm_evicting, 1, 0);
+	}
+}
+
+/**
+ * mpc_reap_scan() - Scan for pages to purge
+ * @elem:
+ * @idx:    reap list index
+ */
+static void mpc_reap_scan(struct mpc_reap_elem *elem)
+{
+	struct list_head *list, process;
+	struct mpc_xvm *xvm, *next;
+	u64 nrpages, n;
+
+	INIT_LIST_HEAD(&process);
+
+	mutex_lock(&elem->reap_lock);
+	list = &elem->reap_list;
+	n = 0;
+
+	list_for_each_entry_safe(xvm, next, list, xvm_list) {
+		nrpages = atomic64_read(&xvm->xvm_nrpages);
+
+		if (nrpages < 32)
+			continue;
+
+		if (atomic_read(&xvm->xvm_reapref) == 1)
+			continue;
+
+		if (atomic_cmpxchg(&xvm->xvm_evicting, 0, 1))
+			continue;
+
+		list_del(&xvm->xvm_list);
+		list_add(&xvm->xvm_list, &process);
+
+		if (++n > 4)
+			break;
+	}
+	mutex_unlock(&elem->reap_lock);
+
+	mpc_reap_evict(&process);
+
+	mutex_lock(&elem->reap_lock);
+	list_splice_tail(&process, list);
+	mutex_unlock(&elem->reap_lock);
+
+	usleep_range(300, 700);
+}
+
+static void mpc_reap_run(struct work_struct *work)
+{
+	struct mpc_reap_elem *elem;
+	struct mpc_reap *reap;
+
+	elem = container_of(work, struct mpc_reap_elem, reap_work);
+	reap = elem->reap_reap;
+
+	while (atomic_read(&reap->reap_lwm))
+		mpc_reap_scan(elem);
+
+	atomic_cmpxchg(&elem->reap_running, 1, 0);
+}
+
+/**
+ * mpc_reap_tune() - Dynamic tuning of reap knobs.
+ * @reap:
+ */
+static void mpc_reap_tune(struct mpc_reap *reap)
+{
+	ulong total_pages, hpages, wpages, cpages, mfree;
+	uint freepct, hwm, lwm, ttl, debug, i;
+
+	hpages = wpages = cpages = 0;
+
+	/*
+	 * Take a live snapshot of the current memory usage.  Disable
+	 * preemption so that the result is reasonably accurate.
+	 */
+	preempt_disable();
+	mpc_reap_meminfo(&mfree, NULL, PAGE_SHIFT);
+
+	for (i = 0; i < REAP_ELEM_MAX; ++i) {
+		struct mpc_reap_elem *elem = &reap->reap_elem[i];
+
+		hpages += atomic64_read(&elem->reap_hpages);
+		wpages += atomic64_read(&elem->reap_wpages);
+		cpages += atomic64_read(&elem->reap_cpages);
+	}
+	preempt_enable();
+
+	total_pages = mfree + hpages + wpages + cpages;
+
+	/*
+	 * Determine the current percentage of free memory relative to the
+	 * number of hot+warm+cold pages tracked by the reaper.  freepct,
+	 * lwm, and hwm are scaled to 10000 for finer resolution.
+	 */
+	freepct = ((hpages + wpages + cpages) * 10000) / total_pages;
+	freepct = 10000 - freepct;
+
+	lwm = (100 - reap->reap_mempct) * 100;
+	hwm = (lwm * 10300) / 10000;
+	hwm = min_t(u32, hwm, 9700);
+	ttl = reap->reap_ttl;
+
+	if (freepct >= hwm) {
+		if (atomic_read(&reap->reap_ttl_cur) != ttl)
+			atomic_set(&reap->reap_ttl_cur, ttl);
+		if (atomic_read(&reap->reap_lwm))
+			atomic_set(&reap->reap_lwm, 0);
+	} else if (freepct < lwm || atomic_read(&reap->reap_lwm) > 0) {
+		ulong x = 10000 - (freepct * 10000) / hwm;
+
+		if (atomic_read(&reap->reap_lwm) != x) {
+			atomic_set(&reap->reap_lwm, x);
+
+			x = (ttl * (500ul * 500)) / (x * x);
+			if (x > ttl)
+				x = ttl;
+
+			atomic_set(&reap->reap_ttl_cur, x);
+		}
+	}
+
+	debug = reap->reap_debug;
+	if (!debug || (debug == 1 && freepct > hwm))
+		return;
+
+	if (atomic_inc_return(&reap->reap_emit) % REAP_ELEM_MAX > 0)
+		return;
+
+	mp_pr_info(
+		"%lu %lu, hot %lu, warm %lu, cold %lu, freepct %u, lwm %u, hwm %u, %2u, ttl %u",
+		mfree >> (20 - PAGE_SHIFT), total_pages >> (20 - PAGE_SHIFT),
+		hpages >> (20 - PAGE_SHIFT), wpages >> (20 - PAGE_SHIFT),
+		cpages >> (20 - PAGE_SHIFT), freepct, lwm, hwm,
+		atomic_read(&reap->reap_lwm), atomic_read(&reap->reap_ttl_cur) / 1000);
+}
+
+static void mpc_reap_prune(struct work_struct *work)
+{
+	struct mpc_reap_elem *elem;
+	struct mpc_xvm *xvm, *next;
+	struct mpc_reap *reap;
+	struct list_head freeme;
+	uint nfreed, eidx;
+	ulong delay;
+
+	reap = container_of(work, struct mpc_reap, reap_dwork.work);
+
+	/*
+	 * First, assesss the current memory situation.  If free
+	 * memory is below the low watermark then try to start a
+	 * reaper to evict some pages.
+	 */
+	mpc_reap_tune(reap);
+
+	if (atomic_read(&reap->reap_lwm)) {
+		eidx = atomic_read(&reap->reap_eidx) % REAP_ELEM_MAX;
+		elem = reap->reap_elem + eidx;
+
+		if (!atomic_cmpxchg(&elem->reap_running, 0, 1))
+			queue_work(reap->reap_wq, &elem->reap_work);
+	}
+
+	/* Next, advance to the next elem and prune VMAs that have been freed. */
+	eidx = atomic_inc_return(&reap->reap_eidx) % REAP_ELEM_MAX;
+
+	elem = reap->reap_elem + eidx;
+	INIT_LIST_HEAD(&freeme);
+
+	nfreed = atomic_read(&elem->reap_nfreed);
+
+	if (nfreed && mutex_trylock(&elem->reap_lock)) {
+		struct list_head   *list = &elem->reap_list;
+		uint                npruned = 0;
+
+		list_for_each_entry_safe(xvm, next, list, xvm_list) {
+			if (atomic_read(&xvm->xvm_reapref) > 1)
+				continue;
+
+			list_del(&xvm->xvm_list);
+			list_add_tail(&xvm->xvm_list, &freeme);
+
+			if (++npruned >= nfreed)
+				break;
+		}
+		mutex_unlock(&elem->reap_lock);
+
+		list_for_each_entry_safe(xvm, next, &freeme, xvm_list)
+			mpc_xvm_free(xvm);
+
+		atomic_sub(npruned, &elem->reap_nfreed);
+	}
+
+	delay = reap->reap_mempct < 100 ? 1000 / REAP_ELEM_MAX : 1000;
+	delay = msecs_to_jiffies(delay);
+
+	queue_delayed_work(reap->reap_wq, &reap->reap_dwork, delay);
+}
+
+#define REAP_MEMPCT_MIN    5
+#define REAP_MEMPCT_MAX    100
+#define REAP_TTL_MIN       100
+#define REAP_DEBUG_MAX     3
+
+static ssize_t mpc_reap_mempct_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%d\n", dev_to_reap(dev)->reap_mempct);
+}
+
+static ssize_t mpc_reap_mempct_store(struct device *dev, struct device_attribute *da,
+				     const char *buf, size_t count)
+{
+	struct mpc_reap *reap;
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint(buf, 10, &val);
+	if (rc || (val < REAP_MEMPCT_MIN || val > REAP_MEMPCT_MAX))
+		return -EINVAL;
+
+	reap = dev_to_reap(dev);
+	reap->reap_mempct = val;
+
+	return count;
+}
+
+static ssize_t mpc_reap_debug_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%d\n", dev_to_reap(dev)->reap_debug);
+}
+
+static ssize_t mpc_reap_debug_store(struct device *dev, struct device_attribute *da,
+				    const char *buf, size_t count)
+{
+	struct mpc_reap *reap;
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint(buf, 10, &val);
+	if (rc || val > REAP_DEBUG_MAX)
+		return -EINVAL;
+
+	reap = dev_to_reap(dev);
+	reap->reap_debug = val;
+
+	return count;
+}
+
+static ssize_t mpc_reap_ttl_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%d\n", dev_to_reap(dev)->reap_ttl);
+}
+
+static ssize_t mpc_reap_ttl_store(struct device *dev, struct device_attribute *da,
+				  const char *buf, size_t count)
+{
+	struct mpc_reap *reap;
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint(buf, 10, &val);
+	if (rc || val < REAP_TTL_MIN)
+		return -EINVAL;
+
+	reap = dev_to_reap(dev);
+	reap->reap_ttl = val;
+
+	return count;
+}
+
+void mpc_reap_params_add(struct device_attribute *dattr)
+{
+	MPC_ATTR_RW(dattr++, reap_mempct);
+	MPC_ATTR_RW(dattr++, reap_debug);
+	MPC_ATTR_RW(dattr, reap_ttl);
+}
+
+static void mpc_reap_mempct_init(struct mpc_reap *reap)
+{
+	ulong mavail;
+	uint pct = 60;
+
+	mpc_reap_meminfo(NULL, &mavail, 30);
+
+	if (mavail > 256)
+		pct = 10;
+	else if (mavail > 128)
+		pct = 13;
+	else if (mavail > 64)
+		pct = 25;
+	else if (mavail > 32)
+		pct = 40;
+
+	reap->reap_mempct = clamp_t(unsigned int, 100 - pct, 1, 100);
+}
+
+/**
+ * mpc_reap_create() - Allocate and initialize reap data strctures
+ * @reapout: Initialized reap structure.
+ *
+ * Returns -ENOMEM if the allocation fails.
+ */
+int mpc_reap_create(struct mpc_reap **reapp)
+{
+	struct mpc_reap_elem *elem;
+	struct mpc_reap *reap;
+	uint flags, i;
+
+	flags = WQ_UNBOUND | WQ_HIGHPRI | WQ_CPU_INTENSIVE;
+	*reapp = NULL;
+
+	reap = kzalloc(roundup_pow_of_two(sizeof(*reap)), GFP_KERNEL);
+	if (!reap)
+		return -ENOMEM;
+
+	reap->reap_wq = alloc_workqueue("mpc_reap", flags, REAP_ELEM_MAX + 1);
+	if (!reap->reap_wq) {
+		kfree(reap);
+		return -ENOMEM;
+	}
+
+	atomic_set(&reap->reap_lwm, 0);
+	atomic_set(&reap->reap_ttl_cur, 0);
+	atomic_set(&reap->reap_eidx, 0);
+	atomic_set(&reap->reap_emit, 0);
+
+	for (i = 0; i < REAP_ELEM_MAX; ++i) {
+		elem = &reap->reap_elem[i];
+
+		mutex_init(&elem->reap_lock);
+		INIT_LIST_HEAD(&elem->reap_list);
+
+		INIT_WORK(&elem->reap_work, mpc_reap_run);
+		atomic_set(&elem->reap_running, 0);
+		elem->reap_reap = reap;
+
+		atomic64_set(&elem->reap_hpages, 0);
+		atomic64_set(&elem->reap_wpages, 0);
+		atomic64_set(&elem->reap_cpages, 0);
+		atomic_set(&elem->reap_nfreed, 0);
+	}
+
+	reap->reap_ttl   = 10 * 1000 * 1000;
+	reap->reap_debug = 0;
+	mpc_reap_mempct_init(reap);
+
+	INIT_DELAYED_WORK(&reap->reap_dwork, mpc_reap_prune);
+	queue_delayed_work(reap->reap_wq, &reap->reap_dwork, 1);
+
+	*reapp = reap;
+
+	return 0;
+}
+
+void mpc_reap_destroy(struct mpc_reap *reap)
+{
+	struct mpc_reap_elem *elem;
+	int i;
+
+	if (!reap)
+		return;
+
+	cancel_delayed_work_sync(&reap->reap_dwork);
+
+	/*
+	 * There shouldn't be any reapers running at this point,
+	 * but perform a flush/wait for good measure...
+	 */
+	atomic_set(&reap->reap_lwm, 0);
+
+	if (reap->reap_wq)
+		flush_workqueue(reap->reap_wq);
+
+	for (i = 0; i < REAP_ELEM_MAX; ++i) {
+		elem = &reap->reap_elem[i];
+
+		ASSERT(atomic64_read(&elem->reap_hpages) == 0);
+		ASSERT(atomic64_read(&elem->reap_wpages) == 0);
+		ASSERT(atomic64_read(&elem->reap_cpages) == 0);
+		ASSERT(atomic_read(&elem->reap_nfreed) == 0);
+		ASSERT(list_empty(&elem->reap_list));
+
+		mutex_destroy(&elem->reap_lock);
+	}
+
+	if (reap->reap_wq)
+		destroy_workqueue(reap->reap_wq);
+	kfree(reap);
+}
+
+void mpc_reap_xvm_add(struct mpc_reap *reap, struct mpc_xvm *xvm)
+{
+	struct mpc_reap_elem *elem;
+	uint idx;
+
+	if (!reap || !xvm)
+		return;
+
+	if (xvm->xvm_advice == MPC_VMA_PINNED)
+		return;
+
+	/* Acquire a reference on xvm for the reaper... */
+	atomic_inc(&xvm->xvm_reapref);
+	xvm->xvm_reap = reap;
+
+	idx = (get_cycles() >> 1) % REAP_ELEM_MAX;
+
+	elem = &reap->reap_elem[idx];
+
+	mutex_lock(&elem->reap_lock);
+	xvm->xvm_freedp = &elem->reap_nfreed;
+
+	if (xvm->xvm_advice == MPC_VMA_HOT)
+		xvm->xvm_hcpagesp = &elem->reap_hpages;
+	else if (xvm->xvm_advice == MPC_VMA_WARM)
+		xvm->xvm_hcpagesp = &elem->reap_wpages;
+	else
+		xvm->xvm_hcpagesp = &elem->reap_cpages;
+
+	list_add_tail(&xvm->xvm_list, &elem->reap_list);
+	mutex_unlock(&elem->reap_lock);
+}
+
+void mpc_reap_xvm_evict(struct mpc_xvm *xvm)
+{
+	pgoff_t start, end, bktsz;
+
+	if (atomic_cmpxchg(&xvm->xvm_evicting, 0, 1))
+		return;
+
+	start = mpc_xvm_pgoff(xvm);
+	end = mpc_xvm_pglen(xvm) + start;
+	bktsz = xvm->xvm_bktsz >> PAGE_SHIFT;
+
+	if (bktsz < 1024)
+		bktsz = end - start;
+
+	/* Evict in chunks to improve mmap_sem interleaving... */
+	for (; start < end; start += bktsz)
+		invalidate_inode_pages2_range(xvm->xvm_mapping, start, start + bktsz);
+
+	atomic_cmpxchg(&xvm->xvm_evicting, 1, 0);
+}
+
+void mpc_reap_xvm_touch(struct mpc_xvm *xvm, int index)
+{
+	struct mpc_reap *reap;
+	atomic64_t *atimep;
+	uint mbnum, lwm;
+	pgoff_t offset;
+	ulong delay;
+	u64 now;
+
+	reap = xvm->xvm_reap;
+	if (!reap)
+		return;
+
+	offset = (index << PAGE_SHIFT) % (1ul << xvm_size_max);
+	mbnum = offset / xvm->xvm_bktsz;
+
+	atimep = &xvm->xvm_mbinfov[mbnum].mbatime;
+	now = local_clock();
+
+	/*
+	 * Don't update atime too frequently.  If we set atime to
+	 * U64_MAX in mpc_reap_evict_vma() then the addition here
+	 * will roll over and atime will be updated.
+	 */
+	if (atomic64_read(atimep) + (10 * USEC_PER_SEC) < now)
+		atomic64_set(atimep, now);
+
+	/* Sleep a bit if the reaper is having trouble meeting the free memory target. */
+	lwm = atomic_read(&reap->reap_lwm);
+	if (lwm < 3333)
+		return;
+
+	delay = 500000 / (10001 - lwm) - (500000 / 10001);
+	delay = min_t(ulong, delay, 3000);
+
+	usleep_range(delay, delay * 2);
+}
+
+bool mpc_reap_xvm_duress(struct mpc_xvm *xvm)
+{
+	struct mpc_reap *reap;
+	uint lwm;
+
+	if (xvm->xvm_advice == MPC_VMA_HOT)
+		return false;
+
+	reap = xvm->xvm_reap;
+	if (!reap)
+		return false;
+
+	lwm = atomic_read(&reap->reap_lwm);
+	if (lwm < 1500)
+		return false;
+
+	if (lwm > 3000)
+		return true;
+
+	return (xvm->xvm_advice == MPC_VMA_COLD);
+}
diff --git a/drivers/mpool/reaper.h b/drivers/mpool/reaper.h
new file mode 100644
index 000000000000..d3af6aef918d
--- /dev/null
+++ b/drivers/mpool/reaper.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc.  All rights reserved.
+ */
+
+#ifndef MPOOL_REAPER_H
+#define MPOOL_REAPER_H
+
+#define MPC_REAP_PARAMS_CNT    3
+
+struct mpc_reap;
+struct mpc_xvm;
+
+/**
+ * mpc_reap_create() - Allocate and initialize reap data strctures
+ * @reapp: Ptr to initialized reap structure.
+ *
+ * Return: -ENOMEM if the allocaiton fails.
+ */
+int mpc_reap_create(struct mpc_reap **reapp);
+
+/**
+ * mpc_reap_destroy() - Destroy the given reaper
+ * @reap:
+ */
+void mpc_reap_destroy(struct mpc_reap *reap);
+
+/**
+ * mpc_reap_xvm_add() - Add an extended VMA to the reap list
+ * @xvm: extended VMA
+ */
+void mpc_reap_xvm_add(struct mpc_reap *reap, struct mpc_xvm *xvm);
+
+/**
+ * mpc_reap_xvm_evict() - Evict all pages of given extended VMA
+ * @xvm: extended VMA
+ */
+void mpc_reap_xvm_evict(struct mpc_xvm *xvm);
+
+/**
+ * mpc_reap_xvm_touch() - Update extended VMA mblock atime
+ * @xvm:    extended VMA
+ * @index:  valid page index within the extended VMA
+ *
+ * Update the access time stamp of the mblock given by the valid
+ * page %index within the VMA.  Might sleep for some number of
+ * microseconds if the reaper is under duress (i.e., the more
+ * urgent the duress the longer the sleep).
+ *
+ * This function is called only by mpc_vm_fault_impl(), once
+ * for each successful page fault.
+ */
+void mpc_reap_xvm_touch(struct mpc_xvm *xvm, int index);
+
+/**
+ * mpc_reap_xvm_duress() - Check to see if reaper is under duress
+ * @xvm:   extended VMA
+ *
+ * Return: %false if the VMA is marked MPC_XVM_HOT.
+ * Return: %false if reaper is not enabled nor under duress.
+ * Return: %true depending upon the urgency of duress and the
+ * VMA advice (MPC_XVM_WARM or MPC_XVM_COLD).
+ *
+ * This function is called only by mpc_readpages() to decide whether
+ * or not to reduce the size of a speculative readahead request.
+ */
+bool mpc_reap_xvm_duress(struct mpc_xvm *xvm);
+
+void mpc_reap_params_add(struct device_attribute *dattr);
+
+#endif /* MPOOL_REAPER_H */
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 21/22] mpool: add documentation
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (19 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 20/22] mpool: add support to proactively evict cached mblock data from the page-cache Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-12 16:53   ` Randy Dunlap
  2020-10-12 16:27 ` [PATCH v2 22/22] mpool: add Kconfig and Makefile Nabeel M Mohamed
  2020-10-15  8:02 ` [PATCH v2 00/22] add Object Storage Media Pool (mpool) Christoph Hellwig
  22 siblings, 1 reply; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds locking hierarchy documentation for mpool and
updates ioctl-number.rst with mpool driver's ioctl code.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |  3 +-
 drivers/mpool/mpool-locking.rst               | 90 +++++++++++++++++++
 2 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mpool/mpool-locking.rst

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 2a198838fca9..1928606ff447 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -97,7 +97,8 @@ Code  Seq#    Include File                                           Comments
 '&'   00-07  drivers/firewire/nosy-user.h
 '1'   00-1F  linux/timepps.h                                         PPS kit from Ulrich Windl
                                                                      <ftp://ftp.de.kernel.org/pub/linux/daemons/ntp/PPS/>
-'2'   01-04  linux/i2o.h
+'2'   01-04  linux/i2o.h                                             conflict!
+'2'   00-8F  drivers/mpool/mpool_ioctl.h                             conflict!
 '3'   00-0F  drivers/s390/char/raw3270.h                             conflict!
 '3'   00-1F  linux/suspend_ioctls.h,                                 conflict!
              kernel/power/user.c
diff --git a/drivers/mpool/mpool-locking.rst b/drivers/mpool/mpool-locking.rst
new file mode 100644
index 000000000000..6a5da727f2fb
--- /dev/null
+++ b/drivers/mpool/mpool-locking.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+=============
+Mpool Locking
+=============
+
+Hierarchy
+---------
+::
+
+  mpool_s_lock
+  pmd_s_lock
+  eld_rwlock          object layout r/w lock (per layout)
+  pds_oml_lock        "open mlog" rbtree lock
+  mdi_slotvlock
+  mmi_uqlock          unique ID generator lock
+  mmi_compactlock     compaction lock (per MDC)
+  mmi_uc_lock         uncommitted objects rbtree lock (per MDC)
+  mmi_co_lock         committed objects rbtree lock (per MDC)
+  pds_pdvlock
+  pdi_rmlock[]
+  sda_dalock
+
+Nesting
+-------
+
+There are three nesting levels for mblocks, mlogs, and mpcore's own
+metadata containers (MDCs):
+
+1. PMD_OBJ_CLIENT for client mblocks and mlogs.
+2. PMD_MDC_NORMAL for MDC-1/255 and their underlying mlog pairs.
+3. PMD_MDC_ZERO for MDC-0 and its underlying mlog pair.
+
+A thread of execution may obtain at most one instance of a given lock-class
+at each nesting level, and must do so in the order specified above.
+
+The following helper functions determine the nesting level and use the
+appropriate _nested() primitive or lock pool::
+
+  pmd_obj_rdlock() and _rdunlock()
+  pmd_obj_wrlock() and _wrunlock()
+  pmd_mdc_rdlock() and _rdunlock()
+  pmd_mdc_wrlock() and _wrunlock()
+  pmd_mdc_lock() and _unlock()
+
+For additional information on the _nested() primitives, see
+https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt.
+
+MDC Compaction Locking Patterns
+-------------------------------
+
+In addition to obeying the lock hierarchy and lock-class nesting levels, the
+following locking rules must also be followed for object layouts and all
+mpool properties stored in MDC-0 (e.g., the list of mpool drives pds_pdv[]).
+
+Object layouts (struct pmd_layout):
+
+- Readers must read-lock the layout using pmd_obj_rdlock().
+- Updaters must both write-lock the layout using pmd_obj_wrlock() and lock
+  the mmi_compactlock for the object's MDC using pmd_mdc_lock() before
+  first logging the update in that MDC and then updating the layout.
+
+Mpool properties stored in MDC-0:
+
+- Readers must read-lock the data structure(s) associated with the property.
+- Updaters must both write-lock the data structure(s) associated with the
+  property and lock the mmi_compactlock for MDC-0 using pmd_mdc_lock() before
+  first logging the update in MDC-0 and then updating the data structure(s).
+
+This locking pattern achieves the following:
+
+- For objects associated with a given MDC-0/255, layout readers can execute
+  concurrent with compacting that MDC, whereas layout updaters cannot.
+- For mpool properties stored in MDC-0, property readers can execute
+  concurrent with compacting MDC-0, whereas property updaters cannot.
+- To compact a given MDC-0/255, all in-memory and on-media state to be
+  written is frozen by simply locking the mmi_compactlock for that MDC
+  (because updates to the committed objects tree may take place only while
+  holding both both the compaction mutex and the mmi_co_lock write lock).
+
+Furthermore, taking the mmi_compactlock does not reduce concurrency for
+object or property updaters because these are inherently serialized by the
+requirement to synchronously append log records in the associated MDC.
+
+Object Layout Reference Counts
+------------------------------
+
+The reference counts for an object layout (eld_ref) are protected
+by mmi_co_lock or mmi_uc_lock of the object's MDC dependiing upon
+which tree it is in at the time of acquisition.
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 22/22] mpool: add Kconfig and Makefile
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (20 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 21/22] mpool: add documentation Nabeel M Mohamed
@ 2020-10-12 16:27 ` Nabeel M Mohamed
  2020-10-15  8:02 ` [PATCH v2 00/22] add Object Storage Media Pool (mpool) Christoph Hellwig
  22 siblings, 0 replies; 35+ messages in thread
From: Nabeel M Mohamed @ 2020-10-12 16:27 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-nvme, linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker, Nabeel M Mohamed

This adds the Kconfig and Makefile for mpool.

Co-developed-by: Greg Becker <gbecker@micron.com>
Signed-off-by: Greg Becker <gbecker@micron.com>
Co-developed-by: Pierre Labat <plabat@micron.com>
Signed-off-by: Pierre Labat <plabat@micron.com>
Co-developed-by: John Groves <jgroves@micron.com>
Signed-off-by: John Groves <jgroves@micron.com>
Signed-off-by: Nabeel M Mohamed <nmeeramohide@micron.com>
---
 drivers/Kconfig        |  2 ++
 drivers/Makefile       |  1 +
 drivers/mpool/Kconfig  | 28 ++++++++++++++++++++++++++++
 drivers/mpool/Makefile | 11 +++++++++++
 4 files changed, 42 insertions(+)
 create mode 100644 drivers/mpool/Kconfig
 create mode 100644 drivers/mpool/Makefile

diff --git a/drivers/Kconfig b/drivers/Kconfig
index dcecc9f6e33f..547ac47a10eb 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -235,4 +235,6 @@ source "drivers/interconnect/Kconfig"
 source "drivers/counter/Kconfig"
 
 source "drivers/most/Kconfig"
+
+source "drivers/mpool/Kconfig"
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index c0cd1b9075e3..e2477288e761 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -188,3 +188,4 @@ obj-$(CONFIG_GNSS)		+= gnss/
 obj-$(CONFIG_INTERCONNECT)	+= interconnect/
 obj-$(CONFIG_COUNTER)		+= counter/
 obj-$(CONFIG_MOST)		+= most/
+obj-$(CONFIG_MPOOL)		+= mpool/
diff --git a/drivers/mpool/Kconfig b/drivers/mpool/Kconfig
new file mode 100644
index 000000000000..33380f497473
--- /dev/null
+++ b/drivers/mpool/Kconfig
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Object Storage Media Pool (mpool) configuration
+#
+
+config MPOOL
+	tristate "Object Storage Media Pool"
+	depends on BLOCK
+	default n
+	help
+	  This module implements a simple transactional object store on top of
+	  block storage devices.
+
+	  Mpool provides a high-performance alternative to file systems or
+	  raw block devices for applications that can benefit from its simple
+	  object storage model and unique features.
+
+	  If you want to use mpool, choose M here: the module will be called mpool.
+
+config MPOOL_ASSERT
+	bool "Object Storage Media Pool assert support"
+	depends on MPOOL
+	default n
+	help
+	  Enables runtime assertion checking for mpool.
+
+	  This is a developer only config. If this config is enabled and any of the
+	  asserts trigger, it results in a panic.
diff --git a/drivers/mpool/Makefile b/drivers/mpool/Makefile
new file mode 100644
index 000000000000..374bbe5bcfa0
--- /dev/null
+++ b/drivers/mpool/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Makefile for Object Storage Media Pool (mpool)
+#
+
+obj-$(CONFIG_MPOOL) += mpool.o
+
+mpool-y             := init.o pd.o mclass.o smap.o omf.o \
+		       upgrade.o sb.o pmd_obj.o mblock.o  \
+		       mlog_utils.o mlog.o mdc.o mpcore.o pmd.o \
+		       mp.o mpctl.o sysfs.o mcache.o reaper.o
-- 
2.17.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/22] mpool: add utility routines and ioctl definitions
  2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed
@ 2020-10-12 16:45   ` Randy Dunlap
  2020-10-12 16:48     ` Randy Dunlap
  0 siblings, 1 reply; 35+ messages in thread
From: Randy Dunlap @ 2020-10-12 16:45 UTC (permalink / raw)
  To: Nabeel M Mohamed, linux-kernel, linux-block, linux-nvme,
	linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker

On 10/12/20 9:27 AM, Nabeel M Mohamed wrote:
> +#define MPIOC_MAGIC             ('2')

Hi,

That value should be documented in
Documentation/userspace-api/ioctl/ioctl-number.rst.

thanks.
-- 
~Randy


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 01/22] mpool: add utility routines and ioctl definitions
  2020-10-12 16:45   ` Randy Dunlap
@ 2020-10-12 16:48     ` Randy Dunlap
  0 siblings, 0 replies; 35+ messages in thread
From: Randy Dunlap @ 2020-10-12 16:48 UTC (permalink / raw)
  To: Nabeel M Mohamed, linux-kernel, linux-block, linux-nvme,
	linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker

On 10/12/20 9:45 AM, Randy Dunlap wrote:
> On 10/12/20 9:27 AM, Nabeel M Mohamed wrote:
>> +#define MPIOC_MAGIC             ('2')
> 
> Hi,
> 
> That value should be documented in
> Documentation/userspace-api/ioctl/ioctl-number.rst.


Sorry, I see it now.

thanks.

-- 
~Randy


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 21/22] mpool: add documentation
  2020-10-12 16:27 ` [PATCH v2 21/22] mpool: add documentation Nabeel M Mohamed
@ 2020-10-12 16:53   ` Randy Dunlap
  0 siblings, 0 replies; 35+ messages in thread
From: Randy Dunlap @ 2020-10-12 16:53 UTC (permalink / raw)
  To: Nabeel M Mohamed, linux-kernel, linux-block, linux-nvme,
	linux-mm, linux-nvdimm
  Cc: plabat, smoyer, jgroves, gbecker


> diff --git a/drivers/mpool/mpool-locking.rst b/drivers/mpool/mpool-locking.rst
> new file mode 100644
> index 000000000000..6a5da727f2fb
> --- /dev/null
> +++ b/drivers/mpool/mpool-locking.rst
> @@ -0,0 +1,90 @@

> +Object Layout Reference Counts
> +------------------------------
> +
> +The reference counts for an object layout (eld_ref) are protected
> +by mmi_co_lock or mmi_uc_lock of the object's MDC dependiing upon

                                                     depending

> +which tree it is in at the time of acquisition.

-- 
~Randy


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
                   ` (21 preceding siblings ...)
  2020-10-12 16:27 ` [PATCH v2 22/22] mpool: add Kconfig and Makefile Nabeel M Mohamed
@ 2020-10-15  8:02 ` Christoph Hellwig
  2020-10-16 21:58   ` [EXT] " Nabeel Meeramohideen Mohamed (nmeeramohide)
  22 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2020-10-15  8:02 UTC (permalink / raw)
  To: Nabeel M Mohamed
  Cc: smoyer, plabat, linux-nvdimm, jgroves, linux-kernel, linux-nvme,
	linux-block, linux-mm, gbecker

I don't think this belongs into the kernel.  It is a classic case for
infrastructure that should be built in userspace.  If anything is
missing to implement it in userspace with equivalent performance we
need to improve out interfaces, although io_uring should cover pretty
much everything you need.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-15  8:02 ` [PATCH v2 00/22] add Object Storage Media Pool (mpool) Christoph Hellwig
@ 2020-10-16 21:58   ` Nabeel Meeramohideen Mohamed (nmeeramohide)
  2020-10-16 22:11     ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Nabeel Meeramohideen Mohamed (nmeeramohide) @ 2020-10-16 21:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Steve Moyer (smoyer), Pierre Labat (plabat),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, linux-block, linux-mm,
	Greg Becker (gbecker)

On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
> I don't think this belongs into the kernel.  It is a classic case for
> infrastructure that should be built in userspace.  If anything is
> missing to implement it in userspace with equivalent performance we
> need to improve out interfaces, although io_uring should cover pretty
> much everything you need.

Hi Christoph,

We previously considered moving the mpool object store code to user-space.
However, by implementing mpool as a device driver, we get several benefits
in terms of scalability, performance, and functionality. In doing so, we relied
only on standard interfaces and did not make any changes to the kernel.

(1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
a collection of logically related objects with a single system call. The objects in
such a collection are created at different times, physically disparate, and may
even reside on different media class volumes.

For our HSE storage engine application, there are commonly 10's to 100's of
objects in a given mcache map, and 75,000 total objects mapped at a given time.

Compared to memory-mapping objects individually, the mcache map facility
scales well because it requires only a single system call and single vm_area_struct
to memory-map a complete collection of objects.

(2) The mcache map reaper mechanism proactively evicts object data from the page
cache based on object-level metrics. This provides significant performance benefit
for many workloads.

For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
against our HSE storage engine using the mpool driver in a 5.9 kernel.
For each workload, we ran with the reaper turned-on and turned-off.

For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail
latency for reads by 39% and updates by 99%. For workload C, the reaper increased
throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
improvements are even more dramatic with earlier kernels.

(3) The mcache map facility can memory-map objects on NVMe ZNS drives that were
created using the Zone Append command. This patch set does not support ZNS, but
that work is in progress and we will be demonstrating our HSE storage engine
running on mpool with ZNS drives at FMS 2020.

(4) mpool's immutable object model allows the driver to support concurrent reading
of object data directly and memory-mapped without a performance penalty to verify
coherence. This allows background operations, such as LSM-tree compaction, to
operate efficiently and without polluting the page cache.

(5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
convenient mechanism for controlling access to and managing the multiple storage
volumes, and in the future pmem devices, that may comprise an logical mpool.

Thanks,
Nabeel

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-16 21:58   ` [EXT] " Nabeel Meeramohideen Mohamed (nmeeramohide)
@ 2020-10-16 22:11     ` Dan Williams
  2020-10-19 22:30       ` Nabeel Meeramohideen Mohamed (nmeeramohide)
  2020-10-21 14:24       ` Mike Snitzer
  0 siblings, 2 replies; 35+ messages in thread
From: Dan Williams @ 2020-10-16 22:11 UTC (permalink / raw)
  To: Nabeel Meeramohideen Mohamed (nmeeramohide)
  Cc: linux-block, Steve Moyer (smoyer),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, Christoph Hellwig, linux-mm,
	Pierre Labat (plabat), Greg Becker (gbecker)

On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > I don't think this belongs into the kernel.  It is a classic case for
> > infrastructure that should be built in userspace.  If anything is
> > missing to implement it in userspace with equivalent performance we
> > need to improve out interfaces, although io_uring should cover pretty
> > much everything you need.
>
> Hi Christoph,
>
> We previously considered moving the mpool object store code to user-space.
> However, by implementing mpool as a device driver, we get several benefits
> in terms of scalability, performance, and functionality. In doing so, we relied
> only on standard interfaces and did not make any changes to the kernel.
>
> (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> a collection of logically related objects with a single system call. The objects in
> such a collection are created at different times, physically disparate, and may
> even reside on different media class volumes.
>
> For our HSE storage engine application, there are commonly 10's to 100's of
> objects in a given mcache map, and 75,000 total objects mapped at a given time.
>
> Compared to memory-mapping objects individually, the mcache map facility
> scales well because it requires only a single system call and single vm_area_struct
> to memory-map a complete collection of objects.

Why can't that be a batch of mmap calls on io_uring?

> (2) The mcache map reaper mechanism proactively evicts object data from the page
> cache based on object-level metrics. This provides significant performance benefit
> for many workloads.
>
> For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
> against our HSE storage engine using the mpool driver in a 5.9 kernel.
> For each workload, we ran with the reaper turned-on and turned-off.
>
> For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail
> latency for reads by 39% and updates by 99%. For workload C, the reaper increased
> throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
> improvements are even more dramatic with earlier kernels.

What metrics proved useful and can the vanilla page cache / page
reclaim mechanism be augmented with those metrics?

>
> (3) The mcache map facility can memory-map objects on NVMe ZNS drives that were
> created using the Zone Append command. This patch set does not support ZNS, but
> that work is in progress and we will be demonstrating our HSE storage engine
> running on mpool with ZNS drives at FMS 2020.
>
> (4) mpool's immutable object model allows the driver to support concurrent reading
> of object data directly and memory-mapped without a performance penalty to verify
> coherence. This allows background operations, such as LSM-tree compaction, to
> operate efficiently and without polluting the page cache.
>

How is this different than existing background operations / defrag
that filesystems perform today? Where are the opportunities to improve
those operations?

> (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
> convenient mechanism for controlling access to and managing the multiple storage
> volumes, and in the future pmem devices, that may comprise an logical mpool.

Christoph and I have talked about replacing the pmem driver's
dependence on device-mapper for pooling. What extensions would be
needed for the existing driver arch?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-16 22:11     ` Dan Williams
@ 2020-10-19 22:30       ` Nabeel Meeramohideen Mohamed (nmeeramohide)
  2020-10-20 21:35         ` Dan Williams
  2020-10-21 14:24       ` Mike Snitzer
  1 sibling, 1 reply; 35+ messages in thread
From: Nabeel Meeramohideen Mohamed (nmeeramohide) @ 2020-10-19 22:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-block, Steve Moyer (smoyer),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, Christoph Hellwig, linux-mm,
	Pierre Labat (plabat), Greg Becker (gbecker)

Hi Dan,

On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide) <nmeeramohide@micron.com> wrote:
> >
> > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> <hch@infradead.org> wrote:
> > > I don't think this belongs into the kernel.  It is a classic case for
> > > infrastructure that should be built in userspace.  If anything is
> > > missing to implement it in userspace with equivalent performance we
> > > need to improve out interfaces, although io_uring should cover pretty
> > > much everything you need.
> >
> > Hi Christoph,
> >
> > We previously considered moving the mpool object store code to user-space.
> > However, by implementing mpool as a device driver, we get several benefits
> > in terms of scalability, performance, and functionality. In doing so, we relied
> > only on standard interfaces and did not make any changes to the kernel.
> >
> > (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> > a collection of logically related objects with a single system call. The objects in
> > such a collection are created at different times, physically disparate, and may
> > even reside on different media class volumes.
> >
> > For our HSE storage engine application, there are commonly 10's to 100's of
> > objects in a given mcache map, and 75,000 total objects mapped at a given
> time.
> >
> > Compared to memory-mapping objects individually, the mcache map facility
> > scales well because it requires only a single system call and single
> vm_area_struct
> > to memory-map a complete collection of objects.

> Why can't that be a batch of mmap calls on io_uring?

Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the
system call overhead of memory-mapping individual objects, versus our mache map
mechanism. However, there is still the scalability issue of having a vm_area_struct
for each object (versus one for each mache map).

We ran YCSB workload C in two different configurations -
Config 1: memory-mapping each individual object
Config 2: memory-mapping a collection of related objects using mcache map

- Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2.

- Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2,
not sure if it's due the reduced complexity of searching VMAs during page faults.

> > (2) The mcache map reaper mechanism proactively evicts object data from the
> page
> > cache based on object-level metrics. This provides significant performance
> benefit
> > for many workloads.
> >
> > For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
> > against our HSE storage engine using the mpool driver in a 5.9 kernel.
> > For each workload, we ran with the reaper turned-on and turned-off.
> >
> > For workload B, the reaper increased throughput 1.77x, while reducing 99.99%
> tail
> > latency for reads by 39% and updates by 99%. For workload C, the reaper
> increased
> > throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
> > improvements are even more dramatic with earlier kernels.

> What metrics proved useful and can the vanilla page cache / page
> reclaim mechanism be augmented with those metrics?

The mcache map facility is designed to cache a collection of related immutable objects
with similar lifetimes. It is best suited for storage applications that run queries against
organized collections of immutable objects, such as storage engines and DBs based on
SSTables.

Each mcache map is associated with a temperature (pinned, hot, warm, cold) and is left
to the application to tag it appropriately. For our HSE storage engine application,
the SSTables in the root/intermediate levels acts as a routing table to redirect queries to
an appropriate leaf level SSTable, in which case, the mcache map corresponding to the
root/intermediate level SSTables can be tagged as pinned/hot.

The mcache reaper tracks the access time of each object in an mcache map. On memory
pressure, the access time is compared to a time-to-live metric that’s set based on the
map’s temperature, how close is the free memory to the low and high watermarks etc.
If the object was last accessed outside the ttl window, its pages are evicted from the
page cache.

We also apply a few other techniques like throttling the readaheads and adding a delay
in the page fault handler to not overwhelm the page cache during memory pressure.

In the workloads that we run, we have noticed stalls when kswapd does the reclaim and
that impacts throughput and tail latencies as described in our last email. The mcache
reaper runs proactively and can make better reclaim decisions as it is designed to
address a specific class of workloads.

We doubt whether the same mechanisms can be employed in the vanilla page cache as
it is designed to work for a wide variety of workloads.

> > (4) mpool's immutable object model allows the driver to support concurrent
> reading
> > of object data directly and memory-mapped without a performance penalty to
> verify
> > coherence. This allows background operations, such as LSM-tree compaction,
> to
> > operate efficiently and without polluting the page cache.

> How is this different than existing background operations / defrag
> that filesystems perform today? Where are the opportunities to improve
> those operations?

We haven’t measured the benefit of eliminating the coherence check, which isn’t needed
in our case because objects are immutable. However the open(2) documentation makes
the statement that “applications should avoid mixing mmap(2) of files with direct I/O to
the same files”, which is what we are effectively doing when we directly read from an
object that is also in an mcache map.

> > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file
> provides a
> > convenient mechanism for controlling access to and managing the multiple
> storage
> > volumes, and in the future pmem devices, that may comprise an logical mpool.

> Christoph and I have talked about replacing the pmem driver's
> dependence on device-mapper for pooling. What extensions would be
> needed for the existing driver arch?

mpool doesn’t extend any of the existing driver arch to manage multiple storage volumes.

Mpool implements the concept of media classes, where each media class corresponds
to a different storage volume. Clients specify a media class when creating an object in
an mpool. mpool currently supports only two media classes, “capacity” for storing bulk
of the objects backed by, for instance, QLC SSDs and “staging” for storing objects
requiring lower latency/higher throughput backed by, for instance, 3DXP SSDs. 

An mpool is accessed via the /dev/mpool/<mpool-name> device file and the
mpool descriptor attached to this device file instance tracks all its associated media
class volumes. mpool relies on device mapper to provide physical device aggregation
within a media class volume.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-19 22:30       ` Nabeel Meeramohideen Mohamed (nmeeramohide)
@ 2020-10-20 21:35         ` Dan Williams
  2020-10-21 17:10           ` Nabeel Meeramohideen Mohamed (nmeeramohide)
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2020-10-20 21:35 UTC (permalink / raw)
  To: Nabeel Meeramohideen Mohamed (nmeeramohide)
  Cc: linux-block, Steve Moyer (smoyer),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, Christoph Hellwig, linux-mm,
	Pierre Labat (plabat), Greg Becker (gbecker)

On Mon, Oct 19, 2020 at 3:30 PM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> Hi Dan,
>
> On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> > (nmeeramohide) <nmeeramohide@micron.com> wrote:
> > >
> > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> > <hch@infradead.org> wrote:
> > > > I don't think this belongs into the kernel.  It is a classic case for
> > > > infrastructure that should be built in userspace.  If anything is
> > > > missing to implement it in userspace with equivalent performance we
> > > > need to improve out interfaces, although io_uring should cover pretty
> > > > much everything you need.
> > >
> > > Hi Christoph,
> > >
> > > We previously considered moving the mpool object store code to user-space.
> > > However, by implementing mpool as a device driver, we get several benefits
> > > in terms of scalability, performance, and functionality. In doing so, we relied
> > > only on standard interfaces and did not make any changes to the kernel.
> > >
> > > (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> > > a collection of logically related objects with a single system call. The objects in
> > > such a collection are created at different times, physically disparate, and may
> > > even reside on different media class volumes.
> > >
> > > For our HSE storage engine application, there are commonly 10's to 100's of
> > > objects in a given mcache map, and 75,000 total objects mapped at a given
> > time.
> > >
> > > Compared to memory-mapping objects individually, the mcache map facility
> > > scales well because it requires only a single system call and single
> > vm_area_struct
> > > to memory-map a complete collection of objects.
>
> > Why can't that be a batch of mmap calls on io_uring?
>
> Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the
> system call overhead of memory-mapping individual objects, versus our mache map
> mechanism. However, there is still the scalability issue of having a vm_area_struct
> for each object (versus one for each mache map).
>
> We ran YCSB workload C in two different configurations -
> Config 1: memory-mapping each individual object
> Config 2: memory-mapping a collection of related objects using mcache map
>
> - Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
> 24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2.
>
> - Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2,
> not sure if it's due the reduced complexity of searching VMAs during page faults.

So this gets to the meta question that is giving me pause on this
whole proposal:

    What does Linux get from merging mpool?

What you have above is a decent scalability bug report. That type of
pressure to meet new workload needs is how Linux interfaces evolve.
However, rather than evolve those interfaces mpool is a revolutionary
replacement that leaves the bugs intact for everyone that does not
switch over to mpool.

Consider io_uring as an example where the kernel resisted trends
towards userspace I/O engines and instead evolved a solution that
maintained kernel control while also achieving similar performance
levels.

The exercise is useful to identify places where Linux has
deficiencies, but wholesale replacing an entire I/O submission model
is a direction that leaves the old apis to rot.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-16 22:11     ` Dan Williams
  2020-10-19 22:30       ` Nabeel Meeramohideen Mohamed (nmeeramohide)
@ 2020-10-21 14:24       ` Mike Snitzer
  2020-10-21 16:24         ` Dan Williams
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Snitzer @ 2020-10-21 14:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-block, Steve Moyer (smoyer),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, Christoph Hellwig, linux-mm,
	device-mapper development,
	Nabeel Meeramohideen Mohamed (nmeeramohide),
	Pierre Labat (plabat), Greg Becker (gbecker)

Hey Dan,

On Fri, Oct 16, 2020 at 6:38 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
> > convenient mechanism for controlling access to and managing the multiple storage
> > volumes, and in the future pmem devices, that may comprise an logical mpool.
>
> Christoph and I have talked about replacing the pmem driver's
> dependence on device-mapper for pooling.

Was this discussion done publicly or private?  If public please share
a pointer to the thread.

I'd really like to understand the problem statement that is leading to
pursuing a pmem native alternative to existing DM.

Thanks,
Mike

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-21 14:24       ` Mike Snitzer
@ 2020-10-21 16:24         ` Dan Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2020-10-21 16:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: linux-block, Steve Moyer (smoyer),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, Christoph Hellwig, linux-mm,
	device-mapper development,
	Nabeel Meeramohideen Mohamed (nmeeramohide),
	Pierre Labat (plabat), Greg Becker (gbecker)

On Wed, Oct 21, 2020 at 7:24 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> Hey Dan,
>
> On Fri, Oct 16, 2020 at 6:38 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> > (nmeeramohide) <nmeeramohide@micron.com> wrote:
> >
> > > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
> > > convenient mechanism for controlling access to and managing the multiple storage
> > > volumes, and in the future pmem devices, that may comprise an logical mpool.
> >
> > Christoph and I have talked about replacing the pmem driver's
> > dependence on device-mapper for pooling.
>
> Was this discussion done publicly or private?  If public please share
> a pointer to the thread.
>
> I'd really like to understand the problem statement that is leading to
> pursuing a pmem native alternative to existing DM.
>

IIRC it was during the hallway track at a conference. Some of the
concern is the flexibility to carve physical address space but not
attach a block-device in front of it, and allow pmem/dax-capable
filesystems to mount on something other than a block-device.

DM does fit the bill for block-device concatenation and striping, but
there's some pressure to have a level of provisioning beneath that.

The device-dax facility has already started to grow some physical
address space partitioning capabilities this cycle, see 60e93dc097f7
device-dax: add dis-contiguous resource support, and the question
becomes when / if that support needs to extend across regions is DM
the right tool for that?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-20 21:35         ` Dan Williams
@ 2020-10-21 17:10           ` Nabeel Meeramohideen Mohamed (nmeeramohide)
  2020-10-21 17:48             ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Nabeel Meeramohideen Mohamed (nmeeramohide) @ 2020-10-21 17:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-block, Steve Moyer (smoyer),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, Christoph Hellwig, linux-mm,
	Pierre Labat (plabat), Greg Becker (gbecker)

On Tuesday, October 20, 2020 3:36 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
>     What does Linux get from merging mpool?
> 

What Linux gets from merging mpool is a generic object store target with some
unique and beneficial features:

- the ability to allocate objects from multiple classes of media
- facilities to memory-map (and unmap) collections of related objects with similar
lifetimes in a single call
- proactive eviction of object data from the page cache which takes into account
these object relationships and lifetimes
- concurrent access to object data directly and memory mapped to eliminate
page cache pollution from background operations
- a management model that is intentionally patterned after LVM so as to feel
familiar to Linux users

The HSE storage engine, which is built on mpool, consistently demonstrates
throughputs and latencies in real-world applications that are multiples better
than common alternatives.  We believe this represents a concrete example of
the benefits of the mpool object store.

That said, we are very open to ideas on how we can improve the mpool
implementation to be better aligned with existing Linux I/O mechanisms.

Thanks,
Nabeel
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
  2020-10-21 17:10           ` Nabeel Meeramohideen Mohamed (nmeeramohide)
@ 2020-10-21 17:48             ` Dan Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2020-10-21 17:48 UTC (permalink / raw)
  To: Nabeel Meeramohideen Mohamed (nmeeramohide)
  Cc: linux-block, Steve Moyer (smoyer),
	linux-nvdimm, John Groves (jgroves),
	linux-kernel, linux-nvme, Christoph Hellwig, linux-mm,
	Pierre Labat (plabat), Greg Becker (gbecker)

On Wed, Oct 21, 2020 at 10:11 AM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> On Tuesday, October 20, 2020 3:36 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >     What does Linux get from merging mpool?
> >
>
> What Linux gets from merging mpool is a generic object store target with some
> unique and beneficial features:

I'll try to make the point a different way. Mpool points out places
where the existing apis fail to scale. Rather than attempt to fix that
problem it proposes to replace the old apis. However, the old apis are
still there. So now upstream has 2 maintenance burdens when it could
have just had one. So when I ask "what does Linux get" it is in
reference to the fact that Linux gets a compounded maintenance problem
and whether the benefits of mpool outweigh that burden. Historically
Linux has been able to evolve to meet the scaling requirements of new
applications, so I am asking whether you have tried to solve the
application problem by evolving rather than replacing existing
infrastructure? The justification to replace rather than evolve is
high because that's how core Linux stays relevant.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, back to index

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed
2020-10-12 16:45   ` Randy Dunlap
2020-10-12 16:48     ` Randy Dunlap
2020-10-12 16:27 ` [PATCH v2 02/22] mpool: add in-memory struct definitions Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 03/22] mpool: add on-media " Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 04/22] mpool: add pool drive component which handles mpool IO using the block layer API Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 05/22] mpool: add space map component which manages free space on mpool devices Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 06/22] mpool: add on-media pack, unpack and upgrade routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 07/22] mpool: add superblock management routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 08/22] mpool: add pool metadata routines to manage object lifecycle and IO Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 09/22] mpool: add mblock lifecycle management and IO routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 10/22] mpool: add mlog IO utility routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 11/22] mpool: add mlog lifecycle management and IO routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 12/22] mpool: add metadata container or mlog-pair framework Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 13/22] mpool: add utility routines for mpool lifecycle management Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 14/22] mpool: add pool metadata routines to create persistent mpools Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 15/22] mpool: add mpool lifecycle management routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 16/22] mpool: add mpool control plane utility routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 17/22] mpool: add mpool lifecycle management ioctls Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 18/22] mpool: add object " Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 19/22] mpool: add support to mmap arbitrary collection of mblocks Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 20/22] mpool: add support to proactively evict cached mblock data from the page-cache Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 21/22] mpool: add documentation Nabeel M Mohamed
2020-10-12 16:53   ` Randy Dunlap
2020-10-12 16:27 ` [PATCH v2 22/22] mpool: add Kconfig and Makefile Nabeel M Mohamed
2020-10-15  8:02 ` [PATCH v2 00/22] add Object Storage Media Pool (mpool) Christoph Hellwig
2020-10-16 21:58   ` [EXT] " Nabeel Meeramohideen Mohamed (nmeeramohide)
2020-10-16 22:11     ` Dan Williams
2020-10-19 22:30       ` Nabeel Meeramohideen Mohamed (nmeeramohide)
2020-10-20 21:35         ` Dan Williams
2020-10-21 17:10           ` Nabeel Meeramohideen Mohamed (nmeeramohide)
2020-10-21 17:48             ` Dan Williams
2020-10-21 14:24       ` Mike Snitzer
2020-10-21 16:24         ` Dan Williams

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git