linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Nabeel Meeramohideen Mohamed (nmeeramohide)"  <nmeeramohide@micron.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"Steve Moyer (smoyer)" <smoyer@micron.com>,
	"Greg Becker (gbecker)" <gbecker@micron.com>,
	"Pierre Labat (plabat)" <plabat@micron.com>,
	"John Groves (jgroves)" <jgroves@micron.com>
Subject: RE: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
Date: Fri, 16 Oct 2020 21:58:32 +0000	[thread overview]
Message-ID: <SN6PR08MB420880574E0705BBC80EC1A3B3030@SN6PR08MB4208.namprd08.prod.outlook.com> (raw)
In-Reply-To: <20201015080254.GA31136@infradead.org>

On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
> I don't think this belongs into the kernel.  It is a classic case for
> infrastructure that should be built in userspace.  If anything is
> missing to implement it in userspace with equivalent performance we
> need to improve out interfaces, although io_uring should cover pretty
> much everything you need.

Hi Christoph,

We previously considered moving the mpool object store code to user-space.
However, by implementing mpool as a device driver, we get several benefits
in terms of scalability, performance, and functionality. In doing so, we relied
only on standard interfaces and did not make any changes to the kernel.

(1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
a collection of logically related objects with a single system call. The objects in
such a collection are created at different times, physically disparate, and may
even reside on different media class volumes.

For our HSE storage engine application, there are commonly 10's to 100's of
objects in a given mcache map, and 75,000 total objects mapped at a given time.

Compared to memory-mapping objects individually, the mcache map facility
scales well because it requires only a single system call and single vm_area_struct
to memory-map a complete collection of objects.

(2) The mcache map reaper mechanism proactively evicts object data from the page
cache based on object-level metrics. This provides significant performance benefit
for many workloads.

For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
against our HSE storage engine using the mpool driver in a 5.9 kernel.
For each workload, we ran with the reaper turned-on and turned-off.

For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail
latency for reads by 39% and updates by 99%. For workload C, the reaper increased
throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
improvements are even more dramatic with earlier kernels.

(3) The mcache map facility can memory-map objects on NVMe ZNS drives that were
created using the Zone Append command. This patch set does not support ZNS, but
that work is in progress and we will be demonstrating our HSE storage engine
running on mpool with ZNS drives at FMS 2020.

(4) mpool's immutable object model allows the driver to support concurrent reading
of object data directly and memory-mapped without a performance penalty to verify
coherence. This allows background operations, such as LSM-tree compaction, to
operate efficiently and without polluting the page cache.

(5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
convenient mechanism for controlling access to and managing the multiple storage
volumes, and in the future pmem devices, that may comprise an logical mpool.

Thanks,
Nabeel

  reply	other threads:[~2020-10-16 21:58 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-12 16:27 [PATCH v2 00/22] add Object Storage Media Pool (mpool) Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed
2020-10-12 16:45   ` Randy Dunlap
2020-10-12 16:48     ` Randy Dunlap
2020-10-12 16:27 ` [PATCH v2 02/22] mpool: add in-memory struct definitions Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 03/22] mpool: add on-media " Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 04/22] mpool: add pool drive component which handles mpool IO using the block layer API Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 05/22] mpool: add space map component which manages free space on mpool devices Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 06/22] mpool: add on-media pack, unpack and upgrade routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 07/22] mpool: add superblock management routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 08/22] mpool: add pool metadata routines to manage object lifecycle and IO Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 09/22] mpool: add mblock lifecycle management and IO routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 10/22] mpool: add mlog IO utility routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 11/22] mpool: add mlog lifecycle management and IO routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 12/22] mpool: add metadata container or mlog-pair framework Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 13/22] mpool: add utility routines for mpool lifecycle management Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 14/22] mpool: add pool metadata routines to create persistent mpools Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 15/22] mpool: add mpool lifecycle management routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 16/22] mpool: add mpool control plane utility routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 17/22] mpool: add mpool lifecycle management ioctls Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 18/22] mpool: add object " Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 19/22] mpool: add support to mmap arbitrary collection of mblocks Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 20/22] mpool: add support to proactively evict cached mblock data from the page-cache Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 21/22] mpool: add documentation Nabeel M Mohamed
2020-10-12 16:53   ` Randy Dunlap
2020-10-12 16:27 ` [PATCH v2 22/22] mpool: add Kconfig and Makefile Nabeel M Mohamed
2020-10-15  8:02 ` [PATCH v2 00/22] add Object Storage Media Pool (mpool) Christoph Hellwig
2020-10-16 21:58   ` Nabeel Meeramohideen Mohamed (nmeeramohide) [this message]
2020-10-16 22:11     ` [EXT] " Dan Williams
2020-10-19 22:30       ` Nabeel Meeramohideen Mohamed (nmeeramohide)
2020-10-20 21:35         ` Dan Williams
2020-10-21 17:10           ` Nabeel Meeramohideen Mohamed (nmeeramohide)
2020-10-21 17:48             ` Dan Williams
2020-10-21 14:24       ` Mike Snitzer
2020-10-21 16:24         ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN6PR08MB420880574E0705BBC80EC1A3B3030@SN6PR08MB4208.namprd08.prod.outlook.com \
    --to=nmeeramohide@micron.com \
    --cc=gbecker@micron.com \
    --cc=hch@infradead.org \
    --cc=jgroves@micron.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=plabat@micron.com \
    --cc=smoyer@micron.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).