Linux-Block Archive on
 help / color / Atom feed
From: "Nabeel Meeramohideen Mohamed (nmeeramohide)"  <>
To: Dan Williams <>
Cc: Christoph Hellwig <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"Steve Moyer (smoyer)" <>,
	"Greg Becker (gbecker)" <>,
	"Pierre Labat (plabat)" <>,
	"John Groves (jgroves)" <>
Subject: RE: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)
Date: Mon, 19 Oct 2020 22:30:02 +0000
Message-ID: <> (raw)
In-Reply-To: <>

Hi Dan,

On Friday, October 16, 2020 4:12 PM, Dan Williams <> wrote:
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide) <> wrote:
> >
> > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> <> wrote:
> > > I don't think this belongs into the kernel.  It is a classic case for
> > > infrastructure that should be built in userspace.  If anything is
> > > missing to implement it in userspace with equivalent performance we
> > > need to improve out interfaces, although io_uring should cover pretty
> > > much everything you need.
> >
> > Hi Christoph,
> >
> > We previously considered moving the mpool object store code to user-space.
> > However, by implementing mpool as a device driver, we get several benefits
> > in terms of scalability, performance, and functionality. In doing so, we relied
> > only on standard interfaces and did not make any changes to the kernel.
> >
> > (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> > a collection of logically related objects with a single system call. The objects in
> > such a collection are created at different times, physically disparate, and may
> > even reside on different media class volumes.
> >
> > For our HSE storage engine application, there are commonly 10's to 100's of
> > objects in a given mcache map, and 75,000 total objects mapped at a given
> time.
> >
> > Compared to memory-mapping objects individually, the mcache map facility
> > scales well because it requires only a single system call and single
> vm_area_struct
> > to memory-map a complete collection of objects.

> Why can't that be a batch of mmap calls on io_uring?

Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the
system call overhead of memory-mapping individual objects, versus our mache map
mechanism. However, there is still the scalability issue of having a vm_area_struct
for each object (versus one for each mache map).

We ran YCSB workload C in two different configurations -
Config 1: memory-mapping each individual object
Config 2: memory-mapping a collection of related objects using mcache map

- Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2.

- Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2,
not sure if it's due the reduced complexity of searching VMAs during page faults.

> > (2) The mcache map reaper mechanism proactively evicts object data from the
> page
> > cache based on object-level metrics. This provides significant performance
> benefit
> > for many workloads.
> >
> > For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
> > against our HSE storage engine using the mpool driver in a 5.9 kernel.
> > For each workload, we ran with the reaper turned-on and turned-off.
> >
> > For workload B, the reaper increased throughput 1.77x, while reducing 99.99%
> tail
> > latency for reads by 39% and updates by 99%. For workload C, the reaper
> increased
> > throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
> > improvements are even more dramatic with earlier kernels.

> What metrics proved useful and can the vanilla page cache / page
> reclaim mechanism be augmented with those metrics?

The mcache map facility is designed to cache a collection of related immutable objects
with similar lifetimes. It is best suited for storage applications that run queries against
organized collections of immutable objects, such as storage engines and DBs based on

Each mcache map is associated with a temperature (pinned, hot, warm, cold) and is left
to the application to tag it appropriately. For our HSE storage engine application,
the SSTables in the root/intermediate levels acts as a routing table to redirect queries to
an appropriate leaf level SSTable, in which case, the mcache map corresponding to the
root/intermediate level SSTables can be tagged as pinned/hot.

The mcache reaper tracks the access time of each object in an mcache map. On memory
pressure, the access time is compared to a time-to-live metric that’s set based on the
map’s temperature, how close is the free memory to the low and high watermarks etc.
If the object was last accessed outside the ttl window, its pages are evicted from the
page cache.

We also apply a few other techniques like throttling the readaheads and adding a delay
in the page fault handler to not overwhelm the page cache during memory pressure.

In the workloads that we run, we have noticed stalls when kswapd does the reclaim and
that impacts throughput and tail latencies as described in our last email. The mcache
reaper runs proactively and can make better reclaim decisions as it is designed to
address a specific class of workloads.

We doubt whether the same mechanisms can be employed in the vanilla page cache as
it is designed to work for a wide variety of workloads.

> > (4) mpool's immutable object model allows the driver to support concurrent
> reading
> > of object data directly and memory-mapped without a performance penalty to
> verify
> > coherence. This allows background operations, such as LSM-tree compaction,
> to
> > operate efficiently and without polluting the page cache.

> How is this different than existing background operations / defrag
> that filesystems perform today? Where are the opportunities to improve
> those operations?

We haven’t measured the benefit of eliminating the coherence check, which isn’t needed
in our case because objects are immutable. However the open(2) documentation makes
the statement that “applications should avoid mixing mmap(2) of files with direct I/O to
the same files”, which is what we are effectively doing when we directly read from an
object that is also in an mcache map.

> > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file
> provides a
> > convenient mechanism for controlling access to and managing the multiple
> storage
> > volumes, and in the future pmem devices, that may comprise an logical mpool.

> Christoph and I have talked about replacing the pmem driver's
> dependence on device-mapper for pooling. What extensions would be
> needed for the existing driver arch?

mpool doesn’t extend any of the existing driver arch to manage multiple storage volumes.

Mpool implements the concept of media classes, where each media class corresponds
to a different storage volume. Clients specify a media class when creating an object in
an mpool. mpool currently supports only two media classes, “capacity” for storing bulk
of the objects backed by, for instance, QLC SSDs and “staging” for storing objects
requiring lower latency/higher throughput backed by, for instance, 3DXP SSDs. 

An mpool is accessed via the /dev/mpool/<mpool-name> device file and the
mpool descriptor attached to this device file instance tracks all its associated media
class volumes. mpool relies on device mapper to provide physical device aggregation
within a media class volume.

  reply index

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-12 16:27 Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed
2020-10-12 16:45   ` Randy Dunlap
2020-10-12 16:48     ` Randy Dunlap
2020-10-12 16:27 ` [PATCH v2 02/22] mpool: add in-memory struct definitions Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 03/22] mpool: add on-media " Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 04/22] mpool: add pool drive component which handles mpool IO using the block layer API Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 05/22] mpool: add space map component which manages free space on mpool devices Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 06/22] mpool: add on-media pack, unpack and upgrade routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 07/22] mpool: add superblock management routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 08/22] mpool: add pool metadata routines to manage object lifecycle and IO Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 09/22] mpool: add mblock lifecycle management and IO routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 10/22] mpool: add mlog IO utility routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 11/22] mpool: add mlog lifecycle management and IO routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 12/22] mpool: add metadata container or mlog-pair framework Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 13/22] mpool: add utility routines for mpool lifecycle management Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 14/22] mpool: add pool metadata routines to create persistent mpools Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 15/22] mpool: add mpool lifecycle management routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 16/22] mpool: add mpool control plane utility routines Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 17/22] mpool: add mpool lifecycle management ioctls Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 18/22] mpool: add object " Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 19/22] mpool: add support to mmap arbitrary collection of mblocks Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 20/22] mpool: add support to proactively evict cached mblock data from the page-cache Nabeel M Mohamed
2020-10-12 16:27 ` [PATCH v2 21/22] mpool: add documentation Nabeel M Mohamed
2020-10-12 16:53   ` Randy Dunlap
2020-10-12 16:27 ` [PATCH v2 22/22] mpool: add Kconfig and Makefile Nabeel M Mohamed
2020-10-15  8:02 ` [PATCH v2 00/22] add Object Storage Media Pool (mpool) Christoph Hellwig
2020-10-16 21:58   ` [EXT] " Nabeel Meeramohideen Mohamed (nmeeramohide)
2020-10-16 22:11     ` Dan Williams
2020-10-19 22:30       ` Nabeel Meeramohideen Mohamed (nmeeramohide) [this message]
2020-10-20 21:35         ` Dan Williams
2020-10-21 17:10           ` Nabeel Meeramohideen Mohamed (nmeeramohide)
2020-10-21 17:48             ` Dan Williams
2020-10-21 14:24       ` Mike Snitzer
2020-10-21 16:24         ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on

Archives are clonable:
	git clone --mirror linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ \
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone