From: "Nabeel Meeramohideen Mohamed (nmeeramohide)" <email@example.com> To: Dan Williams <firstname.lastname@example.org> Cc: Christoph Hellwig <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "Steve Moyer (smoyer)" <firstname.lastname@example.org>, "Greg Becker (gbecker)" <email@example.com>, "Pierre Labat (plabat)" <firstname.lastname@example.org>, "John Groves (jgroves)" <email@example.com> Subject: RE: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool) Date: Mon, 19 Oct 2020 22:30:02 +0000 Message-ID: <SN6PR08MB420843C280D54D7B5B76EB78B31E0@SN6PR08MB4208.namprd08.prod.outlook.com> (raw) In-Reply-To: <CAPcyv4j7a0gq++rL--2W33fL4+S0asYjYkvfBfs+hY+3J=c_GA@mail.gmail.com> Hi Dan, On Friday, October 16, 2020 4:12 PM, Dan Williams <firstname.lastname@example.org> wrote: > > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed > (nmeeramohide) <email@example.com> wrote: > > > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig > <firstname.lastname@example.org> wrote: > > > I don't think this belongs into the kernel. It is a classic case for > > > infrastructure that should be built in userspace. If anything is > > > missing to implement it in userspace with equivalent performance we > > > need to improve out interfaces, although io_uring should cover pretty > > > much everything you need. > > > > Hi Christoph, > > > > We previously considered moving the mpool object store code to user-space. > > However, by implementing mpool as a device driver, we get several benefits > > in terms of scalability, performance, and functionality. In doing so, we relied > > only on standard interfaces and did not make any changes to the kernel. > > > > (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) > > a collection of logically related objects with a single system call. The objects in > > such a collection are created at different times, physically disparate, and may > > even reside on different media class volumes. > > > > For our HSE storage engine application, there are commonly 10's to 100's of > > objects in a given mcache map, and 75,000 total objects mapped at a given > time. > > > > Compared to memory-mapping objects individually, the mcache map facility > > scales well because it requires only a single system call and single > vm_area_struct > > to memory-map a complete collection of objects. > Why can't that be a batch of mmap calls on io_uring? Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the system call overhead of memory-mapping individual objects, versus our mache map mechanism. However, there is still the scalability issue of having a vm_area_struct for each object (versus one for each mache map). We ran YCSB workload C in two different configurations - Config 1: memory-mapping each individual object Config 2: memory-mapping a collection of related objects using mcache map - Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab - 24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2. - Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2, not sure if it's due the reduced complexity of searching VMAs during page faults. > > (2) The mcache map reaper mechanism proactively evicts object data from the > page > > cache based on object-level metrics. This provides significant performance > benefit > > for many workloads. > > > > For example, we ran YCSB workloads B (95/5 read/write mix) and C (100% read) > > against our HSE storage engine using the mpool driver in a 5.9 kernel. > > For each workload, we ran with the reaper turned-on and turned-off. > > > > For workload B, the reaper increased throughput 1.77x, while reducing 99.99% > tail > > latency for reads by 39% and updates by 99%. For workload C, the reaper > increased > > throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These > > improvements are even more dramatic with earlier kernels. > What metrics proved useful and can the vanilla page cache / page > reclaim mechanism be augmented with those metrics? The mcache map facility is designed to cache a collection of related immutable objects with similar lifetimes. It is best suited for storage applications that run queries against organized collections of immutable objects, such as storage engines and DBs based on SSTables. Each mcache map is associated with a temperature (pinned, hot, warm, cold) and is left to the application to tag it appropriately. For our HSE storage engine application, the SSTables in the root/intermediate levels acts as a routing table to redirect queries to an appropriate leaf level SSTable, in which case, the mcache map corresponding to the root/intermediate level SSTables can be tagged as pinned/hot. The mcache reaper tracks the access time of each object in an mcache map. On memory pressure, the access time is compared to a time-to-live metric that’s set based on the map’s temperature, how close is the free memory to the low and high watermarks etc. If the object was last accessed outside the ttl window, its pages are evicted from the page cache. We also apply a few other techniques like throttling the readaheads and adding a delay in the page fault handler to not overwhelm the page cache during memory pressure. In the workloads that we run, we have noticed stalls when kswapd does the reclaim and that impacts throughput and tail latencies as described in our last email. The mcache reaper runs proactively and can make better reclaim decisions as it is designed to address a specific class of workloads. We doubt whether the same mechanisms can be employed in the vanilla page cache as it is designed to work for a wide variety of workloads. > > (4) mpool's immutable object model allows the driver to support concurrent > reading > > of object data directly and memory-mapped without a performance penalty to > verify > > coherence. This allows background operations, such as LSM-tree compaction, > to > > operate efficiently and without polluting the page cache. > How is this different than existing background operations / defrag > that filesystems perform today? Where are the opportunities to improve > those operations? We haven’t measured the benefit of eliminating the coherence check, which isn’t needed in our case because objects are immutable. However the open(2) documentation makes the statement that “applications should avoid mixing mmap(2) of files with direct I/O to the same files”, which is what we are effectively doing when we directly read from an object that is also in an mcache map. > > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file > provides a > > convenient mechanism for controlling access to and managing the multiple > storage > > volumes, and in the future pmem devices, that may comprise an logical mpool. > Christoph and I have talked about replacing the pmem driver's > dependence on device-mapper for pooling. What extensions would be > needed for the existing driver arch? mpool doesn’t extend any of the existing driver arch to manage multiple storage volumes. Mpool implements the concept of media classes, where each media class corresponds to a different storage volume. Clients specify a media class when creating an object in an mpool. mpool currently supports only two media classes, “capacity” for storing bulk of the objects backed by, for instance, QLC SSDs and “staging” for storing objects requiring lower latency/higher throughput backed by, for instance, 3DXP SSDs. An mpool is accessed via the /dev/mpool/<mpool-name> device file and the mpool descriptor attached to this device file instance tracks all its associated media class volumes. mpool relies on device mapper to provide physical device aggregation within a media class volume.
next prev parent reply index Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-10-12 16:27 Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 01/22] mpool: add utility routines and ioctl definitions Nabeel M Mohamed 2020-10-12 16:45 ` Randy Dunlap 2020-10-12 16:48 ` Randy Dunlap 2020-10-12 16:27 ` [PATCH v2 02/22] mpool: add in-memory struct definitions Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 03/22] mpool: add on-media " Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 04/22] mpool: add pool drive component which handles mpool IO using the block layer API Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 05/22] mpool: add space map component which manages free space on mpool devices Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 06/22] mpool: add on-media pack, unpack and upgrade routines Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 07/22] mpool: add superblock management routines Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 08/22] mpool: add pool metadata routines to manage object lifecycle and IO Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 09/22] mpool: add mblock lifecycle management and IO routines Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 10/22] mpool: add mlog IO utility routines Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 11/22] mpool: add mlog lifecycle management and IO routines Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 12/22] mpool: add metadata container or mlog-pair framework Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 13/22] mpool: add utility routines for mpool lifecycle management Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 14/22] mpool: add pool metadata routines to create persistent mpools Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 15/22] mpool: add mpool lifecycle management routines Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 16/22] mpool: add mpool control plane utility routines Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 17/22] mpool: add mpool lifecycle management ioctls Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 18/22] mpool: add object " Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 19/22] mpool: add support to mmap arbitrary collection of mblocks Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 20/22] mpool: add support to proactively evict cached mblock data from the page-cache Nabeel M Mohamed 2020-10-12 16:27 ` [PATCH v2 21/22] mpool: add documentation Nabeel M Mohamed 2020-10-12 16:53 ` Randy Dunlap 2020-10-12 16:27 ` [PATCH v2 22/22] mpool: add Kconfig and Makefile Nabeel M Mohamed 2020-10-15 8:02 ` [PATCH v2 00/22] add Object Storage Media Pool (mpool) Christoph Hellwig 2020-10-16 21:58 ` [EXT] " Nabeel Meeramohideen Mohamed (nmeeramohide) 2020-10-16 22:11 ` Dan Williams 2020-10-19 22:30 ` Nabeel Meeramohideen Mohamed (nmeeramohide) [this message] 2020-10-20 21:35 ` Dan Williams 2020-10-21 17:10 ` Nabeel Meeramohideen Mohamed (nmeeramohide) 2020-10-21 17:48 ` Dan Williams 2020-10-21 14:24 ` Mike Snitzer 2020-10-21 16:24 ` Dan Williams
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=SN6PR08MB420843C280D54D7B5B76EB78B31E0@SN6PR08MB4208.namprd08.prod.outlook.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-Block Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \ email@example.com public-inbox-index linux-block Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block AGPL code for this site: git clone https://public-inbox.org/public-inbox.git