From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 230DDC4363A for ; Tue, 20 Oct 2020 21:36:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6932D22456 for ; Tue, 20 Oct 2020 21:36:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="vz2AkNUQ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6932D22456 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 29F6C6B005C; Tue, 20 Oct 2020 17:35:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 24B106B0062; Tue, 20 Oct 2020 17:35:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 117EB6B0068; Tue, 20 Oct 2020 17:35:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D90AF6B005C for ; Tue, 20 Oct 2020 17:35:58 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 6CA59833852E for ; Tue, 20 Oct 2020 21:35:58 +0000 (UTC) X-FDA: 77393611596.26.brain07_0001f3c27242 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 457081804B654 for ; Tue, 20 Oct 2020 21:35:58 +0000 (UTC) X-HE-Tag: brain07_0001f3c27242 X-Filterd-Recvd-Size: 7042 Received: from mail-ej1-f67.google.com (mail-ej1-f67.google.com [209.85.218.67]) by imf26.hostedemail.com (Postfix) with ESMTP for ; Tue, 20 Oct 2020 21:35:57 +0000 (UTC) Received: by mail-ej1-f67.google.com with SMTP id u8so4964230ejg.1 for ; Tue, 20 Oct 2020 14:35:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=iYeGwSIoad54GydKSNZgZE6g2uU5cfbPFHvZ08BT6RM=; b=vz2AkNUQj4/8TjPt9jfB06mE5lh3/rlUf+7TZCmQybUg2w9Uy6vQWbxXQCV3LtCDo6 8NrtKsKDpJKbo08xxZc3HVsXDnUkKREszdVB2trWrs/2AdrZ3mckGXVRO6A2eIKS2906 Z3RSSNONukymyOXsAGQm4R3ZP6mqVzcK95g/x/1IMvf8wVgCEz66GYAsUm+U916t5qMV Ud4Q2SPAHqkVMkhaMoeNEEg/w+QWIpXZYUr+Go78/SEGrUm7J3KqYmBLlQo51CR7KtRg TEmUrewGCc0Mw15SfxVDhlxCP3L6jNmETWroPevOTLaxA96Bn4Xgyce/T0RK/mgi2IXd /x1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=iYeGwSIoad54GydKSNZgZE6g2uU5cfbPFHvZ08BT6RM=; b=l4Nob/FawwgL6ebPtWtnOGZCJrzeuNfd6PgjlGWa0bjoQjKt+E3iLOkjcuwhsMziHN BME6WbpPGhqxUsJPc/9Klmk57ECc0eceApG0mVq61mwPLktzeKTxMGEmxQ1vKNq/jeRb fNF/g7KCw+GmEkyyONAy3R52UW0pALp3S140+fSBiydyvNyctFQ+cZIjtEqWEN/pg5cc TxoI4MX1Gdp4/93pXZnpTI3VA4kx8h6RYz5u/YG9yq9LlOTggWbpEw5+IhRscwKT/Dc3 53fxoeYpQj8P5e6kQ83vbmNqIWej+BZ+7dPyZCvz61DClbcPnrxWgckJyjW1EVu7oLfx j17w== X-Gm-Message-State: AOAM532mHqCQyx1KA+5cgKw7jYpuFklW5JpGxXQVlxtBTHCjTfBcZAOo vYfRvuQrfQJXIn0RSaKa0EjPhfLR1Wr1PXOoPMUdlA== X-Google-Smtp-Source: ABdhPJxdsFzso6n0bhQlIsaXXdnNPz55lLRK+7wVqKJ60Myp6XphIfmevRWnAF5iTVW1mTgE1+z4eTNT633n3PsqwaQ= X-Received: by 2002:a17:906:c20f:: with SMTP id d15mr155804ejz.341.1603229755722; Tue, 20 Oct 2020 14:35:55 -0700 (PDT) MIME-Version: 1.0 References: <20201012162736.65241-1-nmeeramohide@micron.com> <20201015080254.GA31136@infradead.org> In-Reply-To: From: Dan Williams Date: Tue, 20 Oct 2020 14:35:44 -0700 Message-ID: Subject: Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool) To: "Nabeel Meeramohideen Mohamed (nmeeramohide)" Cc: Christoph Hellwig , "linux-kernel@vger.kernel.org" , "linux-block@vger.kernel.org" , "linux-nvme@lists.infradead.org" , "linux-mm@kvack.org" , "linux-nvdimm@lists.01.org" , "Steve Moyer (smoyer)" , "Greg Becker (gbecker)" , "Pierre Labat (plabat)" , "John Groves (jgroves)" Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 19, 2020 at 3:30 PM Nabeel Meeramohideen Mohamed (nmeeramohide) wrote: > > Hi Dan, > > On Friday, October 16, 2020 4:12 PM, Dan Williams wrote: > > > > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed > > (nmeeramohide) wrote: > > > > > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig > > wrote: > > > > I don't think this belongs into the kernel. It is a classic case for > > > > infrastructure that should be built in userspace. If anything is > > > > missing to implement it in userspace with equivalent performance we > > > > need to improve out interfaces, although io_uring should cover pretty > > > > much everything you need. > > > > > > Hi Christoph, > > > > > > We previously considered moving the mpool object store code to user-space. > > > However, by implementing mpool as a device driver, we get several benefits > > > in terms of scalability, performance, and functionality. In doing so, we relied > > > only on standard interfaces and did not make any changes to the kernel. > > > > > > (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) > > > a collection of logically related objects with a single system call. The objects in > > > such a collection are created at different times, physically disparate, and may > > > even reside on different media class volumes. > > > > > > For our HSE storage engine application, there are commonly 10's to 100's of > > > objects in a given mcache map, and 75,000 total objects mapped at a given > > time. > > > > > > Compared to memory-mapping objects individually, the mcache map facility > > > scales well because it requires only a single system call and single > > vm_area_struct > > > to memory-map a complete collection of objects. > > > Why can't that be a batch of mmap calls on io_uring? > > Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the > system call overhead of memory-mapping individual objects, versus our mache map > mechanism. However, there is still the scalability issue of having a vm_area_struct > for each object (versus one for each mache map). > > We ran YCSB workload C in two different configurations - > Config 1: memory-mapping each individual object > Config 2: memory-mapping a collection of related objects using mcache map > > - Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab - > 24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2. > > - Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2, > not sure if it's due the reduced complexity of searching VMAs during page faults. So this gets to the meta question that is giving me pause on this whole proposal: What does Linux get from merging mpool? What you have above is a decent scalability bug report. That type of pressure to meet new workload needs is how Linux interfaces evolve. However, rather than evolve those interfaces mpool is a revolutionary replacement that leaves the bugs intact for everyone that does not switch over to mpool. Consider io_uring as an example where the kernel resisted trends towards userspace I/O engines and instead evolved a solution that maintained kernel control while also achieving similar performance levels. The exercise is useful to identify places where Linux has deficiencies, but wholesale replacing an entire I/O submission model is a direction that leaves the old apis to rot.