From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC9AEC4BA18 for ; Wed, 26 Feb 2020 16:47:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6F58524670 for ; Wed, 26 Feb 2020 16:47:01 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="LLkXP/3Y" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728003AbgBZQq6 (ORCPT ); Wed, 26 Feb 2020 11:46:58 -0500 Received: from mail-oi1-f193.google.com ([209.85.167.193]:40330 "EHLO mail-oi1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727995AbgBZQqy (ORCPT ); Wed, 26 Feb 2020 11:46:54 -0500 Received: by mail-oi1-f193.google.com with SMTP id a142so139645oii.7 for ; Wed, 26 Feb 2020 08:46:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jfFrdPdZ9FoWsLEItIHfFzP9ymFVjyGzad5UqI1rEBE=; b=LLkXP/3YzNaP8jdt0z93QGulQGLXQX0Trzd+WFJYkXo7XFWuAJ6wZlyGff03QwBHLf 4Plc4q/Qd0QJCdIf1e9GoeaXEwKZlw9ETZjCukpkMrdt9ZR/Sr44zYE50Gip8IOl5r/m JYB1LSlrJ5+0Ka91jrB29f9SbGT869H3/DqAXOp34IH/Yoc3vP+//hshViJUUDDqS/LJ QNCuX04I8VWPdsbprpmw1cyzBOjBzRkd1MECp+iNhZ3/lxaXqckVfJPRynNgiP/ijnQg eiPa77p4MyvubAypJgQy1RxojQg1rNwgRioANNu+yBxI55olFfbCnrSGU/LHQ65MMRHT vO+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jfFrdPdZ9FoWsLEItIHfFzP9ymFVjyGzad5UqI1rEBE=; b=HBGzGRGPbi78/oITDJyUvcOsCumgde8gYgsTdBB78rzldxDsEJAyUvt0QD++8fwC20 tAOhXDbzO61+s0AWEXMQ3F01vVVx/e1R89d9MwTQ+xQxVHa5nNQ1CM8stUIn7rifJhrz 5/rYUShVhSlVHLhMVlM7yJ1xD2Az3OS/eJgujZ2sx2deBGkX54w5qSG/JPtylSxvnFFT aEjS3LM34MdDdi63zq0zpKr34JKsK8uCcqp5dxIHGGzwP9g7kSazZjZ8j6iL+N5btej+ kzTsOmhDOSFY1AkeO1JF692JNeCRRwh6SN1b12y6lH+CQHABzc7+HKL9ZXIZLAArNnS0 OiQA== X-Gm-Message-State: APjAAAUc9vClDXdE51AG9fxfR2XXdPHzZIyB/9d2/nrFI7K9WepYOUsA 3tXV3yS4y6Qf0oSCG66SEJGqBFy1QcBLXSyvjrhR0g== X-Google-Smtp-Source: APXvYqxqs6rTVc9eevxeNThlba+uZVaIe7nbgX9rqBRXGHLdPOffytT7975QEGYC3CSiSUYJqJj/25GH1DMT/gSZElU= X-Received: by 2002:aca:4c9:: with SMTP id 192mr4075244oie.105.1582735614183; Wed, 26 Feb 2020 08:46:54 -0800 (PST) MIME-Version: 1.0 References: <20200221004134.30599-1-ira.weiny@intel.com> <20200221004134.30599-8-ira.weiny@intel.com> <20200221174449.GB11378@lst.de> <20200221224419.GW10776@dread.disaster.area> <20200224175603.GE7771@lst.de> <20200225000937.GA10776@dread.disaster.area> <20200225173633.GA30843@lst.de> In-Reply-To: From: Dan Williams Date: Wed, 26 Feb 2020 08:46:42 -0800 Message-ID: Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state To: Jonathan Halliday Cc: Jeff Moyer , Christoph Hellwig , Dave Chinner , "Weiny, Ira" , Linux Kernel Mailing List , Alexander Viro , "Darrick J. Wong" , "Theodore Y. Ts'o" , Jan Kara , linux-ext4 , linux-xfs , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday wrote: > > > Hi All > > I'm a middleware developer, focused on how Java (JVM) workloads can > benefit from app-direct mode pmem. Initially the target is apps that > need a fast binary log for fault tolerance: the classic database WAL use > case; transaction coordination systems; enterprise message bus > persistence and suchlike. Critically, there are cases where we use log > based storage, i.e. it's not the strict 'read rarely, only on recovery' > model that a classic db may have, but more of a 'append only, read many > times' event stream model. > > Think of the log oriented data storage as having logical segments (let's > implement them as files), of which the most recent is being appended to > (read_write) and the remaining N-1 older segments are full and sealed, > so effectively immutable (read_only) until discarded. The tail segment > needs to be in DAX mode for optimal write performance, as the size of > the append may be sub-block and we don't want the overhead of the kernel > call anyhow. So that's clearly a good fit for putting on a DAX fs mount > and using mmap with MAP_SYNC. > > However, we want fast read access into the segments, to retrieve stored > records. The small access index can be built in volatile RAM (assuming > we're willing to take the startup overhead of a full file scan at > recovery time) but the data itself is big and we don't want to move it > all off pmem. Which means the requirements are now different: we want > the O/S cache to pull hot data into fast volatile RAM for us, which DAX > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather > than app-direct mode, except here we're using the O/S rather than the > hardware memory controller to do the cache management for us. > > Currently this requires closing the full (read_write) file, then copying > it to a non-DAX device and reopening it (read_only) there. Clearly > that's expensive and rather tedious. Instead, I'd like to close the > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode > that will instead go via the O/S cache in the traditional manner. Bonus > points if I can do it over non-overlapping ranges in a file without > closing the DAX mode mmap, since then the segments are entirely logical > instead of needing separate physical files. Hi John, IIRC we chatted about this at PIRL, right? At the time it sounded more like mixed mode dax, i.e. dax writes, but cached reads. To me that's an optimization to optionally use dax for direct-I/O writes, with its existing set of page-cache coherence warts, and not a capability to dynamically switch the dax-mode. mmap+MAP_SYNC seems the wrong interface for this. This writeup mentions bypassing kernel call overhead, but I don't see how a dax-write syscall is cheaper than an mmap syscall plus fault. If direct-I/O to a dax capable file bypasses the block layer, isn't that about the maximum of kernel overhead that can be cut out of this use case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block update-in-place writes not append writes.