From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91B98C43334 for ; Fri, 17 Jun 2022 14:48:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1382968AbiFQOsY (ORCPT ); Fri, 17 Jun 2022 10:48:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52538 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1382984AbiFQOsV (ORCPT ); Fri, 17 Jun 2022 10:48:21 -0400 Received: from mail-ua1-x933.google.com (mail-ua1-x933.google.com [IPv6:2607:f8b0:4864:20::933]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DE0315639E; Fri, 17 Jun 2022 07:48:20 -0700 (PDT) Received: by mail-ua1-x933.google.com with SMTP id r15so1611152uaf.13; Fri, 17 Jun 2022 07:48:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=TcIHYHY7u1Wr/GV3aoIoPqYOsQGjAK4OcNkD/GU5M1M=; b=Jr2Rvzf2R+Qs10pfSCH3B8xV3k51eSIl1oRD/qlzgUF2B6Lqd1wzm0Qv7YmklySnfo pvgZFkH1pxaB0m7tTsV7h5COnS98IUw0oH6sBYhHswlxRx9IAP1IB2n3xAjWChFhcFZo 10Vc8BMMA2QDUpz5PLEetN7sMcHshYY/bez8W8R1Q/Ytff9pSdpy5E58TFvwQDV9+p0G YHxVrn+y/sb+PCZ5CAjbiJdn/mSooNBm6GpXsEIqOZsAgtHrYJ988TJFfgFmwwk0Edpx lWpiT2MZbpsQU5JSqVG6QBzjn7bp27nxg69J1jHoJqEClHlgcAocv8HQlQeThxXd8gJX YjMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=TcIHYHY7u1Wr/GV3aoIoPqYOsQGjAK4OcNkD/GU5M1M=; b=2lACKIK3RFZTJfDY2feY2zrfze1brS/1BTj62MXnsjJGiBmWPRNkJzOJblFfmfTexF Qi6PHlVbb7ZAUtsLuD6WViBHuvuglyLSteX+Pxx+C9GC0KxS4JusHmEHqeaIlfxmxHnv SEUxvpWHMytwuN3m+Bd1M8qYp6lwrFhDmaQcab/iFPYDonWm9gCOPnzeNgK1balGuvu2 2s7uZDzeq+PO2q/RT2SHGOlNACwrnC+LSYOrq36/TkLxEfH9Zud0j8HHZJ5bZcYFH851 o0g2kCYf6zWDoUqoXwGtRj+Sc13ixxP4C/8JtQTy94k6zxD08vQpG31uRoqXSkQLg9k4 +Csw== X-Gm-Message-State: AJIora989BKR4S9JCsR1A8bQQdPZkjB7gskqQdDpbfzTXq08MLSAW8w3 th5cuzEccx3iRt22o6hG4UK2pXdS+jmntlNHwRw= X-Google-Smtp-Source: AGRyM1vU628PZCHEasd5i6Kd/bfeF+tHIigArJiScOV/UHBidep3tKwiX3+yAzz0NhvMYJ/DqI8RbukFBG0ny0S6UFA= X-Received: by 2002:a05:6130:21a:b0:37a:8e07:d495 with SMTP id s26-20020a056130021a00b0037a8e07d495mr4616073uac.80.1655477300037; Fri, 17 Jun 2022 07:48:20 -0700 (PDT) MIME-Version: 1.0 References: <20190404165737.30889-1-amir73il@gmail.com> <20190404211730.GD26298@dastard> <20190407232728.GF26298@dastard> <20190408141114.GC15023@quack2.suse.cz> <20190409082605.GA8107@quack2.suse.cz> In-Reply-To: <20190409082605.GA8107@quack2.suse.cz> From: Amir Goldstein Date: Fri, 17 Jun 2022 17:48:08 +0300 Message-ID: Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload To: Jan Kara Cc: Dave Chinner , "Darrick J . Wong" , Christoph Hellwig , Matthew Wilcox , linux-xfs , linux-fsdevel , Ext4 , Lukas Czerner , Theodore Tso , Al Viro Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Tue, Apr 9, 2019 at 11:26 AM Jan Kara wrote: > > On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara wrote: > > > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner wrote: > > > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner wrote: > > > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > > inode in progress at all to work correctly, which this patch > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > > truncate, hole punching etc? > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > > in the page fault handler. > > > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > > path. > > > > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > > path? Aren't full pages supposed to be faulted in on read() after > > > > truncate_pagecache_range()? > > > > > > During read(2), pages are not "faulted in". Just look at > > > what generic_file_buffered_read() does. It uses completely separate code to > > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > > only to accesses from userspace to mmaps. > > > > > > > Oh! thanks for fixing my blind spot. > > So if you agree with Dave that ext4, and who knows what other fs, > > are vulnerable to populating page cache with stale "uptodate" data, > > Not that many filesystems support punching holes but you're right. > > > then it seems to me that also xfs is vulnerable via readahead(2) and > > posix_fadvise(). > > Yes, this is correct AFAICT. > > > Mind you, I recently added an fadvise f_op, so it could be used by > > xfs to synchronize with IOLOCK. > > And yes, this should work. > > > Perhaps a better solution would be for truncate_pagecache_range() > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > > cache. When we have shared pages for files, these pages could be > > deduped. > > No, I wouldn't really mess with sharing pages due to this. It would be hard > to make that scale resonably and would be rather complex. We really need a > proper and reasonably simple synchronization mechanism between operations > removing blocks from inode and operations filling in page cache of the > inode. Page lock was supposed to provide this but doesn't quite work > because hole punching first remove pagecache pages and then go removing all > blocks. > > So I agree with Dave that going for range lock is really the cleanest way > forward here without causing big regressions for mixed rw workloads. I'm > just thinking how to best do that without introducing lot of boilerplate > code into each filesystem. Hi Jan, Dave, Trying to circle back to this after 3 years! Seeing that there is no progress with range locks and that the mixed rw workloads performance issue still very much exists. Is the situation now different than 3 years ago with invalidate_lock? Would my approach of pre-warm page cache before taking IOLOCK be safe if page cache is pre-warmed with invalidate_lock held? For the pNFS leases issue, as I wrote back in pre-COVID era, I intend to opt-out of this optimization with #ifndef CONFIG_EXPORTFS_BLOCK_OPS Thanks, Amir.