Re: Ext4: Buffered random writes performance regression with dioread_nolock enabled

From: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"adilger.kernel@dilger.ca" <adilger.kernel@dilger.ca>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: Ext4: Buffered random writes performance regression with dioread_nolock enabled
Date: Wed, 21 Sep 2022 12:06:12 +0000	[thread overview]
Message-ID: <5446FC9B-92E6-4423-8BF8-AA1C138B178B@amazon.com> (raw)
In-Reply-To: <Yypx6VQRbl3bFP2v@mit.edu>

   > Since 2016, the commit bbd2f78cf63a ("libext2fs: allow the default
    > journal size to go as large as a gigabyte") has been in e2fsprogs
    > v1.43.2 and newer (the current version of e2fsprogs v1.46.5; v1.43.2
    > was released in September 2016, six years ago).  

Thanks Theo for sharing that I will looks into updating the e2fsprogs package I am using to include that commit.

  > P.S.  I'm puzzled by your comment, "we have to note that this should
  >  be only beneficial with extent-based files" --- while this is true,
  > why does this matter?  Unless you're dealing with an ancient file
  >  system that was originally created as ext2 or ext3 and then converted
  > to ext4, *all* ext4 files should be extent-based...

Yes I am using ext4 FS so that shouldn't be relevant in my case here, I just mentioned that because that's also mentioned in ext4 mount options man page as justification on why dioread_lock is the default so we likely need to update the documentations as well. https://www.kernel.org/doc/Documentation/filesystems/ext4.txt 

> Note that dioread_nolock code path is only used for extent-based files.
> Because of the restrictions this options comprises it is off by default (e.g. dioread_lock).

My last question would be is the higher journaling  overhead is justified or at least expected with dioread_nolock enabled? Is that caused by the fact that we need to allocate uninitialized extent before buffer write ?

Thank you.

Hazem

On 21/09/2022, 03:08, "Theodore Ts'o" <tytso@mit.edu> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

    On Mon, Sep 19, 2022 at 03:06:46PM +0000, Mohamed Abuelfotoh, Hazem wrote:
    > Hey Team,
    >
    >
    >   *   I am sending this e-mail to report a performance regression that’s caused by commit 244adf6426(ext4: make dioread_nolock the default) , I am listing the performance regression symptoms below & our analysis for the reported regression.

    Performance regressions are always tricky; dioread_nolock improves on
    some workloads, and can cause regressions for others.  In this
    particular case, the choice to make it the default was to also fix a
    direct I/O vs. writeback race which can result in stale data being
    revealed (which is a security issue).

    That being said...

    1)  as you've noted, this commit has been around since 5.6.

    2)  as you noted,

        Increasing the journal size from ext4 128 MiB to 1GiB will also
        fix the problem .

    Since 2016, the commit bbd2f78cf63a ("libext2fs: allow the default
    journal size to go as large as a gigabyte") has been in e2fsprogs
    v1.43.2 and newer (the current version of e2fsprogs v1.46.5; v1.43.2
    was released in September 2016, six years ago).  Quoting the commit
    description:

        Recent research has shown that for a metadata-heavy workload, a 128 MB
        is journal be a bottleneck on HDD's, and that the optimal journal size
        is proportional to number of unique metadata blocks that can be
        modified (and written into the journal) in a 30 second window.  One
        gigabyte should be sufficient for most workloads, which will be used
        for file systems larger than 128 gigabytes.

    So this should not be a problem in practice, and if there are users
    who are using antedeluvian versions of e2fsprogs, or who have old file
    systems which were created many years ago, it's quite easy to adjust
    the journal size.  For example, to adjust the journal to be 2GiB (2048
    MiB), just run the commands:

       tune2fs -O ^has_journal /dev/sdXX
       tune2fs -O has_journal -J size=2048 /tmp/sdXX

    Hence, I disagree that we should revert commit 244adf6426.  It may be
    that for your workload and your file system configuration, using the
    mount option nodioread_nolock (or dioread_lock), may make sense.  But
    there were also workloads for which using dioread_nolock improved
    benchmark numbers, so the question of which is the better default is
    not at all obvious.

    That being said, I do have plans for a new writeback scheme which will
    replace dioread_nolock *and* dioread_lock, and which will hopefully be
    faster than either approach.

                                                    - Ted

    P.S.  I'm puzzled by your comment, "we have to note that this should
    be only beneficial with extent-based files" --- while this is true,
    why does this matter?  Unless you're dealing with an ancient file
    system that was originally created as ext2 or ext3 and then converted
    to ext4, *all* ext4 files should be extent-based...