linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@lst.de>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload
Date: Wed, 22 Jun 2022 11:34:03 +0200	[thread overview]
Message-ID: <20220622093403.hvsk2zmlw7o37phe@quack3.lan> (raw)
In-Reply-To: <CAOQ4uxgYbVOSDwufPFvbNwsxnzzseNH9guwxZvP43vMmOFqq+Q@mail.gmail.com>

On Wed 22-06-22 12:00:35, Amir Goldstein wrote:
> On Wed, Jun 22, 2022 at 6:23 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jun 21, 2022 at 03:53:33PM +0300, Amir Goldstein wrote:
> > > On Tue, Jun 21, 2022 at 11:59 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 21-06-22 10:49:48, Amir Goldstein wrote:
> > > > > > How exactly do you imagine the synchronization of buffered read against
> > > > > > buffered write would work? Lock all pages for the read range in the page
> > > > > > cache? You'd need to be careful to not bring the machine OOM when someone
> > > > > > asks to read a huge range...
> > > > >
> > > > > I imagine that the atomic r/w synchronisation will remain *exactly* as it is
> > > > > today by taking XFS_IOLOCK_SHARED around generic_file_read_iter(),
> > > > > when reading data into user buffer, but before that, I would like to issue
> > > > > and wait for read of the pages in the range to reduce the probability
> > > > > of doing the read I/O under XFS_IOLOCK_SHARED.
> > > > >
> > > > > The pre-warm of page cache does not need to abide to the atomic read
> > > > > semantics and it is also tolerable if some pages are evicted in between
> > > > > pre-warn and read to user buffer - in the worst case this will result in
> > > > > I/O amplification, but for the common case, it will be a big win for the
> > > > > mixed random r/w performance on xfs.
> > > > >
> > > > > To reduce risk of page cache thrashing we can limit this optimization
> > > > > to a maximum number of page cache pre-warm.
> > > > >
> > > > > The questions are:
> > > > > 1. Does this plan sound reasonable?
> > > >
> > > > Ah, I see now. So essentially the idea is to pull the readahead (which is
> > > > currently happening from filemap_read() -> filemap_get_pages()) out from under
> > > > the i_rwsem. It looks like a fine idea to me.
> > >
> > > Great!
> > > Anyone doesn't like the idea or has another suggestion?
> >
> > I guess I'm still confused.
> >
> > The problem was the the XFS IOLOCK was being held while we waited for
> > readahead to complete.  To fix this, you're planning on waiting for
> > readahead to complete with the invalidate lock held?  I don't see the
> > benefit.
> >
> > I see the invalidate_lock as being roughly equivalent to the IOLOCK,
> > just pulled up to the VFS.  Is that incorrect?
> >
> 
> This question coming from you really shakes my confidence.
> 
> This entire story started from the observation that xfs performance
> of concurrent mixed rw workload is two orders of magnitude worse
> than ext4 on slow disk.
> 
> The reason for the difference was that xfs was taking the IOLOCK
> shared on reads and ext4 did not.
> 
> That had two very different reasons:
> 1. POSIX atomic read/write semantics unique to xfs
> 2. Correctness w.r.t. races with punch hole etc, which lead to the
>     conclusion that all non-xfs filesystems are buggy in that respect
> 
> The solution of pulling IOLOCK to vfs would have solved the bug
> but at the cost of severely regressing the mix rw workload on all fs.
> 
> The point of Jan's work on invalidate_lock was to fix the bug and
> avoid the regression. I hope that worked out, but I did not test
> the mixed rw workload on ext4 after invalidate_lock.

Yes, it did work out :)

> IIUC, ideally, invalidate_lock was supposed to be taken only for
> adding pages to page cache and locking them, but not during IO
> in order to synchronize against truncating pages (punch hole).
> But from this comment in filemap_create_folio() I just learned
> that that is not exactly the case:
> "...Note that we could release invalidate_lock after inserting the
>  folio into the page cache as the locked folio would then be enough
>  to synchronize with hole punching. But..."
> 
> Even so, because invalidate_lock is not taken by writers and reader
> that work on existing pages and because invalidate_lock is not held
> for the entire read/write operation, statistically it should be less
> contended than IOLOCK for some workloads, but I am afraid that
> for the workload I tested (bs=8K and mostly cold page cache) it will
> be contended with current vfs code.

Well, the rules are: When you are adding pages to the page cache you need
to either hold i_rwsem or invalidate_lock at least in shared mode. Places
removing underlying storage from the page cache pages are responsible for
holding both i_rwsem and invalidate_lock in exclusive mode. Writes
generally hold i_rwsem exclusive (or shared for overwriting direct IO),
reads hold invalidate_lock shared. So there will not be contention on
invalidate_lock as such for a mixed rw workload, except for the internal
contention of the shared invalidate_lock holders on the cacheline holding
the invalidate_lock (which is non-negligible as well in heavily parallel
loads but that's a separate story).

> I am going to go find a machine with slow disk to test the random rw
> workload again on both xfs and ext4 pre and post invalidate_lock and
> to try out the pre-warm page cache solution.
> 
> The results could be:
> a) ext4 random rw performance has been degraded by invalidate_lock
> b) pre-warm page cache before taking IOLOCK is going to improve
>     xfs random rw performance
> c) A little bit of both

Well, numbers always beat the theory so I'm all for measuring it but let me
say our kernel performance testing within SUSE didn't show significant hit
being introduced by invalidate_lock for any major filesystem.

I hope this clears out things a bit.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2022-06-22  9:34 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-04 16:57 [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Amir Goldstein
2019-04-04 21:17 ` Dave Chinner
2019-04-05 14:02   ` Amir Goldstein
2019-04-07 23:27     ` Dave Chinner
2019-04-08  9:02       ` Amir Goldstein
2019-04-08 14:11         ` Jan Kara
2019-04-08 17:41           ` Amir Goldstein
2019-04-09  8:26             ` Jan Kara
2022-06-17 14:48               ` Amir Goldstein
2022-06-17 15:11                 ` Jan Kara
2022-06-18  8:38                   ` Amir Goldstein
2022-06-20  9:11                     ` Jan Kara
2022-06-21  7:49                       ` Amir Goldstein
2022-06-21  8:59                         ` Jan Kara
2022-06-21 12:53                           ` Amir Goldstein
2022-06-22  3:23                             ` Matthew Wilcox
2022-06-22  9:00                               ` Amir Goldstein
2022-06-22  9:34                                 ` Jan Kara [this message]
2022-06-22 16:26                                   ` Amir Goldstein
2022-09-13 14:40                             ` Amir Goldstein
2022-09-14 16:01                               ` Darrick J. Wong
2022-09-14 16:29                                 ` Amir Goldstein
2022-09-14 17:39                                   ` Darrick J. Wong
2022-09-19 23:09                                     ` Dave Chinner
2022-09-20  2:24                                       ` Dave Chinner
2022-09-20  3:08                                         ` Amir Goldstein
2022-09-21 11:20                                           ` Amir Goldstein
2019-04-08 11:03       ` Jan Kara
2019-04-22 10:55         ` Boaz Harrosh
2019-04-08 10:33   ` Jan Kara
2019-04-08 16:37     ` Davidlohr Bueso
2019-04-11  1:11       ` Dave Chinner
2019-04-16 12:22         ` Dave Chinner
2019-04-18  3:10           ` Dave Chinner
2019-04-18 18:21             ` Davidlohr Bueso
2019-04-20 23:54               ` Dave Chinner
2019-05-03  4:17                 ` Dave Chinner
2019-05-03  5:17                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220622093403.hvsk2zmlw7o37phe@quack3.lan \
    --to=jack@suse.cz \
    --cc=amir73il@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).