From: Dave Chinner <david@fromorbit.com>
To: Davidlohr Bueso <dbueso@suse.com>
Cc: Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
"Darrick J . Wong" <darrick.wong@oracle.com>,
Christoph Hellwig <hch@lst.de>,
Matthew Wilcox <willy@infradead.org>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload
Date: Tue, 16 Apr 2019 22:22:40 +1000 [thread overview]
Message-ID: <20190416122240.GN29573@dread.disaster.area> (raw)
In-Reply-To: <20190411011117.GC29573@dread.disaster.area>
On Thu, Apr 11, 2019 at 11:11:17AM +1000, Dave Chinner wrote:
> On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote:
> > On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote:
> > > On Fri 05-04-19 08:17:30, Dave Chinner wrote:
> > > > FYI, I'm working on a range lock implementation that should both
> > > > solve the performance issue and the reader starvation issue at the
> > > > same time by allowing concurrent buffered reads and writes to
> > > > different file ranges.
> > >
> > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get
> > > merged because he had no in-tree user at the time (he was more aiming at
> > > converting mmap_sem which is rather difficult). But the generic lock
> > > implementation should be well usable.
> > >
> > > Added Davidlohr to CC.
> > >
> > > Honza
> > >
> > > [1] https://lkml.org/lkml/2017/3/7/22
> >
> > fyi this was the latest version (had some naming updates per peterz).
> >
> > https://lkml.org/lkml/2018/2/4/232
>
> No, I wasn't aware of these because they haven't ever been posted to
> a list I subscribe to and they haven't been merged. I'll go have a
> look at them over the next few days.
Rather disappointing, to tell the truth.
The implementation doesn't scale nearly as a rwsem for concurrent IO
using shared access (i.e. XFS direct IO). That's the baseline we
need to get close to for range locks to be a viable prospect.
Fio randrw numbers on a single file on a pmem device on a 16p
machine using 4kB AIO-DIO iodepth 128 w/ fio on 5.1.0-rc3:
IOPS read/write (direct IO)
fio processes rwsem rangelock
1 78k / 78k 75k / 75k
2 131k / 131k 123k / 123k
4 267k / 267k 183k / 183k
8 372k / 372k 177k / 177k
16 315k / 315k 135k / 135k
So uncontended, non-overlapping range performance is not
really even in the ballpark right now, unfortunately.
To indicate that range locking is actually working, let's do
buffered read/write, which takes the rwsem exclusive for writes
and so will be bound by that:
IOPS read/write (buffered IO)
fio processes rwsem rangelock
1 57k / 57k 64k / 64k
2 61k / 61k 111k / 111k
4 61k / 61k 228k / 228k
8 55k / 55k 195k / 195k
16 15k / 15k 40k / 40k
So the win for mixed buffered IO is huge but it's at the cost
of direct IO performance. We can't make that tradeoff, so if this
implementation cannot be substantially improved we really can'ti
use it.
FUndamentally, the problem is that the interval tree work is all
done under a spinlock, so by the time we get to 16p, 70% of the 16p
that is being burnt is on the spinlock, and only 30% of the CPU time
is actually doing IO work:
+ 70.78% 2.50% [kernel] [k] do_raw_spin_lock
+ 69.72% 0.07% [kernel] [k] _raw_spin_lock_irqsave
- 68.27% 68.10% [kernel] [k] __pv_queued_spin_lock_slowpath
.....
+ 15.61% 0.07% [kernel] [k] range_write_unlock
+ 15.39% 0.06% [kernel] [k] range_read_unlock
+ 15.11% 0.12% [kernel] [k] range_read_lock
+ 15.11% 0.13% [kernel] [k] range_write_lock
.....
+ 12.92% 0.11% [kernel] [k] range_is_locked
FWIW, I'm not convinced about the scalability of the rb/interval
tree, to tell you the truth. We got rid of the rbtree in XFS for
cache indexing because the multi-level pointer chasing was just too
expensive to do under a spinlock - it's just not a cache efficient
structure for random index object storage.
FWIW, I have basic hack to replace the i_rwsem in XFS with a full
range read or write lock with my XFS range lock implementation so it
just behaves like a rwsem at this point. It is not in any way
optimised at this point. Numbers for same AIO-DIO test are:
IOPS read/write (direct IO)
processes rwsem DB rangelock XFS rangelock
1 78k / 78k 75k / 75k 74k / 74k
2 131k / 131k 123k / 123k 134k / 134k
4 267k / 267k 183k / 183k 246k / 246k
8 372k / 372k 177k / 177k 306k / 306k
16 315k / 315k 135k / 135k 240k / 240k
Performance is much closer to rwsems, but the single lock critical
section still causes serious amounts of cacheline contention on pure
shared-read lock workloads like this.
FWIW, the XFS rangelock numbers match what I've been seeing with
userspace perf tests using unoptimised pthread mutexes - ~600,000
lock/unlock cycles a second over ~100 concurrent non-overlapping
ranges seem to be the limitation before futex contention burns down
the house
AS it is, I've done all the work to make XFS use proper range locks
on top of your patch - I'll go back tomorrow and modify the XFS
range locks to use the same API as your lock implementation. I'm
interested to see what falls out when they are used as real range
locks rather than a glorified rwsem....
Cheers,
Dave.
PS: there's a double lock bug in range_is_locked().....
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2019-04-16 12:22 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-04 16:57 [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Amir Goldstein
2019-04-04 21:17 ` Dave Chinner
2019-04-05 14:02 ` Amir Goldstein
2019-04-07 23:27 ` Dave Chinner
2019-04-08 9:02 ` Amir Goldstein
2019-04-08 14:11 ` Jan Kara
2019-04-08 17:41 ` Amir Goldstein
2019-04-09 8:26 ` Jan Kara
2022-06-17 14:48 ` Amir Goldstein
2022-06-17 15:11 ` Jan Kara
2022-06-18 8:38 ` Amir Goldstein
2022-06-20 9:11 ` Jan Kara
2022-06-21 7:49 ` Amir Goldstein
2022-06-21 8:59 ` Jan Kara
2022-06-21 12:53 ` Amir Goldstein
2022-06-22 3:23 ` Matthew Wilcox
2022-06-22 9:00 ` Amir Goldstein
2022-06-22 9:34 ` Jan Kara
2022-06-22 16:26 ` Amir Goldstein
2022-09-13 14:40 ` Amir Goldstein
2022-09-14 16:01 ` Darrick J. Wong
2022-09-14 16:29 ` Amir Goldstein
2022-09-14 17:39 ` Darrick J. Wong
2022-09-19 23:09 ` Dave Chinner
2022-09-20 2:24 ` Dave Chinner
2022-09-20 3:08 ` Amir Goldstein
2022-09-21 11:20 ` Amir Goldstein
2019-04-08 11:03 ` Jan Kara
2019-04-22 10:55 ` Boaz Harrosh
2019-04-08 10:33 ` Jan Kara
2019-04-08 16:37 ` Davidlohr Bueso
2019-04-11 1:11 ` Dave Chinner
2019-04-16 12:22 ` Dave Chinner [this message]
2019-04-18 3:10 ` Dave Chinner
2019-04-18 18:21 ` Davidlohr Bueso
2019-04-20 23:54 ` Dave Chinner
2019-05-03 4:17 ` Dave Chinner
2019-05-03 5:17 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190416122240.GN29573@dread.disaster.area \
--to=david@fromorbit.com \
--cc=amir73il@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=dbueso@suse.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).