From: Dave Chinner <david@fromorbit.com>
To: Davidlohr Bueso <dbueso@suse.com>
Cc: Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
"Darrick J . Wong" <darrick.wong@oracle.com>,
Christoph Hellwig <hch@lst.de>,
Matthew Wilcox <willy@infradead.org>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload
Date: Fri, 3 May 2019 14:17:27 +1000 [thread overview]
Message-ID: <20190503041727.GL29573@dread.disaster.area> (raw)
In-Reply-To: <20190420235412.GY29573@dread.disaster.area>
On Sun, Apr 21, 2019 at 09:54:12AM +1000, Dave Chinner wrote:
> On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote:
> > On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote:
> > > Now the stuff I've been working on has the same interface as
> > > Davidlohr's patch, so I can swap and change them without thinking
> > > about it. It's still completely unoptimised, but:
> > >
> > > IOPS read/write (direct IO)
> > > processes rwsem DB rangelock XFS
> > > rangelock
> > > 1 78k / 78k 75k / 75k 72k / 72k
> > > 2 131k / 131k 123k / 123k 133k / 133k
> > > 4 267k / 267k 183k / 183k 237k / 237k
> > > 8 372k / 372k 177k / 177k 265k / 265k
> > > 16 315k / 315k 135k / 135k 228k / 228k
> > >
> > > It's still substantially faster than the interval tree code.
....
> > > /me goes off and thinks more about adding optimistic lock coupling
> > > to the XFS iext btree to get rid of the need for tree-wide
> > > locking altogether
> >
> > I was not aware of this code.
>
> It's relatively new, and directly tailored to the needs of caching
> the XFS extent tree - it's not really a generic btree in that it's
> record store format is the XFS on-disk extent record. i.e. it
> only stores 54 bits of start offset and 21 bits of length in it's 16
> byte records, and the rest of the space is for the record data.
SO now I have a mostly working OLC btree based on this tree which is
plumbed into xfsprogs userspace and some testing code. I think I can
say now that the code will actually work, and it /should/ scale
better than a rwsem.
The userspace test harness that I have ran a "thread profile" to
indicated scalability. Basically it ran each thread in a different
offset range and locked a hundred ranges and then unlocked them, and
then looped over this. The btree is a single level for the first 14
locks, 2-level for up to 210 locks, and 3-level for up to 3150
locks. Hence most of this testing results in the btree being 2-3
levels and so largely removes the global root node lock as a point
of contention. It's "best case" for concurrency for an OLC btree.
On a 16p machine:
Range lock/unlock ops/s
threads mutex btree OLC btree
1 5239442 949487
2 1014466 1398539
4 985940 2405275
8 733195 3288435
16 653429 2809225
When looking at these numbers, remember that the mutex btree kernel
range lock performed a lot better than the interval tree range lock,
and they were only ~30% down on an rwsem. The mutex btree code shows
cache residency effects for the single threaded load, hence it looks
much faster than it is for occasional and multithreaded access.
However, at 2 threads (where hot CPU caches don't affect the
performance), the OLC btree is 40% faster, and at 8 threads it is
4.5x faster than the mutex btree. The OLC btree starts slowing down
at 16 threads, largely because the tree itself doesn't have enough
depth to provide the interior nodes to scale to higher concurrency
levels without contention, but it's still running at 4.5x faster
than the mutex btree....
The best part is when I run worse case threaded workloads on the
OLC btree. If I run the same 100-lock loops, but this time change
the offsets of each thread so they interleave into adjacent records
in the btree (i.e. every thread touches every leaf), then the
performance is still pretty damn good:
Range lock/unlock ops/s
threads Worst Case Best Case
1 1045991 949487
2 1530212 1398539
4 1147099 2405275
8 1602114 3288435
16 1731890 2809225
IOWs, performance is down and somewhat variable around tree
height changes (4 threads straddles the 2-3 level tree height
threshold), but it's still a massive improvement on the mutex_btree
and it's not going backwards as threads are added.
Concept proven.
Next steps are:
- separate the OLC btree from the XFS iext btree
implementation. It will still have a similar interface
(i.e. can't manipulate the btree records directly), but
there's sufficient difference in structure for them to be
separate implementations.
- expand records out to full 64bit extents. The iext tree
memory usage constraints no longer apply, so the record
size can go up a little bit.
- work out whether RCU read locking and kfree_rcu() will
work with the requirement to do memory allocation while
holding rcu_read_lock(). Alternative is an internal
garbage collector mechanism, kinda like I've hacked up to
simulate kfree_rcu() in userspace.
- fix all the little bugs that still exist in the code.
- Think about structural optimisations like parent pointers
to avoid costly path walks to find parents for
modifications.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2019-05-03 4:17 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-04 16:57 [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Amir Goldstein
2019-04-04 21:17 ` Dave Chinner
2019-04-05 14:02 ` Amir Goldstein
2019-04-07 23:27 ` Dave Chinner
2019-04-08 9:02 ` Amir Goldstein
2019-04-08 14:11 ` Jan Kara
2019-04-08 17:41 ` Amir Goldstein
2019-04-09 8:26 ` Jan Kara
2022-06-17 14:48 ` Amir Goldstein
2022-06-17 15:11 ` Jan Kara
2022-06-18 8:38 ` Amir Goldstein
2022-06-20 9:11 ` Jan Kara
2022-06-21 7:49 ` Amir Goldstein
2022-06-21 8:59 ` Jan Kara
2022-06-21 12:53 ` Amir Goldstein
2022-06-22 3:23 ` Matthew Wilcox
2022-06-22 9:00 ` Amir Goldstein
2022-06-22 9:34 ` Jan Kara
2022-06-22 16:26 ` Amir Goldstein
2022-09-13 14:40 ` Amir Goldstein
2022-09-14 16:01 ` Darrick J. Wong
2022-09-14 16:29 ` Amir Goldstein
2022-09-14 17:39 ` Darrick J. Wong
2022-09-19 23:09 ` Dave Chinner
2022-09-20 2:24 ` Dave Chinner
2022-09-20 3:08 ` Amir Goldstein
2022-09-21 11:20 ` Amir Goldstein
2019-04-08 11:03 ` Jan Kara
2019-04-22 10:55 ` Boaz Harrosh
2019-04-08 10:33 ` Jan Kara
2019-04-08 16:37 ` Davidlohr Bueso
2019-04-11 1:11 ` Dave Chinner
2019-04-16 12:22 ` Dave Chinner
2019-04-18 3:10 ` Dave Chinner
2019-04-18 18:21 ` Davidlohr Bueso
2019-04-20 23:54 ` Dave Chinner
2019-05-03 4:17 ` Dave Chinner [this message]
2019-05-03 5:17 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190503041727.GL29573@dread.disaster.area \
--to=david@fromorbit.com \
--cc=amir73il@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=dbueso@suse.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).