linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Davidlohr Bueso <dbueso@suse.com>
Cc: Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	Christoph Hellwig <hch@lst.de>,
	Matthew Wilcox <willy@infradead.org>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload
Date: Sun, 21 Apr 2019 09:54:12 +1000	[thread overview]
Message-ID: <20190420235412.GY29573@dread.disaster.area> (raw)
In-Reply-To: <1555611694.18313.12.camel@suse.com>

On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote:
> On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote:
> > Now the stuff I've been working on has the same interface as
> > Davidlohr's patch, so I can swap and change them without thinking
> > about it. It's still completely unoptimised, but:
> > 
> > 			IOPS read/write (direct IO)
> > processes	rwsem		DB rangelock	XFS
> > rangelock
> >  1		78k / 78k	75k / 75k	72k / 72k
> >  2		131k / 131k	123k / 123k	133k / 133k
> >  4		267k / 267k	183k / 183k	237k / 237k
> >  8		372k / 372k	177k / 177k	265k / 265k
> >  16		315k / 315k	135k / 135k	228k / 228k
> > 
> > It's still substantially faster than the interval tree code.
> 
> In general another big difference between both rangelock vs rwsems
> (when comparing them with full ranges) is that the latter will do
> writer optimistic spinning, so saving a context switch under the right
> scenarios provides mayor wins for rwsems -- I'm not sure if this
> applies to your fio tests, though. And pretty soon readers will also do
> this, hence rwsem will become a try-hard-not-to-sleep lock.

Right, the optimistic spin-on-owner behaviour basically causes both
rwsems and mutexes to behave as spin locks under these fio workloads
until all CPUs are spinning at 100% CPU usage before they even start
to behave like a sleeping lock. Essentailly, when I'm using the XFS
range locks as "full range only" rwsem equivalent locks, they end up
behaving like rwsems in terms of spinning on the tree mutex as the
single read-lock range is shared between all concurrent readers and
so has relatively low lookup and modification cost.

If I optmised the record update, it would be even closer to the
rwsem performance, but full range read locks are going to be rare.
Hell, even conflicting lock ranges are going to be rare in the IO
path, because the application is doing something very wrong if it is
doing overlapping concurrent IOs....

> One of the reasons why I was hesitant with Btrees was the fact that
> insertion requires memory allocation, something I wanted to avoid...

Yup, that's an issue, and one of the reasons why I've used a mutex
for the tree lock rather than a spin lock. But in the end, there's
no difference to CPU usage on contention as they both act as spin
locks in these cases...

If it becomes an issue, I'll add a mempool for the tree nodes and
make sure that it is pre-populated sufficiently before starting a
multi-level tree modification.

> per your numbers, sacrificing tree depth was the wrong choice. Thanks
> for sharing these numbers.

Yeah, anything that pointer chases through cachelines and dirties
multiple cachelines on rebalance really doesn't scale for a
fast-lookup structure. For btrees multi-level split/merge hurt, but
for the lookup and simple insert/remove case the cacheline miss cost
is substantially amortised by packing multiple records per
cacheline.

> > BTW, if I take away the rwsem serialisation altogether, this
> > test tops out at just under 500k/500k at 8 threads, and at 16
> > threads has started dropping off (~440k/440k). So the rwsem is
> > a scalability limitation at just 8 threads....
> > 
> > /me goes off and thinks more about adding optimistic lock coupling
> > to the XFS iext btree to get rid of the need for tree-wide
> > locking altogether
> 
> I was not aware of this code.

It's relatively new, and directly tailored to the needs of caching
the XFS extent tree - it's not really a generic btree in that it's
record store format is the XFS on-disk extent record. i.e. it
only stores 54 bits of start offset and 21 bits of length in it's 16
byte records, and the rest of the space is for the record data.

As such, I'm not really implementing a generic range lock - it is
highly tailored to the requirements and constraints of XFS
filesystem IO. e.g. using max(block size, PAGE_SIZE) granularity
range indexing means we can support maximum IO sizes (4GB) with 21
bits of length in the record.  And RANGE_LOCK_FULL is implemented as
length = MAXEXTLEN + a hidden state bit so it supports exclusive
locking from (start, RANGE_LOCK_FULL) and things like truncate, EOF
zeroing use this to exclude IO from the range they are working on
correctly....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2019-04-20 23:54 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-04 16:57 [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Amir Goldstein
2019-04-04 21:17 ` Dave Chinner
2019-04-05 14:02   ` Amir Goldstein
2019-04-07 23:27     ` Dave Chinner
2019-04-08  9:02       ` Amir Goldstein
2019-04-08 14:11         ` Jan Kara
2019-04-08 17:41           ` Amir Goldstein
2019-04-09  8:26             ` Jan Kara
2022-06-17 14:48               ` Amir Goldstein
2022-06-17 15:11                 ` Jan Kara
2022-06-18  8:38                   ` Amir Goldstein
2022-06-20  9:11                     ` Jan Kara
2022-06-21  7:49                       ` Amir Goldstein
2022-06-21  8:59                         ` Jan Kara
2022-06-21 12:53                           ` Amir Goldstein
2022-06-22  3:23                             ` Matthew Wilcox
2022-06-22  9:00                               ` Amir Goldstein
2022-06-22  9:34                                 ` Jan Kara
2022-06-22 16:26                                   ` Amir Goldstein
2022-09-13 14:40                             ` Amir Goldstein
2022-09-14 16:01                               ` Darrick J. Wong
2022-09-14 16:29                                 ` Amir Goldstein
2022-09-14 17:39                                   ` Darrick J. Wong
2022-09-19 23:09                                     ` Dave Chinner
2022-09-20  2:24                                       ` Dave Chinner
2022-09-20  3:08                                         ` Amir Goldstein
2022-09-21 11:20                                           ` Amir Goldstein
2019-04-08 11:03       ` Jan Kara
2019-04-22 10:55         ` Boaz Harrosh
2019-04-08 10:33   ` Jan Kara
2019-04-08 16:37     ` Davidlohr Bueso
2019-04-11  1:11       ` Dave Chinner
2019-04-16 12:22         ` Dave Chinner
2019-04-18  3:10           ` Dave Chinner
2019-04-18 18:21             ` Davidlohr Bueso
2019-04-20 23:54               ` Dave Chinner [this message]
2019-05-03  4:17                 ` Dave Chinner
2019-05-03  5:17                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190420235412.GY29573@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=dbueso@suse.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).