linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin Wilck <mwilck@suse.de>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>, NeilBrown <neilb@suse.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
Date: Mon, 20 Aug 2018 22:06:56 +0200	[thread overview]
Message-ID: <7d575334f8c140071980b16ab8b1607db9eb01ef.camel@suse.de> (raw)
In-Reply-To: <20180820200223.GG5468@fieldses.org>

On Mon, 2018-08-20 at 16:02 -0400, J. Bruce Fields wrote:
> On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote:
> > On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote:
> > > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote:
> > > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote:
> > > > > If you have a many-core machine, and have many threads all
> > > > > wanting to
> > > > > briefly lock a give file (udev is known to do this), you can
> > > > > get
> > > > > quite
> > > > > poor performance.
> > > > > 
> > > > > When one thread releases a lock, it wakes up all other
> > > > > threads
> > > > > that
> > > > > are waiting (classic thundering-herd) - one will get the lock
> > > > > and
> > > > > the
> > > > > others go to sleep.
> > > > > When you have few cores, this is not very noticeable: by the
> > > > > time
> > > > > the
> > > > > 4th or 5th thread gets enough CPU time to try to claim the
> > > > > lock,
> > > > > the
> > > > > earlier threads have claimed it, done what was needed, and
> > > > > released.
> > > > > With 50+ cores, the contention can easily be measured.
> > > > > 
> > > > > This patchset creates a tree of pending lock request in which
> > > > > siblings
> > > > > don't conflict and each lock request does conflict with its
> > > > > parent.
> > > > > When a lock is released, only requests which don't conflict
> > > > > with
> > > > > each
> > > > > other a woken.
> > > > > 
> > > > > Testing shows that lock-acquisitions-per-second is now fairly
> > > > > stable even
> > > > > as number of contending process goes to 1000.  Without this
> > > > > patch,
> > > > > locks-per-second drops off steeply after a few 10s of
> > > > > processes.
> > > > > 
> > > > > There is a small cost to this extra complexity.
> > > > > At 20 processes running a particular test on 72 cores, the
> > > > > lock
> > > > > acquisitions per second drops from 1.8 million to 1.4 million
> > > > > with
> > > > > this patch.  For 100 processes, this patch still provides 1.4
> > > > > million
> > > > > while without this patch there are about 700,000.
> > > > > 
> > > > > NeilBrown
> > > > > 
> > > > > ---
> > > > > 
> > > > > NeilBrown (4):
> > > > >       fs/locks: rename some lists and pointers.
> > > > >       fs/locks: allow a lock request to block other requests.
> > > > >       fs/locks: change all *_conflict() functions to return
> > > > > bool.
> > > > >       fs/locks: create a tree of dependent requests.
> > > > > 
> > > > > 
> > > > >  fs/cifs/file.c                  |    2 -
> > > > >  fs/locks.c                      |  142
> > > > > +++++++++++++++++++++++++--------------
> > > > >  include/linux/fs.h              |    5 +
> > > > >  include/trace/events/filelock.h |   16 ++--
> > > > >  4 files changed, 103 insertions(+), 62 deletions(-)
> > > > > 
> > > > 
> > > > Nice work! I looked over this and I think it looks good.
> > > > 
> > > > I made an attempt to fix this issue several years ago, but my
> > > > method
> > > > sucked as it ended up penalizing the unlocking task too much.
> > > > This
> > > > is
> > > > much cleaner and should scale well overall, I think.
> > > 
> > > I think I also took a crack at this at one point while I was at
> > > UM/CITI
> > > and never got anything I was happy with.  Looks like good work!
> > > 
> > > I remember one main obstacle that I felt like I never had a good
> > > benchmark....
> > > 
> > > How did you choose this workload and hardware?  Was it in fact
> > > udev
> > > (booting a large machine?), or was there some other motivation?
> > 
> > Some details can be found here:
> > 
> > https://github.com/systemd/systemd/pull/9551
> > 
> > https://github.com/systemd/systemd/pull/8667#issuecomment-385520335
> > and comments further down. 8667 has been superseded by 9551.
> > 
> > The original problem was that the symlink "/dev/disk/by-
> > partlabel/primary" may be claimed by _many_ devices on big systems
> > under certain distributions, which use older versions of parted for
> > partition creation on GPT disk labels. I've seen systems with
> > literally
> > thousands of contenders for this symlink. 
> > 
> > We found that with current systemd, this can cause a boot-time race
> > where the wrong device is eventually assigned the "best" contender
> > for
> > the symlink (e.g. a partition on multipath member rather than a
> > partition on the multipath map itself). I extended the udev test
> > suite,
> > creating a test that makes this race easily reproducible, at least
> > on
> > systems with many CPUs (the test host I used most had 72 cores).
> > 
> > I created an udev patch that would use systemd's built in fcntl-
> > based
> > locking to avoid this race, but I found that it would massively
> > slow
> > down the system, and found the contention to be in the spin locks
> > in
> > posix_lock_common(). (I therefore added more the systemd patches to
> > make the locking scale better, but that's irrelevant for the
> > kernel-
> > side discussion).
> > 
> > I further created an artificial test just for the scaling of
> > fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the
> > scaling problem easily, and do some quantitive experiments. My
> > tests
> > didn't use any byte ranges, only "full" locking of 0-byte files.
> 
> Thanks for the explanation!
> 
> I wonder whether there's also anything we could do to keep every
> waiter
> from having to take the same spinlock.

That was my line of thinking initially, but Neil's patches that just
avoid the thundering herd wakeup made both tests scale quite nicely, so
at least for my use case, no further optimizations are required.

Martin

  reply	other threads:[~2018-08-20 23:23 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-08  1:51 [PATCH 0/4] locks: avoid thundering-herd wake-ups NeilBrown
2018-08-08  1:51 ` [PATCH 1/4] fs/locks: rename some lists and pointers NeilBrown
2018-08-08 10:47   ` Jeff Layton
2018-08-08 19:07     ` J. Bruce Fields
2018-08-08  1:51 ` [PATCH 3/4] fs/locks: change all *_conflict() functions to return bool NeilBrown
2018-08-08  1:51 ` [PATCH 2/4] fs/locks: allow a lock request to block other requests NeilBrown
2018-08-08  1:51 ` [PATCH 4/4] fs/locks: create a tree of dependent requests NeilBrown
2018-08-08 16:47 ` [PATCH 0/4] locks: avoid thundering-herd wake-ups Jeff Layton
2018-08-08 18:29   ` J. Bruce Fields
2018-08-09  0:58     ` NeilBrown
2018-08-20 11:02     ` Martin Wilck
2018-08-20 20:02       ` J. Bruce Fields
2018-08-20 20:06         ` Martin Wilck [this message]
2018-08-08 19:54 ` J. Bruce Fields
2018-08-08 20:09   ` J. Bruce Fields
2018-08-08 21:15     ` Frank Filz
2018-08-08 22:34       ` NeilBrown
2018-08-08 21:28     ` J. Bruce Fields
2018-08-08 22:39       ` NeilBrown
2018-08-08 22:50       ` Jeff Layton
2018-08-08 23:34         ` Frank Filz
2018-08-09  2:52           ` NeilBrown
2018-08-09 13:00         ` J. Bruce Fields
2018-08-09 14:49           ` Jeff Layton
2018-08-09 23:56           ` NeilBrown
2018-08-10  1:05             ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7d575334f8c140071980b16ab8b1607db9eb01ef.camel@suse.de \
    --to=mwilck@suse.de \
    --cc=bfields@fieldses.org \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).