linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ian Kent <raven@themaw.net>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	Al Viro <viro@ZenIV.linux.org.uk>,
	Rick Lindsley <ricklind@linux.vnet.ibm.com>,
	David Howells <dhowells@redhat.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
Date: Wed, 17 Jun 2020 15:37:43 +0800	[thread overview]
Message-ID: <159237905950.89469.6559073274338175600.stgit@mickey.themaw.net> (raw)

For very large IBM Power mainframe systems with hundreds of CPUs and TBs
of RAM booting can take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging but two main issues appeared to be: a large number of path
lookups for non-existent files, and very high lock contention in the VFS
during path walks particularly in the dentry allocation code path.

The underlying cause of this was thought to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration as the hardware
divides the memory into 256MB logical blocks. This is believed to be due
to either IBM Power hardware design or a requirement of the mainframe
software used to create logical partitions (LPARs, that are used to
install an operating system to provide services), since these can be made
up of a wide range of resources, CPU, Memory, disks, etc.

It's unclear yet whether the creation of syfs nodes for these memory
devices can be postponed or spread out over a larger amount of time.
That's because the high overhead looks to be due to notifications received
by udev which invokes a systemd program for them and attempts by systemd
folks to improve this have not focused on changing the handling of these
notifications, possibly because of difficulties with doing so. This
remains an avenue of investigation.

Kernel traces show there are many path walks with a fairly large portion
of those for non-existent paths. However, looking at the systemd code
invoked by the udev action it appears there's only one additional lookup
for each invocation so the large number of negative lookups is most likely
due to the large number of notifications rather than a fault with the
systemd program.

The series here tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths, as described above.

That was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes we still need kernel parameters of udev.children-max=2048
and systemd.default_timeout_start_sec=300 for the fastest boot times of
under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing and thinking about what else these could
be. Discussing it with Rick Lindsay, I suspect improvements will get more
difficult to implement for somewhat less improvement so I think what we
have here is a good start for now.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.
---

Ian Kent (6):
      kernfs: switch kernfs to use an rwsem
      kernfs: move revalidate to be near lookup
      kernfs: improve kernfs path resolution
      kernfs: use revision to identify directory node changes
      kernfs: refactor attr locking
      kernfs: make attr_mutex a local kernfs node lock


 fs/kernfs/dir.c             |  284 ++++++++++++++++++++++++++++---------------
 fs/kernfs/file.c            |    4 -
 fs/kernfs/inode.c           |   58 +++++----
 fs/kernfs/kernfs-internal.h |   29 ++++
 fs/kernfs/mount.c           |   12 +-
 fs/kernfs/symlink.c         |    4 -
 include/linux/kernfs.h      |    7 +
 7 files changed, 259 insertions(+), 139 deletions(-)

--
Ian


             reply	other threads:[~2020-06-17  7:37 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-17  7:37 Ian Kent [this message]
2020-06-17  7:37 ` [PATCH v2 1/6] kernfs: switch kernfs to use an rwsem Ian Kent
2020-06-17  7:37 ` [PATCH v2 2/6] kernfs: move revalidate to be near lookup Ian Kent
2020-06-17  7:37 ` [PATCH v2 3/6] kernfs: improve kernfs path resolution Ian Kent
2020-06-17  7:38 ` [PATCH v2 4/6] kernfs: use revision to identify directory node changes Ian Kent
2020-06-17  7:38 ` [PATCH v2 5/6] kernfs: refactor attr locking Ian Kent
2020-06-17  7:38 ` [PATCH v2 6/6] kernfs: make attr_mutex a local kernfs node lock Ian Kent
2020-06-19 15:38 ` [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement Tejun Heo
2020-06-19 20:41   ` Rick Lindsley
2020-06-19 22:23     ` Tejun Heo
2020-06-20  2:44       ` Rick Lindsley
2020-06-22 17:53         ` Tejun Heo
2020-06-22 21:22           ` Rick Lindsley
2020-06-23 23:13             ` Tejun Heo
2020-06-24  9:04               ` Rick Lindsley
2020-06-24  9:27                 ` Greg Kroah-Hartman
2020-06-24 13:19                 ` Tejun Heo
2020-06-25  8:15               ` Ian Kent
2020-06-25  9:43                 ` Greg Kroah-Hartman
2020-06-26  0:19                   ` Ian Kent
2020-06-21  4:55       ` Ian Kent
2020-06-22 17:48         ` Tejun Heo
2020-06-22 18:03           ` Greg Kroah-Hartman
2020-06-22 21:27             ` Rick Lindsley
2020-06-23  5:21               ` Greg Kroah-Hartman
2020-06-23  5:09             ` Ian Kent
2020-06-23  6:02               ` Greg Kroah-Hartman
2020-06-23  8:01                 ` Ian Kent
2020-06-23  8:29                   ` Ian Kent
2020-06-23 11:49                   ` Greg Kroah-Hartman
2020-06-23  9:33                 ` Rick Lindsley
2020-06-23 11:45                   ` Greg Kroah-Hartman
2020-06-23 22:55                     ` Rick Lindsley
2020-06-23 11:51                   ` Ian Kent
2020-06-21  3:21   ` Ian Kent
2020-12-10 16:44 ` Fox Chen
2020-12-11  2:01   ` [PATCH " Ian Kent
2020-12-11  2:17     ` Ian Kent
2020-12-13  3:46       ` Ian Kent
2020-12-14  6:14         ` Fox Chen
2020-12-14 13:30           ` Ian Kent
2020-12-15  8:33             ` Fox Chen
2020-12-15 12:59               ` Ian Kent
2020-12-17  4:46                 ` Ian Kent
2020-12-17  8:54                   ` Fox Chen
2020-12-17 10:09                     ` Ian Kent
2020-12-17 11:09                       ` Ian Kent
2020-12-17 11:48                         ` Ian Kent
2020-12-17 15:14                           ` Tejun Heo
2020-12-18  7:36                             ` Ian Kent
2020-12-18  8:01                               ` Fox Chen
2020-12-18 11:21                                 ` Ian Kent
2020-12-18 13:20                                   ` Fox Chen
2020-12-19  0:53                                     ` Ian Kent
2020-12-19  7:47                                       ` Fox Chen
2020-12-22  2:17                                         ` Ian Kent
2020-12-18 14:59                               ` Tejun Heo
2020-12-19  7:08                                 ` Ian Kent
2020-12-19 16:23                                   ` Tejun Heo
2020-12-19 23:52                                     ` Ian Kent
2020-12-20  1:37                                       ` Ian Kent
2020-12-21  9:28                                       ` Fox Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=159237905950.89469.6559073274338175600.stgit@mickey.themaw.net \
    --to=raven@themaw.net \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=ricklind@linux.vnet.ibm.com \
    --cc=sfr@canb.auug.org.au \
    --cc=tj@kernel.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).