linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rick Lindsley <ricklind@linux.vnet.ibm.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Ian Kent <raven@themaw.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Al Viro <viro@zeniv.linux.org.uk>, Tejun Heo <tj@kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	David Howells <dhowells@redhat.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/4] kernfs: proposed locking and concurrency improvement
Date: Wed, 27 May 2020 05:44:09 -0700	[thread overview]
Message-ID: <1d185eb3-8a85-9138-9277-92400ba03e0a@linux.vnet.ibm.com> (raw)
In-Reply-To: <20200525061616.GA57080@kroah.com>

On 5/24/20 11:16 PM, Greg Kroah-Hartman wrote:

> Independant of your kernfs changes, why do we really need to represent
> all of this memory with that many different "memory objects"?  What is
> that providing to userspace?
> 
> I remember Ben Herrenschmidt did a lot of work on some of the kernfs and
> other functions to make large-memory systems boot faster to remove some
> of the complexity in our functions, but that too did not look into why
> we needed to create so many objects in the first place.

That was my first choice too.  Unfortunately, I was not consulted on this design decision, however, and now it's out there.  It is, as you guessed, a hardware "feature".  The hw believes there is value in identifying memory in 256MB chunks.  There are, unfortunately, 2^18 or over 250,000 of those on a 64TB system, compared with dozens or maybe even hundreds of other devices.

We considered a revamping of the boot process - delaying some devices, reordering operations and such - but deemed that more dangerous to other architectures.  Although this change is driven by a particular architecture, the changes we've identified are architecture independent.  The risk of breaking something else is much lower than if we start reordering boot steps.

> Also, why do you need to create the devices _when_ you create them?  Can
> you wait until after init is up and running to start populating the
> device tree with them?  That way boot can be moving on and disks can be
> spinning up earlier?

I'm not a systemd expert, unfortunately, so I don't know if it needs to happen *right* then or not.  I do know that upon successful boot, a ps reveals many systemd children still reporting in.  It's not that we're waiting on everybody; the contention is causing a delay in the discovery of key devices like disks, and *that* leads to timeouts firing in systemd rules.  Any workaround bent on dodging the problem tends to get exponentially worse when the numbers change.  We noticed this problem at 32TB, designed some timeout changes and udev options to improve it, only to have both fail at 64TB.  Worse, at 64TB, larger timeouts and udev options failed to work consistently anymore.

There are two times we do coldplugs - once in the initramfs, and then again after we switch over to the actual root.  I did try omitting memory devices after the switchover.  Much faster!  So, why is the second one necessary?  Are there some architectures that need that?  I've not found anyone who can answer that, so going that route presents us with a different big risk.

Rick


      parent reply	other threads:[~2020-05-27 12:45 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-25  5:46 [PATCH 0/4] kernfs: proposed locking and concurrency improvement Ian Kent
2020-05-25  5:47 ` [PATCH 1/4] kernfs: switch kernfs to use an rwsem Ian Kent
2020-06-06 15:52   ` [kernfs] ea7c5fc39a: stress-ng.stream.ops_per_sec 11827.2% improvement kernel test robot
2020-06-06 18:18     ` Greg Kroah-Hartman
2020-06-07  1:13       ` Ian Kent
2020-06-11  2:06         ` kernel test robot
2020-06-11  2:20           ` Rick Lindsley
2020-06-11  3:02           ` Ian Kent
2020-06-07  8:40   ` [PATCH 1/4] kernfs: switch kernfs to use an rwsem Ian Kent
2020-06-08  9:58     ` Ian Kent
2020-05-25  5:47 ` [PATCH 2/4] kernfs: move revalidate to be near lookup Ian Kent
2020-05-25  5:47 ` [PATCH 3/4] kernfs: improve kernfs path resolution Ian Kent
2020-05-25  5:47 ` [PATCH 4/4] kernfs: use revision to identify directory node changes Ian Kent
2020-05-25  6:16 ` [PATCH 0/4] kernfs: proposed locking and concurrency improvement Greg Kroah-Hartman
2020-05-25  7:23   ` Ian Kent
2020-05-25  7:31     ` Greg Kroah-Hartman
2020-05-27 12:44   ` Rick Lindsley [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d185eb3-8a85-9138-9277-92400ba03e0a@linux.vnet.ibm.com \
    --to=ricklind@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=raven@themaw.net \
    --cc=sfr@canb.auug.org.au \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).