archive mirror
 help / color / mirror / Atom feed
From: Daniel Jordan <>
Cc:, Dan Williams <>,
	Dave Hansen <>,
	Tim Chen <>,
	Mike Kravetz <>,
	Herbert Xu <>,
	Steffen Klassert <>,
	Tejun Heo <>, Peter Zijlstra <>,
	Alex Williamson <>,
	Daniel Jordan <>
Subject: [LSF/MM/BPF TOPIC] kernel multithreading with padata
Date: Wed, 12 Feb 2020 17:47:31 -0500	[thread overview]
Message-ID: <> (raw)

padata has been undergoing some surgery over the last year[0] and now seems
ready for another enhancement: splitting up and multithreading CPU-intensive
kernel work.

Quoting from an earlier series[1], the problem I'm trying to solve is

  A single CPU can spend an excessive amount of time in the kernel operating
  on large amounts of data.  Often these situations arise during initialization-
  and destruction-related tasks, where the data involved scales with system
  size.  These long-running jobs can slow startup and shutdown of applications
  and the system itself while extra CPUs sit idle.

Here are the current consumers:

 - struct page init (boot, hotplug, pmem)
 - VFIO page pinning (kvm guest init)
 - fallocating a hugetlb file (database shared memory init)

On a large-memory server, DRAM page init is ~23% of kernel boot (3.5s/15.2s),
and it takes over a minute to start a VFIO-enabled kvm guest or fallocate a
hugetlb file that occupy a significant fraction of memory.  This work results
in 7-20x speedups and is currently increasing the uptime of our production

Future areas include munmap/exit, umount, and __ib_umem_release.  Some of these
need coarse locks broken up for multithreading (zone->lock, lru_lock).

Positive outcomes for the session would be...

 - Finding a strategy for capping the maximum number of threads in a job.

 - Agreeing on a way for the job's threads to respect resource controls.

   In the past few weeks I've been thinking about whether remote charging
   in the CPU controller is feasible (RFD to come), am also considering creating
   workqueue workers directly in cgroup-specific pools instead, and have
   proposed migrating workers in and out of cgroups before[2].  There's also
   memory policy and sched_setaffinity() to think about.

 - Checking the overall design of this thing with the mm community, given that
   current users are all mm-related.

 - Getting advice from others (hallway track) on why some pmem devices
   perform better than others under multithreading.

This work-in-progress branch shows what it looks like now.

    git:// padata-mt-wip-v0.2;a=shortlog;h=refs/heads/padata-mt-wip-v0.2


             reply	other threads:[~2020-02-12 22:47 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-12 22:47 Daniel Jordan [this message]
2020-02-12 23:31 ` [LSF/MM/BPF TOPIC] kernel multithreading with padata Jason Gunthorpe
2020-02-13 16:13   ` Daniel Jordan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).