[RFC 00/16] padata, vfio, sched: Multithreaded VFIO page pinning

* [RFC 00/16] padata, vfio, sched: Multithreaded VFIO page pinning
@ 2022-01-06  0:46 Daniel Jordan
  2022-01-06  0:46 ` [RFC 01/16] padata: Remove __init from multithreading functions Daniel Jordan
                   ` (16 more replies)
  0 siblings, 17 replies; 42+ messages in thread
From: Daniel Jordan @ 2022-01-06  0:46 UTC (permalink / raw)
  To: Alexander Duyck, Alex Williamson, Andrew Morton, Ben Segall,
	Cornelia Huck, Dan Williams, Dave Hansen, Dietmar Eggemann,
	Herbert Xu, Ingo Molnar, Jason Gunthorpe, Johannes Weiner,
	Josh Triplett, Michal Hocko, Nico Pache, Pasha Tatashin,
	Peter Zijlstra, Steffen Klassert, Steve Sistare, Tejun Heo,
	Tim Chen, Vincent Guittot
  Cc: linux-mm, kvm, linux-kernel, linux-crypto, Daniel Jordan

Here's phase two of padata multithreaded jobs, which multithreads VFIO page
pinning and lays the groundwork for other padata users.  It's RFC because there
are still pieces missing and testing to do, and because of the last two
patches, which I'm hoping scheduler and cgroup folks can weigh in on.  Any and
all feedback is welcome.

---

Assigning a VFIO device to a guest requires pinning each and every page of the
guest's memory, which gets expensive for large guests even if the memory has
already been faulted in and cleared with something like qemu prealloc.

Some recent optimizations[0][1] have brought the cost down, but it's still a
significant bottleneck for guest initialization time.  Parallelize with padata
to take proper advantage of memory bandwidth, yielding up to 12x speedups for
VFIO page pinning and 10x speedups for overall qemu guest initialization.
Detailed performance results are in patch 8.

Phase one[4] of multithreaded jobs made deferred struct page init use all the
CPUs on x86.  That's a special case because it happens during boot when the
machine is waiting on page init to finish and there are generally no resource
controls to violate.

Page pinning, on the other hand, can be done by a user task (the "main thread"
in a job), so helper threads should honor the main thread's resource controls
that are relevant for pinning (CPU, memory) and give priority to other tasks on
the system.  This RFC has some but not all of the pieces to do that.

After this phase, it shouldn't take many lines to parallelize other
memory-proportional paths like struct page init for memory hotplug, munmap(),
hugetlb_fallocate(), and __ib_umem_release().

The first half of this series (more or less) has been running in our kernels
for about three years.

Changelog
---------

This addresses some comments on two earlier projects, ktask[2] and
cgroup-aware workqueues[3].

 - Fix undoing partially a completed chunk in the thread function, and use
   larger minimum chunk size (Alex Williamson)

 - Helper threads should honor the main thread's settings and resource controls,
   and shouldn't disturb other tasks (Michal Hocko, Pavel Machek)

 - Design comments, lockdep awareness (Peter Zijlstra, Jason Gunthorpe)

 - Implement remote charging in the CPU controller (Tejun Heo)

Series Rundown
--------------

     1  padata: Remove __init from multithreading functions
     2  padata: Return first error from a job
     3  padata: Add undo support
     4  padata: Detect deadlocks between main and helper threads

Get ready to parallelize.  In particular, pinning can fail, so make jobs
undo-able.

     5  vfio/type1: Pass mm to vfio_pin_pages_remote()
     6  vfio/type1: Refactor dma map removal
     7  vfio/type1: Parallelize vfio_pin_map_dma()
     8  vfio/type1: Cache locked_vm to ease mmap_lock contention

Do the parallelization itself.

     9  padata: Use kthreads in do_multithreaded
    10  padata: Helpers should respect main thread's CPU affinity
    11  padata: Cap helpers started to online CPUs
    12  sched, padata: Bound max threads with max_cfs_bandwidth_cpus()

Put caps on the number of helpers started according to the main thread's CPU
affinity, the system' online CPU count, and the main thread's CFS bandwidth
settings.  

    13  padata: Run helper threads at MAX_NICE
    14  padata: Nice helper threads one by one to prevent starvation

Prevent helpers from taking CPU away unfairly from other tasks for the sake of
an optimized kernel code path.

    15  sched/fair: Account kthread runtime debt for CFS bandwidth
    16  sched/fair: Consider kthread debt in cputime

A prototype for remote charging in CFS bandwidth and cpu.stat, described more
in the next section.  It's debatable whether these last two are required for
this series.  Patch 12 caps the number of helper threads started according to
the max effective CPUs allowed by the quota and period of the main thread's
task group.  In practice, I think this hits the sweet spot between complexity
and respecting CFS bandwidth limits so that patch 15 might just be dropped.
For instance, when running qemu with a vfio device, the restriction from patch
12 was enough to avoid the helpers breaching CFS bandwidth limits.  That leaves
patch 16, which on its own seems overkill for all the hunks it would require
from patch 15, so it could be dropped too.

Patch 12 isn't airtight, though, since other tasks running in the task group
alongside the main thread and helpers could still result in overage.  So,
patches 15-16 give an idea of what absolutely correct accounting in the CPU
controller might look like in case there are real situations that want it.

Remote Charging in the CPU Controller
-------------------------------------

CPU-intensive kthreads aren't generally accounted in the CPU controller, so
they escape settings such as weight and bandwidth when they do work on behalf
of a task group.

This problem arises with multithreaded jobs, but is also an issue in other
places.  CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
accounted to the cgroup that the memory belongs to, and similarly CPU activity
from net rx should be accounted to the task groups that correspond to the
packets being received.  There are also vague complaints from Android[6].

Each use case has its own requirements[7].  In padata and reclaim, the task
group to account to is known ahead of time, but net rx has to spend cycles
processing a packet before its destination task group is known, so any solution
should be able to work without knowing the task group in advance.  Furthermore,
the CPU controller shouldn't throttle reclaim or net rx in real time since both
are doing high priority work.  These make approaches that run kthreads directly
in a task group, like cgroup-aware workqueues[8] or a kernel path for
CLONE_INTO_CGROUP, infeasible.  Running kthreads directly in cgroups also has a
downside for padata because helpers' MAX_NICE priority is "shadowed" by the
priority of the group entities they're running under.

The proposed solution of remote charging can accrue debt to a task group to be
paid off or forgiven later, addressing all these issues.  A kthread calls the
interface

    void cpu_cgroup_remote_begin(struct task_struct *p,
                                 struct cgroup_subsys_state *css);

to begin remote charging to @css, causing @p's current sum_exec_runtime to be
updated and saved.  The @css arg isn't required and can be removed later to
facilitate the unknown cgroup case mentioned above.  Then the kthread calls
another interface

    void cpu_cgroup_remote_charge(struct task_struct *p,
                                  struct cgroup_subsys_state *css);

to account the sum_exec_runtime that @p has used since the first call.
Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
debt that's only used when the debt exceeds the quota in the current period.

Weight-based control isn't implemented for now since padata helpers run at
MAX_NICE and so always yield to anything higher priority, meaning they would
rarely compete with other task groups.

[ We have another use case to use remote charging for implementing
  CFS bandwidth control across multiple machines.  This is an entirely
  different topic that deserves its own thread. ]

TODO
----

 - Honor these other resource controls:
    - Memory controller limits for helpers via active_memcg.  I *think* this
      will turn out to be necessary despite helpers using the main thread's mm,
      but I need to look into it more.
    - cpuset.mems
    - NUMA memory policy

 - Make helpers aware of signals sent to the main thread

 - Test test test

Series based on 5.14.  I had to downgrade from 5.15 because of an intel iommu
bug that's since been fixed.

thanks,
Daniel

[0] https://lore.kernel.org/linux-mm/20210128182632.24562-1-joao.m.martins@oracle.com
[1] https://lore.kernel.org/lkml/20210219161305.36522-1-daniel.m.jordan@oracle.com/
[2] https://x-lore.kernel.org/all/20181105165558.11698-1-daniel.m.jordan@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190605133650.28545-1-daniel.m.jordan@oracle.com/
[4] https://x-lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/
[5] https://x-lore.kernel.org/all/20200219181219.54356-1-hannes@cmpxchg.org/
[6] https://x-lore.kernel.org/all/20210407013856.GC21941@codeaurora.org/
[7] https://x-lore.kernel.org/all/20200219214112.4kt573kyzbvmbvn3@ca-dmjordan1.us.oracle.com/
[8] https://x-lore.kernel.org/all/20190605133650.28545-1-daniel.m.jordan@oracle.com/

Daniel Jordan (16):
  padata: Remove __init from multithreading functions
  padata: Return first error from a job
  padata: Add undo support
  padata: Detect deadlocks between main and helper threads
  vfio/type1: Pass mm to vfio_pin_pages_remote()
  vfio/type1: Refactor dma map removal
  vfio/type1: Parallelize vfio_pin_map_dma()
  vfio/type1: Cache locked_vm to ease mmap_lock contention
  padata: Use kthreads in do_multithreaded
  padata: Helpers should respect main thread's CPU affinity
  padata: Cap helpers started to online CPUs
  sched, padata: Bound max threads with max_cfs_bandwidth_cpus()
  padata: Run helper threads at MAX_NICE
  padata: Nice helper threads one by one to prevent starvation
  sched/fair: Account kthread runtime debt for CFS bandwidth
  sched/fair: Consider kthread debt in cputime

 drivers/vfio/Kconfig            |   1 +
 drivers/vfio/vfio_iommu_type1.c | 170 ++++++++++++++---
 include/linux/padata.h          |  31 +++-
 include/linux/sched.h           |   2 +
 include/linux/sched/cgroup.h    |  37 ++++
 kernel/padata.c                 | 311 +++++++++++++++++++++++++-------
 kernel/sched/core.c             |  58 ++++++
 kernel/sched/fair.c             |  99 +++++++++-
 kernel/sched/sched.h            |   5 +
 mm/page_alloc.c                 |   4 +-
 10 files changed, 620 insertions(+), 98 deletions(-)
 create mode 100644 include/linux/sched/cgroup.h

base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
-- 
2.34.1

^ permalink raw reply	[flat|nested] 42+ messages in thread