mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: adobriyan@gmail.com, akpm@linux-foundation.org,
	konrad.wilk@oracle.com, linux-mm@kvack.org,
	mm-commits@vger.kernel.org, stephen.s.brennan@oracle.com,
	torvalds@linux-foundation.org, viro@zeniv.linux.org.uk,
	willy@infradead.org
Subject: [patch 13/87] proc: allow pid_revalidate() during LOOKUP_RCU
Date: Mon, 08 Nov 2021 18:32:05 -0800	[thread overview]
Message-ID: <20211109023205.T0hV91JFt%akpm@linux-foundation.org> (raw)
In-Reply-To: <20211108183057.809e428e841088b657a975ec@linux-foundation.org>

From: Stephen Brennan <stephen.s.brennan@oracle.com>
Subject: proc: allow pid_revalidate() during LOOKUP_RCU

Problem Description:

When running running ~128 parallel instances of "TZ=/etc/localtime ps -fe
>/dev/null" on a 128CPU machine, the %sys utilization reaches 97%, and
perf shows the following code path as being responsible for heavy
contention on the d_lockref spinlock:

      walk_component()
        lookup_fast()
          d_revalidate()
            pid_revalidate() // returns -ECHILD
          unlazy_child()
            lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention

The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode.  All concurrent path lookups thus try to grab a reference
to the dentry for /proc/, before re-executing pid_revalidate() and then
stepping into the /proc/$pid directory.  Thus there is huge spinlock
contention.  This patch allows pid_revalidate() to execute in RCU mode,
meaning that the path lookup can successfully enter the /proc/$pid
directory while still in RCU mode.  Later on, the path lookup may still
drop into ref mode, but the contention will be much reduced at this point.

By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x.  Although this particular workload is a bit contrived,
we have seen some large collections of eager monitoring scripts which
produced similarly high %sys time due to contention in the /proc
directory.

As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode.  To ensure
that this patch is safe, I audited all the inode get_link and permission()
implementations, as well as dentry d_revalidate() implementations, in
fs/proc.  The purpose here is to ensure that they either are safe to call
in RCU (i.e.  don't sleep) or correctly bail out of RCU mode if they don't
support it.  My analysis shows that all at-risk procfs methods are safe to
call under RCU, and thus this patch is safe.

Procfs RCU-walk Analysis:

This analysis is up-to-date with 5.15-rc3.  When called under RCU mode,
these functions have arguments as follows:

* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
  from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.

For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:

proc_ns_get_link       (bails out)
proc_get_link          (RCU safe)
proc_pid_get_link      (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate  (bails out)
proc_net_d_revalidate  (RCU safe)
proc_sys_revalidate    (bails out, also not under /proc/$pid)
tid_fd_revalidate      (bails out)
proc_sys_permission    (not under /proc/$pid)

The remainder of the functions require a bit more detail:

* proc_fd_permission: RCU safe. All of the body of this function is
  under rcu_read_lock(), except generic_permission() which declares
  itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
  and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
  thus calls into the audit code (see note #1 below). The remainder is
  just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
  safe, although it does call into the "security_ptrace_access_check()"
  hook, which looks safe under smack and selinux. Just the audit code is
  of concern. Also uses get_task_struct() and put_task_struct(), see
  note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
  (see note #2 below).

Note #1:
  Most of the concern of RCU safety has centered around the audit code.
  However, since b17ec22fb339 ("selinux: slow_avc_audit has become
  non-blocking"), it's safe to call this code under RCU. So all of the
  above are safe by my estimation.

Note #2: get_task_struct() and put_task_struct():
  The majority of get_task_struct() is under RCU read lock, and in any
  case it is a simple increment. But put_task_struct() is complex, given
  that it could at some point free the task struct, and this process has
  many steps which I couldn't manually verify. However, several other
  places call put_task_struct() under RCU, so it appears safe to use
  here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)

Patch description:

pid_revalidate() drops from RCU into REF lookup mode.  When many threads
are resolving paths within /proc in parallel, this can result in heavy
spinlock contention on d_lockref as each thread tries to grab a reference
to the /proc dentry (and drop it shortly thereafter).

Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps.  So, remove the LOOKUP_RCU check.

Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/base.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

--- a/fs/proc/base.c~proc-allow-pid_revalidate-during-lookup_rcu
+++ a/fs/proc/base.c
@@ -1979,19 +1979,21 @@ static int pid_revalidate(struct dentry
 {
 	struct inode *inode;
 	struct task_struct *task;
+	int ret = 0;
 
-	if (flags & LOOKUP_RCU)
-		return -ECHILD;
-
-	inode = d_inode(dentry);
-	task = get_proc_task(inode);
+	rcu_read_lock();
+	inode = d_inode_rcu(dentry);
+	if (!inode)
+		goto out;
+	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 
 	if (task) {
 		pid_update_inode(task, inode);
-		put_task_struct(task);
-		return 1;
+		ret = 1;
 	}
-	return 0;
+out:
+	rcu_read_unlock();
+	return ret;
 }
 
 static inline bool proc_inode_is_dead(struct inode *inode)
_

  parent reply	other threads:[~2021-11-09  2:32 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-09  2:30 incoming Andrew Morton
2021-11-09  2:31 ` [patch 01/87] vfs: keep inodes with page cache off the inode shrinker LRU Andrew Morton
2021-11-09  2:31 ` [patch 02/87] mm,hugetlb: remove mlock ulimit for SHM_HUGETLB Andrew Morton
2021-11-09  2:31 ` [patch 03/87] procfs: do not list TID 0 in /proc/<pid>/task Andrew Morton
2021-11-09  2:31 ` [patch 04/87] x86/xen: update xen_oldmem_pfn_is_ram() documentation Andrew Morton
2021-11-09  2:31 ` [patch 05/87] x86/xen: simplify xen_oldmem_pfn_is_ram() Andrew Morton
2021-11-09  2:31 ` [patch 06/87] x86/xen: print a warning when HVMOP_get_mem_type fails Andrew Morton
2021-11-09  2:31 ` [patch 07/87] proc/vmcore: let pfn_is_ram() return a bool Andrew Morton
2021-11-09  2:31 ` [patch 08/87] proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks Andrew Morton
2021-11-09  3:59   ` Dave Young
2021-11-09  6:40     ` David Hildenbrand
2021-11-10  7:22   ` Baoquan He
2021-11-10  8:10     ` David Hildenbrand
2021-11-10 11:11       ` Dave Young
2021-11-10 11:21         ` David Hildenbrand
2021-11-10 11:28           ` Dave Young
2021-11-10 12:05             ` David Hildenbrand
2021-11-09  2:31 ` [patch 09/87] virtio-mem: factor out hotplug specifics from virtio_mem_init() into virtio_mem_init_hotplug() Andrew Morton
2021-11-09  2:31 ` [patch 10/87] virtio-mem: factor out hotplug specifics from virtio_mem_probe() " Andrew Morton
2021-11-09  2:31 ` [patch 11/87] virtio-mem: factor out hotplug specifics from virtio_mem_remove() into virtio_mem_deinit_hotplug() Andrew Morton
2021-11-09  2:32 ` [patch 12/87] virtio-mem: kdump mode to sanitize /proc/vmcore access Andrew Morton
2021-11-09  2:32 ` Andrew Morton [this message]
2021-11-09  2:32 ` [patch 14/87] kernel.h: drop unneeded <linux/kernel.h> inclusion from other headers Andrew Morton
2021-11-09  2:32 ` [patch 15/87] kernel.h: split out container_of() and typeof_member() macros Andrew Morton
2021-11-09  2:32 ` [patch 16/87] include/kunit/test.h: replace kernel.h with the necessary inclusions Andrew Morton
2021-11-09  2:32 ` [patch 17/87] include/linux/list.h: " Andrew Morton
2021-11-09  2:32 ` [patch 18/87] include/linux/llist.h: " Andrew Morton
2021-11-09  2:32 ` [patch 19/87] include/linux/plist.h: " Andrew Morton
2021-11-09  2:32 ` [patch 20/87] include/media/media-entity.h: " Andrew Morton
2021-11-09  2:32 ` [patch 21/87] include/linux/delay.h: " Andrew Morton
2021-11-09  2:32 ` [patch 22/87] include/linux/sbitmap.h: " Andrew Morton
2021-11-09  2:32 ` [patch 23/87] include/linux/radix-tree.h: " Andrew Morton
2021-11-09  2:32 ` [patch 24/87] include/linux/generic-radix-tree.h: " Andrew Morton
2021-11-09  2:32 ` [patch 25/87] kernel.h: split out instruction pointer accessors Andrew Morton
2021-11-09  2:32 ` [patch 26/87] linux/container_of.h: switch to static_assert Andrew Morton
2021-11-09  2:32 ` [patch 27/87] mailmap: update email address for Colin King Andrew Morton
2021-11-09  2:32 ` [patch 28/87] MAINTAINERS: add "exec & binfmt" section with myself and Eric Andrew Morton
2021-11-09  2:32 ` [patch 29/87] MAINTAINERS: rectify entry for ARM/TOSHIBA VISCONTI ARCHITECTURE Andrew Morton
2021-11-09  2:32 ` [patch 30/87] MAINTAINERS: rectify entry for HIKEY960 ONBOARD USB GPIO HUB DRIVER Andrew Morton
2021-11-09  2:33 ` [patch 31/87] MAINTAINERS: rectify entry for INTEL KEEM BAY DRM DRIVER Andrew Morton
2021-11-09  2:33 ` [patch 32/87] MAINTAINERS: rectify entry for ALLWINNER HARDWARE SPINLOCK SUPPORT Andrew Morton
2021-11-09  2:33 ` [patch 33/87] lib, stackdepot: check stackdepot handle before accessing slabs Andrew Morton
2021-11-09  2:33 ` [patch 34/87] lib, stackdepot: add helper to print stack entries Andrew Morton
2021-11-09  2:33 ` [patch 35/87] lib, stackdepot: add helper to print stack entries into buffer Andrew Morton
2021-11-09  2:33 ` [patch 36/87] include/linux/string_helpers.h: add linux/string.h for strlen() Andrew Morton
2021-11-09  2:33 ` [patch 37/87] lib: uninline simple_strntoull() as well Andrew Morton
2021-11-09  2:33 ` [patch 38/87] mm/scatterlist: replace the !preemptible warning in sg_miter_stop() Andrew Morton
2021-11-09  2:33 ` [patch 39/87] const_structs.checkpatch: add a few sound ops structs Andrew Morton
2021-11-09  2:33 ` [patch 40/87] checkpatch: improve EXPORT_SYMBOL test for EXPORT_SYMBOL_NS uses Andrew Morton
2021-11-09  2:33 ` [patch 41/87] checkpatch: get default codespell dictionary path from package location Andrew Morton
2021-11-09  2:33 ` [patch 42/87] binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE Andrew Morton
2021-11-09  2:33 ` [patch 43/87] ELF: simplify STACK_ALLOC macro Andrew Morton
2021-11-09  2:33 ` [patch 44/87] kallsyms: remove arch specific text and data check Andrew Morton
2021-11-09  2:33 ` [patch 45/87] kallsyms: fix address-checks for kernel related range Andrew Morton
2021-11-09  2:33 ` [patch 46/87] sections: move and rename core_kernel_data() to is_kernel_core_data() Andrew Morton
2021-11-09  2:33 ` [patch 47/87] sections: move is_kernel_inittext() into sections.h Andrew Morton
2021-11-09  2:33 ` [patch 48/87] x86: mm: rename __is_kernel_text() to is_x86_32_kernel_text() Andrew Morton
2021-11-09  2:34 ` [patch 49/87] sections: provide internal __is_kernel() and __is_kernel_text() helper Andrew Morton
2021-11-09  2:34 ` [patch 50/87] mm: kasan: use is_kernel() helper Andrew Morton
2021-11-09  2:34 ` [patch 51/87] extable: use is_kernel_text() helper Andrew Morton
2021-11-09  2:34 ` [patch 52/87] powerpc/mm: use core_kernel_text() helper Andrew Morton
2021-11-09  2:34 ` [patch 53/87] microblaze: use is_kernel_text() helper Andrew Morton
2021-11-09  2:34 ` [patch 54/87] alpha: " Andrew Morton
2021-11-09  2:34 ` [patch 55/87] ramfs: fix mount source show for ramfs Andrew Morton
2021-11-09  2:34 ` [patch 56/87] init: make unknown command line param message clearer Andrew Morton
2021-11-09  2:34 ` [patch 57/87] coda: avoid NULL pointer dereference from a bad inode Andrew Morton
2021-11-09  2:34 ` [patch 58/87] coda: check for async upcall request using local state Andrew Morton
2021-11-09  2:34 ` [patch 59/87] coda: remove err which no one care Andrew Morton
2021-11-09  2:34 ` [patch 60/87] coda: avoid flagging NULL inodes Andrew Morton
2021-11-09  2:34 ` [patch 61/87] coda: avoid hidden code duplication in rename Andrew Morton
2021-11-09  2:34 ` [patch 62/87] coda: avoid doing bad things on inode type changes during revalidation Andrew Morton
2021-11-09  2:34 ` [patch 63/87] coda: convert from atomic_t to refcount_t on coda_vm_ops->refcnt Andrew Morton
2021-11-09  2:34 ` [patch 64/87] coda: use vmemdup_user to replace the open code Andrew Morton
2021-11-09  2:34 ` [patch 65/87] coda: bump module version to 7.2 Andrew Morton
2021-11-09  2:34 ` [patch 66/87] nilfs2: replace snprintf in show functions with sysfs_emit Andrew Morton
2021-11-09  2:35 ` [patch 67/87] nilfs2: remove filenames from file comments Andrew Morton
2021-11-09  2:35 ` [patch 68/87] hfs/hfsplus: use WARN_ON for sanity check Andrew Morton
2021-11-09  2:35 ` [patch 69/87] crash_dump: fix boolreturn.cocci warning Andrew Morton
2021-11-09  2:35 ` [patch 70/87] crash_dump: remove duplicate include in crash_dump.h Andrew Morton
2021-11-09  2:35 ` [patch 71/87] signal: remove duplicate include in signal.h Andrew Morton
2021-11-09  2:35 ` [patch 72/87] seq_file: move seq_escape() to a header Andrew Morton
2021-11-09  2:35 ` [patch 73/87] seq_file: fix passing wrong private data Andrew Morton
2021-11-09  2:35 ` [patch 74/87] kernel/fork.c: unshare(): use swap() to make code cleaner Andrew Morton
2021-11-09  2:35 ` [patch 75/87] sysv: use BUILD_BUG_ON instead of runtime check Andrew Morton
2021-11-09  2:35 ` [patch 76/87] Documentation/kcov: include types.h in the example Andrew Morton
2021-11-09  2:35 ` [patch 77/87] Documentation/kcov: define `ip' " Andrew Morton
2021-11-09  2:35 ` [patch 78/87] kcov: allocate per-CPU memory on the relevant node Andrew Morton
2021-11-09  2:35 ` [patch 79/87] kcov: avoid enable+disable interrupts if !in_task() Andrew Morton
2021-11-09  2:35 ` [patch 80/87] kcov: replace local_irq_save() with a local_lock_t Andrew Morton
2021-11-09  2:35 ` [patch 81/87] scripts/gdb: handle split debug for vmlinux Andrew Morton
2021-11-09  2:35 ` [patch 82/87] kernel/resource: clean up and optimize iomem_is_exclusive() Andrew Morton
2021-11-09  2:35 ` [patch 83/87] kernel/resource: disallow access to exclusive system RAM regions Andrew Morton
2021-11-09  2:35 ` [patch 84/87] virtio-mem: disallow mapping virtio-mem memory via /dev/mem Andrew Morton
2021-11-09  2:35 ` [patch 85/87] selftests/kselftest/runner/run_one(): allow running non-executable files Andrew Morton
2021-11-09  2:35 ` [patch 86/87] ipc: check checkpoint_restore_ns_capable() to modify C/R proc files Andrew Morton
2021-11-09  2:36 ` [patch 87/87] ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211109023205.T0hV91JFt%akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=adobriyan@gmail.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=stephen.s.brennan@oracle.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).