All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kirill Tkhai <ktkhai@virtuozzo.com>
To: akpm@linux-foundation.org, peterz@infradead.org, oleg@redhat.com,
	viro@zeniv.linux.org.uk, mingo@kernel.org,
	paulmck@linux.vnet.ibm.com, keescook@chromium.org,
	riel@redhat.com, mhocko@suse.com, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, marcos.souza.org@gmail.com,
	hoeun.ryu@gmail.com, pasha.tatashin@oracle.com,
	gs051095@gmail.com, ebiederm@xmission.com, dhowells@redhat.com,
	rppt@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
	ktkhai@virtuozzo.com
Subject: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()
Date: Thu, 26 Apr 2018 14:01:07 +0300	[thread overview]
Message-ID: <152474046779.29458.5294808258041953930.stgit@localhost.localdomain> (raw)
In-Reply-To: <152473763015.29458.1131542311542381803.stgit@localhost.localdomain>

The patch finalizes the series and makes mm_update_next_owner()
to iterate over task list using RCU instead of tasklist_lock.
This is possible because of rules of inheritance of mm: it may be
propagated to a child only, while only kernel thread can obtain
someone else's mm via use_mm().

Also, all new tasks are added to tail of tasks list or threads list.
The only exception is transfer_pid() in de_thread(), when group
leader is replaced by another thread. But transfer_pid() is called
in case of successful exec only, where new mm is allocated, so it
can't be interesting for mm_update_next_owner().

This patch uses alloc_pid() as a memory barrier, and it's possible
since it contains two or more spin_lock()/spin_unlock() pairs.
Single pair does not imply a barrier, while two pairs do imply that.

There are three barriers:

1)for_each_process(g)            copy_process()
                                   p->mm = mm
    smp_rmb();                     smp_wmb() implied by alloc_pid()
    if (g->flags & PF_KTHREAD)     list_add_tail_rcu(&p->tasks, &init_task.tasks)

2)for_each_thread(g, c)          copy_process()
                                   p->mm = mm
    smp_rmb();                     smp_wmb() implied by alloc_pid()
    tmp = READ_ONCE(c->mm)         list_add_tail_rcu(&p->thread_node, ...)

3)for_each_thread(g, c)          copy_process()
                                   list_add_tail_rcu(&p->thread_node, ...)
    p->mm != NULL check          do_exit()
    smp_rmb()                      smp_mb();
    get next thread in loop      p->mm = NULL


This patch may be useful for machines with many processes executing.
I regulary observe mm_update_next_owner() executing on one of the cpus
in crash dumps (not related to this function) on big machines. Even
if iteration over task list looks as unlikely situation, it regularity
grows with the growth of containers/processes numbers.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 kernel/exit.c |   39 +++++++++++++++++++++++++++++++++++----
 kernel/fork.c |    1 +
 kernel/pid.c  |    5 ++++-
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 40f734ed1193..7ce4cdf96a64 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -406,6 +406,8 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 void mm_update_next_owner(struct mm_struct *mm)
 {
 	struct task_struct *c, *g, *p = current;
+	struct mm_struct *tmp;
+	struct list_head *n;
 
 retry:
 	/*
@@ -440,21 +442,49 @@ void mm_update_next_owner(struct mm_struct *mm)
 		if (c->mm == mm)
 			goto new_owner;
 	}
+	read_unlock(&tasklist_lock);
 
 	/*
 	 * Search through everything else, we should not get here often.
 	 */
+	rcu_read_lock();
 	for_each_process(g) {
+		/*
+		 * g->signal, g->mm and g->flags initialization of a just
+		 * created task must not reorder with linking the task to
+		 * tasks list. Pairs with smp_mb() implied by alloc_pid().
+		 */
+		smp_rmb();
 		if (g->flags & PF_KTHREAD)
 			continue;
 		for_each_thread(g, c) {
-			if (c->mm == mm)
-				goto new_owner;
-			if (c->mm)
+			/*
+			 * Make visible mm of iterated thread.
+			 * Pairs with smp_mb() implied by alloc_pid().
+			 */
+			if (c != g)
+				smp_rmb();
+			tmp = READ_ONCE(c->mm);
+			if (tmp == mm)
+				goto new_owner_nolock;
+			if (likely(tmp))
 				break;
+			n = READ_ONCE(c->thread_node.next);
+			/*
+			 * All mm are NULL, so iterated threads already exited.
+			 * Make sure we see their children.
+			 * Pairs with smp_mb() in do_exit().
+			 */
+			if (n == &g->signal->thread_head)
+				smp_rmb();
 		}
+		/*
+		 * Children of exited thread group are visible due to the above
+		 * smp_rmb(). Threads with mm != NULL can't create a child with
+		 * the mm we're looking for. So, no additional smp_rmb() needed.
+		 */
 	}
-	read_unlock(&tasklist_lock);
+	rcu_read_unlock();
 	/*
 	 * We found no owner yet mm_users > 1: this implies that we are
 	 * most likely racing with swapoff (try_to_unuse()) or /proc or
@@ -466,6 +496,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 new_owner:
 	rcu_read_lock();
 	read_unlock(&tasklist_lock);
+new_owner_nolock:
 	BUG_ON(c == p);
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..2032d4657546 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1805,6 +1805,7 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
+		/* Successfuly returned, this function imply smp_mb() */
 		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
diff --git a/kernel/pid.c b/kernel/pid.c
index 157fe4b19971..cb96473aa058 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -155,7 +155,10 @@ void free_pid(struct pid *pid)
 
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
-
+/*
+ * This function contains at least two sequential spin_lock()/spin_unlock(),
+ * and together they imply full memory barrier.
+ */
 struct pid *alloc_pid(struct pid_namespace *ns)
 {
 	struct pid *pid;

  parent reply	other threads:[~2018-04-26 11:01 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-26 11:00 [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable Kirill Tkhai
2018-04-26 11:00 ` [PATCH 1/4] exit: Move read_unlock() up in mm_update_next_owner() Kirill Tkhai
2018-04-26 15:01   ` Oleg Nesterov
2018-04-26 11:00 ` [PATCH 2/4] exit: Use rcu instead of get_task_struct() " Kirill Tkhai
2018-04-26 11:00 ` [PATCH 3/4] exit: Rename assign_new_owner label " Kirill Tkhai
2018-04-26 11:01 ` Kirill Tkhai [this message]
2018-04-26 12:35   ` [PATCH 4/4] exit: Lockless iteration over task list " Andrea Parri
2018-04-26 13:52     ` Kirill Tkhai
2018-04-26 15:20       ` Peter Zijlstra
2018-04-26 15:56         ` Kirill Tkhai
2018-04-26 15:20       ` Peter Zijlstra
2018-04-26 16:04         ` Kirill Tkhai
2018-04-26 15:29       ` Andrea Parri
2018-04-26 16:11         ` Kirill Tkhai
2018-04-26 13:07 ` [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable Michal Hocko
2018-04-26 13:52   ` Oleg Nesterov
2018-04-26 14:07   ` Kirill Tkhai
2018-04-26 15:10     ` Oleg Nesterov
2018-04-26 16:19   ` Eric W. Biederman
2018-04-26 19:28     ` Michal Hocko
2018-04-27  7:08       ` Michal Hocko
2018-04-27 18:05         ` Eric W. Biederman
2018-05-01 17:22           ` Eric W. Biederman
2018-05-01 17:35             ` [RFC][PATCH] memcg: Replace mm->owner with mm->memcg Eric W. Biederman
2018-05-02  8:47               ` Michal Hocko
2018-05-02 13:20                 ` Johannes Weiner
2018-05-02 14:05                   ` Eric W. Biederman
2018-05-02 19:21                   ` [PATCH] " Eric W. Biederman
2018-05-02 21:04                     ` Andrew Morton
2018-05-02 21:35                       ` Eric W. Biederman
2018-05-03 13:33                     ` Oleg Nesterov
2018-05-03 14:39                       ` Eric W. Biederman
2018-05-04 14:20                         ` Oleg Nesterov
2018-05-04 14:36                           ` Eric W. Biederman
2018-05-04 14:54                             ` Oleg Nesterov
2018-05-04 15:49                               ` Eric W. Biederman
2018-05-04 16:22                                 ` Oleg Nesterov
2018-05-04 16:40                                   ` Eric W. Biederman
2018-05-04 17:26                                     ` [PATCH 0/2] mm->owner to mm->memcg fixes Eric W. Biederman
2018-05-04 17:26                                       ` [PATCH 1/2] memcg: Update the mm->memcg maintenance to work when !CONFIG_MMU Eric W. Biederman
2018-05-04 17:27                                       ` [PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm Eric W. Biederman
2018-05-09 14:51                                         ` Oleg Nesterov
2018-05-10  3:00                                           ` Eric W. Biederman
2018-05-10 12:14                                       ` [PATCH 0/2] mm->owner to mm->memcg fixes Michal Hocko
2018-05-10 12:18                                         ` Michal Hocko
2018-05-22 12:57                                         ` Michal Hocko
2018-05-23 19:46                                           ` Eric W. Biederman
2018-05-24 11:10                                             ` Michal Hocko
2018-05-24 21:16                                               ` Andrew Morton
2018-05-24 23:37                                                 ` Andrea Parri
2018-05-30 12:17                                                 ` Michal Hocko
2018-05-31 18:41                                                   ` Eric W. Biederman
2018-06-01  1:57                                                     ` [PATCH] memcg: Replace mm->owner with mm->memcg Eric W. Biederman
2018-06-01 14:52                                                       ` [RFC][PATCH 0/2] memcg: Require every task that uses an mm to migrate together Eric W. Biederman
2018-06-01 14:53                                                         ` [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup Eric W. Biederman
2018-06-01 16:50                                                           ` Tejun Heo
2018-06-01 18:11                                                             ` Eric W. Biederman
2018-06-01 19:16                                                               ` Tejun Heo
2018-06-04 13:01                                                                 ` Michal Hocko
2018-06-04 18:47                                                                   ` Tejun Heo
2018-06-04 19:11                                                                     ` Eric W. Biederman
2018-06-06 11:13                                                           ` Michal Hocko
2018-06-07 11:42                                                             ` Eric W. Biederman
2018-06-07 12:19                                                               ` Michal Hocko
2018-06-01 14:53                                                         ` [RFC][PATCH 2/2] memcgl: Remove dead code now that all tasks of an mm share a memcg Eric W. Biederman
2018-06-01 14:07                                                     ` [PATCH 0/2] mm->owner to mm->memcg fixes Michal Hocko
2018-05-24 21:17                                               ` Andrew Morton
2018-05-30 11:52                                             ` Michal Hocko
2018-05-31 17:43                                               ` Eric W. Biederman
2018-05-07 14:33                                     ` [PATCH] memcg: Replace mm->owner with mm->memcg Oleg Nesterov
2018-05-08  3:15                                       ` Eric W. Biederman
2018-05-09 14:40                                         ` Oleg Nesterov
2018-05-10  3:09                                           ` Eric W. Biederman
2018-05-10  4:03                                             ` [RFC][PATCH] cgroup: Don't mess with tasks in exec Eric W. Biederman
2018-05-10 12:15                                               ` Oleg Nesterov
2018-05-10 12:35                                                 ` Tejun Heo
2018-05-10 12:38                                             ` [PATCH] memcg: Replace mm->owner with mm->memcg Oleg Nesterov
2018-05-04 11:07                     ` Michal Hocko
2018-05-05 16:54                     ` kbuild test robot
2018-05-07 23:18                       ` Andrew Morton
2018-05-08  2:17                         ` Eric W. Biederman
2018-05-09 21:00                         ` Michal Hocko
2018-05-02 23:59               ` [RFC][PATCH] " Balbir Singh
2018-05-03 15:11                 ` Eric W. Biederman
2018-05-04  4:59                   ` Balbir Singh
2018-05-03 10:52           ` [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable Kirill Tkhai
2018-06-01  1:07   ` Eric W. Biederman
2018-06-01 13:57     ` Michal Hocko
2018-06-01 14:32       ` Eric W. Biederman
2018-06-01 15:02         ` Michal Hocko
2018-06-01 15:25           ` Eric W. Biederman
2018-06-04  6:54             ` Michal Hocko
2018-06-04 14:31               ` Eric W. Biederman
2018-06-05  8:15                 ` Michal Hocko
2018-06-05  8:48             ` Kirill Tkhai
2018-06-05 15:36               ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=152474046779.29458.5294808258041953930.stgit@localhost.localdomain \
    --to=ktkhai@virtuozzo.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=gs051095@gmail.com \
    --cc=hoeun.ryu@gmail.com \
    --cc=keescook@chromium.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcos.souza.org@gmail.com \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=pasha.tatashin@oracle.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.