linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] prlimit and set/getpriority tasklist_lock optimizations
@ 2022-01-05 21:28 Barret Rhoden
  2022-01-05 21:28 ` [PATCH v2 1/3] setpriority: only grab the tasklist_lock for PRIO_PGRP Barret Rhoden
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Barret Rhoden @ 2022-01-05 21:28 UTC (permalink / raw)
  To: ebiederm
  Cc: Christian Brauner, Andrew Morton, Alexey Gladkov, William Cohen,
	Viresh Kumar, Alexey Dobriyan, Chris Hyser, Peter Collingbourne,
	Xiaofeng Cao, David Hildenbrand, Cyrill Gorcunov, linux-kernel

The tasklist_lock popped up as a scalability bottleneck on some testing
workloads.  The readlocks in do_prlimit and set/getpriority are not
necessary in all cases.

Based on a cycles profile, it looked like ~87% of the time was spent in
the kernel, ~42% of which was just trying to get *some* spinlock
(queued_spin_lock_slowpath, not necessarily the tasklist_lock).

The big offenders (with rough percentages in cycles of the overall trace):

- do_wait 11%
- setpriority 8% (this patchset)
- kill 8%
- do_exit 5%
- clone 3%
- prlimit64 2%   (this patchset)
- getrlimit 1%   (this patchset)

I can't easily test this patchset on the original workload for various
reasons.  Instead, I used the microbenchmark below to at least verify
there was some improvement.  This patchset had a 28% speedup (12% from
baseline to set/getprio, then another 14% for prlimmit).

One interesting thing is that my libc's getrlimit() was calling
prlimit64, so hoisting the read_lock(tasklist_lock) into sys_prlimit64
had no effect - it essentially optimized the older syscalls only.  I
didn't do that in this patchset, but figured I'd mention it since it was
an option from the previous patch's discussion.

v1: https://lore.kernel.org/lkml/20211213220401.1039578-1-brho@google.com/

#include <sys/resource.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	pid_t child;
	struct rlimit rlim[1];

	fork(); fork(); fork(); fork(); fork(); fork();

	for (int i = 0; i < 5000; i++) {
		child = fork();
		if (child < 0)
			exit(1);
		if (child > 0) {
			usleep(1000);
			kill(child, SIGTERM);
			waitpid(child, NULL, 0);
		} else {
			for (;;) {
				setpriority(PRIO_PROCESS, 0,
					    getpriority(PRIO_PROCESS, 0));
				getrlimit(RLIMIT_CPU, rlim);
			}
		}
	}

	return 0;
}


Barret Rhoden (3):
  setpriority: only grab the tasklist_lock for PRIO_PGRP
  prlimit: make do_prlimit() static
  prlimit: do not grab the tasklist_lock

 include/linux/posix-timers.h   |   2 +-
 include/linux/resource.h       |   2 -
 kernel/sys.c                   | 134 ++++++++++++++++++---------------
 kernel/time/posix-cpu-timers.c |  12 ++-
 4 files changed, 83 insertions(+), 67 deletions(-)

-- 
2.34.1.448.ga2b2bfdf31-goog


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/3] setpriority: only grab the tasklist_lock for PRIO_PGRP
  2022-01-05 21:28 [PATCH v2 0/3] prlimit and set/getpriority tasklist_lock optimizations Barret Rhoden
@ 2022-01-05 21:28 ` Barret Rhoden
  2022-01-05 21:28 ` [PATCH v2 2/3] prlimit: make do_prlimit() static Barret Rhoden
  2022-01-05 21:28 ` [PATCH v2 3/3] prlimit: do not grab the tasklist_lock Barret Rhoden
  2 siblings, 0 replies; 6+ messages in thread
From: Barret Rhoden @ 2022-01-05 21:28 UTC (permalink / raw)
  To: ebiederm
  Cc: Christian Brauner, Andrew Morton, Alexey Gladkov, William Cohen,
	Viresh Kumar, Alexey Dobriyan, Chris Hyser, Peter Collingbourne,
	Xiaofeng Cao, David Hildenbrand, Cyrill Gorcunov, linux-kernel

The tasklist_lock is necessary only for PRIO_PGRP for both setpriority()
and getpriority().

Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
for workloads that also must grab the tasklist_lock for waiting,
killing, and cloning.

This change resulted in a 12% speedup on a microbenchmark where parents
kill and wait on their children, and children getpriority, setpriority,
and getrlimit.

Signed-off-by: Barret Rhoden <brho@google.com>
---
 kernel/sys.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 8fdac0d90504..558e52fa5bbd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -220,7 +220,6 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 		niceval = MAX_NICE;
 
 	rcu_read_lock();
-	read_lock(&tasklist_lock);
 	switch (which) {
 	case PRIO_PROCESS:
 		if (who)
@@ -231,6 +230,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 			error = set_one_prio(p, niceval, error);
 		break;
 	case PRIO_PGRP:
+		read_lock(&tasklist_lock);
 		if (who)
 			pgrp = find_vpid(who);
 		else
@@ -238,6 +238,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 		do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
 			error = set_one_prio(p, niceval, error);
 		} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+		read_unlock(&tasklist_lock);
 		break;
 	case PRIO_USER:
 		uid = make_kuid(cred->user_ns, who);
@@ -258,7 +259,6 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 		break;
 	}
 out_unlock:
-	read_unlock(&tasklist_lock);
 	rcu_read_unlock();
 out:
 	return error;
@@ -283,7 +283,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
 		return -EINVAL;
 
 	rcu_read_lock();
-	read_lock(&tasklist_lock);
 	switch (which) {
 	case PRIO_PROCESS:
 		if (who)
@@ -297,6 +296,7 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
 		}
 		break;
 	case PRIO_PGRP:
+		read_lock(&tasklist_lock);
 		if (who)
 			pgrp = find_vpid(who);
 		else
@@ -306,6 +306,7 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
 			if (niceval > retval)
 				retval = niceval;
 		} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+		read_unlock(&tasklist_lock);
 		break;
 	case PRIO_USER:
 		uid = make_kuid(cred->user_ns, who);
@@ -329,7 +330,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
 		break;
 	}
 out_unlock:
-	read_unlock(&tasklist_lock);
 	rcu_read_unlock();
 
 	return retval;
-- 
2.34.1.448.ga2b2bfdf31-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/3] prlimit: make do_prlimit() static
  2022-01-05 21:28 [PATCH v2 0/3] prlimit and set/getpriority tasklist_lock optimizations Barret Rhoden
  2022-01-05 21:28 ` [PATCH v2 1/3] setpriority: only grab the tasklist_lock for PRIO_PGRP Barret Rhoden
@ 2022-01-05 21:28 ` Barret Rhoden
  2022-01-05 21:28 ` [PATCH v2 3/3] prlimit: do not grab the tasklist_lock Barret Rhoden
  2 siblings, 0 replies; 6+ messages in thread
From: Barret Rhoden @ 2022-01-05 21:28 UTC (permalink / raw)
  To: ebiederm
  Cc: Christian Brauner, Andrew Morton, Alexey Gladkov, William Cohen,
	Viresh Kumar, Alexey Dobriyan, Chris Hyser, Peter Collingbourne,
	Xiaofeng Cao, David Hildenbrand, Cyrill Gorcunov, linux-kernel

There are no other callers in the kernel.

Fixed up a comment format and whitespace issue when moving do_prlimit()
higher in sys.c.

Signed-off-by: Barret Rhoden <brho@google.com>
---
 include/linux/resource.h |   2 -
 kernel/sys.c             | 116 ++++++++++++++++++++-------------------
 2 files changed, 59 insertions(+), 59 deletions(-)

diff --git a/include/linux/resource.h b/include/linux/resource.h
index bdf491cbcab7..4fdbc0c3f315 100644
--- a/include/linux/resource.h
+++ b/include/linux/resource.h
@@ -8,7 +8,5 @@
 struct task_struct;
 
 void getrusage(struct task_struct *p, int who, struct rusage *ru);
-int do_prlimit(struct task_struct *tsk, unsigned int resource,
-		struct rlimit *new_rlim, struct rlimit *old_rlim);
 
 #endif
diff --git a/kernel/sys.c b/kernel/sys.c
index 558e52fa5bbd..fb2a5e7c0589 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1415,6 +1415,65 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len)
 	return errno;
 }
 
+/* make sure you are allowed to change @tsk limits before calling this */
+static int do_prlimit(struct task_struct *tsk, unsigned int resource,
+		      struct rlimit *new_rlim, struct rlimit *old_rlim)
+{
+	struct rlimit *rlim;
+	int retval = 0;
+
+	if (resource >= RLIM_NLIMITS)
+		return -EINVAL;
+	if (new_rlim) {
+		if (new_rlim->rlim_cur > new_rlim->rlim_max)
+			return -EINVAL;
+		if (resource == RLIMIT_NOFILE &&
+				new_rlim->rlim_max > sysctl_nr_open)
+			return -EPERM;
+	}
+
+	/* protect tsk->signal and tsk->sighand from disappearing */
+	read_lock(&tasklist_lock);
+	if (!tsk->sighand) {
+		retval = -ESRCH;
+		goto out;
+	}
+
+	rlim = tsk->signal->rlim + resource;
+	task_lock(tsk->group_leader);
+	if (new_rlim) {
+		/*
+		 * Keep the capable check against init_user_ns until cgroups can
+		 * contain all limits.
+		 */
+		if (new_rlim->rlim_max > rlim->rlim_max &&
+				!capable(CAP_SYS_RESOURCE))
+			retval = -EPERM;
+		if (!retval)
+			retval = security_task_setrlimit(tsk, resource, new_rlim);
+	}
+	if (!retval) {
+		if (old_rlim)
+			*old_rlim = *rlim;
+		if (new_rlim)
+			*rlim = *new_rlim;
+	}
+	task_unlock(tsk->group_leader);
+
+	/*
+	 * RLIMIT_CPU handling. Arm the posix CPU timer if the limit is not
+	 * infinite. In case of RLIM_INFINITY the posix CPU timer code
+	 * ignores the rlimit.
+	 */
+	if (!retval && new_rlim && resource == RLIMIT_CPU &&
+	    new_rlim->rlim_cur != RLIM_INFINITY &&
+	    IS_ENABLED(CONFIG_POSIX_TIMERS))
+		update_rlimit_cpu(tsk, new_rlim->rlim_cur);
+out:
+	read_unlock(&tasklist_lock);
+	return retval;
+}
+
 SYSCALL_DEFINE2(getrlimit, unsigned int, resource, struct rlimit __user *, rlim)
 {
 	struct rlimit value;
@@ -1558,63 +1617,6 @@ static void rlim64_to_rlim(const struct rlimit64 *rlim64, struct rlimit *rlim)
 		rlim->rlim_max = (unsigned long)rlim64->rlim_max;
 }
 
-/* make sure you are allowed to change @tsk limits before calling this */
-int do_prlimit(struct task_struct *tsk, unsigned int resource,
-		struct rlimit *new_rlim, struct rlimit *old_rlim)
-{
-	struct rlimit *rlim;
-	int retval = 0;
-
-	if (resource >= RLIM_NLIMITS)
-		return -EINVAL;
-	if (new_rlim) {
-		if (new_rlim->rlim_cur > new_rlim->rlim_max)
-			return -EINVAL;
-		if (resource == RLIMIT_NOFILE &&
-				new_rlim->rlim_max > sysctl_nr_open)
-			return -EPERM;
-	}
-
-	/* protect tsk->signal and tsk->sighand from disappearing */
-	read_lock(&tasklist_lock);
-	if (!tsk->sighand) {
-		retval = -ESRCH;
-		goto out;
-	}
-
-	rlim = tsk->signal->rlim + resource;
-	task_lock(tsk->group_leader);
-	if (new_rlim) {
-		/* Keep the capable check against init_user_ns until
-		   cgroups can contain all limits */
-		if (new_rlim->rlim_max > rlim->rlim_max &&
-				!capable(CAP_SYS_RESOURCE))
-			retval = -EPERM;
-		if (!retval)
-			retval = security_task_setrlimit(tsk, resource, new_rlim);
-	}
-	if (!retval) {
-		if (old_rlim)
-			*old_rlim = *rlim;
-		if (new_rlim)
-			*rlim = *new_rlim;
-	}
-	task_unlock(tsk->group_leader);
-
-	/*
-	 * RLIMIT_CPU handling. Arm the posix CPU timer if the limit is not
-	 * infinite. In case of RLIM_INFINITY the posix CPU timer code
-	 * ignores the rlimit.
-	 */
-	 if (!retval && new_rlim && resource == RLIMIT_CPU &&
-	     new_rlim->rlim_cur != RLIM_INFINITY &&
-	     IS_ENABLED(CONFIG_POSIX_TIMERS))
-		update_rlimit_cpu(tsk, new_rlim->rlim_cur);
-out:
-	read_unlock(&tasklist_lock);
-	return retval;
-}
-
 /* rcu lock must be held */
 static int check_prlimit_permission(struct task_struct *task,
 				    unsigned int flags)
-- 
2.34.1.448.ga2b2bfdf31-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 3/3] prlimit: do not grab the tasklist_lock
  2022-01-05 21:28 [PATCH v2 0/3] prlimit and set/getpriority tasklist_lock optimizations Barret Rhoden
  2022-01-05 21:28 ` [PATCH v2 1/3] setpriority: only grab the tasklist_lock for PRIO_PGRP Barret Rhoden
  2022-01-05 21:28 ` [PATCH v2 2/3] prlimit: make do_prlimit() static Barret Rhoden
@ 2022-01-05 21:28 ` Barret Rhoden
  2022-01-05 22:05   ` Eric W. Biederman
  2 siblings, 1 reply; 6+ messages in thread
From: Barret Rhoden @ 2022-01-05 21:28 UTC (permalink / raw)
  To: ebiederm
  Cc: Christian Brauner, Andrew Morton, Alexey Gladkov, William Cohen,
	Viresh Kumar, Alexey Dobriyan, Chris Hyser, Peter Collingbourne,
	Xiaofeng Cao, David Hildenbrand, Cyrill Gorcunov, linux-kernel

Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
for workloads that also must grab the tasklist_lock for waiting,
killing, and cloning.

The tasklist_lock was grabbed to protect tsk->sighand from disappearing
(becoming NULL).  tsk->signal was already protected by holding a
reference to tsk.

update_rlimit_cpu() assumed tsk->sighand != NULL.  With this commit, it
attempts to lock_task_sighand().  However, this means that
update_rlimit_cpu() can fail.  This only happens when a task is exiting.
Note that during exec, sighand may *change*, but it will not be NULL.

Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu()
would not fail by read locking the tasklist_lock and checking tsk->sighand
!= NULL.

If update_rlimit_cpu() fails, there may be other tasks that are not
exiting that share tsk->signal.  We need to run update_rlimit_cpu() on
one of them.   We can't "back out" the new rlim - once we unlocked
task_lock(group_leader), the rlim is essentially changed.

The only other caller of update_rlimit_cpu() is
selinux_bprm_committing_creds().  It has tsk == current, so
update_rlimit_cpu() cannot fail (current->sighand cannot disappear
until current exits).

This change resulted in a 14% speedup on a microbenchmark where parents
kill and wait on their children, and children getpriority, setpriority,
and getrlimit.

Signed-off-by: Barret Rhoden <brho@google.com>
---
 include/linux/posix-timers.h   |  2 +-
 kernel/sys.c                   | 32 +++++++++++++++++++++-----------
 kernel/time/posix-cpu-timers.c | 12 +++++++++---
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 5bbcd280bfd2..9cf126c3b27f 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -253,7 +253,7 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
 void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
 			   u64 *newval, u64 *oldval);
 
-void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
+int update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
 
 void posixtimer_rearm(struct kernel_siginfo *info);
 #endif
diff --git a/kernel/sys.c b/kernel/sys.c
index fb2a5e7c0589..073ae9db192f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1432,13 +1432,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 			return -EPERM;
 	}
 
-	/* protect tsk->signal and tsk->sighand from disappearing */
-	read_lock(&tasklist_lock);
-	if (!tsk->sighand) {
-		retval = -ESRCH;
-		goto out;
-	}
-
+	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
 	rlim = tsk->signal->rlim + resource;
 	task_lock(tsk->group_leader);
 	if (new_rlim) {
@@ -1467,10 +1461,26 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 	 */
 	if (!retval && new_rlim && resource == RLIMIT_CPU &&
 	    new_rlim->rlim_cur != RLIM_INFINITY &&
-	    IS_ENABLED(CONFIG_POSIX_TIMERS))
-		update_rlimit_cpu(tsk, new_rlim->rlim_cur);
-out:
-	read_unlock(&tasklist_lock);
+	    IS_ENABLED(CONFIG_POSIX_TIMERS)) {
+		if (update_rlimit_cpu(tsk, new_rlim->rlim_cur)) {
+			/*
+			 * update_rlimit_cpu can fail if the task is exiting.
+			 * We already set the task group's rlim, so we need to
+			 * update_rlimit_cpu for some other task in the process.
+			 * If all of the tasks are exiting, then we don't need
+			 * to update_rlimit_cpu.
+			 */
+			struct task_struct *t_i;
+
+			rcu_read_lock();
+			for_each_thread(tsk, t_i) {
+				if (!update_rlimit_cpu(t_i, new_rlim->rlim_cur))
+					break;
+			}
+			rcu_read_unlock();
+		}
+	}
+
 	return retval;
 }
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 96b4e7810426..e13e628509fb 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -34,14 +34,20 @@ void posix_cputimers_group_init(struct posix_cputimers *pct, u64 cpu_limit)
  * tsk->signal->posix_cputimers.bases[clock].nextevt expiration cache if
  * necessary. Needs siglock protection since other code may update the
  * expiration cache as well.
+ *
+ * Returns 0 on success, -ESRCH on failure.  Can fail if the task is exiting and
+ * we cannot lock_task_sighand.  Cannot fail if task is current.
  */
-void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new)
+int update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new)
 {
 	u64 nsecs = rlim_new * NSEC_PER_SEC;
+	unsigned long irq_fl;
 
-	spin_lock_irq(&task->sighand->siglock);
+	if (!lock_task_sighand(task, &irq_fl))
+		return -ESRCH;
 	set_process_cpu_timer(task, CPUCLOCK_PROF, &nsecs, NULL);
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &irq_fl);
+	return 0;
 }
 
 /*
-- 
2.34.1.448.ga2b2bfdf31-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 3/3] prlimit: do not grab the tasklist_lock
  2022-01-05 21:28 ` [PATCH v2 3/3] prlimit: do not grab the tasklist_lock Barret Rhoden
@ 2022-01-05 22:05   ` Eric W. Biederman
  2022-01-06 16:43     ` Barret Rhoden
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2022-01-05 22:05 UTC (permalink / raw)
  To: Barret Rhoden
  Cc: Christian Brauner, Andrew Morton, Alexey Gladkov, William Cohen,
	Viresh Kumar, Alexey Dobriyan, Chris Hyser, Peter Collingbourne,
	Xiaofeng Cao, David Hildenbrand, Cyrill Gorcunov, linux-kernel

Barret Rhoden <brho@google.com> writes:

> Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
> for workloads that also must grab the tasklist_lock for waiting,
> killing, and cloning.
>
> The tasklist_lock was grabbed to protect tsk->sighand from disappearing
> (becoming NULL).  tsk->signal was already protected by holding a
> reference to tsk.
>
> update_rlimit_cpu() assumed tsk->sighand != NULL.  With this commit, it
> attempts to lock_task_sighand().  However, this means that
> update_rlimit_cpu() can fail.  This only happens when a task is exiting.
> Note that during exec, sighand may *change*, but it will not be NULL.
>
> Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu()
> would not fail by read locking the tasklist_lock and checking tsk->sighand
> != NULL.
>
> If update_rlimit_cpu() fails, there may be other tasks that are not
> exiting that share tsk->signal.  We need to run update_rlimit_cpu() on
> one of them.   We can't "back out" the new rlim - once we unlocked
> task_lock(group_leader), the rlim is essentially changed.
>
> The only other caller of update_rlimit_cpu() is
> selinux_bprm_committing_creds().  It has tsk == current, so
> update_rlimit_cpu() cannot fail (current->sighand cannot disappear
> until current exits).
>
> This change resulted in a 14% speedup on a microbenchmark where parents
> kill and wait on their children, and children getpriority, setpriority,
> and getrlimit.
>
> Signed-off-by: Barret Rhoden <brho@google.com>
> ---
>  include/linux/posix-timers.h   |  2 +-
>  kernel/sys.c                   | 32 +++++++++++++++++++++-----------
>  kernel/time/posix-cpu-timers.c | 12 +++++++++---
>  3 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
> index 5bbcd280bfd2..9cf126c3b27f 100644
> --- a/include/linux/posix-timers.h
> +++ b/include/linux/posix-timers.h
> @@ -253,7 +253,7 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
>  void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
>  			   u64 *newval, u64 *oldval);
>  
> -void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
> +int update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
>  
>  void posixtimer_rearm(struct kernel_siginfo *info);
>  #endif
> diff --git a/kernel/sys.c b/kernel/sys.c
> index fb2a5e7c0589..073ae9db192f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1432,13 +1432,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
>  			return -EPERM;
>  	}
>  
> -	/* protect tsk->signal and tsk->sighand from disappearing */
> -	read_lock(&tasklist_lock);
> -	if (!tsk->sighand) {
> -		retval = -ESRCH;
> -		goto out;
> -	}
> -
> +	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
>  	rlim = tsk->signal->rlim + resource;
>  	task_lock(tsk->group_leader);
>  	if (new_rlim) {
> @@ -1467,10 +1461,26 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
>  	 */
>  	if (!retval && new_rlim && resource == RLIMIT_CPU &&
>  	    new_rlim->rlim_cur != RLIM_INFINITY &&
> -	    IS_ENABLED(CONFIG_POSIX_TIMERS))
> -		update_rlimit_cpu(tsk, new_rlim->rlim_cur);
> -out:
> -	read_unlock(&tasklist_lock);
> +	    IS_ENABLED(CONFIG_POSIX_TIMERS)) {
> +		if (update_rlimit_cpu(tsk, new_rlim->rlim_cur)) {
> +			/*
> +			 * update_rlimit_cpu can fail if the task is exiting.
> +			 * We already set the task group's rlim, so we need to
> +			 * update_rlimit_cpu for some other task in the process.
> +			 * If all of the tasks are exiting, then we don't need
> +			 * to update_rlimit_cpu.
> +			 */
> +			struct task_struct *t_i;
> +
> +			rcu_read_lock();
> +			for_each_thread(tsk, t_i) {
> +				if (!update_rlimit_cpu(t_i, new_rlim->rlim_cur))
> +					break;
> +			}
> +			rcu_read_unlock();
> +		}

I look at this and I ask can't we do this better?

Because you are right that if the thread you landed on is exiting this
is a problem.  It is only a problem for prlimit64, as all of the rest
of the calls to do_prlimit happen from current so you know they are not
exiting.

I think the simple solution is just:
	update_rlimit_cpu(tsk->group_leader)

As the group leader is guaranteed to be the last thread of the thread
group to be processed in release_task, and thus the last thread with a
sighand.  Nothing needs to be done if it does not have a sighand.

How does that sound?

> +	}
> +
>  	return retval;
>  }

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 3/3] prlimit: do not grab the tasklist_lock
  2022-01-05 22:05   ` Eric W. Biederman
@ 2022-01-06 16:43     ` Barret Rhoden
  0 siblings, 0 replies; 6+ messages in thread
From: Barret Rhoden @ 2022-01-06 16:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Andrew Morton, Alexey Gladkov, William Cohen,
	Viresh Kumar, Alexey Dobriyan, Chris Hyser, Peter Collingbourne,
	Xiaofeng Cao, David Hildenbrand, Cyrill Gorcunov, linux-kernel

On 1/5/22 17:05, Eric W. Biederman wrote:
> 
> I think the simple solution is just:
> 	update_rlimit_cpu(tsk->group_leader)
> 
> As the group leader is guaranteed to be the last thread of the thread
> group to be processed in release_task, and thus the last thread with a
> sighand.  Nothing needs to be done if it does not have a sighand.

Ah, good to know.  I didn't know the group_leader stuck around as a 
zombie.  That makes life easier.  I'll make the change and repost the 
patches.

Thanks,

Barret

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-01-06 16:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-05 21:28 [PATCH v2 0/3] prlimit and set/getpriority tasklist_lock optimizations Barret Rhoden
2022-01-05 21:28 ` [PATCH v2 1/3] setpriority: only grab the tasklist_lock for PRIO_PGRP Barret Rhoden
2022-01-05 21:28 ` [PATCH v2 2/3] prlimit: make do_prlimit() static Barret Rhoden
2022-01-05 21:28 ` [PATCH v2 3/3] prlimit: do not grab the tasklist_lock Barret Rhoden
2022-01-05 22:05   ` Eric W. Biederman
2022-01-06 16:43     ` Barret Rhoden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).