* [PATCH -next v2 0/2] kernel/sys: reduce tasklist_lock usage get/set priorities @ 2020-05-12 0:03 Davidlohr Bueso 2020-05-12 0:03 ` [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) Davidlohr Bueso 2020-05-12 0:03 ` [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS) Davidlohr Bueso 0 siblings, 2 replies; 9+ messages in thread From: Davidlohr Bueso @ 2020-05-12 0:03 UTC (permalink / raw) To: akpm; +Cc: peterz, oleg, paulmck, tglx, linux-kernel, dave Hi, The following are two patches that deal with removing the tasklist_lock entirely in getpriority(2), and reducing the scope of it in setpriority(2). This also aligns somewhat to what we already do with io priorities - although there both set and get rely entirely on rcu. Details in the changelog but this passes ltp tests. Please consider for v5.8. Changes from v1: - split get and set into two patches. - improved changelog. Thanks! Davidlohr Bueso (2): kernel/sys: only rely on rcu for getpriority(2) kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS) kernel/sys.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) -- 2.26.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) 2020-05-12 0:03 [PATCH -next v2 0/2] kernel/sys: reduce tasklist_lock usage get/set priorities Davidlohr Bueso @ 2020-05-12 0:03 ` Davidlohr Bueso 2020-05-12 15:09 ` Oleg Nesterov 2020-05-12 0:03 ` [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS) Davidlohr Bueso 1 sibling, 1 reply; 9+ messages in thread From: Davidlohr Bueso @ 2020-05-12 0:03 UTC (permalink / raw) To: akpm; +Cc: peterz, oleg, paulmck, tglx, linux-kernel, dave, Davidlohr Bueso Currently the tasklist_lock is shared mainly in order to observe the list atomically for the PRIO_PGRP and PRIO_USER cases, as the actual lookups are already rcu-safe, providing a stable task pointer. By removing the lock, we can race with: (i) fork (insertion), but this is benign as the child's nice is inherited and the actual task is not observable by the user yet either. The return semantics do not differ. and; (ii) exit (deletion), this window is small but if a task is deleted with the highest nice and it is not observed this would cause a change in return semantics. To further reduce the window we ignore any tasks that are PF_EXITING in the 'old' version of the list. The case for PRIO_PROCESS does not need the lock at all as it only looks up the pointer. The following raw microbenchmark improvements on a 40-core box were seen running the stressng-get workload, which pathologically pounds on various syscalls that get information from the kernel. Increasing thread counts of course shows more wins, albeit probably not something that would be seen in a real workload. 5.7.0-rc3 5.7.0-rc3 getpriority-v1 Hmean get-1 3443.65 ( 0.00%) 3314.08 * -3.76%* Hmean get-2 7809.99 ( 0.00%) 8547.60 * 9.44%* Hmean get-4 15498.01 ( 0.00%) 17396.85 * 12.25%* Hmean get-8 28001.37 ( 0.00%) 31137.53 * 11.20%* Hmean get-16 31460.88 ( 0.00%) 40284.35 * 28.05%* Hmean get-32 30036.64 ( 0.00%) 40657.88 * 35.36%* Hmean get-64 31429.86 ( 0.00%) 41021.73 * 30.52%* Hmean get-80 31804.13 ( 0.00%) 39188.55 * 23.22%* Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- kernel/sys.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..0b72184f5e3e 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -277,7 +277,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) return -EINVAL; rcu_read_lock(); - read_lock(&tasklist_lock); switch (which) { case PRIO_PROCESS: if (who) @@ -296,6 +295,9 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) else pgrp = task_pgrp(current); do_each_pid_thread(pgrp, PIDTYPE_PGID, p) { + if (unlikely(p->flags & PF_EXITING)) + continue; + niceval = nice_to_rlimit(task_nice(p)); if (niceval > retval) retval = niceval; @@ -313,6 +315,9 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) } do_each_thread(g, p) { if (uid_eq(task_uid(p), uid) && task_pid_vnr(p)) { + if (unlikely(p->flags & PF_EXITING)) + continue; + niceval = nice_to_rlimit(task_nice(p)); if (niceval > retval) retval = niceval; @@ -323,7 +328,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) break; } out_unlock: - read_unlock(&tasklist_lock); rcu_read_unlock(); return retval; -- 2.26.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) 2020-05-12 0:03 ` [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) Davidlohr Bueso @ 2020-05-12 15:09 ` Oleg Nesterov 2020-05-12 16:09 ` Davidlohr Bueso 0 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2020-05-12 15:09 UTC (permalink / raw) To: Davidlohr Bueso Cc: akpm, peterz, paulmck, tglx, linux-kernel, Davidlohr Bueso On 05/11, Davidlohr Bueso wrote: > > Currently the tasklist_lock is shared mainly in order to observe > the list atomically for the PRIO_PGRP and PRIO_USER cases, as > the actual lookups are already rcu-safe, not really... do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID) which moves the task from one hlist to another. Yes, it is safe in that task_struct can't go away. But still this is not right because do_each_pid_task() can scan the wrong (2nd) hlist. > (ii) exit (deletion), this window is small but if a task is > deleted with the highest nice and it is not observed this would > cause a change in return semantics. To further reduce the window > we ignore any tasks that are PF_EXITING in the 'old' version of > the list. can't understand... could you explain in details why do you think this PF_EXITING check makes any sense? Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) 2020-05-12 15:09 ` Oleg Nesterov @ 2020-05-12 16:09 ` Davidlohr Bueso 2020-05-12 16:41 ` Oleg Nesterov 0 siblings, 1 reply; 9+ messages in thread From: Davidlohr Bueso @ 2020-05-12 16:09 UTC (permalink / raw) To: Oleg Nesterov; +Cc: akpm, peterz, paulmck, tglx, linux-kernel, Davidlohr Bueso On Tue, 12 May 2020, Oleg Nesterov wrote: >On 05/11, Davidlohr Bueso wrote: >> >> Currently the tasklist_lock is shared mainly in order to observe >> the list atomically for the PRIO_PGRP and PRIO_USER cases, as >> the actual lookups are already rcu-safe, > >not really... > >do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID) >which moves the task from one hlist to another. Yes, it is safe in >that task_struct can't go away. But still this is not right because >do_each_pid_task() can scan the wrong (2nd) hlist. Hmm I didn't think about this case, I guess this is also busted in ioprio_get(2) then. > >> (ii) exit (deletion), this window is small but if a task is >> deleted with the highest nice and it is not observed this would >> cause a change in return semantics. To further reduce the window >> we ignore any tasks that are PF_EXITING in the 'old' version of >> the list. > >can't understand... > >could you explain in details why do you think this PF_EXITING check >makes any sense? My logic was that if the task with the highest prio exited while we were iterating the list, it would not be necessarily seen with rcu and the syscall would return the highest prio of a task that exited; and checking against PF_EXITING was a way to ignore such scenarios as we were going to race with it anyway. At this point it seems that we can just remove the lock for the PRIO_PROCESS case. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) 2020-05-12 16:09 ` Davidlohr Bueso @ 2020-05-12 16:41 ` Oleg Nesterov 2020-05-12 16:58 ` Davidlohr Bueso 0 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2020-05-12 16:41 UTC (permalink / raw) To: Davidlohr Bueso Cc: akpm, peterz, paulmck, tglx, linux-kernel, Davidlohr Bueso On 05/12, Davidlohr Bueso wrote: > > On Tue, 12 May 2020, Oleg Nesterov wrote: > > >do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID) > >which moves the task from one hlist to another. Yes, it is safe in > >that task_struct can't go away. But still this is not right because > >do_each_pid_task() can scan the wrong (2nd) hlist. > > Hmm I didn't think about this case, I guess this is also busted in > ioprio_get(2) then. agreed... > > > >could you explain in details why do you think this PF_EXITING check > >makes any sense? > > My logic was that if the task with the highest prio exited while we > were iterating the list, it would not be necessarily seen with rcu > and the syscall would return the highest prio of a task that exited; > and checking against PF_EXITING was a way to ignore such scenarios > as we were going to race with it anyway. Sorry, still can't understand. The PF_EXITING flag is not protected by tasklist_lock or rcu_lock. OK, if nothing else. Suppose that a prgp has a single process P, this proces has already exited but its parent didn't do wait(). Currently getpriority() returns task_nice(P). With the PF_EXITING check it will return -ESRCH. Hmm? Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) 2020-05-12 16:41 ` Oleg Nesterov @ 2020-05-12 16:58 ` Davidlohr Bueso 2020-05-12 18:16 ` Oleg Nesterov 0 siblings, 1 reply; 9+ messages in thread From: Davidlohr Bueso @ 2020-05-12 16:58 UTC (permalink / raw) To: Oleg Nesterov; +Cc: akpm, peterz, paulmck, tglx, linux-kernel, Davidlohr Bueso On Tue, 12 May 2020, Oleg Nesterov wrote: >On 05/12, Davidlohr Bueso wrote: >> >> On Tue, 12 May 2020, Oleg Nesterov wrote: >> >> >do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID) >> >which moves the task from one hlist to another. Yes, it is safe in >> >that task_struct can't go away. But still this is not right because >> >do_each_pid_task() can scan the wrong (2nd) hlist. >> >> Hmm I didn't think about this case, I guess this is also busted in >> ioprio_get(2) then. > >agreed... > >> > >> >could you explain in details why do you think this PF_EXITING check >> >makes any sense? >> >> My logic was that if the task with the highest prio exited while we >> were iterating the list, it would not be necessarily seen with rcu >> and the syscall would return the highest prio of a task that exited; >> and checking against PF_EXITING was a way to ignore such scenarios >> as we were going to race with it anyway. > >Sorry, still can't understand. The PF_EXITING flag is not protected by >tasklist_lock or rcu_lock. Sorry for not making my idea clear, perhaps it's complete garbage. Right, but setting the flag is an indication that the tasklist_lock will be taken and removed from the list, and therefore we could optimistically avoid considering that task altogether instead of relying on the old copy of the list. It's not perfect, but it does reduce the window in which getpriority() can return a stale value(?). At least this is how I justify it. Otoh this also opens a window in where the lockless version can ignore highest prio task when the locked version would otherwise consider it. So it might not be worth it. > >OK, if nothing else. Suppose that a prgp has a single process P, this >proces has already exited but its parent didn't do wait(). > >Currently getpriority() returns task_nice(P). With the PF_EXITING check >it will return -ESRCH. Hmm? Yes, that would need fixing but you don't seem to be buying the idea in the first place. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) 2020-05-12 16:58 ` Davidlohr Bueso @ 2020-05-12 18:16 ` Oleg Nesterov 0 siblings, 0 replies; 9+ messages in thread From: Oleg Nesterov @ 2020-05-12 18:16 UTC (permalink / raw) To: Davidlohr Bueso Cc: akpm, peterz, paulmck, tglx, linux-kernel, Davidlohr Bueso On 05/12, Davidlohr Bueso wrote: > > Right, but setting the flag is an indication that the tasklist_lock > will be taken Yes, > and removed from the list, Well no. If this task is not a group leader, or if it is traced this won't happen "soon", this will happen when parent or debugger call wait(). But this doesn't matter. Lets suppose that the task is always removed from the list right after it sets PF_EXITING. Now, > and therefore we could > optimistically avoid considering that task altogether Why?? This is what I can't understand. If sys_getpriority() sees PF_EXITING, we can pretend it was called before this task has exited, or even right before this flag was set. Why should we skip this task? And otoh, this check can not help in any case, PF_EXITING can be set right after the check. So I still think this check can only add the unnecessary confusion, even if we forget about change in behaviour. And finally, whatever I missed, I do not understand how this connects to "avoid the tasklist_lock". Whether we want it or not does not depend on the locking, at all. Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS) 2020-05-12 0:03 [PATCH -next v2 0/2] kernel/sys: reduce tasklist_lock usage get/set priorities Davidlohr Bueso 2020-05-12 0:03 ` [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) Davidlohr Bueso @ 2020-05-12 0:03 ` Davidlohr Bueso 2020-05-12 16:10 ` Oleg Nesterov 1 sibling, 1 reply; 9+ messages in thread From: Davidlohr Bueso @ 2020-05-12 0:03 UTC (permalink / raw) To: akpm; +Cc: peterz, oleg, paulmck, tglx, linux-kernel, dave, Davidlohr Bueso This option does not iterate the tasklist and we already have an rcu aware stable pointer. Also, the nice value is not serialized by this lock. Reduce the scope of this lock to just PRIO_PGRP and PRIO_USER - which need to to set the priorities atomically to the list of tasks, at least vs fork(). Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- kernel/sys.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index 0b72184f5e3e..f9f87775d6d2 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -214,16 +214,19 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) niceval = MAX_NICE; rcu_read_lock(); - read_lock(&tasklist_lock); - switch (which) { - case PRIO_PROCESS: + + if (which == PRIO_PROCESS) { if (who) p = find_task_by_vpid(who); else p = current; if (p) error = set_one_prio(p, niceval, error); - break; + goto out_rcu; + } + + read_lock(&tasklist_lock); + switch (which) { case PRIO_PGRP: if (who) pgrp = find_vpid(who); @@ -253,6 +256,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) } out_unlock: read_unlock(&tasklist_lock); +out_rcu: rcu_read_unlock(); out: return error; -- 2.26.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS) 2020-05-12 0:03 ` [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS) Davidlohr Bueso @ 2020-05-12 16:10 ` Oleg Nesterov 0 siblings, 0 replies; 9+ messages in thread From: Oleg Nesterov @ 2020-05-12 16:10 UTC (permalink / raw) To: Davidlohr Bueso Cc: akpm, peterz, paulmck, tglx, linux-kernel, Davidlohr Bueso On 05/11, Davidlohr Bueso wrote: > > This option does not iterate the tasklist and we already have > an rcu aware stable pointer. Also, the nice value is not serialized > by this lock. Reduce the scope of this lock to just PRIO_PGRP > and PRIO_USER - which need to to set the priorities atomically > to the list of tasks, at least vs fork(). looks correct, but probably the PRIO_USER case can avoid tasklist too? It can't help to avoid the race with setuid(). (PRIO_PGRP needs tasklist_lock (see my previous email) but afaics it can race with fork anyway, it can miss the new child which was not added to the list yet, I hope we do not care). So I'd suggest a single patch instead of 1-2, but I still don't understand your PF_EXITING check in 1/2. Oleg. --- x/kernel/sys.c +++ x/kernel/sys.c @@ -214,7 +214,6 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) niceval = MAX_NICE; rcu_read_lock(); - read_lock(&tasklist_lock); switch (which) { case PRIO_PROCESS: if (who) @@ -229,9 +228,11 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) pgrp = find_vpid(who); else pgrp = task_pgrp(current); + read_lock(&tasklist_lock); do_each_pid_thread(pgrp, PIDTYPE_PGID, p) { error = set_one_prio(p, niceval, error); } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); + read_unlock(&tasklist_lock); break; case PRIO_USER: uid = make_kuid(cred->user_ns, who); @@ -243,16 +244,15 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) if (!user) goto out_unlock; /* No processes for this user */ } - do_each_thread(g, p) { + for_each_process_thread(g, p) { if (uid_eq(task_uid(p), uid) && task_pid_vnr(p)) error = set_one_prio(p, niceval, error); - } while_each_thread(g, p); + } if (!uid_eq(uid, cred->uid)) free_uid(user); /* For find_user() */ break; } out_unlock: - read_unlock(&tasklist_lock); rcu_read_unlock(); out: return error; @@ -277,7 +277,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) return -EINVAL; rcu_read_lock(); - read_lock(&tasklist_lock); switch (which) { case PRIO_PROCESS: if (who) @@ -295,11 +294,13 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) pgrp = find_vpid(who); else pgrp = task_pgrp(current); + read_lock(&tasklist_lock); do_each_pid_thread(pgrp, PIDTYPE_PGID, p) { niceval = nice_to_rlimit(task_nice(p)); if (niceval > retval) retval = niceval; } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); + read_unlock(&tasklist_lock); break; case PRIO_USER: uid = make_kuid(cred->user_ns, who); @@ -311,19 +312,18 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) if (!user) goto out_unlock; /* No processes for this user */ } - do_each_thread(g, p) { + for_each_process_thread(g, p) { if (uid_eq(task_uid(p), uid) && task_pid_vnr(p)) { niceval = nice_to_rlimit(task_nice(p)); if (niceval > retval) retval = niceval; } - } while_each_thread(g, p); + } if (!uid_eq(uid, cred->uid)) free_uid(user); /* for find_user() */ break; } out_unlock: - read_unlock(&tasklist_lock); rcu_read_unlock(); return retval; ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-05-12 18:16 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-12 0:03 [PATCH -next v2 0/2] kernel/sys: reduce tasklist_lock usage get/set priorities Davidlohr Bueso 2020-05-12 0:03 ` [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2) Davidlohr Bueso 2020-05-12 15:09 ` Oleg Nesterov 2020-05-12 16:09 ` Davidlohr Bueso 2020-05-12 16:41 ` Oleg Nesterov 2020-05-12 16:58 ` Davidlohr Bueso 2020-05-12 18:16 ` Oleg Nesterov 2020-05-12 0:03 ` [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS) Davidlohr Bueso 2020-05-12 16:10 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).