* kmod: avoid propagating PF_NO_SETAFFINITY into userspace child @ 2013-11-14 1:51 zhang.yi20 2013-11-14 5:23 ` Tejun Heo 0 siblings, 1 reply; 69+ messages in thread From: zhang.yi20 @ 2013-11-14 1:51 UTC (permalink / raw) To: linux-kernel; +Cc: Tejun Heo, Tetsuo Handa, Oleg Nesterov The kernel worker thread has the PF_NO_SETAFFINITY flag, and it is propagated into the userspace child. Clearing this flag in usersapce child to enable its migrating capability. Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> --- linux3-12/kernel/kmod.c 2013-11-14 09:06:58.991781656 +0000 +++ linux3-12/kernel/kmod.c 2013-11-14 09:08:47.511781621 +0000 @@ -217,6 +217,7 @@ static int ____call_usermodehelper(void * Avoid propagating that into the userspace child. */ set_user_nice(current, 0); + current->flags &= ~PF_NO_SETAFFINITY; retval = -ENOMEM; new = prepare_kernel_cred(current); ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: kmod: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-14 1:51 kmod: avoid propagating PF_NO_SETAFFINITY into userspace child zhang.yi20 @ 2013-11-14 5:23 ` Tejun Heo 2013-11-14 11:40 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-14 5:23 UTC (permalink / raw) To: zhang.yi20; +Cc: linux-kernel, Tetsuo Handa, Oleg Nesterov Hello, On Thu, Nov 14, 2013 at 09:51:47AM +0800, zhang.yi20@zte.com.cn wrote: > The kernel worker thread has the PF_NO_SETAFFINITY flag, and it is propagated > into the userspace child. Clearing this flag in usersapce child to enable its > migrating capability. > > > Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> > > --- linux3-12/kernel/kmod.c 2013-11-14 09:06:58.991781656 +0000 > +++ linux3-12/kernel/kmod.c 2013-11-14 09:08:47.511781621 +0000 > @@ -217,6 +217,7 @@ static int ____call_usermodehelper(void > * Avoid propagating that into the userspace child. > */ > set_user_nice(current, 0); > + current->flags &= ~PF_NO_SETAFFINITY; I'm a bit confused. kernel_thread() doesn't use workqueue or kthread_bind(), so the thread shouldn't have PF_NO_SETAFFINITY set. Have you actually observed this happening? Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: kmod: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-14 5:23 ` Tejun Heo @ 2013-11-14 11:40 ` Oleg Nesterov 2013-11-14 11:55 ` [PATCH 0/1]: (Was: kmod: avoid propagating PF_NO_SETAFFINITY into userspace child) Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-14 11:40 UTC (permalink / raw) To: Tejun Heo; +Cc: zhang.yi20, linux-kernel, Tetsuo Handa On 11/14, Tejun Heo wrote: > > Hello, > > On Thu, Nov 14, 2013 at 09:51:47AM +0800, zhang.yi20@zte.com.cn wrote: > > The kernel worker thread has the PF_NO_SETAFFINITY flag, and it is propagated > > into the userspace child. Clearing this flag in usersapce child to enable its > > migrating capability. > > > > > > Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> > > > > --- linux3-12/kernel/kmod.c 2013-11-14 09:06:58.991781656 +0000 > > +++ linux3-12/kernel/kmod.c 2013-11-14 09:08:47.511781621 +0000 > > @@ -217,6 +217,7 @@ static int ____call_usermodehelper(void > > * Avoid propagating that into the userspace child. > > */ > > set_user_nice(current, 0); > > + current->flags &= ~PF_NO_SETAFFINITY; > > I'm a bit confused. kernel_thread() doesn't use workqueue or > kthread_bind(), Yes, but kernel_thread() is called by the worker thread which has PF_NO_SETAFFINITY, this flag is copied to child->flags. Looks like Zhang is right... But I'd suggest to change flush_old_exec() instead (see "current->flags &= ..."). Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 0/1]: (Was: kmod: avoid propagating PF_NO_SETAFFINITY into userspace child) 2013-11-14 11:40 ` Oleg Nesterov @ 2013-11-14 11:55 ` Oleg Nesterov 2013-11-14 11:56 ` [PATCH 1/1] workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-14 11:55 UTC (permalink / raw) To: Tejun Heo; +Cc: zhang.yi20, linux-kernel, Tetsuo Handa On 11/14, Oleg Nesterov wrote: > > Yes, but kernel_thread() is called by the worker thread which has > PF_NO_SETAFFINITY, this flag is copied to child->flags. Hmm, and it seems that create_worker() needs a minor fix, please see the patch. > Looks like Zhang is right... But I'd suggest to change flush_old_exec() > instead (see "current->flags &= ..."). Yes. But... May be we should (also?) change copy_flags() to clear PF_NO_SETAFFINITY? Probably not, this can lead to the subtle behaviour changes. But I am wondering if there is any kthread() which relies on PF_NO_SETAFFINITY copied by kernel_thread(). Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 1/1] workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY 2013-11-14 11:55 ` [PATCH 0/1]: (Was: kmod: avoid propagating PF_NO_SETAFFINITY into userspace child) Oleg Nesterov @ 2013-11-14 11:56 ` Oleg Nesterov 2013-11-22 23:13 ` Tejun Heo 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-14 11:56 UTC (permalink / raw) To: Tejun Heo; +Cc: zhang.yi20, linux-kernel, Tetsuo Handa Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed() in create_worker(). Otherwise userland can change ->cpus_allowed in between. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/workqueue.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 987293d..775af74 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1736,16 +1736,16 @@ static struct worker *create_worker(struct worker_pool *pool) if (IS_ERR(worker->task)) goto fail; + set_user_nice(worker->task, pool->attrs->nice); + + /* prevent userland from meddling with cpumask of workqueue workers */ + worker->task->flags |= PF_NO_SETAFFINITY; /* * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any * online CPUs. It'll be re-applied when any of the CPUs come up. */ - set_user_nice(worker->task, pool->attrs->nice); set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); - /* prevent userland from meddling with cpumask of workqueue workers */ - worker->task->flags |= PF_NO_SETAFFINITY; - /* * The caller is responsible for ensuring %POOL_DISASSOCIATED * remains stable across this function. See the comments above the -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY 2013-11-14 11:56 ` [PATCH 1/1] workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY Oleg Nesterov @ 2013-11-22 23:13 ` Tejun Heo [not found] ` <OF36E72FA9.51146BE3-ON48257C2E.0008BC6D-48257C2E.0008FF9C@zte.com.cn> 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-22 23:13 UTC (permalink / raw) To: Oleg Nesterov; +Cc: zhang.yi20, lkml, Tetsuo Handa On Thu, Nov 14, 2013 at 6:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: > Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed() > in create_worker(). Otherwise userland can change ->cpus_allowed > in between. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Applied to wq/for-3.13-fixes. Thanks! -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
[parent not found: <OF36E72FA9.51146BE3-ON48257C2E.0008BC6D-48257C2E.0008FF9C@zte.com.cn>]
* Re: 答复: Re: [PATCH 1/1] workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY [not found] ` <OF36E72FA9.51146BE3-ON48257C2E.0008BC6D-48257C2E.0008FF9C@zte.com.cn> @ 2013-11-25 12:14 ` Oleg Nesterov 2013-11-26 2:10 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child zhang.yi20 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-25 12:14 UTC (permalink / raw) To: zhang.yi20; +Cc: Tejun Heo, htejun, lkml, Tetsuo Handa On 11/25, zhang.yi20@zte.com.cn wrote: > > htejun@gmail.com wrote on 2013/11/23 07:13:43: > > > > > Re: [PATCH 1/1] workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY > > > > On Thu, Nov 14, 2013 at 6:56 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed() > > > in create_worker(). Otherwise userland can change ->cpus_allowed > > > in between. > > > > > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > > > Applied to wq/for-3.13-fixes. > > > > Thanks! > > > > -- > > tejun > > How about the first patch? Let me quote myself: Looks like Zhang is right... But I'd suggest to change flush_old_exec() instead (see "current->flags &= ..."). Do you agree? If yes, could you make v2? flush_old_exec() already clears the unwanted PF_ bits, I do not think we should add another place. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-25 12:14 ` 答复: " Oleg Nesterov @ 2013-11-26 2:10 ` zhang.yi20 2013-11-26 18:04 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: zhang.yi20 @ 2013-11-26 2:10 UTC (permalink / raw) To: Oleg Nesterov; +Cc: htejun, lkml, Tetsuo Handa, Tejun Heo Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be a kernel worker thread which has PF_NO_SETAFFINITY set. Clearing this flag in usersapce child to enable its migrating capability. Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> --- linux-3.12.old/fs/exec.c 2013-11-26 08:53:12.175811856 +0000 +++ linux-3.12/fs/exec.c 2013-11-26 09:26:53.575999604 +0000 @@ -1091,7 +1091,8 @@ int flush_old_exec(struct linux_binprm * set_fs(USER_DS); current->flags &= - ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE); + ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE + | PF_NO_SETAFFINITY); flush_thread(); current->personality &= ~bprm->per_clear; Thanks. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-26 2:10 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child zhang.yi20 @ 2013-11-26 18:04 ` Oleg Nesterov 2013-11-27 2:07 ` zhang.yi20 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-26 18:04 UTC (permalink / raw) To: zhang.yi20; +Cc: htejun, lkml, Tetsuo Handa, Tejun Heo On 11/26, zhang.yi20@zte.com.cn wrote: > > Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be > a kernel worker thread which has PF_NO_SETAFFINITY set. ... and this worker thread can do kernel_thread() to create the child. > Clearing this flag in usersapce child to enable its migrating capability. > > > Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> > > --- linux-3.12.old/fs/exec.c 2013-11-26 08:53:12.175811856 +0000 > +++ linux-3.12/fs/exec.c 2013-11-26 09:26:53.575999604 +0000 > @@ -1091,7 +1091,8 @@ int flush_old_exec(struct linux_binprm * > > set_fs(USER_DS); > current->flags &= > - ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE); > + ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE > + | PF_NO_SETAFFINITY); Thanks, Acked-by: Oleg Nesterov <oleg@redhat.com> Can't resist... perhaps current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY); looks a bit better, but this is minor and I won't insist. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-26 18:04 ` Oleg Nesterov @ 2013-11-27 2:07 ` zhang.yi20 2013-11-27 13:20 ` Oleg Nesterov 2013-11-27 18:31 ` Tejun Heo 0 siblings, 2 replies; 69+ messages in thread From: zhang.yi20 @ 2013-11-27 2:07 UTC (permalink / raw) To: Oleg Nesterov; +Cc: htejun, lkml, Tetsuo Handa, Tejun Heo Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread can do kernel_thread() to create the child. Clearing this flag in usersapce child to enable its migrating capability. Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> --- linux-3.12.old/fs/exec.c 2013-11-26 08:53:12.175811856 +0000 +++ linux-3.12/fs/exec.c 2013-11-27 09:36:56.231972168 +0000 @@ -1090,8 +1090,8 @@ int flush_old_exec(struct linux_binprm * bprm->mm = NULL; /* We're using it now */ set_fs(USER_DS); - current->flags &= - ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE); + current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | + PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread(); current->personality &= ~bprm->per_clear; ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-27 2:07 ` zhang.yi20 @ 2013-11-27 13:20 ` Oleg Nesterov 2013-11-27 18:31 ` Tejun Heo 1 sibling, 0 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-11-27 13:20 UTC (permalink / raw) To: zhang.yi20; +Cc: htejun, lkml, Tetsuo Handa, Tejun Heo On 11/27, zhang.yi20@zte.com.cn wrote: > > Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be > a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread > can do kernel_thread() to create the child. > Clearing this flag in usersapce child to enable its migrating capability. > > > Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Thanks! Acked-by: Oleg Nesterov <oleg@redhat.com> > --- linux-3.12.old/fs/exec.c 2013-11-26 08:53:12.175811856 +0000 > +++ linux-3.12/fs/exec.c 2013-11-27 09:36:56.231972168 +0000 > @@ -1090,8 +1090,8 @@ int flush_old_exec(struct linux_binprm * > bprm->mm = NULL; /* We're using it now */ > > set_fs(USER_DS); > - current->flags &= > - ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE); > + current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | > + PF_NOFREEZE | PF_NO_SETAFFINITY); > flush_thread(); > current->personality &= ~bprm->per_clear; > > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-27 2:07 ` zhang.yi20 2013-11-27 13:20 ` Oleg Nesterov @ 2013-11-27 18:31 ` Tejun Heo 2013-11-28 9:13 ` Peter Zijlstra 1 sibling, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-27 18:31 UTC (permalink / raw) To: zhang.yi20; +Cc: Oleg Nesterov, lkml, Tetsuo Handa, Ingo Molnar, Peter Zijlstra On Wed, Nov 27, 2013 at 10:07:03AM +0800, zhang.yi20@zte.com.cn wrote: > Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be > a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread > can do kernel_thread() to create the child. > Clearing this flag in usersapce child to enable its migrating capability. > > > Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Acked-by: Tejun Heo <tj@kernel.org> cc'ing Ingo and Peter. Ingo, I think this one doesn't really suit the workqueue tree. Can you please pick this one up w/ Oleg's ack added and stable cc'd? The original patch is http://article.gmane.org/gmane.linux.kernel/1602429/raw Thanks! -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-27 18:31 ` Tejun Heo @ 2013-11-28 9:13 ` Peter Zijlstra 2013-11-28 11:45 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 9:13 UTC (permalink / raw) To: Tejun Heo; +Cc: zhang.yi20, Oleg Nesterov, lkml, Tetsuo Handa, Ingo Molnar On Wed, Nov 27, 2013 at 01:31:17PM -0500, Tejun Heo wrote: > On Wed, Nov 27, 2013 at 10:07:03AM +0800, zhang.yi20@zte.com.cn wrote: > > Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be > > a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread > > can do kernel_thread() to create the child. > > Clearing this flag in usersapce child to enable its migrating capability. > > > > > > Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> > > Acked-by: Tejun Heo <tj@kernel.org> > > cc'ing Ingo and Peter. Ingo, I think this one doesn't really suit the > workqueue tree. Can you please pick this one up w/ Oleg's ack added > and stable cc'd? The original patch is > > http://article.gmane.org/gmane.linux.kernel/1602429/raw So I don't get the problem; aren't all usermode helper thingies spawned by the khelper task, which doesn't have PG_NO_SETAFFINITY set? So how come this is a problem? The Changelog is not explaining anything much -- so no I will not take this patch. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 9:13 ` Peter Zijlstra @ 2013-11-28 11:45 ` Oleg Nesterov 2013-11-28 12:17 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 11:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > On Wed, Nov 27, 2013 at 01:31:17PM -0500, Tejun Heo wrote: > > On Wed, Nov 27, 2013 at 10:07:03AM +0800, zhang.yi20@zte.com.cn wrote: > > > Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be > > > a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread > > > can do kernel_thread() to create the child. > > > Clearing this flag in usersapce child to enable its migrating capability. > > > > > > > > > Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> > > > > Acked-by: Tejun Heo <tj@kernel.org> > > > > cc'ing Ingo and Peter. Ingo, I think this one doesn't really suit the > > workqueue tree. Can you please pick this one up w/ Oleg's ack added > > and stable cc'd? The original patch is > > > > http://article.gmane.org/gmane.linux.kernel/1602429/raw > > So I don't get the problem; aren't all usermode helper thingies spawned > by the khelper task, which doesn't have PG_NO_SETAFFINITY set? It has? khelper is a workqueue thread, this flag is set by create_worker(). And it does kernel_thread() (not kthread_create()) so the child gets this flag too. > The Changelog is not explaining anything much -- so no I will not take > this patch. Well, perhaps the changelog can be more verbose and clear... Zhang, I'm afraid you have to send v3 ;) Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 11:45 ` Oleg Nesterov @ 2013-11-28 12:17 ` Peter Zijlstra 2013-11-28 13:31 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 12:17 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 12:45:42PM +0100, Oleg Nesterov wrote: > It has? khelper is a workqueue thread, this flag is set by create_worker(). > > And it does kernel_thread() (not kthread_create()) so the child gets this > flag too. Urgh, but that's still completely wrong. khelper is a singlethread workqueue, those should be unbound and therefore should not have this flag set at all. In fact, I know people want to set affinity on khelper to ensure all the usermode helper tasks get spawned on the specific system-cpus and leave the rest of the system unperturbed. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 12:17 ` Peter Zijlstra @ 2013-11-28 13:31 ` Oleg Nesterov 2013-11-28 13:39 ` Peter Zijlstra 2013-11-28 13:41 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child Peter Zijlstra 0 siblings, 2 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 13:31 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 12:45:42PM +0100, Oleg Nesterov wrote: > > It has? khelper is a workqueue thread, this flag is set by create_worker(). > > > > And it does kernel_thread() (not kthread_create()) so the child gets this > > flag too. > > Urgh, but that's still completely wrong. khelper is a singlethread > workqueue, those should be unbound and therefore should not have this > flag set at all. Well. This is debatable, but I leave this to you and Tejun ;) But: > In fact, I know people want to set affinity on khelper This is not that simple. Note that khelper itself is the rescuer thread, it doesn't not process the works. There are other kworker/u* threads which run the work queued on khelper_wq. There is a pool of threads. > to ensure all the > usermode helper tasks get spawned on the specific system-cpus and leave > the rest of the system unperturbed. I _guess_ usermodehelper_init() should use WQ_SYSFS then, and in this case the user can write to wq_cpumask_store somewhere in /sys/. Not sure I read this code correctly though, I didn't check. But PF_NO_SETAFFINITY should be cleared on exec anyway, so I still think the patch does the right thing. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 13:31 ` Oleg Nesterov @ 2013-11-28 13:39 ` Peter Zijlstra 2013-11-28 14:13 ` Tejun Heo 2013-11-28 14:23 ` Oleg Nesterov 2013-11-28 13:41 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child Peter Zijlstra 1 sibling, 2 replies; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 13:39 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 02:31:52PM +0100, Oleg Nesterov wrote: > On 11/28, Peter Zijlstra wrote: > > > > On Thu, Nov 28, 2013 at 12:45:42PM +0100, Oleg Nesterov wrote: > > > It has? khelper is a workqueue thread, this flag is set by create_worker(). > > > > > > And it does kernel_thread() (not kthread_create()) so the child gets this > > > flag too. > > > > Urgh, but that's still completely wrong. khelper is a singlethread > > workqueue, those should be unbound and therefore should not have this > > flag set at all. > > Well. This is debatable, but I leave this to you and Tejun ;) How can that be debatable? I don't see a single argument in favour of doing that; its plain ridiculous. > > In fact, I know people want to set affinity on khelper > > This is not that simple. Note that khelper itself is the rescuer thread, > it doesn't not process the works. There are other kworker/u* threads which > run the work queued on khelper_wq. There is a pool of threads. That's just fucked. WTF does singlethreaded mean then? A single parent process for all usermode helpers makes so much sense; not doing it is just weird. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 13:39 ` Peter Zijlstra @ 2013-11-28 14:13 ` Tejun Heo 2013-11-28 14:31 ` Peter Zijlstra 2013-11-28 14:38 ` Peter Zijlstra 2013-11-28 14:23 ` Oleg Nesterov 1 sibling, 2 replies; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:13 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hello, > > > Urgh, but that's still completely wrong. khelper is a singlethread > > > workqueue, those should be unbound and therefore should not have this > > > flag set at all. > > > > Well. This is debatable, but I leave this to you and Tejun ;) > > How can that be debatable? I don't see a single argument in favour of > doing that; its plain ridiculous. As the flag name suggests, it prevents userland from changing affinity of the worker threads which we need whether the worker is confined to a cpu, NUMA node, random subset of CPUs or not at all. Please note that there's no one-to-one mapping between a worker and any given workqueue. workqueue can't allow userland to set affinity on random workers. They're shared resources. > > > In fact, I know people want to set affinity on khelper > > > > This is not that simple. Note that khelper itself is the rescuer thread, > > it doesn't not process the works. There are other kworker/u* threads which > > run the work queued on khelper_wq. There is a pool of threads. > > That's just fucked. WTF does singlethreaded mean then? As explained in the other reply, it's a carried-over name and will be replaced by alloc_ordered_workqueue(). Note that if khelper_wq is converted to use alloc_ordered_workqueue(), the task named "khelper" wouldn't even exist as there's no reason for the wq to have a rescuer. > A single parent process for all usermode helpers makes so much sense; > not doing it is just weird. If we're gonna allow userland to play with the parent attributes, yeah, that'd make sense. I'm not sure whether that's an interface that we'd want to commit to tho? Do we really want to tell userland "there will always be a kernel task khelper and if you change that one's attributes all processes forked from it will inherit those attributes no matter what they are." I think we'd want something more specific cause that's a lot of commitment to things that we haven't carefully thought about. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:13 ` Tejun Heo @ 2013-11-28 14:31 ` Peter Zijlstra 2013-11-28 14:38 ` Tejun Heo 2013-11-28 14:43 ` Peter Zijlstra 2013-11-28 14:38 ` Peter Zijlstra 1 sibling, 2 replies; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:31 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 09:13:29AM -0500, Tejun Heo wrote: > Hello, > > > > > Urgh, but that's still completely wrong. khelper is a singlethread > > > > workqueue, those should be unbound and therefore should not have this > > > > flag set at all. > > > > > > Well. This is debatable, but I leave this to you and Tejun ;) > > > > How can that be debatable? I don't see a single argument in favour of > > doing that; its plain ridiculous. > > As the flag name suggests, it prevents userland from changing affinity > of the worker threads which we need whether the worker is confined to > a cpu, NUMA node, random subset of CPUs or not at all. Please note > that there's no one-to-one mapping between a worker and any given > workqueue. workqueue can't allow userland to set affinity on random > workers. They're shared resources. But the WQ_UNBOUND thingies should be just that and should thus not have the NO_SETAFFINITY flag set because there is no valid reason to have it set. Regardless of whether the threads are shared between unbound workqueues or not. Sure, we absolutely must set it for per-cpu workqueues (and their workers) otherwise we cannot guarantee correctness. Same for per-node if we have that. But absolutely not for unbound. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:31 ` Peter Zijlstra @ 2013-11-28 14:38 ` Tejun Heo 2013-11-28 14:47 ` Peter Zijlstra 2013-11-28 14:55 ` Peter Zijlstra 2013-11-28 14:43 ` Peter Zijlstra 1 sibling, 2 replies; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:38 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hey, Peter. On Thu, Nov 28, 2013 at 03:31:45PM +0100, Peter Zijlstra wrote: > But the WQ_UNBOUND thingies should be just that and should thus not have > the NO_SETAFFINITY flag set because there is no valid reason to have it > set. > > Regardless of whether the threads are shared between unbound workqueues > or not. Hah? No, we do not want to allow userland to be able to set affinities on any workqueue workers, period. That's just inviting people to do weirdest things and then reporting things like "crypt jobs on some of our 500 machines end up stuck on a single cpu once in a while" which will eventually be tracked down to some weird shell script setting affinity on workers doing something else. We really want to insulate workers and pool operation from userland. e.g. unbound workqueues now default to per-NUMA affinity unless explicitly told not to, which leads to better overall behavior for most workloads. We do wanna keep those details from userland so that they can be improved in the future too. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:38 ` Tejun Heo @ 2013-11-28 14:47 ` Peter Zijlstra 2013-11-28 14:51 ` Tejun Heo 2013-11-28 14:55 ` Peter Zijlstra 1 sibling, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:47 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 09:38:48AM -0500, Tejun Heo wrote: > Hey, Peter. > > On Thu, Nov 28, 2013 at 03:31:45PM +0100, Peter Zijlstra wrote: > > But the WQ_UNBOUND thingies should be just that and should thus not have > > the NO_SETAFFINITY flag set because there is no valid reason to have it > > set. > > > > Regardless of whether the threads are shared between unbound workqueues > > or not. > > Hah? No, we do not want to allow userland to be able to set > affinities on any workqueue workers, period. That's just inviting > people to do weirdest things and then reporting things like "crypt > jobs on some of our 500 machines end up stuck on a single cpu once in > a while" which will eventually be tracked down to some weird shell > script setting affinity on workers doing something else. > > We really want to insulate workers and pool operation from userland. > e.g. unbound workqueues now default to per-NUMA affinity unless > explicitly told not to, which leads to better overall behavior for > most workloads. We do wanna keep those details from userland so that > they can be improved in the future too. Bah, windows mentality of we know better. If they do stupid they get stupid; the only thing we should be concerned about is correctness, they shouldn't be able to crash the system. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:47 ` Peter Zijlstra @ 2013-11-28 14:51 ` Tejun Heo 0 siblings, 0 replies; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:51 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 03:47:14PM +0100, Peter Zijlstra wrote: > Bah, windows mentality of we know better. > > If they do stupid they get stupid; the only thing we should be concerned > about is correctness, they shouldn't be able to crash the system. It isn't about that we know better but more that we do not want to leak implementation details to userland. "Whatever isn't crashing is okay" doesn't work because we aren't supposed to change behaviors which are depended upon from userland. Unless what's exposed is actively managed, we end up being locked into specific unintended characteristics of the current implementation. It hurts both the kernel and userland. We end up having to hack around silly constraints in the future and userland gets surprised when such hacks inevitably falls apart in corner cases. So, no, it's not about which part knows better or windows mentaility. It's about having proper separation between implementation details and exposed interface. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:38 ` Tejun Heo 2013-11-28 14:47 ` Peter Zijlstra @ 2013-11-28 14:55 ` Peter Zijlstra 2013-11-28 14:57 ` Tejun Heo 1 sibling, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:55 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 09:38:48AM -0500, Tejun Heo wrote: > Hah? No, we do not want to allow userland to be able to set > affinities on any workqueue workers, period. That's just inviting > people to do weirdest things and then reporting things like "crypt > jobs on some of our 500 machines end up stuck on a single cpu once in > a while" which will eventually be tracked down to some weird shell > script setting affinity on workers doing something else. Because that is exactly what some users desire? Some RT and HPC people want all those crypt jobs to run on a few designated cpus and not wreck their compute jobs. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:55 ` Peter Zijlstra @ 2013-11-28 14:57 ` Tejun Heo 2013-11-28 14:59 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:57 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 03:55:05PM +0100, Peter Zijlstra wrote: > Because that is exactly what some users desire? Some RT and HPC people > want all those crypt jobs to run on a few designated cpus and not wreck > their compute jobs. And you can't do that reliably from userland. Those things are created dynamically. workqueue provides per-workqueue attributes to enable those use cases. It doesn't have system-wide default attributes right now but adding that shouldn't be too difficult. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:57 ` Tejun Heo @ 2013-11-28 14:59 ` Peter Zijlstra 2013-11-28 15:07 ` Tejun Heo 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:59 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 09:57:23AM -0500, Tejun Heo wrote: > On Thu, Nov 28, 2013 at 03:55:05PM +0100, Peter Zijlstra wrote: > > Because that is exactly what some users desire? Some RT and HPC people > > want all those crypt jobs to run on a few designated cpus and not wreck > > their compute jobs. > > And you can't do that reliably from userland. Those things are > created dynamically. workqueue provides per-workqueue attributes to > enable those use cases. It doesn't have system-wide default > attributes right now but adding that shouldn't be too difficult. Like a parent process, right? So people can use the existing task interfaces. Or are we going to tell people; here, there's cgroups for managing your tasks. Oh and here is this crippled interface X for if you want to manage these few tasks, and interface Y for managing these other few tasks. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:59 ` Peter Zijlstra @ 2013-11-28 15:07 ` Tejun Heo 2013-11-28 15:17 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 15:07 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 03:59:48PM +0100, Peter Zijlstra wrote: > On Thu, Nov 28, 2013 at 09:57:23AM -0500, Tejun Heo wrote: > > On Thu, Nov 28, 2013 at 03:55:05PM +0100, Peter Zijlstra wrote: > > > Because that is exactly what some users desire? Some RT and HPC people > > > want all those crypt jobs to run on a few designated cpus and not wreck > > > their compute jobs. > > > > And you can't do that reliably from userland. Those things are > > created dynamically. workqueue provides per-workqueue attributes to > > enable those use cases. It doesn't have system-wide default > > attributes right now but adding that shouldn't be too difficult. > > Like a parent process, right? So people can use the existing task > interfaces. > > Or are we going to tell people; here, there's cgroups for managing your > tasks. Oh and here is this crippled interface X for if you want to > manage these few tasks, and interface Y for managing these other few > tasks. Changing attributes on the parent doesn't get propagated to its children, so I don't think that'd be a terribly good interface for workqueue. You'll end up with workers with mixed attributes and regardless, workqueue management logic needs to know which workqueue has which set of attributes to decide how the workers are shared. Overrloading all that over task mgmt interface would be horrible. For transient processes like usermode helpers, single parent could work. I don't necessarily think exposing that is a good idea but if that's gonna happen, that's not gonna be part of workqueue. Just create a dedicated kthread for it. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:07 ` Tejun Heo @ 2013-11-28 15:17 ` Peter Zijlstra 2013-11-28 15:39 ` Tejun Heo 2013-11-28 15:47 ` Oleg Nesterov 0 siblings, 2 replies; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 15:17 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 10:07:22AM -0500, Tejun Heo wrote: > Changing attributes on the parent doesn't get propagated to its > children, so I don't think that'd be a terribly good interface for > workqueue. The idea was that if there's a single parent, its easy to find all the children using the task hierarchy. So flip the parent to the new attributes, then iterate the children and flip them. That's a single pass race-free approach. New children will already have the desired attributes, dead children we don't care about. So there's three useful parts to having a single parent task: - its a task so you can change the entire task attribute set; current and future. - new children will automatically get the desired attributes. - all children are easily identified by virtual of being children of said parent process. Currently its hard to find all usermode helpers or all workqueue tasks. I know people who are poking at /proc/$pid/stat to inspect task_struct::flags while gleaning the PF_ flags from sched.h and run this in a polling loop to catch new entries. Surely that's a bad idea ;-) > You'll end up with workers with mixed attributes and > regardless, workqueue management logic needs to know which workqueue > has which set of attributes to decide how the workers are shared. > Overrloading all that over task mgmt interface would be horrible. Well, mixed attributes is you own responsibility. I'm all for letting people shoot themselves in the foot as long we don't crash. The huge disadvantage to creating special interfaces is that you can only capture a small part of the task attributes; and worse, you create a special limited interface for a special few tasks. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:17 ` Peter Zijlstra @ 2013-11-28 15:39 ` Tejun Heo 2013-11-28 16:33 ` Peter Zijlstra 2013-11-28 15:47 ` Oleg Nesterov 1 sibling, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 15:39 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hey, On Thu, Nov 28, 2013 at 04:17:04PM +0100, Peter Zijlstra wrote: > So there's three useful parts to having a single parent task: > > - its a task so you can change the entire task attribute set; current > and future. Using task as interface could be okay but I'd still go for explicitly specifying what gets inherited and expand them gradually; otherwise, we end up exposing broken stuff unintentionally. cpuset did this with bound workers and the capability was removed retro-actively, which is not a happy situation. > - new children will automatically get the desired attributes. > > - all children are easily identified by virtual of being children of > said parent process. That'd mean that we'd have to have a dummy target task for attributes for each workqueue and hooks for workqueue to get notified of attribute changes. Unless we're gonna go back to per-workqueue workers, we can't have a single parent per workqueue and all its workers as children of it. Different workqueue configure different set of attributes. Not all !percpu workers are equal and each workqueue serves as an attribute domain. We *could* do all that and it proably won't require walking the children from userland as each attribute change would surmount to finding or creating a matching worker pool, but it doesn't look attractive to me. > Well, mixed attributes is you own responsibility. I'm all for letting > people shoot themselves in the foot as long we don't crash. Again, I'm worried about exposing unintended characteristics of implementation and being locked into it. Regardless of interface, I think it's important to control what can be depended upon from userland if we're gonna keep up "no userland visible behavior will break" thing. > The huge disadvantage to creating special interfaces is that you can > only capture a small part of the task attributes; and worse, you create > a special limited interface for a special few tasks. Yeah, that's the disadvantage but I don't think the single parent per workqueue model is gonna work. FWIW, workqueue implements standardized sysfs interface so that each user doesn't end up with custom interface (writeback was growing one and got switched to the workqueue standard one). workqueue is shared pools of workers keyed by specific worker attributes. There evidently are restrictions coming from its nature and no matter what we do workqueue needs to be taught to distinguish each attribute. I think workqueue-wide interface is an acceptable compromise especially considering that there are attributes which can't be represented by a single task such as max_active and automatic NUMA binding, which means we need workqueue-specific interface anyway. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:39 ` Tejun Heo @ 2013-11-28 16:33 ` Peter Zijlstra 2013-11-29 14:33 ` Tejun Heo 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 16:33 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 10:39:06AM -0500, Tejun Heo wrote: > Hey, > > On Thu, Nov 28, 2013 at 04:17:04PM +0100, Peter Zijlstra wrote: > > So there's three useful parts to having a single parent task: > > > > - its a task so you can change the entire task attribute set; current > > and future. > > Using task as interface could be okay but I'd still go for explicitly > specifying what gets inherited and expand them gradually; otherwise, > we end up exposing broken stuff unintentionally. cpuset did this with > bound workers and the capability was removed retro-actively, which is > not a happy situation. I can work with that. We'd need way to inhibit setting certain attributes, but that can be worked out -- its all in-kernel anyway. > > - new children will automatically get the desired attributes. > > > > - all children are easily identified by virtual of being children of > > said parent process. > > That'd mean that we'd have to have a dummy target task for attributes > for each workqueue and hooks for workqueue to get notified of > attribute changes. Unless we're gonna go back to per-workqueue > workers, we can't have a single parent per workqueue and all its > workers as children of it. Different workqueue configure different > set of attributes. Not all !percpu workers are equal and each > workqueue serves as an attribute domain. > > We *could* do all that and it proably won't require walking the > children from userland as each attribute change would surmount to > finding or creating a matching worker pool, but it doesn't look > attractive to me. I'm not sure we need a single parent per workqueue; certainly the case I get asked most frequently about doesn't care, they only want to contain _all_ unbound workers. I don't see a problem with later splitting out other workqueues if there's a good use-case for those. I'm not even sure we need to split out the userspace helpers per-se; again, they fall in the all-unbound category and I don't think I've seen people ask for specific control of those over other unbound workers -- although conceptually it does make some sense to split them out. > > Well, mixed attributes is you own responsibility. I'm all for letting > > people shoot themselves in the foot as long we don't crash. > > Again, I'm worried about exposing unintended characteristics of > implementation and being locked into it. Regardless of interface, I > think it's important to control what can be depended upon from > userland if we're gonna keep up "no userland visible behavior will > break" thing. I appreciate your caution, but we shouldn't overdo the thing and dis-allow everything. > > The huge disadvantage to creating special interfaces is that you can > > only capture a small part of the task attributes; and worse, you create > > a special limited interface for a special few tasks. > > Yeah, that's the disadvantage but I don't think the single parent per > workqueue model is gonna work. I never proposed a parent per workqueue. The most I proposed was a single parent for all unbound workers and a parent for all usermode helpers. > automatic > NUMA binding, which means we need workqueue-specific interface anyway. I'm curious; why is there workqueue numa stuff? NUMA doesn't have the correctness issues per-cpu has -- per-cpu is fundamentally special in that there's no concurrency. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 16:33 ` Peter Zijlstra @ 2013-11-29 14:33 ` Tejun Heo 0 siblings, 0 replies; 69+ messages in thread From: Tejun Heo @ 2013-11-29 14:33 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hey, On Thu, Nov 28, 2013 at 05:33:23PM +0100, Peter Zijlstra wrote: > I'm not sure we need a single parent per workqueue; certainly the case I > get asked most frequently about doesn't care, they only want to contain > _all_ unbound workers. > > I don't see a problem with later splitting out other workqueues if > there's a good use-case for those. > > I'm not even sure we need to split out the userspace helpers per-se; > again, they fall in the all-unbound category and I don't think I've seen > people ask for specific control of those over other unbound workers -- > although conceptually it does make some sense to split them out. There are workqueues with custom attributes, so, if we want to have default settings for all workers, we'll have to implement nested configuration where global knobs control the default settings with individual settings overriding them. > > Again, I'm worried about exposing unintended characteristics of > > implementation and being locked into it. Regardless of interface, I > > think it's important to control what can be depended upon from > > userland if we're gonna keep up "no userland visible behavior will > > break" thing. > > I appreciate your caution, but we shouldn't overdo the thing and > dis-allow everything. Yeah, going extreme on either extreme would suck but I'm not suggesting disallowing everything, just that we need to vet what gets exposed. > I never proposed a parent per workqueue. The most I proposed was a > single parent for all unbound workers and a parent for all usermode > helpers. It'd be weird to have one mechanism to control default attributes and a different one to control per-workqueue attributes (again, we need a custom mechanism to control attributes outside task scope anyway). If we're actually gonna do this, I'd much prefer reusing the existing workqueue mechanism than introducing something new which wouldn't be able to replace the existing one. > > automatic > > NUMA binding, which means we need workqueue-specific interface anyway. > > I'm curious; why is there workqueue numa stuff? NUMA doesn't have the > correctness issues per-cpu has -- per-cpu is fundamentally special in > that there's no concurrency. NUMA doesn't have correctness issues but it does have significant performance impacts. There are work itmes which may consume considerable amount of cpu cycles, e.g. crypt, writeback, io completion. Binding them to cpu limits how much they can scale and using plain unbound deteriorates NUMA affinity. Even for unbound work items which aren't necessary cpu cycle / memory access heavy, sticking to NUMA affinity leads to generally better behavior. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:17 ` Peter Zijlstra 2013-11-28 15:39 ` Tejun Heo @ 2013-11-28 15:47 ` Oleg Nesterov 2013-11-28 16:07 ` Oleg Nesterov 1 sibling, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 15:47 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > So there's three useful parts to having a single parent task: > > - its a task so you can change the entire task attribute set; current > and future. You can't change, say, its fs->root. OK, OK, this is almost off-topic, I am just trying to say that (perhaps) in the longer term we might want even more than just a dedicated kernel thread. > - new children will automatically get the desired attributes. yes, > - all children are easily identified by virtual of being children of > said parent process. Well, UMH_WAIT_PROC won't work in this case, or it will be "single-threaded" too. OK, this is probably solvable, but nontrivial. In this case "khelper" should act as a /sbin/init for user-mode helpers. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:47 ` Oleg Nesterov @ 2013-11-28 16:07 ` Oleg Nesterov 0 siblings, 0 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 16:07 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Oleg Nesterov wrote: > > On 11/28, Peter Zijlstra wrote: > > > > So there's three useful parts to having a single parent task: > > > > - its a task so you can change the entire task attribute set; current > > and future. > > You can't change, say, its fs->root. > > OK, OK, this is almost off-topic, I am just trying to say that > (perhaps) in the longer term we might want even more than just > a dedicated kernel thread. > > > - new children will automatically get the desired attributes. > > yes, > > > - all children are easily identified by virtual of being children of > > said parent process. > > Well, UMH_WAIT_PROC won't work in this case, or it will be "single-threaded" > too. OK, this is probably solvable, but nontrivial. In this case "khelper" > should act as a /sbin/init for user-mode helpers. But I'm afraid I got lost... I interpreted "having a single parent task" as "parent of all usermode helpers". If you actually meant that every wq thread should have a parent which represents the workqueue, well, I disagree but I leave this to you and Tejun again ;) Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:31 ` Peter Zijlstra 2013-11-28 14:38 ` Tejun Heo @ 2013-11-28 14:43 ` Peter Zijlstra 2013-11-28 14:53 ` Tejun Heo 1 sibling, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:43 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 03:31:45PM +0100, Peter Zijlstra wrote: > Sure, we absolutely must set it for per-cpu workqueues (and their > workers) otherwise we cannot guarantee correctness. Same for per-node if > we have that. On that, the per-node thing is debatable. There's no correctness issues with per-node stuff as we have with per-cpu storage. And if there's no correctness implications we should not force things. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:43 ` Peter Zijlstra @ 2013-11-28 14:53 ` Tejun Heo 2013-11-28 14:57 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:53 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 03:43:59PM +0100, Peter Zijlstra wrote: > On Thu, Nov 28, 2013 at 03:31:45PM +0100, Peter Zijlstra wrote: > > Sure, we absolutely must set it for per-cpu workqueues (and their > > workers) otherwise we cannot guarantee correctness. Same for per-node if > > we have that. > > On that, the per-node thing is debatable. There's no correctness issues > with per-node stuff as we have with per-cpu storage. > > And if there's no correctness implications we should not force things. That's true iff you confine the "correctness" to not crashing. That's an extremely narrow definition tho and most will argue against that. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:53 ` Tejun Heo @ 2013-11-28 14:57 ` Peter Zijlstra 2013-11-28 15:02 ` Tejun Heo 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:57 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 09:53:25AM -0500, Tejun Heo wrote: > On Thu, Nov 28, 2013 at 03:43:59PM +0100, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 03:31:45PM +0100, Peter Zijlstra wrote: > > > Sure, we absolutely must set it for per-cpu workqueues (and their > > > workers) otherwise we cannot guarantee correctness. Same for per-node if > > > we have that. > > > > On that, the per-node thing is debatable. There's no correctness issues > > with per-node stuff as we have with per-cpu storage. > > > > And if there's no correctness implications we should not force things. > > That's true iff you confine the "correctness" to not crashing. That's > an extremely narrow definition tho and most will argue against that. Do you have a better definition that doesn't get in the way of people wanting to do actual work? So far I just see you breaking existing setups because you don't want to support things that work perfectly well. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:57 ` Peter Zijlstra @ 2013-11-28 15:02 ` Tejun Heo 2013-11-28 15:07 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 15:02 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 03:57:55PM +0100, Peter Zijlstra wrote: > Do you have a better definition that doesn't get in the way of people > wanting to do actual work? As written multiple times now, workqueue does have per-workqueue attributes and if you really want a dedicated worker thread, use a dedicated kthread. > So far I just see you breaking existing setups because you don't want to > support things that work perfectly well. It doesn't work as explained multiple times in this thread. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:02 ` Tejun Heo @ 2013-11-28 15:07 ` Peter Zijlstra 2013-11-28 15:10 ` Tejun Heo 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 15:07 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 10:02:10AM -0500, Tejun Heo wrote: > > So far I just see you breaking existing setups because you don't want to > > support things that work perfectly well. > > It doesn't work as explained multiple times in this thread. It used to.. just not on recent kernels. You know 'enterprise' latency. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:07 ` Peter Zijlstra @ 2013-11-28 15:10 ` Tejun Heo 2013-11-28 15:18 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 15:10 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 04:07:04PM +0100, Peter Zijlstra wrote: > On Thu, Nov 28, 2013 at 10:02:10AM -0500, Tejun Heo wrote: > > > So far I just see you breaking existing setups because you don't want to > > > support things that work perfectly well. > > > > It doesn't work as explained multiple times in this thread. > > It used to.. just not on recent kernels. You know 'enterprise' latency. If you're talking about khelfper and wanna restore it, it really should be broken out into a separate kthread. It doesn't make any sense to implement that in the workqueue framework. Why would you implement a dedicated task inside a worker pool implementation which makes use of the said tasks? There's even kthread_work interface which pretty much provides workqueue-equivalent interface on top of a single task for cases like this. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:10 ` Tejun Heo @ 2013-11-28 15:18 ` Peter Zijlstra 0 siblings, 0 replies; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 15:18 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 10:10:35AM -0500, Tejun Heo wrote: > On Thu, Nov 28, 2013 at 04:07:04PM +0100, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 10:02:10AM -0500, Tejun Heo wrote: > > > > So far I just see you breaking existing setups because you don't want to > > > > support things that work perfectly well. > > > > > > It doesn't work as explained multiple times in this thread. > > > > It used to.. just not on recent kernels. You know 'enterprise' latency. > > If you're talking about khelfper and wanna restore it, it really > should be broken out into a separate kthread. It doesn't make any > sense to implement that in the workqueue framework. Why would you > implement a dedicated task inside a worker pool implementation which > makes use of the said tasks? There's even kthread_work interface > which pretty much provides workqueue-equivalent interface on top of a > single task for cases like this. That would only solve one of my problems. People want to contain the unbound workqueues too. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:13 ` Tejun Heo 2013-11-28 14:31 ` Peter Zijlstra @ 2013-11-28 14:38 ` Peter Zijlstra 2013-11-28 14:45 ` Tejun Heo 2013-11-28 15:34 ` Oleg Nesterov 1 sibling, 2 replies; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:38 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 09:13:29AM -0500, Tejun Heo wrote: > > A single parent process for all usermode helpers makes so much sense; > > not doing it is just weird. > > If we're gonna allow userland to play with the parent attributes, > yeah, that'd make sense. I'm not sure whether that's an interface > that we'd want to commit to tho? Do we really want to tell userland > "there will always be a kernel task khelper and if you change that > one's attributes all processes forked from it will inherit those > attributes no matter what they are." I think we'd want something more > specific cause that's a lot of commitment to things that we haven't > carefully thought about. It seems like a perfectly fine interface to me. And much preferable to creating yet another weird interface to manage tasks. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:38 ` Peter Zijlstra @ 2013-11-28 14:45 ` Tejun Heo 2013-11-28 14:53 ` Peter Zijlstra 2013-11-28 15:34 ` Oleg Nesterov 1 sibling, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:45 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hello, On Thu, Nov 28, 2013 at 03:38:57PM +0100, Peter Zijlstra wrote: > It seems like a perfectly fine interface to me. And much preferable to > creating yet another weird interface to manage tasks. The problem with that is it isn't an interface where the user specifies what's desired but more of just an exposed facet of implementation details and locks us into either keeping that single parent model for the eternity or doing a shitty hack like extracing attributes from that task if we ever need to change the implementation for whatever reason. In general, it's a much better idea to have an interface where the feature supported is represented explicitly and finitely. If this is something which is actually necessary, I'd vote for having an explicit interface for the desired feature, whatever that may be. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:45 ` Tejun Heo @ 2013-11-28 14:53 ` Peter Zijlstra 0 siblings, 0 replies; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 14:53 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 09:45:02AM -0500, Tejun Heo wrote: > Hello, > > On Thu, Nov 28, 2013 at 03:38:57PM +0100, Peter Zijlstra wrote: > > It seems like a perfectly fine interface to me. And much preferable to > > creating yet another weird interface to manage tasks. > > The problem with that is it isn't an interface where the user > specifies what's desired but more of just an exposed facet of > implementation details and locks us into either keeping that single > parent model for the eternity or doing a shitty hack like extracing > attributes from that task if we ever need to change the implementation > for whatever reason. In general, it's a much better idea to have an > interface where the feature supported is represented explicitly and > finitely. If this is something which is actually necessary, I'd vote > for having an explicit interface for the desired feature, whatever > that may be. The desired feature is to contain all current and future userspace helpers. Having the single parent allows using either cgroups or direct affinity constraints, depending on whatever the user likes. Related is the desire to contain all current and future unbound workqueue threads. Not being able to set affinity (or cpusets for that matter) on unbound workqueue threads actively breaks these things. While the current use-cases I'm aware of treat these two groups as one, I'm not sure we want to merge them. The problem with yet another interface is that we'll likely expose just the affinity, which would preclude using cgroups for this task. Having single parents (even virtual as you propose, where we harvest context attributes) allows us to easily find the extant child tasks and set the desired attributes for future children with the full set of task interfaces we've currently got. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:38 ` Peter Zijlstra 2013-11-28 14:45 ` Tejun Heo @ 2013-11-28 15:34 ` Oleg Nesterov 2013-11-28 15:40 ` Peter Zijlstra 1 sibling, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 15:34 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 09:13:29AM -0500, Tejun Heo wrote: > > > A single parent process for all usermode helpers makes so much sense; > > > not doing it is just weird. > > > > If we're gonna allow userland to play with the parent attributes, > > yeah, that'd make sense. I'm not sure whether that's an interface > > that we'd want to commit to tho? Do we really want to tell userland > > "there will always be a kernel task khelper and if you change that > > one's attributes all processes forked from it will inherit those > > attributes no matter what they are." I think we'd want something more > > specific cause that's a lot of commitment to things that we haven't > > carefully thought about. > > It seems like a perfectly fine interface to me. And much preferable to > creating yet another weird interface to manage tasks. OK. I am not sure, but perhaps this makes sense. (Although this means that we will always have the problem with the recursive UMH_WAIT_* requests). In this case khelper should be turned into kthread_worker, this looks simple. But note that in the longer term we might want even more. We probably want a non-daemonized thread controlled by the user-space. And even more, this thread should be per-namespace (this needs a lot more discussion). But whataver we do later, I believe that the patch from Zhang should be applied now. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:34 ` Oleg Nesterov @ 2013-11-28 15:40 ` Peter Zijlstra 2013-11-28 16:20 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 15:40 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 04:34:43PM +0100, Oleg Nesterov wrote: > But note that in the longer term we might want even more. We probably > want a non-daemonized thread controlled by the user-space. And even > more, this thread should be per-namespace (this needs a lot more > discussion). Which namespace? PID namespace I presume where we can have a 'new' init task and everything. I'm not sure, are any of these things (workqueues, userspace helpers) pid namespace aware? If not it doesn't seem to make sense to expose this to nested PID namespaces and would be something special for the root namespace. > But whataver we do later, I believe that the patch from Zhang should > be applied now. Probably, although with a much adjusted changelog. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:40 ` Peter Zijlstra @ 2013-11-28 16:20 ` Oleg Nesterov 2013-11-28 16:38 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 16:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 04:34:43PM +0100, Oleg Nesterov wrote: > > But note that in the longer term we might want even more. We probably > > want a non-daemonized thread controlled by the user-space. And even > > more, this thread should be per-namespace (this needs a lot more > > discussion). > > Which namespace? PID namespace I presume where we can have a 'new' init > task and everything. > > I'm not sure, are any of these things (workqueues, userspace helpers) > pid namespace aware? If not it doesn't seem to make sense to expose this > to nested PID namespaces and would be something special for the root > namespace. Not sure I understand correctly. But yes, of course, it is not that call_usermodehelper() should be namespace-friendly "unconditionally". We need another API (although perhaps we can simply add UMH_NAMESPACE flag, this doesn't matter). Just for example, the piped core handler. Currently it is hardly useful in containers. > > But whataver we do later, I believe that the patch from Zhang should > > be applied now. > > Probably, although with a much adjusted changelog. Zhang, please make v3 ;) Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 16:20 ` Oleg Nesterov @ 2013-11-28 16:38 ` Peter Zijlstra 2013-11-28 18:13 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 16:38 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 05:20:25PM +0100, Oleg Nesterov wrote: > On 11/28, Peter Zijlstra wrote: > > > > On Thu, Nov 28, 2013 at 04:34:43PM +0100, Oleg Nesterov wrote: > > > But note that in the longer term we might want even more. We probably > > > want a non-daemonized thread controlled by the user-space. And even > > > more, this thread should be per-namespace (this needs a lot more > > > discussion). > > > > Which namespace? PID namespace I presume where we can have a 'new' init > > task and everything. > > > > I'm not sure, are any of these things (workqueues, userspace helpers) > > pid namespace aware? If not it doesn't seem to make sense to expose this > > to nested PID namespaces and would be something special for the root > > namespace. > > Not sure I understand correctly. But yes, of course, it is not that > call_usermodehelper() should be namespace-friendly "unconditionally". > > We need another API (although perhaps we can simply add UMH_NAMESPACE > flag, this doesn't matter). > > Just for example, the piped core handler. Currently it is hardly useful > in containers. I'm afraid I'm not much familiar with the entire namespace thing other than broad concepts. But if there's specific per-pid-namespace functionality for usermode-helpers, then yes it makes sense to have per-pid-namespace parents. So in specific, you say that piping a core file into a usermode helper is currently busted in pid-namespaces and that fixing that would indeed introduce such pid-namespace awareness to the usermode-helper stuff? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 16:38 ` Peter Zijlstra @ 2013-11-28 18:13 ` Oleg Nesterov 0 siblings, 0 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 18:13 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > So in specific, you say that piping a core file into a usermode helper > is currently busted in pid-namespaces and that fixing that would indeed > introduce such pid-namespace awareness to the usermode-helper stuff? Perhaps yes. People want to run the core handler in the namespace of the crashed task. And (perhaps) even dumping to the file should use per-namespace ->root/etc. (and of course core_pattern[] should not be global, but this is another story). And there are other potential users. Say, nfsd user-mode helpers wants to have the "right" mnt (at least) namespace. But again, this all needs a lot of discussions. Even the API is not actually clear. As for implementation, we could probably even force ->child_reaper (the subnamespace's init) to fork/exec the usermode helper in this case. Or a namespace should explicitly create a thread which should do this. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 13:39 ` Peter Zijlstra 2013-11-28 14:13 ` Tejun Heo @ 2013-11-28 14:23 ` Oleg Nesterov 2013-11-28 14:31 ` Tejun Heo 1 sibling, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 14:23 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 02:31:52PM +0100, Oleg Nesterov wrote: > > On 11/28, Peter Zijlstra wrote: > > > > > > On Thu, Nov 28, 2013 at 12:45:42PM +0100, Oleg Nesterov wrote: > > > > It has? khelper is a workqueue thread, this flag is set by create_worker(). > > > > > > > > And it does kernel_thread() (not kthread_create()) so the child gets this > > > > flag too. > > > > > > Urgh, but that's still completely wrong. khelper is a singlethread > > > workqueue, those should be unbound and therefore should not have this > > > flag set at all. > > > > Well. This is debatable, but I leave this to you and Tejun ;) > > How can that be debatable? I don't see a single argument in favour of > doing that; its plain ridiculous. Let me repeat, I do not pretend I understand the current implementation in details. However, I'd like to say that I do not think think this is ridiculous. I think this is clever. > > > In fact, I know people want to set affinity on khelper > > > > This is not that simple. Note that khelper itself is the rescuer thread, > > it doesn't not process the works. There are other kworker/u* threads which > > run the work queued on khelper_wq. There is a pool of threads. > > That's just fucked. WTF does singlethreaded mean then? Yes, the naming is misleading. But it was always misleading. "singlethreaded" meant a single thread, yes, but this just reflected the implementation details. What it actually meant is: not bound to any cpu, and the works can't race with each other. create_singlethread_workqueue() still has the same semantics due to WQ_UNBOUND && max_active == 1. So in this sense (max_active == 1) it is still single threaded, just (iiuc, Tejun can correct me) it does not guarantee that the kernel thread which actually runs the work will be always the same. > A single parent process for all usermode helpers makes so much sense; This was never true, at least in UMH_WAIT_PROC case. > not doing it is just weird. Why? Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:23 ` Oleg Nesterov @ 2013-11-28 14:31 ` Tejun Heo 2013-11-28 15:00 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:31 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hello, On Thu, Nov 28, 2013 at 03:23:59PM +0100, Oleg Nesterov wrote: > Yes, the naming is misleading. But it was always misleading. > "singlethreaded" meant a single thread, yes, but this just reflected the > implementation details. What it actually meant is: not bound to any cpu, > and the works can't race with each other. Way back when it was actually a singlethread, the naming was kinda okay. Now it's a deprecated interface. I just didn't finish convering the existing ones to alloc_ordered_workqueue(). > create_singlethread_workqueue() still has the same semantics due to > WQ_UNBOUND && max_active == 1. So in this sense (max_active == 1) it > is still single threaded, just (iiuc, Tejun can correct me) it does > not guarantee that the kernel thread which actually runs the work will > be always the same. So, the thing with create_singlethread_workqueue() is that the original implementation guaranteed dedicated execution resource, which means that it requires WQ_RESCUER in the new (well, newer) implementation. Also, singlethread workqueue used to be used not because singlethreadedness was necessary but just to avoid creating all the per-cpu workers, which is no longer necessary and actually is less efficient. So, converting create_singlethread_workqueue() instances to the new interface usually involves making the two determinations. * Is WQ_RESCUER actually necessary? If not, WQ_RESCUER will be dropped and the task bearing the name of the workqueue will no longer exist. * Is ordered execution necessary? If not, it can be converted to alloc_workqueue() or just to use system_wq. khelper is special as its attributes get inherited to its children, so, yeah, we probably wanna keep that one's cpumask set to all. Hmmm... I think it's already broken. So, on NUMA machines, unbound workqueues will be segmented to each NUMA node by default. If khelper is gonna keep using workqueue, it should set turn on no_numa attribute explicitly. I'll prep a patch. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 14:31 ` Tejun Heo @ 2013-11-28 15:00 ` Oleg Nesterov 2013-11-28 15:02 ` Peter Zijlstra 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 15:00 UTC (permalink / raw) To: Tejun Heo; +Cc: Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Tejun Heo wrote: > > * Is WQ_RESCUER actually necessary? If not, WQ_RESCUER will be > dropped and the task bearing the name of the workqueue will no > longer exist. WQ_MEM_RECLAIM, I guess. Probably not... > * Is ordered execution necessary? If not, it can be converted to > alloc_workqueue() or just to use system_wq. I think no. This is the reason for kmod_thread_locker hack. > khelper is special as its attributes get inherited to its children, > so, yeah, we probably wanna keep that one's cpumask set to all. And btw. Note ____call_usermodehelper()->set_cpus_allowed_ptr(cpu_all_mask). Even if we change the affinity of the "khelper" worker threads this won't restrict the user-space helpers. I think this set_cpus_allowed_ptr() should die in any case? Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 15:00 ` Oleg Nesterov @ 2013-11-28 15:02 ` Peter Zijlstra 2013-11-28 19:33 ` [PATCH 0/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 15:02 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 04:00:55PM +0100, Oleg Nesterov wrote: > And btw. Note ____call_usermodehelper()->set_cpus_allowed_ptr(cpu_all_mask). > > Even if we change the affinity of the "khelper" worker threads this > won't restrict the user-space helpers. > > I think this set_cpus_allowed_ptr() should die in any case? Yes. A while back someone proposed adding a magic cpumask to tweak in order to contain the usermode helpers. Obviously I thought that was a bad idea. ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 0/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-11-28 15:02 ` Peter Zijlstra @ 2013-11-28 19:33 ` Oleg Nesterov 2013-11-28 19:33 ` [PATCH 1/1] " Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 19:33 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 04:00:55PM +0100, Oleg Nesterov wrote: > > And btw. Note ____call_usermodehelper()->set_cpus_allowed_ptr(cpu_all_mask). > > > > Even if we change the affinity of the "khelper" worker threads this > > won't restrict the user-space helpers. > > > > I think this set_cpus_allowed_ptr() should die in any case? > > Yes. A while back someone proposed adding a magic cpumask to tweak in > order to contain the usermode helpers. > > Obviously I thought that was a bad idea. OK. let me send the trivial patch then, please take it or ack for Andrew. Assumung that you and Tejun agree with this change, of course. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-11-28 19:33 ` [PATCH 0/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() Oleg Nesterov @ 2013-11-28 19:33 ` Oleg Nesterov 2013-11-29 13:44 ` Tejun Heo 2013-12-05 14:21 ` Frederic Weisbecker 0 siblings, 2 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 19:33 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar ____call_usermodehelper() does set_cpus_allowed_ptr(cpu_all_mask), this (and the comment) is misleading. We no longer have keventd_wq, and kmod.c switched to khelper_wq a long ago. And more importantly, "unlike our parent" is no longer true too, this thread was created by WQ_UNBOUND worker thread which has the full ->cpus_allowed mask, so this set_cpus_allowed_ptr() is simply unnecessary. Perhaps we will change this later, so that userspace can control the affinity of the usermode helper tasks, but this is yet another reason to remove this set_cpus_allowed_ptr(). To some degree this also applies to set_user_nice(), but this patch only updates the comment. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/kmod.c | 8 ++------ 1 files changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/kmod.c b/kernel/kmod.c index b086006..2fe4544 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -208,13 +208,9 @@ static int ____call_usermodehelper(void *data) spin_lock_irq(¤t->sighand->siglock); flush_signal_handlers(current, 1); spin_unlock_irq(¤t->sighand->siglock); - - /* We can run anywhere, unlike our parent keventd(). */ - set_cpus_allowed_ptr(current, cpu_all_mask); - /* - * Our parent is keventd, which runs with elevated scheduling priority. - * Avoid propagating that into the userspace child. + * Our parent is a workqueue thread, which can run with elevated + * scheduling priority. Avoid propagating that into the userspace. */ set_user_nice(current, 0); -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-11-28 19:33 ` [PATCH 1/1] " Oleg Nesterov @ 2013-11-29 13:44 ` Tejun Heo 2013-11-29 16:49 ` Oleg Nesterov 2013-12-05 14:21 ` Frederic Weisbecker 1 sibling, 1 reply; 69+ messages in thread From: Tejun Heo @ 2013-11-29 13:44 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hey, Oleg. On Thu, Nov 28, 2013 at 08:33:49PM +0100, Oleg Nesterov wrote: > @@ -208,13 +208,9 @@ static int ____call_usermodehelper(void *data) > spin_lock_irq(¤t->sighand->siglock); > flush_signal_handlers(current, 1); > spin_unlock_irq(¤t->sighand->siglock); > - > - /* We can run anywhere, unlike our parent keventd(). */ > - set_cpus_allowed_ptr(current, cpu_all_mask); I don't think this is correct on numa machines where unbound workers are NUMA-node affine by default. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-11-29 13:44 ` Tejun Heo @ 2013-11-29 16:49 ` Oleg Nesterov 0 siblings, 0 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-11-29 16:49 UTC (permalink / raw) To: Tejun Heo; +Cc: Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hi Tejun, On 11/29, Tejun Heo wrote: > > Hey, Oleg. > > On Thu, Nov 28, 2013 at 08:33:49PM +0100, Oleg Nesterov wrote: > > @@ -208,13 +208,9 @@ static int ____call_usermodehelper(void *data) > > spin_lock_irq(¤t->sighand->siglock); > > flush_signal_handlers(current, 1); > > spin_unlock_irq(¤t->sighand->siglock); > > - > > - /* We can run anywhere, unlike our parent keventd(). */ > > - set_cpus_allowed_ptr(current, cpu_all_mask); > > I don't think this is correct on numa machines where unbound workers > are NUMA-node affine by default. Aaah, I see. And you even mentioned this before, but I didn't pay attention. Thank! Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-11-28 19:33 ` [PATCH 1/1] " Oleg Nesterov 2013-11-29 13:44 ` Tejun Heo @ 2013-12-05 14:21 ` Frederic Weisbecker 2013-12-05 14:37 ` Oleg Nesterov 1 sibling, 1 reply; 69+ messages in thread From: Frederic Weisbecker @ 2013-12-05 14:21 UTC (permalink / raw) To: Oleg Nesterov, Christoph Lameter Cc: Peter Zijlstra, Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar 2013/11/28 Oleg Nesterov <oleg@redhat.com>: > ____call_usermodehelper() does set_cpus_allowed_ptr(cpu_all_mask), > this (and the comment) is misleading. We no longer have keventd_wq, > and kmod.c switched to khelper_wq a long ago. > > And more importantly, "unlike our parent" is no longer true too, > this thread was created by WQ_UNBOUND worker thread which has the > full ->cpus_allowed mask, so this set_cpus_allowed_ptr() is simply > unnecessary. > > Perhaps we will change this later, so that userspace can control > the affinity of the usermode helper tasks, but this is yet another > reason to remove this set_cpus_allowed_ptr(). > > To some degree this also applies to set_user_nice(), but this > patch only updates the comment. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> I'm adding Christophe in Cc because he is interested in tweaking the affinity of call_usermodehelper for cpu isolation. This welcome cleanup confirms that we want to take the direction of being able to change the affinity of workqueue themselves instead of just call_usermodehelper() alone. > --- > kernel/kmod.c | 8 ++------ > 1 files changed, 2 insertions(+), 6 deletions(-) > > diff --git a/kernel/kmod.c b/kernel/kmod.c > index b086006..2fe4544 100644 > --- a/kernel/kmod.c > +++ b/kernel/kmod.c > @@ -208,13 +208,9 @@ static int ____call_usermodehelper(void *data) > spin_lock_irq(¤t->sighand->siglock); > flush_signal_handlers(current, 1); > spin_unlock_irq(¤t->sighand->siglock); > - > - /* We can run anywhere, unlike our parent keventd(). */ > - set_cpus_allowed_ptr(current, cpu_all_mask); > - > /* > - * Our parent is keventd, which runs with elevated scheduling priority. > - * Avoid propagating that into the userspace child. > + * Our parent is a workqueue thread, which can run with elevated > + * scheduling priority. Avoid propagating that into the userspace. > */ > set_user_nice(current, 0); > > -- > 1.5.5.1 > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-05 14:21 ` Frederic Weisbecker @ 2013-12-05 14:37 ` Oleg Nesterov 2013-12-05 14:39 ` Tejun Heo 2013-12-05 14:42 ` Frederic Weisbecker 0 siblings, 2 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-12-05 14:37 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Peter Zijlstra, Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 12/05, Frederic Weisbecker wrote: > > 2013/11/28 Oleg Nesterov <oleg@redhat.com>: > > ____call_usermodehelper() does set_cpus_allowed_ptr(cpu_all_mask), > > this (and the comment) is misleading. We no longer have keventd_wq, > > and kmod.c switched to khelper_wq a long ago. > > > > And more importantly, "unlike our parent" is no longer true too, > > this thread was created by WQ_UNBOUND worker thread which has the > > full ->cpus_allowed mask, so this set_cpus_allowed_ptr() is simply > > unnecessary. > > > > Perhaps we will change this later, so that userspace can control > > the affinity of the usermode helper tasks, but this is yet another > > reason to remove this set_cpus_allowed_ptr(). > > > > To some degree this also applies to set_user_nice(), but this > > patch only updates the comment. > > > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > I'm adding Christophe in Cc because he is interested in tweaking the > affinity of call_usermodehelper for cpu isolation. This welcome > cleanup confirms that we want to take the direction of being able to > change the affinity of workqueue themselves instead of just > call_usermodehelper() alone. OK, but I'd like to remind just in case, as Tejun pointed out this patch is wrong ;) And "change the affinity of workqueue themselves" is not simple, but we can make khelper_wq WQ_SYSFS. > > kernel/kmod.c | 8 ++------ > > 1 files changed, 2 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/kmod.c b/kernel/kmod.c > > index b086006..2fe4544 100644 > > --- a/kernel/kmod.c > > +++ b/kernel/kmod.c > > @@ -208,13 +208,9 @@ static int ____call_usermodehelper(void *data) > > spin_lock_irq(¤t->sighand->siglock); > > flush_signal_handlers(current, 1); > > spin_unlock_irq(¤t->sighand->siglock); > > - > > - /* We can run anywhere, unlike our parent keventd(). */ > > - set_cpus_allowed_ptr(current, cpu_all_mask); > > - > > /* > > - * Our parent is keventd, which runs with elevated scheduling priority. > > - * Avoid propagating that into the userspace child. > > + * Our parent is a workqueue thread, which can run with elevated > > + * scheduling priority. Avoid propagating that into the userspace. > > */ > > set_user_nice(current, 0); > > > > -- > > 1.5.5.1 > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-05 14:37 ` Oleg Nesterov @ 2013-12-05 14:39 ` Tejun Heo 2013-12-05 15:30 ` Oleg Nesterov 2013-12-05 16:26 ` Frederic Weisbecker 2013-12-05 14:42 ` Frederic Weisbecker 1 sibling, 2 replies; 69+ messages in thread From: Tejun Heo @ 2013-12-05 14:39 UTC (permalink / raw) To: Oleg Nesterov Cc: Frederic Weisbecker, Christoph Lameter, Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hello, On Thu, Dec 05, 2013 at 03:37:45PM +0100, Oleg Nesterov wrote: > > I'm adding Christophe in Cc because he is interested in tweaking the > > affinity of call_usermodehelper for cpu isolation. This welcome > > cleanup confirms that we want to take the direction of being able to > > change the affinity of workqueue themselves instead of just > > call_usermodehelper() alone. > > OK, but I'd like to remind just in case, as Tejun pointed out this > patch is wrong ;) > > And "change the affinity of workqueue themselves" is not simple, but > we can make khelper_wq WQ_SYSFS. Maybe workqueue should implement and expose default attributes which are inherited by all workqueues unless they're explicitly overridden? The use case here is not really about isolating certain subgroup of workers but rather being able to control the default behavior, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-05 14:39 ` Tejun Heo @ 2013-12-05 15:30 ` Oleg Nesterov 2013-12-05 19:13 ` Christoph Lameter 2013-12-05 16:26 ` Frederic Weisbecker 1 sibling, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-12-05 15:30 UTC (permalink / raw) To: Tejun Heo Cc: Frederic Weisbecker, Christoph Lameter, Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hi, On 12/05, Tejun Heo wrote: > Hello, > > On Thu, Dec 05, 2013 at 03:37:45PM +0100, Oleg Nesterov wrote: > > > > OK, but I'd like to remind just in case, as Tejun pointed out this > > patch is wrong ;) > > > > And "change the affinity of workqueue themselves" is not simple, but > > we can make khelper_wq WQ_SYSFS. > > Maybe workqueue should implement and expose default attributes which > are inherited by all workqueues unless they're explicitly overridden? > The use case here is not really about isolating certain subgroup of > workers but rather being able to control the default behavior, right? Perhaps, I simply do not know what the users want. Either way, if khelper_wq will be controlled by sysfs, then this set_cpus_allowed_ptr() should probably die. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-05 15:30 ` Oleg Nesterov @ 2013-12-05 19:13 ` Christoph Lameter 2013-12-06 14:53 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Christoph Lameter @ 2013-12-05 19:13 UTC (permalink / raw) To: Oleg Nesterov Cc: Tejun Heo, Frederic Weisbecker, Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, 5 Dec 2013, Oleg Nesterov wrote: > Perhaps, I simply do not know what the users want. We would like to control OS spawning of threads and restrict them to a limited set of cpus so that the other processors can do latency sensitive work without being impacted by creation of kernel threads etc. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-05 19:13 ` Christoph Lameter @ 2013-12-06 14:53 ` Oleg Nesterov 2013-12-06 15:37 ` Christoph Lameter 0 siblings, 1 reply; 69+ messages in thread From: Oleg Nesterov @ 2013-12-06 14:53 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Frederic Weisbecker, Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 12/05, Christoph Lameter wrote: > > On Thu, 5 Dec 2013, Oleg Nesterov wrote: > > > Perhaps, I simply do not know what the users want. > > We would like to control OS spawning of threads and restrict them to a > limited set of cpus so that the other processors can do latency sensitive > work without being impacted by creation of kernel threads etc. This probably means that Tejun's "default attributes which are inherited by all workqueues" suggestion was right. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-06 14:53 ` Oleg Nesterov @ 2013-12-06 15:37 ` Christoph Lameter 2013-12-06 15:56 ` Oleg Nesterov 0 siblings, 1 reply; 69+ messages in thread From: Christoph Lameter @ 2013-12-06 15:37 UTC (permalink / raw) To: Oleg Nesterov Cc: Tejun Heo, Frederic Weisbecker, Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Fri, 6 Dec 2013, Oleg Nesterov wrote: > This probably means that Tejun's "default attributes which are inherited > by all workqueues" suggestion was right. There are workqueues that are required to run on specific nodes / processors. Those cannot inherit these attributes and would need to be used regardless of the attributes if the kernel code demands it. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-06 15:37 ` Christoph Lameter @ 2013-12-06 15:56 ` Oleg Nesterov 0 siblings, 0 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-12-06 15:56 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Frederic Weisbecker, Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 12/06, Christoph Lameter wrote: > > On Fri, 6 Dec 2013, Oleg Nesterov wrote: > > > This probably means that Tejun's "default attributes which are inherited > > by all workqueues" suggestion was right. > > There are workqueues that are required to run on specific nodes / > processors. Those cannot inherit these attributes and would need to be > used regardless of the attributes if the kernel code demands it. Yes, yes, Tejun actually said "inherited by all workqueues unless they're explicitly overridden", I stripped his words too much. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-05 14:39 ` Tejun Heo 2013-12-05 15:30 ` Oleg Nesterov @ 2013-12-05 16:26 ` Frederic Weisbecker 1 sibling, 0 replies; 69+ messages in thread From: Frederic Weisbecker @ 2013-12-05 16:26 UTC (permalink / raw) To: Tejun Heo Cc: Oleg Nesterov, Christoph Lameter, Peter Zijlstra, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Dec 05, 2013 at 09:39:03AM -0500, Tejun Heo wrote: > Hello, > > On Thu, Dec 05, 2013 at 03:37:45PM +0100, Oleg Nesterov wrote: > > > I'm adding Christophe in Cc because he is interested in tweaking the > > > affinity of call_usermodehelper for cpu isolation. This welcome > > > cleanup confirms that we want to take the direction of being able to > > > change the affinity of workqueue themselves instead of just > > > call_usermodehelper() alone. > > > > OK, but I'd like to remind just in case, as Tejun pointed out this > > patch is wrong ;) > > > > And "change the affinity of workqueue themselves" is not simple, but > > we can make khelper_wq WQ_SYSFS. > > Maybe workqueue should implement and expose default attributes which > are inherited by all workqueues unless they're explicitly overridden? > The use case here is not really about isolating certain subgroup of > workers but rather being able to control the default behavior, right? So the need I know of is to perform CPU isolation on a given set of CPUs. Their are two known usecases: 1) Optimize the use of the CPU and avoid being disturbed with interrupts, other tasks, etc... These disturbances trash the cache and steal cputime. We want to maximize the use of the CPU. So that's the HPC case. 2) Guarantee that the CPU won't be disturbed while some high priority tasks are running. So we have a critical task that has something to do under a given time deadline or some bounded latency. That's the real time case. Case 1) can deal with a few rare noise, but case 2) can't. In both cases I think that people are happy with having one CPU that handles all the housekeeping: handle timekeeping, unbound timers, workqueues, RCU callbacks and random stats maintainance. But this housekeeping could possibly be divided in the future with some per NUMA node affine housekeepers, who knows? So I think it may be nice to have some flexibility in how we affine these workqueues, hence a user-driven action taken through exisiting interfaces like sysfs sounds like a good candidate. > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() 2013-12-05 14:37 ` Oleg Nesterov 2013-12-05 14:39 ` Tejun Heo @ 2013-12-05 14:42 ` Frederic Weisbecker 1 sibling, 0 replies; 69+ messages in thread From: Frederic Weisbecker @ 2013-12-05 14:42 UTC (permalink / raw) To: Oleg Nesterov Cc: Christoph Lameter, Peter Zijlstra, Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Dec 05, 2013 at 03:37:45PM +0100, Oleg Nesterov wrote: > On 12/05, Frederic Weisbecker wrote: > > > > 2013/11/28 Oleg Nesterov <oleg@redhat.com>: > > > ____call_usermodehelper() does set_cpus_allowed_ptr(cpu_all_mask), > > > this (and the comment) is misleading. We no longer have keventd_wq, > > > and kmod.c switched to khelper_wq a long ago. > > > > > > And more importantly, "unlike our parent" is no longer true too, > > > this thread was created by WQ_UNBOUND worker thread which has the > > > full ->cpus_allowed mask, so this set_cpus_allowed_ptr() is simply > > > unnecessary. > > > > > > Perhaps we will change this later, so that userspace can control > > > the affinity of the usermode helper tasks, but this is yet another > > > reason to remove this set_cpus_allowed_ptr(). > > > > > > To some degree this also applies to set_user_nice(), but this > > > patch only updates the comment. > > > > > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > > > I'm adding Christophe in Cc because he is interested in tweaking the > > affinity of call_usermodehelper for cpu isolation. This welcome > > cleanup confirms that we want to take the direction of being able to > > change the affinity of workqueue themselves instead of just > > call_usermodehelper() alone. > > OK, but I'd like to remind just in case, as Tejun pointed out this > patch is wrong ;) Ok, I see that's because unboud workqueues are actually bound to NUMA nodes. Doesn't that eventually work for call_usermodehelper() as well? Or may be there are some reasons for it to keep a global affinity? > > And "change the affinity of workqueue themselves" is not simple, but > we can make khelper_wq WQ_SYSFS. Right it can be either user driven like through sysfs or some adaptive decision taken by the kernel on top of full dynticks CPUs, etc... But the sysfs interface based solution looks better from a quick POV. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 13:31 ` Oleg Nesterov 2013-11-28 13:39 ` Peter Zijlstra @ 2013-11-28 13:41 ` Peter Zijlstra 2013-11-28 14:05 ` Tejun Heo 2013-11-28 14:34 ` Oleg Nesterov 1 sibling, 2 replies; 69+ messages in thread From: Peter Zijlstra @ 2013-11-28 13:41 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On Thu, Nov 28, 2013 at 02:31:52PM +0100, Oleg Nesterov wrote: > I _guess_ usermodehelper_init() should use WQ_SYSFS then, and in this case > the user can write to wq_cpumask_store somewhere in /sys/. WTF is that and why are we creating alternative affinity interfaces when sched_setaffinity() is a prefectly fine one? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 13:41 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child Peter Zijlstra @ 2013-11-28 14:05 ` Tejun Heo 2013-11-28 14:34 ` Oleg Nesterov 1 sibling, 0 replies; 69+ messages in thread From: Tejun Heo @ 2013-11-28 14:05 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Oleg Nesterov, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar Hello, Peter. On Thu, Nov 28, 2013 at 02:41:24PM +0100, Peter Zijlstra wrote: > On Thu, Nov 28, 2013 at 02:31:52PM +0100, Oleg Nesterov wrote: > > I _guess_ usermodehelper_init() should use WQ_SYSFS then, and in this case > > the user can write to wq_cpumask_store somewhere in /sys/. > > WTF is that and why are we creating alternative affinity interfaces when > sched_setaffinity() is a prefectly fine one? Hmmm? Because workqueue is a shared worker pool implementation. The attributes are per-workqueue and if two workqueues have a single set of attributes, they share the same execution resources. There's no one-to-one relationship between a worker thread and a workqueue; otherwise, we end up with a ton of tasks sitting around doing nothing, so you can't individually do sched_setaffinity() on a task and expect it to work. The singlethread name just exists for compatibility and there's new interface named alloc_ordered_workqueue() and it creates a workqueue which execute work items one-by-one in issue order. For singlethread / ordered workqueues with rescuer, it could work to always execute that work item on the rescuer as we're reserving an execution resource anyway, but that'd unnecessarily increase cache footprint from actively using more tasks than necessary and also increase code complexity. Thanks. -- tejun ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child 2013-11-28 13:41 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child Peter Zijlstra 2013-11-28 14:05 ` Tejun Heo @ 2013-11-28 14:34 ` Oleg Nesterov 1 sibling, 0 replies; 69+ messages in thread From: Oleg Nesterov @ 2013-11-28 14:34 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Tejun Heo, zhang.yi20, lkml, Tetsuo Handa, Ingo Molnar On 11/28, Peter Zijlstra wrote: > > On Thu, Nov 28, 2013 at 02:31:52PM +0100, Oleg Nesterov wrote: > > I _guess_ usermodehelper_init() should use WQ_SYSFS then, and in this case > > the user can write to wq_cpumask_store somewhere in /sys/. > > WTF is that and why are we creating alternative affinity interfaces when > sched_setaffinity() is a prefectly fine one? Because there is no a simple workqueue/thread connection, I guess. And I do not understand why do you dislike this. For example. Please note that with the new design we can even kill khelper_wq and the ugly kmod_thread_locker hack (just in case, I am not saying that the patch which added kmod_thread_locker was ugly ;). We can just use one of the system_ WQ_UNBOUND workqueues which has the "large enough" max_active. Oleg. ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child @ 2013-11-29 9:19 zhang.yi20 0 siblings, 0 replies; 69+ messages in thread From: zhang.yi20 @ 2013-11-29 9:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, lkml, Ingo Molnar, Oleg Nesterov, Tetsuo Handa, pang.xunlei Khelper is a workqueue thread which has PF_NO_SETAFFINITY set. When a usermode process is spawned by khelper through __call_usermodehelper() which will invoke kernel_thread() (not kthread_create()) to fork the new process, we will clearly see that it will inherit the PF_NO_SETAFFINITY flag. Consequently, this usermod process will lose some capabilities like sched_setaffinity(), moving between cgroups, etc. Clearing this flag in flush_old_exec() to solve this problem. Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Signed-off-by: Pang Xunlei <pang.xunlei@zte.com.cn> --- linux-3.12.old/fs/exec.c 2013-11-26 08:53:12.175811856 +0000 +++ linux-3.12/fs/exec.c 2013-11-27 09:36:56.231972168 +0000 @@ -1090,8 +1090,8 @@ int flush_old_exec(struct linux_binprm * bprm->mm = NULL; /* We're using it now */ set_fs(USER_DS); - current->flags &= - ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE); + current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | + PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread(); current->personality &= ~bprm->per_clear; ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2013-12-06 15:55 UTC | newest] Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-11-14 1:51 kmod: avoid propagating PF_NO_SETAFFINITY into userspace child zhang.yi20 2013-11-14 5:23 ` Tejun Heo 2013-11-14 11:40 ` Oleg Nesterov 2013-11-14 11:55 ` [PATCH 0/1]: (Was: kmod: avoid propagating PF_NO_SETAFFINITY into userspace child) Oleg Nesterov 2013-11-14 11:56 ` [PATCH 1/1] workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY Oleg Nesterov 2013-11-22 23:13 ` Tejun Heo [not found] ` <OF36E72FA9.51146BE3-ON48257C2E.0008BC6D-48257C2E.0008FF9C@zte.com.cn> 2013-11-25 12:14 ` 答复: " Oleg Nesterov 2013-11-26 2:10 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child zhang.yi20 2013-11-26 18:04 ` Oleg Nesterov 2013-11-27 2:07 ` zhang.yi20 2013-11-27 13:20 ` Oleg Nesterov 2013-11-27 18:31 ` Tejun Heo 2013-11-28 9:13 ` Peter Zijlstra 2013-11-28 11:45 ` Oleg Nesterov 2013-11-28 12:17 ` Peter Zijlstra 2013-11-28 13:31 ` Oleg Nesterov 2013-11-28 13:39 ` Peter Zijlstra 2013-11-28 14:13 ` Tejun Heo 2013-11-28 14:31 ` Peter Zijlstra 2013-11-28 14:38 ` Tejun Heo 2013-11-28 14:47 ` Peter Zijlstra 2013-11-28 14:51 ` Tejun Heo 2013-11-28 14:55 ` Peter Zijlstra 2013-11-28 14:57 ` Tejun Heo 2013-11-28 14:59 ` Peter Zijlstra 2013-11-28 15:07 ` Tejun Heo 2013-11-28 15:17 ` Peter Zijlstra 2013-11-28 15:39 ` Tejun Heo 2013-11-28 16:33 ` Peter Zijlstra 2013-11-29 14:33 ` Tejun Heo 2013-11-28 15:47 ` Oleg Nesterov 2013-11-28 16:07 ` Oleg Nesterov 2013-11-28 14:43 ` Peter Zijlstra 2013-11-28 14:53 ` Tejun Heo 2013-11-28 14:57 ` Peter Zijlstra 2013-11-28 15:02 ` Tejun Heo 2013-11-28 15:07 ` Peter Zijlstra 2013-11-28 15:10 ` Tejun Heo 2013-11-28 15:18 ` Peter Zijlstra 2013-11-28 14:38 ` Peter Zijlstra 2013-11-28 14:45 ` Tejun Heo 2013-11-28 14:53 ` Peter Zijlstra 2013-11-28 15:34 ` Oleg Nesterov 2013-11-28 15:40 ` Peter Zijlstra 2013-11-28 16:20 ` Oleg Nesterov 2013-11-28 16:38 ` Peter Zijlstra 2013-11-28 18:13 ` Oleg Nesterov 2013-11-28 14:23 ` Oleg Nesterov 2013-11-28 14:31 ` Tejun Heo 2013-11-28 15:00 ` Oleg Nesterov 2013-11-28 15:02 ` Peter Zijlstra 2013-11-28 19:33 ` [PATCH 0/1] usermodehelper: kill ____call_usermodehelper()->set_cpus_allowed_ptr() Oleg Nesterov 2013-11-28 19:33 ` [PATCH 1/1] " Oleg Nesterov 2013-11-29 13:44 ` Tejun Heo 2013-11-29 16:49 ` Oleg Nesterov 2013-12-05 14:21 ` Frederic Weisbecker 2013-12-05 14:37 ` Oleg Nesterov 2013-12-05 14:39 ` Tejun Heo 2013-12-05 15:30 ` Oleg Nesterov 2013-12-05 19:13 ` Christoph Lameter 2013-12-06 14:53 ` Oleg Nesterov 2013-12-06 15:37 ` Christoph Lameter 2013-12-06 15:56 ` Oleg Nesterov 2013-12-05 16:26 ` Frederic Weisbecker 2013-12-05 14:42 ` Frederic Weisbecker 2013-11-28 13:41 ` [PATCH]: exec: avoid propagating PF_NO_SETAFFINITY into userspace child Peter Zijlstra 2013-11-28 14:05 ` Tejun Heo 2013-11-28 14:34 ` Oleg Nesterov 2013-11-29 9:19 zhang.yi20
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).