All of lore.kernel.org
 help / color / mirror / Atom feed
From: Juri Lelli <juri.lelli@redhat.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: peterz@infradead.org, mingo@redhat.com,
	linux-kernel@vger.kernel.org, luca.abeni@santannapisa.it,
	claudio@evidence.eu.com, tommaso.cucinotta@santannapisa.it,
	bristot@redhat.com, mathieu.poirier@linaro.org,
	lizefan@huawei.com, cgroups@vger.kernel.org
Subject: Re: [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler()
Date: Fri, 15 Jun 2018 09:01:32 +0200	[thread overview]
Message-ID: <20180615070132.GA30871@localhost.localdomain> (raw)
In-Reply-To: <20180614161142.69f186a6@gandalf.local.home>

On 14/06/18 16:11, Steven Rostedt wrote:
> On Wed, 13 Jun 2018 14:17:10 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > +/**
> > + * cpuset_lock - Grab the cpuset_mutex from another subsysytem
> > + */
> > +int cpuset_lock(void)
> > +{
> > +	return mutex_trylock(&cpuset_mutex);
> > +}
> > +
> > +/**
> > + * cpuset_unlock - Release the cpuset_mutex from another subsysytem
> > + */
> > +void cpuset_unlock(void)
> > +{
> > +	mutex_unlock(&cpuset_mutex);
> > +}
> > +
> >  /**
> >   * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
> >   * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ca788f74259d..a5b0c6c25b44 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4218,6 +4218,14 @@ static int __sched_setscheduler(struct task_struct *p,
> >  		if (attr->sched_flags & SCHED_FLAG_SUGOV)
> >  			return -EINVAL;
> >  
> > +		/*
> > +		 * Make sure we don't race with the cpuset subsystem where root
> > +		 * domains can be rebuilt or modified while operations like DL
> > +		 * admission checks are carried out.
> > +		 */
> > +		if (!cpuset_lock())
> > +			return -EBUSY;
> > +
> 
> How important is this for being a trylock? Reason I am asking is that
> it is making my deadline test fail to get the resources. What my test
> does is to create a bunch of threads, and they try to get the deadline
> resources as they are created. This happens in parallel. Thus, when one
> gets the cpuset_lock, the others are hitting this and returning -EBUSY.
> At first I thought I was over committing somehow but after adding a
> trace_printk() in cpuset_lock(), and this fail case it became obvious
> to what was happening:
> 
>    deadline_test-1376  [004]   107.179693: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017fceeed0, flags: 0x00000000
>    deadline_test-1376  [004]   107.179697: bputs:                cpuset_lock: cpuset_mutex trylock
>    deadline_test-1377  [003]   107.179763: sys_exit_futex:       0x0
>    deadline_test-1377  [003]   107.179767: sys_enter_sched_setattr: pid: 0x00000000, uattr: 0x7f017f4eded0, flags: 0x00000000
>    deadline_test-1377  [003]   107.179771: bputs:                cpuset_lock: cpuset_mutex trylock
>    deadline_test-1377  [003]   107.179771: bputs:                __sched_setscheduler: cpuset_lock
>    deadline_test-1377  [003]   107.179773: sys_exit_sched_setattr: 0xfffffffffffffff0

Right, not nice.

So, even if we can rework sched_setattr and do_sched_setscheduler to
endup calling __sched_setscheduler outside of RCU read_lock sections (as
Peter suggested) it seems that we might still get into troubles, e.g.

[   39.890366] ======================================================
[   39.892741] WARNING: possible circular locking dependency detected
[   39.894723] 4.17.0-rc6-before+ #169 Not tainted
[   39.896380] ------------------------------------------------------
[   39.898369] bash/2206 is trying to acquire lock:
[   39.899862] 0000000083b596bc (cpu_hotplug_lock.rw_sem){++++}, at: rebuild_sched_domains_locked+0x29/0x660
[   39.902931]
[   39.902931] but task is already holding lock:
[   39.904790] 000000004fc7a567 (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
[   39.907263]
[   39.907263] which lock already depends on the new lock.
[   39.907263]
[   39.909936]
[   39.909936] the existing dependency chain (in reverse order) is:
[   39.912479]
[   39.912479] -> #2 (cpuset_mutex){+.+.}:
[   39.914249]        __sched_setscheduler+0xc6/0x880
[   39.915779]        _sched_setscheduler+0x77/0xa0
[   39.917270]        __kthread_create_on_node+0x122/0x1a0
[   39.918948]        kthread_create_on_node+0x46/0x70
[   39.920512]        init_rescuer.part.8+0x49/0xa0
[   39.921688]        workqueue_init+0x1ec/0x2d5
[   39.922659]        kernel_init_freeable+0x23b/0x5b9
[   39.923756]        kernel_init+0xa/0x110
[   39.924656]        ret_from_fork+0x3a/0x50
[   39.925570]
[   39.925570] -> #1 (wq_pool_mutex){+.+.}:
[   39.926591]        apply_workqueue_attrs+0x20/0x50
[   39.927554]        __alloc_workqueue_key+0x1c7/0x490
[   39.928700]        workqueue_init_early+0x257/0x33f
[   39.929685]        start_kernel+0x2fe/0x531
[   39.930530]        secondary_startup_64+0xa5/0xb0
[   39.931336]
[   39.931336] -> #0 (cpu_hotplug_lock.rw_sem){++++}:
[   39.932195]        cpus_read_lock+0x39/0xa0
[   39.932772]        rebuild_sched_domains_locked+0x29/0x660
[   39.933513]        update_flag+0x1fd/0x210
[   39.934075]        cpuset_write_u64+0xe7/0x140
[   39.934708]        cgroup_file_write+0x174/0x230
[   39.935350]        kernfs_fop_write+0x113/0x1a0
[   39.935991]        __vfs_write+0x36/0x180
[   39.936473]        vfs_write+0xc5/0x1b0
[   39.936991]        ksys_write+0x55/0xc0
[   39.937504]        do_syscall_64+0x73/0x4d0
[   39.938061]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   39.938803]
[   39.938803] other info that might help us debug this:
[   39.938803]
[   39.939708] Chain exists of:
[   39.939708]   cpu_hotplug_lock.rw_sem --> wq_pool_mutex --> cpuset_mutex
[   39.939708]
[   39.940960]  Possible unsafe locking scenario:
[   39.940960]
[   39.941607]        CPU0                    CPU1
[   39.942056]        ----                    ----
[   39.942523]   lock(cpuset_mutex);
[   39.942858]                                lock(wq_pool_mutex);
[   39.943436]                                lock(cpuset_mutex);
[   39.944028]   lock(cpu_hotplug_lock.rw_sem);
[   39.944449]
[   39.944449]  *** DEADLOCK ***
[   39.944449]
[   39.945113] 4 locks held by bash/2206:
[   39.945485]  #0: 000000003cc231db (sb_writers#10){.+.+}, at: vfs_write+0x193/0x1b0
[   39.946246]  #1: 00000000fdc2c059 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0
[   39.946980]  #2: 00000000e12e3a5d (kn->count#11){.+.+}, at: kernfs_fop_write+0xeb/0x1a0
[   39.947736]  #3: 000000004fc7a567 (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
[   39.948463]
[   39.948463] stack backtrace:
[   39.948871] CPU: 0 PID: 2206 Comm: bash Not tainted 4.17.0-rc6-before+ #169
[   39.949489] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   39.950271] Call Trace:
[   39.950502]  dump_stack+0x67/0x9b
[   39.950810]  print_circular_bug.isra.24+0x1c8/0x2b0
[   39.951252]  __lock_acquire+0x1813/0x18e0
[   39.951618]  ? __lock_acquire+0x1219/0x18e0
[   39.952016]  ? lock_acquire+0x8e/0x240
[   39.952357]  lock_acquire+0x8e/0x240
[   39.952683]  ? rebuild_sched_domains_locked+0x29/0x660
[   39.953179]  cpus_read_lock+0x39/0xa0
[   39.953516]  ? rebuild_sched_domains_locked+0x29/0x660
[   39.953981]  rebuild_sched_domains_locked+0x29/0x660
[   39.954431]  ? mark_held_locks+0x56/0x80
[   39.954790]  ? _raw_spin_unlock_irq+0x29/0x50
[   39.955186]  update_flag+0x1fd/0x210
[   39.955541]  cpuset_write_u64+0xe7/0x140
[   39.955900]  cgroup_file_write+0x174/0x230
[   39.956274]  kernfs_fop_write+0x113/0x1a0
[   39.956640]  __vfs_write+0x36/0x180
[   39.956963]  ? rcu_read_lock_sched_held+0x74/0x80
[   39.957482]  ? rcu_sync_lockdep_assert+0x2e/0x60
[   39.957969]  ? __sb_start_write+0x154/0x1f0
[   39.958415]  ? __sb_start_write+0x16a/0x1f0
[   39.958877]  vfs_write+0xc5/0x1b0
[   39.959310]  ksys_write+0x55/0xc0
[   39.959696]  do_syscall_64+0x73/0x4d0
[   39.960054]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   39.960533]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   39.961071] RIP: 0033:0x7f534866d780
[   39.961477] RSP: 002b:00007fff54dfa398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   39.962380] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f534866d780
[   39.963166] RDX: 0000000000000002 RSI: 0000000000d97408 RDI: 0000000000000001
[   39.964008] RBP: 0000000000d97408 R08: 000000000000000a R09: 00007f5348f6d700
[   39.964780] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f53489237a0
[   39.965588] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000

Happening the moment one disables sched_load_balance at root level.

I'll try harder to find alternatives, but suggestions are welcome! :-)

Best,

- Juri

  reply	other threads:[~2018-06-15  7:01 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-13 12:17 [PATCH v4 0/5] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2018-06-13 12:17 ` [PATCH v4 1/5] sched/topology: Add check to backup comment about hotplug lock Juri Lelli
2018-06-14 13:33   ` Steven Rostedt
2018-06-14 13:42     ` Juri Lelli
2018-06-14 13:47       ` Steven Rostedt
2018-06-14 13:50         ` Juri Lelli
2018-06-14 13:58           ` Quentin Perret
2018-06-14 14:11             ` Juri Lelli
2018-06-14 14:18               ` Quentin Perret
2018-06-14 14:30                 ` Juri Lelli
2018-06-15  8:39                   ` Quentin Perret
2018-06-13 12:17 ` [PATCH v4 2/5] sched/topology: Adding function partition_sched_domains_locked() Juri Lelli
2018-06-14 13:35   ` Steven Rostedt
2018-06-14 13:47     ` Juri Lelli
2018-06-13 12:17 ` [PATCH v4 3/5] sched/core: Streamlining calls to task_rq_unlock() Juri Lelli
2018-06-14 13:42   ` Steven Rostedt
2018-06-13 12:17 ` [PATCH v4 4/5] sched/core: Prevent race condition between cpuset and __sched_setscheduler() Juri Lelli
2018-06-14 13:45   ` Steven Rostedt
2018-06-14 13:51     ` Juri Lelli
2018-06-14 20:11   ` Steven Rostedt
2018-06-15  7:01     ` Juri Lelli [this message]
2018-06-15 13:07       ` Juri Lelli
2018-06-13 12:17 ` [PATCH v4 5/5] cpuset: Rebuild root domain deadline accounting information Juri Lelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180615070132.GA30871@localhost.localdomain \
    --to=juri.lelli@redhat.com \
    --cc=bristot@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=claudio@evidence.eu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=luca.abeni@santannapisa.it \
    --cc=mathieu.poirier@linaro.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tommaso.cucinotta@santannapisa.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.