linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
@ 2012-10-25  9:41 Jun'ichi Nomura
  2012-10-26  1:42 ` Jun'ichi Nomura
  2012-10-26 20:21 ` Vivek Goyal
  0 siblings, 2 replies; 14+ messages in thread
From: Jun'ichi Nomura @ 2012-10-25  9:41 UTC (permalink / raw)
  To: linux-kernel, device-mapper development
  Cc: Tejun Heo, Vivek Goyal, Jens Axboe, Alasdair G Kergon

[PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

With 749fefe677 ("block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()"),
add_disk() eventually calls blk_queue_bypass_end().
This change invokes the following warning when multipath is used.

  BUG: scheduling while atomic: multipath/2460/0x00000002
  1 lock held by multipath/2460:
   #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
  Call Trace:
   [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
   [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
   [<ffffffff814291e6>] schedule+0x64/0x66
   [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
   [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
   [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
   [<ffffffff814289e3>] wait_for_common+0x9d/0xee
   [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
   [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
   [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
   [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
   [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
   [<ffffffff8106fb18>] ? complete+0x21/0x53
   [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
   [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
   [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
   [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
   [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [<ffffffff811d8c41>] elevator_init+0xe1/0x115
   [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
   [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
   [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
   [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
   [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
   [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
   [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
   [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b

The warning means during queue initialization blk_queue_bypass_start()
calls sleeping function (synchronize_rcu) while dm holds md->type_lock.

dm device initialization basically includes the following 3 steps:
  1. create ioctl, allocates queue and call add_disk()
  2. table load ioctl, determines device type and initialize queue
     if request-based
  3. resume ioctl, device becomes functional

So it is better to have dm's queue stay in bypass mode until
the initialization completes in table load ioctl.

The effect of additional blk_queue_bypass_start():

  3.7-rc2 (plain)
    # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
                                    dmsetup remove a; done

    real  0m15.434s
    user  0m0.423s
    sys   0m7.052s

  3.7-rc2 (with this patch)
    # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
                                    dmsetup remove a; done
    real  0m19.766s
    user  0m0.442s
    sys   0m6.861s

If this additional cost is not negligible, we need a variant of add_disk()
that does not end bypassing.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair G Kergon <agk@redhat.com>

---
 drivers/md/dm.c        |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 02db918..ad02761 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1869,6 +1869,8 @@ static struct mapped_device *alloc_dev(int minor)
 	md->disk->private_data = md;
 	sprintf(md->disk->disk_name, "dm-%d", minor);
 	add_disk(md->disk);
+	/* Until md type is determined, put the queue in bypass mode */
+	blk_queue_bypass_start(md->queue);
 	format_dev_t(md->name, MKDEV(_major, minor));
 
 	md->wq = alloc_workqueue("kdmflush",
@@ -2172,6 +2174,7 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 		return 1;
 
 	/* Fully initialize the queue */
+	WARN_ON(!blk_queue_bypass(md->queue));
 	q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
 	if (!q)
 		return 0;
@@ -2198,6 +2201,7 @@ int dm_setup_md_queue(struct mapped_device *md)
 		return -EINVAL;
 	}
 
+	blk_queue_bypass_end(md->queue);
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
  2012-10-25  9:41 [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized Jun'ichi Nomura
@ 2012-10-26  1:42 ` Jun'ichi Nomura
  2012-10-26 20:21 ` Vivek Goyal
  1 sibling, 0 replies; 14+ messages in thread
From: Jun'ichi Nomura @ 2012-10-26  1:42 UTC (permalink / raw)
  To: linux-kernel, device-mapper development
  Cc: Tejun Heo, Vivek Goyal, Jens Axboe, Alasdair G Kergon

On 10/25/12 18:41, Jun'ichi Nomura wrote:
> With 749fefe677 ("block: lift the initial queue bypass mode on
> blk_register_queue() instead of blk_init_allocated_queue()"),
> add_disk() eventually calls blk_queue_bypass_end().
> This change invokes the following warning when multipath is used.
...
> The warning means during queue initialization blk_queue_bypass_start()
> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> 
> dm device initialization basically includes the following 3 steps:
>   1. create ioctl, allocates queue and call add_disk()
>   2. table load ioctl, determines device type and initialize queue
>      if request-based
>   3. resume ioctl, device becomes functional
> 
> So it is better to have dm's queue stay in bypass mode until
> the initialization completes in table load ioctl.
> 
> The effect of additional blk_queue_bypass_start():
> 
>   3.7-rc2 (plain)
>     # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
>                                     dmsetup remove a; done
> 
>     real  0m15.434s
>     user  0m0.423s
>     sys   0m7.052s
> 
>   3.7-rc2 (with this patch)
>     # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
>                                     dmsetup remove a; done
>     real  0m19.766s
>     user  0m0.442s
>     sys   0m6.861s
> 
> If this additional cost is not negligible, we need a variant of add_disk()
> that does not end bypassing.

Or call blk_queue_bypass_start() before add_disk():

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ad02761..d14639b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1868,9 +1868,9 @@ static struct mapped_device *alloc_dev(int minor)
 	md->disk->queue = md->queue;
 	md->disk->private_data = md;
 	sprintf(md->disk->disk_name, "dm-%d", minor);
-	add_disk(md->disk);
 	/* Until md type is determined, put the queue in bypass mode */
 	blk_queue_bypass_start(md->queue);
+	add_disk(md->disk);
 	format_dev_t(md->name, MKDEV(_major, minor));
 
 	md->wq = alloc_workqueue("kdmflush",
---

If the patch is modified like above, we could fix the issue
without incurring additional cost on dm device creation.

# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
                                dmsetup remove a; done

real	0m15.684s
user	0m0.404s
sys	0m7.181s

-- 
Jun'ichi Nomura, NEC Corporation


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
  2012-10-25  9:41 [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized Jun'ichi Nomura
  2012-10-26  1:42 ` Jun'ichi Nomura
@ 2012-10-26 20:21 ` Vivek Goyal
  2012-10-29 10:15   ` Jun'ichi Nomura
  1 sibling, 1 reply; 14+ messages in thread
From: Vivek Goyal @ 2012-10-26 20:21 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: linux-kernel, device-mapper development, Tejun Heo, Jens Axboe,
	Alasdair G Kergon

On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> 
> With 749fefe677 ("block: lift the initial queue bypass mode on
> blk_register_queue() instead of blk_init_allocated_queue()"),
> add_disk() eventually calls blk_queue_bypass_end().
> This change invokes the following warning when multipath is used.
> 
>   BUG: scheduling while atomic: multipath/2460/0x00000002
>   1 lock held by multipath/2460:
>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>   Modules linked in: ...
>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
>   Call Trace:
>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>    [<ffffffff814291e6>] schedule+0x64/0x66
>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>    [<ffffffff8106fb18>] ? complete+0x21/0x53
>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> 
> The warning means during queue initialization blk_queue_bypass_start()
> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.

md->type_lock is a mutex, isn't it? I thought we are allowed to block
and schedule out under mutex?

add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
So we already have code which can block/wait under md->type_lock. I am
not sure why should we get this warning under a mutex.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
  2012-10-26 20:21 ` Vivek Goyal
@ 2012-10-29 10:15   ` Jun'ichi Nomura
  2012-10-29 16:38     ` Vivek Goyal
  0 siblings, 1 reply; 14+ messages in thread
From: Jun'ichi Nomura @ 2012-10-29 10:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, device-mapper development, Tejun Heo, Jens Axboe,
	Alasdair G Kergon

On 10/27/12 05:21, Vivek Goyal wrote:
> On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
>> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
>>
>> With 749fefe677 ("block: lift the initial queue bypass mode on
>> blk_register_queue() instead of blk_init_allocated_queue()"),
>> add_disk() eventually calls blk_queue_bypass_end().
>> This change invokes the following warning when multipath is used.
>>
>>   BUG: scheduling while atomic: multipath/2460/0x00000002
>>   1 lock held by multipath/2460:
>>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>>   Modules linked in: ...
>>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
>>   Call Trace:
>>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>>    [<ffffffff814291e6>] schedule+0x64/0x66
>>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>>    [<ffffffff8106fb18>] ? complete+0x21/0x53
>>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
>>
>> The warning means during queue initialization blk_queue_bypass_start()
>> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> 
> md->type_lock is a mutex, isn't it? I thought we are allowed to block
> and schedule out under mutex?

Hm, you are right. It's a mutex.
The warning occurs only if I turned on CONFIG_PREEMPT=y.

> add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> So we already have code which can block/wait under md->type_lock. I am
> not sure why should we get this warning under a mutex.

add_disk() is called without md->type_lock.

Call flow is like this:

dm_create
  alloc_dev
    blk_alloc_queue
    alloc_disk
    add_disk
      blk_queue_bypass_end [with 3.7-rc2]

table_load
  dm_lock_md_type [takes md->type_lock]
  dm_setup_md_queue
    blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
      elevator_init
        blkcg_activate_policy
          blk_queue_bypass_start <-- THIS triggers the warning
          blk_queue_bypass_end
      blk_queue_bypass_end [with 3.6]
  dm_unlock_md_type

blk_queue_bypass_start() in blkcg_activate_policy was nested call,
that did nothing, with 3.6.
With 3.7-rc2, it becomes the initial call and does
actual draining stuff.

-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
  2012-10-29 10:15   ` Jun'ichi Nomura
@ 2012-10-29 16:38     ` Vivek Goyal
  2012-10-29 16:45       ` Peter Zijlstra
  2012-10-29 16:55       ` [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized Paul E. McKenney
  0 siblings, 2 replies; 14+ messages in thread
From: Vivek Goyal @ 2012-10-29 16:38 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: linux-kernel, device-mapper development, Tejun Heo, Jens Axboe,
	Alasdair G Kergon, Paul E. McKenney, Peter Zijlstra

On Mon, Oct 29, 2012 at 07:15:08PM +0900, Jun'ichi Nomura wrote:
> On 10/27/12 05:21, Vivek Goyal wrote:
> > On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> >> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> >>
> >> With 749fefe677 ("block: lift the initial queue bypass mode on
> >> blk_register_queue() instead of blk_init_allocated_queue()"),
> >> add_disk() eventually calls blk_queue_bypass_end().
> >> This change invokes the following warning when multipath is used.
> >>
> >>   BUG: scheduling while atomic: multipath/2460/0x00000002
> >>   1 lock held by multipath/2460:
> >>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> >>   Modules linked in: ...
> >>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
> >>   Call Trace:
> >>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> >>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> >>    [<ffffffff814291e6>] schedule+0x64/0x66
> >>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> >>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> >>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> >>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> >>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> >>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> >>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> >>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> >>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> >>    [<ffffffff8106fb18>] ? complete+0x21/0x53
> >>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> >>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> >>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> >>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> >>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> >>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> >>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> >>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> >>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> >>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> >>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> >>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> >>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> >>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> >>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> >>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> >>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> >>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> >>
> >> The warning means during queue initialization blk_queue_bypass_start()
> >> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> > 
> > md->type_lock is a mutex, isn't it? I thought we are allowed to block
> > and schedule out under mutex?
> 
> Hm, you are right. It's a mutex.
> The warning occurs only if I turned on CONFIG_PREEMPT=y.

Ok, so the question is what's wrong with calling synchronize_rcu() inside
a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
peterz.

> 
> > add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> > So we already have code which can block/wait under md->type_lock. I am
> > not sure why should we get this warning under a mutex.
> 
> add_disk() is called without md->type_lock.
> 
> Call flow is like this:
> 
> dm_create
>   alloc_dev
>     blk_alloc_queue
>     alloc_disk
>     add_disk
>       blk_queue_bypass_end [with 3.7-rc2]
> 
> table_load
>   dm_lock_md_type [takes md->type_lock]
>   dm_setup_md_queue
>     blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
>       elevator_init
>         blkcg_activate_policy
>           blk_queue_bypass_start <-- THIS triggers the warning
>           blk_queue_bypass_end
>       blk_queue_bypass_end [with 3.6]
>   dm_unlock_md_type
> 
> blk_queue_bypass_start() in blkcg_activate_policy was nested call,
> that did nothing, with 3.6.
> With 3.7-rc2, it becomes the initial call and does
> actual draining stuff.

Ok. Once we know what's wrong, we should be able to figure out the 
right solution. Artificially putting queue one level deep in bypass
to avoid calling synchronize_rcu() sounds bad.

Thanks
Vivek

> 
> -- 
> Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
  2012-10-29 16:38     ` Vivek Goyal
@ 2012-10-29 16:45       ` Peter Zijlstra
  2012-10-29 17:13         ` Vivek Goyal
  2012-10-29 16:55       ` [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized Paul E. McKenney
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2012-10-29 16:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jun'ichi Nomura, linux-kernel, device-mapper development,
	Tejun Heo, Jens Axboe, Alasdair G Kergon, Paul E. McKenney

On Mon, 2012-10-29 at 12:38 -0400, Vivek Goyal wrote:
> Ok, so the question is what's wrong with calling synchronize_rcu() inside
> a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
> peterz.

int blkcg_activate_policy(struct request_queue *q,

{

	...

        preloaded = !radix_tree_preload(GFP_KERNEL);

        blk_queue_bypass_start(q);




where:

int radix_tree_preload(gfp_t gfp_mask)
{
        struct radix_tree_preload *rtp;
        struct radix_tree_node *node;
        int ret = -ENOMEM;
        
        preempt_disable();


Seems obvious why it explodes..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
  2012-10-29 16:38     ` Vivek Goyal
  2012-10-29 16:45       ` Peter Zijlstra
@ 2012-10-29 16:55       ` Paul E. McKenney
  1 sibling, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2012-10-29 16:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jun'ichi Nomura, linux-kernel, device-mapper development,
	Tejun Heo, Jens Axboe, Alasdair G Kergon, Peter Zijlstra

On Mon, Oct 29, 2012 at 12:38:45PM -0400, Vivek Goyal wrote:
> On Mon, Oct 29, 2012 at 07:15:08PM +0900, Jun'ichi Nomura wrote:
> > On 10/27/12 05:21, Vivek Goyal wrote:
> > > On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> > >> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> > >>
> > >> With 749fefe677 ("block: lift the initial queue bypass mode on
> > >> blk_register_queue() instead of blk_init_allocated_queue()"),
> > >> add_disk() eventually calls blk_queue_bypass_end().
> > >> This change invokes the following warning when multipath is used.
> > >>
> > >>   BUG: scheduling while atomic: multipath/2460/0x00000002

Looks like preemption has been disabled, nested two deep, based
on the /0x00000002, which dumps out preempt_count().

So is someone invoking this with preemption disabled?

> > >>   1 lock held by multipath/2460:
> > >>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> > >>   Modules linked in: ...
> > >>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
> > >>   Call Trace:
> > >>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> > >>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> > >>    [<ffffffff814291e6>] schedule+0x64/0x66
> > >>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> > >>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> > >>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> > >>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> > >>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> > >>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> > >>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> > >>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> > >>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> > >>    [<ffffffff8106fb18>] ? complete+0x21/0x53
> > >>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> > >>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> > >>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> > >>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> > >>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> > >>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> > >>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> > >>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> > >>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> > >>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> > >>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> > >>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> > >>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> > >>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> > >>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> > >>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> > >>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> > >>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > >>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> > >>
> > >> The warning means during queue initialization blk_queue_bypass_start()
> > >> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> > > 
> > > md->type_lock is a mutex, isn't it? I thought we are allowed to block
> > > and schedule out under mutex?
> > 
> > Hm, you are right. It's a mutex.
> > The warning occurs only if I turned on CONFIG_PREEMPT=y.
> 
> Ok, so the question is what's wrong with calling synchronize_rcu() inside
> a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
> peterz.

Nothing is wrong with calling synchronize_rcu() inside a mutex, this has
been used and does work.  Last I tried it, anyway.  ;-)

> > > add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> > > So we already have code which can block/wait under md->type_lock. I am
> > > not sure why should we get this warning under a mutex.
> > 
> > add_disk() is called without md->type_lock.
> > 
> > Call flow is like this:
> > 
> > dm_create
> >   alloc_dev
> >     blk_alloc_queue
> >     alloc_disk
> >     add_disk
> >       blk_queue_bypass_end [with 3.7-rc2]
> > 
> > table_load
> >   dm_lock_md_type [takes md->type_lock]
> >   dm_setup_md_queue
> >     blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
> >       elevator_init
> >         blkcg_activate_policy
> >           blk_queue_bypass_start <-- THIS triggers the warning
> >           blk_queue_bypass_end
> >       blk_queue_bypass_end [with 3.6]
> >   dm_unlock_md_type
> > 
> > blk_queue_bypass_start() in blkcg_activate_policy was nested call,
> > that did nothing, with 3.6.
> > With 3.7-rc2, it becomes the initial call and does
> > actual draining stuff.
> 
> Ok. Once we know what's wrong, we should be able to figure out the 
> right solution. Artificially putting queue one level deep in bypass
> to avoid calling synchronize_rcu() sounds bad.

Also should be unnecessary.  I suggest finding out where preemption is
disabled. Scattering checks for preempt_count()!=0 might help locate it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized
  2012-10-29 16:45       ` Peter Zijlstra
@ 2012-10-29 17:13         ` Vivek Goyal
  2012-10-30  2:25           ` [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Jun'ichi Nomura
                             ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Vivek Goyal @ 2012-10-29 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jun'ichi Nomura, linux-kernel, device-mapper development,
	Tejun Heo, Jens Axboe, Alasdair G Kergon, Paul E. McKenney

On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-10-29 at 12:38 -0400, Vivek Goyal wrote:
> > Ok, so the question is what's wrong with calling synchronize_rcu() inside
> > a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney  and
> > peterz.
> 
> int blkcg_activate_policy(struct request_queue *q,
> 
> {
> 
> 	...
> 
>         preloaded = !radix_tree_preload(GFP_KERNEL);
> 
>         blk_queue_bypass_start(q);
> 
> 
> 
> 
> where:
> 
> int radix_tree_preload(gfp_t gfp_mask)
> {
>         struct radix_tree_preload *rtp;
>         struct radix_tree_node *node;
>         int ret = -ENOMEM;
>         
>         preempt_disable();
> 
> 
> Seems obvious why it explodes..

Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before
radix_tree_preload() should fix it. Junichi, can you please give it
a try.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  2012-10-29 17:13         ` Vivek Goyal
@ 2012-10-30  2:25           ` Jun'ichi Nomura
  2012-10-30 13:21             ` Vivek Goyal
  2013-01-08  7:31           ` [PATCH repost] " Jun'ichi Nomura
  2013-02-26  4:53           ` Jun'ichi Nomura
  2 siblings, 1 reply; 14+ messages in thread
From: Jun'ichi Nomura @ 2012-10-30  2:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-kernel, device-mapper development,
	Tejun Heo, Jens Axboe, Alasdair G Kergon, Paul E. McKenney

On 10/30/12 02:13, Vivek Goyal wrote:
> On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote:
>> int radix_tree_preload(gfp_t gfp_mask)
>> {
>>         struct radix_tree_preload *rtp;
>>         struct radix_tree_node *node;
>>         int ret = -ENOMEM;
>>         
>>         preempt_disable();
>>
>>
>> Seems obvious why it explodes..
> 
> Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before
> radix_tree_preload() should fix it. Junichi, can you please give it
> a try.

Thank you! It just works.
This is a revised patch.

With 749fefe677 ("block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with
CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x00000002
  1 lock held by multipath/2460:
   #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
  Call Trace:
   [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
   [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
   [<ffffffff814291e6>] schedule+0x64/0x66
   [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
   [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
   [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
   [<ffffffff814289e3>] wait_for_common+0x9d/0xee
   [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
   [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
   [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
   [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
   [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
   [<ffffffff8106fb18>] ? complete+0x21/0x53
   [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
   [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
   [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
   [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
   [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [<ffffffff811d8c41>] elevator_init+0xe1/0x115
   [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
   [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
   [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
   [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
   [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
   [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
   [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
   [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 block/blk-cgroup.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d0b7703..954f4be 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q,
 	if (!blkg)
 		return -ENOMEM;
 
-	preloaded = !radix_tree_preload(GFP_KERNEL);
-
 	blk_queue_bypass_start(q);
 
+	preloaded = !radix_tree_preload(GFP_KERNEL);
+
 	/* make sure the root blkg exists and count the existing blkgs */
 	spin_lock_irq(q->queue_lock);
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  2012-10-30  2:25           ` [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Jun'ichi Nomura
@ 2012-10-30 13:21             ` Vivek Goyal
  0 siblings, 0 replies; 14+ messages in thread
From: Vivek Goyal @ 2012-10-30 13:21 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Peter Zijlstra, linux-kernel, device-mapper development,
	Tejun Heo, Jens Axboe, Alasdair G Kergon, Paul E. McKenney

On Tue, Oct 30, 2012 at 11:25:47AM +0900, Jun'ichi Nomura wrote:

[..]
> This patch moves blk_queue_bypass_start() before radix_tree_preload()
> to avoid the sleeping call while preemption is disabled.
> 
>   BUG: scheduling while atomic: multipath/2460/0x00000002
>   1 lock held by multipath/2460:
>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>   Modules linked in: ...
>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
>   Call Trace:
>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>    [<ffffffff814291e6>] schedule+0x64/0x66
>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>    [<ffffffff8106fb18>] ? complete+0x21/0x53
>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> 
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Alasdair G Kergon <agk@redhat.com>
> ---

Thanks Junichi. This patch looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

>  block/blk-cgroup.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d0b7703..954f4be 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q,
>  	if (!blkg)
>  		return -ENOMEM;
>  
> -	preloaded = !radix_tree_preload(GFP_KERNEL);
> -
>  	blk_queue_bypass_start(q);
>  
> +	preloaded = !radix_tree_preload(GFP_KERNEL);
> +
>  	/* make sure the root blkg exists and count the existing blkgs */
>  	spin_lock_irq(q->queue_lock);
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  2012-10-29 17:13         ` Vivek Goyal
  2012-10-30  2:25           ` [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Jun'ichi Nomura
@ 2013-01-08  7:31           ` Jun'ichi Nomura
  2013-01-09 15:52             ` Vivek Goyal
  2013-01-09 15:55             ` Tejun Heo
  2013-02-26  4:53           ` Jun'ichi Nomura
  2 siblings, 2 replies; 14+ messages in thread
From: Jun'ichi Nomura @ 2013-01-08  7:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Peter Zijlstra, linux-kernel,
	device-mapper development, Tejun Heo, Paul E. McKenney,
	Alasdair G Kergon

With 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x00000002
  1 lock held by multipath/2460:
   #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
  Call Trace:
   [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
   [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
   [<ffffffff814291e6>] schedule+0x64/0x66
   [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
   [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
   [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
   [<ffffffff814289e3>] wait_for_common+0x9d/0xee
   [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
   [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
   [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
   [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
   [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
   [<ffffffff8106fb18>] ? complete+0x21/0x53
   [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
   [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
   [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
   [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
   [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [<ffffffff811d8c41>] elevator_init+0xe1/0x115
   [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
   [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
   [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
   [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
   [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
   [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
   [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
   [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 block/blk-cgroup.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..53628e4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q,
 	if (!blkg)
 		return -ENOMEM;
 
-	preloaded = !radix_tree_preload(GFP_KERNEL);
-
 	blk_queue_bypass_start(q);
 
+	preloaded = !radix_tree_preload(GFP_KERNEL);
+
 	/* make sure the root blkg exists and count the existing blkgs */
 	spin_lock_irq(q->queue_lock);
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  2013-01-08  7:31           ` [PATCH repost] " Jun'ichi Nomura
@ 2013-01-09 15:52             ` Vivek Goyal
  2013-01-09 15:55             ` Tejun Heo
  1 sibling, 0 replies; 14+ messages in thread
From: Vivek Goyal @ 2013-01-09 15:52 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Jens Axboe, Peter Zijlstra, linux-kernel,
	device-mapper development, Tejun Heo, Paul E. McKenney,
	Alasdair G Kergon

On Tue, Jan 08, 2013 at 04:31:30PM +0900, Jun'ichi Nomura wrote:
> With 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
> on blk_register_queue() instead of blk_init_allocated_queue()"),
> the following warning appears when multipath is used with CONFIG_PREEMPT=y.
> 
> This patch moves blk_queue_bypass_start() before radix_tree_preload()
> to avoid the sleeping call while preemption is disabled.

Ok, raix_tree_preload() disabled preemption and blk_queue_bypass_start()
calls synchronize_rcu() which in turn leads to schedule(), hence the
warning.

We also call __blkg_lookup_create() with preemption disabled and this
can do blkg allocation. But allocation currently is GFP_ATOMIC, so not
sleeping and scheduling here. So it should be fine.

So fix looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek 
> 
>   BUG: scheduling while atomic: multipath/2460/0x00000002
>   1 lock held by multipath/2460:
>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>   Modules linked in: ...
>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
>   Call Trace:
>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>    [<ffffffff814291e6>] schedule+0x64/0x66
>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>    [<ffffffff8106fb18>] ? complete+0x21/0x53
>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> 
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> Acked-by: Vivek Goyal <vgoyal@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Alasdair G Kergon <agk@redhat.com>
> ---
>  block/blk-cgroup.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index b8858fb..53628e4 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q,
>  	if (!blkg)
>  		return -ENOMEM;
>  
> -	preloaded = !radix_tree_preload(GFP_KERNEL);
> -
>  	blk_queue_bypass_start(q);
>  
> +	preloaded = !radix_tree_preload(GFP_KERNEL);
> +
>  	/* make sure the root blkg exists and count the existing blkgs */
>  	spin_lock_irq(q->queue_lock);
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  2013-01-08  7:31           ` [PATCH repost] " Jun'ichi Nomura
  2013-01-09 15:52             ` Vivek Goyal
@ 2013-01-09 15:55             ` Tejun Heo
  1 sibling, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2013-01-09 15:55 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Jens Axboe, Vivek Goyal, Peter Zijlstra, linux-kernel,
	device-mapper development, Paul E. McKenney, Alasdair G Kergon

On Tue, Jan 08, 2013 at 04:31:30PM +0900, Jun'ichi Nomura wrote:
> With 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
> on blk_register_queue() instead of blk_init_allocated_queue()"),
> the following warning appears when multipath is used with CONFIG_PREEMPT=y.
> 
> This patch moves blk_queue_bypass_start() before radix_tree_preload()
> to avoid the sleeping call while preemption is disabled.
> 
>   BUG: scheduling while atomic: multipath/2460/0x00000002
>   1 lock held by multipath/2460:
>    #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>   Modules linked in: ...
>   Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
>   Call Trace:
>    [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>    [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>    [<ffffffff814291e6>] schedule+0x64/0x66
>    [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>    [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>    [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>    [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>    [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>    [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>    [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>    [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>    [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>    [<ffffffff8106fb18>] ? complete+0x21/0x53
>    [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>    [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>    [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>    [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>    [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>    [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>    [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>    [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>    [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>    [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>    [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>    [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>    [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>    [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>    [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>    [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>    [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>    [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>    [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> 
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> Acked-by: Vivek Goyal <vgoyal@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Alasdair G Kergon <agk@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH repost] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  2012-10-29 17:13         ` Vivek Goyal
  2012-10-30  2:25           ` [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Jun'ichi Nomura
  2013-01-08  7:31           ` [PATCH repost] " Jun'ichi Nomura
@ 2013-02-26  4:53           ` Jun'ichi Nomura
  2 siblings, 0 replies; 14+ messages in thread
From: Jun'ichi Nomura @ 2013-02-26  4:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Peter Zijlstra, linux-kernel,
	device-mapper development, Tejun Heo, Paul E. McKenney,
	Alasdair G Kergon

Hello Jens,

please consider to pick up this patch.
Without this patch, the warning below splats every when a multipath
device is created.
I got acks from Vivek and Tejun when I posted this for v3.7
and this same patch is still applicable to v3.8.


Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
on blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

  BUG: scheduling while atomic: multipath/2460/0x00000002
  1 lock held by multipath/2460:
   #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
  Modules linked in: ...
  Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
  Call Trace:
   [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
   [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
   [<ffffffff814291e6>] schedule+0x64/0x66
   [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
   [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
   [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
   [<ffffffff814289e3>] wait_for_common+0x9d/0xee
   [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
   [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
   [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
   [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
   [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
   [<ffffffff8106fb18>] ? complete+0x21/0x53
   [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
   [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
   [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
   [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
   [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
   [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
   [<ffffffff811d8c41>] elevator_init+0xe1/0x115
   [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
   [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
   [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
   [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
   [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
   [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
   [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
   [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
   [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
   [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
   [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
   [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 block/blk-cgroup.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..53628e4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -790,10 +790,10 @@ int blkcg_activate_policy(struct request_queue *q,
 	if (!blkg)
 		return -ENOMEM;
 
-	preloaded = !radix_tree_preload(GFP_KERNEL);
-
 	blk_queue_bypass_start(q);
 
+	preloaded = !radix_tree_preload(GFP_KERNEL);
+
 	/* make sure the root blkg exists and count the existing blkgs */
 	spin_lock_irq(q->queue_lock);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-02-26  5:00 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-25  9:41 [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized Jun'ichi Nomura
2012-10-26  1:42 ` Jun'ichi Nomura
2012-10-26 20:21 ` Vivek Goyal
2012-10-29 10:15   ` Jun'ichi Nomura
2012-10-29 16:38     ` Vivek Goyal
2012-10-29 16:45       ` Peter Zijlstra
2012-10-29 17:13         ` Vivek Goyal
2012-10-30  2:25           ` [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Jun'ichi Nomura
2012-10-30 13:21             ` Vivek Goyal
2013-01-08  7:31           ` [PATCH repost] " Jun'ichi Nomura
2013-01-09 15:52             ` Vivek Goyal
2013-01-09 15:55             ` Tejun Heo
2013-02-26  4:53           ` Jun'ichi Nomura
2012-10-29 16:55       ` [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).