linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel crash in icq_free_icq_rcu
@ 2012-01-17 20:18 Vivek Goyal
  2012-01-17 20:19 ` Jens Axboe
  2012-01-17 21:48 ` Tejun Heo
  0 siblings, 2 replies; 30+ messages in thread
From: Vivek Goyal @ 2012-01-17 20:18 UTC (permalink / raw)
  To: linux kernel mailing list, Tejun Heo; +Cc: Jens Axboe

Hi Tejun,

With latest linus kernel, I see following crash. I was running some
scripts which create cgroups and launch fio jobs in those groups. In
a separate window I wrote a script to change the IO scheduler on the
device in a loop while IO was happening on the device. After few
seconds I see following. So far I tried it twice and reproduced it
both the times in first few seconds.

Thanks
Vivek


[   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] PGD 13abda067 PUD 137d52067 PMD 0 
[   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   94.218004] CPU 0 
[   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
[   94.218004] 
[   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
[   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
[   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
[   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
[   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
[   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
[   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
[   94.218004] Stack:
[   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
[   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
[   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
[   94.218004] Call Trace:
[   94.218004]  <IRQ> 
[   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
[   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
[   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
[   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
[   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
[   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
[   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
[   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
[   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
[   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
[   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
[   94.218004]  <EOI> 
[   94.218004]  [<ffffffff8100a806>] ? mwait_idle+0xb6/0x4c0
[   94.218004]  [<ffffffff8100a7fd>] ? mwait_idle+0xad/0x4c0
[   94.218004]  [<ffffffff810011e6>] cpu_idle+0x96/0xe0
[   94.218004]  [<ffffffff818064df>] rest_init+0x133/0x144
[   94.218004]  [<ffffffff81806425>] ? rest_init+0x79/0x144
[   94.218004]  [<ffffffff81ed4b51>] start_kernel+0x35b/0x366
[   94.218004]  [<ffffffff81ed4321>] x86_64_start_reservations+0x131/0x135
[   94.218004]  [<ffffffff81ed4415>] x86_64_start_kernel+0xf0/0xf7
[   94.218004] Code: f3 e8 97 cb ee ff 48 c1 e8 0c 48 c1 e0 06 49 01 c5 49 8b 45 00 f6 c4 80 0f 85 99 00 00 00 4c 8b 75 08 9c 41 5f fa e8 12 f8 f4 ff <49> 63 74 24 1c 48 89 df e8 e5 4b f5 ff 41 f7 c7 00 02 00 00 74 
[   94.218004] RIP  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004]  RSP <ffff88013fc03de0>
[   94.218004] CR2: 000000000000001c

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 20:18 Kernel crash in icq_free_icq_rcu Vivek Goyal
@ 2012-01-17 20:19 ` Jens Axboe
  2012-01-17 20:40   ` Vivek Goyal
  2012-01-17 21:48 ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2012-01-17 20:19 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux kernel mailing list, Tejun Heo

On 2012-01-17 21:18, Vivek Goyal wrote:
> Hi Tejun,
> 
> With latest linus kernel, I see following crash. I was running some
> scripts which create cgroups and launch fio jobs in those groups. In
> a separate window I wrote a script to change the IO scheduler on the
> device in a loop while IO was happening on the device. After few
> seconds I see following. So far I tried it twice and reproduced it
> both the times in first few seconds.
> 
> Thanks
> Vivek
> 
> 
> [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
> [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0 
> [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [   94.218004] CPU 0 
> [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
> [   94.218004] 
> [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
> [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
> [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
> [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
> [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
> [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
> [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
> [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
> [   94.218004] Stack:
> [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
> [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
> [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
> [   94.218004] Call Trace:
> [   94.218004]  <IRQ> 
> [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
> [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
> [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
> [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
> [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
> [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
> [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
> [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
> [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
> [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
> [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
> [   94.218004]  <EOI> 
> [   94.218004]  [<ffffffff8100a806>] ? mwait_idle+0xb6/0x4c0
> [   94.218004]  [<ffffffff8100a7fd>] ? mwait_idle+0xad/0x4c0
> [   94.218004]  [<ffffffff810011e6>] cpu_idle+0x96/0xe0
> [   94.218004]  [<ffffffff818064df>] rest_init+0x133/0x144
> [   94.218004]  [<ffffffff81806425>] ? rest_init+0x79/0x144
> [   94.218004]  [<ffffffff81ed4b51>] start_kernel+0x35b/0x366
> [   94.218004]  [<ffffffff81ed4321>] x86_64_start_reservations+0x131/0x135
> [   94.218004]  [<ffffffff81ed4415>] x86_64_start_kernel+0xf0/0xf7
> [   94.218004] Code: f3 e8 97 cb ee ff 48 c1 e8 0c 48 c1 e0 06 49 01 c5 49 8b 45 00 f6 c4 80 0f 85 99 00 00 00 4c 8b 75 08 9c 41 5f fa e8 12 f8 f4 ff <49> 63 74 24 1c 48 89 df e8 e5 4b f5 ff 41 f7 c7 00 02 00 00 74 
> [   94.218004] RIP  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> [   94.218004]  RSP <ffff88013fc03de0>
> [   94.218004] CR2: 000000000000001c

Can you try this?


diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 163263d..ee55019 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
  */
 static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	struct cfq_queue *old_cfqq = cfqd->active_queue;
-
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
 
 	/*
 	 * workload type is changed, don't save slice, otherwise preempt
 	 * doesn't happen
 	 */
-	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
+	if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
 		cfqq->cfqg->saved_workload_slice = 0;
 
+	cfq_slice_expired(cfqd, 1);
+
 	/*
 	 * Put the new queue at the front of the of the current list,
 	 * so we know that it will be selected next.

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 20:19 ` Jens Axboe
@ 2012-01-17 20:40   ` Vivek Goyal
  2012-01-17 20:42     ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2012-01-17 20:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux kernel mailing list, Tejun Heo

On Tue, Jan 17, 2012 at 09:19:28PM +0100, Jens Axboe wrote:
> On 2012-01-17 21:18, Vivek Goyal wrote:
> > Hi Tejun,
> > 
> > With latest linus kernel, I see following crash. I was running some
> > scripts which create cgroups and launch fio jobs in those groups. In
> > a separate window I wrote a script to change the IO scheduler on the
> > device in a loop while IO was happening on the device. After few
> > seconds I see following. So far I tried it twice and reproduced it
> > both the times in first few seconds.
> > 
> > Thanks
> > Vivek
> > 
> > 
> > [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
> > [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> > [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0 
> > [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> > [   94.218004] CPU 0 
> > [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
> > [   94.218004] 
> > [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> > [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> > [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
> > [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
> > [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
> > [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
> > [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
> > [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
> > [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
> > [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
> > [   94.218004] Stack:
> > [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
> > [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
> > [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
> > [   94.218004] Call Trace:
> > [   94.218004]  <IRQ> 
> > [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
> > [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
> > [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
> > [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
> > [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
> > [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
> > [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
> > [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
> > [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
> > [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
> > [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
> > [   94.218004]  <EOI> 
> > [   94.218004]  [<ffffffff8100a806>] ? mwait_idle+0xb6/0x4c0
> > [   94.218004]  [<ffffffff8100a7fd>] ? mwait_idle+0xad/0x4c0
> > [   94.218004]  [<ffffffff810011e6>] cpu_idle+0x96/0xe0
> > [   94.218004]  [<ffffffff818064df>] rest_init+0x133/0x144
> > [   94.218004]  [<ffffffff81806425>] ? rest_init+0x79/0x144
> > [   94.218004]  [<ffffffff81ed4b51>] start_kernel+0x35b/0x366
> > [   94.218004]  [<ffffffff81ed4321>] x86_64_start_reservations+0x131/0x135
> > [   94.218004]  [<ffffffff81ed4415>] x86_64_start_kernel+0xf0/0xf7
> > [   94.218004] Code: f3 e8 97 cb ee ff 48 c1 e8 0c 48 c1 e0 06 49 01 c5 49 8b 45 00 f6 c4 80 0f 85 99 00 00 00 4c 8b 75 08 9c 41 5f fa e8 12 f8 f4 ff <49> 63 74 24 1c 48 89 df e8 e5 4b f5 ff 41 f7 c7 00 02 00 00 74 
> > [   94.218004] RIP  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> > [   94.218004]  RSP <ffff88013fc03de0>
> > [   94.218004] CR2: 000000000000001c
> 
> Can you try this?

Nope, this does not help either. Can reproduce the issue with below
patch applied.

> 
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 163263d..ee55019 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>   */
>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
> -	struct cfq_queue *old_cfqq = cfqd->active_queue;
> -
>  	cfq_log_cfqq(cfqd, cfqq, "preempt");
> -	cfq_slice_expired(cfqd, 1);
>  
>  	/*
>  	 * workload type is changed, don't save slice, otherwise preempt
>  	 * doesn't happen
>  	 */
> -	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
> +	if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
>  		cfqq->cfqg->saved_workload_slice = 0;
>  
> +	cfq_slice_expired(cfqd, 1);
> +

cfq_slice_expired() will overwrite the value of
cfqq->cfqg->saved_workload_slice. So we need to set it to zero after
cfq_slice_expired.

I was thinking of just saving the workload type of cfqq before
cfq_slice_expired() so that we don't access old cfqq after
cfq_slice_expired(). But then I noticed that we don't drop a cfqq
reference in cfq_slice_expired(). So not sure how cfq_slice_expired()
can lead to freeing up of queue. It should happen only when process
has exited and last request on the queue if finished. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 20:40   ` Vivek Goyal
@ 2012-01-17 20:42     ` Jens Axboe
  2012-01-17 20:58       ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2012-01-17 20:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux kernel mailing list, Tejun Heo

On 2012-01-17 21:40, Vivek Goyal wrote:
> On Tue, Jan 17, 2012 at 09:19:28PM +0100, Jens Axboe wrote:
>> On 2012-01-17 21:18, Vivek Goyal wrote:
>>> Hi Tejun,
>>>
>>> With latest linus kernel, I see following crash. I was running some
>>> scripts which create cgroups and launch fio jobs in those groups. In
>>> a separate window I wrote a script to change the IO scheduler on the
>>> device in a loop while IO was happening on the device. After few
>>> seconds I see following. So far I tried it twice and reproduced it
>>> both the times in first few seconds.
>>>
>>> Thanks
>>> Vivek
>>>
>>>
>>> [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>>> [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
>>> [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0 
>>> [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [   94.218004] CPU 0 
>>> [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
>>> [   94.218004] 
>>> [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
>>> [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
>>> [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
>>> [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
>>> [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
>>> [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
>>> [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>> [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
>>> [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
>>> [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
>>> [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
>>> [   94.218004] Stack:
>>> [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
>>> [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
>>> [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
>>> [   94.218004] Call Trace:
>>> [   94.218004]  <IRQ> 
>>> [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
>>> [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
>>> [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
>>> [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
>>> [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
>>> [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
>>> [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
>>> [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
>>> [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
>>> [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
>>> [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
>>> [   94.218004]  <EOI> 
>>> [   94.218004]  [<ffffffff8100a806>] ? mwait_idle+0xb6/0x4c0
>>> [   94.218004]  [<ffffffff8100a7fd>] ? mwait_idle+0xad/0x4c0
>>> [   94.218004]  [<ffffffff810011e6>] cpu_idle+0x96/0xe0
>>> [   94.218004]  [<ffffffff818064df>] rest_init+0x133/0x144
>>> [   94.218004]  [<ffffffff81806425>] ? rest_init+0x79/0x144
>>> [   94.218004]  [<ffffffff81ed4b51>] start_kernel+0x35b/0x366
>>> [   94.218004]  [<ffffffff81ed4321>] x86_64_start_reservations+0x131/0x135
>>> [   94.218004]  [<ffffffff81ed4415>] x86_64_start_kernel+0xf0/0xf7
>>> [   94.218004] Code: f3 e8 97 cb ee ff 48 c1 e8 0c 48 c1 e0 06 49 01 c5 49 8b 45 00 f6 c4 80 0f 85 99 00 00 00 4c 8b 75 08 9c 41 5f fa e8 12 f8 f4 ff <49> 63 74 24 1c 48 89 df e8 e5 4b f5 ff 41 f7 c7 00 02 00 00 74 
>>> [   94.218004] RIP  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
>>> [   94.218004]  RSP <ffff88013fc03de0>
>>> [   94.218004] CR2: 000000000000001c
>>
>> Can you try this?
> 
> Nope, this does not help either. Can reproduce the issue with below
> patch applied.
> 
>>
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 163263d..ee55019 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>>   */
>>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>>  {
>> -	struct cfq_queue *old_cfqq = cfqd->active_queue;
>> -
>>  	cfq_log_cfqq(cfqd, cfqq, "preempt");
>> -	cfq_slice_expired(cfqd, 1);
>>  
>>  	/*
>>  	 * workload type is changed, don't save slice, otherwise preempt
>>  	 * doesn't happen
>>  	 */
>> -	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
>> +	if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
>>  		cfqq->cfqg->saved_workload_slice = 0;
>>  
>> +	cfq_slice_expired(cfqd, 1);
>> +
> 
> cfq_slice_expired() will overwrite the value of
> cfqq->cfqg->saved_workload_slice. So we need to set it to zero after
> cfq_slice_expired.

Good point, lets just fix that up afterwards, the use-after-free needs
to go asap.

> I was thinking of just saving the workload type of cfqq before
> cfq_slice_expired() so that we don't access old cfqq after
> cfq_slice_expired(). But then I noticed that we don't drop a cfqq
> reference in cfq_slice_expired(). So not sure how cfq_slice_expired()
> can lead to freeing up of queue. It should happen only when process
> has exited and last request on the queue if finished. 

It does, it drops a ref to the cic which in turn drops the active async
and sync queues.

Out for tonight, will pick this up in the morning.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 20:42     ` Jens Axboe
@ 2012-01-17 20:58       ` Vivek Goyal
  2012-01-17 21:01         ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2012-01-17 20:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux kernel mailing list, Tejun Heo, Chris Mason

On Tue, Jan 17, 2012 at 09:42:40PM +0100, Jens Axboe wrote:
> On 2012-01-17 21:40, Vivek Goyal wrote:
> > On Tue, Jan 17, 2012 at 09:19:28PM +0100, Jens Axboe wrote:
> >> On 2012-01-17 21:18, Vivek Goyal wrote:
> >>> Hi Tejun,
> >>>
> >>> With latest linus kernel, I see following crash. I was running some
> >>> scripts which create cgroups and launch fio jobs in those groups. In
> >>> a separate window I wrote a script to change the IO scheduler on the
> >>> device in a loop while IO was happening on the device. After few
> >>> seconds I see following. So far I tried it twice and reproduced it
> >>> both the times in first few seconds.
> >>>
> >>> Thanks
> >>> Vivek
> >>>
> >>>
> >>> [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
> >>> [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> >>> [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0 
> >>> [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> >>> [   94.218004] CPU 0 
> >>> [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
> >>> [   94.218004] 
> >>> [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> >>> [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> >>> [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
> >>> [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
> >>> [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
> >>> [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
> >>> [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >>> [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
> >>> [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
> >>> [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>> [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
> >>> [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >>> [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
> >>> [   94.218004] Stack:
> >>> [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
> >>> [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
> >>> [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
> >>> [   94.218004] Call Trace:
> >>> [   94.218004]  <IRQ> 
> >>> [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
> >>> [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
> >>> [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
> >>> [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
> >>> [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
> >>> [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
> >>> [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
> >>> [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
> >>> [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
> >>> [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
> >>> [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
> >>> [   94.218004]  <EOI> 
> >>> [   94.218004]  [<ffffffff8100a806>] ? mwait_idle+0xb6/0x4c0
> >>> [   94.218004]  [<ffffffff8100a7fd>] ? mwait_idle+0xad/0x4c0
> >>> [   94.218004]  [<ffffffff810011e6>] cpu_idle+0x96/0xe0
> >>> [   94.218004]  [<ffffffff818064df>] rest_init+0x133/0x144
> >>> [   94.218004]  [<ffffffff81806425>] ? rest_init+0x79/0x144
> >>> [   94.218004]  [<ffffffff81ed4b51>] start_kernel+0x35b/0x366
> >>> [   94.218004]  [<ffffffff81ed4321>] x86_64_start_reservations+0x131/0x135
> >>> [   94.218004]  [<ffffffff81ed4415>] x86_64_start_kernel+0xf0/0xf7
> >>> [   94.218004] Code: f3 e8 97 cb ee ff 48 c1 e8 0c 48 c1 e0 06 49 01 c5 49 8b 45 00 f6 c4 80 0f 85 99 00 00 00 4c 8b 75 08 9c 41 5f fa e8 12 f8 f4 ff <49> 63 74 24 1c 48 89 df e8 e5 4b f5 ff 41 f7 c7 00 02 00 00 74 
> >>> [   94.218004] RIP  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> >>> [   94.218004]  RSP <ffff88013fc03de0>
> >>> [   94.218004] CR2: 000000000000001c
> >>
> >> Can you try this?
> > 
> > Nope, this does not help either. Can reproduce the issue with below
> > patch applied.
> > 
> >>
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index 163263d..ee55019 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> >>   */
> >>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> >>  {
> >> -	struct cfq_queue *old_cfqq = cfqd->active_queue;
> >> -
> >>  	cfq_log_cfqq(cfqd, cfqq, "preempt");
> >> -	cfq_slice_expired(cfqd, 1);
> >>  
> >>  	/*
> >>  	 * workload type is changed, don't save slice, otherwise preempt
> >>  	 * doesn't happen
> >>  	 */
> >> -	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
> >> +	if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
> >>  		cfqq->cfqg->saved_workload_slice = 0;
> >>  
> >> +	cfq_slice_expired(cfqd, 1);
> >> +
> > 
> > cfq_slice_expired() will overwrite the value of
> > cfqq->cfqg->saved_workload_slice. So we need to set it to zero after
> > cfq_slice_expired.
> 
> Good point, lets just fix that up afterwards, the use-after-free needs
> to go asap.
> 
> > I was thinking of just saving the workload type of cfqq before
> > cfq_slice_expired() so that we don't access old cfqq after
> > cfq_slice_expired(). But then I noticed that we don't drop a cfqq
> > reference in cfq_slice_expired(). So not sure how cfq_slice_expired()
> > can lead to freeing up of queue. It should happen only when process
> > has exited and last request on the queue if finished. 
> 
> It does, it drops a ref to the cic which in turn drops the active async
> and sync queues.

Ok, I see it now. Thanks.

I modified your patch a bit. It does not seem to solve my problem but
might help with Chris Mason's boot issue.

Chris, can you please give it a try.

Thanks
Vivek

---
 block/cfq-iosched.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c	2012-01-18 02:49:33.000000000 -0500
+++ linux-2.6/block/cfq-iosched.c	2012-01-18 02:50:31.000000000 -0500
@@ -3117,7 +3117,7 @@ cfq_should_preempt(struct cfq_data *cfqd
  */
 static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	struct cfq_queue *old_cfqq = cfqd->active_queue;
+	enum wl_type_t old_cfqq_wl_type = cfqq_type(cfqd->active_queue);
 
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
 	cfq_slice_expired(cfqd, 1);
@@ -3126,7 +3126,7 @@ static void cfq_preempt_queue(struct cfq
 	 * workload type is changed, don't save slice, otherwise preempt
 	 * doesn't happen
 	 */
-	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
+	if (old_cfqq_wl_type != cfqq_type(cfqq))
 		cfqq->cfqg->saved_workload_slice = 0;
 
 	/*

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 20:58       ` Vivek Goyal
@ 2012-01-17 21:01         ` Vivek Goyal
  0 siblings, 0 replies; 30+ messages in thread
From: Vivek Goyal @ 2012-01-17 21:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux kernel mailing list, Tejun Heo, Chris Mason

On Tue, Jan 17, 2012 at 03:58:16PM -0500, Vivek Goyal wrote:

[..]

> > >> Can you try this?
> > > 
> > > Nope, this does not help either. Can reproduce the issue with below
> > > patch applied.
> > > 
> > >>
> > >>
> > >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > >> index 163263d..ee55019 100644
> > >> --- a/block/cfq-iosched.c
> > >> +++ b/block/cfq-iosched.c
> > >> @@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> > >>   */
> > >>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > >>  {
> > >> -	struct cfq_queue *old_cfqq = cfqd->active_queue;
> > >> -
> > >>  	cfq_log_cfqq(cfqd, cfqq, "preempt");
> > >> -	cfq_slice_expired(cfqd, 1);
> > >>  
> > >>  	/*
> > >>  	 * workload type is changed, don't save slice, otherwise preempt
> > >>  	 * doesn't happen
> > >>  	 */
> > >> -	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
> > >> +	if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
> > >>  		cfqq->cfqg->saved_workload_slice = 0;
> > >>  
> > >> +	cfq_slice_expired(cfqd, 1);
> > >> +
> > > 
> > > cfq_slice_expired() will overwrite the value of
> > > cfqq->cfqg->saved_workload_slice. So we need to set it to zero after
> > > cfq_slice_expired.
> > 
> > Good point, lets just fix that up afterwards, the use-after-free needs
> > to go asap.
> > 
> > > I was thinking of just saving the workload type of cfqq before
> > > cfq_slice_expired() so that we don't access old cfqq after
> > > cfq_slice_expired(). But then I noticed that we don't drop a cfqq
> > > reference in cfq_slice_expired(). So not sure how cfq_slice_expired()
> > > can lead to freeing up of queue. It should happen only when process
> > > has exited and last request on the queue if finished. 
> > 
> > It does, it drops a ref to the cic which in turn drops the active async
> > and sync queues.
> 
> Ok, I see it now. Thanks.
> 
> I modified your patch a bit. It does not seem to solve my problem but
> might help with Chris Mason's boot issue.
> 
> Chris, can you please give it a try.

Oops, old mail id of chris. Fixing it now.

Thanks
Vivek

> 
> Thanks
> Vivek
> 
> ---
>  block/cfq-iosched.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c	2012-01-18 02:49:33.000000000 -0500
> +++ linux-2.6/block/cfq-iosched.c	2012-01-18 02:50:31.000000000 -0500
> @@ -3117,7 +3117,7 @@ cfq_should_preempt(struct cfq_data *cfqd
>   */
>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
> -	struct cfq_queue *old_cfqq = cfqd->active_queue;
> +	enum wl_type_t old_cfqq_wl_type = cfqq_type(cfqd->active_queue);
>  
>  	cfq_log_cfqq(cfqd, cfqq, "preempt");
>  	cfq_slice_expired(cfqd, 1);
> @@ -3126,7 +3126,7 @@ static void cfq_preempt_queue(struct cfq
>  	 * workload type is changed, don't save slice, otherwise preempt
>  	 * doesn't happen
>  	 */
> -	if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
> +	if (old_cfqq_wl_type != cfqq_type(cfqq))
>  		cfqq->cfqg->saved_workload_slice = 0;
>  
>  	/*

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 20:18 Kernel crash in icq_free_icq_rcu Vivek Goyal
  2012-01-17 20:19 ` Jens Axboe
@ 2012-01-17 21:48 ` Tejun Heo
  2012-01-17 22:07   ` Vivek Goyal
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2012-01-17 21:48 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux kernel mailing list, Jens Axboe

On Tue, Jan 17, 2012 at 03:18:23PM -0500, Vivek Goyal wrote:
> Hi Tejun,
> 
> With latest linus kernel, I see following crash. I was running some
> scripts which create cgroups and launch fio jobs in those groups. In
> a separate window I wrote a script to change the IO scheduler on the
> device in a loop while IO was happening on the device. After few
> seconds I see following. So far I tried it twice and reproduced it
> both the times in first few seconds.

Does the following patch make any difference?

diff --git a/block/elevator.c b/block/elevator.c
index 91e18f8..b5a94ed 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,7 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+	return 0;
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 21:48 ` Tejun Heo
@ 2012-01-17 22:07   ` Vivek Goyal
  2012-01-18  1:01     ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2012-01-17 22:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux kernel mailing list, Jens Axboe

On Tue, Jan 17, 2012 at 01:48:34PM -0800, Tejun Heo wrote:
> On Tue, Jan 17, 2012 at 03:18:23PM -0500, Vivek Goyal wrote:
> > Hi Tejun,
> > 
> > With latest linus kernel, I see following crash. I was running some
> > scripts which create cgroups and launch fio jobs in those groups. In
> > a separate window I wrote a script to change the IO scheduler on the
> > device in a loop while IO was happening on the device. After few
> > seconds I see following. So far I tried it twice and reproduced it
> > both the times in first few seconds.
> 
> Does the following patch make any difference?

Nope. Still can reproduce the issue.

Thanks
Vivek

> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 91e18f8..b5a94ed 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -105,6 +105,7 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
>  	if (bio_integrity(bio) != blk_integrity_rq(rq))
>  		return 0;
>  
> +	return 0;
>  	if (!elv_iosched_allow_merge(rq, bio))
>  		return 0;
>  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-17 22:07   ` Vivek Goyal
@ 2012-01-18  1:01     ` Shaohua Li
  2012-01-18  1:03       ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2012-01-18  1:01 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, linux kernel mailing list, Jens Axboe

2012/1/18 Vivek Goyal <vgoyal@redhat.com>:
> On Tue, Jan 17, 2012 at 01:48:34PM -0800, Tejun Heo wrote:
>> On Tue, Jan 17, 2012 at 03:18:23PM -0500, Vivek Goyal wrote:
>> > Hi Tejun,
>> >
>> > With latest linus kernel, I see following crash. I was running some
>> > scripts which create cgroups and launch fio jobs in those groups. In
>> > a separate window I wrote a script to change the IO scheduler on the
>> > device in a loop while IO was happening on the device. After few
>> > seconds I see following. So far I tried it twice and reproduced it
>> > both the times in first few seconds.
>>
>> Does the following patch make any difference?
>
> Nope. Still can reproduce the issue.
I had similar issue too. Looks ioc_create_icq finds an icq (in ioc_lookup_icq)
the icq is in rcu free (it has __rcu_icq_cache set)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  1:01     ` Shaohua Li
@ 2012-01-18  1:03       ` Tejun Heo
  2012-01-18  1:05         ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2012-01-18  1:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

Hello,

On Wed, Jan 18, 2012 at 09:01:00AM +0800, Shaohua Li wrote:
> 2012/1/18 Vivek Goyal <vgoyal@redhat.com>:
> >> Does the following patch make any difference?
> >
> > Nope. Still can reproduce the issue.
>
> I had similar issue too. Looks ioc_create_icq finds an icq (in ioc_lookup_icq)
> the icq is in rcu free (it has __rcu_icq_cache set)

Vivek is seeing the problem while switching elevators.  Are you too?
Or is it during normal operation?

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  1:03       ` Tejun Heo
@ 2012-01-18  1:05         ` Shaohua Li
  2012-01-18  1:11           ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2012-01-18  1:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

2012/1/18 Tejun Heo <tj@kernel.org>:
> Hello,
>
> On Wed, Jan 18, 2012 at 09:01:00AM +0800, Shaohua Li wrote:
>> 2012/1/18 Vivek Goyal <vgoyal@redhat.com>:
>> >> Does the following patch make any difference?
>> >
>> > Nope. Still can reproduce the issue.
>>
>> I had similar issue too. Looks ioc_create_icq finds an icq (in ioc_lookup_icq)
>> the icq is in rcu free (it has __rcu_icq_cache set)
>
> Vivek is seeing the problem while switching elevators.  Are you too?
> Or is it during normal operation?
same here. I had some problems when I debug my ioscheduler, but
eventually found even switching cfq and noop can trigger oops.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  1:05         ` Shaohua Li
@ 2012-01-18  1:11           ` Tejun Heo
  2012-01-18  1:30             ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2012-01-18  1:11 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
> > Vivek is seeing the problem while switching elevators.  Are you too?
> > Or is it during normal operation?
> same here. I had some problems when I debug my ioscheduler, but
> eventually found even switching cfq and noop can trigger oops.

Hmmm... maybe quiescing isn't working as expected and kmem cache is
being destroyed with live icq's.  I'll try to reproduce it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  1:11           ` Tejun Heo
@ 2012-01-18  1:30             ` Shaohua Li
  2012-01-18  2:26               ` Shaohua Li
  2012-01-18  6:03               ` Shaohua Li
  0 siblings, 2 replies; 30+ messages in thread
From: Shaohua Li @ 2012-01-18  1:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

On Tue, 2012-01-17 at 17:11 -0800, Tejun Heo wrote:
> On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
> > > Vivek is seeing the problem while switching elevators.  Are you too?
> > > Or is it during normal operation?
> > same here. I had some problems when I debug my ioscheduler, but
> > eventually found even switching cfq and noop can trigger oops.
> 
> Hmmm... maybe quiescing isn't working as expected and kmem cache is
> being destroyed with live icq's.  I'll try to reproduce it.
this debug patch seems to fix for me.

diff --git a/block/blk-core.c b/block/blk-core.c
index e6c05a9..c6a8ef5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -872,11 +872,11 @@ retry:
 	spin_unlock_irq(q->queue_lock);
 
 	/* create icq if missing */
-	if (unlikely(et->icq_cache && !icq))
+	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
 		icq = ioc_create_icq(q, gfp_mask);
 
 	/* rqs are guaranteed to have icq on elv_set_request() if requested */
-	if (likely(!et->icq_cache || icq))
+	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
 		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
 
 	if (unlikely(!rq)) {



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  1:30             ` Shaohua Li
@ 2012-01-18  2:26               ` Shaohua Li
  2012-01-18  4:23                 ` Shaohua Li
  2012-01-18  6:03               ` Shaohua Li
  1 sibling, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2012-01-18  2:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

2012/1/18 Shaohua Li <shaohua.li@intel.com>:
> On Tue, 2012-01-17 at 17:11 -0800, Tejun Heo wrote:
>> On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
>> > > Vivek is seeing the problem while switching elevators.  Are you too?
>> > > Or is it during normal operation?
>> > same here. I had some problems when I debug my ioscheduler, but
>> > eventually found even switching cfq and noop can trigger oops.
>>
>> Hmmm... maybe quiescing isn't working as expected and kmem cache is
>> being destroyed with live icq's.  I'll try to reproduce it.
> this debug patch seems to fix for me.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e6c05a9..c6a8ef5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -872,11 +872,11 @@ retry:
>        spin_unlock_irq(q->queue_lock);
>
>        /* create icq if missing */
> -       if (unlikely(et->icq_cache && !icq))
> +       if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
>                icq = ioc_create_icq(q, gfp_mask);
>
>        /* rqs are guaranteed to have icq on elv_set_request() if requested */
> -       if (likely(!et->icq_cache || icq))
> +       if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
>                rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
>
>        if (unlikely(!rq)) {
this passed my test, but I didn't get reason why it can help ...
blk_alloc_request doesn't use icq if REQ_ELVPRIV isn't set, so the code
has problem for sure.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  2:26               ` Shaohua Li
@ 2012-01-18  4:23                 ` Shaohua Li
  0 siblings, 0 replies; 30+ messages in thread
From: Shaohua Li @ 2012-01-18  4:23 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

2012/1/18 Shaohua Li <shaohua.li@intel.com>:
> 2012/1/18 Shaohua Li <shaohua.li@intel.com>:
>> On Tue, 2012-01-17 at 17:11 -0800, Tejun Heo wrote:
>>> On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
>>> > > Vivek is seeing the problem while switching elevators.  Are you too?
>>> > > Or is it during normal operation?
>>> > same here. I had some problems when I debug my ioscheduler, but
>>> > eventually found even switching cfq and noop can trigger oops.
>>>
>>> Hmmm... maybe quiescing isn't working as expected and kmem cache is
>>> being destroyed with live icq's.  I'll try to reproduce it.
>> this debug patch seems to fix for me.
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index e6c05a9..c6a8ef5 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -872,11 +872,11 @@ retry:
>>        spin_unlock_irq(q->queue_lock);
>>
>>        /* create icq if missing */
>> -       if (unlikely(et->icq_cache && !icq))
>> +       if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
>>                icq = ioc_create_icq(q, gfp_mask);
>>
>>        /* rqs are guaranteed to have icq on elv_set_request() if requested */
>> -       if (likely(!et->icq_cache || icq))
>> +       if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
>>                rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
>>
>>        if (unlikely(!rq)) {
> this passed my test, but I didn't get reason why it can help ...
> blk_alloc_request doesn't use icq if REQ_ELVPRIV isn't set, so the code
> has problem for sure.
Ok, I got the reason here:
CPU 0                                                             CPU1
get_request
   ioc_create_icq
       allocate icq with cfq ioscheduler

 switch to noop
       insert icq to ioc

 switch to cfq

         ioc_clear_queue
in ioc_clear_queue, ioc has icq in its list, but current elevator is noop,
so ioc_exit_icq will get a NULL et->icq_cache

I hit a kmem_cache_alloc_node oops too. because the et in ioc_create_icq
might not be the et of get_request (we drop lock before calling ioc_create_icq)

The put_io_context has a workqueue to run ioc_exit_icq, I suspect
there is problem too.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  1:30             ` Shaohua Li
  2012-01-18  2:26               ` Shaohua Li
@ 2012-01-18  6:03               ` Shaohua Li
  2012-01-18 13:51                 ` Vivek Goyal
  2012-01-18 16:07                 ` Tejun Heo
  1 sibling, 2 replies; 30+ messages in thread
From: Shaohua Li @ 2012-01-18  6:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

On Wed, 2012-01-18 at 09:30 +0800, Shaohua Li wrote:
> On Tue, 2012-01-17 at 17:11 -0800, Tejun Heo wrote:
> > On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
> > > > Vivek is seeing the problem while switching elevators.  Are you too?
> > > > Or is it during normal operation?
> > > same here. I had some problems when I debug my ioscheduler, but
> > > eventually found even switching cfq and noop can trigger oops.
> > 
> > Hmmm... maybe quiescing isn't working as expected and kmem cache is
> > being destroyed with live icq's.  I'll try to reproduce it.
> this debug patch seems to fix for me.
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e6c05a9..c6a8ef5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -872,11 +872,11 @@ retry:
>  	spin_unlock_irq(q->queue_lock);
>  
>  	/* create icq if missing */
> -	if (unlikely(et->icq_cache && !icq))
> +	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
>  		icq = ioc_create_icq(q, gfp_mask);
>  
>  	/* rqs are guaranteed to have icq on elv_set_request() if requested */
> -	if (likely(!et->icq_cache || icq))
> +	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
>  		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
>  
>  	if (unlikely(!rq)) {
> 
Here is the formal patch. I hope it fixes all the problems related to
icq_cache, but please open an eye :)


Subject: block: fix NULL icq_cache reference

CPU0:					CPU1:
					switch from cfq to noop
					   elv_quiesce_start
C: get_request
A:   ioc_create_icq
      alloc icq with cfq
					   B: do elevator switch
					       ioc_clear_queue
					   elv_quiesce_end
      insert icq to ioc
					switch from noop to cfq
					    elv_quiesce_start
					    do elevator switch
					       ioc_clear_queue
					    elv_quiesce_end
CPU0 leaves some icq to ioc list after elevator is switching from cfq to noop.
in the second ioc_clear_queue, the ioc has icq in its list, but current
elevator is noop. so ioc_exit_icq will get a NULL et->icq_cache.

In above cases, if A runs after B, ioc_create_icq will have a NULL
et->icq_cache, this will trigger another crash.

Note, get_request caches et without lock hold. Between C and A, a elevator
switch can start. But we will have elvpriv++, elv_quiesce_start will drain
all requests first. So this will not trigger crash.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/blk-core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/block/blk-core.c
===================================================================
--- linux.orig/block/blk-core.c	2012-01-18 12:44:13.000000000 +0800
+++ linux/block/blk-core.c	2012-01-18 12:45:28.000000000 +0800
@@ -872,11 +872,11 @@ retry:
 	spin_unlock_irq(q->queue_lock);
 
 	/* create icq if missing */
-	if (unlikely(et->icq_cache && !icq))
+	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
 		icq = ioc_create_icq(q, gfp_mask);
 
 	/* rqs are guaranteed to have icq on elv_set_request() if requested */
-	if (likely(!et->icq_cache || icq))
+	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
 		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
 
 	if (unlikely(!rq)) {



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  6:03               ` Shaohua Li
@ 2012-01-18 13:51                 ` Vivek Goyal
  2012-01-18 14:20                   ` Vivek Goyal
  2012-01-18 16:07                 ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2012-01-18 13:51 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux kernel mailing list, Jens Axboe

On Wed, Jan 18, 2012 at 02:03:22PM +0800, Shaohua Li wrote:
> On Wed, 2012-01-18 at 09:30 +0800, Shaohua Li wrote:
> > On Tue, 2012-01-17 at 17:11 -0800, Tejun Heo wrote:
> > > On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
> > > > > Vivek is seeing the problem while switching elevators.  Are you too?
> > > > > Or is it during normal operation?
> > > > same here. I had some problems when I debug my ioscheduler, but
> > > > eventually found even switching cfq and noop can trigger oops.
> > > 
> > > Hmmm... maybe quiescing isn't working as expected and kmem cache is
> > > being destroyed with live icq's.  I'll try to reproduce it.
> > this debug patch seems to fix for me.
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index e6c05a9..c6a8ef5 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -872,11 +872,11 @@ retry:
> >  	spin_unlock_irq(q->queue_lock);
> >  
> >  	/* create icq if missing */
> > -	if (unlikely(et->icq_cache && !icq))
> > +	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
> >  		icq = ioc_create_icq(q, gfp_mask);
> >  
> >  	/* rqs are guaranteed to have icq on elv_set_request() if requested */
> > -	if (likely(!et->icq_cache || icq))
> > +	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
> >  		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
> >  
> >  	if (unlikely(!rq)) {
> > 
> Here is the formal patch. I hope it fixes all the problems related to
> icq_cache, but please open an eye :)
> 
> 
> Subject: block: fix NULL icq_cache reference
> 
> CPU0:					CPU1:
> 					switch from cfq to noop
> 					   elv_quiesce_start
> C: get_request
> A:   ioc_create_icq
>       alloc icq with cfq
> 					   B: do elevator switch
> 					       ioc_clear_queue
> 					   elv_quiesce_end
>       insert icq to ioc
> 					switch from noop to cfq
> 					    elv_quiesce_start
> 					    do elevator switch
> 					       ioc_clear_queue
> 					    elv_quiesce_end
> CPU0 leaves some icq to ioc list after elevator is switching from cfq to noop.
> in the second ioc_clear_queue, the ioc has icq in its list, but current
> elevator is noop. so ioc_exit_icq will get a NULL et->icq_cache.
> 
> In above cases, if A runs after B, ioc_create_icq will have a NULL
> et->icq_cache, this will trigger another crash.
> 
> Note, get_request caches et without lock hold. Between C and A, a elevator
> switch can start. But we will have elvpriv++, elv_quiesce_start will drain
> all requests first. So this will not trigger crash.
> 
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> ---
>  block/blk-core.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux/block/blk-core.c
> ===================================================================
> --- linux.orig/block/blk-core.c	2012-01-18 12:44:13.000000000 +0800
> +++ linux/block/blk-core.c	2012-01-18 12:45:28.000000000 +0800
> @@ -872,11 +872,11 @@ retry:
>  	spin_unlock_irq(q->queue_lock);
>  
>  	/* create icq if missing */
> -	if (unlikely(et->icq_cache && !icq))
> +	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
>  		icq = ioc_create_icq(q, gfp_mask);
>  
>  	/* rqs are guaranteed to have icq on elv_set_request() if requested */
> -	if (likely(!et->icq_cache || icq))
> +	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
>  		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);

Not allocating icq if request is never going to go to elevator as elevator
switch was happening makes sense to me.

I tried this patch. It went little further and crashed at a different
place. I think this seems to be separate merging issue Tejun is trying
to track down.

[  167.704257] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[  167.705002] IP: [<ffffffff8130a389>] cfq_allow_merge+0x59/0xb0
[  167.705002] PGD 13204e067 PUD 13224b067 PMD 0 
[  167.705002] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  167.705002] CPU 2 
[  167.705002] Modules linked in: [last unloaded: scsi_wait_scan]
[  167.705002] 
[  167.705002] Pid: 4653, comm: flush-8:16 Not tainted 3.2.0+ #21
Hewlett-Packard HP xw6600 Workstation/0A9Ch
[  167.705002] RIP: 0010:[<ffffffff8130a389>]  [<ffffffff8130a389>]
cfq_allow_merge+0x59/0xb0
[  167.705002] RSP: 0018:ffff8801323a74e0  EFLAGS: 00010246
[  167.705002] RAX: 0000000000000000 RBX: ffff880139500f78 RCX:
ffff880138753060
[  167.705002] RDX: 0000000000000001 RSI: ffff88013157b800 RDI:
ffff88013897cf18
[  167.705002] RBP: ffff8801323a7500 R08: ffff880138440a88 R09:
0000000000000000
[  167.705002] R10: 0000000000000003 R11: 0000000000000000 R12:
ffff88013210ac00
[  167.705002] R13: ffff88013210ac00 R14: ffff8801323a7ae0 R15:
ffff88013210ac00
[  167.705002] FS:  0000000000000000(0000) GS:ffff88013fc80000(0000)
knlGS:0000000000000000
[  167.705002] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  167.705002] CR2: 0000000000000008 CR3: 0000000137e45000 CR4:
00000000000006e0
[  167.705002] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  167.705002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  167.705002] Process flush-8:16 (pid: 4653, threadinfo ffff8801323a6000,
task ffff880138753060)
[  167.705002] Stack:
[  167.705002]  ffff8801323a7510 ffff880139500f78 ffff88013210ac00
ffff880138440a88
[  167.705002]  ffff8801323a7510 ffffffff812eaf06 ffff8801323a7530
ffffffff812eb4e0
[  167.705002]  ffff880139500f78 0000000000000000 ffff8801323a7590
ffffffff812f3d22
[  167.705002] Call Trace:
[  167.705002]  [<ffffffff812eaf06>] elv_rq_merge_ok+0x96/0xa0
[  167.705002]  [<ffffffff812eb4e0>] elv_try_merge+0x20/0x60
[  167.705002]  [<ffffffff812f3d22>] blk_queue_bio+0x172/0x510
[  167.705002]  [<ffffffff81097952>] ? mark_held_locks+0x82/0x130
[  167.705002]  [<ffffffff812f222a>] generic_make_request+0xca/0x100
[  167.705002]  [<ffffffff812f22d6>] submit_bio+0x76/0xf0
[  167.705002]  [<ffffffff81105bf3>] ? account_page_writeback+0x13/0x20
[  167.705002]  [<ffffffff81106b6d>] ? test_set_page_writeback+0xed/0x1a0
[  167.705002]  [<ffffffff811f5379>] ext4_io_submit+0x29/0x60
[  167.705002]  [<ffffffff811f5575>] ext4_bio_write_page+0x1c5/0x4e0
[  167.705002]  [<ffffffff811f0819>] mpage_da_submit_io+0x509/0x580
[  167.705002]  [<ffffffff811f302e>] mpage_da_map_and_submit+0x1ce/0x450
[  167.705002]  [<ffffffff810fbe42>] ? find_get_pages_tag+0x42/0x290
[  167.705002]  [<ffffffff811f3325>] mpage_add_bh_to_extent+0x75/0x100
[  167.705002]  [<ffffffff811f36d8>] write_cache_pages_da+0x328/0x4a0
[  167.705002]  [<ffffffff8123c47d>] ? jbd2_journal_stop+0x1dd/0x2d0
[  167.705002]  [<ffffffff811f3b7d>] ext4_da_writepages+0x32d/0x830
[  167.705002]  [<ffffffff811081f4>] do_writepages+0x24/0x40
[  167.705002]  [<ffffffff81177ef1>] writeback_single_inode+0x171/0x590
[  167.705002]  [<ffffffff81178740>] writeback_sb_inodes+0x1b0/0x290
[  167.705002]  [<ffffffff811788be>] __writeback_inodes_wb+0x9e/0xd0
[  167.705002]  [<ffffffff81178aeb>] wb_writeback+0x1fb/0x520
[  167.705002]  [<ffffffff81179290>] ? wb_do_writeback+0x100/0x290
[  167.705002]  [<ffffffff811792e0>] wb_do_writeback+0x150/0x290
[  167.705002]  [<ffffffff8183510f>] ?
_raw_spin_unlock_irqrestore+0x3f/0x70
[  167.705002]  [<ffffffff811794b3>] bdi_writeback_thread+0x93/0x450
[  167.705002]  [<ffffffff81179420>] ? wb_do_writeback+0x290/0x290
[  167.705002]  [<ffffffff8105fbd3>] kthread+0x93/0xa0
[  167.705002]  [<ffffffff8183ec64>] kernel_thread_helper+0x4/0x10
[  167.705002]  [<ffffffff8183559d>] ? retint_restore_args+0xe/0xe
[  167.705002]  [<ffffffff8105fb40>] ? __init_kthread_worker+0x70/0x70
[  167.705002]  [<ffffffff8183ec60>] ? gs_change+0xb/0xb
[  167.705002] Code: 48 83 fa 01 74 0e 8b 43 40 45 31 e4 83 e0 11 83 f8 01
74 5a 48 8b 83 98 00 00 00 65 48 8b 0c 25 40 b9 00 00 48 8b b9 20 0e 00 00
<48> 3b 78 08 74 1c 45 31 e4 48 85 ff 74 35 48 8b 36 e8 a1 cf fe 
[  167.705002] RIP  [<ffffffff8130a389>] cfq_allow_merge+0x59/0xb0
[  167.705002]  RSP <ffff8801323a74e0>
[  167.705002] CR2: 0000000000000008
[  168.096368] ---[ end trace 80dfb68136073c47 ]---

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 13:51                 ` Vivek Goyal
@ 2012-01-18 14:20                   ` Vivek Goyal
  2012-01-18 16:09                     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2012-01-18 14:20 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux kernel mailing list, Jens Axboe

On Wed, Jan 18, 2012 at 08:51:26AM -0500, Vivek Goyal wrote:

[..]
> > Subject: block: fix NULL icq_cache reference
> > 
> > CPU0:					CPU1:
> > 					switch from cfq to noop
> > 					   elv_quiesce_start
> > C: get_request
> > A:   ioc_create_icq
> >       alloc icq with cfq
> > 					   B: do elevator switch
> > 					       ioc_clear_queue
> > 					   elv_quiesce_end
> >       insert icq to ioc
> > 					switch from noop to cfq
> > 					    elv_quiesce_start
> > 					    do elevator switch
> > 					       ioc_clear_queue
> > 					    elv_quiesce_end
> > CPU0 leaves some icq to ioc list after elevator is switching from cfq to noop.
> > in the second ioc_clear_queue, the ioc has icq in its list, but current
> > elevator is noop. so ioc_exit_icq will get a NULL et->icq_cache.
> > 
> > In above cases, if A runs after B, ioc_create_icq will have a NULL
> > et->icq_cache, this will trigger another crash.
> > 
> > Note, get_request caches et without lock hold. Between C and A, a elevator
> > switch can start. But we will have elvpriv++, elv_quiesce_start will drain
> > all requests first. So this will not trigger crash.
> > 
> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > ---
> >  block/blk-core.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > Index: linux/block/blk-core.c
> > ===================================================================
> > --- linux.orig/block/blk-core.c	2012-01-18 12:44:13.000000000 +0800
> > +++ linux/block/blk-core.c	2012-01-18 12:45:28.000000000 +0800
> > @@ -872,11 +872,11 @@ retry:
> >  	spin_unlock_irq(q->queue_lock);
> >  
> >  	/* create icq if missing */
> > -	if (unlikely(et->icq_cache && !icq))
> > +	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
> >  		icq = ioc_create_icq(q, gfp_mask);
> >  
> >  	/* rqs are guaranteed to have icq on elv_set_request() if requested */
> > -	if (likely(!et->icq_cache || icq))
> > +	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
> >  		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
> 
> Not allocating icq if request is never going to go to elevator as elevator
> switch was happening makes sense to me.
> 
> I tried this patch. It went little further and crashed at a different
> place. I think this seems to be separate merging issue Tejun is trying
> to track down.

Applied Tejun's debug patch to return early and not call into elevator
for checking whether merge is allowed or not. Things seems to be stable
now for me.

So Shaohua's patch did fix the original crash for me.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18  6:03               ` Shaohua Li
  2012-01-18 13:51                 ` Vivek Goyal
@ 2012-01-18 16:07                 ` Tejun Heo
  2012-01-19  1:41                   ` [patch]block: fix NULL icq_cache reference Shaohua Li
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2012-01-18 16:07 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Vivek Goyal, linux kernel mailing list, Jens Axboe

Hello,

On Wed, Jan 18, 2012 at 02:03:22PM +0800, Shaohua Li wrote:
> Subject: block: fix NULL icq_cache reference
> 
> CPU0:					CPU1:
> 					switch from cfq to noop
> 					   elv_quiesce_start
> C: get_request
> A:   ioc_create_icq
>       alloc icq with cfq
> 					   B: do elevator switch
> 					       ioc_clear_queue
> 					   elv_quiesce_end
>       insert icq to ioc
> 					switch from noop to cfq
> 					    elv_quiesce_start
> 					    do elevator switch
> 					       ioc_clear_queue
> 					    elv_quiesce_end
> CPU0 leaves some icq to ioc list after elevator is switching from cfq to noop.
> in the second ioc_clear_queue, the ioc has icq in its list, but current
> elevator is noop. so ioc_exit_icq will get a NULL et->icq_cache.
> 
> In above cases, if A runs after B, ioc_create_icq will have a NULL
> et->icq_cache, this will trigger another crash.
> 
> Note, get_request caches et without lock hold. Between C and A, a elevator
> switch can start. But we will have elvpriv++, elv_quiesce_start will drain
> all requests first. So this will not trigger crash.

Thanks a lot for tracking it down.

Hmmm... but I'm having a difficult time following the description.
Maybe we can simplify a bit?  e.g. sth like the following?

  Once a queue is quiesced, it's not supposed to have any elvpriv data
  or icq's, and elevator switching depends on that.  Request alloc
  path followed the rule for elvpriv data but forgot apply it to
  icq's leading to the following crash during elevator switch.

  <oops log>

  Fix it by not allocating icq's if ELVPRIV is not set for the
  request.

> Index: linux/block/blk-core.c
> ===================================================================
> --- linux.orig/block/blk-core.c	2012-01-18 12:44:13.000000000 +0800
> +++ linux/block/blk-core.c	2012-01-18 12:45:28.000000000 +0800
> @@ -872,11 +872,11 @@ retry:
>  	spin_unlock_irq(q->queue_lock);
>  
>  	/* create icq if missing */
> -	if (unlikely(et->icq_cache && !icq))
> +	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
>  		icq = ioc_create_icq(q, gfp_mask);
>  
>  	/* rqs are guaranteed to have icq on elv_set_request() if requested */
> -	if (likely(!et->icq_cache || icq))
> +	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
>  		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);

Hmmm... I was trying to avoid adding a goto label with the double
testing but with REQ_ELVPRIV test added, it looks more confusing.
Maybe something like the following is better?

	/* rqs are guaranteed to have icq on elv_set_request() if requested */
	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
		icq = ioc_create_icq(q, gfp_mask);
		if (!icq)
			goto fail_icq;
	}
	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
 fail_icq:

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 14:20                   ` Vivek Goyal
@ 2012-01-18 16:09                     ` Tejun Heo
  2012-01-18 16:24                       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2012-01-18 16:09 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Shaohua Li, linux kernel mailing list, Jens Axboe

On Wed, Jan 18, 2012 at 09:20:05AM -0500, Vivek Goyal wrote:
> > Not allocating icq if request is never going to go to elevator as elevator
> > switch was happening makes sense to me.
> > 
> > I tried this patch. It went little further and crashed at a different
> > place. I think this seems to be separate merging issue Tejun is trying
> > to track down.
> 
> Applied Tejun's debug patch to return early and not call into elevator
> for checking whether merge is allowed or not. Things seems to be stable
> now for me.

Yeah, plug merge is calling into elevator code without any
synchronization, so it's bound to be broken.  Given plugging is
per-task, I don't think we really need to query elevator about merging
bio's.  The request is not on elevator and plugging is part of issuing
mechanism, not scheduling, after all.  Jens, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 16:09                     ` Tejun Heo
@ 2012-01-18 16:24                       ` Jens Axboe
  2012-01-18 16:31                         ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2012-01-18 16:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, Shaohua Li, linux kernel mailing list

On 2012-01-18 17:09, Tejun Heo wrote:
> On Wed, Jan 18, 2012 at 09:20:05AM -0500, Vivek Goyal wrote:
>>> Not allocating icq if request is never going to go to elevator as elevator
>>> switch was happening makes sense to me.
>>>
>>> I tried this patch. It went little further and crashed at a different
>>> place. I think this seems to be separate merging issue Tejun is trying
>>> to track down.
>>
>> Applied Tejun's debug patch to return early and not call into elevator
>> for checking whether merge is allowed or not. Things seems to be stable
>> now for me.
> 
> Yeah, plug merge is calling into elevator code without any
> synchronization, so it's bound to be broken.  Given plugging is
> per-task, I don't think we really need to query elevator about merging
> bio's.  The request is not on elevator and plugging is part of issuing
> mechanism, not scheduling, after all.  Jens, what do you think?

Hmmm. We can bypass asking the elevator, as long as we query the
restrictions. Does the below, by itself, resolve the crash? If yes, let
me cook up a patch splitting the elv and blk rq merging logic.

diff --git a/block/elevator.c b/block/elevator.c
index 91e18f8..c2f319a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,8 +105,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+#if 0
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
+#endif
 
 	return 1;
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 16:24                       ` Jens Axboe
@ 2012-01-18 16:31                         ` Jens Axboe
  2012-01-18 16:36                           ` Vivek Goyal
  2012-01-18 16:55                           ` Tejun Heo
  0 siblings, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2012-01-18 16:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, Shaohua Li, linux kernel mailing list

On 2012-01-18 17:24, Jens Axboe wrote:
> On 2012-01-18 17:09, Tejun Heo wrote:
>> On Wed, Jan 18, 2012 at 09:20:05AM -0500, Vivek Goyal wrote:
>>>> Not allocating icq if request is never going to go to elevator as elevator
>>>> switch was happening makes sense to me.
>>>>
>>>> I tried this patch. It went little further and crashed at a different
>>>> place. I think this seems to be separate merging issue Tejun is trying
>>>> to track down.
>>>
>>> Applied Tejun's debug patch to return early and not call into elevator
>>> for checking whether merge is allowed or not. Things seems to be stable
>>> now for me.
>>
>> Yeah, plug merge is calling into elevator code without any
>> synchronization, so it's bound to be broken.  Given plugging is
>> per-task, I don't think we really need to query elevator about merging
>> bio's.  The request is not on elevator and plugging is part of issuing
>> mechanism, not scheduling, after all.  Jens, what do you think?
> 
> Hmmm. We can bypass asking the elevator, as long as we query the
> restrictions. Does the below, by itself, resolve the crash? If yes, let
> me cook up a patch splitting the elv and blk rq merging logic.

Something like the below, completely untested.

But thinking about this a bit while doing it, why is the IO scheduler
going away while we have plugged requests that are elvpriv?

diff --git a/block/blk-core.c b/block/blk-core.c
index e6c05a9..75eba5c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1283,7 +1283,7 @@ static bool attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		if (rq->q != q)
 			continue;
 
-		el_ret = elv_try_merge(rq, bio);
+		el_ret = blk_try_merge(rq, bio);
 		if (el_ret == ELEVATOR_BACK_MERGE) {
 			ret = bio_attempt_back_merge(q, rq, bio);
 			if (ret)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index cfcc37c..ee9ec90 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -471,3 +471,59 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 {
 	return attempt_merge(q, rq, next);
 }
+
+int blk_rq_merge_ok(struct request *rq, struct bio *bio)
+{
+	if (!rq_mergeable(rq))
+		return 0;
+
+	/*
+	 * Don't merge file system requests and discard requests
+	 */
+	if ((bio->bi_rw & REQ_DISCARD) != (rq->bio->bi_rw & REQ_DISCARD))
+		return 0;
+
+	/*
+	 * Don't merge discard requests and secure discard requests
+	 */
+	if ((bio->bi_rw & REQ_SECURE) != (rq->bio->bi_rw & REQ_SECURE))
+		return 0;
+
+	/*
+	 * different data direction or already started, don't merge
+	 */
+	if (bio_data_dir(bio) != rq_data_dir(rq))
+		return 0;
+
+	/*
+	 * must be same device and not a special request
+	 */
+	if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special)
+		return 0;
+
+	/*
+	 * only merge integrity protected bio into ditto rq
+	 */
+	if (bio_integrity(bio) != blk_integrity_rq(rq))
+		return 0;
+
+	return 1;
+}
+EXPORT_SYMBOL(blk_rq_merge_ok);
+
+int blk_try_merge(struct request *__rq, struct bio *bio)
+{
+	int ret = ELEVATOR_NO_MERGE;
+
+	/*
+	 * we can merge and sequence is ok, check if it's possible
+	 */
+	if (blk_rq_merge_ok(__rq, bio)) {
+		if (blk_rq_pos(__rq) + blk_rq_sectors(__rq) == bio->bi_sector)
+			ret = ELEVATOR_BACK_MERGE;
+		else if (blk_rq_pos(__rq) - bio_sectors(bio) == bio->bi_sector)
+			ret = ELEVATOR_FRONT_MERGE;
+	}
+
+	return ret;
+}
diff --git a/block/blk.h b/block/blk.h
index 7efd772..a117fa9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -137,6 +137,8 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
 				struct request *next);
 void blk_recalc_rq_segments(struct request *rq);
 void blk_rq_set_mixed_merge(struct request *rq);
+int blk_rq_merge_ok(struct request *rq, struct bio *bio);
+int blk_try_merge(struct request *rq, struct bio *bio);
 
 void blk_queue_congestion_threshold(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index 91e18f8..a1a75f7 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -72,39 +72,8 @@ static int elv_iosched_allow_merge(struct request *rq, struct bio *bio)
  */
 int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 {
-	if (!rq_mergeable(rq))
+	if (!blk_rq_merge_ok(rq, bio))
 		return 0;
-
-	/*
-	 * Don't merge file system requests and discard requests
-	 */
-	if ((bio->bi_rw & REQ_DISCARD) != (rq->bio->bi_rw & REQ_DISCARD))
-		return 0;
-
-	/*
-	 * Don't merge discard requests and secure discard requests
-	 */
-	if ((bio->bi_rw & REQ_SECURE) != (rq->bio->bi_rw & REQ_SECURE))
-		return 0;
-
-	/*
-	 * different data direction or already started, don't merge
-	 */
-	if (bio_data_dir(bio) != rq_data_dir(rq))
-		return 0;
-
-	/*
-	 * must be same device and not a special request
-	 */
-	if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special)
-		return 0;
-
-	/*
-	 * only merge integrity protected bio into ditto rq
-	 */
-	if (bio_integrity(bio) != blk_integrity_rq(rq))
-		return 0;
-
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
@@ -114,19 +83,13 @@ EXPORT_SYMBOL(elv_rq_merge_ok);
 
 int elv_try_merge(struct request *__rq, struct bio *bio)
 {
-	int ret = ELEVATOR_NO_MERGE;
-
 	/*
 	 * we can merge and sequence is ok, check if it's possible
 	 */
-	if (elv_rq_merge_ok(__rq, bio)) {
-		if (blk_rq_pos(__rq) + blk_rq_sectors(__rq) == bio->bi_sector)
-			ret = ELEVATOR_BACK_MERGE;
-		else if (blk_rq_pos(__rq) - bio_sectors(bio) == bio->bi_sector)
-			ret = ELEVATOR_FRONT_MERGE;
-	}
+	if (elv_rq_merge_ok(__rq, bio))
+		return blk_try_merge(__rq, bio);
 
-	return ret;
+	return ELEVATOR_NO_MERGE;
 }
 
 static struct elevator_type *elevator_find(const char *name)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 16:31                         ` Jens Axboe
@ 2012-01-18 16:36                           ` Vivek Goyal
  2012-01-18 17:10                             ` Tejun Heo
  2012-01-18 19:05                             ` Jens Axboe
  2012-01-18 16:55                           ` Tejun Heo
  1 sibling, 2 replies; 30+ messages in thread
From: Vivek Goyal @ 2012-01-18 16:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, Shaohua Li, linux kernel mailing list

On Wed, Jan 18, 2012 at 05:31:06PM +0100, Jens Axboe wrote:
> On 2012-01-18 17:24, Jens Axboe wrote:
> > On 2012-01-18 17:09, Tejun Heo wrote:
> >> On Wed, Jan 18, 2012 at 09:20:05AM -0500, Vivek Goyal wrote:
> >>>> Not allocating icq if request is never going to go to elevator as elevator
> >>>> switch was happening makes sense to me.
> >>>>
> >>>> I tried this patch. It went little further and crashed at a different
> >>>> place. I think this seems to be separate merging issue Tejun is trying
> >>>> to track down.
> >>>
> >>> Applied Tejun's debug patch to return early and not call into elevator
> >>> for checking whether merge is allowed or not. Things seems to be stable
> >>> now for me.
> >>
> >> Yeah, plug merge is calling into elevator code without any
> >> synchronization, so it's bound to be broken.  Given plugging is
> >> per-task, I don't think we really need to query elevator about merging
> >> bio's.  The request is not on elevator and plugging is part of issuing
> >> mechanism, not scheduling, after all.  Jens, what do you think?
> > 
> > Hmmm. We can bypass asking the elevator, as long as we query the
> > restrictions. Does the below, by itself, resolve the crash? If yes, let
> > me cook up a patch splitting the elv and blk rq merging logic.
> 
> Something like the below, completely untested.
> 
> But thinking about this a bit while doing it, why is the IO scheduler
> going away while we have plugged requests that are elvpriv?

Not calling ioscheduler during plug merge will allow merging of sync/async
requests together. I guess we wouldn't want that. The only check we can
skip in case of plug merge, is whether bio and rq beong to same task/cfqq
or not.

May be separate elevator functions for plug merge (without lock) and
elevator merge (with lock) will do?

Thanks
Vivek

> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e6c05a9..75eba5c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1283,7 +1283,7 @@ static bool attempt_plug_merge(struct request_queue *q, struct bio *bio,
>  		if (rq->q != q)
>  			continue;
>  
> -		el_ret = elv_try_merge(rq, bio);
> +		el_ret = blk_try_merge(rq, bio);
>  		if (el_ret == ELEVATOR_BACK_MERGE) {
>  			ret = bio_attempt_back_merge(q, rq, bio);
>  			if (ret)
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index cfcc37c..ee9ec90 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -471,3 +471,59 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
>  {
>  	return attempt_merge(q, rq, next);
>  }
> +
> +int blk_rq_merge_ok(struct request *rq, struct bio *bio)
> +{
> +	if (!rq_mergeable(rq))
> +		return 0;
> +
> +	/*
> +	 * Don't merge file system requests and discard requests
> +	 */
> +	if ((bio->bi_rw & REQ_DISCARD) != (rq->bio->bi_rw & REQ_DISCARD))
> +		return 0;
> +
> +	/*
> +	 * Don't merge discard requests and secure discard requests
> +	 */
> +	if ((bio->bi_rw & REQ_SECURE) != (rq->bio->bi_rw & REQ_SECURE))
> +		return 0;
> +
> +	/*
> +	 * different data direction or already started, don't merge
> +	 */
> +	if (bio_data_dir(bio) != rq_data_dir(rq))
> +		return 0;
> +
> +	/*
> +	 * must be same device and not a special request
> +	 */
> +	if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special)
> +		return 0;
> +
> +	/*
> +	 * only merge integrity protected bio into ditto rq
> +	 */
> +	if (bio_integrity(bio) != blk_integrity_rq(rq))
> +		return 0;
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL(blk_rq_merge_ok);
> +
> +int blk_try_merge(struct request *__rq, struct bio *bio)
> +{
> +	int ret = ELEVATOR_NO_MERGE;
> +
> +	/*
> +	 * we can merge and sequence is ok, check if it's possible
> +	 */
> +	if (blk_rq_merge_ok(__rq, bio)) {
> +		if (blk_rq_pos(__rq) + blk_rq_sectors(__rq) == bio->bi_sector)
> +			ret = ELEVATOR_BACK_MERGE;
> +		else if (blk_rq_pos(__rq) - bio_sectors(bio) == bio->bi_sector)
> +			ret = ELEVATOR_FRONT_MERGE;
> +	}
> +
> +	return ret;
> +}
> diff --git a/block/blk.h b/block/blk.h
> index 7efd772..a117fa9 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -137,6 +137,8 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
>  				struct request *next);
>  void blk_recalc_rq_segments(struct request *rq);
>  void blk_rq_set_mixed_merge(struct request *rq);
> +int blk_rq_merge_ok(struct request *rq, struct bio *bio);
> +int blk_try_merge(struct request *rq, struct bio *bio);
>  
>  void blk_queue_congestion_threshold(struct request_queue *q);
>  
> diff --git a/block/elevator.c b/block/elevator.c
> index 91e18f8..a1a75f7 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -72,39 +72,8 @@ static int elv_iosched_allow_merge(struct request *rq, struct bio *bio)
>   */
>  int elv_rq_merge_ok(struct request *rq, struct bio *bio)
>  {
> -	if (!rq_mergeable(rq))
> +	if (!blk_rq_merge_ok(rq, bio))
>  		return 0;
> -
> -	/*
> -	 * Don't merge file system requests and discard requests
> -	 */
> -	if ((bio->bi_rw & REQ_DISCARD) != (rq->bio->bi_rw & REQ_DISCARD))
> -		return 0;
> -
> -	/*
> -	 * Don't merge discard requests and secure discard requests
> -	 */
> -	if ((bio->bi_rw & REQ_SECURE) != (rq->bio->bi_rw & REQ_SECURE))
> -		return 0;
> -
> -	/*
> -	 * different data direction or already started, don't merge
> -	 */
> -	if (bio_data_dir(bio) != rq_data_dir(rq))
> -		return 0;
> -
> -	/*
> -	 * must be same device and not a special request
> -	 */
> -	if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special)
> -		return 0;
> -
> -	/*
> -	 * only merge integrity protected bio into ditto rq
> -	 */
> -	if (bio_integrity(bio) != blk_integrity_rq(rq))
> -		return 0;
> -
>  	if (!elv_iosched_allow_merge(rq, bio))
>  		return 0;
>  
> @@ -114,19 +83,13 @@ EXPORT_SYMBOL(elv_rq_merge_ok);
>  
>  int elv_try_merge(struct request *__rq, struct bio *bio)
>  {
> -	int ret = ELEVATOR_NO_MERGE;
> -
>  	/*
>  	 * we can merge and sequence is ok, check if it's possible
>  	 */
> -	if (elv_rq_merge_ok(__rq, bio)) {
> -		if (blk_rq_pos(__rq) + blk_rq_sectors(__rq) == bio->bi_sector)
> -			ret = ELEVATOR_BACK_MERGE;
> -		else if (blk_rq_pos(__rq) - bio_sectors(bio) == bio->bi_sector)
> -			ret = ELEVATOR_FRONT_MERGE;
> -	}
> +	if (elv_rq_merge_ok(__rq, bio))
> +		return blk_try_merge(__rq, bio);
>  
> -	return ret;
> +	return ELEVATOR_NO_MERGE;
>  }
>  
>  static struct elevator_type *elevator_find(const char *name)
> 
> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 16:31                         ` Jens Axboe
  2012-01-18 16:36                           ` Vivek Goyal
@ 2012-01-18 16:55                           ` Tejun Heo
  1 sibling, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2012-01-18 16:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vivek Goyal, Shaohua Li, linux kernel mailing list

Hello, Jens.

On Wed, Jan 18, 2012 at 05:31:06PM +0100, Jens Axboe wrote:
> Something like the below, completely untested.

Yeah, I was thinking about mostly identical approach.  We should do
about the same thing for elv_bio_merged() too.

> But thinking about this a bit while doing it, why is the IO scheduler
> going away while we have plugged requests that are elvpriv?

AFAICS, we don't check whether the request has priv or not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 16:36                           ` Vivek Goyal
@ 2012-01-18 17:10                             ` Tejun Heo
  2012-01-18 19:07                               ` Jens Axboe
  2012-01-18 19:05                             ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2012-01-18 17:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Shaohua Li, linux kernel mailing list, koverstreet,
	Neil Brown

Hello, Vivek.

On Wed, Jan 18, 2012 at 11:36:38AM -0500, Vivek Goyal wrote:
> Not calling ioscheduler during plug merge will allow merging of sync/async
> requests together. I guess we wouldn't want that. The only check we can
> skip in case of plug merge, is whether bio and rq beong to same task/cfqq
> or not.
> 
> May be separate elevator functions for plug merge (without lock) and
> elevator merge (with lock) will do?

I don't think so.  As I wrote before, plugging is about issuing, not
scheduling.  In a sense, it's a block layer service to upper layers to
make building larger request easier such that block layer users can do
"plug - issue multiple bios of whatever size - flush" instead of
having to manually build series of maximum sized bios.  As such, bios
submitted during a single plugging fare naturally expected to be for
the same purpose.  They all belong to single IO unit after all.

Note that although request is involved in plug merging, the request
isn't known to elevator (other than set_request(), that is).  The only
reason we go half-way through request issuing, stash request, do plug
merging and then do the rest on flush is because block layer doesn't
have a way to deal with arbitrarily sized bios.  We're piggy backing
on request merge logic because that's the only way we know how to
group bios.

In the long run, I think plugging should be done with bios.  There's
no reason to care about requests at that level.  We're just collecting
bios issued as a unit and are likely to be close to each other.  If
request_queue can deal with arbitrarily sized bios (and list of them),
we can simply collect bios and flush down on unplug.  This would be
much cleaner, make plugging useable for bio based drivers and allow
simplifying stacking drivers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 16:36                           ` Vivek Goyal
  2012-01-18 17:10                             ` Tejun Heo
@ 2012-01-18 19:05                             ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2012-01-18 19:05 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, Shaohua Li, linux kernel mailing list

On 2012-01-18 17:36, Vivek Goyal wrote:
> On Wed, Jan 18, 2012 at 05:31:06PM +0100, Jens Axboe wrote:
>> On 2012-01-18 17:24, Jens Axboe wrote:
>>> On 2012-01-18 17:09, Tejun Heo wrote:
>>>> On Wed, Jan 18, 2012 at 09:20:05AM -0500, Vivek Goyal wrote:
>>>>>> Not allocating icq if request is never going to go to elevator as elevator
>>>>>> switch was happening makes sense to me.
>>>>>>
>>>>>> I tried this patch. It went little further and crashed at a different
>>>>>> place. I think this seems to be separate merging issue Tejun is trying
>>>>>> to track down.
>>>>>
>>>>> Applied Tejun's debug patch to return early and not call into elevator
>>>>> for checking whether merge is allowed or not. Things seems to be stable
>>>>> now for me.
>>>>
>>>> Yeah, plug merge is calling into elevator code without any
>>>> synchronization, so it's bound to be broken.  Given plugging is
>>>> per-task, I don't think we really need to query elevator about merging
>>>> bio's.  The request is not on elevator and plugging is part of issuing
>>>> mechanism, not scheduling, after all.  Jens, what do you think?
>>>
>>> Hmmm. We can bypass asking the elevator, as long as we query the
>>> restrictions. Does the below, by itself, resolve the crash? If yes, let
>>> me cook up a patch splitting the elv and blk rq merging logic.
>>
>> Something like the below, completely untested.
>>
>> But thinking about this a bit while doing it, why is the IO scheduler
>> going away while we have plugged requests that are elvpriv?
> 
> Not calling ioscheduler during plug merge will allow merging of sync/async
> requests together. I guess we wouldn't want that. The only check we can
> skip in case of plug merge, is whether bio and rq beong to same task/cfqq
> or not.

It's not a huge concern. Since the IO is coming from the same task, it's
definitely related. And for the related cases, we pretty much always
want merging anyway.

> May be separate elevator functions for plug merge (without lock) and
> elevator merge (with lock) will do?

I don't think that's a good idea, just the restriction checking should
be enough.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Kernel crash in icq_free_icq_rcu
  2012-01-18 17:10                             ` Tejun Heo
@ 2012-01-18 19:07                               ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2012-01-18 19:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Shaohua Li, linux kernel mailing list, koverstreet,
	Neil Brown

On 2012-01-18 18:10, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Wed, Jan 18, 2012 at 11:36:38AM -0500, Vivek Goyal wrote:
>> Not calling ioscheduler during plug merge will allow merging of sync/async
>> requests together. I guess we wouldn't want that. The only check we can
>> skip in case of plug merge, is whether bio and rq beong to same task/cfqq
>> or not.
>>
>> May be separate elevator functions for plug merge (without lock) and
>> elevator merge (with lock) will do?
> 
> I don't think so.  As I wrote before, plugging is about issuing, not
> scheduling.  In a sense, it's a block layer service to upper layers to
> make building larger request easier such that block layer users can do
> "plug - issue multiple bios of whatever size - flush" instead of
> having to manually build series of maximum sized bios.  As such, bios
> submitted during a single plugging fare naturally expected to be for
> the same purpose.  They all belong to single IO unit after all.
> 
> Note that although request is involved in plug merging, the request
> isn't known to elevator (other than set_request(), that is).  The only
> reason we go half-way through request issuing, stash request, do plug
> merging and then do the rest on flush is because block layer doesn't
> have a way to deal with arbitrarily sized bios.  We're piggy backing
> on request merge logic because that's the only way we know how to
> group bios.
> 
> In the long run, I think plugging should be done with bios.  There's
> no reason to care about requests at that level.  We're just collecting
> bios issued as a unit and are likely to be close to each other.  If
> request_queue can deal with arbitrarily sized bios (and list of them),
> we can simply collect bios and flush down on unplug.  This would be
> much cleaner, make plugging useable for bio based drivers and allow
> simplifying stacking drivers.

In the longer term, plugging and merging should be much less needed.
Most file systems should have done this upfront already, so the need for
it should be much smaller. For the eventual multiqueue setup, I don't
plan on doing much work around merging at all. Perhaps some basic
merging for when it hits the final queue, but not at submit time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch]block: fix NULL icq_cache reference
  2012-01-18 16:07                 ` Tejun Heo
@ 2012-01-19  1:41                   ` Shaohua Li
  2012-01-19  1:43                     ` Tejun Heo
  2012-01-19  8:20                     ` Jens Axboe
  0 siblings, 2 replies; 30+ messages in thread
From: Shaohua Li @ 2012-01-19  1:41 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe; +Cc: Vivek Goyal, linux kernel mailing list

On Wed, 2012-01-18 at 08:07 -0800, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 18, 2012 at 02:03:22PM +0800, Shaohua Li wrote:
> > Subject: block: fix NULL icq_cache reference
> > 
> > CPU0:					CPU1:
> > 					switch from cfq to noop
> > 					   elv_quiesce_start
> > C: get_request
> > A:   ioc_create_icq
> >       alloc icq with cfq
> > 					   B: do elevator switch
> > 					       ioc_clear_queue
> > 					   elv_quiesce_end
> >       insert icq to ioc
> > 					switch from noop to cfq
> > 					    elv_quiesce_start
> > 					    do elevator switch
> > 					       ioc_clear_queue
> > 					    elv_quiesce_end
> > CPU0 leaves some icq to ioc list after elevator is switching from cfq to noop.
> > in the second ioc_clear_queue, the ioc has icq in its list, but current
> > elevator is noop. so ioc_exit_icq will get a NULL et->icq_cache.
> > 
> > In above cases, if A runs after B, ioc_create_icq will have a NULL
> > et->icq_cache, this will trigger another crash.
> > 
> > Note, get_request caches et without lock hold. Between C and A, a elevator
> > switch can start. But we will have elvpriv++, elv_quiesce_start will drain
> > all requests first. So this will not trigger crash.
> 
> Thanks a lot for tracking it down.
> 
> Hmmm... but I'm having a difficult time following the description.
> Maybe we can simplify a bit?  e.g. sth like the following?
> 
>   Once a queue is quiesced, it's not supposed to have any elvpriv data
>   or icq's, and elevator switching depends on that.  Request alloc
>   path followed the rule for elvpriv data but forgot apply it to
>   icq's leading to the following crash during elevator switch.
> 
>   <oops log>
> 
>   Fix it by not allocating icq's if ELVPRIV is not set for the
>   request.
I'm trying to explain why there is a crash, but fine to use your
description.

> > Index: linux/block/blk-core.c
> > ===================================================================
> > --- linux.orig/block/blk-core.c	2012-01-18 12:44:13.000000000 +0800
> > +++ linux/block/blk-core.c	2012-01-18 12:45:28.000000000 +0800
> > @@ -872,11 +872,11 @@ retry:
> >  	spin_unlock_irq(q->queue_lock);
> >  
> >  	/* create icq if missing */
> > -	if (unlikely(et->icq_cache && !icq))
> > +	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
> >  		icq = ioc_create_icq(q, gfp_mask);
> >  
> >  	/* rqs are guaranteed to have icq on elv_set_request() if requested */
> > -	if (likely(!et->icq_cache || icq))
> > +	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
> >  		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
> 
> Hmmm... I was trying to avoid adding a goto label with the double
> testing but with REQ_ELVPRIV test added, it looks more confusing.
> Maybe something like the following is better?
> 
> 	/* rqs are guaranteed to have icq on elv_set_request() if requested */
> 	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
> 		icq = ioc_create_icq(q, gfp_mask);
> 		if (!icq)
> 			goto fail_icq;
> 	}
> 	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
>  fail_icq:
this is ok.

Subject: block: fix NULL icq_cache reference

Vivek reported a kernel crash:
[   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
[   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   94.218004] CPU 0
[   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
[   94.218004]
[   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
[   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
[   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
[   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
[   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
[   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
[   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
[   94.218004] Stack:
[   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
[   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
[   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
[   94.218004] Call Trace:
[   94.218004]  <IRQ>
[   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
[   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
[   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
[   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
[   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
[   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
[   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
[   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
[   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
[   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
[   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80

Once a queue is quiesced, it's not supposed to have any elvpriv data or
icq's, and elevator switching depends on that.  Request alloc path
followed the rule for elvpriv data but forgot apply it to icq's
leading to the following crash during elevator switch. Fix it by not
allocating icq's if ELVPRIV is not set for the request.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/blk-core.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: linux/block/blk-core.c
===================================================================
--- linux.orig/block/blk-core.c	2012-01-19 08:41:01.000000000 +0800
+++ linux/block/blk-core.c	2012-01-19 08:49:06.000000000 +0800
@@ -872,13 +872,15 @@ retry:
 	spin_unlock_irq(q->queue_lock);
 
 	/* create icq if missing */
-	if (unlikely(et->icq_cache && !icq))
+	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
 		icq = ioc_create_icq(q, gfp_mask);
+		if (!icq)
+			goto fail_icq;
+	}
 
-	/* rqs are guaranteed to have icq on elv_set_request() if requested */
-	if (likely(!et->icq_cache || icq))
-		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
+	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
 
+fail_icq:
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch]block: fix NULL icq_cache reference
  2012-01-19  1:41                   ` [patch]block: fix NULL icq_cache reference Shaohua Li
@ 2012-01-19  1:43                     ` Tejun Heo
  2012-01-19  8:20                     ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2012-01-19  1:43 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jens Axboe, Vivek Goyal, linux kernel mailing list

Hello,

On Thu, Jan 19, 2012 at 09:41:34AM +0800, Shaohua Li wrote:
> > Hmmm... but I'm having a difficult time following the description.
> > Maybe we can simplify a bit?  e.g. sth like the following?
> > 
> >   Once a queue is quiesced, it's not supposed to have any elvpriv data
> >   or icq's, and elevator switching depends on that.  Request alloc
> >   path followed the rule for elvpriv data but forgot apply it to
> >   icq's leading to the following crash during elevator switch.
> > 
> >   <oops log>
> > 
> >   Fix it by not allocating icq's if ELVPRIV is not set for the
> >   request.
>
> I'm trying to explain why there is a crash, but fine to use your
> description.

Yeah, giving details is great.  It just seemed that high level
description seemed lost among the details.  It would have been great
if the description started with high level description and then went
into details.

> Once a queue is quiesced, it's not supposed to have any elvpriv data or
> icq's, and elevator switching depends on that.  Request alloc path
> followed the rule for elvpriv data but forgot apply it to icq's
> leading to the following crash during elevator switch. Fix it by not
> allocating icq's if ELVPRIV is not set for the request.
> 
> Reported-by: Vivek Goyal <vgoyal@redhat.com>
> Tested-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks a lot.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch]block: fix NULL icq_cache reference
  2012-01-19  1:41                   ` [patch]block: fix NULL icq_cache reference Shaohua Li
  2012-01-19  1:43                     ` Tejun Heo
@ 2012-01-19  8:20                     ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2012-01-19  8:20 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, Vivek Goyal, linux kernel mailing list

On 01/19/2012 02:41 AM, Shaohua Li wrote:
> On Wed, 2012-01-18 at 08:07 -0800, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Jan 18, 2012 at 02:03:22PM +0800, Shaohua Li wrote:
>>> Subject: block: fix NULL icq_cache reference
>>>
>>> CPU0:					CPU1:
>>> 					switch from cfq to noop
>>> 					   elv_quiesce_start
>>> C: get_request
>>> A:   ioc_create_icq
>>>       alloc icq with cfq
>>> 					   B: do elevator switch
>>> 					       ioc_clear_queue
>>> 					   elv_quiesce_end
>>>       insert icq to ioc
>>> 					switch from noop to cfq
>>> 					    elv_quiesce_start
>>> 					    do elevator switch
>>> 					       ioc_clear_queue
>>> 					    elv_quiesce_end
>>> CPU0 leaves some icq to ioc list after elevator is switching from cfq to noop.
>>> in the second ioc_clear_queue, the ioc has icq in its list, but current
>>> elevator is noop. so ioc_exit_icq will get a NULL et->icq_cache.
>>>
>>> In above cases, if A runs after B, ioc_create_icq will have a NULL
>>> et->icq_cache, this will trigger another crash.
>>>
>>> Note, get_request caches et without lock hold. Between C and A, a elevator
>>> switch can start. But we will have elvpriv++, elv_quiesce_start will drain
>>> all requests first. So this will not trigger crash.
>>
>> Thanks a lot for tracking it down.
>>
>> Hmmm... but I'm having a difficult time following the description.
>> Maybe we can simplify a bit?  e.g. sth like the following?
>>
>>   Once a queue is quiesced, it's not supposed to have any elvpriv data
>>   or icq's, and elevator switching depends on that.  Request alloc
>>   path followed the rule for elvpriv data but forgot apply it to
>>   icq's leading to the following crash during elevator switch.
>>
>>   <oops log>
>>
>>   Fix it by not allocating icq's if ELVPRIV is not set for the
>>   request.
> I'm trying to explain why there is a crash, but fine to use your
> description.
> 
>>> Index: linux/block/blk-core.c
>>> ===================================================================
>>> --- linux.orig/block/blk-core.c	2012-01-18 12:44:13.000000000 +0800
>>> +++ linux/block/blk-core.c	2012-01-18 12:45:28.000000000 +0800
>>> @@ -872,11 +872,11 @@ retry:
>>>  	spin_unlock_irq(q->queue_lock);
>>>  
>>>  	/* create icq if missing */
>>> -	if (unlikely(et->icq_cache && !icq))
>>> +	if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
>>>  		icq = ioc_create_icq(q, gfp_mask);
>>>  
>>>  	/* rqs are guaranteed to have icq on elv_set_request() if requested */
>>> -	if (likely(!et->icq_cache || icq))
>>> +	if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
>>>  		rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
>>
>> Hmmm... I was trying to avoid adding a goto label with the double
>> testing but with REQ_ELVPRIV test added, it looks more confusing.
>> Maybe something like the following is better?
>>
>> 	/* rqs are guaranteed to have icq on elv_set_request() if requested */
>> 	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
>> 		icq = ioc_create_icq(q, gfp_mask);
>> 		if (!icq)
>> 			goto fail_icq;
>> 	}
>> 	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
>>  fail_icq:
> this is ok.
> 
> Subject: block: fix NULL icq_cache reference
> 
> Vivek reported a kernel crash:
> [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
> [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
> [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [   94.218004] CPU 0
> [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
> [   94.218004]
> [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
> [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
> [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
> [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
> [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
> [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
> [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
> [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
> [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
> [   94.218004] Stack:
> [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
> [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
> [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
> [   94.218004] Call Trace:
> [   94.218004]  <IRQ>
> [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
> [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
> [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
> [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
> [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
> [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
> [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
> [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
> [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
> [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
> [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
> 
> Once a queue is quiesced, it's not supposed to have any elvpriv data or
> icq's, and elevator switching depends on that.  Request alloc path
> followed the rule for elvpriv data but forgot apply it to icq's
> leading to the following crash during elevator switch. Fix it by not
> allocating icq's if ELVPRIV is not set for the request.
> 
> Reported-by: Vivek Goyal <vgoyal@redhat.com>
> Tested-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>

Thanks, applied.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-01-19  8:21 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-17 20:18 Kernel crash in icq_free_icq_rcu Vivek Goyal
2012-01-17 20:19 ` Jens Axboe
2012-01-17 20:40   ` Vivek Goyal
2012-01-17 20:42     ` Jens Axboe
2012-01-17 20:58       ` Vivek Goyal
2012-01-17 21:01         ` Vivek Goyal
2012-01-17 21:48 ` Tejun Heo
2012-01-17 22:07   ` Vivek Goyal
2012-01-18  1:01     ` Shaohua Li
2012-01-18  1:03       ` Tejun Heo
2012-01-18  1:05         ` Shaohua Li
2012-01-18  1:11           ` Tejun Heo
2012-01-18  1:30             ` Shaohua Li
2012-01-18  2:26               ` Shaohua Li
2012-01-18  4:23                 ` Shaohua Li
2012-01-18  6:03               ` Shaohua Li
2012-01-18 13:51                 ` Vivek Goyal
2012-01-18 14:20                   ` Vivek Goyal
2012-01-18 16:09                     ` Tejun Heo
2012-01-18 16:24                       ` Jens Axboe
2012-01-18 16:31                         ` Jens Axboe
2012-01-18 16:36                           ` Vivek Goyal
2012-01-18 17:10                             ` Tejun Heo
2012-01-18 19:07                               ` Jens Axboe
2012-01-18 19:05                             ` Jens Axboe
2012-01-18 16:55                           ` Tejun Heo
2012-01-18 16:07                 ` Tejun Heo
2012-01-19  1:41                   ` [patch]block: fix NULL icq_cache reference Shaohua Li
2012-01-19  1:43                     ` Tejun Heo
2012-01-19  8:20                     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).