* [patch]block: fix ioc locking warning @ 2012-02-06 7:50 Shaohua Li 2012-02-06 7:55 ` Jens Axboe ` (3 more replies) 0 siblings, 4 replies; 49+ messages in thread From: Shaohua Li @ 2012-02-06 7:50 UTC (permalink / raw) To: lkml; +Cc: Linus Torvalds, Jens Axboe, Tejun Heo, Knut Petersen, mroos Meelis reported a warning: WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() Hardware name: 939Dual-SATA2 timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 Call Trace: <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 [<ffffffff810145fd>] ? apic_write+0x11/0x13 [<ffffffff81027475>] __do_softirq+0x74/0xfa [<ffffffff812f337a>] call_softirq+0x1a/0x30 [<ffffffff81002ff9>] do_softirq+0x31/0x68 [<ffffffff810276cf>] irq_exit+0x3d/0xa3 [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff8100801f>] ? default_idle+0x1e/0x32 [<ffffffff81008019>] ? default_idle+0x18/0x32 [<ffffffff810008b1>] cpu_idle+0x87/0xd1 [<ffffffff812de861>] rest_init+0x85/0x89 [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 this_q == locked_q is possible. There are two problems here: 1. In UP case, there is preemption counter issue as spin_trylock always successes. 2. In SMP case, the loop breaks too earlier. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Knut Petersen <Knut_Petersen@t-online.de> Tested-by: Knut Petersen <Knut_Petersen@t-online.de> diff --git a/block/blk-ioc.c b/block/blk-ioc.c index 27a06e0..7490b6d 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -204,7 +204,9 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q) spin_unlock(last_q->queue_lock); last_q = NULL; - if (!spin_trylock(this_q->queue_lock)) + /* spin_trylock() always successes in UP case */ + if (this_q != locked_q && + !spin_trylock(this_q->queue_lock)) break; last_q = this_q; continue; ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 7:50 [patch]block: fix ioc locking warning Shaohua Li @ 2012-02-06 7:55 ` Jens Axboe 2012-02-06 15:12 ` Vivek Goyal ` (2 subsequent siblings) 3 siblings, 0 replies; 49+ messages in thread From: Jens Axboe @ 2012-02-06 7:55 UTC (permalink / raw) To: Shaohua Li; +Cc: lkml, Linus Torvalds, Tejun Heo, Knut Petersen, mroos On 02/06/2012 08:50 AM, Shaohua Li wrote: > Meelis reported a warning: > > WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() > Hardware name: 939Dual-SATA2 > timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 > Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq > Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 > Call Trace: > <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 > [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d > [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 > [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa > [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d > [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec > [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 > [<ffffffff810145fd>] ? apic_write+0x11/0x13 > [<ffffffff81027475>] __do_softirq+0x74/0xfa > [<ffffffff812f337a>] call_softirq+0x1a/0x30 > [<ffffffff81002ff9>] do_softirq+0x31/0x68 > [<ffffffff810276cf>] irq_exit+0x3d/0xa3 > [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 > [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 > <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d > [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d > [<ffffffff8100801f>] ? default_idle+0x1e/0x32 > [<ffffffff81008019>] ? default_idle+0x18/0x32 > [<ffffffff810008b1>] cpu_idle+0x87/0xd1 > [<ffffffff812de861>] rest_init+0x85/0x89 > [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 > [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 > [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 > > this_q == locked_q is possible. There are two problems here: > 1. In UP case, there is preemption counter issue as spin_trylock always > successes. > 2. In SMP case, the loop breaks too earlier. Thanks Shaohua, applied! -- Jens Axboe ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 7:50 [patch]block: fix ioc locking warning Shaohua Li 2012-02-06 7:55 ` Jens Axboe @ 2012-02-06 15:12 ` Vivek Goyal 2012-02-06 16:09 ` Jens Axboe 2012-02-06 16:22 ` Tejun Heo 2012-02-08 18:07 ` walt 3 siblings, 1 reply; 49+ messages in thread From: Vivek Goyal @ 2012-02-06 15:12 UTC (permalink / raw) To: Shaohua Li Cc: lkml, Linus Torvalds, Jens Axboe, Tejun Heo, Knut Petersen, mroos On Mon, Feb 06, 2012 at 03:50:11PM +0800, Shaohua Li wrote: > Meelis reported a warning: > > WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() > Hardware name: 939Dual-SATA2 > timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 > Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq > Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 > Call Trace: > <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 > [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d > [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 > [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa > [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d > [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec > [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 > [<ffffffff810145fd>] ? apic_write+0x11/0x13 > [<ffffffff81027475>] __do_softirq+0x74/0xfa > [<ffffffff812f337a>] call_softirq+0x1a/0x30 > [<ffffffff81002ff9>] do_softirq+0x31/0x68 > [<ffffffff810276cf>] irq_exit+0x3d/0xa3 > [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 > [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 > <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d > [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d > [<ffffffff8100801f>] ? default_idle+0x1e/0x32 > [<ffffffff81008019>] ? default_idle+0x18/0x32 > [<ffffffff810008b1>] cpu_idle+0x87/0xd1 > [<ffffffff812de861>] rest_init+0x85/0x89 > [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 > [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 > [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 > > this_q == locked_q is possible. There are two problems here: > 1. In UP case, there is preemption counter issue as spin_trylock always > successes. > 2. In SMP case, the loop breaks too earlier. Thanks Shaohua. So is it the case where there are more than one cic's on ioc->ioc_list and first cic's queue is not same as locked_queue. But some other cic other than first has queue same as locked_queue. In that case current code will still defer freeing of ioc and cic to a worker thread. So this patch will introduce one optimization to handle those cases and avoid calling worker thread. Secondly it fixes the discrepancy of preemption count on UP machines, where we have one extra preemption count after finish of function put_io_context(). So for UP case spin_trylock() increases the preemption count and always returns success. As this_q == locked_q we never try to do unlock on this queue and hence never decrement the preemption count hence resulting in preemption count warning. Changlog was not obivious atleast to me. I wished it was little more descriptive. Anyway, patch is already committed.. Thanks Vivek ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 15:12 ` Vivek Goyal @ 2012-02-06 16:09 ` Jens Axboe 2012-02-06 16:37 ` Vivek Goyal 0 siblings, 1 reply; 49+ messages in thread From: Jens Axboe @ 2012-02-06 16:09 UTC (permalink / raw) To: Vivek Goyal Cc: Shaohua Li, lkml, Linus Torvalds, Tejun Heo, Knut Petersen, mroos On 02/06/2012 04:12 PM, Vivek Goyal wrote: > On Mon, Feb 06, 2012 at 03:50:11PM +0800, Shaohua Li wrote: >> Meelis reported a warning: >> >> WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() >> Hardware name: 939Dual-SATA2 >> timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 >> Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq >> Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 >> Call Trace: >> <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 >> [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d >> [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 >> [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa >> [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d >> [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec >> [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 >> [<ffffffff810145fd>] ? apic_write+0x11/0x13 >> [<ffffffff81027475>] __do_softirq+0x74/0xfa >> [<ffffffff812f337a>] call_softirq+0x1a/0x30 >> [<ffffffff81002ff9>] do_softirq+0x31/0x68 >> [<ffffffff810276cf>] irq_exit+0x3d/0xa3 >> [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 >> [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 >> <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d >> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d >> [<ffffffff8100801f>] ? default_idle+0x1e/0x32 >> [<ffffffff81008019>] ? default_idle+0x18/0x32 >> [<ffffffff810008b1>] cpu_idle+0x87/0xd1 >> [<ffffffff812de861>] rest_init+0x85/0x89 >> [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 >> [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 >> [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 >> >> this_q == locked_q is possible. There are two problems here: >> 1. In UP case, there is preemption counter issue as spin_trylock always >> successes. >> 2. In SMP case, the loop breaks too earlier. > > Thanks Shaohua. So is it the case where there are more than one cic's on > ioc->ioc_list and first cic's queue is not same as locked_queue. But some > other cic other than first has queue same as locked_queue. > > In that case current code will still defer freeing of ioc and cic to a > worker thread. So this patch will introduce one optimization to handle > those cases and avoid calling worker thread. > > Secondly it fixes the discrepancy of preemption count on UP machines, > where we have one extra preemption count after finish of function > put_io_context(). So for UP case spin_trylock() increases the preemption > count and always returns success. As this_q == locked_q we never try to do > unlock on this queue and hence never decrement the preemption count > hence resulting in preemption count warning. > > Changlog was not obivious atleast to me. I wished it was little more > descriptive. Anyway, patch is already committed.. We can always amend the changelog, so don't worry about it already being committed. If you want to add/change something, just send it in. -- Jens Axboe ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 16:09 ` Jens Axboe @ 2012-02-06 16:37 ` Vivek Goyal 2012-02-06 16:44 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Vivek Goyal @ 2012-02-06 16:37 UTC (permalink / raw) To: Jens Axboe Cc: Shaohua Li, lkml, Linus Torvalds, Tejun Heo, Knut Petersen, mroos On Mon, Feb 06, 2012 at 05:09:05PM +0100, Jens Axboe wrote: [..] > > Changlog was not obivious atleast to me. I wished it was little more > > descriptive. Anyway, patch is already committed.. > > We can always amend the changelog, so don't worry about it already being > committed. If you want to add/change something, just send it in. How about something like as follows. block: Do not lock try to lock already locked queue again put_io_context() can be called with one of the request queue lock already held. But if the locked queue's cic is not first in the ioc->ioc_list, then we have two possible issues. - For above condition, current code bails out and schedules the worker thread for freeing up ioc. This can be optimized. - It might happen that we received the queue locked but we still do the trylock on the queue. For SMP case that's not a problem as we will fail to lock already locked queue, but in case of UP, we seem to succeed and in the process increment the preempt count. Once we are done with ioc_exit_icq(), we do not call spin_unlock() on locked queue as we are not supposed to. This leads to imbalance in preemtion count and following warning was reported. This patch fixes both the above issues by making sure we do not try to lock already locked queue again. WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() Hardware name: 939Dual-SATA2 timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 Call Trace: <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 [<ffffffff810145fd>] ? apic_write+0x11/0x13 [<ffffffff81027475>] __do_softirq+0x74/0xfa [<ffffffff812f337a>] call_softirq+0x1a/0x30 [<ffffffff81002ff9>] do_softirq+0x31/0x68 [<ffffffff810276cf>] irq_exit+0x3d/0xa3 [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff8100801f>] ? default_idle+0x1e/0x32 [<ffffffff81008019>] ? default_idle+0x18/0x32 [<ffffffff810008b1>] cpu_idle+0x87/0xd1 [<ffffffff812de861>] rest_init+0x85/0x89 [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Knut Petersen <Knut_Petersen@t-online.de> Tested-by: Knut Petersen <Knut_Petersen@t-online.de> Thanks Vivek ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 16:37 ` Vivek Goyal @ 2012-02-06 16:44 ` Tejun Heo 2012-02-06 16:58 ` Linus Torvalds 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-06 16:44 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Shaohua Li, lkml, Linus Torvalds, Knut Petersen, mroos Hello, On Mon, Feb 06, 2012 at 11:37:21AM -0500, Vivek Goyal wrote: > block: Do not lock try to lock already locked queue again > > put_io_context() can be called with one of the request queue lock > already held. But if the locked queue's cic is not first in the > ioc->ioc_list, then we have two possible issues. > > - For above condition, current code bails out and schedules the worker > thread for freeing up ioc. This can be optimized. > > - It might happen that we received the queue locked but we still do the > trylock on the queue. For SMP case that's not a problem as we will > fail to lock already locked queue, but in case of UP, we seem to > succeed and in the process increment the preempt count. Once we are > done with ioc_exit_icq(), we do not call spin_unlock() on locked > queue as we are not supposed to. This leads to imbalance in preemtion > count and following warning was reported. > > This patch fixes both the above issues by making sure we do not try to > lock already locked queue again. > > WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() > Hardware name: 939Dual-SATA2 > timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 > Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd > v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq > Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 > Call Trace: > <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 > [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d > [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 > [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa > [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d > [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec > [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 > [<ffffffff810145fd>] ? apic_write+0x11/0x13 > [<ffffffff81027475>] __do_softirq+0x74/0xfa > [<ffffffff812f337a>] call_softirq+0x1a/0x30 > [<ffffffff81002ff9>] do_softirq+0x31/0x68 > [<ffffffff810276cf>] irq_exit+0x3d/0xa3 > [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 > [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 > <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d > [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d > [<ffffffff8100801f>] ? default_idle+0x1e/0x32 > [<ffffffff81008019>] ? default_idle+0x18/0x32 > [<ffffffff810008b1>] cpu_idle+0x87/0xd1 > [<ffffffff812de861>] rest_init+0x85/0x89 > [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 > [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 > [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > Reported-by: Meelis Roos <mroos@linux.ee> > Reported-by: Knut Petersen <Knut_Petersen@t-online.de> > Tested-by: Knut Petersen <Knut_Petersen@t-online.de> Yeah, this seems better to me. Jens, if you're gonna amend the commit, please consider collapsing the following patch into the original patch too. Thanks. diff --git a/block/blk-ioc.c b/block/blk-ioc.c index 7490b6d..12978fc 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -204,7 +204,14 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q) spin_unlock(last_q->queue_lock); last_q = NULL; - /* spin_trylock() always successes in UP case */ + /* + * If icq for locked_q wasn't at the head of + * icq_list, we can try to switch back to locked_q. + * On SMP, the following locked_q test avoids + * unnecessary deferring to release_work, on UP, + * incorrect lock state transition (trylock + * succeeding while holding the same lock). + */ if (this_q != locked_q && !spin_trylock(this_q->queue_lock)) break; ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 16:44 ` Tejun Heo @ 2012-02-06 16:58 ` Linus Torvalds 2012-02-06 17:27 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Linus Torvalds @ 2012-02-06 16:58 UTC (permalink / raw) To: Tejun Heo; +Cc: Vivek Goyal, Jens Axboe, Shaohua Li, lkml, Knut Petersen, mroos On Mon, Feb 6, 2012 at 8:44 AM, Tejun Heo <tj@kernel.org> wrote: > > Yeah, this seems better to me. Jens, if you're gonna amend the > commit, please consider collapsing the following patch into the > original patch too. Thanks. Guys, is it *really* worth it to do all these crazy games? How bad is it to just always use the async freeing, instead of this clearly very fragile crazy direct-freeing-with-serious-locking-issues thing? Sure, even ioc_release_fn() isn't trivial wrt lock handling, but at least it doesn't have to play these *insane* games with recursive locking. And if workqueues are too expensive, what saner alternatives might there be? This code really is insane. Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 16:58 ` Linus Torvalds @ 2012-02-06 17:27 ` Tejun Heo 2012-02-06 20:16 ` Jens Axboe 2012-02-06 20:36 ` [patch]block: fix ioc locking warning Tejun Heo 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-06 17:27 UTC (permalink / raw) To: Linus Torvalds Cc: Vivek Goyal, Jens Axboe, Shaohua Li, lkml, Knut Petersen, mroos Hello, On Mon, Feb 06, 2012 at 08:58:49AM -0800, Linus Torvalds wrote: > On Mon, Feb 6, 2012 at 8:44 AM, Tejun Heo <tj@kernel.org> wrote: > > > > Yeah, this seems better to me. Jens, if you're gonna amend the > > commit, please consider collapsing the following patch into the > > original patch too. Thanks. > > Guys, is it *really* worth it to do all these crazy games? > > How bad is it to just always use the async freeing, instead of this > clearly very fragile crazy direct-freeing-with-serious-locking-issues > thing? It's one wq scheduling on exit for any task which has issued an IO. I don't think it would matter except for task fork/exit microbenchs (or workloads which approximate to that). I'll get some measurements and strip the optimization if it doesn't really show up. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 17:27 ` Tejun Heo @ 2012-02-06 20:16 ` Jens Axboe 2012-02-06 21:54 ` [PATCH] block: strip out locking optimization in put_io_context() Tejun Heo 2012-02-06 20:36 ` [patch]block: fix ioc locking warning Tejun Heo 1 sibling, 1 reply; 49+ messages in thread From: Jens Axboe @ 2012-02-06 20:16 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos On 2012-02-06 18:27, Tejun Heo wrote: > Hello, > > On Mon, Feb 06, 2012 at 08:58:49AM -0800, Linus Torvalds wrote: >> On Mon, Feb 6, 2012 at 8:44 AM, Tejun Heo <tj@kernel.org> wrote: >>> >>> Yeah, this seems better to me. Jens, if you're gonna amend the >>> commit, please consider collapsing the following patch into the >>> original patch too. Thanks. >> >> Guys, is it *really* worth it to do all these crazy games? >> >> How bad is it to just always use the async freeing, instead of this >> clearly very fragile crazy direct-freeing-with-serious-locking-issues >> thing? > > It's one wq scheduling on exit for any task which has issued an IO. I > don't think it would matter except for task fork/exit microbenchs (or > workloads which approximate to that). I'll get some measurements and > strip the optimization if it doesn't really show up. One (arguably stupid) thing that some users do do is something like: $ find . -exec grep foo '{}' \; So that would probably be a good pathological test case for this. -- Jens Axboe ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH] block: strip out locking optimization in put_io_context() 2012-02-06 20:16 ` Jens Axboe @ 2012-02-06 21:54 ` Tejun Heo 2012-02-07 6:49 ` Jens Axboe 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-06 21:54 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos put_io_context() performed a complex trylock dancing to avoid deferring ioc release to workqueue. It was also broken on UP because trylock was always assumed to succeed which resulted in unbalanced preemption count. While there are ways to fix the UP breakage, even the most pathological microbench (forced ioc allocation and tight fork/exit loop) fails to show any appreciable performance benefit of the optimization. Strip it out. If there turns out to be workloads which are affected by this change, simpler optimization from the discussion thread can be applied later. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1328514611.21268.66.camel@sli10-conroe> --- I couldn't find any statiscally meaningful advantage of the optimization with tight fork/exit tests w/ forced ioc creation on fork, which gotta be the most pathological test case for the code path. So, let's remove the ugly optimization. If I missed sth, we can resurrect the simpler optimization later. Jens, this is on top of linus#master without Shaohua's patch. Thanks. block/blk-cgroup.c | 2 - block/blk-core.c | 2 - block/blk-ioc.c | 90 +++++----------------------------------------- block/cfq-iosched.c | 2 - fs/ioprio.c | 2 - include/linux/blkdev.h | 3 - include/linux/iocontext.h | 5 +- kernel/fork.c | 2 - 8 files changed, 18 insertions(+), 90 deletions(-) Index: work/block/blk-cgroup.c =================================================================== --- work.orig/block/blk-cgroup.c +++ work/block/blk-cgroup.c @@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_cgroup_changed(ioc); - put_io_context(ioc, NULL); + put_io_context(ioc); } } } Index: work/block/blk-core.c =================================================================== --- work.orig/block/blk-core.c +++ work/block/blk-core.c @@ -642,7 +642,7 @@ static inline void blk_free_request(stru if (rq->cmd_flags & REQ_ELVPRIV) { elv_put_request(q, rq); if (rq->elv.icq) - put_io_context(rq->elv.icq->ioc, q); + put_io_context(rq->elv.icq->ioc); } mempool_free(rq, q->rq.rq_pool); Index: work/block/blk-ioc.c =================================================================== --- work.orig/block/blk-ioc.c +++ work/block/blk-ioc.c @@ -29,21 +29,6 @@ void get_io_context(struct io_context *i } EXPORT_SYMBOL(get_io_context); -/* - * Releasing ioc may nest into another put_io_context() leading to nested - * fast path release. As the ioc's can't be the same, this is okay but - * makes lockdep whine. Keep track of nesting and use it as subclass. - */ -#ifdef CONFIG_LOCKDEP -#define ioc_release_depth(q) ((q) ? (q)->ioc_release_depth : 0) -#define ioc_release_depth_inc(q) (q)->ioc_release_depth++ -#define ioc_release_depth_dec(q) (q)->ioc_release_depth-- -#else -#define ioc_release_depth(q) 0 -#define ioc_release_depth_inc(q) do { } while (0) -#define ioc_release_depth_dec(q) do { } while (0) -#endif - static void icq_free_icq_rcu(struct rcu_head *head) { struct io_cq *icq = container_of(head, struct io_cq, __rcu_head); @@ -75,11 +60,8 @@ static void ioc_exit_icq(struct io_cq *i if (rcu_dereference_raw(ioc->icq_hint) == icq) rcu_assign_pointer(ioc->icq_hint, NULL); - if (et->ops.elevator_exit_icq_fn) { - ioc_release_depth_inc(q); + if (et->ops.elevator_exit_icq_fn) et->ops.elevator_exit_icq_fn(icq); - ioc_release_depth_dec(q); - } /* * @icq->q might have gone away by the time RCU callback runs @@ -149,79 +131,29 @@ static void ioc_release_fn(struct work_s /** * put_io_context - put a reference of io_context * @ioc: io_context to put - * @locked_q: request_queue the caller is holding queue_lock of (hint) * * Decrement reference count of @ioc and release it if the count reaches - * zero. If the caller is holding queue_lock of a queue, it can indicate - * that with @locked_q. This is an optimization hint and the caller is - * allowed to pass in %NULL even when it's holding a queue_lock. + * zero. */ -void put_io_context(struct io_context *ioc, struct request_queue *locked_q) +void put_io_context(struct io_context *ioc) { - struct request_queue *last_q = locked_q; unsigned long flags; if (ioc == NULL) return; BUG_ON(atomic_long_read(&ioc->refcount) <= 0); - if (locked_q) - lockdep_assert_held(locked_q->queue_lock); - - if (!atomic_long_dec_and_test(&ioc->refcount)) - return; /* - * Destroy @ioc. This is a bit messy because icq's are chained - * from both ioc and queue, and ioc->lock nests inside queue_lock. - * The inner ioc->lock should be held to walk our icq_list and then - * for each icq the outer matching queue_lock should be grabbed. - * ie. We need to do reverse-order double lock dancing. - * - * Another twist is that we are often called with one of the - * matching queue_locks held as indicated by @locked_q, which - * prevents performing double-lock dance for other queues. - * - * So, we do it in two stages. The fast path uses the queue_lock - * the caller is holding and, if other queues need to be accessed, - * uses trylock to avoid introducing locking dependency. This can - * handle most cases, especially if @ioc was performing IO on only - * single device. - * - * If trylock doesn't cut it, we defer to @ioc->release_work which - * can do all the double-locking dancing. + * Releasing ioc requires reverse order double locking and we may + * already be holding a queue_lock. Do it asynchronously from wq. */ - spin_lock_irqsave_nested(&ioc->lock, flags, - ioc_release_depth(locked_q)); - - while (!hlist_empty(&ioc->icq_list)) { - struct io_cq *icq = hlist_entry(ioc->icq_list.first, - struct io_cq, ioc_node); - struct request_queue *this_q = icq->q; - - if (this_q != last_q) { - if (last_q && last_q != locked_q) - spin_unlock(last_q->queue_lock); - last_q = NULL; - - if (!spin_trylock(this_q->queue_lock)) - break; - last_q = this_q; - continue; - } - ioc_exit_icq(icq); + if (atomic_long_dec_and_test(&ioc->refcount)) { + spin_lock_irqsave(&ioc->lock, flags); + if (!hlist_empty(&ioc->icq_list)) + schedule_work(&ioc->release_work); + spin_unlock_irqrestore(&ioc->lock, flags); } - - if (last_q && last_q != locked_q) - spin_unlock(last_q->queue_lock); - - spin_unlock_irqrestore(&ioc->lock, flags); - - /* if no icq is left, we're done; otherwise, kick release_work */ - if (hlist_empty(&ioc->icq_list)) - kmem_cache_free(iocontext_cachep, ioc); - else - schedule_work(&ioc->release_work); } EXPORT_SYMBOL(put_io_context); @@ -236,7 +168,7 @@ void exit_io_context(struct task_struct task_unlock(task); atomic_dec(&ioc->nr_tasks); - put_io_context(ioc, NULL); + put_io_context(ioc); } /** Index: work/block/cfq-iosched.c =================================================================== --- work.orig/block/cfq-iosched.c +++ work/block/cfq-iosched.c @@ -1794,7 +1794,7 @@ __cfq_slice_expired(struct cfq_data *cfq cfqd->active_queue = NULL; if (cfqd->active_cic) { - put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue); + put_io_context(cfqd->active_cic->icq.ioc); cfqd->active_cic = NULL; } } Index: work/fs/ioprio.c =================================================================== --- work.orig/fs/ioprio.c +++ work/fs/ioprio.c @@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct * ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_ioprio_changed(ioc, ioprio); - put_io_context(ioc, NULL); + put_io_context(ioc); } return err; Index: work/include/linux/blkdev.h =================================================================== --- work.orig/include/linux/blkdev.h +++ work/include/linux/blkdev.h @@ -399,9 +399,6 @@ struct request_queue { /* Throttle data */ struct throtl_data *td; #endif -#ifdef CONFIG_LOCKDEP - int ioc_release_depth; -#endif }; #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ Index: work/include/linux/iocontext.h =================================================================== --- work.orig/include/linux/iocontext.h +++ work/include/linux/iocontext.h @@ -133,7 +133,7 @@ static inline struct io_context *ioc_tas struct task_struct; #ifdef CONFIG_BLOCK -void put_io_context(struct io_context *ioc, struct request_queue *locked_q); +void put_io_context(struct io_context *ioc); void exit_io_context(struct task_struct *task); struct io_context *get_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node); @@ -141,8 +141,7 @@ void ioc_ioprio_changed(struct io_contex void ioc_cgroup_changed(struct io_context *ioc); #else struct io_context; -static inline void put_io_context(struct io_context *ioc, - struct request_queue *locked_q) { } +static inline void put_io_context(struct io_context *ioc) { } static inline void exit_io_context(struct task_struct *task) { } #endif Index: work/kernel/fork.c =================================================================== --- work.orig/kernel/fork.c +++ work/kernel/fork.c @@ -890,7 +890,7 @@ static int copy_io(unsigned long clone_f return -ENOMEM; new_ioc->ioprio = ioc->ioprio; - put_io_context(new_ioc, NULL); + put_io_context(new_ioc); } #endif return 0; ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-06 21:54 ` [PATCH] block: strip out locking optimization in put_io_context() Tejun Heo @ 2012-02-07 6:49 ` Jens Axboe 2012-02-07 16:22 ` Tejun Heo 2012-02-07 23:00 ` [PATCH] block: fix lockdep warning on io_context release put_io_context() Tejun Heo 0 siblings, 2 replies; 49+ messages in thread From: Jens Axboe @ 2012-02-07 6:49 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos On 02/06/2012 10:54 PM, Tejun Heo wrote: > put_io_context() performed a complex trylock dancing to avoid > deferring ioc release to workqueue. It was also broken on UP because > trylock was always assumed to succeed which resulted in unbalanced > preemption count. > > While there are ways to fix the UP breakage, even the most > pathological microbench (forced ioc allocation and tight fork/exit > loop) fails to show any appreciable performance benefit of the > optimization. Strip it out. If there turns out to be workloads which > are affected by this change, simpler optimization from the discussion > thread can be applied later. > > Signed-off-by: Tejun Heo <tj@kernel.org> > LKML-Reference: <1328514611.21268.66.camel@sli10-conroe> > --- > I couldn't find any statiscally meaningful advantage of the > optimization with tight fork/exit tests w/ forced ioc creation on > fork, which gotta be the most pathological test case for the code > path. So, let's remove the ugly optimization. If I missed sth, we > can resurrect the simpler optimization later. Jens, this is on top of > linus#master without Shaohua's patch. OK, then I'm fine with cleaning it up. Applied, thanks Tejun. -- Jens Axboe ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-07 6:49 ` Jens Axboe @ 2012-02-07 16:22 ` Tejun Heo 2012-02-07 16:28 ` Jens Axboe 2012-02-07 23:00 ` [PATCH] block: fix lockdep warning on io_context release put_io_context() Tejun Heo 1 sibling, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-07 16:22 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos Hello, Jens. On Tue, Feb 07, 2012 at 07:49:19AM +0100, Jens Axboe wrote: > > I couldn't find any statiscally meaningful advantage of the > > optimization with tight fork/exit tests w/ forced ioc creation on > > fork, which gotta be the most pathological test case for the code > > path. So, let's remove the ugly optimization. If I missed sth, we > > can resurrect the simpler optimization later. Jens, this is on top of > > linus#master without Shaohua's patch. > > OK, then I'm fine with cleaning it up. Applied, thanks Tejun. Hmmm... how about merging Shaohua's smaller fix first until we figure out what's going on with the performance regression he's seeing? Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-07 16:22 ` Tejun Heo @ 2012-02-07 16:28 ` Jens Axboe 2012-02-07 16:33 ` Linus Torvalds 0 siblings, 1 reply; 49+ messages in thread From: Jens Axboe @ 2012-02-07 16:28 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos On 2012-02-07 17:22, Tejun Heo wrote: > Hello, Jens. > > On Tue, Feb 07, 2012 at 07:49:19AM +0100, Jens Axboe wrote: >>> I couldn't find any statiscally meaningful advantage of the >>> optimization with tight fork/exit tests w/ forced ioc creation on >>> fork, which gotta be the most pathological test case for the code >>> path. So, let's remove the ugly optimization. If I missed sth, we >>> can resurrect the simpler optimization later. Jens, this is on top of >>> linus#master without Shaohua's patch. >> >> OK, then I'm fine with cleaning it up. Applied, thanks Tejun. > > Hmmm... how about merging Shaohua's smaller fix first until we figure > out what's going on with the performance regression he's seeing? That was already merged in my tree. I don't see how it makes much difference in tracking the regression. You said that removing it made no difference for the find test case, so I'd be more comfortable getting rid of the nasty optimization. I'll send it when I have pending tomorrow, so there's still a full day to change things. We can just shuffle patches before then, not a problem. -- Jens Axboe ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-07 16:28 ` Jens Axboe @ 2012-02-07 16:33 ` Linus Torvalds 2012-02-07 16:47 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Linus Torvalds @ 2012-02-07 16:33 UTC (permalink / raw) To: Jens Axboe; +Cc: Tejun Heo, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos On Tue, Feb 7, 2012 at 8:28 AM, Jens Axboe <axboe@kernel.dk> wrote: > > That was already merged in my tree. I don't see how it makes much > difference in tracking the regression. You said that removing it made no > difference for the find test case, so I'd be more comfortable getting > rid of the nasty optimization. Yeah, please just get rid of the crazy code. Maybe *that* fixes the regression too, who knows? For all we know, the "fast case" is what causes extra locking only to then fail and not even be a fast-path. I think our default action should always be to simplify and clean up code, unless you have seriously hard numbers to show that the code complexity is worth it. Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-07 16:33 ` Linus Torvalds @ 2012-02-07 16:47 ` Tejun Heo 2012-02-07 17:17 ` Tejun Heo 2012-02-08 0:19 ` Shaohua Li 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-07 16:47 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos Hello, On Tue, Feb 07, 2012 at 08:33:15AM -0800, Linus Torvalds wrote: > Yeah, please just get rid of the crazy code. Maybe *that* fixes the > regression too, who knows? > > For all we know, the "fast case" is what causes extra locking only to > then fail and not even be a fast-path. Yeah, I was about to ask Shaohua to test the version w/o optimization. With heavily loaded request_queue, trylock failure could be frequent, which I wasn't testing. Shaohua, can you please test the version w/o optimization? Also, can you please give a bit more details on the setup? Are there multiple swap devices? Is it SSD or rotating disk? > I think our default action should always be to simplify and clean up > code, unless you have seriously hard numbers to show that the code > complexity is worth it. Sure, it was originally all in put_io_context() and while moving things to wq, it got progressively complex and I lost sense of complexity from staring at it too long. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-07 16:47 ` Tejun Heo @ 2012-02-07 17:17 ` Tejun Heo 2012-02-08 0:19 ` Shaohua Li 1 sibling, 0 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-07 17:17 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Vivek Goyal, lkml, Linus Torvalds, Knut Petersen, mroos On Tue, Feb 07, 2012 at 08:47:35AM -0800, Tejun Heo wrote: > Yeah, I was about to ask Shaohua to test the version w/o optimization. > With heavily loaded request_queue, trylock failure could be frequent, > which I wasn't testing. > > Shaohua, can you please test the version w/o optimization? Also, can > you please give a bit more details on the setup? Are there multiple > swap devices? Is it SSD or rotating disk? Can you please also apply the following patch on top of block/for-linus and see how it behaves? Thanks. --- block/blk-ioc.c | 54 +++++++++++++++++------------------------------------- 1 file changed, 17 insertions(+), 37 deletions(-) Index: work/block/blk-ioc.c =================================================================== --- work.orig/block/blk-ioc.c +++ work/block/blk-ioc.c @@ -79,52 +79,32 @@ static void ioc_release_fn(struct work_s { struct io_context *ioc = container_of(work, struct io_context, release_work); - struct request_queue *last_q = NULL; + unsigned long flags; - spin_lock_irq(&ioc->lock); + /* + * Exiting icq may call into put_io_context() through elevator + * which will trigger lockdep warning. The ioc's are guaranteed to + * be different, use a different locking subclass here. Using + * irqsave variant as there's no spin_lock_irq_nested(). + */ + spin_lock_irqsave_nested(&ioc->lock, flags, 1); while (!hlist_empty(&ioc->icq_list)) { struct io_cq *icq = hlist_entry(ioc->icq_list.first, struct io_cq, ioc_node); - struct request_queue *this_q = icq->q; + struct request_queue *q = icq->q; - if (this_q != last_q) { - /* - * Need to switch to @this_q. Once we release - * @ioc->lock, it can go away along with @cic. - * Hold on to it. - */ - __blk_get_queue(this_q); - - /* - * blk_put_queue() might sleep thanks to kobject - * idiocy. Always release both locks, put and - * restart. - */ - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); - blk_put_queue(last_q); - } else { - spin_unlock_irq(&ioc->lock); - } - - last_q = this_q; - spin_lock_irq(this_q->queue_lock); - spin_lock(&ioc->lock); - continue; + if (spin_trylock(q->queue_lock)) { + ioc_exit_icq(icq); + spin_unlock(q->queue_lock); + } else { + spin_unlock_irqrestore(&ioc->lock, flags); + cpu_relax(); + spin_lock_irqsave_nested(&ioc->lock, flags, 1); } - ioc_exit_icq(icq); - } - - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); - blk_put_queue(last_q); - } else { - spin_unlock_irq(&ioc->lock); } + spin_unlock_irqrestore(&ioc->lock, flags); kmem_cache_free(iocontext_cachep, ioc); } ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-07 16:47 ` Tejun Heo 2012-02-07 17:17 ` Tejun Heo @ 2012-02-08 0:19 ` Shaohua Li 2012-02-08 8:29 ` Shaohua Li 1 sibling, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-08 0:19 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos 2012/2/8 Tejun Heo <tj@kernel.org>: > Hello, > > On Tue, Feb 07, 2012 at 08:33:15AM -0800, Linus Torvalds wrote: >> Yeah, please just get rid of the crazy code. Maybe *that* fixes the >> regression too, who knows? >> >> For all we know, the "fast case" is what causes extra locking only to >> then fail and not even be a fast-path. > > Yeah, I was about to ask Shaohua to test the version w/o optimization. > With heavily loaded request_queue, trylock failure could be frequent, > which I wasn't testing. > > Shaohua, can you please test the version w/o optimization? Also, can > you please give a bit more details on the setup? Are there multiple > swap devices? Is it SSD or rotating disk? the test adds mem=4G in a 2 sockets 16 CPU machine. just make several copy of kernel source in tmpfs (so there is swap depending on your memory size) and run kernel build in the kernel source in the meaning time. there is only one swap device which is a rotating disk. I'll test both patches soon. Thanks, Shaohua ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-08 0:19 ` Shaohua Li @ 2012-02-08 8:29 ` Shaohua Li 2012-02-08 16:29 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-08 8:29 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos 2012/2/8 Shaohua Li <shaohua.li@intel.com>: > 2012/2/8 Tejun Heo <tj@kernel.org>: >> Hello, >> >> On Tue, Feb 07, 2012 at 08:33:15AM -0800, Linus Torvalds wrote: >>> Yeah, please just get rid of the crazy code. Maybe *that* fixes the >>> regression too, who knows? >>> >>> For all we know, the "fast case" is what causes extra locking only to >>> then fail and not even be a fast-path. >> >> Yeah, I was about to ask Shaohua to test the version w/o optimization. >> With heavily loaded request_queue, trylock failure could be frequent, >> which I wasn't testing. >> >> Shaohua, can you please test the version w/o optimization? Also, can >> you please give a bit more details on the setup? Are there multiple >> swap devices? Is it SSD or rotating disk? > the test adds mem=4G in a 2 sockets 16 CPU machine. > just make several copy of kernel source in tmpfs (so there is swap > depending on your > memory size) and run kernel build in the kernel source in the meaning time. > there is only one swap device which is a rotating disk. > > I'll test both patches soon. Tried all the options, the regression still exists. Any new idea? I'll spend some time on it if I can get anything Thanks, Shaohua ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-08 8:29 ` Shaohua Li @ 2012-02-08 16:29 ` Tejun Heo 2012-02-08 16:34 ` Linus Torvalds 2012-02-09 6:22 ` Shaohua Li 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-08 16:29 UTC (permalink / raw) To: Shaohua Li Cc: Linus Torvalds, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos Hello, Shaohua. On Wed, Feb 08, 2012 at 04:29:53PM +0800, Shaohua Li wrote: > > the test adds mem=4G in a 2 sockets 16 CPU machine. > > just make several copy of kernel source in tmpfs (so there is swap > > depending on your > > memory size) and run kernel build in the kernel source in the meaning time. > > there is only one swap device which is a rotating disk. > > > > I'll test both patches soon. > > Tried all the options, the regression still exists. Any new idea? > I'll spend some time on it if I can get anything Can you please try the following one? Thanks a lot! --- block/blk-cgroup.c | 2 block/blk-core.c | 2 block/blk-ioc.c | 99 +++++++++++++--------------------------------- block/cfq-iosched.c | 2 fs/ioprio.c | 2 include/linux/iocontext.h | 8 +-- kernel/fork.c | 2 7 files changed, 37 insertions(+), 80 deletions(-) Index: work/block/blk-ioc.c =================================================================== --- work.orig/block/blk-ioc.c +++ work/block/blk-ioc.c @@ -8,9 +8,12 @@ #include <linux/blkdev.h> #include <linux/bootmem.h> /* for max_pfn/max_low_pfn */ #include <linux/slab.h> +#include <linux/rwlock.h> #include "blk.h" +static DEFINE_RWLOCK(ioc_clear_rwlock); + /* * For io context allocations */ @@ -45,7 +48,7 @@ static void ioc_exit_icq(struct io_cq *i struct request_queue *q = icq->q; struct elevator_type *et = q->elevator->type; - lockdep_assert_held(&ioc->lock); + //lockdep_assert_held(&ioc->lock); lockdep_assert_held(q->queue_lock); radix_tree_delete(&ioc->icq_tree, icq->q->id); @@ -71,63 +74,6 @@ static void ioc_exit_icq(struct io_cq *i call_rcu(&icq->__rcu_head, icq_free_icq_rcu); } -/* - * Slow path for ioc release in put_io_context(). Performs double-lock - * dancing to unlink all icq's and then frees ioc. - */ -static void ioc_release_fn(struct work_struct *work) -{ - struct io_context *ioc = container_of(work, struct io_context, - release_work); - struct request_queue *last_q = NULL; - - spin_lock_irq(&ioc->lock); - - while (!hlist_empty(&ioc->icq_list)) { - struct io_cq *icq = hlist_entry(ioc->icq_list.first, - struct io_cq, ioc_node); - struct request_queue *this_q = icq->q; - - if (this_q != last_q) { - /* - * Need to switch to @this_q. Once we release - * @ioc->lock, it can go away along with @cic. - * Hold on to it. - */ - __blk_get_queue(this_q); - - /* - * blk_put_queue() might sleep thanks to kobject - * idiocy. Always release both locks, put and - * restart. - */ - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); - blk_put_queue(last_q); - } else { - spin_unlock_irq(&ioc->lock); - } - - last_q = this_q; - spin_lock_irq(this_q->queue_lock); - spin_lock(&ioc->lock); - continue; - } - ioc_exit_icq(icq); - } - - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); - blk_put_queue(last_q); - } else { - spin_unlock_irq(&ioc->lock); - } - - kmem_cache_free(iocontext_cachep, ioc); -} - /** * put_io_context - put a reference of io_context * @ioc: io_context to put @@ -135,7 +81,7 @@ static void ioc_release_fn(struct work_s * Decrement reference count of @ioc and release it if the count reaches * zero. */ -void put_io_context(struct io_context *ioc) +void put_io_context(struct io_context *ioc, struct request_queue *locked_q) { unsigned long flags; @@ -144,16 +90,26 @@ void put_io_context(struct io_context *i BUG_ON(atomic_long_read(&ioc->refcount) <= 0); - /* - * Releasing ioc requires reverse order double locking and we may - * already be holding a queue_lock. Do it asynchronously from wq. - */ - if (atomic_long_dec_and_test(&ioc->refcount)) { - spin_lock_irqsave(&ioc->lock, flags); - if (!hlist_empty(&ioc->icq_list)) - schedule_work(&ioc->release_work); - spin_unlock_irqrestore(&ioc->lock, flags); + if (!atomic_long_dec_and_test(&ioc->refcount)) + return; + + read_lock_irqsave(&ioc_clear_rwlock, flags); + + while (!hlist_empty(&ioc->icq_list)) { + struct io_cq *icq = hlist_entry(ioc->icq_list.first, + struct io_cq, ioc_node); + struct request_queue *q = icq->q; + + if (q != locked_q) + spin_lock_nested(q->queue_lock, 1); + + ioc_exit_icq(icq); + + if (q != locked_q) + spin_unlock(q->queue_lock); } + + read_unlock_irqrestore(&ioc_clear_rwlock, flags); } EXPORT_SYMBOL(put_io_context); @@ -168,7 +124,7 @@ void exit_io_context(struct task_struct task_unlock(task); atomic_dec(&ioc->nr_tasks); - put_io_context(ioc); + put_io_context(ioc, NULL); } /** @@ -181,6 +137,8 @@ void ioc_clear_queue(struct request_queu { lockdep_assert_held(q->queue_lock); + write_lock(&ioc_clear_rwlock); + while (!list_empty(&q->icq_list)) { struct io_cq *icq = list_entry(q->icq_list.next, struct io_cq, q_node); @@ -190,6 +148,8 @@ void ioc_clear_queue(struct request_queu ioc_exit_icq(icq); spin_unlock(&ioc->lock); } + + write_unlock(&ioc_clear_rwlock); } void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags, @@ -208,7 +168,6 @@ void create_io_context_slowpath(struct t spin_lock_init(&ioc->lock); INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH); INIT_HLIST_HEAD(&ioc->icq_list); - INIT_WORK(&ioc->release_work, ioc_release_fn); /* * Try to install. ioc shouldn't be installed if someone else Index: work/include/linux/iocontext.h =================================================================== --- work.orig/include/linux/iocontext.h +++ work/include/linux/iocontext.h @@ -3,7 +3,6 @@ #include <linux/radix-tree.h> #include <linux/rcupdate.h> -#include <linux/workqueue.h> enum { ICQ_IOPRIO_CHANGED, @@ -113,8 +112,6 @@ struct io_context { struct radix_tree_root icq_tree; struct io_cq __rcu *icq_hint; struct hlist_head icq_list; - - struct work_struct release_work; }; static inline struct io_context *ioc_task_link(struct io_context *ioc) @@ -133,7 +130,7 @@ static inline struct io_context *ioc_tas struct task_struct; #ifdef CONFIG_BLOCK -void put_io_context(struct io_context *ioc); +void put_io_context(struct io_context *ioc, struct request_queue *locked_q); void exit_io_context(struct task_struct *task); struct io_context *get_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node); @@ -141,7 +138,8 @@ void ioc_ioprio_changed(struct io_contex void ioc_cgroup_changed(struct io_context *ioc); #else struct io_context; -static inline void put_io_context(struct io_context *ioc) { } +static inline void put_io_context(struct io_context *ioc, + struct request_queue *locked_q) { } static inline void exit_io_context(struct task_struct *task) { } #endif Index: work/block/blk-cgroup.c =================================================================== --- work.orig/block/blk-cgroup.c +++ work/block/blk-cgroup.c @@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_cgroup_changed(ioc); - put_io_context(ioc); + put_io_context(ioc, NULL); } } } Index: work/block/blk-core.c =================================================================== --- work.orig/block/blk-core.c +++ work/block/blk-core.c @@ -642,7 +642,7 @@ static inline void blk_free_request(stru if (rq->cmd_flags & REQ_ELVPRIV) { elv_put_request(q, rq); if (rq->elv.icq) - put_io_context(rq->elv.icq->ioc); + put_io_context(rq->elv.icq->ioc, q); } mempool_free(rq, q->rq.rq_pool); Index: work/block/cfq-iosched.c =================================================================== --- work.orig/block/cfq-iosched.c +++ work/block/cfq-iosched.c @@ -1787,7 +1787,7 @@ __cfq_slice_expired(struct cfq_data *cfq cfqd->active_queue = NULL; if (cfqd->active_cic) { - put_io_context(cfqd->active_cic->icq.ioc); + put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue); cfqd->active_cic = NULL; } } Index: work/fs/ioprio.c =================================================================== --- work.orig/fs/ioprio.c +++ work/fs/ioprio.c @@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct * ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_ioprio_changed(ioc, ioprio); - put_io_context(ioc); + put_io_context(ioc, NULL); } return err; Index: work/kernel/fork.c =================================================================== --- work.orig/kernel/fork.c +++ work/kernel/fork.c @@ -890,7 +890,7 @@ static int copy_io(unsigned long clone_f return -ENOMEM; new_ioc->ioprio = ioc->ioprio; - put_io_context(new_ioc); + put_io_context(new_ioc, NULL); } #endif return 0; ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-08 16:29 ` Tejun Heo @ 2012-02-08 16:34 ` Linus Torvalds 2012-02-08 16:49 ` Tejun Heo 2012-02-09 6:22 ` Shaohua Li 1 sibling, 1 reply; 49+ messages in thread From: Linus Torvalds @ 2012-02-08 16:34 UTC (permalink / raw) To: Tejun Heo; +Cc: Shaohua Li, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos On Wed, Feb 8, 2012 at 8:29 AM, Tejun Heo <tj@kernel.org> wrote: > > Can you please try the following one? Thanks a lot! If you can use it as a rwlock, why can't you do it with RCU? Usually rwlocks are a bad idea. They tend to be more expensive than spinlocks, and the extra parallelism is almost never noticeable (except as "more cacheline bounces") for something that is appropriate for a non-sleeping lock. There's a *very* few situations where rwlock is the right thing, but it really almost always is a horribly bad idea. Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-08 16:34 ` Linus Torvalds @ 2012-02-08 16:49 ` Tejun Heo 2012-02-08 16:56 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-08 16:49 UTC (permalink / raw) To: Linus Torvalds Cc: Shaohua Li, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos Hello, Linus. On Wed, Feb 08, 2012 at 08:34:53AM -0800, Linus Torvalds wrote: > On Wed, Feb 8, 2012 at 8:29 AM, Tejun Heo <tj@kernel.org> wrote: > > > > Can you please try the following one? Thanks a lot! > > If you can use it as a rwlock, why can't you do it with RCU? The original locking scheme was using RCU which was very fragile and broken on corner cases. The locking restructuring was aimed to make things simpler. While the double locking isn't trivial, it's much easier to grasp and get right than RCU. We might have to revive RCU if the regression can't be tackled otherwise and it probably is possible to do it simpler. Let's see. > Usually rwlocks are a bad idea. They tend to be more expensive than > spinlocks, and the extra parallelism is almost never noticeable > (except as "more cacheline bounces") for something that is appropriate > for a non-sleeping lock. > > There's a *very* few situations where rwlock is the right thing, but > it really almost always is a horribly bad idea. I'm still a bit lost on where the regression is coming from and *suspecting* that queue_lock contention is making the reverse locking behave much worse than expected, so I mostly wanted to take that out and see what happens. rwlock might increase locking overhead per try but it avoids unlock/lock dancing. I'll try to reproduce the regression in a few days and do better analysis. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-08 16:49 ` Tejun Heo @ 2012-02-08 16:56 ` Tejun Heo 2012-02-08 17:23 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-08 16:56 UTC (permalink / raw) To: Linus Torvalds Cc: Shaohua Li, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos On Wed, Feb 08, 2012 at 08:49:20AM -0800, Tejun Heo wrote: > I'm still a bit lost on where the regression is coming from and > *suspecting* that queue_lock contention is making the reverse locking > behave much worse than expected, so I mostly wanted to take that out > and see what happens. IOW, we can achieve about the same thing by adding another lock in request_queue. The goal is using an inner lock for ioc clearing so that queue_lock doesn't have to be grabbed inside ioc lock. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-08 16:56 ` Tejun Heo @ 2012-02-08 17:23 ` Tejun Heo 0 siblings, 0 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-08 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: Shaohua Li, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos On Wed, Feb 08, 2012 at 08:56:11AM -0800, Tejun Heo wrote: > On Wed, Feb 08, 2012 at 08:49:20AM -0800, Tejun Heo wrote: > > I'm still a bit lost on where the regression is coming from and > > *suspecting* that queue_lock contention is making the reverse locking > > behave much worse than expected, so I mostly wanted to take that out > > and see what happens. > > IOW, we can achieve about the same thing by adding another lock in > request_queue. The goal is using an inner lock for ioc clearing so > that queue_lock doesn't have to be grabbed inside ioc lock. Urgh....... forget about the above message. My head is still booting. We can't do this w/ per-queue lock as we don't have a way to traverse the associated queues w/o holding the lock and we need read locking on ioc exit path as we may recurse through elevator icq exit (it's not about concurrency). /me goes to brew coffee. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-08 16:29 ` Tejun Heo 2012-02-08 16:34 ` Linus Torvalds @ 2012-02-09 6:22 ` Shaohua Li 2012-02-09 17:59 ` Tejun Heo 1 sibling, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-09 6:22 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos 2012/2/9 Tejun Heo <tj@kernel.org>: > Hello, Shaohua. > > On Wed, Feb 08, 2012 at 04:29:53PM +0800, Shaohua Li wrote: >> > the test adds mem=4G in a 2 sockets 16 CPU machine. >> > just make several copy of kernel source in tmpfs (so there is swap >> > depending on your >> > memory size) and run kernel build in the kernel source in the meaning time. >> > there is only one swap device which is a rotating disk. >> > >> > I'll test both patches soon. >> >> Tried all the options, the regression still exists. Any new idea? >> I'll spend some time on it if I can get anything > > Can you please try the following one? Thanks a lot! doesn't work. I also double confirmed b2efa05265d62 causes the regression ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-09 6:22 ` Shaohua Li @ 2012-02-09 17:59 ` Tejun Heo 2012-02-09 18:07 ` Linus Torvalds 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-09 17:59 UTC (permalink / raw) To: Shaohua Li Cc: Linus Torvalds, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos Hello, On Thu, Feb 09, 2012 at 02:22:32PM +0800, Shaohua Li wrote: > >> Tried all the options, the regression still exists. Any new idea? > >> I'll spend some time on it if I can get anything > > > > Can you please try the following one? Thanks a lot! > doesn't work. > I also double confirmed b2efa05265d62 causes the regression I'll soon send a RCU based version. I'm still having trouble reproducing the regression tho. I've tried a few different things. * Heavy thrashing: Disk IO dominates everything and CPUs aren't too busy. While how swap behaves affects completion time, I don't see how CPU locking issue comes into play at all under this circumstance. * Some swap load: Simulated w/ 1G memory limit and buliding defconfig kernel in tmpfs. Swap grows to a couple hundred megabytes but build time is still dominated by CPU. I didn't see any meanginful difference before and after the commit - both in terms of wallclock and CPU times. Maybe these two were too extreme to show the problem and I need to push memory limit a bit further, but it would be great if you can give me a bit more details about your testing. * How much memory does the machine have? How is the tmpfs setup and filled up? What's the size of the tmpfs and the output of "free -m" right before test starts? While the test is running, how occupied are the CPUs? On test completion, what's the output of "free -m"? * What exactly is the test and what do you measure? What does "12% regression" mean? Is it wallclock time or CPU time? If it's CPU time, does systime increase dominate the regression? Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-09 17:59 ` Tejun Heo @ 2012-02-09 18:07 ` Linus Torvalds 2012-02-09 19:24 ` Tejun Heo 2012-02-10 3:09 ` Shaohua Li 0 siblings, 2 replies; 49+ messages in thread From: Linus Torvalds @ 2012-02-09 18:07 UTC (permalink / raw) To: Tejun Heo; +Cc: Shaohua Li, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos On Thu, Feb 9, 2012 at 9:59 AM, Tejun Heo <tj@kernel.org> wrote: > > * What exactly is the test and what do you measure? What does "12% > regression" mean? Is it wallclock time or CPU time? If it's CPU > time, does systime increase dominate the regression? Shaohua, it might be interesting to see a profile of the bad case. Now, quite often these kinds of things don't show anything at all - it's just due to cache issues and there's no obvious "we hold spinlock X for 15 seconds total". But if it's actual lock contention rather than just "more scheduling of worker threads", it should show up in the profile quite clearly. That said, I do think the RCU approach is the right one. The whole delayed deallocation (and the replacement patch with rwlocks) really smells like "badly done RCU-like behavior" to me. Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-09 18:07 ` Linus Torvalds @ 2012-02-09 19:24 ` Tejun Heo 2012-02-09 23:48 ` Tejun Heo 2012-02-10 3:09 ` Shaohua Li 1 sibling, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-09 19:24 UTC (permalink / raw) To: Linus Torvalds Cc: Shaohua Li, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos On Thu, Feb 09, 2012 at 10:07:35AM -0800, Linus Torvalds wrote: > On Thu, Feb 9, 2012 at 9:59 AM, Tejun Heo <tj@kernel.org> wrote: > > > > * What exactly is the test and what do you measure? What does "12% > > regression" mean? Is it wallclock time or CPU time? If it's CPU > > time, does systime increase dominate the regression? > > Shaohua, it might be interesting to see a profile of the bad case. Yeap, if CPUs are taking more time to do stuff, it would be helpful to obtain before and after profiles. > Now, quite often these kinds of things don't show anything at all - > it's just due to cache issues and there's no obvious "we hold spinlock > X for 15 seconds total". But if it's actual lock contention rather > than just "more scheduling of worker threads", it should show up in > the profile quite clearly. Weird thing is that if it were wq, the rwlock patch should have removed the regression. It removed the reverse locking and the wq deferring. It does replace ioc locking with global readlock but it's weird that that can show up as >10% regression (whatever the measure may be). Even though kernel compiling is pretty fork/exit intensive.... > That said, I do think the RCU approach is the right one. The whole > delayed deallocation (and the replacement patch with rwlocks) really > smells like "badly done RCU-like behavior" to me. I'll probably post it in several hours and think it's gonna be pretty well contained. I probably avoided RCU too hard in that path from the original scary RCU usage. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-09 19:24 ` Tejun Heo @ 2012-02-09 23:48 ` Tejun Heo 2012-02-10 5:14 ` Shaohua Li 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-09 23:48 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds Hello, Shaohua. Can you please test the following one? It's probably the simplest version w/o RCU and wq deferring. RCUfying isn't too bad but I'm still a bit hesitant because RCU coverage needs to be extended to request_queue via conditional synchronize_rcu() in queue exit path (can't enforce delayed RCU free on request_queues and unconditional synchronize_rcu() may cause excessive delay during boot for certain configurations). It now can be done in the block core layer proper so it shouldn't be as bad tho. If this too flops, I'll get to that. Thanks. --- block/blk-cgroup.c | 2 block/blk-core.c | 2 block/blk-ioc.c | 112 +++++++++++++++++----------------------------- block/cfq-iosched.c | 2 fs/ioprio.c | 2 include/linux/blkdev.h | 3 + include/linux/iocontext.h | 8 +-- kernel/fork.c | 2 8 files changed, 54 insertions(+), 79 deletions(-) Index: work/block/blk-cgroup.c =================================================================== --- work.orig/block/blk-cgroup.c +++ work/block/blk-cgroup.c @@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_cgroup_changed(ioc); - put_io_context(ioc); + put_io_context(ioc, NULL); } } } Index: work/block/blk-core.c =================================================================== --- work.orig/block/blk-core.c +++ work/block/blk-core.c @@ -642,7 +642,7 @@ static inline void blk_free_request(stru if (rq->cmd_flags & REQ_ELVPRIV) { elv_put_request(q, rq); if (rq->elv.icq) - put_io_context(rq->elv.icq->ioc); + put_io_context(rq->elv.icq->ioc, q); } mempool_free(rq, q->rq.rq_pool); Index: work/block/blk-ioc.c =================================================================== --- work.orig/block/blk-ioc.c +++ work/block/blk-ioc.c @@ -29,6 +29,21 @@ void get_io_context(struct io_context *i } EXPORT_SYMBOL(get_io_context); +/* + * Releasing ioc may nest into another put_io_context() leading to nested + * fast path release. As the ioc's can't be the same, this is okay but + * makes lockdep whine. Keep track of nesting and use it as subclass. + */ +#ifdef CONFIG_LOCKDEP +#define ioc_release_depth(q) ((q) ? (q)->ioc_release_depth : 0) +#define ioc_release_depth_inc(q) (q)->ioc_release_depth++ +#define ioc_release_depth_dec(q) (q)->ioc_release_depth-- +#else +#define ioc_release_depth(q) 0 +#define ioc_release_depth_inc(q) do { } while (0) +#define ioc_release_depth_dec(q) do { } while (0) +#endif + static void icq_free_icq_rcu(struct rcu_head *head) { struct io_cq *icq = container_of(head, struct io_cq, __rcu_head); @@ -71,63 +86,6 @@ static void ioc_exit_icq(struct io_cq *i call_rcu(&icq->__rcu_head, icq_free_icq_rcu); } -/* - * Slow path for ioc release in put_io_context(). Performs double-lock - * dancing to unlink all icq's and then frees ioc. - */ -static void ioc_release_fn(struct work_struct *work) -{ - struct io_context *ioc = container_of(work, struct io_context, - release_work); - struct request_queue *last_q = NULL; - - spin_lock_irq(&ioc->lock); - - while (!hlist_empty(&ioc->icq_list)) { - struct io_cq *icq = hlist_entry(ioc->icq_list.first, - struct io_cq, ioc_node); - struct request_queue *this_q = icq->q; - - if (this_q != last_q) { - /* - * Need to switch to @this_q. Once we release - * @ioc->lock, it can go away along with @cic. - * Hold on to it. - */ - __blk_get_queue(this_q); - - /* - * blk_put_queue() might sleep thanks to kobject - * idiocy. Always release both locks, put and - * restart. - */ - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); - blk_put_queue(last_q); - } else { - spin_unlock_irq(&ioc->lock); - } - - last_q = this_q; - spin_lock_irq(this_q->queue_lock); - spin_lock(&ioc->lock); - continue; - } - ioc_exit_icq(icq); - } - - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); - blk_put_queue(last_q); - } else { - spin_unlock_irq(&ioc->lock); - } - - kmem_cache_free(iocontext_cachep, ioc); -} - /** * put_io_context - put a reference of io_context * @ioc: io_context to put @@ -135,7 +93,7 @@ static void ioc_release_fn(struct work_s * Decrement reference count of @ioc and release it if the count reaches * zero. */ -void put_io_context(struct io_context *ioc) +void put_io_context(struct io_context *ioc, struct request_queue *locked_q) { unsigned long flags; @@ -144,16 +102,33 @@ void put_io_context(struct io_context *i BUG_ON(atomic_long_read(&ioc->refcount) <= 0); - /* - * Releasing ioc requires reverse order double locking and we may - * already be holding a queue_lock. Do it asynchronously from wq. - */ - if (atomic_long_dec_and_test(&ioc->refcount)) { - spin_lock_irqsave(&ioc->lock, flags); - if (!hlist_empty(&ioc->icq_list)) - schedule_work(&ioc->release_work); - spin_unlock_irqrestore(&ioc->lock, flags); + if (!atomic_long_dec_and_test(&ioc->refcount)) + return; + + spin_lock_irqsave_nested(&ioc->lock, flags, + ioc_release_depth(locked_q)); + + while (!hlist_empty(&ioc->icq_list)) { + struct io_cq *icq = hlist_entry(ioc->icq_list.first, + struct io_cq, ioc_node); + struct request_queue *q = icq->q; + + if (q == locked_q || spin_trylock(q->queue_lock)) { + ioc_release_depth_inc(q); + ioc_exit_icq(icq); + ioc_release_depth_dec(q); + + if (q != locked_q) + spin_unlock(q->queue_lock); + } else { + spin_unlock_irqrestore(&ioc->lock, flags); + cpu_relax(); + spin_lock_irqsave_nested(&ioc->lock, flags, + ioc_release_depth(locked_q)); + } } + + spin_unlock_irqrestore(&ioc->lock, flags); } EXPORT_SYMBOL(put_io_context); @@ -168,7 +143,7 @@ void exit_io_context(struct task_struct task_unlock(task); atomic_dec(&ioc->nr_tasks); - put_io_context(ioc); + put_io_context(ioc, NULL); } /** @@ -208,7 +183,6 @@ void create_io_context_slowpath(struct t spin_lock_init(&ioc->lock); INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH); INIT_HLIST_HEAD(&ioc->icq_list); - INIT_WORK(&ioc->release_work, ioc_release_fn); /* * Try to install. ioc shouldn't be installed if someone else Index: work/block/cfq-iosched.c =================================================================== --- work.orig/block/cfq-iosched.c +++ work/block/cfq-iosched.c @@ -1787,7 +1787,7 @@ __cfq_slice_expired(struct cfq_data *cfq cfqd->active_queue = NULL; if (cfqd->active_cic) { - put_io_context(cfqd->active_cic->icq.ioc); + put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue); cfqd->active_cic = NULL; } } Index: work/fs/ioprio.c =================================================================== --- work.orig/fs/ioprio.c +++ work/fs/ioprio.c @@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct * ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_ioprio_changed(ioc, ioprio); - put_io_context(ioc); + put_io_context(ioc, NULL); } return err; Index: work/include/linux/iocontext.h =================================================================== --- work.orig/include/linux/iocontext.h +++ work/include/linux/iocontext.h @@ -3,7 +3,6 @@ #include <linux/radix-tree.h> #include <linux/rcupdate.h> -#include <linux/workqueue.h> enum { ICQ_IOPRIO_CHANGED, @@ -113,8 +112,6 @@ struct io_context { struct radix_tree_root icq_tree; struct io_cq __rcu *icq_hint; struct hlist_head icq_list; - - struct work_struct release_work; }; static inline struct io_context *ioc_task_link(struct io_context *ioc) @@ -133,7 +130,7 @@ static inline struct io_context *ioc_tas struct task_struct; #ifdef CONFIG_BLOCK -void put_io_context(struct io_context *ioc); +void put_io_context(struct io_context *ioc, struct request_queue *locked_q); void exit_io_context(struct task_struct *task); struct io_context *get_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node); @@ -141,7 +138,8 @@ void ioc_ioprio_changed(struct io_contex void ioc_cgroup_changed(struct io_context *ioc); #else struct io_context; -static inline void put_io_context(struct io_context *ioc) { } +static inline void put_io_context(struct io_context *ioc, + struct request_queue *locked_q) { } static inline void exit_io_context(struct task_struct *task) { } #endif Index: work/kernel/fork.c =================================================================== --- work.orig/kernel/fork.c +++ work/kernel/fork.c @@ -890,7 +890,7 @@ static int copy_io(unsigned long clone_f return -ENOMEM; new_ioc->ioprio = ioc->ioprio; - put_io_context(new_ioc); + put_io_context(new_ioc, NULL); } #endif return 0; Index: work/include/linux/blkdev.h =================================================================== --- work.orig/include/linux/blkdev.h +++ work/include/linux/blkdev.h @@ -399,6 +399,9 @@ struct request_queue { /* Throttle data */ struct throtl_data *td; #endif +#ifdef CONFIG_LOCKDEP + int ioc_release_depth; +#endif }; #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-09 23:48 ` Tejun Heo @ 2012-02-10 5:14 ` Shaohua Li 2012-02-10 8:48 ` Shaohua Li 0 siblings, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-10 5:14 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds 2012/2/10 Tejun Heo <tj@kernel.org>: > Hello, Shaohua. > > Can you please test the following one? It's probably the simplest > version w/o RCU and wq deferring. RCUfying isn't too bad but I'm > still a bit hesitant because RCU coverage needs to be extended to > request_queue via conditional synchronize_rcu() in queue exit path > (can't enforce delayed RCU free on request_queues and unconditional > synchronize_rcu() may cause excessive delay during boot for certain > configurations). It now can be done in the block core layer proper so > it shouldn't be as bad tho. If this too flops, I'll get to that. doesn't work. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-10 5:14 ` Shaohua Li @ 2012-02-10 8:48 ` Shaohua Li 2012-02-11 2:17 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-10 8:48 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds 2012/2/10 Shaohua Li <shli@kernel.org>: > 2012/2/10 Tejun Heo <tj@kernel.org>: >> Hello, Shaohua. >> >> Can you please test the following one? It's probably the simplest >> version w/o RCU and wq deferring. RCUfying isn't too bad but I'm >> still a bit hesitant because RCU coverage needs to be extended to >> request_queue via conditional synchronize_rcu() in queue exit path >> (can't enforce delayed RCU free on request_queues and unconditional >> synchronize_rcu() may cause excessive delay during boot for certain >> configurations). It now can be done in the block core layer proper so >> it shouldn't be as bad tho. If this too flops, I'll get to that. > doesn't work. I added trace in the schedule_work code path of put_io_context, which runs very rare. So it's not lock contention for sure. Sounds the only difference between the good/bad cases is the good case runs with rcu_lock_read/rcu_read_unlock. I also checked slab info, the cfq related slab doesn't use too many memory, unlikely because rcu latency uses too many memory. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-10 8:48 ` Shaohua Li @ 2012-02-11 2:17 ` Tejun Heo 2012-02-11 11:35 ` Jens Axboe 2012-02-13 1:34 ` Shaohua Li 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-11 2:17 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds Hello, On Fri, Feb 10, 2012 at 04:48:49PM +0800, Shaohua Li wrote: > >> Can you please test the following one? It's probably the simplest > >> version w/o RCU and wq deferring. RCUfying isn't too bad but I'm > >> still a bit hesitant because RCU coverage needs to be extended to > >> request_queue via conditional synchronize_rcu() in queue exit path > >> (can't enforce delayed RCU free on request_queues and unconditional > >> synchronize_rcu() may cause excessive delay during boot for certain > >> configurations). It now can be done in the block core layer proper so > >> it shouldn't be as bad tho. If this too flops, I'll get to that. > > doesn't work. > I added trace in the schedule_work code path of put_io_context, which > runs very rare. So it's not lock contention for sure. > Sounds the only difference between the good/bad cases is the good > case runs with rcu_lock_read/rcu_read_unlock. I also checked slab > info, the cfq related slab doesn't use too many memory, unlikely > because rcu latency uses too many memory. Yeah, that makes much more sense. It just isn't hot enough path for this sort of micro locking changes to matter. I think the problem is that, after the change, the cfqq aren't being expired immediately on task exit. ie. While moving the cic destruction to release path, I accidentally removed exit notification to cfq. I'll come up with a fix. Thank you! -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-11 2:17 ` Tejun Heo @ 2012-02-11 11:35 ` Jens Axboe 2012-02-13 1:34 ` Shaohua Li 1 sibling, 0 replies; 49+ messages in thread From: Jens Axboe @ 2012-02-11 11:35 UTC (permalink / raw) To: Tejun Heo Cc: Shaohua Li, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds On 2012-02-11 03:17, Tejun Heo wrote: > Hello, > > On Fri, Feb 10, 2012 at 04:48:49PM +0800, Shaohua Li wrote: >>>> Can you please test the following one? It's probably the simplest >>>> version w/o RCU and wq deferring. RCUfying isn't too bad but I'm >>>> still a bit hesitant because RCU coverage needs to be extended to >>>> request_queue via conditional synchronize_rcu() in queue exit path >>>> (can't enforce delayed RCU free on request_queues and unconditional >>>> synchronize_rcu() may cause excessive delay during boot for certain >>>> configurations). It now can be done in the block core layer proper so >>>> it shouldn't be as bad tho. If this too flops, I'll get to that. >>> doesn't work. >> I added trace in the schedule_work code path of put_io_context, which >> runs very rare. So it's not lock contention for sure. >> Sounds the only difference between the good/bad cases is the good >> case runs with rcu_lock_read/rcu_read_unlock. I also checked slab >> info, the cfq related slab doesn't use too many memory, unlikely >> because rcu latency uses too many memory. > > Yeah, that makes much more sense. It just isn't hot enough path for > this sort of micro locking changes to matter. I think the problem is > that, after the change, the cfqq aren't being expired immediately on > task exit. ie. While moving the cic destruction to release path, I > accidentally removed exit notification to cfq. I'll come up with a > fix. Was just thinking about that last night, the missing slice expire on task exit makes a LOT more sense than changed locking. I'm pushing off what I have to Linus today, since I'll be gone skiing next week. I will check email regularly and be able to apply patches and so forth, just a heads up on availability. -- Jens Axboe ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-11 2:17 ` Tejun Heo 2012-02-11 11:35 ` Jens Axboe @ 2012-02-13 1:34 ` Shaohua Li 2012-02-13 20:49 ` Tejun Heo 1 sibling, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-13 1:34 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds 2012/2/11 Tejun Heo <tj@kernel.org>: > Hello, > > On Fri, Feb 10, 2012 at 04:48:49PM +0800, Shaohua Li wrote: >> >> Can you please test the following one? It's probably the simplest >> >> version w/o RCU and wq deferring. RCUfying isn't too bad but I'm >> >> still a bit hesitant because RCU coverage needs to be extended to >> >> request_queue via conditional synchronize_rcu() in queue exit path >> >> (can't enforce delayed RCU free on request_queues and unconditional >> >> synchronize_rcu() may cause excessive delay during boot for certain >> >> configurations). It now can be done in the block core layer proper so >> >> it shouldn't be as bad tho. If this too flops, I'll get to that. >> > doesn't work. >> I added trace in the schedule_work code path of put_io_context, which >> runs very rare. So it's not lock contention for sure. >> Sounds the only difference between the good/bad cases is the good >> case runs with rcu_lock_read/rcu_read_unlock. I also checked slab >> info, the cfq related slab doesn't use too many memory, unlikely >> because rcu latency uses too many memory. > > Yeah, that makes much more sense. It just isn't hot enough path for > this sort of micro locking changes to matter. I think the problem is > that, after the change, the cfqq aren't being expired immediately on > task exit. ie. While moving the cic destruction to release path, I > accidentally removed exit notification to cfq. I'll come up with a > fix. ah, I felt a little strange looking at exit_io_context, but didn't realize the exit notification is removed (confused by put_io_context). This makes a lot of sense. Good catch! ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-13 1:34 ` Shaohua Li @ 2012-02-13 20:49 ` Tejun Heo 2012-02-14 2:36 ` Shaohua Li 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-13 20:49 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds Hello, Shaohua. Can you please test the following patch? It's combination of three patches which invokes elevator icq exit from exit_io_context(). This unfortunately ends up adding another reverse locking loop and using RCU could be better; unfortunately, the change isn't trivial due to q->queue_lock modification during blk_cleanup_queue() and ioc cleanup being called after that from blk_release_queue() - IOW, while holding RCU, we might end up grabbing the wrong q lock (I don't think this is a new problem). Now that we have proper request draining on queue exit, we can probably move ioc clearing and other operations to blk_cleanup_queue() and then apply RCU, but that's for another merge window. Thanks. Index: work/block/blk-ioc.c =================================================================== --- work.orig/block/blk-ioc.c +++ work/block/blk-ioc.c @@ -29,6 +29,21 @@ void get_io_context(struct io_context *i } EXPORT_SYMBOL(get_io_context); +/* + * Exiting ioc may nest into another put_io_context() leading to nested + * fast path release. As the ioc's can't be the same, this is okay but + * makes lockdep whine. Keep track of nesting and use it as subclass. + */ +#ifdef CONFIG_LOCKDEP +#define ioc_exit_depth(q) ((q) ? (q)->ioc_exit_depth : 0) +#define ioc_exit_depth_inc(q) (q)->ioc_exit_depth++ +#define ioc_exit_depth_dec(q) (q)->ioc_exit_depth-- +#else +#define ioc_exit_depth(q) 0 +#define ioc_exit_depth_inc(q) do { } while (0) +#define ioc_exit_depth_dec(q) do { } while (0) +#endif + static void icq_free_icq_rcu(struct rcu_head *head) { struct io_cq *icq = container_of(head, struct io_cq, __rcu_head); @@ -36,17 +51,31 @@ static void icq_free_icq_rcu(struct rcu_ kmem_cache_free(icq->__rcu_icq_cache, icq); } -/* - * Exit and free an icq. Called with both ioc and q locked. - */ +/* Exit an icq. Called with both ioc and q locked. */ static void ioc_exit_icq(struct io_cq *icq) { + struct elevator_type *et = icq->q->elevator->type; + + if (icq->flags & ICQ_EXITED) + return; + + if (et->ops.elevator_exit_icq_fn) { + ioc_exit_depth_inc(icq->q); + et->ops.elevator_exit_icq_fn(icq); + ioc_exit_depth_dec(icq->q); + } + + icq->flags |= ICQ_EXITED; +} + +/* Release an icq. Called with both ioc and q locked. */ +static void ioc_destroy_icq(struct io_cq *icq) +{ struct io_context *ioc = icq->ioc; - struct request_queue *q = icq->q; - struct elevator_type *et = q->elevator->type; + struct elevator_type *et = icq->q->elevator->type; lockdep_assert_held(&ioc->lock); - lockdep_assert_held(q->queue_lock); + lockdep_assert_held(icq->q->queue_lock); radix_tree_delete(&ioc->icq_tree, icq->q->id); hlist_del_init(&icq->ioc_node); @@ -60,8 +89,7 @@ static void ioc_exit_icq(struct io_cq *i if (rcu_dereference_raw(ioc->icq_hint) == icq) rcu_assign_pointer(ioc->icq_hint, NULL); - if (et->ops.elevator_exit_icq_fn) - et->ops.elevator_exit_icq_fn(icq); + ioc_exit_icq(icq); /* * @icq->q might have gone away by the time RCU callback runs @@ -71,70 +99,6 @@ static void ioc_exit_icq(struct io_cq *i call_rcu(&icq->__rcu_head, icq_free_icq_rcu); } -/* - * Slow path for ioc release in put_io_context(). Performs double-lock - * dancing to unlink all icq's and then frees ioc. - */ -static void ioc_release_fn(struct work_struct *work) -{ - struct io_context *ioc = container_of(work, struct io_context, - release_work); - struct request_queue *last_q = NULL; - unsigned long flags; - - /* - * Exiting icq may call into put_io_context() through elevator - * which will trigger lockdep warning. The ioc's are guaranteed to - * be different, use a different locking subclass here. Use - * irqsave variant as there's no spin_lock_irq_nested(). - */ - spin_lock_irqsave_nested(&ioc->lock, flags, 1); - - while (!hlist_empty(&ioc->icq_list)) { - struct io_cq *icq = hlist_entry(ioc->icq_list.first, - struct io_cq, ioc_node); - struct request_queue *this_q = icq->q; - - if (this_q != last_q) { - /* - * Need to switch to @this_q. Once we release - * @ioc->lock, it can go away along with @cic. - * Hold on to it. - */ - __blk_get_queue(this_q); - - /* - * blk_put_queue() might sleep thanks to kobject - * idiocy. Always release both locks, put and - * restart. - */ - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irqrestore(&ioc->lock, flags); - blk_put_queue(last_q); - } else { - spin_unlock_irqrestore(&ioc->lock, flags); - } - - last_q = this_q; - spin_lock_irqsave(this_q->queue_lock, flags); - spin_lock_nested(&ioc->lock, 1); - continue; - } - ioc_exit_icq(icq); - } - - if (last_q) { - spin_unlock(last_q->queue_lock); - spin_unlock_irqrestore(&ioc->lock, flags); - blk_put_queue(last_q); - } else { - spin_unlock_irqrestore(&ioc->lock, flags); - } - - kmem_cache_free(iocontext_cachep, ioc); -} - /** * put_io_context - put a reference of io_context * @ioc: io_context to put @@ -142,7 +106,7 @@ static void ioc_release_fn(struct work_s * Decrement reference count of @ioc and release it if the count reaches * zero. */ -void put_io_context(struct io_context *ioc) +void put_io_context(struct io_context *ioc, struct request_queue *locked_q) { unsigned long flags; @@ -151,16 +115,33 @@ void put_io_context(struct io_context *i BUG_ON(atomic_long_read(&ioc->refcount) <= 0); + if (!atomic_long_dec_and_test(&ioc->refcount)) + return; + /* - * Releasing ioc requires reverse order double locking and we may - * already be holding a queue_lock. Do it asynchronously from wq. + * Need to grab both q and ioc locks from ioc side to release all + * icqs. Perform reverse double locking. */ - if (atomic_long_dec_and_test(&ioc->refcount)) { - spin_lock_irqsave(&ioc->lock, flags); - if (!hlist_empty(&ioc->icq_list)) - schedule_work(&ioc->release_work); - spin_unlock_irqrestore(&ioc->lock, flags); + spin_lock_irqsave_nested(&ioc->lock, flags, ioc_exit_depth(locked_q)); + + while (!hlist_empty(&ioc->icq_list)) { + struct io_cq *icq = hlist_entry(ioc->icq_list.first, + struct io_cq, ioc_node); + struct request_queue *q = icq->q; + + if (q == locked_q || spin_trylock(q->queue_lock)) { + ioc_destroy_icq(icq); + if (q != locked_q) + spin_unlock(q->queue_lock); + } else { + spin_unlock_irqrestore(&ioc->lock, flags); + cpu_relax(); + spin_lock_irqsave_nested(&ioc->lock, flags, + ioc_exit_depth(locked_q)); + } } + + spin_unlock_irqrestore(&ioc->lock, flags); } EXPORT_SYMBOL(put_io_context); @@ -168,14 +149,40 @@ EXPORT_SYMBOL(put_io_context); void exit_io_context(struct task_struct *task) { struct io_context *ioc; + struct io_cq *icq; + struct hlist_node *n; task_lock(task); ioc = task->io_context; task->io_context = NULL; task_unlock(task); - atomic_dec(&ioc->nr_tasks); - put_io_context(ioc); + if (!atomic_dec_and_test(&ioc->nr_tasks)) { + put_io_context(ioc, NULL); + return; + } + + /* + * Need ioc lock to walk icq_list and q lock to exit icq. Perform + * reverse double locking. + */ +retry: + spin_lock_irq(&ioc->lock); + hlist_for_each_entry(icq, n, &ioc->icq_list, ioc_node) { + if (icq->flags & ICQ_EXITED) + continue; + if (spin_trylock(icq->q->queue_lock)) { + ioc_exit_icq(icq); + spin_unlock(icq->q->queue_lock); + } else { + spin_unlock_irq(&ioc->lock); + cpu_relax(); + goto retry; + } + } + spin_unlock_irq(&ioc->lock); + + put_io_context(ioc, NULL); } /** @@ -194,7 +201,7 @@ void ioc_clear_queue(struct request_queu struct io_context *ioc = icq->ioc; spin_lock(&ioc->lock); - ioc_exit_icq(icq); + ioc_destroy_icq(icq); spin_unlock(&ioc->lock); } } @@ -215,7 +222,6 @@ void create_io_context_slowpath(struct t spin_lock_init(&ioc->lock); INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH); INIT_HLIST_HEAD(&ioc->icq_list); - INIT_WORK(&ioc->release_work, ioc_release_fn); /* * Try to install. ioc shouldn't be installed if someone else @@ -363,13 +369,13 @@ struct io_cq *ioc_create_icq(struct requ return icq; } -void ioc_set_changed(struct io_context *ioc, int which) +void ioc_set_icq_flags(struct io_context *ioc, unsigned int flags) { struct io_cq *icq; struct hlist_node *n; hlist_for_each_entry(icq, n, &ioc->icq_list, ioc_node) - set_bit(which, &icq->changed); + icq->flags |= flags; } /** @@ -387,7 +393,7 @@ void ioc_ioprio_changed(struct io_contex spin_lock_irqsave(&ioc->lock, flags); ioc->ioprio = ioprio; - ioc_set_changed(ioc, ICQ_IOPRIO_CHANGED); + ioc_set_icq_flags(ioc, ICQ_IOPRIO_CHANGED); spin_unlock_irqrestore(&ioc->lock, flags); } @@ -404,11 +410,33 @@ void ioc_cgroup_changed(struct io_contex unsigned long flags; spin_lock_irqsave(&ioc->lock, flags); - ioc_set_changed(ioc, ICQ_CGROUP_CHANGED); + ioc_set_icq_flags(ioc, ICQ_CGROUP_CHANGED); spin_unlock_irqrestore(&ioc->lock, flags); } EXPORT_SYMBOL(ioc_cgroup_changed); +/** + * icq_get_changed - fetch and clear icq changed mask + * @icq: icq of interest + * + * Fetch and clear ICQ_*_CHANGED bits from @icq. Grabs and releases + * @icq->ioc->lock. + */ +unsigned icq_get_changed(struct io_cq *icq) +{ + unsigned int changed = 0; + unsigned long flags; + + if (unlikely(icq->flags & ICQ_CHANGED_MASK)) { + spin_lock_irqsave(&icq->ioc->lock, flags); + changed = icq->flags & ICQ_CHANGED_MASK; + icq->flags &= ~ICQ_CHANGED_MASK; + spin_unlock_irqrestore(&icq->ioc->lock, flags); + } + return changed; +} +EXPORT_SYMBOL(icq_get_changed); + static int __init blk_ioc_init(void) { iocontext_cachep = kmem_cache_create("blkdev_ioc", Index: work/block/cfq-iosched.c =================================================================== --- work.orig/block/cfq-iosched.c +++ work/block/cfq-iosched.c @@ -1787,7 +1787,7 @@ __cfq_slice_expired(struct cfq_data *cfq cfqd->active_queue = NULL; if (cfqd->active_cic) { - put_io_context(cfqd->active_cic->icq.ioc); + put_io_context(cfqd->active_cic->icq.ioc, cfqd->queue); cfqd->active_cic = NULL; } } @@ -3470,20 +3470,20 @@ cfq_set_request(struct request_queue *q, const int rw = rq_data_dir(rq); const bool is_sync = rq_is_sync(rq); struct cfq_queue *cfqq; + unsigned int changed; might_sleep_if(gfp_mask & __GFP_WAIT); spin_lock_irq(q->queue_lock); /* handle changed notifications */ - if (unlikely(cic->icq.changed)) { - if (test_and_clear_bit(ICQ_IOPRIO_CHANGED, &cic->icq.changed)) - changed_ioprio(cic); + changed = icq_get_changed(&cic->icq); + if (unlikely(changed & ICQ_IOPRIO_CHANGED)) + changed_ioprio(cic); #ifdef CONFIG_CFQ_GROUP_IOSCHED - if (test_and_clear_bit(ICQ_CGROUP_CHANGED, &cic->icq.changed)) - changed_cgroup(cic); + if (unlikely(changed & ICQ_CGROUP_CHANGED)) + changed_cgroup(cic); #endif - } new_queue: cfqq = cic_to_cfqq(cic, is_sync); Index: work/include/linux/iocontext.h =================================================================== --- work.orig/include/linux/iocontext.h +++ work/include/linux/iocontext.h @@ -3,11 +3,13 @@ #include <linux/radix-tree.h> #include <linux/rcupdate.h> -#include <linux/workqueue.h> enum { - ICQ_IOPRIO_CHANGED, - ICQ_CGROUP_CHANGED, + ICQ_IOPRIO_CHANGED = 1 << 0, + ICQ_CGROUP_CHANGED = 1 << 1, + ICQ_EXITED = 1 << 2, + + ICQ_CHANGED_MASK = ICQ_IOPRIO_CHANGED | ICQ_CGROUP_CHANGED, }; /* @@ -88,7 +90,8 @@ struct io_cq { struct rcu_head __rcu_head; }; - unsigned long changed; + /* ICQ_*, use atomic bitops */ + unsigned long flags; }; /* @@ -113,8 +116,6 @@ struct io_context { struct radix_tree_root icq_tree; struct io_cq __rcu *icq_hint; struct hlist_head icq_list; - - struct work_struct release_work; }; static inline struct io_context *ioc_task_link(struct io_context *ioc) @@ -133,15 +134,17 @@ static inline struct io_context *ioc_tas struct task_struct; #ifdef CONFIG_BLOCK -void put_io_context(struct io_context *ioc); +void put_io_context(struct io_context *ioc, struct request_queue *locked_q); void exit_io_context(struct task_struct *task); struct io_context *get_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node); void ioc_ioprio_changed(struct io_context *ioc, int ioprio); void ioc_cgroup_changed(struct io_context *ioc); +unsigned int icq_get_changed(struct io_cq *icq); #else struct io_context; -static inline void put_io_context(struct io_context *ioc) { } +static inline void put_io_context(struct io_context *ioc, + struct request_queue *locked_q) { } static inline void exit_io_context(struct task_struct *task) { } #endif Index: work/block/blk-cgroup.c =================================================================== --- work.orig/block/blk-cgroup.c +++ work/block/blk-cgroup.c @@ -1659,7 +1659,7 @@ static void blkiocg_attach(struct cgroup ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_cgroup_changed(ioc); - put_io_context(ioc); + put_io_context(ioc, NULL); } } } Index: work/block/blk-core.c =================================================================== --- work.orig/block/blk-core.c +++ work/block/blk-core.c @@ -642,7 +642,7 @@ static inline void blk_free_request(stru if (rq->cmd_flags & REQ_ELVPRIV) { elv_put_request(q, rq); if (rq->elv.icq) - put_io_context(rq->elv.icq->ioc); + put_io_context(rq->elv.icq->ioc, q); } mempool_free(rq, q->rq.rq_pool); Index: work/fs/ioprio.c =================================================================== --- work.orig/fs/ioprio.c +++ work/fs/ioprio.c @@ -51,7 +51,7 @@ int set_task_ioprio(struct task_struct * ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); if (ioc) { ioc_ioprio_changed(ioc, ioprio); - put_io_context(ioc); + put_io_context(ioc, NULL); } return err; Index: work/kernel/fork.c =================================================================== --- work.orig/kernel/fork.c +++ work/kernel/fork.c @@ -910,7 +910,7 @@ static int copy_io(unsigned long clone_f return -ENOMEM; new_ioc->ioprio = ioc->ioprio; - put_io_context(new_ioc); + put_io_context(new_ioc, NULL); } #endif return 0; Index: work/include/linux/blkdev.h =================================================================== --- work.orig/include/linux/blkdev.h +++ work/include/linux/blkdev.h @@ -399,6 +399,9 @@ struct request_queue { /* Throttle data */ struct throtl_data *td; #endif +#ifdef CONFIG_LOCKDEP + int ioc_exit_depth; +#endif }; #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-13 20:49 ` Tejun Heo @ 2012-02-14 2:36 ` Shaohua Li 2012-02-14 16:39 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-14 2:36 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds 2012/2/14 Tejun Heo <tj@kernel.org>: > Hello, Shaohua. > > Can you please test the following patch? It's combination of three > patches which invokes elevator icq exit from exit_io_context(). This > unfortunately ends up adding another reverse locking loop and using > RCU could be better; unfortunately, the change isn't trivial due to > q->queue_lock modification during blk_cleanup_queue() and ioc cleanup > being called after that from blk_release_queue() - IOW, while holding > RCU, we might end up grabbing the wrong q lock (I don't think this is > a new problem). > > Now that we have proper request draining on queue exit, we can > probably move ioc clearing and other operations to blk_cleanup_queue() > and then apply RCU, but that's for another merge window. This fixed the regression. Thanks! ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-14 2:36 ` Shaohua Li @ 2012-02-14 16:39 ` Tejun Heo 0 siblings, 0 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-14 16:39 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos, Linus Torvalds Hello, On Tue, Feb 14, 2012 at 10:36:36AM +0800, Shaohua Li wrote: > 2012/2/14 Tejun Heo <tj@kernel.org>: > > Now that we have proper request draining on queue exit, we can > > probably move ioc clearing and other operations to blk_cleanup_queue() > > and then apply RCU, but that's for another merge window. > > This fixed the regression. Thanks! Ah... that's a good news. I'll split up the patches and post them. Thank you very much for the bisection and testing all the different patches! -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH] block: strip out locking optimization in put_io_context() 2012-02-09 18:07 ` Linus Torvalds 2012-02-09 19:24 ` Tejun Heo @ 2012-02-10 3:09 ` Shaohua Li 1 sibling, 0 replies; 49+ messages in thread From: Shaohua Li @ 2012-02-10 3:09 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Jens Axboe, Vivek Goyal, lkml, Knut Petersen, mroos 2012/2/10 Linus Torvalds <torvalds@linux-foundation.org>: > On Thu, Feb 9, 2012 at 9:59 AM, Tejun Heo <tj@kernel.org> wrote: >> >> * What exactly is the test and what do you measure? What does "12% >> regression" mean? Is it wallclock time or CPU time? If it's CPU >> time, does systime increase dominate the regression? > > Shaohua, it might be interesting to see a profile of the bad case. > > Now, quite often these kinds of things don't show anything at all - > it's just due to cache issues and there's no obvious "we hold spinlock > X for 15 seconds total". But if it's actual lock contention rather > than just "more scheduling of worker threads", it should show up in > the profile quite clearly. > > That said, I do think the RCU approach is the right one. The whole > delayed deallocation (and the replacement patch with rwlocks) really > smells like "badly done RCU-like behavior" to me. Appears not a lock contention issue. The system is quite idle, about 20% busy. And top shows no cpu is very busy. Before test runs, the system has about 2G free memory (from vmstat) system and user time isn't changed. only real time becomes longer. This suggests IO is slower or there is more IO. But vmstat and iostat data doesn't show significant difference between the good and bad cases. There might be some access pattern changed which makes swap no efficient or working set is wrongly swaped out. ^ permalink raw reply [flat|nested] 49+ messages in thread
* [PATCH] block: fix lockdep warning on io_context release put_io_context() 2012-02-07 6:49 ` Jens Axboe 2012-02-07 16:22 ` Tejun Heo @ 2012-02-07 23:00 ` Tejun Heo 1 sibling, 0 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-07 23:00 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Vivek Goyal, Shaohua Li, lkml, Knut Petersen, mroos 11a3122f6c "block: strip out locking optimization in put_io_context()" removed ioc_lock depth lockdep annoation along with locking optimization; however, while recursing from put_io_context() is no longer possible, ioc_release_fn() may still end up putting the last reference of another ioc through elevator, which wlil grab ioc->lock triggering spurious (as the ioc is always different one) A-A deadlock warning. As this can only happen one time from ioc_release_fn(), using non-zero subclass from ioc_release_fn() is enough. Use subclass 1. Signed-off-by: Tejun Heo <tj@kernel.org> --- Jens, this one should go with the previous patch to avoid triggering spurious lockdep warning. Thanks. block/blk-ioc.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) Index: work/block/blk-ioc.c =================================================================== --- work.orig/block/blk-ioc.c +++ work/block/blk-ioc.c @@ -80,8 +80,15 @@ static void ioc_release_fn(struct work_s struct io_context *ioc = container_of(work, struct io_context, release_work); struct request_queue *last_q = NULL; + unsigned long flags; - spin_lock_irq(&ioc->lock); + /* + * Exiting icq may call into put_io_context() through elevator + * which will trigger lockdep warning. The ioc's are guaranteed to + * be different, use a different locking subclass here. Use + * irqsave variant as there's no spin_lock_irq_nested(). + */ + spin_lock_irqsave_nested(&ioc->lock, flags, 1); while (!hlist_empty(&ioc->icq_list)) { struct io_cq *icq = hlist_entry(ioc->icq_list.first, @@ -103,15 +110,15 @@ static void ioc_release_fn(struct work_s */ if (last_q) { spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); + spin_unlock_irqrestore(&ioc->lock, flags); blk_put_queue(last_q); } else { - spin_unlock_irq(&ioc->lock); + spin_unlock_irqrestore(&ioc->lock, flags); } last_q = this_q; - spin_lock_irq(this_q->queue_lock); - spin_lock(&ioc->lock); + spin_lock_irqsave(this_q->queue_lock, flags); + spin_lock_nested(&ioc->lock, 1); continue; } ioc_exit_icq(icq); @@ -119,10 +126,10 @@ static void ioc_release_fn(struct work_s if (last_q) { spin_unlock(last_q->queue_lock); - spin_unlock_irq(&ioc->lock); + spin_unlock_irqrestore(&ioc->lock, flags); blk_put_queue(last_q); } else { - spin_unlock_irq(&ioc->lock); + spin_unlock_irqrestore(&ioc->lock, flags); } kmem_cache_free(iocontext_cachep, ioc); ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 17:27 ` Tejun Heo 2012-02-06 20:16 ` Jens Axboe @ 2012-02-06 20:36 ` Tejun Heo 2012-02-07 0:31 ` Shaohua Li 1 sibling, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-06 20:36 UTC (permalink / raw) To: Linus Torvalds Cc: Vivek Goyal, Jens Axboe, Shaohua Li, lkml, Knut Petersen, mroos On Mon, Feb 06, 2012 at 09:27:06AM -0800, Tejun Heo wrote: > It's one wq scheduling on exit for any task which has issued an IO. I > don't think it would matter except for task fork/exit microbenchs (or > workloads which approximate to that). I'll get some measurements and > strip the optimization if it doesn't really show up. I'm still playing with test methods and getting numbers but the following is the simplified one of the three setups I'm playing with - the current one, simplified and no optimization. There *seems* to be appreciable performance degradation on fork/exit w/ ioc microbenchs so I'm likely to go with the following. I'll post when I know more. Thanks. --- block/blk-ioc.c | 50 +++++++++++++------------------------------------- 1 file changed, 13 insertions(+), 37 deletions(-) Index: work/block/blk-ioc.c =================================================================== --- work.orig/block/blk-ioc.c +++ work/block/blk-ioc.c @@ -158,7 +158,6 @@ static void ioc_release_fn(struct work_s */ void put_io_context(struct io_context *ioc, struct request_queue *locked_q) { - struct request_queue *last_q = locked_q; unsigned long flags; if (ioc == NULL) @@ -171,50 +170,27 @@ void put_io_context(struct io_context *i if (!atomic_long_dec_and_test(&ioc->refcount)) return; - /* - * Destroy @ioc. This is a bit messy because icq's are chained - * from both ioc and queue, and ioc->lock nests inside queue_lock. - * The inner ioc->lock should be held to walk our icq_list and then - * for each icq the outer matching queue_lock should be grabbed. - * ie. We need to do reverse-order double lock dancing. - * - * Another twist is that we are often called with one of the - * matching queue_locks held as indicated by @locked_q, which - * prevents performing double-lock dance for other queues. - * - * So, we do it in two stages. The fast path uses the queue_lock - * the caller is holding and, if other queues need to be accessed, - * uses trylock to avoid introducing locking dependency. This can - * handle most cases, especially if @ioc was performing IO on only - * single device. - * - * If trylock doesn't cut it, we defer to @ioc->release_work which - * can do all the double-locking dancing. - */ spin_lock_irqsave_nested(&ioc->lock, flags, ioc_release_depth(locked_q)); - while (!hlist_empty(&ioc->icq_list)) { + /* + * Due to locking order between queue_lock and ioc lock, proper icq + * shoot down requires reverse double locking dance from another + * context. As there usually is no or one icq, try to handle those + * cases synchronously. + */ + if (!hlist_empty(&ioc->icq_list)) { struct io_cq *icq = hlist_entry(ioc->icq_list.first, struct io_cq, ioc_node); - struct request_queue *this_q = icq->q; - - if (this_q != last_q) { - if (last_q && last_q != locked_q) - spin_unlock(last_q->queue_lock); - last_q = NULL; - - if (!spin_trylock(this_q->queue_lock)) - break; - last_q = this_q; - continue; + if (locked_q) { + if (locked_q == icq->q) + ioc_exit_icq(icq); + } else if (spin_trylock(icq->q->queue_lock)) { + ioc_exit_icq(icq); + spin_unlock(icq->q->queue_lock); } - ioc_exit_icq(icq); } - if (last_q && last_q != locked_q) - spin_unlock(last_q->queue_lock); - spin_unlock_irqrestore(&ioc->lock, flags); /* if no icq is left, we're done; otherwise, kick release_work */ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 20:36 ` [patch]block: fix ioc locking warning Tejun Heo @ 2012-02-07 0:31 ` Shaohua Li 2012-02-07 0:39 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-07 0:31 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos 2012/2/7 Tejun Heo <tj@kernel.org>: > On Mon, Feb 06, 2012 at 09:27:06AM -0800, Tejun Heo wrote: >> It's one wq scheduling on exit for any task which has issued an IO. I >> don't think it would matter except for task fork/exit microbenchs (or >> workloads which approximate to that). I'll get some measurements and >> strip the optimization if it doesn't really show up. > > I'm still playing with test methods and getting numbers but the > following is the simplified one of the three setups I'm playing with - > the current one, simplified and no optimization. There *seems* to be > appreciable performance degradation on fork/exit w/ ioc microbenchs so > I'm likely to go with the following. I'll post when I know more. Hi, Since you are talking about performance, one of our microbenchmark (swap) shows a regression. Alex bisect it to be b2efa05265d62bc2, which is related to the ioc change. A little strange to me, don't expect such change can cause performance issue. I haven't double check the issue, but if you have ideas, please let me know. Thanks, Shaohua ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-07 0:31 ` Shaohua Li @ 2012-02-07 0:39 ` Tejun Heo 2012-02-07 0:43 ` Shaohua Li 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2012-02-07 0:39 UTC (permalink / raw) To: Shaohua Li Cc: Linus Torvalds, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos Hello, On Tue, Feb 07, 2012 at 08:31:22AM +0800, Shaohua Li wrote: > Since you are talking about performance, one of our microbenchmark (swap) > shows a regression. Alex bisect it to be b2efa05265d62bc2, which is related to > the ioc change. A little strange to me, don't expect such change can cause > performance issue. I haven't double check the issue, but if you have > ideas, please let me know. Hmmm... I think the only thing which can be made slower by that patch is ioc release (ie. task exits). What does the microbench do? Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-07 0:39 ` Tejun Heo @ 2012-02-07 0:43 ` Shaohua Li 2012-02-07 0:59 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-07 0:43 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos On Mon, 2012-02-06 at 16:39 -0800, Tejun Heo wrote: > Hello, > > On Tue, Feb 07, 2012 at 08:31:22AM +0800, Shaohua Li wrote: > > Since you are talking about performance, one of our microbenchmark (swap) > > shows a regression. Alex bisect it to be b2efa05265d62bc2, which is related to > > the ioc change. A little strange to me, don't expect such change can cause > > performance issue. I haven't double check the issue, but if you have > > ideas, please let me know. > > Hmmm... I think the only thing which can be made slower by that patch > is ioc release (ie. task exits). What does the microbench do? Quite simple, copy kernel source to tmpfs in tight memory environment. This will trigger swap. The bisection result appears stable. Thanks, Shaohua ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-07 0:43 ` Shaohua Li @ 2012-02-07 0:59 ` Tejun Heo 2012-02-07 1:10 ` Shaohua Li 2012-02-07 5:22 ` Shaohua Li 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-07 0:59 UTC (permalink / raw) To: Shaohua Li Cc: Linus Torvalds, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos On Tue, Feb 07, 2012 at 08:43:50AM +0800, Shaohua Li wrote: > Quite simple, copy kernel source to tmpfs in tight memory environment. > This will trigger swap. The bisection result appears stable. Looked through the commit again but it really doesn't change fast paths. The only paths which can be slower are ioc and q exits. Can you please do the followings? * Share the script used for microbench. Maybe it somehow exposes slower exit path? How much regression are we talking about? * Test with the this_q != locked_q trylock patch. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-07 0:59 ` Tejun Heo @ 2012-02-07 1:10 ` Shaohua Li 2012-02-07 1:33 ` Shaohua Li 2012-02-07 5:22 ` Shaohua Li 1 sibling, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-07 1:10 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos On Mon, 2012-02-06 at 16:59 -0800, Tejun Heo wrote: > On Tue, Feb 07, 2012 at 08:43:50AM +0800, Shaohua Li wrote: > > Quite simple, copy kernel source to tmpfs in tight memory environment. > > This will trigger swap. The bisection result appears stable. > > Looked through the commit again but it really doesn't change fast > paths. The only paths which can be slower are ioc and q exits. Can > you please do the followings? > > * Share the script used for microbench. Maybe it somehow exposes > slower exit path? How much regression are we talking about? I need strip the script out from our test framework, but it essentially is a 'time copy -a tmpfs/kernelsource tmpfs/copy'. The time increased ~20% with 3.3-rc1. > * Test with the this_q != locked_q trylock patch. sure. Thanks, Shaohua ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-07 1:10 ` Shaohua Li @ 2012-02-07 1:33 ` Shaohua Li 0 siblings, 0 replies; 49+ messages in thread From: Shaohua Li @ 2012-02-07 1:33 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos On Tue, 2012-02-07 at 09:10 +0800, Shaohua Li wrote: > On Mon, 2012-02-06 at 16:59 -0800, Tejun Heo wrote: > > On Tue, Feb 07, 2012 at 08:43:50AM +0800, Shaohua Li wrote: > > > Quite simple, copy kernel source to tmpfs in tight memory environment. > > > This will trigger swap. The bisection result appears stable. > > > > Looked through the commit again but it really doesn't change fast > > paths. The only paths which can be slower are ioc and q exits. Can > > you please do the followings? > > > > * Share the script used for microbench. Maybe it somehow exposes > > slower exit path? How much regression are we talking about? > I need strip the script out from our test framework, but it essentially > is a 'time copy -a tmpfs/kernelsource tmpfs/copy'. The time increased > ~20% with 3.3-rc1. oops, I get confused about our swap and swap-cp test. The workload with regression is swap. which does 'time kernelbuild tmpfs/kernelsource'. we run several such kernelbuild in the meantime to trigger swap. Very sorry to give your wrong info before. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-07 0:59 ` Tejun Heo 2012-02-07 1:10 ` Shaohua Li @ 2012-02-07 5:22 ` Shaohua Li 2012-02-07 22:34 ` Linus Torvalds 1 sibling, 1 reply; 49+ messages in thread From: Shaohua Li @ 2012-02-07 5:22 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos On Mon, 2012-02-06 at 16:59 -0800, Tejun Heo wrote: > * Test with the this_q != locked_q trylock patch. this one slightly reduces the regression. without it, the regression is about 17%, with it, the regression is about 12%. Test result is stable. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-07 5:22 ` Shaohua Li @ 2012-02-07 22:34 ` Linus Torvalds 0 siblings, 0 replies; 49+ messages in thread From: Linus Torvalds @ 2012-02-07 22:34 UTC (permalink / raw) To: Shaohua Li; +Cc: Tejun Heo, Vivek Goyal, Jens Axboe, lkml, Knut Petersen, mroos On Mon, Feb 6, 2012 at 9:22 PM, Shaohua Li <shaohua.li@intel.com> wrote: > On Mon, 2012-02-06 at 16:59 -0800, Tejun Heo wrote: >> * Test with the this_q != locked_q trylock patch. > this one slightly reduces the regression. without it, the regression is > about 17%, with it, the regression is about 12%. Test result is stable. What happens when you remove the fast-case entirely? See the other patch by Tejun with the subject "[PATCH] block: strip out locking optimization in put_io_context()" Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 7:50 [patch]block: fix ioc locking warning Shaohua Li 2012-02-06 7:55 ` Jens Axboe 2012-02-06 15:12 ` Vivek Goyal @ 2012-02-06 16:22 ` Tejun Heo 2012-02-08 18:07 ` walt 3 siblings, 0 replies; 49+ messages in thread From: Tejun Heo @ 2012-02-06 16:22 UTC (permalink / raw) To: Shaohua Li; +Cc: lkml, Linus Torvalds, Jens Axboe, Knut Petersen, mroos Hello, On Mon, Feb 06, 2012 at 03:50:11PM +0800, Shaohua Li wrote: > diff --git a/block/blk-ioc.c b/block/blk-ioc.c > index 27a06e0..7490b6d 100644 > --- a/block/blk-ioc.c > +++ b/block/blk-ioc.c > @@ -204,7 +204,9 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q) > spin_unlock(last_q->queue_lock); > last_q = NULL; > > - if (!spin_trylock(this_q->queue_lock)) > + /* spin_trylock() always successes in UP case */ > + if (this_q != locked_q && > + !spin_trylock(this_q->queue_lock)) Yeah, this is the right thing to do. I think the comment is slightly misleading. I'll prep a patch. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch]block: fix ioc locking warning 2012-02-06 7:50 [patch]block: fix ioc locking warning Shaohua Li ` (2 preceding siblings ...) 2012-02-06 16:22 ` Tejun Heo @ 2012-02-08 18:07 ` walt 3 siblings, 0 replies; 49+ messages in thread From: walt @ 2012-02-08 18:07 UTC (permalink / raw) To: linux-kernel On 02/05/2012 11:50 PM, Shaohua Li wrote: > diff --git a/block/blk-ioc.c b/block/blk-ioc.c > index 27a06e0..7490b6d 100644 > --- a/block/blk-ioc.c > +++ b/block/blk-ioc.c > @@ -204,7 +204,9 @@ void put_io_context(struct io_context *ioc, struct request_queue *locked_q) > spin_unlock(last_q->queue_lock); > last_q = NULL; > > - if (!spin_trylock(this_q->queue_lock)) > + /* spin_trylock() always successes in UP case */ > + if (this_q != locked_q&& > + !spin_trylock(this_q->queue_lock)) > break; > last_q = this_q; > continue; I can see from the rest of this thread that your patch fixes a multitude of different backtraces, including (most importantly :) mine, thanks: http://marc.info/?l=linux-kernel&m=132804761925018&w=2 ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2012-02-14 16:40 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-02-06 7:50 [patch]block: fix ioc locking warning Shaohua Li 2012-02-06 7:55 ` Jens Axboe 2012-02-06 15:12 ` Vivek Goyal 2012-02-06 16:09 ` Jens Axboe 2012-02-06 16:37 ` Vivek Goyal 2012-02-06 16:44 ` Tejun Heo 2012-02-06 16:58 ` Linus Torvalds 2012-02-06 17:27 ` Tejun Heo 2012-02-06 20:16 ` Jens Axboe 2012-02-06 21:54 ` [PATCH] block: strip out locking optimization in put_io_context() Tejun Heo 2012-02-07 6:49 ` Jens Axboe 2012-02-07 16:22 ` Tejun Heo 2012-02-07 16:28 ` Jens Axboe 2012-02-07 16:33 ` Linus Torvalds 2012-02-07 16:47 ` Tejun Heo 2012-02-07 17:17 ` Tejun Heo 2012-02-08 0:19 ` Shaohua Li 2012-02-08 8:29 ` Shaohua Li 2012-02-08 16:29 ` Tejun Heo 2012-02-08 16:34 ` Linus Torvalds 2012-02-08 16:49 ` Tejun Heo 2012-02-08 16:56 ` Tejun Heo 2012-02-08 17:23 ` Tejun Heo 2012-02-09 6:22 ` Shaohua Li 2012-02-09 17:59 ` Tejun Heo 2012-02-09 18:07 ` Linus Torvalds 2012-02-09 19:24 ` Tejun Heo 2012-02-09 23:48 ` Tejun Heo 2012-02-10 5:14 ` Shaohua Li 2012-02-10 8:48 ` Shaohua Li 2012-02-11 2:17 ` Tejun Heo 2012-02-11 11:35 ` Jens Axboe 2012-02-13 1:34 ` Shaohua Li 2012-02-13 20:49 ` Tejun Heo 2012-02-14 2:36 ` Shaohua Li 2012-02-14 16:39 ` Tejun Heo 2012-02-10 3:09 ` Shaohua Li 2012-02-07 23:00 ` [PATCH] block: fix lockdep warning on io_context release put_io_context() Tejun Heo 2012-02-06 20:36 ` [patch]block: fix ioc locking warning Tejun Heo 2012-02-07 0:31 ` Shaohua Li 2012-02-07 0:39 ` Tejun Heo 2012-02-07 0:43 ` Shaohua Li 2012-02-07 0:59 ` Tejun Heo 2012-02-07 1:10 ` Shaohua Li 2012-02-07 1:33 ` Shaohua Li 2012-02-07 5:22 ` Shaohua Li 2012-02-07 22:34 ` Linus Torvalds 2012-02-06 16:22 ` Tejun Heo 2012-02-08 18:07 ` walt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).