All of lore.kernel.org
 help / color / mirror / Atom feed
* kvm deadlock
@ 2011-12-05 22:48 Nate Custer
  2011-12-14 12:25 ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Nate Custer @ 2011-12-05 22:48 UTC (permalink / raw)
  To: kvm

Hello,

I am struggling with repeatable full hardware locks when running 8-12 KVM vms. At some point before the hard lock I get a inconsistent lock state warning. An example of this can be found here:

http://pastebin.com/8wKhgE2C

After that the server continues to run for a while and then starts its death spiral. When it reaches that point it fails to log anything further to the disk, but by attaching a console I have been able to get a stack trace documenting the final implosion:

http://pastebin.com/PbcN76bd

All of the cores end up hung and the server stops responding to all input, including SysRq commands. 

I have seen this behavior on two machines (dual E5606 running Fedora 16) both passed cpuburnin testing and memtest86 scans without error. 

I have reproduced the crash and stack traces from a Fedora debugging kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.

Nate Custer
QA Analyst
cPanel Inc

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-05 22:48 kvm deadlock Nate Custer
@ 2011-12-14 12:25 ` Marcelo Tosatti
  2011-12-14 13:43   ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2011-12-14 12:25 UTC (permalink / raw)
  To: Nate Custer; +Cc: kvm

On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> Hello,
> 
> I am struggling with repeatable full hardware locks when running 8-12 KVM vms. At some point before the hard lock I get a inconsistent lock state warning. An example of this can be found here:
> 
> http://pastebin.com/8wKhgE2C
> 
> After that the server continues to run for a while and then starts its death spiral. When it reaches that point it fails to log anything further to the disk, but by attaching a console I have been able to get a stack trace documenting the final implosion:
> 
> http://pastebin.com/PbcN76bd
> 
> All of the cores end up hung and the server stops responding to all input, including SysRq commands. 
> 
> I have seen this behavior on two machines (dual E5606 running Fedora 16) both passed cpuburnin testing and memtest86 scans without error. 
> 
> I have reproduced the crash and stack traces from a Fedora debugging kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.

Busted hardware, apparently. Can you reproduce these issues with the
same workload on different hardware?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 12:25 ` Marcelo Tosatti
@ 2011-12-14 13:43   ` Avi Kivity
  2011-12-14 14:00     ` Marcelo Tosatti
  2011-12-14 16:03     ` Jens Axboe
  0 siblings, 2 replies; 28+ messages in thread
From: Avi Kivity @ 2011-12-14 13:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Nate Custer, kvm, linux-kernel, Jens Axboe

On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> > Hello,
> > 
> > I am struggling with repeatable full hardware locks when running 8-12 KVM vms. At some point before the hard lock I get a inconsistent lock state warning. An example of this can be found here:
> > 
> > http://pastebin.com/8wKhgE2C
> > 
> > After that the server continues to run for a while and then starts its death spiral. When it reaches that point it fails to log anything further to the disk, but by attaching a console I have been able to get a stack trace documenting the final implosion:
> > 
> > http://pastebin.com/PbcN76bd
> > 
> > All of the cores end up hung and the server stops responding to all input, including SysRq commands. 
> > 
> > I have seen this behavior on two machines (dual E5606 running Fedora 16) both passed cpuburnin testing and memtest86 scans without error. 
> > 
> > I have reproduced the crash and stack traces from a Fedora debugging kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
>
> Busted hardware, apparently. Can you reproduce these issues with the
> same workload on different hardware?

I don't think it's hardware related.  The second trace (in the first
paste) is called during swap, so GFP_FS is set.  The first one is not,
so GFP_FS is clear.  Lockdep is worried about the following scenario:

  acpi_early_init() is called
  calls pcpu_alloc(), which takes pcpu_alloc_mutex
  eventually, calls kmalloc(), or some other allocation function
  no memory, so swap
  call try_to_free_pages()
  submit_bio()
  blk_throtl_bio()
  blkio_alloc_blkg_stats()
  alloc_percpu()
  pcpu_alloc(), which takes pcpu_alloc_mutex
  deadlock

It's a little unlikely that acpi_early_init() will OOM, but lockdep
doesn't know that.  Other callers of pcpu_alloc() could trigger the same
thing.

When lockdep says

[ 5839.924953] other info that might help us debug this:
[ 5839.925396]  Possible unsafe locking scenario:
[ 5839.925397]
[ 5839.925840]        CPU0
[ 5839.926063]        ----
[ 5839.926287]   lock(pcpu_alloc_mutex);
[ 5839.926533]   <Interrupt>
[ 5839.926756]     lock(pcpu_alloc_mutex);
[ 5839.926986]

It really means

   <swap, set GFP_FS>

GFP_FS simply marks the beginning of a nested, unrelated context that
uses the same thread, just like an interrupt.  Kudos to lockdep for
catching that.

I think the allocation in blkio_alloc_blkg_stats() should be moved out
of the I/O path into some init function. Copying Jens.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 13:43   ` Avi Kivity
@ 2011-12-14 14:00     ` Marcelo Tosatti
  2011-12-14 14:02       ` Avi Kivity
  2011-12-14 16:03     ` Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Marcelo Tosatti @ 2011-12-14 14:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nate Custer, kvm, linux-kernel, Jens Axboe

On Wed, Dec 14, 2011 at 03:43:09PM +0200, Avi Kivity wrote:
> On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> > On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> > > Hello,
> > > 
> > > I am struggling with repeatable full hardware locks when running 8-12 KVM vms. At some point before the hard lock I get a inconsistent lock state warning. An example of this can be found here:
> > > 
> > > http://pastebin.com/8wKhgE2C
> > > 
> > > After that the server continues to run for a while and then starts its death spiral. When it reaches that point it fails to log anything further to the disk, but by attaching a console I have been able to get a stack trace documenting the final implosion:
> > > 
> > > http://pastebin.com/PbcN76bd
> > > 
> > > All of the cores end up hung and the server stops responding to all input, including SysRq commands. 
> > > 
> > > I have seen this behavior on two machines (dual E5606 running Fedora 16) both passed cpuburnin testing and memtest86 scans without error. 
> > > 
> > > I have reproduced the crash and stack traces from a Fedora debugging kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
> >
> > Busted hardware, apparently. Can you reproduce these issues with the
> > same workload on different hardware?
> 
> I don't think it's hardware related.  The second trace (in the first
> paste) is called during swap, so GFP_FS is set.  The first one is not,
> so GFP_FS is clear.  Lockdep is worried about the following scenario:
> 
>   acpi_early_init() is called
>   calls pcpu_alloc(), which takes pcpu_alloc_mutex
>   eventually, calls kmalloc(), or some other allocation function
>   no memory, so swap
>   call try_to_free_pages()
>   submit_bio()
>   blk_throtl_bio()
>   blkio_alloc_blkg_stats()
>   alloc_percpu()
>   pcpu_alloc(), which takes pcpu_alloc_mutex
>   deadlock
> 
> It's a little unlikely that acpi_early_init() will OOM, but lockdep
> doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> thing.
> 
> When lockdep says
> 
> [ 5839.924953] other info that might help us debug this:
> [ 5839.925396]  Possible unsafe locking scenario:
> [ 5839.925397]
> [ 5839.925840]        CPU0
> [ 5839.926063]        ----
> [ 5839.926287]   lock(pcpu_alloc_mutex);
> [ 5839.926533]   <Interrupt>
> [ 5839.926756]     lock(pcpu_alloc_mutex);
> [ 5839.926986]
> 
> It really means
> 
>    <swap, set GFP_FS>
> 
> GFP_FS simply marks the beginning of a nested, unrelated context that
> uses the same thread, just like an interrupt.  Kudos to lockdep for
> catching that.
> 
> I think the allocation in blkio_alloc_blkg_stats() should be moved out
> of the I/O path into some init function. Copying Jens.

The other traces have apparently bogus NMI interrupts, but it might be a
software bug, OK.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 14:00     ` Marcelo Tosatti
@ 2011-12-14 14:02       ` Avi Kivity
  2011-12-14 14:06         ` Marcelo Tosatti
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2011-12-14 14:02 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Nate Custer, kvm, linux-kernel, Jens Axboe

On 12/14/2011 04:00 PM, Marcelo Tosatti wrote:
> The other traces have apparently bogus NMI interrupts, but it might be a
> software bug, OK.

These are from sysrq-blah, no?  I'm looking at them now.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 14:02       ` Avi Kivity
@ 2011-12-14 14:06         ` Marcelo Tosatti
  2011-12-14 14:17           ` Nate Custer
  2011-12-14 14:27           ` Avi Kivity
  0 siblings, 2 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2011-12-14 14:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nate Custer, kvm, linux-kernel, Jens Axboe

On Wed, Dec 14, 2011 at 04:02:48PM +0200, Avi Kivity wrote:
> On 12/14/2011 04:00 PM, Marcelo Tosatti wrote:
> > The other traces have apparently bogus NMI interrupts, but it might be a
> > software bug, OK.
> 
> These are from sysrq-blah, no?  I'm looking at them now.

I don't know. Its a hang ? It could be memory corruption (of the timer
olist) instead of a bogus NMI actually, the second.

Nate?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 14:06         ` Marcelo Tosatti
@ 2011-12-14 14:17           ` Nate Custer
  2011-12-14 14:20             ` Marcelo Tosatti
  2011-12-14 14:28             ` Avi Kivity
  2011-12-14 14:27           ` Avi Kivity
  1 sibling, 2 replies; 28+ messages in thread
From: Nate Custer @ 2011-12-14 14:17 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, linux-kernel, Jens Axboe


On Dec 14, 2011, at 8:06 AM, Marcelo Tosatti wrote:
> I don't know. Its a hang ? It could be memory corruption (of the timer
> olist) instead of a bogus NMI actually, the second.


What is pasted in the second paste is what came scrolling across the console right before the end of all responsiveness. It came from a dmesg dump, the next dmesg command was not accepted via ssh and the console attached showed the same stack trace. At that point the system refused to respond to any direct keyboard input, including the SysRq commands that I expected to work after a core dump. 

The issue happened with two servers (same hardware, same build group so there is a chance of a bad hardware batch). Switching to an older kernel/kvm setup in RHEL 6.2 has corrected the issue, which suggests a software issue to me.

Nate Custer




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 14:17           ` Nate Custer
@ 2011-12-14 14:20             ` Marcelo Tosatti
  2011-12-14 14:28             ` Avi Kivity
  1 sibling, 0 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2011-12-14 14:20 UTC (permalink / raw)
  To: Nate Custer; +Cc: Avi Kivity, kvm, linux-kernel, Jens Axboe

On Wed, Dec 14, 2011 at 08:17:45AM -0600, Nate Custer wrote:
> 
> On Dec 14, 2011, at 8:06 AM, Marcelo Tosatti wrote:
> > I don't know. Its a hang ? It could be memory corruption (of the timer
> > olist) instead of a bogus NMI actually, the second.
> 
> 
> What is pasted in the second paste is what came scrolling across the console right before the end of all responsiveness. It came from a dmesg dump, the next dmesg command was not accepted via ssh and the console attached showed the same stack trace. At that point the system refused to respond to any direct keyboard input, including the SysRq commands that I expected to work after a core dump. 
> 
> The issue happened with two servers (same hardware, same build group so there is a chance of a bad hardware batch). Switching to an older kernel/kvm setup in RHEL 6.2 has corrected the issue, which suggests a software issue to me.

Right. Perhaps try an older upstream kernel to find a culprit then.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 14:06         ` Marcelo Tosatti
  2011-12-14 14:17           ` Nate Custer
@ 2011-12-14 14:27           ` Avi Kivity
  1 sibling, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2011-12-14 14:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Nate Custer, kvm, linux-kernel, Jens Axboe

On 12/14/2011 04:06 PM, Marcelo Tosatti wrote:
> On Wed, Dec 14, 2011 at 04:02:48PM +0200, Avi Kivity wrote:
> > On 12/14/2011 04:00 PM, Marcelo Tosatti wrote:
> > > The other traces have apparently bogus NMI interrupts, but it might be a
> > > software bug, OK.
> > 
> > These are from sysrq-blah, no?  I'm looking at them now.
>
> I don't know. Its a hang ? It could be memory corruption (of the timer
> olist) instead of a bogus NMI actually, the second.

Looks like lots of cpus are waiting on the smp_call_function_single()
lock.  Looks like rcu is complaining:


[ 4959.814010]  [<ffffffff81252e47>] __const_udelay+0x2c/0x2e
[ 4959.814017]  [<ffffffff81027449>]
native_safe_apic_wait_icr_idle+0x31/0x3d
[ 4959.814024]  [<ffffffff81027f11>]
__default_send_IPI_dest_field.constprop.0+0x23/0x5d
[ 4959.814032]  [<ffffffff81027f93>]
default_send_IPI_mask_sequence_phys+0x48/0x97
[ 4959.814039]  [<ffffffff81089848>] ? tick_nohz_handler+0xdf/0xdf
[ 4959.814044]  [<ffffffff8102b561>] physflat_send_IPI_all+0x17/0x19
[ 4959.814052]  [<ffffffff81028102>]
arch_trigger_all_cpu_backtrace+0x57/0x89
[ 4959.814057]  [<ffffffff810c7711>] __rcu_pending+0x89/0x328
[ 4959.814063]  [<ffffffff81089848>] ? tick_nohz_handler+0xdf/0xdf
[ 4959.814067]  [<ffffffff810c7e0a>] rcu_check_callbacks+0x88/0xb9
[ 4959.814071]  [<ffffffff8106aeba>] update_process_times+0x3f/0x75

Maybe the core issue is that CPU 3 is spinning in do_insn_fetch() and
denying rcu grace periods.  Nate, can you provide a few more dumps (this
is looking at the second paste, so more of the same)?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 14:17           ` Nate Custer
  2011-12-14 14:20             ` Marcelo Tosatti
@ 2011-12-14 14:28             ` Avi Kivity
  1 sibling, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2011-12-14 14:28 UTC (permalink / raw)
  To: Nate Custer; +Cc: Marcelo Tosatti, kvm, linux-kernel, Jens Axboe

On 12/14/2011 04:17 PM, Nate Custer wrote:
> On Dec 14, 2011, at 8:06 AM, Marcelo Tosatti wrote:
> > I don't know. Its a hang ? It could be memory corruption (of the timer
> > olist) instead of a bogus NMI actually, the second.
>
>
> What is pasted in the second paste is what came scrolling across the console right before the end of all responsiveness. It came from a dmesg dump, the next dmesg command was not accepted via ssh and the console attached showed the same stack trace. At that point the system refused to respond to any direct keyboard input, including the SysRq commands that I expected to work after a core dump. 
>
> The issue happened with two servers (same hardware, same build group so there is a chance of a bad hardware batch). Switching to an older kernel/kvm setup in RHEL 6.2 has corrected the issue, which suggests a software issue to me.
>

If you can, please set up netconsole, which should give us complete
dmesg captures.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 13:43   ` Avi Kivity
  2011-12-14 14:00     ` Marcelo Tosatti
@ 2011-12-14 16:03     ` Jens Axboe
  2011-12-14 17:03       ` Vivek Goyal
  2011-12-15 19:47       ` [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock) Vivek Goyal
  1 sibling, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2011-12-14 16:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Nate Custer, kvm, linux-kernel, Vivek Goyal

On 2011-12-14 14:43, Avi Kivity wrote:
> On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
>> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
>>> Hello,
>>>
>>> I am struggling with repeatable full hardware locks when running 8-12 KVM vms. At some point before the hard lock I get a inconsistent lock state warning. An example of this can be found here:
>>>
>>> http://pastebin.com/8wKhgE2C
>>>
>>> After that the server continues to run for a while and then starts its death spiral. When it reaches that point it fails to log anything further to the disk, but by attaching a console I have been able to get a stack trace documenting the final implosion:
>>>
>>> http://pastebin.com/PbcN76bd
>>>
>>> All of the cores end up hung and the server stops responding to all input, including SysRq commands. 
>>>
>>> I have seen this behavior on two machines (dual E5606 running Fedora 16) both passed cpuburnin testing and memtest86 scans without error. 
>>>
>>> I have reproduced the crash and stack traces from a Fedora debugging kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
>>
>> Busted hardware, apparently. Can you reproduce these issues with the
>> same workload on different hardware?
> 
> I don't think it's hardware related.  The second trace (in the first
> paste) is called during swap, so GFP_FS is set.  The first one is not,
> so GFP_FS is clear.  Lockdep is worried about the following scenario:
> 
>   acpi_early_init() is called
>   calls pcpu_alloc(), which takes pcpu_alloc_mutex
>   eventually, calls kmalloc(), or some other allocation function
>   no memory, so swap
>   call try_to_free_pages()
>   submit_bio()
>   blk_throtl_bio()
>   blkio_alloc_blkg_stats()
>   alloc_percpu()
>   pcpu_alloc(), which takes pcpu_alloc_mutex
>   deadlock
> 
> It's a little unlikely that acpi_early_init() will OOM, but lockdep
> doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> thing.
> 
> When lockdep says
> 
> [ 5839.924953] other info that might help us debug this:
> [ 5839.925396]  Possible unsafe locking scenario:
> [ 5839.925397]
> [ 5839.925840]        CPU0
> [ 5839.926063]        ----
> [ 5839.926287]   lock(pcpu_alloc_mutex);
> [ 5839.926533]   <Interrupt>
> [ 5839.926756]     lock(pcpu_alloc_mutex);
> [ 5839.926986]
> 
> It really means
> 
>    <swap, set GFP_FS>
> 
> GFP_FS simply marks the beginning of a nested, unrelated context that
> uses the same thread, just like an interrupt.  Kudos to lockdep for
> catching that.
> 
> I think the allocation in blkio_alloc_blkg_stats() should be moved out
> of the I/O path into some init function. Copying Jens.

That's completely buggy, basically you end up with a GFP_KERNEL
allocation from the IO submit path. Vivek, per_cpu data needs to be set
up at init time. You can't allocate it dynamically off the IO path.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 16:03     ` Jens Axboe
@ 2011-12-14 17:03       ` Vivek Goyal
  2011-12-14 17:09         ` Jens Axboe
  2011-12-15 19:47       ` [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock) Vivek Goyal
  1 sibling, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2011-12-14 17:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Avi Kivity, Marcelo Tosatti, Nate Custer, kvm, linux-kernel, Tejun Heo

On Wed, Dec 14, 2011 at 05:03:54PM +0100, Jens Axboe wrote:
> On 2011-12-14 14:43, Avi Kivity wrote:
> > On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> >> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> >>> Hello,
> >>>
> >>> I am struggling with repeatable full hardware locks when running 8-12 KVM vms. At some point before the hard lock I get a inconsistent lock state warning. An example of this can be found here:
> >>>
> >>> http://pastebin.com/8wKhgE2C
> >>>
> >>> After that the server continues to run for a while and then starts its death spiral. When it reaches that point it fails to log anything further to the disk, but by attaching a console I have been able to get a stack trace documenting the final implosion:
> >>>
> >>> http://pastebin.com/PbcN76bd
> >>>
> >>> All of the cores end up hung and the server stops responding to all input, including SysRq commands. 
> >>>
> >>> I have seen this behavior on two machines (dual E5606 running Fedora 16) both passed cpuburnin testing and memtest86 scans without error. 
> >>>
> >>> I have reproduced the crash and stack traces from a Fedora debugging kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
> >>
> >> Busted hardware, apparently. Can you reproduce these issues with the
> >> same workload on different hardware?
> > 
> > I don't think it's hardware related.  The second trace (in the first
> > paste) is called during swap, so GFP_FS is set.  The first one is not,
> > so GFP_FS is clear.  Lockdep is worried about the following scenario:
> > 
> >   acpi_early_init() is called
> >   calls pcpu_alloc(), which takes pcpu_alloc_mutex
> >   eventually, calls kmalloc(), or some other allocation function
> >   no memory, so swap
> >   call try_to_free_pages()
> >   submit_bio()
> >   blk_throtl_bio()
> >   blkio_alloc_blkg_stats()
> >   alloc_percpu()
> >   pcpu_alloc(), which takes pcpu_alloc_mutex
> >   deadlock
> > 
> > It's a little unlikely that acpi_early_init() will OOM, but lockdep
> > doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> > thing.
> > 
> > When lockdep says
> > 
> > [ 5839.924953] other info that might help us debug this:
> > [ 5839.925396]  Possible unsafe locking scenario:
> > [ 5839.925397]
> > [ 5839.925840]        CPU0
> > [ 5839.926063]        ----
> > [ 5839.926287]   lock(pcpu_alloc_mutex);
> > [ 5839.926533]   <Interrupt>
> > [ 5839.926756]     lock(pcpu_alloc_mutex);
> > [ 5839.926986]
> > 
> > It really means
> > 
> >    <swap, set GFP_FS>
> > 
> > GFP_FS simply marks the beginning of a nested, unrelated context that
> > uses the same thread, just like an interrupt.  Kudos to lockdep for
> > catching that.
> > 
> > I think the allocation in blkio_alloc_blkg_stats() should be moved out
> > of the I/O path into some init function. Copying Jens.
> 
> That's completely buggy, basically you end up with a GFP_KERNEL
> allocation from the IO submit path. Vivek, per_cpu data needs to be set
> up at init time. You can't allocate it dynamically off the IO path.

Hi Jens,

I am wondering how does CFQ get away with blocking cfqq allocation in
IO submit path. I see that blk_queue_bio() will do following.

blk_queue_bio()
 get_request_wait()
  get_request(..,..,GFP_NOIO)
   blk_alloc_request()
    elv_set_request()
     cfq_set_request()
      ---> Can sleep and do memory allocation in IO submit path as
           GFP_NOIO has __GFP_WAIT.

So that means sleeping allocation from IO submit path is not necessarily
a problem?

But in case of per cpu data allocation, we might be holding pcpu_alloc_mutex()
already at the time of calling pcpu allocation again and that might lead
to deadlock.  (As Avi mentioned). If yes, then it is a problem.

Right now allocation of root group and associated stats happens at queue
initialization time. For non-root cgroups, group allocation and associated
per cpu stats allocation happens dynamically when the IO is submitted. So
in this case may be we are creating a new blkio cgroup and then doing IO
which leads to this warning.

I am not sure how to move this allocation to init path. These stats are
per group and groups are created dynamically as IO happens in them. Only
init path seems to be cgroup creation time. blkg is an object which is
contained in a parent object and at that time parent object is not
available. It is created dynamically at the IO time (cfq_group,
blkio_group etc).

Though it is little hackish but can we just delay the allocation of stats
if pcpu_alloc_mutex is held. We shall have to make pcpu_alloc_mutex non
static though. Delaying will just not capture the stats for some time
and sooner or later we will get regular IO with pcpu_alloc_mutex not
held and we can do per cpu allocation at that time. I will write a
a test patch.

Or may be there is a safer version of pcpu alloc which will return
without allocation if pcpu_alloc_mutex is already locked.

CCing Tejun too.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 17:03       ` Vivek Goyal
@ 2011-12-14 17:09         ` Jens Axboe
  2011-12-14 17:22           ` Vivek Goyal
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2011-12-14 17:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Avi Kivity, Marcelo Tosatti, Nate Custer, kvm, linux-kernel, Tejun Heo

On 2011-12-14 18:03, Vivek Goyal wrote:
>>> I think the allocation in blkio_alloc_blkg_stats() should be moved out
>>> of the I/O path into some init function. Copying Jens.
>>
>> That's completely buggy, basically you end up with a GFP_KERNEL
>> allocation from the IO submit path. Vivek, per_cpu data needs to be set
>> up at init time. You can't allocate it dynamically off the IO path.
> 
> Hi Jens,
> 
> I am wondering how does CFQ get away with blocking cfqq allocation in
> IO submit path. I see that blk_queue_bio() will do following.
> 
> blk_queue_bio()
>  get_request_wait()
>   get_request(..,..,GFP_NOIO)
>    blk_alloc_request()
>     elv_set_request()
>      cfq_set_request()
>       ---> Can sleep and do memory allocation in IO submit path as
>            GFP_NOIO has __GFP_WAIT.
> 
> So that means sleeping allocation from IO submit path is not necessarily
> a problem?

__GFP_WAIT isn't the problem, you can block in the IO path. You cannot,
however, recurse back into IO submission. That's why CFQ is using
GFP_NOIO, implying that waiting is OK, but submitting new IO to satisfy
the allocation is not.

> But in case of per cpu data allocation, we might be holding
> pcpu_alloc_mutex() already at the time of calling pcpu allocation
> again and that might lead to deadlock.  (As Avi mentioned). If yes,
> then it is a problem.

It is both a problem and somewhat of a mess...

> Right now allocation of root group and associated stats happens at
> queue initialization time. For non-root cgroups, group allocation and
> associated per cpu stats allocation happens dynamically when the IO is
> submitted. So in this case may be we are creating a new blkio cgroup
> and then doing IO which leads to this warning.
> 
> I am not sure how to move this allocation to init path. These stats
> are per group and groups are created dynamically as IO happens in
> them. Only init path seems to be cgroup creation time. blkg is an
> object which is contained in a parent object and at that time parent
> object is not available. It is created dynamically at the IO time
> (cfq_group, blkio_group etc).
> 
> Though it is little hackish but can we just delay the allocation of
> stats if pcpu_alloc_mutex is held. We shall have to make
> pcpu_alloc_mutex non static though. Delaying will just not capture the
> stats for some time and sooner or later we will get regular IO with
> pcpu_alloc_mutex not held and we can do per cpu allocation at that
> time. I will write a a test patch.
> 
> Or may be there is a safer version of pcpu alloc which will return
> without allocation if pcpu_alloc_mutex is already locked.

Both of these proposed solutions aren't really pretty. It would be
cleaner and have better runtime for the fast path if you could alloc all
of these when the non-root groups are setup. Why isn't it done that way
right now?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 17:09         ` Jens Axboe
@ 2011-12-14 17:22           ` Vivek Goyal
  2011-12-14 18:16             ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2011-12-14 17:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Avi Kivity, Marcelo Tosatti, Nate Custer, kvm, linux-kernel, Tejun Heo

On Wed, Dec 14, 2011 at 06:09:55PM +0100, Jens Axboe wrote:
> On 2011-12-14 18:03, Vivek Goyal wrote:
> >>> I think the allocation in blkio_alloc_blkg_stats() should be moved out
> >>> of the I/O path into some init function. Copying Jens.
> >>
> >> That's completely buggy, basically you end up with a GFP_KERNEL
> >> allocation from the IO submit path. Vivek, per_cpu data needs to be set
> >> up at init time. You can't allocate it dynamically off the IO path.
> > 
> > Hi Jens,
> > 
> > I am wondering how does CFQ get away with blocking cfqq allocation in
> > IO submit path. I see that blk_queue_bio() will do following.
> > 
> > blk_queue_bio()
> >  get_request_wait()
> >   get_request(..,..,GFP_NOIO)
> >    blk_alloc_request()
> >     elv_set_request()
> >      cfq_set_request()
> >       ---> Can sleep and do memory allocation in IO submit path as
> >            GFP_NOIO has __GFP_WAIT.
> > 
> > So that means sleeping allocation from IO submit path is not necessarily
> > a problem?
> 

[..]
> __GFP_WAIT isn't the problem, you can block in the IO path. You cannot,
> however, recurse back into IO submission. That's why CFQ is using
> GFP_NOIO, implying that waiting is OK, but submitting new IO to satisfy
> the allocation is not.

Ok. Got it. So even if we implement mutex_is_locked() check, it does not
protect me from possiblity of per cpu allocation path recursing into
IO submission. That means, there needs to be a variant of per cpu
allocation which can take the allocation flags as parameter and honor
these flags.

> 
> > But in case of per cpu data allocation, we might be holding
> > pcpu_alloc_mutex() already at the time of calling pcpu allocation
> > again and that might lead to deadlock.  (As Avi mentioned). If yes,
> > then it is a problem.
> 
> It is both a problem and somewhat of a mess...
> 
> > Right now allocation of root group and associated stats happens at
> > queue initialization time. For non-root cgroups, group allocation and
> > associated per cpu stats allocation happens dynamically when the IO is
> > submitted. So in this case may be we are creating a new blkio cgroup
> > and then doing IO which leads to this warning.
> > 
> > I am not sure how to move this allocation to init path. These stats
> > are per group and groups are created dynamically as IO happens in
> > them. Only init path seems to be cgroup creation time. blkg is an
> > object which is contained in a parent object and at that time parent
> > object is not available. It is created dynamically at the IO time
> > (cfq_group, blkio_group etc).
> > 
> > Though it is little hackish but can we just delay the allocation of
> > stats if pcpu_alloc_mutex is held. We shall have to make
> > pcpu_alloc_mutex non static though. Delaying will just not capture the
> > stats for some time and sooner or later we will get regular IO with
> > pcpu_alloc_mutex not held and we can do per cpu allocation at that
> > time. I will write a a test patch.
> > 
> > Or may be there is a safer version of pcpu alloc which will return
> > without allocation if pcpu_alloc_mutex is already locked.
> 
> Both of these proposed solutions aren't really pretty. It would be
> cleaner and have better runtime for the fast path if you could alloc all
> of these when the non-root groups are setup. Why isn't it done that way
> right now?

Actually non-root group itself is setup when first IO happens in the
group. When an IO is submitted, we determine which cgroup does it belong
to, then we see if we have already setup the group. If not, then we
go on to alloc one. (Currently I use GFP_ATOMIC for group allocation).

So group allocation itself happens in IO path (like cfqq allocation),
hence per group per cpu stats are also allocated in that path.

We can't setup blkio groups at the cgroup creation time as at that time
we don't even know how many request queues are threre and on how many
of them cfq is being used.

Setup of root group at queue initialization time is easy. But we can't
setup all other cgroups at that time as it does not take care of future
cgroup creation. Also it will waste memory if cgroups are there but no
IO is happening in them.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 17:22           ` Vivek Goyal
@ 2011-12-14 18:16             ` Tejun Heo
  2011-12-14 18:41               ` Vivek Goyal
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2011-12-14 18:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Avi Kivity, Marcelo Tosatti, Nate Custer, kvm, linux-kernel

Hello,

On Wed, Dec 14, 2011 at 12:22:34PM -0500, Vivek Goyal wrote:
> [..]
> > __GFP_WAIT isn't the problem, you can block in the IO path. You cannot,
> > however, recurse back into IO submission. That's why CFQ is using
> > GFP_NOIO, implying that waiting is OK, but submitting new IO to satisfy
> > the allocation is not.
> 
> Ok. Got it. So even if we implement mutex_is_locked() check, it does not
> protect me from possiblity of per cpu allocation path recursing into
> IO submission. That means, there needs to be a variant of per cpu
> allocation which can take the allocation flags as parameter and honor
> these flags.

Slightly tangential but we actually have a bug here.  Under high
enough memory pressure, ioc or cic allocation can fail which will make
request alloc fail and retry, which isn't guaranteed to make forward
progress.  struct request itself is mempool backed but ioc/cic aren't.
It seems hitting this problem (and thus IO / memalloc deadlock) isn't
too difficult w/ memcg.

An easy fix would be using INSERT_BACK instead of INSERT_SORT if
elevator_set() fails.  I'll soon post patches to fix the problem.

> > > Or may be there is a safer version of pcpu alloc which will return
> > > without allocation if pcpu_alloc_mutex is already locked.

pcpu alloc depends on vmalloc allocation, so it isn't trivial.  We can
try to make percpu keep cache of areas for this type of allocation but
I personally think doing percpu allocation from atomic context or IO
path is a bad idea.  Hmmm...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 18:16             ` Tejun Heo
@ 2011-12-14 18:41               ` Vivek Goyal
  2011-12-14 23:06                 ` Vivek Goyal
  0 siblings, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2011-12-14 18:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Avi Kivity, Marcelo Tosatti, Nate Custer, kvm, linux-kernel

On Wed, Dec 14, 2011 at 10:16:23AM -0800, Tejun Heo wrote:

[..]
> > > > Or may be there is a safer version of pcpu alloc which will return
> > > > without allocation if pcpu_alloc_mutex is already locked.
> 
> pcpu alloc depends on vmalloc allocation, so it isn't trivial.  We can
> try to make percpu keep cache of areas for this type of allocation but
> I personally think doing percpu allocation from atomic context or IO
> path is a bad idea.  Hmmm...

Looks like I am running out of options here.  I can't find a suitable path
where I can allocate these stats out of IO path. Because devices can be
plugged in dynamically (and these stats are per cgroup, per device), and
cgroups can be created dynamically after device creation, I can't do any
static allocation out of IO path. So that kind of makes use of per cpu
memory areas for stats in this case impossible.

For a moment I thought of doing allocation from worker thread after taking
a reference on the original group. Allow the IO submission to continue without
blocking. Just that till per cpu areas are allocated, we will not
collect any stats.

But for locking we rely on request queue lock and request queue might be
gone by the time per cpu areas are allocated. That means we need a group
refenrence on the request queue. Request queue referencing and life time
is already full of bugs. So I don't feel comfortable adding more code
there (till atleast your cleanup patches go in).

Hmm..., is revert of per cpu blkio group stats the only sane choice left
now.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: kvm deadlock
  2011-12-14 18:41               ` Vivek Goyal
@ 2011-12-14 23:06                 ` Vivek Goyal
  0 siblings, 0 replies; 28+ messages in thread
From: Vivek Goyal @ 2011-12-14 23:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Avi Kivity, Marcelo Tosatti, Nate Custer, kvm, linux-kernel

On Wed, Dec 14, 2011 at 01:41:34PM -0500, Vivek Goyal wrote:
> On Wed, Dec 14, 2011 at 10:16:23AM -0800, Tejun Heo wrote:
> 
> [..]
> > > > > Or may be there is a safer version of pcpu alloc which will return
> > > > > without allocation if pcpu_alloc_mutex is already locked.
> > 
> > pcpu alloc depends on vmalloc allocation, so it isn't trivial.  We can
> > try to make percpu keep cache of areas for this type of allocation but
> > I personally think doing percpu allocation from atomic context or IO
> > path is a bad idea.  Hmmm...
> 
> Looks like I am running out of options here.  I can't find a suitable path
> where I can allocate these stats out of IO path. Because devices can be
> plugged in dynamically (and these stats are per cgroup, per device), and
> cgroups can be created dynamically after device creation, I can't do any
> static allocation out of IO path. So that kind of makes use of per cpu
> memory areas for stats in this case impossible.
> 
> For a moment I thought of doing allocation from worker thread after taking
> a reference on the original group. Allow the IO submission to continue without
> blocking. Just that till per cpu areas are allocated, we will not
> collect any stats.
>

Ok, even though I am not convinced about it yet, I wrote a small patch
to allocate per cpu data area from worker thread context asynchronously
(except for root cgroup). We continue to submit the IO without blocking
for per cpu data and if per cpu data is not allocated yet, we lose the
stat for that duration for that group.

Here is half baked test patch which compiles and boots. CFQ changes are not
done yet. I have yet to find out if I need any additional locking or
barriers. Right now I am just relying on reference counting of group objects.

I am posting it here to figure if this kind of approach is acceptable or
completely frowned upon.

Thanks
Vivek

Not-signed-half-baked-test-patch
---
 block/blk-cgroup.c   |   29 +++++++++++++++++++++++++++
 block/blk-cgroup.h   |    2 +
 block/blk-throttle.c |   53 ++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 75 insertions(+), 9 deletions(-)

Index: block/blk-throttle.c
===================================================================
--- block/blk-throttle.c.orig	2011-12-14 23:25:00.000000000 -0500
+++ block/blk-throttle.c	2011-12-14 23:46:58.000000000 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
 	int limits_changed;
 
 	struct rcu_head rcu_head;
+	struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,7 @@ static void throtl_free_tg(struct rcu_he
 	struct throtl_grp *tg;
 
 	tg = container_of(head, struct throtl_grp, rcu_head);
+	/* Will delayed allocation be visible here for sure? */
 	free_percpu(tg->blkg.stats_cpu);
 	kfree(tg);
 }
@@ -181,6 +183,24 @@ static void throtl_put_tg(struct throtl_
 	call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+	struct throtl_grp *tg = container_of(work, struct throtl_grp,
+						stat_alloc_work);
+	void *stat_ptr = NULL;
+
+	stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+	if (stat_ptr == NULL) {
+		throtl_put_tg(tg);
+		return;
+	}
+
+	blkio_set_blkg_stats_pcpu(&tg->blkg, stat_ptr);
+	throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
 	INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +208,7 @@ static void throtl_init_group(struct thr
 	bio_list_init(&tg->bio_lists[0]);
 	bio_list_init(&tg->bio_lists[1]);
 	tg->limits_changed = false;
+	INIT_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
 	/* Practically unlimited BW */
 	tg->bps[0] = tg->bps[1] = -1;
@@ -264,7 +285,7 @@ static void throtl_init_add_tg_lists(str
 }
 
 /* Should be called without queue lock and outside of rcu period */
-static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
+static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td, bool async)
 {
 	struct throtl_grp *tg = NULL;
 	int ret;
@@ -273,14 +294,28 @@ static struct throtl_grp *throtl_alloc_t
 	if (!tg)
 		return NULL;
 
-	ret = blkio_alloc_blkg_stats(&tg->blkg);
+	if (!async) {
+		ret = blkio_alloc_blkg_stats(&tg->blkg);
 
-	if (ret) {
-		kfree(tg);
-		return NULL;
+		if (ret) {
+        		kfree(tg);
+        		return NULL;
+		}
 	}
 
 	throtl_init_group(tg);
+
+	if (async) {
+		/*
+		 * Schedule the group per cpu stat allocation through worker
+		 * thread
+		 */
+		throtl_ref_get_tg(tg);
+		if(!schedule_work(&tg->stat_alloc_work))
+			/* nobody should have scheduled this work before */
+			WARN_ON(1);
+	}
+
 	return tg;
 }
 
@@ -329,14 +364,14 @@ static struct throtl_grp * throtl_get_tg
 	rcu_read_unlock();
 	spin_unlock_irq(q->queue_lock);
 
-	tg = throtl_alloc_tg(td);
+	tg = throtl_alloc_tg(td, 1);
 
 	/* Group allocated and queue is still alive. take the lock */
 	spin_lock_irq(q->queue_lock);
 
 	/* Make sure @q is still alive */
 	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
-		kfree(tg);
+		throtl_put_tg(tg);
 		return NULL;
 	}
 
@@ -353,7 +388,7 @@ static struct throtl_grp * throtl_get_tg
 	__tg = throtl_find_tg(td, blkcg);
 
 	if (__tg) {
-		kfree(tg);
+		throtl_put_tg(tg);
 		rcu_read_unlock();
 		return __tg;
 	}
@@ -1254,7 +1289,7 @@ int blk_throtl_init(struct request_queue
 
 	/* alloc and Init root group. */
 	td->queue = q;
-	tg = throtl_alloc_tg(td);
+	tg = throtl_alloc_tg(td, 0);
 
 	if (!tg) {
 		kfree(td);
Index: block/blk-cgroup.c
===================================================================
--- block/blk-cgroup.c.orig	2011-12-14 23:25:00.000000000 -0500
+++ block/blk-cgroup.c	2011-12-14 23:25:24.000000000 -0500
@@ -400,6 +400,9 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -446,6 +449,9 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -477,6 +483,19 @@ int blkio_alloc_blkg_stats(struct blkio_
 }
 EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
 
+void* blkio_alloc_blkg_stats_pcpu(void)
+{
+	/* Allocate memory for per cpu stats */
+	return alloc_percpu(struct blkio_group_stats_cpu);
+}
+EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats_pcpu);
+
+void blkio_set_blkg_stats_pcpu(struct blkio_group *blkg, void *stats_cpu)
+{
+	blkg->stats_cpu = stats_cpu;
+}
+EXPORT_SYMBOL_GPL(blkio_set_blkg_stats_pcpu);
+
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, void *key, dev_t dev,
 		enum blkio_policy_id plid)
@@ -551,6 +570,12 @@ static void blkio_reset_stats_cpu(struct
 {
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	/* blkg->stats_cpu can have delayed allocation */
+
+	if (!blkg->stats_cpu)
+		return;
+
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -673,6 +698,10 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	/* blkg->stats_cpu might not have been allocated yet */
+	if (blkg->stats_cpu == NULL)
+		return 0;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu  = per_cpu_ptr(blkg->stats_cpu, cpu);
Index: block/blk-cgroup.h
===================================================================
--- block/blk-cgroup.h.orig	2011-12-14 23:25:00.000000000 -0500
+++ block/blk-cgroup.h	2011-12-14 23:25:24.000000000 -0500
@@ -311,6 +311,8 @@ extern void blkiocg_add_blkio_group(stru
 	struct blkio_group *blkg, void *key, dev_t dev,
 	enum blkio_policy_id plid);
 extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
+extern void* blkio_alloc_blkg_stats_pcpu(void);
+extern void blkio_set_blkg_stats_pcpu(struct blkio_group *blkg, void *stats_cpu);
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 						void *key);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-14 16:03     ` Jens Axboe
  2011-12-14 17:03       ` Vivek Goyal
@ 2011-12-15 19:47       ` Vivek Goyal
       [not found]         ` <E73DB38E-AFC5-445D-9E76-DE599B36A814@cpanel.net>
  1 sibling, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2011-12-15 19:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Avi Kivity, Marcelo Tosatti, Nate Custer, kvm, linux-kernel

On Wed, Dec 14, 2011 at 05:03:54PM +0100, Jens Axboe wrote:
> On 2011-12-14 14:43, Avi Kivity wrote:
> > On 12/14/2011 02:25 PM, Marcelo Tosatti wrote:
> >> On Mon, Dec 05, 2011 at 04:48:16PM -0600, Nate Custer wrote:
> >>> Hello,
> >>>
> >>> I am struggling with repeatable full hardware locks when running 8-12 KVM vms. At some point before the hard lock I get a inconsistent lock state warning. An example of this can be found here:
> >>>
> >>> http://pastebin.com/8wKhgE2C
> >>>
> >>> After that the server continues to run for a while and then starts its death spiral. When it reaches that point it fails to log anything further to the disk, but by attaching a console I have been able to get a stack trace documenting the final implosion:
> >>>
> >>> http://pastebin.com/PbcN76bd
> >>>
> >>> All of the cores end up hung and the server stops responding to all input, including SysRq commands. 
> >>>
> >>> I have seen this behavior on two machines (dual E5606 running Fedora 16) both passed cpuburnin testing and memtest86 scans without error. 
> >>>
> >>> I have reproduced the crash and stack traces from a Fedora debugging kernel - 3.1.2-1 and with a vanilla 3.1.4 kernel.
> >>
> >> Busted hardware, apparently. Can you reproduce these issues with the
> >> same workload on different hardware?
> > 
> > I don't think it's hardware related.  The second trace (in the first
> > paste) is called during swap, so GFP_FS is set.  The first one is not,
> > so GFP_FS is clear.  Lockdep is worried about the following scenario:
> > 
> >   acpi_early_init() is called
> >   calls pcpu_alloc(), which takes pcpu_alloc_mutex
> >   eventually, calls kmalloc(), or some other allocation function
> >   no memory, so swap
> >   call try_to_free_pages()
> >   submit_bio()
> >   blk_throtl_bio()
> >   blkio_alloc_blkg_stats()
> >   alloc_percpu()
> >   pcpu_alloc(), which takes pcpu_alloc_mutex
> >   deadlock
> > 
> > It's a little unlikely that acpi_early_init() will OOM, but lockdep
> > doesn't know that.  Other callers of pcpu_alloc() could trigger the same
> > thing.
> > 
> > When lockdep says
> > 
> > [ 5839.924953] other info that might help us debug this:
> > [ 5839.925396]  Possible unsafe locking scenario:
> > [ 5839.925397]
> > [ 5839.925840]        CPU0
> > [ 5839.926063]        ----
> > [ 5839.926287]   lock(pcpu_alloc_mutex);
> > [ 5839.926533]   <Interrupt>
> > [ 5839.926756]     lock(pcpu_alloc_mutex);
> > [ 5839.926986]
> > 
> > It really means
> > 
> >    <swap, set GFP_FS>
> > 
> > GFP_FS simply marks the beginning of a nested, unrelated context that
> > uses the same thread, just like an interrupt.  Kudos to lockdep for
> > catching that.
> > 
> > I think the allocation in blkio_alloc_blkg_stats() should be moved out
> > of the I/O path into some init function. Copying Jens.
> 
> That's completely buggy, basically you end up with a GFP_KERNEL
> allocation from the IO submit path. Vivek, per_cpu data needs to be set
> up at init time. You can't allocate it dynamically off the IO path.

Ok, I continued to develop on the patch which tries to allocate per cpu
stats from worker thread context. Here is the patch.

Can the reporter please try out the patch and see if it helps. I am not
sure if deadlock was because of mutex issue or not, but it should help
get rid of lockdep warning.

This patch is on top of for-3.3/core branch of jens's linux-block tree.
If it does not apply on your kernel version, do let me know the version 
you are testing with and I will generate another version of patch.

If testing results are good, I will break down the patch in small pieces
and post as a series separately.

Thanks
Vivek

Alloc per cpu data from a worker thread context to avoid possibility of a
deadlock under memory pressure.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

This patch is generated on top of "for-3.3/core" branch of linux-block tree.

---
 block/blk-cgroup.c   |   31 ++++++++++++-
 block/blk-cgroup.h   |    9 +++
 block/blk-throttle.c |  116 ++++++++++++++++++++++++++++++++-----------------
 block/cfq-iosched.c  |  119 ++++++++++++++++++++++++++++++---------------------
 4 files changed, 180 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===================================================================
--- block/blk-throttle.c.orig	2011-12-15 20:31:05.000000000 -0500
+++ block/blk-throttle.c	2011-12-15 20:32:15.000000000 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
 	int limits_changed;
 
 	struct rcu_head rcu_head;
+	struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he
 	struct throtl_grp *tg;
 
 	tg = container_of(head, struct throtl_grp, rcu_head);
+
+	/*
+	 * Will delayed allocation be visible here for sure? I am relying
+	 * on the fact that after blkg.stats_cpu assignment, we drop
+	 * reference to group using atomic_dec() which should imply
+	 * barrier
+	 */
 	free_percpu(tg->blkg.stats_cpu);
 	kfree(tg);
 }
@@ -181,6 +189,48 @@ static void throtl_put_tg(struct throtl_
 	call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_check_schedule_pcpu_stat_alloc(struct throtl_grp *tg) {
+	if (tg->blkg.stats_cpu != NULL)
+		return;
+	/*
+	 * Schedule the group per cpu stat allocation through worker
+	 * thread
+	 */
+	throtl_ref_get_tg(tg);
+	if (!schedule_work(&tg->stat_alloc_work)) {
+		/* Work is already scheduled by somebody */
+		throtl_put_tg(tg);
+	}
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+	struct throtl_grp *tg = container_of(work, struct throtl_grp,
+						stat_alloc_work);
+	void *stat_ptr = NULL;
+	unsigned long ret;
+
+	stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+	if (stat_ptr == NULL) {
+		/* If memory allocation failed, try again */
+		throtl_check_schedule_pcpu_stat_alloc(tg);
+		throtl_put_tg(tg);
+		return;
+	}
+
+	ret = blkio_cmpxchg_blkg_stats(&tg->blkg, 0,
+					(unsigned long)stat_ptr);
+
+	if (ret != (unsigned long)stat_ptr) {
+		/* Somebody else got to it first */
+		free_percpu(tg->blkg.stats_cpu);
+	}
+
+	throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
 	INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +238,7 @@ static void throtl_init_group(struct thr
 	bio_list_init(&tg->bio_lists[0]);
 	bio_list_init(&tg->bio_lists[1]);
 	tg->limits_changed = false;
+	INIT_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
 	/* Practically unlimited BW */
 	tg->bps[0] = tg->bps[1] = -1;
@@ -263,8 +314,25 @@ static void throtl_init_add_tg_lists(str
 	throtl_add_group_to_td_list(td, tg);
 }
 
-/* Should be called without queue lock and outside of rcu period */
-static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
+/* Allocates per cpu stats asynchronously with the help of worker thread */
+static struct throtl_grp *throtl_alloc_tg_async(struct throtl_data *td)
+{
+	struct throtl_grp *tg = NULL;
+
+	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
+	if (!tg)
+		return NULL;
+
+	throtl_init_group(tg);
+	throtl_check_schedule_pcpu_stat_alloc(tg);
+	return tg;
+}
+
+/*
+ * Should be called without queue lock and outside of rcu period as per
+ * cpu alloc will block
+ */
+static struct throtl_grp *throtl_alloc_tg_sync(struct throtl_data *td)
 {
 	struct throtl_grp *tg = NULL;
 	int ret;
@@ -273,7 +341,7 @@ static struct throtl_grp *throtl_alloc_t
 	if (!tg)
 		return NULL;
 
-	ret = blkio_alloc_blkg_stats(&tg->blkg);
+	ret = blkio_alloc_set_blkg_stats(&tg->blkg);
 
 	if (ret) {
 		kfree(tg);
@@ -305,7 +373,7 @@ throtl_grp *throtl_find_tg(struct throtl
 
 static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 {
-	struct throtl_grp *tg = NULL, *__tg = NULL;
+	struct throtl_grp *tg = NULL;
 	struct blkio_cgroup *blkcg;
 	struct request_queue *q = td->queue;
 
@@ -321,46 +389,12 @@ static struct throtl_grp * throtl_get_tg
 		return tg;
 	}
 
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 */
-	rcu_read_unlock();
-	spin_unlock_irq(q->queue_lock);
-
-	tg = throtl_alloc_tg(td);
-
-	/* Group allocated and queue is still alive. take the lock */
-	spin_lock_irq(q->queue_lock);
-
-	/* Make sure @q is still alive */
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
-		kfree(tg);
-		return NULL;
-	}
-
-	/*
-	 * Initialize the new group. After sleeping, read the blkcg again.
-	 */
-	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__tg = throtl_find_tg(td, blkcg);
-
-	if (__tg) {
-		kfree(tg);
-		rcu_read_unlock();
-		return __tg;
-	}
+	tg = throtl_alloc_tg_async(td);
 
 	/* Group allocation failed. Account the IO to root group */
 	if (!tg) {
 		tg = td->root_tg;
+		rcu_read_unlock();
 		return tg;
 	}
 
@@ -1254,7 +1288,7 @@ int blk_throtl_init(struct request_queue
 
 	/* alloc and Init root group. */
 	td->queue = q;
-	tg = throtl_alloc_tg(td);
+	tg = throtl_alloc_tg_sync(td);
 
 	if (!tg) {
 		kfree(td);
Index: block/blk-cgroup.c
===================================================================
--- block/blk-cgroup.c.orig	2011-12-15 20:31:05.000000000 -0500
+++ block/blk-cgroup.c	2011-12-15 20:32:15.000000000 -0500
@@ -400,6 +400,9 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -446,6 +449,9 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -465,9 +471,11 @@ EXPORT_SYMBOL_GPL(blkiocg_update_io_merg
 
 /*
  * This function allocates the per cpu stats for blkio_group. Should be called
- * from sleepable context as alloc_per_cpu() requires that.
+ * from sleepable context as alloc_per_cpu() requires that. percpu alloc does
+ * not take any flags and does GFP_KERNEL allocations. Don't use it from
+ * IO submission path which usually might require GFP_NOIO.
  */
-int blkio_alloc_blkg_stats(struct blkio_group *blkg)
+int blkio_alloc_set_blkg_stats(struct blkio_group *blkg)
 {
 	/* Allocate memory for per cpu stats */
 	blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
@@ -475,7 +483,14 @@ int blkio_alloc_blkg_stats(struct blkio_
 		return -ENOMEM;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
+EXPORT_SYMBOL_GPL(blkio_alloc_set_blkg_stats);
+
+void* blkio_alloc_blkg_stats_pcpu(void)
+{
+	/* Allocate memory for per cpu stats */
+	return alloc_percpu(struct blkio_group_stats_cpu);
+}
+EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats_pcpu);
 
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, void *key, dev_t dev,
@@ -551,6 +566,12 @@ static void blkio_reset_stats_cpu(struct
 {
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	/* blkg->stats_cpu can have delayed allocation */
+
+	if (!blkg->stats_cpu)
+		return;
+
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -673,6 +694,10 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	/* blkg->stats_cpu might not have been allocated yet */
+	if (blkg->stats_cpu == NULL)
+		return 0;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu  = per_cpu_ptr(blkg->stats_cpu, cpu);
Index: block/blk-cgroup.h
===================================================================
--- block/blk-cgroup.h.orig	2011-12-15 20:31:05.000000000 -0500
+++ block/blk-cgroup.h	2011-12-15 20:32:15.000000000 -0500
@@ -310,7 +310,12 @@ extern struct blkio_cgroup *task_blkio_c
 extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 	struct blkio_group *blkg, void *key, dev_t dev,
 	enum blkio_policy_id plid);
-extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
+extern int blkio_alloc_set_blkg_stats(struct blkio_group *blkg);
+extern void* blkio_alloc_blkg_stats_pcpu(void);
+
+#define blkio_cmpxchg_blkg_stats(blkg, oldval, newval) \
+	cmpxchg((unsigned long *)&(blkg)->stats_cpu, oldval, newval);
+
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 						void *key);
@@ -338,7 +343,7 @@ static inline void blkiocg_add_blkio_gro
 		struct blkio_group *blkg, void *key, dev_t dev,
 		enum blkio_policy_id plid) {}
 
-static inline int blkio_alloc_blkg_stats(struct blkio_group *blkg) { return 0; }
+static inline int blkio_alloc_set_blkg_stats(struct blkio_group *blkg) { return 0; }
 
 static inline int
 blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
Index: block/cfq-iosched.c
===================================================================
--- block/cfq-iosched.c.orig	2011-12-15 20:31:05.000000000 -0500
+++ block/cfq-iosched.c	2011-12-15 20:32:15.000000000 -0500
@@ -209,7 +209,12 @@ struct cfq_group {
 	struct blkio_group blkg;
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 	struct hlist_node cfqd_node;
-	int ref;
+	/*
+	 * blkg per cpu stat allocation code will need to put reference
+	 * without having queue lock. Hence keep it atomic.
+	 */
+	atomic_t ref;
+	struct work_struct stat_alloc_work;
 #endif
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
@@ -1012,6 +1017,9 @@ static void cfq_group_served(struct cfq_
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
+static void cfq_put_cfqg(struct cfq_group *cfqg);
+static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg);
+
 static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 {
 	if (blkg)
@@ -1054,6 +1062,52 @@ static void cfq_init_add_cfqg_lists(stru
 	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
 }
 
+static inline void cfq_check_schedule_pcpu_stat_alloc(struct cfq_group *cfqg) {
+	if (cfqg->blkg.stats_cpu != NULL)
+		return;
+
+	/*
+	 * Schedule the group per cpu stat allocation through worker
+	 * thread
+	 */
+	cfq_ref_get_cfqg(cfqg);
+	if (!schedule_work(&cfqg->stat_alloc_work)) {
+		/* Work is already scheduled by somebody */
+		cfq_put_cfqg(cfqg);
+	}
+}
+
+/* No locks taken. A reference to cfq_group taken before invocation */
+static void cfqg_stat_alloc_fn(struct work_struct *work)
+{
+	struct cfq_group *cfqg = container_of(work, struct cfq_group,
+						stat_alloc_work);
+	void *stat_ptr = NULL;
+	unsigned long ret;
+
+	stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+	if (stat_ptr == NULL) {
+		/*
+		 * Memory allocation failed. Try again. Should an upper limit
+		 * be put on number of retries?
+		 */
+		cfq_check_schedule_pcpu_stat_alloc(cfqg);
+		cfq_put_cfqg(cfqg);
+		return;
+	}
+
+	ret = blkio_cmpxchg_blkg_stats(&cfqg->blkg, 0,
+					(unsigned long)stat_ptr);
+
+	if (ret != (unsigned long)stat_ptr) {
+		/* Somebody else got to it first */
+		free_percpu(cfqg->blkg.stats_cpu);
+	}
+
+	cfq_put_cfqg(cfqg);
+}
+
 /*
  * Should be called from sleepable context. No request queue lock as per
  * cpu stats are allocated dynamically and alloc_percpu needs to be called
@@ -1062,7 +1116,7 @@ static void cfq_init_add_cfqg_lists(stru
 static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
 {
 	struct cfq_group *cfqg = NULL;
-	int i, j, ret;
+	int i, j;
 	struct cfq_rb_root *st;
 
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
@@ -1072,6 +1126,7 @@ static struct cfq_group * cfq_alloc_cfqg
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
 	RB_CLEAR_NODE(&cfqg->rb_node);
+	INIT_WORK(&cfqg->stat_alloc_work, cfqg_stat_alloc_fn);
 
 	cfqg->ttime.last_end_request = jiffies;
 
@@ -1081,14 +1136,9 @@ static struct cfq_group * cfq_alloc_cfqg
 	 * elevator which will be dropped by either elevator exit
 	 * or cgroup deletion path depending on who is exiting first.
 	 */
-	cfqg->ref = 1;
-
-	ret = blkio_alloc_blkg_stats(&cfqg->blkg);
-	if (ret) {
-		kfree(cfqg);
-		return NULL;
-	}
+	atomic_set(&cfqg->ref, 1);
 
+	cfq_check_schedule_pcpu_stat_alloc(cfqg);
 	return cfqg;
 }
 
@@ -1124,8 +1174,7 @@ cfq_find_cfqg(struct cfq_data *cfqd, str
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 {
 	struct blkio_cgroup *blkcg;
-	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
-	struct request_queue *q = cfqd->queue;
+	struct cfq_group *cfqg = NULL;
 
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
@@ -1135,41 +1184,13 @@ static struct cfq_group *cfq_get_cfqg(st
 		return cfqg;
 	}
 
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 *
-	 * Not taking any queue reference here and assuming that queue is
-	 * around by the time we return. CFQ queue allocation code does
-	 * the same. It might be racy though.
-	 */
-
-	rcu_read_unlock();
-	spin_unlock_irq(q->queue_lock);
-
 	cfqg = cfq_alloc_cfqg(cfqd);
-
-	spin_lock_irq(q->queue_lock);
-
-	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__cfqg = cfq_find_cfqg(cfqd, blkcg);
-
-	if (__cfqg) {
-		kfree(cfqg);
+	if (!cfqg) {
+		cfqg = &cfqd->root_group;
 		rcu_read_unlock();
-		return __cfqg;
+		return cfqg;
 	}
 
-	if (!cfqg)
-		cfqg = &cfqd->root_group;
-
 	cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
 	rcu_read_unlock();
 	return cfqg;
@@ -1177,7 +1198,7 @@ static struct cfq_group *cfq_get_cfqg(st
 
 static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 {
-	cfqg->ref++;
+	atomic_inc(&cfqg->ref);
 	return cfqg;
 }
 
@@ -1189,7 +1210,7 @@ static void cfq_link_cfqq_cfqg(struct cf
 
 	cfqq->cfqg = cfqg;
 	/* cfqq reference on cfqg */
-	cfqq->cfqg->ref++;
+	atomic_inc(&cfqq->cfqg->ref);
 }
 
 static void cfq_put_cfqg(struct cfq_group *cfqg)
@@ -1197,9 +1218,8 @@ static void cfq_put_cfqg(struct cfq_grou
 	struct cfq_rb_root *st;
 	int i, j;
 
-	BUG_ON(cfqg->ref <= 0);
-	cfqg->ref--;
-	if (cfqg->ref)
+	BUG_ON(atomic_read(&cfqg->ref) <= 0);
+	if (!atomic_dec_and_test(&cfqg->ref))
 		return;
 	for_each_cfqg_st(cfqg, i, j, st)
 		BUG_ON(!RB_EMPTY_ROOT(&st->rb));
@@ -4025,6 +4045,7 @@ static void *cfq_init_queue(struct reque
 	cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
+	INIT_WORK(&cfqg->stat_alloc_work, cfqg_stat_alloc_fn);
 	/*
 	 * Set root group reference to 2. One reference will be dropped when
 	 * all groups on cfqd->cfqg_list are being deleted during queue exit.
@@ -4032,9 +4053,9 @@ static void *cfq_init_queue(struct reque
 	 * group as it is statically allocated and gets destroyed when
 	 * throtl_data goes away.
 	 */
-	cfqg->ref = 2;
+	atomic_set(&cfqg->ref, 2);
 
-	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
+	if (blkio_alloc_set_blkg_stats(&cfqg->blkg)) {
 		kfree(cfqg);
 		kfree(cfqd);
 		return NULL;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
       [not found]         ` <E73DB38E-AFC5-445D-9E76-DE599B36A814@cpanel.net>
@ 2011-12-16 20:29           ` Vivek Goyal
  2011-12-18 21:25             ` Nate Custer
  0 siblings, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2011-12-16 20:29 UTC (permalink / raw)
  To: Nate Custer; +Cc: Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

On Fri, Dec 16, 2011 at 12:43:52PM -0600, Nate Custer wrote:
> 
> On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote:
> > Ok, I continued to develop on the patch which tries to allocate per cpu
> > stats from worker thread context. Here is the patch.
> > 
> > Can the reporter please try out the patch and see if it helps. I am not
> > sure if deadlock was because of mutex issue or not, but it should help
> > get rid of lockdep warning.
> > 
> > This patch is on top of for-3.3/core branch of jens's linux-block tree.
> > If it does not apply on your kernel version, do let me know the version 
> > you are testing with and I will generate another version of patch.
> > 
> > If testing results are good, I will break down the patch in small pieces
> > and post as a series separately.
> > 
> > Thanks
> > Vivek
> 
> Running on a fedora-16 box with the patch applied to the linux-block tree I still had deadlocks. In fact it seemed to happen much faster and with ligher workloads.
> 
> I was able to get netconsole setup and a full stacktrace is posted here:
> 
> http://pastebin.com/9Rq68exU

Thanks for testing it Nate. I did some debugging and found out that patch
is doing double free on per cpu pointer hence the crash you are running
into. I could reproduce this problem on my box. It is just a matter of
doing rmdir on the blkio cgroup.

I understood the cmpxchg() semantics wrong. I have fixed it now and
no crashes on directory removal. Can you please give this version a
try.

Thanks
Vivek

Alloc per cpu data from a worker thread context to avoid possibility of a
deadlock under memory pressure.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

This patch is generated on top of "for-3.3/core" branch of linux-block tree.

-v2:
 Fixed the issue of double free on percpu pointer.
---
 block/blk-cgroup.c   |   31 ++++++++++++-
 block/blk-cgroup.h   |    9 +++
 block/blk-throttle.c |  116 ++++++++++++++++++++++++++++++++-----------------
 block/cfq-iosched.c  |  119 ++++++++++++++++++++++++++++++---------------------
 4 files changed, 180 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===================================================================
--- block/blk-throttle.c.orig	2011-12-17 01:49:34.000000000 -0500
+++ block/blk-throttle.c	2011-12-17 02:21:50.000000000 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
 	int limits_changed;
 
 	struct rcu_head rcu_head;
+	struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he
 	struct throtl_grp *tg;
 
 	tg = container_of(head, struct throtl_grp, rcu_head);
+
+	/*
+	 * Will delayed allocation be visible here for sure? I am relying
+	 * on the fact that after blkg.stats_cpu assignment, we drop
+	 * reference to group using atomic_dec() which should imply
+	 * barrier
+	 */
 	free_percpu(tg->blkg.stats_cpu);
 	kfree(tg);
 }
@@ -181,6 +189,48 @@ static void throtl_put_tg(struct throtl_
 	call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_check_schedule_pcpu_stat_alloc(struct throtl_grp *tg) {
+	if (tg->blkg.stats_cpu != NULL)
+		return;
+	/*
+	 * Schedule the group per cpu stat allocation through worker
+	 * thread
+	 */
+	throtl_ref_get_tg(tg);
+	if (!schedule_work(&tg->stat_alloc_work)) {
+		/* Work is already scheduled by somebody */
+		throtl_put_tg(tg);
+	}
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+	struct throtl_grp *tg = container_of(work, struct throtl_grp,
+						stat_alloc_work);
+	void *stat_ptr = NULL;
+	unsigned long ret;
+
+	stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+	if (stat_ptr == NULL) {
+		/* If memory allocation failed, try again */
+		throtl_check_schedule_pcpu_stat_alloc(tg);
+		throtl_put_tg(tg);
+		return;
+	}
+
+	ret = blkio_cmpxchg_blkg_stats(&tg->blkg, 0,
+					(unsigned long)stat_ptr);
+
+	if (ret != 0) {
+		/* Somebody else got to it first */
+		free_percpu(stat_ptr);
+	}
+
+	throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
 	INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +238,7 @@ static void throtl_init_group(struct thr
 	bio_list_init(&tg->bio_lists[0]);
 	bio_list_init(&tg->bio_lists[1]);
 	tg->limits_changed = false;
+	INIT_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
 	/* Practically unlimited BW */
 	tg->bps[0] = tg->bps[1] = -1;
@@ -263,8 +314,25 @@ static void throtl_init_add_tg_lists(str
 	throtl_add_group_to_td_list(td, tg);
 }
 
-/* Should be called without queue lock and outside of rcu period */
-static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
+/* Allocates per cpu stats asynchronously with the help of worker thread */
+static struct throtl_grp *throtl_alloc_tg_async(struct throtl_data *td)
+{
+	struct throtl_grp *tg = NULL;
+
+	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
+	if (!tg)
+		return NULL;
+
+	throtl_init_group(tg);
+	throtl_check_schedule_pcpu_stat_alloc(tg);
+	return tg;
+}
+
+/*
+ * Should be called without queue lock and outside of rcu period as per
+ * cpu alloc will block
+ */
+static struct throtl_grp *throtl_alloc_tg_sync(struct throtl_data *td)
 {
 	struct throtl_grp *tg = NULL;
 	int ret;
@@ -273,7 +341,7 @@ static struct throtl_grp *throtl_alloc_t
 	if (!tg)
 		return NULL;
 
-	ret = blkio_alloc_blkg_stats(&tg->blkg);
+	ret = blkio_alloc_set_blkg_stats(&tg->blkg);
 
 	if (ret) {
 		kfree(tg);
@@ -305,7 +373,7 @@ throtl_grp *throtl_find_tg(struct throtl
 
 static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 {
-	struct throtl_grp *tg = NULL, *__tg = NULL;
+	struct throtl_grp *tg = NULL;
 	struct blkio_cgroup *blkcg;
 	struct request_queue *q = td->queue;
 
@@ -321,46 +389,12 @@ static struct throtl_grp * throtl_get_tg
 		return tg;
 	}
 
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 */
-	rcu_read_unlock();
-	spin_unlock_irq(q->queue_lock);
-
-	tg = throtl_alloc_tg(td);
-
-	/* Group allocated and queue is still alive. take the lock */
-	spin_lock_irq(q->queue_lock);
-
-	/* Make sure @q is still alive */
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
-		kfree(tg);
-		return NULL;
-	}
-
-	/*
-	 * Initialize the new group. After sleeping, read the blkcg again.
-	 */
-	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__tg = throtl_find_tg(td, blkcg);
-
-	if (__tg) {
-		kfree(tg);
-		rcu_read_unlock();
-		return __tg;
-	}
+	tg = throtl_alloc_tg_async(td);
 
 	/* Group allocation failed. Account the IO to root group */
 	if (!tg) {
 		tg = td->root_tg;
+		rcu_read_unlock();
 		return tg;
 	}
 
@@ -1254,7 +1288,7 @@ int blk_throtl_init(struct request_queue
 
 	/* alloc and Init root group. */
 	td->queue = q;
-	tg = throtl_alloc_tg(td);
+	tg = throtl_alloc_tg_sync(td);
 
 	if (!tg) {
 		kfree(td);
Index: block/blk-cgroup.c
===================================================================
--- block/blk-cgroup.c.orig	2011-12-17 01:49:34.000000000 -0500
+++ block/blk-cgroup.c	2011-12-17 01:49:35.000000000 -0500
@@ -400,6 +400,9 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -446,6 +449,9 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -465,9 +471,11 @@ EXPORT_SYMBOL_GPL(blkiocg_update_io_merg
 
 /*
  * This function allocates the per cpu stats for blkio_group. Should be called
- * from sleepable context as alloc_per_cpu() requires that.
+ * from sleepable context as alloc_per_cpu() requires that. percpu alloc does
+ * not take any flags and does GFP_KERNEL allocations. Don't use it from
+ * IO submission path which usually might require GFP_NOIO.
  */
-int blkio_alloc_blkg_stats(struct blkio_group *blkg)
+int blkio_alloc_set_blkg_stats(struct blkio_group *blkg)
 {
 	/* Allocate memory for per cpu stats */
 	blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
@@ -475,7 +483,14 @@ int blkio_alloc_blkg_stats(struct blkio_
 		return -ENOMEM;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
+EXPORT_SYMBOL_GPL(blkio_alloc_set_blkg_stats);
+
+void* blkio_alloc_blkg_stats_pcpu(void)
+{
+	/* Allocate memory for per cpu stats */
+	return alloc_percpu(struct blkio_group_stats_cpu);
+}
+EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats_pcpu);
 
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, void *key, dev_t dev,
@@ -551,6 +566,12 @@ static void blkio_reset_stats_cpu(struct
 {
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	/* blkg->stats_cpu can have delayed allocation */
+
+	if (!blkg->stats_cpu)
+		return;
+
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -673,6 +694,10 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	/* blkg->stats_cpu might not have been allocated yet */
+	if (blkg->stats_cpu == NULL)
+		return 0;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu  = per_cpu_ptr(blkg->stats_cpu, cpu);
Index: block/blk-cgroup.h
===================================================================
--- block/blk-cgroup.h.orig	2011-12-17 01:49:34.000000000 -0500
+++ block/blk-cgroup.h	2011-12-17 01:49:35.000000000 -0500
@@ -310,7 +310,12 @@ extern struct blkio_cgroup *task_blkio_c
 extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 	struct blkio_group *blkg, void *key, dev_t dev,
 	enum blkio_policy_id plid);
-extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
+extern int blkio_alloc_set_blkg_stats(struct blkio_group *blkg);
+extern void* blkio_alloc_blkg_stats_pcpu(void);
+
+#define blkio_cmpxchg_blkg_stats(blkg, oldval, newval) \
+	cmpxchg((unsigned long *)&(blkg)->stats_cpu, oldval, newval);
+
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 						void *key);
@@ -338,7 +343,7 @@ static inline void blkiocg_add_blkio_gro
 		struct blkio_group *blkg, void *key, dev_t dev,
 		enum blkio_policy_id plid) {}
 
-static inline int blkio_alloc_blkg_stats(struct blkio_group *blkg) { return 0; }
+static inline int blkio_alloc_set_blkg_stats(struct blkio_group *blkg) { return 0; }
 
 static inline int
 blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
Index: block/cfq-iosched.c
===================================================================
--- block/cfq-iosched.c.orig	2011-12-17 01:49:34.000000000 -0500
+++ block/cfq-iosched.c	2011-12-17 02:21:50.000000000 -0500
@@ -209,7 +209,12 @@ struct cfq_group {
 	struct blkio_group blkg;
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 	struct hlist_node cfqd_node;
-	int ref;
+	/*
+	 * blkg per cpu stat allocation code will need to put reference
+	 * without having queue lock. Hence keep it atomic.
+	 */
+	atomic_t ref;
+	struct work_struct stat_alloc_work;
 #endif
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
@@ -1012,6 +1017,9 @@ static void cfq_group_served(struct cfq_
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
+static void cfq_put_cfqg(struct cfq_group *cfqg);
+static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg);
+
 static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 {
 	if (blkg)
@@ -1054,6 +1062,52 @@ static void cfq_init_add_cfqg_lists(stru
 	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
 }
 
+static inline void cfq_check_schedule_pcpu_stat_alloc(struct cfq_group *cfqg) {
+	if (cfqg->blkg.stats_cpu != NULL)
+		return;
+
+	/*
+	 * Schedule the group per cpu stat allocation through worker
+	 * thread
+	 */
+	cfq_ref_get_cfqg(cfqg);
+	if (!schedule_work(&cfqg->stat_alloc_work)) {
+		/* Work is already scheduled by somebody */
+		cfq_put_cfqg(cfqg);
+	}
+}
+
+/* No locks taken. A reference to cfq_group taken before invocation */
+static void cfqg_stat_alloc_fn(struct work_struct *work)
+{
+	struct cfq_group *cfqg = container_of(work, struct cfq_group,
+						stat_alloc_work);
+	void *stat_ptr = NULL;
+	unsigned long ret;
+
+	stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+	if (stat_ptr == NULL) {
+		/*
+		 * Memory allocation failed. Try again. Should an upper limit
+		 * be put on number of retries?
+		 */
+		cfq_check_schedule_pcpu_stat_alloc(cfqg);
+		cfq_put_cfqg(cfqg);
+		return;
+	}
+
+	ret = blkio_cmpxchg_blkg_stats(&cfqg->blkg, 0,
+					(unsigned long)stat_ptr);
+
+	if (ret != 0) {
+		/* Somebody else got to it first */
+		free_percpu(stat_ptr);
+	}
+
+	cfq_put_cfqg(cfqg);
+}
+
 /*
  * Should be called from sleepable context. No request queue lock as per
  * cpu stats are allocated dynamically and alloc_percpu needs to be called
@@ -1062,7 +1116,7 @@ static void cfq_init_add_cfqg_lists(stru
 static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
 {
 	struct cfq_group *cfqg = NULL;
-	int i, j, ret;
+	int i, j;
 	struct cfq_rb_root *st;
 
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
@@ -1072,6 +1126,7 @@ static struct cfq_group * cfq_alloc_cfqg
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
 	RB_CLEAR_NODE(&cfqg->rb_node);
+	INIT_WORK(&cfqg->stat_alloc_work, cfqg_stat_alloc_fn);
 
 	cfqg->ttime.last_end_request = jiffies;
 
@@ -1081,14 +1136,9 @@ static struct cfq_group * cfq_alloc_cfqg
 	 * elevator which will be dropped by either elevator exit
 	 * or cgroup deletion path depending on who is exiting first.
 	 */
-	cfqg->ref = 1;
-
-	ret = blkio_alloc_blkg_stats(&cfqg->blkg);
-	if (ret) {
-		kfree(cfqg);
-		return NULL;
-	}
+	atomic_set(&cfqg->ref, 1);
 
+	cfq_check_schedule_pcpu_stat_alloc(cfqg);
 	return cfqg;
 }
 
@@ -1124,8 +1174,7 @@ cfq_find_cfqg(struct cfq_data *cfqd, str
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 {
 	struct blkio_cgroup *blkcg;
-	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
-	struct request_queue *q = cfqd->queue;
+	struct cfq_group *cfqg = NULL;
 
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
@@ -1135,41 +1184,13 @@ static struct cfq_group *cfq_get_cfqg(st
 		return cfqg;
 	}
 
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 *
-	 * Not taking any queue reference here and assuming that queue is
-	 * around by the time we return. CFQ queue allocation code does
-	 * the same. It might be racy though.
-	 */
-
-	rcu_read_unlock();
-	spin_unlock_irq(q->queue_lock);
-
 	cfqg = cfq_alloc_cfqg(cfqd);
-
-	spin_lock_irq(q->queue_lock);
-
-	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__cfqg = cfq_find_cfqg(cfqd, blkcg);
-
-	if (__cfqg) {
-		kfree(cfqg);
+	if (!cfqg) {
+		cfqg = &cfqd->root_group;
 		rcu_read_unlock();
-		return __cfqg;
+		return cfqg;
 	}
 
-	if (!cfqg)
-		cfqg = &cfqd->root_group;
-
 	cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
 	rcu_read_unlock();
 	return cfqg;
@@ -1177,7 +1198,7 @@ static struct cfq_group *cfq_get_cfqg(st
 
 static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 {
-	cfqg->ref++;
+	atomic_inc(&cfqg->ref);
 	return cfqg;
 }
 
@@ -1189,7 +1210,7 @@ static void cfq_link_cfqq_cfqg(struct cf
 
 	cfqq->cfqg = cfqg;
 	/* cfqq reference on cfqg */
-	cfqq->cfqg->ref++;
+	atomic_inc(&cfqq->cfqg->ref);
 }
 
 static void cfq_put_cfqg(struct cfq_group *cfqg)
@@ -1197,9 +1218,8 @@ static void cfq_put_cfqg(struct cfq_grou
 	struct cfq_rb_root *st;
 	int i, j;
 
-	BUG_ON(cfqg->ref <= 0);
-	cfqg->ref--;
-	if (cfqg->ref)
+	BUG_ON(atomic_read(&cfqg->ref) <= 0);
+	if (!atomic_dec_and_test(&cfqg->ref))
 		return;
 	for_each_cfqg_st(cfqg, i, j, st)
 		BUG_ON(!RB_EMPTY_ROOT(&st->rb));
@@ -4025,6 +4045,7 @@ static void *cfq_init_queue(struct reque
 	cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
+	INIT_WORK(&cfqg->stat_alloc_work, cfqg_stat_alloc_fn);
 	/*
 	 * Set root group reference to 2. One reference will be dropped when
 	 * all groups on cfqd->cfqg_list are being deleted during queue exit.
@@ -4032,9 +4053,9 @@ static void *cfq_init_queue(struct reque
 	 * group as it is statically allocated and gets destroyed when
 	 * throtl_data goes away.
 	 */
-	cfqg->ref = 2;
+	atomic_set(&cfqg->ref, 2);
 
-	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
+	if (blkio_alloc_set_blkg_stats(&cfqg->blkg)) {
 		kfree(cfqg);
 		kfree(cfqd);
 		return NULL;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-16 20:29           ` Vivek Goyal
@ 2011-12-18 21:25             ` Nate Custer
  2011-12-19 13:40               ` Vivek Goyal
  2011-12-19 17:27               ` Vivek Goyal
  0 siblings, 2 replies; 28+ messages in thread
From: Nate Custer @ 2011-12-18 21:25 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel


On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> Thanks for testing it Nate. I did some debugging and found out that patch
> is doing double free on per cpu pointer hence the crash you are running
> into. I could reproduce this problem on my box. It is just a matter of
> doing rmdir on the blkio cgroup.
> 
> I understood the cmpxchg() semantics wrong. I have fixed it now and
> no crashes on directory removal. Can you please give this version a
> try.
> 
> Thanks
> Vivek

After 24 hours of stress testing the machine remains up and working without issue. I will continue to test it, but am reasonably confident that this patch resolves my issue.

Nate Custer

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-18 21:25             ` Nate Custer
@ 2011-12-19 13:40               ` Vivek Goyal
  2011-12-19 17:27               ` Vivek Goyal
  1 sibling, 0 replies; 28+ messages in thread
From: Vivek Goyal @ 2011-12-19 13:40 UTC (permalink / raw)
  To: Nate Custer; +Cc: Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> 
> On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > Thanks for testing it Nate. I did some debugging and found out that patch
> > is doing double free on per cpu pointer hence the crash you are running
> > into. I could reproduce this problem on my box. It is just a matter of
> > doing rmdir on the blkio cgroup.
> > 
> > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > no crashes on directory removal. Can you please give this version a
> > try.
> > 
> > Thanks
> > Vivek
> 
> After 24 hours of stress testing the machine remains up and working without issue. I will continue to test it, but am reasonably confident that this patch resolves my issue.

That's good to know. While you continue to stress test, I will make some
minor modifications to the patch. (Thinking of delaying the retry of
allocation of per cpu memory in case previous attempts failed).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-18 21:25             ` Nate Custer
  2011-12-19 13:40               ` Vivek Goyal
@ 2011-12-19 17:27               ` Vivek Goyal
  2011-12-19 17:35                 ` Tejun Heo
  1 sibling, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2011-12-19 17:27 UTC (permalink / raw)
  To: Nate Custer
  Cc: Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel, Tejun Heo

On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> 
> On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > Thanks for testing it Nate. I did some debugging and found out that patch
> > is doing double free on per cpu pointer hence the crash you are running
> > into. I could reproduce this problem on my box. It is just a matter of
> > doing rmdir on the blkio cgroup.
> > 
> > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > no crashes on directory removal. Can you please give this version a
> > try.
> > 
> > Thanks
> > Vivek
> 
> After 24 hours of stress testing the machine remains up and working without issue. I will continue to test it, but am reasonably confident that this patch resolves my issue.
> 

Hi Nate,

I have come up with final version of the patch. This time I have used non
rentrant work queue to queue the stat alloc work. This also gets rid of
cmpxchg code as there is only one writer at a time. There are couple of
other small cleanups.

Can you please give patch also a try to make sure I have not broken
something while doing changes.

This version is based on 3.2-rc6. Once you confirm the results, I will
rebase it on top of "linux-block for-3.3/core" and post it to Jens for
inclusion.

Thanks
Vivek

Block cgroup currently allocates percpu data for some stats. This allocation
is happening in IO submission path which can recurse in IO stack.

Percpu allocation API does not take any allocation flags as input, hence
to avoid the deadlock problem under memory pressure, alloc per cpu data
from a worker thread context.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

In case per cpu memory allocation fails, worker thread re-arms the work
with a delay of 1 second.

This patch is generated on top of 3.2-rc6

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reported-by: Nate Custer <nate@cpanel.net>
Tested-by: Nate Custer <nate@cpanel.net>
---
 block/blk-cgroup.c   |   31 ++++++++++++-
 block/blk-cgroup.h   |   11 +++-
 block/blk-throttle.c |  115 ++++++++++++++++++++++++++++++++-------------------
 block/cfq-iosched.c  |  111 +++++++++++++++++++++++++++----------------------
 4 files changed, 173 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===================================================================
--- block/blk-throttle.c.orig	2011-12-19 23:10:53.000000000 -0500
+++ block/blk-throttle.c	2011-12-19 23:11:14.000000000 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
 	int limits_changed;
 
 	struct rcu_head rcu_head;
+	struct delayed_work stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,12 @@ static void throtl_free_tg(struct rcu_he
 	struct throtl_grp *tg;
 
 	tg = container_of(head, struct throtl_grp, rcu_head);
+
+	/*
+	 * delayed per cpu allocation should be visible here due to
+	 * barrier guarantees offered by atomic_dec_and_test() called
+	 * during dropping reference.
+	 */
 	free_percpu(tg->blkg.stats_cpu);
 	kfree(tg);
 }
@@ -181,6 +188,48 @@ static void throtl_put_tg(struct throtl_
 	call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_schedule_pcpu_stat_alloc(struct throtl_grp *tg,
+				unsigned long delay) {
+	WARN_ON(tg->blkg.stats_cpu != NULL);
+
+	/*
+	 * Schedule the group per cpu stat allocation through worker
+	 * thread
+	 */
+	throtl_ref_get_tg(tg);
+
+	/* queue work on non-rentrant work queue */
+	queue_delayed_work(system_nrt_wq, &tg->stat_alloc_work, delay);
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct throtl_grp *tg = container_of(dwork, struct throtl_grp,
+						stat_alloc_work);
+	void *stat_ptr = NULL;
+
+	stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+	if (stat_ptr == NULL) {
+		/*
+		 * If memory allocation failed, try again with delay of 1s
+		 * FIXME: Put an upper limit on number of retries?
+		 */
+		throtl_schedule_pcpu_stat_alloc(tg, HZ);
+		throtl_put_tg(tg);
+		return;
+	}
+
+	/*
+	 * There is only one writer at a time and that is worker thread. Hence
+	 * no synchronization should be required.
+	 */
+	blkio_set_blkg_pcpu_stats(&tg->blkg, stat_ptr);
+	throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
 	INIT_HLIST_NODE(&tg->tg_node);
@@ -188,6 +237,7 @@ static void throtl_init_group(struct thr
 	bio_list_init(&tg->bio_lists[0]);
 	bio_list_init(&tg->bio_lists[1]);
 	tg->limits_changed = false;
+	INIT_DELAYED_WORK(&tg->stat_alloc_work, tg_stat_alloc_fn);
 
 	/* Practically unlimited BW */
 	tg->bps[0] = tg->bps[1] = -1;
@@ -263,8 +313,25 @@ static void throtl_init_add_tg_lists(str
 	throtl_add_group_to_td_list(td, tg);
 }
 
-/* Should be called without queue lock and outside of rcu period */
-static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
+/* Allocates per cpu stats asynchronously with the help of worker thread */
+static struct throtl_grp *throtl_alloc_tg_async(struct throtl_data *td)
+{
+	struct throtl_grp *tg = NULL;
+
+	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
+	if (!tg)
+		return NULL;
+
+	throtl_init_group(tg);
+	throtl_schedule_pcpu_stat_alloc(tg, 0);
+	return tg;
+}
+
+/*
+ * Should be called without queue lock and outside of rcu period as per
+ * cpu alloc will block
+ */
+static struct throtl_grp *throtl_alloc_tg_sync(struct throtl_data *td)
 {
 	struct throtl_grp *tg = NULL;
 	int ret;
@@ -273,7 +340,7 @@ static struct throtl_grp *throtl_alloc_t
 	if (!tg)
 		return NULL;
 
-	ret = blkio_alloc_blkg_stats(&tg->blkg);
+	ret = blkio_alloc_set_blkg_stats(&tg->blkg);
 
 	if (ret) {
 		kfree(tg);
@@ -305,7 +372,7 @@ throtl_grp *throtl_find_tg(struct throtl
 
 static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 {
-	struct throtl_grp *tg = NULL, *__tg = NULL;
+	struct throtl_grp *tg = NULL;
 	struct blkio_cgroup *blkcg;
 	struct request_queue *q = td->queue;
 
@@ -321,46 +388,12 @@ static struct throtl_grp * throtl_get_tg
 		return tg;
 	}
 
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 */
-	rcu_read_unlock();
-	spin_unlock_irq(q->queue_lock);
-
-	tg = throtl_alloc_tg(td);
-
-	/* Group allocated and queue is still alive. take the lock */
-	spin_lock_irq(q->queue_lock);
-
-	/* Make sure @q is still alive */
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
-		kfree(tg);
-		return NULL;
-	}
-
-	/*
-	 * Initialize the new group. After sleeping, read the blkcg again.
-	 */
-	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__tg = throtl_find_tg(td, blkcg);
-
-	if (__tg) {
-		kfree(tg);
-		rcu_read_unlock();
-		return __tg;
-	}
+	tg = throtl_alloc_tg_async(td);
 
 	/* Group allocation failed. Account the IO to root group */
 	if (!tg) {
 		tg = td->root_tg;
+		rcu_read_unlock();
 		return tg;
 	}
 
@@ -1254,7 +1287,7 @@ int blk_throtl_init(struct request_queue
 
 	/* alloc and Init root group. */
 	td->queue = q;
-	tg = throtl_alloc_tg(td);
+	tg = throtl_alloc_tg_sync(td);
 
 	if (!tg) {
 		kfree(td);
Index: block/blk-cgroup.c
===================================================================
--- block/blk-cgroup.c.orig	2011-12-19 23:10:53.000000000 -0500
+++ block/blk-cgroup.c	2011-12-19 23:11:14.000000000 -0500
@@ -400,6 +400,9 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -446,6 +449,9 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (blkg->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -465,9 +471,11 @@ EXPORT_SYMBOL_GPL(blkiocg_update_io_merg
 
 /*
  * This function allocates the per cpu stats for blkio_group. Should be called
- * from sleepable context as alloc_per_cpu() requires that.
+ * from sleepable context as alloc_per_cpu() requires that. percpu alloc does
+ * not take any flags and does GFP_KERNEL allocations. Don't use it from
+ * IO submission path which usually might require GFP_NOIO.
  */
-int blkio_alloc_blkg_stats(struct blkio_group *blkg)
+int blkio_alloc_set_blkg_stats(struct blkio_group *blkg)
 {
 	/* Allocate memory for per cpu stats */
 	blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
@@ -475,7 +483,14 @@ int blkio_alloc_blkg_stats(struct blkio_
 		return -ENOMEM;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
+EXPORT_SYMBOL_GPL(blkio_alloc_set_blkg_stats);
+
+void* blkio_alloc_blkg_stats_pcpu(void)
+{
+	/* Allocate memory for per cpu stats */
+	return alloc_percpu(struct blkio_group_stats_cpu);
+}
+EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats_pcpu);
 
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, void *key, dev_t dev,
@@ -551,6 +566,12 @@ static void blkio_reset_stats_cpu(struct
 {
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	/* blkg->stats_cpu can have delayed allocation */
+
+	if (!blkg->stats_cpu)
+		return;
+
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -673,6 +694,10 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	/* blkg->stats_cpu might not have been allocated yet */
+	if (blkg->stats_cpu == NULL)
+		return 0;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu  = per_cpu_ptr(blkg->stats_cpu, cpu);
Index: block/blk-cgroup.h
===================================================================
--- block/blk-cgroup.h.orig	2011-12-19 23:06:11.000000000 -0500
+++ block/blk-cgroup.h	2011-12-19 23:11:14.000000000 -0500
@@ -310,7 +310,14 @@ extern struct blkio_cgroup *task_blkio_c
 extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 	struct blkio_group *blkg, void *key, dev_t dev,
 	enum blkio_policy_id plid);
-extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
+extern int blkio_alloc_set_blkg_stats(struct blkio_group *blkg);
+extern void* blkio_alloc_blkg_stats_pcpu(void);
+
+static inline void blkio_set_blkg_pcpu_stats(struct blkio_group *blkg, void *ptr)
+{
+	blkg->stats_cpu = ptr;
+}
+
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 						void *key);
@@ -338,7 +345,7 @@ static inline void blkiocg_add_blkio_gro
 		struct blkio_group *blkg, void *key, dev_t dev,
 		enum blkio_policy_id plid) {}
 
-static inline int blkio_alloc_blkg_stats(struct blkio_group *blkg) { return 0; }
+static inline int blkio_alloc_set_blkg_stats(struct blkio_group *blkg) { return 0; }
 
 static inline int
 blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
Index: block/cfq-iosched.c
===================================================================
--- block/cfq-iosched.c.orig	2011-12-19 23:10:53.000000000 -0500
+++ block/cfq-iosched.c	2011-12-19 23:11:14.000000000 -0500
@@ -209,7 +209,12 @@ struct cfq_group {
 	struct blkio_group blkg;
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 	struct hlist_node cfqd_node;
-	int ref;
+	/*
+	 * blkg per cpu stat allocation code will need to put reference
+	 * without having queue lock. Hence keep it atomic.
+	 */
+	atomic_t ref;
+	struct delayed_work stat_alloc_work;
 #endif
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
@@ -1012,6 +1017,9 @@ static void cfq_group_served(struct cfq_
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
+static void cfq_put_cfqg(struct cfq_group *cfqg);
+static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg);
+
 static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 {
 	if (blkg)
@@ -1054,6 +1062,44 @@ static void cfq_init_add_cfqg_lists(stru
 	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
 }
 
+static inline void cfq_schedule_pcpu_stat_alloc(struct cfq_group *cfqg,
+				unsigned long delay) {
+	WARN_ON(cfqg->blkg.stats_cpu != NULL);
+
+	/*
+	 * Schedule the group per cpu stat allocation through worker
+	 * thread
+	 */
+	cfq_ref_get_cfqg(cfqg);
+
+	/* queue work on non-rentrant work queue */
+	queue_delayed_work(system_nrt_wq, &cfqg->stat_alloc_work, delay);
+}
+
+/* No locks taken. A reference to cfq_group taken before invocation */
+static void cfqg_stat_alloc_fn(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct cfq_group *cfqg = container_of(dwork, struct cfq_group,
+						stat_alloc_work);
+	void *stat_ptr = NULL;
+
+	stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+	if (stat_ptr == NULL) {
+		/*
+		 * Memory allocation failed. Try again. Should an upper limit
+		 * be put on number of retries?
+		 */
+		cfq_schedule_pcpu_stat_alloc(cfqg, HZ);
+		cfq_put_cfqg(cfqg);
+		return;
+	}
+
+	blkio_set_blkg_pcpu_stats(&cfqg->blkg, stat_ptr);
+	cfq_put_cfqg(cfqg);
+}
+
 /*
  * Should be called from sleepable context. No request queue lock as per
  * cpu stats are allocated dynamically and alloc_percpu needs to be called
@@ -1062,7 +1108,7 @@ static void cfq_init_add_cfqg_lists(stru
 static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
 {
 	struct cfq_group *cfqg = NULL;
-	int i, j, ret;
+	int i, j;
 	struct cfq_rb_root *st;
 
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
@@ -1072,6 +1118,7 @@ static struct cfq_group * cfq_alloc_cfqg
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
 	RB_CLEAR_NODE(&cfqg->rb_node);
+	INIT_DELAYED_WORK(&cfqg->stat_alloc_work, cfqg_stat_alloc_fn);
 
 	cfqg->ttime.last_end_request = jiffies;
 
@@ -1081,14 +1128,9 @@ static struct cfq_group * cfq_alloc_cfqg
 	 * elevator which will be dropped by either elevator exit
 	 * or cgroup deletion path depending on who is exiting first.
 	 */
-	cfqg->ref = 1;
-
-	ret = blkio_alloc_blkg_stats(&cfqg->blkg);
-	if (ret) {
-		kfree(cfqg);
-		return NULL;
-	}
+	atomic_set(&cfqg->ref, 1);
 
+	cfq_schedule_pcpu_stat_alloc(cfqg, 0);
 	return cfqg;
 }
 
@@ -1124,8 +1166,7 @@ cfq_find_cfqg(struct cfq_data *cfqd, str
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 {
 	struct blkio_cgroup *blkcg;
-	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
-	struct request_queue *q = cfqd->queue;
+	struct cfq_group *cfqg = NULL;
 
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
@@ -1135,41 +1176,13 @@ static struct cfq_group *cfq_get_cfqg(st
 		return cfqg;
 	}
 
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 *
-	 * Not taking any queue reference here and assuming that queue is
-	 * around by the time we return. CFQ queue allocation code does
-	 * the same. It might be racy though.
-	 */
-
-	rcu_read_unlock();
-	spin_unlock_irq(q->queue_lock);
-
 	cfqg = cfq_alloc_cfqg(cfqd);
-
-	spin_lock_irq(q->queue_lock);
-
-	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__cfqg = cfq_find_cfqg(cfqd, blkcg);
-
-	if (__cfqg) {
-		kfree(cfqg);
+	if (!cfqg) {
+		cfqg = &cfqd->root_group;
 		rcu_read_unlock();
-		return __cfqg;
+		return cfqg;
 	}
 
-	if (!cfqg)
-		cfqg = &cfqd->root_group;
-
 	cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
 	rcu_read_unlock();
 	return cfqg;
@@ -1177,7 +1190,7 @@ static struct cfq_group *cfq_get_cfqg(st
 
 static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
 {
-	cfqg->ref++;
+	atomic_inc(&cfqg->ref);
 	return cfqg;
 }
 
@@ -1189,7 +1202,7 @@ static void cfq_link_cfqq_cfqg(struct cf
 
 	cfqq->cfqg = cfqg;
 	/* cfqq reference on cfqg */
-	cfqq->cfqg->ref++;
+	atomic_inc(&cfqq->cfqg->ref);
 }
 
 static void cfq_put_cfqg(struct cfq_group *cfqg)
@@ -1197,9 +1210,8 @@ static void cfq_put_cfqg(struct cfq_grou
 	struct cfq_rb_root *st;
 	int i, j;
 
-	BUG_ON(cfqg->ref <= 0);
-	cfqg->ref--;
-	if (cfqg->ref)
+	BUG_ON(atomic_read(&cfqg->ref) <= 0);
+	if (!atomic_dec_and_test(&cfqg->ref))
 		return;
 	for_each_cfqg_st(cfqg, i, j, st)
 		BUG_ON(!RB_EMPTY_ROOT(&st->rb));
@@ -4032,6 +4044,7 @@ static void *cfq_init_queue(struct reque
 	cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
+	INIT_DELAYED_WORK(&cfqg->stat_alloc_work, cfqg_stat_alloc_fn);
 	/*
 	 * Set root group reference to 2. One reference will be dropped when
 	 * all groups on cfqd->cfqg_list are being deleted during queue exit.
@@ -4039,9 +4052,9 @@ static void *cfq_init_queue(struct reque
 	 * group as it is statically allocated and gets destroyed when
 	 * throtl_data goes away.
 	 */
-	cfqg->ref = 2;
+	atomic_set(&cfqg->ref, 2);
 
-	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
+	if (blkio_alloc_set_blkg_stats(&cfqg->blkg)) {
 		kfree(cfqg);
 
 		spin_lock(&cic_index_lock);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-19 17:27               ` Vivek Goyal
@ 2011-12-19 17:35                 ` Tejun Heo
  2011-12-19 18:27                   ` Vivek Goyal
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2011-12-19 17:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nate Custer, Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote:
> On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> > 
> > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > > Thanks for testing it Nate. I did some debugging and found out that patch
> > > is doing double free on per cpu pointer hence the crash you are running
> > > into. I could reproduce this problem on my box. It is just a matter of
> > > doing rmdir on the blkio cgroup.
> > > 
> > > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > > no crashes on directory removal. Can you please give this version a
> > > try.
> > > 
> > > Thanks
> > > Vivek
> > 
> > After 24 hours of stress testing the machine remains up and working without issue. I will continue to test it, but am reasonably confident that this patch resolves my issue.
> > 
> 
> Hi Nate,
> 
> I have come up with final version of the patch. This time I have used non
> rentrant work queue to queue the stat alloc work. This also gets rid of
> cmpxchg code as there is only one writer at a time. There are couple of
> other small cleanups.
> 
> Can you please give patch also a try to make sure I have not broken
> something while doing changes.
> 
> This version is based on 3.2-rc6. Once you confirm the results, I will
> rebase it on top of "linux-block for-3.3/core" and post it to Jens for
> inclusion.
> 
> Thanks
> Vivek
> 
> Block cgroup currently allocates percpu data for some stats. This allocation
> is happening in IO submission path which can recurse in IO stack.
> 
> Percpu allocation API does not take any allocation flags as input, hence
> to avoid the deadlock problem under memory pressure, alloc per cpu data
> from a worker thread context.
> 
> Only side affect of delayed allocation is that we will lose the blkio cgroup
> stats for that group a small duration.
> 
> In case per cpu memory allocation fails, worker thread re-arms the work
> with a delay of 1 second.
> 
> This patch is generated on top of 3.2-rc6
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> Reported-by: Nate Custer <nate@cpanel.net>
> Tested-by: Nate Custer <nate@cpanel.net>

Hmmm... I really don't like this approach.  It's so unnecessarily
complex with extra refcounting and all when about the same thing can
be achieved by implementing simple mempool which is filled
asynchronously.  Also, the fix seems way too invasive even for -rc6,
let alone -stable.  If reverting isn't gonna be invasive, maybe that's
a better approach for now?

I've been thinking about it and I think the use case is legit and
maybe making percpu allocator support that isn't such a bad idea.  I'm
not completely sure yet tho.  I'll give it a try later today.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-19 17:35                 ` Tejun Heo
@ 2011-12-19 18:27                   ` Vivek Goyal
  2011-12-19 22:56                     ` Tejun Heo
  2011-12-20 12:49                     ` Jens Axboe
  0 siblings, 2 replies; 28+ messages in thread
From: Vivek Goyal @ 2011-12-19 18:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nate Custer, Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

On Mon, Dec 19, 2011 at 09:35:19AM -0800, Tejun Heo wrote:
> On Mon, Dec 19, 2011 at 12:27:17PM -0500, Vivek Goyal wrote:
> > On Sun, Dec 18, 2011 at 03:25:48PM -0600, Nate Custer wrote:
> > > 
> > > On Dec 16, 2011, at 2:29 PM, Vivek Goyal wrote:
> > > > Thanks for testing it Nate. I did some debugging and found out that patch
> > > > is doing double free on per cpu pointer hence the crash you are running
> > > > into. I could reproduce this problem on my box. It is just a matter of
> > > > doing rmdir on the blkio cgroup.
> > > > 
> > > > I understood the cmpxchg() semantics wrong. I have fixed it now and
> > > > no crashes on directory removal. Can you please give this version a
> > > > try.
> > > > 
> > > > Thanks
> > > > Vivek
> > > 
> > > After 24 hours of stress testing the machine remains up and working without issue. I will continue to test it, but am reasonably confident that this patch resolves my issue.
> > > 
> > 
> > Hi Nate,
> > 
> > I have come up with final version of the patch. This time I have used non
> > rentrant work queue to queue the stat alloc work. This also gets rid of
> > cmpxchg code as there is only one writer at a time. There are couple of
> > other small cleanups.
> > 
> > Can you please give patch also a try to make sure I have not broken
> > something while doing changes.
> > 
> > This version is based on 3.2-rc6. Once you confirm the results, I will
> > rebase it on top of "linux-block for-3.3/core" and post it to Jens for
> > inclusion.
> > 
> > Thanks
> > Vivek
> > 
> > Block cgroup currently allocates percpu data for some stats. This allocation
> > is happening in IO submission path which can recurse in IO stack.
> > 
> > Percpu allocation API does not take any allocation flags as input, hence
> > to avoid the deadlock problem under memory pressure, alloc per cpu data
> > from a worker thread context.
> > 
> > Only side affect of delayed allocation is that we will lose the blkio cgroup
> > stats for that group a small duration.
> > 
> > In case per cpu memory allocation fails, worker thread re-arms the work
> > with a delay of 1 second.
> > 
> > This patch is generated on top of 3.2-rc6
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > Reported-by: Nate Custer <nate@cpanel.net>
> > Tested-by: Nate Custer <nate@cpanel.net>
> 
> Hmmm... I really don't like this approach.  It's so unnecessarily
> complex with extra refcounting and all when about the same thing can
> be achieved by implementing simple mempool which is filled
> asynchronously.  Also, the fix seems way too invasive even for -rc6,
> let alone -stable.  If reverting isn't gonna be invasive, maybe that's
> a better approach for now?

I think reverting the previous series is not going to be simple either. It
had 13 patches.

https://lkml.org/lkml/2011/5/19/560

By making stats per cpu, I was able to reduce contention on request
queue lock. Now we shall have to bring the lock back. 

So to me, neither of the solution is very clean/simple. May be we can take
this patch in as temporary stop gap measure for 3.2, while you cook a
patch for per cpu allocator for 3.3 onwards.

This patch looks big, but is not very complex. Now allocation of group
happens without dropping queue lock and it gets rid of all the trickery
of dropping lock, reacquiring it and checking for queue dead etc.

Some code is related to making group refering atomic again. (Which was
atomic in the past. So we just resotore that functionality).

The only core piece is taking a reference on the group and asking
a worker thread to do the allocation. May be if I break the patch
into small pieces, it will be more clear.

So if I had to choose between reverting the series vs taking this patch
as stop gap measure, I will prefer taking this patch. It can get some
soak time in Jens's tree and if everything looks good then push it for
3.2. If things look good there, then it can be pulled into -stable. So
it will be a while before these changes are pulled into -stable.

> 
> I've been thinking about it and I think the use case is legit and
> maybe making percpu allocator support that isn't such a bad idea.  I'm
> not completely sure yet tho.  I'll give it a try later today.

Ok, that's good to know. If per cpu allocator can support this use case,
it will be good for 3.3 onwards. This seems to be right way to go to fix
the problem.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-19 18:27                   ` Vivek Goyal
@ 2011-12-19 22:56                     ` Tejun Heo
  2011-12-20 14:50                       ` Vivek Goyal
  2011-12-20 12:49                     ` Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2011-12-19 22:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nate Custer, Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

Hello, Vivek.

On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote:
> Ok, that's good to know. If per cpu allocator can support this use case,
> it will be good for 3.3 onwards. This seems to be right way to go to fix
> the problem.

Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right
interim solution would be making a simplistic mempool so that later
when percpu can do it, it can be swapped easily.  I really can't see
much benefit of adding refcnting on top of everything just for this.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-19 18:27                   ` Vivek Goyal
  2011-12-19 22:56                     ` Tejun Heo
@ 2011-12-20 12:49                     ` Jens Axboe
  1 sibling, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2011-12-20 12:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Nate Custer, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

On 2011-12-19 19:27, Vivek Goyal wrote:
> I think reverting the previous series is not going to be simple either. It
> had 13 patches.
> 
> https://lkml.org/lkml/2011/5/19/560
> 
> By making stats per cpu, I was able to reduce contention on request
> queue lock. Now we shall have to bring the lock back. 

That's not going to happen, having that lock in there was a disaster for
small IO on fast devices. It's essentially the limiting factor on
benchmark runs on the RHEL kernels that have it included and enabled...


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-19 22:56                     ` Tejun Heo
@ 2011-12-20 14:50                       ` Vivek Goyal
  2011-12-20 20:45                         ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2011-12-20 14:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nate Custer, Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

On Mon, Dec 19, 2011 at 02:56:35PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Mon, Dec 19, 2011 at 01:27:17PM -0500, Vivek Goyal wrote:
> > Ok, that's good to know. If per cpu allocator can support this use case,
> > it will be good for 3.3 onwards. This seems to be right way to go to fix
> > the problem.
> 
> Ummm... if we're gonna make percpu usable w/ GFP_NOIO, the right
> interim solution would be making a simplistic mempool so that later
> when percpu can do it, it can be swapped easily.  I really can't see
> much benefit of adding refcnting on top of everything just for this.

Ok. So are you suggesting that I should write a simple mempool kind of
implementation of my own for group and per cpu data allocation. Keep
certain number of elements in the cache and trigger a worker thread to
allocate more elements once minimum number of elements go below
threshold. If pool has run out of pre-allocated elements then allocation
will fail and IO will be accounted to root group?

I am looking at the mempool implementation (mempool_create()) and looks
like that is not suitable for my use case. mempool_alloc() will call into
alloc function provided by me and pass the flags. I can't implement an
alloc function and honor that flag as per cpu alloc does not take any
flags.

So IIUC, existing mempool implementation is not directly usable for my
requirement and I need to write some code of my own for the caching
layer which always allocates objects from reserve and fills in the
pool asynchronously with the help of a worker thread.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
  2011-12-20 14:50                       ` Vivek Goyal
@ 2011-12-20 20:45                         ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2011-12-20 20:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nate Custer, Jens Axboe, Avi Kivity, Marcelo Tosatti, kvm, linux-kernel

Hello,

On Tue, Dec 20, 2011 at 09:50:24AM -0500, Vivek Goyal wrote:
> So IIUC, existing mempool implementation is not directly usable for my
> requirement and I need to write some code of my own for the caching
> layer which always allocates objects from reserve and fills in the
> pool asynchronously with the help of a worker thread.

I've been looking at it and don't think allowing percpu allocator to
be called from no io path is a good idea.  The on-demand area filling
is tied into vmalloc area management which in turn is tied to arch
page table code and I really want to avoid pre-allocating full chunk -
it can be huge.  I'm trying to extend mempool to cover percpu areas,
so please wait a bit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-12-20 20:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-05 22:48 kvm deadlock Nate Custer
2011-12-14 12:25 ` Marcelo Tosatti
2011-12-14 13:43   ` Avi Kivity
2011-12-14 14:00     ` Marcelo Tosatti
2011-12-14 14:02       ` Avi Kivity
2011-12-14 14:06         ` Marcelo Tosatti
2011-12-14 14:17           ` Nate Custer
2011-12-14 14:20             ` Marcelo Tosatti
2011-12-14 14:28             ` Avi Kivity
2011-12-14 14:27           ` Avi Kivity
2011-12-14 16:03     ` Jens Axboe
2011-12-14 17:03       ` Vivek Goyal
2011-12-14 17:09         ` Jens Axboe
2011-12-14 17:22           ` Vivek Goyal
2011-12-14 18:16             ` Tejun Heo
2011-12-14 18:41               ` Vivek Goyal
2011-12-14 23:06                 ` Vivek Goyal
2011-12-15 19:47       ` [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock) Vivek Goyal
     [not found]         ` <E73DB38E-AFC5-445D-9E76-DE599B36A814@cpanel.net>
2011-12-16 20:29           ` Vivek Goyal
2011-12-18 21:25             ` Nate Custer
2011-12-19 13:40               ` Vivek Goyal
2011-12-19 17:27               ` Vivek Goyal
2011-12-19 17:35                 ` Tejun Heo
2011-12-19 18:27                   ` Vivek Goyal
2011-12-19 22:56                     ` Tejun Heo
2011-12-20 14:50                       ` Vivek Goyal
2011-12-20 20:45                         ` Tejun Heo
2011-12-20 12:49                     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.