linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* docker crashes rcuos in __blkg_release_rcu
@ 2014-06-08 22:22 Joe Lawrence
  2014-06-09 17:47 ` Vivek Goyal
  2014-06-10 19:52 ` docker crashes rcuos in __blkg_release_rcu Christoph Lameter
  0 siblings, 2 replies; 12+ messages in thread
From: Joe Lawrence @ 2014-06-08 22:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, Vivek Goyal

Hi Tejun, Vivek,

I came across this crash when attempting to run the 'hello world'
example from the Getting Started section on the docker.io homepage.

Repro kernels:

(upstream linus) 3.15.0
(RHEL7 RC-2)     3.10.0-121.el7.x86_64

To reproduce, boot with slub_debug=FZPU and run the example.

  % # RHEL7 needs docker-io from EPEL
  % yum install http://dl.fedoraproject.org/pub/epel/beta/7/x86_64/epel-release-7-0.1.noarch.rpm
  % rpm -ivh epel-release-7-0.1.noarch.rpm
  % yum install docker-io
  
  % systemctl start docker
  % docker run ubuntu /bin/echo hello world

The host crashes every time with the following stack trace:

general protection fault: 0000 [#1] SMP 
Modules linked in: veth xt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c loop bonding sg x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel igb ixgbe ghash_clmulni_intel aesni_intel nfsd lrw gf128mul glue_helper ablk_helper dm_service_time cryptd pcspkr ptp auth_rpcgss ntb pps_core nfs_acl ses lockd mdio i2c_algo_bit enclosure ipmi_devintf dca ipmi_msghandler i2c_core dm_multipath sunrpc dm_mod ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_common sr_mod cdrom qla2xxx mpt3sas mpt2sas scsi_transport_fc usb_storage scsi_tgt raid_class scsi_transport_sas
CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
Stack:
 ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
Call Trace:
 [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
 [<ffffffff810b6a00>] ? abort_exclusive_wait+0xb0/0xb0
 [<ffffffff810d1b40>] ? rcu_start_gp+0x40/0x40
 [<ffffffff81091d81>] kthread+0xe1/0x100
 [<ffffffff81091ca0>] ? kthread_create_on_node+0x1a0/0x1a0
 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81091ca0>] ? kthread_create_on_node+0x1a0/0x1a0
Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 
RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
 RSP <ffff88085403fdf0>

crash> dis -l _raw_spin_lock_irq

kernel/locking/spinlock.c: 166
  <_raw_spin_lock_irq>:        data32 data32 data32 xchg %ax,%ax
  <_raw_spin_lock_irq+0x5>:    push   %rbp
  <_raw_spin_lock_irq+0x6>:    mov    %rsp,%rbp
arch/x86/include/asm/paravirt.h: 814
  <_raw_spin_lock_irq+0x9>:    cli    
  <_raw_spin_lock_irq+0xa>:    data32 xchg %ax,%ax
  <_raw_spin_lock_irq+0xd>:    data32 xchg %ax,%ax
arch/x86/include/asm/spinlock.h: 86
  <_raw_spin_lock_irq+0x10>:   mov    $0x20000,%eax
  <_raw_spin_lock_irq+0x15>:   lock xadd %eax,(%rdi)       <<

arch/x86/include/asm/spinlock.h:

 82 static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 83 {
 84         register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
 85 
 86         inc = xadd(&lock->tickets, inc);               <<

.tickets is offset 0 from arch_spinlock_t, so RDI should be the
arch_spinlock_t lock:
RDI: 6b6b6b6b6b6b6b6b

Back up a frame and get bearings...

crash> dis -l __blkg_release_rcu

block/blk-cgroup.c: 402
  <__blkg_release_rcu+0x56>:   cmpq   $0x0,-0x80(%r12)
  <__blkg_release_rcu+0x5c>:   je     0xffffffff812cc001 <__blkg_release_rcu+0xb1>
block/blk-cgroup.c: 403
  <__blkg_release_rcu+0x5e>:   mov    -0xb0(%r12),%rax
include/linux/spinlock.h: 328
  <__blkg_release_rcu+0x66>:   mov    0x460(%rax),%rdi
  <__blkg_release_rcu+0x6d>:   callq  0xffffffff8162e9d0 <_raw_spin_lock_irq>

block/blk-cgroup.c:

 387 void __blkg_release_rcu(struct rcu_head *rcu_head)
 388 {
 ...
 400         /* release the blkcg and parent blkg refs this blkg has been holding */
 401         css_put(&blkg->blkcg->css);
 402         if (blkg->parent) {
 403                 spin_lock_irq(blkg->q->queue_lock);
 404                 blkg_put(blkg->parent);
 405                 spin_unlock_irq(blkg->q->queue_lock);
 406         }

RAX is the struct request_queue*, but has been re-used by
_raw_spin_lock_irq.  How about R12?

crash> struct -o blkcg_gq | grep b0                                                                       
  [0xb0] struct callback_head callback_head;

... and ...

block/blk-cgroup.c: 389
  <__blkg_release_rcu+0xb>:    lea    -0xb0(%rdi),%r13
block/blk-cgroup.c: 388
  <__blkg_release_rcu+0x12>:   push   %r12
  <__blkg_release_rcu+0x14>:   mov    %rdi,%r12

Chances are R12 is struct rcu_head *rcu_head and R13 is struct blkcg_gq*

R13: ffff88103c17a080

crash> p/x 0xffff88103c17a130-0xb0                                                                        
$2 = 0xffff88103c17a080

Yup.

crash> struct blkcg_gq 0xffff88103c17a080 | grep q
struct blkcg_gq {
  q = 0xffff88103fc7df90,

crash> rd 0xffff88103fc7df90 0xee
... all 0x6b's ...

Summary thus far:

R12: ffff88103c17a130 = struct rcu_head *rcu_head 
R13: ffff88103c17a080 = struct blkcg_gq *blkg
     ffff88103fc7df90 = struct request_queue *blkg->q (contains 0x6b
                                                       poison-pattern)

commit 2a4fd070 "blkcg: move bulk of blkcg_gq release operations to the
RCU callback" shuffled around some code in this space, introducing the
the calls to spin_[un]lock_irq(blkg->q->queue_lock).

Tejun -- I still have the vmcore here if you would like further analysis
or test patches you would like me to try.

Vivek -- might slub_debug be a reliable repro for RHBZ-1019584 (closed,
needinfo)?

Regards,

-- Joe

[1] https://www.docker.io/gettingstarted/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1019584

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: docker crashes rcuos in __blkg_release_rcu
  2014-06-08 22:22 docker crashes rcuos in __blkg_release_rcu Joe Lawrence
@ 2014-06-09 17:47 ` Vivek Goyal
  2014-06-09 18:27   ` Vivek Goyal
  2014-06-10 19:52 ` docker crashes rcuos in __blkg_release_rcu Christoph Lameter
  1 sibling, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2014-06-09 17:47 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-kernel, Tejun Heo

On Sun, Jun 08, 2014 at 06:22:00PM -0400, Joe Lawrence wrote:

[..]
> Summary thus far:
> 
> R12: ffff88103c17a130 = struct rcu_head *rcu_head 
> R13: ffff88103c17a080 = struct blkcg_gq *blkg
>      ffff88103fc7df90 = struct request_queue *blkg->q (contains 0x6b
>                                                        poison-pattern)
> 
> commit 2a4fd070 "blkcg: move bulk of blkcg_gq release operations to the
> RCU callback" shuffled around some code in this space, introducing the
> the calls to spin_[un]lock_irq(blkg->q->queue_lock).
> 

Hi Joe,

Thanks for reporting and debugging this issue. So in summary it looks
like that we have freed request queue associated with the blkg and
when blkg is freed later and tries to access spin lock embedded in
request queue, it crashes.

So the question is why request queue is being freed early. Are there any
reference counting issues.

I will spend some more time staring at the code.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: docker crashes rcuos in __blkg_release_rcu
  2014-06-09 17:47 ` Vivek Goyal
@ 2014-06-09 18:27   ` Vivek Goyal
  2014-06-10 18:39     ` Joe Lawrence
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2014-06-09 18:27 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-kernel, Tejun Heo

On Mon, Jun 09, 2014 at 01:47:08PM -0400, Vivek Goyal wrote:
> On Sun, Jun 08, 2014 at 06:22:00PM -0400, Joe Lawrence wrote:
> 
> [..]
> > Summary thus far:
> > 
> > R12: ffff88103c17a130 = struct rcu_head *rcu_head 
> > R13: ffff88103c17a080 = struct blkcg_gq *blkg
> >      ffff88103fc7df90 = struct request_queue *blkg->q (contains 0x6b
> >                                                        poison-pattern)
> > 
> > commit 2a4fd070 "blkcg: move bulk of blkcg_gq release operations to the
> > RCU callback" shuffled around some code in this space, introducing the
> > the calls to spin_[un]lock_irq(blkg->q->queue_lock).
> > 
> 
> Hi Joe,
> 
> Thanks for reporting and debugging this issue. So in summary it looks
> like that we have freed request queue associated with the blkg and
> when blkg is freed later and tries to access spin lock embedded in
> request queue, it crashes.
> 
> So the question is why request queue is being freed early. Are there any
> reference counting issues.

I am wondering if we need to take a reference on the queue
(blk_get_queue()) in blkg_alloc(), to make sure request queue is
still around when blkg is being freed.

I will try to reproduce the issue locally.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: docker crashes rcuos in __blkg_release_rcu
  2014-06-09 18:27   ` Vivek Goyal
@ 2014-06-10 18:39     ` Joe Lawrence
  2014-06-10 19:14       ` Vivek Goyal
  2014-06-11 16:32       ` Vivek Goyal
  0 siblings, 2 replies; 12+ messages in thread
From: Joe Lawrence @ 2014-06-10 18:39 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, Tejun Heo

On Mon, 9 Jun 2014 14:27:29 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:
> ... snip ...
> So the question is why request queue is being freed early. Are there any
> reference counting issues.  

Hi Vivek,

Thanks for taking a look.  For extra debugging, I wrote a quick set of
kprobes that:

  1 - On blkg_alloc entry, save the request_queue's kobj address in a
      list
  2 - On kobject_put entry, dump the stack if the kobj is found in that
      list

and this was the trace for the final kobject put for the
request_queue before a crash:

JL: kobject_put kobj(queue) @ ffff88084d89c9e8, refcount=1
------------[ cut here ]------------
WARNING: CPU: 27 PID: 11060 at /h/jlawrenc/kprobes/docker/probes_blk.c:166 kret_entry_kobject_put+0x47/0x50 [docker_debug]()
[ ... snip modules ... ]
CPU: 27 PID: 11060 Comm: docker Tainted: G        W  OE 3.15.0 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
 0000000000000000 0000000093cbdc81 ffff88104196fae8 ffffffff8162738d
 0000000000000000 ffff88104196fb20 ffffffff8106d81d ffff88084d89c9e8
 ffff881041912cd0 ffffffffa0181020 ffff88104196fbe0 ffffffffa01810c8
Call Trace:
 [<ffffffff8162738d>] dump_stack+0x45/0x56
 [<ffffffff8106d81d>] warn_slowpath_common+0x7d/0xa0
 [<ffffffff8106d94a>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa017f107>] kret_entry_kobject_put+0x47/0x50 [docker_debug]
 [<ffffffff816335ee>] pre_handler_kretprobe+0x9e/0x1c0
 [<ffffffff81635a2f>] opt_pre_handler+0x4f/0x90
 [<ffffffff81631dd7>] optimized_callback+0x97/0xb0
 [<ffffffff812dde01>] ? kobject_put+0x1/0x60
 [<ffffffff812b4561>] ? blk_cleanup_queue+0x101/0x1a0
 [<ffffffffa011114b>] ? __dm_destroy+0x1db/0x260 [dm_mod]
 [<ffffffffa0111f53>] ? dm_destroy+0x13/0x20 [dm_mod]
 [<ffffffffa0117a2e>] ? dev_remove+0x11e/0x180 [dm_mod]
 [<ffffffffa0117910>] ? dev_suspend+0x250/0x250 [dm_mod]
 [<ffffffffa0118105>] ? ctl_ioctl+0x255/0x500 [dm_mod]
 [<ffffffff8118483f>] ? do_wp_page+0x38f/0x750
 [<ffffffffa01183c3>] ? dm_ctl_ioctl+0x13/0x20 [dm_mod]
 [<ffffffff811e1c20>] ? do_vfs_ioctl+0x2e0/0x4a0
 [<ffffffff81277d56>] ? file_has_perm+0xa6/0xb0
 [<ffffffff811e1e61>] ? SyS_ioctl+0x81/0xa0
 [<ffffffff816381e9>] ? system_call_fastpath+0x16/0x1b
---[ end trace b4b8112437afdac8 ]---
 
so I think when dm_destroy() is called, it leads to the request_queue
in question going away.

> I am wondering if we need to take a reference on the queue
> (blk_get_queue()) in blkg_alloc(), to make sure request queue is
> still around when blkg is being freed.

I experimented with this and the crash does go away (and the docker
invocation completes successfully).  I wasn't sure where the
accompanying blk_put_queue() should go.  If I put it in blkg_free, the
kref accounting doesn't seem to even out, ie they never fall to zero.

> I will try to reproduce the issue locally.

Any luck?  I found that slub_debug was required to draw out the crash,
otherwise the use-after-free silently goes about its business.

Hope this helps,

-- Joe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: docker crashes rcuos in __blkg_release_rcu
  2014-06-10 18:39     ` Joe Lawrence
@ 2014-06-10 19:14       ` Vivek Goyal
  2014-06-11 16:32       ` Vivek Goyal
  1 sibling, 0 replies; 12+ messages in thread
From: Vivek Goyal @ 2014-06-10 19:14 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-kernel, Tejun Heo

On Tue, Jun 10, 2014 at 02:39:06PM -0400, Joe Lawrence wrote:

[..]
> > I am wondering if we need to take a reference on the queue
> > (blk_get_queue()) in blkg_alloc(), to make sure request queue is
> > still around when blkg is being freed.
> 
> I experimented with this and the crash does go away (and the docker
> invocation completes successfully).  I wasn't sure where the
> accompanying blk_put_queue() should go.  If I put it in blkg_free, the
> kref accounting doesn't seem to even out, ie they never fall to zero.

I think blkg_free() is logical place to put down queue reference. I am
not sure why it does not even out. Will spend some time on this. 

> 
> > I will try to reproduce the issue locally.
> 
> Any luck?  I found that slub_debug was required to draw out the crash,
> otherwise the use-after-free silently goes about its business.

I had tried it yesterday and saw the crash out of 3 attempts. I am
right now busy in something. Give me 1-2 days and I will be able to
spend more time on this problem.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: docker crashes rcuos in __blkg_release_rcu
  2014-06-08 22:22 docker crashes rcuos in __blkg_release_rcu Joe Lawrence
  2014-06-09 17:47 ` Vivek Goyal
@ 2014-06-10 19:52 ` Christoph Lameter
  1 sibling, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2014-06-10 19:52 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-kernel, Tejun Heo, Vivek Goyal

On Sun, 8 Jun 2014, Joe Lawrence wrote:

>
> .tickets is offset 0 from arch_spinlock_t, so RDI should be the
> arch_spinlock_t lock:
> RDI: 6b6b6b6b6b6b6b6b

Slub has overwritten the object when it was freed with 0x6b.
So this is an access after free.

It works without debug because the object may still linger around (but
there is no guarantee that the memory has not been reused).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: docker crashes rcuos in __blkg_release_rcu
  2014-06-10 18:39     ` Joe Lawrence
  2014-06-10 19:14       ` Vivek Goyal
@ 2014-06-11 16:32       ` Vivek Goyal
  2014-06-19 20:26         ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2014-06-11 16:32 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-kernel, Tejun Heo, Cgroups

On Tue, Jun 10, 2014 at 02:39:06PM -0400, Joe Lawrence wrote:
> 
> Hi Vivek,
> 
> Thanks for taking a look.  For extra debugging, I wrote a quick set of
> kprobes that:
> 
>   1 - On blkg_alloc entry, save the request_queue's kobj address in a
>       list
>   2 - On kobject_put entry, dump the stack if the kobj is found in that
>       list
> 
> and this was the trace for the final kobject put for the
> request_queue before a crash:
> 
> JL: kobject_put kobj(queue) @ ffff88084d89c9e8, refcount=1
> ------------[ cut here ]------------
> WARNING: CPU: 27 PID: 11060 at /h/jlawrenc/kprobes/docker/probes_blk.c:166 kret_entry_kobject_put+0x47/0x50 [docker_debug]()
> [ ... snip modules ... ]
> CPU: 27 PID: 11060 Comm: docker Tainted: G        W  OE 3.15.0 #1
> Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
>  0000000000000000 0000000093cbdc81 ffff88104196fae8 ffffffff8162738d
>  0000000000000000 ffff88104196fb20 ffffffff8106d81d ffff88084d89c9e8
>  ffff881041912cd0 ffffffffa0181020 ffff88104196fbe0 ffffffffa01810c8
> Call Trace:
>  [<ffffffff8162738d>] dump_stack+0x45/0x56
>  [<ffffffff8106d81d>] warn_slowpath_common+0x7d/0xa0
>  [<ffffffff8106d94a>] warn_slowpath_null+0x1a/0x20
>  [<ffffffffa017f107>] kret_entry_kobject_put+0x47/0x50 [docker_debug]
>  [<ffffffff816335ee>] pre_handler_kretprobe+0x9e/0x1c0
>  [<ffffffff81635a2f>] opt_pre_handler+0x4f/0x90
>  [<ffffffff81631dd7>] optimized_callback+0x97/0xb0
>  [<ffffffff812dde01>] ? kobject_put+0x1/0x60
>  [<ffffffff812b4561>] ? blk_cleanup_queue+0x101/0x1a0
>  [<ffffffffa011114b>] ? __dm_destroy+0x1db/0x260 [dm_mod]
>  [<ffffffffa0111f53>] ? dm_destroy+0x13/0x20 [dm_mod]
>  [<ffffffffa0117a2e>] ? dev_remove+0x11e/0x180 [dm_mod]
>  [<ffffffffa0117910>] ? dev_suspend+0x250/0x250 [dm_mod]
>  [<ffffffffa0118105>] ? ctl_ioctl+0x255/0x500 [dm_mod]
>  [<ffffffff8118483f>] ? do_wp_page+0x38f/0x750
>  [<ffffffffa01183c3>] ? dm_ctl_ioctl+0x13/0x20 [dm_mod]
>  [<ffffffff811e1c20>] ? do_vfs_ioctl+0x2e0/0x4a0
>  [<ffffffff81277d56>] ? file_has_perm+0xa6/0xb0
>  [<ffffffff811e1e61>] ? SyS_ioctl+0x81/0xa0
>  [<ffffffff816381e9>] ? system_call_fastpath+0x16/0x1b
> ---[ end trace b4b8112437afdac8 ]---
>  
> so I think when dm_destroy() is called, it leads to the request_queue
> in question going away.
> 
> > I am wondering if we need to take a reference on the queue
> > (blk_get_queue()) in blkg_alloc(), to make sure request queue is
> > still around when blkg is being freed.
> 
> I experimented with this and the crash does go away (and the docker
> invocation completes successfully).  I wasn't sure where the
> accompanying blk_put_queue() should go.  If I put it in blkg_free, the
> kref accounting doesn't seem to even out, ie they never fall to zero.

CC cgroups list.

Ok, I think I figured out why reference counting does not seem to even
out.

There are two ways to destroy blkg. Either device goes away and
blk_release_queue() will take care of removing blkg or cgroup is deleted
and that will take care of cleaning up blkg. I think only exception is
root blkg where one can not delete root cgroup so it is cleaned up only
when request queue goes away.

Now if blkg holds a reference to queue, then blk_release_queue() never
gets called. And root blkg can't be cleaned till queue goes away. So
this seems like chicken and egg situation.

Even for non-root blkg, blkg will not be cleaned till cgroup goes away.

Tejun, any thoughts on how to solve this issue. Delaying blkg release
in rcu context and then expecting queue to be still present is causing
this problem.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: docker crashes rcuos in __blkg_release_rcu
  2014-06-11 16:32       ` Vivek Goyal
@ 2014-06-19 20:26         ` Tejun Heo
  2014-06-19 21:42           ` [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2014-06-19 20:26 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Joe Lawrence, linux-kernel, Cgroups

Sorry about the late reply.

On Wed, Jun 11, 2014 at 12:32:29PM -0400, Vivek Goyal wrote:
> Tejun, any thoughts on how to solve this issue. Delaying blkg release
> in rcu context and then expecting queue to be still present is causing
> this problem.

Heh, this is hilarious.  If you look at the comment right above
__blkg_release_rcu(), it says

 * A group is RCU protected, but having an rcu lock does not mean that one
 * can access all the fields of blkg and assume these are valid.  For
 * example, don't try to follow throtl_data and request queue links.
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

And yet the code brazenly derefs the ->q link to access the lock there
and causes oops.  This is from 2a4fd070ee85 ("blkcg: move bulk of
blkcg_gq release operations to the RCU callback").  I stupidly didn't
realize what I was doing even while moving the comment itself.

Well, the obvious solution is making blkg ref an atomic.  I was
planning to convert it to percpu_ref anyway.  We can first convert it
to atomic_t for -stable and then to percpu_ref.  Will prep a patch.

Thanks for tracking it down!

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
  2014-06-19 20:26         ` Tejun Heo
@ 2014-06-19 21:42           ` Tejun Heo
  2014-06-20 14:39             ` Vivek Goyal
  2014-06-20 18:50             ` Joe Lawrence
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2014-06-19 21:42 UTC (permalink / raw)
  To: Jens Axboe, Joe Lawrence; +Cc: linux-kernel, Cgroups, Vivek Goyal

Hello,

So, this patch should do.  Joe, Vivek, can one of you guys please
verify that the oops goes away with this patch?

Jens, the original thread can be read at

  http://thread.gmane.org/gmane.linux.kernel/1720729

The fix converts blkg->refcnt from int to atomic_t.  It does some
overhead but it should be minute compared to everything else which is
going on and the involved cacheline bouncing, so I think it's highly
unlikely to cause any noticeable difference.  Also, the refcnt in
question should be converted to a perpcu_ref for blk-mq anyway, so the
atomic_t is likely to go away pretty soon anyway.

Thanks.

------- 8< -------
__blkg_release_rcu() may be invoked after the associated request_queue
is released with a RCU grace period inbetween.  As such, the function
and callbacks invoked from it must not dereference the associated
request_queue.  This is clearly indicated in the comment above the
function.

Unfortunately, while trying to fix a different issue, 2a4fd070ee85
("blkcg: move bulk of blkcg_gq release operations to the RCU
callback") ignored this and added [un]locking of @blkg->q->queue_lock
to __blkg_release_rcu().  This of course can cause oops as the
request_queue may be long gone by the time this code gets executed.

  general protection fault: 0000 [#1] SMP
  CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
  Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
  task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
  RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
  RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
  RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
  RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
  RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
  R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
  R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
  Stack:
   ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
   ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
   ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
  Call Trace:
   [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
   [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
   [<ffffffff81091d81>] kthread+0xe1/0x100
   [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
  Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
  +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
  +b7
  RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
   RSP <ffff88085403fdf0>

The request_queue locking was added because blkcg_gq->refcnt is an int
protected with the queue lock and __blkg_release_rcu() needs to put
the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
dropping queue locking in the function.

Given the general heavy weight of the current request_queue and blkcg
operations, this is unlikely to cause any noticeable overhead.
Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
the near future, so whatever (most likely negligible) overhead it may
add is temporary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
Cc: stable@vger.kernel.org
---
 block/blk-cgroup.c |    7 ++-----
 block/blk-cgroup.h |   17 +++++++----------
 2 files changed, 9 insertions(+), 15 deletions(-)

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -80,7 +80,7 @@ static struct blkcg_gq *blkg_alloc(struc
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
 	blkg->blkcg = blkcg;
-	blkg->refcnt = 1;
+	atomic_set(&blkg->refcnt, 1);
 
 	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
 	if (blkcg != &blkcg_root) {
@@ -399,11 +399,8 @@ void __blkg_release_rcu(struct rcu_head
 
 	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
-	if (blkg->parent) {
-		spin_lock_irq(blkg->q->queue_lock);
+	if (blkg->parent)
 		blkg_put(blkg->parent);
-		spin_unlock_irq(blkg->q->queue_lock);
-	}
 
 	blkg_free(blkg);
 }
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -18,6 +18,7 @@
 #include <linux/seq_file.h>
 #include <linux/radix-tree.h>
 #include <linux/blkdev.h>
+#include <linux/atomic.h>
 
 /* Max limits for throttle policy */
 #define THROTL_IOPS_MAX		UINT_MAX
@@ -104,7 +105,7 @@ struct blkcg_gq {
 	struct request_list		rl;
 
 	/* reference count */
-	int				refcnt;
+	atomic_t			refcnt;
 
 	/* is this blkg online? protected by both blkcg and q locks */
 	bool				online;
@@ -257,13 +258,12 @@ static inline int blkg_path(struct blkcg
  * blkg_get - get a blkg reference
  * @blkg: blkg to get
  *
- * The caller should be holding queue_lock and an existing reference.
+ * The caller should be holding an existing reference.
  */
 static inline void blkg_get(struct blkcg_gq *blkg)
 {
-	lockdep_assert_held(blkg->q->queue_lock);
-	WARN_ON_ONCE(!blkg->refcnt);
-	blkg->refcnt++;
+	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
+	atomic_inc(&blkg->refcnt);
 }
 
 void __blkg_release_rcu(struct rcu_head *rcu);
@@ -271,14 +271,11 @@ void __blkg_release_rcu(struct rcu_head
 /**
  * blkg_put - put a blkg reference
  * @blkg: blkg to put
- *
- * The caller should be holding queue_lock.
  */
 static inline void blkg_put(struct blkcg_gq *blkg)
 {
-	lockdep_assert_held(blkg->q->queue_lock);
-	WARN_ON_ONCE(blkg->refcnt <= 0);
-	if (!--blkg->refcnt)
+	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
+	if (atomic_dec_and_test(&blkg->refcnt))
 		call_rcu(&blkg->rcu_head, __blkg_release_rcu);
 }
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
  2014-06-19 21:42           ` [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Tejun Heo
@ 2014-06-20 14:39             ` Vivek Goyal
  2014-06-20 18:50               ` Jens Axboe
  2014-06-20 18:50             ` Joe Lawrence
  1 sibling, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2014-06-20 14:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, Joe Lawrence, linux-kernel, Cgroups

On Thu, Jun 19, 2014 at 05:42:57PM -0400, Tejun Heo wrote:
> Hello,
> 
> So, this patch should do.  Joe, Vivek, can one of you guys please
> verify that the oops goes away with this patch?

Hi Tejun,

This patch seems to fix the issue for me. Tried 10 times and no crash.

So now one need to hold queue lock for getting refernce on the group
only if caller does not already have a reference and if group has been
looked up from some tree/queue etc. I guess only such usage seems to
be in blkg_create() where we take a reference on parent after looking
it up.

This patch looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Thanks
Vivek

> 
> Jens, the original thread can be read at
> 
>   http://thread.gmane.org/gmane.linux.kernel/1720729
> 
> The fix converts blkg->refcnt from int to atomic_t.  It does some
> overhead but it should be minute compared to everything else which is
> going on and the involved cacheline bouncing, so I think it's highly
> unlikely to cause any noticeable difference.  Also, the refcnt in
> question should be converted to a perpcu_ref for blk-mq anyway, so the
> atomic_t is likely to go away pretty soon anyway.
> 
> Thanks.
> 
> ------- 8< -------
> __blkg_release_rcu() may be invoked after the associated request_queue
> is released with a RCU grace period inbetween.  As such, the function
> and callbacks invoked from it must not dereference the associated
> request_queue.  This is clearly indicated in the comment above the
> function.
> 
> Unfortunately, while trying to fix a different issue, 2a4fd070ee85
> ("blkcg: move bulk of blkcg_gq release operations to the RCU
> callback") ignored this and added [un]locking of @blkg->q->queue_lock
> to __blkg_release_rcu().  This of course can cause oops as the
> request_queue may be long gone by the time this code gets executed.
> 
>   general protection fault: 0000 [#1] SMP
>   CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
>   Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
>   task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
>   RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
>   RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
>   RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
>   RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
>   RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
>   R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
>   R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
>   FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
>   Stack:
>    ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
>    ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
>    ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
>   Call Trace:
>    [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
>    [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
>    [<ffffffff81091d81>] kthread+0xe1/0x100
>    [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
>   Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
>   +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
>   +b7
>   RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
>    RSP <ffff88085403fdf0>
> 
> The request_queue locking was added because blkcg_gq->refcnt is an int
> protected with the queue lock and __blkg_release_rcu() needs to put
> the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
> dropping queue locking in the function.
> 
> Given the general heavy weight of the current request_queue and blkcg
> operations, this is unlikely to cause any noticeable overhead.
> Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
> the near future, so whatever (most likely negligible) overhead it may
> add is temporary.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
> Cc: stable@vger.kernel.org
> ---
>  block/blk-cgroup.c |    7 ++-----
>  block/blk-cgroup.h |   17 +++++++----------
>  2 files changed, 9 insertions(+), 15 deletions(-)
> 
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -80,7 +80,7 @@ static struct blkcg_gq *blkg_alloc(struc
>  	blkg->q = q;
>  	INIT_LIST_HEAD(&blkg->q_node);
>  	blkg->blkcg = blkcg;
> -	blkg->refcnt = 1;
> +	atomic_set(&blkg->refcnt, 1);
>  
>  	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
>  	if (blkcg != &blkcg_root) {
> @@ -399,11 +399,8 @@ void __blkg_release_rcu(struct rcu_head
>  
>  	/* release the blkcg and parent blkg refs this blkg has been holding */
>  	css_put(&blkg->blkcg->css);
> -	if (blkg->parent) {
> -		spin_lock_irq(blkg->q->queue_lock);
> +	if (blkg->parent)
>  		blkg_put(blkg->parent);
> -		spin_unlock_irq(blkg->q->queue_lock);
> -	}
>  
>  	blkg_free(blkg);
>  }
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -18,6 +18,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/radix-tree.h>
>  #include <linux/blkdev.h>
> +#include <linux/atomic.h>
>  
>  /* Max limits for throttle policy */
>  #define THROTL_IOPS_MAX		UINT_MAX
> @@ -104,7 +105,7 @@ struct blkcg_gq {
>  	struct request_list		rl;
>  
>  	/* reference count */
> -	int				refcnt;
> +	atomic_t			refcnt;
>  
>  	/* is this blkg online? protected by both blkcg and q locks */
>  	bool				online;
> @@ -257,13 +258,12 @@ static inline int blkg_path(struct blkcg
>   * blkg_get - get a blkg reference
>   * @blkg: blkg to get
>   *
> - * The caller should be holding queue_lock and an existing reference.
> + * The caller should be holding an existing reference.
>   */
>  static inline void blkg_get(struct blkcg_gq *blkg)
>  {
> -	lockdep_assert_held(blkg->q->queue_lock);
> -	WARN_ON_ONCE(!blkg->refcnt);
> -	blkg->refcnt++;
> +	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> +	atomic_inc(&blkg->refcnt);
>  }
>  
>  void __blkg_release_rcu(struct rcu_head *rcu);
> @@ -271,14 +271,11 @@ void __blkg_release_rcu(struct rcu_head
>  /**
>   * blkg_put - put a blkg reference
>   * @blkg: blkg to put
> - *
> - * The caller should be holding queue_lock.
>   */
>  static inline void blkg_put(struct blkcg_gq *blkg)
>  {
> -	lockdep_assert_held(blkg->q->queue_lock);
> -	WARN_ON_ONCE(blkg->refcnt <= 0);
> -	if (!--blkg->refcnt)
> +	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> +	if (atomic_dec_and_test(&blkg->refcnt))
>  		call_rcu(&blkg->rcu_head, __blkg_release_rcu);
>  }
>  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
  2014-06-19 21:42           ` [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Tejun Heo
  2014-06-20 14:39             ` Vivek Goyal
@ 2014-06-20 18:50             ` Joe Lawrence
  1 sibling, 0 replies; 12+ messages in thread
From: Joe Lawrence @ 2014-06-20 18:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-kernel, Cgroups, Vivek Goyal

On Thu, 19 Jun 2014 17:42:57 -0400
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> So, this patch should do.  Joe, Vivek, can one of you guys please
> verify that the oops goes away with this patch?

Tejun -- thanks for fixing!  Looks good here, no issues running w/slub
debug enabled.

-- Joe


> Jens, the original thread can be read at
> 
>   http://thread.gmane.org/gmane.linux.kernel/1720729
> 
> The fix converts blkg->refcnt from int to atomic_t.  It does some
> overhead but it should be minute compared to everything else which is
> going on and the involved cacheline bouncing, so I think it's highly
> unlikely to cause any noticeable difference.  Also, the refcnt in
> question should be converted to a perpcu_ref for blk-mq anyway, so the
> atomic_t is likely to go away pretty soon anyway.
> 
> Thanks.
> 
> ------- 8< -------
> __blkg_release_rcu() may be invoked after the associated request_queue
> is released with a RCU grace period inbetween.  As such, the function
> and callbacks invoked from it must not dereference the associated
> request_queue.  This is clearly indicated in the comment above the
> function.
> 
> Unfortunately, while trying to fix a different issue, 2a4fd070ee85
> ("blkcg: move bulk of blkcg_gq release operations to the RCU
> callback") ignored this and added [un]locking of @blkg->q->queue_lock
> to __blkg_release_rcu().  This of course can cause oops as the
> request_queue may be long gone by the time this code gets executed.
> 
>   general protection fault: 0000 [#1] SMP
>   CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
>   Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
>   task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
>   RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
>   RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
>   RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
>   RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
>   RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
>   R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
>   R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
>   FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
>   Stack:
>    ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
>    ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
>    ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
>   Call Trace:
>    [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
>    [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
>    [<ffffffff81091d81>] kthread+0xe1/0x100
>    [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
>   Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
>   +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
>   +b7
>   RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
>    RSP <ffff88085403fdf0>
> 
> The request_queue locking was added because blkcg_gq->refcnt is an int
> protected with the queue lock and __blkg_release_rcu() needs to put
> the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
> dropping queue locking in the function.
> 
> Given the general heavy weight of the current request_queue and blkcg
> operations, this is unlikely to cause any noticeable overhead.
> Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
> the near future, so whatever (most likely negligible) overhead it may
> add is temporary.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
> Cc: stable@vger.kernel.org
> ---
>  block/blk-cgroup.c |    7 ++-----
>  block/blk-cgroup.h |   17 +++++++----------
>  2 files changed, 9 insertions(+), 15 deletions(-)
> 
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -80,7 +80,7 @@ static struct blkcg_gq *blkg_alloc(struc
>  	blkg->q = q;
>  	INIT_LIST_HEAD(&blkg->q_node);
>  	blkg->blkcg = blkcg;
> -	blkg->refcnt = 1;
> +	atomic_set(&blkg->refcnt, 1);
>  
>  	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
>  	if (blkcg != &blkcg_root) {
> @@ -399,11 +399,8 @@ void __blkg_release_rcu(struct rcu_head
>  
>  	/* release the blkcg and parent blkg refs this blkg has been holding */
>  	css_put(&blkg->blkcg->css);
> -	if (blkg->parent) {
> -		spin_lock_irq(blkg->q->queue_lock);
> +	if (blkg->parent)
>  		blkg_put(blkg->parent);
> -		spin_unlock_irq(blkg->q->queue_lock);
> -	}
>  
>  	blkg_free(blkg);
>  }
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -18,6 +18,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/radix-tree.h>
>  #include <linux/blkdev.h>
> +#include <linux/atomic.h>
>  
>  /* Max limits for throttle policy */
>  #define THROTL_IOPS_MAX		UINT_MAX
> @@ -104,7 +105,7 @@ struct blkcg_gq {
>  	struct request_list		rl;
>  
>  	/* reference count */
> -	int				refcnt;
> +	atomic_t			refcnt;
>  
>  	/* is this blkg online? protected by both blkcg and q locks */
>  	bool				online;
> @@ -257,13 +258,12 @@ static inline int blkg_path(struct blkcg
>   * blkg_get - get a blkg reference
>   * @blkg: blkg to get
>   *
> - * The caller should be holding queue_lock and an existing reference.
> + * The caller should be holding an existing reference.
>   */
>  static inline void blkg_get(struct blkcg_gq *blkg)
>  {
> -	lockdep_assert_held(blkg->q->queue_lock);
> -	WARN_ON_ONCE(!blkg->refcnt);
> -	blkg->refcnt++;
> +	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> +	atomic_inc(&blkg->refcnt);
>  }
>  
>  void __blkg_release_rcu(struct rcu_head *rcu);
> @@ -271,14 +271,11 @@ void __blkg_release_rcu(struct rcu_head
>  /**
>   * blkg_put - put a blkg reference
>   * @blkg: blkg to put
> - *
> - * The caller should be holding queue_lock.
>   */
>  static inline void blkg_put(struct blkcg_gq *blkg)
>  {
> -	lockdep_assert_held(blkg->q->queue_lock);
> -	WARN_ON_ONCE(blkg->refcnt <= 0);
> -	if (!--blkg->refcnt)
> +	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> +	if (atomic_dec_and_test(&blkg->refcnt))
>  		call_rcu(&blkg->rcu_head, __blkg_release_rcu);
>  }
>  


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
  2014-06-20 14:39             ` Vivek Goyal
@ 2014-06-20 18:50               ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2014-06-20 18:50 UTC (permalink / raw)
  To: Vivek Goyal, Tejun Heo; +Cc: Joe Lawrence, linux-kernel, Cgroups

On 06/20/2014 08:39 AM, Vivek Goyal wrote:
> On Thu, Jun 19, 2014 at 05:42:57PM -0400, Tejun Heo wrote:
>> Hello,
>>
>> So, this patch should do.  Joe, Vivek, can one of you guys please
>> verify that the oops goes away with this patch?
> 
> Hi Tejun,
> 
> This patch seems to fix the issue for me. Tried 10 times and no crash.
> 
> So now one need to hold queue lock for getting refernce on the group
> only if caller does not already have a reference and if group has been
> looked up from some tree/queue etc. I guess only such usage seems to
> be in blkg_create() where we take a reference on parent after looking
> it up.
> 
> This patch looks good to me.
> 
> Acked-by: Vivek Goyal <vgoyal@redhat.com>

Thanks. Tejun, I'll queue this up for this cycle.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-06-20 19:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-08 22:22 docker crashes rcuos in __blkg_release_rcu Joe Lawrence
2014-06-09 17:47 ` Vivek Goyal
2014-06-09 18:27   ` Vivek Goyal
2014-06-10 18:39     ` Joe Lawrence
2014-06-10 19:14       ` Vivek Goyal
2014-06-11 16:32       ` Vivek Goyal
2014-06-19 20:26         ` Tejun Heo
2014-06-19 21:42           ` [PATCH block/for-linus] blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Tejun Heo
2014-06-20 14:39             ` Vivek Goyal
2014-06-20 18:50               ` Jens Axboe
2014-06-20 18:50             ` Joe Lawrence
2014-06-10 19:52 ` docker crashes rcuos in __blkg_release_rcu Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).