All of lore.kernel.org
 help / color / mirror / Atom feed
* Use after free with BFQ and cgroups
@ 2021-11-25 17:28 Jan Kara
  2021-11-26 14:47   ` Michal Koutný
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-11-25 17:28 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block, fvogdt

Hello!

Our test VMs started crashing recently (seems to be starting with 5.15
kernel). When we enabled KASAN, we were getting reports of bfq_group being
used after being freed like following (the reports differ a bit in where
exactly did BFQ hit the UAF):

[  235.949241] ==================================================================
[  235.950326] BUG: KASAN: use-after-free in __bfq_deactivate_entity+0x9cb/0xa50
[  235.951369] Read of size 8 at addr ffff88800693c0c0 by task runc:[2:INIT]/10544

[  235.953476] CPU: 0 PID: 10544 Comm: runc:[2:INIT] Tainted: G            E     5.15.2-0.g5fb85fd-default #1 openSUSE Tumbleweed (unreleased) f1f3b891c72369aebecd2e43e4641a6358867c70
[  235.955726] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[  235.958007] Call Trace:
[  235.959157]  <IRQ>
[  235.960287]  dump_stack_lvl+0x46/0x5a
[  235.961412]  print_address_description.constprop.0+0x1f/0x140
[  235.962556]  ? __bfq_deactivate_entity+0x9cb/0xa50
[  235.963707]  kasan_report.cold+0x7f/0x11b
[  235.964841]  ? __bfq_deactivate_entity+0x9cb/0xa50
[  235.965970]  __bfq_deactivate_entity+0x9cb/0xa50
[  235.967092]  ? update_curr+0x32f/0x5d0
[  235.968227]  bfq_deactivate_entity+0xa0/0x1d0
[  235.969365]  bfq_del_bfqq_busy+0x28a/0x420
[  235.970481]  ? resched_curr+0x116/0x1d0
[  235.971573]  ? bfq_requeue_bfqq+0x70/0x70
[  235.972657]  ? check_preempt_wakeup+0x52b/0xbc0
[  235.973748]  __bfq_bfqq_expire+0x1a2/0x270
[  235.974822]  bfq_bfqq_expire+0xd16/0x2160
[  235.975893]  ? try_to_wake_up+0x4ee/0x1260
[  235.976965]  ? bfq_end_wr_async_queues+0xe0/0xe0
[  235.978039]  ? _raw_write_unlock_bh+0x60/0x60
[  235.979105]  ? _raw_spin_lock_irq+0x81/0xe0
[  235.980162]  bfq_idle_slice_timer+0x109/0x280
[  235.981199]  ? bfq_dispatch_request+0x4870/0x4870
[  235.982220]  __hrtimer_run_queues+0x37d/0x700
[  235.983242]  ? enqueue_hrtimer+0x1b0/0x1b0
[  235.984278]  ? kvm_clock_get_cycles+0xd/0x10
[  235.985301]  ? ktime_get_update_offsets_now+0x6f/0x280
[  235.986317]  hrtimer_interrupt+0x2c8/0x740
[  235.987321]  __sysvec_apic_timer_interrupt+0xcd/0x260
[  235.988357]  sysvec_apic_timer_interrupt+0x6a/0x90
[  235.989373]  </IRQ>
[  235.990355]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  235.991366] RIP: 0010:do_seccomp+0x4f5/0x1f40
[  235.992376] Code: 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 cb 14 00 00 48 8b bd d8 0b 00 00 c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00 00 <8b> 4c 24 30 85 c9 0f 85 06 07 00 00 8b 54 24 04 85 d2 74 19 4d 85
[  235.994481] RSP: 0018:ffffc900020cfd48 EFLAGS: 00000246
[  235.995546] RAX: dffffc0000000000 RBX: 1ffff92000419fb1 RCX: ffffffffb9a8d89d
[  235.996638] RDX: 1ffff1100080f17b RSI: 0000000000000008 RDI: ffff888008c56040
[  235.997717] RBP: ffff888004078000 R08: 0000000000000001 R09: ffff88800407800f
[  235.998784] R10: ffffed100080f001 R11: 0000000000000001 R12: 00000000ffffffff
[  235.999852] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  236.000906]  ? do_seccomp+0xfed/0x1f40
[  236.001937]  ? do_seccomp+0xfed/0x1f40
[  236.002938]  ? get_nth_filter+0x2e0/0x2e0
[  236.003932]  ? security_task_prctl+0x66/0xd0
[  236.004910]  __do_sys_prctl+0x420/0xd60
[  236.005842]  ? handle_mm_fault+0x196/0x610
[  236.006739]  ? __ia32_compat_sys_getrusage+0x90/0x90
[  236.007611]  ? up_read+0x15/0x90
[  236.008477]  do_syscall_64+0x5c/0x80
[  236.009349]  ? exc_page_fault+0x60/0xc0
[  236.010219]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  236.011094] RIP: 0033:0x561fa9ceec6a
[  236.011976] Code: e8 db 46 f8 ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
[  236.013823] RSP: 002b:000000c000116e38 EFLAGS: 00000216 ORIG_RAX: 000000000000009d
[  236.014778] RAX: ffffffffffffffda RBX: 000000c000028000 RCX: 0000561fa9ceec6a
[  236.015748] RDX: 000000c000116ee0 RSI: 0000000000000002 RDI: 0000000000000016
[  236.016716] RBP: 000000c000116e90 R08: 0000000000000000 R09: 0000000000000000
[  236.017685] R10: 0000000000000000 R11: 0000000000000216 R12: 00000000000000b8
[  236.018645] R13: 00000000000000b7 R14: 0000000000000200 R15: 0000000000000004

[  236.020558] Allocated by task 485:
[  236.021511]  kasan_save_stack+0x1b/0x40
[  236.022460]  __kasan_kmalloc+0xa4/0xd0
[  236.023410]  bfq_pd_alloc+0xa8/0x170
[  236.024351]  blkg_alloc+0x397/0x540
[  236.025287]  blkg_create+0x66b/0xcd0
[  236.026219]  bio_associate_blkg_from_css+0x43c/0xb20
[  236.027161]  bio_associate_blkg+0x66/0x100
[  236.028098]  submit_extent_page+0x744/0x1380 [btrfs]
[  236.029126]  __extent_writepage_io+0x605/0xaa0 [btrfs]
[  236.030113]  __extent_writepage+0x360/0x740 [btrfs]
[  236.031093]  extent_write_cache_pages+0x5a7/0xa50 [btrfs]
[  236.032084]  extent_writepages+0xcb/0x1a0 [btrfs]
[  236.033063]  do_writepages+0x188/0x720
[  236.033997]  filemap_fdatawrite_wbc+0x19f/0x2b0
[  236.034929]  filemap_fdatawrite_range+0x99/0xd0
[  236.035855]  btrfs_fdatawrite_range+0x46/0xf0 [btrfs]
[  236.036833]  start_ordered_ops.constprop.0+0xb6/0x110 [btrfs]
[  236.037803]  btrfs_sync_file+0x1bf/0xe70 [btrfs]
[  236.038747]  __x64_sys_fsync+0x51/0x80
[  236.039622]  do_syscall_64+0x5c/0x80
[  236.040468]  entry_SYSCALL_64_after_hwframe+0x44/0xae

[  236.042137] Freed by task 10561:
[  236.042966]  kasan_save_stack+0x1b/0x40
[  236.043802]  kasan_set_track+0x1c/0x30
[  236.044628]  kasan_set_free_info+0x20/0x30
[  236.045437]  __kasan_slab_free+0x10b/0x140
[  236.046256]  slab_free_freelist_hook+0x8e/0x180
[  236.047081]  kfree+0xc7/0x400
[  236.047907]  blkg_free.part.0+0x78/0xf0
[  236.048736]  rcu_do_batch+0x365/0x1280
[  236.049558]  rcu_core+0x493/0x8d0
[  236.050376]  __do_softirq+0x18e/0x544

After some poking, looking into crashdumps, and applying some debug patches
the following seems to be happening: We have a process P in blkcg G. Now
G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
still holds reference to G from its bfq_queue. Then P submits IO, G gets
inserted into service tree despite being already offline. IO completes, P
exits, bfq_queue pointing to G gets destroyed, the last reference to G is
dropped, G gets freed although is it still inserted in the service tree.
Eventually someone trips over the freed memory.

Now I was looking into how to best fix this. There are several
possibilities and I'm not sure which one to pick so that's why I'm writing
to you. bfq_pd_offline() is walking all entities in service trees and
trying to get rid of references to bfq_group (by reparenting entities).
Is this guaranteed to see all entities that point to G? From the scenario
I'm observing it seems this can miss entities pointing to G - e.g. if they
are in idle tree, we will just remove them from the idle tree but we won't
change entity->parent so they still point to G. This can be seen as one
culprit of the bug.

Or alternatively, should we e.g. add __bfq_deactivate_entity() to
bfq_put_queue() when that function is dropping last queue in a bfq_group?

Or should we just reparent bfq queues that have already dead parent on
activation?

What's your opinion?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-11-26 14:47   ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2021-11-26 14:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: Paolo Valente, linux-block, fvogdt, cgroups

Hello.

On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
[...]
+Cc cgroups ML
https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/


I understand there are more objects than blkcgs but I assume it can
eventually boil down to blkcg references, so I suggest another
alternative. (But I may easily miss the relations between BFQ objects,
so consider this only high-level opinion.)

> After some poking, looking into crashdumps, and applying some debug patches
> the following seems to be happening: We have a process P in blkcg G. Now
> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
> still holds reference to G from its bfq_queue. Then P submits IO, G gets
> inserted into service tree despite being already offline.

(If G is offline, P can only be zombie, just saying. (I guess it can
still be Q's IO on behalf of G.))

IIUC, the reference to G is only held by P. If the G reference is copied
into another structure (the service tree) it should get another
reference. My naïve proposal would be css_get(). (1)

> IO completes, P exits, bfq_queue pointing to G gets destroyed, the
> last reference to G is dropped, G gets freed although is it still
> inserted in the service tree.  Eventually someone trips over the freed
> memory.

Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
(not blkcg G)?
You write bfq_queue is destroyed, shouldn't that remove it from the
service tree? (2)

> Now I was looking into how to best fix this. There are several
> possibilities and I'm not sure which one to pick so that's why I'm writing
> to you. bfq_pd_offline() is walking all entities in service trees and
> trying to get rid of references to bfq_group (by reparenting entities).
> Is this guaranteed to see all entities that point to G? From the scenario
> I'm observing it seems this can miss entities pointing to G - e.g. if they
> are in idle tree, we will just remove them from the idle tree but we won't
> change entity->parent so they still point to G. This can be seen as one
> culprit of the bug.

There can be two types of references to blkcg (transitively via
bfq_group):
a) "plain" (just a pointer stored somewhere),
b) "pinned" (marked by css_get() of the respective blkcg).

The bfq_pd_offline() callback should erase all plain references (e.g. by
reparenting) or poke the holders of pinned references to release (unpin)
them eventually (so that blkcg goes away).

I reckon it's not possible to traverse all references in the
bfq_pd_offline().

> Or alternatively, should we e.g. add __bfq_deactivate_entity() to
> bfq_put_queue() when that function is dropping last queue in a bfq_group?

I guess this is what I wondered about in (2). (But I'm not sure this
really is proof against subsequent re-insertions into the tree.)

> Or should we just reparent bfq queues that have already dead parent on
> activation?

If (1) used css_tryget_online(), the parent (or ancestor if it happened
to be offlined too) could be the fallback.

> What's your opinion?

The question here is how long would stay the offlined blkcgs around if
they were directly pinned upon the IO submission. If it's unbound, then
reparenting makes more sense.


Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-11-26 14:47   ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2021-11-26 14:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogdt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello.

On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
[...]
+Cc cgroups ML
https://lore.kernel.org/linux-block/20211125172809.GC19572-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org/


I understand there are more objects than blkcgs but I assume it can
eventually boil down to blkcg references, so I suggest another
alternative. (But I may easily miss the relations between BFQ objects,
so consider this only high-level opinion.)

> After some poking, looking into crashdumps, and applying some debug patches
> the following seems to be happening: We have a process P in blkcg G. Now
> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
> still holds reference to G from its bfq_queue. Then P submits IO, G gets
> inserted into service tree despite being already offline.

(If G is offline, P can only be zombie, just saying. (I guess it can
still be Q's IO on behalf of G.))

IIUC, the reference to G is only held by P. If the G reference is copied
into another structure (the service tree) it should get another
reference. My naïve proposal would be css_get(). (1)

> IO completes, P exits, bfq_queue pointing to G gets destroyed, the
> last reference to G is dropped, G gets freed although is it still
> inserted in the service tree.  Eventually someone trips over the freed
> memory.

Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
(not blkcg G)?
You write bfq_queue is destroyed, shouldn't that remove it from the
service tree? (2)

> Now I was looking into how to best fix this. There are several
> possibilities and I'm not sure which one to pick so that's why I'm writing
> to you. bfq_pd_offline() is walking all entities in service trees and
> trying to get rid of references to bfq_group (by reparenting entities).
> Is this guaranteed to see all entities that point to G? From the scenario
> I'm observing it seems this can miss entities pointing to G - e.g. if they
> are in idle tree, we will just remove them from the idle tree but we won't
> change entity->parent so they still point to G. This can be seen as one
> culprit of the bug.

There can be two types of references to blkcg (transitively via
bfq_group):
a) "plain" (just a pointer stored somewhere),
b) "pinned" (marked by css_get() of the respective blkcg).

The bfq_pd_offline() callback should erase all plain references (e.g. by
reparenting) or poke the holders of pinned references to release (unpin)
them eventually (so that blkcg goes away).

I reckon it's not possible to traverse all references in the
bfq_pd_offline().

> Or alternatively, should we e.g. add __bfq_deactivate_entity() to
> bfq_put_queue() when that function is dropping last queue in a bfq_group?

I guess this is what I wondered about in (2). (But I'm not sure this
really is proof against subsequent re-insertions into the tree.)

> Or should we just reparent bfq queues that have already dead parent on
> activation?

If (1) used css_tryget_online(), the parent (or ancestor if it happened
to be offlined too) could be the fallback.

> What's your opinion?

The question here is how long would stay the offlined blkcgs around if
they were directly pinned upon the IO submission. If it's unbound, then
reparenting makes more sense.


Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-11-29 17:11     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-11-29 17:11 UTC (permalink / raw)
  To: Michal Koutný; +Cc: Jan Kara, Paolo Valente, linux-block, fvogt, cgroups

On Fri 26-11-21 15:47:24, Michal Koutný wrote:
> Hello.
> 
> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
> [...]
> +Cc cgroups ML
> https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
> 
> 
> I understand there are more objects than blkcgs but I assume it can
> eventually boil down to blkcg references, so I suggest another
> alternative. (But I may easily miss the relations between BFQ objects,
> so consider this only high-level opinion.)
> 
> > After some poking, looking into crashdumps, and applying some debug patches
> > the following seems to be happening: We have a process P in blkcg G. Now
> > G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
> > still holds reference to G from its bfq_queue. Then P submits IO, G gets
> > inserted into service tree despite being already offline.
> 
> (If G is offline, P can only be zombie, just saying. (I guess it can
> still be Q's IO on behalf of G.))
> 
> IIUC, the reference to G is only held by P. If the G reference is copied
> into another structure (the service tree) it should get another
> reference. My naïve proposal would be css_get(). (1)

So I was looking into this puzzle. The answer is following:

The process P (podman, pid 2571) is currently attached to the root cgroup
but it has io_context with BFQ queue that points to the already-offline G
as a parent. The bio is thus associated with the root cgroup (via
bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
should be queued and then uses its parent to determine blkg which it should
be charged and thus gets to the dying cgroup.

Apparently P got recently moved from G to the root cgroup and there was
reference left in the BFQ queue structure to G.

> > IO completes, P exits, bfq_queue pointing to G gets destroyed, the
> > last reference to G is dropped, G gets freed although is it still
> > inserted in the service tree.  Eventually someone trips over the freed
> > memory.
> 
> Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
> (not blkcg G)?

Yes, it is. But the entity is part of bfq_group structure which is the pd
for the blkcg.

> You write bfq_queue is destroyed, shouldn't that remove it from the
> service tree? (2)

Yes, BFQ queue is removed from the service trees on destruction. But its
parent - bfq_group - is not removed from its service tree. And that's where
we hit the problem.

> > Now I was looking into how to best fix this. There are several
> > possibilities and I'm not sure which one to pick so that's why I'm writing
> > to you. bfq_pd_offline() is walking all entities in service trees and
> > trying to get rid of references to bfq_group (by reparenting entities).
> > Is this guaranteed to see all entities that point to G? From the scenario
> > I'm observing it seems this can miss entities pointing to G - e.g. if they
> > are in idle tree, we will just remove them from the idle tree but we won't
> > change entity->parent so they still point to G. This can be seen as one
> > culprit of the bug.
> 
> There can be two types of references to blkcg (transitively via
> bfq_group):
> a) "plain" (just a pointer stored somewhere),
> b) "pinned" (marked by css_get() of the respective blkcg).
> 
> The bfq_pd_offline() callback should erase all plain references (e.g. by
> reparenting) or poke the holders of pinned references to release (unpin)
> them eventually (so that blkcg goes away).
> 
> I reckon it's not possible to traverse all references in the
> bfq_pd_offline().

So bfq_pd_offline() does erase all plain references AFAICT. But later it
can create new plain references (service tree) from the existing "pinned"
ones and once pinned references go away, those created plain references
cause trouble. And the more I'm looking into this the more I'm convinced
bfq_pd_offline() should be more careful and remove also the pinned
references from bfq queues. It actually does it for most queues but it can
currently miss some... I'll look into that.

Thanks for your very good questions and hints!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-11-29 17:11     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-11-29 17:11 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Jan Kara, Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri 26-11-21 15:47:24, Michal Koutný wrote:
> Hello.
> 
> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> [...]
> +Cc cgroups ML
> https://lore.kernel.org/linux-block/20211125172809.GC19572-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org/
> 
> 
> I understand there are more objects than blkcgs but I assume it can
> eventually boil down to blkcg references, so I suggest another
> alternative. (But I may easily miss the relations between BFQ objects,
> so consider this only high-level opinion.)
> 
> > After some poking, looking into crashdumps, and applying some debug patches
> > the following seems to be happening: We have a process P in blkcg G. Now
> > G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
> > still holds reference to G from its bfq_queue. Then P submits IO, G gets
> > inserted into service tree despite being already offline.
> 
> (If G is offline, P can only be zombie, just saying. (I guess it can
> still be Q's IO on behalf of G.))
> 
> IIUC, the reference to G is only held by P. If the G reference is copied
> into another structure (the service tree) it should get another
> reference. My naïve proposal would be css_get(). (1)

So I was looking into this puzzle. The answer is following:

The process P (podman, pid 2571) is currently attached to the root cgroup
but it has io_context with BFQ queue that points to the already-offline G
as a parent. The bio is thus associated with the root cgroup (via
bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
should be queued and then uses its parent to determine blkg which it should
be charged and thus gets to the dying cgroup.

Apparently P got recently moved from G to the root cgroup and there was
reference left in the BFQ queue structure to G.

> > IO completes, P exits, bfq_queue pointing to G gets destroyed, the
> > last reference to G is dropped, G gets freed although is it still
> > inserted in the service tree.  Eventually someone trips over the freed
> > memory.
> 
> Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
> (not blkcg G)?

Yes, it is. But the entity is part of bfq_group structure which is the pd
for the blkcg.

> You write bfq_queue is destroyed, shouldn't that remove it from the
> service tree? (2)

Yes, BFQ queue is removed from the service trees on destruction. But its
parent - bfq_group - is not removed from its service tree. And that's where
we hit the problem.

> > Now I was looking into how to best fix this. There are several
> > possibilities and I'm not sure which one to pick so that's why I'm writing
> > to you. bfq_pd_offline() is walking all entities in service trees and
> > trying to get rid of references to bfq_group (by reparenting entities).
> > Is this guaranteed to see all entities that point to G? From the scenario
> > I'm observing it seems this can miss entities pointing to G - e.g. if they
> > are in idle tree, we will just remove them from the idle tree but we won't
> > change entity->parent so they still point to G. This can be seen as one
> > culprit of the bug.
> 
> There can be two types of references to blkcg (transitively via
> bfq_group):
> a) "plain" (just a pointer stored somewhere),
> b) "pinned" (marked by css_get() of the respective blkcg).
> 
> The bfq_pd_offline() callback should erase all plain references (e.g. by
> reparenting) or poke the holders of pinned references to release (unpin)
> them eventually (so that blkcg goes away).
> 
> I reckon it's not possible to traverse all references in the
> bfq_pd_offline().

So bfq_pd_offline() does erase all plain references AFAICT. But later it
can create new plain references (service tree) from the existing "pinned"
ones and once pinned references go away, those created plain references
cause trouble. And the more I'm looking into this the more I'm convinced
bfq_pd_offline() should be more careful and remove also the pinned
references from bfq queues. It actually does it for most queues but it can
currently miss some... I'll look into that.

Thanks for your very good questions and hints!

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
  2021-11-26 14:47   ` Michal Koutný
  (?)
  (?)
@ 2021-11-29 17:12   ` Tejun Heo
  2021-11-30 11:50     ` Jan Kara
  -1 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2021-11-29 17:12 UTC (permalink / raw)
  To: Michal Koutný; +Cc: Jan Kara, Paolo Valente, linux-block, fvogdt, cgroups

On Fri, Nov 26, 2021 at 03:47:24PM +0100, Michal Koutný wrote:
> The question here is how long would stay the offlined blkcgs around if
> they were directly pinned upon the IO submission. If it's unbound, then
> reparenting makes more sense.

It should be fine to pin whatever's necessary while related IOs are in
flight and percpu_ref used for css refcnting isn't gonna make any noticeable
difference in terms of overhead.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
  2021-11-29 17:12   ` Tejun Heo
@ 2021-11-30 11:50     ` Jan Kara
  2021-11-30 16:22         ` Tejun Heo
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-11-30 11:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Jan Kara, Paolo Valente, linux-block, fvogdt, cgroups

On Mon 29-11-21 07:12:42, Tejun Heo wrote:
> On Fri, Nov 26, 2021 at 03:47:24PM +0100, Michal Koutný wrote:
> > The question here is how long would stay the offlined blkcgs around if
> > they were directly pinned upon the IO submission. If it's unbound, then
> > reparenting makes more sense.
> 
> It should be fine to pin whatever's necessary while related IOs are in
> flight and percpu_ref used for css refcnting isn't gonna make any noticeable
> difference in terms of overhead.

Yes, holding cgroup ref from IO would be fine. But that is not really our
problem.

The problem is bfq_queue associated with a task effectively holds a
reference to the potentially dead cgroup and the reference can stay there
until the task (that itself got reparented to the root cgroup) exits. So I
think we need to reparent these bfq_queue structures as well to avoid
holding cgroup in zombie state excessively long.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-11-30 16:22         ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2021-11-30 16:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michal Koutný, Paolo Valente, linux-block, fvogdt, cgroups

Hello,

On Tue, Nov 30, 2021 at 12:50:10PM +0100, Jan Kara wrote:
> The problem is bfq_queue associated with a task effectively holds a
> reference to the potentially dead cgroup and the reference can stay there
> until the task (that itself got reparented to the root cgroup) exits. So I
> think we need to reparent these bfq_queue structures as well to avoid
> holding cgroup in zombie state excessively long.

Ah, I see. Yeah, that's not great. Agree that it'd be better to reparent
(probably just punt to the root cgroup).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-11-30 16:22         ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2021-11-30 16:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Koutný,
	Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogdt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Tue, Nov 30, 2021 at 12:50:10PM +0100, Jan Kara wrote:
> The problem is bfq_queue associated with a task effectively holds a
> reference to the potentially dead cgroup and the reference can stay there
> until the task (that itself got reparented to the root cgroup) exits. So I
> think we need to reparent these bfq_queue structures as well to avoid
> holding cgroup in zombie state excessively long.

Ah, I see. Yeah, that's not great. Agree that it'd be better to reparent
(probably just punt to the root cgroup).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
  2021-11-29 17:11     ` Jan Kara
@ 2021-12-09  2:23       ` yukuai (C)
  -1 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2021-12-09  2:23 UTC (permalink / raw)
  To: Jan Kara, Michal Koutný; +Cc: Paolo Valente, linux-block, fvogt, cgroups

在 2021/11/30 1:11, Jan Kara 写道:
> On Fri 26-11-21 15:47:24, Michal Koutný wrote:
>> Hello.
>>
>> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
>> [...]
>> +Cc cgroups ML
>> https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
>>
>>
>> I understand there are more objects than blkcgs but I assume it can
>> eventually boil down to blkcg references, so I suggest another
>> alternative. (But I may easily miss the relations between BFQ objects,
>> so consider this only high-level opinion.)
>>
>>> After some poking, looking into crashdumps, and applying some debug patches
>>> the following seems to be happening: We have a process P in blkcg G. Now
>>> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
>>> still holds reference to G from its bfq_queue. Then P submits IO, G gets
>>> inserted into service tree despite being already offline.
>>
>> (If G is offline, P can only be zombie, just saying. (I guess it can
>> still be Q's IO on behalf of G.))
>>
>> IIUC, the reference to G is only held by P. If the G reference is copied
>> into another structure (the service tree) it should get another
>> reference. My naïve proposal would be css_get(). (1)
> 
> So I was looking into this puzzle. The answer is following:
> 
> The process P (podman, pid 2571) is currently attached to the root cgroup
> but it has io_context with BFQ queue that points to the already-offline G
> as a parent. The bio is thus associated with the root cgroup (via
> bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
> should be queued and then uses its parent to determine blkg which it should
> be charged and thus gets to the dying cgroup.

Hi, Jan

After some code review, we found that the root cause of the problem
semms to be different.

If the process is moved from group G to root group, and a new io is
issued from the process, then bfq should detect this and changing
bfq_queue's parent to root bfq_group:

bfq_insert_request
  bfq_init_rq
   bfq_bic_update_cgroup
    serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
    bic->blkcg_serial_nr == serial_nr -> this do not pass,because 
bic->blkcg_serial_nr is still from group G
    __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root 
group

And we think the following path is possible to trigger the problem:

1) process P1 and P2 is currently in cgroup C1, corresponding to
bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
q1->next_bfqq = q2.

2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
and flag BFQQF_split_coop is not set yet.

3) P2 exit, q2 won't exit because it's still referenced through
queue merge.

4) delete C1, g1 is offlined

5) issue a new io in q1, q1's parent entity will change to root,
however the io will end up in q1->next_bfqq = q2, and thus the
offlined g1 is inserted to service tree through q2.

6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
in service tree of it's parent.

We confirmed this by our reproducer through a simple patch:
stop merging bfq_queues if their parents are different.

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1ce1a99a7160..14c1d1c3811e 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct 
bfq_queue *new_bfqq)
         while ((__bfqq = new_bfqq->new_bfqq)) {
                 if (__bfqq == bfqq)
                         return NULL;
+               if (__bfqq->entity.parent != bfqq->entity.parent) {
+                       if (bfq_bfqq_coop(__bfqq))
+                               bfq_mark_bfqq_split_coop(__bfqq);
+                       return NULL;
+               }
                 new_bfqq = __bfqq;
         }

@@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, 
struct bfq_queue *bfqq,
         if (bfq_too_late_for_merging(bfqq))
                 return NULL;

-       if (bfqq->new_bfqq)
-               return bfqq->new_bfqq;
+       if (bfqq->new_bfqq) {
+               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
+
+               if(bfqq->entity.parent == new_bfqq->entity.parent)
+                       return new_bfqq;
+
+               if(bfq_bfqq_coop(new_bfqq))
+                       bfq_mark_bfqq_split_coop(new_bfqq);
+               return NULL;
+       }

Do you think this analysis is correct?

Thanks,
Kuai
> 
> Apparently P got recently moved from G to the root cgroup and there was
> reference left in the BFQ queue structure to G.
> 
>>> IO completes, P exits, bfq_queue pointing to G gets destroyed, the
>>> last reference to G is dropped, G gets freed although is it still
>>> inserted in the service tree.  Eventually someone trips over the freed
>>> memory.
>>
>> Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
>> (not blkcg G)?
> 
> Yes, it is. But the entity is part of bfq_group structure which is the pd
> for the blkcg.
> 
>> You write bfq_queue is destroyed, shouldn't that remove it from the
>> service tree? (2)
> 
> Yes, BFQ queue is removed from the service trees on destruction. But its
> parent - bfq_group - is not removed from its service tree. And that's where
> we hit the problem.
> 
>>> Now I was looking into how to best fix this. There are several
>>> possibilities and I'm not sure which one to pick so that's why I'm writing
>>> to you. bfq_pd_offline() is walking all entities in service trees and
>>> trying to get rid of references to bfq_group (by reparenting entities).
>>> Is this guaranteed to see all entities that point to G? From the scenario
>>> I'm observing it seems this can miss entities pointing to G - e.g. if they
>>> are in idle tree, we will just remove them from the idle tree but we won't
>>> change entity->parent so they still point to G. This can be seen as one
>>> culprit of the bug.
>>
>> There can be two types of references to blkcg (transitively via
>> bfq_group):
>> a) "plain" (just a pointer stored somewhere),
>> b) "pinned" (marked by css_get() of the respective blkcg).
>>
>> The bfq_pd_offline() callback should erase all plain references (e.g. by
>> reparenting) or poke the holders of pinned references to release (unpin)
>> them eventually (so that blkcg goes away).
>>
>> I reckon it's not possible to traverse all references in the
>> bfq_pd_offline().
> 
> So bfq_pd_offline() does erase all plain references AFAICT. But later it
> can create new plain references (service tree) from the existing "pinned"
> ones and once pinned references go away, those created plain references
> cause trouble. And the more I'm looking into this the more I'm convinced
> bfq_pd_offline() should be more careful and remove also the pinned
> references from bfq queues. It actually does it for most queues but it can
> currently miss some... I'll look into that.
> 
> Thanks for your very good questions and hints!
> 
> 								Honza
> 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-09  2:23       ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2021-12-09  2:23 UTC (permalink / raw)
  To: Jan Kara, Michal Koutný; +Cc: Paolo Valente, linux-block, fvogt, cgroups

在 2021/11/30 1:11, Jan Kara 写道:
> On Fri 26-11-21 15:47:24, Michal Koutný wrote:
>> Hello.
>>
>> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
>> [...]
>> +Cc cgroups ML
>> https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
>>
>>
>> I understand there are more objects than blkcgs but I assume it can
>> eventually boil down to blkcg references, so I suggest another
>> alternative. (But I may easily miss the relations between BFQ objects,
>> so consider this only high-level opinion.)
>>
>>> After some poking, looking into crashdumps, and applying some debug patches
>>> the following seems to be happening: We have a process P in blkcg G. Now
>>> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
>>> still holds reference to G from its bfq_queue. Then P submits IO, G gets
>>> inserted into service tree despite being already offline.
>>
>> (If G is offline, P can only be zombie, just saying. (I guess it can
>> still be Q's IO on behalf of G.))
>>
>> IIUC, the reference to G is only held by P. If the G reference is copied
>> into another structure (the service tree) it should get another
>> reference. My naïve proposal would be css_get(). (1)
> 
> So I was looking into this puzzle. The answer is following:
> 
> The process P (podman, pid 2571) is currently attached to the root cgroup
> but it has io_context with BFQ queue that points to the already-offline G
> as a parent. The bio is thus associated with the root cgroup (via
> bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
> should be queued and then uses its parent to determine blkg which it should
> be charged and thus gets to the dying cgroup.

Hi, Jan

After some code review, we found that the root cause of the problem
semms to be different.

If the process is moved from group G to root group, and a new io is
issued from the process, then bfq should detect this and changing
bfq_queue's parent to root bfq_group:

bfq_insert_request
  bfq_init_rq
   bfq_bic_update_cgroup
    serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
    bic->blkcg_serial_nr == serial_nr -> this do not pass,because 
bic->blkcg_serial_nr is still from group G
    __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root 
group

And we think the following path is possible to trigger the problem:

1) process P1 and P2 is currently in cgroup C1, corresponding to
bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
q1->next_bfqq = q2.

2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
and flag BFQQF_split_coop is not set yet.

3) P2 exit, q2 won't exit because it's still referenced through
queue merge.

4) delete C1, g1 is offlined

5) issue a new io in q1, q1's parent entity will change to root,
however the io will end up in q1->next_bfqq = q2, and thus the
offlined g1 is inserted to service tree through q2.

6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
in service tree of it's parent.

We confirmed this by our reproducer through a simple patch:
stop merging bfq_queues if their parents are different.

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1ce1a99a7160..14c1d1c3811e 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct 
bfq_queue *new_bfqq)
         while ((__bfqq = new_bfqq->new_bfqq)) {
                 if (__bfqq == bfqq)
                         return NULL;
+               if (__bfqq->entity.parent != bfqq->entity.parent) {
+                       if (bfq_bfqq_coop(__bfqq))
+                               bfq_mark_bfqq_split_coop(__bfqq);
+                       return NULL;
+               }
                 new_bfqq = __bfqq;
         }

@@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, 
struct bfq_queue *bfqq,
         if (bfq_too_late_for_merging(bfqq))
                 return NULL;

-       if (bfqq->new_bfqq)
-               return bfqq->new_bfqq;
+       if (bfqq->new_bfqq) {
+               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
+
+               if(bfqq->entity.parent == new_bfqq->entity.parent)
+                       return new_bfqq;
+
+               if(bfq_bfqq_coop(new_bfqq))
+                       bfq_mark_bfqq_split_coop(new_bfqq);
+               return NULL;
+       }

Do you think this analysis is correct?

Thanks,
Kuai
> 
> Apparently P got recently moved from G to the root cgroup and there was
> reference left in the BFQ queue structure to G.
> 
>>> IO completes, P exits, bfq_queue pointing to G gets destroyed, the
>>> last reference to G is dropped, G gets freed although is it still
>>> inserted in the service tree.  Eventually someone trips over the freed
>>> memory.
>>
>> Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
>> (not blkcg G)?
> 
> Yes, it is. But the entity is part of bfq_group structure which is the pd
> for the blkcg.
> 
>> You write bfq_queue is destroyed, shouldn't that remove it from the
>> service tree? (2)
> 
> Yes, BFQ queue is removed from the service trees on destruction. But its
> parent - bfq_group - is not removed from its service tree. And that's where
> we hit the problem.
> 
>>> Now I was looking into how to best fix this. There are several
>>> possibilities and I'm not sure which one to pick so that's why I'm writing
>>> to you. bfq_pd_offline() is walking all entities in service trees and
>>> trying to get rid of references to bfq_group (by reparenting entities).
>>> Is this guaranteed to see all entities that point to G? From the scenario
>>> I'm observing it seems this can miss entities pointing to G - e.g. if they
>>> are in idle tree, we will just remove them from the idle tree but we won't
>>> change entity->parent so they still point to G. This can be seen as one
>>> culprit of the bug.
>>
>> There can be two types of references to blkcg (transitively via
>> bfq_group):
>> a) "plain" (just a pointer stored somewhere),
>> b) "pinned" (marked by css_get() of the respective blkcg).
>>
>> The bfq_pd_offline() callback should erase all plain references (e.g. by
>> reparenting) or poke the holders of pinned references to release (unpin)
>> them eventually (so that blkcg goes away).
>>
>> I reckon it's not possible to traverse all references in the
>> bfq_pd_offline().
> 
> So bfq_pd_offline() does erase all plain references AFAICT. But later it
> can create new plain references (service tree) from the existing "pinned"
> ones and once pinned references go away, those created plain references
> cause trouble. And the more I'm looking into this the more I'm convinced
> bfq_pd_offline() should be more careful and remove also the pinned
> references from bfq queues. It actually does it for most queues but it can
> currently miss some... I'll look into that.
> 
> Thanks for your very good questions and hints!
> 
> 								Honza
> 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-09 15:33         ` Paolo Valente
  0 siblings, 0 replies; 24+ messages in thread
From: Paolo Valente @ 2021-12-09 15:33 UTC (permalink / raw)
  To: yukuai (C); +Cc: Jan Kara, Michal Koutný, linux-block, fvogt, cgroups



> Il giorno 9 dic 2021, alle ore 03:23, yukuai (C) <yukuai3@huawei.com> ha scritto:
> 
> 在 2021/11/30 1:11, Jan Kara 写道:
>> On Fri 26-11-21 15:47:24, Michal Koutný wrote:
>>> Hello.
>>> 
>>> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
>>> [...]
>>> +Cc cgroups ML
>>> https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
>>> 
>>> 
>>> I understand there are more objects than blkcgs but I assume it can
>>> eventually boil down to blkcg references, so I suggest another
>>> alternative. (But I may easily miss the relations between BFQ objects,
>>> so consider this only high-level opinion.)
>>> 
>>>> After some poking, looking into crashdumps, and applying some debug patches
>>>> the following seems to be happening: We have a process P in blkcg G. Now
>>>> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
>>>> still holds reference to G from its bfq_queue. Then P submits IO, G gets
>>>> inserted into service tree despite being already offline.
>>> 
>>> (If G is offline, P can only be zombie, just saying. (I guess it can
>>> still be Q's IO on behalf of G.))
>>> 
>>> IIUC, the reference to G is only held by P. If the G reference is copied
>>> into another structure (the service tree) it should get another
>>> reference. My naïve proposal would be css_get(). (1)
>> So I was looking into this puzzle. The answer is following:
>> The process P (podman, pid 2571) is currently attached to the root cgroup
>> but it has io_context with BFQ queue that points to the already-offline G
>> as a parent. The bio is thus associated with the root cgroup (via
>> bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
>> should be queued and then uses its parent to determine blkg which it should
>> be charged and thus gets to the dying cgroup.
> 
> Hi, Jan
> 
> After some code review, we found that the root cause of the problem
> semms to be different.
> 
> If the process is moved from group G to root group, and a new io is
> issued from the process, then bfq should detect this and changing
> bfq_queue's parent to root bfq_group:
> 
> bfq_insert_request
> bfq_init_rq
>  bfq_bic_update_cgroup
>   serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
>   bic->blkcg_serial_nr == serial_nr -> this do not pass,because bic->blkcg_serial_nr is still from group G
>   __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root group
> 
> And we think the following path is possible to trigger the problem:
> 
> 1) process P1 and P2 is currently in cgroup C1, corresponding to
> bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
> q1->next_bfqq = q2.
> 
> 2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
> and flag BFQQF_split_coop is not set yet.
> 
> 3) P2 exit, q2 won't exit because it's still referenced through
> queue merge.
> 
> 4) delete C1, g1 is offlined
> 
> 5) issue a new io in q1, q1's parent entity will change to root,
> however the io will end up in q1->next_bfqq = q2, and thus the
> offlined g1 is inserted to service tree through q2.
> 
> 6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
> in service tree of it's parent.
> 
> We confirmed this by our reproducer through a simple patch:
> stop merging bfq_queues if their parents are different.
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 1ce1a99a7160..14c1d1c3811e 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
>        while ((__bfqq = new_bfqq->new_bfqq)) {
>                if (__bfqq == bfqq)
>                        return NULL;
> +               if (__bfqq->entity.parent != bfqq->entity.parent) {
> +                       if (bfq_bfqq_coop(__bfqq))
> +                               bfq_mark_bfqq_split_coop(__bfqq);
> +                       return NULL;
> +               }
>                new_bfqq = __bfqq;
>        }
> 
> @@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>        if (bfq_too_late_for_merging(bfqq))
>                return NULL;
> 
> -       if (bfqq->new_bfqq)
> -               return bfqq->new_bfqq;
> +       if (bfqq->new_bfqq) {
> +               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
> +
> +               if(bfqq->entity.parent == new_bfqq->entity.parent)
> +                       return new_bfqq;
> +
> +               if(bfq_bfqq_coop(new_bfqq))
> +                       bfq_mark_bfqq_split_coop(new_bfqq);
> +               return NULL;
> +       }
> 
> Do you think this analysis is correct?
> 

Hi Kuai,
I haven't checked all details, but, for sure, two bfq_queues belonging
to different parent entities (i.e., to different groups) do not have
to be merged.  So, if two bfq_queues happen to be (still) merged while
belonging to different groups, then the scheduler is in an
inconsistent state.  And rules may follow.

Let's see comments from the other people working on this issue.

Thanks for spotting this,
Paolo

> Thanks,
> Kuai
>> Apparently P got recently moved from G to the root cgroup and there was
>> reference left in the BFQ queue structure to G.
>>>> IO completes, P exits, bfq_queue pointing to G gets destroyed, the
>>>> last reference to G is dropped, G gets freed although is it still
>>>> inserted in the service tree.  Eventually someone trips over the freed
>>>> memory.
>>> 
>>> Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
>>> (not blkcg G)?
>> Yes, it is. But the entity is part of bfq_group structure which is the pd
>> for the blkcg.
>>> You write bfq_queue is destroyed, shouldn't that remove it from the
>>> service tree? (2)
>> Yes, BFQ queue is removed from the service trees on destruction. But its
>> parent - bfq_group - is not removed from its service tree. And that's where
>> we hit the problem.
>>>> Now I was looking into how to best fix this. There are several
>>>> possibilities and I'm not sure which one to pick so that's why I'm writing
>>>> to you. bfq_pd_offline() is walking all entities in service trees and
>>>> trying to get rid of references to bfq_group (by reparenting entities).
>>>> Is this guaranteed to see all entities that point to G? From the scenario
>>>> I'm observing it seems this can miss entities pointing to G - e.g. if they
>>>> are in idle tree, we will just remove them from the idle tree but we won't
>>>> change entity->parent so they still point to G. This can be seen as one
>>>> culprit of the bug.
>>> 
>>> There can be two types of references to blkcg (transitively via
>>> bfq_group):
>>> a) "plain" (just a pointer stored somewhere),
>>> b) "pinned" (marked by css_get() of the respective blkcg).
>>> 
>>> The bfq_pd_offline() callback should erase all plain references (e.g. by
>>> reparenting) or poke the holders of pinned references to release (unpin)
>>> them eventually (so that blkcg goes away).
>>> 
>>> I reckon it's not possible to traverse all references in the
>>> bfq_pd_offline().
>> So bfq_pd_offline() does erase all plain references AFAICT. But later it
>> can create new plain references (service tree) from the existing "pinned"
>> ones and once pinned references go away, those created plain references
>> cause trouble. And the more I'm looking into this the more I'm convinced
>> bfq_pd_offline() should be more careful and remove also the pinned
>> references from bfq queues. It actually does it for most queues but it can
>> currently miss some... I'll look into that.
>> Thanks for your very good questions and hints!
>> 								Honza


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-09 15:33         ` Paolo Valente
  0 siblings, 0 replies; 24+ messages in thread
From: Paolo Valente @ 2021-12-09 15:33 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný,
	linux-block, fvogt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA



> Il giorno 9 dic 2021, alle ore 03:23, yukuai (C) <yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> ha scritto:
> 
> 在 2021/11/30 1:11, Jan Kara 写道:
>> On Fri 26-11-21 15:47:24, Michal Koutný wrote:
>>> Hello.
>>> 
>>> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
>>> [...]
>>> +Cc cgroups ML
>>> https://lore.kernel.org/linux-block/20211125172809.GC19572-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org/
>>> 
>>> 
>>> I understand there are more objects than blkcgs but I assume it can
>>> eventually boil down to blkcg references, so I suggest another
>>> alternative. (But I may easily miss the relations between BFQ objects,
>>> so consider this only high-level opinion.)
>>> 
>>>> After some poking, looking into crashdumps, and applying some debug patches
>>>> the following seems to be happening: We have a process P in blkcg G. Now
>>>> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
>>>> still holds reference to G from its bfq_queue. Then P submits IO, G gets
>>>> inserted into service tree despite being already offline.
>>> 
>>> (If G is offline, P can only be zombie, just saying. (I guess it can
>>> still be Q's IO on behalf of G.))
>>> 
>>> IIUC, the reference to G is only held by P. If the G reference is copied
>>> into another structure (the service tree) it should get another
>>> reference. My naïve proposal would be css_get(). (1)
>> So I was looking into this puzzle. The answer is following:
>> The process P (podman, pid 2571) is currently attached to the root cgroup
>> but it has io_context with BFQ queue that points to the already-offline G
>> as a parent. The bio is thus associated with the root cgroup (via
>> bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
>> should be queued and then uses its parent to determine blkg which it should
>> be charged and thus gets to the dying cgroup.
> 
> Hi, Jan
> 
> After some code review, we found that the root cause of the problem
> semms to be different.
> 
> If the process is moved from group G to root group, and a new io is
> issued from the process, then bfq should detect this and changing
> bfq_queue's parent to root bfq_group:
> 
> bfq_insert_request
> bfq_init_rq
>  bfq_bic_update_cgroup
>   serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
>   bic->blkcg_serial_nr == serial_nr -> this do not pass,because bic->blkcg_serial_nr is still from group G
>   __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root group
> 
> And we think the following path is possible to trigger the problem:
> 
> 1) process P1 and P2 is currently in cgroup C1, corresponding to
> bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
> q1->next_bfqq = q2.
> 
> 2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
> and flag BFQQF_split_coop is not set yet.
> 
> 3) P2 exit, q2 won't exit because it's still referenced through
> queue merge.
> 
> 4) delete C1, g1 is offlined
> 
> 5) issue a new io in q1, q1's parent entity will change to root,
> however the io will end up in q1->next_bfqq = q2, and thus the
> offlined g1 is inserted to service tree through q2.
> 
> 6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
> in service tree of it's parent.
> 
> We confirmed this by our reproducer through a simple patch:
> stop merging bfq_queues if their parents are different.
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 1ce1a99a7160..14c1d1c3811e 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
>        while ((__bfqq = new_bfqq->new_bfqq)) {
>                if (__bfqq == bfqq)
>                        return NULL;
> +               if (__bfqq->entity.parent != bfqq->entity.parent) {
> +                       if (bfq_bfqq_coop(__bfqq))
> +                               bfq_mark_bfqq_split_coop(__bfqq);
> +                       return NULL;
> +               }
>                new_bfqq = __bfqq;
>        }
> 
> @@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>        if (bfq_too_late_for_merging(bfqq))
>                return NULL;
> 
> -       if (bfqq->new_bfqq)
> -               return bfqq->new_bfqq;
> +       if (bfqq->new_bfqq) {
> +               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
> +
> +               if(bfqq->entity.parent == new_bfqq->entity.parent)
> +                       return new_bfqq;
> +
> +               if(bfq_bfqq_coop(new_bfqq))
> +                       bfq_mark_bfqq_split_coop(new_bfqq);
> +               return NULL;
> +       }
> 
> Do you think this analysis is correct?
> 

Hi Kuai,
I haven't checked all details, but, for sure, two bfq_queues belonging
to different parent entities (i.e., to different groups) do not have
to be merged.  So, if two bfq_queues happen to be (still) merged while
belonging to different groups, then the scheduler is in an
inconsistent state.  And rules may follow.

Let's see comments from the other people working on this issue.

Thanks for spotting this,
Paolo

> Thanks,
> Kuai
>> Apparently P got recently moved from G to the root cgroup and there was
>> reference left in the BFQ queue structure to G.
>>>> IO completes, P exits, bfq_queue pointing to G gets destroyed, the
>>>> last reference to G is dropped, G gets freed although is it still
>>>> inserted in the service tree.  Eventually someone trips over the freed
>>>> memory.
>>> 
>>> Isn't it the bfq_queue.bfq_entity that's inserted in the service tree
>>> (not blkcg G)?
>> Yes, it is. But the entity is part of bfq_group structure which is the pd
>> for the blkcg.
>>> You write bfq_queue is destroyed, shouldn't that remove it from the
>>> service tree? (2)
>> Yes, BFQ queue is removed from the service trees on destruction. But its
>> parent - bfq_group - is not removed from its service tree. And that's where
>> we hit the problem.
>>>> Now I was looking into how to best fix this. There are several
>>>> possibilities and I'm not sure which one to pick so that's why I'm writing
>>>> to you. bfq_pd_offline() is walking all entities in service trees and
>>>> trying to get rid of references to bfq_group (by reparenting entities).
>>>> Is this guaranteed to see all entities that point to G? From the scenario
>>>> I'm observing it seems this can miss entities pointing to G - e.g. if they
>>>> are in idle tree, we will just remove them from the idle tree but we won't
>>>> change entity->parent so they still point to G. This can be seen as one
>>>> culprit of the bug.
>>> 
>>> There can be two types of references to blkcg (transitively via
>>> bfq_group):
>>> a) "plain" (just a pointer stored somewhere),
>>> b) "pinned" (marked by css_get() of the respective blkcg).
>>> 
>>> The bfq_pd_offline() callback should erase all plain references (e.g. by
>>> reparenting) or poke the holders of pinned references to release (unpin)
>>> them eventually (so that blkcg goes away).
>>> 
>>> I reckon it's not possible to traverse all references in the
>>> bfq_pd_offline().
>> So bfq_pd_offline() does erase all plain references AFAICT. But later it
>> can create new plain references (service tree) from the existing "pinned"
>> ones and once pinned references go away, those created plain references
>> cause trouble. And the more I'm looking into this the more I'm convinced
>> bfq_pd_offline() should be more careful and remove also the pinned
>> references from bfq queues. It actually does it for most queues but it can
>> currently miss some... I'll look into that.
>> Thanks for your very good questions and hints!
>> 								Honza


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
  2021-12-09  2:23       ` yukuai (C)
  (?)
  (?)
@ 2021-12-13 17:33       ` Jan Kara
  2021-12-14  1:24           ` yukuai (C)
  -1 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2021-12-13 17:33 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný, Paolo Valente, linux-block, fvogt, cgroups

Hello!

On Thu 09-12-21 10:23:33, yukuai (C) wrote:
> 在 2021/11/30 1:11, Jan Kara 写道:
> > On Fri 26-11-21 15:47:24, Michal Koutný wrote:
> > > Hello.
> > > 
> > > On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
> > > [...]
> > > +Cc cgroups ML
> > > https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
> > > 
> > > 
> > > I understand there are more objects than blkcgs but I assume it can
> > > eventually boil down to blkcg references, so I suggest another
> > > alternative. (But I may easily miss the relations between BFQ objects,
> > > so consider this only high-level opinion.)
> > > 
> > > > After some poking, looking into crashdumps, and applying some debug patches
> > > > the following seems to be happening: We have a process P in blkcg G. Now
> > > > G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
> > > > still holds reference to G from its bfq_queue. Then P submits IO, G gets
> > > > inserted into service tree despite being already offline.
> > > 
> > > (If G is offline, P can only be zombie, just saying. (I guess it can
> > > still be Q's IO on behalf of G.))
> > > 
> > > IIUC, the reference to G is only held by P. If the G reference is copied
> > > into another structure (the service tree) it should get another
> > > reference. My naïve proposal would be css_get(). (1)
> > 
> > So I was looking into this puzzle. The answer is following:
> > 
> > The process P (podman, pid 2571) is currently attached to the root cgroup
> > but it has io_context with BFQ queue that points to the already-offline G
> > as a parent. The bio is thus associated with the root cgroup (via
> > bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
> > should be queued and then uses its parent to determine blkg which it should
> > be charged and thus gets to the dying cgroup.
> 
> After some code review, we found that the root cause of the problem
> semms to be different.
> 
> If the process is moved from group G to root group, and a new io is
> issued from the process, then bfq should detect this and changing
> bfq_queue's parent to root bfq_group:
> 
> bfq_insert_request
>  bfq_init_rq
>   bfq_bic_update_cgroup
>    serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
>    bic->blkcg_serial_nr == serial_nr -> this do not pass,because
> bic->blkcg_serial_nr is still from group G

So in the crashdump I have available, I can see that 
_bio_blkcg(bio)->css.serial_nr is 4. Also bic->blkcg_serial_nr is 4. But
bic->bfqq[1] is a bfq_queue that has its entity->parent set to already
offlined bfq_group. Not sure how that is possible...

>    __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root group
> 
> And we think the following path is possible to trigger the problem:
> 
> 1) process P1 and P2 is currently in cgroup C1, corresponding to
> bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
> q1->next_bfqq = q2.

I agree shared queues are some factor in this - the problematic bfq_queue
pointing to the dead bfq_group has 'coop' flag set, pid == -1, bic ==
NULL. So clearly it has been merged with another bfq_queue.

> 2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
> and flag BFQQF_split_coop is not set yet.

There's no next_bfqq in the kernel I'm looking into... Generally the merge
code seems to be working somewhat differently to what you describe (I'm
looking into 5.16-rc3 kernel).

> 3) P2 exit, q2 won't exit because it's still referenced through
> queue merge.
> 
> 4) delete C1, g1 is offlined
> 
> 5) issue a new io in q1, q1's parent entity will change to root,
> however the io will end up in q1->next_bfqq = q2, and thus the
> offlined g1 is inserted to service tree through q2.
> 
> 6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
> in service tree of it's parent.
> 
> We confirmed this by our reproducer through a simple patch:
> stop merging bfq_queues if their parents are different.
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 1ce1a99a7160..14c1d1c3811e 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct
> bfq_queue *new_bfqq)
>         while ((__bfqq = new_bfqq->new_bfqq)) {
>                 if (__bfqq == bfqq)
>                         return NULL;
> +               if (__bfqq->entity.parent != bfqq->entity.parent) {
> +                       if (bfq_bfqq_coop(__bfqq))
> +                               bfq_mark_bfqq_split_coop(__bfqq);
> +                       return NULL;
> +               }
>                 new_bfqq = __bfqq;
>         }
> 
> @@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct
> bfq_queue *bfqq,
>         if (bfq_too_late_for_merging(bfqq))
>                 return NULL;
> 
> -       if (bfqq->new_bfqq)
> -               return bfqq->new_bfqq;
> +       if (bfqq->new_bfqq) {
> +               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
> +
> +               if(bfqq->entity.parent == new_bfqq->entity.parent)
> +                       return new_bfqq;
> +
> +               if(bfq_bfqq_coop(new_bfqq))
> +                       bfq_mark_bfqq_split_coop(new_bfqq);
> +               return NULL;
> +       }
> 
> Do you think this analysis is correct?

Honestly, I'm not sure. At this point I'm not sure why the bfqq pointed to
from bic didn't get reparented to the new cgroup when bio was submitted...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-14  1:24           ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2021-12-14  1:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michal Koutný, Paolo Valente, linux-block, fvogt, cgroups

在 2021/12/14 1:33, Jan Kara 写道:
> Hello!
> 
> On Thu 09-12-21 10:23:33, yukuai (C) wrote:
>> 在 2021/11/30 1:11, Jan Kara 写道:
>>> On Fri 26-11-21 15:47:24, Michal Koutný wrote:
>>>> Hello.
>>>>
>>>> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
>>>> [...]
>>>> +Cc cgroups ML
>>>> https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
>>>>
>>>>
>>>> I understand there are more objects than blkcgs but I assume it can
>>>> eventually boil down to blkcg references, so I suggest another
>>>> alternative. (But I may easily miss the relations between BFQ objects,
>>>> so consider this only high-level opinion.)
>>>>
>>>>> After some poking, looking into crashdumps, and applying some debug patches
>>>>> the following seems to be happening: We have a process P in blkcg G. Now
>>>>> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
>>>>> still holds reference to G from its bfq_queue. Then P submits IO, G gets
>>>>> inserted into service tree despite being already offline.
>>>>
>>>> (If G is offline, P can only be zombie, just saying. (I guess it can
>>>> still be Q's IO on behalf of G.))
>>>>
>>>> IIUC, the reference to G is only held by P. If the G reference is copied
>>>> into another structure (the service tree) it should get another
>>>> reference. My naïve proposal would be css_get(). (1)
>>>
>>> So I was looking into this puzzle. The answer is following:
>>>
>>> The process P (podman, pid 2571) is currently attached to the root cgroup
>>> but it has io_context with BFQ queue that points to the already-offline G
>>> as a parent. The bio is thus associated with the root cgroup (via
>>> bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
>>> should be queued and then uses its parent to determine blkg which it should
>>> be charged and thus gets to the dying cgroup.
>>
>> After some code review, we found that the root cause of the problem
>> semms to be different.
>>
>> If the process is moved from group G to root group, and a new io is
>> issued from the process, then bfq should detect this and changing
>> bfq_queue's parent to root bfq_group:
>>
>> bfq_insert_request
>>   bfq_init_rq
>>    bfq_bic_update_cgroup
>>     serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
>>     bic->blkcg_serial_nr == serial_nr -> this do not pass,because
>> bic->blkcg_serial_nr is still from group G
> 
> So in the crashdump I have available, I can see that
> _bio_blkcg(bio)->css.serial_nr is 4. Also bic->blkcg_serial_nr is 4. But
> bic->bfqq[1] is a bfq_queue that has its entity->parent set to already
> offlined bfq_group. Not sure how that is possible...
> 
>>     __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root group
>>
>> And we think the following path is possible to trigger the problem:
>>
>> 1) process P1 and P2 is currently in cgroup C1, corresponding to
>> bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
>> q1->next_bfqq = q2.
> 
> I agree shared queues are some factor in this - the problematic bfq_queue
> pointing to the dead bfq_group has 'coop' flag set, pid == -1, bic ==
> NULL. So clearly it has been merged with another bfq_queue.
> 
>> 2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
>> and flag BFQQF_split_coop is not set yet.
> 
> There's no next_bfqq in the kernel I'm looking into... Generally the merge
> code seems to be working somewhat differently to what you describe (I'm
> looking into 5.16-rc3 kernel).

Hi, Jan

Sorry, It should be new_bfqq, which will be set if the queue is merged
to other queue.
> 
>> 3) P2 exit, q2 won't exit because it's still referenced through
>> queue merge.
>>
>> 4) delete C1, g1 is offlined
>>
>> 5) issue a new io in q1, q1's parent entity will change to root,
>> however the io will end up in q1->next_bfqq = q2, and thus the
>> offlined g1 is inserted to service tree through q2.
>>
>> 6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
>> in service tree of it's parent.
>>
>> We confirmed this by our reproducer through a simple patch:
>> stop merging bfq_queues if their parents are different.
>>
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 1ce1a99a7160..14c1d1c3811e 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct
>> bfq_queue *new_bfqq)
>>          while ((__bfqq = new_bfqq->new_bfqq)) {
>>                  if (__bfqq == bfqq)
>>                          return NULL;
>> +               if (__bfqq->entity.parent != bfqq->entity.parent) {
>> +                       if (bfq_bfqq_coop(__bfqq))
>> +                               bfq_mark_bfqq_split_coop(__bfqq);
>> +                       return NULL;
>> +               }
>>                  new_bfqq = __bfqq;
>>          }
>>
>> @@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct
>> bfq_queue *bfqq,
>>          if (bfq_too_late_for_merging(bfqq))
>>                  return NULL;
>>
>> -       if (bfqq->new_bfqq)
>> -               return bfqq->new_bfqq;
>> +       if (bfqq->new_bfqq) {
>> +               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
>> +
>> +               if(bfqq->entity.parent == new_bfqq->entity.parent)
>> +                       return new_bfqq;
>> +
>> +               if(bfq_bfqq_coop(new_bfqq))
>> +                       bfq_mark_bfqq_split_coop(new_bfqq);
>> +               return NULL;
>> +       }
>>
>> Do you think this analysis is correct?
> 
> Honestly, I'm not sure. At this point I'm not sure why the bfqq pointed to
> from bic didn't get reparented to the new cgroup when bio was submitted...

After queue merging, bic is set to the new bfqq, for example:

__bfq_insert_request
  new_bfqq = bfq_setup_cooperator
  bfq_merge_bfqqs -> before this is done, bic is set to old bfqq
   bic_set_bfqq(bic, new_bfqq, 1); -> bic is set to new bfqq
  rq->elv.priv[1] = new_bfqq;

I think current problem is that if bfq_queue is merged, task migration
won't break such cooperation,thus issue io from one cgroup may endup to
a bfq_queue that is from another cgroup, which might be offlined.

Thanks,
Kuai
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-14  1:24           ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2021-12-14  1:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Koutný,
	Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

在 2021/12/14 1:33, Jan Kara 写道:
> Hello!
> 
> On Thu 09-12-21 10:23:33, yukuai (C) wrote:
>> 在 2021/11/30 1:11, Jan Kara 写道:
>>> On Fri 26-11-21 15:47:24, Michal Koutný wrote:
>>>> Hello.
>>>>
>>>> On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>> [...]
>>>> +Cc cgroups ML
>>>> https://lore.kernel.org/linux-block/20211125172809.GC19572-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org/
>>>>
>>>>
>>>> I understand there are more objects than blkcgs but I assume it can
>>>> eventually boil down to blkcg references, so I suggest another
>>>> alternative. (But I may easily miss the relations between BFQ objects,
>>>> so consider this only high-level opinion.)
>>>>
>>>>> After some poking, looking into crashdumps, and applying some debug patches
>>>>> the following seems to be happening: We have a process P in blkcg G. Now
>>>>> G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
>>>>> still holds reference to G from its bfq_queue. Then P submits IO, G gets
>>>>> inserted into service tree despite being already offline.
>>>>
>>>> (If G is offline, P can only be zombie, just saying. (I guess it can
>>>> still be Q's IO on behalf of G.))
>>>>
>>>> IIUC, the reference to G is only held by P. If the G reference is copied
>>>> into another structure (the service tree) it should get another
>>>> reference. My naïve proposal would be css_get(). (1)
>>>
>>> So I was looking into this puzzle. The answer is following:
>>>
>>> The process P (podman, pid 2571) is currently attached to the root cgroup
>>> but it has io_context with BFQ queue that points to the already-offline G
>>> as a parent. The bio is thus associated with the root cgroup (via
>>> bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
>>> should be queued and then uses its parent to determine blkg which it should
>>> be charged and thus gets to the dying cgroup.
>>
>> After some code review, we found that the root cause of the problem
>> semms to be different.
>>
>> If the process is moved from group G to root group, and a new io is
>> issued from the process, then bfq should detect this and changing
>> bfq_queue's parent to root bfq_group:
>>
>> bfq_insert_request
>>   bfq_init_rq
>>    bfq_bic_update_cgroup
>>     serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
>>     bic->blkcg_serial_nr == serial_nr -> this do not pass,because
>> bic->blkcg_serial_nr is still from group G
> 
> So in the crashdump I have available, I can see that
> _bio_blkcg(bio)->css.serial_nr is 4. Also bic->blkcg_serial_nr is 4. But
> bic->bfqq[1] is a bfq_queue that has its entity->parent set to already
> offlined bfq_group. Not sure how that is possible...
> 
>>     __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root group
>>
>> And we think the following path is possible to trigger the problem:
>>
>> 1) process P1 and P2 is currently in cgroup C1, corresponding to
>> bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
>> q1->next_bfqq = q2.
> 
> I agree shared queues are some factor in this - the problematic bfq_queue
> pointing to the dead bfq_group has 'coop' flag set, pid == -1, bic ==
> NULL. So clearly it has been merged with another bfq_queue.
> 
>> 2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
>> and flag BFQQF_split_coop is not set yet.
> 
> There's no next_bfqq in the kernel I'm looking into... Generally the merge
> code seems to be working somewhat differently to what you describe (I'm
> looking into 5.16-rc3 kernel).

Hi, Jan

Sorry, It should be new_bfqq, which will be set if the queue is merged
to other queue.
> 
>> 3) P2 exit, q2 won't exit because it's still referenced through
>> queue merge.
>>
>> 4) delete C1, g1 is offlined
>>
>> 5) issue a new io in q1, q1's parent entity will change to root,
>> however the io will end up in q1->next_bfqq = q2, and thus the
>> offlined g1 is inserted to service tree through q2.
>>
>> 6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
>> in service tree of it's parent.
>>
>> We confirmed this by our reproducer through a simple patch:
>> stop merging bfq_queues if their parents are different.
>>
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 1ce1a99a7160..14c1d1c3811e 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct
>> bfq_queue *new_bfqq)
>>          while ((__bfqq = new_bfqq->new_bfqq)) {
>>                  if (__bfqq == bfqq)
>>                          return NULL;
>> +               if (__bfqq->entity.parent != bfqq->entity.parent) {
>> +                       if (bfq_bfqq_coop(__bfqq))
>> +                               bfq_mark_bfqq_split_coop(__bfqq);
>> +                       return NULL;
>> +               }
>>                  new_bfqq = __bfqq;
>>          }
>>
>> @@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct
>> bfq_queue *bfqq,
>>          if (bfq_too_late_for_merging(bfqq))
>>                  return NULL;
>>
>> -       if (bfqq->new_bfqq)
>> -               return bfqq->new_bfqq;
>> +       if (bfqq->new_bfqq) {
>> +               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
>> +
>> +               if(bfqq->entity.parent == new_bfqq->entity.parent)
>> +                       return new_bfqq;
>> +
>> +               if(bfq_bfqq_coop(new_bfqq))
>> +                       bfq_mark_bfqq_split_coop(new_bfqq);
>> +               return NULL;
>> +       }
>>
>> Do you think this analysis is correct?
> 
> Honestly, I'm not sure. At this point I'm not sure why the bfqq pointed to
> from bic didn't get reparented to the new cgroup when bio was submitted...

After queue merging, bic is set to the new bfqq, for example:

__bfq_insert_request
  new_bfqq = bfq_setup_cooperator
  bfq_merge_bfqqs -> before this is done, bic is set to old bfqq
   bic_set_bfqq(bic, new_bfqq, 1); -> bic is set to new bfqq
  rq->elv.priv[1] = new_bfqq;

I think current problem is that if bfq_queue is merged, task migration
won't break such cooperation,thus issue io from one cgroup may endup to
a bfq_queue that is from another cgroup, which might be offlined.

Thanks,
Kuai
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-20 18:38             ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-12-20 18:38 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný, Paolo Valente, linux-block, fvogt, cgroups

On Tue 14-12-21 09:24:58, yukuai (C) wrote:
> 在 2021/12/14 1:33, Jan Kara 写道:
> > Hello!
> > 
> > On Thu 09-12-21 10:23:33, yukuai (C) wrote:
> > > 在 2021/11/30 1:11, Jan Kara 写道:
> > > > On Fri 26-11-21 15:47:24, Michal Koutný wrote:
> > > > > Hello.
> > > > > 
> > > > > On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > [...]
> > > > > +Cc cgroups ML
> > > > > https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
> > > > > 
> > > > > 
> > > > > I understand there are more objects than blkcgs but I assume it can
> > > > > eventually boil down to blkcg references, so I suggest another
> > > > > alternative. (But I may easily miss the relations between BFQ objects,
> > > > > so consider this only high-level opinion.)
> > > > > 
> > > > > > After some poking, looking into crashdumps, and applying some debug patches
> > > > > > the following seems to be happening: We have a process P in blkcg G. Now
> > > > > > G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
> > > > > > still holds reference to G from its bfq_queue. Then P submits IO, G gets
> > > > > > inserted into service tree despite being already offline.
> > > > > 
> > > > > (If G is offline, P can only be zombie, just saying. (I guess it can
> > > > > still be Q's IO on behalf of G.))
> > > > > 
> > > > > IIUC, the reference to G is only held by P. If the G reference is copied
> > > > > into another structure (the service tree) it should get another
> > > > > reference. My naïve proposal would be css_get(). (1)
> > > > 
> > > > So I was looking into this puzzle. The answer is following:
> > > > 
> > > > The process P (podman, pid 2571) is currently attached to the root cgroup
> > > > but it has io_context with BFQ queue that points to the already-offline G
> > > > as a parent. The bio is thus associated with the root cgroup (via
> > > > bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
> > > > should be queued and then uses its parent to determine blkg which it should
> > > > be charged and thus gets to the dying cgroup.
> > > 
> > > After some code review, we found that the root cause of the problem
> > > semms to be different.
> > > 
> > > If the process is moved from group G to root group, and a new io is
> > > issued from the process, then bfq should detect this and changing
> > > bfq_queue's parent to root bfq_group:
> > > 
> > > bfq_insert_request
> > >   bfq_init_rq
> > >    bfq_bic_update_cgroup
> > >     serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
> > >     bic->blkcg_serial_nr == serial_nr -> this do not pass,because
> > > bic->blkcg_serial_nr is still from group G
> > 
> > So in the crashdump I have available, I can see that
> > _bio_blkcg(bio)->css.serial_nr is 4. Also bic->blkcg_serial_nr is 4. But
> > bic->bfqq[1] is a bfq_queue that has its entity->parent set to already
> > offlined bfq_group. Not sure how that is possible...
> > 
> > >     __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root group
> > > 
> > > And we think the following path is possible to trigger the problem:
> > > 
> > > 1) process P1 and P2 is currently in cgroup C1, corresponding to
> > > bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
> > > q1->next_bfqq = q2.
> > 
> > I agree shared queues are some factor in this - the problematic bfq_queue
> > pointing to the dead bfq_group has 'coop' flag set, pid == -1, bic ==
> > NULL. So clearly it has been merged with another bfq_queue.
> > 
> > > 2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
> > > and flag BFQQF_split_coop is not set yet.
> > 
> > There's no next_bfqq in the kernel I'm looking into... Generally the merge
> > code seems to be working somewhat differently to what you describe (I'm
> > looking into 5.16-rc3 kernel).
> 
> Hi, Jan
> 
> Sorry, It should be new_bfqq, which will be set if the queue is merged
> to other queue.
> > 
> > > 3) P2 exit, q2 won't exit because it's still referenced through
> > > queue merge.
> > > 
> > > 4) delete C1, g1 is offlined
> > > 
> > > 5) issue a new io in q1, q1's parent entity will change to root,
> > > however the io will end up in q1->next_bfqq = q2, and thus the
> > > offlined g1 is inserted to service tree through q2.
> > > 
> > > 6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
> > > in service tree of it's parent.
> > > 
> > > We confirmed this by our reproducer through a simple patch:
> > > stop merging bfq_queues if their parents are different.
> > > 
> > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> > > index 1ce1a99a7160..14c1d1c3811e 100644
> > > --- a/block/bfq-iosched.c
> > > +++ b/block/bfq-iosched.c
> > > @@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct
> > > bfq_queue *new_bfqq)
> > >          while ((__bfqq = new_bfqq->new_bfqq)) {
> > >                  if (__bfqq == bfqq)
> > >                          return NULL;
> > > +               if (__bfqq->entity.parent != bfqq->entity.parent) {
> > > +                       if (bfq_bfqq_coop(__bfqq))
> > > +                               bfq_mark_bfqq_split_coop(__bfqq);
> > > +                       return NULL;
> > > +               }
> > >                  new_bfqq = __bfqq;
> > >          }
> > > 
> > > @@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct
> > > bfq_queue *bfqq,
> > >          if (bfq_too_late_for_merging(bfqq))
> > >                  return NULL;
> > > 
> > > -       if (bfqq->new_bfqq)
> > > -               return bfqq->new_bfqq;
> > > +       if (bfqq->new_bfqq) {
> > > +               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
> > > +
> > > +               if(bfqq->entity.parent == new_bfqq->entity.parent)
> > > +                       return new_bfqq;
> > > +
> > > +               if(bfq_bfqq_coop(new_bfqq))
> > > +                       bfq_mark_bfqq_split_coop(new_bfqq);
> > > +               return NULL;
> > > +       }
> > > 
> > > Do you think this analysis is correct?
> > 
> > Honestly, I'm not sure. At this point I'm not sure why the bfqq pointed to
> > from bic didn't get reparented to the new cgroup when bio was submitted...
> 
> After queue merging, bic is set to the new bfqq, for example:
> 
> __bfq_insert_request
>  new_bfqq = bfq_setup_cooperator
>  bfq_merge_bfqqs -> before this is done, bic is set to old bfqq
>   bic_set_bfqq(bic, new_bfqq, 1); -> bic is set to new bfqq
>  rq->elv.priv[1] = new_bfqq;
> 
> I think current problem is that if bfq_queue is merged, task migration
> won't break such cooperation,thus issue io from one cgroup may endup to
> a bfq_queue that is from another cgroup, which might be offlined.

OK, I now understand what you mean. Indeed the full code is like:

bfq_insert_request()
  ...
  bfq_init_rq()
    bfq_check_ioprio_change(bic, bio);
    bfq_bic_update_cgroup(bic, bio);
      - this correctly moves bfqq of the process to appropriate blkcg
    bfqq = bfq_get_bfqq_handle_split()
    ... now associate request with bfqq ...
  __bfq_insert_request()
    new_bfqq = bfq_setup_cooperator()
    ... move request to new_bfqq ... there is no guarantee about the
    parent of new_bfqq, it may be a different cgroup if bic->stable_merge_bfqq
    predates process move, it may point to a dead cgroup or whatever.

Now I think there are several things wrong about the current code:

1) A process can submit IO for multiple cgroups (if e.g. fsync() is doing
writeback of pages created in the page cache by a process from different
cgroup). Thus bfqq can at the same time contain IO from multiple cgroups
and reparenting of bfqq in bfq_bic_update_cgroup() is going to have
interesting side-effects on the scheduling decisions. This is going to be
difficult to fully solve - probably bfqq in bic should have as a parent
cgroup pointed to by css of the process and if we detect currently
submitted bio belongs to a different cgroup, we need to lookup & create new
bfqq with appropriate parent and queue the request there. But this is not
critical issue at this point so we can leave it for later, I just wanted to
mention this to set frame for the problems below.

2) Bfqq merging (bfq_setup_cooperator() and friends) happens in
__bfq_insert_request() rather late during request queueing. As a result we
have to redo several things when we decide to merge queues (as request is
already queued in the previous queue) and we miss to update cgroup parent.
IMHO we should reorganize the code flow so that all the queue merging &
splitting happens first, we get final bfqq out of that and we associate
request with that bfqq (including cgroup handling). This will also fix the
use-after-free issue.

3) If the process is moved to a different cgroup (or just submits bio for a
different cgroup) The bic->stable_merge_bfq may point to bfqq from a
different cgroup and thus merging them is undesirable. Solution for 1) will
also deal with this problem but until then, I think that if
bfq_bic_update_cgroup() notices a cgroup mismatch, it also clears
bic->stable_merge_bfq to avoid this problem. As a side-effect this will
also fix the use-after-free issue even without solving 2).

4) I find it fragile that we may have around bfqqs with dead cgroups as a
parent and if we mistakenly activate these bfqqs (and thus parent cgroups)
and insert them in the service tree, use-after-free problem is here. So I
think that my patch to actively reparent all bfqqs from a dying cgroup
makes sense (alternatively we could make sure service tree holds a ref for
the cgroups it contains but that seems like a more expensive and intrusive
fix) and I'd keep it as a safety measure for the future.

So my current plan is:

Implement 3) as a quick and easily backportable fix for the use-after-free
issue. Then work on 2) and submit it with 4) as cleanups & future-proofing.
Once this is done, we can think about 1) but probably we should try to
see how bad that problem is in practice and whether the solution is worth
it.

What do people think?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-20 18:38             ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-12-20 18:38 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný,
	Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue 14-12-21 09:24:58, yukuai (C) wrote:
> 在 2021/12/14 1:33, Jan Kara 写道:
> > Hello!
> > 
> > On Thu 09-12-21 10:23:33, yukuai (C) wrote:
> > > 在 2021/11/30 1:11, Jan Kara 写道:
> > > > On Fri 26-11-21 15:47:24, Michal Koutný wrote:
> > > > > Hello.
> > > > > 
> > > > > On Thu, Nov 25, 2021 at 06:28:09PM +0100, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> > > > > [...]
> > > > > +Cc cgroups ML
> > > > > https://lore.kernel.org/linux-block/20211125172809.GC19572-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org/
> > > > > 
> > > > > 
> > > > > I understand there are more objects than blkcgs but I assume it can
> > > > > eventually boil down to blkcg references, so I suggest another
> > > > > alternative. (But I may easily miss the relations between BFQ objects,
> > > > > so consider this only high-level opinion.)
> > > > > 
> > > > > > After some poking, looking into crashdumps, and applying some debug patches
> > > > > > the following seems to be happening: We have a process P in blkcg G. Now
> > > > > > G is taken offline so bfq_group is cleaned up in bfq_pd_offline() but P
> > > > > > still holds reference to G from its bfq_queue. Then P submits IO, G gets
> > > > > > inserted into service tree despite being already offline.
> > > > > 
> > > > > (If G is offline, P can only be zombie, just saying. (I guess it can
> > > > > still be Q's IO on behalf of G.))
> > > > > 
> > > > > IIUC, the reference to G is only held by P. If the G reference is copied
> > > > > into another structure (the service tree) it should get another
> > > > > reference. My naïve proposal would be css_get(). (1)
> > > > 
> > > > So I was looking into this puzzle. The answer is following:
> > > > 
> > > > The process P (podman, pid 2571) is currently attached to the root cgroup
> > > > but it has io_context with BFQ queue that points to the already-offline G
> > > > as a parent. The bio is thus associated with the root cgroup (via
> > > > bio->bi_blkg) but BFQ uses io_context to lookup the BFQ queue where IO
> > > > should be queued and then uses its parent to determine blkg which it should
> > > > be charged and thus gets to the dying cgroup.
> > > 
> > > After some code review, we found that the root cause of the problem
> > > semms to be different.
> > > 
> > > If the process is moved from group G to root group, and a new io is
> > > issued from the process, then bfq should detect this and changing
> > > bfq_queue's parent to root bfq_group:
> > > 
> > > bfq_insert_request
> > >   bfq_init_rq
> > >    bfq_bic_update_cgroup
> > >     serial_nr = __bio_blkcg(bio)->css.serial_nr; -> from root group
> > >     bic->blkcg_serial_nr == serial_nr -> this do not pass,because
> > > bic->blkcg_serial_nr is still from group G
> > 
> > So in the crashdump I have available, I can see that
> > _bio_blkcg(bio)->css.serial_nr is 4. Also bic->blkcg_serial_nr is 4. But
> > bic->bfqq[1] is a bfq_queue that has its entity->parent set to already
> > offlined bfq_group. Not sure how that is possible...
> > 
> > >     __bfq_bic_change_cgroup -> bfq_queue parent will be changed to root group
> > > 
> > > And we think the following path is possible to trigger the problem:
> > > 
> > > 1) process P1 and P2 is currently in cgroup C1, corresponding to
> > > bfq_queue q1, q2 and bfq_group g1. And q1 and q2 are merged:
> > > q1->next_bfqq = q2.
> > 
> > I agree shared queues are some factor in this - the problematic bfq_queue
> > pointing to the dead bfq_group has 'coop' flag set, pid == -1, bic ==
> > NULL. So clearly it has been merged with another bfq_queue.
> > 
> > > 2) move P1 from C1 to root_cgroup, q1->next_bfqq is still q2
> > > and flag BFQQF_split_coop is not set yet.
> > 
> > There's no next_bfqq in the kernel I'm looking into... Generally the merge
> > code seems to be working somewhat differently to what you describe (I'm
> > looking into 5.16-rc3 kernel).
> 
> Hi, Jan
> 
> Sorry, It should be new_bfqq, which will be set if the queue is merged
> to other queue.
> > 
> > > 3) P2 exit, q2 won't exit because it's still referenced through
> > > queue merge.
> > > 
> > > 4) delete C1, g1 is offlined
> > > 
> > > 5) issue a new io in q1, q1's parent entity will change to root,
> > > however the io will end up in q1->next_bfqq = q2, and thus the
> > > offlined g1 is inserted to service tree through q2.
> > > 
> > > 6) P1 exit, q2 exit, and finially g1 is freed, while g1 is still
> > > in service tree of it's parent.
> > > 
> > > We confirmed this by our reproducer through a simple patch:
> > > stop merging bfq_queues if their parents are different.
> > > 
> > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> > > index 1ce1a99a7160..14c1d1c3811e 100644
> > > --- a/block/bfq-iosched.c
> > > +++ b/block/bfq-iosched.c
> > > @@ -2626,6 +2626,11 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct
> > > bfq_queue *new_bfqq)
> > >          while ((__bfqq = new_bfqq->new_bfqq)) {
> > >                  if (__bfqq == bfqq)
> > >                          return NULL;
> > > +               if (__bfqq->entity.parent != bfqq->entity.parent) {
> > > +                       if (bfq_bfqq_coop(__bfqq))
> > > +                               bfq_mark_bfqq_split_coop(__bfqq);
> > > +                       return NULL;
> > > +               }
> > >                  new_bfqq = __bfqq;
> > >          }
> > > 
> > > @@ -2825,8 +2830,16 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct
> > > bfq_queue *bfqq,
> > >          if (bfq_too_late_for_merging(bfqq))
> > >                  return NULL;
> > > 
> > > -       if (bfqq->new_bfqq)
> > > -               return bfqq->new_bfqq;
> > > +       if (bfqq->new_bfqq) {
> > > +               struct bfq_queue *new_bfqq = bfqq->new_bfqq;
> > > +
> > > +               if(bfqq->entity.parent == new_bfqq->entity.parent)
> > > +                       return new_bfqq;
> > > +
> > > +               if(bfq_bfqq_coop(new_bfqq))
> > > +                       bfq_mark_bfqq_split_coop(new_bfqq);
> > > +               return NULL;
> > > +       }
> > > 
> > > Do you think this analysis is correct?
> > 
> > Honestly, I'm not sure. At this point I'm not sure why the bfqq pointed to
> > from bic didn't get reparented to the new cgroup when bio was submitted...
> 
> After queue merging, bic is set to the new bfqq, for example:
> 
> __bfq_insert_request
>  new_bfqq = bfq_setup_cooperator
>  bfq_merge_bfqqs -> before this is done, bic is set to old bfqq
>   bic_set_bfqq(bic, new_bfqq, 1); -> bic is set to new bfqq
>  rq->elv.priv[1] = new_bfqq;
> 
> I think current problem is that if bfq_queue is merged, task migration
> won't break such cooperation,thus issue io from one cgroup may endup to
> a bfq_queue that is from another cgroup, which might be offlined.

OK, I now understand what you mean. Indeed the full code is like:

bfq_insert_request()
  ...
  bfq_init_rq()
    bfq_check_ioprio_change(bic, bio);
    bfq_bic_update_cgroup(bic, bio);
      - this correctly moves bfqq of the process to appropriate blkcg
    bfqq = bfq_get_bfqq_handle_split()
    ... now associate request with bfqq ...
  __bfq_insert_request()
    new_bfqq = bfq_setup_cooperator()
    ... move request to new_bfqq ... there is no guarantee about the
    parent of new_bfqq, it may be a different cgroup if bic->stable_merge_bfqq
    predates process move, it may point to a dead cgroup or whatever.

Now I think there are several things wrong about the current code:

1) A process can submit IO for multiple cgroups (if e.g. fsync() is doing
writeback of pages created in the page cache by a process from different
cgroup). Thus bfqq can at the same time contain IO from multiple cgroups
and reparenting of bfqq in bfq_bic_update_cgroup() is going to have
interesting side-effects on the scheduling decisions. This is going to be
difficult to fully solve - probably bfqq in bic should have as a parent
cgroup pointed to by css of the process and if we detect currently
submitted bio belongs to a different cgroup, we need to lookup & create new
bfqq with appropriate parent and queue the request there. But this is not
critical issue at this point so we can leave it for later, I just wanted to
mention this to set frame for the problems below.

2) Bfqq merging (bfq_setup_cooperator() and friends) happens in
__bfq_insert_request() rather late during request queueing. As a result we
have to redo several things when we decide to merge queues (as request is
already queued in the previous queue) and we miss to update cgroup parent.
IMHO we should reorganize the code flow so that all the queue merging &
splitting happens first, we get final bfqq out of that and we associate
request with that bfqq (including cgroup handling). This will also fix the
use-after-free issue.

3) If the process is moved to a different cgroup (or just submits bio for a
different cgroup) The bic->stable_merge_bfq may point to bfqq from a
different cgroup and thus merging them is undesirable. Solution for 1) will
also deal with this problem but until then, I think that if
bfq_bic_update_cgroup() notices a cgroup mismatch, it also clears
bic->stable_merge_bfq to avoid this problem. As a side-effect this will
also fix the use-after-free issue even without solving 2).

4) I find it fragile that we may have around bfqqs with dead cgroups as a
parent and if we mistakenly activate these bfqqs (and thus parent cgroups)
and insert them in the service tree, use-after-free problem is here. So I
think that my patch to actively reparent all bfqqs from a dying cgroup
makes sense (alternatively we could make sure service tree holds a ref for
the cgroups it contains but that seems like a more expensive and intrusive
fix) and I'd keep it as a safety measure for the future.

So my current plan is:

Implement 3) as a quick and easily backportable fix for the use-after-free
issue. Then work on 2) and submit it with 4) as cleanups & future-proofing.
Once this is done, we can think about 1) but probably we should try to
see how bad that problem is in practice and whether the solution is worth
it.

What do people think?

								Honza

-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-22 15:21         ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-12-22 15:21 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný, Paolo Valente, linux-block, fvogt, cgroups

On Thu 09-12-21 10:23:33, yukuai (C) wrote:
> We confirmed this by our reproducer through a simple patch:
> stop merging bfq_queues if their parents are different.

Can you please share your reproducer? I have prepared some patches which
I'd like to verify before posting... Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-22 15:21         ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-12-22 15:21 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný,
	Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu 09-12-21 10:23:33, yukuai (C) wrote:
> We confirmed this by our reproducer through a simple patch:
> stop merging bfq_queues if their parents are different.

Can you please share your reproducer? I have prepared some patches which
I'd like to verify before posting... Thanks!

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-23  1:02           ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2021-12-23  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michal Koutný, Paolo Valente, linux-block, fvogt, cgroups

[-- Attachment #1: Type: text/plain, Size: 444 bytes --]

在 2021/12/22 23:21, Jan Kara 写道:
> On Thu 09-12-21 10:23:33, yukuai (C) wrote:
>> We confirmed this by our reproducer through a simple patch:
>> stop merging bfq_queues if their parents are different.
> 
> Can you please share your reproducer? I have prepared some patches which
> I'd like to verify before posting... Thanks!

Hi,

Here is the reproducer, usually the problem will come up within an
hour.

Thanks,
Kuai
> 
> 								Honza
> 

[-- Attachment #2: null_bad.sh --]
[-- Type: text/plain, Size: 3242 bytes --]

#!/bin/bash
NR=1
basedir=/sys/fs/cgroup/blkio/null
CG_PREFIX=/sys/fs/cgroup/blkio/null/nullb

function set_cgroup()
{
	testdir=$1
	dev_id=$2
	let weight=RANDOM%900+100
	let iops=RANDOM%1000+100
	let bps=RANDOM%10485760+10485760
	echo "$weight" > $testdir/blkio.bfq.weight
	echo "$dev_id $iops" > $testdir/blkio.throttle.read_iops_device
	echo "$dev_id $iops" > $testdir/blkio.throttle.write_iops_device
	echo "$dev_id $bps" > $testdir/blkio.throttle.read_bps_device
	echo "$dev_id $bps" > $testdir/blkio.throttle.write_bps_device
}

function set_sys()
{
	local queue_dir=/sys/block/$1/queue

	let rq_affinity=RANDOM%3
	echo $rq_affinity > $queue_dir/rq_affinity

	let add_random=RANDOM%2
	echo $add_random > $queue_dir/add_random

	let rotational=RANDOM%2
	echo $rotational > $queue_dir/rotational

	let nomerges=RANDOM%2
	echo $nomerges > $queue_dir/nomerges

	let s_num=RANDOM%5
	case $s_num in
		0)
		scheduler=none
		;;
		1)
		scheduler=bfq
		;;
		2)
		scheduler=bfq
		;;
		3)
		scheduler=none
		;;
	esac
	echo bfq > $queue_dir/scheduler
}

create_cg()
{
	local i
	local path

	for i in $(seq 0 $NR)
	do
		path=${CG_PREFIX}${i}
		mkdir -p $path
	done
}

switch_cg()
{
	local path=${CG_PREFIX}$1
	local t

	for t in $(cat $path/tasks)
	do
		echo $t > /sys/fs/cgroup/blkio/tasks
	done

	echo "tasks in $path"
	cat $path/tasks
}

rm_cg()
{
	local path=${CG_PREFIX}$1

	rmdir $path
	return $?
}

mkdir $basedir
cgdir1=/sys/fs/cgroup/blkio/null/nullb0
cgdir2=/sys/fs/cgroup/blkio/null/nullb1

ADD_MOD="modprobe null_blk"
while true
do
	let flag=RANDOM%2
	if [ $flag -eq 1 ];then
		$ADD_MOD queue_mode=2 blocking=1 nr_devices=2
	else
		$ADD_MOD queue_mode=2 nr_devices=2
	fi
		
	create_cg

	dev_id=`lsblk | grep nullb0 | awk '{print $2}'`
	set_cgroup $basedir $dev_id 
	set_sys nullb0

	dev_id=`lsblk | grep nullb1 | awk '{print $2}'`
	set_cgroup $basedir $dev_id 
	set_sys nullb1

	let flag=RANDOM%20
	if [ $flag -eq 5 ];then
		echo 1 > /sys/block/nullb0/make-it-fail
		echo 1 > /sys/block/nullb1/make-it-fail
	else
		echo 0 > /sys/block/nullb0/make-it-fail
		echo 0 > /sys/block/nullb1/make-it-fail
	fi

	i=0
	while [ $i -le 3 ]
	do
		cgexec -g "blkio:null/nullb0" fio -filename=/dev/nullb0 -ioengine=libaio -time_based=1 -rw=rw -thread -size=100g -bs=512 -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		cgexec -g "blkio:null/nullb0" fio -filename=/dev/nullb0 -ioengine=psync -direct=1 -time_based=1 -rw=rw -thread -size=100g -bs=512 -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		cgexec -g "blkio:null/nullb1" fio -filename=/dev/nullb1 -ioengine=libaio -time_based=1 -rw=rw -thread -size=100g -bs=1024k -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		cgexec -g "blkio:null/nullb1" fio -filename=/dev/nullb1 -ioengine=psync -direct=1 -time_based=1 -rw=rw -thread -size=100g -bs=1024k -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		((i=i+1))
	done

	sleep 3

	until rm_cg 0
	do
		switch_cg 0
		sleep 0.1
	done

	until rm_cg 1
	do
		switch_cg 1
		sleep 0.1
	done

	while true
	do
		rmmod null_blk &>/dev/null && break
		sleep 0.1
	done
done


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-23  1:02           ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2021-12-23  1:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Koutný,
	Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 444 bytes --]

在 2021/12/22 23:21, Jan Kara 写道:
> On Thu 09-12-21 10:23:33, yukuai (C) wrote:
>> We confirmed this by our reproducer through a simple patch:
>> stop merging bfq_queues if their parents are different.
> 
> Can you please share your reproducer? I have prepared some patches which
> I'd like to verify before posting... Thanks!

Hi,

Here is the reproducer, usually the problem will come up within an
hour.

Thanks,
Kuai
> 
> 								Honza
> 

[-- Attachment #2: null_bad.sh --]
[-- Type: text/plain, Size: 3242 bytes --]

#!/bin/bash
NR=1
basedir=/sys/fs/cgroup/blkio/null
CG_PREFIX=/sys/fs/cgroup/blkio/null/nullb

function set_cgroup()
{
	testdir=$1
	dev_id=$2
	let weight=RANDOM%900+100
	let iops=RANDOM%1000+100
	let bps=RANDOM%10485760+10485760
	echo "$weight" > $testdir/blkio.bfq.weight
	echo "$dev_id $iops" > $testdir/blkio.throttle.read_iops_device
	echo "$dev_id $iops" > $testdir/blkio.throttle.write_iops_device
	echo "$dev_id $bps" > $testdir/blkio.throttle.read_bps_device
	echo "$dev_id $bps" > $testdir/blkio.throttle.write_bps_device
}

function set_sys()
{
	local queue_dir=/sys/block/$1/queue

	let rq_affinity=RANDOM%3
	echo $rq_affinity > $queue_dir/rq_affinity

	let add_random=RANDOM%2
	echo $add_random > $queue_dir/add_random

	let rotational=RANDOM%2
	echo $rotational > $queue_dir/rotational

	let nomerges=RANDOM%2
	echo $nomerges > $queue_dir/nomerges

	let s_num=RANDOM%5
	case $s_num in
		0)
		scheduler=none
		;;
		1)
		scheduler=bfq
		;;
		2)
		scheduler=bfq
		;;
		3)
		scheduler=none
		;;
	esac
	echo bfq > $queue_dir/scheduler
}

create_cg()
{
	local i
	local path

	for i in $(seq 0 $NR)
	do
		path=${CG_PREFIX}${i}
		mkdir -p $path
	done
}

switch_cg()
{
	local path=${CG_PREFIX}$1
	local t

	for t in $(cat $path/tasks)
	do
		echo $t > /sys/fs/cgroup/blkio/tasks
	done

	echo "tasks in $path"
	cat $path/tasks
}

rm_cg()
{
	local path=${CG_PREFIX}$1

	rmdir $path
	return $?
}

mkdir $basedir
cgdir1=/sys/fs/cgroup/blkio/null/nullb0
cgdir2=/sys/fs/cgroup/blkio/null/nullb1

ADD_MOD="modprobe null_blk"
while true
do
	let flag=RANDOM%2
	if [ $flag -eq 1 ];then
		$ADD_MOD queue_mode=2 blocking=1 nr_devices=2
	else
		$ADD_MOD queue_mode=2 nr_devices=2
	fi
		
	create_cg

	dev_id=`lsblk | grep nullb0 | awk '{print $2}'`
	set_cgroup $basedir $dev_id 
	set_sys nullb0

	dev_id=`lsblk | grep nullb1 | awk '{print $2}'`
	set_cgroup $basedir $dev_id 
	set_sys nullb1

	let flag=RANDOM%20
	if [ $flag -eq 5 ];then
		echo 1 > /sys/block/nullb0/make-it-fail
		echo 1 > /sys/block/nullb1/make-it-fail
	else
		echo 0 > /sys/block/nullb0/make-it-fail
		echo 0 > /sys/block/nullb1/make-it-fail
	fi

	i=0
	while [ $i -le 3 ]
	do
		cgexec -g "blkio:null/nullb0" fio -filename=/dev/nullb0 -ioengine=libaio -time_based=1 -rw=rw -thread -size=100g -bs=512 -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		cgexec -g "blkio:null/nullb0" fio -filename=/dev/nullb0 -ioengine=psync -direct=1 -time_based=1 -rw=rw -thread -size=100g -bs=512 -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		cgexec -g "blkio:null/nullb1" fio -filename=/dev/nullb1 -ioengine=libaio -time_based=1 -rw=rw -thread -size=100g -bs=1024k -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		cgexec -g "blkio:null/nullb1" fio -filename=/dev/nullb1 -ioengine=psync -direct=1 -time_based=1 -rw=rw -thread -size=100g -bs=1024k -numjobs=4 -iodepth=8 -runtime=5 -group_reporting -name=brd-IOwrite -rwmixread=50 &>/dev/null &
		((i=i+1))
	done

	sleep 3

	until rm_cg 0
	do
		switch_cg 0
		sleep 0.1
	done

	until rm_cg 1
	do
		switch_cg 1
		sleep 0.1
	done

	while true
	do
		rmmod null_blk &>/dev/null && break
		sleep 0.1
	done
done


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-23 17:13             ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-12-23 17:13 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný, Paolo Valente, linux-block, fvogt, cgroups

On Thu 23-12-21 09:02:55, yukuai (C) wrote:
> 在 2021/12/22 23:21, Jan Kara 写道:
> > On Thu 09-12-21 10:23:33, yukuai (C) wrote:
> > > We confirmed this by our reproducer through a simple patch:
> > > stop merging bfq_queues if their parents are different.
> > 
> > Can you please share your reproducer? I have prepared some patches which
> > I'd like to verify before posting... Thanks!
> 
> Hi,
> 
> Here is the reproducer, usually the problem will come up within an
> hour.

Thanks! For some reason the reproducer does not trigger the KASAN warning
for me (with or without my patches). But anyway, I'll post my fixes and
we'll see what do people think, also I'd be grateful if you can test them
in your environment. Thanks.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Use after free with BFQ and cgroups
@ 2021-12-23 17:13             ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2021-12-23 17:13 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Jan Kara, Michal Koutný,
	Paolo Valente, linux-block-u79uwXL29TY76Z2rM5mHXA,
	fvogt-l3A5Bk7waGM, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu 23-12-21 09:02:55, yukuai (C) wrote:
> 在 2021/12/22 23:21, Jan Kara 写道:
> > On Thu 09-12-21 10:23:33, yukuai (C) wrote:
> > > We confirmed this by our reproducer through a simple patch:
> > > stop merging bfq_queues if their parents are different.
> > 
> > Can you please share your reproducer? I have prepared some patches which
> > I'd like to verify before posting... Thanks!
> 
> Hi,
> 
> Here is the reproducer, usually the problem will come up within an
> hour.

Thanks! For some reason the reproducer does not trigger the KASAN warning
for me (with or without my patches). But anyway, I'll post my fixes and
we'll see what do people think, also I'd be grateful if you can test them
in your environment. Thanks.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-12-23 17:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-25 17:28 Use after free with BFQ and cgroups Jan Kara
2021-11-26 14:47 ` Michal Koutný
2021-11-26 14:47   ` Michal Koutný
2021-11-29 17:11   ` Jan Kara
2021-11-29 17:11     ` Jan Kara
2021-12-09  2:23     ` yukuai (C)
2021-12-09  2:23       ` yukuai (C)
2021-12-09 15:33       ` Paolo Valente
2021-12-09 15:33         ` Paolo Valente
2021-12-13 17:33       ` Jan Kara
2021-12-14  1:24         ` yukuai (C)
2021-12-14  1:24           ` yukuai (C)
2021-12-20 18:38           ` Jan Kara
2021-12-20 18:38             ` Jan Kara
2021-12-22 15:21       ` Jan Kara
2021-12-22 15:21         ` Jan Kara
2021-12-23  1:02         ` yukuai (C)
2021-12-23  1:02           ` yukuai (C)
2021-12-23 17:13           ` Jan Kara
2021-12-23 17:13             ` Jan Kara
2021-11-29 17:12   ` Tejun Heo
2021-11-30 11:50     ` Jan Kara
2021-11-30 16:22       ` Tejun Heo
2021-11-30 16:22         ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.