All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [BUG] potential hugetlb css refcounting issues
       [not found] <8a4f2fbc-76e8-b67b-f110-30beff2228f5@oracle-com>
@ 2021-08-27 22:58   ` Guillaume Morin
  0 siblings, 0 replies; 7+ messages in thread
From: Guillaume Morin @ 2021-08-27 22:58 UTC (permalink / raw)
  To: almasrymina, mike.kravetz, cgroups, guillaume, linux-mm

Hello Mike,

I really appreciate the quick reply

Mike Kravets wrote: 
> There have been other hugetlb cgroup fixes since 5.10.  I do not believe
> they are related to the underflow issue you have seen.  Just FYI.

Yes, I am aware. Actually I did my best to look at all recent changes
not backported to 5.10 and couldn't find anything related. I tried to
cherry-pick a couple of fixes in case but the bug did not go away.

> However, when a vma is split both resulting vmas would be 'owners' of
> private mapping reserves without incrementing the refcount which would
> lead to the underflow you describe.

Indeed and I do know that programs running on my reproducer machines do
split vmas.

>> 2. After 08cf9faf75580, __free_huge_page() decrements the css
>> refcount for _each_ page unconditionally by calling
>> hugetlb_cgroup_uncharge_page_rsvd().  But a per-page reference count
>> is only taken *per page* outside the reserve case in
>> alloc_huge_page() (i.e hugetlb_cgroup_charge_cgroup_rsvd() is called
>> only if deferred_reserve is true).  In the reserve case, there is
>> only one css reference linked to the resv map (taken in
>> hugetlb_reserve_pages()).  This also leads to an underflow of the
>> counter.  A similar scheme to HPageRestoreReserve can be used to
>> track which pages were allocated in the deferred_reserve case and
>> call hugetlb_cgroup_uncharge_page_rsvd() only for these during
>> freeing.

> I am not sure about the above analysis.  It is true that
> hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
> free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
> only decrement the css refcount if there is a non-NULL hugetlb_cgroup
> pointer in the page.  And, the pointer in the page would only be set
> in the 'deferred_reserve' path of alloc_huge_page.  Unless I am
> missing something, they seem to balance.

Now that you explain, I am pretty sure that you're right and I was
wrong.

I'll confirm that I can't reproduce without my change for 2.

Thank you,

Guillaume.

-- 
Guillaume Morin <guillaume@morinfr.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] potential hugetlb css refcounting issues
@ 2021-08-27 22:58   ` Guillaume Morin
  0 siblings, 0 replies; 7+ messages in thread
From: Guillaume Morin @ 2021-08-27 22:58 UTC (permalink / raw)
  To: almasrymina-hpIqsD4AKlfQT0dZR+AlfA,
	mike.kravetz-QHcLZuEGTsvQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guillaume-/FyPzM6KSZdAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello Mike,

I really appreciate the quick reply

Mike Kravets wrote: 
> There have been other hugetlb cgroup fixes since 5.10.  I do not believe
> they are related to the underflow issue you have seen.  Just FYI.

Yes, I am aware. Actually I did my best to look at all recent changes
not backported to 5.10 and couldn't find anything related. I tried to
cherry-pick a couple of fixes in case but the bug did not go away.

> However, when a vma is split both resulting vmas would be 'owners' of
> private mapping reserves without incrementing the refcount which would
> lead to the underflow you describe.

Indeed and I do know that programs running on my reproducer machines do
split vmas.

>> 2. After 08cf9faf75580, __free_huge_page() decrements the css
>> refcount for _each_ page unconditionally by calling
>> hugetlb_cgroup_uncharge_page_rsvd().  But a per-page reference count
>> is only taken *per page* outside the reserve case in
>> alloc_huge_page() (i.e hugetlb_cgroup_charge_cgroup_rsvd() is called
>> only if deferred_reserve is true).  In the reserve case, there is
>> only one css reference linked to the resv map (taken in
>> hugetlb_reserve_pages()).  This also leads to an underflow of the
>> counter.  A similar scheme to HPageRestoreReserve can be used to
>> track which pages were allocated in the deferred_reserve case and
>> call hugetlb_cgroup_uncharge_page_rsvd() only for these during
>> freeing.

> I am not sure about the above analysis.  It is true that
> hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
> free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
> only decrement the css refcount if there is a non-NULL hugetlb_cgroup
> pointer in the page.  And, the pointer in the page would only be set
> in the 'deferred_reserve' path of alloc_huge_page.  Unless I am
> missing something, they seem to balance.

Now that you explain, I am pretty sure that you're right and I was
wrong.

I'll confirm that I can't reproduce without my change for 2.

Thank you,

Guillaume.

-- 
Guillaume Morin <guillaume-/FyPzM6KSZdAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] potential hugetlb css refcounting issues
@ 2021-08-28 19:37     ` Guillaume Morin
  0 siblings, 0 replies; 7+ messages in thread
From: Guillaume Morin @ 2021-08-28 19:37 UTC (permalink / raw)
  To: almasrymina, mike.kravetz, cgroups, guillaume, linux-mm

On 28 Aug  0:58, Guillaume Morin wrote:
> > I am not sure about the above analysis.  It is true that
> > hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
> > free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
> > only decrement the css refcount if there is a non-NULL hugetlb_cgroup
> > pointer in the page.  And, the pointer in the page would only be set
> > in the 'deferred_reserve' path of alloc_huge_page.  Unless I am
> > missing something, they seem to balance.
> 
> Now that you explain, I am pretty sure that you're right and I was
> wrong.
> 
> I'll confirm that I can't reproduce without my change for 2.

Confirmed. With the patch for the first issue, the issue is indeed
fixed. I must have messed up something during my testing...

Anyway, this is the change for 1:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ea35ba6699f..00ad4af0399b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4033,8 +4033,11 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+		if (resv->css)
+			css_get(resv->css);
 		kref_get(&resv->refs);
+	}
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)

-- 
Guillaume Morin <guillaume@morinfr.org>


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [BUG] potential hugetlb css refcounting issues
@ 2021-08-28 19:37     ` Guillaume Morin
  0 siblings, 0 replies; 7+ messages in thread
From: Guillaume Morin @ 2021-08-28 19:37 UTC (permalink / raw)
  To: almasrymina-hpIqsD4AKlfQT0dZR+AlfA,
	mike.kravetz-QHcLZuEGTsvQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guillaume-/FyPzM6KSZdAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 28 Aug  0:58, Guillaume Morin wrote:
> > I am not sure about the above analysis.  It is true that
> > hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
> > free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
> > only decrement the css refcount if there is a non-NULL hugetlb_cgroup
> > pointer in the page.  And, the pointer in the page would only be set
> > in the 'deferred_reserve' path of alloc_huge_page.  Unless I am
> > missing something, they seem to balance.
> 
> Now that you explain, I am pretty sure that you're right and I was
> wrong.
> 
> I'll confirm that I can't reproduce without my change for 2.

Confirmed. With the patch for the first issue, the issue is indeed
fixed. I must have messed up something during my testing...

Anyway, this is the change for 1:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ea35ba6699f..00ad4af0399b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4033,8 +4033,11 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 	 * after this open call completes.  It is therefore safe to take a
 	 * new reference here without additional locking.
 	 */
-	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+	if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+		if (resv->css)
+			css_get(resv->css);
 		kref_get(&resv->refs);
+	}
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)

-- 
Guillaume Morin <guillaume-/FyPzM6KSZdAfugRpC6u6w@public.gmane.org>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [BUG] potential hugetlb css refcounting issues
@ 2021-08-27 22:22   ` Mike Kravetz
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Kravetz @ 2021-08-27 22:22 UTC (permalink / raw)
  To: Mina Almasry, cgroups; +Cc: linux Memory Management List

CC'ing linux-mm 

On 8/27/21 8:11 AM, Guillaume Morin wrote:
> Hi,
> 
> After upgrading to 5.10 from 5.4 (though I believe that these issues are
> still present in 5.14), we noticed some refcount count warning
> pertaining a corrupted reference count of the hugetlb css, e.g
> 
> [  704.259734] percpu ref (css_release) <= 0 (-1) after switching to atomic
> [  704.259755] WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130
> [  704.259911] CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G           O      5.10.60 #1
> [  704.259916] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130
> [  704.259920] Code: eb b1 80 3d 37 4f 0a 01 00 0f 85 5d ff ff ff 49 8b 55 e0 48 c7 c7 38 57 0d 94 c6 05 1f 4f 0a 01 01 49 8b 75 e8 e8 a9 e5 c1 ff <0f> 0b e9 3b ff ff ff 66 90 55 48 89 e5 41 56 49 89 f6 41 55 49 89
> [  704.259922] RSP: 0000:ffffb19b4684bdd0 EFLAGS: 00010282
> [  704.259924] RAX: 0000000000000000 RBX: 7ffffffffffffffe RCX: 0000000000000027
> [  704.259926] RDX: 0000000000000000 RSI: ffff9a81ffb58b40 RDI: ffff9a81ffb58b48
> [  704.259927] RBP: ffffb19b4684bde8 R08: 0000000000000003 R09: 0000000000000001
> [  704.259929] R10: 0000000000000003 R11: ffffb19b4684bb70 R12: 0000370946a03b50
> [  704.259931] R13: ffff9a72c9ceb860 R14: 0000000000000000 R15: ffff9a72c42f4000
> [  704.259933] FS:  0000000000000000(0000) GS:ffff9a81ffb40000(0000) knlGS:0000000000000000
> [  704.259935] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  704.259936] CR2: 0000000001416318 CR3: 000000011e1ac003 CR4: 00000000003706e0
> [  704.259938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  704.259939] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  704.259941] Call Trace:
> [  704.259950]  rcu_core+0x30f/0x530
> [  704.259955]  rcu_core_si+0xe/0x10
> [  704.259959]  __do_softirq+0x103/0x2a2
> [  704.259964]  ? sort_range+0x30/0x30
> [  704.259968]  run_ksoftirqd+0x2b/0x40
> [  704.259971]  smpboot_thread_fn+0x11a/0x170
> [  704.259975]  kthread+0x10a/0x140
> [  704.259978]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  704.259983]  ret_from_fork+0x22/0x30
> 
> The box would soon crash due to some GPF or NULL pointer deference
> either in cgroups_destroy or in the kill_css path. We confirmed the
> issue was specific to the hugetlb css by manually disabling its use and
> verifying that the box then stayed up and happy.

Thanks for reporting this Guillaume!

There have been other hugetlb cgroup fixes since 5.10.  I do not believe
they are related to the underflow issue you have seen.  Just FYI.

I do hope Mina is taking a look as well.  His recent/ongoing mremap work
is also touching these areas of the code.

> I believe there might be 2 distinct bugs leading to this. I am not
> familiar with the cgroup code so I might be off base here. I did my best
> to track this and understand the logic. Any feedback will be welcome.
> 
> I have not provided patches because if I am correct, they're fairly
> trivial and since I am unfamiliar with this code, I am afraid they could
> not be that helpful.  But I could provide them if anybody is interested.
> 
> 1. Since e9fe92ae0cd2, hugetlb_vm_op_close() decreases the refcount of
> the css (if present) through the hugetlb_cgroup_uncharge_counter() call
> if a resv map is set on the vma and the owner flag is present (i.e
> private mapping).  In the most basic case, the corresponding refcount
> increase is done in hugetlb_reserve_pages().
> However when sibling vmas are opened, hugetlb_vm_op_open() is called,
> the resv map reference count is increased (if vma_resv_map(vma) is not
> NULL for private mappings), but not for a potential resv->css (i.e if
> resv->css != NULL).
> When these siblings vmas are closed, the refcount will still be
> decreased once per such vma, leading to an underflow and premature
> release (potentially use after free) of the hugetlb css.  The fix would
> be to call css_get() if resv->css != NULL in hugetlb_vm_op_open()

That does appear to be a real issue.  In the 'common' fork case, a vma
is duplicated but reset_vma_resv_huge_pages() is called to clear the
pointer to the reserve map for private mappings before calling
hugetlb_vm_op_open.  However, when a vma is split both resulting vmas
would be 'owners' of private mapping reserves without incrementing the
refcount which would lead to the underflow you describe.

> 2. After 08cf9faf75580, __free_huge_page() decrements the css refcount
> for _each_ page unconditionally by calling
> hugetlb_cgroup_uncharge_page_rsvd().
> But a per-page reference count is only taken *per page* outside the
> reserve case in alloc_huge_page() (i.e
> hugetlb_cgroup_charge_cgroup_rsvd() is called only if deferred_reserve
> is true).  In the reserve case, there is only one css reference linked
> to the resv map (taken in hugetlb_reserve_pages()).
> This also leads to an underflow of the counter.  A similar scheme to
> HPageRestoreReserve can be used to track which pages were allocated in
> the deferred_reserve case and call hugetlb_cgroup_uncharge_page_rsvd()
> only for these during freeing.

I am not sure about the above analysis.  It is true that
hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
only decrement the css refcount if there is a non-NULL hugetlb_cgroup
pointer in the page.  And, the pointer in the page would only be set in
the 'deferred_reserve' path of alloc_huge_page.  Unless I am missing
something, they seem to balance.

In any case I will look closer at the code with an eye on underflows.
Thanks for the report and help!
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] potential hugetlb css refcounting issues
@ 2021-08-27 22:22   ` Mike Kravetz
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Kravetz @ 2021-08-27 22:22 UTC (permalink / raw)
  To: Mina Almasry, cgroups-u79uwXL29TY76Z2rM5mHXA; +Cc: linux Memory Management List

CC'ing linux-mm 

On 8/27/21 8:11 AM, Guillaume Morin wrote:
> Hi,
> 
> After upgrading to 5.10 from 5.4 (though I believe that these issues are
> still present in 5.14), we noticed some refcount count warning
> pertaining a corrupted reference count of the hugetlb css, e.g
> 
> [  704.259734] percpu ref (css_release) <= 0 (-1) after switching to atomic
> [  704.259755] WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130
> [  704.259911] CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G           O      5.10.60 #1
> [  704.259916] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130
> [  704.259920] Code: eb b1 80 3d 37 4f 0a 01 00 0f 85 5d ff ff ff 49 8b 55 e0 48 c7 c7 38 57 0d 94 c6 05 1f 4f 0a 01 01 49 8b 75 e8 e8 a9 e5 c1 ff <0f> 0b e9 3b ff ff ff 66 90 55 48 89 e5 41 56 49 89 f6 41 55 49 89
> [  704.259922] RSP: 0000:ffffb19b4684bdd0 EFLAGS: 00010282
> [  704.259924] RAX: 0000000000000000 RBX: 7ffffffffffffffe RCX: 0000000000000027
> [  704.259926] RDX: 0000000000000000 RSI: ffff9a81ffb58b40 RDI: ffff9a81ffb58b48
> [  704.259927] RBP: ffffb19b4684bde8 R08: 0000000000000003 R09: 0000000000000001
> [  704.259929] R10: 0000000000000003 R11: ffffb19b4684bb70 R12: 0000370946a03b50
> [  704.259931] R13: ffff9a72c9ceb860 R14: 0000000000000000 R15: ffff9a72c42f4000
> [  704.259933] FS:  0000000000000000(0000) GS:ffff9a81ffb40000(0000) knlGS:0000000000000000
> [  704.259935] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  704.259936] CR2: 0000000001416318 CR3: 000000011e1ac003 CR4: 00000000003706e0
> [  704.259938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  704.259939] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  704.259941] Call Trace:
> [  704.259950]  rcu_core+0x30f/0x530
> [  704.259955]  rcu_core_si+0xe/0x10
> [  704.259959]  __do_softirq+0x103/0x2a2
> [  704.259964]  ? sort_range+0x30/0x30
> [  704.259968]  run_ksoftirqd+0x2b/0x40
> [  704.259971]  smpboot_thread_fn+0x11a/0x170
> [  704.259975]  kthread+0x10a/0x140
> [  704.259978]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  704.259983]  ret_from_fork+0x22/0x30
> 
> The box would soon crash due to some GPF or NULL pointer deference
> either in cgroups_destroy or in the kill_css path. We confirmed the
> issue was specific to the hugetlb css by manually disabling its use and
> verifying that the box then stayed up and happy.

Thanks for reporting this Guillaume!

There have been other hugetlb cgroup fixes since 5.10.  I do not believe
they are related to the underflow issue you have seen.  Just FYI.

I do hope Mina is taking a look as well.  His recent/ongoing mremap work
is also touching these areas of the code.

> I believe there might be 2 distinct bugs leading to this. I am not
> familiar with the cgroup code so I might be off base here. I did my best
> to track this and understand the logic. Any feedback will be welcome.
> 
> I have not provided patches because if I am correct, they're fairly
> trivial and since I am unfamiliar with this code, I am afraid they could
> not be that helpful.  But I could provide them if anybody is interested.
> 
> 1. Since e9fe92ae0cd2, hugetlb_vm_op_close() decreases the refcount of
> the css (if present) through the hugetlb_cgroup_uncharge_counter() call
> if a resv map is set on the vma and the owner flag is present (i.e
> private mapping).  In the most basic case, the corresponding refcount
> increase is done in hugetlb_reserve_pages().
> However when sibling vmas are opened, hugetlb_vm_op_open() is called,
> the resv map reference count is increased (if vma_resv_map(vma) is not
> NULL for private mappings), but not for a potential resv->css (i.e if
> resv->css != NULL).
> When these siblings vmas are closed, the refcount will still be
> decreased once per such vma, leading to an underflow and premature
> release (potentially use after free) of the hugetlb css.  The fix would
> be to call css_get() if resv->css != NULL in hugetlb_vm_op_open()

That does appear to be a real issue.  In the 'common' fork case, a vma
is duplicated but reset_vma_resv_huge_pages() is called to clear the
pointer to the reserve map for private mappings before calling
hugetlb_vm_op_open.  However, when a vma is split both resulting vmas
would be 'owners' of private mapping reserves without incrementing the
refcount which would lead to the underflow you describe.

> 2. After 08cf9faf75580, __free_huge_page() decrements the css refcount
> for _each_ page unconditionally by calling
> hugetlb_cgroup_uncharge_page_rsvd().
> But a per-page reference count is only taken *per page* outside the
> reserve case in alloc_huge_page() (i.e
> hugetlb_cgroup_charge_cgroup_rsvd() is called only if deferred_reserve
> is true).  In the reserve case, there is only one css reference linked
> to the resv map (taken in hugetlb_reserve_pages()).
> This also leads to an underflow of the counter.  A similar scheme to
> HPageRestoreReserve can be used to track which pages were allocated in
> the deferred_reserve case and call hugetlb_cgroup_uncharge_page_rsvd()
> only for these during freeing.

I am not sure about the above analysis.  It is true that
hugetlb_cgroup_uncharge_page_rsvd is called unconditionally in
free_huge_page.  However, IIUC hugetlb_cgroup_uncharge_page_rsvd will
only decrement the css refcount if there is a non-NULL hugetlb_cgroup
pointer in the page.  And, the pointer in the page would only be set in
the 'deferred_reserve' path of alloc_huge_page.  Unless I am missing
something, they seem to balance.

In any case I will look closer at the code with an eye on underflows.
Thanks for the report and help!
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [BUG] potential hugetlb css refcounting issues
@ 2021-08-27 15:11 Guillaume Morin
  2021-08-27 22:22   ` Mike Kravetz
  0 siblings, 1 reply; 7+ messages in thread
From: Guillaume Morin @ 2021-08-27 15:11 UTC (permalink / raw)
  To: Mina Almasry, Mike Kravetz; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

After upgrading to 5.10 from 5.4 (though I believe that these issues are
still present in 5.14), we noticed some refcount count warning
pertaining a corrupted reference count of the hugetlb css, e.g

[  704.259734] percpu ref (css_release) <= 0 (-1) after switching to atomic
[  704.259755] WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130
[  704.259911] CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G           O      5.10.60 #1
[  704.259916] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130
[  704.259920] Code: eb b1 80 3d 37 4f 0a 01 00 0f 85 5d ff ff ff 49 8b 55 e0 48 c7 c7 38 57 0d 94 c6 05 1f 4f 0a 01 01 49 8b 75 e8 e8 a9 e5 c1 ff <0f> 0b e9 3b ff ff ff 66 90 55 48 89 e5 41 56 49 89 f6 41 55 49 89
[  704.259922] RSP: 0000:ffffb19b4684bdd0 EFLAGS: 00010282
[  704.259924] RAX: 0000000000000000 RBX: 7ffffffffffffffe RCX: 0000000000000027
[  704.259926] RDX: 0000000000000000 RSI: ffff9a81ffb58b40 RDI: ffff9a81ffb58b48
[  704.259927] RBP: ffffb19b4684bde8 R08: 0000000000000003 R09: 0000000000000001
[  704.259929] R10: 0000000000000003 R11: ffffb19b4684bb70 R12: 0000370946a03b50
[  704.259931] R13: ffff9a72c9ceb860 R14: 0000000000000000 R15: ffff9a72c42f4000
[  704.259933] FS:  0000000000000000(0000) GS:ffff9a81ffb40000(0000) knlGS:0000000000000000
[  704.259935] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  704.259936] CR2: 0000000001416318 CR3: 000000011e1ac003 CR4: 00000000003706e0
[  704.259938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  704.259939] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  704.259941] Call Trace:
[  704.259950]  rcu_core+0x30f/0x530
[  704.259955]  rcu_core_si+0xe/0x10
[  704.259959]  __do_softirq+0x103/0x2a2
[  704.259964]  ? sort_range+0x30/0x30
[  704.259968]  run_ksoftirqd+0x2b/0x40
[  704.259971]  smpboot_thread_fn+0x11a/0x170
[  704.259975]  kthread+0x10a/0x140
[  704.259978]  ? kthread_create_worker_on_cpu+0x70/0x70
[  704.259983]  ret_from_fork+0x22/0x30

The box would soon crash due to some GPF or NULL pointer deference
either in cgroups_destroy or in the kill_css path. We confirmed the
issue was specific to the hugetlb css by manually disabling its use and
verifying that the box then stayed up and happy.

I believe there might be 2 distinct bugs leading to this. I am not
familiar with the cgroup code so I might be off base here. I did my best
to track this and understand the logic. Any feedback will be welcome.

I have not provided patches because if I am correct, they're fairly
trivial and since I am unfamiliar with this code, I am afraid they could
not be that helpful.  But I could provide them if anybody is interested.

1. Since e9fe92ae0cd2, hugetlb_vm_op_close() decreases the refcount of
the css (if present) through the hugetlb_cgroup_uncharge_counter() call
if a resv map is set on the vma and the owner flag is present (i.e
private mapping).  In the most basic case, the corresponding refcount
increase is done in hugetlb_reserve_pages().
However when sibling vmas are opened, hugetlb_vm_op_open() is called,
the resv map reference count is increased (if vma_resv_map(vma) is not
NULL for private mappings), but not for a potential resv->css (i.e if
resv->css != NULL).
When these siblings vmas are closed, the refcount will still be
decreased once per such vma, leading to an underflow and premature
release (potentially use after free) of the hugetlb css.  The fix would
be to call css_get() if resv->css != NULL in hugetlb_vm_op_open()

2. After 08cf9faf75580, __free_huge_page() decrements the css refcount
for _each_ page unconditionally by calling
hugetlb_cgroup_uncharge_page_rsvd().
But a per-page reference count is only taken *per page* outside the
reserve case in alloc_huge_page() (i.e
hugetlb_cgroup_charge_cgroup_rsvd() is called only if deferred_reserve
is true).  In the reserve case, there is only one css reference linked
to the resv map (taken in hugetlb_reserve_pages()).
This also leads to an underflow of the counter.  A similar scheme to
HPageRestoreReserve can be used to track which pages were allocated in
the deferred_reserve case and call hugetlb_cgroup_uncharge_page_rsvd()
only for these during freeing.

HTH,

Guillaume.

-- 
Guillaume Morin <guillaume-/FyPzM6KSZdAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-28 19:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <8a4f2fbc-76e8-b67b-f110-30beff2228f5@oracle-com>
2021-08-27 22:58 ` [BUG] potential hugetlb css refcounting issues Guillaume Morin
2021-08-27 22:58   ` Guillaume Morin
2021-08-28 19:37   ` Guillaume Morin
2021-08-28 19:37     ` Guillaume Morin
2021-08-27 15:11 Guillaume Morin
2021-08-27 22:22 ` Mike Kravetz
2021-08-27 22:22   ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.