[RFC] hugetlb_cgroup: fix unbalanced css_put for shared mappings
diff mbox series

Message ID 20210123093111.60785-1-linmiaohe@huawei.com
State New, archived
Headers show
Series
  • [RFC] hugetlb_cgroup: fix unbalanced css_put for shared mappings
Related show

Commit Message

Miaohe Lin Jan. 23, 2021, 9:31 a.m. UTC
The current implementation of hugetlb_cgroup for shared mappings could have
different behavior. Consider the following two scenarios:

1.Assume initial css reference count of hugetlb_cgroup is 1:
  1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
count is 2 associated with 1 file_region.
  1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
count is 3 associated with 2 file_region.
  1.3 coalesce_file_region will coalesce these two file_regions into one.
So css reference count is 3 associated with 1 file_region now.

2.Assume initial css reference count of hugetlb_cgroup is 1 again:
  2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
count is 2 associated with 1 file_region.

Therefore, we might have one file_region while holding one or more css
reference counts. This inconsistency could lead to unbalanced css_put().
If we do css_put one by one (i.g. hole punch case), scenario 2 would put
one more css reference. If we do css_put all together (i.g. truncate case),
scenario 1 will leak one css reference.

In order to fix this, we have to make sure that one file_region may hold
and must hold one css reference. So in coalesce_file_region case, we should
release one css reference before coalescence. Also only put css reference
when the entire file_region is removed.

Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: stable@kernel.org
---
 include/linux/hugetlb_cgroup.h |  6 ++++--
 mm/hugetlb.c                   | 18 ++++++++++++++----
 mm/hugetlb_cgroup.c            | 10 ++++++++--
 3 files changed, 26 insertions(+), 8 deletions(-)

Comments

Mike Kravetz Feb. 8, 2021, 7:52 p.m. UTC | #1
On 1/23/21 1:31 AM, Miaohe Lin wrote:
> The current implementation of hugetlb_cgroup for shared mappings could have
> different behavior. Consider the following two scenarios:
> 
> 1.Assume initial css reference count of hugetlb_cgroup is 1:
>   1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
> count is 2 associated with 1 file_region.
>   1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
> count is 3 associated with 2 file_region.
>   1.3 coalesce_file_region will coalesce these two file_regions into one.
> So css reference count is 3 associated with 1 file_region now.
> 
> 2.Assume initial css reference count of hugetlb_cgroup is 1 again:
>   2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
> count is 2 associated with 1 file_region.
> 
> Therefore, we might have one file_region while holding one or more css
> reference counts. This inconsistency could lead to unbalanced css_put().
> If we do css_put one by one (i.g. hole punch case), scenario 2 would put
> one more css reference. If we do css_put all together (i.g. truncate case),
> scenario 1 will leak one css reference.

Sorry for the delay in replying.  This is tricky code and I needed some quiet
time to study it.

I agree that the issue described exists.  Can you describe what a user would
see in the above imbalance scenarios?  What happens if we do one too many
css_put calls?  What happens if we leak the reference and do not do the
required number of css_puts?

The code changes look correct.

I just wish this code was not so complicated.  I think the private mapping
case could be simplified to only take a single css_ref per reserve map.
However, for shared mappings we need to track each individual reservation
which adds the complexity.  I can not think of a better way to do things.

Please update commit message with an explanation of what users might see
because of this issue and resubmit as a patch.

Thanks,
Miaohe Lin Feb. 9, 2021, 3:27 a.m. UTC | #2
On 2021/2/9 3:52, Mike Kravetz wrote:
> On 1/23/21 1:31 AM, Miaohe Lin wrote:
>> The current implementation of hugetlb_cgroup for shared mappings could have
>> different behavior. Consider the following two scenarios:
>>
>> 1.Assume initial css reference count of hugetlb_cgroup is 1:
>>   1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
>> count is 2 associated with 1 file_region.
>>   1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
>> count is 3 associated with 2 file_region.
>>   1.3 coalesce_file_region will coalesce these two file_regions into one.
>> So css reference count is 3 associated with 1 file_region now.
>>
>> 2.Assume initial css reference count of hugetlb_cgroup is 1 again:
>>   2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
>> count is 2 associated with 1 file_region.
>>
>> Therefore, we might have one file_region while holding one or more css
>> reference counts. This inconsistency could lead to unbalanced css_put().
>> If we do css_put one by one (i.g. hole punch case), scenario 2 would put
>> one more css reference. If we do css_put all together (i.g. truncate case),
>> scenario 1 will leak one css reference.
> 
> Sorry for the delay in replying.  This is tricky code and I needed some quiet
> time to study it.
> 

That's fine. I was trying to catch more buggy case too.

> I agree that the issue described exists.  Can you describe what a user would
> see in the above imbalance scenarios?  What happens if we do one too many
> css_put calls?  What happens if we leak the reference and do not do the
> required number of css_puts?
> 

The imbalanced css_get/css_put would result in a non-zero reference when we try to
destroy the hugetlb cgroup. The hugetlb cgroup dir is removed __but__ associated
resource is not freed. This might result in OOM or can not create a new hugetlb cgroup
in a really busy workload finally.

> The code changes look correct.
> 
> I just wish this code was not so complicated.  I think the private mapping
> case could be simplified to only take a single css_ref per reserve map.

Could you explain this more?
It seems one reserve map already takes a single css_ref. And a hugepage outside
reservation would take a single css_ref too.

> However, for shared mappings we need to track each individual reservation
> which adds the complexity.  I can not think of a better way to do things.
> 

I can't figure out one too. And the fix might make the code more complex. :(

> Please update commit message with an explanation of what users might see
> because of this issue and resubmit as a patch.
> 

Will do. Thanks.

> Thanks,
> 

Many thanks for reply. :)
Mike Kravetz Feb. 9, 2021, 6:56 p.m. UTC | #3
On 2/8/21 7:27 PM, Miaohe Lin wrote:
> On 2021/2/9 3:52, Mike Kravetz wrote:
>> On 1/23/21 1:31 AM, Miaohe Lin wrote:
>>> The current implementation of hugetlb_cgroup for shared mappings could have
>>> different behavior. Consider the following two scenarios:
>>>
>>> 1.Assume initial css reference count of hugetlb_cgroup is 1:
>>>   1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
>>> count is 2 associated with 1 file_region.
>>>   1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
>>> count is 3 associated with 2 file_region.
>>>   1.3 coalesce_file_region will coalesce these two file_regions into one.
>>> So css reference count is 3 associated with 1 file_region now.
>>>
>>> 2.Assume initial css reference count of hugetlb_cgroup is 1 again:
>>>   2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
>>> count is 2 associated with 1 file_region.
>>>
>>> Therefore, we might have one file_region while holding one or more css
>>> reference counts. This inconsistency could lead to unbalanced css_put().
>>> If we do css_put one by one (i.g. hole punch case), scenario 2 would put
>>> one more css reference. If we do css_put all together (i.g. truncate case),
>>> scenario 1 will leak one css reference.
>>
>> Sorry for the delay in replying.  This is tricky code and I needed some quiet
>> time to study it.
>>
> 
> That's fine. I was trying to catch more buggy case too.
> 
>> I agree that the issue described exists.  Can you describe what a user would
>> see in the above imbalance scenarios?  What happens if we do one too many
>> css_put calls?  What happens if we leak the reference and do not do the
>> required number of css_puts?
>>
> 
> The imbalanced css_get/css_put would result in a non-zero reference when we try to
> destroy the hugetlb cgroup. The hugetlb cgroup dir is removed __but__ associated
> resource is not freed. This might result in OOM or can not create a new hugetlb cgroup
> in a really busy workload finally.
> 
>> The code changes look correct.
>>
>> I just wish this code was not so complicated.  I think the private mapping
>> case could be simplified to only take a single css_ref per reserve map.
> 
> Could you explain this more?
> It seems one reserve map already takes a single css_ref. And a hugepage outside
> reservation would take a single css_ref too.

Let me preface this by saying that my cgroup knowledge is limited.
For private mappings, all reservations will be associated with the same cgroup.
This is because, only one process can access the mapping.  Since there is only
one process, we only need to hold one css reference.  Individual counters can
be incremented as needed without increasing the css reference count.  We
take a reference when the reserv map is created and drop the reference when it
is deleted.

This does not work for shared mappings as you can have multiple processes in
multiple cgroups taking reservations on the same file.  This is why you need
per-reservation reference accounting in this case.
Miaohe Lin Feb. 10, 2021, 2:11 a.m. UTC | #4
Hi:
On 2021/2/10 2:56, Mike Kravetz wrote:
> On 2/8/21 7:27 PM, Miaohe Lin wrote:
>> On 2021/2/9 3:52, Mike Kravetz wrote:
>>> On 1/23/21 1:31 AM, Miaohe Lin wrote:
>>>> The current implementation of hugetlb_cgroup for shared mappings could have
>>>> different behavior. Consider the following two scenarios:
>>>>
>>>> 1.Assume initial css reference count of hugetlb_cgroup is 1:
>>>>   1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
>>>> count is 2 associated with 1 file_region.
>>>>   1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
>>>> count is 3 associated with 2 file_region.
>>>>   1.3 coalesce_file_region will coalesce these two file_regions into one.
>>>> So css reference count is 3 associated with 1 file_region now.
>>>>
>>>> 2.Assume initial css reference count of hugetlb_cgroup is 1 again:
>>>>   2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
>>>> count is 2 associated with 1 file_region.
>>>>
>>>> Therefore, we might have one file_region while holding one or more css
>>>> reference counts. This inconsistency could lead to unbalanced css_put().
>>>> If we do css_put one by one (i.g. hole punch case), scenario 2 would put
>>>> one more css reference. If we do css_put all together (i.g. truncate case),
>>>> scenario 1 will leak one css reference.
>>>
>>> Sorry for the delay in replying.  This is tricky code and I needed some quiet
>>> time to study it.
>>>
>>
>> That's fine. I was trying to catch more buggy case too.
>>
>>> I agree that the issue described exists.  Can you describe what a user would
>>> see in the above imbalance scenarios?  What happens if we do one too many
>>> css_put calls?  What happens if we leak the reference and do not do the
>>> required number of css_puts?
>>>
>>
>> The imbalanced css_get/css_put would result in a non-zero reference when we try to
>> destroy the hugetlb cgroup. The hugetlb cgroup dir is removed __but__ associated
>> resource is not freed. This might result in OOM or can not create a new hugetlb cgroup
>> in a really busy workload finally.
>>
>>> The code changes look correct.
>>>
>>> I just wish this code was not so complicated.  I think the private mapping
>>> case could be simplified to only take a single css_ref per reserve map.
>>
>> Could you explain this more?
>> It seems one reserve map already takes a single css_ref. And a hugepage outside
>> reservation would take a single css_ref too.
> 
> Let me preface this by saying that my cgroup knowledge is limited.
> For private mappings, all reservations will be associated with the same cgroup.
> This is because, only one process can access the mapping.  Since there is only
> one process, we only need to hold one css reference.  Individual counters can
> be incremented as needed without increasing the css reference count.  We
> take a reference when the reserv map is created and drop the reference when it
> is deleted.
> 

I see. Many thanks for detailed explanation. This could be a to-be-optimized point.

> This does not work for shared mappings as you can have multiple processes in
> multiple cgroups taking reservations on the same file.  This is why you need
> per-reservation reference accounting in this case.

Thanks again. :)

Patch
diff mbox series

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 2ad6e92f124a..7610efcd96bd 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -138,7 +138,8 @@  extern void hugetlb_cgroup_uncharge_counter(struct resv_map *resv,
 
 extern void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
 						struct file_region *rg,
-						unsigned long nr_pages);
+						unsigned long nr_pages,
+						bool region_del);
 
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
@@ -147,7 +148,8 @@  extern void hugetlb_cgroup_migrate(struct page *oldhpage,
 #else
 static inline void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
 						       struct file_region *rg,
-						       unsigned long nr_pages)
+						       unsigned long nr_pages,
+						       bool region_del)
 {
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a6bad1f686c5..777bc0e45bf3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -298,6 +298,14 @@  static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
 #endif
 }
 
+static void put_uncharge_info(struct file_region *rg)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	if (rg->css)
+		css_put(rg->css);
+#endif
+}
+
 static bool has_same_uncharge_info(struct file_region *rg,
 				   struct file_region *org)
 {
@@ -321,6 +329,7 @@  static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
 		prg->to = rg->to;
 
 		list_del(&rg->link);
+		put_uncharge_info(rg);
 		kfree(rg);
 
 		rg = prg;
@@ -332,6 +341,7 @@  static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
 		nrg->from = rg->from;
 
 		list_del(&rg->link);
+		put_uncharge_info(rg);
 		kfree(rg);
 	}
 }
@@ -664,7 +674,7 @@  static long region_del(struct resv_map *resv, long f, long t)
 
 			del += t - f;
 			hugetlb_cgroup_uncharge_file_region(
-				resv, rg, t - f);
+				resv, rg, t - f, false);
 
 			/* New entry for end of split region */
 			nrg->from = t;
@@ -685,7 +695,7 @@  static long region_del(struct resv_map *resv, long f, long t)
 		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
 			del += rg->to - rg->from;
 			hugetlb_cgroup_uncharge_file_region(resv, rg,
-							    rg->to - rg->from);
+							    rg->to - rg->from, true);
 			list_del(&rg->link);
 			kfree(rg);
 			continue;
@@ -693,13 +703,13 @@  static long region_del(struct resv_map *resv, long f, long t)
 
 		if (f <= rg->from) {	/* Trim beginning of region */
 			hugetlb_cgroup_uncharge_file_region(resv, rg,
-							    t - rg->from);
+							    t - rg->from, false);
 
 			del += t - rg->from;
 			rg->from = t;
 		} else {		/* Trim end of region */
 			hugetlb_cgroup_uncharge_file_region(resv, rg,
-							    rg->to - f);
+							    rg->to - f, false);
 
 			del += rg->to - f;
 			rg->to = f;
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 9182848dda3e..8909e075d441 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -391,7 +391,8 @@  void hugetlb_cgroup_uncharge_counter(struct resv_map *resv, unsigned long start,
 
 void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
 					 struct file_region *rg,
-					 unsigned long nr_pages)
+					 unsigned long nr_pages,
+					 bool region_del)
 {
 	if (hugetlb_cgroup_disabled() || !resv || !rg || !nr_pages)
 		return;
@@ -400,7 +401,12 @@  void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
 	    !resv->reservation_counter) {
 		page_counter_uncharge(rg->reservation_counter,
 				      nr_pages * resv->pages_per_hpage);
-		css_put(rg->css);
+		/*
+		 * Only do css_put(rg->css) when we delete the entire region
+		 * because one file_region only holds one rg->css reference.
+		 */
+		if (region_del)
+			css_put(rg->css);
 	}
 }