linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations
@ 2020-12-03 22:02 Mike Kravetz
  2020-12-03 22:11 ` Shakeel Butt
  0 siblings, 1 reply; 2+ messages in thread
From: Mike Kravetz @ 2020-12-03 22:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel, cgroups
  Cc: Mina Almasry, Shakeel Butt, David Rientjes, Greg Thelen,
	Sandipan Das, Shuah Khan, Adrian Moreno, Andrew Morton,
	Mike Kravetz, stable

Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs.  In this environment the issue is reproduced by:
1 - Start a simple pod that uses the recently added HugePages medium
    feature (pod yaml attached)
2 - Start a DPDK app. It doesn't need to run successfully (as in transfer
    packets) nor interact with real hardware. It seems just initializing
    the EAL layer (which handles hugepage reservation and locking) is
    enough to trigger the issue
3 - Delete the Pod (or let it "Complete").

This would result in a kworker thread going into a tight loop (top output):
 1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45
kworker/28:7+cgroup_destroy

'perf top -g' reports:
-   63.28%     0.01%  [kernel]                    [k] worker_thread
   - 49.97% worker_thread
      - 52.64% process_one_work
         - 62.08% css_killed_work_fn
            - hugetlb_cgroup_css_offline
                 41.52% _raw_spin_lock
               - 2.82% _cond_resched
                    rcu_all_qs
                 2.66% PageHuge
      - 0.57% schedule
         - 0.57% __schedule

We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning.  Little else can be done on the system as the
cgroup_mutex can not be acquired.

Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.

The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup.  This is done for each hstate,
and is repeated until hugetlb_cgroup_have_usage returns false.  The routine
moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts.
The routine hugetlb_cgroup_have_usage is checking for both 'usage' and
'reservation' counts.  Discussion about what to do with reservation
counts when reparenting was discussed here:

https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

The decision was made to leave a zombie cgroup for with reservation
counts.  Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.

To fix the issue, simply remove the check for reservation counts.  While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed.
The hstate index is not reinitialized each time through the do-while loop.
Fix this as well.

Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Cc: <stable@vger.kernel.org>
Reported-by: Adrian Moreno <amorenoz@redhat.com>
Tested-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb_cgroup.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 1f87aec9ab5c..9182848dda3e 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -82,11 +82,8 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
 		if (page_counter_read(
-			    hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) ||
-		    page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd(
-			    h_cg, idx))) {
+				hugetlb_cgroup_counter_from_cgroup(h_cg, idx)))
 			return true;
-		}
 	}
 	return false;
 }
@@ -202,9 +199,10 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
 	struct hstate *h;
 	struct page *page;
-	int idx = 0;
+	int idx;
 
 	do {
+		idx = 0;
 		for_each_hstate(h) {
 			spin_lock(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations
  2020-12-03 22:02 [PATCH] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations Mike Kravetz
@ 2020-12-03 22:11 ` Shakeel Butt
  0 siblings, 0 replies; 2+ messages in thread
From: Shakeel Butt @ 2020-12-03 22:11 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux MM, LKML, Cgroups, Mina Almasry, David Rientjes,
	Greg Thelen, Sandipan Das, Shuah Khan, Adrian Moreno,
	Andrew Morton, stable

On Thu, Dec 3, 2020 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
> using hugetlbfs.  In this environment the issue is reproduced by:
> 1 - Start a simple pod that uses the recently added HugePages medium
>     feature (pod yaml attached)
> 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer
>     packets) nor interact with real hardware. It seems just initializing
>     the EAL layer (which handles hugepage reservation and locking) is
>     enough to trigger the issue
> 3 - Delete the Pod (or let it "Complete").
>
> This would result in a kworker thread going into a tight loop (top output):
>  1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45
> kworker/28:7+cgroup_destroy
>
> 'perf top -g' reports:
> -   63.28%     0.01%  [kernel]                    [k] worker_thread
>    - 49.97% worker_thread
>       - 52.64% process_one_work
>          - 62.08% css_killed_work_fn
>             - hugetlb_cgroup_css_offline
>                  41.52% _raw_spin_lock
>                - 2.82% _cond_resched
>                     rcu_all_qs
>                  2.66% PageHuge
>       - 0.57% schedule
>          - 0.57% __schedule
>
> We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
> Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
> infinitely spinning.  Little else can be done on the system as the
> cgroup_mutex can not be acquired.
>
> Do note that the issue can be reproduced by simply offlining a hugetlb
> cgroup containing pages with reservation counts.
>
> The loop in hugetlb_cgroup_css_offline is moving page counts from the
> cgroup being offlined to the parent cgroup.  This is done for each hstate,
> and is repeated until hugetlb_cgroup_have_usage returns false.  The routine
> moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts.
> The routine hugetlb_cgroup_have_usage is checking for both 'usage' and
> 'reservation' counts.  Discussion about what to do with reservation
> counts when reparenting was discussed here:
>
> https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/
>
> The decision was made to leave a zombie cgroup for with reservation
> counts.  Unfortunately, the code checking reservation counts was
> incorrectly added to hugetlb_cgroup_have_usage.
>
> To fix the issue, simply remove the check for reservation counts.  While
> fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed.
> The hstate index is not reinitialized each time through the do-while loop.
> Fix this as well.
>
> Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
> Cc: <stable@vger.kernel.org>
> Reported-by: Adrian Moreno <amorenoz@redhat.com>
> Tested-by: Adrian Moreno <amorenoz@redhat.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-12-03 22:12 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 22:02 [PATCH] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations Mike Kravetz
2020-12-03 22:11 ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).