From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1462C433FE for ; Sun, 6 Dec 2020 06:15:15 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7A90923110 for ; Sun, 6 Dec 2020 06:15:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7A90923110 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1FC0F6B007B; Sun, 6 Dec 2020 01:15:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 188946B007D; Sun, 6 Dec 2020 01:15:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 04D386B007E; Sun, 6 Dec 2020 01:15:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0186.hostedemail.com [216.40.44.186]) by kanga.kvack.org (Postfix) with ESMTP id E238F6B007B for ; Sun, 6 Dec 2020 01:15:14 -0500 (EST) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A34FD8249980 for ; Sun, 6 Dec 2020 06:15:14 +0000 (UTC) X-FDA: 77561844948.15.tub71_3a1090c273d3 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin15.hostedemail.com (Postfix) with ESMTP id 7F9131814B0C1 for ; Sun, 6 Dec 2020 06:15:14 +0000 (UTC) X-HE-Tag: tub71_3a1090c273d3 X-Filterd-Recvd-Size: 5520 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf13.hostedemail.com (Postfix) with ESMTP for ; Sun, 6 Dec 2020 06:15:13 +0000 (UTC) Date: Sat, 05 Dec 2020 22:15:12 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1607235313; bh=QSV5zHJaoaRWLvhtj+cRLdEs+5kHkjXZ5jnz/ZOflFQ=; h=From:To:Subject:In-Reply-To:From; b=2oAHjelcAt/dvwYbvsSwiEL9VxeNnBNYRJY8pPZfNM0nj27ceSz7DO1OCNfqIvktx GzXB4yEN1u+5tMAtW/MPBSKpN0Ujfh4+ibF2+QouN8B8biueR/DqTCjjJVAWkslBjY PUzuahrlTmEAaMLWKuBhc/iOA071244mT7UjHUmw= From: Andrew Morton To: akpm@linux-foundation.org, almasrymina@google.com, amorenoz@redhat.com, gthelen@google.com, linux-mm@kvack.org, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, rientjes@google.com, sandipan@linux.ibm.com, shakeelb@google.com, shuah@kernel.org, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 11/12] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations Message-ID: <20201206061512.FBawJ1qq2%akpm@linux-foundation.org> In-Reply-To: <20201205221412.67f14b9b3a5ef531c76dd452@linux-foundation.org> User-Agent: s-nail v14.8.16 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mike Kravetz Subject: hugetlb_cgroup: fix offline of hugetlb cgroup with reservations Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload using hugetlbfs. In this environment the issue is reproduced by: 1 - Start a simple pod that uses the recently added HugePages medium feature (pod yaml attached) 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer packets) nor interact with real hardware. It seems just initializing the EAL layer (which handles hugepage reservation and locking) is enough to trigger the issue 3 - Delete the Pod (or let it "Complete"). This would result in a kworker thread going into a tight loop (top output): 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy 'perf top -g' reports: - 63.28% 0.01% [kernel] [k] worker_thread - 49.97% worker_thread - 52.64% process_one_work - 62.08% css_killed_work_fn - hugetlb_cgroup_css_offline 41.52% _raw_spin_lock - 2.82% _cond_resched rcu_all_qs 2.66% PageHuge - 0.57% schedule - 0.57% __schedule We are spinning in the do-while loop in hugetlb_cgroup_css_offline. Worse yet, we are holding the master cgroup lock (cgroup_mutex) while infinitely spinning. Little else can be done on the system as the cgroup_mutex can not be acquired. Do note that the issue can be reproduced by simply offlining a hugetlb cgroup containing pages with reservation counts. The loop in hugetlb_cgroup_css_offline is moving page counts from the cgroup being offlined to the parent cgroup. This is done for each hstate, and is repeated until hugetlb_cgroup_have_usage returns false. The routine moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts. The routine hugetlb_cgroup_have_usage is checking for both 'usage' and 'reservation' counts. Discussion about what to do with reservation counts when reparenting was discussed here: https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/ The decision was made to leave a zombie cgroup for with reservation counts. Unfortunately, the code checking reservation counts was incorrectly added to hugetlb_cgroup_have_usage. To fix the issue, simply remove the check for reservation counts. While fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed. The hstate index is not reinitialized each time through the do-while loop. Fix this as well. Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations") Reported-by: Adrian Moreno Tested-by: Adrian Moreno Signed-off-by: Mike Kravetz Reviewed-by: Shakeel Butt Cc: Mina Almasry Cc: David Rientjes Cc: Greg Thelen Cc: Sandipan Das Cc: Shuah Khan Cc: Signed-off-by: Andrew Morton --- mm/hugetlb_cgroup.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) --- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-fix-offline-of-hugetlb-cgroup-with-reservations +++ a/mm/hugetlb_cgroup.c @@ -82,11 +82,8 @@ static inline bool hugetlb_cgroup_have_u for (idx = 0; idx < hugetlb_max_hstate; idx++) { if (page_counter_read( - hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) || - page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd( - h_cg, idx))) { + hugetlb_cgroup_counter_from_cgroup(h_cg, idx))) return true; - } } return false; } @@ -202,9 +199,10 @@ static void hugetlb_cgroup_css_offline(s struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); struct hstate *h; struct page *page; - int idx = 0; + int idx; do { + idx = 0; for_each_hstate(h) { spin_lock(&hugetlb_lock); list_for_each_entry(page, &h->hugepage_activelist, lru) _