linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Adrian Moreno <amorenoz@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeelb@google.com>,
	Mina Almasry <almasrymina@google.com>,
	David Rientjes <rientjes@google.com>,
	Greg Thelen <gthelen@google.com>,
	Sandipan Das <sandipan@linux.ibm.com>,
	Shuah Khan <shuah@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 5.9 55/75] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations
Date: Thu, 10 Dec 2020 15:27:20 +0100	[thread overview]
Message-ID: <20201210142608.763373590@linuxfoundation.org> (raw)
In-Reply-To: <20201210142606.074509102@linuxfoundation.org>

From: Mike Kravetz <mike.kravetz@oracle.com>

commit 7a5bde37983d37783161681ff7c6122dfd081791 upstream.

Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs.  In this environment the issue is reproduced by:

 - Start a simple pod that uses the recently added HugePages medium
   feature (pod yaml attached)

 - Start a DPDK app. It doesn't need to run successfully (as in transfer
   packets) nor interact with real hardware. It seems just initializing
   the EAL layer (which handles hugepage reservation and locking) is
   enough to trigger the issue

 - Delete the Pod (or let it "Complete").

This would result in a kworker thread going into a tight loop (top output):

   1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45 kworker/28:7+cgroup_destroy

'perf top -g' reports:

  -   63.28%     0.01%  [kernel]                    [k] worker_thread
     - 49.97% worker_thread
        - 52.64% process_one_work
           - 62.08% css_killed_work_fn
              - hugetlb_cgroup_css_offline
                   41.52% _raw_spin_lock
                 - 2.82% _cond_resched
                      rcu_all_qs
                   2.66% PageHuge
        - 0.57% schedule
           - 0.57% __schedule

We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning.  Little else can be done on the system as the
cgroup_mutex can not be acquired.

Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.

The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup.  This is done for each
hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
The routine moving counts (hugetlb_cgroup_move_parent) is only moving
'usage' counts.  The routine hugetlb_cgroup_have_usage is checking for
both 'usage' and 'reservation' counts.  Discussion about what to do with
reservation counts when reparenting was discussed here:

https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

The decision was made to leave a zombie cgroup for with reservation
counts.  Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.

To fix the issue, simply remove the check for reservation counts.  While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was
noticed.  The hstate index is not reinitialized each time through the
do-while loop.  Fix this as well.

Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Adrian Moreno <amorenoz@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/hugetlb_cgroup.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -82,11 +82,8 @@ static inline bool hugetlb_cgroup_have_u
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
 		if (page_counter_read(
-			    hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) ||
-		    page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd(
-			    h_cg, idx))) {
+				hugetlb_cgroup_counter_from_cgroup(h_cg, idx)))
 			return true;
-		}
 	}
 	return false;
 }
@@ -202,9 +199,10 @@ static void hugetlb_cgroup_css_offline(s
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
 	struct hstate *h;
 	struct page *page;
-	int idx = 0;
+	int idx;
 
 	do {
+		idx = 0;
 		for_each_hstate(h) {
 			spin_lock(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)



  parent reply	other threads:[~2020-12-10 15:47 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-10 14:26 [PATCH 5.9 00/75] 5.9.14-rc1 review Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 01/75] usb: gadget: f_fs: Use local copy of descriptors for userspace copy Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 02/75] USB: serial: kl5kusb105: fix memleak on open Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 03/75] USB: serial: ch341: add new Product ID for CH341A Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 04/75] USB: serial: ch341: sort device-id entries Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 05/75] USB: serial: option: add Fibocom NL668 variants Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 06/75] USB: serial: option: add support for Thales Cinterion EXS82 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 07/75] USB: serial: option: fix Quectel BG96 matching Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 08/75] tty: Fix ->pgrp locking in tiocspgrp() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 09/75] tty: Fix ->session locking Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 10/75] speakup: Reject setting the speakup line discipline outside of speakup Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 11/75] ALSA: hda/realtek: Fix bass speaker DAC assignment on Asus Zephyrus G14 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 12/75] ALSA: hda/realtek: Add mute LED quirk to yet another HP x360 model Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 13/75] ALSA: hda/realtek: Enable headset of ASUS UX482EG & B9400CEA with ALC294 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 14/75] ALSA: hda/realtek - Add new codec supported for ALC897 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 15/75] ALSA: hda/realtek - Fixed Dell AIO wrong sound tone Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 16/75] ALSA: hda/generic: Add option to enforce preferred_dacs pairs Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 17/75] ring-buffer: Update write stamp with the correct ts Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 18/75] ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 19/75] ring-buffer: Always check to put back before stamp when crossing pages Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 20/75] ftrace: Fix updating FTRACE_FL_TRAMP Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 21/75] ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 22/75] cifs: allow syscalls to be restarted in __smb_send_rqst() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 23/75] cifs: fix potential use-after-free in cifs_echo_request() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 24/75] cifs: refactor create_sd_buf() and and avoid corrupting the buffer Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 25/75] cifs: add NULL check for ses->tcon_ipc Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 26/75] gfs2: Upgrade shared glocks for atime updates Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 27/75] gfs2: Fix deadlock between gfs2_{create_inode,inode_lookup} and delete_work_func Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 28/75] s390/pci: fix CPU address in MSI for directed IRQ Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 29/75] i2c: imx: Fix reset of I2SR_IAL flag Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 30/75] i2c: imx: Check for I2SR_IAL after every byte Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 31/75] i2c: imx: Dont generate STOP condition if arbitration has been lost Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 32/75] tracing: Fix userstacktrace option for instances Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 33/75] thunderbolt: Fix use-after-free in remove_unplugged_switch() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.9 34/75] drm/omap: sdi: fix bridge enable/disable Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 35/75] drm/amdgpu/vcn3.0: stall DPG when WPTR/RPTR reset Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 36/75] drm/amdgpu/vcn3.0: remove old DPG workaround Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 37/75] drm/i915/gt: Retain default context state across shrinking Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 38/75] drm/i915/gt: Limit frequency drop to RPe on parking Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 39/75] drm/i915/gt: Program mocs:63 for cache eviction on gen9 Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 40/75] KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 41/75] scsi: mpt3sas: Fix ioctl timeout Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 42/75] io_uring: fix recvmsg setup with compat buf-select Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 43/75] dm writecache: advance the number of arguments when reporting max_age Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 44/75] dm writecache: fix the maximum number of arguments Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 45/75] powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 46/75] genirq/irqdomain: Add an irq_create_mapping_affinity() function Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 47/75] powerpc/pseries: Pass MSI affinity to irq_create_mapping() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 48/75] dm: fix bug with RCU locking in dm_blk_report_zones Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 49/75] dm: fix double RCU unlock in dm_dax_zero_page_range() error path Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 50/75] dm: remove invalid sparse __acquires and __releases annotations Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 51/75] x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 52/75] coredump: fix core_pattern parse error Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 53/75] mm: list_lru: set shrinker map bit when child nr_items is not zero Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 54/75] mm/swapfile: do not sleep with a spin lock held Greg Kroah-Hartman
2020-12-10 14:27 ` Greg Kroah-Hartman [this message]
2020-12-10 14:27 ` [PATCH 5.9 56/75] [PATCH] Revert "amd/amdgpu: Disable VCN DPG mode for Picasso" Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 57/75] iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 58/75] mm: memcg/slab: fix obj_cgroup_charge() return value handling Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 59/75] lib/syscall: fix syscall registers retrieval on 32-bit platforms Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 60/75] can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 61/75] gfs2: check for empty rgrp tree in gfs2_ri_update Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 62/75] netfilter: ipset: prevent uninit-value in hash_ip6_add Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 63/75] tipc: fix a deadlock when flushing scheduled work Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 64/75] ASoC: wm_adsp: fix error return code in wm_adsp_load() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 65/75] gfs2: Fix deadlock dumping resource group glocks Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 66/75] gfs2: Dont freeze the file system during unmount Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 67/75] rtw88: debug: Fix uninitialized memory in debugfs code Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 68/75] i2c: qcom: Fix IRQ error misassignement Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 69/75] i2c: qup: Fix error return code in qup_i2c_bam_schedule_desc() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 70/75] dm writecache: remove BUG() and fail gracefully instead Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 71/75] Input: i8042 - fix error return code in i8042_setup_aux() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 72/75] netfilter: nf_tables: avoid false-postive lockdep splat Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 73/75] netfilter: nftables_offload: set address type in control dissector Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 74/75] netfilter: nftables_offload: build mask based from the matching bytes Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.9 75/75] x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes Greg Kroah-Hartman
2020-12-10 21:20 ` [PATCH 5.9 00/75] 5.9.14-rc1 review Shuah Khan
2020-12-11 14:23   ` Greg Kroah-Hartman
2020-12-10 21:24 ` Jeffrin Jose T
2020-12-10 23:46 ` Guenter Roeck
2020-12-11 14:23   ` Greg Kroah-Hartman
2020-12-11  5:31 ` Naresh Kamboju
2020-12-11 14:22   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201210142608.763373590@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=amorenoz@redhat.com \
    --cc=gthelen@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=rientjes@google.com \
    --cc=sandipan@linux.ibm.com \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).