linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* cgroup_destroy kworker loops on hugetlb_cgroup_css_offline
@ 2020-11-17  8:39 Adrian Moreno
  0 siblings, 0 replies; only message in thread
From: Adrian Moreno @ 2020-11-17  8:39 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 2023 bytes --]

Hello Mike,

I don't usually work on the kernel so please excuse any inaccuracies.

I'm contacting you off-list because, if what I've facing is confirmed, it might
be considered a security issue (DoS). I'll leave that to your judgement.

I'm seeing an issue related to hugetlb_cgroup:

I'm running:
kubernetes 1.19 + containerd/docker
kernel 5.9.0-36.fc34.x86_64
kernel params: systemd.unified_cgroup_hierarchy=0 default_hugepagesz=1G
hugepagesz=1G hugepages=10

I'm still trying to isolate aspects of my setup, currently my reproducer is:
1 - Start a simple pod that uses the recently added HugePages medium feature [1]
(pod yaml attached)
2 - Start a DPDK app. It doesn't need to run successfully (as in transfer
packets) nor interact with real hardware. It seems just initializing the EAL
layer (which handles hugepage reservation and locking) is enough to trigger the
issue
3 - Delete the Pod (or let it "Complete").

Results in what seems to be a thread endlessly looping over a spin_lock.

top:
 1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45
kworker/28:7+cgroup_destroy

'perf top -g' reports:
-   63.28%     0.01%  [kernel]                    [k] worker_thread
   - 49.97% worker_thread
      - 52.64% process_one_work
         - 62.08% css_killed_work_fn
            - hugetlb_cgroup_css_offline
                 41.52% _raw_spin_lock
               - 2.82% _cond_resched
                    rcu_all_qs
                 2.66% PageHuge
      - 0.57% schedule
         - 0.57% __schedule

Under certain circumstances (which I'm still trying to understand) this makes
the kernel quite unresponsive, requiring a hard reboot.

I've isolated the issue in a VM and I was about to start bisecting the issue
(which does not happen on kernel-5.6.6-300.fc32).

Do you have any clue or pointer as to how to further troubleshoot this issue?

Thanks,

-- 
Adrián Moreno


[1] https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/

[-- Attachment #2: test.yaml --]
[-- Type: application/x-yaml, Size: 734 bytes --]

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-11-17  8:39 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17  8:39 cgroup_destroy kworker loops on hugetlb_cgroup_css_offline Adrian Moreno

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).