All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Roman Gushchin <guro@fb.com>,
	Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeelb@google.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 5.4 34/54] mm: list_lru: set shrinker map bit when child nr_items is not zero
Date: Thu, 10 Dec 2020 15:27:11 +0100	[thread overview]
Message-ID: <20201210142603.714134316@linuxfoundation.org> (raw)
In-Reply-To: <20201210142602.037095225@linuxfoundation.org>

From: Yang Shi <shy828301@gmail.com>

commit 8199be001a470209f5c938570cc199abb012fe53 upstream.

When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache.  The vmcore shows there are over 14M negative dentry objects on
lru, but tracing result shows they were even not scanned at all.

Further investigation shows the memcg's vfs shrinker_map bit is not set.
So the reclaimer or dropping cache just skip calling vfs shrinker.  So
we have to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and
the problem can't be reproduced after rebooting.  But it seems there is
race between shrinker map bit clear and reparenting by code inspection.
The hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:

                root
               /    \
          system   user

The main workloads are running under user slice's children, and it
creates and removes memcg frequently.  So reparenting happens very often
under user slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

       CPU A                            CPU B
reparent
    dst->nr_items == 0
                                 shrinker:
                                     total_objects == 0
    add src->nr_items to dst
    set_bit
                                     return SHRINK_EMPTY
                                     clear_bit
child memcg offline
    replace child's kmemcg_id with
    parent's (in memcg_offline_kmem())
                                  list_lru_del() between shrinker runs
                                     see parent's kmemcg_id
                                     dec dst->nr_items
reparent again
    dst->nr_items may go negative
    due to concurrent list_lru_del()

                                 The second run of shrinker:
                                     read nr_items without any
                                     synchronization, so it may
                                     see intermediate negative
                                     nr_items then total_objects
                                     may return 0 coincidently

                                     keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore.  And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either.  That bit is kept cleared forever.

How does list_lru_del() race with reparenting? It is because reparenting
replaces children's kmemcg_id to parent's without protecting from
nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
deleting items from child's lru, but dec'ing parent's nr_items, so the
parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
reparent list_lrus and free kmemcg_id on css offline") says.

Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0.  We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.

Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>	[4.19]
Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/list_lru.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -544,7 +544,6 @@ static void memcg_drain_list_lru_node(st
 	struct list_lru_node *nlru = &lru->node[nid];
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
-	bool set;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -556,11 +555,12 @@ static void memcg_drain_list_lru_node(st
 	dst = list_lru_from_memcg_idx(nlru, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);
-	set = (!dst->nr_items && src->nr_items);
-	dst->nr_items += src->nr_items;
-	if (set)
+
+	if (src->nr_items) {
+		dst->nr_items += src->nr_items;
 		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
-	src->nr_items = 0;
+		src->nr_items = 0;
+	}
 
 	spin_unlock_irq(&nlru->lock);
 }



  parent reply	other threads:[~2020-12-10 14:41 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-10 14:26 [PATCH 5.4 00/54] 5.4.83-rc1 review Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 01/54] pinctrl: baytrail: Replace WARN with dev_info_once when setting direct-irq pin to output Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 02/54] pinctrl: baytrail: Fix pin being driven low for a while on gpiod_get(..., GPIOD_OUT_HIGH) Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 03/54] Partially revert bpf: Zero-fill re-used per-cpu map element Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 04/54] usb: gadget: f_fs: Use local copy of descriptors for userspace copy Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 05/54] USB: serial: kl5kusb105: fix memleak on open Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 06/54] USB: serial: ch341: add new Product ID for CH341A Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 07/54] USB: serial: ch341: sort device-id entries Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 08/54] USB: serial: option: add Fibocom NL668 variants Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 09/54] USB: serial: option: add support for Thales Cinterion EXS82 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 10/54] USB: serial: option: fix Quectel BG96 matching Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 11/54] tty: Fix ->pgrp locking in tiocspgrp() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 12/54] tty: Fix ->session locking Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 13/54] ALSA: hda/realtek: Fix bass speaker DAC assignment on Asus Zephyrus G14 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 14/54] ALSA: hda/realtek: Add mute LED quirk to yet another HP x360 model Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 15/54] ALSA: hda/realtek: Enable headset of ASUS UX482EG & B9400CEA with ALC294 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 16/54] ALSA: hda/realtek - Add new codec supported for ALC897 Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 17/54] ALSA: hda/generic: Add option to enforce preferred_dacs pairs Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 18/54] ftrace: Fix updating FTRACE_FL_TRAMP Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 19/54] cifs: allow syscalls to be restarted in __smb_send_rqst() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 20/54] cifs: fix potential use-after-free in cifs_echo_request() Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 21/54] s390/pci: fix CPU address in MSI for directed IRQ Greg Kroah-Hartman
2020-12-10 16:34   ` Niklas Schnelle
2020-12-10 16:46     ` Greg Kroah-Hartman
2020-12-10 14:26 ` [PATCH 5.4 22/54] i2c: imx: Dont generate STOP condition if arbitration has been lost Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 23/54] thunderbolt: Fix use-after-free in remove_unplugged_switch() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 24/54] drm/i915/gt: Program mocs:63 for cache eviction on gen9 Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 25/54] scsi: mpt3sas: Fix ioctl timeout Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 26/54] dm writecache: fix the maximum number of arguments Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 27/54] powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 28/54] genirq/irqdomain: Add an irq_create_mapping_affinity() function Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 29/54] powerpc/pseries: Pass MSI affinity to irq_create_mapping() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 30/54] dm: fix bug with RCU locking in dm_blk_report_zones Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 31/54] dm: remove invalid sparse __acquires and __releases annotations Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 32/54] x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 33/54] coredump: fix core_pattern parse error Greg Kroah-Hartman
2020-12-10 14:27 ` Greg Kroah-Hartman [this message]
2020-12-10 14:27 ` [PATCH 5.4 35/54] mm/swapfile: do not sleep with a spin lock held Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 36/54] speakup: Reject setting the speakup line discipline outside of speakup Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 37/54] i2c: imx: Fix reset of I2SR_IAL flag Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 38/54] i2c: imx: Check for I2SR_IAL after every byte Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 39/54] spi: bcm2835: Release the DMA channel if probe fails after dma_init Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 40/54] iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 41/54] tracing: Fix userstacktrace option for instances Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 42/54] lib/syscall: fix syscall registers retrieval on 32-bit platforms Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 43/54] can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 44/54] gfs2: check for empty rgrp tree in gfs2_ri_update Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 45/54] netfilter: ipset: prevent uninit-value in hash_ip6_add Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 46/54] tipc: fix a deadlock when flushing scheduled work Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 47/54] ASoC: wm_adsp: fix error return code in wm_adsp_load() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 48/54] rtw88: debug: Fix uninitialized memory in debugfs code Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 49/54] i2c: qup: Fix error return code in qup_i2c_bam_schedule_desc() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 50/54] dm writecache: remove BUG() and fail gracefully instead Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 51/54] Input: i8042 - fix error return code in i8042_setup_aux() Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 52/54] netfilter: nf_tables: avoid false-postive lockdep splat Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 53/54] netfilter: nftables_offload: set address type in control dissector Greg Kroah-Hartman
2020-12-10 14:27 ` [PATCH 5.4 54/54] x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes Greg Kroah-Hartman
2020-12-10 21:04 ` [PATCH 5.4 00/54] 5.4.83-rc1 review Jon Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201210142603.714134316@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=guro@fb.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.