linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Johannes Weiner <hannes@cmpxchg.org>,
	Shakeel Butt <shakeelb@google.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: [PATCH 5.1 24/55] mm: fix page cache convergence regression
Date: Tue,  2 Jul 2019 10:01:32 +0200	[thread overview]
Message-ID: <20190702080125.306608913@linuxfoundation.org> (raw)
In-Reply-To: <20190702080124.103022729@linuxfoundation.org>

From: Johannes Weiner <hannes@cmpxchg.org>

commit 7b785645e8f13e17cbce492708cf6e7039d32e46 upstream.

Since a28334862993 ("page cache: Finish XArray conversion"), on most
major Linux distributions, the page cache doesn't correctly transition
when the hot data set is changing, and leaves the new pages thrashing
indefinitely instead of kicking out the cold ones.

On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
running stock Arch Linux:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120086/153600 workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
104029/153600 workingset-a
120268/153600 workingset-b

workingset-b is a 600M file on a 1G host that is otherwise entirely
idle. No matter how often it's being accessed, it won't get cached.

While investigating, I noticed that the non-resident information gets
aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
a problem because a workingset transition like this relies on the
non-resident information tracked in the page cache tree of evicted
file ranges: when the cache faults are refaults of recently evicted
cache, we challenge the existing active set, and that allows a new
workingset to establish itself.

Tracing the shrinker that maintains this memory revealed that all page
cache tree nodes were allocated to the root cgroup. This is a problem,
because 1) the shrinker sizes the amount of non-resident information
it keeps to the size of the cgroup's other memory and 2) on most major
Linux distributions, only kernel threads live in the root cgroup and
everything else gets put into services or session groups:

[root@ham ~]# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c1.scope

As a result, we basically maintain no non-resident information for the
workloads running on the system, thus breaking the caching algorithm.

Looking through the code, I found the culprit in the above-mentioned
patch: when switching from the radix tree to xarray, it dropped the
__GFP_ACCOUNT flag from the tree node allocations - the flag that
makes sure the allocated memory gets charged to and tracked by the
cgroup of the calling process - in this case, the one doing the fault.

To fix this, allow xarray users to specify per-tree flag that makes
xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
tree annotation to request such cgroup tracking for the cache nodes.

With this patch applied, the page cache correctly converges on new
workingsets again after just a few iterations:

[root@ham ~]# ./reclaimtest.sh
+ dd of=workingset-a bs=1M count=0 seek=600
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ cat workingset-a
+ ./mincore workingset-a
153600/153600 workingset-a
+ dd of=workingset-b bs=1M count=0 seek=600
+ cat workingset-b
+ ./mincore workingset-a workingset-b
124607/153600 workingset-a
87876/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
81313/153600 workingset-a
133321/153600 workingset-b
+ cat workingset-b
+ ./mincore workingset-a workingset-b
63036/153600 workingset-a
153600/153600 workingset-b

Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/inode.c             |    2 +-
 include/linux/xarray.h |    1 +
 lib/xarray.c           |   12 ++++++++++--
 3 files changed, 12 insertions(+), 3 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -349,7 +349,7 @@ EXPORT_SYMBOL(inc_nlink);
 
 static void __address_space_init_once(struct address_space *mapping)
 {
-	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
+	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -265,6 +265,7 @@ enum xa_lock_type {
 #define XA_FLAGS_TRACK_FREE	((__force gfp_t)4U)
 #define XA_FLAGS_ZERO_BUSY	((__force gfp_t)8U)
 #define XA_FLAGS_ALLOC_WRAPPED	((__force gfp_t)16U)
+#define XA_FLAGS_ACCOUNT	((__force gfp_t)32U)
 #define XA_FLAGS_MARK(mark)	((__force gfp_t)((1U << __GFP_BITS_SHIFT) << \
 						(__force unsigned)(mark)))
 
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -298,6 +298,8 @@ bool xas_nomem(struct xa_state *xas, gfp
 		xas_destroy(xas);
 		return false;
 	}
+	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+		gfp |= __GFP_ACCOUNT;
 	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
 	if (!xas->xa_alloc)
 		return false;
@@ -325,6 +327,8 @@ static bool __xas_nomem(struct xa_state
 		xas_destroy(xas);
 		return false;
 	}
+	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+		gfp |= __GFP_ACCOUNT;
 	if (gfpflags_allow_blocking(gfp)) {
 		xas_unlock_type(xas, lock_type);
 		xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
@@ -358,8 +362,12 @@ static void *xas_alloc(struct xa_state *
 	if (node) {
 		xas->xa_alloc = NULL;
 	} else {
-		node = kmem_cache_alloc(radix_tree_node_cachep,
-					GFP_NOWAIT | __GFP_NOWARN);
+		gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
+
+		if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+			gfp |= __GFP_ACCOUNT;
+
+		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
 		if (!node) {
 			xas_set_err(xas, -ENOMEM);
 			return NULL;



  parent reply	other threads:[~2019-07-02  8:16 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-02  8:01 [PATCH 5.1 00/55] 5.1.16-stable review Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 01/55] arm64: Dont unconditionally add -Wno-psabi to KBUILD_CFLAGS Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 02/55] Revert "x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP" Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 03/55] qmi_wwan: Fix out-of-bounds read Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 04/55] fs/proc/array.c: allow reporting eip/esp for all coredumping threads Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 05/55] mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 06/55] fs/binfmt_flat.c: make load_flat_shared_library() work Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 07/55] clk: tegra210: Fix default rates for HDA clocks Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 08/55] clk: socfpga: stratix10: fix divider entry for the emac clocks Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 09/55] drm/i915: Force 2*96 MHz cdclk on glk/cnl when audio power is enabled Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 10/55] drm/i915: Save the old CDCLK atomic state Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 11/55] drm/i915: Remove redundant store of logical CDCLK state Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 12/55] drm/i915: Skip modeset for cdclk changes if possible Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 13/55] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 14/55] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 15/55] mm/page_idle.c: fix oops because end_pfn is larger than max_pfn Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 16/55] mm, swap: fix THP swap out Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 17/55] dm init: fix incorrect uses of kstrndup() Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 18/55] dm log writes: make sure super sector log updates are written in order Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 19/55] io_uring: ensure req->file is cleared on allocation Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 20/55] scsi: vmw_pscsi: Fix use-after-free in pvscsi_queue_lck() Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 21/55] x86/speculation: Allow guests to use SSBD even if host does not Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 22/55] x86/microcode: Fix the microcode load on CPU hotplug for real Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 23/55] x86/resctrl: Prevent possible overrun during bitmap operations Greg Kroah-Hartman
2019-07-02  8:01 ` Greg Kroah-Hartman [this message]
2019-07-02  8:01 ` [PATCH 5.1 25/55] efi/memreserve: deal with memreserve entries in unmapped memory Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 26/55] NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 27/55] cpu/speculation: Warn on unsupported mitigations= parameter Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 28/55] SUNRPC: Fix up calculation of client message length Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 29/55] irqchip/mips-gic: Use the correct local interrupt map registers Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 30/55] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 31/55] bonding: Always enable vlan tx offload Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 32/55] ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 33/55] net/packet: fix memory leak in packet_set_ring() Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 34/55] net: remove duplicate fetch in sock_getsockopt Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 35/55] net: stmmac: fixed new system time seconds value calculation Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 36/55] net: stmmac: set IC bit when transmitting frames with HW timestamp Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 37/55] net/tls: fix page double free on TX cleanup Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 38/55] sctp: change to hold sk after auth shkey is created successfully Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 39/55] team: Always enable vlan tx offload Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 40/55] tipc: change to use register_pernet_device Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 41/55] tipc: check msg->req data len in tipc_nl_compat_bearer_disable Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 42/55] tun: wake up waitqueues after IFF_UP is set Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 43/55] net: aquantia: fix vlans not working over bridged network Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 44/55] bpf: simplify definition of BPF_FIB_LOOKUP related flags Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 45/55] bpf: lpm_trie: check left child of last leftmost node for NULL Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 46/55] bpf: fix nested bpf tracepoints with per-cpu data Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 47/55] bpf: fix unconnected udp hooks Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 48/55] bpf: udp: Avoid calling reuseports bpf_prog from udp_gro Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 49/55] bpf: udp: ipv6: Avoid running reuseports bpf_prog from __udp6_lib_err Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 50/55] arm64: futex: Avoid copying out uninitialised stack in failed cmpxchg() Greg Kroah-Hartman
2019-07-02  8:01 ` [PATCH 5.1 51/55] bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd Greg Kroah-Hartman
2019-07-03  2:02   ` Sasha Levin
2019-07-03  7:24     ` Greg Kroah-Hartman
2019-07-02  8:02 ` [PATCH 5.1 52/55] futex: Update comments and docs about return values of arch futex code Greg Kroah-Hartman
2019-07-02  8:02 ` [PATCH 5.1 53/55] RDMA: Directly cast the sockaddr union to sockaddr Greg Kroah-Hartman
2019-07-02  8:02 ` [PATCH 5.1 54/55] fanotify: update connector fsid cache on add mark Greg Kroah-Hartman
2019-07-02  8:02 ` [PATCH 5.1 55/55] tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb Greg Kroah-Hartman
2019-07-02 14:32 ` [PATCH 5.1 00/55] 5.1.16-stable review kernelci.org bot
2019-07-02 17:39 ` Naresh Kamboju
2019-07-03  9:11   ` Greg Kroah-Hartman
2019-07-02 18:06 ` Jiunn Chang
2019-07-02 21:09 ` Kelsey Skunberg
2019-07-02 22:56 ` shuah
2019-07-03  9:12   ` Greg Kroah-Hartman
2019-07-03  6:26 ` Shreeya Patel
2019-07-03 10:21 ` Jon Hunter
2019-07-03 10:49   ` Greg Kroah-Hartman
2019-07-04  5:27 ` Bharath Vedartham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190702080125.306608913@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shakeelb@google.com \
    --cc=stable@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).