Re: [PATCH v2] mm: vmscan: reclaim anon pages if there are swapcache pages

From: Yosry Ahmed <yosryahmed@google.com>
To: Liu Shixin <liushixin2@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	wangkefeng.wang@huawei.com, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2] mm: vmscan: reclaim anon pages if there are swapcache pages
Date: Thu, 24 Aug 2023 11:27:12 -0700	[thread overview]
Message-ID: <CAJD7tkZAfgncV+KbKr36=eDzMnT=9dZOT0dpMWcurHLr6Do+GA@mail.gmail.com> (raw)
In-Reply-To: <14e15f31-f3d3-4169-8ed9-fb36e57cf578@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 6752 bytes --]

On Wed, Aug 23, 2023 at 8:39 PM Liu Shixin <liushixin2@huawei.com> wrote:
>
>
>
> On 2023/8/23 23:29, Yosry Ahmed wrote:
> > On Wed, Aug 23, 2023 at 6:12 AM Michal Hocko <mhocko@suse.com> wrote:
> >> On Wed 23-08-23 10:00:58, Liu Shixin wrote:
> >>>
> >>> On 2023/8/23 0:35, Yosry Ahmed wrote:
> >>>> On Mon, Aug 21, 2023 at 6:54 PM Liu Shixin <liushixin2@huawei.com> wrote:
> >>>>> When spaces of swap devices are exhausted, only file pages can be reclaimed.
> >>>>> But there are still some swapcache pages in anon lru list. This can lead
> >>>>> to a premature out-of-memory.
> >>>>>
> >>>>> This problem can be fixed by checking number of swapcache pages in
> >>>>> can_reclaim_anon_pages(). For memcg v2, there are swapcache stat that can
> >>>>> be used directly. For memcg v1, use total_swapcache_pages() instead, which
> >>>>> may not accurate but can solve the problem.
> >>>> Interesting find. I wonder if we really don't have any handling of
> >>>> this situation.
> >>> I have alreadly test this problem and can confirm that it is a real problem.
> >>> With 9MB swap space and 10MB mem_cgroup limit，when allocate 15MB memory,
> >>> there is a probability that OOM occurs.
> >> Could you be more specific about the test and the oom report?
> > I actually couldn't reproduce it using 9MB of zram and a cgroup with a
> > 10MB limit trying to allocate 15MB of tmpfs, no matter how many
> > repetitions I do.
> The following is the printout of the testcase I used. In fact, the probability
> of triggering this problem is very low. You can adjust the size of the swap
> space to increase the probability of recurrence.
>

I was actually trying to reproduce this in the worst way possible:
- Tmpfs is really eager to remove pages from the swapcache once they
are swapped in and added to the page cache, unlike anon.
- With zram we skip the swapcache in the swap fault path for non-shared pages.

I tried again with anonymous pages on a disk swapfile and I can
reliably reproduce the problem with the attached script. It basically
sets up 6MB of disk swap, creates a cgroup with 10MB limit, and runs
an allocator program. The program allocates 13MB and writes to them
twice (to induce refaults), and repeats this 100 times.

Here is an OOM log:

alloc_anon invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0,
oom_score_adj=0
[ 1974.205755] CPU: 252 PID: 116849 Comm: alloc_anon Tainted: G
   O       6.5.0-smp-DEV #21
[ 1974.205758] Hardware name: Google, Inc.
Arcadia_IT_80/Arcadia_IT_80, BIOS 12.22.0 06/14/2023
[ 1974.205759] Call Trace:
[ 1974.205761]  <TASK>
[ 1974.205764]  dump_stack_lvl+0x5d/0x90
[ 1974.205770]  dump_stack+0x14/0x20
[ 1974.205772]  dump_header+0x52/0x230
[ 1974.205775]  oom_kill_process+0x103/0x200
[ 1974.205777]  out_of_memory+0x3ae/0x590
[ 1974.205779]  mem_cgroup_out_of_memory+0x110/0x150
[ 1974.205782]  try_charge_memcg+0x3f9/0x760
[ 1974.205784]  ? __folio_alloc+0x1e/0x40
[ 1974.205787]  charge_memcg+0x3d/0x180
[ 1974.205788]  __mem_cgroup_charge+0x2f/0x70
[ 1974.205790]  do_pte_missing+0x7e8/0xd10
[ 1974.205792]  handle_mm_fault+0x6a5/0xaa0
[ 1974.205795]  do_user_addr_fault+0x387/0x560
[ 1974.205798]  exc_page_fault+0x67/0x120
[ 1974.205799]  asm_exc_page_fault+0x2b/0x30
[ 1974.205802] RIP: 0033:0x40ee89
[ 1974.205804] Code: fe 7f 40 40 c5 fe 7f 40 60 48 83 c7 80 48 81 fa
00 01 00 00 76 2b 48 8d 90 80 00 00 00 48 83 e2 c0 c5 fd 7f 02 c5 fd
7f 40
[ 1974.205806] RSP: 002b:00007ffe9bf6a2d8 EFLAGS: 00010283
[ 1974.205808] RAX: 00000000007286c0 RBX: 00007ffe9bf6a4e8 RCX: 00000000007286c0
[ 1974.205809] RDX: 0000000001215fc0 RSI: 0000000000000001 RDI: 0000000001228640
[ 1974.205810] RBP: 00007ffe9bf6a310 R08: 0000000000b20951 R09: 0000000000000000
[ 1974.205811] R10: 00000000004a21e0 R11: 0000000000749000 R12: 00007ffe9bf6a4c8
[ 1974.205812] R13: 000000000049e6f0 R14: 0000000000000001 R15: 0000000000000001
[ 1974.205815]  </TASK>
[ 1974.205815] memory: usage 10240kB, limit 10240kB, failcnt 4480498
[ 1974.205817] swap: usage 6140kB, limit 9007199254740988kB, failcnt 0
[ 1974.205818] Memory cgroup stats for /a:
[ 1974.205895] anon 5222400
[ 1974.205896] file 0
[ 1974.205896] kernel 106496
[ 1974.205897] kernel_stack 0
[ 1974.205897] pagetables 45056
[ 1974.205898] sec_pagetables 0
[ 1974.205898] percpu 12384
[ 1974.205898] sock 0
[ 1974.205899] vmalloc 0
[ 1974.205899] shmem 0
[ 1974.205900] zswap 0
[ 1974.205900] zswapped 0
[ 1974.205900] file_mapped 0
[ 1974.205901] file_dirty 0
[ 1974.205901] file_writeback 122880
[ 1974.205902] swapcached 5156864
[ 1974.205902] anon_thp 2097152
[ 1974.205902] file_thp 0
[ 1974.205903] shmem_thp 0
[ 1974.205903] inactive_anon 9396224
[ 1974.205904] active_anon 925696
[ 1974.205904] inactive_file 0
[ 1974.205904] active_file 0
[ 1974.205905] unevictable 0
[ 1974.205905] slab_reclaimable 1824
[ 1974.205906] slab_unreclaimable 7728
[ 1974.205906] slab 9552
[ 1974.205907] workingset_refault_anon 5220628
[ 1974.205907] workingset_refault_file 0
[ 1974.205907] workingset_activate_anon 212458
[ 1974.205908] workingset_activate_file 0
[ 1974.205908] workingset_restore_anon 110673
[ 1974.205909] workingset_restore_file 0
[ 1974.205909] workingset_nodereclaim 0
[ 1974.205910] pgscan 15788241
[ 1974.205910] pgsteal 5222782
[ 1974.205910] pgscan_kswapd 0
[ 1974.205911] pgscan_direct 15786393
[ 1974.205911] pgscan_khugepaged 1848
[ 1974.205912] pgsteal_kswapd 0
[ 1974.205912] pgsteal_direct 5222044
[ 1974.205912] pgsteal_khugepaged 738
[ 1974.205913] pgfault 5298719
[ 1974.205913] pgmajfault 692459
[ 1974.205914] pgrefill 350803
[ 1974.205914] pgactivate 140230
[ 1974.205914] pgdeactivate 350803
[ 1974.205915] pglazyfree 0
[ 1974.205915] pglazyfreed 0
[ 1974.205916] zswpin 0
[ 1974.205916] zswpout 0
[ 1974.205916] thp_fault_alloc 125
[ 1974.205917] thp_collapse_alloc 2
[ 1974.205917] Tasks state (memory values in pages):
[ 1974.205918] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
swapents oom_score_adj name
[ 1974.205919] [ 116849]     0 116849     3059     1722    53248
1024             0 alloc_anon
[ 1974.205922] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/a,task_memcg=/a,task=alloc_anon,pid=10
[ 1974.205929] Memory cgroup out of memory: Killed process 116849
(alloc_anon) total-vm:12236kB, anon-rss:6888kB, file-rss:0kB,
shmem-rss:0kB,0

It shows that we are using 10MB of memory and 6MB of swap, but that's
because the ~5MB of memory in the swapcache are doubling as both
memory and swap if I understand correctly.

I could not reproduce the problem with this patch (but perhaps I
didn't try hard enough).

[-- Attachment #2: repro.sh --]
[-- Type: text/x-sh, Size: 309 bytes --]

#!/bin/bash

SWAPFILE="./swapfile"
CGROUP="/sys/fs/cgroup/a"

dd if=/dev/zero of=$SWAPFILE bs=1M count=6
chmod 0600 $SWAPFILE
mkswap $SWAPFILE
swapon $SWAPFILE

mkdir $CGROUP
echo 10m > $CGROUP/memory.max
(echo 0 > $CGROUP/cgroup.procs && ./alloc_anon 15 100)
rmdir $CGROUP

swapoff $SWAPFILE
rm -f $SWAPFILE

[-- Attachment #3: alloc_anon.c --]
[-- Type: application/octet-stream, Size: 538 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <err.h>

#define MB_TO_BYTES(x) (x << 20)

int main(int argc, char **argv)
{
	int iterations;
	size_t len;
	void *buf;

	if (argc != 3)
		err(-1, "Wrong number of args");

	len = (size_t)atoi(argv[1]);
	len = MB_TO_BYTES(len);

	iterations = atoi(argv[2]);

	for (int i = 0; i < iterations; i++) {
		buf = malloc(len);
		if (buf == MAP_FAILED)
			err(-1, "malloc failed");
		memset(buf, 1, len);
		memset(buf, 2, len);
		free(buf);
	}
	return 0;
}