Re: [PATCH] powerpc/mm: Fix RECLAIM_DISTANCE

From: Gavin Shan <gwshan@linux.vnet.ibm.com>
To: Anton Blanchard <anton@samba.org>
Cc: Balbir Singh <bsingharora@gmail.com>,
	Gavin Shan <gwshan@linux.vnet.ibm.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] powerpc/mm: Fix RECLAIM_DISTANCE
Date: Tue, 31 Jan 2017 15:33:55 +1100	[thread overview]
Message-ID: <20170131043355.GA25724@gwshan> (raw)
In-Reply-To: <20170130120240.5018f476@kryten>

On Mon, Jan 30, 2017 at 12:02:40PM +1100, Anton Blanchard wrote:
>Hi,
>
>> Anton suggested that NUMA distances in powerpc mattered and hurted
>> performance without this setting. We need to validate to see if this
>> is still true. A simple way to start would be benchmarking
>
>The original issue was that we never reclaimed local clean pagecache.
>
>I just tried all settings for /proc/sys/vm/zone_reclaim_mode and none
>of them caused me to reclaim local clean pagecache! We are very broken.
>
>I would think we have test cases for this, but here is a dumb one.
>First something to consume memory:
>
># cat alloc.c
>
>#include <stdlib.h>
>#include <unistd.h>
>#include <string.h>
>#include <assert.h>
>
>int main(int argc, char *argv[])
>{
>	void *p;
>
>	unsigned long size;
>
>	size = strtoul(argv[1], NULL, 0);
>
>	p = malloc(size);
>	assert(p);
>	memset(p, 0, size);
>	printf("%p\n", p);
>
>	sleep(3600);
>
>	return 0;
>}
>
>Now create a file to consume pagecache. My nodes have 32GB each, so
>I create 16GB, enough to consume half of the node:
>
>dd if=/dev/zero of=/tmp/file bs=1G count=16
>
>Clear out our pagecache:
>
>sync
>echo 3 > /proc/sys/vm/drop_caches
>
>Bring it in on node 0:
>
>taskset -c 0 cat /tmp/file > /dev/null
>
>Consume 24GB of memory on node 0:
>
>taskset -c 0 ./alloc 25769803776
>
>In all zone reclaim modes, the pagecache never gets reclaimed:
>
># grep FilePages /sys/devices/system/node/node0/meminfo
>
>Node 0 FilePages:      16757376 kB
>
>And our alloc process shows lots of off node memory used:
>
>3ff9a4630000 default anon=393217 dirty=393217 N0=112474 N1=220490 N16=60253 kernelpagesize_kB=64
>
>Clearly nothing is working. Gavin, if your patch fixes this we should
>get it into stable too.
>

Anton, I think the behaviour looks good. Actually, it's not very relevant 
to the issue addressed by the patch. I will reply to Michael's reply about
the reason. There are two nodes in your system and the memory is expected
to be allocated from node-0. If node-0 doesn't have enough free memory,
the allocater switches to node-1. It means we need more stress.

In the experiment, 38GB is allocated: 16GB for pagecache and 24GB for heap.
It's not exceeding the memory capacity (64GB). So page reclaim in the fast
and slow path weren't triggered. It's why the pagecache wasn't dropped.
I think __GFP_THISNODE isn't specified when page-fault handler tries to
allocate page to accomodate the VMA for the heap.

*Without* the patch applied, I got something as below in the system where
two NUMA nodes and each of them has 64GB memory. Also, I don't think the
patch is going to change the behaviour:

# cat /proc/sys/vm/zone_reclaim_mode 
0

Drop pagecache
Read 8GB file, for pagecache to consume 8GB memory.
Node 0 FilePages:       8496960 kB
taskset -c 0 ./alloc 137438953472       <- 128GB sized heap
Node 0 FilePages:        503424 kB

Eventually, some of swap clusters have been used as well:

# free -m
              total        used        free      shared  buff/cache   available
Mem:         130583      129203         861          10         518         297
Swap:         10987        3145        7842

Thanks,
Gavin