linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Problem in Page Cache Replacement
@ 2012-11-20 17:42 metin d
  2012-11-20 18:25 ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: metin d @ 2012-11-20 17:42 UTC (permalink / raw)
  To: linux-kernel

I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem.

This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
  0:  10

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-20 17:42 Problem in Page Cache Replacement metin d
@ 2012-11-20 18:25 ` Jan Kara
  2012-11-21  8:03   ` metin d
                     ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Jan Kara @ 2012-11-20 18:25 UTC (permalink / raw)
  To: metin d; +Cc: linux-kernel, linux-mm

On Tue 20-11-12 09:42:42, metin d wrote:
> I have two PostgreSQL databases named data-1 and data-2 that sit on the
> same machine. Both databases keep 40 GB of data, and the total memory
> available on the machine is 68GB.
> 
> I started data-1 and data-2, and ran several queries to go over all their
> data. Then, I shut down data-1 and kept issuing queries against data-2.
> For some reason, the OS still holds on to large parts of data-1's pages
> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> a result, my queries on data-2 keep hitting disk.
> 
> I'm checking page cache usage with fincore. When I run a table scan query
> against data-2, I see that data-2's pages get evicted and put back into
> the cache in a round-robin manner. Nothing happens to data-1's pages,
> although they haven't been touched for days.
> 
> Does anybody know why data-1's pages aren't evicted from the page cache?
> I'm open to all kind of suggestions you think it might relate to problem.
  Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
  does it evict data-1 pages from memory?

> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> swap space. The kernel version is:
> 
> $ uname -r
> 3.2.28-45.62.amzn1.x86_64
> Edit:
> 
> and it seems that I use one NUMA instance, if  you think that it can a problem.
> 
> $ numactl --hardware
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 70007 MB
> node 0 free: 360 MB
> node distances:
> node   0
>   0:  10

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-20 18:25 ` Jan Kara
@ 2012-11-21  8:03   ` metin d
  2012-11-21  8:13     ` metin d
  2012-11-21 21:34   ` Johannes Weiner
  2012-11-23  1:58   ` Jaegeuk Hanse
  2 siblings, 1 reply; 30+ messages in thread
From: metin d @ 2012-11-21  8:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-mm



>  Curious. Added linux-mm list to CC to catch more attention. If you run
> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?


I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.

My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening?

Thank you,

Metin


----- Original Message -----
From: Jan Kara <jack@suse.cz>
To: metin d <metdos@yahoo.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; linux-mm@kvack.org
Sent: Tuesday, November 20, 2012 8:25 PM
Subject: Re: Problem in Page Cache Replacement

On Tue 20-11-12 09:42:42, metin d wrote:
> I have two PostgreSQL databases named data-1 and data-2 that sit on the
> same machine. Both databases keep 40 GB of data, and the total memory
> available on the machine is 68GB.
> 
> I started data-1 and data-2, and ran several queries to go over all their
> data. Then, I shut down data-1 and kept issuing queries against data-2.
> For some reason, the OS still holds on to large parts of data-1's pages
> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> a result, my queries on data-2 keep hitting disk.
> 
> I'm checking page cache usage with fincore. When I run a table scan query
> against data-2, I see that data-2's pages get evicted and put back into
> the cache in a round-robin manner. Nothing happens to data-1's pages,
> although they haven't been touched for days.
> 
> Does anybody know why data-1's pages aren't evicted from the page cache?
> I'm open to all kind of suggestions you think it might relate to problem.
  Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
  does it evict data-1 pages from memory?

> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> swap space. The kernel version is:
> 
> $ uname -r
> 3.2.28-45.62.amzn1.x86_64
> Edit:
> 
> and it seems that I use one NUMA instance, if  you think that it can a problem.
> 
> $ numactl --hardware
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 70007 MB
> node 0 free: 360 MB
> node distances:
> node   0
>   0:  10

                                Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21  8:03   ` metin d
@ 2012-11-21  8:13     ` metin d
  2012-11-21  8:34       ` Jaegeuk Hanse
  0 siblings, 1 reply; 30+ messages in thread
From: metin d @ 2012-11-21  8:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-mm

>  Curious. Added linux-mm list to CC to catch more attention. If you run
> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?

I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.

My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening?

Thank you,

Metin

On Tue 20-11-12 09:42:42, metin d wrote:
> I have two PostgreSQL databases named data-1 and data-2 that sit on the
> same machine. Both databases keep 40 GB of data, and the total memory
> available on the machine is 68GB.
> 
> I started data-1 and data-2, and ran several queries to go over all their
> data. Then, I shut down data-1 and kept issuing queries against data-2.
> For some reason, the OS still holds on to large parts of data-1's pages
> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> a result, my queries on data-2 keep hitting disk.
> 
> I'm checking page cache usage with fincore. When I run a table scan query
> against data-2, I see that data-2's pages get evicted and put back into
> the cache in a round-robin manner. Nothing happens to data-1's pages,
> although they haven't been touched for days.
> 
> Does anybody know why data-1's pages aren't evicted from the page cache?
> I'm open to all kind of suggestions you think it might relate to problem.
  Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
  does it evict data-1 pages from memory?

> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> swap space. The kernel version is:
> 
> $ uname -r
> 3.2.28-45.62.amzn1.x86_64
> Edit:
> 
> and it seems that I use one NUMA instance, if  you think that it can a problem.
> 
> $ numactl --hardware
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 70007 MB
> node 0 free: 360 MB
> node distances:
> node   0
>   0:  10

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21  8:13     ` metin d
@ 2012-11-21  8:34       ` Jaegeuk Hanse
  2012-11-21  9:02         ` Fengguang Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-21  8:34 UTC (permalink / raw)
  To: metin d, Fengguang Wu; +Cc: Jan Kara, linux-kernel, linux-mm

Cc Fengguang Wu.

On 11/21/2012 04:13 PM, metin d wrote:
>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
>
> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
>
> My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening?
>
> Thank you,
>
> Metin
>
> On Tue 20-11-12 09:42:42, metin d wrote:
>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>> same machine. Both databases keep 40 GB of data, and the total memory
>> available on the machine is 68GB.
>>
>> I started data-1 and data-2, and ran several queries to go over all their
>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>> For some reason, the OS still holds on to large parts of data-1's pages
>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>> a result, my queries on data-2 keep hitting disk.
>>
>> I'm checking page cache usage with fincore. When I run a table scan query
>> against data-2, I see that data-2's pages get evicted and put back into
>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>> although they haven't been touched for days.
>>
>> Does anybody know why data-1's pages aren't evicted from the page cache?
>> I'm open to all kind of suggestions you think it might relate to problem.
>    Curious. Added linux-mm list to CC to catch more attention. If you run
> echo 1 >/proc/sys/vm/drop_caches
>    does it evict data-1 pages from memory?
>
>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>> swap space. The kernel version is:
>>
>> $ uname -r
>> 3.2.28-45.62.amzn1.x86_64
>> Edit:
>>
>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>
>> $ numactl --hardware
>> available: 1 nodes (0)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 70007 MB
>> node 0 free: 360 MB
>> node distances:
>> node   0
>>     0:  10


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21  8:34       ` Jaegeuk Hanse
@ 2012-11-21  9:02         ` Fengguang Wu
  2012-11-21  9:10           ` Fengguang Wu
  2012-11-21  9:42           ` Jaegeuk Hanse
  0 siblings, 2 replies; 30+ messages in thread
From: Fengguang Wu @ 2012-11-21  9:02 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: metin d, Jan Kara, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3200 bytes --]

On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> Cc Fengguang Wu.
> 
> On 11/21/2012 04:13 PM, metin d wrote:
> >>   Curious. Added linux-mm list to CC to catch more attention. If you run
> >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> >I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
> >
> >We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.

> >My understanding was that under memory pressure from heavily
> >accessed pages, unused pages would eventually get evicted. Is there
> >anything else we can try on this host to understand why this is
> >happening?

We may debug it this way.

1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
   (please double check via /proc/vmstat whether it does the expected work)

2) run 'page-types -r' with root, to view the page status for the
   remaining pages of data-1

The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"

page-types can be found in the kernel source tree tools/vm/page-types.c

Sorry that sounds a bit twisted.. I do have a patch to directly dump
page cache status of a user specified file, however it's not
upstreamed yet.

Thanks,
Fengguang

> >On Tue 20-11-12 09:42:42, metin d wrote:
> >>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>same machine. Both databases keep 40 GB of data, and the total memory
> >>available on the machine is 68GB.
> >>
> >>I started data-1 and data-2, and ran several queries to go over all their
> >>data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>For some reason, the OS still holds on to large parts of data-1's pages
> >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>a result, my queries on data-2 keep hitting disk.
> >>
> >>I'm checking page cache usage with fincore. When I run a table scan query
> >>against data-2, I see that data-2's pages get evicted and put back into
> >>the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>although they haven't been touched for days.
> >>
> >>Does anybody know why data-1's pages aren't evicted from the page cache?
> >>I'm open to all kind of suggestions you think it might relate to problem.
> >   Curious. Added linux-mm list to CC to catch more attention. If you run
> >echo 1 >/proc/sys/vm/drop_caches
> >   does it evict data-1 pages from memory?
> >
> >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> >>swap space. The kernel version is:
> >>
> >>$ uname -r
> >>3.2.28-45.62.amzn1.x86_64
> >>Edit:
> >>
> >>and it seems that I use one NUMA instance, if  you think that it can a problem.
> >>
> >>$ numactl --hardware
> >>available: 1 nodes (0)
> >>node 0 cpus: 0 1 2 3 4 5 6 7
> >>node 0 size: 70007 MB
> >>node 0 free: 360 MB
> >>node distances:
> >>node   0
> >>    0:  10

[-- Attachment #2: fadvise.c --]
[-- Type: text/x-csrc, Size: 1904 bytes --]

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>

#include "fadvise.h"

char *progname;

static void usage(void)
{
	fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", progname);
	fprintf(stderr, "      advice: normal sequential willneed noreuse "
					"dontneed asyncwrite writewait\n");
	exit(1);
}

int
main(int argc, char *argv[])
{
	int c;
	int fd;
	char *sadvice;
	char *filename;
	loff_t offset;
	unsigned long length;
	int advice = 0;
	int ret;
	int loops = 1;

	progname = argv[0];

	while ((c = getopt(argc, argv, "")) != -1) {
		switch (c) {
		}
	}

	if (optind == argc)
		usage();
	filename = argv[optind++];

	if (optind == argc)
		usage();
	offset = strtoull(argv[optind++], NULL, 0);

	if (optind == argc)
		usage();
	length = strtol(argv[optind++], NULL, 0);

	if (optind == argc)
		usage();
	sadvice = argv[optind++];

	if (optind != argc)
		loops = strtol(argv[optind++], NULL, 0);

	if (optind != argc)
		usage();

	if (!strcmp(sadvice, "normal"))
		advice = POSIX_FADV_NORMAL;
	else if (!strcmp(sadvice, "sequential"))
		advice = POSIX_FADV_SEQUENTIAL;
	else if (!strcmp(sadvice, "willneed"))
		advice = POSIX_FADV_WILLNEED;
	else if (!strcmp(sadvice, "noreuse"))
		advice = POSIX_FADV_NOREUSE;
	else if (!strcmp(sadvice, "dontneed"))
		advice = POSIX_FADV_DONTNEED;
	else if (!strcmp(sadvice, "asyncwrite"))
		advice = LINUX_FADV_ASYNC_WRITE;
	else if (!strcmp(sadvice, "writewait"))
		advice = LINUX_FADV_WRITE_WAIT;
	else
		usage();

	fd = open(filename, O_RDONLY);
	if (fd < 0) {
		fprintf(stderr, "%s: cannot open `%s': %s\n",
			progname, filename, strerror(errno));
		exit(1);
	}

	while (loops--) {
		ret = __posix_fadvise64(fd, offset, length, advice);
		if (ret) {
			fprintf(stderr, "%s: fadvise() failed: %s\n",
				progname, strerror(errno));
			exit(1);
		}
	}
	close(fd);
	exit(0);
}

[-- Attachment #3: fadvise.h --]
[-- Type: text/x-chdr, Size: 2375 bytes --]

#include <asm/unistd.h>
#include <sys/errno.h>

#ifndef __NR_fadvise64
#if defined (__i386__)
#define __NR_fadvise64          250
#elif defined(__powerpc__)
#define __NR_fadvise64          233
#elif defined(__ia64__)
#define __NR_fadvise64		1234
#elif defined(__x86_64__)
#define __NR_fadvise64		221
#endif
#endif

#ifndef LINUX_FADV_ASYNC_WRITE
#define LINUX_FADV_ASYNC_WRITE 32
#endif

#ifndef LINUX_FADV_WRITE_WAIT
#define LINUX_FADV_WRITE_WAIT 33
#endif

#ifndef __x86_64__
_syscall5(int,fadvise64, int,fd, long,offset_lo,
		long,offset_hi, size_t,len, int,advice)
#endif

/* Works by luck on ppc32, fails on ppc64 */
#if defined(__i386__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, 0, len, advice);
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, offset >> 32, len, advice);
}
#elif defined(__powerpc64__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}
#elif defined(__powerpc__)

/* 
 * long longs are passed in an odd even register pair on ppc32 so
 * we need to pad before offset
 *
 * Note also the glibc syscall() function for ppc has been broken for
 * 6 argument syscalls until recently (~2.3.1 CVS)
 */
#define ppc_fadvise64(fd, offset_hi, offset_lo, len, advice) \
	syscall(__NR_fadvise64, fd, 0, offset_hi, offset_lo, len, advice)

int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return ppc_fadvise64(fd, 0, offset, len, advice);
}

/* big endian, akpm. */
int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return ppc_fadvise64(fd, (unsigned int)(offset >> 32),
			(unsigned int)(offset & 0xffffffff), len, advice);
}
#elif defined(__ia64__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}
#elif defined(__x86_64__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return -1;
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return syscall(__NR_fadvise64, fd, offset, len, advice);
}
#endif

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21  9:02         ` Fengguang Wu
@ 2012-11-21  9:10           ` Fengguang Wu
  2012-11-21  9:42           ` Jaegeuk Hanse
  1 sibling, 0 replies; 30+ messages in thread
From: Fengguang Wu @ 2012-11-21  9:10 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: metin d, Jan Kara, linux-kernel, linux-mm

On Wed, Nov 21, 2012 at 05:02:04PM +0800, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> > Cc Fengguang Wu.
> > 
> > On 11/21/2012 04:13 PM, metin d wrote:
> > >>   Curious. Added linux-mm list to CC to catch more attention. If you run
> > >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> > >I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
> > >
> > >We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
> 
> > >My understanding was that under memory pressure from heavily
> > >accessed pages, unused pages would eventually get evicted. Is there
> > >anything else we can try on this host to understand why this is
> > >happening?
> 
> We may debug it this way.

Better to add a step

0) run 'page-types -r' to get an initial view of the page cache
   status.

Thanks,
Fengguang

> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
>    (please double check via /proc/vmstat whether it does the expected work)
> 
> 2) run 'page-types -r' with root, to view the page status for the
>    remaining pages of data-1
> 
> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
> 
> page-types can be found in the kernel source tree tools/vm/page-types.c
> 
> Sorry that sounds a bit twisted.. I do have a patch to directly dump
> page cache status of a user specified file, however it's not
> upstreamed yet.
> 
> Thanks,
> Fengguang
> 
> > >On Tue 20-11-12 09:42:42, metin d wrote:
> > >>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> > >>same machine. Both databases keep 40 GB of data, and the total memory
> > >>available on the machine is 68GB.
> > >>
> > >>I started data-1 and data-2, and ran several queries to go over all their
> > >>data. Then, I shut down data-1 and kept issuing queries against data-2.
> > >>For some reason, the OS still holds on to large parts of data-1's pages
> > >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> > >>a result, my queries on data-2 keep hitting disk.
> > >>
> > >>I'm checking page cache usage with fincore. When I run a table scan query
> > >>against data-2, I see that data-2's pages get evicted and put back into
> > >>the cache in a round-robin manner. Nothing happens to data-1's pages,
> > >>although they haven't been touched for days.
> > >>
> > >>Does anybody know why data-1's pages aren't evicted from the page cache?
> > >>I'm open to all kind of suggestions you think it might relate to problem.
> > >   Curious. Added linux-mm list to CC to catch more attention. If you run
> > >echo 1 >/proc/sys/vm/drop_caches
> > >   does it evict data-1 pages from memory?
> > >
> > >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> > >>swap space. The kernel version is:
> > >>
> > >>$ uname -r
> > >>3.2.28-45.62.amzn1.x86_64
> > >>Edit:
> > >>
> > >>and it seems that I use one NUMA instance, if  you think that it can a problem.
> > >>
> > >>$ numactl --hardware
> > >>available: 1 nodes (0)
> > >>node 0 cpus: 0 1 2 3 4 5 6 7
> > >>node 0 size: 70007 MB
> > >>node 0 free: 360 MB
> > >>node distances:
> > >>node   0
> > >>    0:  10

> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <stdio.h>
> #include <string.h>
> 
> #include "fadvise.h"
> 
> char *progname;
> 
> static void usage(void)
> {
> 	fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", progname);
> 	fprintf(stderr, "      advice: normal sequential willneed noreuse "
> 					"dontneed asyncwrite writewait\n");
> 	exit(1);
> }
> 
> int
> main(int argc, char *argv[])
> {
> 	int c;
> 	int fd;
> 	char *sadvice;
> 	char *filename;
> 	loff_t offset;
> 	unsigned long length;
> 	int advice = 0;
> 	int ret;
> 	int loops = 1;
> 
> 	progname = argv[0];
> 
> 	while ((c = getopt(argc, argv, "")) != -1) {
> 		switch (c) {
> 		}
> 	}
> 
> 	if (optind == argc)
> 		usage();
> 	filename = argv[optind++];
> 
> 	if (optind == argc)
> 		usage();
> 	offset = strtoull(argv[optind++], NULL, 0);
> 
> 	if (optind == argc)
> 		usage();
> 	length = strtol(argv[optind++], NULL, 0);
> 
> 	if (optind == argc)
> 		usage();
> 	sadvice = argv[optind++];
> 
> 	if (optind != argc)
> 		loops = strtol(argv[optind++], NULL, 0);
> 
> 	if (optind != argc)
> 		usage();
> 
> 	if (!strcmp(sadvice, "normal"))
> 		advice = POSIX_FADV_NORMAL;
> 	else if (!strcmp(sadvice, "sequential"))
> 		advice = POSIX_FADV_SEQUENTIAL;
> 	else if (!strcmp(sadvice, "willneed"))
> 		advice = POSIX_FADV_WILLNEED;
> 	else if (!strcmp(sadvice, "noreuse"))
> 		advice = POSIX_FADV_NOREUSE;
> 	else if (!strcmp(sadvice, "dontneed"))
> 		advice = POSIX_FADV_DONTNEED;
> 	else if (!strcmp(sadvice, "asyncwrite"))
> 		advice = LINUX_FADV_ASYNC_WRITE;
> 	else if (!strcmp(sadvice, "writewait"))
> 		advice = LINUX_FADV_WRITE_WAIT;
> 	else
> 		usage();
> 
> 	fd = open(filename, O_RDONLY);
> 	if (fd < 0) {
> 		fprintf(stderr, "%s: cannot open `%s': %s\n",
> 			progname, filename, strerror(errno));
> 		exit(1);
> 	}
> 
> 	while (loops--) {
> 		ret = __posix_fadvise64(fd, offset, length, advice);
> 		if (ret) {
> 			fprintf(stderr, "%s: fadvise() failed: %s\n",
> 				progname, strerror(errno));
> 			exit(1);
> 		}
> 	}
> 	close(fd);
> 	exit(0);
> }

> #include <asm/unistd.h>
> #include <sys/errno.h>
> 
> #ifndef __NR_fadvise64
> #if defined (__i386__)
> #define __NR_fadvise64          250
> #elif defined(__powerpc__)
> #define __NR_fadvise64          233
> #elif defined(__ia64__)
> #define __NR_fadvise64		1234
> #elif defined(__x86_64__)
> #define __NR_fadvise64		221
> #endif
> #endif
> 
> #ifndef LINUX_FADV_ASYNC_WRITE
> #define LINUX_FADV_ASYNC_WRITE 32
> #endif
> 
> #ifndef LINUX_FADV_WRITE_WAIT
> #define LINUX_FADV_WRITE_WAIT 33
> #endif
> 
> #ifndef __x86_64__
> _syscall5(int,fadvise64, int,fd, long,offset_lo,
> 		long,offset_hi, size_t,len, int,advice)
> #endif
> 
> /* Works by luck on ppc32, fails on ppc64 */
> #if defined(__i386__)
> int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
> {
> 	return fadvise64(fd, offset, 0, len, advice);
> }
> 
> int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
> {
> 	return fadvise64(fd, offset, offset >> 32, len, advice);
> }
> #elif defined(__powerpc64__)
> int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
> {
> 	return fadvise64(fd, offset, len, advice);
> }
> 
> int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
> {
> 	return fadvise64(fd, offset, len, advice);
> }
> #elif defined(__powerpc__)
> 
> /* 
>  * long longs are passed in an odd even register pair on ppc32 so
>  * we need to pad before offset
>  *
>  * Note also the glibc syscall() function for ppc has been broken for
>  * 6 argument syscalls until recently (~2.3.1 CVS)
>  */
> #define ppc_fadvise64(fd, offset_hi, offset_lo, len, advice) \
> 	syscall(__NR_fadvise64, fd, 0, offset_hi, offset_lo, len, advice)
> 
> int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
> {
> 	return ppc_fadvise64(fd, 0, offset, len, advice);
> }
> 
> /* big endian, akpm. */
> int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
> {
> 	return ppc_fadvise64(fd, (unsigned int)(offset >> 32),
> 			(unsigned int)(offset & 0xffffffff), len, advice);
> }
> #elif defined(__ia64__)
> int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
> {
> 	return fadvise64(fd, offset, len, advice);
> }
> 
> int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
> {
> 	return fadvise64(fd, offset, len, advice);
> }
> #elif defined(__x86_64__)
> int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
> {
> 	return -1;
> }
> 
> int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
> {
> 	return syscall(__NR_fadvise64, fd, offset, len, advice);
> }
> #endif


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21  9:02         ` Fengguang Wu
  2012-11-21  9:10           ` Fengguang Wu
@ 2012-11-21  9:42           ` Jaegeuk Hanse
  2012-11-21 10:00             ` metin d
                               ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-21  9:42 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: metin d, Jan Kara, linux-kernel, linux-mm

On 11/21/2012 05:02 PM, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
>> Cc Fengguang Wu.
>>
>> On 11/21/2012 04:13 PM, metin d wrote:
>>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
>>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
>>>
>>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
>>> My understanding was that under memory pressure from heavily
>>> accessed pages, unused pages would eventually get evicted. Is there
>>> anything else we can try on this host to understand why this is
>>> happening?
> We may debug it this way.
>
> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
>     (please double check via /proc/vmstat whether it does the expected work)
>
> 2) run 'page-types -r' with root, to view the page status for the
>     remaining pages of data-1
>
> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
>
> page-types can be found in the kernel source tree tools/vm/page-types.c
>
> Sorry that sounds a bit twisted.. I do have a patch to directly dump
> page cache status of a user specified file, however it's not
> upstreamed yet.

Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

         flags    page-count       MB  symbolic-flags long-symbolic-flags
0x0000000000000000        607699     2373 
___________________________________
0x0000000100000000        343227     1340 
_______________________r___________    reserved

But I have some questions of the print of page-type:

Is 2373MB here mean total memory in used include page cache? I don't 
think so.
Which kind of pages will be marked reserved?
Which line of long-symbolic-flags is for page cache?

Regards,
Jaegeuk

>
> Thanks,
> Fengguang
>
>>> On Tue 20-11-12 09:42:42, metin d wrote:
>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>>> same machine. Both databases keep 40 GB of data, and the total memory
>>>> available on the machine is 68GB.
>>>>
>>>> I started data-1 and data-2, and ran several queries to go over all their
>>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>>> For some reason, the OS still holds on to large parts of data-1's pages
>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>>> a result, my queries on data-2 keep hitting disk.
>>>>
>>>> I'm checking page cache usage with fincore. When I run a table scan query
>>>> against data-2, I see that data-2's pages get evicted and put back into
>>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>>> although they haven't been touched for days.
>>>>
>>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>>> I'm open to all kind of suggestions you think it might relate to problem.
>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>>> echo 1 >/proc/sys/vm/drop_caches
>>>    does it evict data-1 pages from memory?
>>>
>>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>>>> swap space. The kernel version is:
>>>>
>>>> $ uname -r
>>>> 3.2.28-45.62.amzn1.x86_64
>>>> Edit:
>>>>
>>>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>>>
>>>> $ numactl --hardware
>>>> available: 1 nodes (0)
>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>> node 0 size: 70007 MB
>>>> node 0 free: 360 MB
>>>> node distances:
>>>> node   0
>>>>     0:  10


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21  9:42           ` Jaegeuk Hanse
@ 2012-11-21 10:00             ` metin d
       [not found]             ` <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com>
  2012-11-22 15:26             ` Fengguang Wu
  2 siblings, 0 replies; 30+ messages in thread
From: metin d @ 2012-11-21 10:00 UTC (permalink / raw)
  To: Jaegeuk Hanse, Fengguang Wu
  Cc: Jan Kara, linux-kernel, linux-mm, Metin Döşlü

[-- Attachment #1: Type: text/plain, Size: 4684 bytes --]



Hi Fengguang,

I run tests and attached the results. The line below I guess shows the data-1 page caches.

0x000000080000006c       6584051    25718  __RU_lA___________________P________    referenced,uptodate,lru,active,private
Metin


________________________________
From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
To: Fengguang Wu <fengguang.wu@intel.com> 
Cc: metin d <metdos@yahoo.com>; Jan Kara <jack@suse.cz>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; "linux-mm@kvack.org" <linux-mm@kvack.org> 
Sent: Wednesday, November 21, 2012 11:42 AM
Subject: Re: Problem in Page Cache Replacement

On 11/21/2012 05:02 PM, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
>> Cc Fengguang Wu.
>>
>> On 11/21/2012 04:13 PM, metin d wrote:
>>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
>>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
>>>
>>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
>>> My understanding was that under memory pressure from heavily
>>> accessed pages, unused pages would eventually get evicted. Is there
>>> anything else we can try on this host to understand why this is
>>> happening?
> We may debug it this way.
>
> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
>     (please double check via /proc/vmstat whether it does the expected work)
>
> 2) run 'page-types -r' with root, to view the page status for the
>     remaining pages of data-1
>
> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
>
> page-types can be found in the kernel source tree tools/vm/page-types.c
>
> Sorry that sounds a bit twisted.. I do have a patch to directly dump
> page cache status of a user specified file, however it's not
> upstreamed yet.

Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

         flags    page-count       MB  symbolic-flags long-symbolic-flags
0x0000000000000000        607699     2373 
___________________________________
0x0000000100000000        343227     1340 
_______________________r___________    reserved

But I have some questions of the print of page-type:

Is 2373MB here mean total memory in used include page cache? I don't 
think so.
Which kind of pages will be marked reserved?
Which line of long-symbolic-flags is for page cache?

Regards,
Jaegeuk

>
> Thanks,
> Fengguang
>
>>> On Tue 20-11-12 09:42:42, metin d wrote:
>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>>> same machine. Both databases keep 40 GB of data, and the total memory
>>>> available on the machine is 68GB.
>>>>
>>>> I started data-1 and data-2, and ran several queries to go over all their
>>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>>> For some reason, the OS still holds on to large parts of data-1's pages
>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>>> a result, my queries on data-2 keep hitting disk.
>>>>
>>>> I'm checking page cache usage with fincore. When I run a table scan query
>>>> against data-2, I see that data-2's pages get evicted and put back into
>>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>>> although they haven't been touched for days.
>>>>
>>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>>> I'm open to all kind of suggestions you think it might relate to problem.
>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>>> echo 1 >/proc/sys/vm/drop_caches
>>>    does it evict data-1 pages from memory?
>>>
>>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>>>> swap space. The kernel version is:
>>>>
>>>> $ uname -r
>>>> 3.2.28-45.62.amzn1.x86_64
>>>> Edit:
>>>>
>>>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>>>
>>>> $ numactl --hardware
>>>> available: 1 nodes (0)
>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>> node 0 size: 70007 MB
>>>> node 0 free: 360 MB
>>>> node distances:
>>>> node   0
>>>>     0:  10

[-- Attachment #2: page-types_after.txt --]
[-- Type: text/plain, Size: 5453 bytes --]

             flags	page-count       MB  symbolic-flags			long-symbolic-flags
0x0000000000000000	   5508317    21516  ___________________________________	
0x0000000100000000	    335993     1312  _______________________r___________	reserved
0x0000002100000000	     35634      139  _______________________r____O______	reserved,owner_private
0x0000000000010000	     45069      176  ________________T__________________	compound_tail
0x0000002000000000	      1516        5  ____________________________O______	owner_private
0x0000000800000004	         1        0  __R_______________________P________	referenced,private
0x0000000000008000	        10        0  _______________H___________________	compound_head
0x0000000000000004	         1        0  __R________________________________	referenced
0x0000000800000024	       166        0  __R__l____________________P________	referenced,lru,private
0x0000000400000028	       295        1  ___U_l___________________d_________	uptodate,lru,mappedtodisk
0x0001000400000028	         3        0  ___U_l___________________d_____I___	uptodate,lru,mappedtodisk,readahead
0x0000000000000028	         1        0  ___U_l_____________________________	uptodate,lru
0x000000040000002c	    262144     1024  __RU_l___________________d_________	referenced,uptodate,lru,mappedtodisk
0x000000080000002c	         5        0  __RU_l____________________P________	referenced,uptodate,lru,private
0x000000000000403c	       185        0  __RUDl________b____________________	referenced,uptodate,dirty,lru,swapbacked
0x0000000800000060	       163        0  _____lA___________________P________	lru,active,private
0x0000000800000064	     36739      143  __R__lA___________________P________	referenced,lru,active,private
0x0000000400000068	    527810     2061  ___U_lA__________________d_________	uptodate,lru,active,mappedtodisk
0x0000000800000068	       576        2  ___U_lA___________________P________	uptodate,lru,active,private
0x0000000c00000068	       116        0  ___U_lA__________________dP________	uptodate,lru,active,mappedtodisk,private
0x000000080000006c	   6584051    25718  __RU_lA___________________P________	referenced,uptodate,lru,active,private
0x000000040000006c	   1302211     5086  __RU_lA__________________d_________	referenced,uptodate,lru,active,mappedtodisk
0x0000000c0000006c	       431        1  __RU_lA__________________dP________	referenced,uptodate,lru,active,mappedtodisk,private
0x000000000000006c	       128        0  __RU_lA____________________________	referenced,uptodate,lru,active
0x0000000800000074	         2        0  __R_DlA___________________P________	referenced,dirty,lru,active,private
0x0000000000004078	        56        0  ___UDlA_______b____________________	uptodate,dirty,lru,active,swapbacked
0x000000000000407c	       122        0  __RUDlA_______b____________________	referenced,uptodate,dirty,lru,active,swapbacked
0x000000080000007c	         1        0  __RUDlA___________________P________	referenced,uptodate,dirty,lru,active,private
0x0000000000008080	     14495       56  _______S_______H___________________	slab,compound_head
0x0000000000000080	    250498      978  _______S___________________________	slab
0x0000000000000400	   2990908    11683  __________B________________________	buddy
0x0000000000000800	        16        0  ___________M_______________________	mmap
0x0000000100000804	         1        0  __R________M___________r___________	referenced,mmap,reserved
0x000000060004082c	       391        1  __RU_l_____M______u_____md_________	referenced,uptodate,lru,mmap,unevictable,mlocked,mappedtodisk
0x0000000a0004082c	       321        1  __RU_l_____M______u_____m_P________	referenced,uptodate,lru,mmap,unevictable,mlocked,private
0x0000000000004838	      8450       33  ___UDl_____M__b____________________	uptodate,dirty,lru,mmap,swapbacked
0x000000000000483c	      2045        7  __RUDl_____M__b____________________	referenced,uptodate,dirty,lru,mmap,swapbacked
0x0000000800000868	        19        0  ___U_lA____M______________P________	uptodate,lru,active,mmap,private
0x0000000400000868	         5        0  ___U_lA____M_____________d_________	uptodate,lru,active,mmap,mappedtodisk
0x000000040000086c	      1891        7  __RU_lA____M_____________d_________	referenced,uptodate,lru,active,mmap,mappedtodisk
0x000000080000086c	       126        0  __RU_lA____M______________P________	referenced,uptodate,lru,active,mmap,private
0x0000000000004878	        85        0  ___UDlA____M__b____________________	uptodate,dirty,lru,active,mmap,swapbacked
0x000000000000487c	      2263        8  __RUDlA____M__b____________________	referenced,uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000005008	        13        0  ___U________a_b____________________	uptodate,anonymous,swapbacked
0x0000000000005808	        16        0  ___U_______Ma_b____________________	uptodate,mmap,anonymous,swapbacked
0x0000000200045828	         8        0  ___U_l_____Ma_b___u_____m__________	uptodate,lru,mmap,anonymous,swapbacked,unevictable,mlocked
0x000000020004582c	       651        2  __RU_l_____Ma_b___u_____m__________	referenced,uptodate,lru,mmap,anonymous,swapbacked,unevictable,mlocked
0x0000000000005868	      8058       31  ___U_lA____Ma_b____________________	uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c	        42        0  __RU_lA____Ma_b____________________	referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total	  17922048    70008


[-- Attachment #3: page-types_before.txt --]
[-- Type: text/plain, Size: 5551 bytes --]

             flags	page-count       MB  symbolic-flags			long-symbolic-flags
0x0000000000000000	    121628      475  ___________________________________	
0x0000000100000000	    335993     1312  _______________________r___________	reserved
0x0000002100000000	     35634      139  _______________________r____O______	reserved,owner_private
0x0000000000010000	     45429      177  ________________T__________________	compound_tail
0x0000002000000000	      1389        5  ____________________________O______	owner_private
0x0000000400000001	         6        0  L________________________d_________	locked,mappedtodisk
0x0000000000008000	        10        0  _______________H___________________	compound_head
0x0000000000000004	         1        0  __R________________________________	referenced
0x0000000400000021	        64        0  L____l___________________d_________	locked,lru,mappedtodisk
0x0001000400000021	         1        0  L____l___________________d_____I___	locked,lru,mappedtodisk,readahead
0x0000000800000024	       171        0  __R__l____________________P________	referenced,lru,private
0x0000000400000028	      4093       15  ___U_l___________________d_________	uptodate,lru,mappedtodisk
0x0001000400000028	        59        0  ___U_l___________________d_____I___	uptodate,lru,mappedtodisk,readahead
0x0000000000000028	         1        0  ___U_l_____________________________	uptodate,lru
0x000000040000002c	   8598032    33586  __RU_l___________________d_________	referenced,uptodate,lru,mappedtodisk
0x000000080000002c	        10        0  __RU_l____________________P________	referenced,uptodate,lru,private
0x000000000000403c	       185        0  __RUDl________b____________________	referenced,uptodate,dirty,lru,swapbacked
0x0000000800000060	       163        0  _____lA___________________P________	lru,active,private
0x0000000800000064	     36741      143  __R__lA___________________P________	referenced,lru,active,private
0x0000000400000068	    527834     2061  ___U_lA__________________d_________	uptodate,lru,active,mappedtodisk
0x0000000800000068	       695        2  ___U_lA___________________P________	uptodate,lru,active,private
0x0000000c00000068	       116        0  ___U_lA__________________dP________	uptodate,lru,active,mappedtodisk,private
0x000000080000006c	   6584066    25719  __RU_lA___________________P________	referenced,uptodate,lru,active,private
0x000000040000006c	   1325273     5176  __RU_lA__________________d_________	referenced,uptodate,lru,active,mappedtodisk
0x0000000c0000006c	       431        1  __RU_lA__________________dP________	referenced,uptodate,lru,active,mappedtodisk,private
0x000000000000006c	       128        0  __RU_lA____________________________	referenced,uptodate,lru,active
0x0000000000004078	        56        0  ___UDlA_______b____________________	uptodate,dirty,lru,active,swapbacked
0x000000000000407c	       122        0  __RUDlA_______b____________________	referenced,uptodate,dirty,lru,active,swapbacked
0x000000080000007c	         1        0  __RUDlA___________________P________	referenced,uptodate,dirty,lru,active,private
0x0000000000008080	     14571       56  _______S_______H___________________	slab,compound_head
0x0000000000000080	    250546      978  _______S___________________________	slab
0x0000000000000400	     14701       57  __________B________________________	buddy
0x0000000000000800	        16        0  ___________M_______________________	mmap
0x0000000100000804	         1        0  __R________M___________r___________	referenced,mmap,reserved
0x000000060004082c	       391        1  __RU_l_____M______u_____md_________	referenced,uptodate,lru,mmap,unevictable,mlocked,mappedtodisk
0x0000000a0004082c	       321        1  __RU_l_____M______u_____m_P________	referenced,uptodate,lru,mmap,unevictable,mlocked,private
0x0000000000004838	      8385       32  ___UDl_____M__b____________________	uptodate,dirty,lru,mmap,swapbacked
0x000000000000483c	      2045        7  __RUDl_____M__b____________________	referenced,uptodate,dirty,lru,mmap,swapbacked
0x0000000800000868	        19        0  ___U_lA____M______________P________	uptodate,lru,active,mmap,private
0x0000000400000868	         5        0  ___U_lA____M_____________d_________	uptodate,lru,active,mmap,mappedtodisk
0x000000040000086c	      1891        7  __RU_lA____M_____________d_________	referenced,uptodate,lru,active,mmap,mappedtodisk
0x000000080000086c	       126        0  __RU_lA____M______________P________	referenced,uptodate,lru,active,mmap,private
0x0000000000004878	        85        0  ___UDlA____M__b____________________	uptodate,dirty,lru,active,mmap,swapbacked
0x000000000000487c	      2263        8  __RUDlA____M__b____________________	referenced,uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000005008	         4        0  ___U________a_b____________________	uptodate,anonymous,swapbacked
0x0000000000005808	        25        0  ___U_______Ma_b____________________	uptodate,mmap,anonymous,swapbacked
0x0000000200045828	         8        0  ___U_l_____Ma_b___u_____m__________	uptodate,lru,mmap,anonymous,swapbacked,unevictable,mlocked
0x000000020004582c	       651        2  __RU_l_____Ma_b___u_____m__________	referenced,uptodate,lru,mmap,anonymous,swapbacked,unevictable,mlocked
0x0000000000005868	      7623       29  ___U_lA____Ma_b____________________	uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c	        39        0  __RU_lA____Ma_b____________________	referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total	  17922048    70008

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
       [not found]               ` <50ACA634.5000007@gmail.com>
@ 2012-11-21 10:07                 ` Metin Döşlü
  2012-11-22 15:41                   ` Fengguang Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Metin Döşlü @ 2012-11-21 10:07 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: Fengguang Wu, Jan Kara, linux-kernel, linux-mm

On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse <jaegeuk.hanse@gmail.com> wrote:
>
> On 11/21/2012 05:58 PM, metin d wrote:
>
> Hi Fengguang,
>
> I run tests and attached the results. The line below I guess shows the data-1 page caches.
>
> 0x000000080000006c       6584051    25718  __RU_lA___________________P________    referenced,uptodate,lru,active,private
>
>
> I thinks this is just one state of page cache pages.

But why these page caches are in this state as opposed to other page
caches. From the results I conclude that:

data-1 pages are in state : referenced,uptodate,lru,active,private
data-2 pages are in state : referenced,uptodate,lru,mappedtodisk

>
>
>
>
> Metin
>
>
> ----- Original Message -----
> From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
> To: Fengguang Wu <fengguang.wu@intel.com>
> Cc: metin d <metdos@yahoo.com>; Jan Kara <jack@suse.cz>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; "linux-mm@kvack.org" <linux-mm@kvack.org>
> Sent: Wednesday, November 21, 2012 11:42 AM
> Subject: Re: Problem in Page Cache Replacement
>
> On 11/21/2012 05:02 PM, Fengguang Wu wrote:
> > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> >> Cc Fengguang Wu.
> >>
> >> On 11/21/2012 04:13 PM, metin d wrote:
> >>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
> >>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> >>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
> >>>
> >>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
> >>> My understanding was that under memory pressure from heavily
> >>> accessed pages, unused pages would eventually get evicted. Is there
> >>> anything else we can try on this host to understand why this is
> >>> happening?
> > We may debug it this way.
> >
> > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
> >    (please double check via /proc/vmstat whether it does the expected work)
> >
> > 2) run 'page-types -r' with root, to view the page status for the
> >    remaining pages of data-1
> >
> > The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
> >
> > page-types can be found in the kernel source tree tools/vm/page-types.c
> >
> > Sorry that sounds a bit twisted.. I do have a patch to directly dump
> > page cache status of a user specified file, however it's not
> > upstreamed yet.
>
> Hi Fengguang,
>
> Thanks for you detail steps, I think metin can have a try.
>
>         flags    page-count      MB  symbolic-flags long-symbolic-flags
> 0x0000000000000000        607699    2373
> ___________________________________
> 0x0000000100000000        343227    1340
> _______________________r___________    reserved
>
> But I have some questions of the print of page-type:
>
> Is 2373MB here mean total memory in used include page cache? I don't
> think so.
> Which kind of pages will be marked reserved?
> Which line of long-symbolic-flags is for page cache?
>
> Regards,
> Jaegeuk
>
> >
> > Thanks,
> > Fengguang
> >
> >>> On Tue 20-11-12 09:42:42, metin d wrote:
> >>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>>> same machine. Both databases keep 40 GB of data, and the total memory
> >>>> available on the machine is 68GB.
> >>>>
> >>>> I started data-1 and data-2, and ran several queries to go over all their
> >>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>>> For some reason, the OS still holds on to large parts of data-1's pages
> >>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>>> a result, my queries on data-2 keep hitting disk.
> >>>>
> >>>> I'm checking page cache usage with fincore. When I run a table scan query
> >>>> against data-2, I see that data-2's pages get evicted and put back into
> >>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>>> although they haven't been touched for days.
> >>>>
> >>>> Does anybody know why data-1's pages aren't evicted from the page cache?
> >>>> I'm open to all kind of suggestions you think it might relate to problem.
> >>>    Curious. Added linux-mm list to CC to catch more attention. If you run
> >>> echo 1 >/proc/sys/vm/drop_caches
> >>>    does it evict data-1 pages from memory?
> >>>
> >>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> >>>> swap space. The kernel version is:
> >>>>
> >>>> $ uname -r
> >>>> 3.2.28-45.62.amzn1.x86_64
> >>>> Edit:
> >>>>
> >>>> and it seems that I use one NUMA instance, if  you think that it can a problem.
> >>>>
> >>>> $ numactl --hardware
> >>>> available: 1 nodes (0)
> >>>> node 0 cpus: 0 1 2 3 4 5 6 7
> >>>> node 0 size: 70007 MB
> >>>> node 0 free: 360 MB
> >>>> node distances:
> >>>> node  0
> >>>>    0:  10
>
>



--
Metin Döşlü

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-20 18:25 ` Jan Kara
  2012-11-21  8:03   ` metin d
@ 2012-11-21 21:34   ` Johannes Weiner
  2012-11-21 22:01     ` metin d
  2012-11-22  0:48     ` Jaegeuk Hanse
  2012-11-23  1:58   ` Jaegeuk Hanse
  2 siblings, 2 replies; 30+ messages in thread
From: Johannes Weiner @ 2012-11-21 21:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: metin d, linux-kernel, linux-mm

Hi,

On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
> On Tue 20-11-12 09:42:42, metin d wrote:
> > I have two PostgreSQL databases named data-1 and data-2 that sit on the
> > same machine. Both databases keep 40 GB of data, and the total memory
> > available on the machine is 68GB.
> > 
> > I started data-1 and data-2, and ran several queries to go over all their
> > data. Then, I shut down data-1 and kept issuing queries against data-2.
> > For some reason, the OS still holds on to large parts of data-1's pages
> > in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> > a result, my queries on data-2 keep hitting disk.
> > 
> > I'm checking page cache usage with fincore. When I run a table scan query
> > against data-2, I see that data-2's pages get evicted and put back into
> > the cache in a round-robin manner. Nothing happens to data-1's pages,
> > although they haven't been touched for days.
> > 
> > Does anybody know why data-1's pages aren't evicted from the page cache?
> > I'm open to all kind of suggestions you think it might relate to problem.

This might be because we do not deactive pages as long as there is
cache on the inactive list.  I'm guessing that the inter-reference
distance of data-2 is bigger than half of memory, so it's never
getting activated and data-1 is never challenged.

I have a series of patches that detects a thrashing inactive list and
handles working set changes up to the size of memory.  Would you be
willing to test them?  They are currently based on 3.4, let me know
what version works best for you.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21 21:34   ` Johannes Weiner
@ 2012-11-21 22:01     ` metin d
  2012-11-22  0:48     ` Jaegeuk Hanse
  1 sibling, 0 replies; 30+ messages in thread
From: metin d @ 2012-11-21 22:01 UTC (permalink / raw)
  To: Johannes Weiner, Jan Kara
  Cc: linux-kernel, linux-mm, Metin Döşlü

Hi,

Yes data-2 is bigger than half of memory. I'm willing to try those patches. 

This is the version of this machine:

$ uname -r
3.2.28-45.62.amzn1.x86_64



----- Original Message -----
From: Johannes Weiner <hannes@cmpxchg.org>
To: Jan Kara <jack@suse.cz>
Cc: metin d <metdos@yahoo.com>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; linux-mm@kvack.org
Sent: Wednesday, November 21, 2012 11:34 PM
Subject: Re: Problem in Page Cache Replacement

Hi,

On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
> On Tue 20-11-12 09:42:42, metin d wrote:
> > I have two PostgreSQL databases named data-1 and data-2 that sit on the
> > same machine. Both databases keep 40 GB of data, and the total memory
> > available on the machine is 68GB.
> > 
> > I started data-1 and data-2, and ran several queries to go over all their
> > data. Then, I shut down data-1 and kept issuing queries against data-2.
> > For some reason, the OS still holds on to large parts of data-1's pages
> > in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> > a result, my queries on data-2 keep hitting disk.
> > 
> > I'm checking page cache usage with fincore. When I run a table scan query
> > against data-2, I see that data-2's pages get evicted and put back into
> > the cache in a round-robin manner. Nothing happens to data-1's pages,
> > although they haven't been touched for days.
> > 
> > Does anybody know why data-1's pages aren't evicted from the page cache?
> > I'm open to all kind of suggestions you think it might relate to problem.

This might be because we do not deactive pages as long as there is
cache on the inactive list.  I'm guessing that the inter-reference
distance of data-2 is bigger than half of memory, so it's never
getting activated and data-1 is never challenged.

I have a series of patches that detects a thrashing inactive list and
handles working set changes up to the size of memory.  Would you be
willing to test them?  They are currently based on 3.4, let me know
what version works best for you.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21 21:34   ` Johannes Weiner
  2012-11-21 22:01     ` metin d
@ 2012-11-22  0:48     ` Jaegeuk Hanse
  2012-11-22  1:09       ` Johannes Weiner
  1 sibling, 1 reply; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-22  0:48 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Jan Kara, metin d, linux-kernel, linux-mm

On 11/22/2012 05:34 AM, Johannes Weiner wrote:
> Hi,
>
> On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
>> On Tue 20-11-12 09:42:42, metin d wrote:
>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>> same machine. Both databases keep 40 GB of data, and the total memory
>>> available on the machine is 68GB.
>>>
>>> I started data-1 and data-2, and ran several queries to go over all their
>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>> For some reason, the OS still holds on to large parts of data-1's pages
>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>> a result, my queries on data-2 keep hitting disk.
>>>
>>> I'm checking page cache usage with fincore. When I run a table scan query
>>> against data-2, I see that data-2's pages get evicted and put back into
>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>> although they haven't been touched for days.
>>>
>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>> I'm open to all kind of suggestions you think it might relate to problem.
> This might be because we do not deactive pages as long as there is
> cache on the inactive list.  I'm guessing that the inter-reference
> distance of data-2 is bigger than half of memory, so it's never
> getting activated and data-1 is never challenged.

Hi Johannes,

What's the meaning of "inter-reference distance" and why compare it with 
half of memoy, what's the trick?

Regards,
Jaegeuk

>
> I have a series of patches that detects a thrashing inactive list and
> handles working set changes up to the size of memory.  Would you be
> willing to test them?  They are currently based on 3.4, let me know
> what version works best for you.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22  0:48     ` Jaegeuk Hanse
@ 2012-11-22  1:09       ` Johannes Weiner
  2012-11-22  9:37         ` metin d
  2012-11-22 13:16         ` Jaegeuk Hanse
  0 siblings, 2 replies; 30+ messages in thread
From: Johannes Weiner @ 2012-11-22  1:09 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: Jan Kara, metin d, linux-kernel, linux-mm

On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote:
> On 11/22/2012 05:34 AM, Johannes Weiner wrote:
> >Hi,
> >
> >On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
> >>On Tue 20-11-12 09:42:42, metin d wrote:
> >>>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>>same machine. Both databases keep 40 GB of data, and the total memory
> >>>available on the machine is 68GB.
> >>>
> >>>I started data-1 and data-2, and ran several queries to go over all their
> >>>data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>>For some reason, the OS still holds on to large parts of data-1's pages
> >>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>>a result, my queries on data-2 keep hitting disk.
> >>>
> >>>I'm checking page cache usage with fincore. When I run a table scan query
> >>>against data-2, I see that data-2's pages get evicted and put back into
> >>>the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>>although they haven't been touched for days.
> >>>
> >>>Does anybody know why data-1's pages aren't evicted from the page cache?
> >>>I'm open to all kind of suggestions you think it might relate to problem.
> >This might be because we do not deactive pages as long as there is
> >cache on the inactive list.  I'm guessing that the inter-reference
> >distance of data-2 is bigger than half of memory, so it's never
> >getting activated and data-1 is never challenged.
> 
> Hi Johannes,
> 
> What's the meaning of "inter-reference distance"

It's the number of memory accesses between two accesses to the same
page:

  A B C D A B C E ...
    |_______|
    |       |

> and why compare it with half of memoy, what's the trick?

If B gets accessed twice, it gets activated.  If it gets evicted in
between, the second access will be a fresh page fault and B will not
be recognized as frequently used.

Our cutoff for scanning the active list is cache size / 2 right now
(inactive_file_is_low), leaving 50% of memory to the inactive list.
If the inter-reference distance for pages on the inactive list is
bigger than that, they get evicted before their second access.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22  1:09       ` Johannes Weiner
@ 2012-11-22  9:37         ` metin d
  2012-11-22 13:16         ` Jaegeuk Hanse
  1 sibling, 0 replies; 30+ messages in thread
From: metin d @ 2012-11-22  9:37 UTC (permalink / raw)
  To: Johannes Weiner, Jaegeuk Hanse
  Cc: Jan Kara, linux-kernel, linux-mm, Metin Döşlü

Hi Johannes,

Yes, problem was as you projected. I tried to make "active" data-2 pages by manually reading them twice, and finally data-1 are got out of page cache.

We have large files in PostgreSQL and Hadoop that we sequentially scan over; and try to fit our working set into total memory. So I hope your patches will take place in the soonest linux kernel version.

Thanks,
Metin


----- Original Message -----
From: Johannes Weiner <hannes@cmpxchg.org>
To: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
Cc: Jan Kara <jack@suse.cz>; metin d <metdos@yahoo.com>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; linux-mm@kvack.org
Sent: Thursday, November 22, 2012 3:09 AM
Subject: Re: Problem in Page Cache Replacement

On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote:
> On 11/22/2012 05:34 AM, Johannes Weiner wrote:
> >Hi,
> >
> >On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
> >>On Tue 20-11-12 09:42:42, metin d wrote:
> >>>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>>same machine. Both databases keep 40 GB of data, and the total memory
> >>>available on the machine is 68GB.
> >>>
> >>>I started data-1 and data-2, and ran several queries to go over all their
> >>>data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>>For some reason, the OS still holds on to large parts of data-1's pages
> >>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>>a result, my queries on data-2 keep hitting disk.
> >>>
> >>>I'm checking page cache usage with fincore. When I run a table scan query
> >>>against data-2, I see that data-2's pages get evicted and put back into
> >>>the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>>although they haven't been touched for days.
> >>>
> >>>Does anybody know why data-1's pages aren't evicted from the page cache?
> >>>I'm open to all kind of suggestions you think it might relate to problem.
> >This might be because we do not deactive pages as long as there is
> >cache on the inactive list.  I'm guessing that the inter-reference
> >distance of data-2 is bigger than half of memory, so it's never
> >getting activated and data-1 is never challenged.
> 
> Hi Johannes,
> 
> What's the meaning of "inter-reference distance"

It's the number of memory accesses between two accesses to the same
page:

  A B C D A B C E ...
    |_______|
    |       |

> and why compare it with half of memoy, what's the trick?

If B gets accessed twice, it gets activated.  If it gets evicted in
between, the second access will be a fresh page fault and B will not
be recognized as frequently used.

Our cutoff for scanning the active list is cache size / 2 right now
(inactive_file_is_low), leaving 50% of memory to the inactive list.
If the inter-reference distance for pages on the inactive list is
bigger than that, they get evicted before their second access.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22  1:09       ` Johannes Weiner
  2012-11-22  9:37         ` metin d
@ 2012-11-22 13:16         ` Jaegeuk Hanse
  2012-11-22 16:17           ` Johannes Weiner
  1 sibling, 1 reply; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-22 13:16 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Jan Kara, metin d, linux-kernel, linux-mm

On 11/22/2012 09:09 AM, Johannes Weiner wrote:
> On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote:
>> On 11/22/2012 05:34 AM, Johannes Weiner wrote:
>>> Hi,
>>>
>>> On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
>>>> On Tue 20-11-12 09:42:42, metin d wrote:
>>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>>>> same machine. Both databases keep 40 GB of data, and the total memory
>>>>> available on the machine is 68GB.
>>>>>
>>>>> I started data-1 and data-2, and ran several queries to go over all their
>>>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>>>> For some reason, the OS still holds on to large parts of data-1's pages
>>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>>>> a result, my queries on data-2 keep hitting disk.
>>>>>
>>>>> I'm checking page cache usage with fincore. When I run a table scan query
>>>>> against data-2, I see that data-2's pages get evicted and put back into
>>>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>>>> although they haven't been touched for days.
>>>>>
>>>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>>>> I'm open to all kind of suggestions you think it might relate to problem.
>>> This might be because we do not deactive pages as long as there is
>>> cache on the inactive list.  I'm guessing that the inter-reference
>>> distance of data-2 is bigger than half of memory, so it's never
>>> getting activated and data-1 is never challenged.
>> Hi Johannes,
>>
>> What's the meaning of "inter-reference distance"
> It's the number of memory accesses between two accesses to the same
> page:
>
>    A B C D A B C E ...
>      |_______|
>      |       |
>
>> and why compare it with half of memoy, what's the trick?
> If B gets accessed twice, it gets activated.  If it gets evicted in
> between, the second access will be a fresh page fault and B will not
> be recognized as frequently used.
>
> Our cutoff for scanning the active list is cache size / 2 right now
> (inactive_file_is_low), leaving 50% of memory to the inactive list.
> If the inter-reference distance for pages on the inactive list is
> bigger than that, they get evicted before their second access.

Hi Johannes,

Thanks for your explanation. But could you give a short description of 
how you resolve this inactive list thrashing issues?

Regards,
Jaegeuk




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21  9:42           ` Jaegeuk Hanse
  2012-11-21 10:00             ` metin d
       [not found]             ` <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com>
@ 2012-11-22 15:26             ` Fengguang Wu
  2012-11-23  1:32               ` Jaegeuk Hanse
  2 siblings, 1 reply; 30+ messages in thread
From: Fengguang Wu @ 2012-11-22 15:26 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: metin d, Jan Kara, linux-kernel, linux-mm

Hi Jaegeuk,

Sorry for the delay. I'm traveling these days..

On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote:
> On 11/21/2012 05:02 PM, Fengguang Wu wrote:
> >On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> >>Cc Fengguang Wu.
> >>
> >>On 11/21/2012 04:13 PM, metin d wrote:
> >>>>   Curious. Added linux-mm list to CC to catch more attention. If you run
> >>>>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> >>>I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
> >>>
> >>>We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
> >>>My understanding was that under memory pressure from heavily
> >>>accessed pages, unused pages would eventually get evicted. Is there
> >>>anything else we can try on this host to understand why this is
> >>>happening?
> >We may debug it this way.
> >
> >1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
> >    (please double check via /proc/vmstat whether it does the expected work)
> >
> >2) run 'page-types -r' with root, to view the page status for the
> >    remaining pages of data-1
> >
> >The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> >Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
> >
> >page-types can be found in the kernel source tree tools/vm/page-types.c
> >
> >Sorry that sounds a bit twisted.. I do have a patch to directly dump
> >page cache status of a user specified file, however it's not
> >upstreamed yet.
> 
> Hi Fengguang,
> 
> Thanks for you detail steps, I think metin can have a try.
> 
>         flags    page-count       MB  symbolic-flags long-symbolic-flags
> 0x0000000000000000        607699     2373
> ___________________________________
> 0x0000000100000000        343227     1340
> _______________________r___________    reserved
 
We don't need to care about the above two pages states actually.
Page cache pages will never be in the special reserved or
all-flags-cleared state.

> But I have some questions of the print of page-type:
> 
> Is 2373MB here mean total memory in used include page cache? I don't
> think so.
> Which kind of pages will be marked reserved?
> Which line of long-symbolic-flags is for page cache?

The (lru && !anonymous) pages are page cache pages.

Thanks,
Fengguang

> >>>On Tue 20-11-12 09:42:42, metin d wrote:
> >>>>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>>>same machine. Both databases keep 40 GB of data, and the total memory
> >>>>available on the machine is 68GB.
> >>>>
> >>>>I started data-1 and data-2, and ran several queries to go over all their
> >>>>data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>>>For some reason, the OS still holds on to large parts of data-1's pages
> >>>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>>>a result, my queries on data-2 keep hitting disk.
> >>>>
> >>>>I'm checking page cache usage with fincore. When I run a table scan query
> >>>>against data-2, I see that data-2's pages get evicted and put back into
> >>>>the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>>>although they haven't been touched for days.
> >>>>
> >>>>Does anybody know why data-1's pages aren't evicted from the page cache?
> >>>>I'm open to all kind of suggestions you think it might relate to problem.
> >>>   Curious. Added linux-mm list to CC to catch more attention. If you run
> >>>echo 1 >/proc/sys/vm/drop_caches
> >>>   does it evict data-1 pages from memory?
> >>>
> >>>>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> >>>>swap space. The kernel version is:
> >>>>
> >>>>$ uname -r
> >>>>3.2.28-45.62.amzn1.x86_64
> >>>>Edit:
> >>>>
> >>>>and it seems that I use one NUMA instance, if  you think that it can a problem.
> >>>>
> >>>>$ numactl --hardware
> >>>>available: 1 nodes (0)
> >>>>node 0 cpus: 0 1 2 3 4 5 6 7
> >>>>node 0 size: 70007 MB
> >>>>node 0 free: 360 MB
> >>>>node distances:
> >>>>node   0
> >>>>    0:  10

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-21 10:07                 ` Metin Döşlü
@ 2012-11-22 15:41                   ` Fengguang Wu
  2012-11-22 15:53                     ` Fengguang Wu
  2012-11-24 15:06                     ` Metin Döşlü
  0 siblings, 2 replies; 30+ messages in thread
From: Fengguang Wu @ 2012-11-22 15:41 UTC (permalink / raw)
  To: Metin Döşlü
  Cc: Jaegeuk Hanse, Jan Kara, linux-kernel, linux-mm

On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote:
> On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse <jaegeuk.hanse@gmail.com> wrote:
> >
> > On 11/21/2012 05:58 PM, metin d wrote:
> >
> > Hi Fengguang,
> >
> > I run tests and attached the results. The line below I guess shows the data-1 page caches.
> >
> > 0x000000080000006c       6584051    25718  __RU_lA___________________P________    referenced,uptodate,lru,active,private
> >
> >
> > I thinks this is just one state of page cache pages.
> 
> But why these page caches are in this state as opposed to other page
> caches. From the results I conclude that:
> 
> data-1 pages are in state : referenced,uptodate,lru,active,private

I wonder if it's this code that stops data-1 pages from being
reclaimed:

shrink_page_list():

                if (page_has_private(page)) {
                        if (!try_to_release_page(page, sc->gfp_mask))
                                goto activate_locked;

What's the filesystem used?

> data-2 pages are in state : referenced,uptodate,lru,mappedtodisk

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22 15:41                   ` Fengguang Wu
@ 2012-11-22 15:53                     ` Fengguang Wu
  2012-11-23  2:10                       ` Jaegeuk Hanse
  2012-11-25 20:08                       ` Rik van Riel
  2012-11-24 15:06                     ` Metin Döşlü
  1 sibling, 2 replies; 30+ messages in thread
From: Fengguang Wu @ 2012-11-22 15:53 UTC (permalink / raw)
  To: Metin Döşlü
  Cc: Jaegeuk Hanse, Jan Kara, linux-kernel, linux-mm

On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote:
> > On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse <jaegeuk.hanse@gmail.com> wrote:
> > >
> > > On 11/21/2012 05:58 PM, metin d wrote:
> > >
> > > Hi Fengguang,
> > >
> > > I run tests and attached the results. The line below I guess shows the data-1 page caches.
> > >
> > > 0x000000080000006c       6584051    25718  __RU_lA___________________P________    referenced,uptodate,lru,active,private
> > >
> > >
> > > I thinks this is just one state of page cache pages.
> > 
> > But why these page caches are in this state as opposed to other page
> > caches. From the results I conclude that:
> > 
> > data-1 pages are in state : referenced,uptodate,lru,active,private
> 
> I wonder if it's this code that stops data-1 pages from being
> reclaimed:
> 
> shrink_page_list():
> 
>                 if (page_has_private(page)) {
>                         if (!try_to_release_page(page, sc->gfp_mask))
>                                 goto activate_locked;
> 
> What's the filesystem used?

Ah it's more likely caused by this logic:

        if (is_active_lru(lru)) {
                if (inactive_list_is_low(mz, file))
                        shrink_active_list(nr_to_scan, mz, sc, priority, file);

The active file list won't be scanned at all if it's smaller than the
active list. In this case, it's inactive=33586MB > active=25719MB. So
the data-1 pages in the active list will never be scanned and reclaimed.

> > data-2 pages are in state : referenced,uptodate,lru,mappedtodisk
> 
> Thanks,
> Fengguang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22 13:16         ` Jaegeuk Hanse
@ 2012-11-22 16:17           ` Johannes Weiner
  2012-11-23  2:14             ` Jaegeuk Hanse
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2012-11-22 16:17 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: Jan Kara, metin d, linux-kernel, linux-mm

On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote:
> On 11/22/2012 09:09 AM, Johannes Weiner wrote:
> >On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote:
> >>On 11/22/2012 05:34 AM, Johannes Weiner wrote:
> >>>Hi,
> >>>
> >>>On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
> >>>>On Tue 20-11-12 09:42:42, metin d wrote:
> >>>>>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>>>>same machine. Both databases keep 40 GB of data, and the total memory
> >>>>>available on the machine is 68GB.
> >>>>>
> >>>>>I started data-1 and data-2, and ran several queries to go over all their
> >>>>>data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>>>>For some reason, the OS still holds on to large parts of data-1's pages
> >>>>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>>>>a result, my queries on data-2 keep hitting disk.
> >>>>>
> >>>>>I'm checking page cache usage with fincore. When I run a table scan query
> >>>>>against data-2, I see that data-2's pages get evicted and put back into
> >>>>>the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>>>>although they haven't been touched for days.
> >>>>>
> >>>>>Does anybody know why data-1's pages aren't evicted from the page cache?
> >>>>>I'm open to all kind of suggestions you think it might relate to problem.
> >>>This might be because we do not deactive pages as long as there is
> >>>cache on the inactive list.  I'm guessing that the inter-reference
> >>>distance of data-2 is bigger than half of memory, so it's never
> >>>getting activated and data-1 is never challenged.
> >>Hi Johannes,
> >>
> >>What's the meaning of "inter-reference distance"
> >It's the number of memory accesses between two accesses to the same
> >page:
> >
> >   A B C D A B C E ...
> >     |_______|
> >     |       |
> >
> >>and why compare it with half of memoy, what's the trick?
> >If B gets accessed twice, it gets activated.  If it gets evicted in
> >between, the second access will be a fresh page fault and B will not
> >be recognized as frequently used.
> >
> >Our cutoff for scanning the active list is cache size / 2 right now
> >(inactive_file_is_low), leaving 50% of memory to the inactive list.
> >If the inter-reference distance for pages on the inactive list is
> >bigger than that, they get evicted before their second access.
> 
> Hi Johannes,
> 
> Thanks for your explanation. But could you give a short description
> of how you resolve this inactive list thrashing issues?

I remember a time stamp of evicted file pages in the page cache radix
tree that let me reconstruct the inter-reference distance even after a
page has been evicted from cache when it's faulted back in.  This way
I can tell a one-time sequence from thrashing, no matter how small the
inactive list.

When thrashing is detected, I start deactivating protected pages and
put them next to the refaulted cache on the head of the inactive list
and let them fight it out as usual.  In this reported case, the old
data will be challenged and since it's no longer used, it will just
drop off the inactive list eventually.  If the guess is wrong and the
deactivated memory is used more heavily than the refaulting pages,
they will just get activated again without incurring any disruption
like a major fault.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22 15:26             ` Fengguang Wu
@ 2012-11-23  1:32               ` Jaegeuk Hanse
  2012-11-23  2:25                 ` Fengguang Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-23  1:32 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: metin d, Jan Kara, linux-kernel, linux-mm

On 11/22/2012 11:26 PM, Fengguang Wu wrote:
> Hi Jaegeuk,
>
> Sorry for the delay. I'm traveling these days..
>
> On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote:
>> On 11/21/2012 05:02 PM, Fengguang Wu wrote:
>>> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
>>>> Cc Fengguang Wu.
>>>>
>>>> On 11/21/2012 04:13 PM, metin d wrote:
>>>>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>>>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
>>>>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
>>>>>
>>>>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
>>>>> My understanding was that under memory pressure from heavily
>>>>> accessed pages, unused pages would eventually get evicted. Is there
>>>>> anything else we can try on this host to understand why this is
>>>>> happening?
>>> We may debug it this way.
>>>
>>> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
>>>     (please double check via /proc/vmstat whether it does the expected work)
>>>
>>> 2) run 'page-types -r' with root, to view the page status for the
>>>     remaining pages of data-1
>>>
>>> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
>>> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
>>>
>>> page-types can be found in the kernel source tree tools/vm/page-types.c
>>>
>>> Sorry that sounds a bit twisted.. I do have a patch to directly dump
>>> page cache status of a user specified file, however it's not
>>> upstreamed yet.
>> Hi Fengguang,
>>
>> Thanks for you detail steps, I think metin can have a try.
>>
>>          flags    page-count       MB  symbolic-flags long-symbolic-flags
>> 0x0000000000000000        607699     2373
>> ___________________________________
>> 0x0000000100000000        343227     1340
>> _______________________r___________    reserved
>   
> We don't need to care about the above two pages states actually.
> Page cache pages will never be in the special reserved or
> all-flags-cleared state.

Hi Fengguang,

Thanks for your response. But which kind of pages are in the special 
reserved and which are all-flags-cleared?

Regards,
Jaegeuk

>
>> But I have some questions of the print of page-type:
>>
>> Is 2373MB here mean total memory in used include page cache? I don't
>> think so.
>> Which kind of pages will be marked reserved?
>> Which line of long-symbolic-flags is for page cache?
> The (lru && !anonymous) pages are page cache pages.
>
> Thanks,
> Fengguang
>
>>>>> On Tue 20-11-12 09:42:42, metin d wrote:
>>>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>>>>> same machine. Both databases keep 40 GB of data, and the total memory
>>>>>> available on the machine is 68GB.
>>>>>>
>>>>>> I started data-1 and data-2, and ran several queries to go over all their
>>>>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>>>>> For some reason, the OS still holds on to large parts of data-1's pages
>>>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>>>>> a result, my queries on data-2 keep hitting disk.
>>>>>>
>>>>>> I'm checking page cache usage with fincore. When I run a table scan query
>>>>>> against data-2, I see that data-2's pages get evicted and put back into
>>>>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>>>>> although they haven't been touched for days.
>>>>>>
>>>>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>>>>> I'm open to all kind of suggestions you think it might relate to problem.
>>>>>    Curious. Added linux-mm list to CC to catch more attention. If you run
>>>>> echo 1 >/proc/sys/vm/drop_caches
>>>>>    does it evict data-1 pages from memory?
>>>>>
>>>>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>>>>>> swap space. The kernel version is:
>>>>>>
>>>>>> $ uname -r
>>>>>> 3.2.28-45.62.amzn1.x86_64
>>>>>> Edit:
>>>>>>
>>>>>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>>>>>
>>>>>> $ numactl --hardware
>>>>>> available: 1 nodes (0)
>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>> node 0 size: 70007 MB
>>>>>> node 0 free: 360 MB
>>>>>> node distances:
>>>>>> node   0
>>>>>>     0:  10


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-20 18:25 ` Jan Kara
  2012-11-21  8:03   ` metin d
  2012-11-21 21:34   ` Johannes Weiner
@ 2012-11-23  1:58   ` Jaegeuk Hanse
  2012-11-23  8:08     ` metin d
  2 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-23  1:58 UTC (permalink / raw)
  To: metin d; +Cc: Jan Kara, linux-kernel, linux-mm

On 11/21/2012 02:25 AM, Jan Kara wrote:
> On Tue 20-11-12 09:42:42, metin d wrote:
>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>> same machine. Both databases keep 40 GB of data, and the total memory
>> available on the machine is 68GB.
>>
>> I started data-1 and data-2, and ran several queries to go over all their
>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>> For some reason, the OS still holds on to large parts of data-1's pages
>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>> a result, my queries on data-2 keep hitting disk.
>>
>> I'm checking page cache usage with fincore. When I run a table scan query
>> against data-2, I see that data-2's pages get evicted and put back into
>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>> although they haven't been touched for days.

Hi metin d,

fincore is a tool or ...? How could I get it?

Regards,
Jaegeuk

>>
>> Does anybody know why data-1's pages aren't evicted from the page cache?
>> I'm open to all kind of suggestions you think it might relate to problem.
>    Curious. Added linux-mm list to CC to catch more attention. If you run
> echo 1 >/proc/sys/vm/drop_caches
>    does it evict data-1 pages from memory?
>
>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>> swap space. The kernel version is:
>>
>> $ uname -r
>> 3.2.28-45.62.amzn1.x86_64
>> Edit:
>>
>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>
>> $ numactl --hardware
>> available: 1 nodes (0)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 70007 MB
>> node 0 free: 360 MB
>> node distances:
>> node   0
>>    0:  10
> 								Honza


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22 15:53                     ` Fengguang Wu
@ 2012-11-23  2:10                       ` Jaegeuk Hanse
  2012-11-25 20:08                       ` Rik van Riel
  1 sibling, 0 replies; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-23  2:10 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Metin Döşlü, Jan Kara, linux-kernel, linux-mm

On 11/22/2012 11:53 PM, Fengguang Wu wrote:
> On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote:
>> On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote:
>>> On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse <jaegeuk.hanse@gmail.com> wrote:
>>>> On 11/21/2012 05:58 PM, metin d wrote:
>>>>
>>>> Hi Fengguang,
>>>>
>>>> I run tests and attached the results. The line below I guess shows the data-1 page caches.
>>>>
>>>> 0x000000080000006c       6584051    25718  __RU_lA___________________P________    referenced,uptodate,lru,active,private
>>>>
>>>>
>>>> I thinks this is just one state of page cache pages.
>>> But why these page caches are in this state as opposed to other page
>>> caches. From the results I conclude that:
>>>
>>> data-1 pages are in state : referenced,uptodate,lru,active,private
>> I wonder if it's this code that stops data-1 pages from being
>> reclaimed:
>>
>> shrink_page_list():
>>
>>                  if (page_has_private(page)) {
>>                          if (!try_to_release_page(page, sc->gfp_mask))
>>                                  goto activate_locked;
>>
>> What's the filesystem used?
> Ah it's more likely caused by this logic:
>
>          if (is_active_lru(lru)) {
>                  if (inactive_list_is_low(mz, file))
>                          shrink_active_list(nr_to_scan, mz, sc, priority, file);
>
> The active file list won't be scanned at all if it's smaller than the
> active list. In this case, it's inactive=33586MB > active=25719MB. So
> the data-1 pages in the active list will never be scanned and reclaimed.

Hi Fengguang,

It seems that most of data-1 file pages are in active lru cache and most 
of data-2 file pages are in inactive lru cache. As Johannes mentioned, 
if inter-reference distance is bigger than half of memory, the pages 
will not be actived. How you intend to resolve this issue? Is Johannes's 
inactive list threshing idea  available?

Regards,
Jaegeuk

>
>>> data-2 pages are in state : referenced,uptodate,lru,mappedtodisk
>> Thanks,
>> Fengguang


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22 16:17           ` Johannes Weiner
@ 2012-11-23  2:14             ` Jaegeuk Hanse
  0 siblings, 0 replies; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-23  2:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Jan Kara, metin d, linux-kernel, linux-mm

On 11/23/2012 12:17 AM, Johannes Weiner wrote:
> On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote:
>> On 11/22/2012 09:09 AM, Johannes Weiner wrote:
>>> On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote:
>>>> On 11/22/2012 05:34 AM, Johannes Weiner wrote:
>>>>> Hi,
>>>>>
>>>>> On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:
>>>>>> On Tue 20-11-12 09:42:42, metin d wrote:
>>>>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>>>>>> same machine. Both databases keep 40 GB of data, and the total memory
>>>>>>> available on the machine is 68GB.
>>>>>>>
>>>>>>> I started data-1 and data-2, and ran several queries to go over all their
>>>>>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>>>>>> For some reason, the OS still holds on to large parts of data-1's pages
>>>>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>>>>>> a result, my queries on data-2 keep hitting disk.
>>>>>>>
>>>>>>> I'm checking page cache usage with fincore. When I run a table scan query
>>>>>>> against data-2, I see that data-2's pages get evicted and put back into
>>>>>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>>>>>> although they haven't been touched for days.
>>>>>>>
>>>>>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>>>>>> I'm open to all kind of suggestions you think it might relate to problem.
>>>>> This might be because we do not deactive pages as long as there is
>>>>> cache on the inactive list.  I'm guessing that the inter-reference
>>>>> distance of data-2 is bigger than half of memory, so it's never
>>>>> getting activated and data-1 is never challenged.
>>>> Hi Johannes,
>>>>
>>>> What's the meaning of "inter-reference distance"
>>> It's the number of memory accesses between two accesses to the same
>>> page:
>>>
>>>    A B C D A B C E ...
>>>      |_______|
>>>      |       |
>>>
>>>> and why compare it with half of memoy, what's the trick?
>>> If B gets accessed twice, it gets activated.  If it gets evicted in
>>> between, the second access will be a fresh page fault and B will not
>>> be recognized as frequently used.
>>>
>>> Our cutoff for scanning the active list is cache size / 2 right now
>>> (inactive_file_is_low), leaving 50% of memory to the inactive list.
>>> If the inter-reference distance for pages on the inactive list is
>>> bigger than that, they get evicted before their second access.
>> Hi Johannes,
>>
>> Thanks for your explanation. But could you give a short description
>> of how you resolve this inactive list thrashing issues?
> I remember a time stamp of evicted file pages in the page cache radix
> tree that let me reconstruct the inter-reference distance even after a
> page has been evicted from cache when it's faulted back in.  This way
> I can tell a one-time sequence from thrashing, no matter how small the
> inactive list.
>
> When thrashing is detected, I start deactivating protected pages and
> put them next to the refaulted cache on the head of the inactive list
> and let them fight it out as usual.  In this reported case, the old
> data will be challenged and since it's no longer used, it will just
> drop off the inactive list eventually.  If the guess is wrong and the
> deactivated memory is used more heavily than the refaulting pages,
> they will just get activated again without incurring any disruption
> like a major fault.

Hi Johannes,

If you also add the time stamp to the protected pages which you deactive 
when incur thrashing?

Regards,
Jaegeuk




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-23  1:32               ` Jaegeuk Hanse
@ 2012-11-23  2:25                 ` Fengguang Wu
  0 siblings, 0 replies; 30+ messages in thread
From: Fengguang Wu @ 2012-11-23  2:25 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: metin d, Jan Kara, linux-kernel, linux-mm

Jaegeuk,

> Thanks for your response. But which kind of pages are in the special
> reserved and which are all-flags-cleared?

The all-flags-cleared pages are mostly free pages in the buddy system.
The pages with flag "buddy" are also free pages: the buddy system only
marks the head pages of each order-2 free range with flag "buddy".

The reserved pages come from many sources, they may be set for memory
reserved for BIOS, memory holes, offlined memory, or used by some
device drivers.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-23  1:58   ` Jaegeuk Hanse
@ 2012-11-23  8:08     ` metin d
  2012-11-23  8:17       ` Jaegeuk Hanse
  0 siblings, 1 reply; 30+ messages in thread
From: metin d @ 2012-11-23  8:08 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: Jan Kara, linux-kernel, linux-mm

----- Original Message -----

From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
To: metin d <metdos@yahoo.com>
Cc: Jan Kara <jack@suse.cz>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; linux-mm@kvack.org
Sent: Friday, November 23, 2012 3:58 AM
Subject: Re: Problem in Page Cache Replacement

On 11/21/2012 02:25 AM, Jan Kara wrote:
> On Tue 20-11-12 09:42:42, metin d wrote:
>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>> same machine. Both databases keep 40 GB of data, and the total memory
>> available on the machine is 68GB.
>>
>> I started data-1 and data-2, and ran several queries to go over all their
>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>> For some reason, the OS still holds on to large parts of data-1's pages
>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>> a result, my queries on data-2 keep hitting disk.
>>
>> I'm checking page cache usage with fincore. When I run a table scan query
>> against data-2, I see that data-2's pages get evicted and put back into
>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>> although they haven't been touched for days.

> Hi metin d,

> fincore is a tool or ...? How could I get it?

> Regards,
> Jaegeuk


Hi Jaegeuk,

Yes, it is a tool, you get it from here :
http://code.google.com/p/linux-ftools/


Regards,
Metin
>>
>> Does anybody know why data-1's pages aren't evicted from the page cache?
>> I'm open to all kind of suggestions you think it might relate to problem.
>    Curious. Added linux-mm list to CC to catch more attention. If you run
> echo 1 >/proc/sys/vm/drop_caches
>    does it evict data-1 pages from memory?
>
>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>> swap space. The kernel version is:
>>
>> $ uname -r
>> 3.2.28-45.62.amzn1.x86_64
>> Edit:
>>
>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>
>> $ numactl --hardware
>> available: 1 nodes (0)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 70007 MB
>> node 0 free: 360 MB
>> node distances:
>> node   0
>>    0:  10
>                                 Honza

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-23  8:08     ` metin d
@ 2012-11-23  8:17       ` Jaegeuk Hanse
  2012-11-23  8:25         ` metin d
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Hanse @ 2012-11-23  8:17 UTC (permalink / raw)
  To: metin d; +Cc: Jan Kara, linux-kernel, linux-mm

On 11/23/2012 04:08 PM, metin d wrote:
> ----- Original Message -----
>
> From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
> To: metin d <metdos@yahoo.com>
> Cc: Jan Kara <jack@suse.cz>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; linux-mm@kvack.org
> Sent: Friday, November 23, 2012 3:58 AM
> Subject: Re: Problem in Page Cache Replacement
>
> On 11/21/2012 02:25 AM, Jan Kara wrote:
>> On Tue 20-11-12 09:42:42, metin d wrote:
>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>> same machine. Both databases keep 40 GB of data, and the total memory
>>> available on the machine is 68GB.
>>>
>>> I started data-1 and data-2, and ran several queries to go over all their
>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>> For some reason, the OS still holds on to large parts of data-1's pages
>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>> a result, my queries on data-2 keep hitting disk.
>>>
>>> I'm checking page cache usage with fincore. When I run a table scan query
>>> against data-2, I see that data-2's pages get evicted and put back into
>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>> although they haven't been touched for days.
>> Hi metin d,
>> fincore is a tool or ...? How could I get it?
>> Regards,
>> Jaegeuk
>
> Hi Jaegeuk,
>
> Yes, it is a tool, you get it from here :
> http://code.google.com/p/linux-ftools/

Hi Metin,

Could you give me a link to download it? I can't get it from the link 
you give me. Thanks in advance. :-)

Regards,
Jaegeuk

>
>
> Regards,
> Metin
>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>> I'm open to all kind of suggestions you think it might relate to problem.
>>      Curious. Added linux-mm list to CC to catch more attention. If you run
>> echo 1 >/proc/sys/vm/drop_caches
>>      does it evict data-1 pages from memory?
>>
>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>>> swap space. The kernel version is:
>>>
>>> $ uname -r
>>> 3.2.28-45.62.amzn1.x86_64
>>> Edit:
>>>
>>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>>
>>> $ numactl --hardware
>>> available: 1 nodes (0)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 70007 MB
>>> node 0 free: 360 MB
>>> node distances:
>>> node   0
>>>      0:  10
>>                                  Honza


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-23  8:17       ` Jaegeuk Hanse
@ 2012-11-23  8:25         ` metin d
  0 siblings, 0 replies; 30+ messages in thread
From: metin d @ 2012-11-23  8:25 UTC (permalink / raw)
  To: Jaegeuk Hanse; +Cc: Jan Kara, linux-kernel, linux-mm

----- Original Message -----

From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
To: metin d <metdos@yahoo.com>
Cc: Jan Kara <jack@suse.cz>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; "linux-mm@kvack.org" <linux-mm@kvack.org>
Sent: Friday, November 23, 2012 10:17 AM
Subject: Re: Problem in Page Cache Replacement

On 11/23/2012 04:08 PM, metin d wrote:
> ----- Original Message -----
>
> From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
> To: metin d <metdos@yahoo.com>
> Cc: Jan Kara <jack@suse.cz>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; linux-mm@kvack.org
> Sent: Friday, November 23, 2012 3:58 AM
> Subject: Re: Problem in Page Cache Replacement
>
> On 11/21/2012 02:25 AM, Jan Kara wrote:
>> On Tue 20-11-12 09:42:42, metin d wrote:
>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>> same machine. Both databases keep 40 GB of data, and the total memory
>>> available on the machine is 68GB.
>>>
>>> I started data-1 and data-2, and ran several queries to go over all their
>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>> For some reason, the OS still holds on to large parts of data-1's pages
>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>> a result, my queries on data-2 keep hitting disk.
>>>
>>> I'm checking page cache usage with fincore. When I run a table scan query
>>> against data-2, I see that data-2's pages get evicted and put back into
>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>> although they haven't been touched for days.
>> Hi metin d,
>> fincore is a tool or ...? How could I get it?
>> Regards,
>> Jaegeuk
>
> Hi Jaegeuk,
>
> Yes, it is a tool, you get it from here :
> http://code.google.com/p/linux-ftools/


> Hi Metin,

> Could you give me a link to download it? I can't get it from the link 
> you give me. Thanks in advance. :-)

> Regards,
> Jaegeuk

Hi Jaegeuk,

You may need to install mercurial on your system, I'm able to download source code with this command:

hg clone https://code.google.com/p/linux-ftools/


Regards,
Metin

>
>
> Regards,
> Metin
>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>> I'm open to all kind of suggestions you think it might relate to problem.
>>      Curious. Added linux-mm list to CC to catch more attention. If you run
>> echo 1 >/proc/sys/vm/drop_caches
>>      does it evict data-1 pages from memory?
>>
>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>>> swap space. The kernel version is:
>>>
>>> $ uname -r
>>> 3.2.28-45.62.amzn1.x86_64
>>> Edit:
>>>
>>> and it seems that I use one NUMA instance, if  you think that it can a problem.
>>>
>>> $ numactl --hardware
>>> available: 1 nodes (0)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 70007 MB
>>> node 0 free: 360 MB
>>> node distances:
>>> node   0
>>>      0:  10
>>                                  Honza

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22 15:41                   ` Fengguang Wu
  2012-11-22 15:53                     ` Fengguang Wu
@ 2012-11-24 15:06                     ` Metin Döşlü
  1 sibling, 0 replies; 30+ messages in thread
From: Metin Döşlü @ 2012-11-24 15:06 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Jaegeuk Hanse, Jan Kara, linux-kernel, linux-mm

On Thu, Nov 22, 2012 at 5:41 PM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote:
>> On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse <jaegeuk.hanse@gmail.com> wrote:
>> >
>> > On 11/21/2012 05:58 PM, metin d wrote:
>> >
>> > Hi Fengguang,
>> >
>> > I run tests and attached the results. The line below I guess shows the data-1 page caches.
>> >
>> > 0x000000080000006c       6584051    25718  __RU_lA___________________P________    referenced,uptodate,lru,active,private
>> >
>> >
>> > I thinks this is just one state of page cache pages.
>>
>> But why these page caches are in this state as opposed to other page
>> caches. From the results I conclude that:
>>
>> data-1 pages are in state : referenced,uptodate,lru,active,private
>
> I wonder if it's this code that stops data-1 pages from being
> reclaimed:
>
> shrink_page_list():
>
>                 if (page_has_private(page)) {
>                         if (!try_to_release_page(page, sc->gfp_mask))
>                                 goto activate_locked;
>
> What's the filesystem used?

It was ext3.

>> data-2 pages are in state : referenced,uptodate,lru,mappedtodisk
>
> Thanks,
> Fengguang



-- 
Metin Döşlü

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Problem in Page Cache Replacement
  2012-11-22 15:53                     ` Fengguang Wu
  2012-11-23  2:10                       ` Jaegeuk Hanse
@ 2012-11-25 20:08                       ` Rik van Riel
  1 sibling, 0 replies; 30+ messages in thread
From: Rik van Riel @ 2012-11-25 20:08 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Metin Döşlü,
	Jaegeuk Hanse, Jan Kara, linux-kernel, linux-mm, Johannes Weiner

On 11/22/2012 10:53 AM, Fengguang Wu wrote:

> Ah it's more likely caused by this logic:
>
>          if (is_active_lru(lru)) {
>                  if (inactive_list_is_low(mz, file))
>                          shrink_active_list(nr_to_scan, mz, sc, priority, file);
>
> The active file list won't be scanned at all if it's smaller than the
> active list. In this case, it's inactive=33586MB > active=25719MB. So
> the data-1 pages in the active list will never be scanned and reclaimed.

That's it, indeed.

The reason we have that code is that otherwise one large streaming
IO could easily end up evicting the entire page cache working set.

Usually it works well, because the new page cache working set tends
to get touched twice while on the inactive list, and the old working
set gets demoted from the active list.

Only in a few very specific cases, where the inter-reference distance
of the new working set is larger than the size of the inactive list,
does it fail.

Something like Johannes's patches should solve the problem.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-11-25 20:09 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-20 17:42 Problem in Page Cache Replacement metin d
2012-11-20 18:25 ` Jan Kara
2012-11-21  8:03   ` metin d
2012-11-21  8:13     ` metin d
2012-11-21  8:34       ` Jaegeuk Hanse
2012-11-21  9:02         ` Fengguang Wu
2012-11-21  9:10           ` Fengguang Wu
2012-11-21  9:42           ` Jaegeuk Hanse
2012-11-21 10:00             ` metin d
     [not found]             ` <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com>
     [not found]               ` <50ACA634.5000007@gmail.com>
2012-11-21 10:07                 ` Metin Döşlü
2012-11-22 15:41                   ` Fengguang Wu
2012-11-22 15:53                     ` Fengguang Wu
2012-11-23  2:10                       ` Jaegeuk Hanse
2012-11-25 20:08                       ` Rik van Riel
2012-11-24 15:06                     ` Metin Döşlü
2012-11-22 15:26             ` Fengguang Wu
2012-11-23  1:32               ` Jaegeuk Hanse
2012-11-23  2:25                 ` Fengguang Wu
2012-11-21 21:34   ` Johannes Weiner
2012-11-21 22:01     ` metin d
2012-11-22  0:48     ` Jaegeuk Hanse
2012-11-22  1:09       ` Johannes Weiner
2012-11-22  9:37         ` metin d
2012-11-22 13:16         ` Jaegeuk Hanse
2012-11-22 16:17           ` Johannes Weiner
2012-11-23  2:14             ` Jaegeuk Hanse
2012-11-23  1:58   ` Jaegeuk Hanse
2012-11-23  8:08     ` metin d
2012-11-23  8:17       ` Jaegeuk Hanse
2012-11-23  8:25         ` metin d

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).