All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-14 12:03 ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2013-02-14 12:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rob van der Heij, Hugh Dickins, Linux-MM, LKML

Rob van der Heij reported the following (paraphrased) on private mail.

	The scenario is that I want to avoid backups to fill up the page
	cache and purge stuff that is more likely to be used again (this is
	with s390x Linux on z/VM, so I don't give it as much memory that
	we don't care anymore). So I have something with LD_PRELOAD that
	intercepts the close() call (from tar, in this case) and issues
	a posix_fadvise() just before closing the file.

	This mostly works, except for small files (less than 14 pages)
	that remains in page cache after the face.

Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.

The issue is the per-cpu pagevecs for LRU additions. If the pages are added
by one CPU but fadvise() is called on another then the pages remain resident
as the invalidate_mapping_pages() only drains the local pagevecs via its
call to pagevec_release(). The user-visible effect is that a program that
uses fadvise() properly is not obeyed.

A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec
page could not be discarded. The downside with this is that an inode cache
shrink would send a global IPI and memory pressure potentially causing
global IPI storms is very undesirable.

Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages. If a
subset of pages are discarded it drains the LRU pagevecs and tries again. If
the second attempt fails, it assumes it is due to the pages being mapped,
locked or dirty and does not care. With this patch, an application using
fadvise() correctly will be obeyed but there is a downside that a malicious
application can force the kernel to send global IPIs and increase overhead.

If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.

The following test program demonstrates the problem. It should never
report that pages are still resident but will without this patch. It
assumes that CPU 0 and 1 exist.

int main() {
	int fd;
	int pagesize = getpagesize();
	ssize_t written = 0, expected;
	char *buf;
	unsigned char *vec;
	int resident, i;
	cpu_set_t set;

	/* Prepare a buffer for writing */
	expected = FILESIZE_PAGES * pagesize;
	buf = malloc(expected + 1);
	if (buf == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}
	buf[expected] = 0;
	memset(buf, 'a', expected);

	/* Prepare the mincore vec */
	vec = malloc(FILESIZE_PAGES);
	if (vec == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}

	/* Bind ourselves to CPU 0 */
	CPU_ZERO(&set);
	CPU_SET(0, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* open file, unlink and write buffer */
	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}
	unlink("fadvise-test-file");
	while (written < expected) {
		ssize_t this_write;
		this_write = write(fd, buf + written, expected - written);

		if (this_write == -1) {
			perror("write");
			exit(EXIT_FAILURE);
		}

		written += this_write;
	}
	free(buf);

	/*
	 * Force ourselves to another CPU. If fadvise only flushes the local
	 * CPUs pagevecs then the fadvise will fail to discard all file pages
	 */
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* sync and fadvise to discard the page cache */
	fsync(fd);
	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
		perror("posix_fadvise");
		exit(EXIT_FAILURE);
	}

	/* map the file and use mincore to see which parts of it are resident */
	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
	if (buf == NULL) {
		perror("mmap");
		exit(EXIT_FAILURE);
	}
	if (mincore(buf, expected, vec) == -1) {
		perror("mincore");
		exit(EXIT_FAILURE);
	}

	/* Check residency */
	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
		if (vec[i])
			resident++;
	}
	if (resident != 0) {
		printf("Nr unexpected pages resident: %d\n", resident);
		exit(EXIT_FAILURE);
	}

	munmap(buf, expected);
	close(fd);
	free(vec);
	exit(EXIT_SUCCESS);
}

Cc: stable@vger.kernel.org
Reported-by: Rob van der Heij <rvdheij@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/fadvise.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index a47f0f5..909ec55 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -17,6 +17,7 @@
 #include <linux/fadvise.h>
 #include <linux/writeback.h>
 #include <linux/syscalls.h>
+#include <linux/swap.h>
 
 #include <asm/unistd.h>
 
@@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 		end_index = (endbyte >> PAGE_CACHE_SHIFT);
 
-		if (end_index >= start_index)
-			invalidate_mapping_pages(mapping, start_index,
+		if (end_index >= start_index) {
+			unsigned long count = invalidate_mapping_pages(mapping,
+						start_index, end_index);
+
+			/*
+			 * If fewer pages were invalidated than expected then
+			 * it is possible that some of the pages were on
+			 * a per-cpu pagevec for a remote CPU. Drain all
+			 * pagevecs and try again.
+			 */
+			if (count < (end_index - start_index + 1)) {
+				lru_add_drain_all();
+				invalidate_mapping_pages(mapping, start_index,
 						end_index);
+			}
+		}
 		break;
 	default:
 		ret = -EINVAL;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-14 12:03 ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2013-02-14 12:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rob van der Heij, Hugh Dickins, Linux-MM, LKML

Rob van der Heij reported the following (paraphrased) on private mail.

	The scenario is that I want to avoid backups to fill up the page
	cache and purge stuff that is more likely to be used again (this is
	with s390x Linux on z/VM, so I don't give it as much memory that
	we don't care anymore). So I have something with LD_PRELOAD that
	intercepts the close() call (from tar, in this case) and issues
	a posix_fadvise() just before closing the file.

	This mostly works, except for small files (less than 14 pages)
	that remains in page cache after the face.

Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.

The issue is the per-cpu pagevecs for LRU additions. If the pages are added
by one CPU but fadvise() is called on another then the pages remain resident
as the invalidate_mapping_pages() only drains the local pagevecs via its
call to pagevec_release(). The user-visible effect is that a program that
uses fadvise() properly is not obeyed.

A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec
page could not be discarded. The downside with this is that an inode cache
shrink would send a global IPI and memory pressure potentially causing
global IPI storms is very undesirable.

Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages. If a
subset of pages are discarded it drains the LRU pagevecs and tries again. If
the second attempt fails, it assumes it is due to the pages being mapped,
locked or dirty and does not care. With this patch, an application using
fadvise() correctly will be obeyed but there is a downside that a malicious
application can force the kernel to send global IPIs and increase overhead.

If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.

The following test program demonstrates the problem. It should never
report that pages are still resident but will without this patch. It
assumes that CPU 0 and 1 exist.

int main() {
	int fd;
	int pagesize = getpagesize();
	ssize_t written = 0, expected;
	char *buf;
	unsigned char *vec;
	int resident, i;
	cpu_set_t set;

	/* Prepare a buffer for writing */
	expected = FILESIZE_PAGES * pagesize;
	buf = malloc(expected + 1);
	if (buf == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}
	buf[expected] = 0;
	memset(buf, 'a', expected);

	/* Prepare the mincore vec */
	vec = malloc(FILESIZE_PAGES);
	if (vec == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}

	/* Bind ourselves to CPU 0 */
	CPU_ZERO(&set);
	CPU_SET(0, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* open file, unlink and write buffer */
	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}
	unlink("fadvise-test-file");
	while (written < expected) {
		ssize_t this_write;
		this_write = write(fd, buf + written, expected - written);

		if (this_write == -1) {
			perror("write");
			exit(EXIT_FAILURE);
		}

		written += this_write;
	}
	free(buf);

	/*
	 * Force ourselves to another CPU. If fadvise only flushes the local
	 * CPUs pagevecs then the fadvise will fail to discard all file pages
	 */
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* sync and fadvise to discard the page cache */
	fsync(fd);
	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
		perror("posix_fadvise");
		exit(EXIT_FAILURE);
	}

	/* map the file and use mincore to see which parts of it are resident */
	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
	if (buf == NULL) {
		perror("mmap");
		exit(EXIT_FAILURE);
	}
	if (mincore(buf, expected, vec) == -1) {
		perror("mincore");
		exit(EXIT_FAILURE);
	}

	/* Check residency */
	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
		if (vec[i])
			resident++;
	}
	if (resident != 0) {
		printf("Nr unexpected pages resident: %d\n", resident);
		exit(EXIT_FAILURE);
	}

	munmap(buf, expected);
	close(fd);
	free(vec);
	exit(EXIT_SUCCESS);
}

Cc: stable@vger.kernel.org
Reported-by: Rob van der Heij <rvdheij@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/fadvise.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index a47f0f5..909ec55 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -17,6 +17,7 @@
 #include <linux/fadvise.h>
 #include <linux/writeback.h>
 #include <linux/syscalls.h>
+#include <linux/swap.h>
 
 #include <asm/unistd.h>
 
@@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 		end_index = (endbyte >> PAGE_CACHE_SHIFT);
 
-		if (end_index >= start_index)
-			invalidate_mapping_pages(mapping, start_index,
+		if (end_index >= start_index) {
+			unsigned long count = invalidate_mapping_pages(mapping,
+						start_index, end_index);
+
+			/*
+			 * If fewer pages were invalidated than expected then
+			 * it is possible that some of the pages were on
+			 * a per-cpu pagevec for a remote CPU. Drain all
+			 * pagevecs and try again.
+			 */
+			if (count < (end_index - start_index + 1)) {
+				lru_add_drain_all();
+				invalidate_mapping_pages(mapping, start_index,
 						end_index);
+			}
+		}
 		break;
 	default:
 		ret = -EINVAL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  2013-02-14 12:03 ` Mel Gorman
@ 2013-02-14 17:07   ` Rob van der Heij
  -1 siblings, 0 replies; 14+ messages in thread
From: Rob van der Heij @ 2013-02-14 17:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Hugh Dickins, Linux-MM, LKML

On 14 February 2013 13:03, Mel Gorman <mgorman@suse.de> wrote:
> Rob van der Heij reported the following (paraphrased) on private mail.
>
>         The scenario is that I want to avoid backups to fill up the page
>         cache and purge stuff that is more likely to be used again (this is
>         with s390x Linux on z/VM, so I don't give it as much memory that
>         we don't care anymore). So I have something with LD_PRELOAD that
>         intercepts the close() call (from tar, in this case) and issues
>         a posix_fadvise() just before closing the file.
>
>         This mostly works, except for small files (less than 14 pages)
>         that remains in page cache after the face.
>
> Unfortunately Rob has not had a chance to test this exact patch but the
> test program below should be reproducing the problem he described.

I'm happy with the patch and couldn't find another scenario that
breaks. This is going to help us reduce the footprint in virtualized
environments, as long as the programs are well behaved. Thanks.

Reported-and-tested-by: Rob van der Heij <rvdheij@gmail.com>

> The issue is the per-cpu pagevecs for LRU additions. If the pages are added
> by one CPU but fadvise() is called on another then the pages remain resident
> as the invalidate_mapping_pages() only drains the local pagevecs via its
> call to pagevec_release(). The user-visible effect is that a program that
> uses fadvise() properly is not obeyed.
>
> A possible fix for this is to put the necessary smarts into
> invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec
> page could not be discarded. The downside with this is that an inode cache
> shrink would send a global IPI and memory pressure potentially causing
> global IPI storms is very undesirable.
>
> Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
> check if invalidate_mapping_pages() discarded all the requested pages. If a
> subset of pages are discarded it drains the LRU pagevecs and tries again. If
> the second attempt fails, it assumes it is due to the pages being mapped,
> locked or dirty and does not care. With this patch, an application using
> fadvise() correctly will be obeyed but there is a downside that a malicious
> application can force the kernel to send global IPIs and increase overhead.
>
> If accepted, I would like this to be considered as a -stable candidate.
> It's not an urgent issue but it's a system call that is not working as
> advertised which is weak.
>
> The following test program demonstrates the problem. It should never
> report that pages are still resident but will without this patch. It
> assumes that CPU 0 and 1 exist.
>
> int main() {
>         int fd;
>         int pagesize = getpagesize();
>         ssize_t written = 0, expected;
>         char *buf;
>         unsigned char *vec;
>         int resident, i;
>         cpu_set_t set;
>
>         /* Prepare a buffer for writing */
>         expected = FILESIZE_PAGES * pagesize;
>         buf = malloc(expected + 1);
>         if (buf == NULL) {
>                 printf("ENOMEM\n");
>                 exit(EXIT_FAILURE);
>         }
>         buf[expected] = 0;
>         memset(buf, 'a', expected);
>
>         /* Prepare the mincore vec */
>         vec = malloc(FILESIZE_PAGES);
>         if (vec == NULL) {
>                 printf("ENOMEM\n");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* Bind ourselves to CPU 0 */
>         CPU_ZERO(&set);
>         CPU_SET(0, &set);
>         if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
>                 perror("sched_setaffinity");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* open file, unlink and write buffer */
>         fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
>         if (fd == -1) {
>                 perror("open");
>                 exit(EXIT_FAILURE);
>         }
>         unlink("fadvise-test-file");
>         while (written < expected) {
>                 ssize_t this_write;
>                 this_write = write(fd, buf + written, expected - written);
>
>                 if (this_write == -1) {
>                         perror("write");
>                         exit(EXIT_FAILURE);
>                 }
>
>                 written += this_write;
>         }
>         free(buf);
>
>         /*
>          * Force ourselves to another CPU. If fadvise only flushes the local
>          * CPUs pagevecs then the fadvise will fail to discard all file pages
>          */
>         CPU_ZERO(&set);
>         CPU_SET(1, &set);
>         if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
>                 perror("sched_setaffinity");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* sync and fadvise to discard the page cache */
>         fsync(fd);
>         if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
>                 perror("posix_fadvise");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* map the file and use mincore to see which parts of it are resident */
>         buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
>         if (buf == NULL) {
>                 perror("mmap");
>                 exit(EXIT_FAILURE);
>         }
>         if (mincore(buf, expected, vec) == -1) {
>                 perror("mincore");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* Check residency */
>         for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
>                 if (vec[i])
>                         resident++;
>         }
>         if (resident != 0) {
>                 printf("Nr unexpected pages resident: %d\n", resident);
>                 exit(EXIT_FAILURE);
>         }
>
>         munmap(buf, expected);
>         close(fd);
>         free(vec);
>         exit(EXIT_SUCCESS);
> }
>
> Cc: stable@vger.kernel.org
> Reported-by: Rob van der Heij <rvdheij@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/fadvise.c |   18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index a47f0f5..909ec55 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -17,6 +17,7 @@
>  #include <linux/fadvise.h>
>  #include <linux/writeback.h>
>  #include <linux/syscalls.h>
> +#include <linux/swap.h>
>
>  #include <asm/unistd.h>
>
> @@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>                 start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
>                 end_index = (endbyte >> PAGE_CACHE_SHIFT);
>
> -               if (end_index >= start_index)
> -                       invalidate_mapping_pages(mapping, start_index,
> +               if (end_index >= start_index) {
> +                       unsigned long count = invalidate_mapping_pages(mapping,
> +                                               start_index, end_index);
> +
> +                       /*
> +                        * If fewer pages were invalidated than expected then
> +                        * it is possible that some of the pages were on
> +                        * a per-cpu pagevec for a remote CPU. Drain all
> +                        * pagevecs and try again.
> +                        */
> +                       if (count < (end_index - start_index + 1)) {
> +                               lru_add_drain_all();
> +                               invalidate_mapping_pages(mapping, start_index,
>                                                 end_index);
> +                       }
> +               }
>                 break;
>         default:
>                 ret = -EINVAL;
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-14 17:07   ` Rob van der Heij
  0 siblings, 0 replies; 14+ messages in thread
From: Rob van der Heij @ 2013-02-14 17:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Hugh Dickins, Linux-MM, LKML

On 14 February 2013 13:03, Mel Gorman <mgorman@suse.de> wrote:
> Rob van der Heij reported the following (paraphrased) on private mail.
>
>         The scenario is that I want to avoid backups to fill up the page
>         cache and purge stuff that is more likely to be used again (this is
>         with s390x Linux on z/VM, so I don't give it as much memory that
>         we don't care anymore). So I have something with LD_PRELOAD that
>         intercepts the close() call (from tar, in this case) and issues
>         a posix_fadvise() just before closing the file.
>
>         This mostly works, except for small files (less than 14 pages)
>         that remains in page cache after the face.
>
> Unfortunately Rob has not had a chance to test this exact patch but the
> test program below should be reproducing the problem he described.

I'm happy with the patch and couldn't find another scenario that
breaks. This is going to help us reduce the footprint in virtualized
environments, as long as the programs are well behaved. Thanks.

Reported-and-tested-by: Rob van der Heij <rvdheij@gmail.com>

> The issue is the per-cpu pagevecs for LRU additions. If the pages are added
> by one CPU but fadvise() is called on another then the pages remain resident
> as the invalidate_mapping_pages() only drains the local pagevecs via its
> call to pagevec_release(). The user-visible effect is that a program that
> uses fadvise() properly is not obeyed.
>
> A possible fix for this is to put the necessary smarts into
> invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec
> page could not be discarded. The downside with this is that an inode cache
> shrink would send a global IPI and memory pressure potentially causing
> global IPI storms is very undesirable.
>
> Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
> check if invalidate_mapping_pages() discarded all the requested pages. If a
> subset of pages are discarded it drains the LRU pagevecs and tries again. If
> the second attempt fails, it assumes it is due to the pages being mapped,
> locked or dirty and does not care. With this patch, an application using
> fadvise() correctly will be obeyed but there is a downside that a malicious
> application can force the kernel to send global IPIs and increase overhead.
>
> If accepted, I would like this to be considered as a -stable candidate.
> It's not an urgent issue but it's a system call that is not working as
> advertised which is weak.
>
> The following test program demonstrates the problem. It should never
> report that pages are still resident but will without this patch. It
> assumes that CPU 0 and 1 exist.
>
> int main() {
>         int fd;
>         int pagesize = getpagesize();
>         ssize_t written = 0, expected;
>         char *buf;
>         unsigned char *vec;
>         int resident, i;
>         cpu_set_t set;
>
>         /* Prepare a buffer for writing */
>         expected = FILESIZE_PAGES * pagesize;
>         buf = malloc(expected + 1);
>         if (buf == NULL) {
>                 printf("ENOMEM\n");
>                 exit(EXIT_FAILURE);
>         }
>         buf[expected] = 0;
>         memset(buf, 'a', expected);
>
>         /* Prepare the mincore vec */
>         vec = malloc(FILESIZE_PAGES);
>         if (vec == NULL) {
>                 printf("ENOMEM\n");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* Bind ourselves to CPU 0 */
>         CPU_ZERO(&set);
>         CPU_SET(0, &set);
>         if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
>                 perror("sched_setaffinity");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* open file, unlink and write buffer */
>         fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
>         if (fd == -1) {
>                 perror("open");
>                 exit(EXIT_FAILURE);
>         }
>         unlink("fadvise-test-file");
>         while (written < expected) {
>                 ssize_t this_write;
>                 this_write = write(fd, buf + written, expected - written);
>
>                 if (this_write == -1) {
>                         perror("write");
>                         exit(EXIT_FAILURE);
>                 }
>
>                 written += this_write;
>         }
>         free(buf);
>
>         /*
>          * Force ourselves to another CPU. If fadvise only flushes the local
>          * CPUs pagevecs then the fadvise will fail to discard all file pages
>          */
>         CPU_ZERO(&set);
>         CPU_SET(1, &set);
>         if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
>                 perror("sched_setaffinity");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* sync and fadvise to discard the page cache */
>         fsync(fd);
>         if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
>                 perror("posix_fadvise");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* map the file and use mincore to see which parts of it are resident */
>         buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
>         if (buf == NULL) {
>                 perror("mmap");
>                 exit(EXIT_FAILURE);
>         }
>         if (mincore(buf, expected, vec) == -1) {
>                 perror("mincore");
>                 exit(EXIT_FAILURE);
>         }
>
>         /* Check residency */
>         for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
>                 if (vec[i])
>                         resident++;
>         }
>         if (resident != 0) {
>                 printf("Nr unexpected pages resident: %d\n", resident);
>                 exit(EXIT_FAILURE);
>         }
>
>         munmap(buf, expected);
>         close(fd);
>         free(vec);
>         exit(EXIT_SUCCESS);
> }
>
> Cc: stable@vger.kernel.org
> Reported-by: Rob van der Heij <rvdheij@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/fadvise.c |   18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index a47f0f5..909ec55 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -17,6 +17,7 @@
>  #include <linux/fadvise.h>
>  #include <linux/writeback.h>
>  #include <linux/syscalls.h>
> +#include <linux/swap.h>
>
>  #include <asm/unistd.h>
>
> @@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>                 start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
>                 end_index = (endbyte >> PAGE_CACHE_SHIFT);
>
> -               if (end_index >= start_index)
> -                       invalidate_mapping_pages(mapping, start_index,
> +               if (end_index >= start_index) {
> +                       unsigned long count = invalidate_mapping_pages(mapping,
> +                                               start_index, end_index);
> +
> +                       /*
> +                        * If fewer pages were invalidated than expected then
> +                        * it is possible that some of the pages were on
> +                        * a per-cpu pagevec for a remote CPU. Drain all
> +                        * pagevecs and try again.
> +                        */
> +                       if (count < (end_index - start_index + 1)) {
> +                               lru_add_drain_all();
> +                               invalidate_mapping_pages(mapping, start_index,
>                                                 end_index);
> +                       }
> +               }
>                 break;
>         default:
>                 ret = -EINVAL;
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  2013-02-14 12:03 ` Mel Gorman
@ 2013-02-14 20:39   ` Andrew Morton
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2013-02-14 20:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rob van der Heij, Hugh Dickins, Linux-MM, LKML

On Thu, 14 Feb 2013 12:03:49 +0000
Mel Gorman <mgorman@suse.de> wrote:

> Rob van der Heij reported the following (paraphrased) on private mail.
> 
> 	The scenario is that I want to avoid backups to fill up the page
> 	cache and purge stuff that is more likely to be used again (this is
> 	with s390x Linux on z/VM, so I don't give it as much memory that
> 	we don't care anymore). So I have something with LD_PRELOAD that
> 	intercepts the close() call (from tar, in this case) and issues
> 	a posix_fadvise() just before closing the file.
> 
> 	This mostly works, except for small files (less than 14 pages)
> 	that remains in page cache after the face.

Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
and it's still happening.

It should be possible nowadays to toss your backup application into a
container to constrain its pagecache usage.  So we can type

	run-in-a-memcg -m 200MB /my/backup/program

and voila.  Does such a script exist and work?

> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -17,6 +17,7 @@
>  #include <linux/fadvise.h>
>  #include <linux/writeback.h>
>  #include <linux/syscalls.h>
> +#include <linux/swap.h>
>  
>  #include <asm/unistd.h>
>  
> @@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>  		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
>  		end_index = (endbyte >> PAGE_CACHE_SHIFT);
>  
> -		if (end_index >= start_index)
> -			invalidate_mapping_pages(mapping, start_index,
> +		if (end_index >= start_index) {
> +			unsigned long count = invalidate_mapping_pages(mapping,
> +						start_index, end_index);
> +
> +			/*
> +			 * If fewer pages were invalidated than expected then
> +			 * it is possible that some of the pages were on
> +			 * a per-cpu pagevec for a remote CPU. Drain all
> +			 * pagevecs and try again.
> +			 */
> +			if (count < (end_index - start_index + 1)) {
> +				lru_add_drain_all();
> +				invalidate_mapping_pages(mapping, start_index,
>  						end_index);
> +			}
> +		}
>  		break;
>  	default:
>  		ret = -EINVAL;

Those LRU pagevecs are a right pain.  They provided useful gains way
back when I first inflicted them upon Linux, but it would be nice to
confirm whether they're still worthwhile and if so, whether the
benefits can be replicated with some less intrusive scheme.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-14 20:39   ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2013-02-14 20:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rob van der Heij, Hugh Dickins, Linux-MM, LKML

On Thu, 14 Feb 2013 12:03:49 +0000
Mel Gorman <mgorman@suse.de> wrote:

> Rob van der Heij reported the following (paraphrased) on private mail.
> 
> 	The scenario is that I want to avoid backups to fill up the page
> 	cache and purge stuff that is more likely to be used again (this is
> 	with s390x Linux on z/VM, so I don't give it as much memory that
> 	we don't care anymore). So I have something with LD_PRELOAD that
> 	intercepts the close() call (from tar, in this case) and issues
> 	a posix_fadvise() just before closing the file.
> 
> 	This mostly works, except for small files (less than 14 pages)
> 	that remains in page cache after the face.

Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
and it's still happening.

It should be possible nowadays to toss your backup application into a
container to constrain its pagecache usage.  So we can type

	run-in-a-memcg -m 200MB /my/backup/program

and voila.  Does such a script exist and work?

> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -17,6 +17,7 @@
>  #include <linux/fadvise.h>
>  #include <linux/writeback.h>
>  #include <linux/syscalls.h>
> +#include <linux/swap.h>
>  
>  #include <asm/unistd.h>
>  
> @@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>  		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
>  		end_index = (endbyte >> PAGE_CACHE_SHIFT);
>  
> -		if (end_index >= start_index)
> -			invalidate_mapping_pages(mapping, start_index,
> +		if (end_index >= start_index) {
> +			unsigned long count = invalidate_mapping_pages(mapping,
> +						start_index, end_index);
> +
> +			/*
> +			 * If fewer pages were invalidated than expected then
> +			 * it is possible that some of the pages were on
> +			 * a per-cpu pagevec for a remote CPU. Drain all
> +			 * pagevecs and try again.
> +			 */
> +			if (count < (end_index - start_index + 1)) {
> +				lru_add_drain_all();
> +				invalidate_mapping_pages(mapping, start_index,
>  						end_index);
> +			}
> +		}
>  		break;
>  	default:
>  		ret = -EINVAL;

Those LRU pagevecs are a right pain.  They provided useful gains way
back when I first inflicted them upon Linux, but it would be nice to
confirm whether they're still worthwhile and if so, whether the
benefits can be replicated with some less intrusive scheme.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  2013-02-14 20:39   ` Andrew Morton
@ 2013-02-15 11:04     ` Michal Hocko
  -1 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-02-15 11:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, Rob van der Heij, Hugh Dickins, Linux-MM, LKML

On Thu 14-02-13 12:39:26, Andrew Morton wrote:
> On Thu, 14 Feb 2013 12:03:49 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Rob van der Heij reported the following (paraphrased) on private mail.
> > 
> > 	The scenario is that I want to avoid backups to fill up the page
> > 	cache and purge stuff that is more likely to be used again (this is
> > 	with s390x Linux on z/VM, so I don't give it as much memory that
> > 	we don't care anymore). So I have something with LD_PRELOAD that
> > 	intercepts the close() call (from tar, in this case) and issues
> > 	a posix_fadvise() just before closing the file.
> > 
> > 	This mostly works, except for small files (less than 14 pages)
> > 	that remains in page cache after the face.
> 
> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
> and it's still happening.
> 
> It should be possible nowadays to toss your backup application into a
> container to constrain its pagecache usage.  So we can type
> 
> 	run-in-a-memcg -m 200MB /my/backup/program
> 
> and voila.  Does such a script exist and work?

The script would be as simple as:
cgcreate -g memory:backups/`whoami`
cgset -r memory.limit_in_bytes=200MB backups/`whoami`
cgexec -g memory:backups/`whoami` /my/backup/program

It just expects that admin sets up backups group which allows the user
to create a subgroup (w permission on the directory) and probably set up
some reasonable cap for all backups
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-15 11:04     ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-02-15 11:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, Rob van der Heij, Hugh Dickins, Linux-MM, LKML

On Thu 14-02-13 12:39:26, Andrew Morton wrote:
> On Thu, 14 Feb 2013 12:03:49 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Rob van der Heij reported the following (paraphrased) on private mail.
> > 
> > 	The scenario is that I want to avoid backups to fill up the page
> > 	cache and purge stuff that is more likely to be used again (this is
> > 	with s390x Linux on z/VM, so I don't give it as much memory that
> > 	we don't care anymore). So I have something with LD_PRELOAD that
> > 	intercepts the close() call (from tar, in this case) and issues
> > 	a posix_fadvise() just before closing the file.
> > 
> > 	This mostly works, except for small files (less than 14 pages)
> > 	that remains in page cache after the face.
> 
> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
> and it's still happening.
> 
> It should be possible nowadays to toss your backup application into a
> container to constrain its pagecache usage.  So we can type
> 
> 	run-in-a-memcg -m 200MB /my/backup/program
> 
> and voila.  Does such a script exist and work?

The script would be as simple as:
cgcreate -g memory:backups/`whoami`
cgset -r memory.limit_in_bytes=200MB backups/`whoami`
cgexec -g memory:backups/`whoami` /my/backup/program

It just expects that admin sets up backups group which allows the user
to create a subgroup (w permission on the directory) and probably set up
some reasonable cap for all backups
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  2013-02-15 11:04     ` Michal Hocko
@ 2013-02-15 16:14       ` Rob van der Heij
  -1 siblings, 0 replies; 14+ messages in thread
From: Rob van der Heij @ 2013-02-15 16:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Linux-MM, LKML

On 15 February 2013 12:04, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 14-02-13 12:39:26, Andrew Morton wrote:
>> On Thu, 14 Feb 2013 12:03:49 +0000
>> Mel Gorman <mgorman@suse.de> wrote:
>>
>> > Rob van der Heij reported the following (paraphrased) on private mail.
>> >
>> >     The scenario is that I want to avoid backups to fill up the page
>> >     cache and purge stuff that is more likely to be used again (this is
>> >     with s390x Linux on z/VM, so I don't give it as much memory that
>> >     we don't care anymore). So I have something with LD_PRELOAD that
>> >     intercepts the close() call (from tar, in this case) and issues
>> >     a posix_fadvise() just before closing the file.
>> >
>> >     This mostly works, except for small files (less than 14 pages)
>> >     that remains in page cache after the face.
>>
>> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
>> and it's still happening.
>>
>> It should be possible nowadays to toss your backup application into a
>> container to constrain its pagecache usage.  So we can type
>>
>>       run-in-a-memcg -m 200MB /my/backup/program
>>
>> and voila.  Does such a script exist and work?
>
> The script would be as simple as:
> cgcreate -g memory:backups/`whoami`
> cgset -r memory.limit_in_bytes=200MB backups/`whoami`
> cgexec -g memory:backups/`whoami` /my/backup/program
>
> It just expects that admin sets up backups group which allows the user
> to create a subgroup (w permission on the directory) and probably set up
> some reasonable cap for all backups

Cool. This is promising enough to bridge my skills gap. It appears to
work as promised, but I would have to understand why it takes
significantly more CPU than my ugly posix_fadvise() call on close...

Rob

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-15 16:14       ` Rob van der Heij
  0 siblings, 0 replies; 14+ messages in thread
From: Rob van der Heij @ 2013-02-15 16:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Linux-MM, LKML

On 15 February 2013 12:04, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 14-02-13 12:39:26, Andrew Morton wrote:
>> On Thu, 14 Feb 2013 12:03:49 +0000
>> Mel Gorman <mgorman@suse.de> wrote:
>>
>> > Rob van der Heij reported the following (paraphrased) on private mail.
>> >
>> >     The scenario is that I want to avoid backups to fill up the page
>> >     cache and purge stuff that is more likely to be used again (this is
>> >     with s390x Linux on z/VM, so I don't give it as much memory that
>> >     we don't care anymore). So I have something with LD_PRELOAD that
>> >     intercepts the close() call (from tar, in this case) and issues
>> >     a posix_fadvise() just before closing the file.
>> >
>> >     This mostly works, except for small files (less than 14 pages)
>> >     that remains in page cache after the face.
>>
>> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
>> and it's still happening.
>>
>> It should be possible nowadays to toss your backup application into a
>> container to constrain its pagecache usage.  So we can type
>>
>>       run-in-a-memcg -m 200MB /my/backup/program
>>
>> and voila.  Does such a script exist and work?
>
> The script would be as simple as:
> cgcreate -g memory:backups/`whoami`
> cgset -r memory.limit_in_bytes=200MB backups/`whoami`
> cgexec -g memory:backups/`whoami` /my/backup/program
>
> It just expects that admin sets up backups group which allows the user
> to create a subgroup (w permission on the directory) and probably set up
> some reasonable cap for all backups

Cool. This is promising enough to bridge my skills gap. It appears to
work as promised, but I would have to understand why it takes
significantly more CPU than my ugly posix_fadvise() call on close...

Rob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  2013-02-15 16:14       ` Rob van der Heij
@ 2013-02-15 16:48         ` Michal Hocko
  -1 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-02-15 16:48 UTC (permalink / raw)
  To: Rob van der Heij; +Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Linux-MM, LKML

On Fri 15-02-13 17:14:10, Rob van der Heij wrote:
> On 15 February 2013 12:04, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 14-02-13 12:39:26, Andrew Morton wrote:
> >> On Thu, 14 Feb 2013 12:03:49 +0000
> >> Mel Gorman <mgorman@suse.de> wrote:
> >>
> >> > Rob van der Heij reported the following (paraphrased) on private mail.
> >> >
> >> >     The scenario is that I want to avoid backups to fill up the page
> >> >     cache and purge stuff that is more likely to be used again (this is
> >> >     with s390x Linux on z/VM, so I don't give it as much memory that
> >> >     we don't care anymore). So I have something with LD_PRELOAD that
> >> >     intercepts the close() call (from tar, in this case) and issues
> >> >     a posix_fadvise() just before closing the file.
> >> >
> >> >     This mostly works, except for small files (less than 14 pages)
> >> >     that remains in page cache after the face.
> >>
> >> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
> >> and it's still happening.
> >>
> >> It should be possible nowadays to toss your backup application into a
> >> container to constrain its pagecache usage.  So we can type
> >>
> >>       run-in-a-memcg -m 200MB /my/backup/program
> >>
> >> and voila.  Does such a script exist and work?
> >
> > The script would be as simple as:
> > cgcreate -g memory:backups/`whoami`
> > cgset -r memory.limit_in_bytes=200MB backups/`whoami`
> > cgexec -g memory:backups/`whoami` /my/backup/program
> >
> > It just expects that admin sets up backups group which allows the user
> > to create a subgroup (w permission on the directory) and probably set up
> > some reasonable cap for all backups
> 
> Cool. This is promising enough to bridge my skills gap. It appears to
> work as promised, but I would have to understand why it takes
> significantly more CPU than my ugly posix_fadvise() call on close...

I would guess that a lot of reclaim would be an answer. Note that each
memcg has its own LRU and the limit is neforced by the per group
reclaim.
I wouldn't expect the difference to be very big, though. What do you
mean by significantly more?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-15 16:48         ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-02-15 16:48 UTC (permalink / raw)
  To: Rob van der Heij; +Cc: Andrew Morton, Mel Gorman, Hugh Dickins, Linux-MM, LKML

On Fri 15-02-13 17:14:10, Rob van der Heij wrote:
> On 15 February 2013 12:04, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 14-02-13 12:39:26, Andrew Morton wrote:
> >> On Thu, 14 Feb 2013 12:03:49 +0000
> >> Mel Gorman <mgorman@suse.de> wrote:
> >>
> >> > Rob van der Heij reported the following (paraphrased) on private mail.
> >> >
> >> >     The scenario is that I want to avoid backups to fill up the page
> >> >     cache and purge stuff that is more likely to be used again (this is
> >> >     with s390x Linux on z/VM, so I don't give it as much memory that
> >> >     we don't care anymore). So I have something with LD_PRELOAD that
> >> >     intercepts the close() call (from tar, in this case) and issues
> >> >     a posix_fadvise() just before closing the file.
> >> >
> >> >     This mostly works, except for small files (less than 14 pages)
> >> >     that remains in page cache after the face.
> >>
> >> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
> >> and it's still happening.
> >>
> >> It should be possible nowadays to toss your backup application into a
> >> container to constrain its pagecache usage.  So we can type
> >>
> >>       run-in-a-memcg -m 200MB /my/backup/program
> >>
> >> and voila.  Does such a script exist and work?
> >
> > The script would be as simple as:
> > cgcreate -g memory:backups/`whoami`
> > cgset -r memory.limit_in_bytes=200MB backups/`whoami`
> > cgexec -g memory:backups/`whoami` /my/backup/program
> >
> > It just expects that admin sets up backups group which allows the user
> > to create a subgroup (w permission on the directory) and probably set up
> > some reasonable cap for all backups
> 
> Cool. This is promising enough to bridge my skills gap. It appears to
> work as promised, but I would have to understand why it takes
> significantly more CPU than my ugly posix_fadvise() call on close...

I would guess that a lot of reclaim would be an answer. Note that each
memcg has its own LRU and the limit is neforced by the per group
reclaim.
I wouldn't expect the difference to be very big, though. What do you
mean by significantly more?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  2013-02-14 20:39   ` Andrew Morton
@ 2013-02-19 11:57     ` Mel Gorman
  -1 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2013-02-19 11:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rob van der Heij, Hugh Dickins, Linux-MM, LKML

On Thu, Feb 14, 2013 at 12:39:26PM -0800, Andrew Morton wrote:
> On Thu, 14 Feb 2013 12:03:49 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Rob van der Heij reported the following (paraphrased) on private mail.
> > 
> > 	The scenario is that I want to avoid backups to fill up the page
> > 	cache and purge stuff that is more likely to be used again (this is
> > 	with s390x Linux on z/VM, so I don't give it as much memory that
> > 	we don't care anymore). So I have something with LD_PRELOAD that
> > 	intercepts the close() call (from tar, in this case) and issues
> > 	a posix_fadvise() just before closing the file.
> > 
> > 	This mostly works, except for small files (less than 14 pages)
> > 	that remains in page cache after the face.
> 
> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
> and it's still happening.
> 

Yes. There have been variations of it too such as applications being pushed
prematurely into swap. I'm not certain how well we currently handle that
because I haven't checked in a few months.

> It should be possible nowadays to toss your backup application into a
> container to constrain its pagecache usage.  So we can type
> 
> 	run-in-a-memcg -m 200MB /my/backup/program
> 
> and voila.  Does such a script exist and work?
> 

Michal already gave an example. It might work slower if the backup
application has to stall in direct reclaim to keep the container within
limits though.

> > --- a/mm/fadvise.c
> > +++ b/mm/fadvise.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/fadvise.h>
> >  #include <linux/writeback.h>
> >  #include <linux/syscalls.h>
> > +#include <linux/swap.h>
> >  
> >  #include <asm/unistd.h>
> >  
> > @@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> >  		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
> >  		end_index = (endbyte >> PAGE_CACHE_SHIFT);
> >  
> > -		if (end_index >= start_index)
> > -			invalidate_mapping_pages(mapping, start_index,
> > +		if (end_index >= start_index) {
> > +			unsigned long count = invalidate_mapping_pages(mapping,
> > +						start_index, end_index);
> > +
> > +			/*
> > +			 * If fewer pages were invalidated than expected then
> > +			 * it is possible that some of the pages were on
> > +			 * a per-cpu pagevec for a remote CPU. Drain all
> > +			 * pagevecs and try again.
> > +			 */
> > +			if (count < (end_index - start_index + 1)) {
> > +				lru_add_drain_all();
> > +				invalidate_mapping_pages(mapping, start_index,
> >  						end_index);
> > +			}
> > +		}
> >  		break;
> >  	default:
> >  		ret = -EINVAL;
> 
> Those LRU pagevecs are a right pain.  They provided useful gains way
> back when I first inflicted them upon Linux, but it would be nice to
> confirm whether they're still worthwhile and if so, whether the
> benefits can be replicated with some less intrusive scheme.
> 

I know. Unfortunately I've had "Implement pagevec removal and test" on my
TODO list for the guts of a year now. It's long overdue to actually sit down
and just do it. It's a similar story for the per-cpu lists in front of the
page allocator which are overdue to see if they can be replaced. I actually
have a prototype replacement for that lying around but it performed slower
in tests and has bit-rotted since but it ran slower and has bit-rotted
since as it was based on kernel 3.4.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
@ 2013-02-19 11:57     ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2013-02-19 11:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rob van der Heij, Hugh Dickins, Linux-MM, LKML

On Thu, Feb 14, 2013 at 12:39:26PM -0800, Andrew Morton wrote:
> On Thu, 14 Feb 2013 12:03:49 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Rob van der Heij reported the following (paraphrased) on private mail.
> > 
> > 	The scenario is that I want to avoid backups to fill up the page
> > 	cache and purge stuff that is more likely to be used again (this is
> > 	with s390x Linux on z/VM, so I don't give it as much memory that
> > 	we don't care anymore). So I have something with LD_PRELOAD that
> > 	intercepts the close() call (from tar, in this case) and issues
> > 	a posix_fadvise() just before closing the file.
> > 
> > 	This mostly works, except for small files (less than 14 pages)
> > 	that remains in page cache after the face.
> 
> Sigh.  We've had the "my backups swamp pagecache" thing for 15 years
> and it's still happening.
> 

Yes. There have been variations of it too such as applications being pushed
prematurely into swap. I'm not certain how well we currently handle that
because I haven't checked in a few months.

> It should be possible nowadays to toss your backup application into a
> container to constrain its pagecache usage.  So we can type
> 
> 	run-in-a-memcg -m 200MB /my/backup/program
> 
> and voila.  Does such a script exist and work?
> 

Michal already gave an example. It might work slower if the backup
application has to stall in direct reclaim to keep the container within
limits though.

> > --- a/mm/fadvise.c
> > +++ b/mm/fadvise.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/fadvise.h>
> >  #include <linux/writeback.h>
> >  #include <linux/syscalls.h>
> > +#include <linux/swap.h>
> >  
> >  #include <asm/unistd.h>
> >  
> > @@ -120,9 +121,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> >  		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
> >  		end_index = (endbyte >> PAGE_CACHE_SHIFT);
> >  
> > -		if (end_index >= start_index)
> > -			invalidate_mapping_pages(mapping, start_index,
> > +		if (end_index >= start_index) {
> > +			unsigned long count = invalidate_mapping_pages(mapping,
> > +						start_index, end_index);
> > +
> > +			/*
> > +			 * If fewer pages were invalidated than expected then
> > +			 * it is possible that some of the pages were on
> > +			 * a per-cpu pagevec for a remote CPU. Drain all
> > +			 * pagevecs and try again.
> > +			 */
> > +			if (count < (end_index - start_index + 1)) {
> > +				lru_add_drain_all();
> > +				invalidate_mapping_pages(mapping, start_index,
> >  						end_index);
> > +			}
> > +		}
> >  		break;
> >  	default:
> >  		ret = -EINVAL;
> 
> Those LRU pagevecs are a right pain.  They provided useful gains way
> back when I first inflicted them upon Linux, but it would be nice to
> confirm whether they're still worthwhile and if so, whether the
> benefits can be replicated with some less intrusive scheme.
> 

I know. Unfortunately I've had "Implement pagevec removal and test" on my
TODO list for the guts of a year now. It's long overdue to actually sit down
and just do it. It's a similar story for the per-cpu lists in front of the
page allocator which are overdue to see if they can be replaced. I actually
have a prototype replacement for that lying around but it performed slower
in tests and has bit-rotted since but it ran slower and has bit-rotted
since as it was based on kernel 3.4.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-02-19 11:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-14 12:03 [PATCH] mm: fadvise: Drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages Mel Gorman
2013-02-14 12:03 ` Mel Gorman
2013-02-14 17:07 ` Rob van der Heij
2013-02-14 17:07   ` Rob van der Heij
2013-02-14 20:39 ` Andrew Morton
2013-02-14 20:39   ` Andrew Morton
2013-02-15 11:04   ` Michal Hocko
2013-02-15 11:04     ` Michal Hocko
2013-02-15 16:14     ` Rob van der Heij
2013-02-15 16:14       ` Rob van der Heij
2013-02-15 16:48       ` Michal Hocko
2013-02-15 16:48         ` Michal Hocko
2013-02-19 11:57   ` Mel Gorman
2013-02-19 11:57     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.