All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v4 0/3] Support volatile for anonymous range
@ 2012-12-18  6:47 ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

This is still RFC because we need more input from user-space
people and discussion about interface/reclaim policy of volatile
pages and I want to expand this concept to tmpfs volatile range
if it is possbile without big performance drop of anonymous volatile
rnage (Let's define our term. anon volatile VS tmpfs volatile? John?)

NOTE: I didn't consider THP/KSM so for test, you should disable them.

I hope more inputs from user-space allocator people and test patch
with their allocator because it might need design change of arena
management for getting real vaule.

Changelog from v4

 * Add new system call mvolatile/mnovolatile
 * Add sigbus when user try to access volatile range
 * Rebased on v3.7
 * Applied bug fix from John Stultz, Thanks!

Changelog from v3

 * Removing madvise(addr, length, MADV_NOVOLATILE).
 * add vmstat about the number of discarded volatile pages
 * discard volatile pages without promotion in reclaim path

This is based on v3.7

- What's the mvolatile(addr, length)?

  It's a hint that user deliver to kernel so kernel can *discard*
  pages in a range anytime.

- What happens if user access page(ie, virtual address) discarded
  by kernel?

  The user can encounter SIGBUS.

- What should user do for avoding SIGBUS?
  He should call mnovolatie(addr, length) before accessing the range
  which was called by mvolatile.

- What happens if user access page(ie, virtual address) doesn't
  discarded by kernel?

  The user can see old data without page fault.

- What's different with madvise(DONTNEED)?

  System call semantic

  DONTNEED makes sure user always can see zero-fill pages after
  he calls madvise while mvolatile can see old data or encounter
  SIGBUS.

  Internal implementation

  The madvise(DONTNEED) should zap all mapped pages in range so
  overhead is increased linearly with the number of mapped pages.
  Even, if user access zapped pages as write mode, page fault +
  page allocation + memset should be happened.

  The mvolatile just marks the flag in a range(ie, VMA) instead of
  zapping all of pte in the vma so it doesn't touch ptes any more.

- What's the benefit compared to DONTNEED?

  1. The system call overhead is smaller because mvolatile just marks
     the flag to VMA instead of zapping all the page in a range so
     overhead should be very small.

  2. It has a chance to eliminate overheads (ex, zapping pte + page fault
     + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
     severe.

  3. It has a potential to zap all ptes and free the pages if memory
     pressure is severe so reclaim overhead could be disappear - TODO

- Isn't there any drawback?

  Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
  fault of other threads could be allowed. But m[no]volatile needs
  exclusive mmap_sem so other thread would be blocked if they try to
  access not-yet-mapped pages. That's why I design m[no]volatile
  overhead should be small as far as possible.

  It could suffer from max rss usage increasement because madvise(DONTNEED)
  deallocates pages instantly when the system call is issued while mvoatile
  delays it until memory pressure happens so if memory pressure is severe by
  max rss incresement, system would suffer. First of all, allocator needs
  some balance logic for that or kernel might handle it by zapping pages
  although user calls mvolatile if memory pressure is severe.
  The problem is how we know memory pressure is severe.
  One of solution is to see kswapd is active or not. Another solution is
  Anton's mempressure so allocator can handle it.

- What's for targetting?

  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
  of virtual machine like Dalvik. Also, it comes in handy for embedded
  which doesn't have swap device so they can't reclaim anonymous pages.
  By discarding instead of swapout, it could be used in the non-swap system.
  For it, we have to age anon lru list although we don't have swap because
  I don't want to discard volatile pages by top priority when memory pressure
  happens as volatile in this patch means "We don't need to swap out because
  user can handle the situation which data are disappear suddenly", NOT
  "They are useless so hurry up to reclaim them". So I want to apply same
  aging rule of nomal pages to them.

  Anonymous page background aging of non-swap system would be a trade-off
  for getting good feature. Even, we had done it two years ago until merge
  [1] and I believe gain of this patch will beat loss of anon lru aging's
  overead once all of allocator start to use madvise.
  (This patch doesn't include background aging in case of non-swap system
  but it's trivial if we decide)

  As another choice, we can zap the range like madvise(DONTNEED) when mvolatile
  is called if we don't have swap space.

- Stupid performance test
  I attach test program/script which are utter crap and I don't expect
  current smart allocator never have done it so we need more practical data
  with real allocator.

  KVM - 8 core, 2G

VOLATILE test
13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
0inputs+0outputs (0major+164050minor)pagefaults 0swaps

DONTNEED test
23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
0inputs+0outputs (0major+16384210minor)pagefaults 0swaps

  x86-64 - 12 core, 2G

VOLATILE test
33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
0inputs+0outputs (0major+245989minor)pagefaults 0swaps

DONTNEED test
28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Any comments are welcome!

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Minchan Kim (3):
  Introduce new system call mvolatile
  Discard volatile page
  add PGVOLATILE vmstat count

 arch/x86/syscalls/syscall_64.tbl |    3 +-
 include/linux/mm.h               |    1 +
 include/linux/mm_types.h         |    2 +
 include/linux/rmap.h             |    3 +
 include/linux/syscalls.h         |    2 +
 include/linux/vm_event_item.h    |    2 +-
 mm/Makefile                      |    4 +-
 mm/huge_memory.c                 |    9 +-
 mm/ksm.c                         |    3 +-
 mm/memory.c                      |    2 +
 mm/migrate.c                     |    6 +-
 mm/mlock.c                       |    5 +-
 mm/mmap.c                        |    2 +-
 mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                        |   97 +++++++++-
 mm/vmscan.c                      |    4 +
 mm/vmstat.c                      |    1 +
 17 files changed, 527 insertions(+), 15 deletions(-)
 create mode 100644 mm/mvolatile.c

================== 8< =============================

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>

#define SYS_mvolatile 313
#define SYS_mnovolatile 314

#define ALLOC_SIZE (8 << 20)
#define MAP_SIZE  (ALLOC_SIZE * 10)
#define PAGE_SIZE (1 << 12)
#define RETRY 100

pthread_barrier_t barrier;
int mode;
#define VOLATILE_MODE 1

static int mvolatile(void *addr, size_t length)
{
	return syscall(SYS_mvolatile, addr, length);
}

static int mnovolatile(void *addr, size_t length)
{
	return syscall(SYS_mnovolatile, addr, length);
}

void *thread_entry(void *data)
{
	unsigned long i;
	cpu_set_t set;
	int cpu = *(int*)data;
	void *mmap_area;
	int retry = RETRY;

	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	sched_setaffinity(0, sizeof(set), &set);

	mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
					MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	if (mmap_area == MAP_FAILED) {
		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
		exit(1);
	}

	pthread_barrier_wait(&barrier);

	while(retry--) {
		if (mode == VOLATILE_MODE) {
			mvolatile(mmap_area, MAP_SIZE);
			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
				mnovolatile(mmap_area + i, ALLOC_SIZE);
				memset(mmap_area + i, i, ALLOC_SIZE);
				mvolatile(mmap_area + i, ALLOC_SIZE);
			}
		} else {
			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
				memset(mmap_area + i, i, ALLOC_SIZE);
				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
			}
		}
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	int i, nr_thread;
	int *data;

	if (argc < 3)
		return 1;

	nr_thread = atoi(argv[1]);
	mode = atoi(argv[2]);

	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
	data = malloc(sizeof(int) * nr_thread);
	pthread_barrier_init(&barrier, NULL, nr_thread);

	for (i = 0; i < nr_thread; i++) {
		data[i] = i;
		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
			perror("Fail to create thread\n");
			exit(1);
		}
	}

	for (i = 0; i < nr_thread; i++) {
		if (pthread_join(thread[i], NULL))
			perror("Fail to join thread\n");
		printf("[%d] thread done\n", i);
	}

	return 0;
}
-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC v4 0/3] Support volatile for anonymous range
@ 2012-12-18  6:47 ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

This is still RFC because we need more input from user-space
people and discussion about interface/reclaim policy of volatile
pages and I want to expand this concept to tmpfs volatile range
if it is possbile without big performance drop of anonymous volatile
rnage (Let's define our term. anon volatile VS tmpfs volatile? John?)

NOTE: I didn't consider THP/KSM so for test, you should disable them.

I hope more inputs from user-space allocator people and test patch
with their allocator because it might need design change of arena
management for getting real vaule.

Changelog from v4

 * Add new system call mvolatile/mnovolatile
 * Add sigbus when user try to access volatile range
 * Rebased on v3.7
 * Applied bug fix from John Stultz, Thanks!

Changelog from v3

 * Removing madvise(addr, length, MADV_NOVOLATILE).
 * add vmstat about the number of discarded volatile pages
 * discard volatile pages without promotion in reclaim path

This is based on v3.7

- What's the mvolatile(addr, length)?

  It's a hint that user deliver to kernel so kernel can *discard*
  pages in a range anytime.

- What happens if user access page(ie, virtual address) discarded
  by kernel?

  The user can encounter SIGBUS.

- What should user do for avoding SIGBUS?
  He should call mnovolatie(addr, length) before accessing the range
  which was called by mvolatile.

- What happens if user access page(ie, virtual address) doesn't
  discarded by kernel?

  The user can see old data without page fault.

- What's different with madvise(DONTNEED)?

  System call semantic

  DONTNEED makes sure user always can see zero-fill pages after
  he calls madvise while mvolatile can see old data or encounter
  SIGBUS.

  Internal implementation

  The madvise(DONTNEED) should zap all mapped pages in range so
  overhead is increased linearly with the number of mapped pages.
  Even, if user access zapped pages as write mode, page fault +
  page allocation + memset should be happened.

  The mvolatile just marks the flag in a range(ie, VMA) instead of
  zapping all of pte in the vma so it doesn't touch ptes any more.

- What's the benefit compared to DONTNEED?

  1. The system call overhead is smaller because mvolatile just marks
     the flag to VMA instead of zapping all the page in a range so
     overhead should be very small.

  2. It has a chance to eliminate overheads (ex, zapping pte + page fault
     + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
     severe.

  3. It has a potential to zap all ptes and free the pages if memory
     pressure is severe so reclaim overhead could be disappear - TODO

- Isn't there any drawback?

  Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
  fault of other threads could be allowed. But m[no]volatile needs
  exclusive mmap_sem so other thread would be blocked if they try to
  access not-yet-mapped pages. That's why I design m[no]volatile
  overhead should be small as far as possible.

  It could suffer from max rss usage increasement because madvise(DONTNEED)
  deallocates pages instantly when the system call is issued while mvoatile
  delays it until memory pressure happens so if memory pressure is severe by
  max rss incresement, system would suffer. First of all, allocator needs
  some balance logic for that or kernel might handle it by zapping pages
  although user calls mvolatile if memory pressure is severe.
  The problem is how we know memory pressure is severe.
  One of solution is to see kswapd is active or not. Another solution is
  Anton's mempressure so allocator can handle it.

- What's for targetting?

  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
  of virtual machine like Dalvik. Also, it comes in handy for embedded
  which doesn't have swap device so they can't reclaim anonymous pages.
  By discarding instead of swapout, it could be used in the non-swap system.
  For it, we have to age anon lru list although we don't have swap because
  I don't want to discard volatile pages by top priority when memory pressure
  happens as volatile in this patch means "We don't need to swap out because
  user can handle the situation which data are disappear suddenly", NOT
  "They are useless so hurry up to reclaim them". So I want to apply same
  aging rule of nomal pages to them.

  Anonymous page background aging of non-swap system would be a trade-off
  for getting good feature. Even, we had done it two years ago until merge
  [1] and I believe gain of this patch will beat loss of anon lru aging's
  overead once all of allocator start to use madvise.
  (This patch doesn't include background aging in case of non-swap system
  but it's trivial if we decide)

  As another choice, we can zap the range like madvise(DONTNEED) when mvolatile
  is called if we don't have swap space.

- Stupid performance test
  I attach test program/script which are utter crap and I don't expect
  current smart allocator never have done it so we need more practical data
  with real allocator.

  KVM - 8 core, 2G

VOLATILE test
13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
0inputs+0outputs (0major+164050minor)pagefaults 0swaps

DONTNEED test
23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
0inputs+0outputs (0major+16384210minor)pagefaults 0swaps

  x86-64 - 12 core, 2G

VOLATILE test
33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
0inputs+0outputs (0major+245989minor)pagefaults 0swaps

DONTNEED test
28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Any comments are welcome!

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Minchan Kim (3):
  Introduce new system call mvolatile
  Discard volatile page
  add PGVOLATILE vmstat count

 arch/x86/syscalls/syscall_64.tbl |    3 +-
 include/linux/mm.h               |    1 +
 include/linux/mm_types.h         |    2 +
 include/linux/rmap.h             |    3 +
 include/linux/syscalls.h         |    2 +
 include/linux/vm_event_item.h    |    2 +-
 mm/Makefile                      |    4 +-
 mm/huge_memory.c                 |    9 +-
 mm/ksm.c                         |    3 +-
 mm/memory.c                      |    2 +
 mm/migrate.c                     |    6 +-
 mm/mlock.c                       |    5 +-
 mm/mmap.c                        |    2 +-
 mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                        |   97 +++++++++-
 mm/vmscan.c                      |    4 +
 mm/vmstat.c                      |    1 +
 17 files changed, 527 insertions(+), 15 deletions(-)
 create mode 100644 mm/mvolatile.c

================== 8< =============================

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>

#define SYS_mvolatile 313
#define SYS_mnovolatile 314

#define ALLOC_SIZE (8 << 20)
#define MAP_SIZE  (ALLOC_SIZE * 10)
#define PAGE_SIZE (1 << 12)
#define RETRY 100

pthread_barrier_t barrier;
int mode;
#define VOLATILE_MODE 1

static int mvolatile(void *addr, size_t length)
{
	return syscall(SYS_mvolatile, addr, length);
}

static int mnovolatile(void *addr, size_t length)
{
	return syscall(SYS_mnovolatile, addr, length);
}

void *thread_entry(void *data)
{
	unsigned long i;
	cpu_set_t set;
	int cpu = *(int*)data;
	void *mmap_area;
	int retry = RETRY;

	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	sched_setaffinity(0, sizeof(set), &set);

	mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
					MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	if (mmap_area == MAP_FAILED) {
		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
		exit(1);
	}

	pthread_barrier_wait(&barrier);

	while(retry--) {
		if (mode == VOLATILE_MODE) {
			mvolatile(mmap_area, MAP_SIZE);
			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
				mnovolatile(mmap_area + i, ALLOC_SIZE);
				memset(mmap_area + i, i, ALLOC_SIZE);
				mvolatile(mmap_area + i, ALLOC_SIZE);
			}
		} else {
			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
				memset(mmap_area + i, i, ALLOC_SIZE);
				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
			}
		}
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	int i, nr_thread;
	int *data;

	if (argc < 3)
		return 1;

	nr_thread = atoi(argv[1]);
	mode = atoi(argv[2]);

	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
	data = malloc(sizeof(int) * nr_thread);
	pthread_barrier_init(&barrier, NULL, nr_thread);

	for (i = 0; i < nr_thread; i++) {
		data[i] = i;
		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
			perror("Fail to create thread\n");
			exit(1);
		}
	}

	for (i = 0; i < nr_thread; i++) {
		if (pthread_join(thread[i], NULL))
			perror("Fail to join thread\n");
		printf("[%d] thread done\n", i);
	}

	return 0;
}
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC v4 1/3] Introduce new system call mvolatile
  2012-12-18  6:47 ` Minchan Kim
@ 2012-12-18  6:47   ` Minchan Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

This patch adds new system call m[no]volatile. If some user asks
is_volatile system call, it could, too.

The reason why I introduced new system call instead of madvise is
m[no]volatile vma handling is totally different with madvise's vma
handling.

1) The m[no]volatile should be successful although the range includes
   unmapped or non-volatile range. It just skips such range without stop
   with returning error although it encounter invalid range.
   It makes user convenient without calling several calling of small range.
   - Suggested by John Stultz

2) The propagation of purged state between vmas should be atomic between
   m[no]volatile and reclaim. For it, we need to tweak vma_merge/split_vma's
   anon_vma handling. It's very common operation and I don't want to add
   unnecessary overhead and code if it is possbile.

3) The purged state of volatile range should be propagated out to user
   with mnovolatile operation and it should be atomic with reclaim, too.

For meeting above requirements, I introudced new system call m[no]volatile.
It doesn't change vma_merge/split and repair vmas after vma operation.

So mvolatile(start, len)'s semantics is following as.

1) It makes range(start, len) as volatile although the range includes
   unmapped area, speacial mapping and mlocked area which are just skipped.
   Now it doesn't support Hugepage and KSM. - TODO
   Return -EINVAL if range doesn't include a right vma at all.
   Return -ENOMEM with interrupting range opeartion if memory is not
   enough to merge/split vmas. In this case, some range would be volatile
   and others not. So user have to recall mvolatile after he cancel all
   range by mnovolatile.
   Return 0 if range consists of only proper vmas.
   Return 1 if part of range includes hole/huge/ksm/mlock/special area.

2) If user calls mvolatile to the range which was already volatile VMA and
   even purged state, VOLATILE attributes still remains but purged state
   is reset. I expect some user want to split volatile vma into smaller
   ranges. Although he can do it for mnovlatile(whole range) and serveral calling
   with movlatile(smaller range), this function can avoid mnovolatile if he
   doesn't care purged state. I'm not sure we really need this function so
   I hope listen opinions. Unfortunately, current implemenation doesn't split
   volatile VMA with new range in this case. I forgot implementing it
   in this version but decide to send it to listen opinions because implementing
   is rather trivial if we decided.

mnovolatile(start, len)'s semantics is following as.

1) It makes range(start, len) as volatile although the range includes
   unmapped area, speacial mapping and non-volatile range which are just
   skipped.

2) If the range is purged, it will return 1 regardless of including invalid
   range.

3) It returns -ENOMEM if system doesn't have enough memory for vma operation.

4) It returns -EINVAL if range doesn't include a right vma at all.

5) If user try to access purged range without mnovoatile call, it encounters
   SIGBUS which would show up next patch.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/syscalls/syscall_64.tbl |    3 +-
 include/linux/mm.h               |    1 +
 include/linux/mm_types.h         |    2 +
 include/linux/syscalls.h         |    2 +
 mm/Makefile                      |    4 +-
 mm/huge_memory.c                 |    9 +-
 mm/ksm.c                         |    3 +-
 mm/mlock.c                       |    5 +-
 mm/mmap.c                        |    2 +-
 mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                        |    2 +
 11 files changed, 419 insertions(+), 10 deletions(-)
 create mode 100644 mm/mvolatile.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a582bfe..7da9c4a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -319,7 +319,8 @@
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
-
+313	common	mvolatile		sys_mvolatile
+314	common	mnovolatile		sys_mnovolatile
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
 # for native 64-bit operation.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..94742c4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 
+#define VM_VOLATILE	0x00001000	/* Pages could be discarded without swapout */
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..ef2a4a4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -275,6 +275,8 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	/* Page in this vma was discarded*/
+	bool purged;			/* Serialized by anon_vma's mutex */
 };
 
 struct core_thread {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 727f0cd..a8ded1c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -470,6 +470,8 @@ asmlinkage long sys_munlock(unsigned long start, size_t len);
 asmlinkage long sys_mlockall(int flags);
 asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_mvolatile(unsigned long start, size_t len);
+asmlinkage long sys_mnovolatile(unsigned long start, size_t len);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
 
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..962b69f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -4,8 +4,8 @@
 
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
-			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   mvolatile.o mlock.o mmap.o mprotect.o mremap.o msync.o \
+			   rmap.o vmalloc.o pagewalk.o pgtable-generic.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..3fe062d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1477,7 +1477,7 @@ out:
 	return ret;
 }
 
-#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
+#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE|VM_VOLATILE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
@@ -1641,8 +1641,11 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
 		 * page fault if needed.
 		 */
 		return 0;
-	if (vma->vm_ops)
-		/* khugepaged not yet working on file or special mappings */
+	if (vma->vm_ops || vma->vm_flags & VM_VOLATILE)
+		/*
+		 * khugepaged not yet working on file,special mappings
+		 * and volatile.
+		 */
 		return 0;
 	VM_BUG_ON(vma->vm_flags & VM_NO_THP);
 	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
diff --git a/mm/ksm.c b/mm/ksm.c
index ae539f0..2775f59 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1486,7 +1486,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		 */
 		if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
 				 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
-				 VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP))
+				 VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP   |
+				 VM_VOLATILE))
 			return 0;		/* just ignore the advice */
 
 #ifdef VM_SAO
diff --git a/mm/mlock.c b/mm/mlock.c
index f0b9ce5..db3a477 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -316,8 +316,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int ret = 0;
 	int lock = !!(newflags & VM_LOCKED);
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || (vma->vm_flags &
+		 (VM_SPECIAL|VM_VOLATILE)) || is_vm_hugetlb_page(vma) ||
+		 vma == get_gate_vma(current->mm))
 		goto out;	/* don't set VM_LOCKED,  don't count */
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 9a796c4..e4ac12d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -808,7 +808,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	 * We later require that vma->vm_flags == vm_flags,
 	 * so this tests vma->vm_flags & VM_SPECIAL, too.
 	 */
-	if (vm_flags & VM_SPECIAL)
+	if (vm_flags & (VM_SPECIAL|VM_VOLATILE))
 		return NULL;
 
 	if (prev)
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
new file mode 100644
index 0000000..ab5de2b
--- /dev/null
+++ b/mm/mvolatile.c
@@ -0,0 +1,396 @@
+/*
+ *	linux/mm/mvolatile.c
+ *
+ *  Copyright 2012 Minchan Kim
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm_types.h>
+#include <linux/mm.h>
+#include <linux/syscalls.h>
+#include <linux/rmap.h>
+#include <linux/mempolicy.h>
+
+#define NO_PURGED	0
+#define PURGED		1
+
+/*
+ * N: Normal VMA
+ * P: Purged volatile VMA
+ * V: Volatile VMA
+ *
+ * Assume that each VMA has two block so case 1-8 consists of three VMA.
+ * For example, NNPPVV means VMA1 has normal VMA, VMA2 has purged volailte VMA,
+ * and VMA3 has volatile VMA. With another example, NNPVVV means VMA1 has normal VMA,
+ * VMA2-1 has purged volatile VMA, VMA2-2 has volatile VMA.
+ *
+ * Case 7,8 create a new VMA and we call it VMA4 which can be loated before VMA2
+ * or after.
+ *
+ * Notice: The merge between volatile VMAs shouldn't happen.
+ * If we call mnovolatile(VMA2),
+ *
+ * Case 1 NNPPVV -> NNNNVV
+ * Case 2 VVPPNN -> VVNNNN
+ * Case 3 NNPPNN -> NNNNNN
+ * Case 4 NNPPVV -> NNNPVV
+ * case 5 VVPPNN -> VVPNNN
+ * case 6 VVPPVV -> VVNNVV
+ * case 7 VVPPVV -> VVNPVV
+ * case 8 VVPPVV -> VVPNVV
+ */
+static int do_mnovolatile(struct vm_area_struct *vma,
+		struct vm_area_struct **prev, unsigned long start,
+		unsigned long end, bool *purged)
+{
+	pgoff_t pgoff;
+	bool old_purged;
+	unsigned long this_start, this_end;
+	unsigned long next_start, next_end;
+	vm_flags_t new_flags, old_flags;
+	struct mm_struct *mm = vma->vm_mm;
+	int error = 0;
+	struct vm_area_struct *next = NULL;
+	next_start = next_end = 0;
+
+	old_flags = vma->vm_flags;
+	new_flags = old_flags & ~VM_VOLATILE;
+
+	if (new_flags == vma->vm_flags) {
+		*prev = vma;
+		goto success;
+	}
+
+	/*
+	 * From now on, purged state is freezed so closing the race with
+	 * reclaim. It makes works easy.
+	 */
+	vma_lock_anon_vma(vma);
+	vma->vm_flags = new_flags;
+	vma_unlock_anon_vma(vma);
+
+	/*
+	 * Setting vm_flags before vma adjust/split has a problem about
+	 * flag propatation or when error happens during the operation.
+	 * For preventing, we need more tweaking.
+	 */
+	old_purged = vma->purged;
+	*purged |= old_purged;
+	vma->purged = false;
+
+	this_start = vma->vm_start;
+	this_end = vma->vm_end;
+
+	if (*prev) {
+		next = (*prev)->vm_next;
+		if (next) {
+			next_start = next->vm_start;
+			next_end = next->vm_end;
+		}
+	}
+
+	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
+			vma->vm_file, pgoff, vma_policy(vma));
+	if (*prev) {
+		/* case 1 -> Nothing */
+		/* case 2 -> Nothing */
+		/* case 3 -> Nothing */
+		/* case 4 -> Set VMA2-2 with old_flags and old_purged */
+		/* case 5 -> Set VMA2-1 with old_flags and old_purged */
+		if ((*prev)->vm_end == this_end) /* case 1 */
+			goto next;
+		else if ((*prev)->vm_end == next_end) /* case 2 */
+			goto next;
+		else if ((*prev)->vm_end > next_end) /* case 3 */
+			goto next;
+		else if ((*prev)->vm_end > this_start) { /* case 4 */
+			vma_lock_anon_vma(next);
+			next->vm_flags = old_flags;
+			next->purged = old_purged;
+			vma_unlock_anon_vma(next);
+		} else if ((*prev)->vm_end < this_end) { /* case 5 */
+			vma_lock_anon_vma(*prev);
+			(*prev)->vm_flags = old_flags;
+			(*prev)->purged = old_purged;
+			vma_unlock_anon_vma(*prev);
+		}
+next:
+		vma = *prev;
+		goto success;
+	}
+
+	*prev = vma;
+
+	if (start != vma->vm_start) {
+		struct vm_area_struct *tmp_vma;
+		error = split_vma(mm, vma, start, 1);
+		if (error)
+			goto out;
+		/* case 8 -> Set VMA4 with old_flags and old_purged */
+		tmp_vma = vma->vm_prev;
+		vma_lock_anon_vma(tmp_vma);
+		tmp_vma->vm_flags = old_flags;
+		tmp_vma->purged = old_purged;
+		vma_unlock_anon_vma(tmp_vma);
+	}
+
+	if (end != vma->vm_end) {
+		struct vm_area_struct *tmp_vma;
+		error = split_vma(mm, vma, end, 0);
+		if (error)
+			goto out;
+		/* case 7 -> Set VMA4 with old_flags and old_purged */
+		tmp_vma = vma->vm_next;
+		vma_lock_anon_vma(tmp_vma);
+		tmp_vma->vm_flags = old_flags;
+		tmp_vma->purged = old_purged;
+		vma_unlock_anon_vma(tmp_vma);
+	}
+
+success:
+	return 0;
+out:
+	vma_lock_anon_vma(vma);
+	vma->vm_flags = old_flags;
+	vma->purged = old_purged;
+	vma_unlock_anon_vma(vma);
+	return error;
+}
+
+/* I didn't look into KSM/Hugepage so disalbed them */
+#define VM_NO_VOLATILE	(VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\
+				VM_MERGEABLE|VM_HUGEPAGE|VM_LOCKED)
+
+static int do_mvolatile(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		unsigned long start, unsigned long end)
+{
+	int error = -EINVAL;
+	vm_flags_t new_flags = vma->vm_flags;
+	struct mm_struct *mm = vma->vm_mm;
+
+	new_flags |= VM_VOLATILE;
+
+	/* Note : Current version doesn't support file vma volatile */
+	if (vma->vm_file) {
+		*prev = vma;
+		goto out;
+	}
+
+	if (vma->vm_flags & VM_NO_VOLATILE ||
+			(vma == get_gate_vma(current->mm))) {
+		*prev = vma;
+		goto out;
+	}
+	/*
+	 * In case of calling MADV_VOLATILE again,
+	 * We just reset purged state.
+	 */
+	if (new_flags == vma->vm_flags) {
+		*prev = vma;
+		vma_lock_anon_vma(vma);
+		vma->purged = false;
+		vma_unlock_anon_vma(vma);
+		error = 0;
+		goto out;
+	}
+
+	*prev = vma;
+
+	if (start != vma->vm_start) {
+		error = split_vma(mm, vma, start, 1);
+		if (error)
+			goto out;
+	}
+
+	if (end != vma->vm_end) {
+		error = split_vma(mm, vma, end, 0);
+		if (error)
+			goto out;
+	}
+
+	error = 0;
+
+	vma_lock_anon_vma(vma);
+	vma->vm_flags = new_flags;
+	vma_unlock_anon_vma(vma);
+out:
+	return error;
+}
+
+/*
+ * Return -EINVAL if range doesn't include a right vma at all.
+ * Return -ENOMEM with interrupting range opeartion if memory is not enough to
+ * merge/split vmas.
+ * Return 0 if range consists of only proper vmas.
+ * Return 1 if part of range includes hole/huge/ksm/mlock/special area.
+ */
+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
+{
+	unsigned long end, tmp;
+	struct vm_area_struct *vma, *prev;
+	bool invalid = false;
+	int error = -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+			invalid = true;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = do_mvolatile(vma, &prev, start, tmp);
+		if (error == -ENOMEM) {
+			up_write(&current->mm->mmap_sem);
+			return error;
+		}
+		if (error == -EINVAL)
+			invalid = true;
+		else
+			error = 0;
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+
+		vma = prev->vm_next;
+		if (!vma)
+			break;
+	}
+out:
+	up_write(&current->mm->mmap_sem);
+	return invalid ? 1 : 0;
+}
+
+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
+{
+	unsigned long end, tmp;
+	struct vm_area_struct *vma, *prev;
+	int ret, error = -EINVAL;
+	bool purged = false;
+
+	down_write(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = do_mnovolatile(vma, &prev, start, tmp, &purged);
+		if (error) {
+			WARN_ON(error != -ENOMEM);
+			goto out;
+		}
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+
+		vma = prev->vm_next;
+		if (!vma)
+			break;
+	}
+out:
+	up_write(&current->mm->mmap_sem);
+
+	if (error)
+		ret = error;
+	else if (purged)
+		ret = PURGED;
+	else
+		ret = NO_PURGED;
+
+	return ret;
+}
+
+/* Not intend to merge, Just test */
+SYSCALL_DEFINE2(mpurge, unsigned long, start, size_t, len)
+{
+	int error = -EINVAL;
+	unsigned long end;
+	struct vm_area_struct *vma;
+
+	down_read(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma(current->mm, start);
+	if (!vma || vma->vm_start >= end)
+		goto out;
+
+	if (!(vma->vm_flags & VM_VOLATILE))
+		goto out;
+
+	vma_lock_anon_vma(vma);
+	vma->purged = true;
+	vma_unlock_anon_vma(vma);
+out:
+	up_read(&current->mm->mmap_sem);
+
+	return error;
+}
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ee1ef0..7f4493c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -308,6 +308,8 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	vma->anon_vma = anon_vma;
 	anon_vma_lock(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
+	/* Propagate parent's purged state to child */
+	vma->purged = pvma->purged;
 	anon_vma_unlock(anon_vma);
 
 	return 0;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 1/3] Introduce new system call mvolatile
@ 2012-12-18  6:47   ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

This patch adds new system call m[no]volatile. If some user asks
is_volatile system call, it could, too.

The reason why I introduced new system call instead of madvise is
m[no]volatile vma handling is totally different with madvise's vma
handling.

1) The m[no]volatile should be successful although the range includes
   unmapped or non-volatile range. It just skips such range without stop
   with returning error although it encounter invalid range.
   It makes user convenient without calling several calling of small range.
   - Suggested by John Stultz

2) The propagation of purged state between vmas should be atomic between
   m[no]volatile and reclaim. For it, we need to tweak vma_merge/split_vma's
   anon_vma handling. It's very common operation and I don't want to add
   unnecessary overhead and code if it is possbile.

3) The purged state of volatile range should be propagated out to user
   with mnovolatile operation and it should be atomic with reclaim, too.

For meeting above requirements, I introudced new system call m[no]volatile.
It doesn't change vma_merge/split and repair vmas after vma operation.

So mvolatile(start, len)'s semantics is following as.

1) It makes range(start, len) as volatile although the range includes
   unmapped area, speacial mapping and mlocked area which are just skipped.
   Now it doesn't support Hugepage and KSM. - TODO
   Return -EINVAL if range doesn't include a right vma at all.
   Return -ENOMEM with interrupting range opeartion if memory is not
   enough to merge/split vmas. In this case, some range would be volatile
   and others not. So user have to recall mvolatile after he cancel all
   range by mnovolatile.
   Return 0 if range consists of only proper vmas.
   Return 1 if part of range includes hole/huge/ksm/mlock/special area.

2) If user calls mvolatile to the range which was already volatile VMA and
   even purged state, VOLATILE attributes still remains but purged state
   is reset. I expect some user want to split volatile vma into smaller
   ranges. Although he can do it for mnovlatile(whole range) and serveral calling
   with movlatile(smaller range), this function can avoid mnovolatile if he
   doesn't care purged state. I'm not sure we really need this function so
   I hope listen opinions. Unfortunately, current implemenation doesn't split
   volatile VMA with new range in this case. I forgot implementing it
   in this version but decide to send it to listen opinions because implementing
   is rather trivial if we decided.

mnovolatile(start, len)'s semantics is following as.

1) It makes range(start, len) as volatile although the range includes
   unmapped area, speacial mapping and non-volatile range which are just
   skipped.

2) If the range is purged, it will return 1 regardless of including invalid
   range.

3) It returns -ENOMEM if system doesn't have enough memory for vma operation.

4) It returns -EINVAL if range doesn't include a right vma at all.

5) If user try to access purged range without mnovoatile call, it encounters
   SIGBUS which would show up next patch.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/syscalls/syscall_64.tbl |    3 +-
 include/linux/mm.h               |    1 +
 include/linux/mm_types.h         |    2 +
 include/linux/syscalls.h         |    2 +
 mm/Makefile                      |    4 +-
 mm/huge_memory.c                 |    9 +-
 mm/ksm.c                         |    3 +-
 mm/mlock.c                       |    5 +-
 mm/mmap.c                        |    2 +-
 mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                        |    2 +
 11 files changed, 419 insertions(+), 10 deletions(-)
 create mode 100644 mm/mvolatile.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a582bfe..7da9c4a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -319,7 +319,8 @@
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
-
+313	common	mvolatile		sys_mvolatile
+314	common	mnovolatile		sys_mnovolatile
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
 # for native 64-bit operation.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..94742c4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 
+#define VM_VOLATILE	0x00001000	/* Pages could be discarded without swapout */
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..ef2a4a4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -275,6 +275,8 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	/* Page in this vma was discarded*/
+	bool purged;			/* Serialized by anon_vma's mutex */
 };
 
 struct core_thread {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 727f0cd..a8ded1c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -470,6 +470,8 @@ asmlinkage long sys_munlock(unsigned long start, size_t len);
 asmlinkage long sys_mlockall(int flags);
 asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_mvolatile(unsigned long start, size_t len);
+asmlinkage long sys_mnovolatile(unsigned long start, size_t len);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
 
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..962b69f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -4,8 +4,8 @@
 
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
-			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   mvolatile.o mlock.o mmap.o mprotect.o mremap.o msync.o \
+			   rmap.o vmalloc.o pagewalk.o pgtable-generic.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..3fe062d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1477,7 +1477,7 @@ out:
 	return ret;
 }
 
-#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
+#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE|VM_VOLATILE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
@@ -1641,8 +1641,11 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
 		 * page fault if needed.
 		 */
 		return 0;
-	if (vma->vm_ops)
-		/* khugepaged not yet working on file or special mappings */
+	if (vma->vm_ops || vma->vm_flags & VM_VOLATILE)
+		/*
+		 * khugepaged not yet working on file,special mappings
+		 * and volatile.
+		 */
 		return 0;
 	VM_BUG_ON(vma->vm_flags & VM_NO_THP);
 	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
diff --git a/mm/ksm.c b/mm/ksm.c
index ae539f0..2775f59 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1486,7 +1486,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		 */
 		if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
 				 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
-				 VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP))
+				 VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP   |
+				 VM_VOLATILE))
 			return 0;		/* just ignore the advice */
 
 #ifdef VM_SAO
diff --git a/mm/mlock.c b/mm/mlock.c
index f0b9ce5..db3a477 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -316,8 +316,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int ret = 0;
 	int lock = !!(newflags & VM_LOCKED);
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || (vma->vm_flags &
+		 (VM_SPECIAL|VM_VOLATILE)) || is_vm_hugetlb_page(vma) ||
+		 vma == get_gate_vma(current->mm))
 		goto out;	/* don't set VM_LOCKED,  don't count */
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
diff --git a/mm/mmap.c b/mm/mmap.c
index 9a796c4..e4ac12d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -808,7 +808,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	 * We later require that vma->vm_flags == vm_flags,
 	 * so this tests vma->vm_flags & VM_SPECIAL, too.
 	 */
-	if (vm_flags & VM_SPECIAL)
+	if (vm_flags & (VM_SPECIAL|VM_VOLATILE))
 		return NULL;
 
 	if (prev)
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
new file mode 100644
index 0000000..ab5de2b
--- /dev/null
+++ b/mm/mvolatile.c
@@ -0,0 +1,396 @@
+/*
+ *	linux/mm/mvolatile.c
+ *
+ *  Copyright 2012 Minchan Kim
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm_types.h>
+#include <linux/mm.h>
+#include <linux/syscalls.h>
+#include <linux/rmap.h>
+#include <linux/mempolicy.h>
+
+#define NO_PURGED	0
+#define PURGED		1
+
+/*
+ * N: Normal VMA
+ * P: Purged volatile VMA
+ * V: Volatile VMA
+ *
+ * Assume that each VMA has two block so case 1-8 consists of three VMA.
+ * For example, NNPPVV means VMA1 has normal VMA, VMA2 has purged volailte VMA,
+ * and VMA3 has volatile VMA. With another example, NNPVVV means VMA1 has normal VMA,
+ * VMA2-1 has purged volatile VMA, VMA2-2 has volatile VMA.
+ *
+ * Case 7,8 create a new VMA and we call it VMA4 which can be loated before VMA2
+ * or after.
+ *
+ * Notice: The merge between volatile VMAs shouldn't happen.
+ * If we call mnovolatile(VMA2),
+ *
+ * Case 1 NNPPVV -> NNNNVV
+ * Case 2 VVPPNN -> VVNNNN
+ * Case 3 NNPPNN -> NNNNNN
+ * Case 4 NNPPVV -> NNNPVV
+ * case 5 VVPPNN -> VVPNNN
+ * case 6 VVPPVV -> VVNNVV
+ * case 7 VVPPVV -> VVNPVV
+ * case 8 VVPPVV -> VVPNVV
+ */
+static int do_mnovolatile(struct vm_area_struct *vma,
+		struct vm_area_struct **prev, unsigned long start,
+		unsigned long end, bool *purged)
+{
+	pgoff_t pgoff;
+	bool old_purged;
+	unsigned long this_start, this_end;
+	unsigned long next_start, next_end;
+	vm_flags_t new_flags, old_flags;
+	struct mm_struct *mm = vma->vm_mm;
+	int error = 0;
+	struct vm_area_struct *next = NULL;
+	next_start = next_end = 0;
+
+	old_flags = vma->vm_flags;
+	new_flags = old_flags & ~VM_VOLATILE;
+
+	if (new_flags == vma->vm_flags) {
+		*prev = vma;
+		goto success;
+	}
+
+	/*
+	 * From now on, purged state is freezed so closing the race with
+	 * reclaim. It makes works easy.
+	 */
+	vma_lock_anon_vma(vma);
+	vma->vm_flags = new_flags;
+	vma_unlock_anon_vma(vma);
+
+	/*
+	 * Setting vm_flags before vma adjust/split has a problem about
+	 * flag propatation or when error happens during the operation.
+	 * For preventing, we need more tweaking.
+	 */
+	old_purged = vma->purged;
+	*purged |= old_purged;
+	vma->purged = false;
+
+	this_start = vma->vm_start;
+	this_end = vma->vm_end;
+
+	if (*prev) {
+		next = (*prev)->vm_next;
+		if (next) {
+			next_start = next->vm_start;
+			next_end = next->vm_end;
+		}
+	}
+
+	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
+			vma->vm_file, pgoff, vma_policy(vma));
+	if (*prev) {
+		/* case 1 -> Nothing */
+		/* case 2 -> Nothing */
+		/* case 3 -> Nothing */
+		/* case 4 -> Set VMA2-2 with old_flags and old_purged */
+		/* case 5 -> Set VMA2-1 with old_flags and old_purged */
+		if ((*prev)->vm_end == this_end) /* case 1 */
+			goto next;
+		else if ((*prev)->vm_end == next_end) /* case 2 */
+			goto next;
+		else if ((*prev)->vm_end > next_end) /* case 3 */
+			goto next;
+		else if ((*prev)->vm_end > this_start) { /* case 4 */
+			vma_lock_anon_vma(next);
+			next->vm_flags = old_flags;
+			next->purged = old_purged;
+			vma_unlock_anon_vma(next);
+		} else if ((*prev)->vm_end < this_end) { /* case 5 */
+			vma_lock_anon_vma(*prev);
+			(*prev)->vm_flags = old_flags;
+			(*prev)->purged = old_purged;
+			vma_unlock_anon_vma(*prev);
+		}
+next:
+		vma = *prev;
+		goto success;
+	}
+
+	*prev = vma;
+
+	if (start != vma->vm_start) {
+		struct vm_area_struct *tmp_vma;
+		error = split_vma(mm, vma, start, 1);
+		if (error)
+			goto out;
+		/* case 8 -> Set VMA4 with old_flags and old_purged */
+		tmp_vma = vma->vm_prev;
+		vma_lock_anon_vma(tmp_vma);
+		tmp_vma->vm_flags = old_flags;
+		tmp_vma->purged = old_purged;
+		vma_unlock_anon_vma(tmp_vma);
+	}
+
+	if (end != vma->vm_end) {
+		struct vm_area_struct *tmp_vma;
+		error = split_vma(mm, vma, end, 0);
+		if (error)
+			goto out;
+		/* case 7 -> Set VMA4 with old_flags and old_purged */
+		tmp_vma = vma->vm_next;
+		vma_lock_anon_vma(tmp_vma);
+		tmp_vma->vm_flags = old_flags;
+		tmp_vma->purged = old_purged;
+		vma_unlock_anon_vma(tmp_vma);
+	}
+
+success:
+	return 0;
+out:
+	vma_lock_anon_vma(vma);
+	vma->vm_flags = old_flags;
+	vma->purged = old_purged;
+	vma_unlock_anon_vma(vma);
+	return error;
+}
+
+/* I didn't look into KSM/Hugepage so disalbed them */
+#define VM_NO_VOLATILE	(VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\
+				VM_MERGEABLE|VM_HUGEPAGE|VM_LOCKED)
+
+static int do_mvolatile(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		unsigned long start, unsigned long end)
+{
+	int error = -EINVAL;
+	vm_flags_t new_flags = vma->vm_flags;
+	struct mm_struct *mm = vma->vm_mm;
+
+	new_flags |= VM_VOLATILE;
+
+	/* Note : Current version doesn't support file vma volatile */
+	if (vma->vm_file) {
+		*prev = vma;
+		goto out;
+	}
+
+	if (vma->vm_flags & VM_NO_VOLATILE ||
+			(vma == get_gate_vma(current->mm))) {
+		*prev = vma;
+		goto out;
+	}
+	/*
+	 * In case of calling MADV_VOLATILE again,
+	 * We just reset purged state.
+	 */
+	if (new_flags == vma->vm_flags) {
+		*prev = vma;
+		vma_lock_anon_vma(vma);
+		vma->purged = false;
+		vma_unlock_anon_vma(vma);
+		error = 0;
+		goto out;
+	}
+
+	*prev = vma;
+
+	if (start != vma->vm_start) {
+		error = split_vma(mm, vma, start, 1);
+		if (error)
+			goto out;
+	}
+
+	if (end != vma->vm_end) {
+		error = split_vma(mm, vma, end, 0);
+		if (error)
+			goto out;
+	}
+
+	error = 0;
+
+	vma_lock_anon_vma(vma);
+	vma->vm_flags = new_flags;
+	vma_unlock_anon_vma(vma);
+out:
+	return error;
+}
+
+/*
+ * Return -EINVAL if range doesn't include a right vma at all.
+ * Return -ENOMEM with interrupting range opeartion if memory is not enough to
+ * merge/split vmas.
+ * Return 0 if range consists of only proper vmas.
+ * Return 1 if part of range includes hole/huge/ksm/mlock/special area.
+ */
+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
+{
+	unsigned long end, tmp;
+	struct vm_area_struct *vma, *prev;
+	bool invalid = false;
+	int error = -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+			invalid = true;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = do_mvolatile(vma, &prev, start, tmp);
+		if (error == -ENOMEM) {
+			up_write(&current->mm->mmap_sem);
+			return error;
+		}
+		if (error == -EINVAL)
+			invalid = true;
+		else
+			error = 0;
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+
+		vma = prev->vm_next;
+		if (!vma)
+			break;
+	}
+out:
+	up_write(&current->mm->mmap_sem);
+	return invalid ? 1 : 0;
+}
+
+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
+{
+	unsigned long end, tmp;
+	struct vm_area_struct *vma, *prev;
+	int ret, error = -EINVAL;
+	bool purged = false;
+
+	down_write(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (!vma)
+		goto out;
+
+	if (start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = do_mnovolatile(vma, &prev, start, tmp, &purged);
+		if (error) {
+			WARN_ON(error != -ENOMEM);
+			goto out;
+		}
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+
+		vma = prev->vm_next;
+		if (!vma)
+			break;
+	}
+out:
+	up_write(&current->mm->mmap_sem);
+
+	if (error)
+		ret = error;
+	else if (purged)
+		ret = PURGED;
+	else
+		ret = NO_PURGED;
+
+	return ret;
+}
+
+/* Not intend to merge, Just test */
+SYSCALL_DEFINE2(mpurge, unsigned long, start, size_t, len)
+{
+	int error = -EINVAL;
+	unsigned long end;
+	struct vm_area_struct *vma;
+
+	down_read(&current->mm->mmap_sem);
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	vma = find_vma(current->mm, start);
+	if (!vma || vma->vm_start >= end)
+		goto out;
+
+	if (!(vma->vm_flags & VM_VOLATILE))
+		goto out;
+
+	vma_lock_anon_vma(vma);
+	vma->purged = true;
+	vma_unlock_anon_vma(vma);
+out:
+	up_read(&current->mm->mmap_sem);
+
+	return error;
+}
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ee1ef0..7f4493c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -308,6 +308,8 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	vma->anon_vma = anon_vma;
 	anon_vma_lock(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
+	/* Propagate parent's purged state to child */
+	vma->purged = pvma->purged;
 	anon_vma_unlock(anon_vma);
 
 	return 0;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 2/3] Discard volatile page
  2012-12-18  6:47 ` Minchan Kim
@ 2012-12-18  6:47   ` Minchan Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

VM don't need to swap out volatile pages. Instead, it just discards
pages and set true to the vma's purge state so if user try to access
purged vma without calling mnovolatile, it will encounter SIGBUS.

Reclaimer reclaims volatile page when it reaches tail of LRU regardless
of the recent reference. So when the memory pressure doesn't happen,
it wouldn't be evicted so it can reduce the number of minor fault.
Although memory pressure happens, it doesn't be evicted until it reaches
tail of LRU. It could mitigate fault/data-regenaration overhead if
memory pressure isn't severe. But it's not solid design and need more
discussion.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h |    3 ++
 mm/memory.c          |    2 ++
 mm/migrate.c         |    6 ++--
 mm/rmap.c            |   95 ++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c          |    3 ++
 5 files changed, 105 insertions(+), 4 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bfe1f47..ed263bb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -80,6 +80,8 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	/* ignore volatile. Should be revisit to handle migration entry */
+	TTU_IGNORE_VOLATILE = (1 << 11),
 };
 
 #ifdef CONFIG_MMU
@@ -261,5 +263,6 @@ static inline int page_mkclean(struct page *page)
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 #define SWAP_MLOCK	3
+#define SWAP_DISCARD	4
 
 #endif	/* _LINUX_RMAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..71e06fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3459,6 +3459,8 @@ int handle_pte_fault(struct mm_struct *mm,
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, flags, entry);
 			}
+			if (unlikely(vma->vm_flags & VM_VOLATILE))
+				return VM_FAULT_SIGBUS;
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..bf9d76a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 skip_unmap:
 	if (!page_mapped(page))
@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (PageAnon(hpage))
 		anon_vma = page_get_anon_vma(hpage);
 
-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+			TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 	if (!page_mapped(hpage))
 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
diff --git a/mm/rmap.c b/mm/rmap.c
index 7f4493c..02ee1a3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1189,6 +1189,64 @@ out:
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
+		unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	swp_entry_t entry = { .val = page_private(page) };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(!PageAnon(page));
+	VM_BUG_ON(!PageSwapCache(page));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return 0;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+
+	VM_BUG_ON(pmd_trans_huge(*pmd));
+
+	pte = pte_offset_map(pmd, address);
+	/* Make a quick check before getting the lock */
+	if(!pte_present(*pte)) {
+		pte_unmap(pte);
+		return 0;
+	}
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	if (entry.val != pte_to_swp_entry(*pte).val) {
+		pte_unmap_unlock(pte, ptl);
+		return 0;
+	}
+
+	/* Nuke the page table entry. */
+	flush_cache_page(vma, address, page_to_pfn(page));
+	ptep_clear_flush(vma, address, pte);
+
+	/* try_to_unmap_one increased MM_SWAPENTS */
+	dec_mm_counter(mm, MM_SWAPENTS);
+	swap_free(entry);
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, address);
+	return 1;
+}
+
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
@@ -1475,6 +1533,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 	pgoff_t pgoff;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
+	bool is_volatile = true;
+
+	if (flags & TTU_IGNORE_VOLATILE)
+		is_volatile = false;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
@@ -1494,8 +1556,17 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 		 * temporary VMAs until after exec() completes.
 		 */
 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
-				is_vma_temporary_stack(vma))
+				is_vma_temporary_stack(vma)) {
+			is_volatile = false;
 			continue;
+		}
+
+		/*
+		 * A volatile page will only be purged if ALL vmas
+		 * pointing to it are VM_VOLATILE.
+		 */
+		if (!(vma->vm_flags & VM_VOLATILE))
+			is_volatile = false;
 
 		address = vma_address(page, vma);
 		ret = try_to_unmap_one(page, vma, address, flags);
@@ -1503,6 +1574,25 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 			break;
 	}
 
+	if (page_mapped(page) || is_volatile == false)
+		goto out;
+
+	/*
+	 * Here, all vmas point to the page are volatile and all ptes have
+	 * swap entry. PG_locked prevents race of do_swap_page.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long address;
+
+		address = vma_address(page, vma);
+		if (try_to_zap_one(page, vma, address))
+			vma->purged = true;
+	}
+	/* We're throwing this page out, so mark it clean */
+	ClearPageDirty(page);
+	ret = SWAP_DISCARD;
+out:
 	page_unlock_anon_vma(anon_vma);
 	return ret;
 }
@@ -1628,6 +1718,7 @@ out:
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
+ * SWAP_DISCARD - page is volatile.
  */
 int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
@@ -1642,7 +1733,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 		ret = try_to_unmap_anon(page, flags);
 	else
 		ret = try_to_unmap_file(page, flags);
-	if (ret != SWAP_MLOCK && !page_mapped(page))
+	if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
 		ret = SWAP_SUCCESS;
 	return ret;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..cfe95d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -793,6 +793,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, ttu_flags)) {
+			case SWAP_DISCARD:
+				goto discard_page;
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -861,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
+discard_page:
 		/*
 		 * If the page has buffers, try to free the buffer mappings
 		 * associated with this page. If we succeed we try to free
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 2/3] Discard volatile page
@ 2012-12-18  6:47   ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

VM don't need to swap out volatile pages. Instead, it just discards
pages and set true to the vma's purge state so if user try to access
purged vma without calling mnovolatile, it will encounter SIGBUS.

Reclaimer reclaims volatile page when it reaches tail of LRU regardless
of the recent reference. So when the memory pressure doesn't happen,
it wouldn't be evicted so it can reduce the number of minor fault.
Although memory pressure happens, it doesn't be evicted until it reaches
tail of LRU. It could mitigate fault/data-regenaration overhead if
memory pressure isn't severe. But it's not solid design and need more
discussion.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h |    3 ++
 mm/memory.c          |    2 ++
 mm/migrate.c         |    6 ++--
 mm/rmap.c            |   95 ++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c          |    3 ++
 5 files changed, 105 insertions(+), 4 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bfe1f47..ed263bb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -80,6 +80,8 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	/* ignore volatile. Should be revisit to handle migration entry */
+	TTU_IGNORE_VOLATILE = (1 << 11),
 };
 
 #ifdef CONFIG_MMU
@@ -261,5 +263,6 @@ static inline int page_mkclean(struct page *page)
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 #define SWAP_MLOCK	3
+#define SWAP_DISCARD	4
 
 #endif	/* _LINUX_RMAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..71e06fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3459,6 +3459,8 @@ int handle_pte_fault(struct mm_struct *mm,
 					return do_linear_fault(mm, vma, address,
 						pte, pmd, flags, entry);
 			}
+			if (unlikely(vma->vm_flags & VM_VOLATILE))
+				return VM_FAULT_SIGBUS;
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..bf9d76a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 skip_unmap:
 	if (!page_mapped(page))
@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (PageAnon(hpage))
 		anon_vma = page_get_anon_vma(hpage);
 
-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
+			TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
 
 	if (!page_mapped(hpage))
 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
diff --git a/mm/rmap.c b/mm/rmap.c
index 7f4493c..02ee1a3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1189,6 +1189,64 @@ out:
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
+		unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	swp_entry_t entry = { .val = page_private(page) };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(!PageAnon(page));
+	VM_BUG_ON(!PageSwapCache(page));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return 0;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return 0;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return 0;
+
+	VM_BUG_ON(pmd_trans_huge(*pmd));
+
+	pte = pte_offset_map(pmd, address);
+	/* Make a quick check before getting the lock */
+	if(!pte_present(*pte)) {
+		pte_unmap(pte);
+		return 0;
+	}
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	if (entry.val != pte_to_swp_entry(*pte).val) {
+		pte_unmap_unlock(pte, ptl);
+		return 0;
+	}
+
+	/* Nuke the page table entry. */
+	flush_cache_page(vma, address, page_to_pfn(page));
+	ptep_clear_flush(vma, address, pte);
+
+	/* try_to_unmap_one increased MM_SWAPENTS */
+	dec_mm_counter(mm, MM_SWAPENTS);
+	swap_free(entry);
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, address);
+	return 1;
+}
+
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
@@ -1475,6 +1533,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 	pgoff_t pgoff;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
+	bool is_volatile = true;
+
+	if (flags & TTU_IGNORE_VOLATILE)
+		is_volatile = false;
 
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
@@ -1494,8 +1556,17 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 		 * temporary VMAs until after exec() completes.
 		 */
 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
-				is_vma_temporary_stack(vma))
+				is_vma_temporary_stack(vma)) {
+			is_volatile = false;
 			continue;
+		}
+
+		/*
+		 * A volatile page will only be purged if ALL vmas
+		 * pointing to it are VM_VOLATILE.
+		 */
+		if (!(vma->vm_flags & VM_VOLATILE))
+			is_volatile = false;
 
 		address = vma_address(page, vma);
 		ret = try_to_unmap_one(page, vma, address, flags);
@@ -1503,6 +1574,25 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 			break;
 	}
 
+	if (page_mapped(page) || is_volatile == false)
+		goto out;
+
+	/*
+	 * Here, all vmas point to the page are volatile and all ptes have
+	 * swap entry. PG_locked prevents race of do_swap_page.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long address;
+
+		address = vma_address(page, vma);
+		if (try_to_zap_one(page, vma, address))
+			vma->purged = true;
+	}
+	/* We're throwing this page out, so mark it clean */
+	ClearPageDirty(page);
+	ret = SWAP_DISCARD;
+out:
 	page_unlock_anon_vma(anon_vma);
 	return ret;
 }
@@ -1628,6 +1718,7 @@ out:
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
+ * SWAP_DISCARD - page is volatile.
  */
 int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
@@ -1642,7 +1733,7 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 		ret = try_to_unmap_anon(page, flags);
 	else
 		ret = try_to_unmap_file(page, flags);
-	if (ret != SWAP_MLOCK && !page_mapped(page))
+	if (ret != SWAP_MLOCK && !page_mapped(page) && ret != SWAP_DISCARD)
 		ret = SWAP_SUCCESS;
 	return ret;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..cfe95d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -793,6 +793,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, ttu_flags)) {
+			case SWAP_DISCARD:
+				goto discard_page;
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -861,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
+discard_page:
 		/*
 		 * If the page has buffers, try to free the buffer mappings
 		 * associated with this page. If we succeed we try to free
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 3/3] add PGVOLATILE vmstat count
  2012-12-18  6:47 ` Minchan Kim
@ 2012-12-18  6:47   ` Minchan Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

This patch add pgvolatile vmstat so admin can see how many of volatile
pages are discarded by VM until now. It could be a good indicator of
patch effect during test but still not sure we need it in real practice.
Will rethink it.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/vm_event_item.h |    2 +-
 mm/vmscan.c                   |    1 +
 mm/vmstat.c                   |    1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..f83c3d2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -23,7 +23,7 @@
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cfe95d3..1ec7345 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -794,6 +794,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, ttu_flags)) {
 			case SWAP_DISCARD:
+				count_vm_event(PGVOLATILE);
 				goto discard_page;
 			case SWAP_FAIL:
 				goto activate_locked;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..9fd8ead 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -747,6 +747,7 @@ const char * const vmstat_text[] = {
 	TEXTS_FOR_ZONES("pgalloc")
 
 	"pgfree",
+	"pgvolatile",
 	"pgactivate",
 	"pgdeactivate",
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 3/3] add PGVOLATILE vmstat count
@ 2012-12-18  6:47   ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-18  6:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Minchan Kim, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

This patch add pgvolatile vmstat so admin can see how many of volatile
pages are discarded by VM until now. It could be a good indicator of
patch effect during test but still not sure we need it in real practice.
Will rethink it.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
CC: David Rientjes <rientjes@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/vm_event_item.h |    2 +-
 mm/vmscan.c                   |    1 +
 mm/vmstat.c                   |    1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..f83c3d2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -23,7 +23,7 @@
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cfe95d3..1ec7345 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -794,6 +794,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, ttu_flags)) {
 			case SWAP_DISCARD:
+				count_vm_event(PGVOLATILE);
 				goto discard_page;
 			case SWAP_FAIL:
 				goto activate_locked;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..9fd8ead 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -747,6 +747,7 @@ const char * const vmstat_text[] = {
 	TEXTS_FOR_ZONES("pgalloc")
 
 	"pgfree",
+	"pgvolatile",
 	"pgactivate",
 	"pgdeactivate",
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
  2012-12-18  6:47 ` Minchan Kim
@ 2012-12-18 18:27   ` Arun Sharma
  -1 siblings, 0 replies; 20+ messages in thread
From: Arun Sharma @ 2012-12-18 18:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk, sanjay,
	Paul Turner, David Rientjes, John Stultz, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 12/17/12 10:47 PM, Minchan Kim wrote:

> I hope more inputs from user-space allocator people and test patch
> with their allocator because it might need design change of arena
> management for getting real vaule.

jemalloc knows how to handle MADV_FREE on platforms that support it. 
This looks similar (we'll need a SIGBUS handler that does the right 
thing = zero the page + mark it as non-volatile in the common case).

All of this of course assumes that apps madvise the kernel through APIs 
exposed by the malloc implementation - not via a raw syscall.

In other words, some new user space code needs to be written to test 
this out fully. Sounds feasible though.

  -Arun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
@ 2012-12-18 18:27   ` Arun Sharma
  0 siblings, 0 replies; 20+ messages in thread
From: Arun Sharma @ 2012-12-18 18:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk, sanjay,
	Paul Turner, David Rientjes, John Stultz, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On 12/17/12 10:47 PM, Minchan Kim wrote:

> I hope more inputs from user-space allocator people and test patch
> with their allocator because it might need design change of arena
> management for getting real vaule.

jemalloc knows how to handle MADV_FREE on platforms that support it. 
This looks similar (we'll need a SIGBUS handler that does the right 
thing = zero the page + mark it as non-volatile in the common case).

All of this of course assumes that apps madvise the kernel through APIs 
exposed by the malloc implementation - not via a raw syscall.

In other words, some new user space code needs to be written to test 
this out fully. Sounds feasible though.

  -Arun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
  2012-12-18 18:27   ` Arun Sharma
@ 2012-12-20  1:34     ` Minchan Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-20  1:34 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk, sanjay,
	Paul Turner, David Rientjes, John Stultz, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 18, 2012 at 10:27:46AM -0800, Arun Sharma wrote:
> On 12/17/12 10:47 PM, Minchan Kim wrote:
> 
> >I hope more inputs from user-space allocator people and test patch
> >with their allocator because it might need design change of arena
> >management for getting real vaule.
> 
> jemalloc knows how to handle MADV_FREE on platforms that support it.
> This looks similar (we'll need a SIGBUS handler that does the right
> thing = zero the page + mark it as non-volatile in the common case).

Don't work because it's too late to mark it as non-volatile in signal
handler in case of malloc.

For example,
free(P1-P4) -> mvolatile(P1-P4) -> VM discard(P3) -> alloc(P1-P4) ->
use P1 -> VM discard(P1) -> use P3 -> SIGBUS -> mark nonvolatile ->
lost P1.

So, we should call mnovolatile before giving the free space to user.

> 
> All of this of course assumes that apps madvise the kernel through
> APIs exposed by the malloc implementation - not via a raw syscall.
> 
> In other words, some new user space code needs to be written to test

Agreed. I might want to design new allocator with this system calls if
existing allocators cannot use this system calls efficiently because it
might need allocator's design change. MADV_FREE/MADV_DONTNEED isn't cheap
due to enumerating ptes/page descriptors in that range to mark something
so I guess allocator avoids frequent calling of the such advise system call
and even if they call it, they want to call the big range as batch.
Just my imagine.

But mvolatile/mnovolatile is cheaper so you can call it more frequently
with smaller range so VM could have easy-reclaimable pages easily.
Another benefit of the mvolatile is it can change the behavior when memory
pressure is severe where it can zap all pages like DONTNEED so it could
work very flexible.
The downside of that approach is that if we call it with small range,
it can increase the number of VMA so we might tune point for VMA size.

> this out fully. Sounds feasible though.

Thanks!

> 
>  -Arun
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
@ 2012-12-20  1:34     ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-20  1:34 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk, sanjay,
	Paul Turner, David Rientjes, John Stultz, Christoph Lameter,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Tue, Dec 18, 2012 at 10:27:46AM -0800, Arun Sharma wrote:
> On 12/17/12 10:47 PM, Minchan Kim wrote:
> 
> >I hope more inputs from user-space allocator people and test patch
> >with their allocator because it might need design change of arena
> >management for getting real vaule.
> 
> jemalloc knows how to handle MADV_FREE on platforms that support it.
> This looks similar (we'll need a SIGBUS handler that does the right
> thing = zero the page + mark it as non-volatile in the common case).

Don't work because it's too late to mark it as non-volatile in signal
handler in case of malloc.

For example,
free(P1-P4) -> mvolatile(P1-P4) -> VM discard(P3) -> alloc(P1-P4) ->
use P1 -> VM discard(P1) -> use P3 -> SIGBUS -> mark nonvolatile ->
lost P1.

So, we should call mnovolatile before giving the free space to user.

> 
> All of this of course assumes that apps madvise the kernel through
> APIs exposed by the malloc implementation - not via a raw syscall.
> 
> In other words, some new user space code needs to be written to test

Agreed. I might want to design new allocator with this system calls if
existing allocators cannot use this system calls efficiently because it
might need allocator's design change. MADV_FREE/MADV_DONTNEED isn't cheap
due to enumerating ptes/page descriptors in that range to mark something
so I guess allocator avoids frequent calling of the such advise system call
and even if they call it, they want to call the big range as batch.
Just my imagine.

But mvolatile/mnovolatile is cheaper so you can call it more frequently
with smaller range so VM could have easy-reclaimable pages easily.
Another benefit of the mvolatile is it can change the behavior when memory
pressure is severe where it can zap all pages like DONTNEED so it could
work very flexible.
The downside of that approach is that if we call it with small range,
it can increase the number of VMA so we might tune point for VMA size.

> this out fully. Sounds feasible though.

Thanks!

> 
>  -Arun
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
  2012-12-18  6:47 ` Minchan Kim
@ 2012-12-26  2:37   ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 20+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  2:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

(2012/12/18 15:47), Minchan Kim wrote:
> This is still RFC because we need more input from user-space
> people and discussion about interface/reclaim policy of volatile
> pages and I want to expand this concept to tmpfs volatile range
> if it is possbile without big performance drop of anonymous volatile
> rnage (Let's define our term. anon volatile VS tmpfs volatile? John?)
> 
> NOTE: I didn't consider THP/KSM so for test, you should disable them.
> 
> I hope more inputs from user-space allocator people and test patch
> with their allocator because it might need design change of arena
> management for getting real vaule.
> 
> Changelog from v4
> 
>   * Add new system call mvolatile/mnovolatile
>   * Add sigbus when user try to access volatile range
>   * Rebased on v3.7
>   * Applied bug fix from John Stultz, Thanks!
> 
> Changelog from v3
> 
>   * Removing madvise(addr, length, MADV_NOVOLATILE).
>   * add vmstat about the number of discarded volatile pages
>   * discard volatile pages without promotion in reclaim path
> 
> This is based on v3.7
> 
> - What's the mvolatile(addr, length)?
> 
>    It's a hint that user deliver to kernel so kernel can *discard*
>    pages in a range anytime.
> 

This can work against both of PRIVATE and SHARED mapping  ?

What happens at fork() ? VOLATILE ranges are copied ?


> - What happens if user access page(ie, virtual address) discarded
>    by kernel?
> 
>    The user can encounter SIGBUS.
> 
> - What should user do for avoding SIGBUS?
>    He should call mnovolatie(addr, length) before accessing the range
>    which was called by mvolatile.
> 
Will mnovolatile() return whether the range is discarded or not ?

What the user should do in signal handler ?
Can the all expected opereations be done in signal-safe manner ?
(IOW, can user do enough job easily without taking any locks in userland ?)

> - What happens if user access page(ie, virtual address) doesn't
>    discarded by kernel?
> 
>    The user can see old data without page fault.
> 

What happens when ther user calls mvolatile() against mlock()'d range or
calling mlock() against mvolatile()'d range ?

Hm, by the way, the user need to attach pages to the process by causing page-fault
(as you do by memset()) before calling mvolatile() ?

I think your approach is interesting, anyway.

Thanks,
-Kame


> - What's different with madvise(DONTNEED)?
> 
>    System call semantic
> 
>    DONTNEED makes sure user always can see zero-fill pages after
>    he calls madvise while mvolatile can see old data or encounter
>    SIGBUS.
> 
>    Internal implementation
> 
>    The madvise(DONTNEED) should zap all mapped pages in range so
>    overhead is increased linearly with the number of mapped pages.
>    Even, if user access zapped pages as write mode, page fault +
>    page allocation + memset should be happened.
> 
>    The mvolatile just marks the flag in a range(ie, VMA) instead of
>    zapping all of pte in the vma so it doesn't touch ptes any more.
> 
> - What's the benefit compared to DONTNEED?
> 
>    1. The system call overhead is smaller because mvolatile just marks
>       the flag to VMA instead of zapping all the page in a range so
>       overhead should be very small.
> 
>    2. It has a chance to eliminate overheads (ex, zapping pte + page fault
>       + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
>       severe.
> 
>    3. It has a potential to zap all ptes and free the pages if memory
>       pressure is severe so reclaim overhead could be disappear - TODO
> 
> - Isn't there any drawback?
> 
>    Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
>    fault of other threads could be allowed. But m[no]volatile needs
>    exclusive mmap_sem so other thread would be blocked if they try to
>    access not-yet-mapped pages. That's why I design m[no]volatile
>    overhead should be small as far as possible.
> 
>    It could suffer from max rss usage increasement because madvise(DONTNEED)
>    deallocates pages instantly when the system call is issued while mvoatile
>    delays it until memory pressure happens so if memory pressure is severe by
>    max rss incresement, system would suffer. First of all, allocator needs
>    some balance logic for that or kernel might handle it by zapping pages
>    although user calls mvolatile if memory pressure is severe.
>    The problem is how we know memory pressure is severe.
>    One of solution is to see kswapd is active or not. Another solution is
>    Anton's mempressure so allocator can handle it.
> 
> - What's for targetting?
> 
>    Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
>    of virtual machine like Dalvik. Also, it comes in handy for embedded
>    which doesn't have swap device so they can't reclaim anonymous pages.
>    By discarding instead of swapout, it could be used in the non-swap system.
>    For it, we have to age anon lru list although we don't have swap because
>    I don't want to discard volatile pages by top priority when memory pressure
>    happens as volatile in this patch means "We don't need to swap out because
>    user can handle the situation which data are disappear suddenly", NOT
>    "They are useless so hurry up to reclaim them". So I want to apply same
>    aging rule of nomal pages to them.
> 
>    Anonymous page background aging of non-swap system would be a trade-off
>    for getting good feature. Even, we had done it two years ago until merge
>    [1] and I believe gain of this patch will beat loss of anon lru aging's
>    overead once all of allocator start to use madvise.
>    (This patch doesn't include background aging in case of non-swap system
>    but it's trivial if we decide)
> 
>    As another choice, we can zap the range like madvise(DONTNEED) when mvolatile
>    is called if we don't have swap space.
> 
> - Stupid performance test
>    I attach test program/script which are utter crap and I don't expect
>    current smart allocator never have done it so we need more practical data
>    with real allocator.
> 
>    KVM - 8 core, 2G
> 
> VOLATILE test
> 13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
> 0inputs+0outputs (0major+164050minor)pagefaults 0swaps
> 
> DONTNEED test
> 23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
> 0inputs+0outputs (0major+16384210minor)pagefaults 0swaps
> 
>    x86-64 - 12 core, 2G
> 
> VOLATILE test
> 33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
> 0inputs+0outputs (0major+245989minor)pagefaults 0swaps
> 
> DONTNEED test
> 28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k
> 
> [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> 
> Any comments are welcome!
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Arun Sharma <asharma@fb.com>
> Cc: sanjay@google.com
> Cc: Paul Turner <pjt@google.com>
> CC: David Rientjes <rientjes@google.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Minchan Kim (3):
>    Introduce new system call mvolatile
>    Discard volatile page
>    add PGVOLATILE vmstat count
> 
>   arch/x86/syscalls/syscall_64.tbl |    3 +-
>   include/linux/mm.h               |    1 +
>   include/linux/mm_types.h         |    2 +
>   include/linux/rmap.h             |    3 +
>   include/linux/syscalls.h         |    2 +
>   include/linux/vm_event_item.h    |    2 +-
>   mm/Makefile                      |    4 +-
>   mm/huge_memory.c                 |    9 +-
>   mm/ksm.c                         |    3 +-
>   mm/memory.c                      |    2 +
>   mm/migrate.c                     |    6 +-
>   mm/mlock.c                       |    5 +-
>   mm/mmap.c                        |    2 +-
>   mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
>   mm/rmap.c                        |   97 +++++++++-
>   mm/vmscan.c                      |    4 +
>   mm/vmstat.c                      |    1 +
>   17 files changed, 527 insertions(+), 15 deletions(-)
>   create mode 100644 mm/mvolatile.c
> 
> ================== 8< =============================
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <pthread.h>
> #include <sched.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/syscall.h>
> 
> #define SYS_mvolatile 313
> #define SYS_mnovolatile 314
> 
> #define ALLOC_SIZE (8 << 20)
> #define MAP_SIZE  (ALLOC_SIZE * 10)
> #define PAGE_SIZE (1 << 12)
> #define RETRY 100
> 
> pthread_barrier_t barrier;
> int mode;
> #define VOLATILE_MODE 1
> 
> static int mvolatile(void *addr, size_t length)
> {
> 	return syscall(SYS_mvolatile, addr, length);
> }
> 
> static int mnovolatile(void *addr, size_t length)
> {
> 	return syscall(SYS_mnovolatile, addr, length);
> }
> 
> void *thread_entry(void *data)
> {
> 	unsigned long i;
> 	cpu_set_t set;
> 	int cpu = *(int*)data;
> 	void *mmap_area;
> 	int retry = RETRY;
> 
> 	CPU_ZERO(&set);
> 	CPU_SET(cpu, &set);
> 	sched_setaffinity(0, sizeof(set), &set);
> 
> 	mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 	mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
> 					MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 	if (mmap_area == MAP_FAILED) {
> 		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
> 		exit(1);
> 	}
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	while(retry--) {
> 		if (mode == VOLATILE_MODE) {
> 			mvolatile(mmap_area, MAP_SIZE);
> 			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
> 				mnovolatile(mmap_area + i, ALLOC_SIZE);
> 				memset(mmap_area + i, i, ALLOC_SIZE);
> 				mvolatile(mmap_area + i, ALLOC_SIZE);
> 			}
> 		} else {
> 			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
> 				memset(mmap_area + i, i, ALLOC_SIZE);
> 				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
> 			}
> 		}
> 	}
> 	return NULL;
> }
> 
> int main(int argc, char *argv[])
> {
> 	int i, nr_thread;
> 	int *data;
> 
> 	if (argc < 3)
> 		return 1;
> 
> 	nr_thread = atoi(argv[1]);
> 	mode = atoi(argv[2]);
> 
> 	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
> 	data = malloc(sizeof(int) * nr_thread);
> 	pthread_barrier_init(&barrier, NULL, nr_thread);
> 
> 	for (i = 0; i < nr_thread; i++) {
> 		data[i] = i;
> 		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
> 			perror("Fail to create thread\n");
> 			exit(1);
> 		}
> 	}
> 
> 	for (i = 0; i < nr_thread; i++) {
> 		if (pthread_join(thread[i], NULL))
> 			perror("Fail to join thread\n");
> 		printf("[%d] thread done\n", i);
> 	}
> 
> 	return 0;
> }
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
@ 2012-12-26  2:37   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 20+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  2:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

(2012/12/18 15:47), Minchan Kim wrote:
> This is still RFC because we need more input from user-space
> people and discussion about interface/reclaim policy of volatile
> pages and I want to expand this concept to tmpfs volatile range
> if it is possbile without big performance drop of anonymous volatile
> rnage (Let's define our term. anon volatile VS tmpfs volatile? John?)
> 
> NOTE: I didn't consider THP/KSM so for test, you should disable them.
> 
> I hope more inputs from user-space allocator people and test patch
> with their allocator because it might need design change of arena
> management for getting real vaule.
> 
> Changelog from v4
> 
>   * Add new system call mvolatile/mnovolatile
>   * Add sigbus when user try to access volatile range
>   * Rebased on v3.7
>   * Applied bug fix from John Stultz, Thanks!
> 
> Changelog from v3
> 
>   * Removing madvise(addr, length, MADV_NOVOLATILE).
>   * add vmstat about the number of discarded volatile pages
>   * discard volatile pages without promotion in reclaim path
> 
> This is based on v3.7
> 
> - What's the mvolatile(addr, length)?
> 
>    It's a hint that user deliver to kernel so kernel can *discard*
>    pages in a range anytime.
> 

This can work against both of PRIVATE and SHARED mapping  ?

What happens at fork() ? VOLATILE ranges are copied ?


> - What happens if user access page(ie, virtual address) discarded
>    by kernel?
> 
>    The user can encounter SIGBUS.
> 
> - What should user do for avoding SIGBUS?
>    He should call mnovolatie(addr, length) before accessing the range
>    which was called by mvolatile.
> 
Will mnovolatile() return whether the range is discarded or not ?

What the user should do in signal handler ?
Can the all expected opereations be done in signal-safe manner ?
(IOW, can user do enough job easily without taking any locks in userland ?)

> - What happens if user access page(ie, virtual address) doesn't
>    discarded by kernel?
> 
>    The user can see old data without page fault.
> 

What happens when ther user calls mvolatile() against mlock()'d range or
calling mlock() against mvolatile()'d range ?

Hm, by the way, the user need to attach pages to the process by causing page-fault
(as you do by memset()) before calling mvolatile() ?

I think your approach is interesting, anyway.

Thanks,
-Kame


> - What's different with madvise(DONTNEED)?
> 
>    System call semantic
> 
>    DONTNEED makes sure user always can see zero-fill pages after
>    he calls madvise while mvolatile can see old data or encounter
>    SIGBUS.
> 
>    Internal implementation
> 
>    The madvise(DONTNEED) should zap all mapped pages in range so
>    overhead is increased linearly with the number of mapped pages.
>    Even, if user access zapped pages as write mode, page fault +
>    page allocation + memset should be happened.
> 
>    The mvolatile just marks the flag in a range(ie, VMA) instead of
>    zapping all of pte in the vma so it doesn't touch ptes any more.
> 
> - What's the benefit compared to DONTNEED?
> 
>    1. The system call overhead is smaller because mvolatile just marks
>       the flag to VMA instead of zapping all the page in a range so
>       overhead should be very small.
> 
>    2. It has a chance to eliminate overheads (ex, zapping pte + page fault
>       + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
>       severe.
> 
>    3. It has a potential to zap all ptes and free the pages if memory
>       pressure is severe so reclaim overhead could be disappear - TODO
> 
> - Isn't there any drawback?
> 
>    Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
>    fault of other threads could be allowed. But m[no]volatile needs
>    exclusive mmap_sem so other thread would be blocked if they try to
>    access not-yet-mapped pages. That's why I design m[no]volatile
>    overhead should be small as far as possible.
> 
>    It could suffer from max rss usage increasement because madvise(DONTNEED)
>    deallocates pages instantly when the system call is issued while mvoatile
>    delays it until memory pressure happens so if memory pressure is severe by
>    max rss incresement, system would suffer. First of all, allocator needs
>    some balance logic for that or kernel might handle it by zapping pages
>    although user calls mvolatile if memory pressure is severe.
>    The problem is how we know memory pressure is severe.
>    One of solution is to see kswapd is active or not. Another solution is
>    Anton's mempressure so allocator can handle it.
> 
> - What's for targetting?
> 
>    Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
>    of virtual machine like Dalvik. Also, it comes in handy for embedded
>    which doesn't have swap device so they can't reclaim anonymous pages.
>    By discarding instead of swapout, it could be used in the non-swap system.
>    For it, we have to age anon lru list although we don't have swap because
>    I don't want to discard volatile pages by top priority when memory pressure
>    happens as volatile in this patch means "We don't need to swap out because
>    user can handle the situation which data are disappear suddenly", NOT
>    "They are useless so hurry up to reclaim them". So I want to apply same
>    aging rule of nomal pages to them.
> 
>    Anonymous page background aging of non-swap system would be a trade-off
>    for getting good feature. Even, we had done it two years ago until merge
>    [1] and I believe gain of this patch will beat loss of anon lru aging's
>    overead once all of allocator start to use madvise.
>    (This patch doesn't include background aging in case of non-swap system
>    but it's trivial if we decide)
> 
>    As another choice, we can zap the range like madvise(DONTNEED) when mvolatile
>    is called if we don't have swap space.
> 
> - Stupid performance test
>    I attach test program/script which are utter crap and I don't expect
>    current smart allocator never have done it so we need more practical data
>    with real allocator.
> 
>    KVM - 8 core, 2G
> 
> VOLATILE test
> 13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
> 0inputs+0outputs (0major+164050minor)pagefaults 0swaps
> 
> DONTNEED test
> 23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
> 0inputs+0outputs (0major+16384210minor)pagefaults 0swaps
> 
>    x86-64 - 12 core, 2G
> 
> VOLATILE test
> 33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
> 0inputs+0outputs (0major+245989minor)pagefaults 0swaps
> 
> DONTNEED test
> 28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k
> 
> [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> 
> Any comments are welcome!
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Arun Sharma <asharma@fb.com>
> Cc: sanjay@google.com
> Cc: Paul Turner <pjt@google.com>
> CC: David Rientjes <rientjes@google.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Minchan Kim (3):
>    Introduce new system call mvolatile
>    Discard volatile page
>    add PGVOLATILE vmstat count
> 
>   arch/x86/syscalls/syscall_64.tbl |    3 +-
>   include/linux/mm.h               |    1 +
>   include/linux/mm_types.h         |    2 +
>   include/linux/rmap.h             |    3 +
>   include/linux/syscalls.h         |    2 +
>   include/linux/vm_event_item.h    |    2 +-
>   mm/Makefile                      |    4 +-
>   mm/huge_memory.c                 |    9 +-
>   mm/ksm.c                         |    3 +-
>   mm/memory.c                      |    2 +
>   mm/migrate.c                     |    6 +-
>   mm/mlock.c                       |    5 +-
>   mm/mmap.c                        |    2 +-
>   mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
>   mm/rmap.c                        |   97 +++++++++-
>   mm/vmscan.c                      |    4 +
>   mm/vmstat.c                      |    1 +
>   17 files changed, 527 insertions(+), 15 deletions(-)
>   create mode 100644 mm/mvolatile.c
> 
> ================== 8< =============================
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <pthread.h>
> #include <sched.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/syscall.h>
> 
> #define SYS_mvolatile 313
> #define SYS_mnovolatile 314
> 
> #define ALLOC_SIZE (8 << 20)
> #define MAP_SIZE  (ALLOC_SIZE * 10)
> #define PAGE_SIZE (1 << 12)
> #define RETRY 100
> 
> pthread_barrier_t barrier;
> int mode;
> #define VOLATILE_MODE 1
> 
> static int mvolatile(void *addr, size_t length)
> {
> 	return syscall(SYS_mvolatile, addr, length);
> }
> 
> static int mnovolatile(void *addr, size_t length)
> {
> 	return syscall(SYS_mnovolatile, addr, length);
> }
> 
> void *thread_entry(void *data)
> {
> 	unsigned long i;
> 	cpu_set_t set;
> 	int cpu = *(int*)data;
> 	void *mmap_area;
> 	int retry = RETRY;
> 
> 	CPU_ZERO(&set);
> 	CPU_SET(cpu, &set);
> 	sched_setaffinity(0, sizeof(set), &set);
> 
> 	mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 	mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
> 					MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 	if (mmap_area == MAP_FAILED) {
> 		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
> 		exit(1);
> 	}
> 
> 	pthread_barrier_wait(&barrier);
> 
> 	while(retry--) {
> 		if (mode == VOLATILE_MODE) {
> 			mvolatile(mmap_area, MAP_SIZE);
> 			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
> 				mnovolatile(mmap_area + i, ALLOC_SIZE);
> 				memset(mmap_area + i, i, ALLOC_SIZE);
> 				mvolatile(mmap_area + i, ALLOC_SIZE);
> 			}
> 		} else {
> 			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
> 				memset(mmap_area + i, i, ALLOC_SIZE);
> 				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
> 			}
> 		}
> 	}
> 	return NULL;
> }
> 
> int main(int argc, char *argv[])
> {
> 	int i, nr_thread;
> 	int *data;
> 
> 	if (argc < 3)
> 		return 1;
> 
> 	nr_thread = atoi(argv[1]);
> 	mode = atoi(argv[2]);
> 
> 	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
> 	data = malloc(sizeof(int) * nr_thread);
> 	pthread_barrier_init(&barrier, NULL, nr_thread);
> 
> 	for (i = 0; i < nr_thread; i++) {
> 		data[i] = i;
> 		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
> 			perror("Fail to create thread\n");
> 			exit(1);
> 		}
> 	}
> 
> 	for (i = 0; i < nr_thread; i++) {
> 		if (pthread_join(thread[i], NULL))
> 			perror("Fail to join thread\n");
> 		printf("[%d] thread done\n", i);
> 	}
> 
> 	return 0;
> }
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
  2012-12-26  2:37   ` Kamezawa Hiroyuki
@ 2012-12-26  3:46     ` Minchan Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-26  3:46 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

Hi Kame,

What are you doing these holiday season? :)
I can't believe you sit down in front of computer.

On Wed, Dec 26, 2012 at 11:37:02AM +0900, Kamezawa Hiroyuki wrote:
> (2012/12/18 15:47), Minchan Kim wrote:
> > This is still RFC because we need more input from user-space
> > people and discussion about interface/reclaim policy of volatile
> > pages and I want to expand this concept to tmpfs volatile range
> > if it is possbile without big performance drop of anonymous volatile
> > rnage (Let's define our term. anon volatile VS tmpfs volatile? John?)
> > 
> > NOTE: I didn't consider THP/KSM so for test, you should disable them.
> > 
> > I hope more inputs from user-space allocator people and test patch
> > with their allocator because it might need design change of arena
> > management for getting real vaule.
> > 
> > Changelog from v4
> > 
> >   * Add new system call mvolatile/mnovolatile
> >   * Add sigbus when user try to access volatile range
> >   * Rebased on v3.7
> >   * Applied bug fix from John Stultz, Thanks!
> > 
> > Changelog from v3
> > 
> >   * Removing madvise(addr, length, MADV_NOVOLATILE).
> >   * add vmstat about the number of discarded volatile pages
> >   * discard volatile pages without promotion in reclaim path
> > 
> > This is based on v3.7
> > 
> > - What's the mvolatile(addr, length)?
> > 
> >    It's a hint that user deliver to kernel so kernel can *discard*
> >    pages in a range anytime.
> > 
> 
> This can work against both of PRIVATE and SHARED mapping  ?

Yes.

> 
> What happens at fork() ? VOLATILE ranges are copied ?

Just child vma would have a VM_VOLATILE flag.
If a page is shared like above, the page could be discarded only when
all vmas pointing to the page are VM_VOLATILE.

> 
> 
> > - What happens if user access page(ie, virtual address) discarded
> >    by kernel?
> > 
> >    The user can encounter SIGBUS.
> > 
> > - What should user do for avoding SIGBUS?
> >    He should call mnovolatie(addr, length) before accessing the range
> >    which was called by mvolatile.
> > 
> Will mnovolatile() return whether the range is discarded or not ?

Absolutely.

> 
> What the user should do in signal handler ?

It depends on usecase.
Please read John's mail. http://lwn.net/Articles/518130/
Quote from the link
"
But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect
"lazy" with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

Since this approach avoids the more explicit unmark/access/mark
pattern, it avoids the extra overhead required to ensure data is
non-volatile before being accessed.

However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.
"

> Can the all expected opereations be done in signal-safe manner ?
> (IOW, can user do enough job easily without taking any locks in userland ?)

It depends on design of user application but some user space guys want
it so it could be done enoughly, I think. Expecially, Android have used it
by ashmem where was another interface for same goal but it works only tmpfs pages
but mine is normal anonymous page but the goal is to support both.

> 
> > - What happens if user access page(ie, virtual address) doesn't
> >    discarded by kernel?
> > 
> >    The user can see old data without page fault.
> > 
> 
> What happens when ther user calls mvolatile() against mlock()'d range or
> calling mlock() against mvolatile()'d range ?

-EINVAL

> 
> Hm, by the way, the user need to attach pages to the process by causing page-fault
> (as you do by memset()) before calling mvolatile() ?

For effectiveness, Yes.

> 
> I think your approach is interesting, anyway.

Thanks for your interest, Kame.

あけましておめでとう.

> 
> Thanks,
> -Kame
> 
> 
> > - What's different with madvise(DONTNEED)?
> > 
> >    System call semantic
> > 
> >    DONTNEED makes sure user always can see zero-fill pages after
> >    he calls madvise while mvolatile can see old data or encounter
> >    SIGBUS.
> > 
> >    Internal implementation
> > 
> >    The madvise(DONTNEED) should zap all mapped pages in range so
> >    overhead is increased linearly with the number of mapped pages.
> >    Even, if user access zapped pages as write mode, page fault +
> >    page allocation + memset should be happened.
> > 
> >    The mvolatile just marks the flag in a range(ie, VMA) instead of
> >    zapping all of pte in the vma so it doesn't touch ptes any more.
> > 
> > - What's the benefit compared to DONTNEED?
> > 
> >    1. The system call overhead is smaller because mvolatile just marks
> >       the flag to VMA instead of zapping all the page in a range so
> >       overhead should be very small.
> > 
> >    2. It has a chance to eliminate overheads (ex, zapping pte + page fault
> >       + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
> >       severe.
> > 
> >    3. It has a potential to zap all ptes and free the pages if memory
> >       pressure is severe so reclaim overhead could be disappear - TODO
> > 
> > - Isn't there any drawback?
> > 
> >    Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
> >    fault of other threads could be allowed. But m[no]volatile needs
> >    exclusive mmap_sem so other thread would be blocked if they try to
> >    access not-yet-mapped pages. That's why I design m[no]volatile
> >    overhead should be small as far as possible.
> > 
> >    It could suffer from max rss usage increasement because madvise(DONTNEED)
> >    deallocates pages instantly when the system call is issued while mvoatile
> >    delays it until memory pressure happens so if memory pressure is severe by
> >    max rss incresement, system would suffer. First of all, allocator needs
> >    some balance logic for that or kernel might handle it by zapping pages
> >    although user calls mvolatile if memory pressure is severe.
> >    The problem is how we know memory pressure is severe.
> >    One of solution is to see kswapd is active or not. Another solution is
> >    Anton's mempressure so allocator can handle it.
> > 
> > - What's for targetting?
> > 
> >    Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
> >    of virtual machine like Dalvik. Also, it comes in handy for embedded
> >    which doesn't have swap device so they can't reclaim anonymous pages.
> >    By discarding instead of swapout, it could be used in the non-swap system.
> >    For it, we have to age anon lru list although we don't have swap because
> >    I don't want to discard volatile pages by top priority when memory pressure
> >    happens as volatile in this patch means "We don't need to swap out because
> >    user can handle the situation which data are disappear suddenly", NOT
> >    "They are useless so hurry up to reclaim them". So I want to apply same
> >    aging rule of nomal pages to them.
> > 
> >    Anonymous page background aging of non-swap system would be a trade-off
> >    for getting good feature. Even, we had done it two years ago until merge
> >    [1] and I believe gain of this patch will beat loss of anon lru aging's
> >    overead once all of allocator start to use madvise.
> >    (This patch doesn't include background aging in case of non-swap system
> >    but it's trivial if we decide)
> > 
> >    As another choice, we can zap the range like madvise(DONTNEED) when mvolatile
> >    is called if we don't have swap space.
> > 
> > - Stupid performance test
> >    I attach test program/script which are utter crap and I don't expect
> >    current smart allocator never have done it so we need more practical data
> >    with real allocator.
> > 
> >    KVM - 8 core, 2G
> > 
> > VOLATILE test
> > 13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
> > 0inputs+0outputs (0major+164050minor)pagefaults 0swaps
> > 
> > DONTNEED test
> > 23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
> > 0inputs+0outputs (0major+16384210minor)pagefaults 0swaps
> > 
> >    x86-64 - 12 core, 2G
> > 
> > VOLATILE test
> > 33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
> > 0inputs+0outputs (0major+245989minor)pagefaults 0swaps
> > 
> > DONTNEED test
> > 28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k
> > 
> > [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> > 
> > Any comments are welcome!
> > 
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Arun Sharma <asharma@fb.com>
> > Cc: sanjay@google.com
> > Cc: Paul Turner <pjt@google.com>
> > CC: David Rientjes <rientjes@google.com>
> > Cc: John Stultz <john.stultz@linaro.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Android Kernel Team <kernel-team@android.com>
> > Cc: Robert Love <rlove@google.com>
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Neil Brown <neilb@suse.de>
> > Cc: Mike Hommey <mh@glandium.org>
> > Cc: Taras Glek <tglek@mozilla.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Minchan Kim (3):
> >    Introduce new system call mvolatile
> >    Discard volatile page
> >    add PGVOLATILE vmstat count
> > 
> >   arch/x86/syscalls/syscall_64.tbl |    3 +-
> >   include/linux/mm.h               |    1 +
> >   include/linux/mm_types.h         |    2 +
> >   include/linux/rmap.h             |    3 +
> >   include/linux/syscalls.h         |    2 +
> >   include/linux/vm_event_item.h    |    2 +-
> >   mm/Makefile                      |    4 +-
> >   mm/huge_memory.c                 |    9 +-
> >   mm/ksm.c                         |    3 +-
> >   mm/memory.c                      |    2 +
> >   mm/migrate.c                     |    6 +-
> >   mm/mlock.c                       |    5 +-
> >   mm/mmap.c                        |    2 +-
> >   mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
> >   mm/rmap.c                        |   97 +++++++++-
> >   mm/vmscan.c                      |    4 +
> >   mm/vmstat.c                      |    1 +
> >   17 files changed, 527 insertions(+), 15 deletions(-)
> >   create mode 100644 mm/mvolatile.c
> > 
> > ================== 8< =============================
> > 
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <pthread.h>
> > #include <sched.h>
> > #include <sys/mman.h>
> > #include <sys/types.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <unistd.h>
> > #include <sys/syscall.h>
> > 
> > #define SYS_mvolatile 313
> > #define SYS_mnovolatile 314
> > 
> > #define ALLOC_SIZE (8 << 20)
> > #define MAP_SIZE  (ALLOC_SIZE * 10)
> > #define PAGE_SIZE (1 << 12)
> > #define RETRY 100
> > 
> > pthread_barrier_t barrier;
> > int mode;
> > #define VOLATILE_MODE 1
> > 
> > static int mvolatile(void *addr, size_t length)
> > {
> > 	return syscall(SYS_mvolatile, addr, length);
> > }
> > 
> > static int mnovolatile(void *addr, size_t length)
> > {
> > 	return syscall(SYS_mnovolatile, addr, length);
> > }
> > 
> > void *thread_entry(void *data)
> > {
> > 	unsigned long i;
> > 	cpu_set_t set;
> > 	int cpu = *(int*)data;
> > 	void *mmap_area;
> > 	int retry = RETRY;
> > 
> > 	CPU_ZERO(&set);
> > 	CPU_SET(cpu, &set);
> > 	sched_setaffinity(0, sizeof(set), &set);
> > 
> > 	mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 	mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
> > 					MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 	if (mmap_area == MAP_FAILED) {
> > 		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
> > 		exit(1);
> > 	}
> > 
> > 	pthread_barrier_wait(&barrier);
> > 
> > 	while(retry--) {
> > 		if (mode == VOLATILE_MODE) {
> > 			mvolatile(mmap_area, MAP_SIZE);
> > 			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
> > 				mnovolatile(mmap_area + i, ALLOC_SIZE);
> > 				memset(mmap_area + i, i, ALLOC_SIZE);
> > 				mvolatile(mmap_area + i, ALLOC_SIZE);
> > 			}
> > 		} else {
> > 			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
> > 				memset(mmap_area + i, i, ALLOC_SIZE);
> > 				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
> > 			}
> > 		}
> > 	}
> > 	return NULL;
> > }
> > 
> > int main(int argc, char *argv[])
> > {
> > 	int i, nr_thread;
> > 	int *data;
> > 
> > 	if (argc < 3)
> > 		return 1;
> > 
> > 	nr_thread = atoi(argv[1]);
> > 	mode = atoi(argv[2]);
> > 
> > 	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
> > 	data = malloc(sizeof(int) * nr_thread);
> > 	pthread_barrier_init(&barrier, NULL, nr_thread);
> > 
> > 	for (i = 0; i < nr_thread; i++) {
> > 		data[i] = i;
> > 		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
> > 			perror("Fail to create thread\n");
> > 			exit(1);
> > 		}
> > 	}
> > 
> > 	for (i = 0; i < nr_thread; i++) {
> > 		if (pthread_join(thread[i], NULL))
> > 			perror("Fail to join thread\n");
> > 		printf("[%d] thread done\n", i);
> > 	}
> > 
> > 	return 0;
> > }
> > 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
@ 2012-12-26  3:46     ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2012-12-26  3:46 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

Hi Kame,

What are you doing these holiday season? :)
I can't believe you sit down in front of computer.

On Wed, Dec 26, 2012 at 11:37:02AM +0900, Kamezawa Hiroyuki wrote:
> (2012/12/18 15:47), Minchan Kim wrote:
> > This is still RFC because we need more input from user-space
> > people and discussion about interface/reclaim policy of volatile
> > pages and I want to expand this concept to tmpfs volatile range
> > if it is possbile without big performance drop of anonymous volatile
> > rnage (Let's define our term. anon volatile VS tmpfs volatile? John?)
> > 
> > NOTE: I didn't consider THP/KSM so for test, you should disable them.
> > 
> > I hope more inputs from user-space allocator people and test patch
> > with their allocator because it might need design change of arena
> > management for getting real vaule.
> > 
> > Changelog from v4
> > 
> >   * Add new system call mvolatile/mnovolatile
> >   * Add sigbus when user try to access volatile range
> >   * Rebased on v3.7
> >   * Applied bug fix from John Stultz, Thanks!
> > 
> > Changelog from v3
> > 
> >   * Removing madvise(addr, length, MADV_NOVOLATILE).
> >   * add vmstat about the number of discarded volatile pages
> >   * discard volatile pages without promotion in reclaim path
> > 
> > This is based on v3.7
> > 
> > - What's the mvolatile(addr, length)?
> > 
> >    It's a hint that user deliver to kernel so kernel can *discard*
> >    pages in a range anytime.
> > 
> 
> This can work against both of PRIVATE and SHARED mapping  ?

Yes.

> 
> What happens at fork() ? VOLATILE ranges are copied ?

Just child vma would have a VM_VOLATILE flag.
If a page is shared like above, the page could be discarded only when
all vmas pointing to the page are VM_VOLATILE.

> 
> 
> > - What happens if user access page(ie, virtual address) discarded
> >    by kernel?
> > 
> >    The user can encounter SIGBUS.
> > 
> > - What should user do for avoding SIGBUS?
> >    He should call mnovolatie(addr, length) before accessing the range
> >    which was called by mvolatile.
> > 
> Will mnovolatile() return whether the range is discarded or not ?

Absolutely.

> 
> What the user should do in signal handler ?

It depends on usecase.
Please read John's mail. http://lwn.net/Articles/518130/
Quote from the link
"
But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect
"lazy" with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

Since this approach avoids the more explicit unmark/access/mark
pattern, it avoids the extra overhead required to ensure data is
non-volatile before being accessed.

However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.
"

> Can the all expected opereations be done in signal-safe manner ?
> (IOW, can user do enough job easily without taking any locks in userland ?)

It depends on design of user application but some user space guys want
it so it could be done enoughly, I think. Expecially, Android have used it
by ashmem where was another interface for same goal but it works only tmpfs pages
but mine is normal anonymous page but the goal is to support both.

> 
> > - What happens if user access page(ie, virtual address) doesn't
> >    discarded by kernel?
> > 
> >    The user can see old data without page fault.
> > 
> 
> What happens when ther user calls mvolatile() against mlock()'d range or
> calling mlock() against mvolatile()'d range ?

-EINVAL

> 
> Hm, by the way, the user need to attach pages to the process by causing page-fault
> (as you do by memset()) before calling mvolatile() ?

For effectiveness, Yes.

> 
> I think your approach is interesting, anyway.

Thanks for your interest, Kame.

a??a??a? 3/4 a??a?|a??a??a??a??a??.

> 
> Thanks,
> -Kame
> 
> 
> > - What's different with madvise(DONTNEED)?
> > 
> >    System call semantic
> > 
> >    DONTNEED makes sure user always can see zero-fill pages after
> >    he calls madvise while mvolatile can see old data or encounter
> >    SIGBUS.
> > 
> >    Internal implementation
> > 
> >    The madvise(DONTNEED) should zap all mapped pages in range so
> >    overhead is increased linearly with the number of mapped pages.
> >    Even, if user access zapped pages as write mode, page fault +
> >    page allocation + memset should be happened.
> > 
> >    The mvolatile just marks the flag in a range(ie, VMA) instead of
> >    zapping all of pte in the vma so it doesn't touch ptes any more.
> > 
> > - What's the benefit compared to DONTNEED?
> > 
> >    1. The system call overhead is smaller because mvolatile just marks
> >       the flag to VMA instead of zapping all the page in a range so
> >       overhead should be very small.
> > 
> >    2. It has a chance to eliminate overheads (ex, zapping pte + page fault
> >       + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
> >       severe.
> > 
> >    3. It has a potential to zap all ptes and free the pages if memory
> >       pressure is severe so reclaim overhead could be disappear - TODO
> > 
> > - Isn't there any drawback?
> > 
> >    Madvise(DONTNEED) doesn't need exclusive mmap_sem so concurrent page
> >    fault of other threads could be allowed. But m[no]volatile needs
> >    exclusive mmap_sem so other thread would be blocked if they try to
> >    access not-yet-mapped pages. That's why I design m[no]volatile
> >    overhead should be small as far as possible.
> > 
> >    It could suffer from max rss usage increasement because madvise(DONTNEED)
> >    deallocates pages instantly when the system call is issued while mvoatile
> >    delays it until memory pressure happens so if memory pressure is severe by
> >    max rss incresement, system would suffer. First of all, allocator needs
> >    some balance logic for that or kernel might handle it by zapping pages
> >    although user calls mvolatile if memory pressure is severe.
> >    The problem is how we know memory pressure is severe.
> >    One of solution is to see kswapd is active or not. Another solution is
> >    Anton's mempressure so allocator can handle it.
> > 
> > - What's for targetting?
> > 
> >    Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
> >    of virtual machine like Dalvik. Also, it comes in handy for embedded
> >    which doesn't have swap device so they can't reclaim anonymous pages.
> >    By discarding instead of swapout, it could be used in the non-swap system.
> >    For it, we have to age anon lru list although we don't have swap because
> >    I don't want to discard volatile pages by top priority when memory pressure
> >    happens as volatile in this patch means "We don't need to swap out because
> >    user can handle the situation which data are disappear suddenly", NOT
> >    "They are useless so hurry up to reclaim them". So I want to apply same
> >    aging rule of nomal pages to them.
> > 
> >    Anonymous page background aging of non-swap system would be a trade-off
> >    for getting good feature. Even, we had done it two years ago until merge
> >    [1] and I believe gain of this patch will beat loss of anon lru aging's
> >    overead once all of allocator start to use madvise.
> >    (This patch doesn't include background aging in case of non-swap system
> >    but it's trivial if we decide)
> > 
> >    As another choice, we can zap the range like madvise(DONTNEED) when mvolatile
> >    is called if we don't have swap space.
> > 
> > - Stupid performance test
> >    I attach test program/script which are utter crap and I don't expect
> >    current smart allocator never have done it so we need more practical data
> >    with real allocator.
> > 
> >    KVM - 8 core, 2G
> > 
> > VOLATILE test
> > 13.16user 7.58system 0:06.04elapsed 343%CPU (0avgtext+0avgdata 2624096maxresident)k
> > 0inputs+0outputs (0major+164050minor)pagefaults 0swaps
> > 
> > DONTNEED test
> > 23.30user 228.92system 0:33.10elapsed 762%CPU (0avgtext+0avgdata 213088maxresident)k
> > 0inputs+0outputs (0major+16384210minor)pagefaults 0swaps
> > 
> >    x86-64 - 12 core, 2G
> > 
> > VOLATILE test
> > 33.38user 0.44system 0:02.87elapsed 1178%CPU (0avgtext+0avgdata 3935008maxresident)k
> > 0inputs+0outputs (0major+245989minor)pagefaults 0swaps
> > 
> > DONTNEED test
> > 28.02user 41.25system 0:05.80elapsed 1192%CPU (0avgtext+0avgdata 387776maxresident)k
> > 
> > [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
> > 
> > Any comments are welcome!
> > 
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Arun Sharma <asharma@fb.com>
> > Cc: sanjay@google.com
> > Cc: Paul Turner <pjt@google.com>
> > CC: David Rientjes <rientjes@google.com>
> > Cc: John Stultz <john.stultz@linaro.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Android Kernel Team <kernel-team@android.com>
> > Cc: Robert Love <rlove@google.com>
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Neil Brown <neilb@suse.de>
> > Cc: Mike Hommey <mh@glandium.org>
> > Cc: Taras Glek <tglek@mozilla.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Minchan Kim (3):
> >    Introduce new system call mvolatile
> >    Discard volatile page
> >    add PGVOLATILE vmstat count
> > 
> >   arch/x86/syscalls/syscall_64.tbl |    3 +-
> >   include/linux/mm.h               |    1 +
> >   include/linux/mm_types.h         |    2 +
> >   include/linux/rmap.h             |    3 +
> >   include/linux/syscalls.h         |    2 +
> >   include/linux/vm_event_item.h    |    2 +-
> >   mm/Makefile                      |    4 +-
> >   mm/huge_memory.c                 |    9 +-
> >   mm/ksm.c                         |    3 +-
> >   mm/memory.c                      |    2 +
> >   mm/migrate.c                     |    6 +-
> >   mm/mlock.c                       |    5 +-
> >   mm/mmap.c                        |    2 +-
> >   mm/mvolatile.c                   |  396 ++++++++++++++++++++++++++++++++++++++
> >   mm/rmap.c                        |   97 +++++++++-
> >   mm/vmscan.c                      |    4 +
> >   mm/vmstat.c                      |    1 +
> >   17 files changed, 527 insertions(+), 15 deletions(-)
> >   create mode 100644 mm/mvolatile.c
> > 
> > ================== 8< =============================
> > 
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <pthread.h>
> > #include <sched.h>
> > #include <sys/mman.h>
> > #include <sys/types.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <unistd.h>
> > #include <sys/syscall.h>
> > 
> > #define SYS_mvolatile 313
> > #define SYS_mnovolatile 314
> > 
> > #define ALLOC_SIZE (8 << 20)
> > #define MAP_SIZE  (ALLOC_SIZE * 10)
> > #define PAGE_SIZE (1 << 12)
> > #define RETRY 100
> > 
> > pthread_barrier_t barrier;
> > int mode;
> > #define VOLATILE_MODE 1
> > 
> > static int mvolatile(void *addr, size_t length)
> > {
> > 	return syscall(SYS_mvolatile, addr, length);
> > }
> > 
> > static int mnovolatile(void *addr, size_t length)
> > {
> > 	return syscall(SYS_mnovolatile, addr, length);
> > }
> > 
> > void *thread_entry(void *data)
> > {
> > 	unsigned long i;
> > 	cpu_set_t set;
> > 	int cpu = *(int*)data;
> > 	void *mmap_area;
> > 	int retry = RETRY;
> > 
> > 	CPU_ZERO(&set);
> > 	CPU_SET(cpu, &set);
> > 	sched_setaffinity(0, sizeof(set), &set);
> > 
> > 	mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 	mmap_area = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE,
> > 					MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 	if (mmap_area == MAP_FAILED) {
> > 		fprintf(stderr, "Fail to mmap [%d]\n", *(int*)data);
> > 		exit(1);
> > 	}
> > 
> > 	pthread_barrier_wait(&barrier);
> > 
> > 	while(retry--) {
> > 		if (mode == VOLATILE_MODE) {
> > 			mvolatile(mmap_area, MAP_SIZE);
> > 			for (i = 0; i < MAP_SIZE; i+= ALLOC_SIZE) {
> > 				mnovolatile(mmap_area + i, ALLOC_SIZE);
> > 				memset(mmap_area + i, i, ALLOC_SIZE);
> > 				mvolatile(mmap_area + i, ALLOC_SIZE);
> > 			}
> > 		} else {
> > 			for (i = 0; i < MAP_SIZE; i += ALLOC_SIZE) {
> > 				memset(mmap_area + i, i, ALLOC_SIZE);
> > 				madvise(mmap_area + i, ALLOC_SIZE, MADV_DONTNEED);
> > 			}
> > 		}
> > 	}
> > 	return NULL;
> > }
> > 
> > int main(int argc, char *argv[])
> > {
> > 	int i, nr_thread;
> > 	int *data;
> > 
> > 	if (argc < 3)
> > 		return 1;
> > 
> > 	nr_thread = atoi(argv[1]);
> > 	mode = atoi(argv[2]);
> > 
> > 	pthread_t *thread = malloc(sizeof(pthread_t) * nr_thread);
> > 	data = malloc(sizeof(int) * nr_thread);
> > 	pthread_barrier_init(&barrier, NULL, nr_thread);
> > 
> > 	for (i = 0; i < nr_thread; i++) {
> > 		data[i] = i;
> > 		if (pthread_create(&thread[i], NULL, thread_entry, &data[i])) {
> > 			perror("Fail to create thread\n");
> > 			exit(1);
> > 		}
> > 	}
> > 
> > 	for (i = 0; i < nr_thread; i++) {
> > 		if (pthread_join(thread[i], NULL))
> > 			perror("Fail to join thread\n");
> > 		printf("[%d] thread done\n", i);
> > 	}
> > 
> > 	return 0;
> > }
> > 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
  2012-12-26  3:46     ` Minchan Kim
@ 2012-12-28  0:24       ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 20+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

(2012/12/26 12:46), Minchan Kim wrote:
> Hi Kame,
>
> What are you doing these holiday season? :)
> I can't believe you sit down in front of computer.
>
Honestly, my holiday starts tomorrow ;) (but until 1/5 in the next year.)

>>
>> Hm, by the way, the user need to attach pages to the process by causing page-fault
>> (as you do by memset()) before calling mvolatile() ?
>
> For effectiveness, Yes.
>

Isn't it better to make page-fault by get_user_pages() in mvolatile() ?
Calling page fault in userland seems just to increase burden of apps.

>>
>> I think your approach is interesting, anyway.
>
> Thanks for your interest, Kame.
>
> あけましておめでとう.
>

A happy new year.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
@ 2012-12-28  0:24       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 20+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

(2012/12/26 12:46), Minchan Kim wrote:
> Hi Kame,
>
> What are you doing these holiday season? :)
> I can't believe you sit down in front of computer.
>
Honestly, my holiday starts tomorrow ;) (but until 1/5 in the next year.)

>>
>> Hm, by the way, the user need to attach pages to the process by causing page-fault
>> (as you do by memset()) before calling mvolatile() ?
>
> For effectiveness, Yes.
>

Isn't it better to make page-fault by get_user_pages() in mvolatile() ?
Calling page fault in userland seems just to increase burden of apps.

>>
>> I think your approach is interesting, anyway.
>
> Thanks for your interest, Kame.
>
> a??a??a? 3/4 a??a?|a??a??a??a??a??.
>

A happy new year.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
  2012-12-28  0:24       ` Kamezawa Hiroyuki
@ 2013-01-04  2:37         ` Minchan Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2013-01-04  2:37 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

On Fri, Dec 28, 2012 at 09:24:53AM +0900, Kamezawa Hiroyuki wrote:
> (2012/12/26 12:46), Minchan Kim wrote:
> >Hi Kame,
> >
> >What are you doing these holiday season? :)
> >I can't believe you sit down in front of computer.
> >
> Honestly, my holiday starts tomorrow ;) (but until 1/5 in the next year.)
> 
> >>
> >>Hm, by the way, the user need to attach pages to the process by causing page-fault
> >>(as you do by memset()) before calling mvolatile() ?
> >
> >For effectiveness, Yes.
> >
> 
> Isn't it better to make page-fault by get_user_pages() in mvolatile() ?
> Calling page fault in userland seems just to increase burden of apps.

It seems you misunderstood. Firstly, this patch's goal is to minimize
minor fault + page allocation + memset_zero if possible on anon pages.

If someone(like allocator) calls madvise(DONTNEED)/munmap on range
which has garbage collected memory, VM zaps all the pte so if user
try to reuse that range, we can't avoid above overheads.

The mvolatile avoids them with not zapping ptes when memory pressure isn't
severe while VM can discard pages without swapping out if memory pressure
happens.

So, GUP in mvolatile isn't necessary.

> 
> >>
> >>I think your approach is interesting, anyway.
> >
> >Thanks for your interest, Kame.
> >
> >あけましておめでとう.
> >
> 
> A happy new year.
> 
> Thanks,
> -Kame
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/3] Support volatile for anonymous range
@ 2013-01-04  2:37         ` Minchan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2013-01-04  2:37 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Arun Sharma, sanjay, Paul Turner, David Rientjes, John Stultz,
	Christoph Lameter, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner,
	Neil Brown, Mike Hommey, Taras Glek, KOSAKI Motohiro

On Fri, Dec 28, 2012 at 09:24:53AM +0900, Kamezawa Hiroyuki wrote:
> (2012/12/26 12:46), Minchan Kim wrote:
> >Hi Kame,
> >
> >What are you doing these holiday season? :)
> >I can't believe you sit down in front of computer.
> >
> Honestly, my holiday starts tomorrow ;) (but until 1/5 in the next year.)
> 
> >>
> >>Hm, by the way, the user need to attach pages to the process by causing page-fault
> >>(as you do by memset()) before calling mvolatile() ?
> >
> >For effectiveness, Yes.
> >
> 
> Isn't it better to make page-fault by get_user_pages() in mvolatile() ?
> Calling page fault in userland seems just to increase burden of apps.

It seems you misunderstood. Firstly, this patch's goal is to minimize
minor fault + page allocation + memset_zero if possible on anon pages.

If someone(like allocator) calls madvise(DONTNEED)/munmap on range
which has garbage collected memory, VM zaps all the pte so if user
try to reuse that range, we can't avoid above overheads.

The mvolatile avoids them with not zapping ptes when memory pressure isn't
severe while VM can discard pages without swapping out if memory pressure
happens.

So, GUP in mvolatile isn't necessary.

> 
> >>
> >>I think your approach is interesting, anyway.
> >
> >Thanks for your interest, Kame.
> >
> >a??a??a? 3/4 a??a?|a??a??a??a??a??.
> >
> 
> A happy new year.
> 
> Thanks,
> -Kame
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-01-04  2:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-18  6:47 [RFC v4 0/3] Support volatile for anonymous range Minchan Kim
2012-12-18  6:47 ` Minchan Kim
2012-12-18  6:47 ` [RFC v4 1/3] Introduce new system call mvolatile Minchan Kim
2012-12-18  6:47   ` Minchan Kim
2012-12-18  6:47 ` [RFC v4 2/3] Discard volatile page Minchan Kim
2012-12-18  6:47   ` Minchan Kim
2012-12-18  6:47 ` [RFC v4 3/3] add PGVOLATILE vmstat count Minchan Kim
2012-12-18  6:47   ` Minchan Kim
2012-12-18 18:27 ` [RFC v4 0/3] Support volatile for anonymous range Arun Sharma
2012-12-18 18:27   ` Arun Sharma
2012-12-20  1:34   ` Minchan Kim
2012-12-20  1:34     ` Minchan Kim
2012-12-26  2:37 ` Kamezawa Hiroyuki
2012-12-26  2:37   ` Kamezawa Hiroyuki
2012-12-26  3:46   ` Minchan Kim
2012-12-26  3:46     ` Minchan Kim
2012-12-28  0:24     ` Kamezawa Hiroyuki
2012-12-28  0:24       ` Kamezawa Hiroyuki
2013-01-04  2:37       ` Minchan Kim
2013-01-04  2:37         ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.