All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-04-29 21:21 ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

Another few weeks and another volatile ranges patchset...

After getting the sense that the a major objection to the earlier
patches was the introduction of a new syscall (and its somewhat
strange dual length/purged-bit return values), I spent some time
trying to rework the vma manipulations so we can be we won't fail
mid-way through changing volatility (basically making it atomic).
I think I have it working, and thus, there is no longer the
need for a new syscall, and we can go back to using madvise()
to set and unset pages as volatile.


New changes are:
----------------
o Reworked vma manipulations to be be atomic
o Converted back to using madvise() as syscall interface
o Integrated fix from Minchan to avoid SIGBUS faulting race
o Caught/fixed subtle use-after-free bug w/ vma merging
o Lots of minor cleanups and comment improvements


Still on the TODO list
----------------------------------------------------
o Sort out how best to do page accounting when the volatility
  is tracked on a per-mm basis.
o Revisit anonymous page aging on swapless systems
o Draft up re-adding tmpfs/shm file volatility support


Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
Hugh, and others for the great feedback and discussion at
LSF-MM.

thanks
-john


Volatile ranges provides a method for userland to inform the kernel
that a range of memory is safe to discard (ie: can be regenerated)
but userspace may want to try access it in the future.  It can be
thought of as similar to MADV_DONTNEED, but that the actual freeing
of the memory is delayed and only done under memory pressure, and the
user can try to cancel the action and be able to quickly access any
unpurged pages. The idea originated from Android's ashmem, but I've
since learned that other OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to "rightsize" userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile range-like
functionality via the ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


The basic usage of volatile ranges is as so:
1) Userland marks a range of memory that can be regenerated if
necessary as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in
the range has been purged.

If userland accesses memory while it is volatile, it will either
get the value stored at that memory if there has been no memory
pressure or the application will get a SIGBUS if the page has been
purged.

Reads or writes to the memory do not affect the volatility state of the
pages.

You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/592042/
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last few releases, this revision is reduced in
scope when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the
earlier approach, but it does simplify the approach. I'm open to
expanding functionality via flags arguments, but for now I'm wanting
to keep focus on what the right default behavior should be and keep
the use cases restricted to help get reviewer interest.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Again, much of the logic in this patchset is based on Minchan's earlier
efforts, so I do want to make sure the credit goes to him for his major
contribution!

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>

John Stultz (4):
  swap: Cleanup how special swap file numbers are defined
  MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking
    vmas
  MADV_VOLATILE: Add purged page detection on setting memory
    non-volatile
  MADV_VOLATILE: Add page purging logic & SIGBUS trap

 include/linux/mm.h                     |   1 +
 include/linux/mvolatile.h              |   7 +
 include/linux/swap.h                   |  36 +++-
 include/linux/swapops.h                |  10 +
 include/uapi/asm-generic/mman-common.h |   5 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   2 -
 mm/madvise.c                           |  14 ++
 mm/memory.c                            |   7 +
 mm/mvolatile.c                         | 353 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   5 +
 mm/vmscan.c                            |  12 ++
 12 files changed, 440 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

-- 
1.9.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-04-29 21:21 ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

Another few weeks and another volatile ranges patchset...

After getting the sense that the a major objection to the earlier
patches was the introduction of a new syscall (and its somewhat
strange dual length/purged-bit return values), I spent some time
trying to rework the vma manipulations so we can be we won't fail
mid-way through changing volatility (basically making it atomic).
I think I have it working, and thus, there is no longer the
need for a new syscall, and we can go back to using madvise()
to set and unset pages as volatile.


New changes are:
----------------
o Reworked vma manipulations to be be atomic
o Converted back to using madvise() as syscall interface
o Integrated fix from Minchan to avoid SIGBUS faulting race
o Caught/fixed subtle use-after-free bug w/ vma merging
o Lots of minor cleanups and comment improvements


Still on the TODO list
----------------------------------------------------
o Sort out how best to do page accounting when the volatility
  is tracked on a per-mm basis.
o Revisit anonymous page aging on swapless systems
o Draft up re-adding tmpfs/shm file volatility support


Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
Hugh, and others for the great feedback and discussion at
LSF-MM.

thanks
-john


Volatile ranges provides a method for userland to inform the kernel
that a range of memory is safe to discard (ie: can be regenerated)
but userspace may want to try access it in the future.  It can be
thought of as similar to MADV_DONTNEED, but that the actual freeing
of the memory is delayed and only done under memory pressure, and the
user can try to cancel the action and be able to quickly access any
unpurged pages. The idea originated from Android's ashmem, but I've
since learned that other OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to "rightsize" userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile range-like
functionality via the ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


The basic usage of volatile ranges is as so:
1) Userland marks a range of memory that can be regenerated if
necessary as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in
the range has been purged.

If userland accesses memory while it is volatile, it will either
get the value stored at that memory if there has been no memory
pressure or the application will get a SIGBUS if the page has been
purged.

Reads or writes to the memory do not affect the volatility state of the
pages.

You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/592042/
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last few releases, this revision is reduced in
scope when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the
earlier approach, but it does simplify the approach. I'm open to
expanding functionality via flags arguments, but for now I'm wanting
to keep focus on what the right default behavior should be and keep
the use cases restricted to help get reviewer interest.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Again, much of the logic in this patchset is based on Minchan's earlier
efforts, so I do want to make sure the credit goes to him for his major
contribution!

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>

John Stultz (4):
  swap: Cleanup how special swap file numbers are defined
  MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking
    vmas
  MADV_VOLATILE: Add purged page detection on setting memory
    non-volatile
  MADV_VOLATILE: Add page purging logic & SIGBUS trap

 include/linux/mm.h                     |   1 +
 include/linux/mvolatile.h              |   7 +
 include/linux/swap.h                   |  36 +++-
 include/linux/swapops.h                |  10 +
 include/uapi/asm-generic/mman-common.h |   5 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   2 -
 mm/madvise.c                           |  14 ++
 mm/memory.c                            |   7 +
 mm/mvolatile.c                         | 353 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   5 +
 mm/vmscan.c                            |  12 ++
 12 files changed, 440 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/4] swap: Cleanup how special swap file numbers are defined
  2014-04-29 21:21 ` John Stultz
@ 2014-04-29 21:21   ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

The SWP_HWPOISON and SWP_MIGRATION numbers are defined in
a fairly awkward way. Since they are stolen from the top
few values of the 1<<MAX_SWAPFILES_SHIFT bits, the values
themselves are calculated by taking the MAX_SWAPFILES value
(which is defined by subtraciting out all the available special
types), and re-adding all the other various special types.

However, in order to preserve the actual values when adding
new entries, one would have to re-add the new entries value
to all the type definitions. This gets ugly fast.

This patch tries to clean up how these values are defined so
its simpler to understand how they are calculated and makes it
easier add new special values.

This is done via a enum list which tracks the various special types
making the MAX_SWAPFILES definition much simpler. Then we just
define the special type as (MAX_SWAPFILES + <enum val>).

As long as the enum values are added to the top of the enum
instead of the bottom, the values for the types will be preserved.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..a32c3da 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+enum {
+	/*
+	 * NOTE: We use the high bits here (subtracting from
+	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
+	 * new entries here at the top of the enum, not at the bottom
+	 */
+#ifdef CONFIG_MEMORY_FAILURE
+	SWP_HWPOISON_NR,
+#endif
+#ifdef CONFIG_MIGRATION
+	SWP_MIGRATION_READ_NR,
+	SWP_MIGRATION_WRITE_NR,
+#endif
+	SWP_MAX_SPECIAL_TYPE_NR,
+};
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_SPECIAL_TYPE_NR)
+
 /*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
-#define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
-#else
-#define SWP_MIGRATION_NUM 0
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_MIGRATION_READ_NR)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_MIGRATION_WRITE_NR)
 #endif
 
 /*
  * Handling of hardware poisoned pages with memory corruption.
  */
 #ifdef CONFIG_MEMORY_FAILURE
-#define SWP_HWPOISON_NUM 1
-#define SWP_HWPOISON		MAX_SWAPFILES
-#else
-#define SWP_HWPOISON_NUM 0
+#define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 1/4] swap: Cleanup how special swap file numbers are defined
@ 2014-04-29 21:21   ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

The SWP_HWPOISON and SWP_MIGRATION numbers are defined in
a fairly awkward way. Since they are stolen from the top
few values of the 1<<MAX_SWAPFILES_SHIFT bits, the values
themselves are calculated by taking the MAX_SWAPFILES value
(which is defined by subtraciting out all the available special
types), and re-adding all the other various special types.

However, in order to preserve the actual values when adding
new entries, one would have to re-add the new entries value
to all the type definitions. This gets ugly fast.

This patch tries to clean up how these values are defined so
its simpler to understand how they are calculated and makes it
easier add new special values.

This is done via a enum list which tracks the various special types
making the MAX_SWAPFILES definition much simpler. Then we just
define the special type as (MAX_SWAPFILES + <enum val>).

As long as the enum values are added to the top of the enum
instead of the bottom, the values for the types will be preserved.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..a32c3da 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+enum {
+	/*
+	 * NOTE: We use the high bits here (subtracting from
+	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
+	 * new entries here at the top of the enum, not at the bottom
+	 */
+#ifdef CONFIG_MEMORY_FAILURE
+	SWP_HWPOISON_NR,
+#endif
+#ifdef CONFIG_MIGRATION
+	SWP_MIGRATION_READ_NR,
+	SWP_MIGRATION_WRITE_NR,
+#endif
+	SWP_MAX_SPECIAL_TYPE_NR,
+};
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_SPECIAL_TYPE_NR)
+
 /*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
-#define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
-#else
-#define SWP_MIGRATION_NUM 0
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_MIGRATION_READ_NR)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_MIGRATION_WRITE_NR)
 #endif
 
 /*
  * Handling of hardware poisoned pages with memory corruption.
  */
 #ifdef CONFIG_MEMORY_FAILURE
-#define SWP_HWPOISON_NUM 1
-#define SWP_HWPOISON		MAX_SWAPFILES
-#else
-#define SWP_HWPOISON_NUM 0
+#define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-04-29 21:21 ` John Stultz
@ 2014-04-29 21:21   ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
which allows for specifying ranges of memory as volatile, and able
to be discarded by the system.

This initial patch simply adds flag handling to madvise, and the
vma handling, splitting and merging the vmas as needed, and marking
them with VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/mm.h                     |   1 +
 include/linux/mvolatile.h              |   6 ++
 include/uapi/asm-generic/mman-common.h |   5 ++
 mm/Makefile                            |   2 +-
 mm/madvise.c                           |  14 ++++
 mm/mvolatile.c                         | 147 +++++++++++++++++++++++++++++++++
 6 files changed, 174 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf9811e..ea8b687 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
new file mode 100644
index 0000000..f53396b
--- /dev/null
+++ b/include/linux/mvolatile.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_MVOLATILE_H
+#define _LINUX_MVOLATILE_H
+
+int madvise_volatile(int bhv, unsigned long start, unsigned long end);
+
+#endif /* _LINUX_MVOLATILE_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36..b74d61d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,6 +39,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+
 #define MADV_HWPOISON	100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
@@ -52,6 +53,10 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_VOLATILE	18		/* Mark pages as volatile */
+#define MADV_NONVOLATILE 19		/* Mark pages non-volatile, return 1
+					   if any pages were purged  */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/Makefile b/mm/Makefile
index b484452..9a3dc62 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   iov_iter.o $(mmu-y)
+			   mvolatile.o iov_iter.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..937c026 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,7 @@
 #include <linux/blkdev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mvolatile.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -413,6 +414,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_VOLATILE:
+	case MADV_NONVOLATILE:
 		return 1;
 
 	default:
@@ -450,9 +453,14 @@ madvise_behavior_valid(int behavior)
  *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
  *		this area with pages of identical content from other such areas.
  *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ *  MADV_VOLATILE - Mark pages as volatile, allowing kernel to purge them under
+ *		pressure.
+ *  MADV_NONVOLATILE - Mark pages as non-volatile. Report if pages were purged.
  *
  * return values:
  *  zero    - success
+ *  1       - (MADV_NONVOLATILE only) some pages marked non-volatile were
+ *            purged.
  *  -EINVAL - start + len < 0, start is not page-aligned,
  *		"behavior" is not a valid value, or application
  *		is attempting to release locked or shared pages.
@@ -478,6 +486,12 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 #endif
 	if (!madvise_behavior_valid(behavior))
 		return error;
+	/*
+	 * MADV_VOLATILE/NONVOLATILE has subtle semantics that requrie
+	 * we don't use the generic per-vma manipulation below.
+	 */
+	if (behavior == MADV_VOLATILE || behavior == MADV_NONVOLATILE)
+		return madvise_volatile(behavior, start, start+len_in);
 
 	if (start & ~PAGE_MASK)
 		return error;
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
new file mode 100644
index 0000000..edc5894
--- /dev/null
+++ b/mm/mvolatile.c
@@ -0,0 +1,147 @@
+/*
+ * mm/mvolatile.c
+ *
+ * Copyright (C) 2014, LG Electronics, Minchan Kim <minchan@kernel.org>
+ * Copyright (C) 2014 Linaro Ltd., John Stultz <john.stultz@linaro.org>
+ */
+#include <linux/syscalls.h>
+#include <linux/mvolatile.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include <linux/mman.h>
+#include "internal.h"
+
+
+/**
+ * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
+ * @mode: the mode of the volatile range (volatile or non-volatile)
+ * @start: starting address of the volatile range
+ * @end: ending address of the volatile range
+ *
+ * Iterates over the VMAs in the specified range, and marks or clears
+ * them as VM_VOLATILE, splitting or merging them as needed.
+ *
+ * Returns 0 on success
+ * Returns 1 if any pages being marked were purged (MADV_NONVOLATILE only)
+ * Returns error only if no bytes were modified.
+ */
+int madvise_volatile(int mode, unsigned long start, unsigned long end)
+{
+	struct vm_area_struct *vma, *prev;
+	struct mm_struct *mm = current->mm;
+	unsigned long orig_start = start;
+	int ret = 0;
+
+	/* Bit of sanity checking */
+	if ((mode != MADV_VOLATILE) && (mode != MADV_NONVOLATILE))
+		return -EINVAL;
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	if (end & ~PAGE_MASK)
+		return -EINVAL;
+	if (end < start)
+		return -EINVAL;
+	if (start >= TASK_SIZE)
+		return -EINVAL;
+
+
+	down_write(&mm->mmap_sem);
+	/*
+	 * First, iterate ovver the VMAs and make sure
+	 * there are no holes or file vmas which would result
+	 * in -EINVAL.
+	 */
+	vma = find_vma(mm, start);
+	if (!vma) {
+		/* return ENOMEM if we're trying to mark unmapped pages */
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	while (vma) {
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB)) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* return ENOMEM if we're trying to mark unmapped pages */
+		if (start < vma->vm_start) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		start = vma->vm_end;
+		if (start >= end)
+			break;
+		vma = vma->vm_next;
+	}
+
+	/*
+	 * Second, do VMA splitting. Note: If either of these
+	 * fail, we'll make no modifications to the vm_flags,
+	 * and will merge back together any unmodified split
+	 * vmas
+	 */
+	start = orig_start;
+	vma = find_vma(mm, start);
+	if (start != vma->vm_start)
+		ret = split_vma(mm, vma, start, 1);
+
+	vma = find_vma(mm, end-1);
+	/* only need to split if end addr is not at the beginning of the vma */
+	if (!ret && (end != vma->vm_end))
+		ret = split_vma(mm, vma, end, 0);
+
+	/*
+	 * Third, if splitting was successful modify vm_flags.
+	 * We also will do any vma merging that is needed at
+	 * this point.
+	 */
+	start = orig_start;
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	while (vma) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+
+		new_flags = vma->vm_flags;
+		if (!ret) {
+			if (mode == MADV_VOLATILE)
+				new_flags |= VM_VOLATILE;
+			else /* mode == MADV_NONVOLATILE */
+				new_flags &= ~VM_VOLATILE;
+		}
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, vma->vm_end, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (!prev)
+			prev = vma;
+		else
+			vma = prev;
+
+		vma->vm_flags = new_flags;
+
+		start = vma->vm_end;
+		if (start >= end)
+			break;
+		vma = vma->vm_next;
+	}
+out:
+	up_write(&mm->mmap_sem);
+
+	return ret;
+}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-04-29 21:21   ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
which allows for specifying ranges of memory as volatile, and able
to be discarded by the system.

This initial patch simply adds flag handling to madvise, and the
vma handling, splitting and merging the vmas as needed, and marking
them with VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/mm.h                     |   1 +
 include/linux/mvolatile.h              |   6 ++
 include/uapi/asm-generic/mman-common.h |   5 ++
 mm/Makefile                            |   2 +-
 mm/madvise.c                           |  14 ++++
 mm/mvolatile.c                         | 147 +++++++++++++++++++++++++++++++++
 6 files changed, 174 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf9811e..ea8b687 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
new file mode 100644
index 0000000..f53396b
--- /dev/null
+++ b/include/linux/mvolatile.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_MVOLATILE_H
+#define _LINUX_MVOLATILE_H
+
+int madvise_volatile(int bhv, unsigned long start, unsigned long end);
+
+#endif /* _LINUX_MVOLATILE_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36..b74d61d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,6 +39,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+
 #define MADV_HWPOISON	100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
@@ -52,6 +53,10 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_VOLATILE	18		/* Mark pages as volatile */
+#define MADV_NONVOLATILE 19		/* Mark pages non-volatile, return 1
+					   if any pages were purged  */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/Makefile b/mm/Makefile
index b484452..9a3dc62 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   iov_iter.o $(mmu-y)
+			   mvolatile.o iov_iter.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..937c026 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,7 @@
 #include <linux/blkdev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mvolatile.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -413,6 +414,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_VOLATILE:
+	case MADV_NONVOLATILE:
 		return 1;
 
 	default:
@@ -450,9 +453,14 @@ madvise_behavior_valid(int behavior)
  *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
  *		this area with pages of identical content from other such areas.
  *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ *  MADV_VOLATILE - Mark pages as volatile, allowing kernel to purge them under
+ *		pressure.
+ *  MADV_NONVOLATILE - Mark pages as non-volatile. Report if pages were purged.
  *
  * return values:
  *  zero    - success
+ *  1       - (MADV_NONVOLATILE only) some pages marked non-volatile were
+ *            purged.
  *  -EINVAL - start + len < 0, start is not page-aligned,
  *		"behavior" is not a valid value, or application
  *		is attempting to release locked or shared pages.
@@ -478,6 +486,12 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 #endif
 	if (!madvise_behavior_valid(behavior))
 		return error;
+	/*
+	 * MADV_VOLATILE/NONVOLATILE has subtle semantics that requrie
+	 * we don't use the generic per-vma manipulation below.
+	 */
+	if (behavior == MADV_VOLATILE || behavior == MADV_NONVOLATILE)
+		return madvise_volatile(behavior, start, start+len_in);
 
 	if (start & ~PAGE_MASK)
 		return error;
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
new file mode 100644
index 0000000..edc5894
--- /dev/null
+++ b/mm/mvolatile.c
@@ -0,0 +1,147 @@
+/*
+ * mm/mvolatile.c
+ *
+ * Copyright (C) 2014, LG Electronics, Minchan Kim <minchan@kernel.org>
+ * Copyright (C) 2014 Linaro Ltd., John Stultz <john.stultz@linaro.org>
+ */
+#include <linux/syscalls.h>
+#include <linux/mvolatile.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include <linux/mman.h>
+#include "internal.h"
+
+
+/**
+ * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
+ * @mode: the mode of the volatile range (volatile or non-volatile)
+ * @start: starting address of the volatile range
+ * @end: ending address of the volatile range
+ *
+ * Iterates over the VMAs in the specified range, and marks or clears
+ * them as VM_VOLATILE, splitting or merging them as needed.
+ *
+ * Returns 0 on success
+ * Returns 1 if any pages being marked were purged (MADV_NONVOLATILE only)
+ * Returns error only if no bytes were modified.
+ */
+int madvise_volatile(int mode, unsigned long start, unsigned long end)
+{
+	struct vm_area_struct *vma, *prev;
+	struct mm_struct *mm = current->mm;
+	unsigned long orig_start = start;
+	int ret = 0;
+
+	/* Bit of sanity checking */
+	if ((mode != MADV_VOLATILE) && (mode != MADV_NONVOLATILE))
+		return -EINVAL;
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	if (end & ~PAGE_MASK)
+		return -EINVAL;
+	if (end < start)
+		return -EINVAL;
+	if (start >= TASK_SIZE)
+		return -EINVAL;
+
+
+	down_write(&mm->mmap_sem);
+	/*
+	 * First, iterate ovver the VMAs and make sure
+	 * there are no holes or file vmas which would result
+	 * in -EINVAL.
+	 */
+	vma = find_vma(mm, start);
+	if (!vma) {
+		/* return ENOMEM if we're trying to mark unmapped pages */
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	while (vma) {
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB)) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* return ENOMEM if we're trying to mark unmapped pages */
+		if (start < vma->vm_start) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		start = vma->vm_end;
+		if (start >= end)
+			break;
+		vma = vma->vm_next;
+	}
+
+	/*
+	 * Second, do VMA splitting. Note: If either of these
+	 * fail, we'll make no modifications to the vm_flags,
+	 * and will merge back together any unmodified split
+	 * vmas
+	 */
+	start = orig_start;
+	vma = find_vma(mm, start);
+	if (start != vma->vm_start)
+		ret = split_vma(mm, vma, start, 1);
+
+	vma = find_vma(mm, end-1);
+	/* only need to split if end addr is not at the beginning of the vma */
+	if (!ret && (end != vma->vm_end))
+		ret = split_vma(mm, vma, end, 0);
+
+	/*
+	 * Third, if splitting was successful modify vm_flags.
+	 * We also will do any vma merging that is needed at
+	 * this point.
+	 */
+	start = orig_start;
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	while (vma) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+
+		new_flags = vma->vm_flags;
+		if (!ret) {
+			if (mode == MADV_VOLATILE)
+				new_flags |= VM_VOLATILE;
+			else /* mode == MADV_NONVOLATILE */
+				new_flags &= ~VM_VOLATILE;
+		}
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, vma->vm_end, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (!prev)
+			prev = vma;
+		else
+			vma = prev;
+
+		vma->vm_flags = new_flags;
+
+		start = vma->vm_end;
+		if (start >= end)
+			break;
+		vma = vma->vm_next;
+	}
+out:
+	up_write(&mm->mmap_sem);
+
+	return ret;
+}
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
  2014-04-29 21:21 ` John Stultz
@ 2014-04-29 21:21   ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

Users of volatile ranges will need to know if memory was discarded.
This patch adds the purged state tracking required to inform userland
when it marks memory as non-volatile that some memory in that range
was purged and needs to be regenerated.

This simplified implementation which uses some of the logic from
Minchan's earlier efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h    |  5 +++
 include/linux/swapops.h | 10 ++++++
 mm/mvolatile.c          | 87 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a32c3da..3abc977 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,7 @@ enum {
 	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
 	 * new entries here at the top of the enum, not at the bottom
 	 */
+	SWP_MVOLATILE_PURGED_NR,
 #ifdef CONFIG_MEMORY_FAILURE
 	SWP_HWPOISON_NR,
 #endif
@@ -81,6 +82,10 @@ enum {
 #define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
+/*
+ * Purged volatile range pages
+ */
+#define SWP_MVOLATILE_PURGED	(MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c0f7526..fe9c026 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+static inline swp_entry_t make_purged_entry(void)
+{
+	return swp_entry(SWP_MVOLATILE_PURGED, 0);
+}
+
+static inline int is_purged_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_MVOLATILE_PURGED;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Support for hardware poisoned pages
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index edc5894..555d5c4 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -13,8 +13,92 @@
 #include <linux/mmu_notifier.h>
 #include <linux/mm_inline.h>
 #include <linux/mman.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include "internal.h"
 
+struct mvolatile_walker {
+	struct vm_area_struct *vma;
+	int page_was_purged;
+};
+
+
+/**
+ * mvolatile_check_purged_pte - Checks ptes for purged pages
+ * @pmd: pmd to walk
+ * @addr: starting address
+ * @end: end address
+ * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
+ *
+ * Iterates over the ptes in the pmd checking if they have
+ * purged swap entries.
+ *
+ * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
+ * and clears the purged pte swp entries (since the pages are no
+ * longer volatile, we don't want future accesses to SIGBUS).
+ */
+static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
+					unsigned long end, struct mm_walk *walk)
+{
+	struct mvolatile_walker *vw = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte)) {
+			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
+
+			if (unlikely(is_purged_entry(mvolatile_entry))) {
+
+				vw->page_was_purged = 1;
+
+				/* clear the pte swp entry */
+				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
+				ptep_clear_flush(vw->vma, addr, pte);
+			}
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+
+/**
+ * mvolatile_check_purged - Sets up a mm_walk to check for purged pages
+ * @vma: ptr to vma we're starting with
+ * @start: start address to walk
+ * @end: end address of walk
+ *
+ * Sets up and calls wa_page_range() to check for purge pages.
+ *
+ * Returns 1 if pages in the range were purged, 0 otherwise.
+ */
+static int mvolatile_check_purged(struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end)
+{
+	struct mvolatile_walker vw;
+	struct mm_walk mvolatile_walk = {
+		.pmd_entry = mvolatile_check_purged_pte,
+		.mm = vma->vm_mm,
+		.private = &vw,
+	};
+	vw.page_was_purged = 0;
+	vw.vma = vma;
+
+	walk_page_range(start, end, &mvolatile_walk);
+
+	return vw.page_was_purged;
+}
+
 
 /**
  * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
@@ -140,6 +224,9 @@ int madvise_volatile(int mode, unsigned long start, unsigned long end)
 			break;
 		vma = vma->vm_next;
 	}
+
+	if (!ret && (mode == MADV_NONVOLATILE))
+		ret = mvolatile_check_purged(vma, orig_start, end);
 out:
 	up_write(&mm->mmap_sem);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
@ 2014-04-29 21:21   ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

Users of volatile ranges will need to know if memory was discarded.
This patch adds the purged state tracking required to inform userland
when it marks memory as non-volatile that some memory in that range
was purged and needs to be regenerated.

This simplified implementation which uses some of the logic from
Minchan's earlier efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h    |  5 +++
 include/linux/swapops.h | 10 ++++++
 mm/mvolatile.c          | 87 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a32c3da..3abc977 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,7 @@ enum {
 	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
 	 * new entries here at the top of the enum, not at the bottom
 	 */
+	SWP_MVOLATILE_PURGED_NR,
 #ifdef CONFIG_MEMORY_FAILURE
 	SWP_HWPOISON_NR,
 #endif
@@ -81,6 +82,10 @@ enum {
 #define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
+/*
+ * Purged volatile range pages
+ */
+#define SWP_MVOLATILE_PURGED	(MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c0f7526..fe9c026 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+static inline swp_entry_t make_purged_entry(void)
+{
+	return swp_entry(SWP_MVOLATILE_PURGED, 0);
+}
+
+static inline int is_purged_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_MVOLATILE_PURGED;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Support for hardware poisoned pages
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index edc5894..555d5c4 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -13,8 +13,92 @@
 #include <linux/mmu_notifier.h>
 #include <linux/mm_inline.h>
 #include <linux/mman.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include "internal.h"
 
+struct mvolatile_walker {
+	struct vm_area_struct *vma;
+	int page_was_purged;
+};
+
+
+/**
+ * mvolatile_check_purged_pte - Checks ptes for purged pages
+ * @pmd: pmd to walk
+ * @addr: starting address
+ * @end: end address
+ * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
+ *
+ * Iterates over the ptes in the pmd checking if they have
+ * purged swap entries.
+ *
+ * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
+ * and clears the purged pte swp entries (since the pages are no
+ * longer volatile, we don't want future accesses to SIGBUS).
+ */
+static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
+					unsigned long end, struct mm_walk *walk)
+{
+	struct mvolatile_walker *vw = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte)) {
+			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
+
+			if (unlikely(is_purged_entry(mvolatile_entry))) {
+
+				vw->page_was_purged = 1;
+
+				/* clear the pte swp entry */
+				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
+				ptep_clear_flush(vw->vma, addr, pte);
+			}
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+
+/**
+ * mvolatile_check_purged - Sets up a mm_walk to check for purged pages
+ * @vma: ptr to vma we're starting with
+ * @start: start address to walk
+ * @end: end address of walk
+ *
+ * Sets up and calls wa_page_range() to check for purge pages.
+ *
+ * Returns 1 if pages in the range were purged, 0 otherwise.
+ */
+static int mvolatile_check_purged(struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end)
+{
+	struct mvolatile_walker vw;
+	struct mm_walk mvolatile_walk = {
+		.pmd_entry = mvolatile_check_purged_pte,
+		.mm = vma->vm_mm,
+		.private = &vw,
+	};
+	vw.page_was_purged = 0;
+	vw.vma = vma;
+
+	walk_page_range(start, end, &mvolatile_walk);
+
+	return vw.page_was_purged;
+}
+
 
 /**
  * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
@@ -140,6 +224,9 @@ int madvise_volatile(int mode, unsigned long start, unsigned long end)
 			break;
 		vma = vma->vm_next;
 	}
+
+	if (!ret && (mode == MADV_NONVOLATILE))
+		ret = mvolatile_check_purged(vma, orig_start, end);
 out:
 	up_write(&mm->mmap_sem);
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/4] MADV_VOLATILE: Add page purging logic & SIGBUS trap
  2014-04-29 21:21 ` John Stultz
@ 2014-04-29 21:21   ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

This patch adds the hooks in the vmscan logic to purge volatile pages
and mark their pte as purged. With this, volatile pages will be purged
under pressure, and their ptes swap entry's marked. If the purged pages
are accessed before being marked non-volatile, we catch this and send a
SIGBUS.

This is a simplified implementation that uses logic from Minchan's earlier
efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/mvolatile.h |   1 +
 mm/internal.h             |   2 -
 mm/memory.c               |   7 +++
 mm/mvolatile.c            | 119 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                 |   5 ++
 mm/vmscan.c               |  12 +++++
 6 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
index f53396b..8b797b7 100644
--- a/include/linux/mvolatile.h
+++ b/include/linux/mvolatile.h
@@ -2,5 +2,6 @@
 #define _LINUX_MVOLATILE_H
 
 int madvise_volatile(int bhv, unsigned long start, unsigned long end);
+extern int purge_volatile_page(struct page *page);
 
 #endif /* _LINUX_MVOLATILE_H */
diff --git a/mm/internal.h b/mm/internal.h
index 07b6736..2213055 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -240,10 +240,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
-#endif
 #else /* !CONFIG_MMU */
 static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 037b812..cf024bd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/mvolatile.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3067,6 +3068,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_purged_entry(entry)) {
+			page_table = pte_offset_map_lock(mm, pmd, address,
+									&ptl);
+			if (likely(pte_same(*page_table, orig_pte)))
+				ret = VM_FAULT_SIGBUS;
+			goto unlock;
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index 555d5c4..a7831d3 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -232,3 +232,122 @@ out:
 
 	return ret;
 }
+
+
+/**
+ * try_to_purge_one - Purge a volatile page from a vma
+ * @page: page to purge
+ * @vma: vma to purge page from
+ *
+ * Finds the pte for a page in a vma, marks the pte as purged
+ * and release the page.
+ */
+static void try_to_purge_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+	unsigned long addr;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	addr = vma_address(page, vma);
+	pte = page_check_address(page, mm, addr, &ptl, 0);
+	if (!pte)
+		return;
+
+	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, addr, pte);
+
+	update_hiwater_rss(mm);
+	if (PageAnon(page))
+		dec_mm_counter(mm, MM_ANONPAGES);
+	else
+		dec_mm_counter(mm, MM_FILEPAGES);
+
+	page_remove_rmap(page);
+	page_cache_release(page);
+
+	set_pte_at(mm, addr, pte, swp_entry_to_pte(make_purged_entry()));
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, addr);
+}
+
+
+/**
+ * try_to_purge_vpage - check vma chain and purge from vmas marked volatile
+ * @page: page to purge
+ *
+ * Goes over all the vmas that hold a page, and where the vmas are volatile,
+ * purge the page from the vma.
+ *
+ * Returns 0 on success, -1 on error.
+ */
+static int try_to_purge_vpage(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff;
+	int ret = 0;
+
+	anon_vma = page_lock_anon_vma_read(page);
+	if (!anon_vma)
+		return -1;
+
+	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	/*
+	 * During interating the loop, some processes could see a page as
+	 * purged while others could see a page as not-purged because we have
+	 * no global lock between parent and child for protecting mvolatile
+	 * system call during this loop. But it's not a problem because the
+	 * page is  not *SHARED* page but *COW* page so parent and child can
+	 * see other data anytime. The worst case by this race is a page was
+	 * purged but couldn't be discarded so it makes unnecessary pagefault
+	 * but it wouldn't be severe.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+
+		if (!(vma->vm_flags & VM_VOLATILE)) {
+			ret = -1;
+			break;
+		}
+		try_to_purge_one(page, vma);
+	}
+	page_unlock_anon_vma_read(anon_vma);
+	return ret;
+}
+
+
+/**
+ * purge_volatile_page - If possible, purge the specified volatile page
+ * @page: page to purge
+ *
+ * Attempts to purge a volatile page, and if needed frees the swap page
+ *
+ * Returns 0 on success, -1 on error.
+ */
+int purge_volatile_page(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageLRU(page));
+
+	/* XXX - for now we only support anonymous volatile pages */
+	if (!PageAnon(page))
+		return -1;
+
+	if (!try_to_purge_vpage(page)) {
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+
+		if (page_freeze_refs(page, 1)) {
+			unlock_page(page);
+			return 0;
+		}
+	}
+	return -1;
+}
diff --git a/mm/rmap.c b/mm/rmap.c
index 9c3e773..efb5c61 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 		pte_unmap_unlock(pte, ptl);
+		if (vma->vm_flags & VM_VOLATILE) {
+			pra->mapcount = 0;
+			pra->vm_flags |= VM_VOLATILE;
+			return SWAP_FAIL;
+		}
 	}
 
 	if (referenced) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f56c8d..a267926 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/mvolatile.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -697,6 +698,7 @@ enum page_references {
 	PAGEREF_RECLAIM,
 	PAGEREF_RECLAIM_CLEAN,
 	PAGEREF_KEEP,
+	PAGEREF_PURGE,
 	PAGEREF_ACTIVATE,
 };
 
@@ -717,6 +719,13 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * If volatile page is reached on LRU's tail, we discard the
+	 * page without considering recycle the page.
+	 */
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_PURGE;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -944,6 +953,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
+		case PAGEREF_PURGE:
+			if (!purge_volatile_page(page))
+				goto free_it;
 		case PAGEREF_KEEP:
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/4] MADV_VOLATILE: Add page purging logic & SIGBUS trap
@ 2014-04-29 21:21   ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-04-29 21:21 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

This patch adds the hooks in the vmscan logic to purge volatile pages
and mark their pte as purged. With this, volatile pages will be purged
under pressure, and their ptes swap entry's marked. If the purged pages
are accessed before being marked non-volatile, we catch this and send a
SIGBUS.

This is a simplified implementation that uses logic from Minchan's earlier
efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/mvolatile.h |   1 +
 mm/internal.h             |   2 -
 mm/memory.c               |   7 +++
 mm/mvolatile.c            | 119 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                 |   5 ++
 mm/vmscan.c               |  12 +++++
 6 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
index f53396b..8b797b7 100644
--- a/include/linux/mvolatile.h
+++ b/include/linux/mvolatile.h
@@ -2,5 +2,6 @@
 #define _LINUX_MVOLATILE_H
 
 int madvise_volatile(int bhv, unsigned long start, unsigned long end);
+extern int purge_volatile_page(struct page *page);
 
 #endif /* _LINUX_MVOLATILE_H */
diff --git a/mm/internal.h b/mm/internal.h
index 07b6736..2213055 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -240,10 +240,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
-#endif
 #else /* !CONFIG_MMU */
 static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 037b812..cf024bd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/mvolatile.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3067,6 +3068,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_purged_entry(entry)) {
+			page_table = pte_offset_map_lock(mm, pmd, address,
+									&ptl);
+			if (likely(pte_same(*page_table, orig_pte)))
+				ret = VM_FAULT_SIGBUS;
+			goto unlock;
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index 555d5c4..a7831d3 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -232,3 +232,122 @@ out:
 
 	return ret;
 }
+
+
+/**
+ * try_to_purge_one - Purge a volatile page from a vma
+ * @page: page to purge
+ * @vma: vma to purge page from
+ *
+ * Finds the pte for a page in a vma, marks the pte as purged
+ * and release the page.
+ */
+static void try_to_purge_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+	unsigned long addr;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	addr = vma_address(page, vma);
+	pte = page_check_address(page, mm, addr, &ptl, 0);
+	if (!pte)
+		return;
+
+	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, addr, pte);
+
+	update_hiwater_rss(mm);
+	if (PageAnon(page))
+		dec_mm_counter(mm, MM_ANONPAGES);
+	else
+		dec_mm_counter(mm, MM_FILEPAGES);
+
+	page_remove_rmap(page);
+	page_cache_release(page);
+
+	set_pte_at(mm, addr, pte, swp_entry_to_pte(make_purged_entry()));
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, addr);
+}
+
+
+/**
+ * try_to_purge_vpage - check vma chain and purge from vmas marked volatile
+ * @page: page to purge
+ *
+ * Goes over all the vmas that hold a page, and where the vmas are volatile,
+ * purge the page from the vma.
+ *
+ * Returns 0 on success, -1 on error.
+ */
+static int try_to_purge_vpage(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff;
+	int ret = 0;
+
+	anon_vma = page_lock_anon_vma_read(page);
+	if (!anon_vma)
+		return -1;
+
+	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	/*
+	 * During interating the loop, some processes could see a page as
+	 * purged while others could see a page as not-purged because we have
+	 * no global lock between parent and child for protecting mvolatile
+	 * system call during this loop. But it's not a problem because the
+	 * page is  not *SHARED* page but *COW* page so parent and child can
+	 * see other data anytime. The worst case by this race is a page was
+	 * purged but couldn't be discarded so it makes unnecessary pagefault
+	 * but it wouldn't be severe.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+
+		if (!(vma->vm_flags & VM_VOLATILE)) {
+			ret = -1;
+			break;
+		}
+		try_to_purge_one(page, vma);
+	}
+	page_unlock_anon_vma_read(anon_vma);
+	return ret;
+}
+
+
+/**
+ * purge_volatile_page - If possible, purge the specified volatile page
+ * @page: page to purge
+ *
+ * Attempts to purge a volatile page, and if needed frees the swap page
+ *
+ * Returns 0 on success, -1 on error.
+ */
+int purge_volatile_page(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageLRU(page));
+
+	/* XXX - for now we only support anonymous volatile pages */
+	if (!PageAnon(page))
+		return -1;
+
+	if (!try_to_purge_vpage(page)) {
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+
+		if (page_freeze_refs(page, 1)) {
+			unlock_page(page);
+			return 0;
+		}
+	}
+	return -1;
+}
diff --git a/mm/rmap.c b/mm/rmap.c
index 9c3e773..efb5c61 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 		pte_unmap_unlock(pte, ptl);
+		if (vma->vm_flags & VM_VOLATILE) {
+			pra->mapcount = 0;
+			pra->vm_flags |= VM_VOLATILE;
+			return SWAP_FAIL;
+		}
 	}
 
 	if (referenced) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f56c8d..a267926 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/mvolatile.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -697,6 +698,7 @@ enum page_references {
 	PAGEREF_RECLAIM,
 	PAGEREF_RECLAIM_CLEAN,
 	PAGEREF_KEEP,
+	PAGEREF_PURGE,
 	PAGEREF_ACTIVATE,
 };
 
@@ -717,6 +719,13 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * If volatile page is reached on LRU's tail, we discard the
+	 * page without considering recycle the page.
+	 */
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_PURGE;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -944,6 +953,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
+		case PAGEREF_PURGE:
+			if (!purge_volatile_page(page))
+				goto free_it;
 		case PAGEREF_KEEP:
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-04-29 21:21   ` John Stultz
@ 2014-05-08  1:21     ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  1:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

Hey John,

On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> which allows for specifying ranges of memory as volatile, and able
> to be discarded by the system.
> 
> This initial patch simply adds flag handling to madvise, and the
> vma handling, splitting and merging the vmas as needed, and marking
> them with VM_VOLATILE.
> 
> No purging or discarding of volatile ranges is done at this point.
> 
> This a simplified implementation which reuses some of the logic
> from Minchan's earlier efforts. So credit to Minchan for his work.

Remove purged argument is really good thing but I'm not sure merging
the feature into madvise syscall is good idea.
My concern is how we support user who don't want SIGBUS.
I believe we should support them because someuser(ex, sanitizer) really
want to avoid MADV_NONVOLATILE call right before overwriting their cache
(ex, If there was purged page for cyclic cache, user should call NONVOLATILE
right before overwriting to avoid SIGBUS).
Moreover, this changes made unmarking cost O(N) so I'd like to avoid
NOVOLATILE syscall if possible.

For me, SIGBUS is more special usecase for code pages but I believe
both are reasonable for each usecase so my preference is MADV_VOLATILE
is just zero-filled page and MADV_VOLATILE_SIGBUS, another new advise
if you really want to merge volatile range feature with madvise.

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/mm.h                     |   1 +
>  include/linux/mvolatile.h              |   6 ++
>  include/uapi/asm-generic/mman-common.h |   5 ++
>  mm/Makefile                            |   2 +-
>  mm/madvise.c                           |  14 ++++
>  mm/mvolatile.c                         | 147 +++++++++++++++++++++++++++++++++
>  6 files changed, 174 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/mvolatile.h
>  create mode 100644 mm/mvolatile.c
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bf9811e..ea8b687 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>  
>  					/* Used by sys_madvise() */
> +#define VM_VOLATILE	0x00001000	/* VMA is volatile */
>  #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
>  #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
>  
> diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> new file mode 100644
> index 0000000..f53396b
> --- /dev/null
> +++ b/include/linux/mvolatile.h
> @@ -0,0 +1,6 @@
> +#ifndef _LINUX_MVOLATILE_H
> +#define _LINUX_MVOLATILE_H
> +
> +int madvise_volatile(int bhv, unsigned long start, unsigned long end);
> +
> +#endif /* _LINUX_MVOLATILE_H */
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index ddc3b36..b74d61d 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -39,6 +39,7 @@
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> +
>  #define MADV_HWPOISON	100		/* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
>  
> @@ -52,6 +53,10 @@
>  					   overrides the coredump filter bits */
>  #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
>  
> +#define MADV_VOLATILE	18		/* Mark pages as volatile */
> +#define MADV_NONVOLATILE 19		/* Mark pages non-volatile, return 1
> +					   if any pages were purged  */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index b484452..9a3dc62 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
>  			   compaction.o balloon_compaction.o vmacache.o \
>  			   interval_tree.o list_lru.o workingset.o \
> -			   iov_iter.o $(mmu-y)
> +			   mvolatile.o iov_iter.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 539eeb9..937c026 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/mvolatile.h>
>  
>  /*
>   * Any behaviour which results in changes to the vma->vm_flags needs to
> @@ -413,6 +414,8 @@ madvise_behavior_valid(int behavior)
>  #endif
>  	case MADV_DONTDUMP:
>  	case MADV_DODUMP:
> +	case MADV_VOLATILE:
> +	case MADV_NONVOLATILE:
>  		return 1;
>  
>  	default:
> @@ -450,9 +453,14 @@ madvise_behavior_valid(int behavior)
>   *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
>   *		this area with pages of identical content from other such areas.
>   *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> + *  MADV_VOLATILE - Mark pages as volatile, allowing kernel to purge them under
> + *		pressure.
> + *  MADV_NONVOLATILE - Mark pages as non-volatile. Report if pages were purged.
>   *
>   * return values:
>   *  zero    - success
> + *  1       - (MADV_NONVOLATILE only) some pages marked non-volatile were
> + *            purged.
>   *  -EINVAL - start + len < 0, start is not page-aligned,
>   *		"behavior" is not a valid value, or application
>   *		is attempting to release locked or shared pages.
> @@ -478,6 +486,12 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  #endif
>  	if (!madvise_behavior_valid(behavior))
>  		return error;
> +	/*
> +	 * MADV_VOLATILE/NONVOLATILE has subtle semantics that requrie
> +	 * we don't use the generic per-vma manipulation below.
> +	 */
> +	if (behavior == MADV_VOLATILE || behavior == MADV_NONVOLATILE)
> +		return madvise_volatile(behavior, start, start+len_in);
>  
>  	if (start & ~PAGE_MASK)
>  		return error;
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> new file mode 100644
> index 0000000..edc5894
> --- /dev/null
> +++ b/mm/mvolatile.c
> @@ -0,0 +1,147 @@
> +/*
> + * mm/mvolatile.c
> + *
> + * Copyright (C) 2014, LG Electronics, Minchan Kim <minchan@kernel.org>
> + * Copyright (C) 2014 Linaro Ltd., John Stultz <john.stultz@linaro.org>
> + */
> +#include <linux/syscalls.h>
> +#include <linux/mvolatile.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include <linux/mman.h>
> +#include "internal.h"
> +
> +
> +/**
> + * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> + * @mode: the mode of the volatile range (volatile or non-volatile)
> + * @start: starting address of the volatile range
> + * @end: ending address of the volatile range
> + *
> + * Iterates over the VMAs in the specified range, and marks or clears
> + * them as VM_VOLATILE, splitting or merging them as needed.
> + *
> + * Returns 0 on success
> + * Returns 1 if any pages being marked were purged (MADV_NONVOLATILE only)
> + * Returns error only if no bytes were modified.
> + */
> +int madvise_volatile(int mode, unsigned long start, unsigned long end)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	struct mm_struct *mm = current->mm;
> +	unsigned long orig_start = start;
> +	int ret = 0;
> +
> +	/* Bit of sanity checking */
> +	if ((mode != MADV_VOLATILE) && (mode != MADV_NONVOLATILE))
> +		return -EINVAL;
> +	if (start & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (end & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (end < start)
> +		return -EINVAL;
> +	if (start >= TASK_SIZE)
> +		return -EINVAL;
> +
> +
> +	down_write(&mm->mmap_sem);
> +	/*
> +	 * First, iterate ovver the VMAs and make sure
> +	 * there are no holes or file vmas which would result
> +	 * in -EINVAL.
> +	 */
> +	vma = find_vma(mm, start);
> +	if (!vma) {
> +		/* return ENOMEM if we're trying to mark unmapped pages */
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	while (vma) {
> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +					VM_HUGETLB)) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* We don't support volatility on files for now */
> +		if (vma->vm_file) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* return ENOMEM if we're trying to mark unmapped pages */
> +		if (start < vma->vm_start) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		start = vma->vm_end;
> +		if (start >= end)
> +			break;
> +		vma = vma->vm_next;
> +	}
> +
> +	/*
> +	 * Second, do VMA splitting. Note: If either of these
> +	 * fail, we'll make no modifications to the vm_flags,
> +	 * and will merge back together any unmodified split
> +	 * vmas
> +	 */
> +	start = orig_start;
> +	vma = find_vma(mm, start);
> +	if (start != vma->vm_start)
> +		ret = split_vma(mm, vma, start, 1);
> +
> +	vma = find_vma(mm, end-1);
> +	/* only need to split if end addr is not at the beginning of the vma */
> +	if (!ret && (end != vma->vm_end))
> +		ret = split_vma(mm, vma, end, 0);
> +
> +	/*
> +	 * Third, if splitting was successful modify vm_flags.
> +	 * We also will do any vma merging that is needed at
> +	 * this point.
> +	 */
> +	start = orig_start;
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	while (vma) {
> +		unsigned long new_flags;
> +		pgoff_t pgoff;
> +
> +		new_flags = vma->vm_flags;
> +		if (!ret) {
> +			if (mode == MADV_VOLATILE)
> +				new_flags |= VM_VOLATILE;
> +			else /* mode == MADV_NONVOLATILE */
> +				new_flags &= ~VM_VOLATILE;
> +		}
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +		prev = vma_merge(mm, prev, start, vma->vm_end, new_flags,
> +					vma->anon_vma, vma->vm_file, pgoff,
> +					vma_policy(vma));
> +		if (!prev)
> +			prev = vma;
> +		else
> +			vma = prev;
> +
> +		vma->vm_flags = new_flags;
> +
> +		start = vma->vm_end;
> +		if (start >= end)
> +			break;
> +		vma = vma->vm_next;
> +	}
> +out:
> +	up_write(&mm->mmap_sem);
> +
> +	return ret;
> +}
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-05-08  1:21     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  1:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

Hey John,

On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> which allows for specifying ranges of memory as volatile, and able
> to be discarded by the system.
> 
> This initial patch simply adds flag handling to madvise, and the
> vma handling, splitting and merging the vmas as needed, and marking
> them with VM_VOLATILE.
> 
> No purging or discarding of volatile ranges is done at this point.
> 
> This a simplified implementation which reuses some of the logic
> from Minchan's earlier efforts. So credit to Minchan for his work.

Remove purged argument is really good thing but I'm not sure merging
the feature into madvise syscall is good idea.
My concern is how we support user who don't want SIGBUS.
I believe we should support them because someuser(ex, sanitizer) really
want to avoid MADV_NONVOLATILE call right before overwriting their cache
(ex, If there was purged page for cyclic cache, user should call NONVOLATILE
right before overwriting to avoid SIGBUS).
Moreover, this changes made unmarking cost O(N) so I'd like to avoid
NOVOLATILE syscall if possible.

For me, SIGBUS is more special usecase for code pages but I believe
both are reasonable for each usecase so my preference is MADV_VOLATILE
is just zero-filled page and MADV_VOLATILE_SIGBUS, another new advise
if you really want to merge volatile range feature with madvise.

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/mm.h                     |   1 +
>  include/linux/mvolatile.h              |   6 ++
>  include/uapi/asm-generic/mman-common.h |   5 ++
>  mm/Makefile                            |   2 +-
>  mm/madvise.c                           |  14 ++++
>  mm/mvolatile.c                         | 147 +++++++++++++++++++++++++++++++++
>  6 files changed, 174 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/mvolatile.h
>  create mode 100644 mm/mvolatile.c
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bf9811e..ea8b687 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>  
>  					/* Used by sys_madvise() */
> +#define VM_VOLATILE	0x00001000	/* VMA is volatile */
>  #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
>  #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
>  
> diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> new file mode 100644
> index 0000000..f53396b
> --- /dev/null
> +++ b/include/linux/mvolatile.h
> @@ -0,0 +1,6 @@
> +#ifndef _LINUX_MVOLATILE_H
> +#define _LINUX_MVOLATILE_H
> +
> +int madvise_volatile(int bhv, unsigned long start, unsigned long end);
> +
> +#endif /* _LINUX_MVOLATILE_H */
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index ddc3b36..b74d61d 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -39,6 +39,7 @@
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> +
>  #define MADV_HWPOISON	100		/* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
>  
> @@ -52,6 +53,10 @@
>  					   overrides the coredump filter bits */
>  #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
>  
> +#define MADV_VOLATILE	18		/* Mark pages as volatile */
> +#define MADV_NONVOLATILE 19		/* Mark pages non-volatile, return 1
> +					   if any pages were purged  */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index b484452..9a3dc62 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -18,7 +18,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
>  			   compaction.o balloon_compaction.o vmacache.o \
>  			   interval_tree.o list_lru.o workingset.o \
> -			   iov_iter.o $(mmu-y)
> +			   mvolatile.o iov_iter.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 539eeb9..937c026 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/mvolatile.h>
>  
>  /*
>   * Any behaviour which results in changes to the vma->vm_flags needs to
> @@ -413,6 +414,8 @@ madvise_behavior_valid(int behavior)
>  #endif
>  	case MADV_DONTDUMP:
>  	case MADV_DODUMP:
> +	case MADV_VOLATILE:
> +	case MADV_NONVOLATILE:
>  		return 1;
>  
>  	default:
> @@ -450,9 +453,14 @@ madvise_behavior_valid(int behavior)
>   *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
>   *		this area with pages of identical content from other such areas.
>   *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> + *  MADV_VOLATILE - Mark pages as volatile, allowing kernel to purge them under
> + *		pressure.
> + *  MADV_NONVOLATILE - Mark pages as non-volatile. Report if pages were purged.
>   *
>   * return values:
>   *  zero    - success
> + *  1       - (MADV_NONVOLATILE only) some pages marked non-volatile were
> + *            purged.
>   *  -EINVAL - start + len < 0, start is not page-aligned,
>   *		"behavior" is not a valid value, or application
>   *		is attempting to release locked or shared pages.
> @@ -478,6 +486,12 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  #endif
>  	if (!madvise_behavior_valid(behavior))
>  		return error;
> +	/*
> +	 * MADV_VOLATILE/NONVOLATILE has subtle semantics that requrie
> +	 * we don't use the generic per-vma manipulation below.
> +	 */
> +	if (behavior == MADV_VOLATILE || behavior == MADV_NONVOLATILE)
> +		return madvise_volatile(behavior, start, start+len_in);
>  
>  	if (start & ~PAGE_MASK)
>  		return error;
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> new file mode 100644
> index 0000000..edc5894
> --- /dev/null
> +++ b/mm/mvolatile.c
> @@ -0,0 +1,147 @@
> +/*
> + * mm/mvolatile.c
> + *
> + * Copyright (C) 2014, LG Electronics, Minchan Kim <minchan@kernel.org>
> + * Copyright (C) 2014 Linaro Ltd., John Stultz <john.stultz@linaro.org>
> + */
> +#include <linux/syscalls.h>
> +#include <linux/mvolatile.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include <linux/mman.h>
> +#include "internal.h"
> +
> +
> +/**
> + * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> + * @mode: the mode of the volatile range (volatile or non-volatile)
> + * @start: starting address of the volatile range
> + * @end: ending address of the volatile range
> + *
> + * Iterates over the VMAs in the specified range, and marks or clears
> + * them as VM_VOLATILE, splitting or merging them as needed.
> + *
> + * Returns 0 on success
> + * Returns 1 if any pages being marked were purged (MADV_NONVOLATILE only)
> + * Returns error only if no bytes were modified.
> + */
> +int madvise_volatile(int mode, unsigned long start, unsigned long end)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	struct mm_struct *mm = current->mm;
> +	unsigned long orig_start = start;
> +	int ret = 0;
> +
> +	/* Bit of sanity checking */
> +	if ((mode != MADV_VOLATILE) && (mode != MADV_NONVOLATILE))
> +		return -EINVAL;
> +	if (start & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (end & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (end < start)
> +		return -EINVAL;
> +	if (start >= TASK_SIZE)
> +		return -EINVAL;
> +
> +
> +	down_write(&mm->mmap_sem);
> +	/*
> +	 * First, iterate ovver the VMAs and make sure
> +	 * there are no holes or file vmas which would result
> +	 * in -EINVAL.
> +	 */
> +	vma = find_vma(mm, start);
> +	if (!vma) {
> +		/* return ENOMEM if we're trying to mark unmapped pages */
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	while (vma) {
> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +					VM_HUGETLB)) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* We don't support volatility on files for now */
> +		if (vma->vm_file) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* return ENOMEM if we're trying to mark unmapped pages */
> +		if (start < vma->vm_start) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		start = vma->vm_end;
> +		if (start >= end)
> +			break;
> +		vma = vma->vm_next;
> +	}
> +
> +	/*
> +	 * Second, do VMA splitting. Note: If either of these
> +	 * fail, we'll make no modifications to the vm_flags,
> +	 * and will merge back together any unmodified split
> +	 * vmas
> +	 */
> +	start = orig_start;
> +	vma = find_vma(mm, start);
> +	if (start != vma->vm_start)
> +		ret = split_vma(mm, vma, start, 1);
> +
> +	vma = find_vma(mm, end-1);
> +	/* only need to split if end addr is not at the beginning of the vma */
> +	if (!ret && (end != vma->vm_end))
> +		ret = split_vma(mm, vma, end, 0);
> +
> +	/*
> +	 * Third, if splitting was successful modify vm_flags.
> +	 * We also will do any vma merging that is needed at
> +	 * this point.
> +	 */
> +	start = orig_start;
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	while (vma) {
> +		unsigned long new_flags;
> +		pgoff_t pgoff;
> +
> +		new_flags = vma->vm_flags;
> +		if (!ret) {
> +			if (mode == MADV_VOLATILE)
> +				new_flags |= VM_VOLATILE;
> +			else /* mode == MADV_NONVOLATILE */
> +				new_flags &= ~VM_VOLATILE;
> +		}
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +		prev = vma_merge(mm, prev, start, vma->vm_end, new_flags,
> +					vma->anon_vma, vma->vm_file, pgoff,
> +					vma_policy(vma));
> +		if (!prev)
> +			prev = vma;
> +		else
> +			vma = prev;
> +
> +		vma->vm_flags = new_flags;
> +
> +		start = vma->vm_end;
> +		if (start >= end)
> +			break;
> +		vma = vma->vm_next;
> +	}
> +out:
> +	up_write(&mm->mmap_sem);
> +
> +	return ret;
> +}
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
  2014-04-29 21:21   ` John Stultz
@ 2014-05-08  1:51     ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  1:51 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Tue, Apr 29, 2014 at 02:21:22PM -0700, John Stultz wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
> 
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Acked-by: Jan Kara <jack@suse.cz>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/swap.h    |  5 +++
>  include/linux/swapops.h | 10 ++++++
>  mm/mvolatile.c          | 87 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 102 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a32c3da..3abc977 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,7 @@ enum {
>  	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
>  	 * new entries here at the top of the enum, not at the bottom
>  	 */
> +	SWP_MVOLATILE_PURGED_NR,
>  #ifdef CONFIG_MEMORY_FAILURE
>  	SWP_HWPOISON_NR,
>  #endif
> @@ -81,6 +82,10 @@ enum {
>  #define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
>  #endif
>  
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_MVOLATILE_PURGED	(MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index c0f7526..fe9c026 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +static inline swp_entry_t make_purged_entry(void)
> +{
> +	return swp_entry(SWP_MVOLATILE_PURGED, 0);
> +}
> +
> +static inline int is_purged_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_MVOLATILE_PURGED;
> +}
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
>   * Support for hardware poisoned pages
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> index edc5894..555d5c4 100644
> --- a/mm/mvolatile.c
> +++ b/mm/mvolatile.c
> @@ -13,8 +13,92 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/mm_inline.h>
>  #include <linux/mman.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include "internal.h"
>  
> +struct mvolatile_walker {
> +	struct vm_area_struct *vma;
> +	int page_was_purged;
> +};
> +
> +
> +/**
> + * mvolatile_check_purged_pte - Checks ptes for purged pages
> + * @pmd: pmd to walk
> + * @addr: starting address
> + * @end: end address
> + * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
> + * and clears the purged pte swp entries (since the pages are no
> + * longer volatile, we don't want future accesses to SIGBUS).
> + */
> +static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +					unsigned long end, struct mm_walk *walk)
> +{
> +	struct mvolatile_walker *vw = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	if (pmd_trans_huge(*pmd))
> +		return 0;
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte)) {
> +			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
> +
> +			if (unlikely(is_purged_entry(mvolatile_entry))) {
> +
> +				vw->page_was_purged = 1;
> +
> +				/* clear the pte swp entry */
> +				flush_cache_page(vw->vma, addr, pte_pfn(*pte));

Maybe we don't need to flush the cache because there is no mapped page.

> +				ptep_clear_flush(vw->vma, addr, pte);

Maybe we don't need this, either. We didn't set present bit for purged
page but when I look at the internal of ptep_clear_flush, it checks present bit
and skip the TLB flush so it's okay for x86 but not sure other architecture.
More clear function for our purpose would be pte_clear_not_present_full.

And we are changing page table so at least, we need to handle mmu_notifier to
inform that to the client of mmu_notifier.

> +			}
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +
> +/**
> + * mvolatile_check_purged - Sets up a mm_walk to check for purged pages
> + * @vma: ptr to vma we're starting with
> + * @start: start address to walk
> + * @end: end address of walk
> + *
> + * Sets up and calls wa_page_range() to check for purge pages.
> + *
> + * Returns 1 if pages in the range were purged, 0 otherwise.
> + */
> +static int mvolatile_check_purged(struct vm_area_struct *vma,
> +					 unsigned long start,
> +					 unsigned long end)
> +{
> +	struct mvolatile_walker vw;
> +	struct mm_walk mvolatile_walk = {
> +		.pmd_entry = mvolatile_check_purged_pte,
> +		.mm = vma->vm_mm,
> +		.private = &vw,
> +	};
> +	vw.page_was_purged = 0;
> +	vw.vma = vma;
> +
> +	walk_page_range(start, end, &mvolatile_walk);
> +
> +	return vw.page_was_purged;
> +}
> +
>  
>  /**
>   * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> @@ -140,6 +224,9 @@ int madvise_volatile(int mode, unsigned long start, unsigned long end)
>  			break;
>  		vma = vma->vm_next;
>  	}
> +
> +	if (!ret && (mode == MADV_NONVOLATILE))
> +		ret = mvolatile_check_purged(vma, orig_start, end);
>  out:
>  	up_write(&mm->mmap_sem);
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
@ 2014-05-08  1:51     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  1:51 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Tue, Apr 29, 2014 at 02:21:22PM -0700, John Stultz wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
> 
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Acked-by: Jan Kara <jack@suse.cz>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/swap.h    |  5 +++
>  include/linux/swapops.h | 10 ++++++
>  mm/mvolatile.c          | 87 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 102 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a32c3da..3abc977 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,7 @@ enum {
>  	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
>  	 * new entries here at the top of the enum, not at the bottom
>  	 */
> +	SWP_MVOLATILE_PURGED_NR,
>  #ifdef CONFIG_MEMORY_FAILURE
>  	SWP_HWPOISON_NR,
>  #endif
> @@ -81,6 +82,10 @@ enum {
>  #define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
>  #endif
>  
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_MVOLATILE_PURGED	(MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index c0f7526..fe9c026 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +static inline swp_entry_t make_purged_entry(void)
> +{
> +	return swp_entry(SWP_MVOLATILE_PURGED, 0);
> +}
> +
> +static inline int is_purged_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_MVOLATILE_PURGED;
> +}
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
>   * Support for hardware poisoned pages
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> index edc5894..555d5c4 100644
> --- a/mm/mvolatile.c
> +++ b/mm/mvolatile.c
> @@ -13,8 +13,92 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/mm_inline.h>
>  #include <linux/mman.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include "internal.h"
>  
> +struct mvolatile_walker {
> +	struct vm_area_struct *vma;
> +	int page_was_purged;
> +};
> +
> +
> +/**
> + * mvolatile_check_purged_pte - Checks ptes for purged pages
> + * @pmd: pmd to walk
> + * @addr: starting address
> + * @end: end address
> + * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
> + * and clears the purged pte swp entries (since the pages are no
> + * longer volatile, we don't want future accesses to SIGBUS).
> + */
> +static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +					unsigned long end, struct mm_walk *walk)
> +{
> +	struct mvolatile_walker *vw = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	if (pmd_trans_huge(*pmd))
> +		return 0;
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte)) {
> +			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
> +
> +			if (unlikely(is_purged_entry(mvolatile_entry))) {
> +
> +				vw->page_was_purged = 1;
> +
> +				/* clear the pte swp entry */
> +				flush_cache_page(vw->vma, addr, pte_pfn(*pte));

Maybe we don't need to flush the cache because there is no mapped page.

> +				ptep_clear_flush(vw->vma, addr, pte);

Maybe we don't need this, either. We didn't set present bit for purged
page but when I look at the internal of ptep_clear_flush, it checks present bit
and skip the TLB flush so it's okay for x86 but not sure other architecture.
More clear function for our purpose would be pte_clear_not_present_full.

And we are changing page table so at least, we need to handle mmu_notifier to
inform that to the client of mmu_notifier.

> +			}
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +
> +/**
> + * mvolatile_check_purged - Sets up a mm_walk to check for purged pages
> + * @vma: ptr to vma we're starting with
> + * @start: start address to walk
> + * @end: end address of walk
> + *
> + * Sets up and calls wa_page_range() to check for purge pages.
> + *
> + * Returns 1 if pages in the range were purged, 0 otherwise.
> + */
> +static int mvolatile_check_purged(struct vm_area_struct *vma,
> +					 unsigned long start,
> +					 unsigned long end)
> +{
> +	struct mvolatile_walker vw;
> +	struct mm_walk mvolatile_walk = {
> +		.pmd_entry = mvolatile_check_purged_pte,
> +		.mm = vma->vm_mm,
> +		.private = &vw,
> +	};
> +	vw.page_was_purged = 0;
> +	vw.vma = vma;
> +
> +	walk_page_range(start, end, &mvolatile_walk);
> +
> +	return vw.page_was_purged;
> +}
> +
>  
>  /**
>   * madvise_volatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> @@ -140,6 +224,9 @@ int madvise_volatile(int mode, unsigned long start, unsigned long end)
>  			break;
>  		vma = vma->vm_next;
>  	}
> +
> +	if (!ret && (mode == MADV_NONVOLATILE))
> +		ret = mvolatile_check_purged(vma, orig_start, end);
>  out:
>  	up_write(&mm->mmap_sem);
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] MADV_VOLATILE: Add page purging logic & SIGBUS trap
  2014-04-29 21:21   ` John Stultz
@ 2014-05-08  5:16     ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  5:16 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Tue, Apr 29, 2014 at 02:21:23PM -0700, John Stultz wrote:
> This patch adds the hooks in the vmscan logic to purge volatile pages
> and mark their pte as purged. With this, volatile pages will be purged
> under pressure, and their ptes swap entry's marked. If the purged pages
> are accessed before being marked non-volatile, we catch this and send a
> SIGBUS.
> 
> This is a simplified implementation that uses logic from Minchan's earlier
> efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/mvolatile.h |   1 +
>  mm/internal.h             |   2 -
>  mm/memory.c               |   7 +++
>  mm/mvolatile.c            | 119 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/rmap.c                 |   5 ++
>  mm/vmscan.c               |  12 +++++
>  6 files changed, 144 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> index f53396b..8b797b7 100644
> --- a/include/linux/mvolatile.h
> +++ b/include/linux/mvolatile.h
> @@ -2,5 +2,6 @@
>  #define _LINUX_MVOLATILE_H
>  
>  int madvise_volatile(int bhv, unsigned long start, unsigned long end);
> +extern int purge_volatile_page(struct page *page);
>  
>  #endif /* _LINUX_MVOLATILE_H */
> diff --git a/mm/internal.h b/mm/internal.h
> index 07b6736..2213055 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -240,10 +240,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>  
>  extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern unsigned long vma_address(struct page *page,
>  				 struct vm_area_struct *vma);
> -#endif
>  #else /* !CONFIG_MMU */
>  static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
>  {
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812..cf024bd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -61,6 +61,7 @@
>  #include <linux/string.h>
>  #include <linux/dma-debug.h>
>  #include <linux/debugfs.h>
> +#include <linux/mvolatile.h>
>  
>  #include <asm/io.h>
>  #include <asm/pgalloc.h>
> @@ -3067,6 +3068,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			migration_entry_wait(mm, pmd, address);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> +		} else if (is_purged_entry(entry)) {
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +									&ptl);
> +			if (likely(pte_same(*page_table, orig_pte)))
> +				ret = VM_FAULT_SIGBUS;
> +			goto unlock;
>  		} else {
>  			print_bad_pte(vma, address, orig_pte, NULL);
>  			ret = VM_FAULT_SIGBUS;
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> index 555d5c4..a7831d3 100644
> --- a/mm/mvolatile.c
> +++ b/mm/mvolatile.c
> @@ -232,3 +232,122 @@ out:
>  
>  	return ret;
>  }
> +
> +
> +/**
> + * try_to_purge_one - Purge a volatile page from a vma
> + * @page: page to purge
> + * @vma: vma to purge page from
> + *
> + * Finds the pte for a page in a vma, marks the pte as purged
> + * and release the page.
> + */
> +static void try_to_purge_one(struct page *page, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t *pte;
> +	pte_t pteval;
> +	spinlock_t *ptl;
> +	unsigned long addr;
> +
> +	VM_BUG_ON(!PageLocked(page));
> +
> +	addr = vma_address(page, vma);
> +	pte = page_check_address(page, mm, addr, &ptl, 0);
> +	if (!pte)
> +		return;
> +
> +	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
> +
> +	flush_cache_page(vma, addr, page_to_pfn(page));
> +	pteval = ptep_clear_flush(vma, addr, pte);
> +
> +	update_hiwater_rss(mm);
> +	if (PageAnon(page))
> +		dec_mm_counter(mm, MM_ANONPAGES);
> +	else
> +		dec_mm_counter(mm, MM_FILEPAGES);

We can add file-backed page part later when we move to suppport vrange-file.

> +
> +	page_remove_rmap(page);
> +	page_cache_release(page);
> +
> +	set_pte_at(mm, addr, pte, swp_entry_to_pte(make_purged_entry()));
> +
> +	pte_unmap_unlock(pte, ptl);
> +	mmu_notifier_invalidate_page(mm, addr);
> +}
> +
> +
> +/**
> + * try_to_purge_vpage - check vma chain and purge from vmas marked volatile
> + * @page: page to purge
> + *
> + * Goes over all the vmas that hold a page, and where the vmas are volatile,
> + * purge the page from the vma.
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +static int try_to_purge_vpage(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff;
> +	int ret = 0;
> +
> +	anon_vma = page_lock_anon_vma_read(page);
> +	if (!anon_vma)
> +		return -1;
> +
> +	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	/*
> +	 * During interating the loop, some processes could see a page as
> +	 * purged while others could see a page as not-purged because we have
> +	 * no global lock between parent and child for protecting mvolatile
> +	 * system call during this loop. But it's not a problem because the
> +	 * page is  not *SHARED* page but *COW* page so parent and child can
> +	 * see other data anytime. The worst case by this race is a page was
> +	 * purged but couldn't be discarded so it makes unnecessary pagefault
> +	 * but it wouldn't be severe.
> +	 */
> +	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
> +		struct vm_area_struct *vma = avc->vma;
> +
> +		if (!(vma->vm_flags & VM_VOLATILE)) {
> +			ret = -1;
> +			break;
> +		}
> +		try_to_purge_one(page, vma);
> +	}
> +	page_unlock_anon_vma_read(anon_vma);
> +	return ret;
> +}
> +
> +
> +/**
> + * purge_volatile_page - If possible, purge the specified volatile page
> + * @page: page to purge
> + *
> + * Attempts to purge a volatile page, and if needed frees the swap page
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +int purge_volatile_page(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(PageLRU(page));
> +
> +	/* XXX - for now we only support anonymous volatile pages */
> +	if (!PageAnon(page))
> +		return -1;
> +
> +	if (!try_to_purge_vpage(page)) {
> +		if (PageSwapCache(page))
> +			try_to_free_swap(page);
> +
> +		if (page_freeze_refs(page, 1)) {
> +			unlock_page(page);
> +			return 0;
> +		}
> +	}
> +	return -1;
> +}
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9c3e773..efb5c61 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  				referenced++;
>  		}
>  		pte_unmap_unlock(pte, ptl);
> +		if (vma->vm_flags & VM_VOLATILE) {
> +			pra->mapcount = 0;
> +			pra->vm_flags |= VM_VOLATILE;
> +			return SWAP_FAIL;
> +		}
>  	}
>  
>  	if (referenced) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3f56c8d..a267926 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -43,6 +43,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
>  #include <linux/prefetch.h>
> +#include <linux/mvolatile.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -697,6 +698,7 @@ enum page_references {
>  	PAGEREF_RECLAIM,
>  	PAGEREF_RECLAIM_CLEAN,
>  	PAGEREF_KEEP,
> +	PAGEREF_PURGE,
>  	PAGEREF_ACTIVATE,
>  };
>  
> @@ -717,6 +719,13 @@ static enum page_references page_check_references(struct page *page,
>  	if (vm_flags & VM_LOCKED)
>  		return PAGEREF_RECLAIM;
>  
> +	/*
> +	 * If volatile page is reached on LRU's tail, we discard the
> +	 * page without considering recycle the page.
> +	 */
> +	if (vm_flags & VM_VOLATILE)
> +		return PAGEREF_PURGE;
> +
>  	if (referenced_ptes) {
>  		if (PageSwapBacked(page))
>  			return PAGEREF_ACTIVATE;
> @@ -944,6 +953,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		switch (references) {
>  		case PAGEREF_ACTIVATE:
>  			goto activate_locked;
> +		case PAGEREF_PURGE:
> +			if (!purge_volatile_page(page))
> +				goto free_it;
>  		case PAGEREF_KEEP:
>  			goto keep_locked;
>  		case PAGEREF_RECLAIM:
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] MADV_VOLATILE: Add page purging logic & SIGBUS trap
@ 2014-05-08  5:16     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  5:16 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Tue, Apr 29, 2014 at 02:21:23PM -0700, John Stultz wrote:
> This patch adds the hooks in the vmscan logic to purge volatile pages
> and mark their pte as purged. With this, volatile pages will be purged
> under pressure, and their ptes swap entry's marked. If the purged pages
> are accessed before being marked non-volatile, we catch this and send a
> SIGBUS.
> 
> This is a simplified implementation that uses logic from Minchan's earlier
> efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/mvolatile.h |   1 +
>  mm/internal.h             |   2 -
>  mm/memory.c               |   7 +++
>  mm/mvolatile.c            | 119 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/rmap.c                 |   5 ++
>  mm/vmscan.c               |  12 +++++
>  6 files changed, 144 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> index f53396b..8b797b7 100644
> --- a/include/linux/mvolatile.h
> +++ b/include/linux/mvolatile.h
> @@ -2,5 +2,6 @@
>  #define _LINUX_MVOLATILE_H
>  
>  int madvise_volatile(int bhv, unsigned long start, unsigned long end);
> +extern int purge_volatile_page(struct page *page);
>  
>  #endif /* _LINUX_MVOLATILE_H */
> diff --git a/mm/internal.h b/mm/internal.h
> index 07b6736..2213055 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -240,10 +240,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>  
>  extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern unsigned long vma_address(struct page *page,
>  				 struct vm_area_struct *vma);
> -#endif
>  #else /* !CONFIG_MMU */
>  static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
>  {
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b812..cf024bd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -61,6 +61,7 @@
>  #include <linux/string.h>
>  #include <linux/dma-debug.h>
>  #include <linux/debugfs.h>
> +#include <linux/mvolatile.h>
>  
>  #include <asm/io.h>
>  #include <asm/pgalloc.h>
> @@ -3067,6 +3068,12 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			migration_entry_wait(mm, pmd, address);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> +		} else if (is_purged_entry(entry)) {
> +			page_table = pte_offset_map_lock(mm, pmd, address,
> +									&ptl);
> +			if (likely(pte_same(*page_table, orig_pte)))
> +				ret = VM_FAULT_SIGBUS;
> +			goto unlock;
>  		} else {
>  			print_bad_pte(vma, address, orig_pte, NULL);
>  			ret = VM_FAULT_SIGBUS;
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> index 555d5c4..a7831d3 100644
> --- a/mm/mvolatile.c
> +++ b/mm/mvolatile.c
> @@ -232,3 +232,122 @@ out:
>  
>  	return ret;
>  }
> +
> +
> +/**
> + * try_to_purge_one - Purge a volatile page from a vma
> + * @page: page to purge
> + * @vma: vma to purge page from
> + *
> + * Finds the pte for a page in a vma, marks the pte as purged
> + * and release the page.
> + */
> +static void try_to_purge_one(struct page *page, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t *pte;
> +	pte_t pteval;
> +	spinlock_t *ptl;
> +	unsigned long addr;
> +
> +	VM_BUG_ON(!PageLocked(page));
> +
> +	addr = vma_address(page, vma);
> +	pte = page_check_address(page, mm, addr, &ptl, 0);
> +	if (!pte)
> +		return;
> +
> +	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
> +
> +	flush_cache_page(vma, addr, page_to_pfn(page));
> +	pteval = ptep_clear_flush(vma, addr, pte);
> +
> +	update_hiwater_rss(mm);
> +	if (PageAnon(page))
> +		dec_mm_counter(mm, MM_ANONPAGES);
> +	else
> +		dec_mm_counter(mm, MM_FILEPAGES);

We can add file-backed page part later when we move to suppport vrange-file.

> +
> +	page_remove_rmap(page);
> +	page_cache_release(page);
> +
> +	set_pte_at(mm, addr, pte, swp_entry_to_pte(make_purged_entry()));
> +
> +	pte_unmap_unlock(pte, ptl);
> +	mmu_notifier_invalidate_page(mm, addr);
> +}
> +
> +
> +/**
> + * try_to_purge_vpage - check vma chain and purge from vmas marked volatile
> + * @page: page to purge
> + *
> + * Goes over all the vmas that hold a page, and where the vmas are volatile,
> + * purge the page from the vma.
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +static int try_to_purge_vpage(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff;
> +	int ret = 0;
> +
> +	anon_vma = page_lock_anon_vma_read(page);
> +	if (!anon_vma)
> +		return -1;
> +
> +	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	/*
> +	 * During interating the loop, some processes could see a page as
> +	 * purged while others could see a page as not-purged because we have
> +	 * no global lock between parent and child for protecting mvolatile
> +	 * system call during this loop. But it's not a problem because the
> +	 * page is  not *SHARED* page but *COW* page so parent and child can
> +	 * see other data anytime. The worst case by this race is a page was
> +	 * purged but couldn't be discarded so it makes unnecessary pagefault
> +	 * but it wouldn't be severe.
> +	 */
> +	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
> +		struct vm_area_struct *vma = avc->vma;
> +
> +		if (!(vma->vm_flags & VM_VOLATILE)) {
> +			ret = -1;
> +			break;
> +		}
> +		try_to_purge_one(page, vma);
> +	}
> +	page_unlock_anon_vma_read(anon_vma);
> +	return ret;
> +}
> +
> +
> +/**
> + * purge_volatile_page - If possible, purge the specified volatile page
> + * @page: page to purge
> + *
> + * Attempts to purge a volatile page, and if needed frees the swap page
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +int purge_volatile_page(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(PageLRU(page));
> +
> +	/* XXX - for now we only support anonymous volatile pages */
> +	if (!PageAnon(page))
> +		return -1;
> +
> +	if (!try_to_purge_vpage(page)) {
> +		if (PageSwapCache(page))
> +			try_to_free_swap(page);
> +
> +		if (page_freeze_refs(page, 1)) {
> +			unlock_page(page);
> +			return 0;
> +		}
> +	}
> +	return -1;
> +}
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9c3e773..efb5c61 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  				referenced++;
>  		}
>  		pte_unmap_unlock(pte, ptl);
> +		if (vma->vm_flags & VM_VOLATILE) {
> +			pra->mapcount = 0;
> +			pra->vm_flags |= VM_VOLATILE;
> +			return SWAP_FAIL;
> +		}
>  	}
>  
>  	if (referenced) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3f56c8d..a267926 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -43,6 +43,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
>  #include <linux/prefetch.h>
> +#include <linux/mvolatile.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -697,6 +698,7 @@ enum page_references {
>  	PAGEREF_RECLAIM,
>  	PAGEREF_RECLAIM_CLEAN,
>  	PAGEREF_KEEP,
> +	PAGEREF_PURGE,
>  	PAGEREF_ACTIVATE,
>  };
>  
> @@ -717,6 +719,13 @@ static enum page_references page_check_references(struct page *page,
>  	if (vm_flags & VM_LOCKED)
>  		return PAGEREF_RECLAIM;
>  
> +	/*
> +	 * If volatile page is reached on LRU's tail, we discard the
> +	 * page without considering recycle the page.
> +	 */
> +	if (vm_flags & VM_VOLATILE)
> +		return PAGEREF_PURGE;
> +
>  	if (referenced_ptes) {
>  		if (PageSwapBacked(page))
>  			return PAGEREF_ACTIVATE;
> @@ -944,6 +953,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		switch (references) {
>  		case PAGEREF_ACTIVATE:
>  			goto activate_locked;
> +		case PAGEREF_PURGE:
> +			if (!purge_volatile_page(page))
> +				goto free_it;
>  		case PAGEREF_KEEP:
>  			goto keep_locked;
>  		case PAGEREF_RECLAIM:
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
  2014-04-29 21:21 ` John Stultz
@ 2014-05-08  5:58   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  5:58 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
> Another few weeks and another volatile ranges patchset...
> 
> After getting the sense that the a major objection to the earlier
> patches was the introduction of a new syscall (and its somewhat
> strange dual length/purged-bit return values), I spent some time
> trying to rework the vma manipulations so we can be we won't fail
> mid-way through changing volatility (basically making it atomic).
> I think I have it working, and thus, there is no longer the
> need for a new syscall, and we can go back to using madvise()
> to set and unset pages as volatile.

As I said reply as other patch's reply, I'm ok with this but I'd
like to make it clear to support zero-filled page as well as SIGBUS.
If we want to use madvise, maybe we need another advise flag like
MADV_VOLATILE_SIGBUS.
> 
> 
> New changes are:
> ----------------
> o Reworked vma manipulations to be be atomic
> o Converted back to using madvise() as syscall interface
> o Integrated fix from Minchan to avoid SIGBUS faulting race
> o Caught/fixed subtle use-after-free bug w/ vma merging
> o Lots of minor cleanups and comment improvements
> 
> 
> Still on the TODO list
> ----------------------------------------------------
> o Sort out how best to do page accounting when the volatility
>   is tracked on a per-mm basis.

What's is your concern about page accouting?
Could you elaborate it more for everybody to understand your concern
clearly.

> o Revisit anonymous page aging on swapless systems

One idea is that we can age forcefully on swapless system if system
has volatile vma or lazyfree pages. If the number of volatile vma or
lazyfree pages is zero, we can stop the aging automatically.

> o Draft up re-adding tmpfs/shm file volatility support
> 
  o One concern from minchan.
  I really like O(1) cost of unmarking syscall.

Vrange syscall is for others, not itself. I mean if some process calls
vrange syscall, it would scacrifice his resource for others when
emergency happens so if the syscall is overhead rather expensive,
anybody doesn't want to use it.

One idea is put increasing counter in mm_struct and assign the token
to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
bits if we don't want to bloat vma size because we always hold mmap_sem
with write-side lock when we handle vrange syscall.
And we can use the token and purged mark together to pte when the purge
happens. With this, we can bail out as soon as we found purged entry in
unmarking syscall so remained ptes still have purged pte although
unmarking syscall is done. But it's no problem because if the vma is
marked as volatile again, the token will be change(ie, increased) and
doesn't match with pte's token. When the page fault occur, we can compare
the token to emit SIGBUS. If it doesn't match, we can ignore and just
map new page to pte.

One problem is overflow of counter. In the case, we can deliver false
positive to user but it isn't severe, either because use have a preparation
to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

> 
> Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
> Hugh, and others for the great feedback and discussion at
> LSF-MM.
> 
> thanks
> -john
> 
> 
> Volatile ranges provides a method for userland to inform the kernel
> that a range of memory is safe to discard (ie: can be regenerated)
> but userspace may want to try access it in the future.  It can be
> thought of as similar to MADV_DONTNEED, but that the actual freeing
> of the memory is delayed and only done under memory pressure, and the
> user can try to cancel the action and be able to quickly access any
> unpurged pages. The idea originated from Android's ashmem, but I've
> since learned that other OSes provide similar functionality.
> 
> This functionality allows for a number of interesting uses. One such
> example is: Userland caches that have kernel triggered eviction under
> memory pressure. This allows for the kernel to "rightsize" userspace
> caches for current system-wide workload. Things like image bitmap
> caches, or rendered HTML in a hidden browser tab, where the data is
> not visible and can be regenerated if needed, are good examples.
> 
> Both Chrome and Firefox already make use of volatile range-like
> functionality via the ashmem interface:
> https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34
> 
> https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc
> 
> 
> The basic usage of volatile ranges is as so:
> 1) Userland marks a range of memory that can be regenerated if
> necessary as volatile
> 2) Before accessing the memory again, userland marks the memory as
> nonvolatile, and the kernel will provide notification if any pages in
> the range has been purged.
> 
> If userland accesses memory while it is volatile, it will either
> get the value stored at that memory if there has been no memory
> pressure or the application will get a SIGBUS if the page has been
> purged.
> 
> Reads or writes to the memory do not affect the volatility state of the
> pages.
> 
> You can read more about the history of volatile ranges here (~reverse
> chronological order):
> https://lwn.net/Articles/592042/
> https://lwn.net/Articles/590991/
> http://permalink.gmane.org/gmane.linux.kernel.mm/98848
> http://permalink.gmane.org/gmane.linux.kernel.mm/98676
> https://lwn.net/Articles/522135/
> https://lwn.net/Kernel/Index/#Volatile_ranges
> 
> 
> Continuing from the last few releases, this revision is reduced in
> scope when compared to earlier attempts. I've only focused on handled
> volatility on anonymous memory, and we're storing the volatility in
> the VMA.  This may have performance implications compared with the
> earlier approach, but it does simplify the approach. I'm open to
> expanding functionality via flags arguments, but for now I'm wanting
> to keep focus on what the right default behavior should be and keep
> the use cases restricted to help get reviewer interest.
> 
> Additionally, since we don't handle volatility on tmpfs files with this
> version of the patch, it is not able to be used to implement semantics
> similar to Android's ashmem. But since shared volatiltiy on files is
> more complex, my hope is to start small and hopefully grow from there.
> 
> Again, much of the logic in this patchset is based on Minchan's earlier
> efforts, so I do want to make sure the credit goes to him for his major
> contribution!
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> 
> John Stultz (4):
>   swap: Cleanup how special swap file numbers are defined
>   MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking
>     vmas
>   MADV_VOLATILE: Add purged page detection on setting memory
>     non-volatile
>   MADV_VOLATILE: Add page purging logic & SIGBUS trap
> 
>  include/linux/mm.h                     |   1 +
>  include/linux/mvolatile.h              |   7 +
>  include/linux/swap.h                   |  36 +++-
>  include/linux/swapops.h                |  10 +
>  include/uapi/asm-generic/mman-common.h |   5 +
>  mm/Makefile                            |   2 +-
>  mm/internal.h                          |   2 -
>  mm/madvise.c                           |  14 ++
>  mm/memory.c                            |   7 +
>  mm/mvolatile.c                         | 353 +++++++++++++++++++++++++++++++++
>  mm/rmap.c                              |   5 +
>  mm/vmscan.c                            |  12 ++
>  12 files changed, 440 insertions(+), 14 deletions(-)
>  create mode 100644 include/linux/mvolatile.h
>  create mode 100644 mm/mvolatile.c
> 
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-05-08  5:58   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08  5:58 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
> Another few weeks and another volatile ranges patchset...
> 
> After getting the sense that the a major objection to the earlier
> patches was the introduction of a new syscall (and its somewhat
> strange dual length/purged-bit return values), I spent some time
> trying to rework the vma manipulations so we can be we won't fail
> mid-way through changing volatility (basically making it atomic).
> I think I have it working, and thus, there is no longer the
> need for a new syscall, and we can go back to using madvise()
> to set and unset pages as volatile.

As I said reply as other patch's reply, I'm ok with this but I'd
like to make it clear to support zero-filled page as well as SIGBUS.
If we want to use madvise, maybe we need another advise flag like
MADV_VOLATILE_SIGBUS.
> 
> 
> New changes are:
> ----------------
> o Reworked vma manipulations to be be atomic
> o Converted back to using madvise() as syscall interface
> o Integrated fix from Minchan to avoid SIGBUS faulting race
> o Caught/fixed subtle use-after-free bug w/ vma merging
> o Lots of minor cleanups and comment improvements
> 
> 
> Still on the TODO list
> ----------------------------------------------------
> o Sort out how best to do page accounting when the volatility
>   is tracked on a per-mm basis.

What's is your concern about page accouting?
Could you elaborate it more for everybody to understand your concern
clearly.

> o Revisit anonymous page aging on swapless systems

One idea is that we can age forcefully on swapless system if system
has volatile vma or lazyfree pages. If the number of volatile vma or
lazyfree pages is zero, we can stop the aging automatically.

> o Draft up re-adding tmpfs/shm file volatility support
> 
  o One concern from minchan.
  I really like O(1) cost of unmarking syscall.

Vrange syscall is for others, not itself. I mean if some process calls
vrange syscall, it would scacrifice his resource for others when
emergency happens so if the syscall is overhead rather expensive,
anybody doesn't want to use it.

One idea is put increasing counter in mm_struct and assign the token
to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
bits if we don't want to bloat vma size because we always hold mmap_sem
with write-side lock when we handle vrange syscall.
And we can use the token and purged mark together to pte when the purge
happens. With this, we can bail out as soon as we found purged entry in
unmarking syscall so remained ptes still have purged pte although
unmarking syscall is done. But it's no problem because if the vma is
marked as volatile again, the token will be change(ie, increased) and
doesn't match with pte's token. When the page fault occur, we can compare
the token to emit SIGBUS. If it doesn't match, we can ignore and just
map new page to pte.

One problem is overflow of counter. In the case, we can deliver false
positive to user but it isn't severe, either because use have a preparation
to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

> 
> Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
> Hugh, and others for the great feedback and discussion at
> LSF-MM.
> 
> thanks
> -john
> 
> 
> Volatile ranges provides a method for userland to inform the kernel
> that a range of memory is safe to discard (ie: can be regenerated)
> but userspace may want to try access it in the future.  It can be
> thought of as similar to MADV_DONTNEED, but that the actual freeing
> of the memory is delayed and only done under memory pressure, and the
> user can try to cancel the action and be able to quickly access any
> unpurged pages. The idea originated from Android's ashmem, but I've
> since learned that other OSes provide similar functionality.
> 
> This functionality allows for a number of interesting uses. One such
> example is: Userland caches that have kernel triggered eviction under
> memory pressure. This allows for the kernel to "rightsize" userspace
> caches for current system-wide workload. Things like image bitmap
> caches, or rendered HTML in a hidden browser tab, where the data is
> not visible and can be regenerated if needed, are good examples.
> 
> Both Chrome and Firefox already make use of volatile range-like
> functionality via the ashmem interface:
> https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34
> 
> https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc
> 
> 
> The basic usage of volatile ranges is as so:
> 1) Userland marks a range of memory that can be regenerated if
> necessary as volatile
> 2) Before accessing the memory again, userland marks the memory as
> nonvolatile, and the kernel will provide notification if any pages in
> the range has been purged.
> 
> If userland accesses memory while it is volatile, it will either
> get the value stored at that memory if there has been no memory
> pressure or the application will get a SIGBUS if the page has been
> purged.
> 
> Reads or writes to the memory do not affect the volatility state of the
> pages.
> 
> You can read more about the history of volatile ranges here (~reverse
> chronological order):
> https://lwn.net/Articles/592042/
> https://lwn.net/Articles/590991/
> http://permalink.gmane.org/gmane.linux.kernel.mm/98848
> http://permalink.gmane.org/gmane.linux.kernel.mm/98676
> https://lwn.net/Articles/522135/
> https://lwn.net/Kernel/Index/#Volatile_ranges
> 
> 
> Continuing from the last few releases, this revision is reduced in
> scope when compared to earlier attempts. I've only focused on handled
> volatility on anonymous memory, and we're storing the volatility in
> the VMA.  This may have performance implications compared with the
> earlier approach, but it does simplify the approach. I'm open to
> expanding functionality via flags arguments, but for now I'm wanting
> to keep focus on what the right default behavior should be and keep
> the use cases restricted to help get reviewer interest.
> 
> Additionally, since we don't handle volatility on tmpfs files with this
> version of the patch, it is not able to be used to implement semantics
> similar to Android's ashmem. But since shared volatiltiy on files is
> more complex, my hope is to start small and hopefully grow from there.
> 
> Again, much of the logic in this patchset is based on Minchan's earlier
> efforts, so I do want to make sure the credit goes to him for his major
> contribution!
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> 
> John Stultz (4):
>   swap: Cleanup how special swap file numbers are defined
>   MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking
>     vmas
>   MADV_VOLATILE: Add purged page detection on setting memory
>     non-volatile
>   MADV_VOLATILE: Add page purging logic & SIGBUS trap
> 
>  include/linux/mm.h                     |   1 +
>  include/linux/mvolatile.h              |   7 +
>  include/linux/swap.h                   |  36 +++-
>  include/linux/swapops.h                |  10 +
>  include/uapi/asm-generic/mman-common.h |   5 +
>  mm/Makefile                            |   2 +-
>  mm/internal.h                          |   2 -
>  mm/madvise.c                           |  14 ++
>  mm/memory.c                            |   7 +
>  mm/mvolatile.c                         | 353 +++++++++++++++++++++++++++++++++
>  mm/rmap.c                              |   5 +
>  mm/vmscan.c                            |  12 ++
>  12 files changed, 440 insertions(+), 14 deletions(-)
>  create mode 100644 include/linux/mvolatile.h
>  create mode 100644 mm/mvolatile.c
> 
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-05-08  1:21     ` Minchan Kim
@ 2014-05-08 16:38       ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 16:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 06:21 PM, Minchan Kim wrote:
> Hey John,
>
> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
>> which allows for specifying ranges of memory as volatile, and able
>> to be discarded by the system.
>>
>> This initial patch simply adds flag handling to madvise, and the
>> vma handling, splitting and merging the vmas as needed, and marking
>> them with VM_VOLATILE.
>>
>> No purging or discarding of volatile ranges is done at this point.
>>
>> This a simplified implementation which reuses some of the logic
>> from Minchan's earlier efforts. So credit to Minchan for his work.
> Remove purged argument is really good thing but I'm not sure merging
> the feature into madvise syscall is good idea.
> My concern is how we support user who don't want SIGBUS.
> I believe we should support them because someuser(ex, sanitizer) really
> want to avoid MADV_NONVOLATILE call right before overwriting their cache
> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> right before overwriting to avoid SIGBUS).

So... Why not use MADV_FREE then for this case?

Just to be clear, by moving back to madvise, I'm not trying to replace
MADV_FREE. I think you're work there is still useful and splitting the
semantics between the two is cleaner.


> Moreover, this changes made unmarking cost O(N) so I'd like to avoid
> NOVOLATILE syscall if possible.
Well, I think that was made in v13, but yes. NONVOLATILE is currently an
expensive operation in order to keep the semantics simpler, as requested
by Johannes and Kosaki-san.


> For me, SIGBUS is more special usecase for code pages but I believe
> both are reasonable for each usecase so my preference is MADV_VOLATILE
> is just zero-filled page and MADV_VOLATILE_SIGBUS, another new advise
> if you really want to merge volatile range feature with madvise.

This I disagree with. Even for non-code page cases, SIGBUS on volatile
page access is important for normal users who might accidentally touch
volatile data, so they know they are corrupting their data. I know
Johannes suggested this is simply a use-after-free issue, but I really
feel it results in having very strange semantics. And for those cases
where there is a benefit to zero-fill, MADV_FREE seems more appropriate.

thanks
-john




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-05-08 16:38       ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 16:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 06:21 PM, Minchan Kim wrote:
> Hey John,
>
> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
>> which allows for specifying ranges of memory as volatile, and able
>> to be discarded by the system.
>>
>> This initial patch simply adds flag handling to madvise, and the
>> vma handling, splitting and merging the vmas as needed, and marking
>> them with VM_VOLATILE.
>>
>> No purging or discarding of volatile ranges is done at this point.
>>
>> This a simplified implementation which reuses some of the logic
>> from Minchan's earlier efforts. So credit to Minchan for his work.
> Remove purged argument is really good thing but I'm not sure merging
> the feature into madvise syscall is good idea.
> My concern is how we support user who don't want SIGBUS.
> I believe we should support them because someuser(ex, sanitizer) really
> want to avoid MADV_NONVOLATILE call right before overwriting their cache
> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> right before overwriting to avoid SIGBUS).

So... Why not use MADV_FREE then for this case?

Just to be clear, by moving back to madvise, I'm not trying to replace
MADV_FREE. I think you're work there is still useful and splitting the
semantics between the two is cleaner.


> Moreover, this changes made unmarking cost O(N) so I'd like to avoid
> NOVOLATILE syscall if possible.
Well, I think that was made in v13, but yes. NONVOLATILE is currently an
expensive operation in order to keep the semantics simpler, as requested
by Johannes and Kosaki-san.


> For me, SIGBUS is more special usecase for code pages but I believe
> both are reasonable for each usecase so my preference is MADV_VOLATILE
> is just zero-filled page and MADV_VOLATILE_SIGBUS, another new advise
> if you really want to merge volatile range feature with madvise.

This I disagree with. Even for non-code page cases, SIGBUS on volatile
page access is important for normal users who might accidentally touch
volatile data, so they know they are corrupting their data. I know
Johannes suggested this is simply a use-after-free issue, but I really
feel it results in having very strange semantics. And for those cases
where there is a benefit to zero-fill, MADV_FREE seems more appropriate.

thanks
-john



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] MADV_VOLATILE: Add page purging logic & SIGBUS trap
  2014-05-08  5:16     ` Minchan Kim
@ 2014-05-08 16:39       ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 16:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 10:16 PM, Minchan Kim wrote:
> On Tue, Apr 29, 2014 at 02:21:23PM -0700, John Stultz wrote:
>> +	update_hiwater_rss(mm);
>> +	if (PageAnon(page))
>> +		dec_mm_counter(mm, MM_ANONPAGES);
>> +	else
>> +		dec_mm_counter(mm, MM_FILEPAGES);
> We can add file-backed page part later when we move to suppport vrange-file.

Fair enough. That bit is easy to drop for now.

thanks
-john

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/4] MADV_VOLATILE: Add page purging logic & SIGBUS trap
@ 2014-05-08 16:39       ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 16:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 10:16 PM, Minchan Kim wrote:
> On Tue, Apr 29, 2014 at 02:21:23PM -0700, John Stultz wrote:
>> +	update_hiwater_rss(mm);
>> +	if (PageAnon(page))
>> +		dec_mm_counter(mm, MM_ANONPAGES);
>> +	else
>> +		dec_mm_counter(mm, MM_FILEPAGES);
> We can add file-backed page part later when we move to suppport vrange-file.

Fair enough. That bit is easy to drop for now.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
  2014-05-08  5:58   ` Minchan Kim
@ 2014-05-08 17:04     ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 17:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 10:58 PM, Minchan Kim wrote:
> On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
>> Another few weeks and another volatile ranges patchset...
>>
>> After getting the sense that the a major objection to the earlier
>> patches was the introduction of a new syscall (and its somewhat
>> strange dual length/purged-bit return values), I spent some time
>> trying to rework the vma manipulations so we can be we won't fail
>> mid-way through changing volatility (basically making it atomic).
>> I think I have it working, and thus, there is no longer the
>> need for a new syscall, and we can go back to using madvise()
>> to set and unset pages as volatile.
> As I said reply as other patch's reply, I'm ok with this but I'd
> like to make it clear to support zero-filled page as well as SIGBUS.
> If we want to use madvise, maybe we need another advise flag like
> MADV_VOLATILE_SIGBUS.

I still disagree that zero-fill is more obvious behavior. And again, I
still support MADV_VOLATILE and MADV_FREE both being added, as they
really do have different use cases that I'd rather not try to fit into
one operation.


>>
>> New changes are:
>> ----------------
>> o Reworked vma manipulations to be be atomic
>> o Converted back to using madvise() as syscall interface
>> o Integrated fix from Minchan to avoid SIGBUS faulting race
>> o Caught/fixed subtle use-after-free bug w/ vma merging
>> o Lots of minor cleanups and comment improvements
>>
>>
>> Still on the TODO list
>> ----------------------------------------------------
>> o Sort out how best to do page accounting when the volatility
>>   is tracked on a per-mm basis.
> What's is your concern about page accouting?
> Could you elaborate it more for everybody to understand your concern
> clearly.

Basically the issue is that since we keep the volatility in the vma,
when we mark a page as volatile, its only marking the page for that
processes, not globally (since the page may be COWed). This makes
keeping track of the number of actual pages that are volatile accurately
somewhat difficult, since we can't just add one for each page marked and
subtract one for each page unmarked (for tmpfs/shm file based
volatility, where volatility is shared globally, this will be much easier ;)

It might not be too hard to keep a per-process-pages count of
volatility, but in that case we could see some strange effects where it
seems like there are 3x the number of actual volatile pages, and that
might throw off some of the scanning. So its something I've deferred a
bit to think about.



>> o Revisit anonymous page aging on swapless systems
> One idea is that we can age forcefully on swapless system if system
> has volatile vma or lazyfree pages. If the number of volatile vma or
> lazyfree pages is zero, we can stop the aging automatically.

I'll look into this some more.


>
>> o Draft up re-adding tmpfs/shm file volatility support
>>
>   o One concern from minchan.
>   I really like O(1) cost of unmarking syscall.
>
> Vrange syscall is for others, not itself. I mean if some process calls
> vrange syscall, it would scacrifice his resource for others when
> emergency happens so if the syscall is overhead rather expensive,
> anybody doesn't want to use it.

So yes. I agree the cost is more expensive then I'd like. However, I'd
like to get a consensus on the expected behavior established and get
folks first agreeing to the semantics and the interface. Then we can
follow up with optimizations.

> One idea is put increasing counter in mm_struct and assign the token
> to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
> bits if we don't want to bloat vma size because we always hold mmap_sem
> with write-side lock when we handle vrange syscall.
> And we can use the token and purged mark together to pte when the purge
> happens. With this, we can bail out as soon as we found purged entry in
> unmarking syscall so remained ptes still have purged pte although
> unmarking syscall is done. But it's no problem because if the vma is
> marked as volatile again, the token will be change(ie, increased) and
> doesn't match with pte's token. When the page fault occur, we can compare
> the token to emit SIGBUS. If it doesn't match, we can ignore and just
> map new page to pte.
>
> One problem is overflow of counter. In the case, we can deliver false
> positive to user but it isn't severe, either because use have a preparation
> to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

This sounds like an interesting optimization. But again, I worry that
adding these edge cases (which I honestly really don't see as
problematic) muddies the water and keeps reviewers away. I'd rather wait
until after we have something settled behavior wise, then start
discussing these performance optimizations that may cause
safe-but-false-postives.


Thanks so much for your review and guidance here (I was worried I had
lost everyone's attention again). I really appreciate the feedback!

thanks
-john







^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-05-08 17:04     ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 17:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 10:58 PM, Minchan Kim wrote:
> On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
>> Another few weeks and another volatile ranges patchset...
>>
>> After getting the sense that the a major objection to the earlier
>> patches was the introduction of a new syscall (and its somewhat
>> strange dual length/purged-bit return values), I spent some time
>> trying to rework the vma manipulations so we can be we won't fail
>> mid-way through changing volatility (basically making it atomic).
>> I think I have it working, and thus, there is no longer the
>> need for a new syscall, and we can go back to using madvise()
>> to set and unset pages as volatile.
> As I said reply as other patch's reply, I'm ok with this but I'd
> like to make it clear to support zero-filled page as well as SIGBUS.
> If we want to use madvise, maybe we need another advise flag like
> MADV_VOLATILE_SIGBUS.

I still disagree that zero-fill is more obvious behavior. And again, I
still support MADV_VOLATILE and MADV_FREE both being added, as they
really do have different use cases that I'd rather not try to fit into
one operation.


>>
>> New changes are:
>> ----------------
>> o Reworked vma manipulations to be be atomic
>> o Converted back to using madvise() as syscall interface
>> o Integrated fix from Minchan to avoid SIGBUS faulting race
>> o Caught/fixed subtle use-after-free bug w/ vma merging
>> o Lots of minor cleanups and comment improvements
>>
>>
>> Still on the TODO list
>> ----------------------------------------------------
>> o Sort out how best to do page accounting when the volatility
>>   is tracked on a per-mm basis.
> What's is your concern about page accouting?
> Could you elaborate it more for everybody to understand your concern
> clearly.

Basically the issue is that since we keep the volatility in the vma,
when we mark a page as volatile, its only marking the page for that
processes, not globally (since the page may be COWed). This makes
keeping track of the number of actual pages that are volatile accurately
somewhat difficult, since we can't just add one for each page marked and
subtract one for each page unmarked (for tmpfs/shm file based
volatility, where volatility is shared globally, this will be much easier ;)

It might not be too hard to keep a per-process-pages count of
volatility, but in that case we could see some strange effects where it
seems like there are 3x the number of actual volatile pages, and that
might throw off some of the scanning. So its something I've deferred a
bit to think about.



>> o Revisit anonymous page aging on swapless systems
> One idea is that we can age forcefully on swapless system if system
> has volatile vma or lazyfree pages. If the number of volatile vma or
> lazyfree pages is zero, we can stop the aging automatically.

I'll look into this some more.


>
>> o Draft up re-adding tmpfs/shm file volatility support
>>
>   o One concern from minchan.
>   I really like O(1) cost of unmarking syscall.
>
> Vrange syscall is for others, not itself. I mean if some process calls
> vrange syscall, it would scacrifice his resource for others when
> emergency happens so if the syscall is overhead rather expensive,
> anybody doesn't want to use it.

So yes. I agree the cost is more expensive then I'd like. However, I'd
like to get a consensus on the expected behavior established and get
folks first agreeing to the semantics and the interface. Then we can
follow up with optimizations.

> One idea is put increasing counter in mm_struct and assign the token
> to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
> bits if we don't want to bloat vma size because we always hold mmap_sem
> with write-side lock when we handle vrange syscall.
> And we can use the token and purged mark together to pte when the purge
> happens. With this, we can bail out as soon as we found purged entry in
> unmarking syscall so remained ptes still have purged pte although
> unmarking syscall is done. But it's no problem because if the vma is
> marked as volatile again, the token will be change(ie, increased) and
> doesn't match with pte's token. When the page fault occur, we can compare
> the token to emit SIGBUS. If it doesn't match, we can ignore and just
> map new page to pte.
>
> One problem is overflow of counter. In the case, we can deliver false
> positive to user but it isn't severe, either because use have a preparation
> to handle SIGBUS if he want to use vrange syscall with SIGBUS model.

This sounds like an interesting optimization. But again, I worry that
adding these edge cases (which I honestly really don't see as
problematic) muddies the water and keeps reviewers away. I'd rather wait
until after we have something settled behavior wise, then start
discussing these performance optimizations that may cause
safe-but-false-postives.


Thanks so much for your review and guidance here (I was worried I had
lost everyone's attention again). I really appreciate the feedback!

thanks
-john






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
  2014-04-29 21:21 ` John Stultz
@ 2014-05-08 17:12   ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 17:12 UTC (permalink / raw)
  To: LKML, Johannes Weiner
  Cc: Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	linux-mm

On 04/29/2014 02:21 PM, John Stultz wrote:
> Another few weeks and another volatile ranges patchset...
>
> After getting the sense that the a major objection to the earlier
> patches was the introduction of a new syscall (and its somewhat
> strange dual length/purged-bit return values), I spent some time
> trying to rework the vma manipulations so we can be we won't fail
> mid-way through changing volatility (basically making it atomic).
> I think I have it working, and thus, there is no longer the
> need for a new syscall, and we can go back to using madvise()
> to set and unset pages as volatile.

Johannes: To get some feedback, maybe I'll needle you directly here a
bit. :)

Does moving this interface to madvise help reduce your objections?  I
feel like your cleaning-the-dirty-bit idea didn't work out, but I was
hoping that by reworking the vma manipulations to be atomic, we could
move to madvise and still avoid the new syscall that you seemed bothered
by. But I've not really heard much from you recently so I worry your
concerns on this were actually elsewhere, and I'm just churning the
patch needlessly.

thanks
-john




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-05-08 17:12   ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 17:12 UTC (permalink / raw)
  To: LKML, Johannes Weiner
  Cc: Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	linux-mm

On 04/29/2014 02:21 PM, John Stultz wrote:
> Another few weeks and another volatile ranges patchset...
>
> After getting the sense that the a major objection to the earlier
> patches was the introduction of a new syscall (and its somewhat
> strange dual length/purged-bit return values), I spent some time
> trying to rework the vma manipulations so we can be we won't fail
> mid-way through changing volatility (basically making it atomic).
> I think I have it working, and thus, there is no longer the
> need for a new syscall, and we can go back to using madvise()
> to set and unset pages as volatile.

Johannes: To get some feedback, maybe I'll needle you directly here a
bit. :)

Does moving this interface to madvise help reduce your objections?  I
feel like your cleaning-the-dirty-bit idea didn't work out, but I was
hoping that by reworking the vma manipulations to be atomic, we could
move to madvise and still avoid the new syscall that you seemed bothered
by. But I've not really heard much from you recently so I worry your
concerns on this were actually elsewhere, and I'm just churning the
patch needlessly.

thanks
-john



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
  2014-05-08  1:51     ` Minchan Kim
@ 2014-05-08 21:45       ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 21:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 06:51 PM, Minchan Kim wrote:
> On Tue, Apr 29, 2014 at 02:21:22PM -0700, John Stultz wrote:
>> +/**
>> + * mvolatile_check_purged_pte - Checks ptes for purged pages
>> + * @pmd: pmd to walk
>> + * @addr: starting address
>> + * @end: end address
>> + * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
>> + *
>> + * Iterates over the ptes in the pmd checking if they have
>> + * purged swap entries.
>> + *
>> + * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
>> + * and clears the purged pte swp entries (since the pages are no
>> + * longer volatile, we don't want future accesses to SIGBUS).
>> + */
>> +static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
>> +					unsigned long end, struct mm_walk *walk)
>> +{
>> +	struct mvolatile_walker *vw = walk->private;
>> +	pte_t *pte;
>> +	spinlock_t *ptl;
>> +
>> +	if (pmd_trans_huge(*pmd))
>> +		return 0;
>> +	if (pmd_trans_unstable(pmd))
>> +		return 0;
>> +
>> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +		if (!pte_present(*pte)) {
>> +			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
>> +
>> +			if (unlikely(is_purged_entry(mvolatile_entry))) {
>> +
>> +				vw->page_was_purged = 1;
>> +
>> +				/* clear the pte swp entry */
>> +				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
> Maybe we don't need to flush the cache because there is no mapped page.
>
>> +				ptep_clear_flush(vw->vma, addr, pte);
> Maybe we don't need this, either. We didn't set present bit for purged
> page but when I look at the internal of ptep_clear_flush, it checks present bit
> and skip the TLB flush so it's okay for x86 but not sure other architecture.
> More clear function for our purpose would be pte_clear_not_present_full.

Ok.. basically I just wanted to zap the psudo-swp entry, so it will be
zero-filled from here on out.


> And we are changing page table so at least, we need to handle mmu_notifier to
> inform that to the client of mmu_notifier.

So yes, this is one item from my last iteration that I didn't act on
yet. It wasn't clear to me here that we need to do the mmu_notifier,
since the page is evicted earlier via try_to_purge_one (and we do notify
then). But in just removing the psudo-swap entry we need to do a
notification as well? Is there someplace where the mmu_notifier rules
are better documented?

thanks
-john


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
@ 2014-05-08 21:45       ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 21:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/07/2014 06:51 PM, Minchan Kim wrote:
> On Tue, Apr 29, 2014 at 02:21:22PM -0700, John Stultz wrote:
>> +/**
>> + * mvolatile_check_purged_pte - Checks ptes for purged pages
>> + * @pmd: pmd to walk
>> + * @addr: starting address
>> + * @end: end address
>> + * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
>> + *
>> + * Iterates over the ptes in the pmd checking if they have
>> + * purged swap entries.
>> + *
>> + * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
>> + * and clears the purged pte swp entries (since the pages are no
>> + * longer volatile, we don't want future accesses to SIGBUS).
>> + */
>> +static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
>> +					unsigned long end, struct mm_walk *walk)
>> +{
>> +	struct mvolatile_walker *vw = walk->private;
>> +	pte_t *pte;
>> +	spinlock_t *ptl;
>> +
>> +	if (pmd_trans_huge(*pmd))
>> +		return 0;
>> +	if (pmd_trans_unstable(pmd))
>> +		return 0;
>> +
>> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +		if (!pte_present(*pte)) {
>> +			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
>> +
>> +			if (unlikely(is_purged_entry(mvolatile_entry))) {
>> +
>> +				vw->page_was_purged = 1;
>> +
>> +				/* clear the pte swp entry */
>> +				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
> Maybe we don't need to flush the cache because there is no mapped page.
>
>> +				ptep_clear_flush(vw->vma, addr, pte);
> Maybe we don't need this, either. We didn't set present bit for purged
> page but when I look at the internal of ptep_clear_flush, it checks present bit
> and skip the TLB flush so it's okay for x86 but not sure other architecture.
> More clear function for our purpose would be pte_clear_not_present_full.

Ok.. basically I just wanted to zap the psudo-swp entry, so it will be
zero-filled from here on out.


> And we are changing page table so at least, we need to handle mmu_notifier to
> inform that to the client of mmu_notifier.

So yes, this is one item from my last iteration that I didn't act on
yet. It wasn't clear to me here that we need to do the mmu_notifier,
since the page is evicted earlier via try_to_purge_one (and we do notify
then). But in just removing the psudo-swap entry we need to do a
notification as well? Is there someplace where the mmu_notifier rules
are better documented?

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-05-08 16:38       ` John Stultz
@ 2014-05-08 23:12         ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08 23:12 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
> On 05/07/2014 06:21 PM, Minchan Kim wrote:
> > Hey John,
> >
> > On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> >> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> >> which allows for specifying ranges of memory as volatile, and able
> >> to be discarded by the system.
> >>
> >> This initial patch simply adds flag handling to madvise, and the
> >> vma handling, splitting and merging the vmas as needed, and marking
> >> them with VM_VOLATILE.
> >>
> >> No purging or discarding of volatile ranges is done at this point.
> >>
> >> This a simplified implementation which reuses some of the logic
> >> from Minchan's earlier efforts. So credit to Minchan for his work.
> > Remove purged argument is really good thing but I'm not sure merging
> > the feature into madvise syscall is good idea.
> > My concern is how we support user who don't want SIGBUS.
> > I believe we should support them because someuser(ex, sanitizer) really
> > want to avoid MADV_NONVOLATILE call right before overwriting their cache
> > (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> > right before overwriting to avoid SIGBUS).
> 
> So... Why not use MADV_FREE then for this case?

MADV_FREE is one-shot operation. I mean we should call it again to make
them lazyfree while vrange could preserve volatility.
Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
and want to mark the range as volatile. If they uses MADV_FREE instead of
volatile, they should mark 70TB as lazyfree periodically, which is terrible
because MADV_FREE's cost is O(N).

> 
> Just to be clear, by moving back to madvise, I'm not trying to replace
> MADV_FREE. I think you're work there is still useful and splitting the
> semantics between the two is cleaner.

I know.
New vrange syscall which works with existing VMA instead of new vrange
interval tree removed big concern from mm folks about duplicating
of manage layer(ex, vm_area_struct and vrange inteval tree) and
it removed my concern that mmap_sem write-side lock scalability for
allocator usecase so we can make the implemenation simple and clear.
I like it but zero-page VS SIGBUS is another issue we should make an
agreement.

> 
> 
> > Moreover, this changes made unmarking cost O(N) so I'd like to avoid
> > NOVOLATILE syscall if possible.
> Well, I think that was made in v13, but yes. NONVOLATILE is currently an
> expensive operation in order to keep the semantics simpler, as requested
> by Johannes and Kosaki-san.
> 
> 
> > For me, SIGBUS is more special usecase for code pages but I believe
> > both are reasonable for each usecase so my preference is MADV_VOLATILE
> > is just zero-filled page and MADV_VOLATILE_SIGBUS, another new advise
> > if you really want to merge volatile range feature with madvise.
> 
> This I disagree with. Even for non-code page cases, SIGBUS on volatile
> page access is important for normal users who might accidentally touch
> volatile data, so they know they are corrupting their data. I know
> Johannes suggested this is simply a use-after-free issue, but I really
> feel it results in having very strange semantics. And for those cases
> where there is a benefit to zero-fill, MADV_FREE seems more appropriate.

I already explained above why MADV_FREE is not enough.

> 
> thanks
> -john
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-05-08 23:12         ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08 23:12 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
> On 05/07/2014 06:21 PM, Minchan Kim wrote:
> > Hey John,
> >
> > On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> >> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> >> which allows for specifying ranges of memory as volatile, and able
> >> to be discarded by the system.
> >>
> >> This initial patch simply adds flag handling to madvise, and the
> >> vma handling, splitting and merging the vmas as needed, and marking
> >> them with VM_VOLATILE.
> >>
> >> No purging or discarding of volatile ranges is done at this point.
> >>
> >> This a simplified implementation which reuses some of the logic
> >> from Minchan's earlier efforts. So credit to Minchan for his work.
> > Remove purged argument is really good thing but I'm not sure merging
> > the feature into madvise syscall is good idea.
> > My concern is how we support user who don't want SIGBUS.
> > I believe we should support them because someuser(ex, sanitizer) really
> > want to avoid MADV_NONVOLATILE call right before overwriting their cache
> > (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> > right before overwriting to avoid SIGBUS).
> 
> So... Why not use MADV_FREE then for this case?

MADV_FREE is one-shot operation. I mean we should call it again to make
them lazyfree while vrange could preserve volatility.
Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
and want to mark the range as volatile. If they uses MADV_FREE instead of
volatile, they should mark 70TB as lazyfree periodically, which is terrible
because MADV_FREE's cost is O(N).

> 
> Just to be clear, by moving back to madvise, I'm not trying to replace
> MADV_FREE. I think you're work there is still useful and splitting the
> semantics between the two is cleaner.

I know.
New vrange syscall which works with existing VMA instead of new vrange
interval tree removed big concern from mm folks about duplicating
of manage layer(ex, vm_area_struct and vrange inteval tree) and
it removed my concern that mmap_sem write-side lock scalability for
allocator usecase so we can make the implemenation simple and clear.
I like it but zero-page VS SIGBUS is another issue we should make an
agreement.

> 
> 
> > Moreover, this changes made unmarking cost O(N) so I'd like to avoid
> > NOVOLATILE syscall if possible.
> Well, I think that was made in v13, but yes. NONVOLATILE is currently an
> expensive operation in order to keep the semantics simpler, as requested
> by Johannes and Kosaki-san.
> 
> 
> > For me, SIGBUS is more special usecase for code pages but I believe
> > both are reasonable for each usecase so my preference is MADV_VOLATILE
> > is just zero-filled page and MADV_VOLATILE_SIGBUS, another new advise
> > if you really want to merge volatile range feature with madvise.
> 
> This I disagree with. Even for non-code page cases, SIGBUS on volatile
> page access is important for normal users who might accidentally touch
> volatile data, so they know they are corrupting their data. I know
> Johannes suggested this is simply a use-after-free issue, but I really
> feel it results in having very strange semantics. And for those cases
> where there is a benefit to zero-fill, MADV_FREE seems more appropriate.

I already explained above why MADV_FREE is not enough.

> 
> thanks
> -john
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
  2014-05-08 17:04     ` John Stultz
@ 2014-05-08 23:29       ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08 23:29 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 10:04:49AM -0700, John Stultz wrote:
> On 05/07/2014 10:58 PM, Minchan Kim wrote:
> > On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
> >> Another few weeks and another volatile ranges patchset...
> >>
> >> After getting the sense that the a major objection to the earlier
> >> patches was the introduction of a new syscall (and its somewhat
> >> strange dual length/purged-bit return values), I spent some time
> >> trying to rework the vma manipulations so we can be we won't fail
> >> mid-way through changing volatility (basically making it atomic).
> >> I think I have it working, and thus, there is no longer the
> >> need for a new syscall, and we can go back to using madvise()
> >> to set and unset pages as volatile.
> > As I said reply as other patch's reply, I'm ok with this but I'd
> > like to make it clear to support zero-filled page as well as SIGBUS.
> > If we want to use madvise, maybe we need another advise flag like
> > MADV_VOLATILE_SIGBUS.
> 
> I still disagree that zero-fill is more obvious behavior. And again, I
> still support MADV_VOLATILE and MADV_FREE both being added, as they
> really do have different use cases that I'd rather not try to fit into
> one operation.

As I replied previous mail, MADV_FREE is one-shot operation so upcoming
faulted page couldn't be affected so caller should call the syscall again
sometime to make the range volatile again and MADV_FREE is O(N) so vrange
with zero-fill could avoid that totally.

> 
> 
> >>
> >> New changes are:
> >> ----------------
> >> o Reworked vma manipulations to be be atomic
> >> o Converted back to using madvise() as syscall interface
> >> o Integrated fix from Minchan to avoid SIGBUS faulting race
> >> o Caught/fixed subtle use-after-free bug w/ vma merging
> >> o Lots of minor cleanups and comment improvements
> >>
> >>
> >> Still on the TODO list
> >> ----------------------------------------------------
> >> o Sort out how best to do page accounting when the volatility
> >>   is tracked on a per-mm basis.
> > What's is your concern about page accouting?
> > Could you elaborate it more for everybody to understand your concern
> > clearly.
> 
> Basically the issue is that since we keep the volatility in the vma,
> when we mark a page as volatile, its only marking the page for that
> processes, not globally (since the page may be COWed). This makes
> keeping track of the number of actual pages that are volatile accurately
> somewhat difficult, since we can't just add one for each page marked and
> subtract one for each page unmarked (for tmpfs/shm file based
> volatility, where volatility is shared globally, this will be much easier ;)
> 
> It might not be too hard to keep a per-process-pages count of
> volatility, but in that case we could see some strange effects where it
> seems like there are 3x the number of actual volatile pages, and that
> might throw off some of the scanning. So its something I've deferred a
> bit to think about.

Okay. So, why do you want to account volatile page?
Originally, what I expected is to age anonymous LRU list until the number of
count is zero so aging overhead would be zero if there is no volatile page
any more in the system but downside of the approach is it makes vrange marking
syscall cost O(N). That's why I suggested couting of volatile *vmas* instead of
volatile *pages*. It could make unnecessary aging of anon lru list if there is
no physical pages in the vma but I think it's good deal because we moved
hot path overhead to slow path and that's one of design goal of vrange syscall.
We might make an effort to make such aging not agressive in future, which
would be another topic.

> 
> 
> 
> >> o Revisit anonymous page aging on swapless systems
> > One idea is that we can age forcefully on swapless system if system
> > has volatile vma or lazyfree pages. If the number of volatile vma or
> > lazyfree pages is zero, we can stop the aging automatically.
> 
> I'll look into this some more.
> 
> 
> >
> >> o Draft up re-adding tmpfs/shm file volatility support
> >>
> >   o One concern from minchan.
> >   I really like O(1) cost of unmarking syscall.
> >
> > Vrange syscall is for others, not itself. I mean if some process calls
> > vrange syscall, it would scacrifice his resource for others when
> > emergency happens so if the syscall is overhead rather expensive,
> > anybody doesn't want to use it.
> 
> So yes. I agree the cost is more expensive then I'd like. However, I'd
> like to get a consensus on the expected behavior established and get
> folks first agreeing to the semantics and the interface. Then we can
> follow up with optimizations.

Oops, I forgot mentioning "We could do it with optimization in future".
I absolute agree with you. I don't want to do that in this stage but just
want to record one idea to optimize it so don't get me wrong. It's not
a objection.

> 
> > One idea is put increasing counter in mm_struct and assign the token
> > to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
> > bits if we don't want to bloat vma size because we always hold mmap_sem
> > with write-side lock when we handle vrange syscall.
> > And we can use the token and purged mark together to pte when the purge
> > happens. With this, we can bail out as soon as we found purged entry in
> > unmarking syscall so remained ptes still have purged pte although
> > unmarking syscall is done. But it's no problem because if the vma is
> > marked as volatile again, the token will be change(ie, increased) and
> > doesn't match with pte's token. When the page fault occur, we can compare
> > the token to emit SIGBUS. If it doesn't match, we can ignore and just
> > map new page to pte.
> >
> > One problem is overflow of counter. In the case, we can deliver false
> > positive to user but it isn't severe, either because use have a preparation
> > to handle SIGBUS if he want to use vrange syscall with SIGBUS model.
> 
> This sounds like an interesting optimization. But again, I worry that
> adding these edge cases (which I honestly really don't see as
> problematic) muddies the water and keeps reviewers away. I'd rather wait
> until after we have something settled behavior wise, then start
> discussing these performance optimizations that may cause
> safe-but-false-postives.
> 
> 
> Thanks so much for your review and guidance here (I was worried I had
> lost everyone's attention again). I really appreciate the feedback!
> 
> thanks
> -john
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-05-08 23:29       ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08 23:29 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 10:04:49AM -0700, John Stultz wrote:
> On 05/07/2014 10:58 PM, Minchan Kim wrote:
> > On Tue, Apr 29, 2014 at 02:21:19PM -0700, John Stultz wrote:
> >> Another few weeks and another volatile ranges patchset...
> >>
> >> After getting the sense that the a major objection to the earlier
> >> patches was the introduction of a new syscall (and its somewhat
> >> strange dual length/purged-bit return values), I spent some time
> >> trying to rework the vma manipulations so we can be we won't fail
> >> mid-way through changing volatility (basically making it atomic).
> >> I think I have it working, and thus, there is no longer the
> >> need for a new syscall, and we can go back to using madvise()
> >> to set and unset pages as volatile.
> > As I said reply as other patch's reply, I'm ok with this but I'd
> > like to make it clear to support zero-filled page as well as SIGBUS.
> > If we want to use madvise, maybe we need another advise flag like
> > MADV_VOLATILE_SIGBUS.
> 
> I still disagree that zero-fill is more obvious behavior. And again, I
> still support MADV_VOLATILE and MADV_FREE both being added, as they
> really do have different use cases that I'd rather not try to fit into
> one operation.

As I replied previous mail, MADV_FREE is one-shot operation so upcoming
faulted page couldn't be affected so caller should call the syscall again
sometime to make the range volatile again and MADV_FREE is O(N) so vrange
with zero-fill could avoid that totally.

> 
> 
> >>
> >> New changes are:
> >> ----------------
> >> o Reworked vma manipulations to be be atomic
> >> o Converted back to using madvise() as syscall interface
> >> o Integrated fix from Minchan to avoid SIGBUS faulting race
> >> o Caught/fixed subtle use-after-free bug w/ vma merging
> >> o Lots of minor cleanups and comment improvements
> >>
> >>
> >> Still on the TODO list
> >> ----------------------------------------------------
> >> o Sort out how best to do page accounting when the volatility
> >>   is tracked on a per-mm basis.
> > What's is your concern about page accouting?
> > Could you elaborate it more for everybody to understand your concern
> > clearly.
> 
> Basically the issue is that since we keep the volatility in the vma,
> when we mark a page as volatile, its only marking the page for that
> processes, not globally (since the page may be COWed). This makes
> keeping track of the number of actual pages that are volatile accurately
> somewhat difficult, since we can't just add one for each page marked and
> subtract one for each page unmarked (for tmpfs/shm file based
> volatility, where volatility is shared globally, this will be much easier ;)
> 
> It might not be too hard to keep a per-process-pages count of
> volatility, but in that case we could see some strange effects where it
> seems like there are 3x the number of actual volatile pages, and that
> might throw off some of the scanning. So its something I've deferred a
> bit to think about.

Okay. So, why do you want to account volatile page?
Originally, what I expected is to age anonymous LRU list until the number of
count is zero so aging overhead would be zero if there is no volatile page
any more in the system but downside of the approach is it makes vrange marking
syscall cost O(N). That's why I suggested couting of volatile *vmas* instead of
volatile *pages*. It could make unnecessary aging of anon lru list if there is
no physical pages in the vma but I think it's good deal because we moved
hot path overhead to slow path and that's one of design goal of vrange syscall.
We might make an effort to make such aging not agressive in future, which
would be another topic.

> 
> 
> 
> >> o Revisit anonymous page aging on swapless systems
> > One idea is that we can age forcefully on swapless system if system
> > has volatile vma or lazyfree pages. If the number of volatile vma or
> > lazyfree pages is zero, we can stop the aging automatically.
> 
> I'll look into this some more.
> 
> 
> >
> >> o Draft up re-adding tmpfs/shm file volatility support
> >>
> >   o One concern from minchan.
> >   I really like O(1) cost of unmarking syscall.
> >
> > Vrange syscall is for others, not itself. I mean if some process calls
> > vrange syscall, it would scacrifice his resource for others when
> > emergency happens so if the syscall is overhead rather expensive,
> > anybody doesn't want to use it.
> 
> So yes. I agree the cost is more expensive then I'd like. However, I'd
> like to get a consensus on the expected behavior established and get
> folks first agreeing to the semantics and the interface. Then we can
> follow up with optimizations.

Oops, I forgot mentioning "We could do it with optimization in future".
I absolute agree with you. I don't want to do that in this stage but just
want to record one idea to optimize it so don't get me wrong. It's not
a objection.

> 
> > One idea is put increasing counter in mm_struct and assign the token
> > to volatile vma. Maybe we can squeeze it into vma->vm_start's lower
> > bits if we don't want to bloat vma size because we always hold mmap_sem
> > with write-side lock when we handle vrange syscall.
> > And we can use the token and purged mark together to pte when the purge
> > happens. With this, we can bail out as soon as we found purged entry in
> > unmarking syscall so remained ptes still have purged pte although
> > unmarking syscall is done. But it's no problem because if the vma is
> > marked as volatile again, the token will be change(ie, increased) and
> > doesn't match with pte's token. When the page fault occur, we can compare
> > the token to emit SIGBUS. If it doesn't match, we can ignore and just
> > map new page to pte.
> >
> > One problem is overflow of counter. In the case, we can deliver false
> > positive to user but it isn't severe, either because use have a preparation
> > to handle SIGBUS if he want to use vrange syscall with SIGBUS model.
> 
> This sounds like an interesting optimization. But again, I worry that
> adding these edge cases (which I honestly really don't see as
> problematic) muddies the water and keeps reviewers away. I'd rather wait
> until after we have something settled behavior wise, then start
> discussing these performance optimizations that may cause
> safe-but-false-postives.
> 
> 
> Thanks so much for your review and guidance here (I was worried I had
> lost everyone's attention again). I really appreciate the feedback!
> 
> thanks
> -john
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-05-08 23:12         ` Minchan Kim
@ 2014-05-08 23:43           ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 23:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/08/2014 04:12 PM, Minchan Kim wrote:
> On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
>> On 05/07/2014 06:21 PM, Minchan Kim wrote:
>>> Hey John,
>>>
>>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
>>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
>>>> which allows for specifying ranges of memory as volatile, and able
>>>> to be discarded by the system.
>>>>
>>>> This initial patch simply adds flag handling to madvise, and the
>>>> vma handling, splitting and merging the vmas as needed, and marking
>>>> them with VM_VOLATILE.
>>>>
>>>> No purging or discarding of volatile ranges is done at this point.
>>>>
>>>> This a simplified implementation which reuses some of the logic
>>>> from Minchan's earlier efforts. So credit to Minchan for his work.
>>> Remove purged argument is really good thing but I'm not sure merging
>>> the feature into madvise syscall is good idea.
>>> My concern is how we support user who don't want SIGBUS.
>>> I believe we should support them because someuser(ex, sanitizer) really
>>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
>>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
>>> right before overwriting to avoid SIGBUS).
>> So... Why not use MADV_FREE then for this case?
> MADV_FREE is one-shot operation. I mean we should call it again to make
> them lazyfree while vrange could preserve volatility.
> Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
> and want to mark the range as volatile. If they uses MADV_FREE instead of
> volatile, they should mark 70TB as lazyfree periodically, which is terrible
> because MADV_FREE's cost is O(N).

I still have had difficulty seeing the thread-sanitizer usage as a
generic enough model for other applications. I realize they want to
avoid marking and unmarking ranges (and they want that marking and
unmarking to be very cheap), but the zero-fill purged page (while still
preserving volatility) causes lots of *very* strange behavior:

* How do general applications know the difference between a purged page
and a valid empty page?
* When reading/writing a page, what happens if half-way the application
is preempted, and the page is purged?
* If a volatile page is purged, then zero-filled on a read or write,
what is its purged state when we're marking it non-volatile?

These use cases don't seem completely baked, or maybe I've just not been
able to comprehend them yet. But I don't quite understand the desire to
prioritize this style of usage over other simpler and more well
established usage?

I'll grant that there may be some form of semantics that work for this,
and I'm open to considering support for those at some point if they
become more clear, but I don't think these stranger(to me at least)
cases should be the default, and I really worry that these requests
continue to make the basic usage harder to understand for reviewers.


>> Just to be clear, by moving back to madvise, I'm not trying to replace
>> MADV_FREE. I think you're work there is still useful and splitting the
>> semantics between the two is cleaner.
> I know.
> New vrange syscall which works with existing VMA instead of new vrange
> interval tree removed big concern from mm folks about duplicating
> of manage layer(ex, vm_area_struct and vrange inteval tree) and
> it removed my concern that mmap_sem write-side lock scalability for
> allocator usecase so we can make the implemenation simple and clear.
> I like it but zero-page VS SIGBUS is another issue we should make an
> agreement.

Zero-fill makes sense to me for MADV_FREE, where we're not trying to
recover the data, but just save the cost of releasing and re-faulting
possibly frequently used pages. The contents are not intended to be
recovered. Thus semantics there are reasonable.

With volatility (which persists until marked non-volatile), zero-filled
purged page access breaks quite a bit of the established semantics (see
the strange behavior questions listed above).

With SIGBUS semantics, its very clear and much more simple. The
application has hit an page that no longer exists and is clearly
notified (via SIGBUS). There's no way for the purged state to become
lost (other then the application ignoring the return value from
MADV_NONVOLATILE).

Again, I do understand that folks want a solution to the
thread-sanitizer usage model, but I really think, much as we found with
MADV_FREE, that its really a quite different semantics that are wanted,
and trying to mix them doesn't help get anything reviewed/merged.

thanks
-john


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-05-08 23:43           ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-08 23:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/08/2014 04:12 PM, Minchan Kim wrote:
> On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
>> On 05/07/2014 06:21 PM, Minchan Kim wrote:
>>> Hey John,
>>>
>>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
>>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
>>>> which allows for specifying ranges of memory as volatile, and able
>>>> to be discarded by the system.
>>>>
>>>> This initial patch simply adds flag handling to madvise, and the
>>>> vma handling, splitting and merging the vmas as needed, and marking
>>>> them with VM_VOLATILE.
>>>>
>>>> No purging or discarding of volatile ranges is done at this point.
>>>>
>>>> This a simplified implementation which reuses some of the logic
>>>> from Minchan's earlier efforts. So credit to Minchan for his work.
>>> Remove purged argument is really good thing but I'm not sure merging
>>> the feature into madvise syscall is good idea.
>>> My concern is how we support user who don't want SIGBUS.
>>> I believe we should support them because someuser(ex, sanitizer) really
>>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
>>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
>>> right before overwriting to avoid SIGBUS).
>> So... Why not use MADV_FREE then for this case?
> MADV_FREE is one-shot operation. I mean we should call it again to make
> them lazyfree while vrange could preserve volatility.
> Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
> and want to mark the range as volatile. If they uses MADV_FREE instead of
> volatile, they should mark 70TB as lazyfree periodically, which is terrible
> because MADV_FREE's cost is O(N).

I still have had difficulty seeing the thread-sanitizer usage as a
generic enough model for other applications. I realize they want to
avoid marking and unmarking ranges (and they want that marking and
unmarking to be very cheap), but the zero-fill purged page (while still
preserving volatility) causes lots of *very* strange behavior:

* How do general applications know the difference between a purged page
and a valid empty page?
* When reading/writing a page, what happens if half-way the application
is preempted, and the page is purged?
* If a volatile page is purged, then zero-filled on a read or write,
what is its purged state when we're marking it non-volatile?

These use cases don't seem completely baked, or maybe I've just not been
able to comprehend them yet. But I don't quite understand the desire to
prioritize this style of usage over other simpler and more well
established usage?

I'll grant that there may be some form of semantics that work for this,
and I'm open to considering support for those at some point if they
become more clear, but I don't think these stranger(to me at least)
cases should be the default, and I really worry that these requests
continue to make the basic usage harder to understand for reviewers.


>> Just to be clear, by moving back to madvise, I'm not trying to replace
>> MADV_FREE. I think you're work there is still useful and splitting the
>> semantics between the two is cleaner.
> I know.
> New vrange syscall which works with existing VMA instead of new vrange
> interval tree removed big concern from mm folks about duplicating
> of manage layer(ex, vm_area_struct and vrange inteval tree) and
> it removed my concern that mmap_sem write-side lock scalability for
> allocator usecase so we can make the implemenation simple and clear.
> I like it but zero-page VS SIGBUS is another issue we should make an
> agreement.

Zero-fill makes sense to me for MADV_FREE, where we're not trying to
recover the data, but just save the cost of releasing and re-faulting
possibly frequently used pages. The contents are not intended to be
recovered. Thus semantics there are reasonable.

With volatility (which persists until marked non-volatile), zero-filled
purged page access breaks quite a bit of the established semantics (see
the strange behavior questions listed above).

With SIGBUS semantics, its very clear and much more simple. The
application has hit an page that no longer exists and is clearly
notified (via SIGBUS). There's no way for the purged state to become
lost (other then the application ignoring the return value from
MADV_NONVOLATILE).

Again, I do understand that folks want a solution to the
thread-sanitizer usage model, but I really think, much as we found with
MADV_FREE, that its really a quite different semantics that are wanted,
and trying to mix them doesn't help get anything reviewed/merged.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
  2014-05-08 21:45       ` John Stultz
@ 2014-05-08 23:45         ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08 23:45 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 02:45:27PM -0700, John Stultz wrote:
> On 05/07/2014 06:51 PM, Minchan Kim wrote:
> > On Tue, Apr 29, 2014 at 02:21:22PM -0700, John Stultz wrote:
> >> +/**
> >> + * mvolatile_check_purged_pte - Checks ptes for purged pages
> >> + * @pmd: pmd to walk
> >> + * @addr: starting address
> >> + * @end: end address
> >> + * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
> >> + *
> >> + * Iterates over the ptes in the pmd checking if they have
> >> + * purged swap entries.
> >> + *
> >> + * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
> >> + * and clears the purged pte swp entries (since the pages are no
> >> + * longer volatile, we don't want future accesses to SIGBUS).
> >> + */
> >> +static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
> >> +					unsigned long end, struct mm_walk *walk)
> >> +{
> >> +	struct mvolatile_walker *vw = walk->private;
> >> +	pte_t *pte;
> >> +	spinlock_t *ptl;
> >> +
> >> +	if (pmd_trans_huge(*pmd))
> >> +		return 0;
> >> +	if (pmd_trans_unstable(pmd))
> >> +		return 0;
> >> +
> >> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> >> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> >> +		if (!pte_present(*pte)) {
> >> +			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
> >> +
> >> +			if (unlikely(is_purged_entry(mvolatile_entry))) {
> >> +
> >> +				vw->page_was_purged = 1;
> >> +
> >> +				/* clear the pte swp entry */
> >> +				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
> > Maybe we don't need to flush the cache because there is no mapped page.
> >
> >> +				ptep_clear_flush(vw->vma, addr, pte);
> > Maybe we don't need this, either. We didn't set present bit for purged
> > page but when I look at the internal of ptep_clear_flush, it checks present bit
> > and skip the TLB flush so it's okay for x86 but not sure other architecture.
> > More clear function for our purpose would be pte_clear_not_present_full.
> 
> Ok.. basically I just wanted to zap the psudo-swp entry, so it will be
> zero-filled from here on out.
> 
> 
> > And we are changing page table so at least, we need to handle mmu_notifier to
> > inform that to the client of mmu_notifier.
> 
> So yes, this is one item from my last iteration that I didn't act on
> yet. It wasn't clear to me here that we need to do the mmu_notifier,
> since the page is evicted earlier via try_to_purge_one (and we do notify
> then). But in just removing the psudo-swap entry we need to do a
> notification as well? Is there someplace where the mmu_notifier rules

Hmm, it seems your claim is reasonable so we don't need to call it again, sorry.
But let's double check with KVM people.

> are better documented?
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile
@ 2014-05-08 23:45         ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-08 23:45 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 02:45:27PM -0700, John Stultz wrote:
> On 05/07/2014 06:51 PM, Minchan Kim wrote:
> > On Tue, Apr 29, 2014 at 02:21:22PM -0700, John Stultz wrote:
> >> +/**
> >> + * mvolatile_check_purged_pte - Checks ptes for purged pages
> >> + * @pmd: pmd to walk
> >> + * @addr: starting address
> >> + * @end: end address
> >> + * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
> >> + *
> >> + * Iterates over the ptes in the pmd checking if they have
> >> + * purged swap entries.
> >> + *
> >> + * Sets the mvolatile_walker.page_was_purged to 1 if any were purged,
> >> + * and clears the purged pte swp entries (since the pages are no
> >> + * longer volatile, we don't want future accesses to SIGBUS).
> >> + */
> >> +static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
> >> +					unsigned long end, struct mm_walk *walk)
> >> +{
> >> +	struct mvolatile_walker *vw = walk->private;
> >> +	pte_t *pte;
> >> +	spinlock_t *ptl;
> >> +
> >> +	if (pmd_trans_huge(*pmd))
> >> +		return 0;
> >> +	if (pmd_trans_unstable(pmd))
> >> +		return 0;
> >> +
> >> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> >> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> >> +		if (!pte_present(*pte)) {
> >> +			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
> >> +
> >> +			if (unlikely(is_purged_entry(mvolatile_entry))) {
> >> +
> >> +				vw->page_was_purged = 1;
> >> +
> >> +				/* clear the pte swp entry */
> >> +				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
> > Maybe we don't need to flush the cache because there is no mapped page.
> >
> >> +				ptep_clear_flush(vw->vma, addr, pte);
> > Maybe we don't need this, either. We didn't set present bit for purged
> > page but when I look at the internal of ptep_clear_flush, it checks present bit
> > and skip the TLB flush so it's okay for x86 but not sure other architecture.
> > More clear function for our purpose would be pte_clear_not_present_full.
> 
> Ok.. basically I just wanted to zap the psudo-swp entry, so it will be
> zero-filled from here on out.
> 
> 
> > And we are changing page table so at least, we need to handle mmu_notifier to
> > inform that to the client of mmu_notifier.
> 
> So yes, this is one item from my last iteration that I didn't act on
> yet. It wasn't clear to me here that we need to do the mmu_notifier,
> since the page is evicted earlier via try_to_purge_one (and we do notify
> then). But in just removing the psudo-swap entry we need to do a
> notification as well? Is there someplace where the mmu_notifier rules

Hmm, it seems your claim is reasonable so we don't need to call it again, sorry.
But let's double check with KVM people.

> are better documented?
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-05-08 23:43           ` John Stultz
@ 2014-05-09  0:07             ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-09  0:07 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 04:43:07PM -0700, John Stultz wrote:
> On 05/08/2014 04:12 PM, Minchan Kim wrote:
> > On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
> >> On 05/07/2014 06:21 PM, Minchan Kim wrote:
> >>> Hey John,
> >>>
> >>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> >>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> >>>> which allows for specifying ranges of memory as volatile, and able
> >>>> to be discarded by the system.
> >>>>
> >>>> This initial patch simply adds flag handling to madvise, and the
> >>>> vma handling, splitting and merging the vmas as needed, and marking
> >>>> them with VM_VOLATILE.
> >>>>
> >>>> No purging or discarding of volatile ranges is done at this point.
> >>>>
> >>>> This a simplified implementation which reuses some of the logic
> >>>> from Minchan's earlier efforts. So credit to Minchan for his work.
> >>> Remove purged argument is really good thing but I'm not sure merging
> >>> the feature into madvise syscall is good idea.
> >>> My concern is how we support user who don't want SIGBUS.
> >>> I believe we should support them because someuser(ex, sanitizer) really
> >>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
> >>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> >>> right before overwriting to avoid SIGBUS).
> >> So... Why not use MADV_FREE then for this case?
> > MADV_FREE is one-shot operation. I mean we should call it again to make
> > them lazyfree while vrange could preserve volatility.
> > Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
> > and want to mark the range as volatile. If they uses MADV_FREE instead of
> > volatile, they should mark 70TB as lazyfree periodically, which is terrible
> > because MADV_FREE's cost is O(N).
> 
> I still have had difficulty seeing the thread-sanitizer usage as a
> generic enough model for other applications. I realize they want to
> avoid marking and unmarking ranges (and they want that marking and
> unmarking to be very cheap), but the zero-fill purged page (while still
> preserving volatility) causes lots of *very* strange behavior:
 
I don't think it's for only thread-sanitizer.
Pz, think following usecase.

Let's assume big volatile cache.
If there is request for cache, it should find a object in a cache
and if it found, it should call vrange(NOVOLATILE) right before
passing it to the user and investigate it was purged or not.
If it wasn't purged, cache manager could pass the object to the user.
But it's circular cache so if there is no request from user, cache manager
always overwrites objects so it could encounter SIGBUS easily
so as current sematic, cache manager always should call vrange(NOVOLATILE)
right before the overwriting. Otherwise, it should register SIGBUS handler
to unmark volatile by page unit. SIGH.

If we support zero-fill, cache manager could overwrite object without
SIGBUS handling or vrange(NOVOLATILE) call right before overwriting.
Just what we need is vrange(NOVOLATILE) call right before passing it
to user.

> 
> * How do general applications know the difference between a purged page
> and a valid empty page?
> * When reading/writing a page, what happens if half-way the application
> is preempted, and the page is purged?
> * If a volatile page is purged, then zero-filled on a read or write,
> what is its purged state when we're marking it non-volatile?

Maybe above scenario goes your questions to VOID.

> 
> These use cases don't seem completely baked, or maybe I've just not been
> able to comprehend them yet. But I don't quite understand the desire to
> prioritize this style of usage over other simpler and more well
> established usage?

I think it's one of typical usecase of vrange syscall.

> 
> I'll grant that there may be some form of semantics that work for this,
> and I'm open to considering support for those at some point if they
> become more clear, but I don't think these stranger(to me at least)
> cases should be the default, and I really worry that these requests
> continue to make the basic usage harder to understand for reviewers.
> 
> 
> >> Just to be clear, by moving back to madvise, I'm not trying to replace
> >> MADV_FREE. I think you're work there is still useful and splitting the
> >> semantics between the two is cleaner.
> > I know.
> > New vrange syscall which works with existing VMA instead of new vrange
> > interval tree removed big concern from mm folks about duplicating
> > of manage layer(ex, vm_area_struct and vrange inteval tree) and
> > it removed my concern that mmap_sem write-side lock scalability for
> > allocator usecase so we can make the implemenation simple and clear.
> > I like it but zero-page VS SIGBUS is another issue we should make an
> > agreement.
> 
> Zero-fill makes sense to me for MADV_FREE, where we're not trying to
> recover the data, but just save the cost of releasing and re-faulting
> possibly frequently used pages. The contents are not intended to be
> recovered. Thus semantics there are reasonable.
> 
> With volatility (which persists until marked non-volatile), zero-filled
> purged page access breaks quite a bit of the established semantics (see
> the strange behavior questions listed above).
> 
> With SIGBUS semantics, its very clear and much more simple. The
> application has hit an page that no longer exists and is clearly
> notified (via SIGBUS). There's no way for the purged state to become
> lost (other then the application ignoring the return value from
> MADV_NONVOLATILE).
> 
> Again, I do understand that folks want a solution to the
> thread-sanitizer usage model, but I really think, much as we found with
> MADV_FREE, that its really a quite different semantics that are wanted,
> and trying to mix them doesn't help get anything reviewed/merged.
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-05-09  0:07             ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-09  0:07 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 04:43:07PM -0700, John Stultz wrote:
> On 05/08/2014 04:12 PM, Minchan Kim wrote:
> > On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
> >> On 05/07/2014 06:21 PM, Minchan Kim wrote:
> >>> Hey John,
> >>>
> >>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> >>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> >>>> which allows for specifying ranges of memory as volatile, and able
> >>>> to be discarded by the system.
> >>>>
> >>>> This initial patch simply adds flag handling to madvise, and the
> >>>> vma handling, splitting and merging the vmas as needed, and marking
> >>>> them with VM_VOLATILE.
> >>>>
> >>>> No purging or discarding of volatile ranges is done at this point.
> >>>>
> >>>> This a simplified implementation which reuses some of the logic
> >>>> from Minchan's earlier efforts. So credit to Minchan for his work.
> >>> Remove purged argument is really good thing but I'm not sure merging
> >>> the feature into madvise syscall is good idea.
> >>> My concern is how we support user who don't want SIGBUS.
> >>> I believe we should support them because someuser(ex, sanitizer) really
> >>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
> >>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> >>> right before overwriting to avoid SIGBUS).
> >> So... Why not use MADV_FREE then for this case?
> > MADV_FREE is one-shot operation. I mean we should call it again to make
> > them lazyfree while vrange could preserve volatility.
> > Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
> > and want to mark the range as volatile. If they uses MADV_FREE instead of
> > volatile, they should mark 70TB as lazyfree periodically, which is terrible
> > because MADV_FREE's cost is O(N).
> 
> I still have had difficulty seeing the thread-sanitizer usage as a
> generic enough model for other applications. I realize they want to
> avoid marking and unmarking ranges (and they want that marking and
> unmarking to be very cheap), but the zero-fill purged page (while still
> preserving volatility) causes lots of *very* strange behavior:
 
I don't think it's for only thread-sanitizer.
Pz, think following usecase.

Let's assume big volatile cache.
If there is request for cache, it should find a object in a cache
and if it found, it should call vrange(NOVOLATILE) right before
passing it to the user and investigate it was purged or not.
If it wasn't purged, cache manager could pass the object to the user.
But it's circular cache so if there is no request from user, cache manager
always overwrites objects so it could encounter SIGBUS easily
so as current sematic, cache manager always should call vrange(NOVOLATILE)
right before the overwriting. Otherwise, it should register SIGBUS handler
to unmark volatile by page unit. SIGH.

If we support zero-fill, cache manager could overwrite object without
SIGBUS handling or vrange(NOVOLATILE) call right before overwriting.
Just what we need is vrange(NOVOLATILE) call right before passing it
to user.

> 
> * How do general applications know the difference between a purged page
> and a valid empty page?
> * When reading/writing a page, what happens if half-way the application
> is preempted, and the page is purged?
> * If a volatile page is purged, then zero-filled on a read or write,
> what is its purged state when we're marking it non-volatile?

Maybe above scenario goes your questions to VOID.

> 
> These use cases don't seem completely baked, or maybe I've just not been
> able to comprehend them yet. But I don't quite understand the desire to
> prioritize this style of usage over other simpler and more well
> established usage?

I think it's one of typical usecase of vrange syscall.

> 
> I'll grant that there may be some form of semantics that work for this,
> and I'm open to considering support for those at some point if they
> become more clear, but I don't think these stranger(to me at least)
> cases should be the default, and I really worry that these requests
> continue to make the basic usage harder to understand for reviewers.
> 
> 
> >> Just to be clear, by moving back to madvise, I'm not trying to replace
> >> MADV_FREE. I think you're work there is still useful and splitting the
> >> semantics between the two is cleaner.
> > I know.
> > New vrange syscall which works with existing VMA instead of new vrange
> > interval tree removed big concern from mm folks about duplicating
> > of manage layer(ex, vm_area_struct and vrange inteval tree) and
> > it removed my concern that mmap_sem write-side lock scalability for
> > allocator usecase so we can make the implemenation simple and clear.
> > I like it but zero-page VS SIGBUS is another issue we should make an
> > agreement.
> 
> Zero-fill makes sense to me for MADV_FREE, where we're not trying to
> recover the data, but just save the cost of releasing and re-faulting
> possibly frequently used pages. The contents are not intended to be
> recovered. Thus semantics there are reasonable.
> 
> With volatility (which persists until marked non-volatile), zero-filled
> purged page access breaks quite a bit of the established semantics (see
> the strange behavior questions listed above).
> 
> With SIGBUS semantics, its very clear and much more simple. The
> application has hit an page that no longer exists and is clearly
> notified (via SIGBUS). There's no way for the purged state to become
> lost (other then the application ignoring the return value from
> MADV_NONVOLATILE).
> 
> Again, I do understand that folks want a solution to the
> thread-sanitizer usage model, but I really think, much as we found with
> MADV_FREE, that its really a quite different semantics that are wanted,
> and trying to mix them doesn't help get anything reviewed/merged.
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-05-09  0:07             ` Minchan Kim
@ 2014-05-09  0:24               ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-09  0:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/08/2014 05:07 PM, Minchan Kim wrote:
> On Thu, May 08, 2014 at 04:43:07PM -0700, John Stultz wrote:
>> On 05/08/2014 04:12 PM, Minchan Kim wrote:
>>> On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
>>>> On 05/07/2014 06:21 PM, Minchan Kim wrote:
>>>>> Hey John,
>>>>>
>>>>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
>>>>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
>>>>>> which allows for specifying ranges of memory as volatile, and able
>>>>>> to be discarded by the system.
>>>>>>
>>>>>> This initial patch simply adds flag handling to madvise, and the
>>>>>> vma handling, splitting and merging the vmas as needed, and marking
>>>>>> them with VM_VOLATILE.
>>>>>>
>>>>>> No purging or discarding of volatile ranges is done at this point.
>>>>>>
>>>>>> This a simplified implementation which reuses some of the logic
>>>>>> from Minchan's earlier efforts. So credit to Minchan for his work.
>>>>> Remove purged argument is really good thing but I'm not sure merging
>>>>> the feature into madvise syscall is good idea.
>>>>> My concern is how we support user who don't want SIGBUS.
>>>>> I believe we should support them because someuser(ex, sanitizer) really
>>>>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
>>>>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
>>>>> right before overwriting to avoid SIGBUS).
>>>> So... Why not use MADV_FREE then for this case?
>>> MADV_FREE is one-shot operation. I mean we should call it again to make
>>> them lazyfree while vrange could preserve volatility.
>>> Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
>>> and want to mark the range as volatile. If they uses MADV_FREE instead of
>>> volatile, they should mark 70TB as lazyfree periodically, which is terrible
>>> because MADV_FREE's cost is O(N).
>> I still have had difficulty seeing the thread-sanitizer usage as a
>> generic enough model for other applications. I realize they want to
>> avoid marking and unmarking ranges (and they want that marking and
>> unmarking to be very cheap), but the zero-fill purged page (while still
>> preserving volatility) causes lots of *very* strange behavior:
>  
> I don't think it's for only thread-sanitizer.
> Pz, think following usecase.
>
> Let's assume big volatile cache.
> If there is request for cache, it should find a object in a cache
> and if it found, it should call vrange(NOVOLATILE) right before
> passing it to the user and investigate it was purged or not.
> If it wasn't purged, cache manager could pass the object to the user.
> But it's circular cache so if there is no request from user, cache manager
> always overwrites objects so it could encounter SIGBUS easily
> so as current sematic, cache manager always should call vrange(NOVOLATILE)
> right before the overwriting. Otherwise, it should register SIGBUS handler
> to unmark volatile by page unit. SIGH.
>
> If we support zero-fill, cache manager could overwrite object without
> SIGBUS handling or vrange(NOVOLATILE) call right before overwriting.
> Just what we need is vrange(NOVOLATILE) call right before passing it
> to user.

But that wouldn't work. If the page was purged half way through writing
it, we end up with a page of half zero data and half written data. What
would the page state be at that point? Purged? Not purged?

* If its not purged (since a write was done to the page after being
zero-filled), we will silently return to the user corrupted data.

* If it is considered purged, how do we store that data? Since we
currently detect purged pages by checking if they are present when we
mark non-volatile.


This sort of zero-fill behavior on volatile pages only seems to make
sense if pages are written atomically.

The SIGBUS handling solution you SIGH'ed at above actually seems
reasonable, because it would allow the page to be safely filled
atomically (marking it non-volatile, filling it and then re-marking it
volatile). Sure it would cost more, fast and wrong isn't really a valid
option.



>
>> * How do general applications know the difference between a purged page
>> and a valid empty page?
>> * When reading/writing a page, what happens if half-way the application
>> is preempted, and the page is purged?
>> * If a volatile page is purged, then zero-filled on a read or write,
>> what is its purged state when we're marking it non-volatile?
> Maybe above scenario goes your questions to VOID.

I'm not sure I understand this.


>
>> These use cases don't seem completely baked, or maybe I've just not been
>> able to comprehend them yet. But I don't quite understand the desire to
>> prioritize this style of usage over other simpler and more well
>> established usage?
> I think it's one of typical usecase of vrange syscall.

I apologize if I'm seeming stubborn, but I just can't see how it would
work sanely.

thanks
-john


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-05-09  0:24               ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-05-09  0:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On 05/08/2014 05:07 PM, Minchan Kim wrote:
> On Thu, May 08, 2014 at 04:43:07PM -0700, John Stultz wrote:
>> On 05/08/2014 04:12 PM, Minchan Kim wrote:
>>> On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
>>>> On 05/07/2014 06:21 PM, Minchan Kim wrote:
>>>>> Hey John,
>>>>>
>>>>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
>>>>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
>>>>>> which allows for specifying ranges of memory as volatile, and able
>>>>>> to be discarded by the system.
>>>>>>
>>>>>> This initial patch simply adds flag handling to madvise, and the
>>>>>> vma handling, splitting and merging the vmas as needed, and marking
>>>>>> them with VM_VOLATILE.
>>>>>>
>>>>>> No purging or discarding of volatile ranges is done at this point.
>>>>>>
>>>>>> This a simplified implementation which reuses some of the logic
>>>>>> from Minchan's earlier efforts. So credit to Minchan for his work.
>>>>> Remove purged argument is really good thing but I'm not sure merging
>>>>> the feature into madvise syscall is good idea.
>>>>> My concern is how we support user who don't want SIGBUS.
>>>>> I believe we should support them because someuser(ex, sanitizer) really
>>>>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
>>>>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
>>>>> right before overwriting to avoid SIGBUS).
>>>> So... Why not use MADV_FREE then for this case?
>>> MADV_FREE is one-shot operation. I mean we should call it again to make
>>> them lazyfree while vrange could preserve volatility.
>>> Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
>>> and want to mark the range as volatile. If they uses MADV_FREE instead of
>>> volatile, they should mark 70TB as lazyfree periodically, which is terrible
>>> because MADV_FREE's cost is O(N).
>> I still have had difficulty seeing the thread-sanitizer usage as a
>> generic enough model for other applications. I realize they want to
>> avoid marking and unmarking ranges (and they want that marking and
>> unmarking to be very cheap), but the zero-fill purged page (while still
>> preserving volatility) causes lots of *very* strange behavior:
>  
> I don't think it's for only thread-sanitizer.
> Pz, think following usecase.
>
> Let's assume big volatile cache.
> If there is request for cache, it should find a object in a cache
> and if it found, it should call vrange(NOVOLATILE) right before
> passing it to the user and investigate it was purged or not.
> If it wasn't purged, cache manager could pass the object to the user.
> But it's circular cache so if there is no request from user, cache manager
> always overwrites objects so it could encounter SIGBUS easily
> so as current sematic, cache manager always should call vrange(NOVOLATILE)
> right before the overwriting. Otherwise, it should register SIGBUS handler
> to unmark volatile by page unit. SIGH.
>
> If we support zero-fill, cache manager could overwrite object without
> SIGBUS handling or vrange(NOVOLATILE) call right before overwriting.
> Just what we need is vrange(NOVOLATILE) call right before passing it
> to user.

But that wouldn't work. If the page was purged half way through writing
it, we end up with a page of half zero data and half written data. What
would the page state be at that point? Purged? Not purged?

* If its not purged (since a write was done to the page after being
zero-filled), we will silently return to the user corrupted data.

* If it is considered purged, how do we store that data? Since we
currently detect purged pages by checking if they are present when we
mark non-volatile.


This sort of zero-fill behavior on volatile pages only seems to make
sense if pages are written atomically.

The SIGBUS handling solution you SIGH'ed at above actually seems
reasonable, because it would allow the page to be safely filled
atomically (marking it non-volatile, filling it and then re-marking it
volatile). Sure it would cost more, fast and wrong isn't really a valid
option.



>
>> * How do general applications know the difference between a purged page
>> and a valid empty page?
>> * When reading/writing a page, what happens if half-way the application
>> is preempted, and the page is purged?
>> * If a volatile page is purged, then zero-filled on a read or write,
>> what is its purged state when we're marking it non-volatile?
> Maybe above scenario goes your questions to VOID.

I'm not sure I understand this.


>
>> These use cases don't seem completely baked, or maybe I've just not been
>> able to comprehend them yet. But I don't quite understand the desire to
>> prioritize this style of usage over other simpler and more well
>> established usage?
> I think it's one of typical usecase of vrange syscall.

I apologize if I'm seeming stubborn, but I just can't see how it would
work sanely.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
  2014-05-09  0:24               ` John Stultz
@ 2014-05-09  0:41                 ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-09  0:41 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 05:24:41PM -0700, John Stultz wrote:
> On 05/08/2014 05:07 PM, Minchan Kim wrote:
> > On Thu, May 08, 2014 at 04:43:07PM -0700, John Stultz wrote:
> >> On 05/08/2014 04:12 PM, Minchan Kim wrote:
> >>> On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
> >>>> On 05/07/2014 06:21 PM, Minchan Kim wrote:
> >>>>> Hey John,
> >>>>>
> >>>>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> >>>>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> >>>>>> which allows for specifying ranges of memory as volatile, and able
> >>>>>> to be discarded by the system.
> >>>>>>
> >>>>>> This initial patch simply adds flag handling to madvise, and the
> >>>>>> vma handling, splitting and merging the vmas as needed, and marking
> >>>>>> them with VM_VOLATILE.
> >>>>>>
> >>>>>> No purging or discarding of volatile ranges is done at this point.
> >>>>>>
> >>>>>> This a simplified implementation which reuses some of the logic
> >>>>>> from Minchan's earlier efforts. So credit to Minchan for his work.
> >>>>> Remove purged argument is really good thing but I'm not sure merging
> >>>>> the feature into madvise syscall is good idea.
> >>>>> My concern is how we support user who don't want SIGBUS.
> >>>>> I believe we should support them because someuser(ex, sanitizer) really
> >>>>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
> >>>>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> >>>>> right before overwriting to avoid SIGBUS).
> >>>> So... Why not use MADV_FREE then for this case?
> >>> MADV_FREE is one-shot operation. I mean we should call it again to make
> >>> them lazyfree while vrange could preserve volatility.
> >>> Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
> >>> and want to mark the range as volatile. If they uses MADV_FREE instead of
> >>> volatile, they should mark 70TB as lazyfree periodically, which is terrible
> >>> because MADV_FREE's cost is O(N).
> >> I still have had difficulty seeing the thread-sanitizer usage as a
> >> generic enough model for other applications. I realize they want to
> >> avoid marking and unmarking ranges (and they want that marking and
> >> unmarking to be very cheap), but the zero-fill purged page (while still
> >> preserving volatility) causes lots of *very* strange behavior:
> >  
> > I don't think it's for only thread-sanitizer.
> > Pz, think following usecase.
> >
> > Let's assume big volatile cache.
> > If there is request for cache, it should find a object in a cache
> > and if it found, it should call vrange(NOVOLATILE) right before
> > passing it to the user and investigate it was purged or not.
> > If it wasn't purged, cache manager could pass the object to the user.
> > But it's circular cache so if there is no request from user, cache manager
> > always overwrites objects so it could encounter SIGBUS easily
> > so as current sematic, cache manager always should call vrange(NOVOLATILE)
> > right before the overwriting. Otherwise, it should register SIGBUS handler
> > to unmark volatile by page unit. SIGH.
> >
> > If we support zero-fill, cache manager could overwrite object without
> > SIGBUS handling or vrange(NOVOLATILE) call right before overwriting.
> > Just what we need is vrange(NOVOLATILE) call right before passing it
> > to user.
> 
> But that wouldn't work. If the page was purged half way through writing
> it, we end up with a page of half zero data and half written data. What
> would the page state be at that point? Purged? Not purged?

You're right. Application might detect it with adding a sentinel in the
header but I don't think it should be generic model with zero-fill semantic
although some of application could do it.

> 
> * If its not purged (since a write was done to the page after being
> zero-filled), we will silently return to the user corrupted data.
> 
> * If it is considered purged, how do we store that data? Since we
> currently detect purged pages by checking if they are present when we
> mark non-volatile.
> 
> 
> This sort of zero-fill behavior on volatile pages only seems to make
> sense if pages are written atomically.
> 
> The SIGBUS handling solution you SIGH'ed at above actually seems
> reasonable, because it would allow the page to be safely filled
> atomically (marking it non-volatile, filling it and then re-marking it
> volatile). Sure it would cost more, fast and wrong isn't really a valid
> option.

Got it. My scenario was totally broken so I don't insist on such model any more.
First of all, let's go with SIGBUS model first if there is no strong requirement
from user folks.

Thanks for pointing out, John!

> 
> 
> 
> >
> >> * How do general applications know the difference between a purged page
> >> and a valid empty page?
> >> * When reading/writing a page, what happens if half-way the application
> >> is preempted, and the page is purged?
> >> * If a volatile page is purged, then zero-filled on a read or write,
> >> what is its purged state when we're marking it non-volatile?
> > Maybe above scenario goes your questions to VOID.
> 
> I'm not sure I understand this.
> 
> 
> >
> >> These use cases don't seem completely baked, or maybe I've just not been
> >> able to comprehend them yet. But I don't quite understand the desire to
> >> prioritize this style of usage over other simpler and more well
> >> established usage?
> > I think it's one of typical usecase of vrange syscall.
> 
> I apologize if I'm seeming stubborn, but I just can't see how it would
> work sanely.
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas
@ 2014-05-09  0:41                 ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2014-05-09  0:41 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Thu, May 08, 2014 at 05:24:41PM -0700, John Stultz wrote:
> On 05/08/2014 05:07 PM, Minchan Kim wrote:
> > On Thu, May 08, 2014 at 04:43:07PM -0700, John Stultz wrote:
> >> On 05/08/2014 04:12 PM, Minchan Kim wrote:
> >>> On Thu, May 08, 2014 at 09:38:40AM -0700, John Stultz wrote:
> >>>> On 05/07/2014 06:21 PM, Minchan Kim wrote:
> >>>>> Hey John,
> >>>>>
> >>>>> On Tue, Apr 29, 2014 at 02:21:21PM -0700, John Stultz wrote:
> >>>>>> This patch introduces MADV_VOLATILE/NONVOLATILE flags to madvise(),
> >>>>>> which allows for specifying ranges of memory as volatile, and able
> >>>>>> to be discarded by the system.
> >>>>>>
> >>>>>> This initial patch simply adds flag handling to madvise, and the
> >>>>>> vma handling, splitting and merging the vmas as needed, and marking
> >>>>>> them with VM_VOLATILE.
> >>>>>>
> >>>>>> No purging or discarding of volatile ranges is done at this point.
> >>>>>>
> >>>>>> This a simplified implementation which reuses some of the logic
> >>>>>> from Minchan's earlier efforts. So credit to Minchan for his work.
> >>>>> Remove purged argument is really good thing but I'm not sure merging
> >>>>> the feature into madvise syscall is good idea.
> >>>>> My concern is how we support user who don't want SIGBUS.
> >>>>> I believe we should support them because someuser(ex, sanitizer) really
> >>>>> want to avoid MADV_NONVOLATILE call right before overwriting their cache
> >>>>> (ex, If there was purged page for cyclic cache, user should call NONVOLATILE
> >>>>> right before overwriting to avoid SIGBUS).
> >>>> So... Why not use MADV_FREE then for this case?
> >>> MADV_FREE is one-shot operation. I mean we should call it again to make
> >>> them lazyfree while vrange could preserve volatility.
> >>> Pz, think about thread-sanitizer usecase. They do mmap 70TB once start up
> >>> and want to mark the range as volatile. If they uses MADV_FREE instead of
> >>> volatile, they should mark 70TB as lazyfree periodically, which is terrible
> >>> because MADV_FREE's cost is O(N).
> >> I still have had difficulty seeing the thread-sanitizer usage as a
> >> generic enough model for other applications. I realize they want to
> >> avoid marking and unmarking ranges (and they want that marking and
> >> unmarking to be very cheap), but the zero-fill purged page (while still
> >> preserving volatility) causes lots of *very* strange behavior:
> >  
> > I don't think it's for only thread-sanitizer.
> > Pz, think following usecase.
> >
> > Let's assume big volatile cache.
> > If there is request for cache, it should find a object in a cache
> > and if it found, it should call vrange(NOVOLATILE) right before
> > passing it to the user and investigate it was purged or not.
> > If it wasn't purged, cache manager could pass the object to the user.
> > But it's circular cache so if there is no request from user, cache manager
> > always overwrites objects so it could encounter SIGBUS easily
> > so as current sematic, cache manager always should call vrange(NOVOLATILE)
> > right before the overwriting. Otherwise, it should register SIGBUS handler
> > to unmark volatile by page unit. SIGH.
> >
> > If we support zero-fill, cache manager could overwrite object without
> > SIGBUS handling or vrange(NOVOLATILE) call right before overwriting.
> > Just what we need is vrange(NOVOLATILE) call right before passing it
> > to user.
> 
> But that wouldn't work. If the page was purged half way through writing
> it, we end up with a page of half zero data and half written data. What
> would the page state be at that point? Purged? Not purged?

You're right. Application might detect it with adding a sentinel in the
header but I don't think it should be generic model with zero-fill semantic
although some of application could do it.

> 
> * If its not purged (since a write was done to the page after being
> zero-filled), we will silently return to the user corrupted data.
> 
> * If it is considered purged, how do we store that data? Since we
> currently detect purged pages by checking if they are present when we
> mark non-volatile.
> 
> 
> This sort of zero-fill behavior on volatile pages only seems to make
> sense if pages are written atomically.
> 
> The SIGBUS handling solution you SIGH'ed at above actually seems
> reasonable, because it would allow the page to be safely filled
> atomically (marking it non-volatile, filling it and then re-marking it
> volatile). Sure it would cost more, fast and wrong isn't really a valid
> option.

Got it. My scenario was totally broken so I don't insist on such model any more.
First of all, let's go with SIGBUS model first if there is no strong requirement
from user folks.

Thanks for pointing out, John!

> 
> 
> 
> >
> >> * How do general applications know the difference between a purged page
> >> and a valid empty page?
> >> * When reading/writing a page, what happens if half-way the application
> >> is preempted, and the page is purged?
> >> * If a volatile page is purged, then zero-filled on a read or write,
> >> what is its purged state when we're marking it non-volatile?
> > Maybe above scenario goes your questions to VOID.
> 
> I'm not sure I understand this.
> 
> 
> >
> >> These use cases don't seem completely baked, or maybe I've just not been
> >> able to comprehend them yet. But I don't quite understand the desire to
> >> prioritize this style of usage over other simpler and more well
> >> established usage?
> > I think it's one of typical usecase of vrange syscall.
> 
> I apologize if I'm seeming stubborn, but I just can't see how it would
> work sanely.
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
  2014-05-08 17:12   ` John Stultz
@ 2014-06-03 14:57     ` Johannes Weiner
  -1 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2014-06-03 14:57 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
> On 04/29/2014 02:21 PM, John Stultz wrote:
> > Another few weeks and another volatile ranges patchset...
> >
> > After getting the sense that the a major objection to the earlier
> > patches was the introduction of a new syscall (and its somewhat
> > strange dual length/purged-bit return values), I spent some time
> > trying to rework the vma manipulations so we can be we won't fail
> > mid-way through changing volatility (basically making it atomic).
> > I think I have it working, and thus, there is no longer the
> > need for a new syscall, and we can go back to using madvise()
> > to set and unset pages as volatile.
> 
> Johannes: To get some feedback, maybe I'll needle you directly here a
> bit. :)
> 
> Does moving this interface to madvise help reduce your objections?  I
> feel like your cleaning-the-dirty-bit idea didn't work out, but I was
> hoping that by reworking the vma manipulations to be atomic, we could
> move to madvise and still avoid the new syscall that you seemed bothered
> by. But I've not really heard much from you recently so I worry your
> concerns on this were actually elsewhere, and I'm just churning the
> patch needlessly.

My objection was not the syscall.

>From a reclaim perspective, using the dirty state to denote whether a
swap-backed page needs writeback before reclaim is quite natural and I
much prefer Minchan's changes to the reclaim code over yours.

>From an interface point of view, I would prefer the simplicity of
cleaning dirty bits to invalidate pages, and a default of zero-filling
invalidated pages instead of sending SIGBUS.  This also is quite
natural when you think of anon/shmem mappings as cache pages on top of
/dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
well to tmpfs.

At the same time, I acknowledge that there are usecases that want
SIGBUS delivery for more than just convenience in order to implement
userspace fault handling, and this is the only place where I see a
real divergence in actual functionality from Minchan's code.

That, however, truly is a separate virtual memory feature.  Would it
be possible for you to take MADV_FREE and MADV_REVIVE as a base and
implement an madvise op that switches the no-page behavior of a VMA
from zero-filling to SIGBUS delivery?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-06-03 14:57     ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2014-06-03 14:57 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
> On 04/29/2014 02:21 PM, John Stultz wrote:
> > Another few weeks and another volatile ranges patchset...
> >
> > After getting the sense that the a major objection to the earlier
> > patches was the introduction of a new syscall (and its somewhat
> > strange dual length/purged-bit return values), I spent some time
> > trying to rework the vma manipulations so we can be we won't fail
> > mid-way through changing volatility (basically making it atomic).
> > I think I have it working, and thus, there is no longer the
> > need for a new syscall, and we can go back to using madvise()
> > to set and unset pages as volatile.
> 
> Johannes: To get some feedback, maybe I'll needle you directly here a
> bit. :)
> 
> Does moving this interface to madvise help reduce your objections?  I
> feel like your cleaning-the-dirty-bit idea didn't work out, but I was
> hoping that by reworking the vma manipulations to be atomic, we could
> move to madvise and still avoid the new syscall that you seemed bothered
> by. But I've not really heard much from you recently so I worry your
> concerns on this were actually elsewhere, and I'm just churning the
> patch needlessly.

My objection was not the syscall.

>From a reclaim perspective, using the dirty state to denote whether a
swap-backed page needs writeback before reclaim is quite natural and I
much prefer Minchan's changes to the reclaim code over yours.

>From an interface point of view, I would prefer the simplicity of
cleaning dirty bits to invalidate pages, and a default of zero-filling
invalidated pages instead of sending SIGBUS.  This also is quite
natural when you think of anon/shmem mappings as cache pages on top of
/dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
well to tmpfs.

At the same time, I acknowledge that there are usecases that want
SIGBUS delivery for more than just convenience in order to implement
userspace fault handling, and this is the only place where I see a
real divergence in actual functionality from Minchan's code.

That, however, truly is a separate virtual memory feature.  Would it
be possible for you to take MADV_FREE and MADV_REVIVE as a base and
implement an madvise op that switches the no-page behavior of a VMA
from zero-filling to SIGBUS delivery?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
  2014-06-03 14:57     ` Johannes Weiner
@ 2014-06-16 20:12       ` John Stultz
  -1 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-06-16 20:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
>> On 04/29/2014 02:21 PM, John Stultz wrote:
>> > Another few weeks and another volatile ranges patchset...
>> >
>> > After getting the sense that the a major objection to the earlier
>> > patches was the introduction of a new syscall (and its somewhat
>> > strange dual length/purged-bit return values), I spent some time
>> > trying to rework the vma manipulations so we can be we won't fail
>> > mid-way through changing volatility (basically making it atomic).
>> > I think I have it working, and thus, there is no longer the
>> > need for a new syscall, and we can go back to using madvise()
>> > to set and unset pages as volatile.
>>
>> Johannes: To get some feedback, maybe I'll needle you directly here a
>> bit. :)
>>
>> Does moving this interface to madvise help reduce your objections?  I
>> feel like your cleaning-the-dirty-bit idea didn't work out, but I was
>> hoping that by reworking the vma manipulations to be atomic, we could
>> move to madvise and still avoid the new syscall that you seemed bothered
>> by. But I've not really heard much from you recently so I worry your
>> concerns on this were actually elsewhere, and I'm just churning the
>> patch needlessly.
>
> My objection was not the syscall.
>
> From a reclaim perspective, using the dirty state to denote whether a
> swap-backed page needs writeback before reclaim is quite natural and I
> much prefer Minchan's changes to the reclaim code over yours.
>
> From an interface point of view, I would prefer the simplicity of
> cleaning dirty bits to invalidate pages, and a default of zero-filling
> invalidated pages instead of sending SIGBUS.  This also is quite
> natural when you think of anon/shmem mappings as cache pages on top of
> /dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
> well to tmpfs.
>
> At the same time, I acknowledge that there are usecases that want
> SIGBUS delivery for more than just convenience in order to implement
> userspace fault handling, and this is the only place where I see a
> real divergence in actual functionality from Minchan's code.

Thanks for the clarification and feedback. Sorry for my slow response,
as I was on vacation for a week and am just now catching up on this.

So again, SIGBUS for userspace fault handling is really of a
side-effect of having more userspace friendly semantics, and isn't
really the primary goal/usage model.

Zerofill semantics are mostly problematic because they make userspace
mistakes harder to find and diagnose. Android's ashmem actually uses
zerofill semantics, so while I see it as less ideal, technically
zerofill would work here.

However, combining zerofill with your preferred overloading of the
dirty state is particularly problematic because it makes any dirtying
of volatile data clear both the volatile state as well as the purged
state for the entire page. The volatile state is surprising, but less
problematic, but the clearing of the purged state means applications
would possibly get a partial zero page (for whatever wasn't written)
and no warning that their data was lost.  This is a very surprising
and unfriendly side-effect from a userspace perspective.

For context,  Android's ashmem preserves both the volatile and purged
state on volatile page dirtying (since the volatility and purged state
are kept in their own range structure independently from the VM).

> That, however, truly is a separate virtual memory feature.  Would it
> be possible for you to take MADV_FREE and MADV_REVIVE as a base and
> implement an madvise op that switches the no-page behavior of a VMA
> from zero-filling to SIGBUS delivery?

I'll see if I can look into it if I get some time. However, I suspect
its more likely I'll just have to admit defeat on this one and let
someone else champion the effort. Interest and reviews have seemingly
dropped again here and with other work ramping up, I'm not sure if
I'll be able to justify further work on this. :(

thanks
-john

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-06-16 20:12       ` John Stultz
  0 siblings, 0 replies; 48+ messages in thread
From: John Stultz @ 2014-06-16 20:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, May 08, 2014 at 10:12:40AM -0700, John Stultz wrote:
>> On 04/29/2014 02:21 PM, John Stultz wrote:
>> > Another few weeks and another volatile ranges patchset...
>> >
>> > After getting the sense that the a major objection to the earlier
>> > patches was the introduction of a new syscall (and its somewhat
>> > strange dual length/purged-bit return values), I spent some time
>> > trying to rework the vma manipulations so we can be we won't fail
>> > mid-way through changing volatility (basically making it atomic).
>> > I think I have it working, and thus, there is no longer the
>> > need for a new syscall, and we can go back to using madvise()
>> > to set and unset pages as volatile.
>>
>> Johannes: To get some feedback, maybe I'll needle you directly here a
>> bit. :)
>>
>> Does moving this interface to madvise help reduce your objections?  I
>> feel like your cleaning-the-dirty-bit idea didn't work out, but I was
>> hoping that by reworking the vma manipulations to be atomic, we could
>> move to madvise and still avoid the new syscall that you seemed bothered
>> by. But I've not really heard much from you recently so I worry your
>> concerns on this were actually elsewhere, and I'm just churning the
>> patch needlessly.
>
> My objection was not the syscall.
>
> From a reclaim perspective, using the dirty state to denote whether a
> swap-backed page needs writeback before reclaim is quite natural and I
> much prefer Minchan's changes to the reclaim code over yours.
>
> From an interface point of view, I would prefer the simplicity of
> cleaning dirty bits to invalidate pages, and a default of zero-filling
> invalidated pages instead of sending SIGBUS.  This also is quite
> natural when you think of anon/shmem mappings as cache pages on top of
> /dev/zero (see mmap_zero() and shmem_zero_setup()).  And it translates
> well to tmpfs.
>
> At the same time, I acknowledge that there are usecases that want
> SIGBUS delivery for more than just convenience in order to implement
> userspace fault handling, and this is the only place where I see a
> real divergence in actual functionality from Minchan's code.

Thanks for the clarification and feedback. Sorry for my slow response,
as I was on vacation for a week and am just now catching up on this.

So again, SIGBUS for userspace fault handling is really of a
side-effect of having more userspace friendly semantics, and isn't
really the primary goal/usage model.

Zerofill semantics are mostly problematic because they make userspace
mistakes harder to find and diagnose. Android's ashmem actually uses
zerofill semantics, so while I see it as less ideal, technically
zerofill would work here.

However, combining zerofill with your preferred overloading of the
dirty state is particularly problematic because it makes any dirtying
of volatile data clear both the volatile state as well as the purged
state for the entire page. The volatile state is surprising, but less
problematic, but the clearing of the purged state means applications
would possibly get a partial zero page (for whatever wasn't written)
and no warning that their data was lost.  This is a very surprising
and unfriendly side-effect from a userspace perspective.

For context,  Android's ashmem preserves both the volatile and purged
state on volatile page dirtying (since the volatility and purged state
are kept in their own range structure independently from the VM).

> That, however, truly is a separate virtual memory feature.  Would it
> be possible for you to take MADV_FREE and MADV_REVIVE as a base and
> implement an madvise op that switches the no-page behavior of a VMA
> from zero-filling to SIGBUS delivery?

I'll see if I can look into it if I get some time. However, I suspect
its more likely I'll just have to admit defeat on this one and let
someone else champion the effort. Interest and reviews have seemingly
dropped again here and with other work ramping up, I'm not sure if
I'll be able to justify further work on this. :(

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
  2014-06-16 20:12       ` John Stultz
@ 2014-06-16 22:24         ` Andrea Arcangeli
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2014-06-16 22:24 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	linux-mm

Hello everyone,

On Mon, Jun 16, 2014 at 01:12:41PM -0700, John Stultz wrote:
> On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > That, however, truly is a separate virtual memory feature.  Would it
> > be possible for you to take MADV_FREE and MADV_REVIVE as a base and
> > implement an madvise op that switches the no-page behavior of a VMA
> > from zero-filling to SIGBUS delivery?
> 
> I'll see if I can look into it if I get some time. However, I suspect
> its more likely I'll just have to admit defeat on this one and let
> someone else champion the effort. Interest and reviews have seemingly
> dropped again here and with other work ramping up, I'm not sure if
> I'll be able to justify further work on this. :(

About adding an madvise op that switches the no-page behavior from
zero-filling to SIGBUS delivery (right now only for anonymous vmas but
we can evaluate to extend it) I've mostly completed the
userfaultfd/madvise(MADV_USERFAULT) according to the design I
described earlier. Like we discussed earlier that may fit the bill if
extended to tmpfs? The first preliminary tests just passed last week.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/?h=userfault

If userfaultfd() isn't instantiated by the process, it only sends a
SIBGUS to the thread accessing the unmapped virtual address
(handle_mm_faults returns VM_FAULT_SIGBUS). The address of the fault
is then available in siginfo->si_addr.

You strictly need a memory externalization thread opening the
userfaultfd and speaking the userfaultfd protocol only if you need to
access the memory also through syscalls or drivers doing GUP
calls. This allows memory mapped in a secondary MMU for example to be
externalized without a single change to the secondary MMU code. The
userfault becomes invisible to
handle_mm_fault/gup()/gup_fast/FOLL_NOWAIT etc.... The only
requirement is that the memory externalization thread never accesses
any memory in the MADV_USERFAULT marked regions (and if it does
because of a bug, the deadlock should be quite apparent by simply
checking the stack trace of the externalization thread blocked in
handle_userfault(), sigkill will then clear it up :). If you close the
userfaultfd the SIGBUS behavior will immediately return for the
MADV_USERFAULT marked regions and any hung task waiting to be waken
will get an immediate SIGBUS.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!)
@ 2014-06-16 22:24         ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2014-06-16 22:24 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	linux-mm

Hello everyone,

On Mon, Jun 16, 2014 at 01:12:41PM -0700, John Stultz wrote:
> On Tue, Jun 3, 2014 at 7:57 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > That, however, truly is a separate virtual memory feature.  Would it
> > be possible for you to take MADV_FREE and MADV_REVIVE as a base and
> > implement an madvise op that switches the no-page behavior of a VMA
> > from zero-filling to SIGBUS delivery?
> 
> I'll see if I can look into it if I get some time. However, I suspect
> its more likely I'll just have to admit defeat on this one and let
> someone else champion the effort. Interest and reviews have seemingly
> dropped again here and with other work ramping up, I'm not sure if
> I'll be able to justify further work on this. :(

About adding an madvise op that switches the no-page behavior from
zero-filling to SIGBUS delivery (right now only for anonymous vmas but
we can evaluate to extend it) I've mostly completed the
userfaultfd/madvise(MADV_USERFAULT) according to the design I
described earlier. Like we discussed earlier that may fit the bill if
extended to tmpfs? The first preliminary tests just passed last week.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/?h=userfault

If userfaultfd() isn't instantiated by the process, it only sends a
SIBGUS to the thread accessing the unmapped virtual address
(handle_mm_faults returns VM_FAULT_SIGBUS). The address of the fault
is then available in siginfo->si_addr.

You strictly need a memory externalization thread opening the
userfaultfd and speaking the userfaultfd protocol only if you need to
access the memory also through syscalls or drivers doing GUP
calls. This allows memory mapped in a secondary MMU for example to be
externalized without a single change to the secondary MMU code. The
userfault becomes invisible to
handle_mm_fault/gup()/gup_fast/FOLL_NOWAIT etc.... The only
requirement is that the memory externalization thread never accesses
any memory in the MADV_USERFAULT marked regions (and if it does
because of a bug, the deadlock should be quite apparent by simply
checking the stack trace of the externalization thread blocked in
handle_userfault(), sigkill will then clear it up :). If you close the
userfaultfd the SIGBUS behavior will immediately return for the
MADV_USERFAULT marked regions and any hung task waiting to be waken
will get an immediate SIGBUS.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2014-06-16 22:24 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-29 21:21 [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!) John Stultz
2014-04-29 21:21 ` John Stultz
2014-04-29 21:21 ` [PATCH 1/4] swap: Cleanup how special swap file numbers are defined John Stultz
2014-04-29 21:21   ` John Stultz
2014-04-29 21:21 ` [PATCH 2/4] MADV_VOLATILE: Add MADV_VOLATILE/NONVOLATILE hooks and handle marking vmas John Stultz
2014-04-29 21:21   ` John Stultz
2014-05-08  1:21   ` Minchan Kim
2014-05-08  1:21     ` Minchan Kim
2014-05-08 16:38     ` John Stultz
2014-05-08 16:38       ` John Stultz
2014-05-08 23:12       ` Minchan Kim
2014-05-08 23:12         ` Minchan Kim
2014-05-08 23:43         ` John Stultz
2014-05-08 23:43           ` John Stultz
2014-05-09  0:07           ` Minchan Kim
2014-05-09  0:07             ` Minchan Kim
2014-05-09  0:24             ` John Stultz
2014-05-09  0:24               ` John Stultz
2014-05-09  0:41               ` Minchan Kim
2014-05-09  0:41                 ` Minchan Kim
2014-04-29 21:21 ` [PATCH 3/4] MADV_VOLATILE: Add purged page detection on setting memory non-volatile John Stultz
2014-04-29 21:21   ` John Stultz
2014-05-08  1:51   ` Minchan Kim
2014-05-08  1:51     ` Minchan Kim
2014-05-08 21:45     ` John Stultz
2014-05-08 21:45       ` John Stultz
2014-05-08 23:45       ` Minchan Kim
2014-05-08 23:45         ` Minchan Kim
2014-04-29 21:21 ` [PATCH 4/4] MADV_VOLATILE: Add page purging logic & SIGBUS trap John Stultz
2014-04-29 21:21   ` John Stultz
2014-05-08  5:16   ` Minchan Kim
2014-05-08  5:16     ` Minchan Kim
2014-05-08 16:39     ` John Stultz
2014-05-08 16:39       ` John Stultz
2014-05-08  5:58 ` [PATCH 0/4] Volatile Ranges (v14 - madvise reborn edition!) Minchan Kim
2014-05-08  5:58   ` Minchan Kim
2014-05-08 17:04   ` John Stultz
2014-05-08 17:04     ` John Stultz
2014-05-08 23:29     ` Minchan Kim
2014-05-08 23:29       ` Minchan Kim
2014-05-08 17:12 ` John Stultz
2014-05-08 17:12   ` John Stultz
2014-06-03 14:57   ` Johannes Weiner
2014-06-03 14:57     ` Johannes Weiner
2014-06-16 20:12     ` John Stultz
2014-06-16 20:12       ` John Stultz
2014-06-16 22:24       ` Andrea Arcangeli
2014-06-16 22:24         ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.