linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/6] Guest page hinting version 6.
@ 2008-03-12 13:21 Martin Schwidefsky
  2008-03-12 13:21 ` [patch 1/6] Guest page hinting: core + volatile page cache Martin Schwidefsky
                   ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-s390, virtualization
  Cc: akpm, nickpiggin, hugh, zach, frankeh

Greetings,
I've dedusted the guest page hinting patches and ported them to todays
upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but
that is easy to fix. The code stills works as expected on my test system.

Our z/VM performance team recently published a report on guest page
hinting vs. the ballooner approach on SLES10 for a farm of web servers.
The code on SLES10 differs a bit from the upstream variant but the
performance results should be still valid.  You will find the report
here:

  http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html

(the VMRM-CMM the web page speaks about is the balloon approach,
 CMMA is the guest page hinting).

Both approaches to the memory overcommit problem show comparable benefits
for this workload, with an advantage for guest page hinting for large
number of guests. For other workloads your mileage may vary.

The main benefit for guest page hinting vs. the ballooner is that there
is no need for a monitor that keeps track of the memory usage of all the
guests, a complex algorithm that calculates the working set sizes and for
the calls into the guest kernel to control the size of the balloons.
The host just does normal LRU based paging. If the host picks one of the
pages the guest can recreate, the host can throw it away instead of writing
it to the paging device. Simple and elegant.
The main disadvantage is the added complexity that is introduced to the
guests memory management code to do the page state changes and to deal
with discard faults.

The last versions of the patches do not differ much, I consider the code
to be stable. My question now is how to proceed with the code. I sure
would love to see the code going upstream some day but that depends on
the mm developers as the code adds complexity that needs to be supported.
If the general feeling is that the advantages of this approach do not
warrent for the added complexity this will likely be the last time you
will hear about guest page hinting. 

--
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 1/6] Guest page hinting: core + volatile page cache.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
@ 2008-03-12 13:21 ` Martin Schwidefsky
  2008-03-12 23:12   ` Rusty Russell
  2008-03-12 13:21 ` [patch 2/6] Guest page hinting: volatile swap cache Martin Schwidefsky
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-s390, virtualization
  Cc: akpm, nickpiggin, hugh, zach, frankeh, Martin Schwidefsky

[-- Attachment #1: 001-hva-core.diff --]
[-- Type: text/plain, Size: 44875 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
From: Hubertus Franke <frankeh@watson.ibm.com>
From: Himanshu Raj

The guest page hinting patchset introduces code that passes guest
page usage information to the host system that virtualizes the
memory of its guests. There are three different page states:
* Unused: The page content is of no interest to the guest. The
  host can forget the page content and replace it with a page
  containing zeroes.
* Stable: The page content is needed by the guest and has to be
  preserved by the host.
* Volatile: The page content is useful to the guest but not
  essential. The host can discard the page but has to deliver a
  special kind of fault to the guest if the guest accesses a
  page discarded by the host.
   
The unused state is used for free pages, it allows the host to avoid
the paging of empty pages. The default state for non-free pages is
stable. The host can write stable pages to a swap device but has
to restore the page if the guest accesses it. The volatile page state
is used for clean uptodate page cache pages. The host can choose to
discard volatile pages as part of its vmscan operation instead of
writing them to the hosts paging device. The guest system doesn't
notice that a volatile page is gone until it tries to access the page
or if it tries to make the page stable again. For a guest access to a
discarded page the host generates a discard fault to notify the guest.
The guest has to remove the page from the cache and reload the page
from its backing device.

The volatile state is used for all page cache pages, even for pages
which are referenced by writable ptes. The host needs to be able to
check the dirty state of the pages. Since the host doesn't know where
the page table entries of the guest are located, the volatile state as
introduced by this patch is only usable on architectures with per-page
dirty bits (s390 only). For per-pte dirty bit architectures some
additional code is needed, see patch #4.

The main question is where to put the state transitions between
the volatile and the stable state. The simple solution is to make a
page stable whenever a lookup is done or a page reference is derived
from a page table entry. Attempts to make pages volatile are added at
strategic points. The conditions that prevent a page from being made
volatile:
1) The page is reserved. Some sort of special page.
2) The page is marked dirty in the struct page. The page content is
   more recent than the data on the backing device. The host cannot
   access the linux internal dirty bit so the page needs to be stable.
3) The page is in writeback. The page content is needed for i/o.
4) The page is locked. Someone has exclusive access to the page.
5) The page is anonymous. Swap cache support needs additional code.
   See patch #2.
6) The page has no mapping. Without a backing the page cannot be
   recreated.
7) The page is not uptodate.
8) The page has private information. try_to_release_page can fail,
   e.g. in case the private information is journaling data. The discard
   fault need to be able to remove the page.
9) The page is already discarded.
10) The page is not on the LRU list. The page has been isolated, some
   processing is done.
11) The page map count is not equal to the page reference count - 1.
   The discard fault handler can remove the page cache reference and
   all mappers of a page. It cannot remove the page reference for any
   other user of the page.

The transitions to stable are done by find_get_pages() and its variants,
in follow_page if the FOLL_GET flag is set, by copy-on-write in
do_wp_page, and by the early copy-on-write in do_no_page. For page cache
page this is always done with a call to page_make_stable().
To make enough pages discardable by the host an attempt to do the
transition to volatile state is done at several places:
1) When a page gets unlocked (unlock_page).
2) When writeback has finished (test_clear_page_writeback).
3) When the page reference counter is decreased (__free_pages,
   page_cache_release alias put_page_check and __pagevec_release_nonlru
   right before the put_page_testzero call).
4) When the map counter in increased (page_add_file_rmap).
5) When a page is moved from the active list to the inactive list.
6) In filemap_nopage after the wait for readpage has finished. This try
   is necessary because filemap_nopage held an additional reference to
   the page so that the page_make_volatile call in unlock_page could not
   do the state transition.
The function for the state transitions to volatile is page_make_volatile().

The major obstacles that need to get addressed:
* Concurrent page state changes:
  To guard against concurrent page state updates some kind of lock
  is needed. If page_make_volatile() has already done the 11 checks it
  will issue the state change primitive. If in the meantime one of
  the conditions has changed the user that requires that page in
  stable state will have to wait in the page_make_stable() function
  until the make volatile operation has finished. It is up to the
  architecture to define how this is done with the three primitives
  page_test_set_state_change, page_clear_state_change and
  page_state_change.
  There are some alternatives how this can be done, e.g. a global
  lock, or lock per segment in the kernel page table, or the per page
  bit PG_arch_1 if it is still free.

* Page references acquired from page tables:
  All page references acquired with find_get_page and friends can be
  used to access the page frame content. A page reference grabbed from
  a page table cannot be used to access the page content, the page has
  to be made stable first. If the make stable operation fails because
  the page has been discarded it has to be removed from page cache.
  That removes the page table entry as well.

* Page discard vs. __remove_from_page_cache race
  A new page flag PG_discarded is added. This bit is set for discarded
  pages. It prevents multiple removes of a page from the page cache due
  to concurrent discard faults and/or normal page removals. It also
  prevents the re-add of isolated pages to the lru list in vmscan if
  the page has been discarded while it was not on the lru list.

* Page discard vs. pte establish
  The discard fault handler does three things: 1) set the PG_discarded
  bit for the page, 2) remove the page from all page tables and 3) remove
  the page from the page cache. All page references of the discarded
  page that are still around after step 2 may not be used to establish
  new mappings because step 3 clears the page->mapping field that is
  required to find the mappers. Code that establishes new ptes to pages
  that might be discarded has to check the PG_discarded bit. Step 2 has
  to check all possible location for a pte of a particular page and check
  if the pte exists or another processor might be in the process of
  establishing one. To do that the page table lock for the pte is used.
  See page_unmap_all and the modified quick check in page_check_address
  for the details.

* copy_one_pte vs. discarded pages
  The code that copies the page tables may not copy ptes for discarded
  pages because this races with the discard fault handler. copy_one_pte
  cannot back out either since there is no automatic repeat of the
  fault that caused the pte modification. Ptes to discarded pages only
  show up in copy_one_pte if a fork races with a discard fault. In this
  case copy_one_pte has to create a pte in the new page table that looks
  like the one that the discard fault handler would have created in the
  original page table if copy_one_pte would not have grabed the page
  table lock first.

* get_user_pages with FOLL_GET
  If get_user_pages is called with a non-NULL pages argument the caller
  has to be able to access the page content using the references
  returned in the pages array. This is done with a check in follow_page
  for the FOLL_GET bit and a call to page_make_stable.
  If get_user_pages is called with NULL as the pages argument the
  pages are not made stable. The caller cannot expect that the pages 
  are available after the call because vmscan might have removed them.

* buffer heads / page_private
  A page that is modified with sys_write will get a buffer-head to
  keep track of the dirty state. The existence of a buffer-head makes
  PagePrivate(page) return true. Pages with private information cannot
  be made volatile. Until the buffer-head is removed the page will
  stay stable. The standard logic is to call try_to_release_page which
  frees the buffer-head only if more than 10% of GFP_USER memory are
  used for buffer heads. Without high memory every page can have a
  buffer-head without running over the limit. The result is that every
  page written to with sys_write will stay stable until it is removed.
  To get these pages volatile again max_buffer_heads is set to zero (!)
  to force a call to try_to_release_page whenever a page is moved from
  the active to the inactive list.

* page_free_discarded hook
  The architecture might want/need to do special things for discarded
  pages before they are freed. E.g. s390 has to delay the freeing of
  discarded pages. To allow this a hook in added to free_hot_cold_page.

Another noticable change is that the first few lines of code in
try_to_unmap_one that calculates the address from the page and the vma
is moved out of try_to_unmap_one to the callers. This is done to make
try_to_unmap_one usable for the removal of discarded pages in
page_unmap_all.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 fs/buffer.c                 |   12 ++
 include/linux/mm.h          |    1 
 include/linux/page-flags.h  |   13 ++
 include/linux/page-states.h |  123 +++++++++++++++++++++++++++
 include/linux/pagemap.h     |    6 +
 mm/Makefile                 |    1 
 mm/filemap.c                |   74 +++++++++++++++-
 mm/memory.c                 |   58 ++++++++++++
 mm/page-states.c            |  197 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c         |    2 
 mm/page_alloc.c             |   14 ++-
 mm/rmap.c                   |   93 ++++++++++++++++++--
 mm/swap.c                   |   14 +++
 mm/vmscan.c                 |   68 ++++++++++-----
 14 files changed, 640 insertions(+), 36 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -3275,11 +3275,23 @@ void __init buffer_init(void)
 				SLAB_MEM_SPREAD),
 				init_buffer_head);
 
+#ifdef CONFIG_PAGE_STATES
+	/*
+	 * If volatile page cache is enabled we want to get as many
+	 * pages into volatile state as possible. Pages with private
+	 * information cannot be made stable. Set max_buffer_heads
+	 * to zero to make shrink_active_list to release the private
+	 * information when moving page from the active to the inactive
+	 * list.
+	 */
+	max_buffer_heads = 0;
+#else
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL
 	 */
 	nrpages = (nr_free_buffer_pages() * 10) / 100;
 	max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
+#endif
 	hotcpu_notifier(buffer_cpu_notify, 0);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -280,6 +280,7 @@ static inline void init_page_count(struc
 }
 
 void put_page(struct page *page);
+void put_page_check(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -108,6 +108,8 @@
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
+#define PG_discarded		20	/* Page discarded by the hypervisor. */
+
 /*
  * Manipulation of page state flags
  */
@@ -296,6 +298,17 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#ifdef CONFIG_PAGE_STATES
+#define PageDiscarded(page)	test_bit(PG_discarded, &(page)->flags)
+#define ClearPageDiscarded(page) clear_bit(PG_discarded, &(page)->flags)
+#define TestSetPageDiscarded(page) \
+		test_and_set_bit(PG_discarded, &(page)->flags)
+#else
+#define PageDiscarded(page)		0
+#define ClearPageDiscarded(page)	do { } while (0)
+#define TestSetPageDiscarded(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -12,6 +12,7 @@
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
 #include <linux/bitops.h>
+#include <linux/page-states.h>
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -59,7 +60,11 @@ static inline void mapping_set_gfp_mask(
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
 #define page_cache_get(page)		get_page(page)
+#ifdef CONFIG_PAGE_STATES
+#define page_cache_release(page)	put_page_check(page)
+#else
 #define page_cache_release(page)	put_page(page)
+#endif
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
@@ -139,6 +144,7 @@ int add_to_page_cache_lru(struct page *p
 				pgoff_t index, gfp_t gfp_mask);
 extern void remove_from_page_cache(struct page *page);
 extern void __remove_from_page_cache(struct page *page);
+extern void __remove_from_page_cache_nocheck(struct page *page);
 
 /*
  * Return byte-offset into filesystem object for page.
Index: linux-2.6/include/linux/page-states.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/page-states.h
@@ -0,0 +1,123 @@
+#ifndef _LINUX_PAGE_STATES_H
+#define _LINUX_PAGE_STATES_H
+
+/*
+ * include/linux/page-states.h
+ *
+ * Copyright IBM Corp. 2005, 2007
+ *
+ * Authors: Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *          Hubertus Franke <frankeh@watson.ibm.com>
+ *          Himanshu Raj <rhim@cc.gatech.edu>
+ */
+
+#include <linux/pagevec.h>
+
+#ifdef CONFIG_PAGE_STATES
+/*
+ * Guest page hinting primitives that need to be defined in the
+ * architecture header file if PAGE_STATES=y:
+ * - page_host_discards:
+ *     Indicates whether the host system discards guest pages or not.
+ * - page_set_unused:
+ *     Indicates to the host that the page content is of no interest
+ *     to the guest. The host can "forget" the page content and replace
+ *     it with a page containing zeroes.
+ * - page_set_stable:
+ *     Indicate to the host that the page content is needed by the guest.
+ * - page_set_volatile:
+ *     Make the page discardable by the host. Instead of writing the
+ *     page to the hosts swap device, the host can remove the page.
+ *     A guest that accesses such a discarded page gets a special
+ *     discard fault.
+ * - page_set_stable_if_present:
+ *     The page state is set to stable if the page has not been discarded
+ *     by the host. The check and the state change have to be done
+ *     atomically.
+ * - page_discarded:
+ *     Returns true if the page has been discarded by the host.
+ * - page_volatile:
+ *     Returns true if the page is marked volatile.
+ * - page_test_set_state_change:
+ *     Tries to lock the page for state change. The primitive does not need
+ *     to have page granularity, it can lock a range of pages.
+ * - page_clear_state_change:
+ *     Unlocks a page for state changes.
+ * - page_state_change:
+ *     Returns true if the page is locked for state change.
+ * - page_free_discarded:
+ *     Free a discarded page. This might require to put the page on a
+ *     discard list and a synchronization over all cpus. Returns true
+ *     if the architecture backend wants to do special things on free.
+ */
+#include <asm/page-states.h>
+
+extern void page_unmap_all(struct page *page);
+extern void page_discard(struct page *page);
+extern int  __page_make_stable(struct page *page);
+extern void __page_make_volatile(struct page *page, int offset);
+extern void __pagevec_make_volatile(struct pagevec *pvec);
+
+/*
+ * Extended guest page hinting functions defined by using the
+ * architecture primitives:
+ * - page_make_stable:
+ *     Tries to make a page stable. This operation can fail if the
+ *     host has discarded a page. The function returns != 0 if the
+ *     page could not be made stable.
+ * - page_make_volatile:
+ *     Tries to make a page volatile. There are a number of conditions
+ *     that prevent a page from becoming volatile. If at least one
+ *     is true the function does nothing. See mm/page-states.c for
+ *     details.
+ * - pagevec_make_volatile:
+ *     Tries to make a vector of pages volatile. For each page in the
+ *     vector the same conditions apply as for page_make_volatile.
+ * - page_discard:
+ *     Removes a discarded page from the system. The page is removed
+ *     from the LRU list and the radix tree of its mapping.
+ *     page_discard uses page_unmap_all to remove all page table
+ *     entries for a page.
+ */
+
+static inline int page_make_stable(struct page *page)
+{
+	return page_host_discards() ? __page_make_stable(page) : 1;
+}
+
+static inline void page_make_volatile(struct page *page, int offset)
+{
+	if (page_host_discards())
+		__page_make_volatile(page, offset);
+}
+
+static inline void pagevec_make_volatile(struct pagevec *pvec)
+{
+	if (page_host_discards())
+		__pagevec_make_volatile(pvec);
+}
+
+#else
+
+#define page_host_discards()			(0)
+#define page_set_unused(_page,_order)		do { } while (0)
+#define page_set_stable(_page,_order)		do { } while (0)
+#define page_set_volatile(_page)		do { } while (0)
+#define page_set_stable_if_present(_page)	(1)
+#define page_discarded(_page)			(0)
+#define page_volatile(_page)			(0)
+
+#define page_test_set_state_change(_page)	(0)
+#define page_clear_state_change(_page)		do { } while (0)
+#define page_state_change(_page)		(0)
+
+#define page_free_discarded(_page)		(0)
+
+#define page_make_stable(_page)			(1)
+#define page_make_volatile(_page, offset)	do { } while (0)
+#define pagevec_make_volatile(_pagevec)	do { } while (0)
+#define page_discard(_page)			do { } while (0)
+
+#endif
+
+#endif /* _LINUX_PAGE_STATES_H */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/page-states.h>
 #include "internal.h"
 
 /*
@@ -114,7 +115,7 @@ generic_file_direct_IO(int rw, struct ki
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold a write_lock on the mapping's tree_lock.
  */
-void __remove_from_page_cache(struct page *page)
+void inline __remove_from_page_cache_nocheck(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -138,6 +139,28 @@ void __remove_from_page_cache(struct pag
 	}
 }
 
+void __remove_from_page_cache(struct page *page)
+{
+	/*
+	 * Check if the discard fault handler already removed
+	 * the page from the page cache. If not set the discard
+	 * bit in the page flags to prevent double page free if
+	 * a discard fault is racing with normal page free.
+	 */
+	if (TestSetPageDiscarded(page))
+		return;
+
+	__remove_from_page_cache_nocheck(page);
+
+	/*
+	 * Check the hardware page state and clear the discard
+	 * bit in the page flags only if the page is not
+	 * discarded.
+	 */
+	if (!page_discarded(page))
+		ClearPageDiscarded(page);
+}
+
 void remove_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
@@ -566,6 +589,7 @@ void unlock_page(struct page *page)
 	if (!TestClearPageLocked(page))
 		BUG();
 	smp_mb__after_clear_bit(); 
+	page_make_volatile(page, 1);
 	wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
@@ -639,6 +663,14 @@ struct page * find_get_page(struct addre
 	if (page)
 		page_cache_get(page);
 	read_unlock_irq(&mapping->tree_lock);
+	if (page && unlikely(!page_make_stable(page))) {
+		/*
+		 * The page has been discarded by the host. Run the
+		 * discard handler and return NULL.
+		 */
+		page_discard(page);
+		page = NULL;
+	}
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -663,7 +695,15 @@ repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
 		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
+		if (unlikely(!page_make_stable(page))) {
+			/*
+			 * The page has been discarded by the host. Run the
+			 * discard handler and return NULL.
+			 */
+			read_unlock_irq(&mapping->tree_lock);
+			page_discard(page);
+			return NULL;
+		} else if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
 			__lock_page(page);
 
@@ -745,11 +785,24 @@ unsigned find_get_pages(struct address_s
 	unsigned int i;
 	unsigned int ret;
 
+repeat:
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
 		page_cache_get(pages[i]);
+		if (likely(page_make_stable(pages[i])))
+			continue;
+		/*
+		 * Make stable failed, we discard the page and retry the
+		 * whole operation.
+		 */
+		read_unlock_irq(&mapping->tree_lock);
+		page_discard(pages[i]);
+		while (i--)
+			page_cache_release(pages[i]);
+		goto repeat;
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -804,11 +857,24 @@ unsigned find_get_pages_tag(struct addre
 	unsigned int i;
 	unsigned int ret;
 
+repeat:
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
 		page_cache_get(pages[i]);
+		if (likely(page_make_stable(pages[i])))
+			continue;
+		/*
+		 * Make stable failed, we discard the page and retry the
+		 * whole operation.
+		 */
+		read_unlock_irq(&mapping->tree_lock);
+		page_discard(pages[i]);
+		while (i--)
+			page_cache_release(pages[i]);
+		goto repeat;
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_PAGE_STATES) += page-states.o
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/page-states.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -487,6 +488,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
+		if (unlikely(PageDiscarded(page)))
+			goto out_discard_pte;
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
 		rss[!!PageAnon(page)]++;
@@ -494,6 +497,21 @@ copy_one_pte(struct mm_struct *dst_mm, s
 
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+	return;
+
+out_discard_pte:
+	/*
+	 * If the page referred by the pte has the PG_discarded bit set,
+	 * copy_one_pte is racing with page_discard. The pte may not be
+	 * copied or we can end up with a pte pointing to a page not
+	 * in page cache anymore. Do what try_to_unmap_one would do
+	 * if the copy_one_pte had taken place before page_discard.
+	 */
+	if (page->index != linear_page_index(vma, addr))
+		/* If nonlinear, store the file page offset in the pte. */
+		set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index));
+	else
+		pte_clear(dst_mm, addr, dst_pte);
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -957,6 +975,19 @@ struct page *follow_page(struct vm_area_
 
 	if (flags & FOLL_GET)
 		get_page(page);
+
+	if (flags & FOLL_GET) {
+		/*
+		 * The page is made stable if a reference is acquired.
+		 * If the caller does not get a reference it implies that
+		 * the caller can deal with page faults in case the page
+		 * is swapped out. In this case the caller can deal with
+		 * discard faults as well.
+		 */
+		if (unlikely(!page_make_stable(page)))
+			goto out_discard;
+	}
+
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -980,6 +1011,11 @@ no_page_table:
 		BUG_ON(flags & FOLL_WRITE);
 	}
 	return page;
+
+out_discard:
+	pte_unmap_unlock(ptep, ptl);
+	page_discard(page);
+	return NULL;
 }
 
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
@@ -1622,6 +1658,11 @@ static int do_wp_page(struct mm_struct *
 		dirty_page = old_page;
 		get_page(dirty_page);
 		reuse = 1;
+		/*
+		 * dirty_page will be set dirty, so it needs to be stable.
+		 */
+		if (unlikely(!page_make_stable(dirty_page)))
+			goto discard;
 	}
 
 	if (reuse) {
@@ -1638,6 +1679,12 @@ static int do_wp_page(struct mm_struct *
 	 * Ok, we need to copy. Oh, well..
 	 */
 	page_cache_get(old_page);
+	/*
+	 * To copy the content of old_page it needs to be stable.
+	 * page_cache_release on old_page will make it volatile again.
+	 */
+	if (unlikely(!page_make_stable(old_page)))
+		goto discard;
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
@@ -1720,6 +1767,10 @@ oom:
 unwritable_page:
 	page_cache_release(old_page);
 	return VM_FAULT_SIGBUS;
+discard:
+	pte_unmap_unlock(page_table, ptl);
+	page_discard(old_page);
+	return VM_FAULT_MINOR;
 }
 
 /*
@@ -2198,7 +2249,7 @@ static int __do_fault(struct mm_struct *
 	vmf.page = NULL;
 
 	BUG_ON(vma->vm_flags & VM_PFNMAP);
-
+retry:
 	if (likely(vma->vm_ops->fault)) {
 		ret = vma->vm_ops->fault(vma, &vmf);
 		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
@@ -2234,6 +2285,11 @@ static int __do_fault(struct mm_struct *
 				ret = VM_FAULT_OOM;
 				goto out;
 			}
+			if (unlikely(!page_make_stable(vmf.page))) {
+				unlock_page(vmf.page);
+				page_discard(vmf.page);
+				goto retry;
+			}
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 						vma, address);
 			if (!page) {
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -45,6 +45,7 @@
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
 #include <linux/memcontrol.h>
+#include <linux/page-states.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -245,7 +246,8 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_buddy );
+			1 << PG_buddy |
+			1 << PG_discarded );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
@@ -531,6 +533,7 @@ static void __free_pages_ok(struct page 
 		reserved += free_pages_check(page + i);
 	if (reserved)
 		return;
+	page_set_unused(page, order);
 
 	if (!PageHighMem(page))
 		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
@@ -990,10 +993,16 @@ static void free_hot_cold_page(struct pa
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	if (unlikely(PageDiscarded(page))) {
+		if (page_free_discarded(page))
+			return;
+	}
+
 	if (PageAnon(page))
 		page->mapping = NULL;
 	if (free_pages_check(page))
 		return;
+	page_set_unused(page, 0);
 
 	if (!PageHighMem(page))
 		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
@@ -1107,6 +1116,7 @@ again:
 	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
+	page_set_stable(page, order);
 	if (prep_new_page(page, order, gfp_flags))
 		goto again;
 	return page;
@@ -1690,6 +1700,8 @@ void __pagevec_free(struct pagevec *pvec
 
 void __free_pages(struct page *page, unsigned int order)
 {
+	if (page_count(page) > 1)
+		page_make_volatile(page, 2);
 	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
Index: linux-2.6/mm/page-states.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/page-states.c
@@ -0,0 +1,197 @@
+/*
+ * mm/page-states.c
+ *
+ * (C) Copyright IBM Corp. 2005, 2007
+ *
+ * Guest page hinting functions.
+ *
+ * Authors: Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *          Hubertus Franke <frankeh@watson.ibm.com>
+ *          Himanshu Raj <rhim@cc.gatech.edu>
+ */
+
+#include <linux/mm.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
+#include <linux/page-states.h>
+
+#include "internal.h"
+
+/*
+ * Check if there is anything in the page flags or the mapping
+ * that prevents the page from changing its state to volatile.
+ */
+static inline int check_bits(struct page *page)
+{
+	/*
+	 * There are several conditions that prevent a page from becoming
+	 * volatile. The first check is for the page bits.
+	 */
+	if (PageDirty(page) || PageReserved(page) || PageWriteback(page) ||
+	    PageLocked(page) || PagePrivate(page) || PageDiscarded(page) ||
+	    !PageUptodate(page) || !PageLRU(page) || PageAnon(page))
+		return 0;
+
+	/*
+	 * If the page has been truncated there is no point in making
+	 * it volatile. It will be freed soon. And if the mapping ever
+	 * had locked pages all pages of the mapping will stay stable.
+	 */
+	return page_mapping(page) != NULL;
+}
+
+/*
+ * Check the reference counter of the page against the number of
+ * mappings. The caller passes an offset, that is the number of
+ * extra, known references. The page cache itself is one extra
+ * reference. If the caller acquired an additional reference then
+ * the offset would be 2. If the page map counter is equal to the
+ * page count minus the offset then there is no other, unknown
+ * user of the page in the system.
+ */
+static inline int check_counts(struct page *page, unsigned int offset)
+{
+	return page_mapcount(page) + offset == page_count(page);
+}
+
+/*
+ * Attempts to change the state of a page to volatile.
+ * If there is something preventing the state change the page stays
+ * int its current state.
+ */
+void __page_make_volatile(struct page *page, int offset)
+{
+	preempt_disable();
+	if (!page_test_set_state_change(page)) {
+		if (check_bits(page) && check_counts(page, offset))
+			page_set_volatile(page);
+		page_clear_state_change(page);
+	}
+	preempt_enable();
+}
+EXPORT_SYMBOL(__page_make_volatile);
+
+/*
+ * Attempts to change the state of a vector of pages to volatile.
+ * If there is something preventing the state change the page stays
+ * int its current state.
+ */
+void __pagevec_make_volatile(struct pagevec *pvec)
+{
+	struct page *page;
+	int i = pagevec_count(pvec);
+
+	while (--i >= 0) {
+		/*
+		 * If we can't get the state change bit just give up.
+		 * The worst that can happen is that the page will stay
+		 * in the stable state although it might be volatile.
+		 */
+		page = pvec->pages[i];
+		if (!page_test_set_state_change(page)) {
+			if (check_bits(page) && check_counts(page, 1))
+				page_set_volatile(page);
+			page_clear_state_change(page);
+		}
+	}
+}
+EXPORT_SYMBOL(__pagevec_make_volatile);
+
+/*
+ * Attempts to change the state of a page to stable. The host could
+ * have removed a volatile page, the page_set_stable_if_present call
+ * can fail.
+ *
+ * returns "0" on success and "1" on failure
+ */
+int __page_make_stable(struct page *page)
+{
+	/*
+	 * Postpone state change to stable until the state change bit is
+	 * cleared. As long as the state change bit is set another cpu
+	 * is in page_make_volatile for this page. That makes sure that
+	 * no caller of make_stable "overtakes" a make_volatile leaving
+	 * the page in volatile where stable is required.
+	 * The caller of make_stable need to make sure that no caller
+	 * of make_volatile can make the page volatile right after
+	 * make_stable has finished.
+	 */
+	while (page_state_change(page))
+		cpu_relax();
+	return page_set_stable_if_present(page);
+}
+EXPORT_SYMBOL(__page_make_stable);
+
+/**
+ * __page_discard() - remove a discarded page from the cache
+ *
+ * @page: the page
+ *
+ * The page passed to this function needs to be locked.
+ */
+static void __page_discard(struct page *page)
+{
+	struct address_space *mapping;
+	struct zone *zone;
+
+	/* Paranoia checks. */
+	VM_BUG_ON(PageWriteback(page));
+	VM_BUG_ON(PageDirty(page));
+	VM_BUG_ON(PagePrivate(page));
+
+	/* Set the discarded bit early. */
+	if (TestSetPageDiscarded(page))
+		return;
+
+	/* Unmap the page from all page tables. */
+	page_unmap_all(page);
+
+	/* Check if really all mappers of this page are gone. */
+	VM_BUG_ON(page_mapcount(page) != 0);
+
+	/*
+	 * Remove the page from LRU if it is currently added.
+	 * The users of isolate_lru_pages need to check the
+	 * discarded bit before readding the page to the LRU.
+	 */
+	zone = page_zone(page);
+	spin_lock_irq(&zone->lru_lock);
+	if (PageLRU(page)) {
+		/* Unlink page from lru. */
+		__ClearPageLRU(page);
+		del_page_from_lru(zone, page);
+	}
+	spin_unlock_irq(&zone->lru_lock);
+
+	/* We can't handle swap cache pages (yet). */
+	VM_BUG_ON(PageSwapCache(page));
+
+	/* Remove page from page cache. */
+ 	mapping = page->mapping;
+	write_lock_irq(&mapping->tree_lock);
+	__remove_from_page_cache_nocheck(page);
+	write_unlock_irq(&mapping->tree_lock);
+	__put_page(page);
+}
+
+/**
+ * page_discard() - remove a discarded page from the cache
+ *
+ * @page: the page
+ *
+ * Before calling this function an additional page reference needs to
+ * be acquired. This reference is released by the function.
+ */
+void page_discard(struct page *page)
+{
+	lock_page(page);
+	__page_discard(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(page_discard);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/page-states.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1192,6 +1193,7 @@ int test_clear_page_writeback(struct pag
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			page_make_volatile(page, 1);
 			if (bdi_cap_writeback_dirty(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/page-states.h>
 
 #include <asm/tlbflush.h>
 
@@ -247,13 +248,24 @@ pte_t *page_check_address(struct page *p
 		return NULL;
 
 	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
 	/* Make a quick check before getting the lock */
+#ifndef CONFIG_PAGE_STATES
+	/*
+	 * If the page table lock for this pte is taken we have to
+	 * assume that someone might be mapping the page. To solve
+	 * the race of a page discard vs. mapping the page we have
+	 * to serialize the two operations by taking the lock,
+	 * otherwise we end up with a pte for a page that has been
+	 * removed from page cache by the discard fault handler.
+	 */
+	if (!spin_is_locked(ptl))
+#endif
 	if (!pte_present(*pte)) {
 		pte_unmap(pte);
 		return NULL;
 	}
 
-	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
 		*ptlp = ptl;
@@ -617,6 +629,7 @@ void page_add_file_rmap(struct page *pag
 		 * This takes care of balancing the reference counts
 		 */
 		mem_cgroup_uncharge_page(page);
+	page_make_volatile(page, 1);
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -689,19 +702,14 @@ void page_remove_rmap(struct page *page,
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				unsigned long address, int migration)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
 	pte_t *pte;
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 
-	address = vma_address(page, vma);
-	if (address == -EFAULT)
-		goto out;
-
 	pte = page_check_address(page, mm, address, &ptl);
 	if (!pte)
 		goto out;
@@ -766,8 +774,14 @@ static int try_to_unmap_one(struct page 
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 	} else
 #endif
+	{
+#ifdef CONFIG_PAGE_STATES
+		/* If nonlinear, store the file page offset in the pte. */
+		if (page->index != linear_page_index(vma, address))
+			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
+#endif
 		dec_mm_counter(mm, file_rss);
-
+	}
 
 	page_remove_rmap(page, vma);
 	page_cache_release(page);
@@ -871,6 +885,7 @@ static int try_to_unmap_anon(struct page
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
+	unsigned long address;
 	int ret = SWAP_AGAIN;
 
 	anon_vma = page_lock_anon_vma(page);
@@ -878,7 +893,10 @@ static int try_to_unmap_anon(struct page
 		return ret;
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma, migration);
+		address = vma_address(page, vma);
+		if (address == -EFAULT)
+			continue;
+		ret = try_to_unmap_one(page, vma, address, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			break;
 	}
@@ -903,6 +921,7 @@ static int try_to_unmap_file(struct page
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
 	int ret = SWAP_AGAIN;
+	unsigned long address;
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
@@ -910,7 +929,10 @@ static int try_to_unmap_file(struct page
 
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma, migration);
+		address = vma_address(page, vma);
+		if (address == -EFAULT)
+			continue;
+		ret = try_to_unmap_one(page, vma, address, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			goto out;
 	}
@@ -1011,3 +1033,54 @@ int try_to_unmap(struct page *page, int 
 	return ret;
 }
 
+#ifdef CONFIG_PAGE_STATES
+
+/**
+ * page_unmap_all - removes all mappings of a page
+ *
+ * @page: the page which mapping in the vma should be struck down
+ *
+ * the caller needs to hold page lock
+ */
+void page_unmap_all(struct page* page)
+{
+	struct address_space *mapping = page_mapping(page);
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	unsigned long address;
+	int rc;
+
+	VM_BUG_ON(!PageLocked(page) || PageReserved(page) || PageAnon(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		address = vma_address(page, vma);
+		if (address == -EFAULT)
+			continue;
+		rc = try_to_unmap_one(page, vma, address, 0);
+		VM_BUG_ON(rc == SWAP_FAIL);
+	}
+
+	if (list_empty(&mapping->i_mmap_nonlinear))
+		goto out;
+
+	/*
+	 * Remove the non-linear mappings of the page. This is
+	 * awfully slow, but we have to find that discarded page..
+	 */
+	list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+			    shared.vm_set.list) {
+		address = vma->vm_start;
+		while (address < vma->vm_end) {
+			rc = try_to_unmap_one(page, vma, address, 0);
+			VM_BUG_ON(rc == SWAP_FAIL);
+			address += PAGE_SIZE;
+		}
+	}
+
+out:
+	spin_unlock(&mapping->i_mmap_lock);
+}
+
+#endif
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -30,6 +30,7 @@
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
 #include <linux/memcontrol.h>
+#include <linux/page-states.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
@@ -77,6 +78,16 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+#ifdef CONFIG_PAGE_STATES
+void put_page_check(struct page *page)
+{
+	if (page_count(page) > 1)
+		page_make_volatile(page, 2);
+	put_page(page);
+}
+EXPORT_SYMBOL(put_page_check);
+#endif
+
 /**
  * put_pages_list(): release a list of pages
  *
@@ -382,6 +393,8 @@ void __pagevec_release_nonlru(struct pag
 		struct page *page = pvec->pages[i];
 
 		VM_BUG_ON(PageLRU(page));
+		if (page_count(page) > 1)
+			page_make_volatile(page, 2);
 		if (put_page_testzero(page))
 			pagevec_add(&pages_to_free, page);
 	}
@@ -411,6 +424,7 @@ void __pagevec_lru_add(struct pagevec *p
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		add_page_to_inactive_list(zone, page);
+		page_make_volatile(page, 2);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -38,6 +38,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/memcontrol.h>
+#include <linux/page-states.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -903,13 +904,20 @@ static unsigned long shrink_inactive_lis
 		 */
 		while (!list_empty(&page_list)) {
 			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			SetPageLRU(page);
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
+			/*
+			 * Only readd the page to lru list if it has not
+			 * been discarded.
+			 */
+			if (likely(!PageDiscarded(page))) {
+				VM_BUG_ON(PageLRU(page));
+				SetPageLRU(page);
+				if (PageActive(page))
+					add_page_to_active_list(zone, page);
+				else
+					add_page_to_inactive_list(zone, page);
+			} else
+				ClearPageActive(page);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -1122,14 +1130,23 @@ static void shrink_active_list(unsigned 
 	while (!list_empty(&l_inactive)) {
 		page = lru_to_page(&l_inactive);
 		prefetchw_prev_lru_page(page, &l_inactive, flags);
-		VM_BUG_ON(PageLRU(page));
-		SetPageLRU(page);
-		VM_BUG_ON(!PageActive(page));
-		ClearPageActive(page);
-
-		list_move(&page->lru, &zone->inactive_list);
-		mem_cgroup_move_lists(page, false);
-		pgmoved++;
+		/*
+		 * Only readd the page to lru list if it has not
+		 * been discarded.
+		 */
+		if (likely(!PageDiscarded(page))) {
+			VM_BUG_ON(PageLRU(page));
+			SetPageLRU(page);
+			VM_BUG_ON(!PageActive(page));
+			ClearPageActive(page);
+			list_move(&page->lru, &zone->inactive_list);
+			mem_cgroup_move_lists(page, false);
+			pgmoved++;
+		} else {
+			ClearPageActive(page);
+			list_del(&page->lru);
+		}
+
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
@@ -1137,6 +1154,7 @@ static void shrink_active_list(unsigned 
 			pgmoved = 0;
 			if (buffer_heads_over_limit)
 				pagevec_strip(&pvec);
+			pagevec_make_volatile(&pvec);
 			__pagevec_release(&pvec);
 			spin_lock_irq(&zone->lru_lock);
 		}
@@ -1146,6 +1164,7 @@ static void shrink_active_list(unsigned 
 	if (buffer_heads_over_limit) {
 		spin_unlock_irq(&zone->lru_lock);
 		pagevec_strip(&pvec);
+		pagevec_make_volatile(&pvec);
 		spin_lock_irq(&zone->lru_lock);
 	}
 
@@ -1153,13 +1172,22 @@ static void shrink_active_list(unsigned 
 	while (!list_empty(&l_active)) {
 		page = lru_to_page(&l_active);
 		prefetchw_prev_lru_page(page, &l_active, flags);
-		VM_BUG_ON(PageLRU(page));
-		SetPageLRU(page);
-		VM_BUG_ON(!PageActive(page));
+		/*
+		 * Only readd the page to lru list if it has not
+		 * been discarded.
+		 */
+		if (likely(!PageDiscarded(page))) {
+			VM_BUG_ON(PageLRU(page));
+			SetPageLRU(page);
+			VM_BUG_ON(!PageActive(page));
+			list_move(&page->lru, &zone->active_list);
+			mem_cgroup_move_lists(page, true);
+			pgmoved++;
+		} else {
+			ClearPageActive(page);
+			list_del(&page->lru);
+		}
 
-		list_move(&page->lru, &zone->active_list);
-		mem_cgroup_move_lists(page, true);
-		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 2/6] Guest page hinting: volatile swap cache.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
  2008-03-12 13:21 ` [patch 1/6] Guest page hinting: core + volatile page cache Martin Schwidefsky
@ 2008-03-12 13:21 ` Martin Schwidefsky
  2008-03-12 13:21 ` [patch 3/6] Guest page hinting: mlocked pages Martin Schwidefsky
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-s390, virtualization
  Cc: akpm, nickpiggin, hugh, zach, frankeh, Martin Schwidefsky

[-- Attachment #1: 002-hva-swap.diff --]
[-- Type: text/plain, Size: 12791 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
From: Hubertus Franke <frankeh@watson.ibm.com>
From: Himanshu Raj

The volatile page state can be used for anonymous pages as well, if
they have been added to the swap cache and the swap write is finished.
The tricky bit is in free_swap_and_cache. The call to find_get_page
dead-locks with the discard handler. If the page has been discarded
find_get_page will try to remove it. To do that it needs the page table
lock of all mappers but one is held by the caller of free_swap_and_cache.
A special variant of find_get_page is needed that does not check the
page state and returns a page reference even if the page is discarded.
The second pitfall is that the page needs to be made stable before the
swap slot gets freed. If the page cannot be made stable because it has
been discarded the swap slot may not be freed because it is still
needed to reload the discarded page from the swap device.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 include/linux/pagemap.h |    3 ++
 include/linux/swap.h    |    5 ++++
 mm/filemap.c            |   19 +++++++++++++++++
 mm/memory.c             |   13 +++++++++++-
 mm/page-states.c        |   34 +++++++++++++++++++++++---------
 mm/rmap.c               |   51 ++++++++++++++++++++++++++++++++++++++++++++----
 mm/swap_state.c         |   25 ++++++++++++++++++++++-
 mm/swapfile.c           |   30 ++++++++++++++++++++++------
 mm/vmscan.c             |    3 ++
 9 files changed, 162 insertions(+), 21 deletions(-)

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -61,8 +61,11 @@ static inline void mapping_set_gfp_mask(
 
 #define page_cache_get(page)		get_page(page)
 #ifdef CONFIG_PAGE_STATES
+extern struct page * find_get_page_nodiscard(struct address_space *mapping,
+					     unsigned long index);
 #define page_cache_release(page)	put_page_check(page)
 #else
+#define find_get_page_nodiscard(mapping, index) find_get_page(mapping, index)
 #define page_cache_release(page)	put_page(page)
 #endif
 void release_pages(struct page **pages, int nr, int cold);
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -226,6 +226,7 @@ extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, gfp_t);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern void __delete_from_swap_cache(struct page *);
+extern void __delete_from_swap_cache_nocheck(struct page *);
 extern void delete_from_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
@@ -331,6 +332,10 @@ static inline void __delete_from_swap_ca
 {
 }
 
+static inline void __delete_from_swap_cache_nocheck(struct page *page)
+{
+}
+
 static inline void delete_from_swap_cache(struct page *page)
 {
 }
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -537,6 +537,25 @@ static int __sleep_on_page_lock(void *wo
 	return 0;
 }
 
+#ifdef CONFIG_PAGE_STATES
+
+struct page * find_get_page_nodiscard(struct address_space *mapping,
+				      unsigned long offset)
+{
+	struct page *page;
+
+	read_lock_irq(&mapping->tree_lock);
+	page = radix_tree_lookup(&mapping->page_tree, offset);
+	if (page)
+		page_cache_get(page);
+	read_unlock_irq(&mapping->tree_lock);
+	return page;
+}
+
+EXPORT_SYMBOL(find_get_page_nodiscard);
+
+#endif
+
 /*
  * In order to wait for pages to become available there must be
  * waitqueues associated with pages. By using a hash table of
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -507,7 +507,18 @@ out_discard_pte:
 	 * in page cache anymore. Do what try_to_unmap_one would do
 	 * if the copy_one_pte had taken place before page_discard.
 	 */
-	if (page->index != linear_page_index(vma, addr))
+	if (PageAnon(page)) {
+		swp_entry_t entry = { .val = page_private(page) };
+		swap_duplicate(entry);
+		if (list_empty(&dst_mm->mmlist)) {
+			spin_lock(&mmlist_lock);
+			if (list_empty(&dst_mm->mmlist))
+				list_add(&dst_mm->mmlist, &init_mm.mmlist);
+			spin_unlock(&mmlist_lock);
+		}
+		pte = swp_entry_to_pte(entry);
+		set_pte_at(dst_mm, addr, dst_pte, pte);
+	} else if (page->index != linear_page_index(vma, addr))
 		/* If nonlinear, store the file page offset in the pte. */
 		set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index));
 	else
Index: linux-2.6/mm/page-states.c
===================================================================
--- linux-2.6.orig/mm/page-states.c
+++ linux-2.6/mm/page-states.c
@@ -19,6 +19,7 @@
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
 #include <linux/page-states.h>
+#include <linux/swap.h>
 
 #include "internal.h"
 
@@ -34,7 +35,16 @@ static inline int check_bits(struct page
 	 */
 	if (PageDirty(page) || PageReserved(page) || PageWriteback(page) ||
 	    PageLocked(page) || PagePrivate(page) || PageDiscarded(page) ||
-	    !PageUptodate(page) || !PageLRU(page) || PageAnon(page))
+	    !PageUptodate(page) || !PageLRU(page) ||
+	    (PageAnon(page) && !PageSwapCache(page)))
+		return 0;
+
+	/*
+	 * Special case shared memory: page is PageSwapCache but not
+	 * PageAnon. page_unmap_all failes for swapped shared memory
+	 * pages.
+	 */
+	if (PageSwapCache(page) && !PageAnon(page))
 		return 0;
 
 	/*
@@ -168,15 +178,21 @@ static void __page_discard(struct page *
 	}
 	spin_unlock_irq(&zone->lru_lock);
 
-	/* We can't handle swap cache pages (yet). */
-	VM_BUG_ON(PageSwapCache(page));
-
-	/* Remove page from page cache. */
+	/* Remove page from page cache/swap cache. */
  	mapping = page->mapping;
-	write_lock_irq(&mapping->tree_lock);
-	__remove_from_page_cache_nocheck(page);
-	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+	if (PageSwapCache(page)) {
+		swp_entry_t entry = { .val = page_private(page) };
+		write_lock_irq(&swapper_space.tree_lock);
+		__delete_from_swap_cache_nocheck(page);
+		write_unlock_irq(&swapper_space.tree_lock);
+		swap_free(entry);
+		page_cache_release(page);
+	} else {
+		write_lock_irq(&mapping->tree_lock);
+		__remove_from_page_cache_nocheck(page);
+		write_unlock_irq(&mapping->tree_lock);
+ 		__put_page(page);
+	}
 }
 
 /**
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -593,6 +593,7 @@ void page_add_anon_rmap(struct page *pag
 		 */
 		mem_cgroup_uncharge_page(page);
 	}
+	page_make_volatile(page, 1);
 }
 
 /*
@@ -1036,13 +1037,13 @@ int try_to_unmap(struct page *page, int 
 #ifdef CONFIG_PAGE_STATES
 
 /**
- * page_unmap_all - removes all mappings of a page
+ * page_unmap_file - removes all mappings of a file page
  *
  * @page: the page which mapping in the vma should be struck down
  *
  * the caller needs to hold page lock
  */
-void page_unmap_all(struct page* page)
+static void page_unmap_file(struct page* page)
 {
 	struct address_space *mapping = page_mapping(page);
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1051,8 +1052,6 @@ void page_unmap_all(struct page* page)
 	unsigned long address;
 	int rc;
 
-	VM_BUG_ON(!PageLocked(page) || PageReserved(page) || PageAnon(page));
-
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		address = vma_address(page, vma);
@@ -1083,4 +1082,48 @@ out:
 	spin_unlock(&mapping->i_mmap_lock);
 }
 
+/**
+ * page_unmap_anon - removes all mappings of an anonymous page
+ *
+ * @page: the page which mapping in the vma should be struck down
+ *
+ * the caller needs to hold page lock
+ */
+static void page_unmap_anon(struct page* page)
+{
+	struct anon_vma *anon_vma;
+	struct vm_area_struct *vma;
+	unsigned long address;
+	int rc;
+
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		return;
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		address = vma_address(page, vma);
+		if (address == -EFAULT)
+			continue;
+		rc = try_to_unmap_one(page, vma, address, 0);
+		VM_BUG_ON(rc == SWAP_FAIL);
+	}
+	page_unlock_anon_vma(anon_vma);
+}
+
+/**
+ * page_unmap_all - removes all mappings of a page
+ *
+ * @page: the page which mapping in the vma should be struck down
+ *
+ * the caller needs to hold page lock
+ */
+void page_unmap_all(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page) || PageReserved(page));
+
+	if (PageAnon(page))
+		page_unmap_anon(page);
+	else
+		page_unmap_file(page);
+}
+
 #endif
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -28,6 +28,7 @@
 #include <linux/capability.h>
 #include <linux/syscalls.h>
 #include <linux/memcontrol.h>
+#include <linux/page-states.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -370,9 +371,11 @@ int remove_exclusive_swap_page(struct pa
 		/* Recheck the page count with the swapcache lock held.. */
 		write_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
-			__delete_from_swap_cache(page);
-			SetPageDirty(page);
-			retval = 1;
+			if (likely(page_make_stable(page))) {
+				__delete_from_swap_cache(page);
+				SetPageDirty(page);
+				retval = 1;
+			}
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
 	}
@@ -401,7 +404,13 @@ void free_swap_and_cache(swp_entry_t ent
 	p = swap_info_get(entry);
 	if (p) {
 		if (swap_entry_free(p, swp_offset(entry)) == 1) {
-			page = find_get_page(&swapper_space, entry.val);
+			/*
+			 * Use find_get_page_nodiscard to avoid the deadlock
+			 * on the swap_lock and the page table lock if the
+			 * page has been discarded.
+			 */
+			page = find_get_page_nodiscard(&swapper_space,
+						       entry.val);
 			if (page && unlikely(TestSetPageLocked(page))) {
 				page_cache_release(page);
 				page = NULL;
@@ -418,8 +427,17 @@ void free_swap_and_cache(swp_entry_t ent
 		/* Also recheck PageSwapCache after page is locked (above) */
 		if (PageSwapCache(page) && !PageWriteback(page) &&
 					(one_user || vm_swap_full())) {
-			delete_from_swap_cache(page);
-			SetPageDirty(page);
+			/*
+			 * To be able to reload the page from swap the
+			 * swap slot may not be freed. The caller of
+			 * free_swap_and_cache holds a page table lock
+			 * for this page. The discarded page can not be
+			 * removed here.
+			 */
+			if (likely(page_make_stable(page))) {
+				delete_from_swap_cache(page);
+				SetPageDirty(page);
+			}
 		}
 		unlock_page(page);
 		page_cache_release(page);
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/page-states.h>
 
 #include <asm/pgtable.h>
 
@@ -97,7 +98,7 @@ int add_to_swap_cache(struct page *page,
  * This must be called only on pages that have
  * been verified to be in the swap cache.
  */
-void __delete_from_swap_cache(struct page *page)
+void inline __delete_from_swap_cache_nocheck(struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
@@ -112,6 +113,28 @@ void __delete_from_swap_cache(struct pag
 	INC_CACHE_INFO(del_total);
 }
 
+void __delete_from_swap_cache(struct page *page)
+{
+	/*
+	 * Check if the discard fault handler already removed
+	 * the page from the page cache. If not set the discard
+	 * bit in the page flags to prevent double page free if
+	 * a discard fault is racing with normal page free.
+	 */
+	if (TestSetPageDiscarded(page))
+		return;
+
+	__delete_from_swap_cache_nocheck(page);
+
+	/*
+	 * Check the hardware page state and clear the discard
+	 * bit in the page flags only if the page is not
+	 * discarded.
+	 */
+	if (!page_discarded(page))
+		ClearPageDiscarded(page);
+}
+
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -492,6 +492,9 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		if (unlikely(PageDiscarded(page)))
+			goto free_it;
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 3/6] Guest page hinting: mlocked pages.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
  2008-03-12 13:21 ` [patch 1/6] Guest page hinting: core + volatile page cache Martin Schwidefsky
  2008-03-12 13:21 ` [patch 2/6] Guest page hinting: volatile swap cache Martin Schwidefsky
@ 2008-03-12 13:21 ` Martin Schwidefsky
  2008-03-12 23:27   ` Rusty Russell
  2008-03-12 13:21 ` [patch 4/6] Guest page hinting: writable page table entries Martin Schwidefsky
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-s390, virtualization
  Cc: akpm, nickpiggin, hugh, zach, frankeh, Martin Schwidefsky

[-- Attachment #1: 003-hva-mlock.diff --]
[-- Type: text/plain, Size: 5019 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
From: Hubertus Franke <frankeh@watson.ibm.com>
From: Himanshu Raj

Add code to get mlock() working with guest page hinting. The problem
with mlock is that locked pages may not be removed from page cache.
That means they need to be stable. page_make_volatile needs a way to
check if a page has been locked. To avoid traversing vma lists - which
would hurt performance a lot - a field is added in the struct
address_space. This field is set in mlock_fixup if a vma gets mlocked.
The bit never gets removed - once a file had an mlocked vma all future
pages added to it will stay stable.

The pages of an mlocked area are made present in the linux page table by
a call to make_pages_present which calls get_user_pages and follow_page.
The follow_page function is called for each page in the mlocked vma,
if the VM_LOCKED bit in the vma flags is set the page is made stable.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 include/linux/fs.h |   10 ++++++++++
 mm/memory.c        |    5 +++--
 mm/mlock.c         |    3 +++
 mm/page-states.c   |    5 ++++-
 mm/rmap.c          |   13 +++++++++++--
 5 files changed, 31 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -513,6 +513,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_PAGE_STATES
+	unsigned int		mlocked;	/* set if VM_LOCKED vmas present */
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -520,6 +523,13 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+static inline void mapping_set_mlocked(struct address_space *mapping)
+{
+#ifdef CONFIG_PAGE_STATES
+	mapping->mlocked = 1;
+#endif
+}
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -987,9 +987,10 @@ struct page *follow_page(struct vm_area_
 	if (flags & FOLL_GET)
 		get_page(page);
 
-	if (flags & FOLL_GET) {
+	if ((flags & FOLL_GET) || (vma->vm_flags & VM_LOCKED)) {
 		/*
-		 * The page is made stable if a reference is acquired.
+		 * The page is made stable if a reference is acquired or
+		 * the vm area is locked.
 		 * If the caller does not get a reference it implies that
 		 * the caller can deal with page faults in case the page
 		 * is swapped out. In this case the caller can deal with
Index: linux-2.6/mm/mlock.c
===================================================================
--- linux-2.6.orig/mm/mlock.c
+++ linux-2.6/mm/mlock.c
@@ -12,6 +12,7 @@
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/fs.h>
 
 int can_do_mlock(void)
 {
@@ -71,6 +72,8 @@ success:
 	 */
 	pages = (end - start) >> PAGE_SHIFT;
 	if (newflags & VM_LOCKED) {
+		if (vma->vm_file && vma->vm_file->f_mapping)
+			mapping_set_mlocked(vma->vm_file->f_mapping);
 		pages = -pages;
 		if (!(newflags & VM_IO))
 			ret = make_pages_present(start, end);
Index: linux-2.6/mm/page-states.c
===================================================================
--- linux-2.6.orig/mm/page-states.c
+++ linux-2.6/mm/page-states.c
@@ -29,6 +29,8 @@
  */
 static inline int check_bits(struct page *page)
 {
+	struct address_space *mapping;
+
 	/*
 	 * There are several conditions that prevent a page from becoming
 	 * volatile. The first check is for the page bits.
@@ -52,7 +54,8 @@ static inline int check_bits(struct page
 	 * it volatile. It will be freed soon. And if the mapping ever
 	 * had locked pages all pages of the mapping will stay stable.
 	 */
-	return page_mapping(page) != NULL;
+	mapping = page_mapping(page);
+	return mapping && !mapping->mlocked;
 }
 
 /*
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -722,8 +722,17 @@ static int try_to_unmap_one(struct page 
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
 			(ptep_clear_flush_young(vma, address, pte)))) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+		/*
+		 * Check for discarded pages. This can happen if there have
+		 * been discarded pages before a vma gets mlocked. The code
+		 * in make_pages_present will force all discarded pages out
+		 * and reload them. That happens after the VM_LOCKED bit
+		 * has been set.
+		 */
+		if (likely(!PageDiscarded(page))) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
 	}
 
 	/* Nuke the page table entry. */

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 4/6] Guest page hinting: writable page table entries.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
                   ` (2 preceding siblings ...)
  2008-03-12 13:21 ` [patch 3/6] Guest page hinting: mlocked pages Martin Schwidefsky
@ 2008-03-12 13:21 ` Martin Schwidefsky
  2008-03-12 23:35   ` Rusty Russell
  2008-03-12 13:21 ` [patch 5/6] Guest page hinting: minor fault optimization Martin Schwidefsky
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-s390, virtualization
  Cc: akpm, nickpiggin, hugh, zach, frankeh, Martin Schwidefsky

[-- Attachment #1: 004-hva-prot.diff --]
[-- Type: text/plain, Size: 11418 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
From: Hubertus Franke <frankeh@watson.ibm.com>
From: Himanshu Raj

The volatile state for page cache and swap cache pages requires that
the host system needs to be able to determine if a volatile page is
dirty before removing it. This excludes almost all platforms from using
the scheme. What is needed is a way to distinguish between pages that
are purely read-only and pages that might get written to. This allows
platforms with per-pte dirty bits to use the scheme and platforms with
per-page dirty bits a small optimization.

Whenever a writable pte is created a check is added that allows to
move the page into the correct state. This needs to be done before
the writable pte is established. To avoid unnecessary state transitions
and the need for a counter, a new page flag PG_writable is added. Only
the creation of the first writable pte will do a page state change.
Even if all the writable ptes pointing to a page are removed again,
the page stays in the safe state until all read-only users of the page
have unmapped it as well. Only then is the PG_writable bit reset.

The state a page needs to have if a writable pte is present depends
on the platform. A platform with per-pte dirty bits wants to move the
page into stable state, a platform with per-page dirty bits like s390
can decide to move the page into a special state that requires the host
system to check the dirty bit before discarding a page.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 fs/exec.c                   |    1 
 include/linux/page-flags.h  |    6 ++++
 include/linux/page-states.h |   27 +++++++++++++++++++-
 mm/memory.c                 |    5 +++
 mm/mprotect.c               |    2 +
 mm/page-states.c            |   58 ++++++++++++++++++++++++++++++++++++++++++--
 mm/page_alloc.c             |    3 +-
 mm/rmap.c                   |    1 
 8 files changed, 99 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -51,6 +51,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/page-states.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -109,6 +109,7 @@
 #endif
 
 #define PG_discarded		20	/* Page discarded by the hypervisor. */
+#define PG_writable		21	/* Page is mapped writable. */
 
 /*
  * Manipulation of page state flags
@@ -309,6 +310,11 @@ static inline void __ClearPageTail(struc
 #define TestSetPageDiscarded(page)	0
 #endif
 
+#define PageWritable(page) test_bit(PG_writable, &(page)->flags)
+#define TestSetPageWritable(page) \
+		test_and_set_bit(PG_writable, &(page)->flags)
+#define ClearPageWritable(page) clear_bit(PG_writable, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/page-states.h
===================================================================
--- linux-2.6.orig/include/linux/page-states.h
+++ linux-2.6/include/linux/page-states.h
@@ -57,6 +57,9 @@ extern void page_discard(struct page *pa
 extern int  __page_make_stable(struct page *page);
 extern void __page_make_volatile(struct page *page, int offset);
 extern void __pagevec_make_volatile(struct pagevec *pvec);
+extern void __page_check_writable(struct page *page, pte_t pte,
+				  unsigned int offset);
+extern void __page_reset_writable(struct page *page);
 
 /*
  * Extended guest page hinting functions defined by using the
@@ -78,6 +81,12 @@ extern void __pagevec_make_volatile(stru
  *     from the LRU list and the radix tree of its mapping.
  *     page_discard uses page_unmap_all to remove all page table
  *     entries for a page.
+ * - page_check_writable:
+ *     Checks if the page states needs to be adapted because a new
+ *     writable page table entry refering to the page is established.
+ * - page_reset_writable:
+ *     Resets the page state after the last writable page table entry
+ *     refering to the page has been removed.
  */
 
 static inline int page_make_stable(struct page *page)
@@ -97,12 +106,26 @@ static inline void pagevec_make_volatile
 		__pagevec_make_volatile(pvec);
 }
 
+static inline void page_check_writable(struct page *page, pte_t pte,
+				       unsigned int offset)
+{
+	if (page_host_discards() && pte_write(pte) &&
+	    !test_bit(PG_writable, &page->flags))
+		__page_check_writable(page, pte, offset);
+}
+
+static inline void page_reset_writable(struct page *page)
+{
+	if (page_host_discards() && test_bit(PG_writable, &page->flags))
+		__page_reset_writable(page);
+}
+
 #else
 
 #define page_host_discards()			(0)
 #define page_set_unused(_page,_order)		do { } while (0)
 #define page_set_stable(_page,_order)		do { } while (0)
-#define page_set_volatile(_page)		do { } while (0)
+#define page_set_volatile(_page,_writable)	do { } while (0)
 #define page_set_stable_if_present(_page)	(1)
 #define page_discarded(_page)			(0)
 #define page_volatile(_page)			(0)
@@ -117,6 +140,8 @@ static inline void pagevec_make_volatile
 #define page_make_volatile(_page, offset)	do { } while (0)
 #define pagevec_make_volatile(_pagevec)	do { } while (0)
 #define page_discard(_page)			do { } while (0)
+#define page_check_writable(_page,_pte,_off)	do { } while (0)
+#define page_reset_writable(_page)		do { } while (0)
 
 #endif
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1681,6 +1681,7 @@ static int do_wp_page(struct mm_struct *
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_check_writable(old_page, entry, 1);
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
@@ -1728,6 +1729,7 @@ gotten:
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_check_writable(new_page, entry, 2);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
@@ -2147,6 +2149,7 @@ static int do_swap_page(struct mm_struct
 	}
 
 	flush_icache_page(vma, page);
+	page_check_writable(page, pte, 2);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
 
@@ -2204,6 +2207,7 @@ static int do_anonymous_page(struct mm_s
 
 	entry = mk_pte(page, vma->vm_page_prot);
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	page_check_writable(page, entry, 2);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte_none(*page_table))
@@ -2365,6 +2369,7 @@ retry:
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_check_writable(page, entry, 2);
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/page-states.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -52,6 +53,7 @@ static void change_pte_range(struct mm_s
 			 */
 			if (dirty_accountable && pte_dirty(ptent))
 				ptent = pte_mkwrite(ptent);
+			page_check_writable(pte_page(ptent), ptent, 1);
 			set_pte_at(mm, addr, pte, ptent);
 #ifdef CONFIG_MIGRATION
 		} else if (!pte_file(oldpte)) {
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -637,7 +637,8 @@ static int prep_new_page(struct page *pa
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
+			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk |
+			1 << PG_writable);
 	set_page_private(page, 0);
 	set_page_refcounted(page);
 
Index: linux-2.6/mm/page-states.c
===================================================================
--- linux-2.6.orig/mm/page-states.c
+++ linux-2.6/mm/page-states.c
@@ -82,7 +82,7 @@ void __page_make_volatile(struct page *p
 	preempt_disable();
 	if (!page_test_set_state_change(page)) {
 		if (check_bits(page) && check_counts(page, offset))
-			page_set_volatile(page);
+			page_set_volatile(page, PageWritable(page));
 		page_clear_state_change(page);
 	}
 	preempt_enable();
@@ -108,7 +108,7 @@ void __pagevec_make_volatile(struct page
 		page = pvec->pages[i];
 		if (!page_test_set_state_change(page)) {
 			if (check_bits(page) && check_counts(page, 1))
-				page_set_volatile(page);
+				page_set_volatile(page, PageWritable(page));
 			page_clear_state_change(page);
 		}
 	}
@@ -141,6 +141,60 @@ int __page_make_stable(struct page *page
 EXPORT_SYMBOL(__page_make_stable);
 
 /**
+ * __page_check_writable() - check page state for new writable pte
+ *
+ * @page: the page the new writable pte refers to
+ * @pte: the new writable pte
+ */
+void __page_check_writable(struct page *page, pte_t pte, unsigned int offset)
+{
+	int count_ok = 0;
+
+	preempt_disable();
+	while (page_test_set_state_change(page))
+		cpu_relax();
+
+	if (!TestSetPageWritable(page)) {
+		count_ok = check_counts(page, offset);
+		if (check_bits(page) && count_ok)
+			page_set_volatile(page, 1);
+		else
+			/*
+			 * If two processes create a write mapping at the
+			 * same time check_counts will return false or if
+			 * the page is currently isolated from the LRU
+			 * check_bits will return false but the page might
+			 * be in volatile state.
+			 * We have to take care about the dirty bit so the
+			 * only option left is to make the page stable but
+			 * we can try to make it volatile a bit later.
+			 */
+			page_set_stable_if_present(page);
+	}
+	page_clear_state_change(page);
+	if (!count_ok)
+		page_make_volatile(page, 1);
+	preempt_enable();
+}
+EXPORT_SYMBOL(__page_check_writable);
+
+/**
+ * __page_reset_writable() - clear the PageWritable bit
+ *
+ * @page: the page
+ */
+void __page_reset_writable(struct page *page)
+{
+	preempt_disable();
+	if (!page_test_set_state_change(page)) {
+		ClearPageWritable(page);
+		page_clear_state_change(page);
+	}
+	preempt_enable();
+}
+EXPORT_SYMBOL(__page_reset_writable);
+
+/**
  * __page_discard() - remove a discarded page from the cache
  *
  * @page: the page
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -695,6 +695,7 @@ void page_remove_rmap(struct page *page,
 
 		__dec_zone_page_state(page,
 				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
+		page_reset_writable(page);
 	}
 }
 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 5/6] Guest page hinting: minor fault optimization.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
                   ` (3 preceding siblings ...)
  2008-03-12 13:21 ` [patch 4/6] Guest page hinting: writable page table entries Martin Schwidefsky
@ 2008-03-12 13:21 ` Martin Schwidefsky
  2008-03-12 13:21 ` [patch 6/6] Guest page hinting: s390 support Martin Schwidefsky
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-s390, virtualization
  Cc: akpm, nickpiggin, hugh, zach, frankeh, Martin Schwidefsky

[-- Attachment #1: 005-hva-nohv.diff --]
[-- Type: text/plain, Size: 9306 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
From: Hubertus Franke <frankeh@watson.ibm.com>
From: Himanshu Raj

On of the challenges of the guest page hinting scheme is the cost for
the state transitions. If the cost gets too high the whole concept of
page state information is in question. Therefore it is important to
avoid the state transitions when possible. One place where the state
transitions can be avoided are minor faults. Why change the page state
to stable in find_get_page and back in page_add_anon_rmap/
page_add_file_rmap if the discarded pages can be handled by the discard
fault handler? If the page is in page/swap cache just map it even if it
is already discarded. The first access to the page will cause a discard
fault which needs to be able to deal with this kind of situation anyway
because of other races in the memory management.

The special find_get_page_nodiscard variant introduced for volatile
swap cache is used which does not change the page state. The calls to
find_get_page in filemap_nopage and lookup_swap_cache are replaced with
find_get_page_nodiscard. By the use of this function a new race is
created. If a minor fault races with the discard of a page the page may
not get mapped to the page table because the discard handler removed
the page from the cache which removes the page->mapping that is needed
to find the page table entry. A check for the discarded bit is added to
do_swap_page and do_no_page. The page table lock for the pte takes care
of the synchronization.

That removes the state transitions on the minor fault path. A page that
has been mapped will eventually be unmapped again. On the unmap path
each page that has been removed from the page table is freed with a
call to page_cache_release. In general that causes an unnecessary page
state transition from volatile to volatile. To get rid of these state
transitions as well a special variants of page_cache_release is added
that does not attempt to make the page volatile.
page_cache_release_nocheck is then used in free_page_and_swap_cache
and release_pages. This makes the unmap of ptes state transitions free.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 include/linux/pagemap.h |    4 ++++
 include/linux/swap.h    |    2 +-
 mm/filemap.c            |   35 ++++++++++++++++++++++++++++++++---
 mm/fremap.c             |    1 +
 mm/memory.c             |    4 ++--
 mm/rmap.c               |    4 +---
 mm/shmem.c              |    7 +++++++
 mm/swap_state.c         |    4 ++--
 8 files changed, 50 insertions(+), 11 deletions(-)

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -63,11 +63,15 @@ static inline void mapping_set_gfp_mask(
 #ifdef CONFIG_PAGE_STATES
 extern struct page * find_get_page_nodiscard(struct address_space *mapping,
 					     unsigned long index);
+extern struct page * find_lock_page_nodiscard(struct address_space *mapping,
+					      unsigned long index);
 #define page_cache_release(page)	put_page_check(page)
 #else
 #define find_get_page_nodiscard(mapping, index) find_get_page(mapping, index)
+#define find_lock_page_nodiscard(mapping, index) find_lock_page(mapping, index)
 #define page_cache_release(page)	put_page(page)
 #endif
+#define page_cache_release_nocheck(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -288,7 +288,7 @@ static inline void disable_swap_token(vo
 /* only sparc can not include linux/pagemap.h in this file
  * so leave page_cache_release and release_pages undeclared... */
 #define free_page_and_swap_cache(page) \
-	page_cache_release(page)
+	page_cache_release_nocheck(page)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr), 0);
 
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -554,6 +554,35 @@ struct page * find_get_page_nodiscard(st
 
 EXPORT_SYMBOL(find_get_page_nodiscard);
 
+struct page *find_lock_page_nodiscard(struct address_space *mapping,
+				      unsigned long offset)
+{
+	struct page *page;
+
+	read_lock_irq(&mapping->tree_lock);
+repeat:
+	page = radix_tree_lookup(&mapping->page_tree, offset);
+	if (page) {
+		page_cache_get(page);
+		if (TestSetPageLocked(page)) {
+			read_unlock_irq(&mapping->tree_lock);
+			__lock_page(page);
+			read_lock_irq(&mapping->tree_lock);
+
+			/* Has the page been truncated while we slept? */
+			if (unlikely(page->mapping != mapping ||
+				     page->index != offset)) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto repeat;
+			}
+		}
+	}
+	read_unlock_irq(&mapping->tree_lock);
+	return page;
+}
+EXPORT_SYMBOL(find_lock_page_nodiscard);
+
 #endif
 
 /*
@@ -1424,7 +1453,7 @@ int filemap_fault(struct vm_area_struct 
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
-	page = find_lock_page(mapping, vmf->pgoff);
+	page = find_lock_page_nodiscard(mapping, vmf->pgoff);
 	/*
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
@@ -1432,7 +1461,7 @@ retry_find:
 		if (!page) {
 			page_cache_sync_readahead(mapping, ra, file,
 							   vmf->pgoff, 1);
-			page = find_lock_page(mapping, vmf->pgoff);
+			page = find_lock_page_nodiscard(mapping, vmf->pgoff);
 			if (!page)
 				goto no_cached_page;
 		}
@@ -1471,7 +1500,7 @@ retry_find:
 				start = vmf->pgoff - ra_pages / 2;
 			do_page_cache_readahead(mapping, file, start, ra_pages);
 		}
-		page = find_lock_page(mapping, vmf->pgoff);
+		page = find_lock_page_nodiscard(mapping, vmf->pgoff);
 		if (!page)
 			goto no_cached_page;
 	}
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/page-states.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2131,7 +2131,7 @@ static int do_swap_page(struct mm_struct
 	 * Back out if somebody else already faulted in this pte.
 	 */
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*page_table, orig_pte)))
+	if (unlikely(!pte_same(*page_table, orig_pte) || PageDiscarded(page)))
 		goto out_nomap;
 
 	if (unlikely(!PageUptodate(page))) {
@@ -2364,7 +2364,7 @@ retry:
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte))) {
+	if (likely(pte_same(*page_table, orig_pte) && !PageDiscarded(page))) {
 		flush_icache_page(vma, page);
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -593,7 +593,6 @@ void page_add_anon_rmap(struct page *pag
 		 */
 		mem_cgroup_uncharge_page(page);
 	}
-	page_make_volatile(page, 1);
 }
 
 /*
@@ -630,7 +629,6 @@ void page_add_file_rmap(struct page *pag
 		 * This takes care of balancing the reference counts
 		 */
 		mem_cgroup_uncharge_page(page);
-	page_make_volatile(page, 1);
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -795,7 +793,7 @@ static int try_to_unmap_one(struct page 
 	}
 
 	page_remove_rmap(page, vma);
-	page_cache_release(page);
+	page_cache_release_nocheck(page);
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -50,6 +50,7 @@
 #include <linux/migrate.h>
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
+#include <linux/page-states.h>
 
 #include <asm/uaccess.h>
 #include <asm/div64.h>
@@ -1290,6 +1291,12 @@ repeat:
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		swappage = lookup_swap_cache(swap);
+		if (swappage && unlikely(!page_make_stable(swappage))) {
+			shmem_swp_unmap(entry);
+			spin_unlock(&info->lock);
+			page_discard(swappage);
+			goto repeat;
+		}
 		if (!swappage) {
 			shmem_swp_unmap(entry);
 			/* here we actually do the io */
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -228,7 +228,7 @@ static inline void free_swap_cache(struc
 void free_page_and_swap_cache(struct page *page)
 {
 	free_swap_cache(page);
-	page_cache_release(page);
+	page_cache_release_nocheck(page);
 }
 
 /*
@@ -262,7 +262,7 @@ struct page * lookup_swap_cache(swp_entr
 {
 	struct page *page;
 
-	page = find_get_page(&swapper_space, entry.val);
+	page = find_get_page_nodiscard(&swapper_space, entry.val);
 
 	if (page)
 		INC_CACHE_INFO(find_success);

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
                   ` (4 preceding siblings ...)
  2008-03-12 13:21 ` [patch 5/6] Guest page hinting: minor fault optimization Martin Schwidefsky
@ 2008-03-12 13:21 ` Martin Schwidefsky
  2008-03-12 16:19   ` Jeremy Fitzhardinge
  2008-03-12 22:41 ` [patch 0/6] Guest page hinting version 6 Rusty Russell
  2008-03-13 16:57 ` Hugh Dickins
  7 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-s390, virtualization
  Cc: akpm, nickpiggin, hugh, zach, frankeh, Martin Schwidefsky

[-- Attachment #1: 006-hva-s390.diff --]
[-- Type: text/plain, Size: 21764 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>
From: Hubertus Franke <frankeh@watson.ibm.com>
From: Himanshu Raj

s390 uses the milli-coded ESSA instruction to set the page state. The
page state is formed by four guest page states called block usage states
and three host page states called block content states.

The guest states are:
 - stable (S): there is essential content in the page
 - unused (U): there is no useful content and any access to the page will
   cause an addressing exception
 - volatile (V): there is useful content in the page. The host system is
   allowed to discard the content anytime, but has to deliver a discard
   fault with the absolute address of the page if the guest tries to
   access it.
 - potential volatile (P): the page has useful content. The host system
   is allowed to discard the content after it has checked the dirty bit
   of the page. It has to deliver a discard fault with the absolute
   address of the page if the guest tries to access it.

The host states are:
 - resident: the page is present in real memory.
 - preserved: the page is not present in real memory but the content is
   preserved elsewhere by the machine, e.g. on the paging device.
 - zero: the page is not present in real memory. The content of the page
   is logically-zero.

There are 12 combinations of guest and host state, currently only 8 are
valid page states:
 Sr: a stable, resident page.
 Sp: a stable, preserved page.
 Sz: a stable, logically zero page. A page filled with zeroes will be
     allocated on first access.
 Ur: an unused but resident page. The host could make it Uz anytime but
     it doesn't have to.
 Uz: an unused, logically zero page.
 Vr: a volatile, resident page. The guest can access it normally.
 Vz: a volatile, logically zero page. This is a discarded page. The host
     will deliver a discard fault for any access to the page.
 Pr: a potential volatile, resident page. The guest can access it normally.

The remaining 4 combinations can't occur:
 Up: an unused, preserved page. If the host tries to get rid of a Ur page
     it will remove it without writing the page content to disk and set
     the page to Uz.
 Vp: a volatile, preserved page. If the host picks a Vr page for eviction
     it will discard it and set the page state to Vz.
 Pp: a potential volatile, preserved page. There are two cases for page out:
     1) if the page is dirty then the host will preserved the page and set
     it to Sp or 2) if the page is clean then the host will discard it and
     set the page state to Vz.
 Pz: a potential volatile, logically zero page. The host system will always
     use Vz instead of Pz.

The state transitions (a diagram would be nicer but that is too hard
to do in ascii art...):
{Ur,Sr,Vr,Pr}: a resident page will change its block usage state if the
     guest requests it with page_set_{unused,stable,volatile}.
{Uz,Sz,Vz}: a logically zero page will change its block usage state if the
     guest requests it with page_set_{unused,stable,volatile}. The
     guest can't create the Pz state, the state will be Vz instead.
Ur -> Uz: the host system can remove an unused, resident page from memory
Sz -> Sr: on first access a stable, logically zero page will become resident
Sr -> Sp: the host system can swap a stable page to disk
Sp -> Sr: a guest access to a Sp page forces the host to retrieve it
Vr -> Vz: the host can discard a volatile page
Sp -> Uz: a page preserved by the host will be removed if the guest sets 
     the block usage state to unused.
Sp -> Vz: a page preserved by the host will be discarded if the guest sets
     the block usage state to volatile.
Pr -> Sp: the host can move a page from Pr to Sp if it discovers that the
     page is dirty while trying to discard the page. The page content is
     written to the paging device.
Pr -> Vz: the host can discard a Pr page. The Pz state is replaced by the
     Vz state.

The are some hazards the code has to deal with:
1) For potential volatile pages the transfer of the hardware dirty bit to
the software dirty bit needs to make sure that the page gets into the
stable state before the hardware dirty bit is cleared. Between the
page_test_dirty and the page_clear_dirty call a page_make_stable is
required.

2) Since the access of unused pages causes addressing exceptions we need
to take care with /dev/mem. The copy_{from_to}_user functions need to
be able to cope with addressing exceptions for the kernel address space.

3) The discard fault on a s390 machine delivers the absolute address of
the page that caused the fault instead of the virtual address. With the
virtual address we could have used the page table entry of the current
process to safely get a reference to the discarded page. We can get to
the struct page from the absolute page address but it is rather hard to
get to a proper page reference. The page that caused the fault could
already have been freed and reused for a different purpose. None of the
fields in the struct page would be reliable to use. The freeing of
discarded pages therefore has to be postponed until all pending discard
faults for this page have been dealt with. The discard fault handler
is called disabled for interrupts and tries to get a page reference
with get_page_unless_zero. A discarded page is only freed after all
cpus have been enabled for interrupts at least once since the detection
of the discarded page. This is done using the timer interrupts and the
cpu-idle notifier. 

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

---

 arch/s390/Kconfig              |    3 
 arch/s390/kernel/time.c        |   11 ++
 arch/s390/kernel/traps.c       |    4 
 arch/s390/lib/uaccess_mvcos.c  |   10 +
 arch/s390/lib/uaccess_std.c    |    7 -
 arch/s390/mm/fault.c           |  210 +++++++++++++++++++++++++++++++++++++++++
 include/asm-s390/page-states.h |  117 ++++++++++++++++++++++
 mm/rmap.c                      |    9 +
 8 files changed, 364 insertions(+), 7 deletions(-)

Index: linux-2.6/arch/s390/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/Kconfig
+++ linux-2.6/arch/s390/Kconfig
@@ -411,6 +411,9 @@ config CMM_IUCV
 	  Select this option to enable the special message interface to
 	  the cooperative memory management.
 
+config PAGE_STATES
+	bool "Enable support for guest page hinting."
+
 config VIRT_TIMER
 	bool "Virtual CPU timer support"
 	help
Index: linux-2.6/arch/s390/kernel/time.c
===================================================================
--- linux-2.6.orig/arch/s390/kernel/time.c
+++ linux-2.6/arch/s390/kernel/time.c
@@ -30,6 +30,7 @@
 #include <linux/timex.h>
 #include <linux/notifier.h>
 #include <linux/clocksource.h>
+#include <linux/page-states.h>
 
 #include <asm/uaccess.h>
 #include <asm/delay.h>
@@ -222,6 +223,9 @@ static int nohz_idle_notify(struct notif
 	switch (action) {
 	case S390_CPU_IDLE:
 		stop_hz_timer();
+#ifdef CONFIG_PAGE_STATES
+		page_shrink_discard_list();
+#endif
 		break;
 	case S390_CPU_NOT_IDLE:
 		start_hz_timer();
@@ -270,6 +274,9 @@ void init_cpu_timer(void)
 
 static void clock_comparator_interrupt(__u16 code)
 {
+#ifdef CONFIG_PAGE_STATES
+	page_shrink_discard_list();
+#endif
 	/* set clock comparator for next tick */
 	set_clock_comparator(S390_lowcore.jiffy_timer + CPU_DEVIATION);
 }
@@ -349,6 +356,10 @@ void __init time_init(void)
 #ifdef CONFIG_VIRT_TIMER
 	vtime_init();
 #endif
+
+#ifdef CONFIG_PAGE_STATES
+	page_discard_init();
+#endif
 }
 
 /*
Index: linux-2.6/arch/s390/kernel/traps.c
===================================================================
--- linux-2.6.orig/arch/s390/kernel/traps.c
+++ linux-2.6/arch/s390/kernel/traps.c
@@ -61,6 +61,7 @@ extern pgm_check_handler_t do_protection
 extern pgm_check_handler_t do_dat_exception;
 extern pgm_check_handler_t do_monitor_call;
 extern pgm_check_handler_t do_asce_exception;
+extern pgm_check_handler_t do_discard_fault;
 
 #define stack_pointer ({ void **sp; asm("la %0,0(15)" : "=&d" (sp)); sp; })
 
@@ -740,5 +741,8 @@ void __init trap_init(void)
         pgm_check_table[0x1C] = &space_switch_exception;
         pgm_check_table[0x1D] = &hfp_sqrt_exception;
 	pgm_check_table[0x40] = &do_monitor_call;
+#ifdef CONFIG_PAGE_STATES
+	pgm_check_table[0x1a] = &do_discard_fault;
+#endif
 	pfault_irq_init();
 }
Index: linux-2.6/arch/s390/lib/uaccess_mvcos.c
===================================================================
--- linux-2.6.orig/arch/s390/lib/uaccess_mvcos.c
+++ linux-2.6/arch/s390/lib/uaccess_mvcos.c
@@ -36,7 +36,7 @@ static size_t copy_from_user_mvcos(size_
 	tmp1 = -4096UL;
 	asm volatile(
 		"0: .insn ss,0xc80000000000,0(%0,%2),0(%1),0\n"
-		"   jz    7f\n"
+		"10:jz    7f\n"
 		"1:"ALR"  %0,%3\n"
 		"  "SLR"  %1,%3\n"
 		"  "SLR"  %2,%3\n"
@@ -47,7 +47,7 @@ static size_t copy_from_user_mvcos(size_
 		"  "CLR"  %0,%4\n"	/* copy crosses next page boundary? */
 		"   jnh   4f\n"
 		"3: .insn ss,0xc80000000000,0(%4,%2),0(%1),0\n"
-		"  "SLR"  %0,%4\n"
+		"11:"SLR"  %0,%4\n"
 		"  "ALR"  %2,%4\n"
 		"4:"LHI"  %4,-1\n"
 		"  "ALR"  %4,%0\n"	/* copy remaining size, subtract 1 */
@@ -62,6 +62,7 @@ static size_t copy_from_user_mvcos(size_
 		"7:"SLR"  %0,%0\n"
 		"8: \n"
 		EX_TABLE(0b,2b) EX_TABLE(3b,4b)
+		EX_TABLE(10b,8b) EX_TABLE(11b,8b)
 		: "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2)
 		: "d" (reg0) : "cc", "memory");
 	return size;
@@ -82,7 +83,7 @@ static size_t copy_to_user_mvcos(size_t 
 	tmp1 = -4096UL;
 	asm volatile(
 		"0: .insn ss,0xc80000000000,0(%0,%1),0(%2),0\n"
-		"   jz    4f\n"
+		"6: jz    4f\n"
 		"1:"ALR"  %0,%3\n"
 		"  "SLR"  %1,%3\n"
 		"  "SLR"  %2,%3\n"
@@ -93,11 +94,12 @@ static size_t copy_to_user_mvcos(size_t 
 		"  "CLR"  %0,%4\n"	/* copy crosses next page boundary? */
 		"   jnh   5f\n"
 		"3: .insn ss,0xc80000000000,0(%4,%1),0(%2),0\n"
-		"  "SLR"  %0,%4\n"
+		"7:"SLR"  %0,%4\n"
 		"   j     5f\n"
 		"4:"SLR"  %0,%0\n"
 		"5: \n"
 		EX_TABLE(0b,2b) EX_TABLE(3b,5b)
+		EX_TABLE(6b,5b) EX_TABLE(7b,5b)
 		: "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2)
 		: "d" (reg0) : "cc", "memory");
 	return size;
Index: linux-2.6/arch/s390/lib/uaccess_std.c
===================================================================
--- linux-2.6.orig/arch/s390/lib/uaccess_std.c
+++ linux-2.6/arch/s390/lib/uaccess_std.c
@@ -36,12 +36,12 @@ size_t copy_from_user_std(size_t size, c
 	tmp1 = -256UL;
 	asm volatile(
 		"0: mvcp  0(%0,%2),0(%1),%3\n"
-		"   jz    8f\n"
+		"10:jz    8f\n"
 		"1:"ALR"  %0,%3\n"
 		"   la    %1,256(%1)\n"
 		"   la    %2,256(%2)\n"
 		"2: mvcp  0(%0,%2),0(%1),%3\n"
-		"   jnz   1b\n"
+		"11:jnz   1b\n"
 		"   j     8f\n"
 		"3: la    %4,255(%1)\n"	/* %4 = ptr + 255 */
 		"  "LHI"  %3,-4096\n"
@@ -50,7 +50,7 @@ size_t copy_from_user_std(size_t size, c
 		"  "CLR"  %0,%4\n"	/* copy crosses next page boundary? */
 		"   jnh   5f\n"
 		"4: mvcp  0(%4,%2),0(%1),%3\n"
-		"  "SLR"  %0,%4\n"
+		"12:"SLR"  %0,%4\n"
 		"  "ALR"  %2,%4\n"
 		"5:"LHI"  %4,-1\n"
 		"  "ALR"  %4,%0\n"	/* copy remaining size, subtract 1 */
@@ -65,6 +65,7 @@ size_t copy_from_user_std(size_t size, c
 		"8:"SLR"  %0,%0\n"
 		"9: \n"
 		EX_TABLE(0b,3b) EX_TABLE(2b,3b) EX_TABLE(4b,5b)
+		EX_TABLE(10b,9b) EX_TABLE(11b,9b) EX_TABLE(12b,9b)
 		: "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2)
 		: : "cc", "memory");
 	return size;
Index: linux-2.6/arch/s390/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/fault.c
+++ linux-2.6/arch/s390/mm/fault.c
@@ -19,6 +19,8 @@
 #include <linux/ptrace.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/cpu.h>
 #include <linux/smp.h>
 #include <linux/kdebug.h>
 #include <linux/smp_lock.h>
@@ -28,11 +30,13 @@
 #include <linux/hardirq.h>
 #include <linux/kprobes.h>
 #include <linux/uaccess.h>
+#include <linux/page-states.h>
 
 #include <asm/system.h>
 #include <asm/pgtable.h>
 #include <asm/s390_ext.h>
 #include <asm/mmu_context.h>
+#include <asm/io.h>
 
 #ifndef CONFIG_64BIT
 #define __FAIL_ADDR_MASK 0x7ffff000
@@ -615,4 +619,210 @@ void __init pfault_irq_init(void)
 	unregister_early_external_interrupt(0x2603, pfault_interrupt,
 					    &ext_int_pfault);
 }
+
+#endif
+
+#ifdef CONFIG_PAGE_STATES
+
+int cmma_flag;
+
+static inline int machine_has_essa(void)
+{
+	register unsigned long tmp asm("0") = 0;
+	register int rc asm("1") = 0;
+	asm volatile(
+		"	.insn rrf,0xb9ab0000,%1,%1,0,0\n"
+		"0:	la	%0,1\n"
+		"1:\n"
+		EX_TABLE(0b,1b)
+		: "+&d" (rc), "+&d" (tmp));
+	return rc;
+}
+
+static int __init cmma(char *str)
+{
+	char *parm;
+
+	parm = strstrip(str);
+	if (strcmp(parm, "yes") == 0 || strcmp(parm, "on") == 0) {
+		cmma_flag = machine_has_essa();
+		return 1;
+	}
+	if (strcmp(parm, "no") == 0 || strcmp(parm, "off") == 0) {
+		cmma_flag = 0;
+		return 1;
+	}
+	return 0;
+}
+
+__setup("cmma=", cmma);
+
+static inline void fixup_user_copy(struct pt_regs *regs,
+				   unsigned long address, unsigned short rx)
+{
+	const struct exception_table_entry *fixup;
+	unsigned long kaddr;
+
+	kaddr = (regs->gprs[rx >> 12] + (rx & 0xfff)) & __FAIL_ADDR_MASK;
+	if (virt_to_phys((void *) kaddr) != address)
+		return;
+
+	fixup = search_exception_tables(regs->psw.addr & PSW_ADDR_INSN);
+	if (fixup)
+		regs->psw.addr = fixup->fixup | PSW_ADDR_AMODE;
+	else
+		die("discard fault", regs, SIGSEGV);
+}
+
+/*
+ * Discarded pages with a page_count() of zero are placed on
+ * the page_discarded_list until all cpus have been at
+ * least once in enabled code. That closes the race of page
+ * free vs. discard faults.
+ */
+void do_discard_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	unsigned long address;
+	struct page *page;
+
+	/*
+	 * get the real address that caused the block validity
+	 * exception.
+	 */
+	address = S390_lowcore.trans_exc_code & __FAIL_ADDR_MASK;
+	page = pfn_to_page(address >> PAGE_SHIFT);
+
+	/*
+	 * Check for the special case of a discard fault in
+	 * copy_{from,to}_user. User copy is done using one of
+	 * three special instructions: mvcp, mvcs or mvcos.
+	 */
+	if (!(regs->psw.mask & PSW_MASK_PSTATE)) {
+		switch (*(unsigned char *) regs->psw.addr) {
+		case 0xda:	/* mvcp */
+			fixup_user_copy(regs, address,
+					*(__u16 *)(regs->psw.addr + 2));
+			break;
+		case 0xdb:	/* mvcs */
+			fixup_user_copy(regs, address,
+					*(__u16 *)(regs->psw.addr + 4));
+			break;
+		case 0xc8:	/* mvcos */
+			if (regs->gprs[0] == 0x81)
+				fixup_user_copy(regs, address,
+						*(__u16*)(regs->psw.addr + 2));
+			else if (regs->gprs[0] == 0x810000)
+				fixup_user_copy(regs, address,
+						*(__u16*)(regs->psw.addr + 4));
+			break;
+		default:
+			break;
+		}
+	}
+
+	if (likely(get_page_unless_zero(page))) {
+		local_irq_enable();
+		page_discard(page);
+	}
+}
+
+static DEFINE_PER_CPU(struct list_head, page_discard_list);
+static struct list_head page_gather_list = LIST_HEAD_INIT(page_gather_list);
+static struct list_head page_signoff_list = LIST_HEAD_INIT(page_signoff_list);
+static cpumask_t page_signoff_cpumask = CPU_MASK_NONE;
+static DEFINE_SPINLOCK(page_discard_lock);
+
+/*
+ * page_free_discarded
+ *
+ * free_hot_cold_page calls this function if it is about to free a
+ * page that has PG_discarded set. Since there might be pending
+ * discard faults on other cpus on s390 we have to postpone the
+ * freeing of the page until each cpu has "signed-off" the page.
+ *
+ * returns 1 to stop free_hot_cold_page from freeing the page.
+ */
+int page_free_discarded(struct page *page)
+{
+	local_irq_disable();
+	list_add_tail(&page->lru, &__get_cpu_var(page_discard_list));
+	local_irq_enable();
+	return 1;
+}
+
+/*
+ * page_shrink_discard_list
+ *
+ * This function is called from the timer tick for an active cpu or
+ * from the idle notifier. It frees discarded pages in three stages.
+ * In the first stage it moves the pages from the per-cpu discard
+ * list to a global list. From the global list the pages are moved
+ * to the signoff list in a second step. The third step is to free
+ * the pages after all cpus acknoledged the signoff. That prevents
+ * that a page is freed when a cpus still has a pending discard
+ * fault for the page.
+ */
+void page_shrink_discard_list(void)
+{
+	struct list_head *cpu_list = &__get_cpu_var(page_discard_list);
+	struct list_head free_list = LIST_HEAD_INIT(free_list);
+	struct page *page, *next;
+	int cpu = smp_processor_id();
+
+	if (list_empty(cpu_list) && !cpu_isset(cpu, page_signoff_cpumask))
+		return;
+	spin_lock(&page_discard_lock);
+	if (!list_empty(cpu_list))
+		list_splice_init(cpu_list, &page_gather_list);
+	cpu_clear(cpu, page_signoff_cpumask);
+	if (cpus_empty(page_signoff_cpumask)) {
+		list_splice_init(&page_signoff_list, &free_list);
+		list_splice_init(&page_gather_list, &page_signoff_list);
+		if (!list_empty(&page_signoff_list)) {
+			/* Take care of the nohz race.. */
+			page_signoff_cpumask = cpu_online_map;
+			smp_wmb();
+			cpus_andnot(page_signoff_cpumask,
+				    page_signoff_cpumask, nohz_cpu_mask);
+			cpu_clear(cpu, page_signoff_cpumask);
+			if (cpus_empty(page_signoff_cpumask))
+				list_splice_init(&page_signoff_list,
+						 &free_list);
+		}
+	}
+	spin_unlock(&page_discard_lock);
+	list_for_each_entry_safe(page, next, &free_list, lru) {
+		ClearPageDiscarded(page);
+		free_cold_page(page);
+	}
+}
+
+static int page_discard_cpu_notify(struct notifier_block *self,
+				   unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long) hcpu;
+
+	if (action == CPU_DEAD) {
+		local_irq_disable();
+		list_splice_init(&per_cpu(page_discard_list, cpu),
+				 &__get_cpu_var(page_discard_list));
+		local_irq_enable();
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block page_discard_cpu_notifier = {
+	.notifier_call = page_discard_cpu_notify,
+};
+
+void __init page_discard_init(void)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		INIT_LIST_HEAD(&per_cpu(page_discard_list, i));
+	if (register_cpu_notifier(&page_discard_cpu_notifier))
+		panic("Couldn't register page discard cpu notifier");
+}
+
 #endif
Index: linux-2.6/include/asm-s390/page-states.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-s390/page-states.h
@@ -0,0 +1,117 @@
+#ifndef _ASM_S390_PAGE_STATES_H
+#define _ASM_S390_PAGE_STATES_H
+
+#define ESSA_GET_STATE			0
+#define ESSA_SET_STABLE			1
+#define ESSA_SET_UNUSED			2
+#define ESSA_SET_VOLATILE		3
+#define ESSA_SET_PVOLATILE		4
+#define ESSA_SET_STABLE_MAKE_RESIDENT	5
+#define ESSA_SET_STABLE_IF_NOT_DISCARDED	6
+
+#define ESSA_USTATE_MASK		0x0c
+#define ESSA_USTATE_STABLE		0x00
+#define ESSA_USTATE_UNUSED		0x04
+#define ESSA_USTATE_PVOLATILE		0x08
+#define ESSA_USTATE_VOLATILE		0x0c
+
+#define ESSA_CSTATE_MASK		0x03
+#define ESSA_CSTATE_RESIDENT		0x00
+#define ESSA_CSTATE_PRESERVED		0x02
+#define ESSA_CSTATE_ZERO		0x03
+
+extern int cmma_flag;
+extern struct page *mem_map;
+
+/*
+ * ESSA <rc-reg>,<page-address-reg>,<command-immediate>
+ */
+#define page_essa(_page,_command) ({		       \
+	int _rc; \
+	asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" \
+		     : "=&d" (_rc) : "a" (((_page)-mem_map)<<PAGE_SHIFT), \
+		       "i" (_command)); \
+	_rc; \
+})
+
+static inline int page_host_discards(void)
+{
+	return cmma_flag;
+}
+
+static inline int page_discarded(struct page *page)
+{
+	int state;
+
+	if (!cmma_flag)
+		return 0;
+	state = page_essa(page, ESSA_GET_STATE);
+	return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE &&
+		(state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO;
+}
+
+static inline void page_set_unused(struct page *page, int order)
+{
+	int i;
+
+	if (!cmma_flag)
+		return;
+	for (i = 0; i < (1 << order); i++)
+		page_essa(page + i, ESSA_SET_UNUSED);
+}
+
+static inline void page_set_stable(struct page *page, int order)
+{
+	int i;
+
+	if (!cmma_flag)
+		return;
+	for (i = 0; i < (1 << order); i++)
+		page_essa(page + i, ESSA_SET_STABLE);
+}
+
+static inline void page_set_volatile(struct page *page, int writable)
+{
+	if (!cmma_flag)
+		return;
+	if (writable)
+		page_essa(page, ESSA_SET_PVOLATILE);
+	else
+		page_essa(page, ESSA_SET_VOLATILE);
+}
+
+static inline int page_set_stable_if_present(struct page *page)
+{
+	int rc;
+
+	if (!cmma_flag || PageReserved(page))
+		return 1;
+
+	rc = page_essa(page, ESSA_SET_STABLE_IF_NOT_DISCARDED);
+	return (rc & ESSA_USTATE_MASK) != ESSA_USTATE_VOLATILE ||
+		(rc & ESSA_CSTATE_MASK) != ESSA_CSTATE_ZERO;
+}
+
+/*
+ * Page locking is done with the architecture page bit PG_arch_1.
+ */
+static inline int page_test_set_state_change(struct page *page)
+{
+	return test_and_set_bit(PG_arch_1, &page->flags);
+}
+
+static inline void page_clear_state_change(struct page *page)
+{
+	clear_bit(PG_arch_1, &page->flags);
+}
+
+static inline int page_state_change(struct page *page)
+{
+	return test_bit(PG_arch_1, &page->flags);
+}
+
+int page_free_discarded(struct page *page);
+void page_shrink_discard_list(void);
+void page_discard_init(void);
+
+#endif /* _ASM_S390_PAGE_STATES_H */
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -686,6 +686,15 @@ void page_remove_rmap(struct page *page,
 		 * faster for those pages still in swapcache.
 		 */
 		if (page_test_dirty(page)) {
+			int stable = page_make_stable(page);
+			VM_BUG_ON(!stable);
+			/*
+			 * We decremented the mapcount so we now have an
+			 * extra reference for the page. That prevents
+			 * page_make_volatile from making the page
+			 * volatile again while the dirty bit is in
+			 * transit.
+			 */
 			page_clear_dirty(page);
 			set_page_dirty(page);
 		}

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 13:21 ` [patch 6/6] Guest page hinting: s390 support Martin Schwidefsky
@ 2008-03-12 16:19   ` Jeremy Fitzhardinge
  2008-03-12 16:28     ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-12 16:19 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin,
	frankeh, hugh

[-- Attachment #1: Type: text/plain, Size: 1474 bytes --]

Martin Schwidefsky wrote:
> The state transitions (a diagram would be nicer but that is too hard
> to do in ascii art...):
> {Ur,Sr,Vr,Pr}: a resident page will change its block usage state if the
>      guest requests it with page_set_{unused,stable,volatile}.
> {Uz,Sz,Vz}: a logically zero page will change its block usage state if the
>      guest requests it with page_set_{unused,stable,volatile}. The
>      guest can't create the Pz state, the state will be Vz instead.
> Ur -> Uz: the host system can remove an unused, resident page from memory
> Sz -> Sr: on first access a stable, logically zero page will become resident
> Sr -> Sp: the host system can swap a stable page to disk
> Sp -> Sr: a guest access to a Sp page forces the host to retrieve it
> Vr -> Vz: the host can discard a volatile page
> Sp -> Uz: a page preserved by the host will be removed if the guest sets 
>      the block usage state to unused.
> Sp -> Vz: a page preserved by the host will be discarded if the guest sets
>      the block usage state to volatile.
> Pr -> Sp: the host can move a page from Pr to Sp if it discovers that the
>      page is dirty while trying to discard the page. The page content is
>      written to the paging device.
> Pr -> Vz: the host can discard a Pr page. The Pz state is replaced by the
>      Vz state.

I created the attached .dot graph based purely on this description.  It 
looks reasonable, but I didn't see how a page enters a Pr state.

    J

[-- Attachment #2: gph.dot --]
[-- Type: text/plain, Size: 1039 bytes --]

digraph gph {
	Ur -> Sr [ label="page_set_stable" ];
	Ur -> Vr [ label="page_set_volatile" ];
	Ur -> Ur [ label="page_set_unused" ];

	Sr -> Sr [ label="page_set_stable" ];
	Sr -> Vr [ label="page_set_volatile" ];
	Sr -> Ur [ label="page_set_unused" ];

	Vr -> Sr [ label="page_set_stable" ];
	Vr -> Vr [ label="page_set_volatile" ];
	Vr -> Ur [ label="page_set_unused" ];

	Uz -> Sz [ label="page_set_stable" ];
	Uz -> Vz [ label="page_set_volatile" ];
	Uz -> Uz [ label="page_set_unused" ];

	Sz -> Sz [ label="page_set_stable" ];
	Sz -> Vz [ label="page_set_volatile" ];
	Sz -> Uz [ label="page_set_unused" ];

	Vz -> Sz [ label="page_set_stable" ];
	Vz -> Vz [ label="page_set_volatile" ];
	Vz -> Uz [ label="page_set_unused" ];

	Ur -> Uz [ label="host evict" ];

	Sz -> Sr [ label="guest write" ];
	Sr -> Sp [ label="host swap" ];
	Sp -> Sr [ label="guest access" ];

	Sp -> Uz [ label="guest discard" ];
	Sp -> Vz [ label="page_set_volatile" ];

	Pr -> Sp [ label="host discard dirty" ];
	Pr -> Vz [ label="host discard clean" ];
}

[-- Attachment #3: gph.pdf --]
[-- Type: application/pdf, Size: 14976 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 16:19   ` Jeremy Fitzhardinge
@ 2008-03-12 16:28     ` Martin Schwidefsky
  2008-03-12 16:44       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 16:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin,
	frankeh, hugh

On Wed, 2008-03-12 at 09:19 -0700, Jeremy Fitzhardinge wrote:
> Martin Schwidefsky wrote:
> > The state transitions (a diagram would be nicer but that is too hard
> > to do in ascii art...):
> > {Ur,Sr,Vr,Pr}: a resident page will change its block usage state if the
> >      guest requests it with page_set_{unused,stable,volatile}.
> > {Uz,Sz,Vz}: a logically zero page will change its block usage state if the
> >      guest requests it with page_set_{unused,stable,volatile}. The
> >      guest can't create the Pz state, the state will be Vz instead.
> > Ur -> Uz: the host system can remove an unused, resident page from memory
> > Sz -> Sr: on first access a stable, logically zero page will become resident
> > Sr -> Sp: the host system can swap a stable page to disk
> > Sp -> Sr: a guest access to a Sp page forces the host to retrieve it
> > Vr -> Vz: the host can discard a volatile page
> > Sp -> Uz: a page preserved by the host will be removed if the guest sets 
> >      the block usage state to unused.
> > Sp -> Vz: a page preserved by the host will be discarded if the guest sets
> >      the block usage state to volatile.
> > Pr -> Sp: the host can move a page from Pr to Sp if it discovers that the
> >      page is dirty while trying to discard the page. The page content is
> >      written to the paging device.
> > Pr -> Vz: the host can discard a Pr page. The Pz state is replaced by the
> >      Vz state.
> 
> I created the attached .dot graph based purely on this description.  It 
> looks reasonable, but I didn't see how a page enters a Pr state.

That is the first block of state transitions: {Ur,Sr,Vr,Pr}
You can go from any of the four states to any of the remaining three.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 16:28     ` Martin Schwidefsky
@ 2008-03-12 16:44       ` Jeremy Fitzhardinge
  2008-03-12 16:59         ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-12 16:44 UTC (permalink / raw)
  To: schwidefsky
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin,
	frankeh, hugh

[-- Attachment #1: Type: text/plain, Size: 924 bytes --]

Martin Schwidefsky wrote:
> That is the first block of state transitions: {Ur,Sr,Vr,Pr}
> You can go from any of the four states to any of the remaining three.
>   

You only mention page_set_{unused,stable,volatile}.  Is 
page_set_stable_if_present() the fourth.  And shouldn't that be 
"stable_if_clean":

     - potential volatile (P): the page has useful content. The host system
       is allowed to discard the content after it has checked the dirty bit
       of the page. It has to deliver a discard fault with the absolute
       address of the page if the guest tries to access it.
      

The use of "stable" in the function call and "volatile" in this 
description is a bit confusing.  My understanding is that a page in this 
state is either stable or volatile depending on whether its dirty, which 
makes sense, but it would be good to consistently refer to it in the 
same way.

Updated .dot attached.

    J

[-- Attachment #2: gph.dot --]
[-- Type: text/plain, Size: 1229 bytes --]

digraph gph {
	Ur -> Sr [ label="set stable" ];
	Ur -> Vr [ label="set volatile" ];
	Ur -> Ur [ label="set unused" ];
	Ur -> Pr [ label="set stable_if_present" ];

	Sr -> Sr [ label="set stable" ];
	Sr -> Vr [ label="set volatile" ];
	Sr -> Ur [ label="set unused" ];
	Sr -> Pr [ label="set stable_if_present" ];

	Vr -> Sr [ label="set stable" ];
	Vr -> Vr [ label="set volatile" ];
	Vr -> Ur [ label="set unused" ];
	Vr -> Pr [ label="set stable_if_present" ];

	Pr -> Sr [ label="set stable" ];
	Pr -> Vr [ label="set volatile" ];
	Pr -> Ur [ label="set unused" ];
	Pr -> Pr [ label="set stable_if_present" ];

	Uz -> Sz [ label="set stable" ];
	Uz -> Vz [ label="set volatile" ];
	Uz -> Uz [ label="set unused" ];

	Sz -> Sz [ label="set stable" ];
	Sz -> Vz [ label="set volatile" ];
	Sz -> Uz [ label="set unused" ];

	Vz -> Sz [ label="set stable" ];
	Vz -> Vz [ label="set volatile" ];
	Vz -> Uz [ label="set unused" ];

	Ur -> Uz [ label="host evict" ];

	Sz -> Sr [ label="guest write" ];
	Sr -> Sp [ label="host swap" ];
	Sp -> Sr [ label="guest access" ];

	Sp -> Uz [ label="guest discard" ];
	Sp -> Vz [ label="set volatile" ];

	Pr -> Sp [ label="host discard dirty" ];
	Pr -> Vz [ label="host discard clean" ];
}

[-- Attachment #3: gph.pdf --]
[-- Type: application/pdf, Size: 16115 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 16:44       ` Jeremy Fitzhardinge
@ 2008-03-12 16:59         ` Martin Schwidefsky
  2008-03-12 17:48           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-12 16:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin,
	frankeh, hugh

On Wed, 2008-03-12 at 09:44 -0700, Jeremy Fitzhardinge wrote:
> Martin Schwidefsky wrote:
> > That is the first block of state transitions: {Ur,Sr,Vr,Pr}
> > You can go from any of the four states to any of the remaining three.
> >   
> 
> You only mention page_set_{unused,stable,volatile}.  Is 
> page_set_stable_if_present() the fourth.  And shouldn't that be 
> "stable_if_clean":

page_set_volatile has a "writable" argument. For writable==0 you get a
Vx page, for writable==1 you get a Px page.

With stable_if_clean you are refering to stable_if_present? If yes the
answer is that this operation is used to get a page from Vx/Px back to
Sx but only if the page has not been discarded. The operation will fail
if the page state is Vz/Pz. The dirty bit only matters for the hosts
decision to discard the page, these are the state transitions from Vr/Pr
to Vz.

>      - potential volatile (P): the page has useful content. The host system
>        is allowed to discard the content after it has checked the dirty bit
>        of the page. It has to deliver a discard fault with the absolute
>        address of the page if the guest tries to access it.
>       
> 
> The use of "stable" in the function call and "volatile" in this 
> description is a bit confusing.  My understanding is that a page in this 
> state is either stable or volatile depending on whether its dirty, which 
> makes sense, but it would be good to consistently refer to it in the 
> same way.

Your understanding is good, but how can I make this less confusing? A Px
page that is dirty may not be discarded which makes it basically stable.
The guest state still is potential volatile though as it does not have a
state of Sx.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 16:59         ` Martin Schwidefsky
@ 2008-03-12 17:48           ` Jeremy Fitzhardinge
  2008-03-12 20:04             ` Anthony Liguori
  0 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-12 17:48 UTC (permalink / raw)
  To: schwidefsky
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin,
	frankeh, hugh

[-- Attachment #1: Type: text/plain, Size: 2397 bytes --]

Martin Schwidefsky wrote:
> On Wed, 2008-03-12 at 09:44 -0700, Jeremy Fitzhardinge wrote:
>   
>> Martin Schwidefsky wrote:
>>     
>>> That is the first block of state transitions: {Ur,Sr,Vr,Pr}
>>> You can go from any of the four states to any of the remaining three.
>>>   
>>>       
>> You only mention page_set_{unused,stable,volatile}.  Is 
>> page_set_stable_if_present() the fourth.  And shouldn't that be 
>> "stable_if_clean":
>>     
>
> page_set_volatile has a "writable" argument. For writable==0 you get a
> Vx page, for writable==1 you get a Px page.
>   

Hm.  But a Vx page is writable isn't it?  It's just that its contents 
can go away at any time.  Or does the kernel treat Vx pages as strictly 
RO cached copies of other things?

It also seems to me that given you talking about "potentially volatile" 
as a distinct state, it would would be best to have a distinct 
state-setting function associated with it, so there's a 1:1 
correspondence between the code and the description.


> With stable_if_clean you are refering to stable_if_present?

No.  I misunderstood and thought that stable_if_present sets the Px 
state.  I'd overlooked the writable flag on page_set_volatile().

>  If yes the
> answer is that this operation is used to get a page from Vx/Px back to
> Sx but only if the page has not been discarded.

So you mean it will change Vr/Pr to Sr but everything else will fail?  
Are there there any other non-discarded states for Vx/Px?

>  The operation will fail
> if the page state is Vz/Pz.

Do mean just Vz here?  You say that Pz is never used.

> Your understanding is good, but how can I make this less confusing? A Px
> page that is dirty may not be discarded which makes it basically stable.
> The guest state still is potential volatile though as it does not have a
> state of Sx.
>   

Mainly, use identical terminology in code and description so they can be 
easily compared.  I found the diagram was quite helpful in understanding 
what's going on; feel free to include it in your documentation.

Updated .dot attached; I've updated it to include the page_set_volatile 
writable argument and the stable_if_present transitions; commented it, 
removed the self-edges which were cluttering things up.

Also, does a page go from Vz->Vr on guest memory write?  If so, does a 
clean page which goes from Pr->Vz->Vr lose its Px state in the process?

    J

[-- Attachment #2: gph.dot --]
[-- Type: text/plain, Size: 1274 bytes --]

digraph gph {
	/* Guest state changes on resident pages */
	Ur -> Sr [ label="set stable" ];
	Ur -> Vr [ label="set volatile\n(w=0)" ];
	Ur -> Pr [ label="set volatile\n(w=1)" ];

	Sr -> Ur [ label="set unused" ];
	Sr -> Vr [ label="set volatile\n(w=0)" ];
	Sr -> Pr [ label="set volatile\n(w=1)" ];

	Vr -> Sr [ label="set stable(_if_present)" ];
	Vr -> Ur [ label="set unused" ];
	Vr -> Pr [ label="set volatile\n(w=1)" ];

	Pr -> Sr [ label="set stable(_if_present)" ];
	Pr -> Vr [ label="set volatile\n(w=0)" ];
	Pr -> Ur [ label="set unused" ];

	/* Guest state changes on zero pages */
	Uz -> Sz [ label="set stable" ];
	Uz -> Vz [ label="set volatile" ];

	Sz -> Vz [ label="set volatile" ];
	Sz -> Uz [ label="set unused" ];

	Vz -> Sz [ label="set stable" ];
	Vz -> Uz [ label="set unused" ];

	/* Guest state changes on host-swapped pages */
	Sp -> Uz [ label="set unused" ];
	Sp -> Vz [ label="set volatile" ];

	/* Guest touches pages */
	Sz -> Sr [ label="guest write" ];
	Sp -> Sr [ label="guest access" ];
	Vz -> Vr [ label="guest write" ];

	/* Host actions */
	Sr -> Sp [ label="host swap" ];
	Ur -> Uz [ label="host discard" ];
	Vr -> Vz [ label="host discard" ];
	Pr -> Sp [ label="host discard\n(dirty)" ];
	Pr -> Vz [ label="host discard\n(clean)" ];
}

[-- Attachment #3: gph.pdf --]
[-- Type: application/pdf, Size: 16298 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 17:48           ` Jeremy Fitzhardinge
@ 2008-03-12 20:04             ` Anthony Liguori
  2008-03-12 20:45               ` Jeremy Fitzhardinge
  2008-03-13  9:32               ` Martin Schwidefsky
  0 siblings, 2 replies; 49+ messages in thread
From: Anthony Liguori @ 2008-03-12 20:04 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: schwidefsky, linux-kernel, linux-s390, virtualization, akpm,
	nickpiggin, frankeh, hugh

Jeremy Fitzhardinge wrote:
>> With stable_if_clean you are refering to stable_if_present?
> 
> No.  I misunderstood and thought that stable_if_present sets the Px 
> state.  I'd overlooked the writable flag on page_set_volatile().
> 
>>  If yes the
>> answer is that this operation is used to get a page from Vx/Px back to
>> Sx but only if the page has not been discarded.
> 
> So you mean it will change Vr/Pr to Sr but everything else will fail?  

Well presumably Vp/Pr => Sp?  Is is true that from the guest's 
perspective, all of the 'p' states are identical to the 'r' states?

Do the host states even really need visibility to the guest at all?  It 
may be useful for the guest to be able to distinguish between Ur and Uz 
but it doesn't seem necessary.

BTW Jeremy, the .dot was very useful!

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 20:04             ` Anthony Liguori
@ 2008-03-12 20:45               ` Jeremy Fitzhardinge
  2008-03-12 20:56                 ` Anthony Liguori
  2008-03-13  9:36                 ` Martin Schwidefsky
  2008-03-13  9:32               ` Martin Schwidefsky
  1 sibling, 2 replies; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-12 20:45 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jeremy Fitzhardinge, akpm, linux-s390, frankeh, nickpiggin,
	linux-kernel, virtualization, schwidefsky, hugh

[-- Attachment #1: Type: text/plain, Size: 1680 bytes --]

Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
>   
>>> With stable_if_clean you are refering to stable_if_present?
>>>       
>> No.  I misunderstood and thought that stable_if_present sets the Px 
>> state.  I'd overlooked the writable flag on page_set_volatile().
>>
>>     
>>>  If yes the
>>> answer is that this operation is used to get a page from Vx/Px back to
>>> Sx but only if the page has not been discarded.
>>>       
>> So you mean it will change Vr/Pr to Sr but everything else will fail?  
>>     
>
> Well presumably Vp/Pr => Sp?  Is is true that from the guest's 
> perspective, all of the 'p' states are identical to the 'r' states?
>   

Vp should never happen, since you'd never preserve a V page.  And surely 
it would be Pr -> Sr, since the hypervisor wouldn't push the page to 
backing store when you change the client state.

> Do the host states even really need visibility to the guest at all?  It 
> may be useful for the guest to be able to distinguish between Ur and Uz 
> but it doesn't seem necessary.

Well, you implicitly see the hypervisor state.  If you touch a [UV]z 
page then you get a fault telling you that the page has been taken away 
from you (I think).  And it would definitely help with debugging (seems 
likely there's lots of scope for race conditions if you prematurely tell 
the hypervisor you don't need the page any more...).

> BTW Jeremy, the .dot was very useful!
Yes, there's no way I'd be able to get my head around this otherwise.  
BTW, here's an updated one with the host-driven events as dashed lines, 
and a couple of extra transitions I think should be in there (but 
waiting for Martin's confirmation).

    J

[-- Attachment #2: gph.dot --]
[-- Type: text/plain, Size: 1344 bytes --]

digraph gph {
	/* Guest state changes on resident pages */
	Ur -> Sr [ label="set stable" ];
	Ur -> Vr [ label="set volatile\n(w=0)" ];
	Ur -> Pr [ label="set volatile\n(w=1)" ];

	Sr -> Ur [ label="set unused" ];
	Sr -> Vr [ label="set volatile\n(w=0)" ];
	Sr -> Pr [ label="set volatile\n(w=1)" ];

	Vr -> Sr [ label="set stable(_if_present)" ];
	Vr -> Ur [ label="set unused" ];
	Vr -> Pr [ label="set volatile\n(w=1)" ];

	Pr -> Sr [ label="set stable(_if_present)" ];
	Pr -> Vr [ label="set volatile\n(w=0)" ];
	Pr -> Ur [ label="set unused" ];

	/* Guest state changes on zero pages */
	Uz -> Sz [ label="set stable" ];
	Uz -> Vz [ label="set volatile" ];

	Sz -> Vz [ label="set volatile" ];
	Sz -> Uz [ label="set unused" ];

	Vz -> Sz [ label="set stable" ];
	Vz -> Uz [ label="set unused" ];

	/* Guest state changes on host-swapped pages */
	Sp -> Uz [ label="set unused" ];
	Sp -> Vz [ label="set volatile" ];

	/* Guest touches pages */
	Sz -> Sr [ label="guest write" ];
	Sp -> Sr [ label="guest access" ];
	Vz -> Vr [ label="guest write" ];

	/* Host actions */
	Sr -> Sp [ label="host swap", style=dashed ];
	Ur -> Uz [ label="host discard", style=dashed ];
	Vr -> Vz [ label="host discard", style=dashed ];
	Pr -> Sp [ label="host discard\n(dirty)", style=dashed ];
	Pr -> Vz [ label="host discard\n(clean)", style=dashed ];
}

[-- Attachment #3: gph.pdf --]
[-- Type: application/pdf, Size: 16441 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 20:45               ` Jeremy Fitzhardinge
@ 2008-03-12 20:56                 ` Anthony Liguori
  2008-03-12 21:36                   ` Jeremy Fitzhardinge
  2008-03-13  9:42                   ` Martin Schwidefsky
  2008-03-13  9:36                 ` Martin Schwidefsky
  1 sibling, 2 replies; 49+ messages in thread
From: Anthony Liguori @ 2008-03-12 20:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: akpm, linux-s390, frankeh, nickpiggin, linux-kernel,
	virtualization, schwidefsky, hugh

Jeremy Fitzhardinge wrote:
>>
>> Well presumably Vp/Pr => Sp?  Is is true that from the guest's 
>> perspective, all of the 'p' states are identical to the 'r' states?
>>   
>
> Vp should never happen, since you'd never preserve a V page.  And 
> surely it would be Pr -> Sr, since the hypervisor wouldn't push the 
> page to backing store when you change the client state.

You're right, I meant Vp/Pp but they are invalid states.  I think one of 
the things that keeps tripping me up is that the host can change both 
the host and guest page states.  My initial impression was that the host 
handled the host state and the guest handled the guest state.

>> Do the host states even really need visibility to the guest at all?  
>> It may be useful for the guest to be able to distinguish between Ur 
>> and Uz but it doesn't seem necessary.
>
> Well, you implicitly see the hypervisor state.  If you touch a [UV]z 
> page then you get a fault telling you that the page has been taken 
> away from you (I think).  And it would definitely help with debugging 
> (seems likely there's lots of scope for race conditions if you 
> prematurely tell the hypervisor you don't need the page any more...).

I was thinking that it may be useful to know a Ur verses a Uz when 
allocating memory.  In this case, you'd rather allocate Ur pages verses 
Uz to avoid the fault.  I don't read s390 arch code well, is the host 
state explicit to the guest?

>> BTW Jeremy, the .dot was very useful!
> Yes, there's no way I'd be able to get my head around this otherwise.  
> BTW, here's an updated one with the host-driven events as dashed 
> lines, and a couple of extra transitions I think should be in there 
> (but waiting for Martin's confirmation).

Excellent!

Regards,

Anthony LIguori

>    J


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 20:56                 ` Anthony Liguori
@ 2008-03-12 21:36                   ` Jeremy Fitzhardinge
  2008-03-13  9:45                     ` Martin Schwidefsky
  2008-03-13  9:42                   ` Martin Schwidefsky
  1 sibling, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-12 21:36 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jeremy Fitzhardinge, akpm, linux-s390, frankeh, nickpiggin,
	linux-kernel, virtualization, schwidefsky, hugh

Anthony Liguori wrote:
>> Vp should never happen, since you'd never preserve a V page.  And 
>> surely it would be Pr -> Sr, since the hypervisor wouldn't push the 
>> page to backing store when you change the client state.
>>     
>
> You're right, I meant Vp/Pp but they are invalid states.  I think one of 
> the things that keeps tripping me up is that the host can change both 
> the host and guest page states.  My initial impression was that the host 
> handled the host state and the guest handled the guest state.
>   

Yes.  And it seems to me that you get unfortunate outcomes if you have a 
Pr->Vz->Vr transition.

>>> Do the host states even really need visibility to the guest at all?  
>>> It may be useful for the guest to be able to distinguish between Ur 
>>> and Uz but it doesn't seem necessary.
>>>       
>> Well, you implicitly see the hypervisor state.  If you touch a [UV]z 
>> page then you get a fault telling you that the page has been taken 
>> away from you (I think).  And it would definitely help with debugging 
>> (seems likely there's lots of scope for race conditions if you 
>> prematurely tell the hypervisor you don't need the page any more...).
>>     
>
> I was thinking that it may be useful to know a Ur verses a Uz when 
> allocating memory.  In this case, you'd rather allocate Ur pages verses 
> Uz to avoid the fault.  I don't read s390 arch code well, is the host 
> state explicit to the guest?
>   

Yes, reusing Ur pages might well be better, but who knows - they've 
probably got an instruction which makes Uz cheap...

Stuff like this suggets that both parts of the state are packed 
together, and are guest-visible:

+	return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE &&
+		(state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO;


      J

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
                   ` (5 preceding siblings ...)
  2008-03-12 13:21 ` [patch 6/6] Guest page hinting: s390 support Martin Schwidefsky
@ 2008-03-12 22:41 ` Rusty Russell
  2008-03-13  9:47   ` Martin Schwidefsky
  2008-03-13 16:57 ` Hugh Dickins
  7 siblings, 1 reply; 49+ messages in thread
From: Rusty Russell @ 2008-03-12 22:41 UTC (permalink / raw)
  To: virtualization
  Cc: Martin Schwidefsky, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, frankeh, hugh

On Thursday 13 March 2008 00:21:32 Martin Schwidefsky wrote:
> My question now is how to proceed with the code. I sure
> would love to see the code going upstream some day but that depends on
> the mm developers as the code adds complexity that needs to be supported.

Well, I want this feature, but I agree about complexity.

AFAICT, the trivial subset of this is the hinting of Unused pages.  It seems 
that would buy us something, and perhaps be a stepping stone to full page 
hinting?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 1/6] Guest page hinting: core + volatile page cache.
  2008-03-12 13:21 ` [patch 1/6] Guest page hinting: core + volatile page cache Martin Schwidefsky
@ 2008-03-12 23:12   ` Rusty Russell
  2008-03-13  9:24     ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Rusty Russell @ 2008-03-12 23:12 UTC (permalink / raw)
  To: virtualization
  Cc: Martin Schwidefsky, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, frankeh, hugh

On Thursday 13 March 2008 00:21:33 Martin Schwidefsky wrote:
> @@ -957,6 +975,19 @@ struct page *follow_page(struct vm_area_
>
>  	if (flags & FOLL_GET)
>  		get_page(page);
> +
> +	if (flags & FOLL_GET) {
> +		/*
> +		 * The page is made stable if a reference is acquired.
> +		 * If the caller does not get a reference it implies that
> +		 * the caller can deal with page faults in case the page
> +		 * is swapped out. In this case the caller can deal with
> +		 * discard faults as well.
> +		 */
> +		if (unlikely(!page_make_stable(page)))
> +			goto out_discard;
> +	}

Dumb comment: seems like this if could be folded into the one above.

> + * Attempts to change the state of a page to volatile.
> + * If there is something preventing the state change the page stays
> + * int its current state.

Typo "int its current state".


>  		return NULL;
>
>  	pte = pte_offset_map(pmd, address);
> +	ptl = pte_lockptr(mm, pmd);
>  	/* Make a quick check before getting the lock */
> +#ifndef CONFIG_PAGE_STATES
> +	/*
> +	 * If the page table lock for this pte is taken we have to
> +	 * assume that someone might be mapping the page. To solve
> +	 * the race of a page discard vs. mapping the page we have
> +	 * to serialize the two operations by taking the lock,
> +	 * otherwise we end up with a pte for a page that has been
> +	 * removed from page cache by the discard fault handler.
> +	 */
> +	if (!spin_is_locked(ptl))
> +#endif
>  	if (!pte_present(*pte)) {
>  		pte_unmap(pte);
>  		return NULL;
>  	}
>
> -	ptl = pte_lockptr(mm, pmd);
>  	spin_lock(ptl);
>  	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
>  		*ptlp = ptl;

Did you really mean ifndef here?

(BTW: I'm just reading through the code, not really understanding it, so
this is not a real review).

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 3/6] Guest page hinting: mlocked pages.
  2008-03-12 13:21 ` [patch 3/6] Guest page hinting: mlocked pages Martin Schwidefsky
@ 2008-03-12 23:27   ` Rusty Russell
  2008-03-13  9:13     ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Rusty Russell @ 2008-03-12 23:27 UTC (permalink / raw)
  To: virtualization
  Cc: Martin Schwidefsky, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, frankeh, hugh

On Thursday 13 March 2008 00:21:35 Martin Schwidefsky wrote:
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -513,6 +513,9 @@ struct address_space {
>  	spinlock_t		private_lock;	/* for use by the address_space */
>  	struct list_head	private_list;	/* ditto */
>  	struct address_space	*assoc_mapping;	/* ditto */
> +#ifdef CONFIG_PAGE_STATES
> +	unsigned int		mlocked;	/* set if VM_LOCKED vmas present */
> +#endif
>  } __attribute__((aligned(sizeof(long))));

Minor nit: I think this would be better under private_lock where it wouldn't 
consume any extra space on 64-bit.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 4/6] Guest page hinting: writable page table entries.
  2008-03-12 13:21 ` [patch 4/6] Guest page hinting: writable page table entries Martin Schwidefsky
@ 2008-03-12 23:35   ` Rusty Russell
  2008-03-13  9:11     ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Rusty Russell @ 2008-03-12 23:35 UTC (permalink / raw)
  To: virtualization
  Cc: Martin Schwidefsky, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, frankeh, hugh

On Thursday 13 March 2008 00:21:36 Martin Schwidefsky wrote:
> Index: linux-2.6/fs/exec.c
> ===================================================================
> --- linux-2.6.orig/fs/exec.c
> +++ linux-2.6/fs/exec.c
> @@ -51,6 +51,7 @@
>  #include <linux/tsacct_kern.h>
>  #include <linux/cn_proc.h>
>  #include <linux/audit.h>
> +#include <linux/page-states.h>
>
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>

I haven't compile-tested, but this seems unnecessary; it's the only change to 
this file.

> +/**
> + * __page_reset_writable() - clear the PageWritable bit
> + *
> + * @page: the page
> + */
> +void __page_reset_writable(struct page *page)
> +{
> +	preempt_disable();
> +	if (!page_test_set_state_change(page)) {
> +		ClearPageWritable(page);
> +		page_clear_state_change(page);
> +	}
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL(__page_reset_writable);

If I understand correctly, you don't bother resetting the writable bit if you 
don't get the state_change lock.  Is this best effort, or is there some 
correctness issue here?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 4/6] Guest page hinting: writable page table entries.
  2008-03-12 23:35   ` Rusty Russell
@ 2008-03-13  9:11     ` Martin Schwidefsky
  0 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:11 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, linux-kernel, linux-s390, virtualization, akpm,
	nickpiggin, frankeh, hugh

On Thu, 2008-03-13 at 10:35 +1100, Rusty Russell wrote:
> On Thursday 13 March 2008 00:21:36 Martin Schwidefsky wrote:
> > Index: linux-2.6/fs/exec.c
> > ===================================================================
> > --- linux-2.6.orig/fs/exec.c
> > +++ linux-2.6/fs/exec.c
> > @@ -51,6 +51,7 @@
> >  #include <linux/tsacct_kern.h>
> >  #include <linux/cn_proc.h>
> >  #include <linux/audit.h>
> > +#include <linux/page-states.h>
> >
> >  #include <asm/uaccess.h>
> >  #include <asm/mmu_context.h>
> 
> I haven't compile-tested, but this seems unnecessary; it's the only change to 
> this file.

True. I removed the include.

> > +/**
> > + * __page_reset_writable() - clear the PageWritable bit
> > + *
> > + * @page: the page
> > + */
> > +void __page_reset_writable(struct page *page)
> > +{
> > +	preempt_disable();
> > +	if (!page_test_set_state_change(page)) {
> > +		ClearPageWritable(page);
> > +		page_clear_state_change(page);
> > +	}
> > +	preempt_enable();
> > +}
> > +EXPORT_SYMBOL(__page_reset_writable);
> 
> If I understand correctly, you don't bother resetting the writable bit if you 
> don't get the state_change lock.  Is this best effort, or is there some 
> correctness issue here?

It is an error on the safe side. If the page writable bit is set then
the page state has to indicate to the host that the page dirty bit needs
to be checked. 

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 3/6] Guest page hinting: mlocked pages.
  2008-03-12 23:27   ` Rusty Russell
@ 2008-03-13  9:13     ` Martin Schwidefsky
  0 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:13 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, linux-kernel, linux-s390, virtualization, akpm,
	nickpiggin, frankeh, hugh

On Thu, 2008-03-13 at 10:27 +1100, Rusty Russell wrote:
> On Thursday 13 March 2008 00:21:35 Martin Schwidefsky wrote:
> > --- linux-2.6.orig/include/linux/fs.h
> > +++ linux-2.6/include/linux/fs.h
> > @@ -513,6 +513,9 @@ struct address_space {
> >  	spinlock_t		private_lock;	/* for use by the address_space */
> >  	struct list_head	private_list;	/* ditto */
> >  	struct address_space	*assoc_mapping;	/* ditto */
> > +#ifdef CONFIG_PAGE_STATES
> > +	unsigned int		mlocked;	/* set if VM_LOCKED vmas present */
> > +#endif
> >  } __attribute__((aligned(sizeof(long))));
> 
> Minor nit: I think this would be better under private_lock where it wouldn't 
> consume any extra space on 64-bit.

Yes, makes sense. I moved the #ifdef.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 1/6] Guest page hinting: core + volatile page cache.
  2008-03-12 23:12   ` Rusty Russell
@ 2008-03-13  9:24     ` Martin Schwidefsky
  0 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:24 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, linux-kernel, linux-s390, virtualization, akpm,
	nickpiggin, frankeh, hugh

On Thu, 2008-03-13 at 10:12 +1100, Rusty Russell wrote:
> On Thursday 13 March 2008 00:21:33 Martin Schwidefsky wrote:
> > @@ -957,6 +975,19 @@ struct page *follow_page(struct vm_area_
> >
> >  	if (flags & FOLL_GET)
> >  		get_page(page);
> > +
> > +	if (flags & FOLL_GET) {
> > +		/*
> > +		 * The page is made stable if a reference is acquired.
> > +		 * If the caller does not get a reference it implies that
> > +		 * the caller can deal with page faults in case the page
> > +		 * is swapped out. In this case the caller can deal with
> > +		 * discard faults as well.
> > +		 */
> > +		if (unlikely(!page_make_stable(page)))
> > +			goto out_discard;
> > +	}
> 
> Dumb comment: seems like this if could be folded into the one above.

In this patch yes, but a later patch adds a condition:

-       if (flags & FOLL_GET) {
+       if ((flags & FOLL_GET) || (vma->vm_flags & VM_LOCKED)) {


> > + * Attempts to change the state of a page to volatile.
> > + * If there is something preventing the state change the page stays
> > + * int its current state.
> 
> Typo "int its current state".

Fixed.

> >  		return NULL;
> >
> >  	pte = pte_offset_map(pmd, address);
> > +	ptl = pte_lockptr(mm, pmd);
> >  	/* Make a quick check before getting the lock */
> > +#ifndef CONFIG_PAGE_STATES
> > +	/*
> > +	 * If the page table lock for this pte is taken we have to
> > +	 * assume that someone might be mapping the page. To solve
> > +	 * the race of a page discard vs. mapping the page we have
> > +	 * to serialize the two operations by taking the lock,
> > +	 * otherwise we end up with a pte for a page that has been
> > +	 * removed from page cache by the discard fault handler.
> > +	 */
> > +	if (!spin_is_locked(ptl))
> > +#endif
> >  	if (!pte_present(*pte)) {
> >  		pte_unmap(pte);
> >  		return NULL;
> >  	}
> >
> > -	ptl = pte_lockptr(mm, pmd);
> >  	spin_lock(ptl);
> >  	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
> >  		*ptlp = ptl;
> 
> Did you really mean ifndef here?

That is a major nit. This should be an #ifdef. In previous versions the
complete "if (!pte_present(*pte)) { }" is ifdefed, the later versions
use the !spin_is_locked condition. Only I forgot to invert the #ifndef.
Fixed.

> (BTW: I'm just reading through the code, not really understanding it, so
> this is not a real review).

I take the small review anytime. Already found one major nit.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 20:04             ` Anthony Liguori
  2008-03-12 20:45               ` Jeremy Fitzhardinge
@ 2008-03-13  9:32               ` Martin Schwidefsky
  1 sibling, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:32 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jeremy Fitzhardinge, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, frankeh, hugh

[-- Attachment #1: Type: text/plain, Size: 1573 bytes --]

On Wed, 2008-03-12 at 15:04 -0500, Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
> >> With stable_if_clean you are refering to stable_if_present?
> > 
> > No.  I misunderstood and thought that stable_if_present sets the Px 
> > state.  I'd overlooked the writable flag on page_set_volatile().
> > 
> >>  If yes the
> >> answer is that this operation is used to get a page from Vx/Px back to
> >> Sx but only if the page has not been discarded.
> > 
> > So you mean it will change Vr/Pr to Sr but everything else will fail?  

In the extended version Vp/Pp to Sr as well but the current z/VM code
will discard a page if the host picks a Vr/Pr page to swap it.

> Well presumably Vp/Pr => Sp?  Is is true that from the guest's 
> perspective, all of the 'p' states are identical to the 'r' states?

Basically yes. The guest doesn't care about the host state.

> Do the host states even really need visibility to the guest at all?  It 
> may be useful for the guest to be able to distinguish between Ur and Uz 
> but it doesn't seem necessary.

It is very useful for debugging to have the host state in the guest as
well. There is one possible optimization: if the guests finds a Uz page
in the free list, it can make it Sz and doesn't have to clear it because
the host will provide an already empty page (not yet implemented
though).

> BTW Jeremy, the .dot was very useful!

I've search on my disk and found the state diagrams we've used for the
OLS paper. You may find these useful as well.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


[-- Attachment #2: states.dia --]
[-- Type: application/x-dia-diagram, Size: 5973 bytes --]

[-- Attachment #3: states-simple.dia --]
[-- Type: application/x-dia-diagram, Size: 3966 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 20:45               ` Jeremy Fitzhardinge
  2008-03-12 20:56                 ` Anthony Liguori
@ 2008-03-13  9:36                 ` Martin Schwidefsky
  1 sibling, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Anthony Liguori, akpm, linux-s390, frankeh, nickpiggin,
	linux-kernel, virtualization, hugh

On Wed, 2008-03-12 at 13:45 -0700, Jeremy Fitzhardinge wrote:
> Vp should never happen, since you'd never preserve a V page.  And surely 
> it would be Pr -> Sr, since the hypervisor wouldn't push the page to 
> backing store when you change the client state.

Vp does not happen in the current implementation. But it actually may be
useful. z/VM has multiple layers of paging, the first goes to expanded
storage which is very fast. If you make the page Vz and the guests needs
it you have to do a standard Linux I/O to get retrieve the page. This
can be slower than a read and a write to expanded storage.

> > Do the host states even really need visibility to the guest at all?  It 
> > may be useful for the guest to be able to distinguish between Ur and Uz 
> > but it doesn't seem necessary.
> 
> Well, you implicitly see the hypervisor state.  If you touch a [UV]z 
> page then you get a fault telling you that the page has been taken away 
> from you (I think).  And it would definitely help with debugging (seems 
> likely there's lots of scope for race conditions if you prematurely tell 
> the hypervisor you don't need the page any more...).

You get an addressing exception if you touch a Uz page. This indicates a
BUG in the Linux code because this is a use after free. If the guests
touches a Vz page you get a discard fault.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 20:56                 ` Anthony Liguori
  2008-03-12 21:36                   ` Jeremy Fitzhardinge
@ 2008-03-13  9:42                   ` Martin Schwidefsky
  1 sibling, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:42 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jeremy Fitzhardinge, akpm, linux-s390, frankeh, nickpiggin,
	linux-kernel, virtualization, hugh

On Wed, 2008-03-12 at 15:56 -0500, Anthony Liguori wrote:
> > Vp should never happen, since you'd never preserve a V page.  And 
> > surely it would be Pr -> Sr, since the hypervisor wouldn't push the 
> > page to backing store when you change the client state.
> 
> You're right, I meant Vp/Pp but they are invalid states.  I think one of 
> the things that keeps tripping me up is that the host can change both 
> the host and guest page states.  My initial impression was that the host 
> handled the host state and the guest handled the guest state.

In principle only the guest changes the guest state and only the host
changes the host state. The simplified state diagram shows exceptions
for Pr->Sp and Pr->Vz.

> >> Do the host states even really need visibility to the guest at all?  
> >> It may be useful for the guest to be able to distinguish between Ur 
> >> and Uz but it doesn't seem necessary.
> >
> > Well, you implicitly see the hypervisor state.  If you touch a [UV]z 
> > page then you get a fault telling you that the page has been taken 
> > away from you (I think).  And it would definitely help with debugging 
> > (seems likely there's lots of scope for race conditions if you 
> > prematurely tell the hypervisor you don't need the page any more...).
> 
> I was thinking that it may be useful to know a Ur verses a Uz when 
> allocating memory.  In this case, you'd rather allocate Ur pages verses 
> Uz to avoid the fault.  I don't read s390 arch code well, is the host 
> state explicit to the guest?

This is the second optimization you might want to think about. The other
is to avoid the page clearing for Uz.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-12 21:36                   ` Jeremy Fitzhardinge
@ 2008-03-13  9:45                     ` Martin Schwidefsky
  2008-03-13 16:07                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Anthony Liguori, akpm, linux-s390, frankeh, nickpiggin,
	linux-kernel, virtualization, hugh

On Wed, 2008-03-12 at 14:36 -0700, Jeremy Fitzhardinge wrote:
> Anthony Liguori wrote:
> >> Vp should never happen, since you'd never preserve a V page.  And 
> >> surely it would be Pr -> Sr, since the hypervisor wouldn't push the 
> >> page to backing store when you change the client state.
> >>     
> >
> > You're right, I meant Vp/Pp but they are invalid states.  I think one of 
> > the things that keeps tripping me up is that the host can change both 
> > the host and guest page states.  My initial impression was that the host 
> > handled the host state and the guest handled the guest state.
> >   
> 
> Yes.  And it seems to me that you get unfortunate outcomes if you have a 
> Pr->Vz->Vr transition.

Vz->Vr cannot happen. This would be a bug in the host.

> > I was thinking that it may be useful to know a Ur verses a Uz when 
> > allocating memory.  In this case, you'd rather allocate Ur pages verses 
> > Uz to avoid the fault.  I don't read s390 arch code well, is the host 
> > state explicit to the guest?
> >   
> 
> Yes, reusing Ur pages might well be better, but who knows - they've 
> probably got an instruction which makes Uz cheap...

Yes, faulting in a Uz page is cheap on s390. Isn't it a lovely
architecture :-)

> Stuff like this suggets that both parts of the state are packed 
> together, and are guest-visible:
> 
> +	return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE &&
> +		(state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO;
> 

Yes, the return value of the ESSA instruction has both the guest state
and the host state.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-12 22:41 ` [patch 0/6] Guest page hinting version 6 Rusty Russell
@ 2008-03-13  9:47   ` Martin Schwidefsky
  0 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13  9:47 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, linux-kernel, linux-s390, virtualization, akpm,
	nickpiggin, frankeh, hugh

On Thu, 2008-03-13 at 09:41 +1100, Rusty Russell wrote:
> On Thursday 13 March 2008 00:21:32 Martin Schwidefsky wrote:
> > My question now is how to proceed with the code. I sure
> > would love to see the code going upstream some day but that depends on
> > the mm developers as the code adds complexity that needs to be supported.
> 
> Well, I want this feature, but I agree about complexity.
> 
> AFAICT, the trivial subset of this is the hinting of Unused pages.  It seems 
> that would buy us something, and perhaps be a stepping stone to full page 
> hinting?

I've been there but the unused page thing is so small that it doesn't
make sense to separate it from the patches. If I don't see any progress
then I will come up with a patch that adds the Unused state transitions
to the arch files of s390.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-13  9:45                     ` Martin Schwidefsky
@ 2008-03-13 16:07                       ` Jeremy Fitzhardinge
  2008-03-13 16:17                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-13 16:07 UTC (permalink / raw)
  To: schwidefsky
  Cc: Jeremy Fitzhardinge, akpm, linux-s390, frankeh, nickpiggin,
	linux-kernel, virtualization, Anthony Liguori, hugh

Martin Schwidefsky wrote:
> Vz->Vr cannot happen. This would be a bug in the host.
>   

Does that mean that Vz is effectively identical to Uz?

    J

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-13 16:07                       ` Jeremy Fitzhardinge
@ 2008-03-13 16:17                         ` Jeremy Fitzhardinge
  2008-03-13 16:55                           ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-13 16:17 UTC (permalink / raw)
  To: schwidefsky
  Cc: akpm, linux-s390, frankeh, nickpiggin, linux-kernel,
	virtualization, Anthony Liguori, hugh

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

Jeremy Fitzhardinge wrote:
> Martin Schwidefsky wrote:
>> Vz->Vr cannot happen. This would be a bug in the host.
>>   
>
> Does that mean that Vz is effectively identical to Uz? 

Hm, on further thought:

If guests writes to Vz pages are disallowed, then the only way out of Vz 
is if the guest sets it to something else (Uz,Sz).  If so, what's the 
point of using that state?  Why not make:

    Vr -> Uz      host discard
    Pr -> Uz      host discard clean
    Sp -> Uz      set volatile
    Uz -> Uz      set volatile


But given how you've described V-state pages, I really would expect 
writes to a Vz to work, or alternatively, all writes to V-state pages to 
be disallowed.  Are there any real uses for a writable Vr page?

On the other hand, removing Vz->Vr does clean up the dot graph a lot...

    J

[-- Attachment #2: gph.dot --]
[-- Type: text/plain, Size: 1309 bytes --]

digraph gph {
	/* Guest state changes on resident pages */
	Ur -> Sr [ label="set stable" ];
	Ur -> Vr [ label="set volatile\n(w=0)" ];
	Ur -> Pr [ label="set volatile\n(w=1)" ];

	Sr -> Ur [ label="set unused" ];
	Sr -> Vr [ label="set volatile\n(w=0)" ];
	Sr -> Pr [ label="set volatile\n(w=1)" ];

	Vr -> Sr [ label="set stable(_if_present)" ];
	Vr -> Ur [ label="set unused" ];
	Vr -> Pr [ label="set volatile\n(w=1)" ];

	Pr -> Sr [ label="set stable(_if_present)" ];
	Pr -> Vr [ label="set volatile\n(w=0)" ];
	Pr -> Ur [ label="set unused" ];

	/* Guest state changes on zero pages */
	Uz -> Sz [ label="set stable" ];
	Uz -> Vz [ label="set volatile" ];

	Sz -> Vz [ label="set volatile" ];
	Sz -> Uz [ label="set unused" ];

	Vz -> Sz [ label="set stable" ];
	Vz -> Uz [ label="set unused" ];

	/* Guest state changes on host-swapped pages */
	Sp -> Uz [ label="set unused" ];
	Sp -> Vz [ label="set volatile" ];

	/* Guest touches pages */
	Sz -> Sr [ label="guest write" ];
	Sp -> Sr [ label="guest access" ];

	/* Host actions */
	Sr -> Sp [ label="host swap", style=dashed ];
	Ur -> Uz [ label="host discard", style=dashed ];
	Vr -> Vz [ label="host discard", style=dashed ];
	Pr -> Sp [ label="host discard\n(dirty)", style=dashed ];
	Pr -> Vz [ label="host discard\n(clean)", style=dashed ];
}

[-- Attachment #3: gph.pdf --]
[-- Type: application/pdf, Size: 16517 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-13 16:17                         ` Jeremy Fitzhardinge
@ 2008-03-13 16:55                           ` Martin Schwidefsky
  2008-03-13 17:05                             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13 16:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: akpm, linux-s390, frankeh, nickpiggin, linux-kernel,
	virtualization, Anthony Liguori, hugh

On Thu, 2008-03-13 at 09:17 -0700, Jeremy Fitzhardinge wrote:
> Jeremy Fitzhardinge wrote:
> > Martin Schwidefsky wrote:
> >> Vz->Vr cannot happen. This would be a bug in the host.
> >>   
> >
> > Does that mean that Vz is effectively identical to Uz? 
> 
> Hm, on further thought:
> 
> If guests writes to Vz pages are disallowed, then the only way out of Vz 
> is if the guest sets it to something else (Uz,Sz).  If so, what's the 
> point of using that state?  Why not make:
>
>     Vr -> Uz      host discard
>     Pr -> Uz      host discard clean
>     Sp -> Uz      set volatile
>     Uz -> Uz      set volatile

Vz is the page discarded state. The difference to Uz is slim, both
states will cause a program check on access. Vz generates a discard
fault, Uz generates an addressing exception which is nice for debugging.
But I don't see a reason why an implementation that uses Uz instead of
Vz shouldn't work.

> But given how you've described V-state pages, I really would expect 
> writes to a Vz to work, or alternatively, all writes to V-state pages to 
> be disallowed.  Are there any real uses for a writable Vr page?

You mean in the section that speaks about the guests states S/U/V/P ?
Always keep in mind that you can access a V/P page only until it gets
discarded. Then the useful content of the page frame is lost and any
read of write to the not Vz page will be answered with a discard fault.

A Vr page is read-only. If a page gets mapped for writing it needs to
get into the Pr state. This is the hint for the host to look at the
dirty bit before it discards a page.
So yes, there is no use for a writable Vr page.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
                   ` (6 preceding siblings ...)
  2008-03-12 22:41 ` [patch 0/6] Guest page hinting version 6 Rusty Russell
@ 2008-03-13 16:57 ` Hugh Dickins
  2008-03-13 17:14   ` Martin Schwidefsky
                     ` (3 more replies)
  7 siblings, 4 replies; 49+ messages in thread
From: Hugh Dickins @ 2008-03-13 16:57 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin, zach,
	frankeh, rusty, jeremy, andrea, clameter, a.p.zijlstra

On Wed, 12 Mar 2008, Martin Schwidefsky wrote:
> Greetings,
> I've dedusted the guest page hinting patches and ported them to todays
> upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but
> that is easy to fix. The code stills works as expected on my test system.
> 
> Our z/VM performance team recently published a report on guest page
> hinting vs. the ballooner approach on SLES10 for a farm of web servers.
> The code on SLES10 differs a bit from the upstream variant but the
> performance results should be still valid.  You will find the report
> here:
> 
>   http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html
> 
> (the VMRM-CMM the web page speaks about is the balloon approach,
>  CMMA is the guest page hinting).
> 
> Both approaches to the memory overcommit problem show comparable benefits
> for this workload, with an advantage for guest page hinting for large
> number of guests. For other workloads your mileage may vary.
> 
> The main benefit for guest page hinting vs. the ballooner is that there
> is no need for a monitor that keeps track of the memory usage of all the
> guests, a complex algorithm that calculates the working set sizes and for
> the calls into the guest kernel to control the size of the balloons.
> The host just does normal LRU based paging. If the host picks one of the
> pages the guest can recreate, the host can throw it away instead of writing
> it to the paging device. Simple and elegant.
> The main disadvantage is the added complexity that is introduced to the
> guests memory management code to do the page state changes and to deal
> with discard faults.
> 
> The last versions of the patches do not differ much, I consider the code
> to be stable. My question now is how to proceed with the code. I sure
> would love to see the code going upstream some day but that depends on
> the mm developers as the code adds complexity that needs to be supported.
> If the general feeling is that the advantages of this approach do not
> warrent for the added complexity this will likely be the last time you
> will hear about guest page hinting. 

Oh, that would be such a shame.  Your guest page hinting patches remind
me of that childhood thrill, when once a year the circus comes to town ;)

But seriously, I'm ashamed to see my name in the Cc list: it would
be very unfair if your patches never made it in, just because I've
failed to find the time to wrap my own puny brain around them.

It's very encouraging to see Jeremy and Rusty weighing in.  I hope
Zach will too, and I've added Andrea: their support would count a lot.
You have Nick on the list, good, I've added Christoph and Peter
(if you do resend, linux-mm might prove more useful than linux-kernel).

With support from rival virtualizers,
I do think you've a good chance of getting in.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-13 16:55                           ` Martin Schwidefsky
@ 2008-03-13 17:05                             ` Jeremy Fitzhardinge
  2008-03-13 17:23                               ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-13 17:05 UTC (permalink / raw)
  To: schwidefsky
  Cc: akpm, linux-s390, frankeh, nickpiggin, linux-kernel,
	virtualization, Anthony Liguori, hugh

Martin Schwidefsky wrote:
> Vz is the page discarded state. The difference to Uz is slim, both
> states will cause a program check on access. Vz generates a discard
> fault, Uz generates an addressing exception which is nice for debugging.
>   

How do you handle these different cases in Linux?  Do you use Vr pages 
in the pagecache, and then shoot down the pagecache entry if the host 
steals the page?

The Uz access exception presumably just generates a normal oops.

(I should probably make time to read the rest of the series.)

>> But given how you've described V-state pages, I really would expect 
>> writes to a Vz to work, or alternatively, all writes to V-state pages to 
>> be disallowed.  Are there any real uses for a writable Vr page?
>>     
>
> You mean in the section that speaks about the guests states S/U/V/P ?
> Always keep in mind that you can access a V/P page only until it gets
> discarded. Then the useful content of the page frame is lost and any
> read of write to the not Vz page will be answered with a discard fault.
>   

Presumably reads from a Vz page also generate a discard fault?

> A Vr page is read-only. If a page gets mapped for writing it needs to
> get into the Pr state. This is the hint for the host to look at the
> dirty bit before it discards a page.
> So yes, there is no use for a writable Vr page.
>   

OK, thanks, that clears things up.  I was assuming that Vr was 
technically writable but that writes could be discarded at any time (ie, 
allowing guests to merrily shoot themselves in the foot ;).  Making it 
forced RO is much more sensible.

    J

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 16:57 ` Hugh Dickins
@ 2008-03-13 17:14   ` Martin Schwidefsky
  2008-03-13 17:45   ` Zachary Amsden
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13 17:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin, zach,
	frankeh, rusty, jeremy, andrea, clameter, a.p.zijlstra

On Thu, 2008-03-13 at 16:57 +0000, Hugh Dickins wrote:
> > The last versions of the patches do not differ much, I consider the code
> > to be stable. My question now is how to proceed with the code. I sure
> > would love to see the code going upstream some day but that depends on
> > the mm developers as the code adds complexity that needs to be supported.
> > If the general feeling is that the advantages of this approach do not
> > warrent for the added complexity this will likely be the last time you
> > will hear about guest page hinting. 
> 
> Oh, that would be such a shame.  Your guest page hinting patches remind
> me of that childhood thrill, when once a year the circus comes to town ;)
> 
> But seriously, I'm ashamed to see my name in the Cc list: it would
> be very unfair if your patches never made it in, just because I've
> failed to find the time to wrap my own puny brain around them.

It is an effort to get you head around it the first time. It gets
easiert the more you talk about it :-)

> It's very encouraging to see Jeremy and Rusty weighing in.  I hope
> Zach will too, and I've added Andrea: their support would count a lot.
> You have Nick on the list, good, I've added Christoph and Peter
> (if you do resend, linux-mm might prove more useful than linux-kernel).

Grr, did I really forgot to copy linux-mm?!? (..insert you favourite
four letter word here..). I absolutely intended to copy linux-mm but
somehow replaced it with linux-s390.

> With support from rival virtualizers,
> I do think you've a good chance of getting in.

Yes, it would be great if we can find another user for it.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 6/6] Guest page hinting: s390 support.
  2008-03-13 17:05                             ` Jeremy Fitzhardinge
@ 2008-03-13 17:23                               ` Martin Schwidefsky
  0 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-13 17:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: akpm, linux-s390, frankeh, nickpiggin, linux-kernel,
	virtualization, Anthony Liguori, hugh

On Thu, 2008-03-13 at 10:05 -0700, Jeremy Fitzhardinge wrote:
> Martin Schwidefsky wrote:
> > Vz is the page discarded state. The difference to Uz is slim, both
> > states will cause a program check on access. Vz generates a discard
> > fault, Uz generates an addressing exception which is nice for debugging.
> >   
> 
> How do you handle these different cases in Linux?  Do you use Vr pages 
> in the pagecache, and then shoot down the pagecache entry if the host 
> steals the page?

The environment where we currently run all this is z/VM as the host and
Linux as the guest. We have two page tables on s390, a host page table
and a guest page table. If the host discards a page it simple removes
the entry for the page in the host page table. If the guest comes along
and accesses the page the host gets the fault and generates the
appropriate fault.

> The Uz access exception presumably just generates a normal oops.

Yes, the handler for an addressing exception will call die() for a
kernel check without a fixup.

> (I should probably make time to read the rest of the series.)
> 
> >> But given how you've described V-state pages, I really would expect 
> >> writes to a Vz to work, or alternatively, all writes to V-state pages to 
> >> be disallowed.  Are there any real uses for a writable Vr page?
> >>     
> >
> > You mean in the section that speaks about the guests states S/U/V/P ?
> > Always keep in mind that you can access a V/P page only until it gets
> > discarded. Then the useful content of the page frame is lost and any
> > read of write to the not Vz page will be answered with a discard fault.
> >   
> 
> Presumably reads from a Vz page also generate a discard fault?

Yes.

> > A Vr page is read-only. If a page gets mapped for writing it needs to
> > get into the Pr state. This is the hint for the host to look at the
> > dirty bit before it discards a page.
> > So yes, there is no use for a writable Vr page.
> >   
> 
> OK, thanks, that clears things up.  I was assuming that Vr was 
> technically writable but that writes could be discarded at any time (ie, 
> allowing guests to merrily shoot themselves in the foot ;).  Making it 
> forced RO is much more sensible.

Well, technically you could write to a Vr page via the kernel address
space. The thing is that the host can just discard the page although it
is dirty. The Vr state is used for page cache pages which do not have
any writable mapping.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 16:57 ` Hugh Dickins
  2008-03-13 17:14   ` Martin Schwidefsky
@ 2008-03-13 17:45   ` Zachary Amsden
  2008-03-13 19:45     ` Andrea Arcangeli
  2008-03-13 18:41   ` Jeremy Fitzhardinge
  2008-05-06 15:33   ` Martin Schwidefsky
  3 siblings, 1 reply; 49+ messages in thread
From: Zachary Amsden @ 2008-03-13 17:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, frankeh, rusty, jeremy, andrea, clameter,
	a.p.zijlstra


On Thu, 2008-03-13 at 16:57 +0000, Hugh Dickins wrote:
> Oh, that would be such a shame.  Your guest page hinting patches remind
> me of that childhood thrill, when once a year the circus comes to town ;)

I like the circus too.

> But seriously, I'm ashamed to see my name in the Cc list: it would
> be very unfair if your patches never made it in, just because I've
> failed to find the time to wrap my own puny brain around them.

Bah!  So modest.

> It's very encouraging to see Jeremy and Rusty weighing in.  I hope
> Zach will too, and I've added Andrea: their support would count a lot.
> You have Nick on the list, good, I've added Christoph and Peter
> (if you do resend, linux-mm might prove more useful than linux-kernel).

I agree the page hinting technique is generally useful, even
cross-architecture.

What doesn't appear to be useful however, is support for this under
VMware.  It can be done, even without the writable pte support (yes,
really).  But due to us exploiting optimizations at lower layers, it
doesn't appear that it will gain us any performance - and we must
already have the complex working set algorithms to support
non-paravirtualized guests.

> With support from rival virtualizers,
> I do think you've a good chance of getting in.

I would say we support it, but I don't expect us to make use of the
infrastructure anytime soon.  For us it would make more sense to use the
swap-fault optimization, but this requires some significant design
changes in our monitor.

Either way, these are both great ideas and I would not want to be held
responsible for blocking their upstream progress.  Someday, with the
evolving x86 architecture (if we ever get per-page dirty bits), they
might make sense for us to do as well.

Zach


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 16:57 ` Hugh Dickins
  2008-03-13 17:14   ` Martin Schwidefsky
  2008-03-13 17:45   ` Zachary Amsden
@ 2008-03-13 18:41   ` Jeremy Fitzhardinge
  2008-03-13 18:55     ` Hugh Dickins
  2008-05-06 15:33   ` Martin Schwidefsky
  3 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-13 18:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin Schwidefsky, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, zach, frankeh, rusty, andrea, clameter,
	a.p.zijlstra, Keir Fraser, Ian Pratt

Hugh Dickins wrote:
> On Wed, 12 Mar 2008, Martin Schwidefsky wrote:
>   
>> Greetings,
>> I've dedusted the guest page hinting patches and ported them to todays
>> upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but
>> that is easy to fix. The code stills works as expected on my test system.
>>
>> Our z/VM performance team recently published a report on guest page
>> hinting vs. the ballooner approach on SLES10 for a farm of web servers.
>> The code on SLES10 differs a bit from the upstream variant but the
>> performance results should be still valid.  You will find the report
>> here:
>>
>>   http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html
>>
>> (the VMRM-CMM the web page speaks about is the balloon approach,
>>  CMMA is the guest page hinting).
>>
>> Both approaches to the memory overcommit problem show comparable benefits
>> for this workload, with an advantage for guest page hinting for large
>> number of guests. For other workloads your mileage may vary.
>>
>> The main benefit for guest page hinting vs. the ballooner is that there
>> is no need for a monitor that keeps track of the memory usage of all the
>> guests, a complex algorithm that calculates the working set sizes and for
>> the calls into the guest kernel to control the size of the balloons.
>> The host just does normal LRU based paging. If the host picks one of the
>> pages the guest can recreate, the host can throw it away instead of writing
>> it to the paging device. Simple and elegant.
>> The main disadvantage is the added complexity that is introduced to the
>> guests memory management code to do the page state changes and to deal
>> with discard faults.
>>
>> The last versions of the patches do not differ much, I consider the code
>> to be stable. My question now is how to proceed with the code. I sure
>> would love to see the code going upstream some day but that depends on
>> the mm developers as the code adds complexity that needs to be supported.
>> If the general feeling is that the advantages of this approach do not
>> warrent for the added complexity this will likely be the last time you
>> will hear about guest page hinting. 
>>     
>
> Oh, that would be such a shame.  Your guest page hinting patches remind
> me of that childhood thrill, when once a year the circus comes to town ;)
>
> But seriously, I'm ashamed to see my name in the Cc list: it would
> be very unfair if your patches never made it in, just because I've
> failed to find the time to wrap my own puny brain around them.
>
> It's very encouraging to see Jeremy and Rusty weighing in.  I hope
> Zach will too, and I've added Andrea: their support would count a lot.
> You have Nick on the list, good, I've added Christoph and Peter
> (if you do resend, linux-mm might prove more useful than linux-kernel).
>   

I like the idea and it seems basically sound, but unfortunately Xen 
won't be able to make use of it in the near term, because it doesn't 
support any kind of backing for guest domain memory.  There has been 
some thought about adding this kind of functionality to Xen.  Keir, Ian: 
do you think this kind of support in the kernel be useful to us?

One concern I have is that 4k is really a very fine grain.  We're 
thinking about moving Xen's memory management to operate in 2M chunk 
units, which would allow guests to directly use large pages with the 
corresponding reduction in TLB pressure.  One side-effect of this is 
that we'd need to change ballooning to be in 2M rather than 4k units in 
order to prevent physical memory fragmentation.

Page hinting at 4k resolution poses the same problem.  Would this 
technique still be useful operating on 2M chunks?  Certainly it seems 
less likely that you could easily get a whole 2M area with the same 
fine-grained properties that these patches track.  Would some kind of 
page/sub-page tracking be useful?

My other concern is just correctness over time on the Linux side.  We 
already have enough trouble keeping things like the pte and page 
structure state in sync, with resulting rare data-loss bugs.  Adding 
another layer which only applies in specific environments raises the 
possibility for new bugs to be un-noticed for a long time.  How can we 
structure the VM changes to make sure that its robust in the face of 
maintenance?

    J

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 18:41   ` Jeremy Fitzhardinge
@ 2008-03-13 18:55     ` Hugh Dickins
  2008-03-13 19:53       ` Zachary Amsden
  0 siblings, 1 reply; 49+ messages in thread
From: Hugh Dickins @ 2008-03-13 18:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Martin Schwidefsky, linux-kernel, linux-s390, virtualization,
	akpm, nickpiggin, zach, frankeh, rusty, andrea, clameter,
	a.p.zijlstra, Keir Fraser, Ian Pratt

On Thu, 13 Mar 2008, Jeremy Fitzhardinge wrote:
> 
> My other concern is just correctness over time on the Linux side.  We already
> have enough trouble keeping things like the pte and page structure state in
> sync, with resulting rare data-loss bugs.  Adding another layer which only
> applies in specific environments raises the possibility for new bugs to be
> un-noticed for a long time.  How can we structure the VM changes to make sure
> that its robust in the face of maintenance?

Yes, that's the main concern, as whenever lots of subtlety is added.
I wonder if there's any chance of a CONFIG_DEBUG mode, which could be
run on anybody's x86 machine, without involving any virtualization, but
in which the PAGE_STATEs become essential to the correct working of the mm.

Hugh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 17:45   ` Zachary Amsden
@ 2008-03-13 19:45     ` Andrea Arcangeli
  2008-03-13 21:41       ` Zachary Amsden
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2008-03-13 19:45 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Hugh Dickins, Martin Schwidefsky, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, jeremy,
	clameter, a.p.zijlstra

On Thu, Mar 13, 2008 at 10:45:07AM -0700, Zachary Amsden wrote:
> What doesn't appear to be useful however, is support for this under
> VMware.  It can be done, even without the writable pte support (yes,
> really).  But due to us exploiting optimizations at lower layers, it
> doesn't appear that it will gain us any performance - and we must
> already have the complex working set algorithms to support
> non-paravirtualized guests.

With non-paravirt all you can do is to swap the guest physical memory
(mmu notifiers allows linux to do that) or share memory (mmu notifiers
+ ksm allows linux to do that too). We also have complex working set
algorithms that we use to finds which parts of the guest physical
address space are best to swap first: the core linux VM.

What paravirt allows us to do (and that's the whole point of the paper
I guess), is to go one step further than just guest swapping and to
ask the guest if the page really need to be swapped or if it can be
freed right away. So this would be an extension of the mmu notifiers
(this also shows how EMM API is too restrictive, while MMU notifiers
will allow that extension in the future) to avoid I/O sometime if
guest tells us it's not necessary to swap through paravirt ops.

When talking with friends about ballooning I already once suggested to
auto inflate the balloon with pages in the freelist.

Now this paper goes well beyond the pages in the freelist (called
U/unused in the paper), this also covers cache and mapped-clean cache
in the guest. That would have been the next step.

Anyway plain ballooning remains useful as rss limiting or numa
compartments in the linux hypervisor, to provide unfariness to certain
guests.

I didn't read the patch yet, but I think paravirt knowledge about
U/unused pages is needed to avoid guest swapping. The cache and mapped
cache in the guest is a gray area, because linux as hypervisor will be
extremely efficient at swapping out and swapping in the guest cache
(host swapping guest cache, may be faster than re-issuing a read-I/O
to refill the cache by itself, clearly with guest using
paravirt). Let's say I'm mostly interested about page-hinting for the
U pages initially.

I'm currently busy with other two features and trying to get mmu
notifier #v9 into mainline which is orders of magnitude more important
than avoiding a few swapouts sometime (without mmu notifiers
everything else is irrelevant, including guest page hinting and
including ballooning too cause madvise(don't need) won't clear sptes
and invalidate guest tlbs).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 18:55     ` Hugh Dickins
@ 2008-03-13 19:53       ` Zachary Amsden
  2008-03-14 18:30         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 49+ messages in thread
From: Zachary Amsden @ 2008-03-13 19:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jeremy Fitzhardinge, Martin Schwidefsky, linux-kernel,
	linux-s390, virtualization, akpm, nickpiggin, frankeh, rusty,
	andrea, clameter, a.p.zijlstra, Keir Fraser, Ian Pratt


On Thu, 2008-03-13 at 18:55 +0000, Hugh Dickins wrote:
> On Thu, 13 Mar 2008, Jeremy Fitzhardinge wrote:
> > 
> > My other concern is just correctness over time on the Linux side.  We already
> > have enough trouble keeping things like the pte and page structure state in
> > sync, with resulting rare data-loss bugs.  Adding another layer which only
> > applies in specific environments raises the possibility for new bugs to be
> > un-noticed for a long time.  How can we structure the VM changes to make sure
> > that its robust in the face of maintenance?
> 
> Yes, that's the main concern, as whenever lots of subtlety is added.
> I wonder if there's any chance of a CONFIG_DEBUG mode, which could be
> run on anybody's x86 machine, without involving any virtualization, but
> in which the PAGE_STATEs become essential to the correct working of the mm.

How about a fake hypervisor, which is really just a random page evictor,
following the rules of CMM?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 19:45     ` Andrea Arcangeli
@ 2008-03-13 21:41       ` Zachary Amsden
  0 siblings, 0 replies; 49+ messages in thread
From: Zachary Amsden @ 2008-03-13 21:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Martin Schwidefsky, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, jeremy,
	clameter, a.p.zijlstra


On Thu, 2008-03-13 at 20:45 +0100, Andrea Arcangeli wrote:
> On Thu, Mar 13, 2008 at 10:45:07AM -0700, Zachary Amsden wrote:
> > What doesn't appear to be useful however, is support for this under
> > VMware.  It can be done, even without the writable pte support (yes,
> > really).  But due to us exploiting optimizations at lower layers, it
> > doesn't appear that it will gain us any performance - and we must
> > already have the complex working set algorithms to support
> > non-paravirtualized guests.
> 
> With non-paravirt all you can do is to swap the guest physical memory
> (mmu notifiers allows linux to do that) or share memory (mmu notifiers
> + ksm allows linux to do that too). We also have complex working set
> algorithms that we use to finds which parts of the guest physical
> address space are best to swap first: the core linux VM.

We can tap into those algorithms just as effectively using ballooning, and we've optimized the sharing and working set models from outside of the guest.  So while CMM gives slightly better information for a random forced page eviction, the complexity doesn't appear to justify the savings for a VMware implementation.

Things would change quite a bit if we had hardware supported per-page dirty bits.

> than avoiding a few swapouts sometime (without mmu notifiers
> everything else is irrelevant, including guest page hinting and
> including ballooning too cause madvise(don't need) won't clear sptes
> and invalidate guest tlbs).

Ballooning still works if you use a kernel based balloon driver.  Using
madvise wouldn't be a reliable way to balloon anyway.  Are you talking
about an API to manage working sets and such from userspace?

Cheers,

Zach


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 19:53       ` Zachary Amsden
@ 2008-03-14 18:30         ` Jeremy Fitzhardinge
  2008-03-14 21:32           ` Zachary Amsden
  0 siblings, 1 reply; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-14 18:30 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Hugh Dickins, Martin Schwidefsky, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, andrea,
	clameter, a.p.zijlstra, Keir Fraser, Ian Pratt

Zachary Amsden wrote:
> On Thu, 2008-03-13 at 18:55 +0000, Hugh Dickins wrote:
>   
>> On Thu, 13 Mar 2008, Jeremy Fitzhardinge wrote:
>>     
>>> My other concern is just correctness over time on the Linux side.  We already
>>> have enough trouble keeping things like the pte and page structure state in
>>> sync, with resulting rare data-loss bugs.  Adding another layer which only
>>> applies in specific environments raises the possibility for new bugs to be
>>> un-noticed for a long time.  How can we structure the VM changes to make sure
>>> that its robust in the face of maintenance?
>>>       
>> Yes, that's the main concern, as whenever lots of subtlety is added.
>> I wonder if there's any chance of a CONFIG_DEBUG mode, which could be
>> run on anybody's x86 machine, without involving any virtualization, but
>> in which the PAGE_STATEs become essential to the correct working of the mm.
>>     
>
> How about a fake hypervisor, which is really just a random page evictor,
> following the rules of CMM?
>   

Probably simpler to just have variants of the page_set_* functions which 
simulate the worst-possible host action immediately (ie, stealing pages, 
logically swapping them, etc).  That wouldn't give you full coverage, 
but it would go some way.  An async variant which schedules a change in 
a few milliseconds would help too.

I guess that's equivalent to having a special-purpose hypervisor built 
into the kernel (hm, sounds familiar...).

    J

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-14 18:30         ` Jeremy Fitzhardinge
@ 2008-03-14 21:32           ` Zachary Amsden
  2008-03-14 21:37             ` Jeremy Fitzhardinge
  2008-03-17  9:21             ` Martin Schwidefsky
  0 siblings, 2 replies; 49+ messages in thread
From: Zachary Amsden @ 2008-03-14 21:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Hugh Dickins, Martin Schwidefsky, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, andrea,
	clameter, a.p.zijlstra, Keir Fraser, Ian Pratt


On Fri, 2008-03-14 at 11:30 -0700, Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
> > How about a fake hypervisor, which is really just a random page evictor,
> > following the rules of CMM?
> >   
> 
> Probably simpler to just have variants of the page_set_* functions which 
> simulate the worst-possible host action immediately (ie, stealing pages, 
> logically swapping them, etc).  That wouldn't give you full coverage, 
> but it would go some way.  An async variant which schedules a change in 
> a few milliseconds would help too.
> 
> I guess that's equivalent to having a special-purpose hypervisor built 
> into the kernel (hm, sounds familiar...).

It needn't be that hard on s390, I believe you don't need to worry about
PTEs becoming asynchronous when stealing a page, since if I understand
the hypervisor architecture, there is a per-page mapping level
available, allowing you to generate discard faults on access.  It might
be possible to use this mapping layer without implementing a full blown
hypervisor.  Martin?

For x86, at discard time, you would have to manually walk and invalidate
any PTEs potentially mapping the discarded page, but there is already
this great thing called Xen paravirt-ops which actually does that for
completely different reasons (PT page protection).

I think a random exponential distribution for discard would be needed to
catch all the racey failure modes.

Zach


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-14 21:32           ` Zachary Amsden
@ 2008-03-14 21:37             ` Jeremy Fitzhardinge
  2008-03-17  9:21             ` Martin Schwidefsky
  1 sibling, 0 replies; 49+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-14 21:37 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Hugh Dickins, Martin Schwidefsky, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, andrea,
	clameter, a.p.zijlstra, Keir Fraser, Ian Pratt

Zachary Amsden wrote:
> It needn't be that hard on s390, I believe you don't need to worry about
> PTEs becoming asynchronous when stealing a page, since if I understand
> the hypervisor architecture, there is a per-page mapping level
> available, allowing you to generate discard faults on access.  It might
> be possible to use this mapping layer without implementing a full blown
> hypervisor.  Martin?
>   

Yes, I don't expect its a problem for s390, but the point is making 
something workalike enough to make sure there's an evenly distributed 
number of explosions-in-face when things go wrong.

> For x86, at discard time, you would have to manually walk and invalidate
> any PTEs potentially mapping the discarded page, but there is already
> this great thing called Xen paravirt-ops which actually does that for
> completely different reasons (PT page protection).
>   

Not sure I follow.  Xen pvops pays attention to whether a particular 
page is being used as part of a pagetable, and changes its permissions 
accordingly.  But because pagetable pages are strictly kernel-only, we 
can get away with updating a single kernel-mapping pte which is shared 
across all processes.  In the guest page hinting case, we need to deal 
with general pages which can be mapped anywhere, so that really does 
require a full traversal of the pagetables.   Presumably rmap would be 
helpful here.

    J

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-14 21:32           ` Zachary Amsden
  2008-03-14 21:37             ` Jeremy Fitzhardinge
@ 2008-03-17  9:21             ` Martin Schwidefsky
  1 sibling, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-03-17  9:21 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Hugh Dickins, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, andrea,
	clameter, a.p.zijlstra, Keir Fraser, Ian Pratt

On Fri, 2008-03-14 at 14:32 -0700, Zachary Amsden wrote:
> On Fri, 2008-03-14 at 11:30 -0700, Jeremy Fitzhardinge wrote:
> > Zachary Amsden wrote:
> > > How about a fake hypervisor, which is really just a random page evictor,
> > > following the rules of CMM?
> > >   
> > 
> > Probably simpler to just have variants of the page_set_* functions which 
> > simulate the worst-possible host action immediately (ie, stealing pages, 
> > logically swapping them, etc).  That wouldn't give you full coverage, 
> > but it would go some way.  An async variant which schedules a change in 
> > a few milliseconds would help too.
> > 
> > I guess that's equivalent to having a special-purpose hypervisor built 
> > into the kernel (hm, sounds familiar...).
> 
> It needn't be that hard on s390, I believe you don't need to worry about
> PTEs becoming asynchronous when stealing a page, since if I understand
> the hypervisor architecture, there is a per-page mapping level
> available, allowing you to generate discard faults on access.  It might
> be possible to use this mapping layer without implementing a full blown
> hypervisor.  Martin?

Yes, on s390 the PTEs cannot be asynchronous because there is no need to
synchronize them in the first place. A mapping layer with all primitives
without using the SIE instruction will be difficult. For one we cannot
use the ESSA instruction which isolates the state changes and host page
table is tied to the SIE. The page state is stored in the page table
extension and the discard state is basically a specially marked invalid
pte in the host table. A mapping layer with some restrictions is
certainly possible.

> For x86, at discard time, you would have to manually walk and invalidate
> any PTEs potentially mapping the discarded page, but there is already
> this great thing called Xen paravirt-ops which actually does that for
> completely different reasons (PT page protection).

If you have to walk the guest page tables you call into the guest, no ?
I would characterize this more as a ballooner since you need guest
activity to do the page stealing. The trick with guest page hinting is
that you do NOT call into the guest to do the discard. You'll a nested
page table for that I'm afraid.

> I think a random exponential distribution for discard would be needed to
> catch all the racey failure modes.

We had quite a few of these racy failures. Nasty. Hard to find.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-03-13 16:57 ` Hugh Dickins
                     ` (2 preceding siblings ...)
  2008-03-13 18:41   ` Jeremy Fitzhardinge
@ 2008-05-06 15:33   ` Martin Schwidefsky
  2008-05-06 19:46     ` Rik van Riel
  3 siblings, 1 reply; 49+ messages in thread
From: Martin Schwidefsky @ 2008-05-06 15:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, linux-s390, virtualization, akpm, nickpiggin, zach,
	frankeh, rusty, jeremy, andrea, clameter, a.p.zijlstra

On Thu, 2008-03-13 at 16:57 +0000, Hugh Dickins wrote:
> It's very encouraging to see Jeremy and Rusty weighing in.  I hope
> Zach will too, and I've added Andrea: their support would count a lot.
> You have Nick on the list, good, I've added Christoph and Peter
> (if you do resend, linux-mm might prove more useful than linux-kernel).
> 
> With support from rival virtualizers,
> I do think you've a good chance of getting in.

Traffic on the guest page hinting patches died down again. Until another
user shows up I guess that's it for the full version. In the meantime I
push the patch below which is the poor mans version that can be done
without common code change. It uses the arch_alloc_page/arch_free_page
hooks to do the stable/unused state transitions. Better than nothing ..

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

---
Subject: [PATCH] guest page hinting light

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Use the existing arch_alloc_page/arch_free_page callbacks to do
the guest page state transitions between stable and unused.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 arch/s390/Kconfig          |    7 +++
 arch/s390/mm/Makefile      |    1 
 arch/s390/mm/init.c        |    3 +
 arch/s390/mm/page-states.c |   79 +++++++++++++++++++++++++++++++++++++++++++++
 include/asm-s390/page.h    |   11 ++++++
 include/asm-s390/system.h  |    6 +++
 6 files changed, 107 insertions(+)

diff -urpN linux-2.6/arch/s390/Kconfig linux-2.6-patched/arch/s390/Kconfig
--- linux-2.6/arch/s390/Kconfig	2008-05-06 17:38:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/Kconfig	2008-05-06 17:38:28.000000000 +0200
@@ -430,6 +430,13 @@ config CMM_IUCV
 	  Select this option to enable the special message interface to
 	  the cooperative memory management.
 
+config PAGE_STATES
+	bool "Unused page notification"
+	help
+	  This enables the notification of unused pages to the
+	  hypervisor. The ESSA instruction is used to do the states
+	  changes between a page that has content and the unused state.
+
 config VIRT_TIMER
 	bool "Virtual CPU timer support"
 	help
diff -urpN linux-2.6/arch/s390/mm/init.c linux-2.6-patched/arch/s390/mm/init.c
--- linux-2.6/arch/s390/mm/init.c	2008-05-06 17:38:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/mm/init.c	2008-05-06 17:38:28.000000000 +0200
@@ -126,6 +126,9 @@ void __init mem_init(void)
         /* clear the zero-page */
         memset(empty_zero_page, 0, PAGE_SIZE);
 
+	/* Setup guest page hinting */
+	cmma_init();
+
 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem();
 
diff -urpN linux-2.6/arch/s390/mm/Makefile linux-2.6-patched/arch/s390/mm/Makefile
--- linux-2.6/arch/s390/mm/Makefile	2008-05-06 17:38:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/mm/Makefile	2008-05-06 17:38:28.000000000 +0200
@@ -5,3 +5,4 @@
 obj-y	 := init.o fault.o extmem.o mmap.o vmem.o pgtable.o
 obj-$(CONFIG_CMM) += cmm.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
+obj-$(CONFIG_PAGE_STATES) += page-states.o
diff -urpN linux-2.6/arch/s390/mm/page-states.c linux-2.6-patched/arch/s390/mm/page-states.c
--- linux-2.6/arch/s390/mm/page-states.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6-patched/arch/s390/mm/page-states.c	2008-05-06 17:38:28.000000000 +0200
@@ -0,0 +1,79 @@
+/*
+ * arch/s390/mm/page-states.c
+ *
+ * (C) Copyright IBM Corp. 2008
+ *
+ * Guest page hinting for unused pages.
+ *
+ * Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+
+#define ESSA_SET_STABLE		1
+#define ESSA_SET_UNUSED		2
+
+int cmma_flag;
+
+static int __init cmma(char *str)
+{
+	char *parm;
+	parm = strstrip(str);
+	if (strcmp(parm, "yes") == 0 || strcmp(parm, "on") == 0) {
+		cmma_flag = 1;
+		return 1;
+	}
+	cmma_flag = 0;
+	if (strcmp(parm, "no") == 0 || strcmp(parm, "off") == 0)
+		return 1;
+	return 0;
+}
+
+__setup("cmma=", cmma);
+
+void __init cmma_init(void)
+{
+	register unsigned long tmp asm("0") = 0;
+	register int rc asm("1") = -ENOSYS;
+
+	if (!cmma_flag)
+		return;
+	asm volatile(
+		"       .insn rrf,0xb9ab0000,%1,%1,0,0\n"
+		"0:     la      %0,0\n"
+		"1:\n"
+		EX_TABLE(0b,1b)
+		: "+&d" (rc), "+&d" (tmp));
+	if (rc)
+		cmma_flag = 0;
+}
+
+void arch_free_page(struct page *page, int order)
+{
+	int i, rc;
+
+	if (!cmma_flag)
+		return;
+	for (i = 0; i < (1 << order); i++)
+		asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0"
+			     : "=&d" (rc)
+			     : "a" ((page_to_pfn(page) + i) << PAGE_SHIFT),
+			       "i" (ESSA_SET_UNUSED));
+}
+
+void arch_alloc_page(struct page *page, int order)
+{
+	int i, rc;
+
+	if (!cmma_flag)
+		return;
+	for (i = 0; i < (1 << order); i++)
+		asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0"
+			     : "=&d" (rc)
+			     : "a" ((page_to_pfn(page) + i) << PAGE_SHIFT),
+			       "i" (ESSA_SET_STABLE));
+}
diff -urpN linux-2.6/include/asm-s390/page.h linux-2.6-patched/include/asm-s390/page.h
--- linux-2.6/include/asm-s390/page.h	2008-05-06 17:38:17.000000000 +0200
+++ linux-2.6-patched/include/asm-s390/page.h	2008-05-06 17:38:28.000000000 +0200
@@ -125,6 +125,17 @@ page_get_storage_key(unsigned long addr)
 	return skey;
 }
 
+#ifdef CONFIG_PAGE_STATES
+
+struct page;
+void arch_free_page(struct page *page, int order);
+void arch_alloc_page(struct page *page, int order);
+
+#define HAVE_ARCH_FREE_PAGE
+#define HAVE_ARCH_ALLOC_PAGE
+
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 /* to align the pointer to the (next) page boundary */
diff -urpN linux-2.6/include/asm-s390/system.h linux-2.6-patched/include/asm-s390/system.h
--- linux-2.6/include/asm-s390/system.h	2008-05-06 17:38:17.000000000 +0200
+++ linux-2.6-patched/include/asm-s390/system.h	2008-05-06 17:38:28.000000000 +0200
@@ -116,6 +116,12 @@ extern void pfault_fini(void);
 #define pfault_fini()		do { } while (0)
 #endif /* CONFIG_PFAULT */
 
+#ifdef CONFIG_PAGE_STATES
+extern void cmma_init(void);
+#else
+#define cmma_init()		do { } while (0)
+#endif
+
 #define finish_arch_switch(prev) do {					     \
 	set_fs(current->thread.mm_segment);				     \
 	account_vtime(prev);						     \



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-05-06 15:33   ` Martin Schwidefsky
@ 2008-05-06 19:46     ` Rik van Riel
  2008-05-07  3:49       ` Zachary Amsden
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2008-05-06 19:46 UTC (permalink / raw)
  To: schwidefsky
  Cc: Hugh Dickins, linux-kernel, linux-s390, virtualization, akpm,
	nickpiggin, zach, frankeh, rusty, jeremy, andrea, clameter,
	a.p.zijlstra

On Tue, 06 May 2008 17:33:02 +0200
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> On Thu, 2008-03-13 at 16:57 +0000, Hugh Dickins wrote:
> > It's very encouraging to see Jeremy and Rusty weighing in.  I hope
> > Zach will too, and I've added Andrea: their support would count a lot.
> > You have Nick on the list, good, I've added Christoph and Peter
> > (if you do resend, linux-mm might prove more useful than linux-kernel).
> > 
> > With support from rival virtualizers,
> > I do think you've a good chance of getting in.
> 
> Traffic on the guest page hinting patches died down again. Until another
> user shows up I guess that's it for the full version. 

I suspect one of the problems is that there are too many state transitions
to have it implemented with a low overhead on anything but S390, and even
there you need milicoded instructions to handle things.

If the number of transitions can be reduced, page hinting could be useful
for KVM, too.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-05-06 19:46     ` Rik van Riel
@ 2008-05-07  3:49       ` Zachary Amsden
  2008-05-07  7:00         ` Martin Schwidefsky
  0 siblings, 1 reply; 49+ messages in thread
From: Zachary Amsden @ 2008-05-07  3:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: schwidefsky, Hugh Dickins, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, jeremy, andrea,
	clameter, a.p.zijlstra


On Tue, 2008-05-06 at 15:46 -0400, Rik van Riel wrote:
> On Tue, 06 May 2008 17:33:02 +0200
> Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > On Thu, 2008-03-13 at 16:57 +0000, Hugh Dickins wrote:
> > > It's very encouraging to see Jeremy and Rusty weighing in.  I hope
> > > Zach will too, and I've added Andrea: their support would count a lot.
> > > You have Nick on the list, good, I've added Christoph and Peter
> > > (if you do resend, linux-mm might prove more useful than linux-kernel).
> > > 
> > > With support from rival virtualizers,
> > > I do think you've a good chance of getting in.
> > 
> > Traffic on the guest page hinting patches died down again. Until another
> > user shows up I guess that's it for the full version. 
> 
> I suspect one of the problems is that there are too many state transitions
> to have it implemented with a low overhead on anything but S390, and even
> there you need milicoded instructions to handle things.
> 
> If the number of transitions can be reduced, page hinting could be useful
> for KVM, too.

Spot on Rik, if every transition becomes a hypercall (and a synchronous
one at that), it isn't workable for us.  If, on the other hand, you
share the state bits between the guest and hypervisor, you need a giant
(standalone) bit array for per-page state, which is neither convenient
for Linux nor the hypervisor.  I believe s390 has an 'instruction' to
migrate the state bits into the hypervisor per-physical-page data
without requiring a hypercall.

Zach


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [patch 0/6] Guest page hinting version 6.
  2008-05-07  3:49       ` Zachary Amsden
@ 2008-05-07  7:00         ` Martin Schwidefsky
  0 siblings, 0 replies; 49+ messages in thread
From: Martin Schwidefsky @ 2008-05-07  7:00 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Rik van Riel, Hugh Dickins, linux-kernel, linux-s390,
	virtualization, akpm, nickpiggin, frankeh, rusty, jeremy, andrea,
	clameter, a.p.zijlstra

On Tue, 2008-05-06 at 20:49 -0700, Zachary Amsden wrote:
> > I suspect one of the problems is that there are too many state transitions
> > to have it implemented with a low overhead on anything but S390, and even
> > there you need milicoded instructions to handle things.
> > 
> > If the number of transitions can be reduced, page hinting could be useful
> > for KVM, too.
> 
> Spot on Rik, if every transition becomes a hypercall (and a synchronous
> one at that), it isn't workable for us.  If, on the other hand, you
> share the state bits between the guest and hypervisor, you need a giant
> (standalone) bit array for per-page state, which is neither convenient
> for Linux nor the hypervisor.  I believe s390 has an 'instruction' to
> migrate the state bits into the hypervisor per-physical-page data
> without requiring a hypercall.

That is why we invented the millicoded ESSA instruction on s390. We had
an emulation of the instruction to test things. It worked but was
awfully slow.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2008-05-07  7:12 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-12 13:21 [patch 0/6] Guest page hinting version 6 Martin Schwidefsky
2008-03-12 13:21 ` [patch 1/6] Guest page hinting: core + volatile page cache Martin Schwidefsky
2008-03-12 23:12   ` Rusty Russell
2008-03-13  9:24     ` Martin Schwidefsky
2008-03-12 13:21 ` [patch 2/6] Guest page hinting: volatile swap cache Martin Schwidefsky
2008-03-12 13:21 ` [patch 3/6] Guest page hinting: mlocked pages Martin Schwidefsky
2008-03-12 23:27   ` Rusty Russell
2008-03-13  9:13     ` Martin Schwidefsky
2008-03-12 13:21 ` [patch 4/6] Guest page hinting: writable page table entries Martin Schwidefsky
2008-03-12 23:35   ` Rusty Russell
2008-03-13  9:11     ` Martin Schwidefsky
2008-03-12 13:21 ` [patch 5/6] Guest page hinting: minor fault optimization Martin Schwidefsky
2008-03-12 13:21 ` [patch 6/6] Guest page hinting: s390 support Martin Schwidefsky
2008-03-12 16:19   ` Jeremy Fitzhardinge
2008-03-12 16:28     ` Martin Schwidefsky
2008-03-12 16:44       ` Jeremy Fitzhardinge
2008-03-12 16:59         ` Martin Schwidefsky
2008-03-12 17:48           ` Jeremy Fitzhardinge
2008-03-12 20:04             ` Anthony Liguori
2008-03-12 20:45               ` Jeremy Fitzhardinge
2008-03-12 20:56                 ` Anthony Liguori
2008-03-12 21:36                   ` Jeremy Fitzhardinge
2008-03-13  9:45                     ` Martin Schwidefsky
2008-03-13 16:07                       ` Jeremy Fitzhardinge
2008-03-13 16:17                         ` Jeremy Fitzhardinge
2008-03-13 16:55                           ` Martin Schwidefsky
2008-03-13 17:05                             ` Jeremy Fitzhardinge
2008-03-13 17:23                               ` Martin Schwidefsky
2008-03-13  9:42                   ` Martin Schwidefsky
2008-03-13  9:36                 ` Martin Schwidefsky
2008-03-13  9:32               ` Martin Schwidefsky
2008-03-12 22:41 ` [patch 0/6] Guest page hinting version 6 Rusty Russell
2008-03-13  9:47   ` Martin Schwidefsky
2008-03-13 16:57 ` Hugh Dickins
2008-03-13 17:14   ` Martin Schwidefsky
2008-03-13 17:45   ` Zachary Amsden
2008-03-13 19:45     ` Andrea Arcangeli
2008-03-13 21:41       ` Zachary Amsden
2008-03-13 18:41   ` Jeremy Fitzhardinge
2008-03-13 18:55     ` Hugh Dickins
2008-03-13 19:53       ` Zachary Amsden
2008-03-14 18:30         ` Jeremy Fitzhardinge
2008-03-14 21:32           ` Zachary Amsden
2008-03-14 21:37             ` Jeremy Fitzhardinge
2008-03-17  9:21             ` Martin Schwidefsky
2008-05-06 15:33   ` Martin Schwidefsky
2008-05-06 19:46     ` Rik van Riel
2008-05-07  3:49       ` Zachary Amsden
2008-05-07  7:00         ` Martin Schwidefsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).