All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch][v2] swap: virtual swap readahead
@ 2009-06-02 22:37 ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-02 22:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Peter Zijlstra, Hugh Dickins, Andi Kleen, linux-mm,
	linux-kernel

Hi Andrew,

I redid the qsbench runs with a bigger page cluster (2^4).  It shows
improvement on both versions, the patched one still performing better.
Rik hinted me that we can make the default even bigger when we are
better at avoiding reading unrelated pages.  I am currently testing
this.  Here are the timings for 2^4 (i.e. twice the) ra pages:

vanilla:
1 x 2048M [20 runs]  user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
2 x 1024M [20 runs]  user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
4 x 540M [20 runs]  user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
8 x 280M [20 runs]  user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
16 x 128M [20 runs]  user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]

vswapra:
1 x 2048M [20 runs]  user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
2 x 1024M [20 runs]  user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
4 x 540M [20 runs]  user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
8 x 280M [20 runs]  user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
16 x 128M [20 runs]  user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]

Furthermore I changed the patch to leave shmem alone for now and added
documentation for the new approach.  And I adjusted the changelog a
bit.

Andi, I think the NUMA policy is already taken care of.  Can you have
another look at it?  Other than that you gave positive feedback - can
I add your acked-by?

	Hannes

---
The current swap readahead implementation reads a physically
contiguous group of swap slots around the faulting page to take
advantage of the disk head's position and in the hope that the
surrounding pages will be needed soon as well.

This works as long as the physical swap slot order approximates the
LRU order decently, otherwise it wastes memory and IO bandwidth to
read in pages that are unlikely to be needed soon.

However, the physical swap slot layout diverges from the LRU order
with increasing swap activity, i.e. high memory pressure situations,
and this is exactly the situation where swapin should not waste any
memory or IO bandwidth as both are the most contended resources at
this point.

Another approximation for LRU-relation is the VMA order as groups of
VMA-related pages are usually used together.

This patch combines both the physical and the virtual hint to get a
good approximation of pages that are sensible to read ahead.

When both diverge, we either read unrelated data, seek heavily for
related data, or, what this patch does, just decrease the readahead
efforts.

To achieve this, we have essentially two readahead windows of the same
size: one spans the virtual, the other one the physical neighborhood
of the faulting page.  We only read where both areas overlap.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
---
 mm/swap_state.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 99 insertions(+), 16 deletions(-)

version 2:
  o fall back to physical ra window for shmem
  o add documentation to the new ra algorithm

qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":

		vanilla				vswapra

1  x 2048M	391.25 ( 71.76) 384.56		445.55 (83.19) 415.41
2  x 1024M	384.25 ( 75.00) 423.08		290.26 (31.38) 299.51
4  x  540M	553.91 (100.02) 554.57		336.58 (52.49) 331.52
8  x  280M	561.08 ( 82.36) 583.12		319.13 (43.17) 307.69
16 x  128M	285.51 (113.20) 236.62		214.24 (62.37) 214.15

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
 	return found_page;
 }
 
-/**
- * swapin_readahead - swap in pages in hope we need them soon
- * @entry: swap entry of this memory
- * @gfp_mask: memory allocation flags
- * @vma: user vma this address belongs to
- * @addr: target address for mempolicy
- *
- * Returns the struct page for entry and addr, after queueing swapin.
- *
+/*
  * Primitive swap readahead code. We simply read an aligned block of
  * (1 << page_cluster) entries in the swap area. This method is chosen
  * because it doesn't cost us any seek time.  We also make sure to queue
  * the 'original' request together with the readahead ones...
- *
- * This has been extended to use the NUMA policies from the mm triggering
- * the readahead.
- *
- * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
  */
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
+				struct vm_area_struct *vma, unsigned long addr)
 {
 	int nr_pages;
 	struct page *page;
@@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
+
+/**
+ * swapin_readahead - swap in pages in hope we need them soon
+ * @entry: swap entry of this memory
+ * @gfp_mask: memory allocation flags
+ * @vma: user vma this address belongs to
+ * @addr: target address for mempolicy
+ *
+ * Returns the struct page for entry and addr, after queueing swapin.
+ *
+ * The readahead window is the virtual area around the faulting page,
+ * where the physical proximity of the swap slots is taken into
+ * account as well.
+ *
+ * While the swap allocation algorithm tries to keep LRU-related pages
+ * together on the swap backing, it is not reliable on heavy thrashing
+ * systems where concurrent reclaimers allocate swap slots and/or most
+ * anonymous memory pages are already in swap cache.
+ *
+ * On the virtual side, subgroups of VMA-related pages are usually
+ * used together, which gives another hint to LRU relationship.
+ *
+ * By taking both aspects into account, we get a good approximation of
+ * which pages are sensible to read together with the faulting one.
+ *
+ * This has been extended to use the NUMA policies from the mm
+ * triggering the readahead.
+ *
+ * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
+ */
+struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	unsigned long start, pos, end;
+	unsigned long pmin, pmax;
+	int cluster, window;
+
+	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
+		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
+
+	cluster = 1 << page_cluster;
+	window = cluster << PAGE_SHIFT;
+
+	/* Physical range to read from */
+	pmin = swp_offset(entry) & ~(cluster - 1);
+	pmax = pmin + cluster;
+
+	/* Virtual range to read from */
+	start = addr & ~(window - 1);
+	end = start + window;
+
+	for (pos = start; pos < end; pos += PAGE_SIZE) {
+		struct page *page;
+		swp_entry_t swp;
+		spinlock_t *ptl;
+		pgd_t *pgd;
+		pud_t *pud;
+		pmd_t *pmd;
+		pte_t *pte;
+
+		pgd = pgd_offset(vma->vm_mm, pos);
+		if (!pgd_present(*pgd))
+			continue;
+		pud = pud_offset(pgd, pos);
+		if (!pud_present(*pud))
+			continue;
+		pmd = pmd_offset(pud, pos);
+		if (!pmd_present(*pmd))
+			continue;
+		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
+		if (!is_swap_pte(*pte)) {
+			pte_unmap_unlock(pte, ptl);
+			continue;
+		}
+		swp = pte_to_swp_entry(*pte);
+		pte_unmap_unlock(pte, ptl);
+
+		if (swp_type(swp) != swp_type(entry))
+			continue;
+		/*
+		 * Dont move the disk head too far away.  This also
+		 * throttles readahead while thrashing, where virtual
+		 * order diverges more and more from physical order.
+		 */
+		if (swp_offset(swp) > pmax)
+			continue;
+		if (swp_offset(swp) < pmin)
+			continue;
+		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
+		if (!page)
+			continue;
+		page_cache_release(page);
+	}
+	lru_add_drain();	/* Push any new pages onto the LRU now */
+	return read_swap_cache_async(entry, gfp_mask, vma, addr);
+}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch][v2] swap: virtual swap readahead
@ 2009-06-02 22:37 ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-02 22:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Peter Zijlstra, Hugh Dickins, Andi Kleen, linux-mm,
	linux-kernel

Hi Andrew,

I redid the qsbench runs with a bigger page cluster (2^4).  It shows
improvement on both versions, the patched one still performing better.
Rik hinted me that we can make the default even bigger when we are
better at avoiding reading unrelated pages.  I am currently testing
this.  Here are the timings for 2^4 (i.e. twice the) ra pages:

vanilla:
1 x 2048M [20 runs]  user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
2 x 1024M [20 runs]  user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
4 x 540M [20 runs]  user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
8 x 280M [20 runs]  user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
16 x 128M [20 runs]  user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]

vswapra:
1 x 2048M [20 runs]  user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
2 x 1024M [20 runs]  user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
4 x 540M [20 runs]  user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
8 x 280M [20 runs]  user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
16 x 128M [20 runs]  user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]

Furthermore I changed the patch to leave shmem alone for now and added
documentation for the new approach.  And I adjusted the changelog a
bit.

Andi, I think the NUMA policy is already taken care of.  Can you have
another look at it?  Other than that you gave positive feedback - can
I add your acked-by?

	Hannes

---
The current swap readahead implementation reads a physically
contiguous group of swap slots around the faulting page to take
advantage of the disk head's position and in the hope that the
surrounding pages will be needed soon as well.

This works as long as the physical swap slot order approximates the
LRU order decently, otherwise it wastes memory and IO bandwidth to
read in pages that are unlikely to be needed soon.

However, the physical swap slot layout diverges from the LRU order
with increasing swap activity, i.e. high memory pressure situations,
and this is exactly the situation where swapin should not waste any
memory or IO bandwidth as both are the most contended resources at
this point.

Another approximation for LRU-relation is the VMA order as groups of
VMA-related pages are usually used together.

This patch combines both the physical and the virtual hint to get a
good approximation of pages that are sensible to read ahead.

When both diverge, we either read unrelated data, seek heavily for
related data, or, what this patch does, just decrease the readahead
efforts.

To achieve this, we have essentially two readahead windows of the same
size: one spans the virtual, the other one the physical neighborhood
of the faulting page.  We only read where both areas overlap.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
---
 mm/swap_state.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 99 insertions(+), 16 deletions(-)

version 2:
  o fall back to physical ra window for shmem
  o add documentation to the new ra algorithm

qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":

		vanilla				vswapra

1  x 2048M	391.25 ( 71.76) 384.56		445.55 (83.19) 415.41
2  x 1024M	384.25 ( 75.00) 423.08		290.26 (31.38) 299.51
4  x  540M	553.91 (100.02) 554.57		336.58 (52.49) 331.52
8  x  280M	561.08 ( 82.36) 583.12		319.13 (43.17) 307.69
16 x  128M	285.51 (113.20) 236.62		214.24 (62.37) 214.15

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
 	return found_page;
 }
 
-/**
- * swapin_readahead - swap in pages in hope we need them soon
- * @entry: swap entry of this memory
- * @gfp_mask: memory allocation flags
- * @vma: user vma this address belongs to
- * @addr: target address for mempolicy
- *
- * Returns the struct page for entry and addr, after queueing swapin.
- *
+/*
  * Primitive swap readahead code. We simply read an aligned block of
  * (1 << page_cluster) entries in the swap area. This method is chosen
  * because it doesn't cost us any seek time.  We also make sure to queue
  * the 'original' request together with the readahead ones...
- *
- * This has been extended to use the NUMA policies from the mm triggering
- * the readahead.
- *
- * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
  */
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
+				struct vm_area_struct *vma, unsigned long addr)
 {
 	int nr_pages;
 	struct page *page;
@@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
+
+/**
+ * swapin_readahead - swap in pages in hope we need them soon
+ * @entry: swap entry of this memory
+ * @gfp_mask: memory allocation flags
+ * @vma: user vma this address belongs to
+ * @addr: target address for mempolicy
+ *
+ * Returns the struct page for entry and addr, after queueing swapin.
+ *
+ * The readahead window is the virtual area around the faulting page,
+ * where the physical proximity of the swap slots is taken into
+ * account as well.
+ *
+ * While the swap allocation algorithm tries to keep LRU-related pages
+ * together on the swap backing, it is not reliable on heavy thrashing
+ * systems where concurrent reclaimers allocate swap slots and/or most
+ * anonymous memory pages are already in swap cache.
+ *
+ * On the virtual side, subgroups of VMA-related pages are usually
+ * used together, which gives another hint to LRU relationship.
+ *
+ * By taking both aspects into account, we get a good approximation of
+ * which pages are sensible to read together with the faulting one.
+ *
+ * This has been extended to use the NUMA policies from the mm
+ * triggering the readahead.
+ *
+ * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
+ */
+struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	unsigned long start, pos, end;
+	unsigned long pmin, pmax;
+	int cluster, window;
+
+	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
+		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
+
+	cluster = 1 << page_cluster;
+	window = cluster << PAGE_SHIFT;
+
+	/* Physical range to read from */
+	pmin = swp_offset(entry) & ~(cluster - 1);
+	pmax = pmin + cluster;
+
+	/* Virtual range to read from */
+	start = addr & ~(window - 1);
+	end = start + window;
+
+	for (pos = start; pos < end; pos += PAGE_SIZE) {
+		struct page *page;
+		swp_entry_t swp;
+		spinlock_t *ptl;
+		pgd_t *pgd;
+		pud_t *pud;
+		pmd_t *pmd;
+		pte_t *pte;
+
+		pgd = pgd_offset(vma->vm_mm, pos);
+		if (!pgd_present(*pgd))
+			continue;
+		pud = pud_offset(pgd, pos);
+		if (!pud_present(*pud))
+			continue;
+		pmd = pmd_offset(pud, pos);
+		if (!pmd_present(*pmd))
+			continue;
+		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
+		if (!is_swap_pte(*pte)) {
+			pte_unmap_unlock(pte, ptl);
+			continue;
+		}
+		swp = pte_to_swp_entry(*pte);
+		pte_unmap_unlock(pte, ptl);
+
+		if (swp_type(swp) != swp_type(entry))
+			continue;
+		/*
+		 * Dont move the disk head too far away.  This also
+		 * throttles readahead while thrashing, where virtual
+		 * order diverges more and more from physical order.
+		 */
+		if (swp_offset(swp) > pmax)
+			continue;
+		if (swp_offset(swp) < pmin)
+			continue;
+		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
+		if (!page)
+			continue;
+		page_cache_release(page);
+	}
+	lru_add_drain();	/* Push any new pages onto the LRU now */
+	return read_swap_cache_async(entry, gfp_mask, vma, addr);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-02 22:37 ` Johannes Weiner
@ 2009-06-02 23:34   ` Andi Kleen
  -1 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-06-02 23:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, linux-mm, linux-kernel

On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +			struct vm_area_struct *vma, unsigned long addr)
> +{
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +	int cluster, window;
> +
> +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> +	cluster = 1 << page_cluster;
> +	window = cluster << PAGE_SHIFT;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);

Is cluster really properly sign extended on 64bit? Looks a little
dubious. long from the start would be safer

> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);

Same.

> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);

You could be more efficient here by using the standard mm/* nested loop
pattern that avoids relookup of everything in each iteration. I suppose
it would mainly make a difference with 32bit highpte where mapping a pte
can be somewhat costly. And you would take less locks this way.

> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> +		if (!page)
> +			continue;

That's out of memory, break would be better here because prefetch
while oom is usually harmful.

> +		page_cache_release(page);
> +	}
> +	lru_add_drain();	/* Push any new pages onto the LRU now */
> +	return read_swap_cache_async(entry, gfp_mask, vma, addr);

Shouldn't that page be already handled in the loop earlier? Why doing that
again? It would be better to remember it from there.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-02 23:34   ` Andi Kleen
  0 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-06-02 23:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, linux-mm, linux-kernel

On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +			struct vm_area_struct *vma, unsigned long addr)
> +{
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +	int cluster, window;
> +
> +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> +	cluster = 1 << page_cluster;
> +	window = cluster << PAGE_SHIFT;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);

Is cluster really properly sign extended on 64bit? Looks a little
dubious. long from the start would be safer

> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);

Same.

> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);

You could be more efficient here by using the standard mm/* nested loop
pattern that avoids relookup of everything in each iteration. I suppose
it would mainly make a difference with 32bit highpte where mapping a pte
can be somewhat costly. And you would take less locks this way.

> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> +		if (!page)
> +			continue;

That's out of memory, break would be better here because prefetch
while oom is usually harmful.

> +		page_cache_release(page);
> +	}
> +	lru_add_drain();	/* Push any new pages onto the LRU now */
> +	return read_swap_cache_async(entry, gfp_mask, vma, addr);

Shouldn't that page be already handled in the loop earlier? Why doing that
again? It would be better to remember it from there.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-02 23:34   ` Andi Kleen
@ 2009-06-03 13:27     ` Johannes Weiner
  -1 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-03 13:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
> > + *
> > + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> > + */
> > +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > +			struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +	unsigned long start, pos, end;
> > +	unsigned long pmin, pmax;
> > +	int cluster, window;
> > +
> > +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> > +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> > +
> > +	cluster = 1 << page_cluster;
> > +	window = cluster << PAGE_SHIFT;
> > +
> > +	/* Physical range to read from */
> > +	pmin = swp_offset(entry) & ~(cluster - 1);
> 
> Is cluster really properly sign extended on 64bit? Looks a little
> dubious. long from the start would be safer

Fixed.

> > +	/* Virtual range to read from */
> > +	start = addr & ~(window - 1);
> 
> Same.

Fixed.

> > +		pgd = pgd_offset(vma->vm_mm, pos);
> > +		if (!pgd_present(*pgd))
> > +			continue;
> > +		pud = pud_offset(pgd, pos);
> > +		if (!pud_present(*pud))
> > +			continue;
> > +		pmd = pmd_offset(pud, pos);
> > +		if (!pmd_present(*pmd))
> > +			continue;
> > +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> 
> You could be more efficient here by using the standard mm/* nested loop
> pattern that avoids relookup of everything in each iteration. I suppose
> it would mainly make a difference with 32bit highpte where mapping a pte
> can be somewhat costly. And you would take less locks this way.

I ran into weird problems here.  The above version is actually faster
in the benchmarks than writing a nested level walker or using
walk_page_range().  Still digging but it can take some time.  Busy
week :(

> > +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> > +		if (!page)
> > +			continue;
> 
> That's out of memory, break would be better here because prefetch
> while oom is usually harmful.

It can also happen due to a race with something releasing the swap
slot (i.e. swap_duplicate() fails).  But the old version did a break
too and this patch shouldn't do it differently.  Fixed.

> > +		page_cache_release(page);
> > +	}
> > +	lru_add_drain();	/* Push any new pages onto the LRU now */
> > +	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> 
> Shouldn't that page be already handled in the loop earlier? Why doing that
> again? It would be better to remember it from there.

When doing the nested page table level walker, communicating even more
state back and forth gets pretty ugly.  I see what I can do.

Thanks for your input Andi,

	Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-03 13:27     ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-03 13:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
> > + *
> > + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> > + */
> > +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > +			struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +	unsigned long start, pos, end;
> > +	unsigned long pmin, pmax;
> > +	int cluster, window;
> > +
> > +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> > +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> > +
> > +	cluster = 1 << page_cluster;
> > +	window = cluster << PAGE_SHIFT;
> > +
> > +	/* Physical range to read from */
> > +	pmin = swp_offset(entry) & ~(cluster - 1);
> 
> Is cluster really properly sign extended on 64bit? Looks a little
> dubious. long from the start would be safer

Fixed.

> > +	/* Virtual range to read from */
> > +	start = addr & ~(window - 1);
> 
> Same.

Fixed.

> > +		pgd = pgd_offset(vma->vm_mm, pos);
> > +		if (!pgd_present(*pgd))
> > +			continue;
> > +		pud = pud_offset(pgd, pos);
> > +		if (!pud_present(*pud))
> > +			continue;
> > +		pmd = pmd_offset(pud, pos);
> > +		if (!pmd_present(*pmd))
> > +			continue;
> > +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> 
> You could be more efficient here by using the standard mm/* nested loop
> pattern that avoids relookup of everything in each iteration. I suppose
> it would mainly make a difference with 32bit highpte where mapping a pte
> can be somewhat costly. And you would take less locks this way.

I ran into weird problems here.  The above version is actually faster
in the benchmarks than writing a nested level walker or using
walk_page_range().  Still digging but it can take some time.  Busy
week :(

> > +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> > +		if (!page)
> > +			continue;
> 
> That's out of memory, break would be better here because prefetch
> while oom is usually harmful.

It can also happen due to a race with something releasing the swap
slot (i.e. swap_duplicate() fails).  But the old version did a break
too and this patch shouldn't do it differently.  Fixed.

> > +		page_cache_release(page);
> > +	}
> > +	lru_add_drain();	/* Push any new pages onto the LRU now */
> > +	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> 
> Shouldn't that page be already handled in the loop earlier? Why doing that
> again? It would be better to remember it from there.

When doing the nested page table level walker, communicating even more
state back and forth gets pretty ugly.  I see what I can do.

Thanks for your input Andi,

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-03 13:27     ` Johannes Weiner
@ 2009-06-03 14:51       ` Rik van Riel
  -1 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2009-06-03 14:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	linux-mm, linux-kernel

Johannes Weiner wrote:
> On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
>> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:

>>> +		pgd = pgd_offset(vma->vm_mm, pos);
>>> +		if (!pgd_present(*pgd))
>>> +			continue;
>>> +		pud = pud_offset(pgd, pos);
>>> +		if (!pud_present(*pud))
>>> +			continue;
>>> +		pmd = pmd_offset(pud, pos);
>>> +		if (!pmd_present(*pmd))
>>> +			continue;
>>> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
>> You could be more efficient here by using the standard mm/* nested loop
>> pattern that avoids relookup of everything in each iteration. I suppose
>> it would mainly make a difference with 32bit highpte where mapping a pte
>> can be somewhat costly. And you would take less locks this way.
> 
> I ran into weird problems here.  The above version is actually faster
> in the benchmarks than writing a nested level walker or using
> walk_page_range().  Still digging but it can take some time.  Busy
> week :(

I'm not too worried about not walking the page tables,
because swap is an extreme slow path anyway.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-03 14:51       ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2009-06-03 14:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrew Morton, Peter Zijlstra, Hugh Dickins,
	linux-mm, linux-kernel

Johannes Weiner wrote:
> On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
>> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:

>>> +		pgd = pgd_offset(vma->vm_mm, pos);
>>> +		if (!pgd_present(*pgd))
>>> +			continue;
>>> +		pud = pud_offset(pgd, pos);
>>> +		if (!pud_present(*pud))
>>> +			continue;
>>> +		pmd = pmd_offset(pud, pos);
>>> +		if (!pmd_present(*pmd))
>>> +			continue;
>>> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
>> You could be more efficient here by using the standard mm/* nested loop
>> pattern that avoids relookup of everything in each iteration. I suppose
>> it would mainly make a difference with 32bit highpte where mapping a pte
>> can be somewhat costly. And you would take less locks this way.
> 
> I ran into weird problems here.  The above version is actually faster
> in the benchmarks than writing a nested level walker or using
> walk_page_range().  Still digging but it can take some time.  Busy
> week :(

I'm not too worried about not walking the page tables,
because swap is an extreme slow path anyway.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-03 14:51       ` Rik van Riel
@ 2009-06-03 15:02         ` Andi Kleen
  -1 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-06-03 15:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andi Kleen, Andrew Morton, Peter Zijlstra,
	Hugh Dickins, linux-mm, linux-kernel

> I'm not too worried about not walking the page tables,
> because swap is an extreme slow path anyway.

it was more about taking less locks and doing less mappings.
Especially highmem pte mappings can be quite expensive, because
they have to flush parts of the TLB.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-03 15:02         ` Andi Kleen
  0 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2009-06-03 15:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andi Kleen, Andrew Morton, Peter Zijlstra,
	Hugh Dickins, linux-mm, linux-kernel

> I'm not too worried about not walking the page tables,
> because swap is an extreme slow path anyway.

it was more about taking less locks and doing less mappings.
Especially highmem pte mappings can be quite expensive, because
they have to flush parts of the TLB.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-02 22:37 ` Johannes Weiner
@ 2009-06-04  1:46   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-04  1:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, linux-mm, linux-kernel

On Wed, 3 Jun 2009 00:37:39 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hi Andrew,
> 
> I redid the qsbench runs with a bigger page cluster (2^4).  It shows
> improvement on both versions, the patched one still performing better.
> Rik hinted me that we can make the default even bigger when we are
> better at avoiding reading unrelated pages.  I am currently testing
> this.  Here are the timings for 2^4 (i.e. twice the) ra pages:
> 
> vanilla:
> 1 x 2048M [20 runs]  user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
> 2 x 1024M [20 runs]  user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
> 4 x 540M [20 runs]  user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
> 8 x 280M [20 runs]  user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
> 16 x 128M [20 runs]  user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]
> 
> vswapra:
> 1 x 2048M [20 runs]  user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
> 2 x 1024M [20 runs]  user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
> 4 x 540M [20 runs]  user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
> 8 x 280M [20 runs]  user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
> 16 x 128M [20 runs]  user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]
> 
> Furthermore I changed the patch to leave shmem alone for now and added
> documentation for the new approach.  And I adjusted the changelog a
> bit.
> 
> Andi, I think the NUMA policy is already taken care of.  Can you have
> another look at it?  Other than that you gave positive feedback - can
> I add your acked-by?
> 
> 	Hannes
> 
> ---
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
> 
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
> 
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
> 
> Another approximation for LRU-relation is the VMA order as groups of
> VMA-related pages are usually used together.
> 
> This patch combines both the physical and the virtual hint to get a
> good approximation of pages that are sensible to read ahead.
> 
> When both diverge, we either read unrelated data, seek heavily for
> related data, or, what this patch does, just decrease the readahead
> efforts.
> 
> To achieve this, we have essentially two readahead windows of the same
> size: one spans the virtual, the other one the physical neighborhood
> of the faulting page.  We only read where both areas overlap.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Andi Kleen <andi@firstfloor.org>
> ---
>  mm/swap_state.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 99 insertions(+), 16 deletions(-)
> 
> version 2:
>   o fall back to physical ra window for shmem
>   o add documentation to the new ra algorithm
> 
> qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":
> 
> 		vanilla				vswapra
> 
> 1  x 2048M	391.25 ( 71.76) 384.56		445.55 (83.19) 415.41
> 2  x 1024M	384.25 ( 75.00) 423.08		290.26 (31.38) 299.51
> 4  x  540M	553.91 (100.02) 554.57		336.58 (52.49) 331.52
> 8  x  280M	561.08 ( 82.36) 583.12		319.13 (43.17) 307.69
> 16 x  128M	285.51 (113.20) 236.62		214.24 (62.37) 214.15
> 
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
>  	return found_page;
>  }
>  
> -/**
> - * swapin_readahead - swap in pages in hope we need them soon
> - * @entry: swap entry of this memory
> - * @gfp_mask: memory allocation flags
> - * @vma: user vma this address belongs to
> - * @addr: target address for mempolicy
> - *
> - * Returns the struct page for entry and addr, after queueing swapin.
> - *
> +/*
>   * Primitive swap readahead code. We simply read an aligned block of
>   * (1 << page_cluster) entries in the swap area. This method is chosen
>   * because it doesn't cost us any seek time.  We also make sure to queue
>   * the 'original' request together with the readahead ones...
> - *
> - * This has been extended to use the NUMA policies from the mm triggering
> - * the readahead.
> - *
> - * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
>   */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -			struct vm_area_struct *vma, unsigned long addr)
> +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
> +				struct vm_area_struct *vma, unsigned long addr)
>  {
>  	int nr_pages;
>  	struct page *page;
> @@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> +
> +/**
> + * swapin_readahead - swap in pages in hope we need them soon
> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vma: user vma this address belongs to
> + * @addr: target address for mempolicy
> + *
> + * Returns the struct page for entry and addr, after queueing swapin.
> + *
> + * The readahead window is the virtual area around the faulting page,
> + * where the physical proximity of the swap slots is taken into
> + * account as well.
> + *
> + * While the swap allocation algorithm tries to keep LRU-related pages
> + * together on the swap backing, it is not reliable on heavy thrashing
> + * systems where concurrent reclaimers allocate swap slots and/or most
> + * anonymous memory pages are already in swap cache.
> + *
> + * On the virtual side, subgroups of VMA-related pages are usually
> + * used together, which gives another hint to LRU relationship.
> + *
> + * By taking both aspects into account, we get a good approximation of
> + * which pages are sensible to read together with the faulting one.
> + *
> + * This has been extended to use the NUMA policies from the mm
> + * triggering the readahead.
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +			struct vm_area_struct *vma, unsigned long addr)
> +{
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +	int cluster, window;
> +
> +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> +	cluster = 1 << page_cluster;
> +	window = cluster << PAGE_SHIFT;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);
> +	pmax = pmin + cluster;
> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);
> +	end = start + window;
> +
> +	for (pos = start; pos < end; pos += PAGE_SIZE) {
> +		struct page *page;
> +		swp_entry_t swp;
> +		spinlock_t *ptl;
> +		pgd_t *pgd;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> +		if (!is_swap_pte(*pte)) {
> +			pte_unmap_unlock(pte, ptl);
> +			continue;
> +		}
> +		swp = pte_to_swp_entry(*pte);
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (swp_type(swp) != swp_type(entry))
> +			continue;
> +		/*
> +		 * Dont move the disk head too far away.  This also
> +		 * throttles readahead while thrashing, where virtual
> +		 * order diverges more and more from physical order.
> +		 */
> +		if (swp_offset(swp) > pmax)
> +			continue;
> +		if (swp_offset(swp) < pmin)
> +			continue;

I wonder (I just wonder..) can we add code like following here ?

   /* we do _readahead_ here. Then, we don't want to add too much jobs to vm/IO*/
   if (swp != entry)
	gfp_mask &= ~__GFP_WAIT
> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);

too slow ?

Bye.
-Kame

> +		if (!page)
> +			continue;
> +		page_cache_release(page);
> +	}
> +	lru_add_drain();	/* Push any new pages onto the LRU now */
> +	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> +}
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-04  1:46   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-04  1:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, linux-mm, linux-kernel

On Wed, 3 Jun 2009 00:37:39 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hi Andrew,
> 
> I redid the qsbench runs with a bigger page cluster (2^4).  It shows
> improvement on both versions, the patched one still performing better.
> Rik hinted me that we can make the default even bigger when we are
> better at avoiding reading unrelated pages.  I am currently testing
> this.  Here are the timings for 2^4 (i.e. twice the) ra pages:
> 
> vanilla:
> 1 x 2048M [20 runs]  user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
> 2 x 1024M [20 runs]  user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
> 4 x 540M [20 runs]  user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
> 8 x 280M [20 runs]  user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
> 16 x 128M [20 runs]  user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]
> 
> vswapra:
> 1 x 2048M [20 runs]  user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
> 2 x 1024M [20 runs]  user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
> 4 x 540M [20 runs]  user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
> 8 x 280M [20 runs]  user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
> 16 x 128M [20 runs]  user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]
> 
> Furthermore I changed the patch to leave shmem alone for now and added
> documentation for the new approach.  And I adjusted the changelog a
> bit.
> 
> Andi, I think the NUMA policy is already taken care of.  Can you have
> another look at it?  Other than that you gave positive feedback - can
> I add your acked-by?
> 
> 	Hannes
> 
> ---
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
> 
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
> 
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
> 
> Another approximation for LRU-relation is the VMA order as groups of
> VMA-related pages are usually used together.
> 
> This patch combines both the physical and the virtual hint to get a
> good approximation of pages that are sensible to read ahead.
> 
> When both diverge, we either read unrelated data, seek heavily for
> related data, or, what this patch does, just decrease the readahead
> efforts.
> 
> To achieve this, we have essentially two readahead windows of the same
> size: one spans the virtual, the other one the physical neighborhood
> of the faulting page.  We only read where both areas overlap.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Andi Kleen <andi@firstfloor.org>
> ---
>  mm/swap_state.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 99 insertions(+), 16 deletions(-)
> 
> version 2:
>   o fall back to physical ra window for shmem
>   o add documentation to the new ra algorithm
> 
> qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":
> 
> 		vanilla				vswapra
> 
> 1  x 2048M	391.25 ( 71.76) 384.56		445.55 (83.19) 415.41
> 2  x 1024M	384.25 ( 75.00) 423.08		290.26 (31.38) 299.51
> 4  x  540M	553.91 (100.02) 554.57		336.58 (52.49) 331.52
> 8  x  280M	561.08 ( 82.36) 583.12		319.13 (43.17) 307.69
> 16 x  128M	285.51 (113.20) 236.62		214.24 (62.37) 214.15
> 
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
>  	return found_page;
>  }
>  
> -/**
> - * swapin_readahead - swap in pages in hope we need them soon
> - * @entry: swap entry of this memory
> - * @gfp_mask: memory allocation flags
> - * @vma: user vma this address belongs to
> - * @addr: target address for mempolicy
> - *
> - * Returns the struct page for entry and addr, after queueing swapin.
> - *
> +/*
>   * Primitive swap readahead code. We simply read an aligned block of
>   * (1 << page_cluster) entries in the swap area. This method is chosen
>   * because it doesn't cost us any seek time.  We also make sure to queue
>   * the 'original' request together with the readahead ones...
> - *
> - * This has been extended to use the NUMA policies from the mm triggering
> - * the readahead.
> - *
> - * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
>   */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -			struct vm_area_struct *vma, unsigned long addr)
> +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
> +				struct vm_area_struct *vma, unsigned long addr)
>  {
>  	int nr_pages;
>  	struct page *page;
> @@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> +
> +/**
> + * swapin_readahead - swap in pages in hope we need them soon
> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vma: user vma this address belongs to
> + * @addr: target address for mempolicy
> + *
> + * Returns the struct page for entry and addr, after queueing swapin.
> + *
> + * The readahead window is the virtual area around the faulting page,
> + * where the physical proximity of the swap slots is taken into
> + * account as well.
> + *
> + * While the swap allocation algorithm tries to keep LRU-related pages
> + * together on the swap backing, it is not reliable on heavy thrashing
> + * systems where concurrent reclaimers allocate swap slots and/or most
> + * anonymous memory pages are already in swap cache.
> + *
> + * On the virtual side, subgroups of VMA-related pages are usually
> + * used together, which gives another hint to LRU relationship.
> + *
> + * By taking both aspects into account, we get a good approximation of
> + * which pages are sensible to read together with the faulting one.
> + *
> + * This has been extended to use the NUMA policies from the mm
> + * triggering the readahead.
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +			struct vm_area_struct *vma, unsigned long addr)
> +{
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +	int cluster, window;
> +
> +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> +	cluster = 1 << page_cluster;
> +	window = cluster << PAGE_SHIFT;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);
> +	pmax = pmin + cluster;
> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);
> +	end = start + window;
> +
> +	for (pos = start; pos < end; pos += PAGE_SIZE) {
> +		struct page *page;
> +		swp_entry_t swp;
> +		spinlock_t *ptl;
> +		pgd_t *pgd;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> +		if (!is_swap_pte(*pte)) {
> +			pte_unmap_unlock(pte, ptl);
> +			continue;
> +		}
> +		swp = pte_to_swp_entry(*pte);
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (swp_type(swp) != swp_type(entry))
> +			continue;
> +		/*
> +		 * Dont move the disk head too far away.  This also
> +		 * throttles readahead while thrashing, where virtual
> +		 * order diverges more and more from physical order.
> +		 */
> +		if (swp_offset(swp) > pmax)
> +			continue;
> +		if (swp_offset(swp) < pmin)
> +			continue;

I wonder (I just wonder..) can we add code like following here ?

   /* we do _readahead_ here. Then, we don't want to add too much jobs to vm/IO*/
   if (swp != entry)
	gfp_mask &= ~__GFP_WAIT
> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);

too slow ?

Bye.
-Kame

> +		if (!page)
> +			continue;
> +		page_cache_release(page);
> +	}
> +	lru_add_drain();	/* Push any new pages onto the LRU now */
> +	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> +}
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-04  1:46   ` KAMEZAWA Hiroyuki
@ 2009-06-04 10:04     ` Johannes Weiner
  -1 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-04 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, linux-mm, linux-kernel

On Thu, Jun 04, 2009 at 10:46:28AM +0900, KAMEZAWA Hiroyuki wrote:

> I wonder (I just wonder..) can we add code like following here ?
> 
>    /* we do _readahead_ here. Then, we don't want to add too much jobs to vm/IO*/
>    if (swp != entry)
> 	gfp_mask &= ~__GFP_WAIT
> > +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> 
> too slow ?

Good idea, certainly worth evaluating.  But not in this patch, I don't
want to change _everything_ at once :-)

	Thanks, Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-04 10:04     ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-04 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, linux-mm, linux-kernel

On Thu, Jun 04, 2009 at 10:46:28AM +0900, KAMEZAWA Hiroyuki wrote:

> I wonder (I just wonder..) can we add code like following here ?
> 
>    /* we do _readahead_ here. Then, we don't want to add too much jobs to vm/IO*/
>    if (swp != entry)
> 	gfp_mask &= ~__GFP_WAIT
> > +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> 
> too slow ?

Good idea, certainly worth evaluating.  But not in this patch, I don't
want to change _everything_ at once :-)

	Thanks, Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-03 13:27     ` Johannes Weiner
@ 2009-06-05 11:03       ` Minchan Kim
  -1 siblings, 0 replies; 19+ messages in thread
From: Minchan Kim @ 2009-06-05 11:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrew Morton, Rik van Riel, Peter Zijlstra,
	Hugh Dickins, linux-mm, linux-kernel

Hi, Hannes.

On Wed, Jun 3, 2009 at 10:27 PM, Johannes Weiner<hannes@cmpxchg.org> wrote:
> On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
>> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
>> > + *
>> > + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
>> > + */
>> > +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> > +                   struct vm_area_struct *vma, unsigned long addr)
>> > +{
>> > +   unsigned long start, pos, end;
>> > +   unsigned long pmin, pmax;
>> > +   int cluster, window;
>> > +
>> > +   if (!vma || !vma->vm_mm)        /* XXX: shmem case */
>> > +           return swapin_readahead_phys(entry, gfp_mask, vma, addr);
>> > +
>> > +   cluster = 1 << page_cluster;
>> > +   window = cluster << PAGE_SHIFT;
>> > +
>> > +   /* Physical range to read from */
>> > +   pmin = swp_offset(entry) & ~(cluster - 1);
>>
>> Is cluster really properly sign extended on 64bit? Looks a little
>> dubious. long from the start would be safer
>
> Fixed.
>
>> > +   /* Virtual range to read from */
>> > +   start = addr & ~(window - 1);
>>
>> Same.
>
> Fixed.
>
>> > +           pgd = pgd_offset(vma->vm_mm, pos);
>> > +           if (!pgd_present(*pgd))
>> > +                   continue;
>> > +           pud = pud_offset(pgd, pos);
>> > +           if (!pud_present(*pud))
>> > +                   continue;
>> > +           pmd = pmd_offset(pud, pos);
>> > +           if (!pmd_present(*pmd))
>> > +                   continue;
>> > +           pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
>>
>> You could be more efficient here by using the standard mm/* nested loop
>> pattern that avoids relookup of everything in each iteration. I suppose
>> it would mainly make a difference with 32bit highpte where mapping a pte
>> can be somewhat costly. And you would take less locks this way.
>
> I ran into weird problems here.  The above version is actually faster
> in the benchmarks than writing a nested level walker or using
> walk_page_range().  Still digging but it can take some time.  Busy
> week :(
>
>> > +           page = read_swap_cache_async(swp, gfp_mask, vma, pos);
>> > +           if (!page)
>> > +                   continue;
>>
>> That's out of memory, break would be better here because prefetch
>> while oom is usually harmful.
>
> It can also happen due to a race with something releasing the swap
> slot (i.e. swap_duplicate() fails).  But the old version did a break
> too and this patch shouldn't do it differently.  Fixed.

I think it would be better to read fault page earlier than readahead pages.
That's because,
1) Readahead pages would prevent to read fault page due to out-of-memory.
2)  If we can't get the fault page, we don't need extra pages(ie,
readahead pages)
     It's waste of memory or IO bandwidth. It's what you want.
3) If we read fault page at first and meet oom, we can also stop readahead.

-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-05 11:03       ` Minchan Kim
  0 siblings, 0 replies; 19+ messages in thread
From: Minchan Kim @ 2009-06-05 11:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrew Morton, Rik van Riel, Peter Zijlstra,
	Hugh Dickins, linux-mm, linux-kernel

Hi, Hannes.

On Wed, Jun 3, 2009 at 10:27 PM, Johannes Weiner<hannes@cmpxchg.org> wrote:
> On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
>> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
>> > + *
>> > + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
>> > + */
>> > +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> > +                   struct vm_area_struct *vma, unsigned long addr)
>> > +{
>> > +   unsigned long start, pos, end;
>> > +   unsigned long pmin, pmax;
>> > +   int cluster, window;
>> > +
>> > +   if (!vma || !vma->vm_mm)        /* XXX: shmem case */
>> > +           return swapin_readahead_phys(entry, gfp_mask, vma, addr);
>> > +
>> > +   cluster = 1 << page_cluster;
>> > +   window = cluster << PAGE_SHIFT;
>> > +
>> > +   /* Physical range to read from */
>> > +   pmin = swp_offset(entry) & ~(cluster - 1);
>>
>> Is cluster really properly sign extended on 64bit? Looks a little
>> dubious. long from the start would be safer
>
> Fixed.
>
>> > +   /* Virtual range to read from */
>> > +   start = addr & ~(window - 1);
>>
>> Same.
>
> Fixed.
>
>> > +           pgd = pgd_offset(vma->vm_mm, pos);
>> > +           if (!pgd_present(*pgd))
>> > +                   continue;
>> > +           pud = pud_offset(pgd, pos);
>> > +           if (!pud_present(*pud))
>> > +                   continue;
>> > +           pmd = pmd_offset(pud, pos);
>> > +           if (!pmd_present(*pmd))
>> > +                   continue;
>> > +           pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
>>
>> You could be more efficient here by using the standard mm/* nested loop
>> pattern that avoids relookup of everything in each iteration. I suppose
>> it would mainly make a difference with 32bit highpte where mapping a pte
>> can be somewhat costly. And you would take less locks this way.
>
> I ran into weird problems here.  The above version is actually faster
> in the benchmarks than writing a nested level walker or using
> walk_page_range().  Still digging but it can take some time.  Busy
> week :(
>
>> > +           page = read_swap_cache_async(swp, gfp_mask, vma, pos);
>> > +           if (!page)
>> > +                   continue;
>>
>> That's out of memory, break would be better here because prefetch
>> while oom is usually harmful.
>
> It can also happen due to a race with something releasing the swap
> slot (i.e. swap_duplicate() fails).  But the old version did a break
> too and this patch shouldn't do it differently.  Fixed.

I think it would be better to read fault page earlier than readahead pages.
That's because,
1) Readahead pages would prevent to read fault page due to out-of-memory.
2)  If we can't get the fault page, we don't need extra pages(ie,
readahead pages)
     It's waste of memory or IO bandwidth. It's what you want.
3) If we read fault page at first and meet oom, we can also stop readahead.

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-02 22:37 ` Johannes Weiner
                   ` (2 preceding siblings ...)
  (?)
@ 2009-06-07 17:55 ` Hugh Dickins
  2009-06-08 15:21     ` Johannes Weiner
  -1 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2009-06-07 17:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Hugh Dickins,
	Andi Kleen, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11705 bytes --]

On Wed, 3 Jun 2009, Johannes Weiner wrote:
> Hi Andrew,
> 
> I redid the qsbench runs with a bigger page cluster (2^4).  It shows
> improvement on both versions, the patched one still performing better.
> Rik hinted me that we can make the default even bigger when we are
> better at avoiding reading unrelated pages.  I am currently testing
> this.  Here are the timings for 2^4 (i.e. twice the) ra pages:
> 
> vanilla:
> 1 x 2048M [20 runs]  user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
> 2 x 1024M [20 runs]  user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
> 4 x 540M [20 runs]  user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
> 8 x 280M [20 runs]  user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
> 16 x 128M [20 runs]  user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]
> 
> vswapra:
> 1 x 2048M [20 runs]  user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
> 2 x 1024M [20 runs]  user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
> 4 x 540M [20 runs]  user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
> 8 x 280M [20 runs]  user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
> 16 x 128M [20 runs]  user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]
> 
> Furthermore I changed the patch to leave shmem alone for now and added
> documentation for the new approach.  And I adjusted the changelog a
> bit.
> 
> Andi, I think the NUMA policy is already taken care of.  Can you have
> another look at it?  Other than that you gave positive feedback - can
> I add your acked-by?
> 
> 	Hannes
> 
> ---
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
> 
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
> 
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
> 
> Another approximation for LRU-relation is the VMA order as groups of
> VMA-related pages are usually used together.
> 
> This patch combines both the physical and the virtual hint to get a
> good approximation of pages that are sensible to read ahead.
> 
> When both diverge, we either read unrelated data, seek heavily for
> related data, or, what this patch does, just decrease the readahead
> efforts.
> 
> To achieve this, we have essentially two readahead windows of the same
> size: one spans the virtual, the other one the physical neighborhood
> of the faulting page.  We only read where both areas overlap.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Andi Kleen <andi@firstfloor.org>

I think this a great idea, a very promising approach.  I like it
so much better than Andrew's and others' proposals to dedicate
areas of swap space to distinct virtual objects: which, as you
rightly pointed out, condemn us to unnecessary seeking when
writing swap (and is even more of an issue if we're writing
to SSD rather than HDD).

It would be nice to get results from a wider set of benchmarks
than just qsbench; but I don't think qsbench is biased in your
favour, and I don't think you need go to too much trouble on
that - let's just aim to get your work into mmotm, then we can
all play around with it for a while.  I suppose what I'd most
like is to try it with shmem, which you've understandably left
out for now.

You'll be hating me for the way I made shmem_truncate_range() etc.
nigh incomprehensible when enabling highmem index pages there.
Christoph Rohland's original was much nicer.  Again and again and
again I've wanted to throw all that out, and keep swap entries in
the standard radix tree which keeps the struct page pointers; but
again and again and again, I've been unable to justify losing the
highmem index ability - for a while it seemed as if x86_64 was
going to make highmem a thing of the past, and it's certainly helped
us to ignore 32GB 32-bit; but I think 4GB 32-bit is here a long while.

Though I like the way you localized it all into swapin_readahead(),
I'd prefer to keep ptes out of swap_state.c, and think several issues
will fall away if you turn your patch around.  You'll avoid the pte code
in swap_state.c, you'll satisfy Andi's concerns about locking/kmapping
overhead, and you'll find shmem much much easier, if instead of peering
back at where you've come from in swapin_readahead(), you make the outer
levels (do_swap_page and shmem_getpage) pass a vector of swap entries to
swapin_readahead()?  That vector on stack, and copied from the one page
table or index page (don't bother to cross page table or index page
boundaries) while it was mapped.

It's probably irrelevant to you, but I've attached an untested patch
of mine which stomps somewhat on this area: I was shocked to notice
shmem_getpage() in a list of deep stack offenders a few months back,
realized it was those unpleasant NUMA mempolicy pseudo-vmas, and made
a patch to get rid of them.  I've rebased it to 2.6.30-rc8 and checked
that the resulting kernel runs, but not really tested it; and I think
I didn't even try to get the mpol reference counting right (tends to
be an issue precisely in swapin_readahead, where one mpol ends up used
repeatedly) - mpol refcounting is an arcane art only Lee understands!
I've attached the patch because you may want to glance at it and
decide, either that it's something which is helpful to the direction
you're going in and you'd like to base upon it, or that it's a
distraction and you'd prefer me to keep it to myself until your
changes are in.

But your patch below is incomplete, isn't it?  The old swapin_readahead()
is now called swapin_readahead_phys(), and you want shmem_getpage() to be
using that for now: but no prototype for it and no change to mm/shmem.c.

Hugh

> ---
>  mm/swap_state.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 99 insertions(+), 16 deletions(-)
> 
> version 2:
>   o fall back to physical ra window for shmem
>   o add documentation to the new ra algorithm
> 
> qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":
> 
> 		vanilla				vswapra
> 
> 1  x 2048M	391.25 ( 71.76) 384.56		445.55 (83.19) 415.41
> 2  x 1024M	384.25 ( 75.00) 423.08		290.26 (31.38) 299.51
> 4  x  540M	553.91 (100.02) 554.57		336.58 (52.49) 331.52
> 8  x  280M	561.08 ( 82.36) 583.12		319.13 (43.17) 307.69
> 16 x  128M	285.51 (113.20) 236.62		214.24 (62.37) 214.15
> 
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
>  	return found_page;
>  }
>  
> -/**
> - * swapin_readahead - swap in pages in hope we need them soon
> - * @entry: swap entry of this memory
> - * @gfp_mask: memory allocation flags
> - * @vma: user vma this address belongs to
> - * @addr: target address for mempolicy
> - *
> - * Returns the struct page for entry and addr, after queueing swapin.
> - *
> +/*
>   * Primitive swap readahead code. We simply read an aligned block of
>   * (1 << page_cluster) entries in the swap area. This method is chosen
>   * because it doesn't cost us any seek time.  We also make sure to queue
>   * the 'original' request together with the readahead ones...
> - *
> - * This has been extended to use the NUMA policies from the mm triggering
> - * the readahead.
> - *
> - * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
>   */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -			struct vm_area_struct *vma, unsigned long addr)
> +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
> +				struct vm_area_struct *vma, unsigned long addr)
>  {
>  	int nr_pages;
>  	struct page *page;
> @@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>  }
> +
> +/**
> + * swapin_readahead - swap in pages in hope we need them soon
> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vma: user vma this address belongs to
> + * @addr: target address for mempolicy
> + *
> + * Returns the struct page for entry and addr, after queueing swapin.
> + *
> + * The readahead window is the virtual area around the faulting page,
> + * where the physical proximity of the swap slots is taken into
> + * account as well.
> + *
> + * While the swap allocation algorithm tries to keep LRU-related pages
> + * together on the swap backing, it is not reliable on heavy thrashing
> + * systems where concurrent reclaimers allocate swap slots and/or most
> + * anonymous memory pages are already in swap cache.
> + *
> + * On the virtual side, subgroups of VMA-related pages are usually
> + * used together, which gives another hint to LRU relationship.
> + *
> + * By taking both aspects into account, we get a good approximation of
> + * which pages are sensible to read together with the faulting one.
> + *
> + * This has been extended to use the NUMA policies from the mm
> + * triggering the readahead.
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +			struct vm_area_struct *vma, unsigned long addr)
> +{
> +	unsigned long start, pos, end;
> +	unsigned long pmin, pmax;
> +	int cluster, window;
> +
> +	if (!vma || !vma->vm_mm)	/* XXX: shmem case */
> +		return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> +	cluster = 1 << page_cluster;
> +	window = cluster << PAGE_SHIFT;
> +
> +	/* Physical range to read from */
> +	pmin = swp_offset(entry) & ~(cluster - 1);
> +	pmax = pmin + cluster;
> +
> +	/* Virtual range to read from */
> +	start = addr & ~(window - 1);
> +	end = start + window;
> +
> +	for (pos = start; pos < end; pos += PAGE_SIZE) {
> +		struct page *page;
> +		swp_entry_t swp;
> +		spinlock_t *ptl;
> +		pgd_t *pgd;
> +		pud_t *pud;
> +		pmd_t *pmd;
> +		pte_t *pte;
> +
> +		pgd = pgd_offset(vma->vm_mm, pos);
> +		if (!pgd_present(*pgd))
> +			continue;
> +		pud = pud_offset(pgd, pos);
> +		if (!pud_present(*pud))
> +			continue;
> +		pmd = pmd_offset(pud, pos);
> +		if (!pmd_present(*pmd))
> +			continue;
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> +		if (!is_swap_pte(*pte)) {
> +			pte_unmap_unlock(pte, ptl);
> +			continue;
> +		}
> +		swp = pte_to_swp_entry(*pte);
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (swp_type(swp) != swp_type(entry))
> +			continue;
> +		/*
> +		 * Dont move the disk head too far away.  This also
> +		 * throttles readahead while thrashing, where virtual
> +		 * order diverges more and more from physical order.
> +		 */
> +		if (swp_offset(swp) > pmax)
> +			continue;
> +		if (swp_offset(swp) < pmin)
> +			continue;
> +		page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> +		if (!page)
> +			continue;
> +		page_cache_release(page);
> +	}
> +	lru_add_drain();	/* Push any new pages onto the LRU now */
> +	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> +}

[-- Attachment #2: alloc_page_mpol patch --]
[-- Type: TEXT/PLAIN, Size: 13630 bytes --]

[PATCH] mm: alloc_page_mpol

Introduce alloc_page_mpol(), to get rid of those mpol pseudo-vmas from
shmem.c, which caused shmem_getpage() to show up in deep stack reports.

Not-yet-Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---

 include/linux/gfp.h       |    6 +++
 include/linux/mempolicy.h |   10 ++++++
 include/linux/swap.h      |    9 ++---
 mm/memory.c               |    5 +--
 mm/mempolicy.c            |   75 ++++++++++++++++++++++++++--------------------
 mm/shmem.c                |   66 ++++------------------------------------
 mm/swap_state.c           |   10 +++---
 mm/swapfile.c             |    4 +-
 8 files changed, 80 insertions(+), 105 deletions(-)

--- 2.6.30-rc8/include/linux/gfp.h	2009-04-08 18:26:14.000000000 +0100
+++ linux/include/linux/gfp.h	2009-06-07 13:56:58.000000000 +0100
@@ -7,6 +7,7 @@
 #include <linux/topology.h>
 
 struct vm_area_struct;
+struct mempolicy;
 
 /*
  * GFP bitmasks..
@@ -213,10 +214,13 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+extern struct page *alloc_page_mpol(gfp_t gfp_mask,
+			struct mempolicy *mpol, pgoff_t pgoff);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr)	alloc_pages(gfp_mask, 0)
+#define alloc_page_mpol(gfp_mask, mpol, pgoff)	alloc_pages(gfp_mask, 0)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
--- 2.6.30-rc8/include/linux/mempolicy.h	2008-10-09 23:13:53.000000000 +0100
+++ linux/include/linux/mempolicy.h	2009-06-07 13:56:58.000000000 +0100
@@ -62,6 +62,7 @@ enum {
 #include <linux/pagemap.h>
 
 struct mm_struct;
+struct vm_area_struct;
 
 #ifdef CONFIG_NUMA
 
@@ -147,6 +148,9 @@ static inline struct mempolicy *mpol_dup
 	return pol;
 }
 
+extern struct mempolicy *get_vma_policy(struct task_struct *task,
+			struct vm_area_struct *vma, unsigned long addr);
+
 #define vma_policy(vma) ((vma)->vm_policy)
 #define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
 
@@ -294,6 +298,12 @@ mpol_shared_policy_lookup(struct shared_
 {
 	return NULL;
 }
+
+static inline struct mempolicy *get_vma_policy(struct task_struct *task,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	return NULL;
+}
 
 #define vma_policy(vma) NULL
 #define vma_set_policy(vma, pol) do {} while(0)
--- 2.6.30-rc8/include/linux/swap.h	2009-06-03 10:13:27.000000000 +0100
+++ linux/include/linux/swap.h	2009-06-07 13:56:58.000000000 +0100
@@ -12,9 +12,8 @@
 #include <asm/atomic.h>
 #include <asm/page.h>
 
-struct notifier_block;
-
 struct bio;
+struct mempolicy;
 
 #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
 #define SWAP_FLAG_PRIO_MASK	0x7fff
@@ -290,9 +289,9 @@ extern void free_page_and_swap_cache(str
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page *lookup_swap_cache(swp_entry_t);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr);
+			struct mempolicy *mpol, pgoff_t pgoff);
 extern struct page *swapin_readahead(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr);
+			struct mempolicy *mpol, pgoff_t pgoff);
 
 /* linux/mm/swapfile.c */
 extern long nr_swap_pages;
@@ -377,7 +376,7 @@ static inline void swap_free(swp_entry_t
 }
 
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+					struct mempolicy *mpol, pgoff_t pgoff)
 {
 	return NULL;
 }
--- 2.6.30-rc8/mm/memory.c	2009-05-09 09:06:44.000000000 +0100
+++ linux/mm/memory.c	2009-06-07 13:56:58.000000000 +0100
@@ -2467,8 +2467,9 @@ static int do_swap_page(struct mm_struct
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		grab_swap_token(); /* Contend for token _before_ read-in */
-		page = swapin_readahead(entry,
-					GFP_HIGHUSER_MOVABLE, vma, address);
+		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+					get_vma_policy(current, vma, address),
+					linear_page_index(vma, address));
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
--- 2.6.30-rc8/mm/mempolicy.c	2009-03-23 23:12:14.000000000 +0000
+++ linux/mm/mempolicy.c	2009-06-07 13:56:58.000000000 +0100
@@ -1304,7 +1304,7 @@ asmlinkage long compat_sys_mbind(compat_
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct task_struct *task,
+struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
@@ -1425,9 +1425,8 @@ unsigned slab_node(struct mempolicy *pol
 	}
 }
 
-/* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
-		struct vm_area_struct *vma, unsigned long off)
+/* Determine a node number for interleave */
+static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t pgoff)
 {
 	unsigned nnodes = nodes_weight(pol->v.nodes);
 	unsigned target;
@@ -1436,7 +1435,7 @@ static unsigned offset_il_node(struct me
 
 	if (!nnodes)
 		return numa_node_id();
-	target = (unsigned int)off % nnodes;
+	target = (unsigned int)pgoff % nnodes;
 	c = 0;
 	do {
 		nid = next_node(nid, pol->v.nodes);
@@ -1445,28 +1444,6 @@ static unsigned offset_il_node(struct me
 	return nid;
 }
 
-/* Determine a node number for interleave */
-static inline unsigned interleave_nid(struct mempolicy *pol,
-		 struct vm_area_struct *vma, unsigned long addr, int shift)
-{
-	if (vma) {
-		unsigned long off;
-
-		/*
-		 * for small pages, there is no difference between
-		 * shift and PAGE_SHIFT, so the bit-shift is safe.
-		 * for huge pages, since vm_pgoff is in units of small
-		 * pages, we need to shift off the always 0 bits to get
-		 * a useful offset.
-		 */
-		BUG_ON(shift < PAGE_SHIFT);
-		off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
-		off += (addr - vma->vm_start) >> shift;
-		return offset_il_node(pol, vma, off);
-	} else
-		return interleave_nodes(pol);
-}
-
 #ifdef CONFIG_HUGETLBFS
 /*
  * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
@@ -1491,8 +1468,9 @@ struct zonelist *huge_zonelist(struct vm
 	*nodemask = NULL;	/* assume !MPOL_BIND */
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
-		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-				huge_page_shift(hstate_vma(vma))), gfp_flags);
+		pgoff_t pgoff = linear_page_index(vma, addr);
+		pgoff >>= huge_page_shift(hstate_vma(vma));
+		zl = node_zonelist(interleave_nid(*mpol, pgoff), gfp_flags);
 	} else {
 		zl = policy_zonelist(gfp_flags, *mpol);
 		if ((*mpol)->mode == MPOL_BIND)
@@ -1550,7 +1528,40 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
+		if (vma)
+			nid = interleave_nid(pol, linear_page_index(vma, addr));
+		else
+			nid = interleave_nodes(pol);
+		mpol_cond_put(pol);
+		return alloc_page_interleave(gfp, 0, nid);
+	}
+	zl = policy_zonelist(gfp, pol);
+	if (unlikely(mpol_needs_cond_ref(pol))) {
+		/*
+		 * slow path: ref counted shared policy
+		 */
+		struct page *page =  __alloc_pages_nodemask(gfp, 0,
+						zl, policy_nodemask(gfp, pol));
+		__mpol_put(pol);
+		return page;
+	}
+	/*
+	 * fast path:  default or task policy
+	 */
+	return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
+}
+
+struct page *
+alloc_page_mpol(gfp_t gfp, struct mempolicy *pol, pgoff_t pgoff)
+{
+	struct zonelist *zl;
+
+	cpuset_update_task_memory_state();
+
+	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
+		unsigned int nid;
+
+		nid = interleave_nid(pol, pgoff);
 		mpol_cond_put(pol);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
@@ -1757,11 +1768,11 @@ static void sp_insert(struct shared_poli
 struct mempolicy *
 mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
 {
-	struct mempolicy *pol = NULL;
+	struct mempolicy *pol = &default_policy;
 	struct sp_node *sn;
 
 	if (!sp->root.rb_node)
-		return NULL;
+		return pol;
 	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
 	if (sn) {
--- 2.6.30-rc8/mm/shmem.c	2009-05-09 09:06:44.000000000 +0100
+++ linux/mm/shmem.c	2009-06-07 13:56:58.000000000 +0100
@@ -1106,8 +1106,7 @@ redirty:
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
-#ifdef CONFIG_TMPFS
+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 	char buffer[64];
@@ -1131,64 +1130,11 @@ static struct mempolicy *shmem_get_sbmpo
 	}
 	return mpol;
 }
-#endif /* CONFIG_TMPFS */
-
-static struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
-{
-	struct mempolicy mpol, *spol;
-	struct vm_area_struct pvma;
-	struct page *page;
-
-	spol = mpol_cond_copy(&mpol,
-				mpol_shared_policy_lookup(&info->policy, idx));
-
-	/* Create a pseudo vma that just contains the policy */
-	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
-	pvma.vm_ops = NULL;
-	pvma.vm_policy = spol;
-	page = swapin_readahead(entry, gfp, &pvma, 0);
-	return page;
-}
-
-static struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
-{
-	struct vm_area_struct pvma;
-
-	/* Create a pseudo vma that just contains the policy */
-	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
-	pvma.vm_ops = NULL;
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
-
-	/*
-	 * alloc_page_vma() will drop the shared policy reference
-	 */
-	return alloc_page_vma(gfp, &pvma, 0);
-}
-#else /* !CONFIG_NUMA */
-#ifdef CONFIG_TMPFS
+#else
 static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *p)
 {
 }
-#endif /* CONFIG_TMPFS */
-
-static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
-{
-	return swapin_readahead(entry, gfp, NULL, 0);
-}
-
-static inline struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
-{
-	return alloc_page(gfp);
-}
-#endif /* CONFIG_NUMA */
 
-#if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
 static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 {
 	return NULL;
@@ -1268,7 +1214,9 @@ repeat:
 				*type |= VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(swap, gfp, info, idx);
+			swappage = swapin_readahead(swap, gfp,
+				mpol_shared_policy_lookup(&info->policy, idx),
+						idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1395,7 +1343,9 @@ repeat:
 			int ret;
 
 			spin_unlock(&info->lock);
-			filepage = shmem_alloc_page(gfp, info, idx);
+			filepage = alloc_page_mpol(gfp,
+				mpol_shared_policy_lookup(&info->policy, idx),
+							idx);
 			if (!filepage) {
 				shmem_unacct_blocks(info->flags, 1);
 				shmem_free_blocks(inode, 1);
--- 2.6.30-rc8/mm/swap_state.c	2009-06-03 10:13:27.000000000 +0100
+++ linux/mm/swap_state.c	2009-06-07 13:56:58.000000000 +0100
@@ -266,7 +266,7 @@ struct page * lookup_swap_cache(swp_entr
  * the swap entry is no longer in use.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct mempolicy *mpol, pgoff_t pgoff)
 {
 	struct page *found_page, *new_page = NULL;
 	int err;
@@ -285,7 +285,7 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(gfp_mask, vma, addr);
+			new_page = alloc_page_mpol(gfp_mask, mpol, pgoff);
 			if (!new_page)
 				break;		/* Out of memory */
 		}
@@ -345,7 +345,7 @@ struct page *read_swap_cache_async(swp_e
  * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
  */
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct mempolicy *mpol, pgoff_t pgoff)
 {
 	int nr_pages;
 	struct page *page;
@@ -363,11 +363,11 @@ struct page *swapin_readahead(swp_entry_
 	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						gfp_mask, vma, addr);
+						gfp_mask, mpol, pgoff);
 		if (!page)
 			break;
 		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
-	return read_swap_cache_async(entry, gfp_mask, vma, addr);
+	return read_swap_cache_async(entry, gfp_mask, mpol, pgoff);
 }
--- 2.6.30-rc8/mm/swapfile.c	2009-03-23 23:12:14.000000000 +0000
+++ linux/mm/swapfile.c	2009-06-07 13:56:58.000000000 +0100
@@ -951,8 +951,8 @@ static int try_to_unuse(unsigned int typ
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
-		page = read_swap_cache_async(entry,
-					GFP_HIGHUSER_MOVABLE, NULL, 0);
+		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
+					get_vma_policy(current, NULL, 0), 0);
 		if (!page) {
 			/*
 			 * Either swap_duplicate() failed because entry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
  2009-06-07 17:55 ` Hugh Dickins
@ 2009-06-08 15:21     ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-08 15:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Andi Kleen,
	linux-mm, linux-kernel

On Sun, Jun 07, 2009 at 06:55:15PM +0100, Hugh Dickins wrote:
> On Wed, 3 Jun 2009, Johannes Weiner wrote:
> >
> > The current swap readahead implementation reads a physically
> > contiguous group of swap slots around the faulting page to take
> > advantage of the disk head's position and in the hope that the
> > surrounding pages will be needed soon as well.
> > 
> > This works as long as the physical swap slot order approximates the
> > LRU order decently, otherwise it wastes memory and IO bandwidth to
> > read in pages that are unlikely to be needed soon.
> > 
> > However, the physical swap slot layout diverges from the LRU order
> > with increasing swap activity, i.e. high memory pressure situations,
> > and this is exactly the situation where swapin should not waste any
> > memory or IO bandwidth as both are the most contended resources at
> > this point.
> > 
> > Another approximation for LRU-relation is the VMA order as groups of
> > VMA-related pages are usually used together.
> > 
> > This patch combines both the physical and the virtual hint to get a
> > good approximation of pages that are sensible to read ahead.
> > 
> > When both diverge, we either read unrelated data, seek heavily for
> > related data, or, what this patch does, just decrease the readahead
> > efforts.
> > 
> > To achieve this, we have essentially two readahead windows of the same
> > size: one spans the virtual, the other one the physical neighborhood
> > of the faulting page.  We only read where both areas overlap.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > Cc: Andi Kleen <andi@firstfloor.org>
> 
> I think this a great idea, a very promising approach.  I like it
> so much better than Andrew's and others' proposals to dedicate
> areas of swap space to distinct virtual objects: which, as you
> rightly pointed out, condemn us to unnecessary seeking when
> writing swap (and is even more of an issue if we're writing
> to SSD rather than HDD).
> 
> It would be nice to get results from a wider set of benchmarks
> than just qsbench; but I don't think qsbench is biased in your
> favour, and I don't think you need go to too much trouble on
> that - let's just aim to get your work into mmotm, then we can
> all play around with it for a while.  I suppose what I'd most
> like is to try it with shmem, which you've understandably left
> out for now.

I agree, it would be nice to have this in mm soonish and have it
exposed a bit more until .31 or .32.  And I'll continue to test other
loads.

> You'll be hating me for the way I made shmem_truncate_range() etc.
> nigh incomprehensible when enabling highmem index pages there.
> Christoph Rohland's original was much nicer.  Again and again and
> again I've wanted to throw all that out, and keep swap entries in
> the standard radix tree which keeps the struct page pointers; but
> again and again and again, I've been unable to justify losing the
> highmem index ability - for a while it seemed as if x86_64 was
> going to make highmem a thing of the past, and it's certainly helped
> us to ignore 32GB 32-bit; but I think 4GB 32-bit is here a long while.

Regarding highmem, I have only one 32 bit box with 1G memory and
highmem= seems to be broken right now.  The documentation says it will
fix-set the zone.  It does not on x86, but instead seems to set up
Normal and then complain about the lack of remaining pages to stuff
into HighMem.

With 128M HighMem, testing for highpte mapping overhead is a bit bogus
so I can't do that until I figure something out (either fixing
highmem= or no-opping GFP_HIGHMEM and use GFP_HIGHPTE for the pte
pages).

Or redoing the patch as you suggested below, which is probably what I
will opt for.

> Though I like the way you localized it all into swapin_readahead(),
> I'd prefer to keep ptes out of swap_state.c, and think several issues
> will fall away if you turn your patch around.  You'll avoid the pte code
> in swap_state.c, you'll satisfy Andi's concerns about locking/kmapping
> overhead, and you'll find shmem much much easier, if instead of peering
> back at where you've come from in swapin_readahead(), you make the outer
> levels (do_swap_page and shmem_getpage) pass a vector of swap entries to
> swapin_readahead()?  That vector on stack, and copied from the one page
> table or index page (don't bother to cross page table or index page
> boundaries) while it was mapped.

That is a nice suggestion.

> It's probably irrelevant to you, but I've attached an untested patch
> of mine which stomps somewhat on this area: I was shocked to notice
> shmem_getpage() in a list of deep stack offenders a few months back,
> realized it was those unpleasant NUMA mempolicy pseudo-vmas, and made
> a patch to get rid of them.  I've rebased it to 2.6.30-rc8 and checked
> that the resulting kernel runs, but not really tested it; and I think
> I didn't even try to get the mpol reference counting right (tends to
> be an issue precisely in swapin_readahead, where one mpol ends up used
> repeatedly) - mpol refcounting is an arcane art only Lee understands!
> I've attached the patch because you may want to glance at it and
> decide, either that it's something which is helpful to the direction
> you're going in and you'd like to base upon it, or that it's a
> distraction and you'd prefer me to keep it to myself until your
> changes are in.

I will try to make swapin_readahead() take an array of ptes, then your
patch shouldn't get in my way as I don't need the vma anymore.

> But your patch below is incomplete, isn't it?  The old swapin_readahead()
> is now called swapin_readahead_phys(), and you want shmem_getpage() to be
> using that for now: but no prototype for it and no change to mm/shmem.c.

You probably missed the no-vma or no-vma->vm_mm branch in
swapin_readahead().  shmem either sends in a NULL-vma or the dummy-vma
that has no vm->vm_mm set.  Oops.  That is of course a bug, vma->vm_mm
is uninitialized, not NULL.  But that will go away as well.

Thanks,

	Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch][v2] swap: virtual swap readahead
@ 2009-06-08 15:21     ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2009-06-08 15:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Rik van Riel, Peter Zijlstra, Andi Kleen,
	linux-mm, linux-kernel

On Sun, Jun 07, 2009 at 06:55:15PM +0100, Hugh Dickins wrote:
> On Wed, 3 Jun 2009, Johannes Weiner wrote:
> >
> > The current swap readahead implementation reads a physically
> > contiguous group of swap slots around the faulting page to take
> > advantage of the disk head's position and in the hope that the
> > surrounding pages will be needed soon as well.
> > 
> > This works as long as the physical swap slot order approximates the
> > LRU order decently, otherwise it wastes memory and IO bandwidth to
> > read in pages that are unlikely to be needed soon.
> > 
> > However, the physical swap slot layout diverges from the LRU order
> > with increasing swap activity, i.e. high memory pressure situations,
> > and this is exactly the situation where swapin should not waste any
> > memory or IO bandwidth as both are the most contended resources at
> > this point.
> > 
> > Another approximation for LRU-relation is the VMA order as groups of
> > VMA-related pages are usually used together.
> > 
> > This patch combines both the physical and the virtual hint to get a
> > good approximation of pages that are sensible to read ahead.
> > 
> > When both diverge, we either read unrelated data, seek heavily for
> > related data, or, what this patch does, just decrease the readahead
> > efforts.
> > 
> > To achieve this, we have essentially two readahead windows of the same
> > size: one spans the virtual, the other one the physical neighborhood
> > of the faulting page.  We only read where both areas overlap.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > Cc: Andi Kleen <andi@firstfloor.org>
> 
> I think this a great idea, a very promising approach.  I like it
> so much better than Andrew's and others' proposals to dedicate
> areas of swap space to distinct virtual objects: which, as you
> rightly pointed out, condemn us to unnecessary seeking when
> writing swap (and is even more of an issue if we're writing
> to SSD rather than HDD).
> 
> It would be nice to get results from a wider set of benchmarks
> than just qsbench; but I don't think qsbench is biased in your
> favour, and I don't think you need go to too much trouble on
> that - let's just aim to get your work into mmotm, then we can
> all play around with it for a while.  I suppose what I'd most
> like is to try it with shmem, which you've understandably left
> out for now.

I agree, it would be nice to have this in mm soonish and have it
exposed a bit more until .31 or .32.  And I'll continue to test other
loads.

> You'll be hating me for the way I made shmem_truncate_range() etc.
> nigh incomprehensible when enabling highmem index pages there.
> Christoph Rohland's original was much nicer.  Again and again and
> again I've wanted to throw all that out, and keep swap entries in
> the standard radix tree which keeps the struct page pointers; but
> again and again and again, I've been unable to justify losing the
> highmem index ability - for a while it seemed as if x86_64 was
> going to make highmem a thing of the past, and it's certainly helped
> us to ignore 32GB 32-bit; but I think 4GB 32-bit is here a long while.

Regarding highmem, I have only one 32 bit box with 1G memory and
highmem= seems to be broken right now.  The documentation says it will
fix-set the zone.  It does not on x86, but instead seems to set up
Normal and then complain about the lack of remaining pages to stuff
into HighMem.

With 128M HighMem, testing for highpte mapping overhead is a bit bogus
so I can't do that until I figure something out (either fixing
highmem= or no-opping GFP_HIGHMEM and use GFP_HIGHPTE for the pte
pages).

Or redoing the patch as you suggested below, which is probably what I
will opt for.

> Though I like the way you localized it all into swapin_readahead(),
> I'd prefer to keep ptes out of swap_state.c, and think several issues
> will fall away if you turn your patch around.  You'll avoid the pte code
> in swap_state.c, you'll satisfy Andi's concerns about locking/kmapping
> overhead, and you'll find shmem much much easier, if instead of peering
> back at where you've come from in swapin_readahead(), you make the outer
> levels (do_swap_page and shmem_getpage) pass a vector of swap entries to
> swapin_readahead()?  That vector on stack, and copied from the one page
> table or index page (don't bother to cross page table or index page
> boundaries) while it was mapped.

That is a nice suggestion.

> It's probably irrelevant to you, but I've attached an untested patch
> of mine which stomps somewhat on this area: I was shocked to notice
> shmem_getpage() in a list of deep stack offenders a few months back,
> realized it was those unpleasant NUMA mempolicy pseudo-vmas, and made
> a patch to get rid of them.  I've rebased it to 2.6.30-rc8 and checked
> that the resulting kernel runs, but not really tested it; and I think
> I didn't even try to get the mpol reference counting right (tends to
> be an issue precisely in swapin_readahead, where one mpol ends up used
> repeatedly) - mpol refcounting is an arcane art only Lee understands!
> I've attached the patch because you may want to glance at it and
> decide, either that it's something which is helpful to the direction
> you're going in and you'd like to base upon it, or that it's a
> distraction and you'd prefer me to keep it to myself until your
> changes are in.

I will try to make swapin_readahead() take an array of ptes, then your
patch shouldn't get in my way as I don't need the vma anymore.

> But your patch below is incomplete, isn't it?  The old swapin_readahead()
> is now called swapin_readahead_phys(), and you want shmem_getpage() to be
> using that for now: but no prototype for it and no change to mm/shmem.c.

You probably missed the no-vma or no-vma->vm_mm branch in
swapin_readahead().  shmem either sends in a NULL-vma or the dummy-vma
that has no vm->vm_mm set.  Oops.  That is of course a bug, vma->vm_mm
is uninitialized, not NULL.  But that will go away as well.

Thanks,

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-06-08 15:24 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-02 22:37 [patch][v2] swap: virtual swap readahead Johannes Weiner
2009-06-02 22:37 ` Johannes Weiner
2009-06-02 23:34 ` Andi Kleen
2009-06-02 23:34   ` Andi Kleen
2009-06-03 13:27   ` Johannes Weiner
2009-06-03 13:27     ` Johannes Weiner
2009-06-03 14:51     ` Rik van Riel
2009-06-03 14:51       ` Rik van Riel
2009-06-03 15:02       ` Andi Kleen
2009-06-03 15:02         ` Andi Kleen
2009-06-05 11:03     ` Minchan Kim
2009-06-05 11:03       ` Minchan Kim
2009-06-04  1:46 ` KAMEZAWA Hiroyuki
2009-06-04  1:46   ` KAMEZAWA Hiroyuki
2009-06-04 10:04   ` Johannes Weiner
2009-06-04 10:04     ` Johannes Weiner
2009-06-07 17:55 ` Hugh Dickins
2009-06-08 15:21   ` Johannes Weiner
2009-06-08 15:21     ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.