linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: implement swap prefetching
@ 2006-02-06 23:28 Con Kolivas
  2006-02-07  0:38 ` Andrew Morton
  2006-02-07  3:08 ` [PATCH] mm: implement swap prefetching Nick Piggin
  0 siblings, 2 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-06 23:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Andrew Morton, ck

Andrew et al

I'm resubmitting the swap prefetching patch for inclusion in -mm and hopefully
mainline. After you removed it from -mm there were some people that described
the benefits it afforded their workloads. -mm being ever so slightly quieter
at the moment please reconsider.

Cheers,
Con
---
This patch implements swap prefetching when the vm is relatively idle and
there is free ram available. The code is based on some early work by Thomas
Schlichter.

This stores a list of swapped entries in a list ordered most recently used
and a radix tree. It generates a low priority kernel thread running at nice 19
to do the prefetching at a later stage.

Once pages have been added to the swapped list, a timer is started, testing
for conditions suitable to prefetch swap pages every 5 seconds. Suitable
conditions are defined as lack of swapping out or in any pages, and no
watermark tests failing. Significant amounts of dirtied ram and changes in
free ram representing disk writes or reads also prevent prefetching.

It then checks that we have spare ram looking for at least 3* pages_high free
per zone and if it succeeds that will prefetch pages from swap. The pages are
prefetched in 128kb groups every 1 second until the vm is busy for the tests
above, the watermarks fail to detect adequate free ram or the list is emptied.
The pages are copied to swap cache and kept on backing store. This allows
pressure on either physical ram or swap to readily find free pages without
further I/O.

The amount prefetched in each group is configurable via the tunable in
/proc/sys/vm/swap_prefetch. This is set to a value based on memory size. When
laptop_mode is enabled it prefetches in ten times larger blocks to minimise
the time spent reading.

In testing on modern pc hardware this results in wall-clock time activation of
the firefox browser to speed up 5 fold after a worst case complete swap-out
of the browser on a static web page.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

 Documentation/sysctl/vm.txt |   11 +
 include/linux/swap.h        |   32 +++
 include/linux/sysctl.h      |    1 
 init/Kconfig                |   21 ++
 kernel/sysctl.c             |   12 +
 mm/Makefile                 |    1 
 mm/page_alloc.c             |   12 -
 mm/swap.c                   |    3 
 mm/swap_prefetch.c          |  431 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c             |   10 -
 mm/vmscan.c                 |    5 
 11 files changed, 535 insertions(+), 4 deletions(-)

Index: linux-2.6.16-rc2-ck1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.16-rc2-ck1.orig/Documentation/sysctl/vm.txt	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/Documentation/sysctl/vm.txt	2006-02-04 18:38:24.000000000 +1100
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/
 - drop-caches
 - zone_reclaim_mode
 - zone_reclaim_interval
+- swap_prefetch
 
 ==============================================================
 
@@ -178,3 +179,13 @@ Time is set in seconds and set by defaul
 Reduce the interval if undesired off node allocations occur. However, too
 frequent scans will have a negative impact onoff node allocation performance.
 
+==============================================================
+
+swap_prefetch
+
+This is the amount of data prefetched per prefetching interval when
+swap prefetching is compiled in. The value means multiples of 128K,
+except when laptop_mode is enabled and then it is ten times larger.
+Setting it to 0 disables prefetching entirely.
+
+The default value is dependant on ramsize.
Index: linux-2.6.16-rc2-ck1/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc2-ck1.orig/include/linux/swap.h	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/include/linux/swap.h	2006-02-04 18:38:24.000000000 +1100
@@ -214,6 +214,37 @@ extern int shmem_unuse(swp_entry_t entry
 
 extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *);
 
+#ifdef CONFIG_SWAP_PREFETCH
+/* only used by prefetch externally */
+/*	mm/swap_prefetch.c */
+extern void prepare_prefetch(void);
+extern void add_to_swapped_list(unsigned long index);
+extern void remove_from_swapped_list(unsigned long index);
+extern void delay_prefetch(void);
+/* linux/mm/page_alloc.c */
+extern struct page *buffered_rmqueue(struct zonelist *zonelist,
+			struct zone *zone, int order, gfp_t gfp_flags);
+extern int swap_prefetch;
+
+#else	/* CONFIG_SWAP_PREFETCH */
+static inline void add_to_swapped_list(unsigned long index)
+{
+}
+
+static inline void prepare_prefetch(void)
+{
+}
+
+static inline void remove_from_swapped_list(unsigned long index)
+{
+}
+
+static inline void delay_prefetch(void)
+{
+}
+
+#endif	/* CONFIG_SWAP_PREFETCH */
+
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
@@ -235,6 +266,7 @@ extern void free_pages_and_swap_cache(st
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma,
 					   unsigned long addr);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry);
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
Index: linux-2.6.16-rc2-ck1/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc2-ck1.orig/include/linux/sysctl.h	2006-02-04 18:38:19.000000000 +1100
+++ linux-2.6.16-rc2-ck1/include/linux/sysctl.h	2006-02-04 18:38:24.000000000 +1100
@@ -187,6 +187,7 @@ enum
 	VM_PERCPU_PAGELIST_FRACTION=30,/* int: fraction of pages in each percpu_pagelist */
 	VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
+	VM_SWAP_PREFETCH=33,	/* int: amount to swap prefetch */
 };
 
 
Index: linux-2.6.16-rc2-ck1/init/Kconfig
===================================================================
--- linux-2.6.16-rc2-ck1.orig/init/Kconfig	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/init/Kconfig	2006-02-04 18:38:24.000000000 +1100
@@ -103,6 +103,27 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_PREFETCH
+	bool "Support for prefetching swapped memory"
+	depends on SWAP
+	default y
+	---help---
+	  This option will allow the kernel to prefetch swapped memory pages
+	  when idle. The pages will be kept on both swap and in swap_cache
+	  thus avoiding the need for further I/O if either ram or swap space
+	  is required.
+
+	  What this will do on workstations is slowly bring back applications
+	  that have swapped out after memory intensive workloads back into
+	  physical ram if you have free ram at a later stage and the machine
+	  is relatively idle. This means that when you come back to your
+	  computer after leaving it idle for a while, applications will come
+	  to life faster. Note that your swap usage will appear to increase
+	  but these are cached pages, can be dropped freely by the vm, and it
+	  should stabilise around 50% swap usage.
+
+	  Desktop users will most likely want to say Y.
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
Index: linux-2.6.16-rc2-ck1/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc2-ck1.orig/kernel/sysctl.c	2006-02-04 18:38:19.000000000 +1100
+++ linux-2.6.16-rc2-ck1/kernel/sysctl.c	2006-02-04 18:38:24.000000000 +1100
@@ -916,6 +916,18 @@ static ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec_jiffies,
 		.strategy	= &sysctl_jiffies,
 	},
+#ifdef CONFIG_SWAP_PREFETCH
+	{
+		.ctl_name	= VM_SWAP_PREFETCH,
+		.procname	= "swap_prefetch",
+		.data		= &swap_prefetch,
+		.maxlen		= sizeof(swap_prefetch),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+#endif
 #endif
 	{ .ctl_name = 0 }
 };
Index: linux-2.6.16-rc2-ck1/mm/Makefile
===================================================================
--- linux-2.6.16-rc2-ck1.orig/mm/Makefile	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/mm/Makefile	2006-02-04 18:38:24.000000000 +1100
@@ -13,6 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   prio_tree.o util.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP_PREFETCH) += swap_prefetch.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
Index: linux-2.6.16-rc2-ck1/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2-ck1.orig/mm/page_alloc.c	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/mm/page_alloc.c	2006-02-04 18:38:24.000000000 +1100
@@ -754,7 +754,7 @@ static inline void prep_zero_page(struct
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zonelist *zonelist,
+struct page *buffered_rmqueue(struct zonelist *zonelist,
 			struct zone *zone, int order, gfp_t gfp_flags)
 {
 	unsigned long flags;
@@ -833,7 +833,7 @@ int zone_watermark_ok(struct zone *z, in
 		min -= min / 4;
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
-		return 0;
+		goto out_failed;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
 		free_pages -= z->free_area[o].nr_free << o;
@@ -842,9 +842,15 @@ int zone_watermark_ok(struct zone *z, in
 		min >>= 1;
 
 		if (free_pages <= min)
-			return 0;
+			goto out_failed;
 	}
+
 	return 1;
+out_failed:
+	/* Swap prefetching is delayed if any watermark is low */
+	delay_prefetch();
+
+	return 0;
 }
 
 /*
Index: linux-2.6.16-rc2-ck1/mm/swap.c
===================================================================
--- linux-2.6.16-rc2-ck1.orig/mm/swap.c	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/mm/swap.c	2006-02-04 18:38:24.000000000 +1100
@@ -502,5 +502,8 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	prepare_prefetch();
+
 	hotcpu_notifier(cpu_swap_callback, 0);
 }
Index: linux-2.6.16-rc2-ck1/mm/swap_prefetch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc2-ck1/mm/swap_prefetch.c	2006-02-04 18:38:24.000000000 +1100
@@ -0,0 +1,431 @@
+/*
+ * linux/mm/swap_prefetch.c
+ *
+ * Copyright (C) 2005 Con Kolivas
+ *
+ * Written by Con Kolivas <kernel@kolivas.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/ioprio.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/writeback.h>
+
+/* Time to delay prefetching if vm is busy or prefetching unsuccessful */
+#define PREFETCH_DELAY	(HZ * 5)
+/* Time between attempting prefetching when vm is idle */
+#define PREFETCH_INTERVAL (HZ)
+
+/* sysctl - how many SWAP_CLUSTER_MAX pages to prefetch at a time */
+int swap_prefetch __read_mostly;
+
+struct swapped_root {
+	unsigned long		busy;		/* vm busy */
+	spinlock_t		lock;		/* protects all data */
+	struct list_head	list;		/* MRU list of swapped pages */
+	struct radix_tree_root	swap_tree;	/* Lookup tree of pages */
+	unsigned int		count;		/* Number of entries */
+	unsigned int		maxcount;	/* Maximum entries allowed */
+	kmem_cache_t		*cache;
+};
+
+struct swapped_entry {
+	swp_entry_t		swp_entry;
+	struct list_head	swapped_list;
+};
+
+static struct swapped_root swapped = {
+	.busy 		= 0,
+	.lock		= SPIN_LOCK_UNLOCKED,
+	.list  		= LIST_HEAD_INIT(swapped.list),
+	.swap_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
+	.count 		= 0,
+};
+
+static task_t *kprefetchd_task;
+
+/* Max mapped we will prefetch to */
+static unsigned long mapped_limit __read_mostly;
+/* Last total free pages */
+static unsigned long last_free = 0;
+static unsigned long temp_free = 0;
+
+/*
+ * Create kmem cache for swapped entries
+ */
+void __init prepare_prefetch(void)
+{
+	long mem = nr_free_pagecache_pages();
+
+	swapped.cache = kmem_cache_create("swapped_entry",
+		sizeof(struct swapped_entry), 0, SLAB_PANIC, NULL, NULL);
+
+	/* Set max number of entries to size of physical ram */
+	swapped.maxcount = mem;
+	/* Set maximum amount of mapped pages to prefetch to 2/3 ram */
+	mapped_limit = mem / 3 * 2;
+
+	/* Set initial swap_prefetch value according to memory size */
+	swap_prefetch = mem / 10000 ? : 1;
+}
+
+/*
+ * We check to see no part of the vm is busy. If it is this will interrupt
+ * trickle_swap and wait another PREFETCH_DELAY. Purposefully racy.
+ */
+inline void delay_prefetch(void)
+{
+	__set_bit(0, &swapped.busy);
+}
+
+/*
+ * Accounting is sloppy on purpose. As adding and removing entries from the
+ * list happens during swapping in and out we don't want to be spinning on
+ * locks. It is cheaper to just miss adding an entry since having a reference
+ * to every entry is not critical.
+ */
+void add_to_swapped_list(unsigned long index)
+{
+	struct swapped_entry *entry;
+	int error;
+
+	if (unlikely(!spin_trylock(&swapped.lock)))
+		goto out;
+
+	if (swapped.count >= swapped.maxcount) {
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
+		list_del(&entry->swapped_list);
+		swapped.count--;
+	} else {
+		entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
+		if (unlikely(!entry))
+			/* bad, can't allocate more mem */
+			goto out_locked;
+	}
+
+	entry->swp_entry.val = index;
+
+	error = radix_tree_preload(GFP_ATOMIC);
+	if (likely(!error)) {
+		error = radix_tree_insert(&swapped.swap_tree, index, entry);
+		if (likely(!error)) {
+			/*
+			 * If this is the first entry, kprefetchd needs to be
+			 * (re)started
+			 */
+			if (list_empty(&swapped.list))
+				wake_up_process(kprefetchd_task);
+			list_add(&entry->swapped_list, &swapped.list);
+			swapped.count++;
+		}
+		radix_tree_preload_end();
+	} else
+		kmem_cache_free(swapped.cache, entry);
+
+out_locked:
+	spin_unlock(&swapped.lock);
+out:
+	return;
+}
+
+/*
+ * Cheaper to not spin on the lock and remove the entry lazily via
+ * add_to_swap_cache when we hit it in trickle_swap_cache_async
+ */
+void remove_from_swapped_list(unsigned long index)
+{
+	struct swapped_entry *entry;
+	unsigned long flags;
+
+	if (unlikely(!spin_trylock_irqsave(&swapped.lock, flags)))
+		return;
+	entry = radix_tree_delete(&swapped.swap_tree, index);
+	if (likely(entry)) {
+		list_del_init(&entry->swapped_list);
+		swapped.count--;
+		kmem_cache_free(swapped.cache, entry);
+	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
+}
+
+static inline int high_zone(struct zone *zone)
+{
+	if (zone == NULL)
+		return 0;
+	return is_highmem(zone);
+}
+
+/*
+ * Find the zone with the most free pages, recheck the watermarks and
+ * then directly allocate the ram. We don't want prefetch to use
+ * __alloc_pages and go calling on reclaim.
+ */
+static struct page *prefetch_get_page(void)
+{
+	struct zone *zone = NULL, *z;
+	struct page *page = NULL;
+	struct zonelist *zonelist;
+	long most_free = 0;
+
+	for_each_zone(z) {
+		long free;
+
+		if (z->present_pages == 0)
+			continue;
+
+		/* We don't prefetch into DMA */
+		if (zone_idx(z) == ZONE_DMA)
+			continue;
+
+		free = z->free_pages;
+		/* Select the zone with the most free ram preferring high */
+		if ((free > most_free && (!high_zone(zone) || high_zone(z))) ||
+			(!high_zone(zone) && high_zone(z))) {
+				most_free = free;
+				zone = z;
+		}
+	}
+
+	if (zone == NULL)
+		goto out;
+
+	zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+		(GFP_HIGHUSER & GFP_ZONEMASK);
+	page = buffered_rmqueue(zonelist, zone, 0, GFP_HIGHUSER);
+out:
+	return page;
+}
+
+enum trickle_return {
+	TRICKLE_SUCCESS,
+	TRICKLE_FAILED,
+	TRICKLE_DELAY,
+};
+
+/*
+ * This tries to read a swp_entry_t into swap cache for swap prefetching.
+ * If it returns TRICKLE_DELAY we should delay further prefetching.
+ */
+static enum trickle_return trickle_swap_cache_async(swp_entry_t entry)
+{
+	enum trickle_return ret = TRICKLE_FAILED;
+	struct page *page = NULL;
+
+	if (unlikely(!read_trylock(&swapper_space.tree_lock))) {
+		ret = TRICKLE_DELAY;
+		goto out;
+	}
+	/* Entry may already exist */
+	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
+	read_unlock(&swapper_space.tree_lock);
+	if (page) {
+		remove_from_swapped_list(entry.val);
+		goto out;
+	}
+
+	/* Get a new page to read from swap */
+	page = prefetch_get_page();
+	if (unlikely(!page)) {
+		ret = TRICKLE_DELAY;
+		goto out;
+	}
+
+	if (add_to_swap_cache(page, entry))
+		/* Failed to add to swap cache */
+		goto out_release;
+
+	lru_cache_add(page);
+	if (unlikely(swap_readpage(NULL, page))) {
+		ret = TRICKLE_DELAY;
+		goto out_release;
+	}
+
+	ret = TRICKLE_SUCCESS;
+out_release:
+	page_cache_release(page);
+out:
+	return ret;
+}
+
+/*
+ * How many pages to prefetch at a time. We prefetch SWAP_CLUSTER_MAX *
+ * swap_prefetch per PREFETCH_INTERVAL, but prefetch ten times as much at a
+ * time in laptop_mode to minimise the time we keep the disk spinning.
+ */
+static inline unsigned long prefetch_pages(void)
+{
+	return (SWAP_CLUSTER_MAX * swap_prefetch * (1 + 9 * !!laptop_mode));
+}
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(void)
+{
+	struct page_state ps;
+	unsigned long limit;
+	struct zone *z;
+	int ret = 0;
+
+	/* Purposefully racy and might return false positive which is ok */
+	if (__test_and_clear_bit(0, &swapped.busy))
+		goto out;
+
+	temp_free = 0;
+	/*
+	 * Have some hysteresis between where page reclaiming and prefetching
+	 * will occur to prevent ping-ponging between them.
+	 */
+	for_each_zone(z) {
+		unsigned long free;
+
+		if (z->present_pages == 0)
+			continue;
+		free = z->free_pages;
+		if (z->pages_high * 3 > free)
+			goto out;
+		temp_free += free;
+	}
+
+	/*
+	 * We check to see that pages are not being allocated elsewhere
+	 * at any significant rate implying any degree of memory pressure
+	 * (eg during file reads)
+	 */
+	if (last_free) {
+		if (temp_free + SWAP_CLUSTER_MAX < last_free) {
+			last_free = temp_free;
+			goto out;
+		}
+	} else
+		last_free = temp_free;
+
+	get_page_state(&ps);
+
+	/* We shouldn't prefetch when we are doing writeback */
+	if (ps.nr_writeback)
+		goto out;
+
+	/*
+	 * >2/3 of the ram is mapped, swapcache or dirty, we need some free
+	 * for pagecache
+	 */
+	limit = ps.nr_mapped + ps.nr_slab + ps.nr_dirty + ps.nr_unstable +
+		total_swapcache_pages;
+	if (limit > mapped_limit)
+		goto out;
+
+	/* Survived all that? Hooray we can prefetch! */
+	ret = 1;
+out:
+	return ret;
+}
+
+/*
+ * trickle_swap is the main function that initiates the swap prefetching. It
+ * first checks to see if the busy flag is set, and does not prefetch if it
+ * is, as the flag implied we are low on memory or swapping in currently.
+ * Otherwise it runs till prefetch_pages() are prefetched.
+ */
+static enum trickle_return trickle_swap(void)
+{
+	enum trickle_return ret = TRICKLE_DELAY;
+	struct swapped_entry *entry;
+	int pages = 0;
+
+	while (pages < prefetch_pages()) {
+		enum trickle_return got_page;
+
+		if (!prefetch_suitable())
+			goto out;
+		/* Lock is held? We must be busy elsewhere */
+		if (unlikely(!spin_trylock(&swapped.lock)))
+			goto out;
+		if (list_empty(&swapped.list)) {
+			spin_unlock(&swapped.lock);
+			ret = TRICKLE_FAILED;
+			goto out;
+		}
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		spin_unlock(&swapped.lock);
+
+		got_page = trickle_swap_cache_async(entry->swp_entry);
+		switch (got_page) {
+		case TRICKLE_FAILED:
+			break;
+		case TRICKLE_SUCCESS:
+			last_free--;
+			pages++;
+			break;
+		case TRICKLE_DELAY:
+			goto out;
+		}
+	}
+	ret = TRICKLE_SUCCESS;
+
+out:
+	if (pages)
+		lru_add_drain();
+	return ret;
+}
+
+static int kprefetchd(void *__unused)
+{
+	set_user_nice(current, 19);
+	/* Set ioprio to lowest if supported by i/o scheduler */
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+
+	do {
+		enum trickle_return prefetched;
+
+		try_to_freeze();
+
+		/*
+		 * TRICKLE_FAILED implies no entries left - we do not schedule
+		 * a wakeup, and further delay the next one.
+		 */
+		prefetched = trickle_swap();
+		switch (prefetched) {
+		case TRICKLE_SUCCESS:
+			last_free = temp_free;
+			schedule_timeout_interruptible(PREFETCH_INTERVAL);
+			break;
+		case TRICKLE_DELAY:
+			last_free = 0;
+			schedule_timeout_interruptible(PREFETCH_DELAY);
+			break;
+		case TRICKLE_FAILED:
+			last_free = 0;
+			schedule_timeout_interruptible(MAX_SCHEDULE_TIMEOUT);
+			schedule_timeout_interruptible(PREFETCH_DELAY);
+			break;
+		}
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+static int __init kprefetchd_init(void)
+{
+	kprefetchd_task = kthread_run(kprefetchd, NULL, "kprefetchd");
+
+	return 0;
+}
+
+static void __exit kprefetchd_exit(void)
+{
+	kthread_stop(kprefetchd_task);
+}
+
+module_init(kprefetchd_init);
+module_exit(kprefetchd_exit);
Index: linux-2.6.16-rc2-ck1/mm/swap_state.c
===================================================================
--- linux-2.6.16-rc2-ck1.orig/mm/swap_state.c	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/mm/swap_state.c	2006-02-04 18:38:24.000000000 +1100
@@ -81,6 +81,7 @@ static int __add_to_swap_cache(struct pa
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
+			remove_from_swapped_list(entry.val);
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
@@ -94,11 +95,12 @@ static int __add_to_swap_cache(struct pa
 	return error;
 }
 
-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
 	int error;
 
 	if (!swap_duplicate(entry)) {
+		remove_from_swapped_list(entry.val);
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
@@ -147,6 +149,9 @@ int add_to_swap(struct page * page, gfp_
 	swp_entry_t entry;
 	int err;
 
+	/* Swap prefetching is delayed if we're swapping pages */
+	delay_prefetch();
+
 	if (!PageLocked(page))
 		BUG();
 
@@ -320,6 +325,9 @@ struct page *read_swap_cache_async(swp_e
 	struct page *found_page, *new_page = NULL;
 	int err;
 
+	/* Swap prefetching is delayed if we're already reading from swap */
+	delay_prefetch();
+
 	do {
 		/*
 		 * First check the swap cache.  Since this is normally
Index: linux-2.6.16-rc2-ck1/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc2-ck1.orig/mm/vmscan.c	2006-02-04 18:15:28.000000000 +1100
+++ linux-2.6.16-rc2-ck1/mm/vmscan.c	2006-02-04 18:38:24.000000000 +1100
@@ -396,6 +396,7 @@ static int remove_mapping(struct address
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		add_to_swapped_list(swap.val);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
@@ -1406,6 +1407,8 @@ int try_to_free_pages(struct zone **zone
 	unsigned long lru_pages = 0;
 	int i;
 
+	delay_prefetch();
+
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = !laptop_mode;
 	sc.may_swap = 1;
@@ -1758,6 +1761,8 @@ int shrink_all_memory(int nr_pages)
 		.reclaimed_slab = 0,
 	};
 
+	delay_prefetch();
+
 	current->reclaim_state = &reclaim_state;
 	for_each_pgdat(pgdat) {
 		int freed;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-06 23:28 [PATCH] mm: implement swap prefetching Con Kolivas
@ 2006-02-07  0:38 ` Andrew Morton
  2006-02-07  1:29   ` Con Kolivas
                     ` (2 more replies)
  2006-02-07  3:08 ` [PATCH] mm: implement swap prefetching Nick Piggin
  1 sibling, 3 replies; 26+ messages in thread
From: Andrew Morton @ 2006-02-07  0:38 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, linux-mm, ck

Con Kolivas <kernel@kolivas.org> wrote:
>
> Andrew et al
> 
> I'm resubmitting the swap prefetching patch for inclusion in -mm and hopefully
> mainline.

Resubmitting is good, thanks.

> After you removed it from -mm there were some people that described
> the benefits it afforded their workloads. -mm being ever so slightly quieter
> at the moment please reconsider.
> 

I wish I could convince myself this is sufficiently beneficial..

I've been running 2.6.15-rc2-mm2 on my main workstation (x86, 2G) since
whenever.  (Am lazy, haven't gotten around to upgrading that machine).  It
has swap prefetch.

I can't say I noticed any difference, although I did turn it off in /proc a
few reboots ago because it was irritating me for some reason which I forget
(sorry).

One thing about 2.6.15-rc2-mm2 is that the `so' and `si' columns in
`vmstat' always read zero.  I don't know whether that bug is due to the
prefetch patch or not.

> 
> The amount prefetched in each group is configurable via the tunable in
> /proc/sys/vm/swap_prefetch. This is set to a value based on memory size. When
> laptop_mode is enabled it prefetches in ten times larger blocks to minimise
> the time spent reading.

That's incomprehensible, sorry.

I think it'd be much clearer if the thing was called swap_prefetch_kbytes
or swap_prefetch_mbytes or (worse) swap_prefetch_pages - putting the units in the
name really helps clarify things.

And if such a change is made, the internal variable should also be renamed.
 Right now it's "swap_prefetch", which sounds like a boolean.

> +swap_prefetch
> +
> +This is the amount of data prefetched per prefetching interval when
> +swap prefetching is compiled in. The value means multiples of 128K,
> +except when laptop_mode is enabled and then it is ten times larger.
> +Setting it to 0 disables prefetching entirely.

What does "ten times larger" mean?  If laptop_mode, this thing is in units
of 1280 kbytes and if !laptop_mode it's in units of 128 kbytes?

If so (or if not), this tunable is quite obscure and hard-to-understand. 
Can you find a way to make this more user-friendly?

> +/* only used by prefetch externally */
> +/*	mm/swap_prefetch.c */
> +extern void prepare_prefetch(void);
> +extern void add_to_swapped_list(unsigned long index);
> +extern void remove_from_swapped_list(unsigned long index);
> +extern void delay_prefetch(void);

I'd suggest that "prefetch" is too generic a term.  We prefetch lots of
things in the kernel.  Please rename all globally-visible identifiers with
s/prefetch/swap_prefetch/.


> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.16-rc2-ck1/mm/swap_prefetch.c	2006-02-04 18:38:24.000000000 +1100
> @@ -0,0 +1,431 @@
> +/*
> + * linux/mm/swap_prefetch.c
> + *
> + * Copyright (C) 2005 Con Kolivas
> + *
> + * Written by Con Kolivas <kernel@kolivas.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/swap.h>
> +#include <linux/ioprio.h>
> +#include <linux/kthread.h>
> +#include <linux/pagemap.h>
> +#include <linux/syscalls.h>
> +#include <linux/writeback.h>
> +
> +/* Time to delay prefetching if vm is busy or prefetching unsuccessful */
> +#define PREFETCH_DELAY	(HZ * 5)
> +/* Time between attempting prefetching when vm is idle */
> +#define PREFETCH_INTERVAL (HZ)
> +
> +/* sysctl - how many SWAP_CLUSTER_MAX pages to prefetch at a time */
> +int swap_prefetch __read_mostly;
> +
> +struct swapped_root {
> +	unsigned long		busy;		/* vm busy */
> +	spinlock_t		lock;		/* protects all data */
> +	struct list_head	list;		/* MRU list of swapped pages */
> +	struct radix_tree_root	swap_tree;	/* Lookup tree of pages */
> +	unsigned int		count;		/* Number of entries */
> +	unsigned int		maxcount;	/* Maximum entries allowed */
> +	kmem_cache_t		*cache;
						/* Of struct swapped_entry */
> +};

> +struct swapped_entry {
> +	swp_entry_t		swp_entry;
> +	struct list_head	swapped_list;
> +};
> +
> +static struct swapped_root swapped = {
> +	.busy 		= 0,
> +	.lock		= SPIN_LOCK_UNLOCKED,
> +	.list  		= LIST_HEAD_INIT(swapped.list),
> +	.swap_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
> +	.count 		= 0,
> +};

Description of `busy' and `count'?

> +static task_t *kprefetchd_task;
> +
> +/* Max mapped we will prefetch to */
> +static unsigned long mapped_limit __read_mostly;
> +/* Last total free pages */
> +static unsigned long last_free = 0;
> +static unsigned long temp_free = 0;

Unneeded initialisation.

> +
> +/*
> + * Accounting is sloppy on purpose. As adding and removing entries from the
> + * list happens during swapping in and out we don't want to be spinning on
> + * locks. It is cheaper to just miss adding an entry since having a reference
> + * to every entry is not critical.
> + */
> +void add_to_swapped_list(unsigned long index)
> +{
> +	struct swapped_entry *entry;
> +	int error;
> +
> +	if (unlikely(!spin_trylock(&swapped.lock)))
> +		goto out;

hm, spin_trylock() should internally do unlikely(), but it doesn't.  (It's
a bit of a mess, too).

> +	if (swapped.count >= swapped.maxcount) {

		/*
		 * <comment about LRU>
		 */
> +		entry = list_entry(swapped.list.next,
> +			struct swapped_entry, swapped_list);
> +		radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
> +		list_del(&entry->swapped_list);
> +		swapped.count--;
> +	} else {
> +		entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
> +		if (unlikely(!entry))
> +			/* bad, can't allocate more mem */
> +			goto out_locked;
> +	}
> +
> +	entry->swp_entry.val = index;
> +
> +	error = radix_tree_preload(GFP_ATOMIC);

I suspect we don't need to preload here.  We can handle radix_tree_insert()
failure, so just go ahead and do it.

> +static inline int high_zone(struct zone *zone)
> +{
> +	if (zone == NULL)
> +		return 0;
> +	return is_highmem(zone);
> +}
> +
> +/*
> + * Find the zone with the most free pages, recheck the watermarks and
> + * then directly allocate the ram. We don't want prefetch to use
> + * __alloc_pages and go calling on reclaim.
> + */
> +static struct page *prefetch_get_page(void)
> +{
> +	struct zone *zone = NULL, *z;
> +	struct page *page = NULL;
> +	struct zonelist *zonelist;
> +	long most_free = 0;
> +
> +	for_each_zone(z) {
> +		long free;
> +
> +		if (z->present_pages == 0)
> +			continue;
> +
> +		/* We don't prefetch into DMA */
> +		if (zone_idx(z) == ZONE_DMA)
> +			continue;
> +
> +		free = z->free_pages;
> +		/* Select the zone with the most free ram preferring high */
> +		if ((free > most_free && (!high_zone(zone) || high_zone(z))) ||
> +			(!high_zone(zone) && high_zone(z))) {
> +				most_free = free;
> +				zone = z;
> +		}
> +	}

<stares at the above expression for three minutes>

I think it'll always select ZONE_HIGHMEM no matter what.  Users of 1G x86
boxes not happy.

> +/*
> + * How many pages to prefetch at a time. We prefetch SWAP_CLUSTER_MAX *
> + * swap_prefetch per PREFETCH_INTERVAL, but prefetch ten times as much at a
> + * time in laptop_mode to minimise the time we keep the disk spinning.
> + */
> +static inline unsigned long prefetch_pages(void)
> +{
> +	return (SWAP_CLUSTER_MAX * swap_prefetch * (1 + 9 * !!laptop_mode));
> +}

I don't think this should be done in-kernel.  There's a nice script to
start and stop laptop mode.  We can make this decision in that script.

> +/*
> + * We want to be absolutely certain it's ok to start prefetching.
> + */
> +static int prefetch_suitable(void)
> +{
> +	struct page_state ps;
> +	unsigned long limit;
> +	struct zone *z;
> +	int ret = 0;
> +
> +	/* Purposefully racy and might return false positive which is ok */
> +	if (__test_and_clear_bit(0, &swapped.busy))
> +		goto out;
> +
> +	temp_free = 0;
> +	/*
> +	 * Have some hysteresis between where page reclaiming and prefetching
> +	 * will occur to prevent ping-ponging between them.
> +	 */
> +	for_each_zone(z) {
> +		unsigned long free;
> +
> +		if (z->present_pages == 0)
> +			continue;
> +		free = z->free_pages;
> +		if (z->pages_high * 3 > free)
> +			goto out;
> +		temp_free += free;
> +	}
> +
> +	/*
> +	 * We check to see that pages are not being allocated elsewhere
> +	 * at any significant rate implying any degree of memory pressure
> +	 * (eg during file reads)
> +	 */
> +	if (last_free) {
> +		if (temp_free + SWAP_CLUSTER_MAX < last_free) {
> +			last_free = temp_free;
> +			goto out;
> +		}
> +	} else
> +		last_free = temp_free;

What is the actual threshold rate here?
SWAP_CLUSTER_MAX/(how fast your CPU is)?  Seems a bit vague?

> +	get_page_state(&ps);

get_page_state() can be super-expensive.  How frequently is this called?

> +
> +static int kprefetchd(void *__unused)
> +{
> +	set_user_nice(current, 19);
> +	/* Set ioprio to lowest if supported by i/o scheduler */
> +	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
> +
> +	do {
> +		enum trickle_return prefetched;
> +
> +		try_to_freeze();
> +
> +		/*
> +		 * TRICKLE_FAILED implies no entries left - we do not schedule
> +		 * a wakeup, and further delay the next one.
> +		 */
> +		prefetched = trickle_swap();
> +		switch (prefetched) {
> +		case TRICKLE_SUCCESS:
> +			last_free = temp_free;

This `last_free' thing is really confusing.  It's central to the algorithms
yet its name is largely meaningless.  last_free *what*?  It seems to mean
"total number of free pages on the last prefetching pass", yes?  Wanna
think of a better name and a better comment for it?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  0:38 ` Andrew Morton
@ 2006-02-07  1:29   ` Con Kolivas
  2006-02-07  1:32     ` [ck] " Con Kolivas
  2006-02-07  1:39     ` Andrew Morton
  2006-02-07  1:37   ` Bernd Eckenfels
  2006-02-08  3:29   ` [PATCH] mm: implement swap prefetching v21 Con Kolivas
  2 siblings, 2 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-07  1:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, ck

On Tue, 7 Feb 2006 11:38 am, Andrew Morton wrote:
> Con Kolivas <kernel@kolivas.org> wrote:
> > After you removed it from -mm there were some people that described
> > the benefits it afforded their workloads. -mm being ever so slightly
> > quieter at the moment please reconsider.
>
> I wish I could convince myself this is sufficiently beneficial..
>
> I've been running 2.6.15-rc2-mm2 on my main workstation (x86, 2G) since
> whenever.  (Am lazy, haven't gotten around to upgrading that machine).  It
> has swap prefetch.

Tons of ram, no swapping maybe?

> I can't say I noticed any difference, although I did turn it off in /proc a
> few reboots ago because it was irritating me for some reason which I forget
> (sorry).

It could be the intermittent way it was actually doing any prefetching. It 
ticks the drive over every second. I'll change this. 

> One thing about 2.6.15-rc2-mm2 is that the `so' and `si' columns in
> `vmstat' always read zero.  I don't know whether that bug is due to the
> prefetch patch or not.

so and si work fine here and are the main way I watch if it has started 
prefetching.

>
> > The amount prefetched in each group is configurable via the tunable in
> > /proc/sys/vm/swap_prefetch. This is set to a value based on memory size.
> > When laptop_mode is enabled it prefetches in ten times larger blocks to
> > minimise the time spent reading.
>
> That's incomprehensible, sorry.
>
> I think it'd be much clearer if the thing was called swap_prefetch_kbytes
> or swap_prefetch_mbytes or (worse) swap_prefetch_pages - putting the units
> in the name really helps clarify things.
>
> And if such a change is made, the internal variable should also be renamed.
>  Right now it's "swap_prefetch", which sounds like a boolean.

I might just change it to a boolean and be done with.

>
> > +swap_prefetch
> > +
> > +This is the amount of data prefetched per prefetching interval when
> > +swap prefetching is compiled in. The value means multiples of 128K,
> > +except when laptop_mode is enabled and then it is ten times larger.
> > +Setting it to 0 disables prefetching entirely.
>
> What does "ten times larger" mean?  If laptop_mode, this thing is in units
> of 1280 kbytes and if !laptop_mode it's in units of 128 kbytes?
>
> If so (or if not), this tunable is quite obscure and hard-to-understand.
> Can you find a way to make this more user-friendly?

Changing it to on or off should fix that. I'll blow the idea that you can set 
the amount to prefetch, and I'll just change it to - on: prefetch till ram is 
sufficiently full or we run out of pages to prefetch.

[ack snipped comments]

> > +/* Last total free pages */
> > +static unsigned long last_free = 0;
> > +static unsigned long temp_free = 0;
>
> Unneeded initialisation.

Very first use of both of these variables depends on them being initialised.

> > +	if (unlikely(!spin_trylock(&swapped.lock)))
> > +		goto out;
>
> hm, spin_trylock() should internally do unlikely(), but it doesn't.  (It's
> a bit of a mess, too).

Good point. Perhaps I should submit a separate patch for this instead.

[ack other snipped points]

> > +		/* Select the zone with the most free ram preferring high */
> > +		if ((free > most_free && (!high_zone(zone) || high_zone(z))) ||
> > +			(!high_zone(zone) && high_zone(z))) {
> > +				most_free = free;
> > +				zone = z;
> > +		}
> > +	}
>
> <stares at the above expression for three minutes>
>
> I think it'll always select ZONE_HIGHMEM no matter what.  Users of 1G x86
> boxes not happy.

Rethink in order..

> > +static inline unsigned long prefetch_pages(void)
> > +{
> > +	return (SWAP_CLUSTER_MAX * swap_prefetch * (1 + 9 * !!laptop_mode));
> > +}
>
> I don't think this should be done in-kernel.  There's a nice script to
> start and stop laptop mode.  We can make this decision in that script.

Will be unnecessary as I'm blowing away the idea of scaling the amount of 
pages to prefetch.

> > +	if (last_free) {
> > +		if (temp_free + SWAP_CLUSTER_MAX < last_free) {
> > +			last_free = temp_free;
> > +			goto out;
> > +		}
> > +	} else
> > +		last_free = temp_free;
>
> What is the actual threshold rate here?
> SWAP_CLUSTER_MAX/(how fast your CPU is)?  Seems a bit vague?

temp_free is the total amount of free pages in all zones at the moment. 

If SWAP_CLUSTER_MAX pages have been allocated since we last called 
prefetch_suitable it returns that prefetching is unsuitable.

> > +	get_page_state(&ps);
>
> get_page_state() can be super-expensive.  How frequently is this called?

It will call this only if the system is very idle for at least 5 seconds. 
However once it starts doing the actual prefetching it calls it regularly 
when every page is prefetched, but only if the system remains idle. If 
there's pretty much anything else going on it won't hit this.

> > +		case TRICKLE_SUCCESS:
> > +			last_free = temp_free;
>
> This `last_free' thing is really confusing.  It's central to the algorithms
> yet its name is largely meaningless.  last_free *what*?  It seems to mean
> "total number of free pages on the last prefetching pass", yes?  Wanna
> think of a better name and a better comment for it?

Yes that's what it means. I'll have a renaming rethink and spray more comments 
around.

Thanks very much for your patch reviewing and comments!

I'll rework it, thrash it around a bit here and resubmit.

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [ck] Re: [PATCH] mm: implement swap prefetching
  2006-02-07  1:29   ` Con Kolivas
@ 2006-02-07  1:32     ` Con Kolivas
  2006-02-07  1:39     ` Andrew Morton
  1 sibling, 0 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-07  1:32 UTC (permalink / raw)
  To: ck; +Cc: Andrew Morton, linux-mm, linux-kernel

On Tue, 7 Feb 2006 12:29 pm, Con Kolivas wrote:
> On Tue, 7 Feb 2006 11:38 am, Andrew Morton wrote:
> > Con Kolivas <kernel@kolivas.org> wrote:
> > > +	if (unlikely(!spin_trylock(&swapped.lock)))
> > > +		goto out;
> >
> > hm, spin_trylock() should internally do unlikely(), but it doesn't. 
> > (It's a bit of a mess, too).
>
> Good point. Perhaps I should submit a separate patch for this instead.

A quick look at this code made me change my mind; there's heaps that could do 
with this treatment in spinlock.h. I'll let someone else tackle it.

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  0:38 ` Andrew Morton
  2006-02-07  1:29   ` Con Kolivas
@ 2006-02-07  1:37   ` Bernd Eckenfels
  2006-02-08  3:29   ` [PATCH] mm: implement swap prefetching v21 Con Kolivas
  2 siblings, 0 replies; 26+ messages in thread
From: Bernd Eckenfels @ 2006-02-07  1:37 UTC (permalink / raw)
  To: linux-kernel

Andrew Morton <akpm@osdl.org> wrote:
>> +/*
>> + * How many pages to prefetch at a time. We prefetch SWAP_CLUSTER_MAX *
>> + * swap_prefetch per PREFETCH_INTERVAL, but prefetch ten times as much at a
>> + * time in laptop_mode to minimise the time we keep the disk spinning.
>> + */
>> +static inline unsigned long prefetch_pages(void)
>> +{
>> +     return (SWAP_CLUSTER_MAX * swap_prefetch * (1 + 9 * !!laptop_mode));
>> +}
> 
> I don't think this should be done in-kernel.  There's a nice script to
> start and stop laptop mode.  We can make this decision in that script.

I agree, the default could be depending on laptop mode, but if a value is
specified or changed by sysctl, it should not be automatically tuned (in
that case)

Gruss
Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  1:29   ` Con Kolivas
  2006-02-07  1:32     ` [ck] " Con Kolivas
@ 2006-02-07  1:39     ` Andrew Morton
  1 sibling, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2006-02-07  1:39 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, linux-mm, ck

Con Kolivas <kernel@kolivas.org> wrote:
>
>  > > +/* Last total free pages */
>  > > +static unsigned long last_free = 0;
>  > > +static unsigned long temp_free = 0;
>  >
>  > Unneeded initialisation.
> 
>  Very first use of both of these variables depends on them being initialised.

All bss is initialised to zero at bootup.  So all the `= 0' is doing here
is moving these variables from .bss to .data, and taking up extra space in
vmlinux.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-06 23:28 [PATCH] mm: implement swap prefetching Con Kolivas
  2006-02-07  0:38 ` Andrew Morton
@ 2006-02-07  3:08 ` Nick Piggin
  2006-02-07  3:29   ` Nick Piggin
  2006-02-07  4:02   ` Con Kolivas
  1 sibling, 2 replies; 26+ messages in thread
From: Nick Piggin @ 2006-02-07  3:08 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, linux-mm, Andrew Morton, ck

Con Kolivas wrote:
> Andrew et al
> 
> I'm resubmitting the swap prefetching patch for inclusion in -mm and hopefully
> mainline. After you removed it from -mm there were some people that described
> the benefits it afforded their workloads. -mm being ever so slightly quieter
> at the moment please reconsider.
> 

I have a few comments.

prefetch_get_page is doing funny things with zones and nodes / zonelists
(eg. 'We don't prefetch into DMA' meaning something like 'this only works
on i386 and x86-64').

buffered_rmqueue, zone_statistics, etc really should to stay static to
page_alloc.

It is completely non NUMA or cpuset-aware so it will likely allocate memory
in the wrong node, and will cause cpuset tasks that have their memory swapped
out to get it swapped in again on other parts of the machine (ie. breaks
cpuset's memory partitioning stuff).

It introduces global cacheline bouncing in pagecache allocation and removal
and page reclaim paths, also low watermark failure is quite common in normal
operation, so that is another global cacheline write in page allocation path.

Why bother with the trylocks? On many architectures they'll RMW the cacheline
anyway, so scalability isn't going to be much improved (or do you see big
lock contention?)

Aside from those issues, I think the idea has is pretty cool... but there are
a few things that get to me:

- it is far more common to reclaim pages from other mappings (not swap).
   Shouldn't they have the same treatment? Would that be more worthwhile?

- when is a system _really_ idle? what if we want it to stay idle (eg.
   laptops)? what if some block devices or swap devices are busy, or
   memory is continually being allocated and freed and/or pagecache is
   being created and truncated but we still want to prefetch?

- for all its efforts, it will still interact with page reclaim by
   putting pages on the LRU and causing them to be cycled.

   - on bursty loads, this cycling could happen a bit. and more reads on
     the swap devices.

- in a sense it papers over page reclaim problems that shouldn't be so
   bad in the first place (midnight cron). On the other hand, I can see
   how it solves this issue nicely.


> Cheers,
> Con
> ---
> This patch implements swap prefetching when the vm is relatively idle and
> there is free ram available. The code is based on some early work by Thomas
> Schlichter.
> 

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  3:08 ` [PATCH] mm: implement swap prefetching Nick Piggin
@ 2006-02-07  3:29   ` Nick Piggin
  2006-02-07  4:02   ` Con Kolivas
  1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2006-02-07  3:29 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, linux-mm, Andrew Morton, ck

Nick Piggin wrote:

> It introduces global cacheline bouncing in pagecache allocation and removal

Sorry, not regular pagecache but only swapcache, which already has global
cachelines. Ignore that bit ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  3:08 ` [PATCH] mm: implement swap prefetching Nick Piggin
  2006-02-07  3:29   ` Nick Piggin
@ 2006-02-07  4:02   ` Con Kolivas
  2006-02-07  5:00     ` Nick Piggin
  2006-02-08  4:46     ` Paul Jackson
  1 sibling, 2 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-07  4:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm, Andrew Morton, ck

On Tue, 7 Feb 2006 02:08 pm, Nick Piggin wrote:
> Con Kolivas wrote:
> > Andrew et al
> >
> > I'm resubmitting the swap prefetching patch for inclusion in -mm and
> > hopefully mainline. After you removed it from -mm there were some people
> > that described the benefits it afforded their workloads. -mm being ever
> > so slightly quieter at the moment please reconsider.
>
> I have a few comments.

Thanks.

> prefetch_get_page is doing funny things with zones and nodes / zonelists
> (eg. 'We don't prefetch into DMA' meaning something like 'this only works
> on i386 and x86-64').

Hrm? It's just a generic thing to do; I'm not sure I follow why it's i386 and 
x86-64 only. Every architecture has ZONE_NORMAL so it will prefetch there.

> buffered_rmqueue, zone_statistics, etc really should to stay static to
> page_alloc.

I can have an even simpler version of buffered_rmqueue specifically for swap 
prefetch, but I didn't want to reproduce code unnecessarily, nor did I want a 
page allocator outside page_alloc.c or swap_prefetch only code placed in 
page_alloc. The higher level page allocators do too much and they test to see 
if we should reclaim (which we never want to do) or allocate too many pages. 
It is the only code "cost" when swap prefetch is configured off. I'm open to 
suggestions?

> It is completely non NUMA or cpuset-aware so it will likely allocate memory
> in the wrong node, and will cause cpuset tasks that have their memory
> swapped out to get it swapped in again on other parts of the machine (ie.
> breaks cpuset's memory partitioning stuff).
>
> It introduces global cacheline bouncing in pagecache allocation and removal
> and page reclaim paths, also low watermark failure is quite common in
> normal operation, so that is another global cacheline write in page
> allocation path.

None of these issues is going to remotely the target audience. If the issue is 
how scalable such a change can be then I cannot advocate making the code 
smart and complex enough to be numa and cpuset aware.. but then that's never 
going to be the target audience. It affects a particular class of user which 
happens to be quite a large population not affected by complex memory 
hardware.

> Why bother with the trylocks? On many architectures they'll RMW the
> cacheline anyway, so scalability isn't going to be much improved (or do you
> see big lock contention?)

Rather than scalability concerns per se the trylock is used as yet another 
(admittedly rarely hit) way of defining busy.

> Aside from those issues, I think the idea has is pretty cool... but there
> are a few things that get to me:
>
> - it is far more common to reclaim pages from other mappings (not swap).
>    Shouldn't they have the same treatment? Would that be more worthwhile?

I don't know. Swap is the one that affect ordinary desktop users in magnitudes 
that embarrass perceived performance beyond belief. I didn't have any other 
uses for this code in mind.

> - when is a system _really_ idle? what if we want it to stay idle (eg.
>    laptops)? what if some block devices or swap devices are busy, or
>    memory is continually being allocated and freed and/or pagecache is
>    being created and truncated but we still want to prefetch?

The code is pretty aggressive at defining busy. It looks for pretty much all 
of those and it prefetches till it stops then allowing idle to occur again. 
Opting out of prefetching whenever there is doubt seems reasonable to me.

> - for all its efforts, it will still interact with page reclaim by
>    putting pages on the LRU and causing them to be cycled.
>
>    - on bursty loads, this cycling could happen a bit. and more reads on
>      the swap devices.

Theoretically yes I agree. The definition of busy is so broad that prevents it 
prefetching that it is not significant.

> - in a sense it papers over page reclaim problems that shouldn't be so
>    bad in the first place (midnight cron). On the other hand, I can see
>    how it solves this issue nicely.

I doubt any audience that will care about scalability and complex memory 
configurations would knowingly enable it so it costs them virtually nothing 
for the relatively unintrusive code to be there. It's configurable and helps 
a unique problem that affects most users who are not in the complex hardware 
group. I was not advocating it being enabled by default, but last time it was 
in -mm akpm suggested doing that to increase its testing - while in -mm.

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  4:02   ` Con Kolivas
@ 2006-02-07  5:00     ` Nick Piggin
  2006-02-07  6:02       ` Con Kolivas
  2006-02-08  4:46     ` Paul Jackson
  1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2006-02-07  5:00 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, linux-mm, Andrew Morton, ck

Con Kolivas wrote:
> On Tue, 7 Feb 2006 02:08 pm, Nick Piggin wrote:
> 
>>I have a few comments.
> 
> 
> Thanks.
> 

No problem!

> 
>>prefetch_get_page is doing funny things with zones and nodes / zonelists
>>(eg. 'We don't prefetch into DMA' meaning something like 'this only works
>>on i386 and x86-64').
> 
> 
> Hrm? It's just a generic thing to do; I'm not sure I follow why it's i386 and 
> x86-64 only. Every architecture has ZONE_NORMAL so it will prefetch there.
> 

I don't think every architecture has ZONE_NORMAL.

> 
>>buffered_rmqueue, zone_statistics, etc really should to stay static to
>>page_alloc.
> 
> 
> I can have an even simpler version of buffered_rmqueue specifically for swap 
> prefetch, but I didn't want to reproduce code unnecessarily, nor did I want a 
> page allocator outside page_alloc.c or swap_prefetch only code placed in 
> page_alloc. The higher level page allocators do too much and they test to see 
> if we should reclaim (which we never want to do) or allocate too many pages. 
> It is the only code "cost" when swap prefetch is configured off. I'm open to 
> suggestions?
> 

If you omit __GFP_WAIT and already test the watermarks yourself it should
be OK.

> 
>>It is completely non NUMA or cpuset-aware so it will likely allocate memory
>>in the wrong node, and will cause cpuset tasks that have their memory
>>swapped out to get it swapped in again on other parts of the machine (ie.
>>breaks cpuset's memory partitioning stuff).
>>
>>It introduces global cacheline bouncing in pagecache allocation and removal
>>and page reclaim paths, also low watermark failure is quite common in
>>normal operation, so that is another global cacheline write in page
>>allocation path.
> 
> 
> None of these issues is going to remotely the target audience. If the issue is 
> how scalable such a change can be then I cannot advocate making the code 
> smart and complex enough to be numa and cpuset aware.. but then that's never 
> going to be the target audience. It affects a particular class of user which 
> happens to be quite a large population not affected by complex memory 
> hardware.
> 

Workstations can have 2 or more dual core CPUs with multiple threads or NUMA
these days. Desktops and laptops will probably eventually gain more cores and
threads too.

> 
>>Why bother with the trylocks? On many architectures they'll RMW the
>>cacheline anyway, so scalability isn't going to be much improved (or do you
>>see big lock contention?)
> 
> 
> Rather than scalability concerns per se the trylock is used as yet another 
> (admittedly rarely hit) way of defining busy.
> 

They just seem to complicate the code for apparently little gain.

> 
>>Aside from those issues, I think the idea has is pretty cool... but there
>>are a few things that get to me:
>>
>>- it is far more common to reclaim pages from other mappings (not swap).
>>   Shouldn't they have the same treatment? Would that be more worthwhile?
> 
> 
> I don't know. Swap is the one that affect ordinary desktop users in magnitudes 
> that embarrass perceived performance beyond belief. I didn't have any other 
> uses for this code in mind.
> 
> 
>>- when is a system _really_ idle? what if we want it to stay idle (eg.
>>   laptops)? what if some block devices or swap devices are busy, or
>>   memory is continually being allocated and freed and/or pagecache is
>>   being created and truncated but we still want to prefetch?
> 
> 
> The code is pretty aggressive at defining busy. It looks for pretty much all 
> of those and it prefetches till it stops then allowing idle to occur again. 
> Opting out of prefetching whenever there is doubt seems reasonable to me.
> 

What if you want to prefetch when there is slight activity going on though?
What if your pagecache has filled memory with useless stuff (which would appear
to be the case with updatedb). What if you don't want to prefetch in laptop
mode at all?

> 
>>- for all its efforts, it will still interact with page reclaim by
>>   putting pages on the LRU and causing them to be cycled.
>>
>>   - on bursty loads, this cycling could happen a bit. and more reads on
>>     the swap devices.
> 
> 
> Theoretically yes I agree. The definition of busy is so broad that prevents it 
> prefetching that it is not significant.
> 

Not if the workload is very bursty.

> 
>>- in a sense it papers over page reclaim problems that shouldn't be so
>>   bad in the first place (midnight cron). On the other hand, I can see
>>   how it solves this issue nicely.
> 
> 
> I doubt any audience that will care about scalability and complex memory 
> configurations would knowingly enable it so it costs them virtually nothing 
> for the relatively unintrusive code to be there. It's configurable and helps 
> a unique problem that affects most users who are not in the complex hardware 
> group. I was not advocating it being enabled by default, but last time it was 
> in -mm akpm suggested doing that to increase its testing - while in -mm.
> 

If it is in core mm then I would very much like to see it adhere to how
everything else works, and attempt to be scalable and generalised.

Any code in a core system is intrusive by definition because it simply
adds to the amount of work that needs to be done when maintaining the
thing or trying to understand how things work, debugging people's badly
behaving workloads, etc.

If it is going to be off by default, why couldn't they
echo 10 > /proc/sys/vm/swappiness rather than turning it on?

Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  5:00     ` Nick Piggin
@ 2006-02-07  6:02       ` Con Kolivas
  2006-02-07  6:51         ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Con Kolivas @ 2006-02-07  6:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm, Andrew Morton, ck

On Tue, 7 Feb 2006 04:00 pm, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Tue, 7 Feb 2006 02:08 pm, Nick Piggin wrote:
> >>prefetch_get_page is doing funny things with zones and nodes / zonelists
> >>(eg. 'We don't prefetch into DMA' meaning something like 'this only works
> >>on i386 and x86-64').
> >
> > Hrm? It's just a generic thing to do; I'm not sure I follow why it's i386
> > and x86-64 only. Every architecture has ZONE_NORMAL so it will prefetch
> > there.
>
> I don't think every architecture has ZONE_NORMAL.

!ZONE_DMA they all have, no?

> >>buffered_rmqueue, zone_statistics, etc really should to stay static to
> >>page_alloc.
> >
> > I can have an even simpler version of buffered_rmqueue specifically for
> > swap prefetch, but I didn't want to reproduce code unnecessarily, nor did
> > I want a page allocator outside page_alloc.c or swap_prefetch only code
> > placed in page_alloc. The higher level page allocators do too much and
> > they test to see if we should reclaim (which we never want to do) or
> > allocate too many pages. It is the only code "cost" when swap prefetch is
> > configured off. I'm open to suggestions?
>
> If you omit __GFP_WAIT and already test the watermarks yourself it should
> be OK.

Ok.

> >>It is completely non NUMA or cpuset-aware so it will likely allocate
> >> memory in the wrong node, and will cause cpuset tasks that have their
> >> memory swapped out to get it swapped in again on other parts of the
> >> machine (ie. breaks cpuset's memory partitioning stuff).
> >>
> >>It introduces global cacheline bouncing in pagecache allocation and
> >> removal and page reclaim paths, also low watermark failure is quite
> >> common in normal operation, so that is another global cacheline write in
> >> page allocation path.
> >
> > None of these issues is going to remotely the target audience. If the
> > issue is how scalable such a change can be then I cannot advocate making
> > the code smart and complex enough to be numa and cpuset aware.. but then
> > that's never going to be the target audience. It affects a particular
> > class of user which happens to be quite a large population not affected
> > by complex memory hardware.
>
> Workstations can have 2 or more dual core CPUs with multiple threads or
> NUMA these days. Desktops and laptops will probably eventually gain more
> cores and threads too.

While I am aware of the hardware changes out there I still doubt the 
scalability issues you're concerned about affect a desktop. The code cost and 
complexity will increase substantially yet I'm not sure that will be for any 
gain to the targetted users.

> >>Why bother with the trylocks? On many architectures they'll RMW the
> >>cacheline anyway, so scalability isn't going to be much improved (or do
> >> you see big lock contention?)
> >
> > Rather than scalability concerns per se the trylock is used as yet
> > another (admittedly rarely hit) way of defining busy.
>
> They just seem to complicate the code for apparently little gain.

No biggie; I'll drop them.

> > The code is pretty aggressive at defining busy. It looks for pretty much
> > all of those and it prefetches till it stops then allowing idle to occur
> > again. Opting out of prefetching whenever there is doubt seems reasonable
> > to me.
>
> What if you want to prefetch when there is slight activity going on though?

I don't. I want this to not cost us anything during any activity.

> What if your pagecache has filled memory with useless stuff (which would
> appear to be the case with updatedb). 

There is no way the vm will ever be smart enough to say "this is crap, throw 
it out and prefetch some good stuff", so it doesn't matter.

> What if you don't want to prefetch in 
> laptop mode at all?

No problem; make it part of the laptop mode scripts to disable the sysctl.

> >>- for all its efforts, it will still interact with page reclaim by
> >>   putting pages on the LRU and causing them to be cycled.
> >>
> >>   - on bursty loads, this cycling could happen a bit. and more reads on
> >>     the swap devices.
> >
> > Theoretically yes I agree. The definition of busy is so broad that
> > prevents it prefetching that it is not significant.
>
> Not if the workload is very bursty.

It's an either/or for prefetching; I don't see a workaround, just some sane 
balance.

> >>- in a sense it papers over page reclaim problems that shouldn't be so
> >>   bad in the first place (midnight cron). On the other hand, I can see
> >>   how it solves this issue nicely.
> >
> > I doubt any audience that will care about scalability and complex memory
> > configurations would knowingly enable it so it costs them virtually
> > nothing for the relatively unintrusive code to be there. It's
> > configurable and helps a unique problem that affects most users who are
> > not in the complex hardware group. I was not advocating it being enabled
> > by default, but last time it was in -mm akpm suggested doing that to
> > increase its testing - while in -mm.
>
> If it is in core mm then I would very much like to see it adhere to how
> everything else works, and attempt to be scalable and generalised.

I'm trying.

> Any code in a core system is intrusive by definition because it simply
> adds to the amount of work that needs to be done when maintaining the
> thing or trying to understand how things work, debugging people's badly
> behaving workloads, etc.

I'm open to code suggestions and appreciate any outside help.

> If it is going to be off by default, why couldn't they
> echo 10 > /proc/sys/vm/swappiness rather than turning it on?

Because we still swap no matter what the sysctl setting is, which makes it 
even more useful in my opinion for those who aggressively set this tunable.

Thanks,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  6:02       ` Con Kolivas
@ 2006-02-07  6:51         ` Nick Piggin
  2006-02-07 10:54           ` Con Kolivas
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2006-02-07  6:51 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, linux-mm, Andrew Morton, ck

Con Kolivas wrote:
> On Tue, 7 Feb 2006 04:00 pm, Nick Piggin wrote:
> 
>>Con Kolivas wrote:
>>
>>>On Tue, 7 Feb 2006 02:08 pm, Nick Piggin wrote:
>>>
>>>>prefetch_get_page is doing funny things with zones and nodes / zonelists
>>>>(eg. 'We don't prefetch into DMA' meaning something like 'this only works
>>>>on i386 and x86-64').
>>>
>>>Hrm? It's just a generic thing to do; I'm not sure I follow why it's i386
>>>and x86-64 only. Every architecture has ZONE_NORMAL so it will prefetch
>>>there.
>>
>>I don't think every architecture has ZONE_NORMAL.
> 
> 
> !ZONE_DMA they all have, no?
> 

Don't think so. IIRC ppc64 has only ZONE_DMA although may have picked up
DMA32 now (/me boots the G5). IA64 I think have 4GB ZONE_DMA so smaller
systems won't have any other zones.

On small memory systems, ZONE_DMA will be a significant portion of memory
too (but maybe you're not targetting them either).

>>If you omit __GFP_WAIT and already test the watermarks yourself it should
>>be OK.
> 
> 
> Ok.
> 
> 

Note, it may dip lower than we would like, but the watermark checking is
already completely racy anyway so it is possible that that will happen
anyway.

>>Workstations can have 2 or more dual core CPUs with multiple threads or
>>NUMA these days. Desktops and laptops will probably eventually gain more
>>cores and threads too.
> 
> 
> While I am aware of the hardware changes out there I still doubt the 
> scalability issues you're concerned about affect a desktop. The code cost and 
> complexity will increase substantially yet I'm not sure that will be for any 
> gain to the targetted users.
> 

Possibly. Why wouldn't you want swap prefetching on servers though?
Especially on some kind of shell server, or other internet server
where load could be really varied.

> 
>>>>Why bother with the trylocks? On many architectures they'll RMW the
>>>>cacheline anyway, so scalability isn't going to be much improved (or do
>>>>you see big lock contention?)
>>>
>>>Rather than scalability concerns per se the trylock is used as yet
>>>another (admittedly rarely hit) way of defining busy.
>>
>>They just seem to complicate the code for apparently little gain.
> 
> 
> No biggie; I'll drop them.
> 

That's what I'd do for now. A concurrent spin_lock could hit right after
the trylock takes the lock anyway...

> 
>>>The code is pretty aggressive at defining busy. It looks for pretty much
>>>all of those and it prefetches till it stops then allowing idle to occur
>>>again. Opting out of prefetching whenever there is doubt seems reasonable
>>>to me.
>>
>>What if you want to prefetch when there is slight activity going on though?
> 
> 
> I don't. I want this to not cost us anything during any activity.
> 

So if you have say some networking running (p2p or something), then it
may not ever prefetch?

> 
>>What if your pagecache has filled memory with useless stuff (which would
>>appear to be the case with updatedb). 
> 
> 
> There is no way the vm will ever be smart enough to say "this is crap, throw 
> it out and prefetch some good stuff", so it doesn't matter.
> 

It can do a lot better about throwing out updatedb type stuff.

Actually I had thought the point of this was to page in stuff after the
updatedb run, but it would appear that it won't do this because updatedb
will leave the pagecache full...

>>>>- for all its efforts, it will still interact with page reclaim by
>>>>  putting pages on the LRU and causing them to be cycled.
>>>>
>>>>  - on bursty loads, this cycling could happen a bit. and more reads on
>>>>    the swap devices.
>>>
>>>Theoretically yes I agree. The definition of busy is so broad that
>>>prevents it prefetching that it is not significant.
>>
>>Not if the workload is very bursty.
> 
> 
> It's an either/or for prefetching; I don't see a workaround, just some sane 
> balance.
> 

Makes improving the rest of the VM for desktop users harder, no matter
how sane. Though I can't deny it is potentially an improvement itself
either.

>>Any code in a core system is intrusive by definition because it simply
>>adds to the amount of work that needs to be done when maintaining the
>>thing or trying to understand how things work, debugging people's badly
>>behaving workloads, etc.
> 
> 
> I'm open to code suggestions and appreciate any outside help.
> 

Hopefully you have a bit to go on. I still see difficult problems that
I'm not sure how can be solved.

> 
>>If it is going to be off by default, why couldn't they
>>echo 10 > /proc/sys/vm/swappiness rather than turning it on?
> 
> 
> Because we still swap no matter what the sysctl setting is, which makes it 
> even more useful in my opinion for those who aggressively set this tunable.
> 

Sounds like we need to do more basic VM tuning as well.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  6:51         ` Nick Piggin
@ 2006-02-07 10:54           ` Con Kolivas
  2006-02-07 17:14             ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Con Kolivas @ 2006-02-07 10:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-mm, Andrew Morton, ck

On Tuesday 07 February 2006 17:51, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Tue, 7 Feb 2006 04:00 pm, Nick Piggin wrote:
> >>Con Kolivas wrote:
> >>>On Tue, 7 Feb 2006 02:08 pm, Nick Piggin wrote:
> >>>>prefetch_get_page is doing funny things with zones and nodes /
> >>>> zonelists (eg. 'We don't prefetch into DMA' meaning something like
> >>>> 'this only works on i386 and x86-64').
> >>>
> >>>Hrm? It's just a generic thing to do; I'm not sure I follow why it's
> >>> i386 and x86-64 only. Every architecture has ZONE_NORMAL so it will
> >>> prefetch there.
> >>
> >>I don't think every architecture has ZONE_NORMAL.
> >
> > !ZONE_DMA they all have, no?
>
> Don't think so. IIRC ppc64 has only ZONE_DMA although may have picked up
> DMA32 now (/me boots the G5). 

/me looks around desperately for some hardware he has that Nick might not and 
sees only a VIC20 and decides this won't represent him very well in the 
underwear bulge stakes

Andrew I think I see why your G5 didn't see any benefit with swap prefetching.

> >>Workstations can have 2 or more dual core CPUs with multiple threads or
> >>NUMA these days. Desktops and laptops will probably eventually gain more
> >>cores and threads too.
> >
> > While I am aware of the hardware changes out there I still doubt the
> > scalability issues you're concerned about affect a desktop. The code cost
> > and complexity will increase substantially yet I'm not sure that will be
> > for any gain to the targetted users.
>
> Possibly. Why wouldn't you want swap prefetching on servers though?
> Especially on some kind of shell server, or other internet server
> where load could be really varied.

One of the users describing benefit from the current patch runs it on a 
multiuser X server where the time taken for login is substantially reduced 
after another user has logged out. The fact that the pages are back in ram, 
be it on the wrong numa node or not, far outweighs the speed disadvantage of 
reading from swap. I don't imagine that having per pgdat sets of 
swap_prefetch data is worth it. Once we've hit swap we really bugger it all 
up anyway. I'm not even sure what exactly you want swap prefetching to do 
here instead? I could make the accounting more numa aware but the accounting 
doesn't need to be remotely accurate, just conservative.

> >>What if you want to prefetch when there is slight activity going on
> >> though?
> >
> > I don't. I want this to not cost us anything during any activity.
>
> So if you have say some networking running (p2p or something), then it
> may not ever prefetch?

Where to draw the line could be debated to death. The difference between a p2p 
app running slowly in the background and a low bandwidth samba server would 
be indistinguishable. Avoiding swap prefetching with any activity is 
definitely going to do the least harm regardless of the workload.

> >>What if your pagecache has filled memory with useless stuff (which would
> >>appear to be the case with updatedb).
> >
> > There is no way the vm will ever be smart enough to say "this is crap,
> > throw it out and prefetch some good stuff", so it doesn't matter.
>
> It can do a lot better about throwing out updatedb type stuff.

Indeed some of the clock pro patches certainly seem to be heading that way. A 
pipe dream is to get some variant of those included and have clock pro 
swapped data to select the prefetched pages smarter than just 
most-recently-used. Clearly that can wait since even our VM doesn't have it 
yet.

> Actually I had thought the point of this was to page in stuff after the
> updatedb run, but it would appear that it won't do this because updatedb
> will leave the pagecache full...

The most common scenario where it does help is after running some memory 
intensive hog and then stopping the memory intensive work. Opening and 
closing openoffice is a beautiful example. Another real world one that hits 
me regularly is printing a high resolution picture needs heaps of ram but 
only for a short time.  After this time most everything else is swapped out. 
I find swap prefetching helps massively here. Some people's machines seem to 
swap easily just copying an ISO image or encoding a video.

> > I'm open to code suggestions and appreciate any outside help.
>
> Hopefully you have a bit to go on. 

I do and thank you for your comments.

> I still see difficult problems that 
> I'm not sure how can be solved.

Haha I seem to _only_ ever hack on things that you describe as having 
difficult problems :P

> >>If it is going to be off by default, why couldn't they
> >>echo 10 > /proc/sys/vm/swappiness rather than turning it on?
> >
> > Because we still swap no matter what the sysctl setting is, which makes
> > it even more useful in my opinion for those who aggressively set this
> > tunable.
>
> Sounds like we need to do more basic VM tuning as well.

That would be a given at any time in the past, present or future.

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07 10:54           ` Con Kolivas
@ 2006-02-07 17:14             ` Andrew Morton
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2006-02-07 17:14 UTC (permalink / raw)
  To: Con Kolivas; +Cc: nickpiggin, linux-kernel, linux-mm, ck

Con Kolivas <kernel@kolivas.org> wrote:
>
>  Andrew I think I see why your G5 didn't see any benefit with swap prefetching.

No, this machine is x86 w/ 2GB.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] mm: implement swap prefetching v21
  2006-02-07  0:38 ` Andrew Morton
  2006-02-07  1:29   ` Con Kolivas
  2006-02-07  1:37   ` Bernd Eckenfels
@ 2006-02-08  3:29   ` Con Kolivas
  2006-02-08  3:49     ` [ck] " Con Kolivas
  2 siblings, 1 reply; 26+ messages in thread
From: Con Kolivas @ 2006-02-08  3:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, ck, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 22247 bytes --]

Ok here is a rewrite incorporating many of the suggested changes by Andrew and
Nick (thanks both for comments). The numa and cpuset issues Nick brought up
I have not tackled (yet?)

Cheers,
Con
---
This patch implements swap prefetching when the vm is relatively idle and
there is free ram available. The code is based on some early work by Thomas
Schlichter.

This stores a list of swapped entries in a list ordered most recently used
and a radix tree. It generates a low priority kernel thread running at nice 19
to do the prefetching at a later stage.

Once pages have been added to the swapped list, a timer is started, testing
for conditions suitable to prefetch swap pages every 5 seconds. Suitable
conditions are defined as lack of swapping out or in any pages, and no
watermark tests failing. Significant amounts of dirtied ram and changes in
free ram representing disk writes or reads also prevent prefetching.

It then checks that we have spare ram looking for at least 3* pages_high free
per zone and if it succeeds that will prefetch pages from swap into the swap
cache. Pages are prefetched until the list is empty or the vm is seen as busy
according to the previously described criteria.

The pages are copied to swap cache and kept on backing store. This allows
pressure on either physical ram or swap to readily find free pages without
further I/O.

Prefetching can be enabled/disabled via the tunable in 
/proc/sys/vm/swap_prefetch initially set to 1 (enabled).

In testing on modern pc hardware this results in wall-clock time activation of
the firefox browser to speed up 5 fold after a worst case complete swap-out
of the browser on an static web page.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
 
 Documentation/sysctl/vm.txt |    8
 include/linux/swap.h        |   29 +++
 include/linux/sysctl.h      |    1
 init/Kconfig                |   22 ++
 kernel/sysctl.c             |   10 +
 mm/Makefile                 |    1
 mm/page_alloc.c             |   10 -
 mm/swap.c                   |    3
 mm/swap_prefetch.c          |  378 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c             |   10 +
 mm/vmscan.c                 |    5
 11 files changed, 474 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc2/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.16-rc2.orig/Documentation/sysctl/vm.txt	2006-02-07 11:02:18.000000000 +1100
+++ linux-2.6.16-rc2/Documentation/sysctl/vm.txt	2006-02-08 11:23:03.000000000 +1100
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/
 - drop-caches
 - zone_reclaim_mode
 - zone_reclaim_interval
+- swap_prefetch
 
 ==============================================================
 
@@ -178,3 +179,10 @@ Time is set in seconds and set by defaul
 Reduce the interval if undesired off node allocations occur. However, too
 frequent scans will have a negative impact onoff node allocation performance.
 
+==============================================================
+
+swap_prefetch
+
+This enables or disables the swap prefetching feature.
+
+The default value is 1.
Index: linux-2.6.16-rc2/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/swap.h	2006-02-07 11:02:38.000000000 +1100
+++ linux-2.6.16-rc2/include/linux/swap.h	2006-02-08 11:30:06.000000000 +1100
@@ -214,6 +214,34 @@ extern int shmem_unuse(swp_entry_t entry
 
 extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *);
 
+#ifdef CONFIG_SWAP_PREFETCH
+/* only used by swap prefetch externally */
+/*	mm/swap_prefetch.c */
+extern void prepare_swap_prefetch(void);
+extern void add_to_swapped_list(unsigned long index);
+extern void remove_from_swapped_list(unsigned long index);
+extern void delay_swap_prefetch(void);
+extern int swap_prefetch;
+
+#else	/* CONFIG_SWAP_PREFETCH */
+static inline void add_to_swapped_list(unsigned long index)
+{
+}
+
+static inline void prepare_swap_prefetch(void)
+{
+}
+
+static inline void remove_from_swapped_list(unsigned long index)
+{
+}
+
+static inline void delay_swap_prefetch(void)
+{
+}
+
+#endif	/* CONFIG_SWAP_PREFETCH */
+
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
@@ -235,6 +263,7 @@ extern void free_pages_and_swap_cache(st
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma,
 					   unsigned long addr);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry);
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
Index: linux-2.6.16-rc2/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/sysctl.h	2006-02-07 11:02:38.000000000 +1100
+++ linux-2.6.16-rc2/include/linux/sysctl.h	2006-02-08 11:24:43.000000000 +1100
@@ -184,6 +184,7 @@ enum
 	VM_PERCPU_PAGELIST_FRACTION=30,/* int: fraction of pages in each percpu_pagelist */
 	VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
+	VM_SWAP_PREFETCH=33,	/* swap prefetch */
 };
 
 
Index: linux-2.6.16-rc2/init/Kconfig
===================================================================
--- linux-2.6.16-rc2.orig/init/Kconfig	2006-02-07 11:02:39.000000000 +1100
+++ linux-2.6.16-rc2/init/Kconfig	2006-02-08 11:26:24.000000000 +1100
@@ -103,6 +103,28 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_PREFETCH
+	bool "Support for prefetching swapped memory"
+	depends on SWAP
+	default y
+	---help---
+	  This option will allow the kernel to prefetch swapped memory pages
+	  when idle. The pages will be kept on both swap and in swap_cache
+	  thus avoiding the need for further I/O if either ram or swap space
+	  is required.
+
+	  What this will do on workstations is slowly bring back applications
+	  that have swapped out after memory intensive workloads back into
+	  physical ram if you have free ram at a later stage and the machine
+	  is relatively idle. This means that when you come back to your
+	  computer after leaving it idle for a while, applications will come
+	  to life faster. Note that your swap usage will appear to increase
+	  but these are cached pages, can be dropped freely by the vm, and it
+	  should stabilise around 50% swap usage maximum.
+
+	  Workstations and multiuser workstation servers will most likely want
+	  to say Y.
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
Index: linux-2.6.16-rc2/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc2.orig/kernel/sysctl.c	2006-02-07 11:02:39.000000000 +1100
+++ linux-2.6.16-rc2/kernel/sysctl.c	2006-02-08 13:51:10.000000000 +1100
@@ -891,6 +891,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+#ifdef CONFIG_SWAP_PREFETCH
+	{
+		.ctl_name	= VM_SWAP_PREFETCH,
+		.procname	= "swap_prefetch",
+		.data		= &swap_prefetch,
+		.maxlen		= sizeof(swap_prefetch),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.16-rc2/mm/Makefile
===================================================================
--- linux-2.6.16-rc2.orig/mm/Makefile	2006-02-07 11:02:39.000000000 +1100
+++ linux-2.6.16-rc2/mm/Makefile	2006-02-08 11:21:09.000000000 +1100
@@ -13,6 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   prio_tree.o util.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP_PREFETCH) += swap_prefetch.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c	2006-02-07 11:02:39.000000000 +1100
+++ linux-2.6.16-rc2/mm/page_alloc.c	2006-02-08 11:29:47.000000000 +1100
@@ -833,7 +833,7 @@ int zone_watermark_ok(struct zone *z, in
 		min -= min / 4;
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
-		return 0;
+		goto out_failed;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
 		free_pages -= z->free_area[o].nr_free << o;
@@ -842,9 +842,15 @@ int zone_watermark_ok(struct zone *z, in
 		min >>= 1;
 
 		if (free_pages <= min)
-			return 0;
+			goto out_failed;
 	}
+
 	return 1;
+out_failed:
+	/* Swap prefetching is delayed if any watermark is low */
+	delay_swap_prefetch();
+
+	return 0;
 }
 
 /*
Index: linux-2.6.16-rc2/mm/swap.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/swap.c	2006-02-07 11:02:39.000000000 +1100
+++ linux-2.6.16-rc2/mm/swap.c	2006-02-08 11:30:28.000000000 +1100
@@ -502,5 +502,8 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	prepare_swap_prefetch();
+
 	hotcpu_notifier(cpu_swap_callback, 0);
 }
Index: linux-2.6.16-rc2/mm/swap_prefetch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc2/mm/swap_prefetch.c	2006-02-08 13:22:44.000000000 +1100
@@ -0,0 +1,378 @@
+/*
+ * linux/mm/swap_prefetch.c
+ *
+ * Copyright (C) 2005 Con Kolivas
+ *
+ * Written by Con Kolivas <kernel@kolivas.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/ioprio.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/writeback.h>
+
+/* Time to delay prefetching if vm is busy or prefetching unsuccessful */
+#define PREFETCH_DELAY	(HZ * 5)
+
+/* sysctl - enable/disable swap prefetching */
+int swap_prefetch __read_mostly = 1;
+
+struct swapped_root {
+	unsigned long		busy;		/* vm busy */
+	spinlock_t		lock;		/* protects all data */
+	struct list_head	list;		/* MRU list of swapped pages */
+	struct radix_tree_root	swap_tree;	/* Lookup tree of pages */
+	unsigned int		count;		/* Number of entries */
+	unsigned int		maxcount;	/* Maximum entries allowed */
+	kmem_cache_t		*cache;		/* Of struct swapped_entry */
+};
+
+struct swapped_entry {
+	swp_entry_t		swp_entry;
+	struct list_head	swapped_list;
+};
+
+static struct swapped_root swapped = {
+	.busy 		= 0,			/* Any vm activity */
+	.lock		= SPIN_LOCK_UNLOCKED,
+	.list  		= LIST_HEAD_INIT(swapped.list),
+	.swap_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
+	.count 		= 0,			/* Number of swapped entries */
+};
+
+static task_t *kprefetchd_task;
+
+/* Max mapped we will prefetch to */
+static unsigned long mapped_limit __read_mostly;
+
+/*
+ * Create kmem cache for swapped entries
+ */
+void __init prepare_swap_prefetch(void)
+{
+	long mem = nr_free_pagecache_pages();
+
+	swapped.cache = kmem_cache_create("swapped_entry",
+		sizeof(struct swapped_entry), 0, SLAB_PANIC, NULL, NULL);
+
+	/* Set max number of entries to size of physical ram */
+	swapped.maxcount = mem;
+	/* Set maximum amount of mapped pages to prefetch to 2/3 ram */
+	mapped_limit = mem / 3 * 2;
+}
+
+/*
+ * We check to see no part of the vm is busy. If it is this will interrupt
+ * trickle_swap and wait another PREFETCH_DELAY. Purposefully racy.
+ */
+inline void delay_swap_prefetch(void)
+{
+	__set_bit(0, &swapped.busy);
+}
+
+/*
+ * Drop behind accounting which keeps a list of the most recently used swap
+ * entries.
+ */
+void add_to_swapped_list(unsigned long index)
+{
+	struct swapped_entry *entry;
+
+	spin_lock(&swapped.lock);
+	if (swapped.count >= swapped.maxcount) {
+		/*
+		 * We limit the number of entries to the size of physical ram.
+		 * Once the number of entries exceeds this we start removing
+		 * the least recently used entries.
+		 */
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
+		list_del(&entry->swapped_list);
+		swapped.count--;
+	} else {
+		entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
+		if (unlikely(!entry))
+			/* bad, can't allocate more mem */
+			goto out_locked;
+	}
+
+	entry->swp_entry.val = index;
+
+	if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
+		/*
+		 * If this is the first entry, kprefetchd needs to be
+		 * (re)started
+		 */
+		if (list_empty(&swapped.list))
+			wake_up_process(kprefetchd_task);
+		list_add(&entry->swapped_list, &swapped.list);
+		swapped.count++;
+	}
+
+out_locked:
+	spin_unlock(&swapped.lock);
+	return;
+}
+
+/*
+ * Cheaper to not spin on the lock and remove the entry lazily via
+ * add_to_swap_cache when we hit it in trickle_swap_cache_async
+ */
+void remove_from_swapped_list(unsigned long index)
+{
+	struct swapped_entry *entry;
+	unsigned long flags;
+
+	if (unlikely(!spin_trylock_irqsave(&swapped.lock, flags)))
+		return;
+	entry = radix_tree_delete(&swapped.swap_tree, index);
+	if (likely(entry)) {
+		list_del_init(&entry->swapped_list);
+		swapped.count--;
+		kmem_cache_free(swapped.cache, entry);
+	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
+}
+
+enum trickle_return {
+	TRICKLE_SUCCESS,
+	TRICKLE_FAILED,
+	TRICKLE_DELAY,
+};
+
+/*
+ * This tries to read a swp_entry_t into swap cache for swap prefetching.
+ * If it returns TRICKLE_DELAY we should delay further prefetching.
+ */
+static enum trickle_return trickle_swap_cache_async(swp_entry_t entry)
+{
+	enum trickle_return ret = TRICKLE_FAILED;
+	struct zonelist *zonelist;
+	struct page *page = NULL;
+
+	read_lock(&swapper_space.tree_lock);
+	/* Entry may already exist */
+	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
+	read_unlock(&swapper_space.tree_lock);
+	if (page) {
+		remove_from_swapped_list(entry.val);
+		goto out;
+	}
+
+	/*
+	 * Use a default numa policy; anywhere in ram is better than on swap.
+	 * Keeping track of the original process's policy would be
+	 * prohibitively expensive.
+	 */
+	zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+		(GFP_HIGHUSER & GFP_ZONEMASK);
+
+	/*
+	 * Get a new page to read from swap. We have already checked the
+	 * watermarks so __alloc_pages will not call on reclaim.
+	 */
+	page = __alloc_pages(GFP_HIGHUSER & ~__GFP_WAIT, 0, zonelist);
+	if (unlikely(!page)) {
+		ret = TRICKLE_DELAY;
+		goto out;
+	}
+
+	if (add_to_swap_cache(page, entry))
+		/* Failed to add to swap cache */
+		goto out_release;
+
+	lru_cache_add(page);
+	if (unlikely(swap_readpage(NULL, page))) {
+		ret = TRICKLE_DELAY;
+		goto out_release;
+	}
+
+	ret = TRICKLE_SUCCESS;
+out_release:
+	page_cache_release(page);
+out:
+	return ret;
+}
+
+/*
+ * last_prefetch_free is the amount of free ram after a cycle of prefetching
+ * current_free is the amount of free ram on this cycle of checking
+ * prefetch_suitable.
+ */
+static unsigned long last_prefetch_free, current_free, prefetched_pages;
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(void)
+{
+	struct page_state ps;
+	unsigned long limit;
+	struct zone *z;
+	int ret = 0;
+
+	/* Purposefully racy and might return false positive which is ok */
+	if (__test_and_clear_bit(0, &swapped.busy))
+		goto out;
+
+	current_free = 0;
+	/*
+	 * Have some hysteresis between where page reclaiming and prefetching
+	 * will occur to prevent ping-ponging between them.
+	 */
+	for_each_zone(z) {
+		unsigned long free;
+
+		if (!populated_zone(z))
+			continue;
+		free = z->free_pages;
+		if (z->pages_high * 3 > free)
+			goto out;
+		current_free += free;
+	}
+
+	/*
+	 * We check to see that pages are not being allocated elsewhere
+	 * at any significant rate implying any degree of memory pressure
+	 * (eg during file reads)
+	 */
+	if (last_prefetch_free) {
+		if (current_free + SWAP_CLUSTER_MAX < last_prefetch_free) {
+			last_prefetch_free = current_free;
+			goto out;
+		}
+	} else
+		last_prefetch_free = current_free;
+
+	/*
+	 * get_page_state is super expensive so we only perform it every
+	 * SWAP_CLUSTER_MAX prefetched_pages
+	 */
+	if (prefetched_pages % SWAP_CLUSTER_MAX)
+		goto out_prefetch;
+
+	get_page_state(&ps);
+
+	/* We shouldn't prefetch when we are doing writeback */
+	if (ps.nr_writeback)
+		goto out;
+
+	/*
+	 * >2/3 of the ram is mapped, swapcache or dirty, we need some free
+	 * for pagecache
+	 */
+	limit = ps.nr_mapped + ps.nr_slab + ps.nr_dirty + ps.nr_unstable +
+		total_swapcache_pages;
+	if (limit > mapped_limit)
+		goto out;
+
+out_prefetch:
+	/* Survived all that? Hooray we can prefetch! */
+	ret = 1;
+out:
+	return ret;
+}
+
+/*
+ * trickle_swap is the main function that initiates the swap prefetching. It
+ * first checks to see if the busy flag is set, and does not prefetch if it
+ * is, as the flag implied we are low on memory or swapping in currently.
+ * Otherwise it runs until prefetch_suitable fails which occurs when the
+ * vm is busy, we prefetch to the watermark, or the list is empty.
+ */
+static enum trickle_return trickle_swap(void)
+{
+	enum trickle_return ret = TRICKLE_DELAY;
+	struct swapped_entry *entry;
+
+	for ( ; ; ) {
+		enum trickle_return got_page;
+
+		if (!prefetch_suitable())
+			goto out;
+		spin_lock(&swapped.lock);
+		if (list_empty(&swapped.list)) {
+			spin_unlock(&swapped.lock);
+			ret = TRICKLE_FAILED;
+			goto out;
+		}
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		spin_unlock(&swapped.lock);
+
+		got_page = trickle_swap_cache_async(entry->swp_entry);
+		switch (got_page) {
+		case TRICKLE_FAILED:
+			break;
+		case TRICKLE_SUCCESS:
+			last_prefetch_free--;
+			prefetched_pages++;
+			break;
+		case TRICKLE_DELAY:
+			goto out;
+		}
+	}
+
+out:
+	if (prefetched_pages) {
+		lru_add_drain();
+		prefetched_pages = 0;
+	}
+	return ret;
+}
+
+static int kprefetchd(void *__unused)
+{
+	set_user_nice(current, 19);
+	/* Set ioprio to lowest if supported by i/o scheduler */
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+
+	do {
+		enum trickle_return prefetched;
+
+		try_to_freeze();
+
+		/*
+		 * TRICKLE_FAILED implies no entries left - we do not schedule
+		 * a wakeup, and further delay the next one.
+		 */
+		prefetched = trickle_swap();
+		switch (prefetched) {
+		case TRICKLE_SUCCESS:
+		case TRICKLE_DELAY:
+			last_prefetch_free = 0;
+			schedule_timeout_interruptible(PREFETCH_DELAY);
+			break;
+		case TRICKLE_FAILED:
+			last_prefetch_free = 0;
+			schedule_timeout_interruptible(MAX_SCHEDULE_TIMEOUT);
+			schedule_timeout_interruptible(PREFETCH_DELAY);
+			break;
+		}
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+static int __init kprefetchd_init(void)
+{
+	kprefetchd_task = kthread_run(kprefetchd, NULL, "kprefetchd");
+
+	return 0;
+}
+
+static void __exit kprefetchd_exit(void)
+{
+	kthread_stop(kprefetchd_task);
+}
+
+module_init(kprefetchd_init);
+module_exit(kprefetchd_exit);
Index: linux-2.6.16-rc2/mm/swap_state.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/swap_state.c	2006-02-07 11:02:39.000000000 +1100
+++ linux-2.6.16-rc2/mm/swap_state.c	2006-02-08 11:33:05.000000000 +1100
@@ -81,6 +81,7 @@ static int __add_to_swap_cache(struct pa
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
+			remove_from_swapped_list(entry.val);
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
@@ -94,11 +95,12 @@ static int __add_to_swap_cache(struct pa
 	return error;
 }
 
-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
 	int error;
 
 	if (!swap_duplicate(entry)) {
+		remove_from_swapped_list(entry.val);
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
@@ -147,6 +149,9 @@ int add_to_swap(struct page * page, gfp_
 	swp_entry_t entry;
 	int err;
 
+	/* Swap prefetching is delayed if we're swapping pages */
+	delay_swap_prefetch();
+
 	if (!PageLocked(page))
 		BUG();
 
@@ -320,6 +325,9 @@ struct page *read_swap_cache_async(swp_e
 	struct page *found_page, *new_page = NULL;
 	int err;
 
+	/* Swap prefetching is delayed if we're already reading from swap */
+	delay_swap_prefetch();
+
 	do {
 		/*
 		 * First check the swap cache.  Since this is normally
Index: linux-2.6.16-rc2/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/vmscan.c	2006-02-07 11:02:39.000000000 +1100
+++ linux-2.6.16-rc2/mm/vmscan.c	2006-02-08 11:23:32.000000000 +1100
@@ -396,6 +396,7 @@ static int remove_mapping(struct address
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		add_to_swapped_list(swap.val);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
@@ -1406,6 +1407,8 @@ int try_to_free_pages(struct zone **zone
 	unsigned long lru_pages = 0;
 	int i;
 
+	delay_swap_prefetch();
+
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = !laptop_mode;
 	sc.may_swap = 1;
@@ -1758,6 +1761,8 @@ int shrink_all_memory(int nr_pages)
 		.reclaimed_slab = 0,
 	};
 
+	delay_swap_prefetch();
+
 	current->reclaim_state = &reclaim_state;
 	for_each_pgdat(pgdat) {
 		int freed;

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [ck] [PATCH] mm: implement swap prefetching v21
  2006-02-08  3:29   ` [PATCH] mm: implement swap prefetching v21 Con Kolivas
@ 2006-02-08  3:49     ` Con Kolivas
  0 siblings, 0 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-08  3:49 UTC (permalink / raw)
  To: ck; +Cc: Andrew Morton, linux-mm, linux-kernel, Nick Piggin

On Wed, 8 Feb 2006 02:29 pm, Con Kolivas wrote:
> Ok here is a rewrite incorporating many of the suggested changes by Andrew
> and Nick (thanks both for comments). The numa and cpuset issues Nick
> brought up I have not tackled (yet?)

> +/* sysctl - enable/disable swap prefetching */
> +int swap_prefetch __read_mostly = 1;

Err I seem to have forgotten to actually use the enable/disable tunable now. 
Patch works fine otherwise.

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-07  4:02   ` Con Kolivas
  2006-02-07  5:00     ` Nick Piggin
@ 2006-02-08  4:46     ` Paul Jackson
  2006-02-08  5:06       ` Con Kolivas
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Jackson @ 2006-02-08  4:46 UTC (permalink / raw)
  To: Con Kolivas; +Cc: nickpiggin, linux-kernel, linux-mm, akpm, ck

Con, responding to Nick:
> > It introduces global cacheline bouncing in pagecache allocation and removal
> > and page reclaim paths, also low watermark failure is quite common in
> > normal operation, so that is another global cacheline write in page
> > allocation path.
> 
> None of these issues is going to remotely the target audience. If the issue is 
> how scalable such a change can be then I cannot advocate making the code 
> smart and complex enough to be numa and cpuset aware.. but then that's never 
> going to be the target audience. It affects a particular class of user which 
> happens to be quite a large population not affected by complex memory 
> hardware.

How about only moving memory back to the Memory Node (zone) that it
came from?  And providing some call that Christoph Lameters migration
code can call, to disable or fix this up, so you don't end up bringing
back pages on their pre-migration nodes?

Just honoring the memory node placement should be sufficient.  No need
to wrap your head around cpusets.

If you don't do that, then consider disabling this thing entirely
if CONFIG_NUMA is enabled.  This swap prefetching sounds like it
could be a loose canon ball in a NUMA box.

As for non-NUMA boxes, like my humble desktop PC, I would -love-
to have Firefox come back up faster in the morning.  I have a nightly
cron jobs push everything out to swap, and it is slow going getting it
back.

The day will come (it has already gotten there for some of my
colleagues who are using a small Altix system for their desktop
software) when we want this prefetching for NUMA boxes too.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-08  4:46     ` Paul Jackson
@ 2006-02-08  5:06       ` Con Kolivas
  2006-02-08  5:13         ` Paul Jackson
  2006-02-08  5:33         ` Con Kolivas
  0 siblings, 2 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-08  5:06 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, linux-kernel, linux-mm, akpm, ck

On Wed, 8 Feb 2006 03:46 pm, Paul Jackson wrote:
> Con, responding to Nick:
> > > It introduces global cacheline bouncing in pagecache allocation and
> > > removal and page reclaim paths, also low watermark failure is quite
> > > common in normal operation, so that is another global cacheline write
> > > in page allocation path.
> >
> > None of these issues is going to remotely the target audience. If the
> > issue is how scalable such a change can be then I cannot advocate making
> > the code smart and complex enough to be numa and cpuset aware.. but then
> > that's never going to be the target audience. It affects a particular
> > class of user which happens to be quite a large population not affected
> > by complex memory hardware.
>
> How about only moving memory back to the Memory Node (zone) that it
> came from?  And providing some call that Christoph Lameters migration
> code can call, to disable or fix this up, so you don't end up bringing
> back pages on their pre-migration nodes?

Sounds good, and this is what I was hoping to be able to do; first I need to 
see the best time and place to get this information (and learn some more 
about the code).

> Just honoring the memory node placement should be sufficient.  No need
> to wrap your head around cpusets.

Phew. 

> If you don't do that, then consider disabling this thing entirely
> if CONFIG_NUMA is enabled.  This swap prefetching sounds like it
> could be a loose canon ball in a NUMA box.

That's probably a less satisfactory option since NUMA isn't that rare with the 
light numa of commodity hardware.

> As for non-NUMA boxes, like my humble desktop PC, I would -love-
> to have Firefox come back up faster in the morning.  I have a nightly
> cron jobs push everything out to swap, and it is slow going getting it
> back.
>
> The day will come (it has already gotten there for some of my
> colleagues who are using a small Altix system for their desktop
> software) when we want this prefetching for NUMA boxes too.

I do see that now. Thanks for your comments too.

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-08  5:06       ` Con Kolivas
@ 2006-02-08  5:13         ` Paul Jackson
  2006-02-08  5:33         ` Con Kolivas
  1 sibling, 0 replies; 26+ messages in thread
From: Paul Jackson @ 2006-02-08  5:13 UTC (permalink / raw)
  To: Con Kolivas; +Cc: nickpiggin, linux-kernel, linux-mm, akpm, ck

Con wrote:
> > If you don't do that, then consider disabling this thing entirely
> > if CONFIG_NUMA is enabled.  This swap prefetching sounds like it
> > could be a loose canon ball in a NUMA box.
> 
> That's probably a less satisfactory option since NUMA isn't that rare with the 
> light numa of commodity hardware.

You're right -- my suggestion was not a good one.

I expect that the main distros are or will be shipping their stock PC
kernel with NUMA enabled.  Most of these kernels end up on exactly the
kind of system that is the target audience for swap prefetching.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: implement swap prefetching
  2006-02-08  5:06       ` Con Kolivas
  2006-02-08  5:13         ` Paul Jackson
@ 2006-02-08  5:33         ` Con Kolivas
  1 sibling, 0 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-08  5:33 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, linux-kernel, linux-mm, akpm, ck

On Wed, 8 Feb 2006 04:06 pm, Con Kolivas wrote:
> On Wed, 8 Feb 2006 03:46 pm, Paul Jackson wrote:
> > Con, responding to Nick:
> > > > It introduces global cacheline bouncing in pagecache allocation and
> > > > removal and page reclaim paths, also low watermark failure is quite
> > > > common in normal operation, so that is another global cacheline write
> > > > in page allocation path.
> > >
> > > None of these issues is going to remotely the target audience. If the
> > > issue is how scalable such a change can be then I cannot advocate
> > > making the code smart and complex enough to be numa and cpuset aware..
> > > but then that's never going to be the target audience. It affects a
> > > particular class of user which happens to be quite a large population
> > > not affected by complex memory hardware.
> >
> > How about only moving memory back to the Memory Node (zone) that it
> > came from?  And providing some call that Christoph Lameters migration
> > code can call, to disable or fix this up, so you don't end up bringing
> > back pages on their pre-migration nodes?
>
> Sounds good, and this is what I was hoping to be able to do; first I need
> to see the best time and place to get this information (and learn some more
> about the code).

Actually it's looking an awful lot like I should just use one thread per pgdat 
and have per node data. Given that, I should probably just make this a task 
for kswapd since that is what they already are - and the name isn't wrong 
either.

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] mm: Implement swap prefetching
@ 2006-02-24 11:23 Con Kolivas
  0 siblings, 0 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-24 11:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux list, linux-mm, ck list

Resync unchanged with 2.6.16-rc4-mm2.

Cheers,
Con
---
This patch implements swap prefetching when the vm is relatively idle and
there is free ram available. The code is based on some preliminary code by
Thomas Schlichter.

This stores a list of swapped entries in a list ordered most recently used
and a radix tree. It generates a low priority kernel thread running at nice 19
to do the prefetching at a later stage.

Once pages have been added to the swapped list, a timer is started, testing
for conditions suitable to prefetch swap pages every 5 seconds. Suitable
conditions are defined as lack of swapping out or in any pages, and no
watermark tests failing. Significant amounts of dirtied ram and changes in
free ram representing disk writes or reads also prevent prefetching.

It then checks that we have spare ram looking for at least 3* pages_high free
per zone and if it succeeds that will prefetch pages from swap into the swap
cache. The pages are added to the tail of the inactive list to preserve LRU
ordering.

Pages are prefetched until the list is empty or the vm is seen as busy
according to the previously described criteria. Node data on numa is stored
with the entries and an appropriate zonelist based on this is used when
allocating ram.

The pages are copied to swap cache and kept on backing store. This allows
pressure on either physical ram or swap to readily find free pages without
further I/O.

Prefetching can be enabled/disabled via the tunable in 
/proc/sys/vm/swap_prefetch initially set to 1 (enabled).

Enabling laptop_mode disables swap prefetching to prevent unnecessary
spin ups.

In testing on modern pc hardware this results in wall-clock time activation of
the firefox browser to speed up 5 fold after a worst case complete swap-out
of the browser on a static web page.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

 Documentation/sysctl/vm.txt |   11 
 include/linux/mm_inline.h   |    7 
 include/linux/swap.h        |   55 ++++
 include/linux/sysctl.h      |    1 
 init/Kconfig                |   22 +
 kernel/sysctl.c             |   10 
 mm/Makefile                 |    1 
 mm/swap.c                   |   43 +++
 mm/swap_prefetch.c          |  496 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c             |   10 
 mm/vmscan.c                 |    5 
 11 files changed, 660 insertions(+), 1 deletion(-)

Index: linux-2.6.16-rc4-mm2/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.16-rc4-mm2.orig/Documentation/sysctl/vm.txt	2006-02-18 10:36:52.000000000 +1100
+++ linux-2.6.16-rc4-mm2/Documentation/sysctl/vm.txt	2006-02-24 22:16:31.000000000 +1100
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/
 - drop-caches
 - zone_reclaim_mode
 - zone_reclaim_interval
+- swap_prefetch
 
 ==============================================================
 
@@ -178,3 +179,13 @@ Time is set in seconds and set by defaul
 Reduce the interval if undesired off node allocations occur. However, too
 frequent scans will have a negative impact onoff node allocation performance.
 
+==============================================================
+
+swap_prefetch
+
+This enables or disables the swap prefetching feature. When the virtual
+memory subsystem has been extremely idle for at least 5 seconds it will start
+copying back pages from swap into the swapcache and keep a copy in swap. In
+practice it can take many minutes before the vm is idle enough.
+
+The default value is 1.
Index: linux-2.6.16-rc4-mm2/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc4-mm2.orig/include/linux/swap.h	2006-02-24 22:16:03.000000000 +1100
+++ linux-2.6.16-rc4-mm2/include/linux/swap.h	2006-02-24 22:16:31.000000000 +1100
@@ -7,6 +7,7 @@
 #include <linux/mmzone.h>
 #include <linux/list.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -164,6 +165,7 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_cache_add_tail(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -214,6 +216,58 @@ extern int shmem_unuse(swp_entry_t entry
 
 extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *);
 
+#ifdef CONFIG_SWAP_PREFETCH
+/* mm/swap_prefetch.c */
+extern int swap_prefetch;
+struct swapped_entry {
+	swp_entry_t		swp_entry;	/* The actual swap entry */
+	struct list_head	swapped_list;	/* Linked list of entries */
+#if MAX_NUMNODES > 1
+	int			node;		/* Node id */
+#endif
+} __attribute__((packed));
+
+static inline void store_swap_entry_node(struct swapped_entry *entry,
+	struct page *page)
+{
+#if MAX_NUMNODES > 1
+	entry->node = page_to_nid(page);
+#endif
+}
+
+static inline int get_swap_entry_node(struct swapped_entry *entry)
+{
+#if MAX_NUMNODES > 1
+	return entry->node;
+#else
+	return 0;
+#endif
+}
+
+extern void add_to_swapped_list(struct page *page);
+extern void remove_from_swapped_list(const unsigned long index);
+extern void delay_swap_prefetch(void);
+extern void prepare_swap_prefetch(void);
+
+#else	/* CONFIG_SWAP_PREFETCH */
+static inline void add_to_swapped_list(struct page *__unused)
+{
+}
+
+static inline void prepare_swap_prefetch(void)
+{
+}
+
+static inline void remove_from_swapped_list(const unsigned long __unused)
+{
+}
+
+static inline void delay_swap_prefetch(void)
+{
+}
+
+#endif	/* CONFIG_SWAP_PREFETCH */
+
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
@@ -235,6 +289,7 @@ extern void free_pages_and_swap_cache(st
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma,
 					   unsigned long addr);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry);
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
Index: linux-2.6.16-rc4-mm2/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc4-mm2.orig/include/linux/sysctl.h	2006-02-24 22:16:03.000000000 +1100
+++ linux-2.6.16-rc4-mm2/include/linux/sysctl.h	2006-02-24 22:16:31.000000000 +1100
@@ -185,6 +185,7 @@ enum
 	VM_PERCPU_PAGELIST_FRACTION=30,/* int: fraction of pages in each percpu_pagelist */
 	VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
+	VM_SWAP_PREFETCH=33,	/* swap prefetch */
 };
 
 
Index: linux-2.6.16-rc4-mm2/init/Kconfig
===================================================================
--- linux-2.6.16-rc4-mm2.orig/init/Kconfig	2006-02-24 22:16:03.000000000 +1100
+++ linux-2.6.16-rc4-mm2/init/Kconfig	2006-02-24 22:16:31.000000000 +1100
@@ -92,6 +92,28 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_PREFETCH
+	bool "Support for prefetching swapped memory"
+	depends on SWAP
+	default y
+	---help---
+	  This option will allow the kernel to prefetch swapped memory pages
+	  when idle. The pages will be kept on both swap and in swap_cache
+	  thus avoiding the need for further I/O if either ram or swap space
+	  is required.
+
+	  What this will do on workstations is slowly bring back applications
+	  that have swapped out after memory intensive workloads back into
+	  physical ram if you have free ram at a later stage and the machine
+	  is relatively idle. This means that when you come back to your
+	  computer after leaving it idle for a while, applications will come
+	  to life faster. Note that your swap usage will appear to increase
+	  but these are cached pages, can be dropped freely by the vm, and it
+	  should stabilise around 50% swap usage maximum.
+
+	  Workstations and multiuser workstation servers will most likely want
+	  to say Y.
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
Index: linux-2.6.16-rc4-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc4-mm2.orig/kernel/sysctl.c	2006-02-24 22:16:03.000000000 +1100
+++ linux-2.6.16-rc4-mm2/kernel/sysctl.c	2006-02-24 22:16:31.000000000 +1100
@@ -901,6 +901,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+#ifdef CONFIG_SWAP_PREFETCH
+	{
+		.ctl_name	= VM_SWAP_PREFETCH,
+		.procname	= "swap_prefetch",
+		.data		= &swap_prefetch,
+		.maxlen		= sizeof(swap_prefetch),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.16-rc4-mm2/mm/Makefile
===================================================================
--- linux-2.6.16-rc4-mm2.orig/mm/Makefile	2006-02-18 10:37:19.000000000 +1100
+++ linux-2.6.16-rc4-mm2/mm/Makefile	2006-02-24 22:16:31.000000000 +1100
@@ -13,6 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   prio_tree.o util.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP_PREFETCH) += swap_prefetch.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
Index: linux-2.6.16-rc4-mm2/mm/swap.c
===================================================================
--- linux-2.6.16-rc4-mm2.orig/mm/swap.c	2006-02-24 22:16:03.000000000 +1100
+++ linux-2.6.16-rc4-mm2/mm/swap.c	2006-02-24 22:16:31.000000000 +1100
@@ -384,6 +384,46 @@ void __pagevec_lru_add_active(struct pag
 	pagevec_reinit(pvec);
 }
 
+static inline void __pagevec_lru_add_tail(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		add_page_to_inactive_list_tail(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+/*
+ * Function used uniquely to put pages back to the lru at the end of the
+ * inactive list to preserve the lru order. Currently only used by swap
+ * prefetch.
+ */
+void fastcall lru_cache_add_tail(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_tail(pvec);
+	put_cpu_var(lru_add_pvecs);
+}
+
 /*
  * Try to drop buffers from the pages in a pagevec
  */
@@ -537,5 +577,8 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	prepare_swap_prefetch();
+
 	hotcpu_notifier(cpu_swap_callback, 0);
 }
Index: linux-2.6.16-rc4-mm2/mm/swap_prefetch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc4-mm2/mm/swap_prefetch.c	2006-02-24 22:16:31.000000000 +1100
@@ -0,0 +1,496 @@
+/*
+ * linux/mm/swap_prefetch.c
+ *
+ * Copyright (C) 2005-2006 Con Kolivas
+ *
+ * Written by Con Kolivas <kernel@kolivas.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/ioprio.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/writeback.h>
+
+/*
+ * Time to delay prefetching if vm is busy or prefetching unsuccessful. There
+ * needs to be at least this duration of idle time meaning in practice it can
+ * be much longer
+ */
+#define PREFETCH_DELAY	(HZ * 5)
+
+/* sysctl - enable/disable swap prefetching */
+int swap_prefetch __read_mostly = 1;
+
+struct swapped_root {
+	unsigned long		busy;		/* vm busy */
+	spinlock_t		lock;		/* protects all data */
+	struct list_head	list;		/* MRU list of swapped pages */
+	struct radix_tree_root	swap_tree;	/* Lookup tree of pages */
+	unsigned int		count;		/* Number of entries */
+	unsigned int		maxcount;	/* Maximum entries allowed */
+	kmem_cache_t		*cache;		/* Of struct swapped_entry */
+};
+
+static struct swapped_root swapped = {
+	.lock		= SPIN_LOCK_UNLOCKED,
+	.list  		= LIST_HEAD_INIT(swapped.list),
+	.swap_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
+};
+
+static task_t *kprefetchd_task;
+
+/*
+ * We check to see no part of the vm is busy. If it is this will interrupt
+ * trickle_swap and wait another PREFETCH_DELAY. Purposefully racy.
+ */
+inline void delay_swap_prefetch(void)
+{
+	if (!test_bit(0, &swapped.busy))
+		__set_bit(0, &swapped.busy);
+}
+
+/*
+ * Drop behind accounting which keeps a list of the most recently used swap
+ * entries.
+ */
+void add_to_swapped_list(struct page *page)
+{
+	struct swapped_entry *entry;
+	unsigned long index;
+	int wakeup;
+
+	if (!swap_prefetch)
+		return;
+
+	wakeup = 0;
+
+	spin_lock(&swapped.lock);
+	if (swapped.count >= swapped.maxcount) {
+		/*
+		 * We limit the number of entries to 2/3 of physical ram.
+		 * Once the number of entries exceeds this we start removing
+		 * the least recently used entries.
+		 */
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
+		list_del(&entry->swapped_list);
+		swapped.count--;
+	} else {
+		entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
+		if (unlikely(!entry))
+			/* bad, can't allocate more mem */
+			goto out_locked;
+	}
+
+	index = page_private(page);
+	entry->swp_entry.val = index;
+	/*
+	 * On numa we need to store the node id to ensure that we prefetch to
+	 * the same node it came from.
+	 */
+	store_swap_entry_node(entry, page);
+
+	if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
+		/*
+		 * If this is the first entry, kprefetchd needs to be
+		 * (re)started.
+		 */
+		if (!swapped.count)
+			wakeup = 1;
+		list_add(&entry->swapped_list, &swapped.list);
+		swapped.count++;
+	}
+
+out_locked:
+	spin_unlock(&swapped.lock);
+
+	/* Do the wakeup outside the lock to shorten lock hold time. */
+	if (wakeup)
+		wake_up_process(kprefetchd_task);
+
+	return;
+}
+
+/*
+ * Removes entries from the swapped_list. The radix tree allows us to quickly
+ * look up the entry from the index without having to iterate over the whole
+ * list.
+ */
+void remove_from_swapped_list(const unsigned long index)
+{
+	struct swapped_entry *entry;
+	unsigned long flags;
+
+	if (list_empty(&swapped.list))
+		return;
+
+	spin_lock_irqsave(&swapped.lock, flags);
+	entry = radix_tree_delete(&swapped.swap_tree, index);
+	if (likely(entry)) {
+		list_del_init(&entry->swapped_list);
+		swapped.count--;
+		kmem_cache_free(swapped.cache, entry);
+	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
+}
+
+enum trickle_return {
+	TRICKLE_SUCCESS,
+	TRICKLE_FAILED,
+	TRICKLE_DELAY,
+};
+
+/*
+ * prefetch_stats stores the free ram data of each node and this is used to
+ * determine if a node is suitable for prefetching into.
+ */
+struct prefetch_stats{
+	unsigned long	last_free[MAX_NUMNODES];
+	/* Free ram after a cycle of prefetching */
+	unsigned long	current_free[MAX_NUMNODES];
+	/* Free ram on this cycle of checking prefetch_suitable */
+	unsigned long	prefetch_watermark[MAX_NUMNODES];
+	/* Maximum amount we will prefetch to */
+	nodemask_t	prefetch_nodes;
+	/* Which nodes are currently suited to prefetching */
+	unsigned long	prefetched_pages;
+	/* Total pages we've prefetched on this wakeup of kprefetchd */
+};
+
+static struct prefetch_stats sp_stat;
+
+/*
+ * This tries to read a swp_entry_t into swap cache for swap prefetching.
+ * If it returns TRICKLE_DELAY we should delay further prefetching.
+ */
+static enum trickle_return trickle_swap_cache_async(const swp_entry_t entry,
+	const int node)
+{
+	enum trickle_return ret = TRICKLE_FAILED;
+	struct page *page;
+
+	read_lock_irq(&swapper_space.tree_lock);
+	/* Entry may already exist */
+	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
+	read_unlock_irq(&swapper_space.tree_lock);
+	if (page) {
+		remove_from_swapped_list(entry.val);
+		goto out;
+	}
+
+	/*
+	 * Get a new page to read from swap. We have already checked the
+	 * watermarks so __alloc_pages will not call on reclaim.
+	 */
+	page = alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
+	if (unlikely(!page)) {
+		ret = TRICKLE_DELAY;
+		goto out;
+	}
+
+	if (add_to_swap_cache(page, entry)) {
+		/* Failed to add to swap cache */
+		goto out_release;
+	}
+
+	/* Add them to the tail of the inactive list to preserve LRU order */
+	lru_cache_add_tail(page);
+	if (unlikely(swap_readpage(NULL, page))) {
+		ret = TRICKLE_DELAY;
+		goto out_release;
+	}
+
+	sp_stat.prefetched_pages++;
+	sp_stat.last_free[node]--;
+
+	ret = TRICKLE_SUCCESS;
+out_release:
+	page_cache_release(page);
+out:
+	return ret;
+}
+
+static void clear_last_prefetch_free(void)
+{
+	int node;
+
+	/*
+	 * Reset the nodes suitable for prefetching to all nodes. We could
+	 * update the data to take into account memory hotplug if desired..
+	 */
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes)
+		sp_stat.last_free[node] = 0;
+}
+
+static void clear_current_prefetch_free(void)
+{
+	int node;
+
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes)
+		sp_stat.current_free[node] = 0;
+}
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(void)
+{
+	struct page_state ps;
+	unsigned long limit;
+	struct zone *z;
+	int node, ret = 0;
+
+	/* Purposefully racy and might return false positive which is ok */
+	if (__test_and_clear_bit(0, &swapped.busy))
+		goto out;
+
+	clear_current_prefetch_free();
+
+	/*
+	 * Have some hysteresis between where page reclaiming and prefetching
+	 * will occur to prevent ping-ponging between them.
+	 */
+	for_each_zone(z) {
+		unsigned long free;
+
+		if (!populated_zone(z))
+			continue;
+		node = z->zone_pgdat->node_id;
+
+		free = z->free_pages;
+		if (z->pages_high * 3 + z->lowmem_reserve[zone_idx(z)] > free) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+		sp_stat.current_free[node] += free;
+	}
+
+	/*
+	 * We iterate over each node testing to see if it is suitable for
+	 * prefetching and clear the nodemask if it is not.
+	 */
+	for_each_node_mask(node, sp_stat.prefetch_nodes) {
+		/*
+		 * We check to see that pages are not being allocated
+		 * elsewhere at any significant rate implying any
+		 * degree of memory pressure (eg during file reads)
+		 */
+		if (sp_stat.last_free[node]) {
+			if (sp_stat.current_free[node] + SWAP_CLUSTER_MAX <
+				sp_stat.last_free[node]) {
+					sp_stat.last_free[node] =
+						sp_stat.current_free[node];
+					node_clear(node,
+						sp_stat.prefetch_nodes);
+					continue;
+			}
+		} else
+			sp_stat.last_free[node] = sp_stat.current_free[node];
+
+		/*
+		 * get_page_state is super expensive so we only perform it
+		 * every SWAP_CLUSTER_MAX prefetched_pages
+		 */
+		if (sp_stat.prefetched_pages % SWAP_CLUSTER_MAX)
+			continue;
+
+		get_page_state_node(&ps, node);
+
+		/* We shouldn't prefetch when we are doing writeback */
+		if (ps.nr_writeback) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+
+		/*
+		 * >2/3 of the ram on this node is mapped, slab, swapcache or
+		 * dirty, we need to leave some free for pagecache.
+		 * Note that currently nr_slab is innacurate on numa because
+		 * nr_slab is incremented on the node doing the accounting
+		 * even if the slab is being allocated on a remote node. This
+		 * would be expensive to fix and not of great significance.
+		 */
+		limit = ps.nr_mapped + ps.nr_slab + ps.nr_dirty +
+			ps.nr_unstable + total_swapcache_pages;
+		if (limit > sp_stat.prefetch_watermark[node]) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+	}
+
+	if (nodes_empty(sp_stat.prefetch_nodes))
+		goto out;
+
+	/* Survived all that? Hooray we can prefetch! */
+	ret = 1;
+out:
+	return ret;
+}
+
+/*
+ * Get previous swapped entry when iterating over all entries. swapped.lock
+ * should be held and we should already ensure that entry exists.
+ */
+static inline struct swapped_entry *prev_swapped_entry
+	(struct swapped_entry *entry)
+{
+	return list_entry(entry->swapped_list.prev->prev,
+		struct swapped_entry, swapped_list);
+}
+
+/*
+ * trickle_swap is the main function that initiates the swap prefetching. It
+ * first checks to see if the busy flag is set, and does not prefetch if it
+ * is, as the flag implied we are low on memory or swapping in currently.
+ * Otherwise it runs until prefetch_suitable fails which occurs when the
+ * vm is busy, we prefetch to the watermark, or the list is empty or we have
+ * iterated over all entries
+ */
+static enum trickle_return trickle_swap(void)
+{
+	enum trickle_return ret = TRICKLE_DELAY;
+	struct swapped_entry *entry;
+
+	/*
+	 * If laptop_mode is enabled don't prefetch to avoid hard drives
+	 * doing unnecessary spin-ups
+	 */
+	if (!swap_prefetch || laptop_mode)
+		return ret;
+
+	entry = NULL;
+
+	for ( ; ; ) {
+		swp_entry_t swp_entry;
+		int node;
+
+		if (!prefetch_suitable())
+			break;
+
+		spin_lock(&swapped.lock);
+		if (list_empty(&swapped.list)) {
+			ret = TRICKLE_FAILED;
+			spin_unlock(&swapped.lock);
+			break;
+		}
+
+		if (!entry) {
+			/*
+			 * This sets the entry for the first iteration. It
+			 * also is a safeguard against the entry disappearing
+			 * while the lock is not held.
+			 */
+			entry = list_entry(swapped.list.prev,
+				struct swapped_entry, swapped_list);
+		} else if (entry->swapped_list.prev == swapped.list.next) {
+			/*
+			 * If we have iterated over all entries and there are
+			 * still entries that weren't swapped out there may
+			 * be a reason we could not swap them back in so
+			 * delay attempting further prefetching.
+			 */
+			spin_unlock(&swapped.lock);
+			break;
+		}
+
+		node = get_swap_entry_node(entry);
+		if (!node_isset(node, sp_stat.prefetch_nodes)) {
+			/*
+			 * We found an entry that belongs to a node that is
+			 * not suitable for prefetching so skip it.
+			 */
+			entry = prev_swapped_entry(entry);
+			spin_unlock(&swapped.lock);
+			continue;
+		}
+		swp_entry = entry->swp_entry;
+		entry = prev_swapped_entry(entry);
+		spin_unlock(&swapped.lock);
+
+		if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
+			break;
+	}
+
+	if (sp_stat.prefetched_pages) {
+		lru_add_drain();
+		sp_stat.prefetched_pages = 0;
+	}
+	return ret;
+}
+
+static int kprefetchd(void *__unused)
+{
+	set_user_nice(current, 19);
+	/* Set ioprio to lowest if supported by i/o scheduler */
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+
+	do {
+		try_to_freeze();
+
+		/*
+		 * TRICKLE_FAILED implies no entries left - we do not schedule
+		 * a wakeup, and further delay the next one.
+		 */
+		if (trickle_swap() == TRICKLE_FAILED) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+		}
+		clear_last_prefetch_free();
+		schedule_timeout_interruptible(PREFETCH_DELAY);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+/*
+ * Create kmem cache for swapped entries
+ */
+void __init prepare_swap_prefetch(void)
+{
+	pg_data_t *pgdat;
+	int node;
+
+	swapped.cache = kmem_cache_create("swapped_entry",
+		sizeof(struct swapped_entry), 0, SLAB_PANIC, NULL, NULL);
+
+	/*
+	 * Set max number of entries to 2/3 the size of physical ram  as we
+	 * only ever prefetch to consume 2/3 of the ram.
+	 */
+	swapped.maxcount = nr_free_pagecache_pages() / 3 * 2;
+
+	for_each_pgdat(pgdat) {
+		unsigned long present;
+
+		present = pgdat->node_present_pages;
+		if (!present)
+			continue;
+		node = pgdat->node_id;
+		sp_stat.prefetch_watermark[node] += present / 3 * 2;
+	}
+}
+
+static int __init kprefetchd_init(void)
+{
+	kprefetchd_task = kthread_run(kprefetchd, NULL, "kprefetchd");
+
+	return 0;
+}
+
+static void __exit kprefetchd_exit(void)
+{
+	kthread_stop(kprefetchd_task);
+}
+
+module_init(kprefetchd_init);
+module_exit(kprefetchd_exit);
Index: linux-2.6.16-rc4-mm2/mm/swap_state.c
===================================================================
--- linux-2.6.16-rc4-mm2.orig/mm/swap_state.c	2006-02-18 10:37:19.000000000 +1100
+++ linux-2.6.16-rc4-mm2/mm/swap_state.c	2006-02-24 22:16:31.000000000 +1100
@@ -81,6 +81,7 @@ static int __add_to_swap_cache(struct pa
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
+			remove_from_swapped_list(entry.val);
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
@@ -94,11 +95,12 @@ static int __add_to_swap_cache(struct pa
 	return error;
 }
 
-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
 	int error;
 
 	if (!swap_duplicate(entry)) {
+		remove_from_swapped_list(entry.val);
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
@@ -147,6 +149,9 @@ int add_to_swap(struct page * page, gfp_
 	swp_entry_t entry;
 	int err;
 
+	/* Swap prefetching is delayed if we're swapping pages */
+	delay_swap_prefetch();
+
 	if (!PageLocked(page))
 		BUG();
 
@@ -320,6 +325,9 @@ struct page *read_swap_cache_async(swp_e
 	struct page *found_page, *new_page = NULL;
 	int err;
 
+	/* Swap prefetching is delayed if we're already reading from swap */
+	delay_swap_prefetch();
+
 	do {
 		/*
 		 * First check the swap cache.  Since this is normally
Index: linux-2.6.16-rc4-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc4-mm2.orig/mm/vmscan.c	2006-02-24 22:16:03.000000000 +1100
+++ linux-2.6.16-rc4-mm2/mm/vmscan.c	2006-02-24 22:16:31.000000000 +1100
@@ -390,6 +390,7 @@ static int remove_mapping(struct address
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		add_to_swapped_list(page);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
@@ -1451,6 +1452,8 @@ unsigned long try_to_free_pages(struct z
 		.may_swap = 1,
 	};
 
+	delay_swap_prefetch();
+
 	inc_page_state(allocstall);
 
 	for (i = 0; zones[i] != NULL; i++) {
@@ -1794,6 +1797,8 @@ int shrink_all_memory(unsigned long nr_p
 		.reclaimed_slab = 0,
 	};
 
+	delay_swap_prefetch();
+
 	current->reclaim_state = &reclaim_state;
 	for_each_pgdat(pgdat) {
 		int freed;
Index: linux-2.6.16-rc4-mm2/include/linux/mm_inline.h
===================================================================
--- linux-2.6.16-rc4-mm2.orig/include/linux/mm_inline.h	2006-02-24 22:16:03.000000000 +1100
+++ linux-2.6.16-rc4-mm2/include/linux/mm_inline.h	2006-02-24 22:16:31.000000000 +1100
@@ -14,6 +14,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(&page->lru, &zone->inactive_list);
+	zone->nr_inactive++;
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: Implement swap prefetching
  2006-02-20 20:58   ` Con Kolivas
@ 2006-02-20 23:45     ` Michal Piotrowski
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Piotrowski @ 2006-02-20 23:45 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Mattia Dongili, linux kernel mailing list, Andrew Morton, ck list

On 20/02/06, Con Kolivas <kernel@kolivas.org> wrote:
> On Tuesday 21 February 2006 06:08, Mattia Dongili wrote:
> > TestSetPageLRU is gone in -mm (see mm-pagelru-no-testset.patch), you
> > should probably change it to
>
> Here's a respin with that change.
>
> Cheers,
> Con

Thanks!

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: Implement swap prefetching
  2006-02-20 19:08 ` Mattia Dongili
  2006-02-20 20:47   ` Con Kolivas
@ 2006-02-20 20:58   ` Con Kolivas
  2006-02-20 23:45     ` Michal Piotrowski
  1 sibling, 1 reply; 26+ messages in thread
From: Con Kolivas @ 2006-02-20 20:58 UTC (permalink / raw)
  To: Mattia Dongili
  Cc: linux kernel mailing list, Andrew Morton, ck list, Michal Piotrowski

On Tuesday 21 February 2006 06:08, Mattia Dongili wrote:
> TestSetPageLRU is gone in -mm (see mm-pagelru-no-testset.patch), you
> should probably change it to

Here's a respin with that change.

Cheers,
Con
---
This patch implements swap prefetching when the vm is relatively idle and
there is free ram available. The code is based on some preliminary code by
Thomas Schlichter.

This stores a list of swapped entries in a list ordered most recently used
and a radix tree. It generates a low priority kernel thread running at nice 19
to do the prefetching at a later stage.

Once pages have been added to the swapped list, a timer is started, testing
for conditions suitable to prefetch swap pages every 5 seconds. Suitable
conditions are defined as lack of swapping out or in any pages, and no
watermark tests failing. Significant amounts of dirtied ram and changes in
free ram representing disk writes or reads also prevent prefetching.

It then checks that we have spare ram looking for at least 3* pages_high free
per zone and if it succeeds that will prefetch pages from swap into the swap
cache. The pages are added to the tail of the inactive list to preserve LRU
ordering.

Pages are prefetched until the list is empty or the vm is seen as busy
according to the previously described criteria. Node data on numa is stored
with the entries and an appropriate zonelist based on this is used when
allocating ram.

The pages are copied to swap cache and kept on backing store. This allows
pressure on either physical ram or swap to readily find free pages without
further I/O.

Prefetching can be enabled/disabled via the tunable in 
/proc/sys/vm/swap_prefetch initially set to 1 (enabled).

Enabling laptop_mode disables swap prefetching to prevent unnecessary
spin ups.

In testing on modern pc hardware this results in wall-clock time activation of
the firefox browser to speed up 5 fold after a worst case complete swap-out
of the browser on a static web page.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

 Documentation/sysctl/vm.txt |   11 
 include/linux/mm_inline.h   |    7 
 include/linux/swap.h        |   55 ++++
 include/linux/sysctl.h      |    1 
 init/Kconfig                |   22 +
 kernel/sysctl.c             |   10 
 mm/Makefile                 |    1 
 mm/swap.c                   |   43 +++
 mm/swap_prefetch.c          |  496 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c             |   10 
 mm/vmscan.c                 |    5 
 11 files changed, 660 insertions(+), 1 deletion(-)

Index: linux-2.6.16-rc4-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.16-rc4-mm1.orig/Documentation/sysctl/vm.txt	2006-02-18 10:36:52.000000000 +1100
+++ linux-2.6.16-rc4-mm1/Documentation/sysctl/vm.txt	2006-02-21 00:39:25.000000000 +1100
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/
 - drop-caches
 - zone_reclaim_mode
 - zone_reclaim_interval
+- swap_prefetch
 
 ==============================================================
 
@@ -178,3 +179,13 @@ Time is set in seconds and set by defaul
 Reduce the interval if undesired off node allocations occur. However, too
 frequent scans will have a negative impact onoff node allocation performance.
 
+==============================================================
+
+swap_prefetch
+
+This enables or disables the swap prefetching feature. When the virtual
+memory subsystem has been extremely idle for at least 5 seconds it will start
+copying back pages from swap into the swapcache and keep a copy in swap. In
+practice it can take many minutes before the vm is idle enough.
+
+The default value is 1.
Index: linux-2.6.16-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc4-mm1.orig/include/linux/swap.h	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/include/linux/swap.h	2006-02-21 00:39:25.000000000 +1100
@@ -7,6 +7,7 @@
 #include <linux/mmzone.h>
 #include <linux/list.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -164,6 +165,7 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_cache_add_tail(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -214,6 +216,58 @@ extern int shmem_unuse(swp_entry_t entry
 
 extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *);
 
+#ifdef CONFIG_SWAP_PREFETCH
+/* mm/swap_prefetch.c */
+extern int swap_prefetch;
+struct swapped_entry {
+	swp_entry_t		swp_entry;	/* The actual swap entry */
+	struct list_head	swapped_list;	/* Linked list of entries */
+#if MAX_NUMNODES > 1
+	int			node;		/* Node id */
+#endif
+} __attribute__((packed));
+
+static inline void store_swap_entry_node(struct swapped_entry *entry,
+	struct page *page)
+{
+#if MAX_NUMNODES > 1
+	entry->node = page_to_nid(page);
+#endif
+}
+
+static inline int get_swap_entry_node(struct swapped_entry *entry)
+{
+#if MAX_NUMNODES > 1
+	return entry->node;
+#else
+	return 0;
+#endif
+}
+
+extern void add_to_swapped_list(struct page *page);
+extern void remove_from_swapped_list(const unsigned long index);
+extern void delay_swap_prefetch(void);
+extern void prepare_swap_prefetch(void);
+
+#else	/* CONFIG_SWAP_PREFETCH */
+static inline void add_to_swapped_list(struct page *__unused)
+{
+}
+
+static inline void prepare_swap_prefetch(void)
+{
+}
+
+static inline void remove_from_swapped_list(const unsigned long __unused)
+{
+}
+
+static inline void delay_swap_prefetch(void)
+{
+}
+
+#endif	/* CONFIG_SWAP_PREFETCH */
+
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
@@ -235,6 +289,7 @@ extern void free_pages_and_swap_cache(st
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma,
 					   unsigned long addr);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry);
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
Index: linux-2.6.16-rc4-mm1/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc4-mm1.orig/include/linux/sysctl.h	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/include/linux/sysctl.h	2006-02-21 00:39:25.000000000 +1100
@@ -185,6 +185,7 @@ enum
 	VM_PERCPU_PAGELIST_FRACTION=30,/* int: fraction of pages in each percpu_pagelist */
 	VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
+	VM_SWAP_PREFETCH=33,	/* swap prefetch */
 };
 
 
Index: linux-2.6.16-rc4-mm1/init/Kconfig
===================================================================
--- linux-2.6.16-rc4-mm1.orig/init/Kconfig	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/init/Kconfig	2006-02-21 00:39:25.000000000 +1100
@@ -92,6 +92,28 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_PREFETCH
+	bool "Support for prefetching swapped memory"
+	depends on SWAP
+	default y
+	---help---
+	  This option will allow the kernel to prefetch swapped memory pages
+	  when idle. The pages will be kept on both swap and in swap_cache
+	  thus avoiding the need for further I/O if either ram or swap space
+	  is required.
+
+	  What this will do on workstations is slowly bring back applications
+	  that have swapped out after memory intensive workloads back into
+	  physical ram if you have free ram at a later stage and the machine
+	  is relatively idle. This means that when you come back to your
+	  computer after leaving it idle for a while, applications will come
+	  to life faster. Note that your swap usage will appear to increase
+	  but these are cached pages, can be dropped freely by the vm, and it
+	  should stabilise around 50% swap usage maximum.
+
+	  Workstations and multiuser workstation servers will most likely want
+	  to say Y.
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
Index: linux-2.6.16-rc4-mm1/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/kernel/sysctl.c	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/kernel/sysctl.c	2006-02-21 00:39:25.000000000 +1100
@@ -901,6 +901,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+#ifdef CONFIG_SWAP_PREFETCH
+	{
+		.ctl_name	= VM_SWAP_PREFETCH,
+		.procname	= "swap_prefetch",
+		.data		= &swap_prefetch,
+		.maxlen		= sizeof(swap_prefetch),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.16-rc4-mm1/mm/Makefile
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/Makefile	2006-02-18 10:37:19.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/Makefile	2006-02-21 00:39:25.000000000 +1100
@@ -13,6 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   prio_tree.o util.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP_PREFETCH) += swap_prefetch.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
Index: linux-2.6.16-rc4-mm1/mm/swap.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/swap.c	2006-02-21 00:38:56.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/swap.c	2006-02-21 07:55:40.000000000 +1100
@@ -384,6 +384,46 @@ void __pagevec_lru_add_active(struct pag
 	pagevec_reinit(pvec);
 }
 
+static inline void __pagevec_lru_add_tail(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		add_page_to_inactive_list_tail(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+/*
+ * Function used uniquely to put pages back to the lru at the end of the
+ * inactive list to preserve the lru order. Currently only used by swap
+ * prefetch.
+ */
+void fastcall lru_cache_add_tail(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_tail(pvec);
+	put_cpu_var(lru_add_pvecs);
+}
+
 /*
  * Try to drop buffers from the pages in a pagevec
  */
@@ -537,5 +577,8 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	prepare_swap_prefetch();
+
 	hotcpu_notifier(cpu_swap_callback, 0);
 }
Index: linux-2.6.16-rc4-mm1/mm/swap_prefetch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc4-mm1/mm/swap_prefetch.c	2006-02-21 07:55:33.000000000 +1100
@@ -0,0 +1,496 @@
+/*
+ * linux/mm/swap_prefetch.c
+ *
+ * Copyright (C) 2005-2006 Con Kolivas
+ *
+ * Written by Con Kolivas <kernel@kolivas.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/ioprio.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/writeback.h>
+
+/*
+ * Time to delay prefetching if vm is busy or prefetching unsuccessful. There
+ * needs to be at least this duration of idle time meaning in practice it can
+ * be much longer
+ */
+#define PREFETCH_DELAY	(HZ * 5)
+
+/* sysctl - enable/disable swap prefetching */
+int swap_prefetch __read_mostly = 1;
+
+struct swapped_root {
+	unsigned long		busy;		/* vm busy */
+	spinlock_t		lock;		/* protects all data */
+	struct list_head	list;		/* MRU list of swapped pages */
+	struct radix_tree_root	swap_tree;	/* Lookup tree of pages */
+	unsigned int		count;		/* Number of entries */
+	unsigned int		maxcount;	/* Maximum entries allowed */
+	kmem_cache_t		*cache;		/* Of struct swapped_entry */
+};
+
+static struct swapped_root swapped = {
+	.lock		= SPIN_LOCK_UNLOCKED,
+	.list  		= LIST_HEAD_INIT(swapped.list),
+	.swap_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
+};
+
+static task_t *kprefetchd_task;
+
+/*
+ * We check to see no part of the vm is busy. If it is this will interrupt
+ * trickle_swap and wait another PREFETCH_DELAY. Purposefully racy.
+ */
+inline void delay_swap_prefetch(void)
+{
+	if (!test_bit(0, &swapped.busy))
+		__set_bit(0, &swapped.busy);
+}
+
+/*
+ * Drop behind accounting which keeps a list of the most recently used swap
+ * entries.
+ */
+void add_to_swapped_list(struct page *page)
+{
+	struct swapped_entry *entry;
+	unsigned long index;
+	int wakeup;
+
+	if (!swap_prefetch)
+		return;
+
+	wakeup = 0;
+
+	spin_lock(&swapped.lock);
+	if (swapped.count >= swapped.maxcount) {
+		/*
+		 * We limit the number of entries to 2/3 of physical ram.
+		 * Once the number of entries exceeds this we start removing
+		 * the least recently used entries.
+		 */
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
+		list_del(&entry->swapped_list);
+		swapped.count--;
+	} else {
+		entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
+		if (unlikely(!entry))
+			/* bad, can't allocate more mem */
+			goto out_locked;
+	}
+
+	index = page_private(page);
+	entry->swp_entry.val = index;
+	/*
+	 * On numa we need to store the node id to ensure that we prefetch to
+	 * the same node it came from.
+	 */
+	store_swap_entry_node(entry, page);
+
+	if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
+		/*
+		 * If this is the first entry, kprefetchd needs to be
+		 * (re)started.
+		 */
+		if (!swapped.count)
+			wakeup = 1;
+		list_add(&entry->swapped_list, &swapped.list);
+		swapped.count++;
+	}
+
+out_locked:
+	spin_unlock(&swapped.lock);
+
+	/* Do the wakeup outside the lock to shorten lock hold time. */
+	if (wakeup)
+		wake_up_process(kprefetchd_task);
+
+	return;
+}
+
+/*
+ * Removes entries from the swapped_list. The radix tree allows us to quickly
+ * look up the entry from the index without having to iterate over the whole
+ * list.
+ */
+void remove_from_swapped_list(const unsigned long index)
+{
+	struct swapped_entry *entry;
+	unsigned long flags;
+
+	if (list_empty(&swapped.list))
+		return;
+
+	spin_lock_irqsave(&swapped.lock, flags);
+	entry = radix_tree_delete(&swapped.swap_tree, index);
+	if (likely(entry)) {
+		list_del_init(&entry->swapped_list);
+		swapped.count--;
+		kmem_cache_free(swapped.cache, entry);
+	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
+}
+
+enum trickle_return {
+	TRICKLE_SUCCESS,
+	TRICKLE_FAILED,
+	TRICKLE_DELAY,
+};
+
+/*
+ * prefetch_stats stores the free ram data of each node and this is used to
+ * determine if a node is suitable for prefetching into.
+ */
+struct prefetch_stats{
+	unsigned long	last_free[MAX_NUMNODES];
+	/* Free ram after a cycle of prefetching */
+	unsigned long	current_free[MAX_NUMNODES];
+	/* Free ram on this cycle of checking prefetch_suitable */
+	unsigned long	prefetch_watermark[MAX_NUMNODES];
+	/* Maximum amount we will prefetch to */
+	nodemask_t	prefetch_nodes;
+	/* Which nodes are currently suited to prefetching */
+	unsigned long	prefetched_pages;
+	/* Total pages we've prefetched on this wakeup of kprefetchd */
+};
+
+static struct prefetch_stats sp_stat;
+
+/*
+ * This tries to read a swp_entry_t into swap cache for swap prefetching.
+ * If it returns TRICKLE_DELAY we should delay further prefetching.
+ */
+static enum trickle_return trickle_swap_cache_async(const swp_entry_t entry,
+	const int node)
+{
+	enum trickle_return ret = TRICKLE_FAILED;
+	struct page *page;
+
+	read_lock_irq(&swapper_space.tree_lock);
+	/* Entry may already exist */
+	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
+	read_unlock_irq(&swapper_space.tree_lock);
+	if (page) {
+		remove_from_swapped_list(entry.val);
+		goto out;
+	}
+
+	/*
+	 * Get a new page to read from swap. We have already checked the
+	 * watermarks so __alloc_pages will not call on reclaim.
+	 */
+	page = alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
+	if (unlikely(!page)) {
+		ret = TRICKLE_DELAY;
+		goto out;
+	}
+
+	if (add_to_swap_cache(page, entry)) {
+		/* Failed to add to swap cache */
+		goto out_release;
+	}
+
+	/* Add them to the tail of the inactive list to preserve LRU order */
+	lru_cache_add_tail(page);
+	if (unlikely(swap_readpage(NULL, page))) {
+		ret = TRICKLE_DELAY;
+		goto out_release;
+	}
+
+	sp_stat.prefetched_pages++;
+	sp_stat.last_free[node]--;
+
+	ret = TRICKLE_SUCCESS;
+out_release:
+	page_cache_release(page);
+out:
+	return ret;
+}
+
+static void clear_last_prefetch_free(void)
+{
+	int node;
+
+	/*
+	 * Reset the nodes suitable for prefetching to all nodes. We could
+	 * update the data to take into account memory hotplug if desired..
+	 */
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes)
+		sp_stat.last_free[node] = 0;
+}
+
+static void clear_current_prefetch_free(void)
+{
+	int node;
+
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes)
+		sp_stat.current_free[node] = 0;
+}
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(void)
+{
+	struct page_state ps;
+	unsigned long limit;
+	struct zone *z;
+	int node, ret = 0;
+
+	/* Purposefully racy and might return false positive which is ok */
+	if (__test_and_clear_bit(0, &swapped.busy))
+		goto out;
+
+	clear_current_prefetch_free();
+
+	/*
+	 * Have some hysteresis between where page reclaiming and prefetching
+	 * will occur to prevent ping-ponging between them.
+	 */
+	for_each_zone(z) {
+		unsigned long free;
+
+		if (!populated_zone(z))
+			continue;
+		node = z->zone_pgdat->node_id;
+
+		free = z->free_pages;
+		if (z->pages_high * 3 + z->lowmem_reserve[zone_idx(z)] > free) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+		sp_stat.current_free[node] += free;
+	}
+
+	/*
+	 * We iterate over each node testing to see if it is suitable for
+	 * prefetching and clear the nodemask if it is not.
+	 */
+	for_each_node_mask(node, sp_stat.prefetch_nodes) {
+		/*
+		 * We check to see that pages are not being allocated
+		 * elsewhere at any significant rate implying any
+		 * degree of memory pressure (eg during file reads)
+		 */
+		if (sp_stat.last_free[node]) {
+			if (sp_stat.current_free[node] + SWAP_CLUSTER_MAX <
+				sp_stat.last_free[node]) {
+					sp_stat.last_free[node] =
+						sp_stat.current_free[node];
+					node_clear(node,
+						sp_stat.prefetch_nodes);
+					continue;
+			}
+		} else
+			sp_stat.last_free[node] = sp_stat.current_free[node];
+
+		/*
+		 * get_page_state is super expensive so we only perform it
+		 * every SWAP_CLUSTER_MAX prefetched_pages
+		 */
+		if (sp_stat.prefetched_pages % SWAP_CLUSTER_MAX)
+			continue;
+
+		get_page_state_node(&ps, node);
+
+		/* We shouldn't prefetch when we are doing writeback */
+		if (ps.nr_writeback) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+
+		/*
+		 * >2/3 of the ram on this node is mapped, slab, swapcache or
+		 * dirty, we need to leave some free for pagecache.
+		 * Note that currently nr_slab is innacurate on numa because
+		 * nr_slab is incremented on the node doing the accounting
+		 * even if the slab is being allocated on a remote node. This
+		 * would be expensive to fix and not of great significance.
+		 */
+		limit = ps.nr_mapped + ps.nr_slab + ps.nr_dirty +
+			ps.nr_unstable + total_swapcache_pages;
+		if (limit > sp_stat.prefetch_watermark[node]) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+	}
+
+	if (nodes_empty(sp_stat.prefetch_nodes))
+		goto out;
+
+	/* Survived all that? Hooray we can prefetch! */
+	ret = 1;
+out:
+	return ret;
+}
+
+/*
+ * Get previous swapped entry when iterating over all entries. swapped.lock
+ * should be held and we should already ensure that entry exists.
+ */
+static inline struct swapped_entry *prev_swapped_entry
+	(struct swapped_entry *entry)
+{
+	return list_entry(entry->swapped_list.prev->prev,
+		struct swapped_entry, swapped_list);
+}
+
+/*
+ * trickle_swap is the main function that initiates the swap prefetching. It
+ * first checks to see if the busy flag is set, and does not prefetch if it
+ * is, as the flag implied we are low on memory or swapping in currently.
+ * Otherwise it runs until prefetch_suitable fails which occurs when the
+ * vm is busy, we prefetch to the watermark, or the list is empty or we have
+ * iterated over all entries
+ */
+static enum trickle_return trickle_swap(void)
+{
+	enum trickle_return ret = TRICKLE_DELAY;
+	struct swapped_entry *entry;
+
+	/*
+	 * If laptop_mode is enabled don't prefetch to avoid hard drives
+	 * doing unnecessary spin-ups
+	 */
+	if (!swap_prefetch || laptop_mode)
+		return ret;
+
+	entry = NULL;
+
+	for ( ; ; ) {
+		swp_entry_t swp_entry;
+		int node;
+
+		if (!prefetch_suitable())
+			break;
+
+		spin_lock(&swapped.lock);
+		if (list_empty(&swapped.list)) {
+			ret = TRICKLE_FAILED;
+			spin_unlock(&swapped.lock);
+			break;
+		}
+
+		if (!entry) {
+			/*
+			 * This sets the entry for the first iteration. It
+			 * also is a safeguard against the entry disappearing
+			 * while the lock is not held.
+			 */
+			entry = list_entry(swapped.list.prev,
+				struct swapped_entry, swapped_list);
+		} else if (entry->swapped_list.prev == swapped.list.next) {
+			/*
+			 * If we have iterated over all entries and there are
+			 * still entries that weren't swapped out there may
+			 * be a reason we could not swap them back in so
+			 * delay attempting further prefetching.
+			 */
+			spin_unlock(&swapped.lock);
+			break;
+		}
+
+		node = get_swap_entry_node(entry);
+		if (!node_isset(node, sp_stat.prefetch_nodes)) {
+			/*
+			 * We found an entry that belongs to a node that is
+			 * not suitable for prefetching so skip it.
+			 */
+			entry = prev_swapped_entry(entry);
+			spin_unlock(&swapped.lock);
+			continue;
+		}
+		swp_entry = entry->swp_entry;
+		entry = prev_swapped_entry(entry);
+		spin_unlock(&swapped.lock);
+
+		if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
+			break;
+	}
+
+	if (sp_stat.prefetched_pages) {
+		lru_add_drain();
+		sp_stat.prefetched_pages = 0;
+	}
+	return ret;
+}
+
+static int kprefetchd(void *__unused)
+{
+	set_user_nice(current, 19);
+	/* Set ioprio to lowest if supported by i/o scheduler */
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+
+	do {
+		try_to_freeze();
+
+		/*
+		 * TRICKLE_FAILED implies no entries left - we do not schedule
+		 * a wakeup, and further delay the next one.
+		 */
+		if (trickle_swap() == TRICKLE_FAILED) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+		}
+		clear_last_prefetch_free();
+		schedule_timeout_interruptible(PREFETCH_DELAY);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+/*
+ * Create kmem cache for swapped entries
+ */
+void __init prepare_swap_prefetch(void)
+{
+	pg_data_t *pgdat;
+	int node;
+
+	swapped.cache = kmem_cache_create("swapped_entry",
+		sizeof(struct swapped_entry), 0, SLAB_PANIC, NULL, NULL);
+
+	/*
+	 * Set max number of entries to 2/3 the size of physical ram  as we
+	 * only ever prefetch to consume 2/3 of the ram.
+	 */
+	swapped.maxcount = nr_free_pagecache_pages() / 3 * 2;
+
+	for_each_pgdat(pgdat) {
+		unsigned long present;
+
+		present = pgdat->node_present_pages;
+		if (!present)
+			continue;
+		node = pgdat->node_id;
+		sp_stat.prefetch_watermark[node] += present / 3 * 2;
+	}
+}
+
+static int __init kprefetchd_init(void)
+{
+	kprefetchd_task = kthread_run(kprefetchd, NULL, "kprefetchd");
+
+	return 0;
+}
+
+static void __exit kprefetchd_exit(void)
+{
+	kthread_stop(kprefetchd_task);
+}
+
+module_init(kprefetchd_init);
+module_exit(kprefetchd_exit);
Index: linux-2.6.16-rc4-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/swap_state.c	2006-02-18 10:37:19.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/swap_state.c	2006-02-21 00:39:25.000000000 +1100
@@ -81,6 +81,7 @@ static int __add_to_swap_cache(struct pa
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
+			remove_from_swapped_list(entry.val);
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
@@ -94,11 +95,12 @@ static int __add_to_swap_cache(struct pa
 	return error;
 }
 
-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
 	int error;
 
 	if (!swap_duplicate(entry)) {
+		remove_from_swapped_list(entry.val);
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
@@ -147,6 +149,9 @@ int add_to_swap(struct page * page, gfp_
 	swp_entry_t entry;
 	int err;
 
+	/* Swap prefetching is delayed if we're swapping pages */
+	delay_swap_prefetch();
+
 	if (!PageLocked(page))
 		BUG();
 
@@ -320,6 +325,9 @@ struct page *read_swap_cache_async(swp_e
 	struct page *found_page, *new_page = NULL;
 	int err;
 
+	/* Swap prefetching is delayed if we're already reading from swap */
+	delay_swap_prefetch();
+
 	do {
 		/*
 		 * First check the swap cache.  Since this is normally
Index: linux-2.6.16-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/vmscan.c	2006-02-21 00:38:56.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/vmscan.c	2006-02-21 00:40:26.000000000 +1100
@@ -390,6 +390,7 @@ static int remove_mapping(struct address
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		add_to_swapped_list(page);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
@@ -1451,6 +1452,8 @@ unsigned long try_to_free_pages(struct z
 		.may_swap = 1,
 	};
 
+	delay_swap_prefetch();
+
 	inc_page_state(allocstall);
 
 	for (i = 0; zones[i] != NULL; i++) {
@@ -1794,6 +1797,8 @@ int shrink_all_memory(unsigned long nr_p
 		.reclaimed_slab = 0,
 	};
 
+	delay_swap_prefetch();
+
 	current->reclaim_state = &reclaim_state;
 	for_each_pgdat(pgdat) {
 		int freed;
Index: linux-2.6.16-rc4-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.16-rc4-mm1.orig/include/linux/mm_inline.h	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/include/linux/mm_inline.h	2006-02-21 00:39:25.000000000 +1100
@@ -14,6 +14,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(&page->lru, &zone->inactive_list);
+	zone->nr_inactive++;
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: Implement swap prefetching
  2006-02-20 19:08 ` Mattia Dongili
@ 2006-02-20 20:47   ` Con Kolivas
  2006-02-20 20:58   ` Con Kolivas
  1 sibling, 0 replies; 26+ messages in thread
From: Con Kolivas @ 2006-02-20 20:47 UTC (permalink / raw)
  To: Mattia Dongili; +Cc: linux kernel mailing list, Andrew Morton, ck list

On Tuesday 21 February 2006 06:08, Mattia Dongili wrote:
> Hello Con,
>
> On Tue, Feb 21, 2006 at 12:44:51AM +1100, Con Kolivas wrote:
> > Unchanged heavily tested v27 implementation of swap prefetching resynced
> > with 2.6.16-rc4-mm1.
>
> I used your patches in the last 2 or 3 -mm kernels (since s-p-v24). It's
> been working good until now.
>
> > Please consider for -mm.
>
> Just one minor note:
> [...]

> > +		if (TestSetPageLRU(page))
> > +			BUG();
>
> TestSetPageLRU is gone in -mm (see mm-pagelru-no-testset.patch), you
> should probably change it to
>
> 		BUG_ON(PageLRU(page));
> 		SetPageLRU(page);

Thanks!

Cheers,
Con

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] mm: Implement swap prefetching
  2006-02-20 13:44 [PATCH] mm: Implement " Con Kolivas
@ 2006-02-20 19:08 ` Mattia Dongili
  2006-02-20 20:47   ` Con Kolivas
  2006-02-20 20:58   ` Con Kolivas
  0 siblings, 2 replies; 26+ messages in thread
From: Mattia Dongili @ 2006-02-20 19:08 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list, Andrew Morton, ck list

Hello Con,

On Tue, Feb 21, 2006 at 12:44:51AM +1100, Con Kolivas wrote:
> Unchanged heavily tested v27 implementation of swap prefetching resynced with
> 2.6.16-rc4-mm1.

I used your patches in the last 2 or 3 -mm kernels (since s-p-v24). It's
been working good until now.

> Please consider for -mm.

Just one minor note:
[...]
> Index: linux-2.6.16-rc4-mm1/mm/swap.c
> ===================================================================
> --- linux-2.6.16-rc4-mm1.orig/mm/swap.c	2006-02-21 00:38:56.000000000 +1100
> +++ linux-2.6.16-rc4-mm1/mm/swap.c	2006-02-21 00:39:25.000000000 +1100
> @@ -384,6 +384,46 @@ void __pagevec_lru_add_active(struct pag
>  	pagevec_reinit(pvec);
>  }
>  
> +static inline void __pagevec_lru_add_tail(struct pagevec *pvec)
> +{
> +	int i;
> +	struct zone *zone = NULL;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		struct zone *pagezone = page_zone(page);
> +
> +		if (pagezone != zone) {
> +			if (zone)
> +				spin_unlock_irq(&zone->lru_lock);
> +			zone = pagezone;
> +			spin_lock_irq(&zone->lru_lock);
> +		}
> +		if (TestSetPageLRU(page))
> +			BUG();

TestSetPageLRU is gone in -mm (see mm-pagelru-no-testset.patch), you
should probably change it to

		BUG_ON(PageLRU(page));
		SetPageLRU(page);

ciao
-- 
mattia
:wq!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] mm: Implement swap prefetching
@ 2006-02-20 13:44 Con Kolivas
  2006-02-20 19:08 ` Mattia Dongili
  0 siblings, 1 reply; 26+ messages in thread
From: Con Kolivas @ 2006-02-20 13:44 UTC (permalink / raw)
  To: linux kernel mailing list; +Cc: Andrew Morton, ck list

Unchanged heavily tested v27 implementation of swap prefetching resynced with
2.6.16-rc4-mm1.

Please consider for -mm.

Cheers,
Con
---
This patch implements swap prefetching when the vm is relatively idle and
there is free ram available. The code is based on some preliminary code by
Thomas Schlichter.

This stores a list of swapped entries in a list ordered most recently used
and a radix tree. It generates a low priority kernel thread running at nice 19
to do the prefetching at a later stage.

Once pages have been added to the swapped list, a timer is started, testing
for conditions suitable to prefetch swap pages every 5 seconds. Suitable
conditions are defined as lack of swapping out or in any pages, and no
watermark tests failing. Significant amounts of dirtied ram and changes in
free ram representing disk writes or reads also prevent prefetching.

It then checks that we have spare ram looking for at least 3* pages_high free
per zone and if it succeeds that will prefetch pages from swap into the swap
cache. The pages are added to the tail of the inactive list to preserve LRU
ordering.

Pages are prefetched until the list is empty or the vm is seen as busy
according to the previously described criteria. Node data on numa is stored
with the entries and an appropriate zonelist based on this is used when
allocating ram.

The pages are copied to swap cache and kept on backing store. This allows
pressure on either physical ram or swap to readily find free pages without
further I/O.

Prefetching can be enabled/disabled via the tunable in 
/proc/sys/vm/swap_prefetch initially set to 1 (enabled).

Enabling laptop_mode disables swap prefetching to prevent unnecessary
spin ups.

In testing on modern pc hardware this results in wall-clock time activation of
the firefox browser to speed up 5 fold after a worst case complete swap-out
of the browser on a static web page.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

 Documentation/sysctl/vm.txt |   11 
 include/linux/mm_inline.h   |    7 
 include/linux/swap.h        |   55 ++++
 include/linux/sysctl.h      |    1 
 init/Kconfig                |   22 +
 kernel/sysctl.c             |   10 
 mm/Makefile                 |    1 
 mm/swap.c                   |   43 +++
 mm/swap_prefetch.c          |  496 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c             |   10 
 mm/vmscan.c                 |    5 
 11 files changed, 660 insertions(+), 1 deletion(-)

Index: linux-2.6.16-rc4-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.16-rc4-mm1.orig/Documentation/sysctl/vm.txt	2006-02-18 10:36:52.000000000 +1100
+++ linux-2.6.16-rc4-mm1/Documentation/sysctl/vm.txt	2006-02-21 00:39:25.000000000 +1100
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/
 - drop-caches
 - zone_reclaim_mode
 - zone_reclaim_interval
+- swap_prefetch
 
 ==============================================================
 
@@ -178,3 +179,13 @@ Time is set in seconds and set by defaul
 Reduce the interval if undesired off node allocations occur. However, too
 frequent scans will have a negative impact onoff node allocation performance.
 
+==============================================================
+
+swap_prefetch
+
+This enables or disables the swap prefetching feature. When the virtual
+memory subsystem has been extremely idle for at least 5 seconds it will start
+copying back pages from swap into the swapcache and keep a copy in swap. In
+practice it can take many minutes before the vm is idle enough.
+
+The default value is 1.
Index: linux-2.6.16-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc4-mm1.orig/include/linux/swap.h	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/include/linux/swap.h	2006-02-21 00:39:25.000000000 +1100
@@ -7,6 +7,7 @@
 #include <linux/mmzone.h>
 #include <linux/list.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -164,6 +165,7 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_cache_add_tail(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -214,6 +216,58 @@ extern int shmem_unuse(swp_entry_t entry
 
 extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *);
 
+#ifdef CONFIG_SWAP_PREFETCH
+/* mm/swap_prefetch.c */
+extern int swap_prefetch;
+struct swapped_entry {
+	swp_entry_t		swp_entry;	/* The actual swap entry */
+	struct list_head	swapped_list;	/* Linked list of entries */
+#if MAX_NUMNODES > 1
+	int			node;		/* Node id */
+#endif
+} __attribute__((packed));
+
+static inline void store_swap_entry_node(struct swapped_entry *entry,
+	struct page *page)
+{
+#if MAX_NUMNODES > 1
+	entry->node = page_to_nid(page);
+#endif
+}
+
+static inline int get_swap_entry_node(struct swapped_entry *entry)
+{
+#if MAX_NUMNODES > 1
+	return entry->node;
+#else
+	return 0;
+#endif
+}
+
+extern void add_to_swapped_list(struct page *page);
+extern void remove_from_swapped_list(const unsigned long index);
+extern void delay_swap_prefetch(void);
+extern void prepare_swap_prefetch(void);
+
+#else	/* CONFIG_SWAP_PREFETCH */
+static inline void add_to_swapped_list(struct page *__unused)
+{
+}
+
+static inline void prepare_swap_prefetch(void)
+{
+}
+
+static inline void remove_from_swapped_list(const unsigned long __unused)
+{
+}
+
+static inline void delay_swap_prefetch(void)
+{
+}
+
+#endif	/* CONFIG_SWAP_PREFETCH */
+
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
@@ -235,6 +289,7 @@ extern void free_pages_and_swap_cache(st
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma,
 					   unsigned long addr);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry);
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern unsigned int nr_swapfiles;
Index: linux-2.6.16-rc4-mm1/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc4-mm1.orig/include/linux/sysctl.h	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/include/linux/sysctl.h	2006-02-21 00:39:25.000000000 +1100
@@ -185,6 +185,7 @@ enum
 	VM_PERCPU_PAGELIST_FRACTION=30,/* int: fraction of pages in each percpu_pagelist */
 	VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
+	VM_SWAP_PREFETCH=33,	/* swap prefetch */
 };
 
 
Index: linux-2.6.16-rc4-mm1/init/Kconfig
===================================================================
--- linux-2.6.16-rc4-mm1.orig/init/Kconfig	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/init/Kconfig	2006-02-21 00:39:25.000000000 +1100
@@ -92,6 +92,28 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config SWAP_PREFETCH
+	bool "Support for prefetching swapped memory"
+	depends on SWAP
+	default y
+	---help---
+	  This option will allow the kernel to prefetch swapped memory pages
+	  when idle. The pages will be kept on both swap and in swap_cache
+	  thus avoiding the need for further I/O if either ram or swap space
+	  is required.
+
+	  What this will do on workstations is slowly bring back applications
+	  that have swapped out after memory intensive workloads back into
+	  physical ram if you have free ram at a later stage and the machine
+	  is relatively idle. This means that when you come back to your
+	  computer after leaving it idle for a while, applications will come
+	  to life faster. Note that your swap usage will appear to increase
+	  but these are cached pages, can be dropped freely by the vm, and it
+	  should stabilise around 50% swap usage maximum.
+
+	  Workstations and multiuser workstation servers will most likely want
+	  to say Y.
+
 config SYSVIPC
 	bool "System V IPC"
 	---help---
Index: linux-2.6.16-rc4-mm1/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/kernel/sysctl.c	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/kernel/sysctl.c	2006-02-21 00:39:25.000000000 +1100
@@ -901,6 +901,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+#ifdef CONFIG_SWAP_PREFETCH
+	{
+		.ctl_name	= VM_SWAP_PREFETCH,
+		.procname	= "swap_prefetch",
+		.data		= &swap_prefetch,
+		.maxlen		= sizeof(swap_prefetch),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.16-rc4-mm1/mm/Makefile
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/Makefile	2006-02-18 10:37:19.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/Makefile	2006-02-21 00:39:25.000000000 +1100
@@ -13,6 +13,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   prio_tree.o util.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP_PREFETCH) += swap_prefetch.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
Index: linux-2.6.16-rc4-mm1/mm/swap.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/swap.c	2006-02-21 00:38:56.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/swap.c	2006-02-21 00:39:25.000000000 +1100
@@ -384,6 +384,46 @@ void __pagevec_lru_add_active(struct pag
 	pagevec_reinit(pvec);
 }
 
+static inline void __pagevec_lru_add_tail(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		if (TestSetPageLRU(page))
+			BUG();
+		add_page_to_inactive_list_tail(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+/*
+ * Function used uniquely to put pages back to the lru at the end of the
+ * inactive list to preserve the lru order. Currently only used by swap
+ * prefetch.
+ */
+void fastcall lru_cache_add_tail(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_tail(pvec);
+	put_cpu_var(lru_add_pvecs);
+}
+
 /*
  * Try to drop buffers from the pages in a pagevec
  */
@@ -537,5 +577,8 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	prepare_swap_prefetch();
+
 	hotcpu_notifier(cpu_swap_callback, 0);
 }
Index: linux-2.6.16-rc4-mm1/mm/swap_prefetch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc4-mm1/mm/swap_prefetch.c	2006-02-21 00:39:25.000000000 +1100
@@ -0,0 +1,496 @@
+/*
+ * linux/mm/swap_prefetch.c
+ *
+ * Copyright (C) 2005-2006 Con Kolivas
+ *
+ * Written by Con Kolivas <kernel@kolivas.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/ioprio.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/writeback.h>
+
+/*
+ * Time to delay prefetching if vm is busy or prefetching unsuccessful. There
+ * needs to be at least this duration of idle time meaning in practice it can
+ * be much longer
+ */
+#define PREFETCH_DELAY	(HZ * 5)
+
+/* sysctl - enable/disable swap prefetching */
+int swap_prefetch __read_mostly = 1;
+
+struct swapped_root {
+	unsigned long		busy;		/* vm busy */
+	spinlock_t		lock;		/* protects all data */
+	struct list_head	list;		/* MRU list of swapped pages */
+	struct radix_tree_root	swap_tree;	/* Lookup tree of pages */
+	unsigned int		count;		/* Number of entries */
+	unsigned int		maxcount;	/* Maximum entries allowed */
+	kmem_cache_t		*cache;		/* Of struct swapped_entry */
+};
+
+static struct swapped_root swapped = {
+	.lock		= SPIN_LOCK_UNLOCKED,
+	.list  		= LIST_HEAD_INIT(swapped.list),
+	.swap_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
+};
+
+static task_t *kprefetchd_task;
+
+/*
+ * We check to see no part of the vm is busy. If it is this will interrupt
+ * trickle_swap and wait another PREFETCH_DELAY. Purposefully racy.
+ */
+inline void delay_swap_prefetch(void)
+{
+	if (!test_bit(0, &swapped.busy))
+		__set_bit(0, &swapped.busy);
+}
+
+/*
+ * Drop behind accounting which keeps a list of the most recently used swap
+ * entries.
+ */
+void add_to_swapped_list(struct page *page)
+{
+	struct swapped_entry *entry;
+	unsigned long index;
+	int wakeup;
+
+	if (!swap_prefetch)
+		return;
+
+	wakeup = 0;
+
+	spin_lock(&swapped.lock);
+	if (swapped.count >= swapped.maxcount) {
+		/*
+		 * We limit the number of entries to 2/3 of physical ram.
+		 * Once the number of entries exceeds this we start removing
+		 * the least recently used entries.
+		 */
+		entry = list_entry(swapped.list.next,
+			struct swapped_entry, swapped_list);
+		radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
+		list_del(&entry->swapped_list);
+		swapped.count--;
+	} else {
+		entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
+		if (unlikely(!entry))
+			/* bad, can't allocate more mem */
+			goto out_locked;
+	}
+
+	index = page_private(page);
+	entry->swp_entry.val = index;
+	/*
+	 * On numa we need to store the node id to ensure that we prefetch to
+	 * the same node it came from.
+	 */
+	store_swap_entry_node(entry, page);
+
+	if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
+		/*
+		 * If this is the first entry, kprefetchd needs to be
+		 * (re)started.
+		 */
+		if (!swapped.count)
+			wakeup = 1;
+		list_add(&entry->swapped_list, &swapped.list);
+		swapped.count++;
+	}
+
+out_locked:
+	spin_unlock(&swapped.lock);
+
+	/* Do the wakeup outside the lock to shorten lock hold time. */
+	if (wakeup)
+		wake_up_process(kprefetchd_task);
+
+	return;
+}
+
+/*
+ * Removes entries from the swapped_list. The radix tree allows us to quickly
+ * look up the entry from the index without having to iterate over the whole
+ * list.
+ */
+void remove_from_swapped_list(const unsigned long index)
+{
+	struct swapped_entry *entry;
+	unsigned long flags;
+
+	if (list_empty(&swapped.list))
+		return;
+
+	spin_lock_irqsave(&swapped.lock, flags);
+	entry = radix_tree_delete(&swapped.swap_tree, index);
+	if (likely(entry)) {
+		list_del_init(&entry->swapped_list);
+		swapped.count--;
+		kmem_cache_free(swapped.cache, entry);
+	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
+}
+
+enum trickle_return {
+	TRICKLE_SUCCESS,
+	TRICKLE_FAILED,
+	TRICKLE_DELAY,
+};
+
+/*
+ * prefetch_stats stores the free ram data of each node and this is used to
+ * determine if a node is suitable for prefetching into.
+ */
+struct prefetch_stats{
+	unsigned long	last_free[MAX_NUMNODES];
+	/* Free ram after a cycle of prefetching */
+	unsigned long	current_free[MAX_NUMNODES];
+	/* Free ram on this cycle of checking prefetch_suitable */
+	unsigned long	prefetch_watermark[MAX_NUMNODES];
+	/* Maximum amount we will prefetch to */
+	nodemask_t	prefetch_nodes;
+	/* Which nodes are currently suited to prefetching */
+	unsigned long	prefetched_pages;
+	/* Total pages we've prefetched on this wakeup of kprefetchd */
+};
+
+static struct prefetch_stats sp_stat;
+
+/*
+ * This tries to read a swp_entry_t into swap cache for swap prefetching.
+ * If it returns TRICKLE_DELAY we should delay further prefetching.
+ */
+static enum trickle_return trickle_swap_cache_async(const swp_entry_t entry,
+	const int node)
+{
+	enum trickle_return ret = TRICKLE_FAILED;
+	struct page *page;
+
+	read_lock_irq(&swapper_space.tree_lock);
+	/* Entry may already exist */
+	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
+	read_unlock_irq(&swapper_space.tree_lock);
+	if (page) {
+		remove_from_swapped_list(entry.val);
+		goto out;
+	}
+
+	/*
+	 * Get a new page to read from swap. We have already checked the
+	 * watermarks so __alloc_pages will not call on reclaim.
+	 */
+	page = alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
+	if (unlikely(!page)) {
+		ret = TRICKLE_DELAY;
+		goto out;
+	}
+
+	if (add_to_swap_cache(page, entry)) {
+		/* Failed to add to swap cache */
+		goto out_release;
+	}
+
+	/* Add them to the tail of the inactive list to preserve LRU order */
+	lru_cache_add_tail(page);
+	if (unlikely(swap_readpage(NULL, page))) {
+		ret = TRICKLE_DELAY;
+		goto out_release;
+	}
+
+	sp_stat.prefetched_pages++;
+	sp_stat.last_free[node]--;
+
+	ret = TRICKLE_SUCCESS;
+out_release:
+	page_cache_release(page);
+out:
+	return ret;
+}
+
+static void clear_last_prefetch_free(void)
+{
+	int node;
+
+	/*
+	 * Reset the nodes suitable for prefetching to all nodes. We could
+	 * update the data to take into account memory hotplug if desired..
+	 */
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes)
+		sp_stat.last_free[node] = 0;
+}
+
+static void clear_current_prefetch_free(void)
+{
+	int node;
+
+	sp_stat.prefetch_nodes = node_online_map;
+	for_each_node_mask(node, sp_stat.prefetch_nodes)
+		sp_stat.current_free[node] = 0;
+}
+
+/*
+ * We want to be absolutely certain it's ok to start prefetching.
+ */
+static int prefetch_suitable(void)
+{
+	struct page_state ps;
+	unsigned long limit;
+	struct zone *z;
+	int node, ret = 0;
+
+	/* Purposefully racy and might return false positive which is ok */
+	if (__test_and_clear_bit(0, &swapped.busy))
+		goto out;
+
+	clear_current_prefetch_free();
+
+	/*
+	 * Have some hysteresis between where page reclaiming and prefetching
+	 * will occur to prevent ping-ponging between them.
+	 */
+	for_each_zone(z) {
+		unsigned long free;
+
+		if (!populated_zone(z))
+			continue;
+		node = z->zone_pgdat->node_id;
+
+		free = z->free_pages;
+		if (z->pages_high * 3 + z->lowmem_reserve[zone_idx(z)] > free) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+		sp_stat.current_free[node] += free;
+	}
+
+	/*
+	 * We iterate over each node testing to see if it is suitable for
+	 * prefetching and clear the nodemask if it is not.
+	 */
+	for_each_node_mask(node, sp_stat.prefetch_nodes) {
+		/*
+		 * We check to see that pages are not being allocated
+		 * elsewhere at any significant rate implying any
+		 * degree of memory pressure (eg during file reads)
+		 */
+		if (sp_stat.last_free[node]) {
+			if (sp_stat.current_free[node] + SWAP_CLUSTER_MAX <
+				sp_stat.last_free[node]) {
+					sp_stat.last_free[node] =
+						sp_stat.current_free[node];
+					node_clear(node,
+						sp_stat.prefetch_nodes);
+					continue;
+			}
+		} else
+			sp_stat.last_free[node] = sp_stat.current_free[node];
+
+		/*
+		 * get_page_state is super expensive so we only perform it
+		 * every SWAP_CLUSTER_MAX prefetched_pages
+		 */
+		if (sp_stat.prefetched_pages % SWAP_CLUSTER_MAX)
+			continue;
+
+		get_page_state_node(&ps, node);
+
+		/* We shouldn't prefetch when we are doing writeback */
+		if (ps.nr_writeback) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+
+		/*
+		 * >2/3 of the ram on this node is mapped, slab, swapcache or
+		 * dirty, we need to leave some free for pagecache.
+		 * Note that currently nr_slab is innacurate on numa because
+		 * nr_slab is incremented on the node doing the accounting
+		 * even if the slab is being allocated on a remote node. This
+		 * would be expensive to fix and not of great significance.
+		 */
+		limit = ps.nr_mapped + ps.nr_slab + ps.nr_dirty +
+			ps.nr_unstable + total_swapcache_pages;
+		if (limit > sp_stat.prefetch_watermark[node]) {
+			node_clear(node, sp_stat.prefetch_nodes);
+			continue;
+		}
+	}
+
+	if (nodes_empty(sp_stat.prefetch_nodes))
+		goto out;
+
+	/* Survived all that? Hooray we can prefetch! */
+	ret = 1;
+out:
+	return ret;
+}
+
+/*
+ * Get previous swapped entry when iterating over all entries. swapped.lock
+ * should be held and we should already ensure that entry exists.
+ */
+static inline struct swapped_entry *prev_swapped_entry
+	(struct swapped_entry *entry)
+{
+	return list_entry(entry->swapped_list.prev->prev,
+		struct swapped_entry, swapped_list);
+}
+
+/*
+ * trickle_swap is the main function that initiates the swap prefetching. It
+ * first checks to see if the busy flag is set, and does not prefetch if it
+ * is, as the flag implied we are low on memory or swapping in currently.
+ * Otherwise it runs until prefetch_suitable fails which occurs when the
+ * vm is busy, we prefetch to the watermark, or the list is empty or we have
+ * iterated over all entries
+ */
+static enum trickle_return trickle_swap(void)
+{
+	enum trickle_return ret = TRICKLE_DELAY;
+	struct swapped_entry *entry;
+
+	/*
+	 * If laptop_mode is enabled don't prefetch to avoid hard drives
+	 * doing unnecessary spin-ups
+	 */
+	if (!swap_prefetch || laptop_mode)
+		return ret;
+
+	entry = NULL;
+
+	for ( ; ; ) {
+		swp_entry_t swp_entry;
+		int node;
+
+		if (!prefetch_suitable())
+			break;
+
+		spin_lock(&swapped.lock);
+		if (list_empty(&swapped.list)) {
+			ret = TRICKLE_FAILED;
+			spin_unlock(&swapped.lock);
+			break;
+		}
+
+		if (!entry) {
+			/*
+			 * This sets the entry for the first iteration. It
+			 * also is a safeguard against the entry disappearing
+			 * while the lock is not held.
+			 */
+			entry = list_entry(swapped.list.prev,
+				struct swapped_entry, swapped_list);
+		} else if (entry->swapped_list.prev == swapped.list.next) {
+			/*
+			 * If we have iterated over all entries and there are
+			 * still entries that weren't swapped out there may
+			 * be a reason we could not swap them back in so
+			 * delay attempting further prefetching.
+			 */
+			spin_unlock(&swapped.lock);
+			break;
+		}
+
+		node = get_swap_entry_node(entry);
+		if (!node_isset(node, sp_stat.prefetch_nodes)) {
+			/*
+			 * We found an entry that belongs to a node that is
+			 * not suitable for prefetching so skip it.
+			 */
+			entry = prev_swapped_entry(entry);
+			spin_unlock(&swapped.lock);
+			continue;
+		}
+		swp_entry = entry->swp_entry;
+		entry = prev_swapped_entry(entry);
+		spin_unlock(&swapped.lock);
+
+		if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
+			break;
+	}
+
+	if (sp_stat.prefetched_pages) {
+		lru_add_drain();
+		sp_stat.prefetched_pages = 0;
+	}
+	return ret;
+}
+
+static int kprefetchd(void *__unused)
+{
+	set_user_nice(current, 19);
+	/* Set ioprio to lowest if supported by i/o scheduler */
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+
+	do {
+		try_to_freeze();
+
+		/*
+		 * TRICKLE_FAILED implies no entries left - we do not schedule
+		 * a wakeup, and further delay the next one.
+		 */
+		if (trickle_swap() == TRICKLE_FAILED) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+		}
+		clear_last_prefetch_free();
+		schedule_timeout_interruptible(PREFETCH_DELAY);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+/*
+ * Create kmem cache for swapped entries
+ */
+void __init prepare_swap_prefetch(void)
+{
+	pg_data_t *pgdat;
+	int node;
+
+	swapped.cache = kmem_cache_create("swapped_entry",
+		sizeof(struct swapped_entry), 0, SLAB_PANIC, NULL, NULL);
+
+	/*
+	 * Set max number of entries to 2/3 the size of physical ram  as we
+	 * only ever prefetch to consume 2/3 of the ram.
+	 */
+	swapped.maxcount = nr_free_pagecache_pages() / 3 * 2;
+
+	for_each_pgdat(pgdat) {
+		unsigned long present;
+
+		present = pgdat->node_present_pages;
+		if (!present)
+			continue;
+		node = pgdat->node_id;
+		sp_stat.prefetch_watermark[node] += present / 3 * 2;
+	}
+}
+
+static int __init kprefetchd_init(void)
+{
+	kprefetchd_task = kthread_run(kprefetchd, NULL, "kprefetchd");
+
+	return 0;
+}
+
+static void __exit kprefetchd_exit(void)
+{
+	kthread_stop(kprefetchd_task);
+}
+
+module_init(kprefetchd_init);
+module_exit(kprefetchd_exit);
Index: linux-2.6.16-rc4-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/swap_state.c	2006-02-18 10:37:19.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/swap_state.c	2006-02-21 00:39:25.000000000 +1100
@@ -81,6 +81,7 @@ static int __add_to_swap_cache(struct pa
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
+			remove_from_swapped_list(entry.val);
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
@@ -94,11 +95,12 @@ static int __add_to_swap_cache(struct pa
 	return error;
 }
 
-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
 	int error;
 
 	if (!swap_duplicate(entry)) {
+		remove_from_swapped_list(entry.val);
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
@@ -147,6 +149,9 @@ int add_to_swap(struct page * page, gfp_
 	swp_entry_t entry;
 	int err;
 
+	/* Swap prefetching is delayed if we're swapping pages */
+	delay_swap_prefetch();
+
 	if (!PageLocked(page))
 		BUG();
 
@@ -320,6 +325,9 @@ struct page *read_swap_cache_async(swp_e
 	struct page *found_page, *new_page = NULL;
 	int err;
 
+	/* Swap prefetching is delayed if we're already reading from swap */
+	delay_swap_prefetch();
+
 	do {
 		/*
 		 * First check the swap cache.  Since this is normally
Index: linux-2.6.16-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.16-rc4-mm1.orig/mm/vmscan.c	2006-02-21 00:38:56.000000000 +1100
+++ linux-2.6.16-rc4-mm1/mm/vmscan.c	2006-02-21 00:40:26.000000000 +1100
@@ -390,6 +390,7 @@ static int remove_mapping(struct address
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		add_to_swapped_list(page);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
@@ -1451,6 +1452,8 @@ unsigned long try_to_free_pages(struct z
 		.may_swap = 1,
 	};
 
+	delay_swap_prefetch();
+
 	inc_page_state(allocstall);
 
 	for (i = 0; zones[i] != NULL; i++) {
@@ -1794,6 +1797,8 @@ int shrink_all_memory(unsigned long nr_p
 		.reclaimed_slab = 0,
 	};
 
+	delay_swap_prefetch();
+
 	current->reclaim_state = &reclaim_state;
 	for_each_pgdat(pgdat) {
 		int freed;
Index: linux-2.6.16-rc4-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.16-rc4-mm1.orig/include/linux/mm_inline.h	2006-02-21 00:38:55.000000000 +1100
+++ linux-2.6.16-rc4-mm1/include/linux/mm_inline.h	2006-02-21 00:39:25.000000000 +1100
@@ -14,6 +14,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(&page->lru, &zone->inactive_list);
+	zone->nr_inactive++;
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-02-24 11:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-06 23:28 [PATCH] mm: implement swap prefetching Con Kolivas
2006-02-07  0:38 ` Andrew Morton
2006-02-07  1:29   ` Con Kolivas
2006-02-07  1:32     ` [ck] " Con Kolivas
2006-02-07  1:39     ` Andrew Morton
2006-02-07  1:37   ` Bernd Eckenfels
2006-02-08  3:29   ` [PATCH] mm: implement swap prefetching v21 Con Kolivas
2006-02-08  3:49     ` [ck] " Con Kolivas
2006-02-07  3:08 ` [PATCH] mm: implement swap prefetching Nick Piggin
2006-02-07  3:29   ` Nick Piggin
2006-02-07  4:02   ` Con Kolivas
2006-02-07  5:00     ` Nick Piggin
2006-02-07  6:02       ` Con Kolivas
2006-02-07  6:51         ` Nick Piggin
2006-02-07 10:54           ` Con Kolivas
2006-02-07 17:14             ` Andrew Morton
2006-02-08  4:46     ` Paul Jackson
2006-02-08  5:06       ` Con Kolivas
2006-02-08  5:13         ` Paul Jackson
2006-02-08  5:33         ` Con Kolivas
2006-02-20 13:44 [PATCH] mm: Implement " Con Kolivas
2006-02-20 19:08 ` Mattia Dongili
2006-02-20 20:47   ` Con Kolivas
2006-02-20 20:58   ` Con Kolivas
2006-02-20 23:45     ` Michal Piotrowski
2006-02-24 11:23 Con Kolivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).