linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm] swsusp: support creating bigger images
@ 2006-05-02 10:00 Rafael J. Wysocki
  2006-05-09  7:33 ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-02 10:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Nick Piggin, Pavel Machek

Currently swsusp is only capable of creating suspend images that are not
bigger than 1/2 of the normal zone, because it needs to create a copy of every
page which should be included in the image.  This may hurt the system
responsiveness after resume, especially on systems with less that 1 GB of RAM.

To allow swsusp to create bigger images we can use the observation that
if some pages don't change after the snapshot image of the system has been
created and before the image is entirely saved, they can be included in the
image without copying.  Now if the mapped pages that are not mapped by the
current task are considered, it turns out that they would change only if they
were reclaimed by try_to_free_pages().  Thus if we take them out of reach
of try_to_free_pages(), for example by (temporarily) moving them out of their
respective LRU lists after creating the image, we will be able to include them
in the image without copying.

If these pages are included in the image without copying, the amount of free
memory needed by swsusp is usually much less than otherwise, so the size
of the image may be bigger.  This also makes swsusp use less memory during
suspend and saves us quite a lot of memory allocations and copyings.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 include/linux/rmap.h    |    8 +
 include/linux/swap.h    |    4 
 kernel/power/power.h    |   10 +-
 kernel/power/snapshot.c |  225 +++++++++++++++++++++++++++++++++---------------
 kernel/power/swsusp.c   |   15 +--
 kernel/power/user.c     |    2 
 mm/rmap.c               |   40 ++++++++
 mm/vmscan.c             |   63 +++++++++++++
 8 files changed, 291 insertions(+), 76 deletions(-)

Index: linux-2.6.17-rc3-mm1/mm/rmap.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/mm/rmap.c
+++ linux-2.6.17-rc3-mm1/mm/rmap.c
@@ -857,3 +857,43 @@ int try_to_unmap(struct page *page, int 
 	return ret;
 }
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+static int page_mapped_by_task(struct page *page, struct task_struct *task)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	spin_lock(&task->mm->page_table_lock);
+
+	for (vma = task->mm->mmap; vma; vma = vma->vm_next)
+		if (page_address_in_vma(page, vma) != -EFAULT) {
+			ret = 1;
+			break;
+		}
+
+	spin_unlock(&task->mm->page_table_lock);
+
+	return ret;
+}
+
+/**
+ *	page_to_copy - determine if a page can be included in the
+ *	suspend image without copying (returns false if that's the case).
+ *
+ *	It is safe to include the page in the suspend image without
+ *	copying if (a) it's on the LRU and (b) it's mapped by a frozen task
+ *	(all tasks except for the current task should be frozen when it's
+ *	called).  Otherwise the page should be copied for this purpose
+ *	(compound pages are avoided for simplicity).
+ */
+
+int page_to_copy(struct page *page)
+{
+	if (!PageLRU(page) || PageCompound(page))
+		return 1;
+	if (page_mapped(page))
+		return page_mapped_by_task(page, current);
+
+	return 1;
+}
+#endif /* CONFIG_SOFTWARE_SUSPEND */
Index: linux-2.6.17-rc3-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.17-rc3-mm1.orig/include/linux/rmap.h
+++ linux-2.6.17-rc3-mm1/include/linux/rmap.h
@@ -104,6 +104,14 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+/*
+ * Used to determine if the page can be included in the suspend image without
+ * copying
+ */
+int page_to_copy(struct page *);
+#endif
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux-2.6.17-rc3-mm1/kernel/power/snapshot.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/snapshot.c
+++ linux-2.6.17-rc3-mm1/kernel/power/snapshot.c
@@ -13,6 +13,7 @@
 #include <linux/version.h>
 #include <linux/module.h>
 #include <linux/mm.h>
+#include <linux/rmap.h>
 #include <linux/suspend.h>
 #include <linux/smp_lock.h>
 #include <linux/delay.h>
@@ -261,6 +262,14 @@ int restore_special_mem(void)
 	return ret;
 }
 
+/* Represents a stacked allocated page to be used in the future */
+struct res_page {
+	struct res_page *next;
+	char padding[PAGE_SIZE - sizeof(void *)];
+};
+
+static struct res_page *page_list;
+
 static int pfn_is_nosave(unsigned long pfn)
 {
 	unsigned long nosave_begin_pfn = __pa(&__nosave_begin) >> PAGE_SHIFT;
@@ -269,7 +278,7 @@ static int pfn_is_nosave(unsigned long p
 }
 
 /**
- *	saveable - Determine whether a page should be cloned or not.
+ *	saveable - Determine whether a page should be saved or not.
  *	@pfn:	The page
  *
  *	We save a page if it's Reserved, and not in the range of pages
@@ -277,9 +286,8 @@ static int pfn_is_nosave(unsigned long p
  *	isn't part of a free chunk of pages.
  */
 
-static int saveable(struct zone *zone, unsigned long *zone_pfn)
+static int saveable(unsigned long pfn)
 {
-	unsigned long pfn = *zone_pfn + zone->zone_start_pfn;
 	struct page *page;
 
 	if (!pfn_valid(pfn))
@@ -296,46 +304,103 @@ static int saveable(struct zone *zone, u
 	return 1;
 }
 
-unsigned int count_data_pages(void)
+/**
+ *	count_data_pages - returns the number of pages that need to be copied
+ *	in order to be included in the suspend snapshot image.  If @total is
+ *	not null, the total number of pages that should be included in the
+ *	snapshot image is stored in the location pointed to by it.
+ */
+
+unsigned int count_data_pages(unsigned int *total)
 {
 	struct zone *zone;
 	unsigned long zone_pfn;
-	unsigned int n = 0;
+	unsigned int n, m;
 
+	n = m = 0;
 	for_each_zone (zone) {
 		if (is_highmem(zone))
 			continue;
 		mark_free_pages(zone);
-		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
-			n += saveable(zone, &zone_pfn);
+		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
+			unsigned long pfn = zone_pfn + zone->zone_start_pfn;
+
+			if (saveable(pfn)) {
+				n++;
+				if (!page_to_copy(pfn_to_page(pfn)))
+					m++;
+			}
+		}
 	}
-	return n;
+	if (total)
+		*total = n;
+
+	return n - m;
 }
 
+/**
+ *	copy_data_pages - populates the page backup list @pblist with
+ *	the addresses of the pages that should be included in the
+ *	suspend snapshot image.  The pages that cannot be included in the
+ *	image without copying are copied into empty page frames allocated
+ *	earlier and available from the list page_list (the addresses of
+ *	these page frames are also stored in the page backup list).
+ *
+ *	This function is only called from the critical section, ie. when
+ *	processes are frozen, there's only one CPU online and local IRQs
+ *	are disabled on it.
+ */
+
 static void copy_data_pages(struct pbe *pblist)
 {
 	struct zone *zone;
 	unsigned long zone_pfn;
 	struct pbe *pbe, *p;
+	struct res_page *ptr;
 
 	pbe = pblist;
 	for_each_zone (zone) {
 		if (is_highmem(zone))
 			continue;
+
 		mark_free_pages(zone);
-		/* This is necessary for swsusp_free() */
+		/* This is necessary for free_image() */
 		for_each_pb_page (p, pblist)
 			SetPageNosaveFree(virt_to_page(p));
+
 		for_each_pbe (p, pblist)
-			SetPageNosaveFree(virt_to_page(p->address));
+			if (p->address && p->orig_address != p->address)
+				SetPageNosaveFree(virt_to_page(p->address));
+
+		ptr = page_list;
+		while (ptr) {
+			SetPageNosaveFree(virt_to_page(ptr));
+			ptr = ptr->next;
+		}
+
 		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
-			if (saveable(zone, &zone_pfn)) {
+			unsigned long pfn = zone_pfn + zone->zone_start_pfn;
+
+			if (saveable(pfn)) {
 				struct page *page;
-				page = pfn_to_page(zone_pfn + zone->zone_start_pfn);
+
 				BUG_ON(!pbe);
+				page = pfn_to_page(pfn);
 				pbe->orig_address = (unsigned long)page_address(page);
-				/* copy_page is not usable for copying task structs. */
-				memcpy((void *)pbe->address, (void *)pbe->orig_address, PAGE_SIZE);
+				if (page_to_copy(page)) {
+					BUG_ON(!page_list);
+					pbe->address = (unsigned long)page_list;
+					page_list = page_list->next;
+					/*
+					 * copy_page is not usable for copying
+					 * task structs.
+					 */
+					memcpy((void *)pbe->address,
+						(void *)pbe->orig_address,
+						PAGE_SIZE);
+				} else {
+					pbe->address = pbe->orig_address;
+				}
 				pbe = pbe->next;
 			}
 		}
@@ -419,7 +484,7 @@ static inline void *alloc_image_page(gfp
 	res = (void *)get_zeroed_page(gfp_mask);
 	if (safe_needed)
 		while (res && PageNosaveFree(virt_to_page(res))) {
-			/* The page is unsafe, mark it for swsusp_free() */
+			/* The page is unsafe, mark it for free_image() */
 			SetPageNosave(virt_to_page(res));
 			unsafe_pages++;
 			res = (void *)get_zeroed_page(gfp_mask);
@@ -474,11 +539,32 @@ static struct pbe *alloc_pagedir(unsigne
 }
 
 /**
+ *	protect_data_pages - move data pages that need to be protected from
+ *	being reclaimed (ie. those included in the suspend image without
+ *	copying) out of their respective LRU lists.  This is done after the
+ *	image has been created so the LRU lists only have to be restored if
+ *	the suspend fails.
+ *
+ *	This function is only called from the critical section, ie. when
+ *	processes are frozen, there's only one CPU online and local IRQs
+ *	are disabled on it.
+ */
+
+static void protect_data_pages(struct pbe *pblist)
+{
+	struct pbe *p;
+
+	for_each_pbe (p, pblist)
+		if (p->address == p->orig_address)
+			move_out_of_lru(virt_to_page(p->address));
+}
+
+/**
  * Free pages we allocated for suspend. Suspend pages are alocated
  * before atomic copy, so we need to free them after resume.
  */
 
-void swsusp_free(void)
+void free_image(void)
 {
 	struct zone *zone;
 	unsigned long zone_pfn;
@@ -501,6 +587,11 @@ void swsusp_free(void)
 	buffer = NULL;
 }
 
+void swsusp_free(void)
+{
+	free_image();
+	restore_active_inactive_lists();
+}
 
 /**
  *	enough_free_mem - Make sure we enough free memory to snapshot.
@@ -509,32 +600,33 @@ void swsusp_free(void)
  *	free pages.
  */
 
-static int enough_free_mem(unsigned int nr_pages)
+static int enough_free_mem(unsigned int nr_pages, unsigned int copy_pages)
 {
 	struct zone *zone;
-	unsigned int n = 0;
+	long n = 0;
+
+	pr_debug("swsusp: pages needed: %u + %lu + %lu\n",
+		 copy_pages,
+		 (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE,
+		 EXTRA_PAGES);
 
 	for_each_zone (zone)
-		if (!is_highmem(zone))
+		if (!is_highmem(zone) && populated_zone(zone)) {
 			n += zone->free_pages;
-	pr_debug("swsusp: available memory: %u pages\n", n);
-	return n > (nr_pages + PAGES_FOR_IO +
-		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);
-}
-
-static int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed)
-{
-	struct pbe *p;
+			/*
+			 * We're going to use atomic allocations, so we
+			 * shouldn't count the lowmem reserves in the lower
+			 * zones as available to us
+			 */
+			n -= zone->lowmem_reserve[ZONE_NORMAL];
+		}
 
-	for_each_pbe (p, pblist) {
-		p->address = (unsigned long)alloc_image_page(gfp_mask, safe_needed);
-		if (!p->address)
-			return -ENOMEM;
-	}
-	return 0;
+	pr_debug("swsusp: available memory: %ld pages\n", n);
+	return n > (long)(copy_pages + EXTRA_PAGES +
+		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);
 }
 
-static struct pbe *swsusp_alloc(unsigned int nr_pages)
+static struct pbe *swsusp_alloc(unsigned int nr_pages, unsigned int copy_pages)
 {
 	struct pbe *pblist;
 
@@ -543,10 +635,19 @@ static struct pbe *swsusp_alloc(unsigned
 		return NULL;
 	}
 
-	if (alloc_data_pages(pblist, GFP_ATOMIC | __GFP_COLD, 0)) {
-		printk(KERN_ERR "suspend: Allocating image pages failed.\n");
-		swsusp_free();
-		return NULL;
+	page_list = NULL;
+	while (copy_pages--) {
+		struct res_page *ptr;
+
+		ptr = alloc_image_page(GFP_ATOMIC | __GFP_COLD, 0);
+		if (!ptr) {
+			printk(KERN_ERR
+				"suspend: Allocating image pages failed.\n");
+			free_image();
+			return NULL;
+		}
+		ptr->next = page_list;
+		page_list = ptr;
 	}
 
 	return pblist;
@@ -554,25 +655,21 @@ static struct pbe *swsusp_alloc(unsigned
 
 asmlinkage int swsusp_save(void)
 {
-	unsigned int nr_pages;
+	unsigned int nr_pages, copy_pages;
 
 	pr_debug("swsusp: critical section: \n");
 
 	drain_local_pages();
-	nr_pages = count_data_pages();
-	printk("swsusp: Need to copy %u pages\n", nr_pages);
-
-	pr_debug("swsusp: pages needed: %u + %lu + %u, free: %u\n",
-		 nr_pages,
-		 (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE,
-		 PAGES_FOR_IO, nr_free_pages());
+	copy_pages = count_data_pages(&nr_pages);
+	printk("swsusp: Need to save %u pages\n", nr_pages);
+	printk("swsusp: %u pages to copy\n", copy_pages);
 
-	if (!enough_free_mem(nr_pages)) {
+	if (!enough_free_mem(nr_pages, copy_pages)) {
 		printk(KERN_ERR "swsusp: Not enough free memory\n");
 		return -ENOMEM;
 	}
 
-	pagedir_nosave = swsusp_alloc(nr_pages);
+	pagedir_nosave = swsusp_alloc(nr_pages, copy_pages);
 	if (!pagedir_nosave)
 		return -ENOMEM;
 
@@ -582,6 +679,9 @@ asmlinkage int swsusp_save(void)
 	drain_local_pages();
 	copy_data_pages(pagedir_nosave);
 
+	/* Make sure the pages that we have not copied won't be reclaimed */
+	protect_data_pages(pagedir_nosave);
+
 	/*
 	 * End of critical section. From now on, we can write to memory,
 	 * but we should not touch disk. This specially means we must _not_
@@ -591,7 +691,7 @@ asmlinkage int swsusp_save(void)
 	nr_copy_pages = nr_pages;
 	nr_meta_pages = (nr_pages * sizeof(long) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
-	printk("swsusp: critical section/: done (%d pages copied)\n", nr_pages);
+	printk("swsusp: critical section/: done (%d pages)\n", nr_pages);
 	return 0;
 }
 
@@ -654,7 +754,7 @@ int snapshot_read_next(struct snapshot_h
 	if (handle->page > nr_meta_pages + nr_copy_pages)
 		return 0;
 	if (!buffer) {
-		/* This makes the buffer be freed by swsusp_free() */
+		/* This makes the buffer be freed by free_image() */
 		buffer = alloc_image_page(GFP_ATOMIC, 0);
 		if (!buffer)
 			return -ENOMEM;
@@ -810,13 +910,6 @@ static inline struct pbe *unpack_orig_ad
  *	of "safe" which will be used later
  */
 
-struct safe_page {
-	struct safe_page *next;
-	char padding[PAGE_SIZE - sizeof(void *)];
-};
-
-static struct safe_page *safe_pages;
-
 static int prepare_image(struct snapshot_handle *handle)
 {
 	int error = 0;
@@ -833,21 +926,21 @@ static int prepare_image(struct snapshot
 		if (!pblist)
 			error = -ENOMEM;
 	}
-	safe_pages = NULL;
+	page_list = NULL;
 	if (!error && nr_pages > unsafe_pages) {
 		nr_pages -= unsafe_pages;
 		while (nr_pages--) {
-			struct safe_page *ptr;
+			struct res_page *ptr;
 
-			ptr = (struct safe_page *)get_zeroed_page(GFP_ATOMIC);
+			ptr = (struct res_page *)get_zeroed_page(GFP_ATOMIC);
 			if (!ptr) {
 				error = -ENOMEM;
 				break;
 			}
 			if (!PageNosaveFree(virt_to_page(ptr))) {
 				/* The page is "safe", add it to the list */
-				ptr->next = safe_pages;
-				safe_pages = ptr;
+				ptr->next = page_list;
+				page_list = ptr;
 			}
 			/* Mark the page as allocated */
 			SetPageNosave(virt_to_page(ptr));
@@ -858,7 +951,7 @@ static int prepare_image(struct snapshot
 		pagedir_nosave = pblist;
 	} else {
 		handle->pbe = NULL;
-		swsusp_free();
+		free_image();
 	}
 	return error;
 }
@@ -882,8 +975,8 @@ static void *get_buffer(struct snapshot_
 	 * The "original" page frame has not been allocated and we have to
 	 * use a "safe" page frame to store the read page
 	 */
-	pbe->address = (unsigned long)safe_pages;
-	safe_pages = safe_pages->next;
+	pbe->address = (unsigned long)page_list;
+	page_list = page_list->next;
 	if (last)
 		last->next = pbe;
 	handle->last_pbe = pbe;
@@ -919,7 +1012,7 @@ int snapshot_write_next(struct snapshot_
 	if (handle->prev && handle->page > nr_meta_pages + nr_copy_pages)
 		return 0;
 	if (!buffer) {
-		/* This makes the buffer be freed by swsusp_free() */
+		/* This makes the buffer be freed by free_image() */
 		buffer = alloc_image_page(GFP_ATOMIC, 0);
 		if (!buffer)
 			return -ENOMEM;
Index: linux-2.6.17-rc3-mm1/kernel/power/power.h
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/power.h
+++ linux-2.6.17-rc3-mm1/kernel/power/power.h
@@ -48,7 +48,14 @@ extern dev_t swsusp_resume_device;
 extern asmlinkage int swsusp_arch_suspend(void);
 extern asmlinkage int swsusp_arch_resume(void);
 
-extern unsigned int count_data_pages(void);
+#if (defined(CONFIG_BLK_DEV_INITRD) && defined(CONFIG_BLK_DEV_RAM))
+#define EXTRA_PAGES	(PAGES_FOR_IO + \
+			(2 * CONFIG_BLK_DEV_RAM_SIZE * 1024) / PAGE_SIZE)
+#else
+#define EXTRA_PAGES	PAGES_FOR_IO
+#endif
+
+extern unsigned int count_data_pages(unsigned int *total);
 
 struct snapshot_handle {
 	loff_t		offset;
@@ -111,6 +118,7 @@ extern int restore_special_mem(void);
 
 extern int swsusp_check(void);
 extern int swsusp_shrink_memory(void);
+extern void free_image(void);
 extern void swsusp_free(void);
 extern int swsusp_suspend(void);
 extern int swsusp_resume(void);
Index: linux-2.6.17-rc3-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/swsusp.c
+++ linux-2.6.17-rc3-mm1/kernel/power/swsusp.c
@@ -177,28 +177,29 @@ int swsusp_shrink_memory(void)
 	long size, tmp;
 	struct zone *zone;
 	unsigned long pages = 0;
-	unsigned int i = 0;
+	unsigned int to_save, i = 0;
 	char *p = "-\\|/";
 
 	printk("Shrinking memory...  ");
 	do {
 		size = 2 * count_special_pages();
-		size += size / 50 + count_data_pages();
-		size += (size + PBES_PER_PAGE - 1) / PBES_PER_PAGE +
-			PAGES_FOR_IO;
+		size += size / 50 + count_data_pages(&to_save);
+		size += (to_save + PBES_PER_PAGE - 1) / PBES_PER_PAGE +
+			EXTRA_PAGES;
 		tmp = size;
 		for_each_zone (zone)
 			if (!is_highmem(zone) && populated_zone(zone)) {
 				tmp -= zone->free_pages;
 				tmp += zone->lowmem_reserve[ZONE_NORMAL];
 			}
+		size = to_save * (PAGE_SIZE + sizeof(long));
 		if (tmp > 0) {
 			tmp = __shrink_memory(tmp);
 			if (!tmp)
 				return -ENOMEM;
 			pages += tmp;
-		} else if (size > image_size / PAGE_SIZE) {
-			tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+		} else if (size > image_size) {
+			tmp = __shrink_memory(size - image_size);
 			pages += tmp;
 		}
 		printk("\b%c", p[i++%4]);
@@ -261,7 +262,7 @@ int swsusp_resume(void)
 	 * very tight, so we have to free it as soon as we can to avoid
 	 * subsequent failures
 	 */
-	swsusp_free();
+	free_image();
 	restore_processor_state();
 	restore_special_mem();
 	touch_softlockup_watchdog();
Index: linux-2.6.17-rc3-mm1/kernel/power/user.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/user.c
+++ linux-2.6.17-rc3-mm1/kernel/power/user.c
@@ -153,6 +153,8 @@ static int snapshot_ioctl(struct inode *
 	case SNAPSHOT_UNFREEZE:
 		if (!data->frozen)
 			break;
+		if (data->ready)
+			swsusp_free();
 		down(&pm_sem);
 		thaw_processes();
 		enable_nonboot_cpus();
Index: linux-2.6.17-rc3-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/mm/vmscan.c
+++ linux-2.6.17-rc3-mm1/mm/vmscan.c
@@ -1437,7 +1437,68 @@ out:
 
 	return ret;
 }
-#endif
+
+static LIST_HEAD(saved_active_pages);
+static LIST_HEAD(saved_inactive_pages);
+
+/**
+ *	move_out_of_lru - software suspend includes some pages in the
+ *	suspend snapshot image without copying and these pages should be
+ *	procected from being reclaimed, which can be done by (temporarily)
+ *	moving them out of their respective LRU lists
+ *
+ *	It is to be called with local IRQs disabled.
+ */
+
+void move_out_of_lru(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock(&zone->lru_lock);
+	if (PageActive(page)) {
+		del_page_from_active_list(zone, page);
+		list_add(&page->lru, &saved_active_pages);
+	} else {
+		del_page_from_inactive_list(zone, page);
+		list_add(&page->lru, &saved_inactive_pages);
+	}
+	ClearPageLRU(page);
+	spin_unlock(&zone->lru_lock);
+}
+
+
+/**
+ *	restore_active_inactive_lists - used by the software suspend to move
+ *	the pages taken out of the LRU by take_page_out_of_lru() back to
+ *	their respective active/inactive lists (if the suspend fails)
+ */
+
+void restore_active_inactive_lists(void)
+{
+	struct page *page;
+	struct zone *zone;
+
+	while(!list_empty(&saved_active_pages)) {
+		page = lru_to_page(&saved_active_pages);
+		zone = page_zone(page);
+		list_del(&page->lru);
+		spin_lock_irq(&zone->lru_lock);
+		SetPageLRU(page);
+		add_page_to_active_list(zone, page);
+		spin_unlock_irq(&zone->lru_lock);
+	}
+
+	while(!list_empty(&saved_inactive_pages)) {
+		page = lru_to_page(&saved_inactive_pages);
+		zone = page_zone(page);
+		list_del(&page->lru);
+		spin_lock_irq(&zone->lru_lock);
+		SetPageLRU(page);
+		add_page_to_inactive_list(zone, page);
+		spin_unlock_irq(&zone->lru_lock);
+	}
+}
+#endif /* CONFIG_PM */
 
 #ifdef CONFIG_HOTPLUG_CPU
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
Index: linux-2.6.17-rc3-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.17-rc3-mm1.orig/include/linux/swap.h
+++ linux-2.6.17-rc3-mm1/include/linux/swap.h
@@ -184,7 +184,9 @@ extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zone **, gfp_t);
-extern unsigned long shrink_all_memory(unsigned long nr_pages);
+extern unsigned long shrink_all_memory(unsigned long);
+extern void move_out_of_lru(struct page *);
+extern void restore_active_inactive_lists(void);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-02 10:00 [PATCH -mm] swsusp: support creating bigger images Rafael J. Wysocki
@ 2006-05-09  7:33 ` Andrew Morton
  2006-05-09 10:19   ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-05-09  7:33 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-kernel, nickpiggin, pavel

"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> Currently swsusp is only capable of creating suspend images that are not
> bigger than 1/2 of the normal zone, because it needs to create a copy of every
> page which should be included in the image.  This may hurt the system
> responsiveness after resume, especially on systems with less that 1 GB of RAM.
> 
> To allow swsusp to create bigger images we can use the observation that
> if some pages don't change after the snapshot image of the system has been
> created and before the image is entirely saved, they can be included in the
> image without copying.  Now if the mapped pages that are not mapped by the
> current task are considered, it turns out that they would change only if they
> were reclaimed by try_to_free_pages().  Thus if we take them out of reach
> of try_to_free_pages(), for example by (temporarily) moving them out of their
> respective LRU lists after creating the image, we will be able to include them
> in the image without copying.
> 
> If these pages are included in the image without copying, the amount of free
> memory needed by swsusp is usually much less than otherwise, so the size
> of the image may be bigger.  This also makes swsusp use less memory during
> suspend and saves us quite a lot of memory allocations and copyings.
> 

I have a host of minor problems with this patch.

> --- linux-2.6.17-rc3-mm1.orig/mm/rmap.c
> +++ linux-2.6.17-rc3-mm1/mm/rmap.c
>  
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +static int page_mapped_by_task(struct page *page, struct task_struct *task)
> +{
> +	struct vm_area_struct *vma;
> +	int ret = 0;
> +
> +	spin_lock(&task->mm->page_table_lock);
> +
> +	for (vma = task->mm->mmap; vma; vma = vma->vm_next)
> +		if (page_address_in_vma(page, vma) != -EFAULT) {
> +			ret = 1;
> +			break;
> +		}
> +
> +	spin_unlock(&task->mm->page_table_lock);
> +
> +	return ret;
> +}

task_struct.mm can sometimes be NULL.  This function assumes that it will
never be NULL.  That makes it a somewhat risky interface.  Are we sure it
can never be NULL?

> +/**
> + *	page_to_copy - determine if a page can be included in the
> + *	suspend image without copying (returns false if that's the case).
> + *
> + *	It is safe to include the page in the suspend image without
> + *	copying if (a) it's on the LRU and (b) it's mapped by a frozen task
> + *	(all tasks except for the current task should be frozen when it's
> + *	called).  Otherwise the page should be copied for this purpose
> + *	(compound pages are avoided for simplicity).
> + */
> +
> +int page_to_copy(struct page *page)
> +{
> +	if (!PageLRU(page) || PageCompound(page))
> +		return 1;
> +	if (page_mapped(page))
> +		return page_mapped_by_task(page, current);
> +
> +	return 1;
> +}

a) This is a poorly-chosen name for a globally-scoped function.  There's
   no indication in its name that it is a swsusp thing.

b) It hurts my brain.  "determine X and return false if it's true (!)",
   where "X" means "can it be included" whereas the name "page_to_copy"
   implies something seemingly unrelated - some semantic caller-defined
   consequence of X.

Shudder.  Please try again.

> +#endif /* CONFIG_SOFTWARE_SUSPEND */
> Index: linux-2.6.17-rc3-mm1/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.17-rc3-mm1.orig/include/linux/rmap.h
> +++ linux-2.6.17-rc3-mm1/include/linux/rmap.h
> @@ -104,6 +104,14 @@ pte_t *page_check_address(struct page *,
>   */
>  unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
>  
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +/*
> + * Used to determine if the page can be included in the suspend image without
> + * copying
> + */
> +int page_to_copy(struct page *);
> +#endif

Please remove the ifdef here.

> +++ linux-2.6.17-rc3-mm1/kernel/power/snapshot.c
> --- linux-2.6.17-rc3-mm1.orig/kernel/power/snapshot.c
> +
> +unsigned int count_data_pages(unsigned int *total)
>  {
>  	struct zone *zone;
>  	unsigned long zone_pfn;
> -	unsigned int n = 0;
> +	unsigned int n, m;
>  
> +	n = m = 0;

Please avoid the multiple assignment.  Kernel coding style is super-simple
- no tricksy stuff.

	unsigned int n = 0;
	unsigned int m = 0;

See, if we put each declaration on its own line we get a little bit of room
for a comment explaining what it does.  Which is pretty much compulsory if
one insists on using identifiers like `m' and `n'.

>  	for_each_zone (zone) {

Might as well remove the extranous space here.

>  		if (is_highmem(zone))
>  			continue;
>  		mark_free_pages(zone);
> -		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
> -			n += saveable(zone, &zone_pfn);
> +		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
> +			unsigned long pfn = zone_pfn + zone->zone_start_pfn;
> +
> +			if (saveable(pfn)) {
> +				n++;
> +				if (!page_to_copy(pfn_to_page(pfn)))
> +					m++;
> +			}
> +		}
>  	}
> -	return n;
> +	if (total)
> +		*total = n;
> +
> +	return n - m;
>  }

See, I think `n' should be renamed to `pages_to_copy'.

I don't know what `m' should be renamed to, because that would require me
to understand `page_to_copy()', and I'm not smart enough for that.

>  /**
> + *	protect_data_pages - move data pages that need to be protected from
> + *	being reclaimed (ie. those included in the suspend image without
> + *	copying) out of their respective LRU lists.  This is done after the
> + *	image has been created so the LRU lists only have to be restored if
> + *	the suspend fails.
> + *
> + *	This function is only called from the critical section, ie. when
> + *	processes are frozen, there's only one CPU online and local IRQs
> + *	are disabled on it.
> + */
> +
> +static void protect_data_pages(struct pbe *pblist)
> +{
> +	struct pbe *p;
> +
> +	for_each_pbe (p, pblist)

extraneous space.

> -static int enough_free_mem(unsigned int nr_pages)
> +static int enough_free_mem(unsigned int nr_pages, unsigned int copy_pages)
>  {
>  	struct zone *zone;
> -	unsigned int n = 0;
> +	long n = 0;

Suggest you rename `n' to something useful while you're there.

> +	pr_debug("swsusp: pages needed: %u + %lu + %lu\n",
> +		 copy_pages,
> +		 (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE,
> +		 EXTRA_PAGES);

Use DIV_ROUND_UP() here.

>  	for_each_zone (zone)
> -		if (!is_highmem(zone))
> +		if (!is_highmem(zone) && populated_zone(zone)) {
>  			n += zone->free_pages;
> -	pr_debug("swsusp: available memory: %u pages\n", n);
> -	return n > (nr_pages + PAGES_FOR_IO +
> -		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);
> -}
> -
> -static int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed)
> -{
> -	struct pbe *p;
> +			/*
> +			 * We're going to use atomic allocations, so we
> +			 * shouldn't count the lowmem reserves in the lower
> +			 * zones as available to us
> +			 */
> +			n -= zone->lowmem_reserve[ZONE_NORMAL];
> +		}
>  
> -	for_each_pbe (p, pblist) {
> -		p->address = (unsigned long)alloc_image_page(gfp_mask, safe_needed);
> -		if (!p->address)
> -			return -ENOMEM;
> -	}
> -	return 0;
> +	pr_debug("swsusp: available memory: %ld pages\n", n);
> +	return n > (long)(copy_pages + EXTRA_PAGES +
> +		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);

DIV_ROUND_UP().

>  asmlinkage int swsusp_save(void)
>  {
> -	unsigned int nr_pages;
> +	unsigned int nr_pages, copy_pages;

	unsigned int nr_pages;
	unsigned int pages_to_copy;


> +#if (defined(CONFIG_BLK_DEV_INITRD) && defined(CONFIG_BLK_DEV_RAM))
> +#define EXTRA_PAGES	(PAGES_FOR_IO + \
> +			(2 * CONFIG_BLK_DEV_RAM_SIZE * 1024) / PAGE_SIZE)
> +#else
> +#define EXTRA_PAGES	PAGES_FOR_IO
> +#endif

What is the significance of the ramdisk to the swsusp code?  (Need commenting).

EXTRA_PAGES is not a well-chosen identifier.  Please choose something
within the swsusp "namespace", if there's such a thing.

> ===================================================================
> --- linux-2.6.17-rc3-mm1.orig/kernel/power/swsusp.c
> +++ linux-2.6.17-rc3-mm1/kernel/power/swsusp.c
> @@ -177,28 +177,29 @@ int swsusp_shrink_memory(void)
>  	long size, tmp;
>  	struct zone *zone;
>  	unsigned long pages = 0;
> -	unsigned int i = 0;
> +	unsigned int to_save, i = 0;

	unsigned int pages_to_save;
	unsigned int i = 0;		/* Maybe a comment */

> ===================================================================
> --- linux-2.6.17-rc3-mm1.orig/mm/vmscan.c
> +++ linux-2.6.17-rc3-mm1/mm/vmscan.c
> @@ -1437,7 +1437,68 @@ out:
>  
>  	return ret;
>  }
> -#endif
> +
> +static LIST_HEAD(saved_active_pages);
> +static LIST_HEAD(saved_inactive_pages);
> +
> +/**
> + *	move_out_of_lru - software suspend includes some pages in the
> + *	suspend snapshot image without copying and these pages should be
> + *	procected from being reclaimed, which can be done by (temporarily)
> + *	moving them out of their respective LRU lists
> + *
> + *	It is to be called with local IRQs disabled.
> + */
> +
> +void move_out_of_lru(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	spin_lock(&zone->lru_lock);

Normally we'd check that PageLRU() is still true after acquiring the lock.

> +	if (PageActive(page)) {
> +		del_page_from_active_list(zone, page);
> +		list_add(&page->lru, &saved_active_pages);
> +	} else {
> +		del_page_from_inactive_list(zone, page);
> +		list_add(&page->lru, &saved_inactive_pages);
> +	}
> +	ClearPageLRU(page);
> +	spin_unlock(&zone->lru_lock);
> +}
> +
> +
> +/**
> + *	restore_active_inactive_lists - used by the software suspend to move
> + *	the pages taken out of the LRU by take_page_out_of_lru() back to
> + *	their respective active/inactive lists (if the suspend fails)
> + */
> +
> +void restore_active_inactive_lists(void)
> +{
> +	struct page *page;
> +	struct zone *zone;
> +
> +	while(!list_empty(&saved_active_pages)) {

Add a space after `while'.

> +		page = lru_to_page(&saved_active_pages);
> +		zone = page_zone(page);
> +		list_del(&page->lru);
> +		spin_lock_irq(&zone->lru_lock);
> +		SetPageLRU(page);
> +		add_page_to_active_list(zone, page);
> +		spin_unlock_irq(&zone->lru_lock);
> +	}
> +
> +	while(!list_empty(&saved_inactive_pages)) {

Ditto.

> +		page = lru_to_page(&saved_inactive_pages);
> +		zone = page_zone(page);
> +		list_del(&page->lru);
> +		spin_lock_irq(&zone->lru_lock);
> +		SetPageLRU(page);
> +		add_page_to_inactive_list(zone, page);
> +		spin_unlock_irq(&zone->lru_lock);
> +	}
> +}
> +#endif /* CONFIG_PM */

All the above code is inside CONFIG_PM, but afaik it's only used by
CONFIG_SOFTWARE_SUSPEND.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-09  7:33 ` Andrew Morton
@ 2006-05-09 10:19   ` Rafael J. Wysocki
  2006-05-09 11:22     ` Pavel Machek
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-09 10:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nickpiggin, pavel

On Tuesday 09 May 2006 09:33, you wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > Currently swsusp is only capable of creating suspend images that are not
> > bigger than 1/2 of the normal zone, because it needs to create a copy of every
> > page which should be included in the image.  This may hurt the system
> > responsiveness after resume, especially on systems with less that 1 GB of RAM.
> > 
> > To allow swsusp to create bigger images we can use the observation that
> > if some pages don't change after the snapshot image of the system has been
> > created and before the image is entirely saved, they can be included in the
> > image without copying.  Now if the mapped pages that are not mapped by the
> > current task are considered, it turns out that they would change only if they
> > were reclaimed by try_to_free_pages().  Thus if we take them out of reach
> > of try_to_free_pages(), for example by (temporarily) moving them out of their
> > respective LRU lists after creating the image, we will be able to include them
> > in the image without copying.
> > 
> > If these pages are included in the image without copying, the amount of free
> > memory needed by swsusp is usually much less than otherwise, so the size
> > of the image may be bigger.  This also makes swsusp use less memory during
> > suspend and saves us quite a lot of memory allocations and copyings.
> > 
> 
> I have a host of minor problems with this patch.
> 
> > --- linux-2.6.17-rc3-mm1.orig/mm/rmap.c
> > +++ linux-2.6.17-rc3-mm1/mm/rmap.c
> >  
> > +#ifdef CONFIG_SOFTWARE_SUSPEND
> > +static int page_mapped_by_task(struct page *page, struct task_struct *task)
> > +{
> > +	struct vm_area_struct *vma;
> > +	int ret = 0;
> > +
> > +	spin_lock(&task->mm->page_table_lock);
> > +
> > +	for (vma = task->mm->mmap; vma; vma = vma->vm_next)
> > +		if (page_address_in_vma(page, vma) != -EFAULT) {
> > +			ret = 1;
> > +			break;
> > +		}
> > +
> > +	spin_unlock(&task->mm->page_table_lock);
> > +
> > +	return ret;
> > +}
> 
> task_struct.mm can sometimes be NULL.  This function assumes that it will
> never be NULL.  That makes it a somewhat risky interface.  Are we sure it
> can never be NULL?

Well, now it's only called for task == current, but I can add a check.

> > +/**
> > + *	page_to_copy - determine if a page can be included in the
> > + *	suspend image without copying (returns false if that's the case).
> > + *
> > + *	It is safe to include the page in the suspend image without
> > + *	copying if (a) it's on the LRU and (b) it's mapped by a frozen task
> > + *	(all tasks except for the current task should be frozen when it's
> > + *	called).  Otherwise the page should be copied for this purpose
> > + *	(compound pages are avoided for simplicity).
> > + */
> > +
> > +int page_to_copy(struct page *page)
> > +{
> > +	if (!PageLRU(page) || PageCompound(page))
> > +		return 1;
> > +	if (page_mapped(page))
> > +		return page_mapped_by_task(page, current);
> > +
> > +	return 1;
> > +}
> 
> a) This is a poorly-chosen name for a globally-scoped function.  There's
>    no indication in its name that it is a swsusp thing.
> 
> b) It hurts my brain.  "determine X and return false if it's true (!)",
>    where "X" means "can it be included" whereas the name "page_to_copy"
>    implies something seemingly unrelated - some semantic caller-defined
>    consequence of X.
> 
> Shudder.  Please try again.

OK (although it's not that easy ;-))

> > +#endif /* CONFIG_SOFTWARE_SUSPEND */
> > Index: linux-2.6.17-rc3-mm1/include/linux/rmap.h
> > ===================================================================
> > --- linux-2.6.17-rc3-mm1.orig/include/linux/rmap.h
> > +++ linux-2.6.17-rc3-mm1/include/linux/rmap.h
> > @@ -104,6 +104,14 @@ pte_t *page_check_address(struct page *,
> >   */
> >  unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
> >  
> > +#ifdef CONFIG_SOFTWARE_SUSPEND
> > +/*
> > + * Used to determine if the page can be included in the suspend image without
> > + * copying
> > + */
> > +int page_to_copy(struct page *);
> > +#endif
> 
> Please remove the ifdef here.

OK

> > +++ linux-2.6.17-rc3-mm1/kernel/power/snapshot.c
> > --- linux-2.6.17-rc3-mm1.orig/kernel/power/snapshot.c
> > +
> > +unsigned int count_data_pages(unsigned int *total)
> >  {
> >  	struct zone *zone;
> >  	unsigned long zone_pfn;
> > -	unsigned int n = 0;
> > +	unsigned int n, m;
> >  
> > +	n = m = 0;
> 
> Please avoid the multiple assignment.  Kernel coding style is super-simple
> - no tricksy stuff.

OK

> 
> 	unsigned int n = 0;
> 	unsigned int m = 0;
> 
> See, if we put each declaration on its own line we get a little bit of room
> for a comment explaining what it does.  Which is pretty much compulsory if
> one insists on using identifiers like `m' and `n'.

OK, I'll change the names.

> 
> >  	for_each_zone (zone) {
> 
> Might as well remove the extranous space here.

OK

> >  		if (is_highmem(zone))
> >  			continue;
> >  		mark_free_pages(zone);
> > -		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
> > -			n += saveable(zone, &zone_pfn);
> > +		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
> > +			unsigned long pfn = zone_pfn + zone->zone_start_pfn;
> > +
> > +			if (saveable(pfn)) {
> > +				n++;
> > +				if (!page_to_copy(pfn_to_page(pfn)))
> > +					m++;
> > +			}
> > +		}
> >  	}
> > -	return n;
> > +	if (total)
> > +		*total = n;
> > +
> > +	return n - m;
> >  }
> 
> See, I think `n' should be renamed to `pages_to_copy'.

Actually 'n' is the total number of data pages, and (n - m) is the number of
pages to copy ...

> I don't know what `m' should be renamed to, because that would require me
> to understand `page_to_copy()', and I'm not smart enough for that.

... so 'm' is the number of pages we don't want to be copied. :-)

> >  /**
> > + *	protect_data_pages - move data pages that need to be protected from
> > + *	being reclaimed (ie. those included in the suspend image without
> > + *	copying) out of their respective LRU lists.  This is done after the
> > + *	image has been created so the LRU lists only have to be restored if
> > + *	the suspend fails.
> > + *
> > + *	This function is only called from the critical section, ie. when
> > + *	processes are frozen, there's only one CPU online and local IRQs
> > + *	are disabled on it.
> > + */
> > +
> > +static void protect_data_pages(struct pbe *pblist)
> > +{
> > +	struct pbe *p;
> > +
> > +	for_each_pbe (p, pblist)
> 
> extraneous space.

Will fix.

> > -static int enough_free_mem(unsigned int nr_pages)
> > +static int enough_free_mem(unsigned int nr_pages, unsigned int copy_pages)
> >  {
> >  	struct zone *zone;
> > -	unsigned int n = 0;
> > +	long n = 0;
> 
> Suggest you rename `n' to something useful while you're there.

OK

> > +	pr_debug("swsusp: pages needed: %u + %lu + %lu\n",
> > +		 copy_pages,
> > +		 (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE,
> > +		 EXTRA_PAGES);
> 
> Use DIV_ROUND_UP() here.

OK

> >  	for_each_zone (zone)
> > -		if (!is_highmem(zone))
> > +		if (!is_highmem(zone) && populated_zone(zone)) {
> >  			n += zone->free_pages;
> > -	pr_debug("swsusp: available memory: %u pages\n", n);
> > -	return n > (nr_pages + PAGES_FOR_IO +
> > -		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);
> > -}
> > -
> > -static int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed)
> > -{
> > -	struct pbe *p;
> > +			/*
> > +			 * We're going to use atomic allocations, so we
> > +			 * shouldn't count the lowmem reserves in the lower
> > +			 * zones as available to us
> > +			 */
> > +			n -= zone->lowmem_reserve[ZONE_NORMAL];
> > +		}
> >  
> > -	for_each_pbe (p, pblist) {
> > -		p->address = (unsigned long)alloc_image_page(gfp_mask, safe_needed);
> > -		if (!p->address)
> > -			return -ENOMEM;
> > -	}
> > -	return 0;
> > +	pr_debug("swsusp: available memory: %ld pages\n", n);
> > +	return n > (long)(copy_pages + EXTRA_PAGES +
> > +		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);
> 
> DIV_ROUND_UP().

OK

> >  asmlinkage int swsusp_save(void)
> >  {
> > -	unsigned int nr_pages;
> > +	unsigned int nr_pages, copy_pages;
> 
> 	unsigned int nr_pages;
> 	unsigned int pages_to_copy;

OK

> > +#if (defined(CONFIG_BLK_DEV_INITRD) && defined(CONFIG_BLK_DEV_RAM))
> > +#define EXTRA_PAGES	(PAGES_FOR_IO + \
> > +			(2 * CONFIG_BLK_DEV_RAM_SIZE * 1024) / PAGE_SIZE)
> > +#else
> > +#define EXTRA_PAGES	PAGES_FOR_IO
> > +#endif
> 
> What is the significance of the ramdisk to the swsusp code?  (Need commenting).

Well, yes.  If the userland suspend is used, it needs the extra space for
itself and the initrd on resume (I'll add a comment).

> EXTRA_PAGES is not a well-chosen identifier.  Please choose something
> within the swsusp "namespace", if there's such a thing.

Unfortunately there's not any, but I'll try to invent a better name. :-)

> > ===================================================================
> > --- linux-2.6.17-rc3-mm1.orig/kernel/power/swsusp.c
> > +++ linux-2.6.17-rc3-mm1/kernel/power/swsusp.c
> > @@ -177,28 +177,29 @@ int swsusp_shrink_memory(void)
> >  	long size, tmp;
> >  	struct zone *zone;
> >  	unsigned long pages = 0;
> > -	unsigned int i = 0;
> > +	unsigned int to_save, i = 0;
> 
> 	unsigned int pages_to_save;
> 	unsigned int i = 0;		/* Maybe a comment */
> 
> > ===================================================================
> > --- linux-2.6.17-rc3-mm1.orig/mm/vmscan.c
> > +++ linux-2.6.17-rc3-mm1/mm/vmscan.c
> > @@ -1437,7 +1437,68 @@ out:
> >  
> >  	return ret;
> >  }
> > -#endif
> > +
> > +static LIST_HEAD(saved_active_pages);
> > +static LIST_HEAD(saved_inactive_pages);
> > +
> > +/**
> > + *	move_out_of_lru - software suspend includes some pages in the
> > + *	suspend snapshot image without copying and these pages should be
> > + *	procected from being reclaimed, which can be done by (temporarily)
> > + *	moving them out of their respective LRU lists
> > + *
> > + *	It is to be called with local IRQs disabled.
> > + */
> > +
> > +void move_out_of_lru(struct page *page)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +
> > +	spin_lock(&zone->lru_lock);
> 
> Normally we'd check that PageLRU() is still true after acquiring the lock.

OK

> > +	if (PageActive(page)) {
> > +		del_page_from_active_list(zone, page);
> > +		list_add(&page->lru, &saved_active_pages);
> > +	} else {
> > +		del_page_from_inactive_list(zone, page);
> > +		list_add(&page->lru, &saved_inactive_pages);
> > +	}
> > +	ClearPageLRU(page);
> > +	spin_unlock(&zone->lru_lock);
> > +}
> > +
> > +
> > +/**
> > + *	restore_active_inactive_lists - used by the software suspend to move
> > + *	the pages taken out of the LRU by take_page_out_of_lru() back to
> > + *	their respective active/inactive lists (if the suspend fails)
> > + */
> > +
> > +void restore_active_inactive_lists(void)
> > +{
> > +	struct page *page;
> > +	struct zone *zone;
> > +
> > +	while(!list_empty(&saved_active_pages)) {
> 
> Add a space after `while'.

OK

> > +		page = lru_to_page(&saved_active_pages);
> > +		zone = page_zone(page);
> > +		list_del(&page->lru);
> > +		spin_lock_irq(&zone->lru_lock);
> > +		SetPageLRU(page);
> > +		add_page_to_active_list(zone, page);
> > +		spin_unlock_irq(&zone->lru_lock);
> > +	}
> > +
> > +	while(!list_empty(&saved_inactive_pages)) {
> 
> Ditto.

OK

> > +		page = lru_to_page(&saved_inactive_pages);
> > +		zone = page_zone(page);
> > +		list_del(&page->lru);
> > +		spin_lock_irq(&zone->lru_lock);
> > +		SetPageLRU(page);
> > +		add_page_to_inactive_list(zone, page);
> > +		spin_unlock_irq(&zone->lru_lock);
> > +	}
> > +}
> > +#endif /* CONFIG_PM */
> 
> All the above code is inside CONFIG_PM, but afaik it's only used by
> CONFIG_SOFTWARE_SUSPEND.

Yes, but CONFIG_PM has been used here for quite some time and I wanted
to change that in a separate patch, along with some header clean-ups.  Still
I can do it in this patch if that's better.

Thanks for the comments.  I'll do my best to fix all of the issues in the next
version.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-09 10:19   ` Rafael J. Wysocki
@ 2006-05-09 11:22     ` Pavel Machek
  2006-05-09 12:18       ` Rafael J. Wysocki
  2006-05-09 12:30     ` Hugh Dickins
  2006-05-09 22:15     ` [PATCH -mm] swsusp: support creating bigger images (rev. 2) Rafael J. Wysocki
  2 siblings, 1 reply; 29+ messages in thread
From: Pavel Machek @ 2006-05-09 11:22 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Andrew Morton, linux-kernel, nickpiggin

Hi!

> > EXTRA_PAGES is not a well-chosen identifier.  Please choose something
> > within the swsusp "namespace", if there's such a thing.
> 
> Unfortunately there's not any, but I'll try to invent a better
> name. :-)

PM_EXTRA_PAGES would probably be acceptable, as would
SUSPEND_EXTRA_PAGES be...
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-09 11:22     ` Pavel Machek
@ 2006-05-09 12:18       ` Rafael J. Wysocki
  0 siblings, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-09 12:18 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, linux-kernel, nickpiggin

Hi,

On Tuesday 09 May 2006 13:22, Pavel Machek wrote:
> Hi!
> 
> > > EXTRA_PAGES is not a well-chosen identifier.  Please choose something
> > > within the swsusp "namespace", if there's such a thing.
> > 
> > Unfortunately there's not any, but I'll try to invent a better
> > name. :-)
> 
> PM_EXTRA_PAGES would probably be acceptable, as would
> SUSPEND_EXTRA_PAGES be...

Thanks, SUSPEND_EXTRA_PAGES sounds better to me.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-09 10:19   ` Rafael J. Wysocki
  2006-05-09 11:22     ` Pavel Machek
@ 2006-05-09 12:30     ` Hugh Dickins
  2006-05-10  3:50       ` Nick Piggin
  2006-05-10 22:26       ` Rafael J. Wysocki
  2006-05-09 22:15     ` [PATCH -mm] swsusp: support creating bigger images (rev. 2) Rafael J. Wysocki
  2 siblings, 2 replies; 29+ messages in thread
From: Hugh Dickins @ 2006-05-09 12:30 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Andrew Morton, linux-kernel, nickpiggin, pavel

On Tue, 9 May 2006, Rafael J. Wysocki wrote:
> On Tuesday 09 May 2006 09:33, Andrew Morton wrote:
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > 
> > I have a host of minor problems with this patch.
> > 
> > > --- linux-2.6.17-rc3-mm1.orig/mm/rmap.c
> > > +++ linux-2.6.17-rc3-mm1/mm/rmap.c
> > >  
> > > +#ifdef CONFIG_SOFTWARE_SUSPEND
> > > +static int page_mapped_by_task(struct page *page, struct task_struct *task)
> > > +{
> > > +	struct vm_area_struct *vma;
> > > +	int ret = 0;
> > > +
> > > +	spin_lock(&task->mm->page_table_lock);
> > > +
> > > +	for (vma = task->mm->mmap; vma; vma = vma->vm_next)
> > > +		if (page_address_in_vma(page, vma) != -EFAULT) {
> > > +			ret = 1;
> > > +			break;
> > > +		}
> > > +
> > > +	spin_unlock(&task->mm->page_table_lock);
> > > +
> > > +	return ret;
> > > +}
> > 
> > task_struct.mm can sometimes be NULL.  This function assumes that it will
> > never be NULL.  That makes it a somewhat risky interface.  Are we sure it
> > can never be NULL?
> 
> Well, now it's only called for task == current, but I can add a check.

Better fold it into the (renamed and recommented) page_to_copy,
applying only to current.

The "use" of page_table_lock there is totally bogus.  Normally you
need down_read(&current->mm->mmap_sem) to walk that vma chain; but
I'm guessing you have everything sufficiently frozen here that you
don't need that.

But if it is sufficiently frozen, I'm puzzled as to why pages mapped
into the current process are (potentially) unsafe, while those mapped
into others are safe.  If the current process can get back to messing
with its mapped pages, what if it maps a page you earlier judged safe?

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-09 10:19   ` Rafael J. Wysocki
  2006-05-09 11:22     ` Pavel Machek
  2006-05-09 12:30     ` Hugh Dickins
@ 2006-05-09 22:15     ` Rafael J. Wysocki
  2006-05-09 22:27       ` Andrew Morton
  2 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-09 22:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nickpiggin, pavel

On Tuesday 09 May 2006 12:19, Rafael J. Wysocki wrote:
> On Tuesday 09 May 2006 09:33, you wrote:
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
}-- snip --{
> 
> Thanks for the comments.  I'll do my best to fix all of the issues in the next
> version.

Corrected patch follows.

---
Currently swsusp is only capable of creating suspend images that are not
bigger than 1/2 of the normal zone, because it needs to create a copy of every
page which should be included in the image.  This may hurt the system
responsiveness after resume, especially on systems with less that 1 GB of RAM.

To allow swsusp to create bigger images we can use the observation that
if some pages don't change after the snapshot image of the system has been
created and before the image is entirely saved, they can be included in the
image without copying.  Now if the mapped pages that are not mapped by the
current task are considered, it turns out that they would change only if they
were reclaimed by try_to_free_pages().  Thus if we take them out of reach
of try_to_free_pages(), for example by (temporarily) moving them out of their
respective LRU lists after creating the image, we will be able to include them
in the image without copying.

If these pages are included in the image without copying, the amount of free
memory needed by swsusp is usually much less than otherwise, so the size
of the image may be bigger.  This also makes swsusp use less memory during
suspend and saves us quite a lot of memory allocations and copyings.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 include/linux/rmap.h    |    6 +
 include/linux/swap.h    |    4 
 kernel/power/power.h    |   14 ++
 kernel/power/snapshot.c |  230 +++++++++++++++++++++++++++++++++---------------
 kernel/power/swsusp.c   |   16 +--
 kernel/power/user.c     |    2 
 mm/rmap.c               |   44 +++++++++
 mm/vmscan.c             |   68 +++++++++++++-
 8 files changed, 306 insertions(+), 78 deletions(-)

Index: linux-2.6.17-rc3-mm1/mm/rmap.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/mm/rmap.c	2006-05-01 14:11:50.000000000 +0200
+++ linux-2.6.17-rc3-mm1/mm/rmap.c	2006-05-09 22:44:05.000000000 +0200
@@ -857,3 +857,47 @@ int try_to_unmap(struct page *page, int 
 	return ret;
 }
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+static int page_mapped_by_task(struct page *page, struct task_struct *task)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	if (!task)
+		return 0;
+
+	spin_lock(&task->mm->page_table_lock);
+
+	for (vma = task->mm->mmap; vma; vma = vma->vm_next)
+		if (page_address_in_vma(page, vma) != -EFAULT) {
+			ret = 1;
+			break;
+		}
+
+	spin_unlock(&task->mm->page_table_lock);
+
+	return ret;
+}
+
+/**
+ *	suspend_safe_page - determine if a page can be included in the
+ *	suspend image without copying (returns true if so).
+ *
+ *	It is safe to include the page in the suspend image without
+ *	copying if (a) it's on the LRU and (b) it's mapped by a frozen task
+ *	(all tasks except for the current task should be frozen when it's
+ *	called).  Otherwise the page should be copied for this purpose
+ *	(compound pages are avoided for simplicity).
+ */
+
+int suspend_safe_page(struct page *page)
+{
+	if (!PageLRU(page) || PageCompound(page))
+		return 0;
+
+	if (page_mapped(page))
+		return !page_mapped_by_task(page, current);
+
+	return 0;
+}
+#endif /* CONFIG_SOFTWARE_SUSPEND */
Index: linux-2.6.17-rc3-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.17-rc3-mm1.orig/include/linux/rmap.h	2006-05-01 14:11:47.000000000 +0200
+++ linux-2.6.17-rc3-mm1/include/linux/rmap.h	2006-05-09 22:43:40.000000000 +0200
@@ -104,6 +104,12 @@ pte_t *page_check_address(struct page *,
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
+/*
+ * Used to determine if the page can be included in the suspend image without
+ * copying
+ */
+int suspend_safe_page(struct page *);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux-2.6.17-rc3-mm1/kernel/power/snapshot.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/snapshot.c	2006-05-01 14:11:47.000000000 +0200
+++ linux-2.6.17-rc3-mm1/kernel/power/snapshot.c	2006-05-09 23:19:15.000000000 +0200
@@ -13,6 +13,7 @@
 #include <linux/version.h>
 #include <linux/module.h>
 #include <linux/mm.h>
+#include <linux/rmap.h>
 #include <linux/suspend.h>
 #include <linux/smp_lock.h>
 #include <linux/delay.h>
@@ -261,6 +262,14 @@ int restore_special_mem(void)
 	return ret;
 }
 
+/* Represents a stacked allocated page to be used in the future */
+struct res_page {
+	struct res_page *next;
+	char padding[PAGE_SIZE - sizeof(void *)];
+};
+
+static struct res_page *page_list;
+
 static int pfn_is_nosave(unsigned long pfn)
 {
 	unsigned long nosave_begin_pfn = __pa(&__nosave_begin) >> PAGE_SHIFT;
@@ -269,7 +278,7 @@ static int pfn_is_nosave(unsigned long p
 }
 
 /**
- *	saveable - Determine whether a page should be cloned or not.
+ *	saveable - Determine whether a page should be saved or not.
  *	@pfn:	The page
  *
  *	We save a page if it's Reserved, and not in the range of pages
@@ -277,9 +286,8 @@ static int pfn_is_nosave(unsigned long p
  *	isn't part of a free chunk of pages.
  */
 
-static int saveable(struct zone *zone, unsigned long *zone_pfn)
+static int saveable(unsigned long pfn)
 {
-	unsigned long pfn = *zone_pfn + zone->zone_start_pfn;
 	struct page *page;
 
 	if (!pfn_valid(pfn))
@@ -296,46 +304,103 @@ static int saveable(struct zone *zone, u
 	return 1;
 }
 
-unsigned int count_data_pages(void)
+/**
+ *	count_data_pages - returns the number of pages that need to be copied
+ *	in order to be included in the suspend snapshot image.  If @total is
+ *	not null, the total number of pages that should be included in the
+ *	snapshot image is stored in the location pointed to by it.
+ */
+
+unsigned int count_data_pages(unsigned int *total)
 {
 	struct zone *zone;
 	unsigned long zone_pfn;
-	unsigned int n = 0;
+	unsigned int data_pages = 0;
+	unsigned int safe_pages = 0;
 
-	for_each_zone (zone) {
+	for_each_zone(zone) {
 		if (is_highmem(zone))
 			continue;
 		mark_free_pages(zone);
-		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
-			n += saveable(zone, &zone_pfn);
+		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
+			unsigned long pfn = zone_pfn + zone->zone_start_pfn;
+
+			if (saveable(pfn)) {
+				data_pages++;
+				if (suspend_safe_page(pfn_to_page(pfn)))
+					safe_pages++;
+			}
+		}
 	}
-	return n;
+	if (total)
+		*total = data_pages;
+
+	return data_pages - safe_pages;
 }
 
+/**
+ *	copy_data_pages - populates the page backup list @pblist with
+ *	the addresses of the pages that should be included in the
+ *	suspend snapshot image.  The pages that cannot be included in the
+ *	image without copying are copied into empty page frames allocated
+ *	earlier and available from the list page_list (the addresses of
+ *	these page frames are also stored in the page backup list).
+ *
+ *	This function is only called from the critical section, ie. when
+ *	processes are frozen, there's only one CPU online and local IRQs
+ *	are disabled on it.
+ */
+
 static void copy_data_pages(struct pbe *pblist)
 {
 	struct zone *zone;
 	unsigned long zone_pfn;
 	struct pbe *pbe, *p;
+	struct res_page *ptr;
 
 	pbe = pblist;
 	for_each_zone (zone) {
 		if (is_highmem(zone))
 			continue;
+
 		mark_free_pages(zone);
-		/* This is necessary for swsusp_free() */
+		/* This is necessary for free_image() */
 		for_each_pb_page (p, pblist)
 			SetPageNosaveFree(virt_to_page(p));
+
 		for_each_pbe (p, pblist)
-			SetPageNosaveFree(virt_to_page(p->address));
+			if (p->address && p->orig_address != p->address)
+				SetPageNosaveFree(virt_to_page(p->address));
+
+		ptr = page_list;
+		while (ptr) {
+			SetPageNosaveFree(virt_to_page(ptr));
+			ptr = ptr->next;
+		}
+
 		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
-			if (saveable(zone, &zone_pfn)) {
+			unsigned long pfn = zone_pfn + zone->zone_start_pfn;
+
+			if (saveable(pfn)) {
 				struct page *page;
-				page = pfn_to_page(zone_pfn + zone->zone_start_pfn);
+
 				BUG_ON(!pbe);
+				page = pfn_to_page(pfn);
 				pbe->orig_address = (unsigned long)page_address(page);
-				/* copy_page is not usable for copying task structs. */
-				memcpy((void *)pbe->address, (void *)pbe->orig_address, PAGE_SIZE);
+				if (suspend_safe_page(page)) {
+					pbe->address = pbe->orig_address;
+				} else {
+					BUG_ON(!page_list);
+					pbe->address = (unsigned long)page_list;
+					page_list = page_list->next;
+					/*
+					 * copy_page is not usable for copying
+					 * task structs.
+					 */
+					memcpy((void *)pbe->address,
+						(void *)pbe->orig_address,
+						PAGE_SIZE);
+				}
 				pbe = pbe->next;
 			}
 		}
@@ -419,7 +484,7 @@ static inline void *alloc_image_page(gfp
 	res = (void *)get_zeroed_page(gfp_mask);
 	if (safe_needed)
 		while (res && PageNosaveFree(virt_to_page(res))) {
-			/* The page is unsafe, mark it for swsusp_free() */
+			/* The page is unsafe, mark it for free_image() */
 			SetPageNosave(virt_to_page(res));
 			unsafe_pages++;
 			res = (void *)get_zeroed_page(gfp_mask);
@@ -474,11 +539,32 @@ static struct pbe *alloc_pagedir(unsigne
 }
 
 /**
+ *	protect_data_pages - move data pages that need to be protected from
+ *	being reclaimed (ie. those included in the suspend image without
+ *	copying) out of their respective LRU lists.  This is done after the
+ *	image has been created so the LRU lists only have to be restored if
+ *	the suspend fails.
+ *
+ *	This function is only called from the critical section, ie. when
+ *	processes are frozen, there's only one CPU online and local IRQs
+ *	are disabled on it.
+ */
+
+static void protect_data_pages(struct pbe *pblist)
+{
+	struct pbe *p;
+
+	for_each_pbe(p, pblist)
+		if (p->address == p->orig_address)
+			move_out_of_lru(virt_to_page(p->address));
+}
+
+/**
  * Free pages we allocated for suspend. Suspend pages are alocated
  * before atomic copy, so we need to free them after resume.
  */
 
-void swsusp_free(void)
+void free_image(void)
 {
 	struct zone *zone;
 	unsigned long zone_pfn;
@@ -501,6 +587,11 @@ void swsusp_free(void)
 	buffer = NULL;
 }
 
+void swsusp_free(void)
+{
+	free_image();
+	restore_active_inactive_lists();
+}
 
 /**
  *	enough_free_mem - Make sure we enough free memory to snapshot.
@@ -509,32 +600,33 @@ void swsusp_free(void)
  *	free pages.
  */
 
-static int enough_free_mem(unsigned int nr_pages)
+static int enough_free_mem(unsigned int nr_pages, unsigned int copy_pages)
 {
 	struct zone *zone;
-	unsigned int n = 0;
-
-	for_each_zone (zone)
-		if (!is_highmem(zone))
-			n += zone->free_pages;
-	pr_debug("swsusp: available memory: %u pages\n", n);
-	return n > (nr_pages + PAGES_FOR_IO +
-		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);
-}
+	long avail_pages = 0;
 
-static int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed)
-{
-	struct pbe *p;
+	pr_debug("swsusp: pages needed: %u + %lu + %lu\n",
+		 copy_pages,
+		 DIV_ROUND_UP(nr_pages, PBES_PER_PAGE),
+		 SUSPEND_EXTRA_PAGES);
+
+	for_each_zone(zone)
+		if (!is_highmem(zone) && populated_zone(zone)) {
+			avail_pages += zone->free_pages;
+			/*
+			 * We're going to use atomic allocations, so we
+			 * shouldn't count the lowmem reserves in the lower
+			 * zones as available to us
+			 */
+			avail_pages -= zone->lowmem_reserve[ZONE_NORMAL];
+		}
 
-	for_each_pbe (p, pblist) {
-		p->address = (unsigned long)alloc_image_page(gfp_mask, safe_needed);
-		if (!p->address)
-			return -ENOMEM;
-	}
-	return 0;
+	pr_debug("swsusp: available memory: %ld pages\n", avail_pages);
+	return avail_pages > (long)(copy_pages + SUSPEND_EXTRA_PAGES +
+		DIV_ROUND_UP(nr_pages, PBES_PER_PAGE));
 }
 
-static struct pbe *swsusp_alloc(unsigned int nr_pages)
+static struct pbe *swsusp_alloc(unsigned int nr_pages, unsigned int copy_pages)
 {
 	struct pbe *pblist;
 
@@ -543,10 +635,19 @@ static struct pbe *swsusp_alloc(unsigned
 		return NULL;
 	}
 
-	if (alloc_data_pages(pblist, GFP_ATOMIC | __GFP_COLD, 0)) {
-		printk(KERN_ERR "suspend: Allocating image pages failed.\n");
-		swsusp_free();
-		return NULL;
+	page_list = NULL;
+	while (copy_pages--) {
+		struct res_page *ptr;
+
+		ptr = alloc_image_page(GFP_ATOMIC | __GFP_COLD, 0);
+		if (!ptr) {
+			printk(KERN_ERR
+				"suspend: Allocating image pages failed.\n");
+			free_image();
+			return NULL;
+		}
+		ptr->next = page_list;
+		page_list = ptr;
 	}
 
 	return pblist;
@@ -555,24 +656,21 @@ static struct pbe *swsusp_alloc(unsigned
 asmlinkage int swsusp_save(void)
 {
 	unsigned int nr_pages;
+	unsigned int copy_pages;
 
 	pr_debug("swsusp: critical section: \n");
 
 	drain_local_pages();
-	nr_pages = count_data_pages();
-	printk("swsusp: Need to copy %u pages\n", nr_pages);
+	copy_pages = count_data_pages(&nr_pages);
+	printk("swsusp: Need to save %u pages\n", nr_pages);
+	printk("swsusp: %u pages to copy\n", copy_pages);
 
-	pr_debug("swsusp: pages needed: %u + %lu + %u, free: %u\n",
-		 nr_pages,
-		 (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE,
-		 PAGES_FOR_IO, nr_free_pages());
-
-	if (!enough_free_mem(nr_pages)) {
+	if (!enough_free_mem(nr_pages, copy_pages)) {
 		printk(KERN_ERR "swsusp: Not enough free memory\n");
 		return -ENOMEM;
 	}
 
-	pagedir_nosave = swsusp_alloc(nr_pages);
+	pagedir_nosave = swsusp_alloc(nr_pages, copy_pages);
 	if (!pagedir_nosave)
 		return -ENOMEM;
 
@@ -582,6 +680,9 @@ asmlinkage int swsusp_save(void)
 	drain_local_pages();
 	copy_data_pages(pagedir_nosave);
 
+	/* Make sure the pages that we have not copied won't be reclaimed */
+	protect_data_pages(pagedir_nosave);
+
 	/*
 	 * End of critical section. From now on, we can write to memory,
 	 * but we should not touch disk. This specially means we must _not_
@@ -591,7 +692,7 @@ asmlinkage int swsusp_save(void)
 	nr_copy_pages = nr_pages;
 	nr_meta_pages = (nr_pages * sizeof(long) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
-	printk("swsusp: critical section/: done (%d pages copied)\n", nr_pages);
+	printk("swsusp: critical section/: done (%d pages)\n", nr_pages);
 	return 0;
 }
 
@@ -654,7 +755,7 @@ int snapshot_read_next(struct snapshot_h
 	if (handle->page > nr_meta_pages + nr_copy_pages)
 		return 0;
 	if (!buffer) {
-		/* This makes the buffer be freed by swsusp_free() */
+		/* This makes the buffer be freed by free_image() */
 		buffer = alloc_image_page(GFP_ATOMIC, 0);
 		if (!buffer)
 			return -ENOMEM;
@@ -810,13 +911,6 @@ static inline struct pbe *unpack_orig_ad
  *	of "safe" which will be used later
  */
 
-struct safe_page {
-	struct safe_page *next;
-	char padding[PAGE_SIZE - sizeof(void *)];
-};
-
-static struct safe_page *safe_pages;
-
 static int prepare_image(struct snapshot_handle *handle)
 {
 	int error = 0;
@@ -833,21 +927,21 @@ static int prepare_image(struct snapshot
 		if (!pblist)
 			error = -ENOMEM;
 	}
-	safe_pages = NULL;
+	page_list = NULL;
 	if (!error && nr_pages > unsafe_pages) {
 		nr_pages -= unsafe_pages;
 		while (nr_pages--) {
-			struct safe_page *ptr;
+			struct res_page *ptr;
 
-			ptr = (struct safe_page *)get_zeroed_page(GFP_ATOMIC);
+			ptr = (struct res_page *)get_zeroed_page(GFP_ATOMIC);
 			if (!ptr) {
 				error = -ENOMEM;
 				break;
 			}
 			if (!PageNosaveFree(virt_to_page(ptr))) {
 				/* The page is "safe", add it to the list */
-				ptr->next = safe_pages;
-				safe_pages = ptr;
+				ptr->next = page_list;
+				page_list = ptr;
 			}
 			/* Mark the page as allocated */
 			SetPageNosave(virt_to_page(ptr));
@@ -858,7 +952,7 @@ static int prepare_image(struct snapshot
 		pagedir_nosave = pblist;
 	} else {
 		handle->pbe = NULL;
-		swsusp_free();
+		free_image();
 	}
 	return error;
 }
@@ -882,8 +976,8 @@ static void *get_buffer(struct snapshot_
 	 * The "original" page frame has not been allocated and we have to
 	 * use a "safe" page frame to store the read page
 	 */
-	pbe->address = (unsigned long)safe_pages;
-	safe_pages = safe_pages->next;
+	pbe->address = (unsigned long)page_list;
+	page_list = page_list->next;
 	if (last)
 		last->next = pbe;
 	handle->last_pbe = pbe;
@@ -919,7 +1013,7 @@ int snapshot_write_next(struct snapshot_
 	if (handle->prev && handle->page > nr_meta_pages + nr_copy_pages)
 		return 0;
 	if (!buffer) {
-		/* This makes the buffer be freed by swsusp_free() */
+		/* This makes the buffer be freed by free_image() */
 		buffer = alloc_image_page(GFP_ATOMIC, 0);
 		if (!buffer)
 			return -ENOMEM;
Index: linux-2.6.17-rc3-mm1/kernel/power/power.h
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/power.h	2006-05-01 14:11:47.000000000 +0200
+++ linux-2.6.17-rc3-mm1/kernel/power/power.h	2006-05-09 23:03:35.000000000 +0200
@@ -48,7 +48,18 @@ extern dev_t swsusp_resume_device;
 extern asmlinkage int swsusp_arch_suspend(void);
 extern asmlinkage int swsusp_arch_resume(void);
 
-extern unsigned int count_data_pages(void);
+/*
+ * If the userland suspend is used, some extra space will be needed during
+ * resume for the initrd, the resume binary and its data
+ */
+#if (defined(CONFIG_BLK_DEV_INITRD) && defined(CONFIG_BLK_DEV_RAM))
+#define SUSPEND_EXTRA_PAGES	(PAGES_FOR_IO + \
+				(2 * CONFIG_BLK_DEV_RAM_SIZE * 1024) / PAGE_SIZE)
+#else
+#define SUSPEND_EXTRA_PAGES	PAGES_FOR_IO
+#endif
+
+extern unsigned int count_data_pages(unsigned int *total);
 
 struct snapshot_handle {
 	loff_t		offset;
@@ -111,6 +122,7 @@ extern int restore_special_mem(void);
 
 extern int swsusp_check(void);
 extern int swsusp_shrink_memory(void);
+extern void free_image(void);
 extern void swsusp_free(void);
 extern int swsusp_suspend(void);
 extern int swsusp_resume(void);
Index: linux-2.6.17-rc3-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/swsusp.c	2006-05-01 14:11:47.000000000 +0200
+++ linux-2.6.17-rc3-mm1/kernel/power/swsusp.c	2006-05-09 23:17:32.000000000 +0200
@@ -177,28 +177,30 @@ int swsusp_shrink_memory(void)
 	long size, tmp;
 	struct zone *zone;
 	unsigned long pages = 0;
-	unsigned int i = 0;
+	unsigned int pages_to_save;
+	unsigned int i = 0; /* for the progress meter */
 	char *p = "-\\|/";
 
 	printk("Shrinking memory...  ");
 	do {
 		size = 2 * count_special_pages();
-		size += size / 50 + count_data_pages();
-		size += (size + PBES_PER_PAGE - 1) / PBES_PER_PAGE +
-			PAGES_FOR_IO;
+		size += size / 50 + count_data_pages(&pages_to_save);
+		size += DIV_ROUND_UP(pages_to_save, PBES_PER_PAGE) +
+			SUSPEND_EXTRA_PAGES;
 		tmp = size;
 		for_each_zone (zone)
 			if (!is_highmem(zone) && populated_zone(zone)) {
 				tmp -= zone->free_pages;
 				tmp += zone->lowmem_reserve[ZONE_NORMAL];
 			}
+		size = pages_to_save * (PAGE_SIZE + sizeof(long));
 		if (tmp > 0) {
 			tmp = __shrink_memory(tmp);
 			if (!tmp)
 				return -ENOMEM;
 			pages += tmp;
-		} else if (size > image_size / PAGE_SIZE) {
-			tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+		} else if (size > image_size) {
+			tmp = __shrink_memory(size - image_size);
 			pages += tmp;
 		}
 		printk("\b%c", p[i++%4]);
@@ -261,7 +263,7 @@ int swsusp_resume(void)
 	 * very tight, so we have to free it as soon as we can to avoid
 	 * subsequent failures
 	 */
-	swsusp_free();
+	free_image();
 	restore_processor_state();
 	restore_special_mem();
 	touch_softlockup_watchdog();
Index: linux-2.6.17-rc3-mm1/kernel/power/user.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/kernel/power/user.c	2006-05-01 14:11:21.000000000 +0200
+++ linux-2.6.17-rc3-mm1/kernel/power/user.c	2006-05-01 16:00:00.000000000 +0200
@@ -153,6 +153,8 @@ static int snapshot_ioctl(struct inode *
 	case SNAPSHOT_UNFREEZE:
 		if (!data->frozen)
 			break;
+		if (data->ready)
+			swsusp_free();
 		down(&pm_sem);
 		thaw_processes();
 		enable_nonboot_cpus();
Index: linux-2.6.17-rc3-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.17-rc3-mm1.orig/mm/vmscan.c	2006-05-01 14:11:50.000000000 +0200
+++ linux-2.6.17-rc3-mm1/mm/vmscan.c	2006-05-09 23:26:24.000000000 +0200
@@ -1437,7 +1437,73 @@ out:
 
 	return ret;
 }
-#endif
+
+#ifdef CONFIG_SOFTWARE_SUSPEND
+static LIST_HEAD(saved_active_pages);
+static LIST_HEAD(saved_inactive_pages);
+
+/**
+ *	move_out_of_lru - software suspend includes some pages in the
+ *	suspend snapshot image without copying and these pages should be
+ *	procected from being reclaimed, which can be done by (temporarily)
+ *	moving them out of their respective LRU lists
+ *
+ *	It is to be called with local IRQs disabled.
+ */
+
+void move_out_of_lru(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock(&zone->lru_lock);
+	if (likely(PageLRU(page))) {
+		if (PageActive(page)) {
+			del_page_from_active_list(zone, page);
+			list_add(&page->lru, &saved_active_pages);
+		} else {
+			del_page_from_inactive_list(zone, page);
+			list_add(&page->lru, &saved_inactive_pages);
+		}
+		ClearPageLRU(page);
+	}
+	spin_unlock(&zone->lru_lock);
+}
+
+
+/**
+ *	restore_active_inactive_lists - used by the software suspend to move
+ *	the pages taken out of the LRU by take_page_out_of_lru() back to
+ *	their respective active/inactive lists (if the suspend fails)
+ */
+
+void restore_active_inactive_lists(void)
+{
+	struct page *page;
+	struct zone *zone;
+
+	while (!list_empty(&saved_active_pages)) {
+		page = lru_to_page(&saved_active_pages);
+		zone = page_zone(page);
+		list_del(&page->lru);
+		spin_lock_irq(&zone->lru_lock);
+		SetPageLRU(page);
+		add_page_to_active_list(zone, page);
+		spin_unlock_irq(&zone->lru_lock);
+	}
+
+	while (!list_empty(&saved_inactive_pages)) {
+		page = lru_to_page(&saved_inactive_pages);
+		zone = page_zone(page);
+		list_del(&page->lru);
+		spin_lock_irq(&zone->lru_lock);
+		SetPageLRU(page);
+		add_page_to_inactive_list(zone, page);
+		spin_unlock_irq(&zone->lru_lock);
+	}
+}
+#endif /* CONFIG_SOFTWARE_SUSPEND */
+
+#endif /* CONFIG_PM */
 
 #ifdef CONFIG_HOTPLUG_CPU
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
Index: linux-2.6.17-rc3-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.17-rc3-mm1.orig/include/linux/swap.h	2006-05-01 14:11:47.000000000 +0200
+++ linux-2.6.17-rc3-mm1/include/linux/swap.h	2006-05-07 14:17:36.000000000 +0200
@@ -184,7 +184,9 @@ extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zone **, gfp_t);
-extern unsigned long shrink_all_memory(unsigned long nr_pages);
+extern unsigned long shrink_all_memory(unsigned long);
+extern void move_out_of_lru(struct page *);
+extern void restore_active_inactive_lists(void);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-09 22:15     ` [PATCH -mm] swsusp: support creating bigger images (rev. 2) Rafael J. Wysocki
@ 2006-05-09 22:27       ` Andrew Morton
  2006-05-10 22:58         ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2006-05-09 22:27 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-kernel, nickpiggin, pavel

"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> Now if the mapped pages that are not mapped by the
>  current task are considered, it turns out that they would change only if they
>  were reclaimed by try_to_free_pages().  Thus if we take them out of reach
>  of try_to_free_pages(), for example by (temporarily) moving them out of their
>  respective LRU lists after creating the image, we will be able to include them
>  in the image without copying.

I'm a bit curious about how this is true.  There are all sorts of way in
which there could be activity against these pages - interrupt-time
asynchronous network Tx completion, async interrupt-time direct-io
completion, tasklets, schedule_work(), etc, etc.

So...  could we check your homework on this please?  How come only page
reclaim can disturb these pages?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-09 12:30     ` Hugh Dickins
@ 2006-05-10  3:50       ` Nick Piggin
  2006-05-10 22:26       ` Rafael J. Wysocki
  1 sibling, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2006-05-10  3:50 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Rafael J. Wysocki, Andrew Morton, linux-kernel, pavel

Hugh Dickins wrote:

>On Tue, 9 May 2006, Rafael J. Wysocki wrote:
>
>>On Tuesday 09 May 2006 09:33, Andrew Morton wrote:
>>    
>>
>>>task_struct.mm can sometimes be NULL.  This function assumes that it will
>>>never be NULL.  That makes it a somewhat risky interface.  Are we sure it
>>>can never be NULL?
>>>
>>Well, now it's only called for task == current, but I can add a check.
>>
>
>Better fold it into the (renamed and recommented) page_to_copy,
>applying only to current.
>
>The "use" of page_table_lock there is totally bogus.  Normally you
>need down_read(&current->mm->mmap_sem) to walk that vma chain; but
>I'm guessing you have everything sufficiently frozen here that you
>don't need that.
>

I have to admit that I suggested making this change, because the
function was sufficently generic looking. I guess the mm == NULL
case should logically return 0... unless you did that, making
page_mapped_by_task use current still leaves the burden of ensuring
->mm != NULL on the caller.

But I don't much mind which way the consensus goes.

---

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-09 12:30     ` Hugh Dickins
  2006-05-10  3:50       ` Nick Piggin
@ 2006-05-10 22:26       ` Rafael J. Wysocki
  2006-05-11 15:01         ` Hugh Dickins
  1 sibling, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-10 22:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, nickpiggin, pavel

On Tuesday 09 May 2006 14:30, Hugh Dickins wrote:
> On Tue, 9 May 2006, Rafael J. Wysocki wrote:
> > On Tuesday 09 May 2006 09:33, Andrew Morton wrote:
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > 
> > > I have a host of minor problems with this patch.
> > > 
> > > > --- linux-2.6.17-rc3-mm1.orig/mm/rmap.c
> > > > +++ linux-2.6.17-rc3-mm1/mm/rmap.c
> > > >  
> > > > +#ifdef CONFIG_SOFTWARE_SUSPEND
> > > > +static int page_mapped_by_task(struct page *page, struct task_struct *task)
> > > > +{
> > > > +	struct vm_area_struct *vma;
> > > > +	int ret = 0;
> > > > +
> > > > +	spin_lock(&task->mm->page_table_lock);
> > > > +
> > > > +	for (vma = task->mm->mmap; vma; vma = vma->vm_next)
> > > > +		if (page_address_in_vma(page, vma) != -EFAULT) {
> > > > +			ret = 1;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +	spin_unlock(&task->mm->page_table_lock);
> > > > +
> > > > +	return ret;
> > > > +}
> > > 
> > > task_struct.mm can sometimes be NULL.  This function assumes that it will
> > > never be NULL.  That makes it a somewhat risky interface.  Are we sure it
> > > can never be NULL?
> > 
> > Well, now it's only called for task == current, but I can add a check.
> 
> Better fold it into the (renamed and recommented) page_to_copy,
> applying only to current.
> 
> The "use" of page_table_lock there is totally bogus.  Normally you
> need down_read(&current->mm->mmap_sem) to walk that vma chain; but
> I'm guessing you have everything sufficiently frozen here that you
> don't need that.

So I think it can be changed into something like this:

--- linux-2.6.17-rc3-mm1.orig/mm/rmap.c	2006-05-01 14:11:50.000000000 +0200
+++ linux-2.6.17-rc3-mm1/mm/rmap.c	2006-05-10 23:10:58.000000000 +0200
@@ -857,3 +857,38 @@ int try_to_unmap(struct page *page, int 
 	return ret;
 }
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+/**
+ *	suspend_safe_page - determine if a page can be included in the
+ *	suspend image without copying (returns true if so).
+ *
+ *	It is safe to include the page in the suspend image without
+ *	copying if (a) it's on the LRU and (b) it's mapped by a frozen task
+ *	(all tasks except for the current task should be frozen when it's
+ *	called).  Otherwise the page should be copied for this purpose
+ *	(compound pages are avoided for simplicity).
+ */
+
+int suspend_safe_page(struct page *page)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	if (!PageLRU(page) || PageCompound(page))
+		return 0;
+
+	if (page_mapped(page)) {
+		ret = 1;
+		down_read(&current->mm->mmap_sem);
+		/* Return 0 if the page is mapped by the current task */
+		for (vma = current->mm->mmap; vma; vma = vma->vm_next)
+			if (page_address_in_vma(page, vma) != -EFAULT) {
+				ret = 0;
+				break;
+			}
+		up_read(&current->mm->mmap_sem);
+	}
+
+	return ret;
+}
+#endif /* CONFIG_SOFTWARE_SUSPEND */

I've left the locking, because the functions is called when we're freeing
memory and I'm not 100% sure it's safe to drop it.

> But if it is sufficiently frozen, I'm puzzled as to why pages mapped
> into the current process are (potentially) unsafe, while those mapped
> into others are safe.  If the current process can get back to messing
> with its mapped pages, what if it maps a page you earlier judged safe?

The current task is forbidden to map anything at this point.

Greetings,
Rafael


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-09 22:27       ` Andrew Morton
@ 2006-05-10 22:58         ` Rafael J. Wysocki
  2006-05-10 23:38           ` Andrew Morton
  2006-05-11 11:35           ` Pavel Machek
  0 siblings, 2 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-10 22:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nickpiggin, pavel

On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > Now if the mapped pages that are not mapped by the
> >  current task are considered, it turns out that they would change only if they
> >  were reclaimed by try_to_free_pages().  Thus if we take them out of reach
> >  of try_to_free_pages(), for example by (temporarily) moving them out of their
> >  respective LRU lists after creating the image, we will be able to include them
> >  in the image without copying.
> 
> I'm a bit curious about how this is true.  There are all sorts of way in
> which there could be activity against these pages - interrupt-time
> asynchronous network Tx completion, async interrupt-time direct-io
> completion, tasklets, schedule_work(), etc, etc.

AFAIK, many of these things are waited for uninterruptibly, and uninterruptible
tasks cannot be frozen.  Theoretically we may have a problem if there's an
interruptible task that waits for the completion of an operation that gets
finished after snapshotting the system.  However that would have to survive the
syncing of filesystems, freezing of kernel threads, freeing of memory as well
as suspending and resuming all devices.  [In which case it would be starving
to death. :-)]

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-10 22:58         ` Rafael J. Wysocki
@ 2006-05-10 23:38           ` Andrew Morton
  2006-05-11  0:11             ` Nigel Cunningham
  2006-05-11 13:15             ` Rafael J. Wysocki
  2006-05-11 11:35           ` Pavel Machek
  1 sibling, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2006-05-10 23:38 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-kernel, nickpiggin, pavel

"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > Now if the mapped pages that are not mapped by the
> > >  current task are considered, it turns out that they would change only if they
> > >  were reclaimed by try_to_free_pages().  Thus if we take them out of reach
> > >  of try_to_free_pages(), for example by (temporarily) moving them out of their
> > >  respective LRU lists after creating the image, we will be able to include them
> > >  in the image without copying.
> > 
> > I'm a bit curious about how this is true.  There are all sorts of way in
> > which there could be activity against these pages - interrupt-time
> > asynchronous network Tx completion, async interrupt-time direct-io
> > completion, tasklets, schedule_work(), etc, etc.
> 
> AFAIK, many of these things are waited for uninterruptibly, and uninterruptible
> tasks cannot be frozen.

There can be situations where we won't be waiting on this IO at all. 
Network zero-copy transmit, for example.

Or maybe there's some async writeback going on against pagecache - we'll
end up looking at the page's LRU state within interrupt context at IO
completion.  (A sync would prevent this from happening).

One possibly problematic scenario is where task A is doing a direct-IO read
and task B truncates the same file - here, the page will be actually
removed from the LRU and freed in interrupt context.  The direct-IO read
process will be waiting on the IO in D state though.  It it was a
synchronous read - if it was an AIO read then it won't be waiting on the
IO.  Something else might save us here, but it's fragile.

>  Theoretically we may have a problem if there's an
> interruptible task that waits for the completion of an operation that gets
> finished after snapshotting the system.  However that would have to survive the
> syncing of filesystems, freezing of kernel threads, freeing of memory as well
> as suspending and resuming all devices.  [In which case it would be starving
> to death. :-)]

hm.  It's all a bit of a worry.  I don't understand what swsusp is trying
to do here sufficiently well to be able to advise, sorry.  I was rather
surprised to learn that it's presently taking copies of all these pages...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-10 23:38           ` Andrew Morton
@ 2006-05-11  0:11             ` Nigel Cunningham
  2006-05-11 13:20               ` Rafael J. Wysocki
  2006-05-11 13:15             ` Rafael J. Wysocki
  1 sibling, 1 reply; 29+ messages in thread
From: Nigel Cunningham @ 2006-05-11  0:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rafael J. Wysocki, linux-kernel, nickpiggin, pavel

[-- Attachment #1: Type: text/plain, Size: 3225 bytes --]

Hi Andrew et al.

On Thursday 11 May 2006 09:38, Andrew Morton wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > Now if the mapped pages that are not mapped by the
> > > >  current task are considered, it turns out that they would change
> > > > only if they were reclaimed by try_to_free_pages().  Thus if we take
> > > > them out of reach of try_to_free_pages(), for example by
> > > > (temporarily) moving them out of their respective LRU lists after
> > > > creating the image, we will be able to include them in the image
> > > > without copying.
> > >
> > > I'm a bit curious about how this is true.  There are all sorts of way
> > > in which there could be activity against these pages - interrupt-time
> > > asynchronous network Tx completion, async interrupt-time direct-io
> > > completion, tasklets, schedule_work(), etc, etc.
> >
> > AFAIK, many of these things are waited for uninterruptibly, and
> > uninterruptible tasks cannot be frozen.
>
> There can be situations where we won't be waiting on this IO at all.
> Network zero-copy transmit, for example.
>
> Or maybe there's some async writeback going on against pagecache - we'll
> end up looking at the page's LRU state within interrupt context at IO
> completion.  (A sync would prevent this from happening).

I believe more than a sync is needed in at least some cases. I've seen XFS 
continue to submit I/O (presumably on the sb or such like) after everything 
else has been frozen and data has been synced. Freezing bdevs addressed this.

> One possibly problematic scenario is where task A is doing a direct-IO read
> and task B truncates the same file - here, the page will be actually
> removed from the LRU and freed in interrupt context.  The direct-IO read
> process will be waiting on the IO in D state though.  It it was a
> synchronous read - if it was an AIO read then it won't be waiting on the
> IO.  Something else might save us here, but it's fragile.

Bdev freezing helps here too, right?

> >  Theoretically we may have a problem if there's an
> > interruptible task that waits for the completion of an operation that
> > gets finished after snapshotting the system.  However that would have to
> > survive the syncing of filesystems, freezing of kernel threads, freeing
> > of memory as well as suspending and resuming all devices.  [In which case
> > it would be starving to death. :-)]

(For Rafael/Pavel): The swsusp version of the refrigerator signals these 
processes to enter the freezer too, just in case the uninterruptible task 
does continue, right?

Regards,

Nigel

> hm.  It's all a bit of a worry.  I don't understand what swsusp is trying
> to do here sufficiently well to be able to advise, sorry.  I was rather
> surprised to learn that it's presently taking copies of all these pages...
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-10 22:58         ` Rafael J. Wysocki
  2006-05-10 23:38           ` Andrew Morton
@ 2006-05-11 11:35           ` Pavel Machek
  2006-05-11 12:10             ` Rafael J. Wysocki
  2006-05-11 23:49             ` Nigel Cunningham
  1 sibling, 2 replies; 29+ messages in thread
From: Pavel Machek @ 2006-05-11 11:35 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Andrew Morton, linux-kernel, nickpiggin

On Čt 11-05-06 00:58:18, Rafael J. Wysocki wrote:
> On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > Now if the mapped pages that are not mapped by the
> > >  current task are considered, it turns out that they would change only if they
> > >  were reclaimed by try_to_free_pages().  Thus if we take them out of reach
> > >  of try_to_free_pages(), for example by (temporarily) moving them out of their
> > >  respective LRU lists after creating the image, we will be able to include them
> > >  in the image without copying.
> > 
> > I'm a bit curious about how this is true.  There are all sorts of way in
> > which there could be activity against these pages - interrupt-time
> > asynchronous network Tx completion, async interrupt-time direct-io
> > completion, tasklets, schedule_work(), etc, etc.
> 
> AFAIK, many of these things are waited for uninterruptibly, and
> uninterruptible

Well, "many of these things" makes me nervous.

> tasks cannot be frozen.  Theoretically we may have a problem if there's an
> interruptible task that waits for the completion of an operation that gets
> finished after snapshotting the system.  

I'd prefer not to have even theoretical problems. If we don't _know_
why patch is safe, I'd prefer not to have it.

Needing bdev freezing is bad sign, too.

We are talking 10% speedup here (on low-mem-machines, IIRC), but whole
design has just got way more complex. Previous snapshot was really
atomic, and apart from NMI, it was "independend" from the rest of the
system.

New design depends on bdev freezing (depending on XFS details we do
not understand), and depends on all the other parts of kernel using
uninteruptible (when we know that networking sleeps interruptibly).

Too much uncertainity for 10% speedup, I'm afraid. Yes, it was really
clever to get this fundamental change down to few hundred lines, but
design complexity remains. Could we drop that patch?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-11 11:35           ` Pavel Machek
@ 2006-05-11 12:10             ` Rafael J. Wysocki
  2006-05-11 23:49             ` Nigel Cunningham
  1 sibling, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-11 12:10 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, linux-kernel, nickpiggin

On Thursday 11 May 2006 13:35, Pavel Machek wrote:
> On Čt 11-05-06 00:58:18, Rafael J. Wysocki wrote:
> > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > >
> > > > Now if the mapped pages that are not mapped by the
> > > >  current task are considered, it turns out that they would change only if they
> > > >  were reclaimed by try_to_free_pages().  Thus if we take them out of reach
> > > >  of try_to_free_pages(), for example by (temporarily) moving them out of their
> > > >  respective LRU lists after creating the image, we will be able to include them
> > > >  in the image without copying.
> > > 
> > > I'm a bit curious about how this is true.  There are all sorts of way in
> > > which there could be activity against these pages - interrupt-time
> > > asynchronous network Tx completion, async interrupt-time direct-io
> > > completion, tasklets, schedule_work(), etc, etc.
> > 
> > AFAIK, many of these things are waited for uninterruptibly, and
> > uninterruptible
> 
> Well, "many of these things" makes me nervous.
> 
> > tasks cannot be frozen.  Theoretically we may have a problem if there's an
> > interruptible task that waits for the completion of an operation that gets
> > finished after snapshotting the system.  
> 
> I'd prefer not to have even theoretical problems. If we don't _know_
> why patch is safe, I'd prefer not to have it.

OK (still I think I'll be able to provide some more precise arguments for it,
but I'll need some more time).

> Needing bdev freezing is bad sign, too.

Well, acutually we should understand why it helps I think, which makes it
interesting anyway, because in theory even without the patch we may be
vulnerable to a leftover "completion" that causes data to be written to a
storage.

> We are talking 10% speedup here (on low-mem-machines, IIRC), but whole
> design has just got way more complex. Previous snapshot was really
> atomic, and apart from NMI, it was "independend" from the rest of the
> system.
> 
> New design depends on bdev freezing (depending on XFS details we do
> not understand), and depends on all the other parts of kernel using
> uninteruptible (when we know that networking sleeps interruptibly).
> 
> Too much uncertainity for 10% speedup, I'm afraid. Yes, it was really
> clever to get this fundamental change down to few hundred lines,

:-)

> but design complexity remains. Could we drop that patch?

Yes, Andrew please drop it.

[Nonetheless I'd appreciate it if someone could suggest a specific failing
scenario to me.]

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-10 23:38           ` Andrew Morton
  2006-05-11  0:11             ` Nigel Cunningham
@ 2006-05-11 13:15             ` Rafael J. Wysocki
  1 sibling, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-11 13:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nickpiggin, pavel

On Thursday 11 May 2006 01:38, Andrew Morton wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > >
> > > > Now if the mapped pages that are not mapped by the
> > > >  current task are considered, it turns out that they would change only if they
> > > >  were reclaimed by try_to_free_pages().  Thus if we take them out of reach
> > > >  of try_to_free_pages(), for example by (temporarily) moving them out of their
> > > >  respective LRU lists after creating the image, we will be able to include them
> > > >  in the image without copying.
> > > 
> > > I'm a bit curious about how this is true.  There are all sorts of way in
> > > which there could be activity against these pages - interrupt-time
> > > asynchronous network Tx completion, async interrupt-time direct-io
> > > completion, tasklets, schedule_work(), etc, etc.
> > 
> > AFAIK, many of these things are waited for uninterruptibly, and uninterruptible
> > tasks cannot be frozen.
> 
> There can be situations where we won't be waiting on this IO at all. 
> Network zero-copy transmit, for example.
> 
> Or maybe there's some async writeback going on against pagecache - we'll
> end up looking at the page's LRU state within interrupt context at IO
> completion.  (A sync would prevent this from happening).
> 
> One possibly problematic scenario is where task A is doing a direct-IO read
> and task B truncates the same file - here, the page will be actually
> removed from the LRU and freed in interrupt context.  The direct-IO read
> process will be waiting on the IO in D state though.  It it was a
> synchronous read - if it was an AIO read then it won't be waiting on the
> IO.  Something else might save us here, but it's fragile.

Yes, that's one situation to be considered in detail (which I evidently failed
to do :-().

I think the problem is whether any modification agains these pages can
be triggered from interrupt context _after_ we have snapshotted the system and
resumed devices (in order to write the image).

> >  Theoretically we may have a problem if there's an
> > interruptible task that waits for the completion of an operation that gets
> > finished after snapshotting the system.  However that would have to survive the
> > syncing of filesystems, freezing of kernel threads, freeing of memory as well
> > as suspending and resuming all devices.  [In which case it would be starving
> > to death. :-)]
> 
> hm.  It's all a bit of a worry.  I don't understand what swsusp is trying
> to do here sufficiently well to be able to advise, sorry.  I was rather
> surprised to learn that it's presently taking copies of all these pages...

Well, I'm trying to avoid copying every single page in the system in the
process of creating the snapshot image.  The idea behind the patch was
to select a susbset of pages that could be safely included in the image
without copying and such that we could identify them easily.  [The LRU
pages obviously come to mind here.]

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-11  0:11             ` Nigel Cunningham
@ 2006-05-11 13:20               ` Rafael J. Wysocki
  2006-05-11 23:45                 ` Nigel Cunningham
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-11 13:20 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Andrew Morton, linux-kernel, nickpiggin, pavel

Hi,

On Thursday 11 May 2006 02:11, Nigel Cunningham wrote:
> Hi Andrew et al.
> 
> On Thursday 11 May 2006 09:38, Andrew Morton wrote:
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > Now if the mapped pages that are not mapped by the
> > > > >  current task are considered, it turns out that they would change
> > > > > only if they were reclaimed by try_to_free_pages().  Thus if we take
> > > > > them out of reach of try_to_free_pages(), for example by
> > > > > (temporarily) moving them out of their respective LRU lists after
> > > > > creating the image, we will be able to include them in the image
> > > > > without copying.
> > > >
> > > > I'm a bit curious about how this is true.  There are all sorts of way
> > > > in which there could be activity against these pages - interrupt-time
> > > > asynchronous network Tx completion, async interrupt-time direct-io
> > > > completion, tasklets, schedule_work(), etc, etc.
> > >
> > > AFAIK, many of these things are waited for uninterruptibly, and
> > > uninterruptible tasks cannot be frozen.
> >
> > There can be situations where we won't be waiting on this IO at all.
> > Network zero-copy transmit, for example.
> >
> > Or maybe there's some async writeback going on against pagecache - we'll
> > end up looking at the page's LRU state within interrupt context at IO
> > completion.  (A sync would prevent this from happening).
> 
> I believe more than a sync is needed in at least some cases. I've seen XFS 
> continue to submit I/O (presumably on the sb or such like) after everything 
> else has been frozen and data has been synced. Freezing bdevs addressed this.
> 
> > One possibly problematic scenario is where task A is doing a direct-IO read
> > and task B truncates the same file - here, the page will be actually
> > removed from the LRU and freed in interrupt context.  The direct-IO read
> > process will be waiting on the IO in D state though.  It it was a
> > synchronous read - if it was an AIO read then it won't be waiting on the
> > IO.  Something else might save us here, but it's fragile.
> 
> Bdev freezing helps here too, right?

Well, I'm not sure.  How exactly?

> > >  Theoretically we may have a problem if there's an
> > > interruptible task that waits for the completion of an operation that
> > > gets finished after snapshotting the system.  However that would have to
> > > survive the syncing of filesystems, freezing of kernel threads, freeing
> > > of memory as well as suspending and resuming all devices.  [In which case
> > > it would be starving to death. :-)]
> 
> (For Rafael/Pavel): The swsusp version of the refrigerator signals these 
> processes to enter the freezer too, just in case the uninterruptible task 
> does continue, right?

Uninterruptible tasks are not freezable with the swsusp's freezer at all.
The other tasks are signaled to enter the refrigerator - first user space,
then we sync filesystems and finally we freeze kernel threads.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-10 22:26       ` Rafael J. Wysocki
@ 2006-05-11 15:01         ` Hugh Dickins
  2006-05-11 21:19           ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2006-05-11 15:01 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Andrew Morton, linux-kernel, nickpiggin, pavel

On Thu, 11 May 2006, Rafael J. Wysocki wrote:
> On Tuesday 09 May 2006 14:30, Hugh Dickins wrote:
> 
> --- linux-2.6.17-rc3-mm1.orig/mm/rmap.c	2006-05-01 14:11:50.000000000 +0200
> +++ linux-2.6.17-rc3-mm1/mm/rmap.c	2006-05-10 23:10:58.000000000 +0200
> @@ -857,3 +857,38 @@ int try_to_unmap(struct page *page, int 
>  	return ret;
>  }
>  
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +/**
> + *	suspend_safe_page - determine if a page can be included in the

suspend_safe_page sounds like it's suspending a safe page,
and safe is unclear: suspend_knows_page_is_frozen?

> + *	suspend image without copying (returns true if so).
> + *
> + *	It is safe to include the page in the suspend image without
> + *	copying if (a) it's on the LRU and (b) it's mapped by a frozen task

That's not quite what the code checks:
(b) it's mapped but not by the the current task

(Actually, page_address_in_vma doesn't really say whether the page
is mapped by the task, but -EFAULT does tell you that it can't be.)

I still find its checks very obscure: am I right to think that most
pages are "frozen" at this point, that it's very hard to determine
which are and which are not, but there happens to be this "little"
category of pages which you can be damn sure are frozen - and on
most active systems, this "little" category actually covers a
large number of pages which it's well worth avoiding copying?

A very loose heuristic which works well enough to make a big difference.

> + *	(all tasks except for the current task should be frozen when it's
> + *	called).  Otherwise the page should be copied for this purpose
> + *	(compound pages are avoided for simplicity).
> + */
> +
> +int suspend_safe_page(struct page *page)
> +{
> +	struct vm_area_struct *vma;
> +	int ret = 0;
> +
> +	if (!PageLRU(page) || PageCompound(page))
> +		return 0;
> +
> +	if (page_mapped(page)) {
> +		ret = 1;
> +		down_read(&current->mm->mmap_sem);
> +		/* Return 0 if the page is mapped by the current task */
> +		for (vma = current->mm->mmap; vma; vma = vma->vm_next)
> +			if (page_address_in_vma(page, vma) != -EFAULT) {
> +				ret = 0;
> +				break;
> +			}
> +		up_read(&current->mm->mmap_sem);
> +	}
> +
> +	return ret;
> +}
> +#endif /* CONFIG_SOFTWARE_SUSPEND */
> 
> I've left the locking, because the functions is called when we're freeing
> memory and I'm not 100% sure it's safe to drop it.

If the locking is necessary, then that down_read is liable to schedule
and wait for mmap_sem to be released.  But you have interrupts disabled?
And everything which might release mmap_sem already frozen?  My guess is
that it's a lot safer _not_ to lock here than to lock; but just how safe
that is I'm not at all sure.

> > But if it is sufficiently frozen, I'm puzzled as to why pages mapped
> > into the current process are (potentially) unsafe, while those mapped
> > into others are safe.  If the current process can get back to messing
> > with its mapped pages, what if it maps a page you earlier judged safe?
> 
> The current task is forbidden to map anything at this point.

Too bald a statement for me to judge (and by forbidden to map,
do you mean forbidden to mmap, or forbidden to fault?).

Perhaps I'd understand better if you explain why pages mapped into the
current task have to be excluded?  It just seems to me that if it can
interfere with those pages, then it is liable to interfere with other
pages too, including some mapped into other processes which you've
already judged to be safe/frozen.

Sorry if I'm wasting your time, forcing you to spell things out to
someone who hasn't a clue what you're up to; but it all smells a
little fishy to me.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images
  2006-05-11 15:01         ` Hugh Dickins
@ 2006-05-11 21:19           ` Rafael J. Wysocki
  0 siblings, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-11 21:19 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, nickpiggin, pavel

On Thursday 11 May 2006 17:01, Hugh Dickins wrote:
> On Thu, 11 May 2006, Rafael J. Wysocki wrote:
> > On Tuesday 09 May 2006 14:30, Hugh Dickins wrote:
> > 
> > --- linux-2.6.17-rc3-mm1.orig/mm/rmap.c	2006-05-01 14:11:50.000000000 +0200
> > +++ linux-2.6.17-rc3-mm1/mm/rmap.c	2006-05-10 23:10:58.000000000 +0200
> > @@ -857,3 +857,38 @@ int try_to_unmap(struct page *page, int 
> >  	return ret;
> >  }
> >  
> > +#ifdef CONFIG_SOFTWARE_SUSPEND
> > +/**
> > + *	suspend_safe_page - determine if a page can be included in the
> 
> suspend_safe_page sounds like it's suspending a safe page,
> and safe is unclear: suspend_knows_page_is_frozen?
> 
> > + *	suspend image without copying (returns true if so).
> > + *
> > + *	It is safe to include the page in the suspend image without
> > + *	copying if (a) it's on the LRU and (b) it's mapped by a frozen task
> 
> That's not quite what the code checks:
> (b) it's mapped but not by the the current task
> 
> (Actually, page_address_in_vma doesn't really say whether the page
> is mapped by the task, but -EFAULT does tell you that it can't be.)
> 
> I still find its checks very obscure: am I right to think that most
> pages are "frozen" at this point, that it's very hard to determine
> which are and which are not, but there happens to be this "little"
> category of pages which you can be damn sure are frozen - and on
> most active systems, this "little" category actually covers a
> large number of pages which it's well worth avoiding copying?

Yes.  Still I'm struggling to find some reliable+fast method of determining
which pages belong to this category.

> A very loose heuristic which works well enough to make a big difference.

I think so. :-)

> > + *	(all tasks except for the current task should be frozen when it's
> > + *	called).  Otherwise the page should be copied for this purpose
> > + *	(compound pages are avoided for simplicity).
> > + */
> > +
> > +int suspend_safe_page(struct page *page)
> > +{
> > +	struct vm_area_struct *vma;
> > +	int ret = 0;
> > +
> > +	if (!PageLRU(page) || PageCompound(page))
> > +		return 0;
> > +
> > +	if (page_mapped(page)) {
> > +		ret = 1;
> > +		down_read(&current->mm->mmap_sem);
> > +		/* Return 0 if the page is mapped by the current task */
> > +		for (vma = current->mm->mmap; vma; vma = vma->vm_next)
> > +			if (page_address_in_vma(page, vma) != -EFAULT) {
> > +				ret = 0;
> > +				break;
> > +			}
> > +		up_read(&current->mm->mmap_sem);
> > +	}
> > +
> > +	return ret;
> > +}
> > +#endif /* CONFIG_SOFTWARE_SUSPEND */
> > 
> > I've left the locking, because the functions is called when we're freeing
> > memory and I'm not 100% sure it's safe to drop it.
> 
> If the locking is necessary, then that down_read is liable to schedule
> and wait for mmap_sem to be released.  But you have interrupts disabled?

It also needs to be called with interrupts enabled, unfortunately.

> And everything which might release mmap_sem already frozen?  My guess is
> that it's a lot safer _not_ to lock here than to lock; but just how safe
> that is I'm not at all sure.

Well, me too. :-)

> > > But if it is sufficiently frozen, I'm puzzled as to why pages mapped
> > > into the current process are (potentially) unsafe, while those mapped
> > > into others are safe.  If the current process can get back to messing
> > > with its mapped pages, what if it maps a page you earlier judged safe?
> > 
> > The current task is forbidden to map anything at this point.
> 
> Too bald a statement for me to judge (and by forbidden to map,
> do you mean forbidden to mmap, or forbidden to fault?).

It's forbiddent to refer to any filesystems at all or else it could corrupt
them.

> Perhaps I'd understand better if you explain why pages mapped into the
> current task have to be excluded?  It just seems to me that if it can
> interfere with those pages, then it is liable to interfere with other
> pages too, including some mapped into other processes which you've
> already judged to be safe/frozen.

The current task may be a userland process that saves the image, ie.
calls write() to save the data.  It reads the image data from the kernel,
it can compress and/or encrypt them, compute checksums etc. and
then it saves the (possibly processed) data using write() directly to a
disk partition (ie. without touching any filesystems).  However it only
is allowed to (directly) modify its own memory.

This is a very special userland process which must adhere to some strict
rules (please have a look at Documentation/power/userland-swsusp.txt).

> Sorry if I'm wasting your time, forcing you to spell things out to
> someone who hasn't a clue what you're up to; but it all smells a
> little fishy to me.

You're absolutely right to do so, and it's a tricky stuff. :-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-11 13:20               ` Rafael J. Wysocki
@ 2006-05-11 23:45                 ` Nigel Cunningham
  2006-05-12  0:17                   ` Nathan Scott
  2006-05-13 22:32                   ` Rafael J. Wysocki
  0 siblings, 2 replies; 29+ messages in thread
From: Nigel Cunningham @ 2006-05-11 23:45 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Andrew Morton, linux-kernel, nickpiggin, pavel

[-- Attachment #1: Type: text/plain, Size: 3634 bytes --]

Hello.

On Thursday 11 May 2006 23:20, Rafael J. Wysocki wrote:
> Hi,
>
> On Thursday 11 May 2006 02:11, Nigel Cunningham wrote:
> > Hi Andrew et al.
> >
> > On Thursday 11 May 2006 09:38, Andrew Morton wrote:
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > > Now if the mapped pages that are not mapped by the
> > > > > >  current task are considered, it turns out that they would change
> > > > > > only if they were reclaimed by try_to_free_pages().  Thus if we
> > > > > > take them out of reach of try_to_free_pages(), for example by
> > > > > > (temporarily) moving them out of their respective LRU lists after
> > > > > > creating the image, we will be able to include them in the image
> > > > > > without copying.
> > > > >
> > > > > I'm a bit curious about how this is true.  There are all sorts of
> > > > > way in which there could be activity against these pages -
> > > > > interrupt-time asynchronous network Tx completion, async
> > > > > interrupt-time direct-io completion, tasklets, schedule_work(),
> > > > > etc, etc.
> > > >
> > > > AFAIK, many of these things are waited for uninterruptibly, and
> > > > uninterruptible tasks cannot be frozen.
> > >
> > > There can be situations where we won't be waiting on this IO at all.
> > > Network zero-copy transmit, for example.
> > >
> > > Or maybe there's some async writeback going on against pagecache -
> > > we'll end up looking at the page's LRU state within interrupt context
> > > at IO completion.  (A sync would prevent this from happening).
> >
> > I believe more than a sync is needed in at least some cases. I've seen
> > XFS continue to submit I/O (presumably on the sb or such like) after
> > everything else has been frozen and data has been synced. Freezing bdevs
> > addressed this.
> >
> > > One possibly problematic scenario is where task A is doing a direct-IO
> > > read and task B truncates the same file - here, the page will be
> > > actually removed from the LRU and freed in interrupt context.  The
> > > direct-IO read process will be waiting on the IO in D state though.  It
> > > it was a synchronous read - if it was an AIO read then it won't be
> > > waiting on the IO.  Something else might save us here, but it's
> > > fragile.
> >
> > Bdev freezing helps here too, right?
>
> Well, I'm not sure.  How exactly?

I believe it will ensure that both operations will be completed and the waiter 
woken.

> > > >  Theoretically we may have a problem if there's an
> > > > interruptible task that waits for the completion of an operation that
> > > > gets finished after snapshotting the system.  However that would have
> > > > to survive the syncing of filesystems, freezing of kernel threads,
> > > > freeing of memory as well as suspending and resuming all devices. 
> > > > [In which case it would be starving to death. :-)]
> >
> > (For Rafael/Pavel): The swsusp version of the refrigerator signals these
> > processes to enter the freezer too, just in case the uninterruptible task
> > does continue, right?
>
> Uninterruptible tasks are not freezable with the swsusp's freezer at all.
> The other tasks are signaled to enter the refrigerator - first user space,
> then we sync filesystems and finally we freeze kernel threads.

Oooh. It would probably be good if you signalled them to enter the freezer, 
just in case, and had the freezer code handle a process entering the path 
when we no longer want to freeze.

Regards,

Nigel

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-11 11:35           ` Pavel Machek
  2006-05-11 12:10             ` Rafael J. Wysocki
@ 2006-05-11 23:49             ` Nigel Cunningham
  2006-05-12 10:30               ` Pavel Machek
  1 sibling, 1 reply; 29+ messages in thread
From: Nigel Cunningham @ 2006-05-11 23:49 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, Andrew Morton, linux-kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 2462 bytes --]

Hi.

On Thursday 11 May 2006 21:35, Pavel Machek wrote:
> On Čt 11-05-06 00:58:18, Rafael J. Wysocki wrote:
> > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > Now if the mapped pages that are not mapped by the
> > > >  current task are considered, it turns out that they would change
> > > > only if they were reclaimed by try_to_free_pages().  Thus if we take
> > > > them out of reach of try_to_free_pages(), for example by
> > > > (temporarily) moving them out of their respective LRU lists after
> > > > creating the image, we will be able to include them in the image
> > > > without copying.
> > >
> > > I'm a bit curious about how this is true.  There are all sorts of way
> > > in which there could be activity against these pages - interrupt-time
> > > asynchronous network Tx completion, async interrupt-time direct-io
> > > completion, tasklets, schedule_work(), etc, etc.
> >
> > AFAIK, many of these things are waited for uninterruptibly, and
> > uninterruptible
>
> Well, "many of these things" makes me nervous.
>
> > tasks cannot be frozen.  Theoretically we may have a problem if there's
> > an interruptible task that waits for the completion of an operation that
> > gets finished after snapshotting the system.
>
> I'd prefer not to have even theoretical problems. If we don't _know_
> why patch is safe, I'd prefer not to have it.
>
> Needing bdev freezing is bad sign, too.
>
> We are talking 10% speedup here (on low-mem-machines, IIRC), but whole
> design has just got way more complex. Previous snapshot was really
> atomic, and apart from NMI, it was "independend" from the rest of the
> system.
>
> New design depends on bdev freezing (depending on XFS details we do
> not understand), and depends on all the other parts of kernel using
> uninteruptible (when we know that networking sleeps interruptibly).
>
> Too much uncertainity for 10% speedup, I'm afraid. Yes, it was really
> clever to get this fundamental change down to few hundred lines, but
> design complexity remains. Could we drop that patch?

Could you provide justification for your claim that the speedup is only 10%?

Please also remember that you are introducing complexity in other ways, with 
that swap prefetching code and so on. Any comparison in speed should include 
the time to fault back in pages that have been discarded.

Regards,

Nigel

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-11 23:45                 ` Nigel Cunningham
@ 2006-05-12  0:17                   ` Nathan Scott
  2006-05-12 10:09                     ` Pavel Machek
  2006-05-13 22:32                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 29+ messages in thread
From: Nathan Scott @ 2006-05-12  0:17 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Rafael J. Wysocki, Andrew Morton, linux-kernel, nickpiggin, pavel

On Fri, May 12, 2006 at 09:45:48AM +1000, Nigel Cunningham wrote:
> On Thursday 11 May 2006 23:20, Rafael J. Wysocki wrote:
> > On Thursday 11 May 2006 02:11, Nigel Cunningham wrote:
> > > On Thursday 11 May 2006 09:38, Andrew Morton wrote:
> > > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > >
> > > > There can be situations where we won't be waiting on this IO at all.
> > > > Network zero-copy transmit, for example.
> > > >
> > > > Or maybe there's some async writeback going on against pagecache -
> > > > we'll end up looking at the page's LRU state within interrupt context
> > > > at IO completion.  (A sync would prevent this from happening).
> > >
> > > I believe more than a sync is needed in at least some cases. I've seen
> > > XFS continue to submit I/O (presumably on the sb or such like) after
> > > everything else has been frozen and data has been synced. Freezing bdevs
> > > addressed this.

[just came across this, missed it before, sorry]

The above is correct - sync means get current state safe ondisk, it
doesn't mean flush all dirty metadata to its final resting place
(subtle difference).  XFS will flush and wait on its journal on
sync, which means theres a reconstructable state for all of the
currently-incore-dirty-metadata ondisk, so it does not also flush
and wait on that currently-incore-dirty-metadata.  It doesn't need
to, it has already ensured thats written elsewhere on disk, in the
journal.  And should the unthinkable happen, that metadata will be
correctly recovered on the next mount when the journal is replayed.

Block device freeze, unmount and/or remount,ro will all ensure that
all incore-dirty-metadata is also flushed and waited on.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-12  0:17                   ` Nathan Scott
@ 2006-05-12 10:09                     ` Pavel Machek
  0 siblings, 0 replies; 29+ messages in thread
From: Pavel Machek @ 2006-05-12 10:09 UTC (permalink / raw)
  To: Nathan Scott
  Cc: Nigel Cunningham, Rafael J. Wysocki, Andrew Morton, linux-kernel,
	nickpiggin

Hi!

On Pá 12-05-06 10:17:54, Nathan Scott wrote:
> On Fri, May 12, 2006 at 09:45:48AM +1000, Nigel Cunningham wrote:
> > On Thursday 11 May 2006 23:20, Rafael J. Wysocki wrote:
> > > On Thursday 11 May 2006 02:11, Nigel Cunningham wrote:
> > > > On Thursday 11 May 2006 09:38, Andrew Morton wrote:
> > > > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > > >
> > > > > There can be situations where we won't be waiting on this IO at all.
> > > > > Network zero-copy transmit, for example.
> > > > >
> > > > > Or maybe there's some async writeback going on against pagecache -
> > > > > we'll end up looking at the page's LRU state within interrupt context
> > > > > at IO completion.  (A sync would prevent this from happening).
> > > >
> > > > I believe more than a sync is needed in at least some cases. I've seen
> > > > XFS continue to submit I/O (presumably on the sb or such like) after
> > > > everything else has been frozen and data has been synced. Freezing bdevs
> > > > addressed this.
> 
> [just came across this, missed it before, sorry]
> 
> The above is correct - sync means get current state safe ondisk, it
> doesn't mean flush all dirty metadata to its final resting place
> (subtle difference).  XFS will flush and wait on its journal on
> sync, which means theres a reconstructable state for all of the
> currently-incore-dirty-metadata ondisk, so it does not also flush
> and wait on that currently-incore-dirty-metadata.  It doesn't need
> to, it has already ensured thats written elsewhere on disk, in the
> journal.  And should the unthinkable happen, that metadata will be
> correctly recovered on the next mount when the journal is replayed.
> 
> Block device freeze, unmount and/or remount,ro will all ensure that
> all incore-dirty-metadata is also flushed and waited on.

Who does the background writing? I'd prefer not to do blockdev
freezing in swsusp (and we do not currently do it).

Simple fix is probably to put proper refrigerator support into that
particular background writing thread.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-11 23:49             ` Nigel Cunningham
@ 2006-05-12 10:30               ` Pavel Machek
  2006-05-13 22:33                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Pavel Machek @ 2006-05-12 10:30 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Rafael J. Wysocki, Andrew Morton, linux-kernel, nickpiggin

> > Too much uncertainity for 10% speedup, I'm afraid. Yes, it was really
> > clever to get this fundamental change down to few hundred lines, but
> > design complexity remains. Could we drop that patch?
> 
> Could you provide justification for your claim that the speedup is
> only 10%?

10% was number Rafael provided, IIRC.

> Please also remember that you are introducing complexity in other ways, with 
> that swap prefetching code and so on. Any comparison in speed should include 
> the time to fault back in pages that have been discarded.

Well, swap prefetching is useful for other workloads, too; so it gets
developed/tested outside swsusp.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-11 23:45                 ` Nigel Cunningham
  2006-05-12  0:17                   ` Nathan Scott
@ 2006-05-13 22:32                   ` Rafael J. Wysocki
  1 sibling, 0 replies; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-13 22:32 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Andrew Morton, linux-kernel, nickpiggin, pavel

Hi,

Sorry for the slow response, I've had a lot of work to do recently.

On Friday 12 May 2006 01:45, Nigel Cunningham wrote:
> On Thursday 11 May 2006 23:20, Rafael J. Wysocki wrote:
> > On Thursday 11 May 2006 02:11, Nigel Cunningham wrote:
> > > On Thursday 11 May 2006 09:38, Andrew Morton wrote:
> > > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > On Wednesday 10 May 2006 00:27, Andrew Morton wrote:
> > > > > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > > > Now if the mapped pages that are not mapped by the
> > > > > > >  current task are considered, it turns out that they would change
> > > > > > > only if they were reclaimed by try_to_free_pages().  Thus if we
> > > > > > > take them out of reach of try_to_free_pages(), for example by
> > > > > > > (temporarily) moving them out of their respective LRU lists after
> > > > > > > creating the image, we will be able to include them in the image
> > > > > > > without copying.
> > > > > >
> > > > > > I'm a bit curious about how this is true.  There are all sorts of
> > > > > > way in which there could be activity against these pages -
> > > > > > interrupt-time asynchronous network Tx completion, async
> > > > > > interrupt-time direct-io completion, tasklets, schedule_work(),
> > > > > > etc, etc.
> > > > >
> > > > > AFAIK, many of these things are waited for uninterruptibly, and
> > > > > uninterruptible tasks cannot be frozen.
> > > >
> > > > There can be situations where we won't be waiting on this IO at all.
> > > > Network zero-copy transmit, for example.
> > > >
> > > > Or maybe there's some async writeback going on against pagecache -
> > > > we'll end up looking at the page's LRU state within interrupt context
> > > > at IO completion.  (A sync would prevent this from happening).
> > >
> > > I believe more than a sync is needed in at least some cases. I've seen
> > > XFS continue to submit I/O (presumably on the sb or such like) after
> > > everything else has been frozen and data has been synced. Freezing bdevs
> > > addressed this.

Do you know any other examples?

> > > > One possibly problematic scenario is where task A is doing a direct-IO
> > > > read and task B truncates the same file - here, the page will be
> > > > actually removed from the LRU and freed in interrupt context.  The
> > > > direct-IO read process will be waiting on the IO in D state though.  It
> > > > it was a synchronous read - if it was an AIO read then it won't be
> > > > waiting on the IO.  Something else might save us here, but it's
> > > > fragile.

In this particular case we are saved by the fact that the snapshotted kernel
memory reflects the state from before the operation, so the kernel will
perform it again after resume.  However if the suspend failed we'd add the
freed page to the LRU which would be a mistake (I think it might be prevented
by checking the page's refcount before putting in back on the LRU).

> > > Bdev freezing helps here too, right?
> >
> > Well, I'm not sure.  How exactly?
> 
> I believe it will ensure that both operations will be completed and the waiter 
> woken.

We're talking about the situation in which no one waits on the I/O that's
completed asynchronously in interrupt context.

I assume that the freezing of bdevs makes this impossible with respect to
filesystems, but there are other types of I/O, like the networking, which are
independent of that.  Also, if someone uses a block device or a partition
directly, will the freezing of bdevs save us?

> > > > >  Theoretically we may have a problem if there's an
> > > > > interruptible task that waits for the completion of an operation that
> > > > > gets finished after snapshotting the system.  However that would have
> > > > > to survive the syncing of filesystems, freezing of kernel threads,
> > > > > freeing of memory as well as suspending and resuming all devices. 
> > > > > [In which case it would be starving to death. :-)]
> > >
> > > (For Rafael/Pavel): The swsusp version of the refrigerator signals these
> > > processes to enter the freezer too, just in case the uninterruptible task
> > > does continue, right?
> >
> > Uninterruptible tasks are not freezable with the swsusp's freezer at all.
> > The other tasks are signaled to enter the refrigerator - first user space,
> > then we sync filesystems and finally we freeze kernel threads.
> 
> Oooh. It would probably be good if you signalled them to enter the freezer, 
> just in case, and had the freezer code handle a process entering the path 
> when we no longer want to freeze.

Could you please explain why you think so?  We just fail the suspend if there
still are any uninterruptible tasks after the timeout, so I don't see the
reason why these tasks should enter the refrigerator.

Greetings,
Rafael


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-12 10:30               ` Pavel Machek
@ 2006-05-13 22:33                 ` Rafael J. Wysocki
  2006-05-15  9:48                   ` Con Kolivas
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-13 22:33 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Nigel Cunningham, Andrew Morton, linux-kernel, nickpiggin

Hi,

Sorry for the slow response, I've had a lot of work to do recently.

On Friday 12 May 2006 12:30, Pavel Machek wrote:
> > > Too much uncertainity for 10% speedup, I'm afraid. Yes, it was really
> > > clever to get this fundamental change down to few hundred lines, but
> > > design complexity remains. Could we drop that patch?

I think the main problem with the patch was that the criterion by which
the pages were included in the image without copying appeared to be too simple
and therefore too general.  It well might be correct, but that's yet to be
shown.  OTOH it probably is possible to use a criterion that everybody will
agree with.  For example, we could chose pages that would be reclaimed if
there were memory pressure (the pages with I/O operations pending should
not belong to this class).
 
> > Could you provide justification for your claim that the speedup is
> > only 10%?
> 
> 10% was number Rafael provided, IIRC.

That's true, but I only measured the short-time effect of the suspend on the
system responsiveness.

Apart from this without the patch swsusp tends to leave quite o lot
of data in the swap and IMO this effect should not be neglected.  Currently we
only have a very general idea about what these data are and why it happens
actually and there's no warranty it wouldn't hamper performance in the
long run.

> > Please also remember that you are introducing complexity in other ways, with 
> > that swap prefetching code and so on. Any comparison in speed should include 
> > the time to fault back in pages that have been discarded.
> 
> Well, swap prefetching is useful for other workloads, too; so it gets
> developed/tested outside swsusp.

Still my experience indicates that it doesn't play very nice with swsusp and
unfortunately it hogs the I/O.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-13 22:33                 ` Rafael J. Wysocki
@ 2006-05-15  9:48                   ` Con Kolivas
  2006-05-15 19:52                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 29+ messages in thread
From: Con Kolivas @ 2006-05-15  9:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rafael J. Wysocki, Pavel Machek, Nigel Cunningham, Andrew Morton,
	nickpiggin

On Sunday 14 May 2006 08:33, Rafael J. Wysocki wrote:
> On Friday 12 May 2006 12:30, Pavel Machek wrote:
> > > Please also remember that you are introducing complexity in other ways,
> > > with that swap prefetching code and so on. Any comparison in speed
> > > should include the time to fault back in pages that have been
> > > discarded.
> >
> > Well, swap prefetching is useful for other workloads, too; so it gets
> > developed/tested outside swsusp.
>
> Still my experience indicates that it doesn't play very nice with swsusp
> and unfortunately it hogs the I/O.

There is no swap prefetching code linked in any way to swsusp suspend or 
resume on mainline or -mm. It was a preliminary experiment and Rafael lost 
interest in it so I never bothered pursuing it.

-- 
-ck

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-15  9:48                   ` Con Kolivas
@ 2006-05-15 19:52                     ` Rafael J. Wysocki
  2006-05-15 23:50                       ` Con Kolivas
  0 siblings, 1 reply; 29+ messages in thread
From: Rafael J. Wysocki @ 2006-05-15 19:52 UTC (permalink / raw)
  To: Con Kolivas
  Cc: linux-kernel, Pavel Machek, Nigel Cunningham, Andrew Morton, nickpiggin

Hi,

On Monday 15 May 2006 11:48, Con Kolivas wrote:
> On Sunday 14 May 2006 08:33, Rafael J. Wysocki wrote:
> > On Friday 12 May 2006 12:30, Pavel Machek wrote:
> > > > Please also remember that you are introducing complexity in other ways,
> > > > with that swap prefetching code and so on. Any comparison in speed
> > > > should include the time to fault back in pages that have been
> > > > discarded.
> > >
> > > Well, swap prefetching is useful for other workloads, too; so it gets
> > > developed/tested outside swsusp.
> >
> > Still my experience indicates that it doesn't play very nice with swsusp
> > and unfortunately it hogs the I/O.
> 
> There is no swap prefetching code linked in any way to swsusp suspend or 
> resume on mainline or -mm. It was a preliminary experiment and Rafael lost 
> interest in it so I never bothered pursuing it.

I'm referring to the code currently in -mm, where kprefetchd sometimes starts
prefetching like mad after resume which hurts the disk I/O really badly (unless
I set /proc/sys/vm/swap_prefetch to 0, that is).

I think the problem is related to the fact that swsusp tends to leave quite a lot
of pages in the swap, if they had to be swapped out before suspend, and that
makes kprefetchd believe it should get these pages back into RAM, which
usually is not the greatest idea.

The above is only a speculation, however, and I'd have to investigate it a bit
more to say something more certain.  Anyway, my experience indicates
that it usually is better to set /proc/sys/vm/swap_prefetch to 0 after resume,
but YMMV.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH -mm] swsusp: support creating bigger images (rev. 2)
  2006-05-15 19:52                     ` Rafael J. Wysocki
@ 2006-05-15 23:50                       ` Con Kolivas
  0 siblings, 0 replies; 29+ messages in thread
From: Con Kolivas @ 2006-05-15 23:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, Pavel Machek, Nigel Cunningham, Andrew Morton, nickpiggin

On Tuesday 16 May 2006 05:52, Rafael J. Wysocki wrote:
> Hi,
>
> On Monday 15 May 2006 11:48, Con Kolivas wrote:
> > On Sunday 14 May 2006 08:33, Rafael J. Wysocki wrote:
> > > On Friday 12 May 2006 12:30, Pavel Machek wrote:
> > > > > Please also remember that you are introducing complexity in other
> > > > > ways, with that swap prefetching code and so on. Any comparison in
> > > > > speed should include the time to fault back in pages that have been
> > > > > discarded.
> > > >
> > > > Well, swap prefetching is useful for other workloads, too; so it gets
> > > > developed/tested outside swsusp.
> > >
> > > Still my experience indicates that it doesn't play very nice with
> > > swsusp and unfortunately it hogs the I/O.
> >
> > There is no swap prefetching code linked in any way to swsusp suspend or
> > resume on mainline or -mm. It was a preliminary experiment and Rafael
> > lost interest in it so I never bothered pursuing it.
>
> I'm referring to the code currently in -mm, where kprefetchd sometimes
> starts prefetching like mad after resume which hurts the disk I/O really
> badly (unless I set /proc/sys/vm/swap_prefetch to 0, that is).

Not my experience. It does lots of disk I/O after resume because as you say 
yourself there is heaps in swap, but I haven't seen that I/O hurt as it 
simply stops the instant I do anything with the machine even as trivial as 
moving the mouse.

> I think the problem is related to the fact that swsusp tends to leave quite
> a lot of pages in the swap, if they had to be swapped out before suspend,
> and that makes kprefetchd believe it should get these pages back into RAM,
> which usually is not the greatest idea.
>
> The above is only a speculation, however, and I'd have to investigate it a
> bit more to say something more certain.  Anyway, my experience indicates
> that it usually is better to set /proc/sys/vm/swap_prefetch to 0 after
> resume, but YMMV.

My mileage definitely varies here. I don't concentrate on examining the fact 
that the machine is doing any I/O, but how usable the machine is and it's my 
experience that it is much better as about 1/3 of my applications are not 
floundering entirely on swap. Fortunately there's a tunable there and it 
allows you to set and unset it how you see fit.

Anyway the original point of my response was to point out to Nigel that there 
is no added complexity on behalf of swsusp in the swap prefetch code.

-- 
-ck

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2006-05-15 23:51 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-02 10:00 [PATCH -mm] swsusp: support creating bigger images Rafael J. Wysocki
2006-05-09  7:33 ` Andrew Morton
2006-05-09 10:19   ` Rafael J. Wysocki
2006-05-09 11:22     ` Pavel Machek
2006-05-09 12:18       ` Rafael J. Wysocki
2006-05-09 12:30     ` Hugh Dickins
2006-05-10  3:50       ` Nick Piggin
2006-05-10 22:26       ` Rafael J. Wysocki
2006-05-11 15:01         ` Hugh Dickins
2006-05-11 21:19           ` Rafael J. Wysocki
2006-05-09 22:15     ` [PATCH -mm] swsusp: support creating bigger images (rev. 2) Rafael J. Wysocki
2006-05-09 22:27       ` Andrew Morton
2006-05-10 22:58         ` Rafael J. Wysocki
2006-05-10 23:38           ` Andrew Morton
2006-05-11  0:11             ` Nigel Cunningham
2006-05-11 13:20               ` Rafael J. Wysocki
2006-05-11 23:45                 ` Nigel Cunningham
2006-05-12  0:17                   ` Nathan Scott
2006-05-12 10:09                     ` Pavel Machek
2006-05-13 22:32                   ` Rafael J. Wysocki
2006-05-11 13:15             ` Rafael J. Wysocki
2006-05-11 11:35           ` Pavel Machek
2006-05-11 12:10             ` Rafael J. Wysocki
2006-05-11 23:49             ` Nigel Cunningham
2006-05-12 10:30               ` Pavel Machek
2006-05-13 22:33                 ` Rafael J. Wysocki
2006-05-15  9:48                   ` Con Kolivas
2006-05-15 19:52                     ` Rafael J. Wysocki
2006-05-15 23:50                       ` Con Kolivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).