* [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-03 0:20 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:20 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Saturday 02 May 2009, Andrew Morton wrote:
> On Sat, 2 May 2009 13:46:34 +0200 "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > > Do we need the bitmap? I expect we can just string all these pages
> > > onto a local list via page.lru. Would need to check that - the
> > > pageframe fields are quite overloaded.
> >
> > This is the reason why we use the bitmaps for hibernation. :-)
>
> grep the tree for page->lru and you'll see that quite a few page
> consumers are using it. So you'd be pretty safe doing it this way.
>
> Whether it's _worth_ doing it this way is debatable, given that
> hibernation uses bitmaps elsewhere. But it would shrink the patch a
> bit I expect?
It probably would, but it turns out we need not create the new bitmap, we
can use the existing ones for marking the allocated pages. That also has
the benefit that we can use swsusp_free() to release them.
Modified patch series follows:
[1/4] - your patch introducing __GFP_NO_OOM_KILL (I decided it would be better
do it this way in this particular case. The fact that the OOM killer
is not going to work after tasks have been frozen is a different issue.)
[2/4] - move swsusp_shrink_memory to snapshot.c, no major changes
[3/4] - use memory allocations to for making the room for the image (added
comments, used the existing bitmaps, cleaned up a bit)
[4/4] - new thing: do not release memory allocated by [3/4] and use it for
creating the image directly.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 1/4] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-03 0:22 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:22 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel, torvalds, jens.axboe, alan-jenkins, linux-kernel,
kernel-testers, linux-pm
From: Andrew Morton <akpm@linux-foundation.org>
> > Remind me: why can't we just allocate N pages at suspend-time?
>
> We need half of memory free. The reason we can't "just allocate" is
> probably OOM killer; but my memories are quite weak :-(.
hm. You'd think that with our splendid range of __GFP_foo falgs, there
would be some combo which would suit this requirement but I can't
immediately spot one.
We can always add another I guess. Something like...
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1620,7 +1620,8 @@ nofail_alloc:
}
/* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (order > PAGE_ALLOC_COSTLY_ORDER ||
+ (gfp_mask & __GFP_NO_OOM_KILL)) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of__GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 1/4] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-03 0:22 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:22 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
From: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> > Remind me: why can't we just allocate N pages at suspend-time?
>
> We need half of memory free. The reason we can't "just allocate" is
> probably OOM killer; but my memories are quite weak :-(.
hm. You'd think that with our splendid range of __GFP_foo falgs, there
would be some combo which would suit this requirement but I can't
immediately spot one.
We can always add another I guess. Something like...
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1620,7 +1620,8 @@ nofail_alloc:
}
/* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (order > PAGE_ALLOC_COSTLY_ORDER ||
+ (gfp_mask & __GFP_NO_OOM_KILL)) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of__GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/4] mm: Add __GFP_NO_OOM_KILL flag
2009-05-03 0:22 ` Rafael J. Wysocki
(?)
@ 2009-05-03 11:54 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 11:54 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, torvalds, linux-pm
On Sun, May 03, 2009 at 02:22:06AM +0200, Rafael J. Wysocki wrote:
> From: Andrew Morton <akpm@linux-foundation.org>
>
> > > Remind me: why can't we just allocate N pages at suspend-time?
> >
> > We need half of memory free. The reason we can't "just allocate" is
> > probably OOM killer; but my memories are quite weak :-(.
>
> hm. You'd think that with our splendid range of __GFP_foo falgs, there
> would be some combo which would suit this requirement but I can't
> immediately spot one.
>
> We can always add another I guess. Something like...
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> include/linux/gfp.h | 3 ++-
> mm/page_alloc.c | 3 ++-
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1620,7 +1620,8 @@ nofail_alloc:
> }
>
> /* The OOM killer will not help higher order allocs so fail */
> - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> + (gfp_mask & __GFP_NO_OOM_KILL)) {
> clear_zonelist_oom(zonelist, gfp_mask);
> goto nopage;
> }
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h
> +++ linux-2.6/include/linux/gfp.h
> @@ -51,8 +51,9 @@ struct vm_area_struct;
> #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
>
> -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 22 /* Number of__GFP_FOO bits */
^ missed a white space :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/4] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-03 11:54 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 11:54 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Sun, May 03, 2009 at 02:22:06AM +0200, Rafael J. Wysocki wrote:
> From: Andrew Morton <akpm@linux-foundation.org>
>
> > > Remind me: why can't we just allocate N pages at suspend-time?
> >
> > We need half of memory free. The reason we can't "just allocate" is
> > probably OOM killer; but my memories are quite weak :-(.
>
> hm. You'd think that with our splendid range of __GFP_foo falgs, there
> would be some combo which would suit this requirement but I can't
> immediately spot one.
>
> We can always add another I guess. Something like...
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> include/linux/gfp.h | 3 ++-
> mm/page_alloc.c | 3 ++-
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1620,7 +1620,8 @@ nofail_alloc:
> }
>
> /* The OOM killer will not help higher order allocs so fail */
> - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> + (gfp_mask & __GFP_NO_OOM_KILL)) {
> clear_zonelist_oom(zonelist, gfp_mask);
> goto nopage;
> }
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h
> +++ linux-2.6/include/linux/gfp.h
> @@ -51,8 +51,9 @@ struct vm_area_struct;
> #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
>
> -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 22 /* Number of__GFP_FOO bits */
^ missed a white space :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/4] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-03 11:54 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 11:54 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sun, May 03, 2009 at 02:22:06AM +0200, Rafael J. Wysocki wrote:
> From: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>
> > > Remind me: why can't we just allocate N pages at suspend-time?
> >
> > We need half of memory free. The reason we can't "just allocate" is
> > probably OOM killer; but my memories are quite weak :-(.
>
> hm. You'd think that with our splendid range of __GFP_foo falgs, there
> would be some combo which would suit this requirement but I can't
> immediately spot one.
>
> We can always add another I guess. Something like...
>
> Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> ---
> include/linux/gfp.h | 3 ++-
> mm/page_alloc.c | 3 ++-
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1620,7 +1620,8 @@ nofail_alloc:
> }
>
> /* The OOM killer will not help higher order allocs so fail */
> - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> + (gfp_mask & __GFP_NO_OOM_KILL)) {
> clear_zonelist_oom(zonelist, gfp_mask);
> goto nopage;
> }
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h
> +++ linux-2.6/include/linux/gfp.h
> @@ -51,8 +51,9 @@ struct vm_area_struct;
> #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
>
> -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 22 /* Number of__GFP_FOO bits */
^ missed a white space :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 1/4] mm: Add __GFP_NO_OOM_KILL flag
2009-05-03 0:20 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-03 0:22 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:22 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds
From: Andrew Morton <akpm@linux-foundation.org>
> > Remind me: why can't we just allocate N pages at suspend-time?
>
> We need half of memory free. The reason we can't "just allocate" is
> probably OOM killer; but my memories are quite weak :-(.
hm. You'd think that with our splendid range of __GFP_foo falgs, there
would be some combo which would suit this requirement but I can't
immediately spot one.
We can always add another I guess. Something like...
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1620,7 +1620,8 @@ nofail_alloc:
}
/* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (order > PAGE_ALLOC_COSTLY_ORDER ||
+ (gfp_mask & __GFP_NO_OOM_KILL)) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of__GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 2/4] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
2009-05-03 0:20 ` Rafael J. Wysocki
` (2 preceding siblings ...)
(?)
@ 2009-05-03 0:23 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:23 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds
From: Rafael J. Wysocki <rjw@sisk.pl>
The next patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that. For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .
[rev. 2: Make some functions static and remove their headers from
kernel/power/power.h]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/power.h | 4 --
kernel/power/snapshot.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/power/swsusp.c | 76 ---------------------------------------------
3 files changed, 79 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -39,6 +39,14 @@ static int swsusp_page_is_free(struct pa
static void swsusp_set_page_forbidden(struct page *);
static void swsusp_unset_page_forbidden(struct page *);
+/*
+ * Preferred image size in bytes (tunable via /sys/power/image_size).
+ * When it is set to N, swsusp will do its best to ensure the image
+ * size will not exceed N bytes, but if that is impossible, it will
+ * try to create the smallest image possible.
+ */
+unsigned long image_size = 500 * 1024 * 1024;
+
/* List of PBEs needed for restoring the pages that were allocated before
* the suspend and included in the suspend image, but have also been
* allocated by the "resume" kernel, so their contents cannot be written
@@ -840,7 +848,7 @@ static struct page *saveable_highmem_pag
* pages.
*/
-unsigned int count_highmem_pages(void)
+static unsigned int count_highmem_pages(void)
{
struct zone *zone;
unsigned int n = 0;
@@ -902,7 +910,7 @@ static struct page *saveable_page(struct
* pages.
*/
-unsigned int count_data_pages(void)
+static unsigned int count_data_pages(void)
{
struct zone *zone;
unsigned long pfn, max_zone_pfn;
@@ -1058,6 +1066,74 @@ void swsusp_free(void)
buffer = NULL;
}
+/**
+ * swsusp_shrink_memory - Try to free as much memory as needed
+ *
+ * ... but do not OOM-kill anyone
+ *
+ * Notice: all userland should be stopped before it is called, or
+ * livelock is possible.
+ */
+
+#define SHRINK_BITE 10000
+static inline unsigned long __shrink_memory(long tmp)
+{
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ return shrink_all_memory(tmp);
+}
+
+int swsusp_shrink_memory(void)
+{
+ long tmp;
+ struct zone *zone;
+ unsigned long pages = 0;
+ unsigned int i = 0;
+ char *p = "-\\|/";
+ struct timeval start, stop;
+
+ printk(KERN_INFO "PM: Shrinking memory... ");
+ do_gettimeofday(&start);
+ do {
+ long size, highmem_size;
+
+ highmem_size = count_highmem_pages();
+ size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
+ tmp = size;
+ size += highmem_size;
+ for_each_populated_zone(zone) {
+ tmp += snapshot_additional_pages(zone);
+ if (is_highmem(zone)) {
+ highmem_size -=
+ zone_page_state(zone, NR_FREE_PAGES);
+ } else {
+ tmp -= zone_page_state(zone, NR_FREE_PAGES);
+ tmp += zone->lowmem_reserve[ZONE_NORMAL];
+ }
+ }
+
+ if (highmem_size < 0)
+ highmem_size = 0;
+
+ tmp += highmem_size;
+ if (tmp > 0) {
+ tmp = __shrink_memory(tmp);
+ if (!tmp)
+ return -ENOMEM;
+ pages += tmp;
+ } else if (size > image_size / PAGE_SIZE) {
+ tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+ pages += tmp;
+ }
+ printk("\b%c", p[i++%4]);
+ } while (tmp > 0);
+ do_gettimeofday(&stop);
+ printk("\bdone (%lu pages freed)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Freed");
+
+ return 0;
+}
+
#ifdef CONFIG_HIGHMEM
/**
* count_pages_for_highmem - compute the number of non-highmem pages
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -55,14 +55,6 @@
#include "power.h"
-/*
- * Preferred image size in bytes (tunable via /sys/power/image_size).
- * When it is set to N, swsusp will do its best to ensure the image
- * size will not exceed N bytes, but if that is impossible, it will
- * try to create the smallest image possible.
- */
-unsigned long image_size = 500 * 1024 * 1024;
-
int in_suspend __nosavedata = 0;
/**
@@ -195,74 +187,6 @@ void swsusp_show_speed(struct timeval *s
kps / 1000, (kps % 1000) / 10);
}
-/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
- */
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
-int swsusp_shrink_memory(void)
-{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
- struct timeval start, stop;
-
- printk(KERN_INFO "PM: Shrinking memory... ");
- do_gettimeofday(&start);
- do {
- long size, highmem_size;
-
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
-
- if (highmem_size < 0)
- highmem_size = 0;
-
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
- do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
-
- return 0;
-}
-
/*
* Platforms, like ACPI, may want us to save some memory used by them during
* hibernation and to restore the contents of this memory during the subsequent
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern unsigned int count_data_pages(void);
+extern int swsusp_shrink_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
@@ -149,7 +149,6 @@ extern int swsusp_swap_in_use(void);
/* kernel/power/disk.c */
extern int swsusp_check(void);
-extern int swsusp_shrink_memory(void);
extern void swsusp_free(void);
extern int swsusp_read(unsigned int *flags_p);
extern int swsusp_write(unsigned int flags);
@@ -176,7 +175,6 @@ extern int pm_notifier_call_chain(unsign
#endif
#ifdef CONFIG_HIGHMEM
-unsigned int count_highmem_pages(void);
int restore_highmem(void);
#else
static inline unsigned int count_highmem_pages(void) { return 0; }
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 2/4] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
@ 2009-05-03 0:23 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:23 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel, torvalds, jens.axboe, alan-jenkins, linux-kernel,
kernel-testers, linux-pm
From: Rafael J. Wysocki <rjw@sisk.pl>
The next patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that. For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .
[rev. 2: Make some functions static and remove their headers from
kernel/power/power.h]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/power.h | 4 --
kernel/power/snapshot.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/power/swsusp.c | 76 ---------------------------------------------
3 files changed, 79 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -39,6 +39,14 @@ static int swsusp_page_is_free(struct pa
static void swsusp_set_page_forbidden(struct page *);
static void swsusp_unset_page_forbidden(struct page *);
+/*
+ * Preferred image size in bytes (tunable via /sys/power/image_size).
+ * When it is set to N, swsusp will do its best to ensure the image
+ * size will not exceed N bytes, but if that is impossible, it will
+ * try to create the smallest image possible.
+ */
+unsigned long image_size = 500 * 1024 * 1024;
+
/* List of PBEs needed for restoring the pages that were allocated before
* the suspend and included in the suspend image, but have also been
* allocated by the "resume" kernel, so their contents cannot be written
@@ -840,7 +848,7 @@ static struct page *saveable_highmem_pag
* pages.
*/
-unsigned int count_highmem_pages(void)
+static unsigned int count_highmem_pages(void)
{
struct zone *zone;
unsigned int n = 0;
@@ -902,7 +910,7 @@ static struct page *saveable_page(struct
* pages.
*/
-unsigned int count_data_pages(void)
+static unsigned int count_data_pages(void)
{
struct zone *zone;
unsigned long pfn, max_zone_pfn;
@@ -1058,6 +1066,74 @@ void swsusp_free(void)
buffer = NULL;
}
+/**
+ * swsusp_shrink_memory - Try to free as much memory as needed
+ *
+ * ... but do not OOM-kill anyone
+ *
+ * Notice: all userland should be stopped before it is called, or
+ * livelock is possible.
+ */
+
+#define SHRINK_BITE 10000
+static inline unsigned long __shrink_memory(long tmp)
+{
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ return shrink_all_memory(tmp);
+}
+
+int swsusp_shrink_memory(void)
+{
+ long tmp;
+ struct zone *zone;
+ unsigned long pages = 0;
+ unsigned int i = 0;
+ char *p = "-\\|/";
+ struct timeval start, stop;
+
+ printk(KERN_INFO "PM: Shrinking memory... ");
+ do_gettimeofday(&start);
+ do {
+ long size, highmem_size;
+
+ highmem_size = count_highmem_pages();
+ size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
+ tmp = size;
+ size += highmem_size;
+ for_each_populated_zone(zone) {
+ tmp += snapshot_additional_pages(zone);
+ if (is_highmem(zone)) {
+ highmem_size -=
+ zone_page_state(zone, NR_FREE_PAGES);
+ } else {
+ tmp -= zone_page_state(zone, NR_FREE_PAGES);
+ tmp += zone->lowmem_reserve[ZONE_NORMAL];
+ }
+ }
+
+ if (highmem_size < 0)
+ highmem_size = 0;
+
+ tmp += highmem_size;
+ if (tmp > 0) {
+ tmp = __shrink_memory(tmp);
+ if (!tmp)
+ return -ENOMEM;
+ pages += tmp;
+ } else if (size > image_size / PAGE_SIZE) {
+ tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+ pages += tmp;
+ }
+ printk("\b%c", p[i++%4]);
+ } while (tmp > 0);
+ do_gettimeofday(&stop);
+ printk("\bdone (%lu pages freed)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Freed");
+
+ return 0;
+}
+
#ifdef CONFIG_HIGHMEM
/**
* count_pages_for_highmem - compute the number of non-highmem pages
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -55,14 +55,6 @@
#include "power.h"
-/*
- * Preferred image size in bytes (tunable via /sys/power/image_size).
- * When it is set to N, swsusp will do its best to ensure the image
- * size will not exceed N bytes, but if that is impossible, it will
- * try to create the smallest image possible.
- */
-unsigned long image_size = 500 * 1024 * 1024;
-
int in_suspend __nosavedata = 0;
/**
@@ -195,74 +187,6 @@ void swsusp_show_speed(struct timeval *s
kps / 1000, (kps % 1000) / 10);
}
-/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
- */
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
-int swsusp_shrink_memory(void)
-{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
- struct timeval start, stop;
-
- printk(KERN_INFO "PM: Shrinking memory... ");
- do_gettimeofday(&start);
- do {
- long size, highmem_size;
-
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
-
- if (highmem_size < 0)
- highmem_size = 0;
-
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
- do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
-
- return 0;
-}
-
/*
* Platforms, like ACPI, may want us to save some memory used by them during
* hibernation and to restore the contents of this memory during the subsequent
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern unsigned int count_data_pages(void);
+extern int swsusp_shrink_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
@@ -149,7 +149,6 @@ extern int swsusp_swap_in_use(void);
/* kernel/power/disk.c */
extern int swsusp_check(void);
-extern int swsusp_shrink_memory(void);
extern void swsusp_free(void);
extern int swsusp_read(unsigned int *flags_p);
extern int swsusp_write(unsigned int flags);
@@ -176,7 +175,6 @@ extern int pm_notifier_call_chain(unsign
#endif
#ifdef CONFIG_HIGHMEM
-unsigned int count_highmem_pages(void);
int restore_highmem(void);
#else
static inline unsigned int count_highmem_pages(void) { return 0; }
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 2/4] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
@ 2009-05-03 0:23 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:23 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
The next patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that. For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .
[rev. 2: Make some functions static and remove their headers from
kernel/power/power.h]
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/power.h | 4 --
kernel/power/snapshot.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/power/swsusp.c | 76 ---------------------------------------------
3 files changed, 79 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -39,6 +39,14 @@ static int swsusp_page_is_free(struct pa
static void swsusp_set_page_forbidden(struct page *);
static void swsusp_unset_page_forbidden(struct page *);
+/*
+ * Preferred image size in bytes (tunable via /sys/power/image_size).
+ * When it is set to N, swsusp will do its best to ensure the image
+ * size will not exceed N bytes, but if that is impossible, it will
+ * try to create the smallest image possible.
+ */
+unsigned long image_size = 500 * 1024 * 1024;
+
/* List of PBEs needed for restoring the pages that were allocated before
* the suspend and included in the suspend image, but have also been
* allocated by the "resume" kernel, so their contents cannot be written
@@ -840,7 +848,7 @@ static struct page *saveable_highmem_pag
* pages.
*/
-unsigned int count_highmem_pages(void)
+static unsigned int count_highmem_pages(void)
{
struct zone *zone;
unsigned int n = 0;
@@ -902,7 +910,7 @@ static struct page *saveable_page(struct
* pages.
*/
-unsigned int count_data_pages(void)
+static unsigned int count_data_pages(void)
{
struct zone *zone;
unsigned long pfn, max_zone_pfn;
@@ -1058,6 +1066,74 @@ void swsusp_free(void)
buffer = NULL;
}
+/**
+ * swsusp_shrink_memory - Try to free as much memory as needed
+ *
+ * ... but do not OOM-kill anyone
+ *
+ * Notice: all userland should be stopped before it is called, or
+ * livelock is possible.
+ */
+
+#define SHRINK_BITE 10000
+static inline unsigned long __shrink_memory(long tmp)
+{
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ return shrink_all_memory(tmp);
+}
+
+int swsusp_shrink_memory(void)
+{
+ long tmp;
+ struct zone *zone;
+ unsigned long pages = 0;
+ unsigned int i = 0;
+ char *p = "-\\|/";
+ struct timeval start, stop;
+
+ printk(KERN_INFO "PM: Shrinking memory... ");
+ do_gettimeofday(&start);
+ do {
+ long size, highmem_size;
+
+ highmem_size = count_highmem_pages();
+ size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
+ tmp = size;
+ size += highmem_size;
+ for_each_populated_zone(zone) {
+ tmp += snapshot_additional_pages(zone);
+ if (is_highmem(zone)) {
+ highmem_size -=
+ zone_page_state(zone, NR_FREE_PAGES);
+ } else {
+ tmp -= zone_page_state(zone, NR_FREE_PAGES);
+ tmp += zone->lowmem_reserve[ZONE_NORMAL];
+ }
+ }
+
+ if (highmem_size < 0)
+ highmem_size = 0;
+
+ tmp += highmem_size;
+ if (tmp > 0) {
+ tmp = __shrink_memory(tmp);
+ if (!tmp)
+ return -ENOMEM;
+ pages += tmp;
+ } else if (size > image_size / PAGE_SIZE) {
+ tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+ pages += tmp;
+ }
+ printk("\b%c", p[i++%4]);
+ } while (tmp > 0);
+ do_gettimeofday(&stop);
+ printk("\bdone (%lu pages freed)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Freed");
+
+ return 0;
+}
+
#ifdef CONFIG_HIGHMEM
/**
* count_pages_for_highmem - compute the number of non-highmem pages
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -55,14 +55,6 @@
#include "power.h"
-/*
- * Preferred image size in bytes (tunable via /sys/power/image_size).
- * When it is set to N, swsusp will do its best to ensure the image
- * size will not exceed N bytes, but if that is impossible, it will
- * try to create the smallest image possible.
- */
-unsigned long image_size = 500 * 1024 * 1024;
-
int in_suspend __nosavedata = 0;
/**
@@ -195,74 +187,6 @@ void swsusp_show_speed(struct timeval *s
kps / 1000, (kps % 1000) / 10);
}
-/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
- */
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
-int swsusp_shrink_memory(void)
-{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
- struct timeval start, stop;
-
- printk(KERN_INFO "PM: Shrinking memory... ");
- do_gettimeofday(&start);
- do {
- long size, highmem_size;
-
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
-
- if (highmem_size < 0)
- highmem_size = 0;
-
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
- do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
-
- return 0;
-}
-
/*
* Platforms, like ACPI, may want us to save some memory used by them during
* hibernation and to restore the contents of this memory during the subsequent
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern unsigned int count_data_pages(void);
+extern int swsusp_shrink_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
@@ -149,7 +149,6 @@ extern int swsusp_swap_in_use(void);
/* kernel/power/disk.c */
extern int swsusp_check(void);
-extern int swsusp_shrink_memory(void);
extern void swsusp_free(void);
extern int swsusp_read(unsigned int *flags_p);
extern int swsusp_write(unsigned int flags);
@@ -176,7 +175,6 @@ extern int pm_notifier_call_chain(unsign
#endif
#ifdef CONFIG_HIGHMEM
-unsigned int count_highmem_pages(void);
int restore_highmem(void);
#else
static inline unsigned int count_highmem_pages(void) { return 0; }
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 0:20 ` Rafael J. Wysocki
` (4 preceding siblings ...)
(?)
@ 2009-05-03 0:24 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:24 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds
From: Rafael J. Wysocki <rjw@sisk.pl>
Modify the hibernation memory shrinking code so that it will make
memory allocations to free memory instead of using an artificial
memory shrinking mechanism for that. Remove the shrinking of
memory from the suspend-to-RAM code, where it is not really
necessary. Finally, remove the no longer used memory shrinking
functions from mm/vmscan.c .
[rev. 2: Use the existing memory bitmaps for marking preallocated
image pages and use swsusp_free() from releasing them, introduce
GFP_IMAGE, add comments describing the memory shrinking strategy.]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/main.c | 20 ------
kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
mm/vmscan.c | 142 ------------------------------------------------
3 files changed, 101 insertions(+), 193 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,41 +1066,97 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper functions used for the shrinking of memory. */
+
+#ifdef CONFIG_HIGHMEM
+#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
+#else
+#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
+#endif
+
+#define SHRINK_BITE 10000
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
+ * prealloc_pages - preallocate given number of pages and mark their PFNs
+ * @nr_pages: Number of pages to allocate.
*
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * Allocate given number of pages and mark their PFNs in the hibernation memory
+ * bitmaps, so that they can be released by swsusp_free().
+ * Return value: The number of normal (ie. non-highmem) pages allocated or
+ * -ENOMEM on failure.
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
+static long prealloc_pages(long nr_pages)
{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
+ long nr_normal = 0;
+
+ while (nr_pages-- > 0) {
+ struct page *page;
+
+ page = alloc_image_page(GFP_IMAGE);
+ if (!page)
+ return -ENOMEM;
+ if (!PageHighMem(page))
+ nr_normal++;
+ }
+
+ return nr_normal;
}
+/**
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * preallocate memory in SHRINK_BITE chunks in a loop until the following
+ * condition is satisfied:
+ *
+ * [number of preallocated page frames] >=
+ * (1/2) * ([total number of page frames in use] + PAGES_FOR_IO
+ * + SPARE_PAGES - [number of free page frames])
+ *
+ * because in that case, if all of the preallocated page frames are released,
+ * the total number of free page frames will be equal to or greater than the sum
+ * of the total number of page frames in use with PAGES_FOR_IO and SPARE_PAGES,
+ * which is what we need.
+ *
+ * If image_size is set below the number following from the above inequality,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size.
+ */
int swsusp_shrink_memory(void)
{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
+ unsigned long pages = 0, alloc_normal = 0, alloc_highmem = 0;
unsigned int i = 0;
char *p = "-\\|/";
struct timeval start, stop;
+ int error = 0;
printk(KERN_INFO "PM: Shrinking memory... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
+ for (;;) {
+ struct zone *zone;
+ long size, highmem_size, tmp, ret;
+
+ /*
+ * Pages preallocated by this loop are not counted as data pages
+ * by count_data_pages() and count_highmem_pages(), so we only
+ * need to subtract their numbers once here to verify the
+ * satisfaction of the stop condition.
+ */
+ size = count_data_pages() - alloc_normal;
+ tmp = size + PAGES_FOR_IO + SPARE_PAGES;
+ highmem_size = count_highmem_pages() - alloc_highmem;
size += highmem_size;
+ /*
+ * Highmem is treated differently, because we prefer not to
+ * store copies of normal page frames in it during image
+ * creation.
+ */
for_each_populated_zone(zone) {
tmp += snapshot_additional_pages(zone);
if (is_highmem(zone)) {
@@ -1111,27 +1167,39 @@ int swsusp_shrink_memory(void)
tmp += zone->lowmem_reserve[ZONE_NORMAL];
}
}
-
if (highmem_size < 0)
highmem_size = 0;
-
tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
+
+ if (tmp <= 0 && size > image_size / PAGE_SIZE)
+ tmp = size - (image_size / PAGE_SIZE);
+
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ else if (tmp <= 0)
+ break;
+
+ ret = prealloc_pages(tmp);
+ if (ret < 0) {
+ error = -ENOMEM;
+ goto out;
}
+ alloc_normal += ret;
+ alloc_highmem += tmp - ret;
+ pages += tmp;
+
printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ }
+
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk("\bdone (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
- return 0;
+ out:
+ /* Release the preallocated page frames. */
+ swsusp_free();
+
+ return error;
}
#ifdef CONFIG_HIGHMEM
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2054,148 +2054,6 @@ unsigned long global_lru_pages(void)
+ global_page_state(NR_INACTIVE_FILE);
}
-#ifdef CONFIG_PM
-/*
- * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
- * from LRU lists system-wide, for given pass and priority.
- *
- * For pass > 3 we also try to shrink the LRU lists that contain a few pages
- */
-static void shrink_all_zones(unsigned long nr_pages, int prio,
- int pass, struct scan_control *sc)
-{
- struct zone *zone;
- unsigned long nr_reclaimed = 0;
-
- for_each_populated_zone(zone) {
- enum lru_list l;
-
- if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
- continue;
-
- for_each_evictable_lru(l) {
- enum zone_stat_item ls = NR_LRU_BASE + l;
- unsigned long lru_pages = zone_page_state(zone, ls);
-
- /* For pass = 0, we don't shrink the active list */
- if (pass == 0 && (l == LRU_ACTIVE_ANON ||
- l == LRU_ACTIVE_FILE))
- continue;
-
- zone->lru[l].nr_scan += (lru_pages >> prio) + 1;
- if (zone->lru[l].nr_scan >= nr_pages || pass > 3) {
- unsigned long nr_to_scan;
-
- zone->lru[l].nr_scan = 0;
- nr_to_scan = min(nr_pages, lru_pages);
- nr_reclaimed += shrink_list(l, nr_to_scan, zone,
- sc, prio);
- if (nr_reclaimed >= nr_pages) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
- }
- }
- }
- sc->nr_reclaimed += nr_reclaimed;
-}
-
-/*
- * Try to free `nr_pages' of memory, system-wide, and return the number of
- * freed pages.
- *
- * Rather than trying to age LRUs the aim is to preserve the overall
- * LRU order by reclaiming preferentially
- * inactive > active > active referenced > active mapped
- */
-unsigned long shrink_all_memory(unsigned long nr_pages)
-{
- unsigned long lru_pages, nr_slab;
- int pass;
- struct reclaim_state reclaim_state;
- struct scan_control sc = {
- .gfp_mask = GFP_KERNEL,
- .may_unmap = 0,
- .may_writepage = 1,
- .isolate_pages = isolate_pages_global,
- .nr_reclaimed = 0,
- };
-
- current->reclaim_state = &reclaim_state;
-
- lru_pages = global_lru_pages();
- nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
- /* If slab caches are huge, it's better to hit them first */
- while (nr_slab >= lru_pages) {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
- if (!reclaim_state.reclaimed_slab)
- break;
-
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- nr_slab -= reclaim_state.reclaimed_slab;
- }
-
- /*
- * We try to shrink LRUs in 5 passes:
- * 0 = Reclaim from inactive_list only
- * 1 = Reclaim from active list but don't reclaim mapped
- * 2 = 2nd pass of type 1
- * 3 = Reclaim mapped (normal reclaim)
- * 4 = 2nd pass of type 3
- */
- for (pass = 0; pass < 5; pass++) {
- int prio;
-
- /* Force reclaiming mapped pages in the passes #3 and #4 */
- if (pass > 2)
- sc.may_unmap = 1;
-
- for (prio = DEF_PRIORITY; prio >= 0; prio--) {
- unsigned long nr_to_scan = nr_pages - sc.nr_reclaimed;
-
- sc.nr_scanned = 0;
- sc.swap_cluster_max = nr_to_scan;
- shrink_all_zones(nr_to_scan, prio, pass, &sc);
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask,
- global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
- }
- }
-
- /*
- * If sc.nr_reclaimed = 0, we could not shrink LRUs, but there may be
- * something in slab caches
- */
- if (!sc.nr_reclaimed) {
- do {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- } while (sc.nr_reclaimed < nr_pages &&
- reclaim_state.reclaimed_slab > 0);
- }
-
-
-out:
- current->reclaim_state = NULL;
-
- return sc.nr_reclaimed;
-}
-#endif
-
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness. So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const ch
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -226,7 +223,6 @@ int suspend_valid_only_mem(suspend_state
static int suspend_prepare(void)
{
int error;
- unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -241,24 +237,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 0:24 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:24 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel, torvalds, jens.axboe, alan-jenkins, linux-kernel,
kernel-testers, linux-pm
From: Rafael J. Wysocki <rjw@sisk.pl>
Modify the hibernation memory shrinking code so that it will make
memory allocations to free memory instead of using an artificial
memory shrinking mechanism for that. Remove the shrinking of
memory from the suspend-to-RAM code, where it is not really
necessary. Finally, remove the no longer used memory shrinking
functions from mm/vmscan.c .
[rev. 2: Use the existing memory bitmaps for marking preallocated
image pages and use swsusp_free() from releasing them, introduce
GFP_IMAGE, add comments describing the memory shrinking strategy.]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/main.c | 20 ------
kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
mm/vmscan.c | 142 ------------------------------------------------
3 files changed, 101 insertions(+), 193 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,41 +1066,97 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper functions used for the shrinking of memory. */
+
+#ifdef CONFIG_HIGHMEM
+#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
+#else
+#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
+#endif
+
+#define SHRINK_BITE 10000
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
+ * prealloc_pages - preallocate given number of pages and mark their PFNs
+ * @nr_pages: Number of pages to allocate.
*
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * Allocate given number of pages and mark their PFNs in the hibernation memory
+ * bitmaps, so that they can be released by swsusp_free().
+ * Return value: The number of normal (ie. non-highmem) pages allocated or
+ * -ENOMEM on failure.
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
+static long prealloc_pages(long nr_pages)
{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
+ long nr_normal = 0;
+
+ while (nr_pages-- > 0) {
+ struct page *page;
+
+ page = alloc_image_page(GFP_IMAGE);
+ if (!page)
+ return -ENOMEM;
+ if (!PageHighMem(page))
+ nr_normal++;
+ }
+
+ return nr_normal;
}
+/**
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * preallocate memory in SHRINK_BITE chunks in a loop until the following
+ * condition is satisfied:
+ *
+ * [number of preallocated page frames] >=
+ * (1/2) * ([total number of page frames in use] + PAGES_FOR_IO
+ * + SPARE_PAGES - [number of free page frames])
+ *
+ * because in that case, if all of the preallocated page frames are released,
+ * the total number of free page frames will be equal to or greater than the sum
+ * of the total number of page frames in use with PAGES_FOR_IO and SPARE_PAGES,
+ * which is what we need.
+ *
+ * If image_size is set below the number following from the above inequality,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size.
+ */
int swsusp_shrink_memory(void)
{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
+ unsigned long pages = 0, alloc_normal = 0, alloc_highmem = 0;
unsigned int i = 0;
char *p = "-\\|/";
struct timeval start, stop;
+ int error = 0;
printk(KERN_INFO "PM: Shrinking memory... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
+ for (;;) {
+ struct zone *zone;
+ long size, highmem_size, tmp, ret;
+
+ /*
+ * Pages preallocated by this loop are not counted as data pages
+ * by count_data_pages() and count_highmem_pages(), so we only
+ * need to subtract their numbers once here to verify the
+ * satisfaction of the stop condition.
+ */
+ size = count_data_pages() - alloc_normal;
+ tmp = size + PAGES_FOR_IO + SPARE_PAGES;
+ highmem_size = count_highmem_pages() - alloc_highmem;
size += highmem_size;
+ /*
+ * Highmem is treated differently, because we prefer not to
+ * store copies of normal page frames in it during image
+ * creation.
+ */
for_each_populated_zone(zone) {
tmp += snapshot_additional_pages(zone);
if (is_highmem(zone)) {
@@ -1111,27 +1167,39 @@ int swsusp_shrink_memory(void)
tmp += zone->lowmem_reserve[ZONE_NORMAL];
}
}
-
if (highmem_size < 0)
highmem_size = 0;
-
tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
+
+ if (tmp <= 0 && size > image_size / PAGE_SIZE)
+ tmp = size - (image_size / PAGE_SIZE);
+
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ else if (tmp <= 0)
+ break;
+
+ ret = prealloc_pages(tmp);
+ if (ret < 0) {
+ error = -ENOMEM;
+ goto out;
}
+ alloc_normal += ret;
+ alloc_highmem += tmp - ret;
+ pages += tmp;
+
printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ }
+
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk("\bdone (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
- return 0;
+ out:
+ /* Release the preallocated page frames. */
+ swsusp_free();
+
+ return error;
}
#ifdef CONFIG_HIGHMEM
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2054,148 +2054,6 @@ unsigned long global_lru_pages(void)
+ global_page_state(NR_INACTIVE_FILE);
}
-#ifdef CONFIG_PM
-/*
- * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
- * from LRU lists system-wide, for given pass and priority.
- *
- * For pass > 3 we also try to shrink the LRU lists that contain a few pages
- */
-static void shrink_all_zones(unsigned long nr_pages, int prio,
- int pass, struct scan_control *sc)
-{
- struct zone *zone;
- unsigned long nr_reclaimed = 0;
-
- for_each_populated_zone(zone) {
- enum lru_list l;
-
- if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
- continue;
-
- for_each_evictable_lru(l) {
- enum zone_stat_item ls = NR_LRU_BASE + l;
- unsigned long lru_pages = zone_page_state(zone, ls);
-
- /* For pass = 0, we don't shrink the active list */
- if (pass == 0 && (l == LRU_ACTIVE_ANON ||
- l == LRU_ACTIVE_FILE))
- continue;
-
- zone->lru[l].nr_scan += (lru_pages >> prio) + 1;
- if (zone->lru[l].nr_scan >= nr_pages || pass > 3) {
- unsigned long nr_to_scan;
-
- zone->lru[l].nr_scan = 0;
- nr_to_scan = min(nr_pages, lru_pages);
- nr_reclaimed += shrink_list(l, nr_to_scan, zone,
- sc, prio);
- if (nr_reclaimed >= nr_pages) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
- }
- }
- }
- sc->nr_reclaimed += nr_reclaimed;
-}
-
-/*
- * Try to free `nr_pages' of memory, system-wide, and return the number of
- * freed pages.
- *
- * Rather than trying to age LRUs the aim is to preserve the overall
- * LRU order by reclaiming preferentially
- * inactive > active > active referenced > active mapped
- */
-unsigned long shrink_all_memory(unsigned long nr_pages)
-{
- unsigned long lru_pages, nr_slab;
- int pass;
- struct reclaim_state reclaim_state;
- struct scan_control sc = {
- .gfp_mask = GFP_KERNEL,
- .may_unmap = 0,
- .may_writepage = 1,
- .isolate_pages = isolate_pages_global,
- .nr_reclaimed = 0,
- };
-
- current->reclaim_state = &reclaim_state;
-
- lru_pages = global_lru_pages();
- nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
- /* If slab caches are huge, it's better to hit them first */
- while (nr_slab >= lru_pages) {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
- if (!reclaim_state.reclaimed_slab)
- break;
-
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- nr_slab -= reclaim_state.reclaimed_slab;
- }
-
- /*
- * We try to shrink LRUs in 5 passes:
- * 0 = Reclaim from inactive_list only
- * 1 = Reclaim from active list but don't reclaim mapped
- * 2 = 2nd pass of type 1
- * 3 = Reclaim mapped (normal reclaim)
- * 4 = 2nd pass of type 3
- */
- for (pass = 0; pass < 5; pass++) {
- int prio;
-
- /* Force reclaiming mapped pages in the passes #3 and #4 */
- if (pass > 2)
- sc.may_unmap = 1;
-
- for (prio = DEF_PRIORITY; prio >= 0; prio--) {
- unsigned long nr_to_scan = nr_pages - sc.nr_reclaimed;
-
- sc.nr_scanned = 0;
- sc.swap_cluster_max = nr_to_scan;
- shrink_all_zones(nr_to_scan, prio, pass, &sc);
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask,
- global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
- }
- }
-
- /*
- * If sc.nr_reclaimed = 0, we could not shrink LRUs, but there may be
- * something in slab caches
- */
- if (!sc.nr_reclaimed) {
- do {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- } while (sc.nr_reclaimed < nr_pages &&
- reclaim_state.reclaimed_slab > 0);
- }
-
-
-out:
- current->reclaim_state = NULL;
-
- return sc.nr_reclaimed;
-}
-#endif
-
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness. So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const ch
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -226,7 +223,6 @@ int suspend_valid_only_mem(suspend_state
static int suspend_prepare(void)
{
int error;
- unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -241,24 +237,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 0:24 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:24 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Modify the hibernation memory shrinking code so that it will make
memory allocations to free memory instead of using an artificial
memory shrinking mechanism for that. Remove the shrinking of
memory from the suspend-to-RAM code, where it is not really
necessary. Finally, remove the no longer used memory shrinking
functions from mm/vmscan.c .
[rev. 2: Use the existing memory bitmaps for marking preallocated
image pages and use swsusp_free() from releasing them, introduce
GFP_IMAGE, add comments describing the memory shrinking strategy.]
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/main.c | 20 ------
kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
mm/vmscan.c | 142 ------------------------------------------------
3 files changed, 101 insertions(+), 193 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,41 +1066,97 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper functions used for the shrinking of memory. */
+
+#ifdef CONFIG_HIGHMEM
+#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
+#else
+#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
+#endif
+
+#define SHRINK_BITE 10000
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
+ * prealloc_pages - preallocate given number of pages and mark their PFNs
+ * @nr_pages: Number of pages to allocate.
*
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * Allocate given number of pages and mark their PFNs in the hibernation memory
+ * bitmaps, so that they can be released by swsusp_free().
+ * Return value: The number of normal (ie. non-highmem) pages allocated or
+ * -ENOMEM on failure.
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
+static long prealloc_pages(long nr_pages)
{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
+ long nr_normal = 0;
+
+ while (nr_pages-- > 0) {
+ struct page *page;
+
+ page = alloc_image_page(GFP_IMAGE);
+ if (!page)
+ return -ENOMEM;
+ if (!PageHighMem(page))
+ nr_normal++;
+ }
+
+ return nr_normal;
}
+/**
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * preallocate memory in SHRINK_BITE chunks in a loop until the following
+ * condition is satisfied:
+ *
+ * [number of preallocated page frames] >=
+ * (1/2) * ([total number of page frames in use] + PAGES_FOR_IO
+ * + SPARE_PAGES - [number of free page frames])
+ *
+ * because in that case, if all of the preallocated page frames are released,
+ * the total number of free page frames will be equal to or greater than the sum
+ * of the total number of page frames in use with PAGES_FOR_IO and SPARE_PAGES,
+ * which is what we need.
+ *
+ * If image_size is set below the number following from the above inequality,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size.
+ */
int swsusp_shrink_memory(void)
{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
+ unsigned long pages = 0, alloc_normal = 0, alloc_highmem = 0;
unsigned int i = 0;
char *p = "-\\|/";
struct timeval start, stop;
+ int error = 0;
printk(KERN_INFO "PM: Shrinking memory... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
+ for (;;) {
+ struct zone *zone;
+ long size, highmem_size, tmp, ret;
+
+ /*
+ * Pages preallocated by this loop are not counted as data pages
+ * by count_data_pages() and count_highmem_pages(), so we only
+ * need to subtract their numbers once here to verify the
+ * satisfaction of the stop condition.
+ */
+ size = count_data_pages() - alloc_normal;
+ tmp = size + PAGES_FOR_IO + SPARE_PAGES;
+ highmem_size = count_highmem_pages() - alloc_highmem;
size += highmem_size;
+ /*
+ * Highmem is treated differently, because we prefer not to
+ * store copies of normal page frames in it during image
+ * creation.
+ */
for_each_populated_zone(zone) {
tmp += snapshot_additional_pages(zone);
if (is_highmem(zone)) {
@@ -1111,27 +1167,39 @@ int swsusp_shrink_memory(void)
tmp += zone->lowmem_reserve[ZONE_NORMAL];
}
}
-
if (highmem_size < 0)
highmem_size = 0;
-
tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
+
+ if (tmp <= 0 && size > image_size / PAGE_SIZE)
+ tmp = size - (image_size / PAGE_SIZE);
+
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ else if (tmp <= 0)
+ break;
+
+ ret = prealloc_pages(tmp);
+ if (ret < 0) {
+ error = -ENOMEM;
+ goto out;
}
+ alloc_normal += ret;
+ alloc_highmem += tmp - ret;
+ pages += tmp;
+
printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ }
+
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk("\bdone (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
- return 0;
+ out:
+ /* Release the preallocated page frames. */
+ swsusp_free();
+
+ return error;
}
#ifdef CONFIG_HIGHMEM
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2054,148 +2054,6 @@ unsigned long global_lru_pages(void)
+ global_page_state(NR_INACTIVE_FILE);
}
-#ifdef CONFIG_PM
-/*
- * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
- * from LRU lists system-wide, for given pass and priority.
- *
- * For pass > 3 we also try to shrink the LRU lists that contain a few pages
- */
-static void shrink_all_zones(unsigned long nr_pages, int prio,
- int pass, struct scan_control *sc)
-{
- struct zone *zone;
- unsigned long nr_reclaimed = 0;
-
- for_each_populated_zone(zone) {
- enum lru_list l;
-
- if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
- continue;
-
- for_each_evictable_lru(l) {
- enum zone_stat_item ls = NR_LRU_BASE + l;
- unsigned long lru_pages = zone_page_state(zone, ls);
-
- /* For pass = 0, we don't shrink the active list */
- if (pass == 0 && (l == LRU_ACTIVE_ANON ||
- l == LRU_ACTIVE_FILE))
- continue;
-
- zone->lru[l].nr_scan += (lru_pages >> prio) + 1;
- if (zone->lru[l].nr_scan >= nr_pages || pass > 3) {
- unsigned long nr_to_scan;
-
- zone->lru[l].nr_scan = 0;
- nr_to_scan = min(nr_pages, lru_pages);
- nr_reclaimed += shrink_list(l, nr_to_scan, zone,
- sc, prio);
- if (nr_reclaimed >= nr_pages) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
- }
- }
- }
- sc->nr_reclaimed += nr_reclaimed;
-}
-
-/*
- * Try to free `nr_pages' of memory, system-wide, and return the number of
- * freed pages.
- *
- * Rather than trying to age LRUs the aim is to preserve the overall
- * LRU order by reclaiming preferentially
- * inactive > active > active referenced > active mapped
- */
-unsigned long shrink_all_memory(unsigned long nr_pages)
-{
- unsigned long lru_pages, nr_slab;
- int pass;
- struct reclaim_state reclaim_state;
- struct scan_control sc = {
- .gfp_mask = GFP_KERNEL,
- .may_unmap = 0,
- .may_writepage = 1,
- .isolate_pages = isolate_pages_global,
- .nr_reclaimed = 0,
- };
-
- current->reclaim_state = &reclaim_state;
-
- lru_pages = global_lru_pages();
- nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
- /* If slab caches are huge, it's better to hit them first */
- while (nr_slab >= lru_pages) {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
- if (!reclaim_state.reclaimed_slab)
- break;
-
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- nr_slab -= reclaim_state.reclaimed_slab;
- }
-
- /*
- * We try to shrink LRUs in 5 passes:
- * 0 = Reclaim from inactive_list only
- * 1 = Reclaim from active list but don't reclaim mapped
- * 2 = 2nd pass of type 1
- * 3 = Reclaim mapped (normal reclaim)
- * 4 = 2nd pass of type 3
- */
- for (pass = 0; pass < 5; pass++) {
- int prio;
-
- /* Force reclaiming mapped pages in the passes #3 and #4 */
- if (pass > 2)
- sc.may_unmap = 1;
-
- for (prio = DEF_PRIORITY; prio >= 0; prio--) {
- unsigned long nr_to_scan = nr_pages - sc.nr_reclaimed;
-
- sc.nr_scanned = 0;
- sc.swap_cluster_max = nr_to_scan;
- shrink_all_zones(nr_to_scan, prio, pass, &sc);
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask,
- global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
- }
- }
-
- /*
- * If sc.nr_reclaimed = 0, we could not shrink LRUs, but there may be
- * something in slab caches
- */
- if (!sc.nr_reclaimed) {
- do {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- } while (sc.nr_reclaimed < nr_pages &&
- reclaim_state.reclaimed_slab > 0);
- }
-
-
-out:
- current->reclaim_state = NULL;
-
- return sc.nr_reclaimed;
-}
-#endif
-
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness. So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const ch
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -226,7 +223,6 @@ int suspend_valid_only_mem(suspend_state
static int suspend_prepare(void)
{
int error;
- unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -241,24 +237,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 0:24 ` Rafael J. Wysocki
(?)
@ 2009-05-03 3:06 ` Linus Torvalds
-1 siblings, 0 replies; 580+ messages in thread
From: Linus Torvalds @ 2009-05-03 3:06 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, linux-pm
On Sun, 3 May 2009, Rafael J. Wysocki wrote:
>
> Remove the shrinking of memory from the suspend-to-RAM code, where it is
> not really necessary.
Hmm. Shouldn't we do this _regardless_?
IOW, shouldn't this be a totally separate patch? It seems to be left-over
from when we shared the same code-paths, and before the split of the STR
and hibernate code?
IOW, shouldn't the very _first_ patch just be this part? That code doesn't
make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
arbitrary).
This part seems to be totally independent of all the other parts in your
patch-series. No?
Linus
---
kernel/power/main.c | 19 +------------------
1 files changed, 1 insertions(+), 18 deletions(-)
diff --git a/kernel/power/main.c b/kernel/power/main.c
index f99ed6a..e3197e9 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const char *label)
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -241,24 +238,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply related [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 3:06 ` Linus Torvalds
0 siblings, 0 replies; 580+ messages in thread
From: Linus Torvalds @ 2009-05-03 3:06 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel, jens.axboe, alan-jenkins, linux-kernel,
kernel-testers, linux-pm
On Sun, 3 May 2009, Rafael J. Wysocki wrote:
>
> Remove the shrinking of memory from the suspend-to-RAM code, where it is
> not really necessary.
Hmm. Shouldn't we do this _regardless_?
IOW, shouldn't this be a totally separate patch? It seems to be left-over
from when we shared the same code-paths, and before the split of the STR
and hibernate code?
IOW, shouldn't the very _first_ patch just be this part? That code doesn't
make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
arbitrary).
This part seems to be totally independent of all the other parts in your
patch-series. No?
Linus
---
kernel/power/main.c | 19 +------------------
1 files changed, 1 insertions(+), 18 deletions(-)
diff --git a/kernel/power/main.c b/kernel/power/main.c
index f99ed6a..e3197e9 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const char *label)
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -241,24 +238,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply related [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 3:06 ` Linus Torvalds
0 siblings, 0 replies; 580+ messages in thread
From: Linus Torvalds @ 2009-05-03 3:06 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sun, 3 May 2009, Rafael J. Wysocki wrote:
>
> Remove the shrinking of memory from the suspend-to-RAM code, where it is
> not really necessary.
Hmm. Shouldn't we do this _regardless_?
IOW, shouldn't this be a totally separate patch? It seems to be left-over
from when we shared the same code-paths, and before the split of the STR
and hibernate code?
IOW, shouldn't the very _first_ patch just be this part? That code doesn't
make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
arbitrary).
This part seems to be totally independent of all the other parts in your
patch-series. No?
Linus
---
kernel/power/main.c | 19 +------------------
1 files changed, 1 insertions(+), 18 deletions(-)
diff --git a/kernel/power/main.c b/kernel/power/main.c
index f99ed6a..e3197e9 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const char *label)
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -241,24 +238,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply related [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 3:06 ` Linus Torvalds
(?)
@ 2009-05-03 9:36 ` Pavel Machek
-1 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-03 9:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, linux-pm
Hi!
> > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > not really necessary.
>
> Hmm. Shouldn't we do this _regardless_?
>
> IOW, shouldn't this be a totally separate patch? It seems to be left-over
> from when we shared the same code-paths, and before the split of the STR
> and hibernate code?
>
> IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> arbitrary).
>
> This part seems to be totally independent of all the other parts in your
> patch-series. No?
I'm not sure this one is a good idea: drivers will need to allocate
memory during suspend/resume, and when processes are frozen/disk
driver is suspended, normal memory management will no longer work.
So, freeing 4M of memory before starting suspend seems like a good
idea. That way those small alocations will not fail.
Pavel
> @@ -188,9 +188,6 @@ static void suspend_test_finish(const char *label)
>
> #endif
>
> -/* This is just an arbitrary number */
> -#define FREE_PAGE_NUMBER (100)
> -
> static struct platform_suspend_ops *suspend_ops;
>
> /**
> @@ -241,24 +238,10 @@ static int suspend_prepare(void)
> if (error)
> goto Finish;
>
> - if (suspend_freeze_processes()) {
> - error = -EAGAIN;
> - goto Thaw;
> - }
> -
> - free_pages = global_page_state(NR_FREE_PAGES);
> - if (free_pages < FREE_PAGE_NUMBER) {
> - pr_debug("PM: free some memory\n");
> - shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
> - if (nr_free_pages() < FREE_PAGE_NUMBER) {
> - error = -ENOMEM;
> - printk(KERN_ERR "PM: No enough memory\n");
> - }
> - }
> + error = suspend_freeze_processes();
> if (!error)
> return 0;
>
> - Thaw:
> suspend_thaw_processes();
> usermodehelper_enable();
> Finish:
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 9:36 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-03 9:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Andrew Morton, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
Hi!
> > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > not really necessary.
>
> Hmm. Shouldn't we do this _regardless_?
>
> IOW, shouldn't this be a totally separate patch? It seems to be left-over
> from when we shared the same code-paths, and before the split of the STR
> and hibernate code?
>
> IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> arbitrary).
>
> This part seems to be totally independent of all the other parts in your
> patch-series. No?
I'm not sure this one is a good idea: drivers will need to allocate
memory during suspend/resume, and when processes are frozen/disk
driver is suspended, normal memory management will no longer work.
So, freeing 4M of memory before starting suspend seems like a good
idea. That way those small alocations will not fail.
Pavel
> @@ -188,9 +188,6 @@ static void suspend_test_finish(const char *label)
>
> #endif
>
> -/* This is just an arbitrary number */
> -#define FREE_PAGE_NUMBER (100)
> -
> static struct platform_suspend_ops *suspend_ops;
>
> /**
> @@ -241,24 +238,10 @@ static int suspend_prepare(void)
> if (error)
> goto Finish;
>
> - if (suspend_freeze_processes()) {
> - error = -EAGAIN;
> - goto Thaw;
> - }
> -
> - free_pages = global_page_state(NR_FREE_PAGES);
> - if (free_pages < FREE_PAGE_NUMBER) {
> - pr_debug("PM: free some memory\n");
> - shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
> - if (nr_free_pages() < FREE_PAGE_NUMBER) {
> - error = -ENOMEM;
> - printk(KERN_ERR "PM: No enough memory\n");
> - }
> - }
> + error = suspend_freeze_processes();
> if (!error)
> return 0;
>
> - Thaw:
> suspend_thaw_processes();
> usermodehelper_enable();
> Finish:
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 9:36 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-03 9:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Andrew Morton,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Hi!
> > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > not really necessary.
>
> Hmm. Shouldn't we do this _regardless_?
>
> IOW, shouldn't this be a totally separate patch? It seems to be left-over
> from when we shared the same code-paths, and before the split of the STR
> and hibernate code?
>
> IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> arbitrary).
>
> This part seems to be totally independent of all the other parts in your
> patch-series. No?
I'm not sure this one is a good idea: drivers will need to allocate
memory during suspend/resume, and when processes are frozen/disk
driver is suspended, normal memory management will no longer work.
So, freeing 4M of memory before starting suspend seems like a good
idea. That way those small alocations will not fail.
Pavel
> @@ -188,9 +188,6 @@ static void suspend_test_finish(const char *label)
>
> #endif
>
> -/* This is just an arbitrary number */
> -#define FREE_PAGE_NUMBER (100)
> -
> static struct platform_suspend_ops *suspend_ops;
>
> /**
> @@ -241,24 +238,10 @@ static int suspend_prepare(void)
> if (error)
> goto Finish;
>
> - if (suspend_freeze_processes()) {
> - error = -EAGAIN;
> - goto Thaw;
> - }
> -
> - free_pages = global_page_state(NR_FREE_PAGES);
> - if (free_pages < FREE_PAGE_NUMBER) {
> - pr_debug("PM: free some memory\n");
> - shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
> - if (nr_free_pages() < FREE_PAGE_NUMBER) {
> - error = -ENOMEM;
> - printk(KERN_ERR "PM: No enough memory\n");
> - }
> - }
> + error = suspend_freeze_processes();
> if (!error)
> return 0;
>
> - Thaw:
> suspend_thaw_processes();
> usermodehelper_enable();
> Finish:
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 9:36 ` Pavel Machek
(?)
@ 2009-05-03 16:35 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:35 UTC (permalink / raw)
To: Pavel Machek
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, Linus Torvalds, linux-pm
On Sunday 03 May 2009, Pavel Machek wrote:
> Hi!
Hi,
> > > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > > not really necessary.
> >
> > Hmm. Shouldn't we do this _regardless_?
> >
> > IOW, shouldn't this be a totally separate patch? It seems to be left-over
> > from when we shared the same code-paths, and before the split of the STR
> > and hibernate code?
> >
> > IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> > make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> > arbitrary).
> >
> > This part seems to be totally independent of all the other parts in your
> > patch-series. No?
>
> I'm not sure this one is a good idea: drivers will need to allocate
> memory during suspend/resume, and when processes are frozen/disk
> driver is suspended, normal memory management will no longer work.
>
> So, freeing 4M of memory before starting suspend seems like a good
> idea. That way those small alocations will not fail.
I don't think we've ever had problems with the drivers having too little
memory to suspend.
I'm opting for removing this code and seeing if that leads to any regressions.
If it does, we can still get some free memory by allocating and releasing it.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 16:35 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:35 UTC (permalink / raw)
To: Pavel Machek
Cc: Linus Torvalds, Andrew Morton, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Sunday 03 May 2009, Pavel Machek wrote:
> Hi!
Hi,
> > > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > > not really necessary.
> >
> > Hmm. Shouldn't we do this _regardless_?
> >
> > IOW, shouldn't this be a totally separate patch? It seems to be left-over
> > from when we shared the same code-paths, and before the split of the STR
> > and hibernate code?
> >
> > IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> > make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> > arbitrary).
> >
> > This part seems to be totally independent of all the other parts in your
> > patch-series. No?
>
> I'm not sure this one is a good idea: drivers will need to allocate
> memory during suspend/resume, and when processes are frozen/disk
> driver is suspended, normal memory management will no longer work.
>
> So, freeing 4M of memory before starting suspend seems like a good
> idea. That way those small alocations will not fail.
I don't think we've ever had problems with the drivers having too little
memory to suspend.
I'm opting for removing this code and seeing if that leads to any regressions.
If it does, we can still get some free memory by allocating and releasing it.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 16:35 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:35 UTC (permalink / raw)
To: Pavel Machek
Cc: Linus Torvalds, Andrew Morton, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sunday 03 May 2009, Pavel Machek wrote:
> Hi!
Hi,
> > > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > > not really necessary.
> >
> > Hmm. Shouldn't we do this _regardless_?
> >
> > IOW, shouldn't this be a totally separate patch? It seems to be left-over
> > from when we shared the same code-paths, and before the split of the STR
> > and hibernate code?
> >
> > IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> > make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> > arbitrary).
> >
> > This part seems to be totally independent of all the other parts in your
> > patch-series. No?
>
> I'm not sure this one is a good idea: drivers will need to allocate
> memory during suspend/resume, and when processes are frozen/disk
> driver is suspended, normal memory management will no longer work.
>
> So, freeing 4M of memory before starting suspend seems like a good
> idea. That way those small alocations will not fail.
I don't think we've ever had problems with the drivers having too little
memory to suspend.
I'm opting for removing this code and seeing if that leads to any regressions.
If it does, we can still get some free memory by allocating and releasing it.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 16:35 ` Rafael J. Wysocki
(?)
@ 2009-05-04 9:36 ` Pavel Machek
-1 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:36 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, Linus Torvalds, linux-pm
On Sun 2009-05-03 18:35:06, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Pavel Machek wrote:
> > Hi!
>
> Hi,
>
> > > > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > > > not really necessary.
> > >
> > > Hmm. Shouldn't we do this _regardless_?
> > >
> > > IOW, shouldn't this be a totally separate patch? It seems to be left-over
> > > from when we shared the same code-paths, and before the split of the STR
> > > and hibernate code?
> > >
> > > IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> > > make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> > > arbitrary).
> > >
> > > This part seems to be totally independent of all the other parts in your
> > > patch-series. No?
> >
> > I'm not sure this one is a good idea: drivers will need to allocate
> > memory during suspend/resume, and when processes are frozen/disk
> > driver is suspended, normal memory management will no longer work.
> >
> > So, freeing 4M of memory before starting suspend seems like a good
> > idea. That way those small alocations will not fail.
>
> I don't think we've ever had problems with the drivers having too little
> memory to suspend.
Well, we had the 4MB buffer there, so it is hardly surprising.
> I'm opting for removing this code and seeing if that leads to any regressions.
> If it does, we can still get some free memory by allocating and releasing it.
I believe we should. If we don't... we will not get any regression
reports, because it will probably just hang with black screen :-(, and
"being out of memory during suspend" is probably going to be hard to
reproduce.
Perhaps we should try to _eat_ all memory available during suspend to
test driver behaviour with 0 pages free?
while (kmalloc(100, GFP_ATOMIC))
;
in suspend path should just do it for testing.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-04 9:36 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:36 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Andrew Morton, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Sun 2009-05-03 18:35:06, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Pavel Machek wrote:
> > Hi!
>
> Hi,
>
> > > > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > > > not really necessary.
> > >
> > > Hmm. Shouldn't we do this _regardless_?
> > >
> > > IOW, shouldn't this be a totally separate patch? It seems to be left-over
> > > from when we shared the same code-paths, and before the split of the STR
> > > and hibernate code?
> > >
> > > IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> > > make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> > > arbitrary).
> > >
> > > This part seems to be totally independent of all the other parts in your
> > > patch-series. No?
> >
> > I'm not sure this one is a good idea: drivers will need to allocate
> > memory during suspend/resume, and when processes are frozen/disk
> > driver is suspended, normal memory management will no longer work.
> >
> > So, freeing 4M of memory before starting suspend seems like a good
> > idea. That way those small alocations will not fail.
>
> I don't think we've ever had problems with the drivers having too little
> memory to suspend.
Well, we had the 4MB buffer there, so it is hardly surprising.
> I'm opting for removing this code and seeing if that leads to any regressions.
> If it does, we can still get some free memory by allocating and releasing it.
I believe we should. If we don't... we will not get any regression
reports, because it will probably just hang with black screen :-(, and
"being out of memory during suspend" is probably going to be hard to
reproduce.
Perhaps we should try to _eat_ all memory available during suspend to
test driver behaviour with 0 pages free?
while (kmalloc(100, GFP_ATOMIC))
;
in suspend path should just do it for testing.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-04 9:36 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:36 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linus Torvalds, Andrew Morton, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sun 2009-05-03 18:35:06, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Pavel Machek wrote:
> > Hi!
>
> Hi,
>
> > > > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > > > not really necessary.
> > >
> > > Hmm. Shouldn't we do this _regardless_?
> > >
> > > IOW, shouldn't this be a totally separate patch? It seems to be left-over
> > > from when we shared the same code-paths, and before the split of the STR
> > > and hibernate code?
> > >
> > > IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> > > make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> > > arbitrary).
> > >
> > > This part seems to be totally independent of all the other parts in your
> > > patch-series. No?
> >
> > I'm not sure this one is a good idea: drivers will need to allocate
> > memory during suspend/resume, and when processes are frozen/disk
> > driver is suspended, normal memory management will no longer work.
> >
> > So, freeing 4M of memory before starting suspend seems like a good
> > idea. That way those small alocations will not fail.
>
> I don't think we've ever had problems with the drivers having too little
> memory to suspend.
Well, we had the 4MB buffer there, so it is hardly surprising.
> I'm opting for removing this code and seeing if that leads to any regressions.
> If it does, we can still get some free memory by allocating and releasing it.
I believe we should. If we don't... we will not get any regression
reports, because it will probably just hang with black screen :-(, and
"being out of memory during suspend" is probably going to be hard to
reproduce.
Perhaps we should try to _eat_ all memory available during suspend to
test driver behaviour with 0 pages free?
while (kmalloc(100, GFP_ATOMIC))
;
in suspend path should just do it for testing.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 3:06 ` Linus Torvalds
` (2 preceding siblings ...)
(?)
@ 2009-05-03 16:15 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, linux-pm
On Sunday 03 May 2009, Linus Torvalds wrote:
>
> On Sun, 3 May 2009, Rafael J. Wysocki wrote:
> >
> > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > not really necessary.
>
> Hmm. Shouldn't we do this _regardless_?
>
> IOW, shouldn't this be a totally separate patch? It seems to be left-over
> from when we shared the same code-paths, and before the split of the STR
> and hibernate code?
>
> IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> arbitrary).
>
> This part seems to be totally independent of all the other parts in your
> patch-series. No?
I'm removing this along with shrink_all_memory() which it depends on, but I can
put that into a separate patch if you prefer.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 16:15 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, pavel, jens.axboe, alan-jenkins, linux-kernel,
kernel-testers, linux-pm
On Sunday 03 May 2009, Linus Torvalds wrote:
>
> On Sun, 3 May 2009, Rafael J. Wysocki wrote:
> >
> > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > not really necessary.
>
> Hmm. Shouldn't we do this _regardless_?
>
> IOW, shouldn't this be a totally separate patch? It seems to be left-over
> from when we shared the same code-paths, and before the split of the STR
> and hibernate code?
>
> IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> arbitrary).
>
> This part seems to be totally independent of all the other parts in your
> patch-series. No?
I'm removing this along with shrink_all_memory() which it depends on, but I can
put that into a separate patch if you prefer.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 16:15 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sunday 03 May 2009, Linus Torvalds wrote:
>
> On Sun, 3 May 2009, Rafael J. Wysocki wrote:
> >
> > Remove the shrinking of memory from the suspend-to-RAM code, where it is
> > not really necessary.
>
> Hmm. Shouldn't we do this _regardless_?
>
> IOW, shouldn't this be a totally separate patch? It seems to be left-over
> from when we shared the same code-paths, and before the split of the STR
> and hibernate code?
>
> IOW, shouldn't the very _first_ patch just be this part? That code doesn't
> make any sense anyway (that FREE_PAGE_NUMBER really _is_ totally
> arbitrary).
>
> This part seems to be totally independent of all the other parts in your
> patch-series. No?
I'm removing this along with shrink_all_memory() which it depends on, but I can
put that into a separate patch if you prefer.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 0:24 ` Rafael J. Wysocki
` (2 preceding siblings ...)
(?)
@ 2009-05-03 11:51 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 11:51 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, torvalds, linux-pm
On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
>
> Modify the hibernation memory shrinking code so that it will make
> memory allocations to free memory instead of using an artificial
> memory shrinking mechanism for that. Remove the shrinking of
> memory from the suspend-to-RAM code, where it is not really
> necessary. Finally, remove the no longer used memory shrinking
> functions from mm/vmscan.c .
>
> [rev. 2: Use the existing memory bitmaps for marking preallocated
> image pages and use swsusp_free() from releasing them, introduce
> GFP_IMAGE, add comments describing the memory shrinking strategy.]
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> kernel/power/main.c | 20 ------
> kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> mm/vmscan.c | 142 ------------------------------------------------
> 3 files changed, 101 insertions(+), 193 deletions(-)
>
> Index: linux-2.6/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/snapshot.c
> +++ linux-2.6/kernel/power/snapshot.c
> @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> buffer = NULL;
> }
>
> +/* Helper functions used for the shrinking of memory. */
> +
> +#ifdef CONFIG_HIGHMEM
> +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> +#else
> +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> +#endif
The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> +#define SHRINK_BITE 10000
This is ~40MB. A full scan of (for example) 8G pages will be time
consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
Can we make it a LONG_MAX?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 11:51 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 11:51 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
>
> Modify the hibernation memory shrinking code so that it will make
> memory allocations to free memory instead of using an artificial
> memory shrinking mechanism for that. Remove the shrinking of
> memory from the suspend-to-RAM code, where it is not really
> necessary. Finally, remove the no longer used memory shrinking
> functions from mm/vmscan.c .
>
> [rev. 2: Use the existing memory bitmaps for marking preallocated
> image pages and use swsusp_free() from releasing them, introduce
> GFP_IMAGE, add comments describing the memory shrinking strategy.]
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> kernel/power/main.c | 20 ------
> kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> mm/vmscan.c | 142 ------------------------------------------------
> 3 files changed, 101 insertions(+), 193 deletions(-)
>
> Index: linux-2.6/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/snapshot.c
> +++ linux-2.6/kernel/power/snapshot.c
> @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> buffer = NULL;
> }
>
> +/* Helper functions used for the shrinking of memory. */
> +
> +#ifdef CONFIG_HIGHMEM
> +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> +#else
> +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> +#endif
The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> +#define SHRINK_BITE 10000
This is ~40MB. A full scan of (for example) 8G pages will be time
consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
Can we make it a LONG_MAX?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 11:51 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 11:51 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
>
> Modify the hibernation memory shrinking code so that it will make
> memory allocations to free memory instead of using an artificial
> memory shrinking mechanism for that. Remove the shrinking of
> memory from the suspend-to-RAM code, where it is not really
> necessary. Finally, remove the no longer used memory shrinking
> functions from mm/vmscan.c .
>
> [rev. 2: Use the existing memory bitmaps for marking preallocated
> image pages and use swsusp_free() from releasing them, introduce
> GFP_IMAGE, add comments describing the memory shrinking strategy.]
>
> Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> ---
> kernel/power/main.c | 20 ------
> kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> mm/vmscan.c | 142 ------------------------------------------------
> 3 files changed, 101 insertions(+), 193 deletions(-)
>
> Index: linux-2.6/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/snapshot.c
> +++ linux-2.6/kernel/power/snapshot.c
> @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> buffer = NULL;
> }
>
> +/* Helper functions used for the shrinking of memory. */
> +
> +#ifdef CONFIG_HIGHMEM
> +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> +#else
> +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> +#endif
The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> +#define SHRINK_BITE 10000
This is ~40MB. A full scan of (for example) 8G pages will be time
consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
Can we make it a LONG_MAX?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 11:51 ` Wu Fengguang
(?)
@ 2009-05-03 16:22 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, torvalds, linux-pm
On Sunday 03 May 2009, Wu Fengguang wrote:
> On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> >
> > Modify the hibernation memory shrinking code so that it will make
> > memory allocations to free memory instead of using an artificial
> > memory shrinking mechanism for that. Remove the shrinking of
> > memory from the suspend-to-RAM code, where it is not really
> > necessary. Finally, remove the no longer used memory shrinking
> > functions from mm/vmscan.c .
> >
> > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > image pages and use swsusp_free() from releasing them, introduce
> > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> >
> > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > ---
> > kernel/power/main.c | 20 ------
> > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > mm/vmscan.c | 142 ------------------------------------------------
> > 3 files changed, 101 insertions(+), 193 deletions(-)
> >
> > Index: linux-2.6/kernel/power/snapshot.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/snapshot.c
> > +++ linux-2.6/kernel/power/snapshot.c
> > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > buffer = NULL;
> > }
> >
> > +/* Helper functions used for the shrinking of memory. */
> > +
> > +#ifdef CONFIG_HIGHMEM
> > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > +#else
> > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > +#endif
>
> The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
>
> > +#define SHRINK_BITE 10000
>
> This is ~40MB. A full scan of (for example) 8G pages will be time
> consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
>
> Can we make it a LONG_MAX?
No, I don't think so. The problem is the number of pages we'll need to copy
is generally shrinking as we allocate memory, so we can't do that in one shot.
We can make it a greater number, but I don't really think it would be a good
idea to make it greater than 100 MB.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 11:51 ` Wu Fengguang
@ 2009-05-03 16:22 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, pavel, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Sunday 03 May 2009, Wu Fengguang wrote:
> On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> >
> > Modify the hibernation memory shrinking code so that it will make
> > memory allocations to free memory instead of using an artificial
> > memory shrinking mechanism for that. Remove the shrinking of
> > memory from the suspend-to-RAM code, where it is not really
> > necessary. Finally, remove the no longer used memory shrinking
> > functions from mm/vmscan.c .
> >
> > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > image pages and use swsusp_free() from releasing them, introduce
> > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> >
> > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > ---
> > kernel/power/main.c | 20 ------
> > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > mm/vmscan.c | 142 ------------------------------------------------
> > 3 files changed, 101 insertions(+), 193 deletions(-)
> >
> > Index: linux-2.6/kernel/power/snapshot.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/snapshot.c
> > +++ linux-2.6/kernel/power/snapshot.c
> > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > buffer = NULL;
> > }
> >
> > +/* Helper functions used for the shrinking of memory. */
> > +
> > +#ifdef CONFIG_HIGHMEM
> > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > +#else
> > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > +#endif
>
> The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
>
> > +#define SHRINK_BITE 10000
>
> This is ~40MB. A full scan of (for example) 8G pages will be time
> consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
>
> Can we make it a LONG_MAX?
No, I don't think so. The problem is the number of pages we'll need to copy
is generally shrinking as we allocate memory, so we can't do that in one shot.
We can make it a greater number, but I don't really think it would be a good
idea to make it greater than 100 MB.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-03 16:22 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sunday 03 May 2009, Wu Fengguang wrote:
> On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> >
> > Modify the hibernation memory shrinking code so that it will make
> > memory allocations to free memory instead of using an artificial
> > memory shrinking mechanism for that. Remove the shrinking of
> > memory from the suspend-to-RAM code, where it is not really
> > necessary. Finally, remove the no longer used memory shrinking
> > functions from mm/vmscan.c .
> >
> > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > image pages and use swsusp_free() from releasing them, introduce
> > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> >
> > Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > ---
> > kernel/power/main.c | 20 ------
> > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > mm/vmscan.c | 142 ------------------------------------------------
> > 3 files changed, 101 insertions(+), 193 deletions(-)
> >
> > Index: linux-2.6/kernel/power/snapshot.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/snapshot.c
> > +++ linux-2.6/kernel/power/snapshot.c
> > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > buffer = NULL;
> > }
> >
> > +/* Helper functions used for the shrinking of memory. */
> > +
> > +#ifdef CONFIG_HIGHMEM
> > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > +#else
> > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > +#endif
>
> The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
>
> > +#define SHRINK_BITE 10000
>
> This is ~40MB. A full scan of (for example) 8G pages will be time
> consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
>
> Can we make it a LONG_MAX?
No, I don't think so. The problem is the number of pages we'll need to copy
is generally shrinking as we allocate memory, so we can't do that in one shot.
We can make it a greater number, but I don't really think it would be a good
idea to make it greater than 100 MB.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-04 9:31 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:31 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, Andrew Morton, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Sun 2009-05-03 18:22:54, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Wu Fengguang wrote:
> > On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Modify the hibernation memory shrinking code so that it will make
> > > memory allocations to free memory instead of using an artificial
> > > memory shrinking mechanism for that. Remove the shrinking of
> > > memory from the suspend-to-RAM code, where it is not really
> > > necessary. Finally, remove the no longer used memory shrinking
> > > functions from mm/vmscan.c .
> > >
> > > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > > image pages and use swsusp_free() from releasing them, introduce
> > > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> > >
> > > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > > ---
> > > kernel/power/main.c | 20 ------
> > > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > > mm/vmscan.c | 142 ------------------------------------------------
> > > 3 files changed, 101 insertions(+), 193 deletions(-)
> > >
> > > Index: linux-2.6/kernel/power/snapshot.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/power/snapshot.c
> > > +++ linux-2.6/kernel/power/snapshot.c
> > > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > > buffer = NULL;
> > > }
> > >
> > > +/* Helper functions used for the shrinking of memory. */
> > > +
> > > +#ifdef CONFIG_HIGHMEM
> > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > > +#else
> > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > > +#endif
> >
> > The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> >
> > > +#define SHRINK_BITE 10000
> >
> > This is ~40MB. A full scan of (for example) 8G pages will be time
> > consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
> >
> > Can we make it a LONG_MAX?
>
> No, I don't think so. The problem is the number of pages we'll need to copy
> is generally shrinking as we allocate memory, so we can't do that in one shot.
>
> We can make it a greater number, but I don't really think it would be a good
> idea to make it greater than 100 MB.
Well, even 100MB is quite big: on 128MB machine, that will probably
mean freeing all the memory (instead of "as much as needed"). And that
memory needs to go to disk, so it will be slow.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-04 9:31 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:31 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, Andrew Morton,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sun 2009-05-03 18:22:54, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Wu Fengguang wrote:
> > On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > >
> > > Modify the hibernation memory shrinking code so that it will make
> > > memory allocations to free memory instead of using an artificial
> > > memory shrinking mechanism for that. Remove the shrinking of
> > > memory from the suspend-to-RAM code, where it is not really
> > > necessary. Finally, remove the no longer used memory shrinking
> > > functions from mm/vmscan.c .
> > >
> > > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > > image pages and use swsusp_free() from releasing them, introduce
> > > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> > >
> > > Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > ---
> > > kernel/power/main.c | 20 ------
> > > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > > mm/vmscan.c | 142 ------------------------------------------------
> > > 3 files changed, 101 insertions(+), 193 deletions(-)
> > >
> > > Index: linux-2.6/kernel/power/snapshot.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/power/snapshot.c
> > > +++ linux-2.6/kernel/power/snapshot.c
> > > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > > buffer = NULL;
> > > }
> > >
> > > +/* Helper functions used for the shrinking of memory. */
> > > +
> > > +#ifdef CONFIG_HIGHMEM
> > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > > +#else
> > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > > +#endif
> >
> > The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> >
> > > +#define SHRINK_BITE 10000
> >
> > This is ~40MB. A full scan of (for example) 8G pages will be time
> > consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
> >
> > Can we make it a LONG_MAX?
>
> No, I don't think so. The problem is the number of pages we'll need to copy
> is generally shrinking as we allocate memory, so we can't do that in one shot.
>
> We can make it a greater number, but I don't really think it would be a good
> idea to make it greater than 100 MB.
Well, even 100MB is quite big: on 128MB machine, that will probably
mean freeing all the memory (instead of "as much as needed"). And that
memory needs to go to disk, so it will be slow.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-04 9:31 ` Pavel Machek
(?)
@ 2009-05-04 19:52 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:52 UTC (permalink / raw)
To: Pavel Machek
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, Wu Fengguang, torvalds, linux-pm
On Monday 04 May 2009, Pavel Machek wrote:
> On Sun 2009-05-03 18:22:54, Rafael J. Wysocki wrote:
> > On Sunday 03 May 2009, Wu Fengguang wrote:
> > > On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Modify the hibernation memory shrinking code so that it will make
> > > > memory allocations to free memory instead of using an artificial
> > > > memory shrinking mechanism for that. Remove the shrinking of
> > > > memory from the suspend-to-RAM code, where it is not really
> > > > necessary. Finally, remove the no longer used memory shrinking
> > > > functions from mm/vmscan.c .
> > > >
> > > > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > > > image pages and use swsusp_free() from releasing them, introduce
> > > > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> > > >
> > > > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > > > ---
> > > > kernel/power/main.c | 20 ------
> > > > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > > > mm/vmscan.c | 142 ------------------------------------------------
> > > > 3 files changed, 101 insertions(+), 193 deletions(-)
> > > >
> > > > Index: linux-2.6/kernel/power/snapshot.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/kernel/power/snapshot.c
> > > > +++ linux-2.6/kernel/power/snapshot.c
> > > > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > > > buffer = NULL;
> > > > }
> > > >
> > > > +/* Helper functions used for the shrinking of memory. */
> > > > +
> > > > +#ifdef CONFIG_HIGHMEM
> > > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > > > +#else
> > > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > > > +#endif
> > >
> > > The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> > >
> > > > +#define SHRINK_BITE 10000
> > >
> > > This is ~40MB. A full scan of (for example) 8G pages will be time
> > > consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
> > >
> > > Can we make it a LONG_MAX?
> >
> > No, I don't think so. The problem is the number of pages we'll need to copy
> > is generally shrinking as we allocate memory, so we can't do that in one shot.
> >
> > We can make it a greater number, but I don't really think it would be a good
> > idea to make it greater than 100 MB.
>
> Well, even 100MB is quite big: on 128MB machine, that will probably
> mean freeing all the memory (instead of "as much as needed"). And that
> memory needs to go to disk, so it will be slow.
But we're going to free it anyway?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-04 19:52 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:52 UTC (permalink / raw)
To: Pavel Machek
Cc: Wu Fengguang, Andrew Morton, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Monday 04 May 2009, Pavel Machek wrote:
> On Sun 2009-05-03 18:22:54, Rafael J. Wysocki wrote:
> > On Sunday 03 May 2009, Wu Fengguang wrote:
> > > On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Modify the hibernation memory shrinking code so that it will make
> > > > memory allocations to free memory instead of using an artificial
> > > > memory shrinking mechanism for that. Remove the shrinking of
> > > > memory from the suspend-to-RAM code, where it is not really
> > > > necessary. Finally, remove the no longer used memory shrinking
> > > > functions from mm/vmscan.c .
> > > >
> > > > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > > > image pages and use swsusp_free() from releasing them, introduce
> > > > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> > > >
> > > > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > > > ---
> > > > kernel/power/main.c | 20 ------
> > > > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > > > mm/vmscan.c | 142 ------------------------------------------------
> > > > 3 files changed, 101 insertions(+), 193 deletions(-)
> > > >
> > > > Index: linux-2.6/kernel/power/snapshot.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/kernel/power/snapshot.c
> > > > +++ linux-2.6/kernel/power/snapshot.c
> > > > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > > > buffer = NULL;
> > > > }
> > > >
> > > > +/* Helper functions used for the shrinking of memory. */
> > > > +
> > > > +#ifdef CONFIG_HIGHMEM
> > > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > > > +#else
> > > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > > > +#endif
> > >
> > > The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> > >
> > > > +#define SHRINK_BITE 10000
> > >
> > > This is ~40MB. A full scan of (for example) 8G pages will be time
> > > consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
> > >
> > > Can we make it a LONG_MAX?
> >
> > No, I don't think so. The problem is the number of pages we'll need to copy
> > is generally shrinking as we allocate memory, so we can't do that in one shot.
> >
> > We can make it a greater number, but I don't really think it would be a good
> > idea to make it greater than 100 MB.
>
> Well, even 100MB is quite big: on 128MB machine, that will probably
> mean freeing all the memory (instead of "as much as needed"). And that
> memory needs to go to disk, so it will be slow.
But we're going to free it anyway?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
@ 2009-05-04 19:52 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:52 UTC (permalink / raw)
To: Pavel Machek
Cc: Wu Fengguang, Andrew Morton,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Monday 04 May 2009, Pavel Machek wrote:
> On Sun 2009-05-03 18:22:54, Rafael J. Wysocki wrote:
> > On Sunday 03 May 2009, Wu Fengguang wrote:
> > > On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > >
> > > > Modify the hibernation memory shrinking code so that it will make
> > > > memory allocations to free memory instead of using an artificial
> > > > memory shrinking mechanism for that. Remove the shrinking of
> > > > memory from the suspend-to-RAM code, where it is not really
> > > > necessary. Finally, remove the no longer used memory shrinking
> > > > functions from mm/vmscan.c .
> > > >
> > > > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > > > image pages and use swsusp_free() from releasing them, introduce
> > > > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> > > >
> > > > Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > > ---
> > > > kernel/power/main.c | 20 ------
> > > > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > > > mm/vmscan.c | 142 ------------------------------------------------
> > > > 3 files changed, 101 insertions(+), 193 deletions(-)
> > > >
> > > > Index: linux-2.6/kernel/power/snapshot.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/kernel/power/snapshot.c
> > > > +++ linux-2.6/kernel/power/snapshot.c
> > > > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > > > buffer = NULL;
> > > > }
> > > >
> > > > +/* Helper functions used for the shrinking of memory. */
> > > > +
> > > > +#ifdef CONFIG_HIGHMEM
> > > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > > > +#else
> > > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > > > +#endif
> > >
> > > The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> > >
> > > > +#define SHRINK_BITE 10000
> > >
> > > This is ~40MB. A full scan of (for example) 8G pages will be time
> > > consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
> > >
> > > Can we make it a LONG_MAX?
> >
> > No, I don't think so. The problem is the number of pages we'll need to copy
> > is generally shrinking as we allocate memory, so we can't do that in one shot.
> >
> > We can make it a greater number, but I don't really think it would be a good
> > idea to make it greater than 100 MB.
>
> Well, even 100MB is quite big: on 128MB machine, that will probably
> mean freeing all the memory (instead of "as much as needed"). And that
> memory needs to go to disk, so it will be slow.
But we're going to free it anyway?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 3/4] PM/Hibernate: Use memory allocations to free memory (rev. 2)
2009-05-03 16:22 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-04 9:31 ` Pavel Machek
-1 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:31 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, Wu Fengguang, torvalds, linux-pm
On Sun 2009-05-03 18:22:54, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Wu Fengguang wrote:
> > On Sun, May 03, 2009 at 02:24:20AM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Modify the hibernation memory shrinking code so that it will make
> > > memory allocations to free memory instead of using an artificial
> > > memory shrinking mechanism for that. Remove the shrinking of
> > > memory from the suspend-to-RAM code, where it is not really
> > > necessary. Finally, remove the no longer used memory shrinking
> > > functions from mm/vmscan.c .
> > >
> > > [rev. 2: Use the existing memory bitmaps for marking preallocated
> > > image pages and use swsusp_free() from releasing them, introduce
> > > GFP_IMAGE, add comments describing the memory shrinking strategy.]
> > >
> > > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > > ---
> > > kernel/power/main.c | 20 ------
> > > kernel/power/snapshot.c | 132 +++++++++++++++++++++++++++++++++-----------
> > > mm/vmscan.c | 142 ------------------------------------------------
> > > 3 files changed, 101 insertions(+), 193 deletions(-)
> > >
> > > Index: linux-2.6/kernel/power/snapshot.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/power/snapshot.c
> > > +++ linux-2.6/kernel/power/snapshot.c
> > > @@ -1066,41 +1066,97 @@ void swsusp_free(void)
> > > buffer = NULL;
> > > }
> > >
> > > +/* Helper functions used for the shrinking of memory. */
> > > +
> > > +#ifdef CONFIG_HIGHMEM
> > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_HIGHMEM | __GFP_NO_OOM_KILL)
> > > +#else
> > > +#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
> > > +#endif
> >
> > The CONFIG_HIGHMEM test is not necessary: __GFP_HIGHMEM is always defined.
> >
> > > +#define SHRINK_BITE 10000
> >
> > This is ~40MB. A full scan of (for example) 8G pages will be time
> > consuming, not to mention we have to do it 2*(8G-500M)/40M = 384 times!
> >
> > Can we make it a LONG_MAX?
>
> No, I don't think so. The problem is the number of pages we'll need to copy
> is generally shrinking as we allocate memory, so we can't do that in one shot.
>
> We can make it a greater number, but I don't really think it would be a good
> idea to make it greater than 100 MB.
Well, even 100MB is quite big: on 128MB machine, that will probably
mean freeing all the memory (instead of "as much as needed"). And that
memory needs to go to disk, so it will be slow.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 4/4] PM/Hibernate: Do not release preallocated memory unnecessarily
2009-05-03 0:20 ` Rafael J. Wysocki
` (6 preceding siblings ...)
(?)
@ 2009-05-03 0:25 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:25 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds
From: Rafael J. Wysocki <rjw@sisk.pl>
Since the hibernation code is now going to use allocations of memory
to create enough room for the image, it can also use the page frames
allocated at this stage as image page frames. The low-level
hibernation code needs to be rearranged for this purpose, but it
allows us to avoid freeing a great number of pages and allocating
these same pages once again later, so it generally is worth doing.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/disk.c | 15 +++-
kernel/power/power.h | 2
kernel/power/snapshot.c | 151 +++++++++++++++++++++++-------------------------
3 files changed, 87 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -783,21 +783,6 @@ void free_basic_memory_bitmaps(void)
pr_debug("PM: Basic memory bitmaps freed\n");
}
-/**
- * snapshot_additional_pages - estimate the number of additional pages
- * be needed for setting up the suspend image data structures for given
- * zone (usually the returned value is greater than the exact number)
- */
-
-unsigned int snapshot_additional_pages(struct zone *zone)
-{
- unsigned int res;
-
- res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
- res += DIV_ROUND_UP(res * sizeof(struct bm_block), PAGE_SIZE);
- return 2 * res;
-}
-
#ifdef CONFIG_HIGHMEM
/**
* count_free_highmem_pages - compute the total number of free highmem
@@ -1033,6 +1018,25 @@ copy_data_pages(struct memory_bitmap *co
static unsigned int nr_copy_pages;
/* Number of pages needed for saving the original pfns of the image pages */
static unsigned int nr_meta_pages;
+/*
+ * Numbers of normal and highmem page frames allocated for hibernation image
+ * before suspending devices.
+ */
+unsigned int alloc_normal, alloc_highmem;
+/*
+ * Memory bitmap used for marking saveable pages (during hibernation) or
+ * hibernation image pages (during restore)
+ */
+static struct memory_bitmap orig_bm;
+/*
+ * Memory bitmap used during hibernation for marking allocated page frames that
+ * will contain copies of saveable pages. During restore it is initially used
+ * for marking hibernation image pages, but then the set bits from it are
+ * duplicated in @orig_bm and it is released. On highmem systems it is next
+ * used for marking "safe" highmem pages, but it has to be reinitialized for
+ * this purpose.
+ */
+static struct memory_bitmap copy_bm;
/**
* swsusp_free - free pages allocated for the suspend.
@@ -1064,6 +1068,8 @@ void swsusp_free(void)
nr_meta_pages = 0;
restore_pblist = NULL;
buffer = NULL;
+ alloc_normal = 0;
+ alloc_highmem = 0;
}
/* Helper functions used for the shrinking of memory. */
@@ -1085,7 +1091,7 @@ void swsusp_free(void)
* Return value: The number of normal (ie. non-highmem) pages allocated or
* -ENOMEM on failure.
*/
-static long prealloc_pages(long nr_pages)
+static long prealloc_pages(struct memory_bitmap *bm, long nr_pages)
{
long nr_normal = 0;
@@ -1095,6 +1101,7 @@ static long prealloc_pages(long nr_pages
page = alloc_image_page(GFP_IMAGE);
if (!page)
return -ENOMEM;
+ memory_bm_set_bit(bm, page_to_pfn(page));
if (!PageHighMem(page))
nr_normal++;
}
@@ -1103,7 +1110,7 @@ static long prealloc_pages(long nr_pages
}
/**
- * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ * hibernate_preallocate_memory - Preallocate memory for hibernation image
*
* To create a hibernation image it is necessary to make a copy of every page
* frame in use. We also need a number of page frames to be free during
@@ -1127,17 +1134,29 @@ static long prealloc_pages(long nr_pages
* the preallocation of memory is continued until the total number of page
* frames in use is below the requested image size.
*/
-int swsusp_shrink_memory(void)
+int hibernate_preallocate_memory(void)
{
- unsigned long pages = 0, alloc_normal = 0, alloc_highmem = 0;
+ unsigned long pages = 0;
unsigned int i = 0;
char *p = "-\\|/";
struct timeval start, stop;
- int error = 0;
+ int error;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Preallocating image memory ... ");
do_gettimeofday(&start);
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ alloc_normal = 0;
+ alloc_highmem = 0;
+ error = -ENOMEM;
+
for (;;) {
struct zone *zone;
long size, highmem_size, tmp, ret;
@@ -1158,7 +1177,6 @@ int swsusp_shrink_memory(void)
* creation.
*/
for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
if (is_highmem(zone)) {
highmem_size -=
zone_page_state(zone, NR_FREE_PAGES);
@@ -1179,11 +1197,9 @@ int swsusp_shrink_memory(void)
else if (tmp <= 0)
break;
- ret = prealloc_pages(tmp);
- if (ret < 0) {
- error = -ENOMEM;
- goto out;
- }
+ ret = prealloc_pages(©_bm, tmp);
+ if (ret < 0)
+ goto err_out;
alloc_normal += ret;
alloc_highmem += tmp - ret;
pages += tmp;
@@ -1192,13 +1208,13 @@ int swsusp_shrink_memory(void)
}
do_gettimeofday(&stop);
- printk("\bdone (preallocated %lu free pages)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
+ printk("\bdone (allocated %lu image pages)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Allocated");
- out:
- /* Release the preallocated page frames. */
- swsusp_free();
+ return 0;
+ err_out:
+ swsusp_free();
return error;
}
@@ -1210,7 +1226,7 @@ int swsusp_shrink_memory(void)
static unsigned int count_pages_for_highmem(unsigned int nr_highmem)
{
- unsigned int free_highmem = count_free_highmem_pages();
+ unsigned int free_highmem = count_free_highmem_pages() + alloc_highmem;
if (free_highmem >= nr_highmem)
nr_highmem = 0;
@@ -1232,19 +1248,17 @@ count_pages_for_highmem(unsigned int nr_
static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
{
struct zone *zone;
- unsigned int free = 0, meta = 0;
+ unsigned int free = alloc_normal;
- for_each_zone(zone) {
- meta += snapshot_additional_pages(zone);
+ for_each_zone(zone)
if (!is_highmem(zone))
free += zone_page_state(zone, NR_FREE_PAGES);
- }
nr_pages += count_pages_for_highmem(nr_highmem);
- pr_debug("PM: Normal pages needed: %u + %u + %u, available pages: %u\n",
- nr_pages, PAGES_FOR_IO, meta, free);
+ pr_debug("PM: Normal pages needed: %u + %u, available pages: %u\n",
+ nr_pages, PAGES_FOR_IO, free);
- return free > nr_pages + PAGES_FOR_IO + meta;
+ return free > nr_pages + PAGES_FOR_IO;
}
#ifdef CONFIG_HIGHMEM
@@ -1266,7 +1280,7 @@ static inline int get_highmem_buffer(int
*/
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
{
unsigned int to_alloc = count_free_highmem_pages();
@@ -1277,7 +1291,7 @@ alloc_highmem_image_pages(struct memory_
while (to_alloc-- > 0) {
struct page *page;
- page = alloc_image_page(__GFP_HIGHMEM);
+ page = alloc_image_page(__GFP_HIGHMEM | __GFP_NO_OOM_KILL);
memory_bm_set_bit(bm, page_to_pfn(page));
}
return nr_highmem;
@@ -1286,7 +1300,7 @@ alloc_highmem_image_pages(struct memory_
static inline int get_highmem_buffer(int safe_needed) { return 0; }
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
#endif /* CONFIG_HIGHMEM */
/**
@@ -1305,51 +1319,36 @@ static int
swsusp_alloc(struct memory_bitmap *orig_bm, struct memory_bitmap *copy_bm,
unsigned int nr_pages, unsigned int nr_highmem)
{
- int error;
-
- error = memory_bm_create(orig_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
-
- error = memory_bm_create(copy_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
+ int error = 0;
if (nr_highmem > 0) {
error = get_highmem_buffer(PG_ANY);
if (error)
- goto Free;
-
- nr_pages += alloc_highmem_image_pages(copy_bm, nr_highmem);
+ goto err_out;
+ if (nr_highmem > alloc_highmem) {
+ nr_highmem -= alloc_highmem;
+ nr_pages += alloc_highmem_pages(copy_bm, nr_highmem);
+ }
}
- while (nr_pages-- > 0) {
- struct page *page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
-
- if (!page)
- goto Free;
+ if (nr_pages > alloc_normal) {
+ nr_pages -= alloc_normal;
+ while (nr_pages-- > 0) {
+ struct page *page;
- memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
+ if (!page)
+ goto err_out;
+ memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ }
}
+
return 0;
- Free:
+ err_out:
swsusp_free();
- return -ENOMEM;
+ return error;
}
-/* Memory bitmap used for marking saveable pages (during suspend) or the
- * suspend image pages (during resume)
- */
-static struct memory_bitmap orig_bm;
-/* Memory bitmap used on suspend for marking allocated pages that will contain
- * the copies of saveable pages. During resume it is initially used for
- * marking the suspend image pages, but then its set bits are duplicated in
- * @orig_bm and it is released. Next, on systems with high memory, it may be
- * used for marking "safe" highmem pages, but it has to be reinitialized for
- * this purpose.
- */
-static struct memory_bitmap copy_bm;
-
asmlinkage int swsusp_save(void)
{
unsigned int nr_pages, nr_highmem;
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern int swsusp_shrink_memory(void);
+extern int hibernate_preallocate_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -303,8 +303,8 @@ int hibernation_snapshot(int platform_mo
if (error)
return error;
- /* Free memory before shutting down devices. */
- error = swsusp_shrink_memory();
+ /* Preallocate image memory before shutting down devices. */
+ error = hibernate_preallocate_memory();
if (error)
goto Close;
@@ -320,6 +320,10 @@ int hibernation_snapshot(int platform_mo
/* Control returns here after successful restore */
Resume_devices:
+ /* We may need to release the preallocated image pages here. */
+ if (error || !in_suspend)
+ swsusp_free();
+
device_resume(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
resume_console();
@@ -593,7 +597,10 @@ int hibernate(void)
goto Thaw;
error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
- if (in_suspend && !error) {
+ if (error)
+ goto Thaw;
+
+ if (in_suspend) {
unsigned int flags = 0;
if (hibernation_mode == HIBERNATION_PLATFORM)
@@ -605,8 +612,8 @@ int hibernate(void)
power_down();
} else {
pr_debug("PM: Image restored successfully.\n");
- swsusp_free();
}
+
Thaw:
thaw_processes();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 4/4] PM/Hibernate: Do not release preallocated memory unnecessarily
@ 2009-05-03 0:25 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:25 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel, torvalds, jens.axboe, alan-jenkins, linux-kernel,
kernel-testers, linux-pm
From: Rafael J. Wysocki <rjw@sisk.pl>
Since the hibernation code is now going to use allocations of memory
to create enough room for the image, it can also use the page frames
allocated at this stage as image page frames. The low-level
hibernation code needs to be rearranged for this purpose, but it
allows us to avoid freeing a great number of pages and allocating
these same pages once again later, so it generally is worth doing.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/disk.c | 15 +++-
kernel/power/power.h | 2
kernel/power/snapshot.c | 151 +++++++++++++++++++++++-------------------------
3 files changed, 87 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -783,21 +783,6 @@ void free_basic_memory_bitmaps(void)
pr_debug("PM: Basic memory bitmaps freed\n");
}
-/**
- * snapshot_additional_pages - estimate the number of additional pages
- * be needed for setting up the suspend image data structures for given
- * zone (usually the returned value is greater than the exact number)
- */
-
-unsigned int snapshot_additional_pages(struct zone *zone)
-{
- unsigned int res;
-
- res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
- res += DIV_ROUND_UP(res * sizeof(struct bm_block), PAGE_SIZE);
- return 2 * res;
-}
-
#ifdef CONFIG_HIGHMEM
/**
* count_free_highmem_pages - compute the total number of free highmem
@@ -1033,6 +1018,25 @@ copy_data_pages(struct memory_bitmap *co
static unsigned int nr_copy_pages;
/* Number of pages needed for saving the original pfns of the image pages */
static unsigned int nr_meta_pages;
+/*
+ * Numbers of normal and highmem page frames allocated for hibernation image
+ * before suspending devices.
+ */
+unsigned int alloc_normal, alloc_highmem;
+/*
+ * Memory bitmap used for marking saveable pages (during hibernation) or
+ * hibernation image pages (during restore)
+ */
+static struct memory_bitmap orig_bm;
+/*
+ * Memory bitmap used during hibernation for marking allocated page frames that
+ * will contain copies of saveable pages. During restore it is initially used
+ * for marking hibernation image pages, but then the set bits from it are
+ * duplicated in @orig_bm and it is released. On highmem systems it is next
+ * used for marking "safe" highmem pages, but it has to be reinitialized for
+ * this purpose.
+ */
+static struct memory_bitmap copy_bm;
/**
* swsusp_free - free pages allocated for the suspend.
@@ -1064,6 +1068,8 @@ void swsusp_free(void)
nr_meta_pages = 0;
restore_pblist = NULL;
buffer = NULL;
+ alloc_normal = 0;
+ alloc_highmem = 0;
}
/* Helper functions used for the shrinking of memory. */
@@ -1085,7 +1091,7 @@ void swsusp_free(void)
* Return value: The number of normal (ie. non-highmem) pages allocated or
* -ENOMEM on failure.
*/
-static long prealloc_pages(long nr_pages)
+static long prealloc_pages(struct memory_bitmap *bm, long nr_pages)
{
long nr_normal = 0;
@@ -1095,6 +1101,7 @@ static long prealloc_pages(long nr_pages
page = alloc_image_page(GFP_IMAGE);
if (!page)
return -ENOMEM;
+ memory_bm_set_bit(bm, page_to_pfn(page));
if (!PageHighMem(page))
nr_normal++;
}
@@ -1103,7 +1110,7 @@ static long prealloc_pages(long nr_pages
}
/**
- * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ * hibernate_preallocate_memory - Preallocate memory for hibernation image
*
* To create a hibernation image it is necessary to make a copy of every page
* frame in use. We also need a number of page frames to be free during
@@ -1127,17 +1134,29 @@ static long prealloc_pages(long nr_pages
* the preallocation of memory is continued until the total number of page
* frames in use is below the requested image size.
*/
-int swsusp_shrink_memory(void)
+int hibernate_preallocate_memory(void)
{
- unsigned long pages = 0, alloc_normal = 0, alloc_highmem = 0;
+ unsigned long pages = 0;
unsigned int i = 0;
char *p = "-\\|/";
struct timeval start, stop;
- int error = 0;
+ int error;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Preallocating image memory ... ");
do_gettimeofday(&start);
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ alloc_normal = 0;
+ alloc_highmem = 0;
+ error = -ENOMEM;
+
for (;;) {
struct zone *zone;
long size, highmem_size, tmp, ret;
@@ -1158,7 +1177,6 @@ int swsusp_shrink_memory(void)
* creation.
*/
for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
if (is_highmem(zone)) {
highmem_size -=
zone_page_state(zone, NR_FREE_PAGES);
@@ -1179,11 +1197,9 @@ int swsusp_shrink_memory(void)
else if (tmp <= 0)
break;
- ret = prealloc_pages(tmp);
- if (ret < 0) {
- error = -ENOMEM;
- goto out;
- }
+ ret = prealloc_pages(©_bm, tmp);
+ if (ret < 0)
+ goto err_out;
alloc_normal += ret;
alloc_highmem += tmp - ret;
pages += tmp;
@@ -1192,13 +1208,13 @@ int swsusp_shrink_memory(void)
}
do_gettimeofday(&stop);
- printk("\bdone (preallocated %lu free pages)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
+ printk("\bdone (allocated %lu image pages)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Allocated");
- out:
- /* Release the preallocated page frames. */
- swsusp_free();
+ return 0;
+ err_out:
+ swsusp_free();
return error;
}
@@ -1210,7 +1226,7 @@ int swsusp_shrink_memory(void)
static unsigned int count_pages_for_highmem(unsigned int nr_highmem)
{
- unsigned int free_highmem = count_free_highmem_pages();
+ unsigned int free_highmem = count_free_highmem_pages() + alloc_highmem;
if (free_highmem >= nr_highmem)
nr_highmem = 0;
@@ -1232,19 +1248,17 @@ count_pages_for_highmem(unsigned int nr_
static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
{
struct zone *zone;
- unsigned int free = 0, meta = 0;
+ unsigned int free = alloc_normal;
- for_each_zone(zone) {
- meta += snapshot_additional_pages(zone);
+ for_each_zone(zone)
if (!is_highmem(zone))
free += zone_page_state(zone, NR_FREE_PAGES);
- }
nr_pages += count_pages_for_highmem(nr_highmem);
- pr_debug("PM: Normal pages needed: %u + %u + %u, available pages: %u\n",
- nr_pages, PAGES_FOR_IO, meta, free);
+ pr_debug("PM: Normal pages needed: %u + %u, available pages: %u\n",
+ nr_pages, PAGES_FOR_IO, free);
- return free > nr_pages + PAGES_FOR_IO + meta;
+ return free > nr_pages + PAGES_FOR_IO;
}
#ifdef CONFIG_HIGHMEM
@@ -1266,7 +1280,7 @@ static inline int get_highmem_buffer(int
*/
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
{
unsigned int to_alloc = count_free_highmem_pages();
@@ -1277,7 +1291,7 @@ alloc_highmem_image_pages(struct memory_
while (to_alloc-- > 0) {
struct page *page;
- page = alloc_image_page(__GFP_HIGHMEM);
+ page = alloc_image_page(__GFP_HIGHMEM | __GFP_NO_OOM_KILL);
memory_bm_set_bit(bm, page_to_pfn(page));
}
return nr_highmem;
@@ -1286,7 +1300,7 @@ alloc_highmem_image_pages(struct memory_
static inline int get_highmem_buffer(int safe_needed) { return 0; }
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
#endif /* CONFIG_HIGHMEM */
/**
@@ -1305,51 +1319,36 @@ static int
swsusp_alloc(struct memory_bitmap *orig_bm, struct memory_bitmap *copy_bm,
unsigned int nr_pages, unsigned int nr_highmem)
{
- int error;
-
- error = memory_bm_create(orig_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
-
- error = memory_bm_create(copy_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
+ int error = 0;
if (nr_highmem > 0) {
error = get_highmem_buffer(PG_ANY);
if (error)
- goto Free;
-
- nr_pages += alloc_highmem_image_pages(copy_bm, nr_highmem);
+ goto err_out;
+ if (nr_highmem > alloc_highmem) {
+ nr_highmem -= alloc_highmem;
+ nr_pages += alloc_highmem_pages(copy_bm, nr_highmem);
+ }
}
- while (nr_pages-- > 0) {
- struct page *page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
-
- if (!page)
- goto Free;
+ if (nr_pages > alloc_normal) {
+ nr_pages -= alloc_normal;
+ while (nr_pages-- > 0) {
+ struct page *page;
- memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
+ if (!page)
+ goto err_out;
+ memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ }
}
+
return 0;
- Free:
+ err_out:
swsusp_free();
- return -ENOMEM;
+ return error;
}
-/* Memory bitmap used for marking saveable pages (during suspend) or the
- * suspend image pages (during resume)
- */
-static struct memory_bitmap orig_bm;
-/* Memory bitmap used on suspend for marking allocated pages that will contain
- * the copies of saveable pages. During resume it is initially used for
- * marking the suspend image pages, but then its set bits are duplicated in
- * @orig_bm and it is released. Next, on systems with high memory, it may be
- * used for marking "safe" highmem pages, but it has to be reinitialized for
- * this purpose.
- */
-static struct memory_bitmap copy_bm;
-
asmlinkage int swsusp_save(void)
{
unsigned int nr_pages, nr_highmem;
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern int swsusp_shrink_memory(void);
+extern int hibernate_preallocate_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -303,8 +303,8 @@ int hibernation_snapshot(int platform_mo
if (error)
return error;
- /* Free memory before shutting down devices. */
- error = swsusp_shrink_memory();
+ /* Preallocate image memory before shutting down devices. */
+ error = hibernate_preallocate_memory();
if (error)
goto Close;
@@ -320,6 +320,10 @@ int hibernation_snapshot(int platform_mo
/* Control returns here after successful restore */
Resume_devices:
+ /* We may need to release the preallocated image pages here. */
+ if (error || !in_suspend)
+ swsusp_free();
+
device_resume(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
resume_console();
@@ -593,7 +597,10 @@ int hibernate(void)
goto Thaw;
error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
- if (in_suspend && !error) {
+ if (error)
+ goto Thaw;
+
+ if (in_suspend) {
unsigned int flags = 0;
if (hibernation_mode == HIBERNATION_PLATFORM)
@@ -605,8 +612,8 @@ int hibernate(void)
power_down();
} else {
pr_debug("PM: Image restored successfully.\n");
- swsusp_free();
}
+
Thaw:
thaw_processes();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 4/4] PM/Hibernate: Do not release preallocated memory unnecessarily
@ 2009-05-03 0:25 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 0:25 UTC (permalink / raw)
To: Andrew Morton
Cc: pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Since the hibernation code is now going to use allocations of memory
to create enough room for the image, it can also use the page frames
allocated at this stage as image page frames. The low-level
hibernation code needs to be rearranged for this purpose, but it
allows us to avoid freeing a great number of pages and allocating
these same pages once again later, so it generally is worth doing.
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/disk.c | 15 +++-
kernel/power/power.h | 2
kernel/power/snapshot.c | 151 +++++++++++++++++++++++-------------------------
3 files changed, 87 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -783,21 +783,6 @@ void free_basic_memory_bitmaps(void)
pr_debug("PM: Basic memory bitmaps freed\n");
}
-/**
- * snapshot_additional_pages - estimate the number of additional pages
- * be needed for setting up the suspend image data structures for given
- * zone (usually the returned value is greater than the exact number)
- */
-
-unsigned int snapshot_additional_pages(struct zone *zone)
-{
- unsigned int res;
-
- res = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
- res += DIV_ROUND_UP(res * sizeof(struct bm_block), PAGE_SIZE);
- return 2 * res;
-}
-
#ifdef CONFIG_HIGHMEM
/**
* count_free_highmem_pages - compute the total number of free highmem
@@ -1033,6 +1018,25 @@ copy_data_pages(struct memory_bitmap *co
static unsigned int nr_copy_pages;
/* Number of pages needed for saving the original pfns of the image pages */
static unsigned int nr_meta_pages;
+/*
+ * Numbers of normal and highmem page frames allocated for hibernation image
+ * before suspending devices.
+ */
+unsigned int alloc_normal, alloc_highmem;
+/*
+ * Memory bitmap used for marking saveable pages (during hibernation) or
+ * hibernation image pages (during restore)
+ */
+static struct memory_bitmap orig_bm;
+/*
+ * Memory bitmap used during hibernation for marking allocated page frames that
+ * will contain copies of saveable pages. During restore it is initially used
+ * for marking hibernation image pages, but then the set bits from it are
+ * duplicated in @orig_bm and it is released. On highmem systems it is next
+ * used for marking "safe" highmem pages, but it has to be reinitialized for
+ * this purpose.
+ */
+static struct memory_bitmap copy_bm;
/**
* swsusp_free - free pages allocated for the suspend.
@@ -1064,6 +1068,8 @@ void swsusp_free(void)
nr_meta_pages = 0;
restore_pblist = NULL;
buffer = NULL;
+ alloc_normal = 0;
+ alloc_highmem = 0;
}
/* Helper functions used for the shrinking of memory. */
@@ -1085,7 +1091,7 @@ void swsusp_free(void)
* Return value: The number of normal (ie. non-highmem) pages allocated or
* -ENOMEM on failure.
*/
-static long prealloc_pages(long nr_pages)
+static long prealloc_pages(struct memory_bitmap *bm, long nr_pages)
{
long nr_normal = 0;
@@ -1095,6 +1101,7 @@ static long prealloc_pages(long nr_pages
page = alloc_image_page(GFP_IMAGE);
if (!page)
return -ENOMEM;
+ memory_bm_set_bit(bm, page_to_pfn(page));
if (!PageHighMem(page))
nr_normal++;
}
@@ -1103,7 +1110,7 @@ static long prealloc_pages(long nr_pages
}
/**
- * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ * hibernate_preallocate_memory - Preallocate memory for hibernation image
*
* To create a hibernation image it is necessary to make a copy of every page
* frame in use. We also need a number of page frames to be free during
@@ -1127,17 +1134,29 @@ static long prealloc_pages(long nr_pages
* the preallocation of memory is continued until the total number of page
* frames in use is below the requested image size.
*/
-int swsusp_shrink_memory(void)
+int hibernate_preallocate_memory(void)
{
- unsigned long pages = 0, alloc_normal = 0, alloc_highmem = 0;
+ unsigned long pages = 0;
unsigned int i = 0;
char *p = "-\\|/";
struct timeval start, stop;
- int error = 0;
+ int error;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Preallocating image memory ... ");
do_gettimeofday(&start);
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ alloc_normal = 0;
+ alloc_highmem = 0;
+ error = -ENOMEM;
+
for (;;) {
struct zone *zone;
long size, highmem_size, tmp, ret;
@@ -1158,7 +1177,6 @@ int swsusp_shrink_memory(void)
* creation.
*/
for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
if (is_highmem(zone)) {
highmem_size -=
zone_page_state(zone, NR_FREE_PAGES);
@@ -1179,11 +1197,9 @@ int swsusp_shrink_memory(void)
else if (tmp <= 0)
break;
- ret = prealloc_pages(tmp);
- if (ret < 0) {
- error = -ENOMEM;
- goto out;
- }
+ ret = prealloc_pages(©_bm, tmp);
+ if (ret < 0)
+ goto err_out;
alloc_normal += ret;
alloc_highmem += tmp - ret;
pages += tmp;
@@ -1192,13 +1208,13 @@ int swsusp_shrink_memory(void)
}
do_gettimeofday(&stop);
- printk("\bdone (preallocated %lu free pages)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
+ printk("\bdone (allocated %lu image pages)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Allocated");
- out:
- /* Release the preallocated page frames. */
- swsusp_free();
+ return 0;
+ err_out:
+ swsusp_free();
return error;
}
@@ -1210,7 +1226,7 @@ int swsusp_shrink_memory(void)
static unsigned int count_pages_for_highmem(unsigned int nr_highmem)
{
- unsigned int free_highmem = count_free_highmem_pages();
+ unsigned int free_highmem = count_free_highmem_pages() + alloc_highmem;
if (free_highmem >= nr_highmem)
nr_highmem = 0;
@@ -1232,19 +1248,17 @@ count_pages_for_highmem(unsigned int nr_
static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
{
struct zone *zone;
- unsigned int free = 0, meta = 0;
+ unsigned int free = alloc_normal;
- for_each_zone(zone) {
- meta += snapshot_additional_pages(zone);
+ for_each_zone(zone)
if (!is_highmem(zone))
free += zone_page_state(zone, NR_FREE_PAGES);
- }
nr_pages += count_pages_for_highmem(nr_highmem);
- pr_debug("PM: Normal pages needed: %u + %u + %u, available pages: %u\n",
- nr_pages, PAGES_FOR_IO, meta, free);
+ pr_debug("PM: Normal pages needed: %u + %u, available pages: %u\n",
+ nr_pages, PAGES_FOR_IO, free);
- return free > nr_pages + PAGES_FOR_IO + meta;
+ return free > nr_pages + PAGES_FOR_IO;
}
#ifdef CONFIG_HIGHMEM
@@ -1266,7 +1280,7 @@ static inline int get_highmem_buffer(int
*/
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
{
unsigned int to_alloc = count_free_highmem_pages();
@@ -1277,7 +1291,7 @@ alloc_highmem_image_pages(struct memory_
while (to_alloc-- > 0) {
struct page *page;
- page = alloc_image_page(__GFP_HIGHMEM);
+ page = alloc_image_page(__GFP_HIGHMEM | __GFP_NO_OOM_KILL);
memory_bm_set_bit(bm, page_to_pfn(page));
}
return nr_highmem;
@@ -1286,7 +1300,7 @@ alloc_highmem_image_pages(struct memory_
static inline int get_highmem_buffer(int safe_needed) { return 0; }
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
#endif /* CONFIG_HIGHMEM */
/**
@@ -1305,51 +1319,36 @@ static int
swsusp_alloc(struct memory_bitmap *orig_bm, struct memory_bitmap *copy_bm,
unsigned int nr_pages, unsigned int nr_highmem)
{
- int error;
-
- error = memory_bm_create(orig_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
-
- error = memory_bm_create(copy_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
+ int error = 0;
if (nr_highmem > 0) {
error = get_highmem_buffer(PG_ANY);
if (error)
- goto Free;
-
- nr_pages += alloc_highmem_image_pages(copy_bm, nr_highmem);
+ goto err_out;
+ if (nr_highmem > alloc_highmem) {
+ nr_highmem -= alloc_highmem;
+ nr_pages += alloc_highmem_pages(copy_bm, nr_highmem);
+ }
}
- while (nr_pages-- > 0) {
- struct page *page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
-
- if (!page)
- goto Free;
+ if (nr_pages > alloc_normal) {
+ nr_pages -= alloc_normal;
+ while (nr_pages-- > 0) {
+ struct page *page;
- memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
+ if (!page)
+ goto err_out;
+ memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ }
}
+
return 0;
- Free:
+ err_out:
swsusp_free();
- return -ENOMEM;
+ return error;
}
-/* Memory bitmap used for marking saveable pages (during suspend) or the
- * suspend image pages (during resume)
- */
-static struct memory_bitmap orig_bm;
-/* Memory bitmap used on suspend for marking allocated pages that will contain
- * the copies of saveable pages. During resume it is initially used for
- * marking the suspend image pages, but then its set bits are duplicated in
- * @orig_bm and it is released. Next, on systems with high memory, it may be
- * used for marking "safe" highmem pages, but it has to be reinitialized for
- * this purpose.
- */
-static struct memory_bitmap copy_bm;
-
asmlinkage int swsusp_save(void)
{
unsigned int nr_pages, nr_highmem;
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern int swsusp_shrink_memory(void);
+extern int hibernate_preallocate_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -303,8 +303,8 @@ int hibernation_snapshot(int platform_mo
if (error)
return error;
- /* Free memory before shutting down devices. */
- error = swsusp_shrink_memory();
+ /* Preallocate image memory before shutting down devices. */
+ error = hibernate_preallocate_memory();
if (error)
goto Close;
@@ -320,6 +320,10 @@ int hibernation_snapshot(int platform_mo
/* Control returns here after successful restore */
Resume_devices:
+ /* We may need to release the preallocated image pages here. */
+ if (error || !in_suspend)
+ swsusp_free();
+
device_resume(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
resume_console();
@@ -593,7 +597,10 @@ int hibernate(void)
goto Thaw;
error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
- if (in_suspend && !error) {
+ if (error)
+ goto Thaw;
+
+ if (in_suspend) {
unsigned int flags = 0;
if (hibernation_mode == HIBERNATION_PLATFORM)
@@ -605,8 +612,8 @@ int hibernate(void)
power_down();
} else {
pr_debug("PM: Image restored successfully.\n");
- swsusp_free();
}
+
Thaw:
thaw_processes();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
2009-05-03 0:20 ` Rafael J. Wysocki
` (8 preceding siblings ...)
(?)
@ 2009-05-03 13:08 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 13:08 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, torvalds, linux-pm
Hi Rafael,
I happened to be doing some benchmarks on the older shrink_all_memory(),
Hopefully it can be a useful reference point for the new design.
The current swsusp_shrink_memory()/shrink_all_memory() are terribly
inefficient: it takes 7-9s to free up 1.4G memory:
[ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
[ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
Below are the logs I collected by injecting printks. There are
basically two major problems:
- swsusp_shrink_memory() scans the whole 2G memory again and again;
- shrink_all_memory() is slow. It won't reclaim pages at all with
small priority values, because it's batching size is 10000 pages.
I wonder if it's possible to free up the memory within 1s at all.
(Maybe the slowness is due to too much enabled debugging options...)
Thanks,
Fengguang
---
vanilla 2.6.30-rc2-next-20090417:
[ 124.516187] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 124.523087] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 124.530060] PM: Basic memory bitmaps created
[ 124.534421] PM: Syncing filesystems ... done.
[ 124.842282] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 124.849800] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 124.857571] PM: Shrinking memory... tmp=471584, size=491906, highmem_size=0
[ 124.939103] shrink_all_memory: pages=10000
[ 125.019543] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.027636]-tmp=451770, size=481986, highmem_size=0
[ 125.107571] shrink_all_memory: pages=10000
[ 125.139928] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=8500
[ 125.280940] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=1500, reclaimed=1500
[ 125.547990] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.556135]\tmp=411598, size=461898, highmem_size=0
[ 125.637414] shrink_all_memory: pages=10000
[ 125.716890] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=10000
[ 125.725092]|tmp=391507, size=451854, highmem_size=0
[ 125.806935] shrink_all_memory: pages=10000
[ 125.886317] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.894531]/tmp=371481, size=441841, highmem_size=0
[ 125.976823] shrink_all_memory: pages=10000
[ 126.104367] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 126.112572]-tmp=351715, size=431952, highmem_size=0
[ 126.195178] shrink_all_memory: pages=10000
[ 126.274586] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.282698]\tmp=331949, size=422063, highmem_size=0
[ 126.365743] shrink_all_memory: pages=10000
[ 126.445851] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.453968]|tmp=311858, size=412019, highmem_size=0
[ 126.537417] shrink_all_memory: pages=10000
[ 126.616980] shrink_all_zones: pass=0, prio=9, lru=Normal.2, pages=10000, reclaimed=10000
[ 126.625180]/tmp=291751, size=401975, highmem_size=0
[ 126.709066] shrink_all_memory: pages=10000
[ 126.788665] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.796833]-tmp=271725, size=391962, highmem_size=0
[ 126.880997] shrink_all_memory: pages=10000
[ 127.008443] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.016667]\tmp=251716, size=381949, highmem_size=0
[ 127.101581] shrink_all_memory: pages=10000
[ 127.181588] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.189728]|tmp=231673, size=371936, highmem_size=0
[ 127.275105] shrink_all_memory: pages=10000
[ 127.354799] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.363003]/tmp=211599, size=361892, highmem_size=0
[ 127.448750] shrink_all_memory: pages=10000
[ 127.528252] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.536385]-tmp=191621, size=351910, highmem_size=0
[ 127.622369] shrink_all_memory: pages=10000
[ 127.750093] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.758295]\tmp=171539, size=341866, highmem_size=0
[ 127.844867] shrink_all_memory: pages=10000
[ 127.925614] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.933758]|tmp=151465, size=331822, highmem_size=0
[ 128.020878] shrink_all_memory: pages=10000
[ 128.100580] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.108803]/tmp=131391, size=321778, highmem_size=0
[ 128.196312] shrink_all_memory: pages=10000
[ 128.275643] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 128.283769]-tmp=111413, size=311796, highmem_size=0
[ 128.371814] shrink_all_memory: pages=10000
[ 128.501803] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.510007]\tmp=91339, size=301752, highmem_size=0
[ 128.597726] shrink_all_memory: pages=10000
[ 128.677138] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 128.685277]|tmp=71296, size=291739, highmem_size=0
[ 128.774061] shrink_all_memory: pages=10000
[ 128.855940] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.864145]/tmp=51259, size=281726, highmem_size=0
[ 128.953486] shrink_all_memory: pages=10000
[ 129.033417] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.041553]-tmp=31172, size=271682, highmem_size=0
[ 129.131233] shrink_all_memory: pages=10000
[ 129.210693] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.218994]\tmp=11146, size=261669, highmem_size=0
[ 129.309142] shrink_all_memory: pages=10000
[ 129.388523] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.396648]|tmp=-8880, size=251656, highmem_size=0
[ 129.487193] shrink_all_memory: pages=10000
[ 129.614831] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.623059]/tmp=-28954, size=241612, highmem_size=0
[ 129.714055] shrink_all_memory: pages=10000
[ 129.794104] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.802246]-tmp=-48932, size=231630, highmem_size=0
[ 129.893893] shrink_all_memory: pages=10000
[ 129.973667] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.981892]\tmp=-69020, size=221586, highmem_size=0
[ 130.073916] shrink_all_memory: pages=10000
[ 130.154620] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.162853]|tmp=-89156, size=211511, highmem_size=0
[ 130.255274] shrink_all_memory: pages=10000
[ 130.334612] shrink_all_zones: pass=0, prio=8, lru=DMA32.2, pages=10000, reclaimed=10000
[ 130.342750]/tmp=-109182, size=201498, highmem_size=0
[ 130.435551] shrink_all_memory: pages=10000
[ 130.515074] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.523305]-tmp=-129273, size=191454, highmem_size=0
[ 130.616714] shrink_all_memory: pages=10000
[ 130.696350] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 130.704490]\tmp=-149299, size=181441, highmem_size=0
[ 130.798322] shrink_all_memory: pages=10000
[ 130.877834] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.886038]|tmp=-169325, size=171428, highmem_size=0
[ 130.980312] shrink_all_memory: pages=10000
[ 131.107844] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=10000
[ 131.115982]/tmp=-189351, size=161415, highmem_size=0
[ 131.210530] shrink_all_memory: pages=10000
[ 131.291223] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=10000
[ 131.299433]-tmp=-209459, size=151371, highmem_size=0
[ 131.394488] shrink_all_memory: pages=10000
[ 131.474123] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=10000
[ 131.482344]\tmp=-229420, size=141389, highmem_size=0
[ 131.577910] shrink_all_memory: pages=10000
[ 131.657376] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 131.665498]|tmp=-249511, size=131345, highmem_size=0
[ 131.761676] shrink_all_memory: pages=3345
[ 131.791048] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=3345, reclaimed=3345
[ 131.799085]/tmp=-256256, size=127966, highmem_size=0
[ 131.895290]done (353345 pages freed)
[ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
1/30 memory being mapped, vanilla 2.6.30-rc2-next-20090417:
AnonPages: 38684 kB
Mapped: 66940 kB
[ 722.944082] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 722.956215] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 722.963053] PM: Basic memory bitmaps created
[ 722.967365] PM: Syncing filesystems ... done.
[ 723.361274] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 723.369310] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 723.377342] PM: Shrinking memory... tmp=508165, size=510179, highmem_size=0
[ 723.563602] shrink_all_memory: pages=10000
[ 723.648921] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9766
[ 723.733064] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19581
[ 723.741225]-tmp=468972, size=490587, highmem_size=0
[ 723.821406] shrink_all_memory: pages=10000
[ 723.902912] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9804
[ 723.987433] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19617
[ 723.995565]\tmp=429714, size=470964, highmem_size=0
[ 724.077458] shrink_all_memory: pages=10000
[ 724.160394] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9808
[ 724.261056] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19610
[ 724.269482]|tmp=390489, size=451341, highmem_size=0
[ 724.353672] shrink_all_memory: pages=10000
[ 724.556153] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9806
[ 724.669591] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19636
[ 724.677770]/tmp=351365, size=431780, highmem_size=0
[ 724.762188] shrink_all_memory: pages=10000
[ 724.923372] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9805
[ 725.037897] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19620
[ 725.046501]-tmp=312189, size=412188, highmem_size=0
[ 725.133452] shrink_all_memory: pages=10000
[ 725.371199] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9781
[ 725.519983] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19585
[ 725.528233]\tmp=273061, size=392627, highmem_size=0
[ 725.616020] shrink_all_memory: pages=10000
[ 725.801211] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 725.954523] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19617
[ 725.962685]|tmp=233885, size=373035, highmem_size=0
[ 726.051775] shrink_all_memory: pages=10000
[ 726.296589] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 726.449150] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19682
[ 726.457342]/tmp=194507, size=353350, highmem_size=0
[ 726.548053] shrink_all_memory: pages=10000
[ 726.759180] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9803
[ 726.940475] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19561
[ 726.948638]-tmp=155396, size=333789, highmem_size=0
[ 727.040362] shrink_all_memory: pages=10000
[ 727.257478] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 727.442356] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19544
[ 727.450548]\tmp=116319, size=314259, highmem_size=0
[ 727.543609] shrink_all_memory: pages=10000
[ 727.755346] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 727.894707] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19555
[ 727.902910]|tmp=77256, size=294729, highmem_size=0
[ 727.997018] shrink_all_memory: pages=10000
[ 728.170973] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9799
[ 728.332426] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19545
[ 728.341152]/tmp=38210, size=275199, highmem_size=0
[ 728.437625] shrink_all_memory: pages=10000
[ 728.673862] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9773
[ 728.812572] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19517
[ 728.820738]-tmp=-852, size=255669, highmem_size=0
[ 728.917360] shrink_all_memory: pages=10000
[ 729.110178] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9839
[ 729.266243] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19610
[ 729.274407]\tmp=-40045, size=236077, highmem_size=0
[ 729.372371] shrink_all_memory: pages=10000
[ 729.553743] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=9741
[ 729.673174] shrink_all_zones: pass=0, prio=3, lru=DMA.2, pages=259, reclaimed=256
[ 729.681224] shrink_all_zones: pass=0, prio=3, lru=DMA32.0, pages=259, reclaimed=256
[ 729.693997] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=259, reclaimed=513
[ 730.006423] shrink_all_zones: pass=0, prio=2, lru=DMA32.2, pages=9487, reclaimed=9296
[ 730.177563] shrink_all_zones: pass=0, prio=2, lru=Normal.2, pages=9487, reclaimed=18626
[ 730.185640]|tmp=-98138, size=207022, highmem_size=0
[ 730.285280] shrink_all_memory: pages=10000
[ 730.484499] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9807
[ 730.637792] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19613
[ 730.645975]/tmp=-137343, size=187430, highmem_size=0
[ 730.746709] shrink_all_memory: pages=10000
[ 730.754374] shrink_all_zones: pass=0, prio=5, lru=Normal.0, pages=10000, reclaimed=0
[ 731.101101] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9777
[ 731.257243] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19582
[ 731.265411]-tmp=-176567, size=167807, highmem_size=0
[ 731.367111] shrink_all_memory: pages=10000
[ 731.567779] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9811
[ 731.803019] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19615
[ 731.811189]\tmp=-215837, size=148184, highmem_size=0
[ 731.913738] shrink_all_memory: pages=10000
[ 732.123893] shrink_all_zones: pass=0, prio=2, lru=DMA32.2, pages=10000, reclaimed=9808
[ 732.312075] shrink_all_zones: pass=0, prio=2, lru=Normal.2, pages=10000, reclaimed=19580
[ 732.320234]|tmp=-254948, size=128623, highmem_size=0
[ 732.423776] shrink_all_memory: pages=623
[ 732.432862] shrink_all_zones: pass=0, prio=12, lru=DMA.2, pages=623, reclaimed=617
[ 732.441782] shrink_all_zones: pass=0, prio=12, lru=Normal.0, pages=623, reclaimed=617
[ 732.453341] shrink_all_zones: pass=0, prio=11, lru=DMA.0, pages=6, reclaimed=0
[ 732.460712] shrink_all_zones: pass=0, prio=11, lru=DMA32.0, pages=6, reclaimed=0
[ 732.468390] shrink_all_zones: pass=0, prio=11, lru=DMA32.2, pages=6, reclaimed=6
[ 732.488091] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=623, reclaimed=617
[ 732.508256] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=623, reclaimed=1233
[ 732.516202]/tmp=-258774, size=126704, highmem_size=0
[ 732.753869]done (372529 pages freed)
[ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-03 13:08 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 13:08 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
Hi Rafael,
I happened to be doing some benchmarks on the older shrink_all_memory(),
Hopefully it can be a useful reference point for the new design.
The current swsusp_shrink_memory()/shrink_all_memory() are terribly
inefficient: it takes 7-9s to free up 1.4G memory:
[ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
[ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
Below are the logs I collected by injecting printks. There are
basically two major problems:
- swsusp_shrink_memory() scans the whole 2G memory again and again;
- shrink_all_memory() is slow. It won't reclaim pages at all with
small priority values, because it's batching size is 10000 pages.
I wonder if it's possible to free up the memory within 1s at all.
(Maybe the slowness is due to too much enabled debugging options...)
Thanks,
Fengguang
---
vanilla 2.6.30-rc2-next-20090417:
[ 124.516187] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 124.523087] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 124.530060] PM: Basic memory bitmaps created
[ 124.534421] PM: Syncing filesystems ... done.
[ 124.842282] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 124.849800] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 124.857571] PM: Shrinking memory... tmp=471584, size=491906, highmem_size=0
[ 124.939103] shrink_all_memory: pages=10000
[ 125.019543] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.027636]-tmp=451770, size=481986, highmem_size=0
[ 125.107571] shrink_all_memory: pages=10000
[ 125.139928] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=8500
[ 125.280940] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=1500, reclaimed=1500
[ 125.547990] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.556135]\tmp=411598, size=461898, highmem_size=0
[ 125.637414] shrink_all_memory: pages=10000
[ 125.716890] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=10000
[ 125.725092]|tmp=391507, size=451854, highmem_size=0
[ 125.806935] shrink_all_memory: pages=10000
[ 125.886317] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.894531]/tmp=371481, size=441841, highmem_size=0
[ 125.976823] shrink_all_memory: pages=10000
[ 126.104367] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 126.112572]-tmp=351715, size=431952, highmem_size=0
[ 126.195178] shrink_all_memory: pages=10000
[ 126.274586] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.282698]\tmp=331949, size=422063, highmem_size=0
[ 126.365743] shrink_all_memory: pages=10000
[ 126.445851] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.453968]|tmp=311858, size=412019, highmem_size=0
[ 126.537417] shrink_all_memory: pages=10000
[ 126.616980] shrink_all_zones: pass=0, prio=9, lru=Normal.2, pages=10000, reclaimed=10000
[ 126.625180]/tmp=291751, size=401975, highmem_size=0
[ 126.709066] shrink_all_memory: pages=10000
[ 126.788665] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.796833]-tmp=271725, size=391962, highmem_size=0
[ 126.880997] shrink_all_memory: pages=10000
[ 127.008443] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.016667]\tmp=251716, size=381949, highmem_size=0
[ 127.101581] shrink_all_memory: pages=10000
[ 127.181588] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.189728]|tmp=231673, size=371936, highmem_size=0
[ 127.275105] shrink_all_memory: pages=10000
[ 127.354799] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.363003]/tmp=211599, size=361892, highmem_size=0
[ 127.448750] shrink_all_memory: pages=10000
[ 127.528252] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.536385]-tmp=191621, size=351910, highmem_size=0
[ 127.622369] shrink_all_memory: pages=10000
[ 127.750093] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.758295]\tmp=171539, size=341866, highmem_size=0
[ 127.844867] shrink_all_memory: pages=10000
[ 127.925614] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.933758]|tmp=151465, size=331822, highmem_size=0
[ 128.020878] shrink_all_memory: pages=10000
[ 128.100580] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.108803]/tmp=131391, size=321778, highmem_size=0
[ 128.196312] shrink_all_memory: pages=10000
[ 128.275643] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 128.283769]-tmp=111413, size=311796, highmem_size=0
[ 128.371814] shrink_all_memory: pages=10000
[ 128.501803] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.510007]\tmp=91339, size=301752, highmem_size=0
[ 128.597726] shrink_all_memory: pages=10000
[ 128.677138] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 128.685277]|tmp=71296, size=291739, highmem_size=0
[ 128.774061] shrink_all_memory: pages=10000
[ 128.855940] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.864145]/tmp=51259, size=281726, highmem_size=0
[ 128.953486] shrink_all_memory: pages=10000
[ 129.033417] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.041553]-tmp=31172, size=271682, highmem_size=0
[ 129.131233] shrink_all_memory: pages=10000
[ 129.210693] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.218994]\tmp=11146, size=261669, highmem_size=0
[ 129.309142] shrink_all_memory: pages=10000
[ 129.388523] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.396648]|tmp=-8880, size=251656, highmem_size=0
[ 129.487193] shrink_all_memory: pages=10000
[ 129.614831] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.623059]/tmp=-28954, size=241612, highmem_size=0
[ 129.714055] shrink_all_memory: pages=10000
[ 129.794104] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.802246]-tmp=-48932, size=231630, highmem_size=0
[ 129.893893] shrink_all_memory: pages=10000
[ 129.973667] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.981892]\tmp=-69020, size=221586, highmem_size=0
[ 130.073916] shrink_all_memory: pages=10000
[ 130.154620] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.162853]|tmp=-89156, size=211511, highmem_size=0
[ 130.255274] shrink_all_memory: pages=10000
[ 130.334612] shrink_all_zones: pass=0, prio=8, lru=DMA32.2, pages=10000, reclaimed=10000
[ 130.342750]/tmp=-109182, size=201498, highmem_size=0
[ 130.435551] shrink_all_memory: pages=10000
[ 130.515074] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.523305]-tmp=-129273, size=191454, highmem_size=0
[ 130.616714] shrink_all_memory: pages=10000
[ 130.696350] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 130.704490]\tmp=-149299, size=181441, highmem_size=0
[ 130.798322] shrink_all_memory: pages=10000
[ 130.877834] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.886038]|tmp=-169325, size=171428, highmem_size=0
[ 130.980312] shrink_all_memory: pages=10000
[ 131.107844] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=10000
[ 131.115982]/tmp=-189351, size=161415, highmem_size=0
[ 131.210530] shrink_all_memory: pages=10000
[ 131.291223] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=10000
[ 131.299433]-tmp=-209459, size=151371, highmem_size=0
[ 131.394488] shrink_all_memory: pages=10000
[ 131.474123] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=10000
[ 131.482344]\tmp=-229420, size=141389, highmem_size=0
[ 131.577910] shrink_all_memory: pages=10000
[ 131.657376] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 131.665498]|tmp=-249511, size=131345, highmem_size=0
[ 131.761676] shrink_all_memory: pages=3345
[ 131.791048] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=3345, reclaimed=3345
[ 131.799085]/tmp=-256256, size=127966, highmem_size=0
[ 131.895290]done (353345 pages freed)
[ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
1/30 memory being mapped, vanilla 2.6.30-rc2-next-20090417:
AnonPages: 38684 kB
Mapped: 66940 kB
[ 722.944082] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 722.956215] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 722.963053] PM: Basic memory bitmaps created
[ 722.967365] PM: Syncing filesystems ... done.
[ 723.361274] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 723.369310] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 723.377342] PM: Shrinking memory... tmp=508165, size=510179, highmem_size=0
[ 723.563602] shrink_all_memory: pages=10000
[ 723.648921] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9766
[ 723.733064] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19581
[ 723.741225]-tmp=468972, size=490587, highmem_size=0
[ 723.821406] shrink_all_memory: pages=10000
[ 723.902912] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9804
[ 723.987433] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19617
[ 723.995565]\tmp=429714, size=470964, highmem_size=0
[ 724.077458] shrink_all_memory: pages=10000
[ 724.160394] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9808
[ 724.261056] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19610
[ 724.269482]|tmp=390489, size=451341, highmem_size=0
[ 724.353672] shrink_all_memory: pages=10000
[ 724.556153] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9806
[ 724.669591] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19636
[ 724.677770]/tmp=351365, size=431780, highmem_size=0
[ 724.762188] shrink_all_memory: pages=10000
[ 724.923372] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9805
[ 725.037897] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19620
[ 725.046501]-tmp=312189, size=412188, highmem_size=0
[ 725.133452] shrink_all_memory: pages=10000
[ 725.371199] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9781
[ 725.519983] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19585
[ 725.528233]\tmp=273061, size=392627, highmem_size=0
[ 725.616020] shrink_all_memory: pages=10000
[ 725.801211] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 725.954523] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19617
[ 725.962685]|tmp=233885, size=373035, highmem_size=0
[ 726.051775] shrink_all_memory: pages=10000
[ 726.296589] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 726.449150] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19682
[ 726.457342]/tmp=194507, size=353350, highmem_size=0
[ 726.548053] shrink_all_memory: pages=10000
[ 726.759180] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9803
[ 726.940475] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19561
[ 726.948638]-tmp=155396, size=333789, highmem_size=0
[ 727.040362] shrink_all_memory: pages=10000
[ 727.257478] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 727.442356] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19544
[ 727.450548]\tmp=116319, size=314259, highmem_size=0
[ 727.543609] shrink_all_memory: pages=10000
[ 727.755346] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 727.894707] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19555
[ 727.902910]|tmp=77256, size=294729, highmem_size=0
[ 727.997018] shrink_all_memory: pages=10000
[ 728.170973] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9799
[ 728.332426] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19545
[ 728.341152]/tmp=38210, size=275199, highmem_size=0
[ 728.437625] shrink_all_memory: pages=10000
[ 728.673862] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9773
[ 728.812572] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19517
[ 728.820738]-tmp=-852, size=255669, highmem_size=0
[ 728.917360] shrink_all_memory: pages=10000
[ 729.110178] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9839
[ 729.266243] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19610
[ 729.274407]\tmp=-40045, size=236077, highmem_size=0
[ 729.372371] shrink_all_memory: pages=10000
[ 729.553743] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=9741
[ 729.673174] shrink_all_zones: pass=0, prio=3, lru=DMA.2, pages=259, reclaimed=256
[ 729.681224] shrink_all_zones: pass=0, prio=3, lru=DMA32.0, pages=259, reclaimed=256
[ 729.693997] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=259, reclaimed=513
[ 730.006423] shrink_all_zones: pass=0, prio=2, lru=DMA32.2, pages=9487, reclaimed=9296
[ 730.177563] shrink_all_zones: pass=0, prio=2, lru=Normal.2, pages=9487, reclaimed=18626
[ 730.185640]|tmp=-98138, size=207022, highmem_size=0
[ 730.285280] shrink_all_memory: pages=10000
[ 730.484499] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9807
[ 730.637792] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19613
[ 730.645975]/tmp=-137343, size=187430, highmem_size=0
[ 730.746709] shrink_all_memory: pages=10000
[ 730.754374] shrink_all_zones: pass=0, prio=5, lru=Normal.0, pages=10000, reclaimed=0
[ 731.101101] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9777
[ 731.257243] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19582
[ 731.265411]-tmp=-176567, size=167807, highmem_size=0
[ 731.367111] shrink_all_memory: pages=10000
[ 731.567779] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9811
[ 731.803019] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19615
[ 731.811189]\tmp=-215837, size=148184, highmem_size=0
[ 731.913738] shrink_all_memory: pages=10000
[ 732.123893] shrink_all_zones: pass=0, prio=2, lru=DMA32.2, pages=10000, reclaimed=9808
[ 732.312075] shrink_all_zones: pass=0, prio=2, lru=Normal.2, pages=10000, reclaimed=19580
[ 732.320234]|tmp=-254948, size=128623, highmem_size=0
[ 732.423776] shrink_all_memory: pages=623
[ 732.432862] shrink_all_zones: pass=0, prio=12, lru=DMA.2, pages=623, reclaimed=617
[ 732.441782] shrink_all_zones: pass=0, prio=12, lru=Normal.0, pages=623, reclaimed=617
[ 732.453341] shrink_all_zones: pass=0, prio=11, lru=DMA.0, pages=6, reclaimed=0
[ 732.460712] shrink_all_zones: pass=0, prio=11, lru=DMA32.0, pages=6, reclaimed=0
[ 732.468390] shrink_all_zones: pass=0, prio=11, lru=DMA32.2, pages=6, reclaimed=6
[ 732.488091] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=623, reclaimed=617
[ 732.508256] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=623, reclaimed=1233
[ 732.516202]/tmp=-258774, size=126704, highmem_size=0
[ 732.753869]done (372529 pages freed)
[ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-03 13:08 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-03 13:08 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Hi Rafael,
I happened to be doing some benchmarks on the older shrink_all_memory(),
Hopefully it can be a useful reference point for the new design.
The current swsusp_shrink_memory()/shrink_all_memory() are terribly
inefficient: it takes 7-9s to free up 1.4G memory:
[ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
[ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
Below are the logs I collected by injecting printks. There are
basically two major problems:
- swsusp_shrink_memory() scans the whole 2G memory again and again;
- shrink_all_memory() is slow. It won't reclaim pages at all with
small priority values, because it's batching size is 10000 pages.
I wonder if it's possible to free up the memory within 1s at all.
(Maybe the slowness is due to too much enabled debugging options...)
Thanks,
Fengguang
---
vanilla 2.6.30-rc2-next-20090417:
[ 124.516187] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 124.523087] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 124.530060] PM: Basic memory bitmaps created
[ 124.534421] PM: Syncing filesystems ... done.
[ 124.842282] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 124.849800] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 124.857571] PM: Shrinking memory... tmp=471584, size=491906, highmem_size=0
[ 124.939103] shrink_all_memory: pages=10000
[ 125.019543] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.027636]-tmp=451770, size=481986, highmem_size=0
[ 125.107571] shrink_all_memory: pages=10000
[ 125.139928] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=8500
[ 125.280940] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=1500, reclaimed=1500
[ 125.547990] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.556135]\tmp=411598, size=461898, highmem_size=0
[ 125.637414] shrink_all_memory: pages=10000
[ 125.716890] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=10000
[ 125.725092]|tmp=391507, size=451854, highmem_size=0
[ 125.806935] shrink_all_memory: pages=10000
[ 125.886317] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 125.894531]/tmp=371481, size=441841, highmem_size=0
[ 125.976823] shrink_all_memory: pages=10000
[ 126.104367] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 126.112572]-tmp=351715, size=431952, highmem_size=0
[ 126.195178] shrink_all_memory: pages=10000
[ 126.274586] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.282698]\tmp=331949, size=422063, highmem_size=0
[ 126.365743] shrink_all_memory: pages=10000
[ 126.445851] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.453968]|tmp=311858, size=412019, highmem_size=0
[ 126.537417] shrink_all_memory: pages=10000
[ 126.616980] shrink_all_zones: pass=0, prio=9, lru=Normal.2, pages=10000, reclaimed=10000
[ 126.625180]/tmp=291751, size=401975, highmem_size=0
[ 126.709066] shrink_all_memory: pages=10000
[ 126.788665] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 126.796833]-tmp=271725, size=391962, highmem_size=0
[ 126.880997] shrink_all_memory: pages=10000
[ 127.008443] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.016667]\tmp=251716, size=381949, highmem_size=0
[ 127.101581] shrink_all_memory: pages=10000
[ 127.181588] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.189728]|tmp=231673, size=371936, highmem_size=0
[ 127.275105] shrink_all_memory: pages=10000
[ 127.354799] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.363003]/tmp=211599, size=361892, highmem_size=0
[ 127.448750] shrink_all_memory: pages=10000
[ 127.528252] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.536385]-tmp=191621, size=351910, highmem_size=0
[ 127.622369] shrink_all_memory: pages=10000
[ 127.750093] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 127.758295]\tmp=171539, size=341866, highmem_size=0
[ 127.844867] shrink_all_memory: pages=10000
[ 127.925614] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 127.933758]|tmp=151465, size=331822, highmem_size=0
[ 128.020878] shrink_all_memory: pages=10000
[ 128.100580] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.108803]/tmp=131391, size=321778, highmem_size=0
[ 128.196312] shrink_all_memory: pages=10000
[ 128.275643] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 128.283769]-tmp=111413, size=311796, highmem_size=0
[ 128.371814] shrink_all_memory: pages=10000
[ 128.501803] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.510007]\tmp=91339, size=301752, highmem_size=0
[ 128.597726] shrink_all_memory: pages=10000
[ 128.677138] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 128.685277]|tmp=71296, size=291739, highmem_size=0
[ 128.774061] shrink_all_memory: pages=10000
[ 128.855940] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 128.864145]/tmp=51259, size=281726, highmem_size=0
[ 128.953486] shrink_all_memory: pages=10000
[ 129.033417] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.041553]-tmp=31172, size=271682, highmem_size=0
[ 129.131233] shrink_all_memory: pages=10000
[ 129.210693] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.218994]\tmp=11146, size=261669, highmem_size=0
[ 129.309142] shrink_all_memory: pages=10000
[ 129.388523] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.396648]|tmp=-8880, size=251656, highmem_size=0
[ 129.487193] shrink_all_memory: pages=10000
[ 129.614831] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.623059]/tmp=-28954, size=241612, highmem_size=0
[ 129.714055] shrink_all_memory: pages=10000
[ 129.794104] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 129.802246]-tmp=-48932, size=231630, highmem_size=0
[ 129.893893] shrink_all_memory: pages=10000
[ 129.973667] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=10000, reclaimed=10000
[ 129.981892]\tmp=-69020, size=221586, highmem_size=0
[ 130.073916] shrink_all_memory: pages=10000
[ 130.154620] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.162853]|tmp=-89156, size=211511, highmem_size=0
[ 130.255274] shrink_all_memory: pages=10000
[ 130.334612] shrink_all_zones: pass=0, prio=8, lru=DMA32.2, pages=10000, reclaimed=10000
[ 130.342750]/tmp=-109182, size=201498, highmem_size=0
[ 130.435551] shrink_all_memory: pages=10000
[ 130.515074] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.523305]-tmp=-129273, size=191454, highmem_size=0
[ 130.616714] shrink_all_memory: pages=10000
[ 130.696350] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=10000
[ 130.704490]\tmp=-149299, size=181441, highmem_size=0
[ 130.798322] shrink_all_memory: pages=10000
[ 130.877834] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=10000
[ 130.886038]|tmp=-169325, size=171428, highmem_size=0
[ 130.980312] shrink_all_memory: pages=10000
[ 131.107844] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=10000
[ 131.115982]/tmp=-189351, size=161415, highmem_size=0
[ 131.210530] shrink_all_memory: pages=10000
[ 131.291223] shrink_all_zones: pass=0, prio=7, lru=Normal.2, pages=10000, reclaimed=10000
[ 131.299433]-tmp=-209459, size=151371, highmem_size=0
[ 131.394488] shrink_all_memory: pages=10000
[ 131.474123] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=10000
[ 131.482344]\tmp=-229420, size=141389, highmem_size=0
[ 131.577910] shrink_all_memory: pages=10000
[ 131.657376] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=10000
[ 131.665498]|tmp=-249511, size=131345, highmem_size=0
[ 131.761676] shrink_all_memory: pages=3345
[ 131.791048] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=3345, reclaimed=3345
[ 131.799085]/tmp=-256256, size=127966, highmem_size=0
[ 131.895290]done (353345 pages freed)
[ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
1/30 memory being mapped, vanilla 2.6.30-rc2-next-20090417:
AnonPages: 38684 kB
Mapped: 66940 kB
[ 722.944082] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 722.956215] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 722.963053] PM: Basic memory bitmaps created
[ 722.967365] PM: Syncing filesystems ... done.
[ 723.361274] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 723.369310] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 723.377342] PM: Shrinking memory... tmp=508165, size=510179, highmem_size=0
[ 723.563602] shrink_all_memory: pages=10000
[ 723.648921] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9766
[ 723.733064] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19581
[ 723.741225]-tmp=468972, size=490587, highmem_size=0
[ 723.821406] shrink_all_memory: pages=10000
[ 723.902912] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9804
[ 723.987433] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19617
[ 723.995565]\tmp=429714, size=470964, highmem_size=0
[ 724.077458] shrink_all_memory: pages=10000
[ 724.160394] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9808
[ 724.261056] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19610
[ 724.269482]|tmp=390489, size=451341, highmem_size=0
[ 724.353672] shrink_all_memory: pages=10000
[ 724.556153] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9806
[ 724.669591] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19636
[ 724.677770]/tmp=351365, size=431780, highmem_size=0
[ 724.762188] shrink_all_memory: pages=10000
[ 724.923372] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9805
[ 725.037897] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19620
[ 725.046501]-tmp=312189, size=412188, highmem_size=0
[ 725.133452] shrink_all_memory: pages=10000
[ 725.371199] shrink_all_zones: pass=0, prio=5, lru=DMA32.2, pages=10000, reclaimed=9781
[ 725.519983] shrink_all_zones: pass=0, prio=5, lru=Normal.2, pages=10000, reclaimed=19585
[ 725.528233]\tmp=273061, size=392627, highmem_size=0
[ 725.616020] shrink_all_memory: pages=10000
[ 725.801211] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 725.954523] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19617
[ 725.962685]|tmp=233885, size=373035, highmem_size=0
[ 726.051775] shrink_all_memory: pages=10000
[ 726.296589] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 726.449150] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19682
[ 726.457342]/tmp=194507, size=353350, highmem_size=0
[ 726.548053] shrink_all_memory: pages=10000
[ 726.759180] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9803
[ 726.940475] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19561
[ 726.948638]-tmp=155396, size=333789, highmem_size=0
[ 727.040362] shrink_all_memory: pages=10000
[ 727.257478] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 727.442356] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19544
[ 727.450548]\tmp=116319, size=314259, highmem_size=0
[ 727.543609] shrink_all_memory: pages=10000
[ 727.755346] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9804
[ 727.894707] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19555
[ 727.902910]|tmp=77256, size=294729, highmem_size=0
[ 727.997018] shrink_all_memory: pages=10000
[ 728.170973] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9799
[ 728.332426] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19545
[ 728.341152]/tmp=38210, size=275199, highmem_size=0
[ 728.437625] shrink_all_memory: pages=10000
[ 728.673862] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9773
[ 728.812572] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19517
[ 728.820738]-tmp=-852, size=255669, highmem_size=0
[ 728.917360] shrink_all_memory: pages=10000
[ 729.110178] shrink_all_zones: pass=0, prio=4, lru=DMA32.2, pages=10000, reclaimed=9839
[ 729.266243] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=19610
[ 729.274407]\tmp=-40045, size=236077, highmem_size=0
[ 729.372371] shrink_all_memory: pages=10000
[ 729.553743] shrink_all_zones: pass=0, prio=4, lru=Normal.2, pages=10000, reclaimed=9741
[ 729.673174] shrink_all_zones: pass=0, prio=3, lru=DMA.2, pages=259, reclaimed=256
[ 729.681224] shrink_all_zones: pass=0, prio=3, lru=DMA32.0, pages=259, reclaimed=256
[ 729.693997] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=259, reclaimed=513
[ 730.006423] shrink_all_zones: pass=0, prio=2, lru=DMA32.2, pages=9487, reclaimed=9296
[ 730.177563] shrink_all_zones: pass=0, prio=2, lru=Normal.2, pages=9487, reclaimed=18626
[ 730.185640]|tmp=-98138, size=207022, highmem_size=0
[ 730.285280] shrink_all_memory: pages=10000
[ 730.484499] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9807
[ 730.637792] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19613
[ 730.645975]/tmp=-137343, size=187430, highmem_size=0
[ 730.746709] shrink_all_memory: pages=10000
[ 730.754374] shrink_all_zones: pass=0, prio=5, lru=Normal.0, pages=10000, reclaimed=0
[ 731.101101] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9777
[ 731.257243] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19582
[ 731.265411]-tmp=-176567, size=167807, highmem_size=0
[ 731.367111] shrink_all_memory: pages=10000
[ 731.567779] shrink_all_zones: pass=0, prio=3, lru=DMA32.2, pages=10000, reclaimed=9811
[ 731.803019] shrink_all_zones: pass=0, prio=3, lru=Normal.2, pages=10000, reclaimed=19615
[ 731.811189]\tmp=-215837, size=148184, highmem_size=0
[ 731.913738] shrink_all_memory: pages=10000
[ 732.123893] shrink_all_zones: pass=0, prio=2, lru=DMA32.2, pages=10000, reclaimed=9808
[ 732.312075] shrink_all_zones: pass=0, prio=2, lru=Normal.2, pages=10000, reclaimed=19580
[ 732.320234]|tmp=-254948, size=128623, highmem_size=0
[ 732.423776] shrink_all_memory: pages=623
[ 732.432862] shrink_all_zones: pass=0, prio=12, lru=DMA.2, pages=623, reclaimed=617
[ 732.441782] shrink_all_zones: pass=0, prio=12, lru=Normal.0, pages=623, reclaimed=617
[ 732.453341] shrink_all_zones: pass=0, prio=11, lru=DMA.0, pages=6, reclaimed=0
[ 732.460712] shrink_all_zones: pass=0, prio=11, lru=DMA32.0, pages=6, reclaimed=0
[ 732.468390] shrink_all_zones: pass=0, prio=11, lru=DMA32.2, pages=6, reclaimed=6
[ 732.488091] shrink_all_zones: pass=0, prio=6, lru=DMA32.2, pages=623, reclaimed=617
[ 732.508256] shrink_all_zones: pass=0, prio=6, lru=Normal.2, pages=623, reclaimed=1233
[ 732.516202]/tmp=-258774, size=126704, highmem_size=0
[ 732.753869]done (372529 pages freed)
[ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
2009-05-03 13:08 ` Wu Fengguang
(?)
@ 2009-05-03 16:30 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:30 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, torvalds, linux-pm
On Sunday 03 May 2009, Wu Fengguang wrote:
>
> Hi Rafael,
Hi,
> I happened to be doing some benchmarks on the older shrink_all_memory(),
> Hopefully it can be a useful reference point for the new design.
>
> The current swsusp_shrink_memory()/shrink_all_memory() are terribly
> inefficient: it takes 7-9s to free up 1.4G memory:
One reason may be that it takes too many steps to do it,
> [ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
> [ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
because the new way doesn't seem to do any better.
> Below are the logs I collected by injecting printks. There are
> basically two major problems:
> - swsusp_shrink_memory() scans the whole 2G memory again and again;
> - shrink_all_memory() is slow. It won't reclaim pages at all with
> small priority values, because it's batching size is 10000 pages.
I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
of it.
> I wonder if it's possible to free up the memory within 1s at all.
I'm not sure.
Apparently, the counting of saveable pages takes substantial time (0.5 s each
iteration on my 64-bit test box), so we can improve that by limiting the number
of iterations.
Well, perhaps we can do it all in one shot after all, I'll think how to do that.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
2009-05-03 13:08 ` Wu Fengguang
@ 2009-05-03 16:30 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:30 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, pavel, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Sunday 03 May 2009, Wu Fengguang wrote:
>
> Hi Rafael,
Hi,
> I happened to be doing some benchmarks on the older shrink_all_memory(),
> Hopefully it can be a useful reference point for the new design.
>
> The current swsusp_shrink_memory()/shrink_all_memory() are terribly
> inefficient: it takes 7-9s to free up 1.4G memory:
One reason may be that it takes too many steps to do it,
> [ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
> [ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
because the new way doesn't seem to do any better.
> Below are the logs I collected by injecting printks. There are
> basically two major problems:
> - swsusp_shrink_memory() scans the whole 2G memory again and again;
> - shrink_all_memory() is slow. It won't reclaim pages at all with
> small priority values, because it's batching size is 10000 pages.
I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
of it.
> I wonder if it's possible to free up the memory within 1s at all.
I'm not sure.
Apparently, the counting of saveable pages takes substantial time (0.5 s each
iteration on my 64-bit test box), so we can improve that by limiting the number
of iterations.
Well, perhaps we can do it all in one shot after all, I'll think how to do that.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-03 16:30 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-03 16:30 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Sunday 03 May 2009, Wu Fengguang wrote:
>
> Hi Rafael,
Hi,
> I happened to be doing some benchmarks on the older shrink_all_memory(),
> Hopefully it can be a useful reference point for the new design.
>
> The current swsusp_shrink_memory()/shrink_all_memory() are terribly
> inefficient: it takes 7-9s to free up 1.4G memory:
One reason may be that it takes too many steps to do it,
> [ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
> [ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
because the new way doesn't seem to do any better.
> Below are the logs I collected by injecting printks. There are
> basically two major problems:
> - swsusp_shrink_memory() scans the whole 2G memory again and again;
> - shrink_all_memory() is slow. It won't reclaim pages at all with
> small priority values, because it's batching size is 10000 pages.
I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
of it.
> I wonder if it's possible to free up the memory within 1s at all.
I'm not sure.
Apparently, the counting of saveable pages takes substantial time (0.5 s each
iteration on my 64-bit test box), so we can improve that by limiting the number
of iterations.
Well, perhaps we can do it all in one shot after all, I'll think how to do that.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 0/5] PM: Drop shrink_all_memory (rev. 3)
2009-05-03 16:30 ` Rafael J. Wysocki
(?)
@ 2009-05-04 0:08 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:08 UTC (permalink / raw)
To: Wu Fengguang, linux-pm
Cc: linux-kernel, alan-jenkins, jens.axboe, Andrew Morton,
kernel-testers, torvalds
On Sunday 03 May 2009, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Wu Fengguang wrote:
> >
> > Hi Rafael,
>
> Hi,
>
> > I happened to be doing some benchmarks on the older shrink_all_memory(),
> > Hopefully it can be a useful reference point for the new design.
> >
> > The current swsusp_shrink_memory()/shrink_all_memory() are terribly
> > inefficient: it takes 7-9s to free up 1.4G memory:
>
> One reason may be that it takes too many steps to do it,
>
> > [ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
> > [ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
>
> because the new way doesn't seem to do any better.
>
> > Below are the logs I collected by injecting printks. There are
> > basically two major problems:
> > - swsusp_shrink_memory() scans the whole 2G memory again and again;
> > - shrink_all_memory() is slow. It won't reclaim pages at all with
> > small priority values, because it's batching size is 10000 pages.
>
> I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> of it.
>
> > I wonder if it's possible to free up the memory within 1s at all.
>
> I'm not sure.
>
> Apparently, the counting of saveable pages takes substantial time (0.5 s each
> iteration on my 64-bit test box), so we can improve that by limiting the number
> of iterations.
>
> Well, perhaps we can do it all in one shot after all, I'll think how to do that.
I've changed swsusp_shrink_memory() to preallocate all of the pages in one
iteration. Although it doesn't seem to improve the speed of memory shrinking,
the function is simpler in this form.
Anyway, updated patch series follows:
[1/5] - the Andrew's patch introducing __GFP_NO_OOM_KILL (I decided it would be
better do it this way in this particular case. The fact that the OOM
killer is not going to work after tasks have been frozen is a different
issue.)
[2/5] - move swsusp_shrink_memory to snapshot.c, no major changes
[3/5] - remove the shrinking of memory from suspend code (in a separate patch
as requested by Linus)
[4/5] - use memory allocations to for making the room for the image
[5/5] - do not release all memory allocated by [4/5] and use it for
creating the image directly (some allocated memory is released).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 0/5] PM: Drop shrink_all_memory (rev. 3)
@ 2009-05-04 0:08 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:08 UTC (permalink / raw)
To: Wu Fengguang, linux-pm
Cc: Andrew Morton, pavel, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers
On Sunday 03 May 2009, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Wu Fengguang wrote:
> >
> > Hi Rafael,
>
> Hi,
>
> > I happened to be doing some benchmarks on the older shrink_all_memory(),
> > Hopefully it can be a useful reference point for the new design.
> >
> > The current swsusp_shrink_memory()/shrink_all_memory() are terribly
> > inefficient: it takes 7-9s to free up 1.4G memory:
>
> One reason may be that it takes too many steps to do it,
>
> > [ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
> > [ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
>
> because the new way doesn't seem to do any better.
>
> > Below are the logs I collected by injecting printks. There are
> > basically two major problems:
> > - swsusp_shrink_memory() scans the whole 2G memory again and again;
> > - shrink_all_memory() is slow. It won't reclaim pages at all with
> > small priority values, because it's batching size is 10000 pages.
>
> I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> of it.
>
> > I wonder if it's possible to free up the memory within 1s at all.
>
> I'm not sure.
>
> Apparently, the counting of saveable pages takes substantial time (0.5 s each
> iteration on my 64-bit test box), so we can improve that by limiting the number
> of iterations.
>
> Well, perhaps we can do it all in one shot after all, I'll think how to do that.
I've changed swsusp_shrink_memory() to preallocate all of the pages in one
iteration. Although it doesn't seem to improve the speed of memory shrinking,
the function is simpler in this form.
Anyway, updated patch series follows:
[1/5] - the Andrew's patch introducing __GFP_NO_OOM_KILL (I decided it would be
better do it this way in this particular case. The fact that the OOM
killer is not going to work after tasks have been frozen is a different
issue.)
[2/5] - move swsusp_shrink_memory to snapshot.c, no major changes
[3/5] - remove the shrinking of memory from suspend code (in a separate patch
as requested by Linus)
[4/5] - use memory allocations to for making the room for the image
[5/5] - do not release all memory allocated by [4/5] and use it for
creating the image directly (some allocated memory is released).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 0/5] PM: Drop shrink_all_memory (rev. 3)
@ 2009-05-04 0:08 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:08 UTC (permalink / raw)
To: Wu Fengguang, linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Cc: Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Sunday 03 May 2009, Rafael J. Wysocki wrote:
> On Sunday 03 May 2009, Wu Fengguang wrote:
> >
> > Hi Rafael,
>
> Hi,
>
> > I happened to be doing some benchmarks on the older shrink_all_memory(),
> > Hopefully it can be a useful reference point for the new design.
> >
> > The current swsusp_shrink_memory()/shrink_all_memory() are terribly
> > inefficient: it takes 7-9s to free up 1.4G memory:
>
> One reason may be that it takes too many steps to do it,
>
> > [ 131.899389] PM: Freed 1413380 kbytes in 7.03 seconds (201.04 MB/s)
> > [ 732.757916] PM: Freed 1490116 kbytes in 9.37 seconds (159.03 MB/s)
>
> because the new way doesn't seem to do any better.
>
> > Below are the logs I collected by injecting printks. There are
> > basically two major problems:
> > - swsusp_shrink_memory() scans the whole 2G memory again and again;
> > - shrink_all_memory() is slow. It won't reclaim pages at all with
> > small priority values, because it's batching size is 10000 pages.
>
> I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> of it.
>
> > I wonder if it's possible to free up the memory within 1s at all.
>
> I'm not sure.
>
> Apparently, the counting of saveable pages takes substantial time (0.5 s each
> iteration on my 64-bit test box), so we can improve that by limiting the number
> of iterations.
>
> Well, perhaps we can do it all in one shot after all, I'll think how to do that.
I've changed swsusp_shrink_memory() to preallocate all of the pages in one
iteration. Although it doesn't seem to improve the speed of memory shrinking,
the function is simpler in this form.
Anyway, updated patch series follows:
[1/5] - the Andrew's patch introducing __GFP_NO_OOM_KILL (I decided it would be
better do it this way in this particular case. The fact that the OOM
killer is not going to work after tasks have been frozen is a different
issue.)
[2/5] - move swsusp_shrink_memory to snapshot.c, no major changes
[3/5] - remove the shrinking of memory from suspend code (in a separate patch
as requested by Linus)
[4/5] - use memory allocations to for making the room for the image
[5/5] - do not release all memory allocated by [4/5] and use it for
creating the image directly (some allocated memory is released).
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 0:10 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:10 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
From: Andrew Morton <akpm@linux-foundation.org>
> > Remind me: why can't we just allocate N pages at suspend-time?
>
> We need half of memory free. The reason we can't "just allocate" is
> probably OOM killer; but my memories are quite weak :-(.
hm. You'd think that with our splendid range of __GFP_foo falgs, there
would be some combo which would suit this requirement but I can't
immediately spot one.
We can always add another I guess. Something like...
[rjw: fixed white space]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1620,7 +1620,8 @@ nofail_alloc:
}
/* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (order > PAGE_ALLOC_COSTLY_ORDER ||
+ (gfp_mask & __GFP_NO_OOM_KILL)) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 0:10 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:10 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
From: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> > Remind me: why can't we just allocate N pages at suspend-time?
>
> We need half of memory free. The reason we can't "just allocate" is
> probably OOM killer; but my memories are quite weak :-(.
hm. You'd think that with our splendid range of __GFP_foo falgs, there
would be some combo which would suit this requirement but I can't
immediately spot one.
We can always add another I guess. Something like...
[rjw: fixed white space]
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1620,7 +1620,8 @@ nofail_alloc:
}
/* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (order > PAGE_ALLOC_COSTLY_ORDER ||
+ (gfp_mask & __GFP_NO_OOM_KILL)) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 0:10 ` Rafael J. Wysocki
(?)
@ 2009-05-04 0:38 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 0:38 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1620,7 +1620,8 @@ nofail_alloc:
> }
>
> /* The OOM killer will not help higher order allocs so fail */
> - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> + (gfp_mask & __GFP_NO_OOM_KILL)) {
> clear_zonelist_oom(zonelist, gfp_mask);
> goto nopage;
> }
This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
(the "goto nopage" above), but only for allocations with __GFP_FS set and
__GFP_NORETRY clear.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 0:10 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-04 0:38 ` David Rientjes
2009-05-04 15:02 ` Rafael J. Wysocki
2009-05-04 15:02 ` Rafael J. Wysocki
-1 siblings, 2 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 0:38 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, linux-pm, Andrew Morton, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1620,7 +1620,8 @@ nofail_alloc:
> }
>
> /* The OOM killer will not help higher order allocs so fail */
> - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> + (gfp_mask & __GFP_NO_OOM_KILL)) {
> clear_zonelist_oom(zonelist, gfp_mask);
> goto nopage;
> }
This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
(the "goto nopage" above), but only for allocations with __GFP_FS set and
__GFP_NORETRY clear.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 0:38 ` David Rientjes
@ 2009-05-04 15:02 ` Rafael J. Wysocki
2009-05-04 15:02 ` Rafael J. Wysocki
1 sibling, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 15:02 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > }
> >
> > /* The OOM killer will not help higher order allocs so fail */
> > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > clear_zonelist_oom(zonelist, gfp_mask);
> > goto nopage;
> > }
>
> This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> (the "goto nopage" above), but only for allocations with __GFP_FS set and
> __GFP_NORETRY clear.
Well, what would you suggest?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 15:02 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 15:02 UTC (permalink / raw)
To: David Rientjes
Cc: Wu Fengguang, linux-pm, Andrew Morton, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > }
> >
> > /* The OOM killer will not help higher order allocs so fail */
> > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > clear_zonelist_oom(zonelist, gfp_mask);
> > goto nopage;
> > }
>
> This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> (the "goto nopage" above), but only for allocations with __GFP_FS set and
> __GFP_NORETRY clear.
Well, what would you suggest?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 15:02 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 15:02 UTC (permalink / raw)
To: David Rientjes
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > }
> >
> > /* The OOM killer will not help higher order allocs so fail */
> > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > clear_zonelist_oom(zonelist, gfp_mask);
> > goto nopage;
> > }
>
> This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> (the "goto nopage" above), but only for allocations with __GFP_FS set and
> __GFP_NORETRY clear.
Well, what would you suggest?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 15:02 ` Rafael J. Wysocki
(?)
@ 2009-05-04 16:44 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 16:44 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> > > Index: linux-2.6/mm/page_alloc.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/page_alloc.c
> > > +++ linux-2.6/mm/page_alloc.c
> > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > }
> > >
> > > /* The OOM killer will not help higher order allocs so fail */
> > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > clear_zonelist_oom(zonelist, gfp_mask);
> > > goto nopage;
> > > }
> >
> > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > __GFP_NORETRY clear.
>
> Well, what would you suggest?
>
A couple things:
- rebase this on mmotm so that it doesn't conflict with Mel Gorman's page
allocator speedup changes, and
- avoid the final call to get_page_from_freelist() for
!(gfp_mask & __GFP_NO_OOM_KILL) by adding a check for it alongside
(gfp_mask & __GFP_FS) and !(gfp_mask & __GFP_NORETRY) because it should
really only catch parallel oom killings which won't happen in your
suspend case since it uses ALLOC_WMARK_HIGH.
The latter is important to avoid unnecessary dependencies among low-level
__GFP_* flags (although all __GFP_NO_OOM_KILL allocations should really
all be passing __GFP_NORETRY too to avoid relying too heavily on direct
reclaim).
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 16:44 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 16:44 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, linux-pm, Andrew Morton, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> > > Index: linux-2.6/mm/page_alloc.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/page_alloc.c
> > > +++ linux-2.6/mm/page_alloc.c
> > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > }
> > >
> > > /* The OOM killer will not help higher order allocs so fail */
> > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > clear_zonelist_oom(zonelist, gfp_mask);
> > > goto nopage;
> > > }
> >
> > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > __GFP_NORETRY clear.
>
> Well, what would you suggest?
>
A couple things:
- rebase this on mmotm so that it doesn't conflict with Mel Gorman's page
allocator speedup changes, and
- avoid the final call to get_page_from_freelist() for
!(gfp_mask & __GFP_NO_OOM_KILL) by adding a check for it alongside
(gfp_mask & __GFP_FS) and !(gfp_mask & __GFP_NORETRY) because it should
really only catch parallel oom killings which won't happen in your
suspend case since it uses ALLOC_WMARK_HIGH.
The latter is important to avoid unnecessary dependencies among low-level
__GFP_* flags (although all __GFP_NO_OOM_KILL allocations should really
all be passing __GFP_NORETRY too to avoid relying too heavily on direct
reclaim).
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 16:44 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 16:44 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> > > Index: linux-2.6/mm/page_alloc.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/page_alloc.c
> > > +++ linux-2.6/mm/page_alloc.c
> > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > }
> > >
> > > /* The OOM killer will not help higher order allocs so fail */
> > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > clear_zonelist_oom(zonelist, gfp_mask);
> > > goto nopage;
> > > }
> >
> > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > __GFP_NORETRY clear.
>
> Well, what would you suggest?
>
A couple things:
- rebase this on mmotm so that it doesn't conflict with Mel Gorman's page
allocator speedup changes, and
- avoid the final call to get_page_from_freelist() for
!(gfp_mask & __GFP_NO_OOM_KILL) by adding a check for it alongside
(gfp_mask & __GFP_FS) and !(gfp_mask & __GFP_NORETRY) because it should
really only catch parallel oom killings which won't happen in your
suspend case since it uses ALLOC_WMARK_HIGH.
The latter is important to avoid unnecessary dependencies among low-level
__GFP_* flags (although all __GFP_NO_OOM_KILL allocations should really
all be passing __GFP_NORETRY too to avoid relying too heavily on direct
reclaim).
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 16:44 ` David Rientjes
(?)
@ 2009-05-04 19:51 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:51 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > > > Index: linux-2.6/mm/page_alloc.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/mm/page_alloc.c
> > > > +++ linux-2.6/mm/page_alloc.c
> > > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > > }
> > > >
> > > > /* The OOM killer will not help higher order allocs so fail */
> > > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > > clear_zonelist_oom(zonelist, gfp_mask);
> > > > goto nopage;
> > > > }
> > >
> > > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > > __GFP_NORETRY clear.
> >
> > Well, what would you suggest?
> >
>
> A couple things:
>
> - rebase this on mmotm so that it doesn't conflict with Mel Gorman's page
> allocator speedup changes, and
I'm going to rebase the patchset on top of linux-next eventually.
> - avoid the final call to get_page_from_freelist() for
> !(gfp_mask & __GFP_NO_OOM_KILL) by adding a check for it alongside
> (gfp_mask & __GFP_FS) and !(gfp_mask & __GFP_NORETRY) because it should
> really only catch parallel oom killings which won't happen in your
> suspend case since it uses ALLOC_WMARK_HIGH.
>
> The latter is important to avoid unnecessary dependencies among low-level
> __GFP_* flags (although all __GFP_NO_OOM_KILL allocations should really
> all be passing __GFP_NORETRY too to avoid relying too heavily on direct
> reclaim).
OK, thanks.
Something like this?
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1599,7 +1599,8 @@ nofail_alloc:
zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
+ && !(gfp_mask & __GFP_NO_OOM_KILL)) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 19:51 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:51 UTC (permalink / raw)
To: David Rientjes
Cc: Wu Fengguang, linux-pm, Andrew Morton, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > > > Index: linux-2.6/mm/page_alloc.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/mm/page_alloc.c
> > > > +++ linux-2.6/mm/page_alloc.c
> > > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > > }
> > > >
> > > > /* The OOM killer will not help higher order allocs so fail */
> > > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > > clear_zonelist_oom(zonelist, gfp_mask);
> > > > goto nopage;
> > > > }
> > >
> > > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > > __GFP_NORETRY clear.
> >
> > Well, what would you suggest?
> >
>
> A couple things:
>
> - rebase this on mmotm so that it doesn't conflict with Mel Gorman's page
> allocator speedup changes, and
I'm going to rebase the patchset on top of linux-next eventually.
> - avoid the final call to get_page_from_freelist() for
> !(gfp_mask & __GFP_NO_OOM_KILL) by adding a check for it alongside
> (gfp_mask & __GFP_FS) and !(gfp_mask & __GFP_NORETRY) because it should
> really only catch parallel oom killings which won't happen in your
> suspend case since it uses ALLOC_WMARK_HIGH.
>
> The latter is important to avoid unnecessary dependencies among low-level
> __GFP_* flags (although all __GFP_NO_OOM_KILL allocations should really
> all be passing __GFP_NORETRY too to avoid relying too heavily on direct
> reclaim).
OK, thanks.
Something like this?
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1599,7 +1599,8 @@ nofail_alloc:
zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
+ && !(gfp_mask & __GFP_NO_OOM_KILL)) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 19:51 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:51 UTC (permalink / raw)
To: David Rientjes
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > > > Index: linux-2.6/mm/page_alloc.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/mm/page_alloc.c
> > > > +++ linux-2.6/mm/page_alloc.c
> > > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > > }
> > > >
> > > > /* The OOM killer will not help higher order allocs so fail */
> > > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > > clear_zonelist_oom(zonelist, gfp_mask);
> > > > goto nopage;
> > > > }
> > >
> > > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > > __GFP_NORETRY clear.
> >
> > Well, what would you suggest?
> >
>
> A couple things:
>
> - rebase this on mmotm so that it doesn't conflict with Mel Gorman's page
> allocator speedup changes, and
I'm going to rebase the patchset on top of linux-next eventually.
> - avoid the final call to get_page_from_freelist() for
> !(gfp_mask & __GFP_NO_OOM_KILL) by adding a check for it alongside
> (gfp_mask & __GFP_FS) and !(gfp_mask & __GFP_NORETRY) because it should
> really only catch parallel oom killings which won't happen in your
> suspend case since it uses ALLOC_WMARK_HIGH.
>
> The latter is important to avoid unnecessary dependencies among low-level
> __GFP_* flags (although all __GFP_NO_OOM_KILL allocations should really
> all be passing __GFP_NORETRY too to avoid relying too heavily on direct
> reclaim).
OK, thanks.
Something like this?
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1599,7 +1599,8 @@ nofail_alloc:
zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
+ && !(gfp_mask & __GFP_NO_OOM_KILL)) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 19:51 ` Rafael J. Wysocki
(?)
@ 2009-05-04 20:02 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 20:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1599,7 +1599,8 @@ nofail_alloc:
> zonelist, high_zoneidx, alloc_flags);
> if (page)
> goto got_pg;
> - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> + && !(gfp_mask & __GFP_NO_OOM_KILL)) {
> if (!try_set_zone_oom(zonelist, gfp_mask)) {
> schedule_timeout_uninterruptible(1);
> goto restart;
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h
> +++ linux-2.6/include/linux/gfp.h
> @@ -51,8 +51,9 @@ struct vm_area_struct;
> #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
>
> -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
>
Yeah, that's much better, thanks. There's currently concerns about adding
a new gfp flag in another thread (__GFP_PANIC), though, so you might find
some resistance in adding a flag with a very specific and limited use cae.
I think you might have better luck in doing
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
if all other tasks are really in D state at this point since oom killer
serialization is done with try locks in the page allocator. This is
equivalent to __GFP_NO_OOM_KILL.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 20:02 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 20:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, linux-pm, Andrew Morton, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1599,7 +1599,8 @@ nofail_alloc:
> zonelist, high_zoneidx, alloc_flags);
> if (page)
> goto got_pg;
> - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> + && !(gfp_mask & __GFP_NO_OOM_KILL)) {
> if (!try_set_zone_oom(zonelist, gfp_mask)) {
> schedule_timeout_uninterruptible(1);
> goto restart;
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h
> +++ linux-2.6/include/linux/gfp.h
> @@ -51,8 +51,9 @@ struct vm_area_struct;
> #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
>
> -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
>
Yeah, that's much better, thanks. There's currently concerns about adding
a new gfp flag in another thread (__GFP_PANIC), though, so you might find
some resistance in adding a flag with a very specific and limited use cae.
I think you might have better luck in doing
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
if all other tasks are really in D state at this point since oom killer
serialization is done with try locks in the page allocator. This is
equivalent to __GFP_NO_OOM_KILL.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 20:02 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-04 20:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1599,7 +1599,8 @@ nofail_alloc:
> zonelist, high_zoneidx, alloc_flags);
> if (page)
> goto got_pg;
> - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> + && !(gfp_mask & __GFP_NO_OOM_KILL)) {
> if (!try_set_zone_oom(zonelist, gfp_mask)) {
> schedule_timeout_uninterruptible(1);
> goto restart;
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h
> +++ linux-2.6/include/linux/gfp.h
> @@ -51,8 +51,9 @@ struct vm_area_struct;
> #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
>
> -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> /* This equals 0, but use constants in case they ever change */
>
Yeah, that's much better, thanks. There's currently concerns about adding
a new gfp flag in another thread (__GFP_PANIC), though, so you might find
some resistance in adding a flag with a very specific and limited use cae.
I think you might have better luck in doing
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
if all other tasks are really in D state at this point since oom killer
serialization is done with try locks in the page allocator. This is
equivalent to __GFP_NO_OOM_KILL.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 20:02 ` David Rientjes
(?)
@ 2009-05-04 22:23 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 22:23 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -1599,7 +1599,8 @@ nofail_alloc:
> > zonelist, high_zoneidx, alloc_flags);
> > if (page)
> > goto got_pg;
> > - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> > + && !(gfp_mask & __GFP_NO_OOM_KILL)) {
> > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > schedule_timeout_uninterruptible(1);
> > goto restart;
> > Index: linux-2.6/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/gfp.h
> > +++ linux-2.6/include/linux/gfp.h
> > @@ -51,8 +51,9 @@ struct vm_area_struct;
> > #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> > #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> > #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> > +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
> >
> > -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> > +#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
> > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> >
> > /* This equals 0, but use constants in case they ever change */
> >
>
> Yeah, that's much better, thanks. There's currently concerns about adding
> a new gfp flag in another thread (__GFP_PANIC), though, so you might find
> some resistance in adding a flag with a very specific and limited use cae.
Oh great. Andrew, what's your opinion?
> I think you might have better luck in doing
>
> struct zone *z;
>
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> if all other tasks are really in D state at this point since oom killer
> serialization is done with try locks in the page allocator.
Not all of them, actually. Some kernel threads are not freezable.
> This is equivalent to __GFP_NO_OOM_KILL.
In that case I think I'd go back to my initial idea with disabling the OOM
killer after freezing tasks.
Roughly, this. [The idea is that the OOM killer is not really going to work
while tasks are frozen, so we can just give up calling it in that case.]
---
include/linux/freezer.h | 2 ++
kernel/power/process.c | 12 ++++++++++++
mm/page_alloc.c | 4 +++-
3 files changed, 17 insertions(+), 1 deletion(-)
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -19,6 +19,8 @@
*/
#define TIMEOUT (20 * HZ)
+static bool tasks_frozen;
+
static inline int freezeable(struct task_struct * p)
{
if ((p == current) ||
@@ -120,6 +122,10 @@ int freeze_processes(void)
Exit:
BUG_ON(in_atomic());
printk("\n");
+
+ if (!error)
+ tasks_frozen = true;
+
return error;
}
@@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
void thaw_processes(void)
{
+ tasks_frozen = false;
+
printk("Restarting tasks ... ");
thaw_tasks(true);
thaw_tasks(false);
@@ -152,3 +160,7 @@ void thaw_processes(void)
printk("done.\n");
}
+bool processes_are_frozen(void)
+{
+ return tasks_frozen;
+}
Index: linux-2.6/include/linux/freezer.h
===================================================================
--- linux-2.6.orig/include/linux/freezer.h
+++ linux-2.6/include/linux/freezer.h
@@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
extern void refrigerator(void);
extern int freeze_processes(void);
extern void thaw_processes(void);
+extern bool processes_are_frozen(void);
static inline int try_to_freeze(void)
{
@@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
static inline void refrigerator(void) {}
static inline int freeze_processes(void) { BUG(); return 0; }
static inline void thaw_processes(void) {}
+static inline bool processes_are_frozen(void) { return false; }
static inline int try_to_freeze(void) { return 0; }
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -46,6 +46,7 @@
#include <linux/page-isolation.h>
#include <linux/page_cgroup.h>
#include <linux/debugobjects.h>
+#include <linux/freezer.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1599,7 +1600,8 @@ nofail_alloc:
zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
+ && !processes_are_frozen()) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 22:23 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 22:23 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: Wu Fengguang, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -1599,7 +1599,8 @@ nofail_alloc:
> > zonelist, high_zoneidx, alloc_flags);
> > if (page)
> > goto got_pg;
> > - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> > + && !(gfp_mask & __GFP_NO_OOM_KILL)) {
> > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > schedule_timeout_uninterruptible(1);
> > goto restart;
> > Index: linux-2.6/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/gfp.h
> > +++ linux-2.6/include/linux/gfp.h
> > @@ -51,8 +51,9 @@ struct vm_area_struct;
> > #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> > #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> > #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> > +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
> >
> > -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> > +#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
> > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> >
> > /* This equals 0, but use constants in case they ever change */
> >
>
> Yeah, that's much better, thanks. There's currently concerns about adding
> a new gfp flag in another thread (__GFP_PANIC), though, so you might find
> some resistance in adding a flag with a very specific and limited use cae.
Oh great. Andrew, what's your opinion?
> I think you might have better luck in doing
>
> struct zone *z;
>
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> if all other tasks are really in D state at this point since oom killer
> serialization is done with try locks in the page allocator.
Not all of them, actually. Some kernel threads are not freezable.
> This is equivalent to __GFP_NO_OOM_KILL.
In that case I think I'd go back to my initial idea with disabling the OOM
killer after freezing tasks.
Roughly, this. [The idea is that the OOM killer is not really going to work
while tasks are frozen, so we can just give up calling it in that case.]
---
include/linux/freezer.h | 2 ++
kernel/power/process.c | 12 ++++++++++++
mm/page_alloc.c | 4 +++-
3 files changed, 17 insertions(+), 1 deletion(-)
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -19,6 +19,8 @@
*/
#define TIMEOUT (20 * HZ)
+static bool tasks_frozen;
+
static inline int freezeable(struct task_struct * p)
{
if ((p == current) ||
@@ -120,6 +122,10 @@ int freeze_processes(void)
Exit:
BUG_ON(in_atomic());
printk("\n");
+
+ if (!error)
+ tasks_frozen = true;
+
return error;
}
@@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
void thaw_processes(void)
{
+ tasks_frozen = false;
+
printk("Restarting tasks ... ");
thaw_tasks(true);
thaw_tasks(false);
@@ -152,3 +160,7 @@ void thaw_processes(void)
printk("done.\n");
}
+bool processes_are_frozen(void)
+{
+ return tasks_frozen;
+}
Index: linux-2.6/include/linux/freezer.h
===================================================================
--- linux-2.6.orig/include/linux/freezer.h
+++ linux-2.6/include/linux/freezer.h
@@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
extern void refrigerator(void);
extern int freeze_processes(void);
extern void thaw_processes(void);
+extern bool processes_are_frozen(void);
static inline int try_to_freeze(void)
{
@@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
static inline void refrigerator(void) {}
static inline int freeze_processes(void) { BUG(); return 0; }
static inline void thaw_processes(void) {}
+static inline bool processes_are_frozen(void) { return false; }
static inline int try_to_freeze(void) { return 0; }
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -46,6 +46,7 @@
#include <linux/page-isolation.h>
#include <linux/page_cgroup.h>
#include <linux/debugobjects.h>
+#include <linux/freezer.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1599,7 +1600,8 @@ nofail_alloc:
zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
+ && !processes_are_frozen()) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 22:23 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 22:23 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Monday 04 May 2009, David Rientjes wrote:
> On Mon, 4 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -1599,7 +1599,8 @@ nofail_alloc:
> > zonelist, high_zoneidx, alloc_flags);
> > if (page)
> > goto got_pg;
> > - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> > + && !(gfp_mask & __GFP_NO_OOM_KILL)) {
> > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > schedule_timeout_uninterruptible(1);
> > goto restart;
> > Index: linux-2.6/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/gfp.h
> > +++ linux-2.6/include/linux/gfp.h
> > @@ -51,8 +51,9 @@ struct vm_area_struct;
> > #define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
> > #define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
> > #define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
> > +#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
> >
> > -#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
> > +#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
> > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> >
> > /* This equals 0, but use constants in case they ever change */
> >
>
> Yeah, that's much better, thanks. There's currently concerns about adding
> a new gfp flag in another thread (__GFP_PANIC), though, so you might find
> some resistance in adding a flag with a very specific and limited use cae.
Oh great. Andrew, what's your opinion?
> I think you might have better luck in doing
>
> struct zone *z;
>
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> if all other tasks are really in D state at this point since oom killer
> serialization is done with try locks in the page allocator.
Not all of them, actually. Some kernel threads are not freezable.
> This is equivalent to __GFP_NO_OOM_KILL.
In that case I think I'd go back to my initial idea with disabling the OOM
killer after freezing tasks.
Roughly, this. [The idea is that the OOM killer is not really going to work
while tasks are frozen, so we can just give up calling it in that case.]
---
include/linux/freezer.h | 2 ++
kernel/power/process.c | 12 ++++++++++++
mm/page_alloc.c | 4 +++-
3 files changed, 17 insertions(+), 1 deletion(-)
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -19,6 +19,8 @@
*/
#define TIMEOUT (20 * HZ)
+static bool tasks_frozen;
+
static inline int freezeable(struct task_struct * p)
{
if ((p == current) ||
@@ -120,6 +122,10 @@ int freeze_processes(void)
Exit:
BUG_ON(in_atomic());
printk("\n");
+
+ if (!error)
+ tasks_frozen = true;
+
return error;
}
@@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
void thaw_processes(void)
{
+ tasks_frozen = false;
+
printk("Restarting tasks ... ");
thaw_tasks(true);
thaw_tasks(false);
@@ -152,3 +160,7 @@ void thaw_processes(void)
printk("done.\n");
}
+bool processes_are_frozen(void)
+{
+ return tasks_frozen;
+}
Index: linux-2.6/include/linux/freezer.h
===================================================================
--- linux-2.6.orig/include/linux/freezer.h
+++ linux-2.6/include/linux/freezer.h
@@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
extern void refrigerator(void);
extern int freeze_processes(void);
extern void thaw_processes(void);
+extern bool processes_are_frozen(void);
static inline int try_to_freeze(void)
{
@@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
static inline void refrigerator(void) {}
static inline int freeze_processes(void) { BUG(); return 0; }
static inline void thaw_processes(void) {}
+static inline bool processes_are_frozen(void) { return false; }
static inline int try_to_freeze(void) { return 0; }
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -46,6 +46,7 @@
#include <linux/page-isolation.h>
#include <linux/page_cgroup.h>
#include <linux/debugobjects.h>
+#include <linux/freezer.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -1599,7 +1600,8 @@ nofail_alloc:
zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
+ && !processes_are_frozen()) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 0:37 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-05 0:37 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, Wu Fengguang, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Tue, 5 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/kernel/power/process.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/process.c
> +++ linux-2.6/kernel/power/process.c
> @@ -19,6 +19,8 @@
> */
> #define TIMEOUT (20 * HZ)
>
> +static bool tasks_frozen;
> +
> static inline int freezeable(struct task_struct * p)
> {
> if ((p == current) ||
> @@ -120,6 +122,10 @@ int freeze_processes(void)
> Exit:
> BUG_ON(in_atomic());
> printk("\n");
> +
> + if (!error)
> + tasks_frozen = true;
> +
> return error;
> }
>
> @@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
>
> void thaw_processes(void)
> {
> + tasks_frozen = false;
> +
> printk("Restarting tasks ... ");
> thaw_tasks(true);
> thaw_tasks(false);
> @@ -152,3 +160,7 @@ void thaw_processes(void)
> printk("done.\n");
> }
>
> +bool processes_are_frozen(void)
> +{
> + return tasks_frozen;
> +}
> Index: linux-2.6/include/linux/freezer.h
> ===================================================================
> --- linux-2.6.orig/include/linux/freezer.h
> +++ linux-2.6/include/linux/freezer.h
> @@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
> extern void refrigerator(void);
> extern int freeze_processes(void);
> extern void thaw_processes(void);
> +extern bool processes_are_frozen(void);
>
> static inline int try_to_freeze(void)
> {
> @@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
> static inline void refrigerator(void) {}
> static inline int freeze_processes(void) { BUG(); return 0; }
> static inline void thaw_processes(void) {}
> +static inline bool processes_are_frozen(void) { return false; }
>
> static inline int try_to_freeze(void) { return 0; }
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -46,6 +46,7 @@
> #include <linux/page-isolation.h>
> #include <linux/page_cgroup.h>
> #include <linux/debugobjects.h>
> +#include <linux/freezer.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -1599,7 +1600,8 @@ nofail_alloc:
> zonelist, high_zoneidx, alloc_flags);
> if (page)
> goto got_pg;
> - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> + && !processes_are_frozen()) {
> if (!try_set_zone_oom(zonelist, gfp_mask)) {
> schedule_timeout_uninterruptible(1);
> goto restart;
Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
a new gfp flag. Thanks.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 0:37 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-05 0:37 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Tue, 5 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/kernel/power/process.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/process.c
> +++ linux-2.6/kernel/power/process.c
> @@ -19,6 +19,8 @@
> */
> #define TIMEOUT (20 * HZ)
>
> +static bool tasks_frozen;
> +
> static inline int freezeable(struct task_struct * p)
> {
> if ((p == current) ||
> @@ -120,6 +122,10 @@ int freeze_processes(void)
> Exit:
> BUG_ON(in_atomic());
> printk("\n");
> +
> + if (!error)
> + tasks_frozen = true;
> +
> return error;
> }
>
> @@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
>
> void thaw_processes(void)
> {
> + tasks_frozen = false;
> +
> printk("Restarting tasks ... ");
> thaw_tasks(true);
> thaw_tasks(false);
> @@ -152,3 +160,7 @@ void thaw_processes(void)
> printk("done.\n");
> }
>
> +bool processes_are_frozen(void)
> +{
> + return tasks_frozen;
> +}
> Index: linux-2.6/include/linux/freezer.h
> ===================================================================
> --- linux-2.6.orig/include/linux/freezer.h
> +++ linux-2.6/include/linux/freezer.h
> @@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
> extern void refrigerator(void);
> extern int freeze_processes(void);
> extern void thaw_processes(void);
> +extern bool processes_are_frozen(void);
>
> static inline int try_to_freeze(void)
> {
> @@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
> static inline void refrigerator(void) {}
> static inline int freeze_processes(void) { BUG(); return 0; }
> static inline void thaw_processes(void) {}
> +static inline bool processes_are_frozen(void) { return false; }
>
> static inline int try_to_freeze(void) { return 0; }
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -46,6 +46,7 @@
> #include <linux/page-isolation.h>
> #include <linux/page_cgroup.h>
> #include <linux/debugobjects.h>
> +#include <linux/freezer.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -1599,7 +1600,8 @@ nofail_alloc:
> zonelist, high_zoneidx, alloc_flags);
> if (page)
> goto got_pg;
> - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> + && !processes_are_frozen()) {
> if (!try_set_zone_oom(zonelist, gfp_mask)) {
> schedule_timeout_uninterruptible(1);
> goto restart;
Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
a new gfp flag. Thanks.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-05 0:37 ` David Rientjes
(?)
@ 2009-05-05 22:19 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 22:19 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, Wu Fengguang, torvalds, linux-pm
On Tuesday 05 May 2009, David Rientjes wrote:
> On Tue, 5 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/kernel/power/process.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/process.c
> > +++ linux-2.6/kernel/power/process.c
> > @@ -19,6 +19,8 @@
> > */
> > #define TIMEOUT (20 * HZ)
> >
> > +static bool tasks_frozen;
> > +
> > static inline int freezeable(struct task_struct * p)
> > {
> > if ((p == current) ||
> > @@ -120,6 +122,10 @@ int freeze_processes(void)
> > Exit:
> > BUG_ON(in_atomic());
> > printk("\n");
> > +
> > + if (!error)
> > + tasks_frozen = true;
> > +
> > return error;
> > }
> >
> > @@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
> >
> > void thaw_processes(void)
> > {
> > + tasks_frozen = false;
> > +
> > printk("Restarting tasks ... ");
> > thaw_tasks(true);
> > thaw_tasks(false);
> > @@ -152,3 +160,7 @@ void thaw_processes(void)
> > printk("done.\n");
> > }
> >
> > +bool processes_are_frozen(void)
> > +{
> > + return tasks_frozen;
> > +}
> > Index: linux-2.6/include/linux/freezer.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/freezer.h
> > +++ linux-2.6/include/linux/freezer.h
> > @@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
> > extern void refrigerator(void);
> > extern int freeze_processes(void);
> > extern void thaw_processes(void);
> > +extern bool processes_are_frozen(void);
> >
> > static inline int try_to_freeze(void)
> > {
> > @@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
> > static inline void refrigerator(void) {}
> > static inline int freeze_processes(void) { BUG(); return 0; }
> > static inline void thaw_processes(void) {}
> > +static inline bool processes_are_frozen(void) { return false; }
> >
> > static inline int try_to_freeze(void) { return 0; }
> >
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -46,6 +46,7 @@
> > #include <linux/page-isolation.h>
> > #include <linux/page_cgroup.h>
> > #include <linux/debugobjects.h>
> > +#include <linux/freezer.h>
> >
> > #include <asm/tlbflush.h>
> > #include <asm/div64.h>
> > @@ -1599,7 +1600,8 @@ nofail_alloc:
> > zonelist, high_zoneidx, alloc_flags);
> > if (page)
> > goto got_pg;
> > - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> > + && !processes_are_frozen()) {
> > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > schedule_timeout_uninterruptible(1);
> > goto restart;
>
> Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> a new gfp flag. Thanks.
Well, you're welcome.
BTW, I think that Andrew was actually right when he asked if I checked whether
the existing __GFP_NORETRY would work as-is for __GFP_FS set and
__GFP_NORETRY unset. Namely, in that case we never reach the code before
nopage: that checks __GFP_NORETRY, do we?
So I think we shouldn't modify the 'else if' condition above and check for
!processes_are_frozen() at the beginning of the block below.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 22:19 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 22:19 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Wu Fengguang, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Tuesday 05 May 2009, David Rientjes wrote:
> On Tue, 5 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/kernel/power/process.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/process.c
> > +++ linux-2.6/kernel/power/process.c
> > @@ -19,6 +19,8 @@
> > */
> > #define TIMEOUT (20 * HZ)
> >
> > +static bool tasks_frozen;
> > +
> > static inline int freezeable(struct task_struct * p)
> > {
> > if ((p == current) ||
> > @@ -120,6 +122,10 @@ int freeze_processes(void)
> > Exit:
> > BUG_ON(in_atomic());
> > printk("\n");
> > +
> > + if (!error)
> > + tasks_frozen = true;
> > +
> > return error;
> > }
> >
> > @@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
> >
> > void thaw_processes(void)
> > {
> > + tasks_frozen = false;
> > +
> > printk("Restarting tasks ... ");
> > thaw_tasks(true);
> > thaw_tasks(false);
> > @@ -152,3 +160,7 @@ void thaw_processes(void)
> > printk("done.\n");
> > }
> >
> > +bool processes_are_frozen(void)
> > +{
> > + return tasks_frozen;
> > +}
> > Index: linux-2.6/include/linux/freezer.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/freezer.h
> > +++ linux-2.6/include/linux/freezer.h
> > @@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
> > extern void refrigerator(void);
> > extern int freeze_processes(void);
> > extern void thaw_processes(void);
> > +extern bool processes_are_frozen(void);
> >
> > static inline int try_to_freeze(void)
> > {
> > @@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
> > static inline void refrigerator(void) {}
> > static inline int freeze_processes(void) { BUG(); return 0; }
> > static inline void thaw_processes(void) {}
> > +static inline bool processes_are_frozen(void) { return false; }
> >
> > static inline int try_to_freeze(void) { return 0; }
> >
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -46,6 +46,7 @@
> > #include <linux/page-isolation.h>
> > #include <linux/page_cgroup.h>
> > #include <linux/debugobjects.h>
> > +#include <linux/freezer.h>
> >
> > #include <asm/tlbflush.h>
> > #include <asm/div64.h>
> > @@ -1599,7 +1600,8 @@ nofail_alloc:
> > zonelist, high_zoneidx, alloc_flags);
> > if (page)
> > goto got_pg;
> > - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> > + && !processes_are_frozen()) {
> > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > schedule_timeout_uninterruptible(1);
> > goto restart;
>
> Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> a new gfp flag. Thanks.
Well, you're welcome.
BTW, I think that Andrew was actually right when he asked if I checked whether
the existing __GFP_NORETRY would work as-is for __GFP_FS set and
__GFP_NORETRY unset. Namely, in that case we never reach the code before
nopage: that checks __GFP_NORETRY, do we?
So I think we shouldn't modify the 'else if' condition above and check for
!processes_are_frozen() at the beginning of the block below.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 22:19 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 22:19 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Tuesday 05 May 2009, David Rientjes wrote:
> On Tue, 5 May 2009, Rafael J. Wysocki wrote:
>
> > Index: linux-2.6/kernel/power/process.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/process.c
> > +++ linux-2.6/kernel/power/process.c
> > @@ -19,6 +19,8 @@
> > */
> > #define TIMEOUT (20 * HZ)
> >
> > +static bool tasks_frozen;
> > +
> > static inline int freezeable(struct task_struct * p)
> > {
> > if ((p == current) ||
> > @@ -120,6 +122,10 @@ int freeze_processes(void)
> > Exit:
> > BUG_ON(in_atomic());
> > printk("\n");
> > +
> > + if (!error)
> > + tasks_frozen = true;
> > +
> > return error;
> > }
> >
> > @@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
> >
> > void thaw_processes(void)
> > {
> > + tasks_frozen = false;
> > +
> > printk("Restarting tasks ... ");
> > thaw_tasks(true);
> > thaw_tasks(false);
> > @@ -152,3 +160,7 @@ void thaw_processes(void)
> > printk("done.\n");
> > }
> >
> > +bool processes_are_frozen(void)
> > +{
> > + return tasks_frozen;
> > +}
> > Index: linux-2.6/include/linux/freezer.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/freezer.h
> > +++ linux-2.6/include/linux/freezer.h
> > @@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
> > extern void refrigerator(void);
> > extern int freeze_processes(void);
> > extern void thaw_processes(void);
> > +extern bool processes_are_frozen(void);
> >
> > static inline int try_to_freeze(void)
> > {
> > @@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
> > static inline void refrigerator(void) {}
> > static inline int freeze_processes(void) { BUG(); return 0; }
> > static inline void thaw_processes(void) {}
> > +static inline bool processes_are_frozen(void) { return false; }
> >
> > static inline int try_to_freeze(void) { return 0; }
> >
> > Index: linux-2.6/mm/page_alloc.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page_alloc.c
> > +++ linux-2.6/mm/page_alloc.c
> > @@ -46,6 +46,7 @@
> > #include <linux/page-isolation.h>
> > #include <linux/page_cgroup.h>
> > #include <linux/debugobjects.h>
> > +#include <linux/freezer.h>
> >
> > #include <asm/tlbflush.h>
> > #include <asm/div64.h>
> > @@ -1599,7 +1600,8 @@ nofail_alloc:
> > zonelist, high_zoneidx, alloc_flags);
> > if (page)
> > goto got_pg;
> > - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> > + && !processes_are_frozen()) {
> > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > schedule_timeout_uninterruptible(1);
> > goto restart;
>
> Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> a new gfp flag. Thanks.
Well, you're welcome.
BTW, I think that Andrew was actually right when he asked if I checked whether
the existing __GFP_NORETRY would work as-is for __GFP_FS set and
__GFP_NORETRY unset. Namely, in that case we never reach the code before
nopage: that checks __GFP_NORETRY, do we?
So I think we shouldn't modify the 'else if' condition above and check for
!processes_are_frozen() at the beginning of the block below.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 22:37 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-05 22:37 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Wed, 6 May 2009 00:19:35 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > + && !processes_are_frozen()) {
> > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > schedule_timeout_uninterruptible(1);
> > > goto restart;
> >
> > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > a new gfp flag. Thanks.
>
> Well, you're welcome.
>
> BTW, I think that Andrew was actually right when he asked if I checked whether
> the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> __GFP_NORETRY unset. Namely, in that case we never reach the code before
> nopage: that checks __GFP_NORETRY, do we?
>
> So I think we shouldn't modify the 'else if' condition above and check for
> !processes_are_frozen() at the beginning of the block below.
Confused.
I'm suspecting that hibernation can allocate its pages with
__GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
will dtrt: no oom-killings.
In which case, processes_are_frozen() is not needed at all?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 22:37 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-05 22:37 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wed, 6 May 2009 00:19:35 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> > > + && !processes_are_frozen()) {
> > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > schedule_timeout_uninterruptible(1);
> > > goto restart;
> >
> > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > a new gfp flag. Thanks.
>
> Well, you're welcome.
>
> BTW, I think that Andrew was actually right when he asked if I checked whether
> the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> __GFP_NORETRY unset. Namely, in that case we never reach the code before
> nopage: that checks __GFP_NORETRY, do we?
>
> So I think we shouldn't modify the 'else if' condition above and check for
> !processes_are_frozen() at the beginning of the block below.
Confused.
I'm suspecting that hibernation can allocate its pages with
__GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
will dtrt: no oom-killings.
In which case, processes_are_frozen() is not needed at all?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-05 22:37 ` Andrew Morton
(?)
@ 2009-05-05 23:20 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:20 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Wednesday 06 May 2009, Andrew Morton wrote:
> On Wed, 6 May 2009 00:19:35 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > > > + && !processes_are_frozen()) {
> > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > schedule_timeout_uninterruptible(1);
> > > > goto restart;
> > >
> > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > a new gfp flag. Thanks.
> >
> > Well, you're welcome.
> >
> > BTW, I think that Andrew was actually right when he asked if I checked whether
> > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > nopage: that checks __GFP_NORETRY, do we?
> >
> > So I think we shouldn't modify the 'else if' condition above and check for
> > !processes_are_frozen() at the beginning of the block below.
>
> Confused.
>
> I'm suspecting that hibernation can allocate its pages with
> __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> will dtrt: no oom-killings.
>
> In which case, processes_are_frozen() is not needed at all?
__GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
the combination.
Anyway, even if the hibernation code itself doesn't trigger the OOM killer,
but anyone else allocates memory in parallel or after we've preallocated the
image memory, that may still trigger it. So it seems processes_are_frozen()
may still be useful?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 23:20 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:20 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Wednesday 06 May 2009, Andrew Morton wrote:
> On Wed, 6 May 2009 00:19:35 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > > > + && !processes_are_frozen()) {
> > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > schedule_timeout_uninterruptible(1);
> > > > goto restart;
> > >
> > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > a new gfp flag. Thanks.
> >
> > Well, you're welcome.
> >
> > BTW, I think that Andrew was actually right when he asked if I checked whether
> > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > nopage: that checks __GFP_NORETRY, do we?
> >
> > So I think we shouldn't modify the 'else if' condition above and check for
> > !processes_are_frozen() at the beginning of the block below.
>
> Confused.
>
> I'm suspecting that hibernation can allocate its pages with
> __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> will dtrt: no oom-killings.
>
> In which case, processes_are_frozen() is not needed at all?
__GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
the combination.
Anyway, even if the hibernation code itself doesn't trigger the OOM killer,
but anyone else allocates memory in parallel or after we've preallocated the
image memory, that may still trigger it. So it seems processes_are_frozen()
may still be useful?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 23:20 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:20 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wednesday 06 May 2009, Andrew Morton wrote:
> On Wed, 6 May 2009 00:19:35 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > > > + && !processes_are_frozen()) {
> > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > schedule_timeout_uninterruptible(1);
> > > > goto restart;
> > >
> > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > a new gfp flag. Thanks.
> >
> > Well, you're welcome.
> >
> > BTW, I think that Andrew was actually right when he asked if I checked whether
> > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > nopage: that checks __GFP_NORETRY, do we?
> >
> > So I think we shouldn't modify the 'else if' condition above and check for
> > !processes_are_frozen() at the beginning of the block below.
>
> Confused.
>
> I'm suspecting that hibernation can allocate its pages with
> __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> will dtrt: no oom-killings.
>
> In which case, processes_are_frozen() is not needed at all?
__GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
the combination.
Anyway, even if the hibernation code itself doesn't trigger the OOM killer,
but anyone else allocates memory in parallel or after we've preallocated the
image memory, that may still trigger it. So it seems processes_are_frozen()
may still be useful?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 23:40 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-05 23:40 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Wed, 6 May 2009 01:20:34 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Wednesday 06 May 2009, Andrew Morton wrote:
> > On Wed, 6 May 2009 00:19:35 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > > > + && !processes_are_frozen()) {
> > > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > > schedule_timeout_uninterruptible(1);
> > > > > goto restart;
> > > >
> > > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > > a new gfp flag. Thanks.
> > >
> > > Well, you're welcome.
> > >
> > > BTW, I think that Andrew was actually right when he asked if I checked whether
> > > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > > nopage: that checks __GFP_NORETRY, do we?
> > >
> > > So I think we shouldn't modify the 'else if' condition above and check for
> > > !processes_are_frozen() at the beginning of the block below.
> >
> > Confused.
> >
> > I'm suspecting that hibernation can allocate its pages with
> > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > will dtrt: no oom-killings.
> >
> > In which case, processes_are_frozen() is not needed at all?
>
> __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> the combination.
OK. __GFP_WAIT is the big hammer.
> Anyway, even if the hibernation code itself doesn't trigger the OOM killer,
> but anyone else allocates memory in parallel or after we've preallocated the
> image memory, that may still trigger it. So it seems processes_are_frozen()
> may still be useful?
Could be. But only kernel threads are active at this time (yes?), and they
won't have much work to do because userspace is asleep.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-05 23:40 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-05 23:40 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wed, 6 May 2009 01:20:34 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> On Wednesday 06 May 2009, Andrew Morton wrote:
> > On Wed, 6 May 2009 00:19:35 +0200
> > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> >
> > > > > + && !processes_are_frozen()) {
> > > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > > schedule_timeout_uninterruptible(1);
> > > > > goto restart;
> > > >
> > > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > > a new gfp flag. Thanks.
> > >
> > > Well, you're welcome.
> > >
> > > BTW, I think that Andrew was actually right when he asked if I checked whether
> > > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > > nopage: that checks __GFP_NORETRY, do we?
> > >
> > > So I think we shouldn't modify the 'else if' condition above and check for
> > > !processes_are_frozen() at the beginning of the block below.
> >
> > Confused.
> >
> > I'm suspecting that hibernation can allocate its pages with
> > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > will dtrt: no oom-killings.
> >
> > In which case, processes_are_frozen() is not needed at all?
>
> __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> the combination.
OK. __GFP_WAIT is the big hammer.
> Anyway, even if the hibernation code itself doesn't trigger the OOM killer,
> but anyone else allocates memory in parallel or after we've preallocated the
> image memory, that may still trigger it. So it seems processes_are_frozen()
> may still be useful?
Could be. But only kernel threads are active at this time (yes?), and they
won't have much work to do because userspace is asleep.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-05 23:40 ` Andrew Morton
(?)
@ 2009-05-07 18:09 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 18:09 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Wednesday 06 May 2009, Andrew Morton wrote:
> On Wed, 6 May 2009 01:20:34 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Wednesday 06 May 2009, Andrew Morton wrote:
> > > On Wed, 6 May 2009 00:19:35 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > > > + && !processes_are_frozen()) {
> > > > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > > > schedule_timeout_uninterruptible(1);
> > > > > > goto restart;
> > > > >
> > > > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > > > a new gfp flag. Thanks.
> > > >
> > > > Well, you're welcome.
> > > >
> > > > BTW, I think that Andrew was actually right when he asked if I checked whether
> > > > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > > > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > > > nopage: that checks __GFP_NORETRY, do we?
> > > >
> > > > So I think we shouldn't modify the 'else if' condition above and check for
> > > > !processes_are_frozen() at the beginning of the block below.
> > >
> > > Confused.
> > >
> > > I'm suspecting that hibernation can allocate its pages with
> > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > will dtrt: no oom-killings.
> > >
> > > In which case, processes_are_frozen() is not needed at all?
> >
> > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > the combination.
>
> OK. __GFP_WAIT is the big hammer.
Unfortunately it fails too quickly with the combination as well, so it looks
like we can't use __GFP_NORETRY during hibernation.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 18:09 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 18:09 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Wednesday 06 May 2009, Andrew Morton wrote:
> On Wed, 6 May 2009 01:20:34 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Wednesday 06 May 2009, Andrew Morton wrote:
> > > On Wed, 6 May 2009 00:19:35 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > > > + && !processes_are_frozen()) {
> > > > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > > > schedule_timeout_uninterruptible(1);
> > > > > > goto restart;
> > > > >
> > > > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > > > a new gfp flag. Thanks.
> > > >
> > > > Well, you're welcome.
> > > >
> > > > BTW, I think that Andrew was actually right when he asked if I checked whether
> > > > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > > > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > > > nopage: that checks __GFP_NORETRY, do we?
> > > >
> > > > So I think we shouldn't modify the 'else if' condition above and check for
> > > > !processes_are_frozen() at the beginning of the block below.
> > >
> > > Confused.
> > >
> > > I'm suspecting that hibernation can allocate its pages with
> > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > will dtrt: no oom-killings.
> > >
> > > In which case, processes_are_frozen() is not needed at all?
> >
> > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > the combination.
>
> OK. __GFP_WAIT is the big hammer.
Unfortunately it fails too quickly with the combination as well, so it looks
like we can't use __GFP_NORETRY during hibernation.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 18:09 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 18:09 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wednesday 06 May 2009, Andrew Morton wrote:
> On Wed, 6 May 2009 01:20:34 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > On Wednesday 06 May 2009, Andrew Morton wrote:
> > > On Wed, 6 May 2009 00:19:35 +0200
> > > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> > >
> > > > > > + && !processes_are_frozen()) {
> > > > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > > > schedule_timeout_uninterruptible(1);
> > > > > > goto restart;
> > > > >
> > > > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > > > a new gfp flag. Thanks.
> > > >
> > > > Well, you're welcome.
> > > >
> > > > BTW, I think that Andrew was actually right when he asked if I checked whether
> > > > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > > > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > > > nopage: that checks __GFP_NORETRY, do we?
> > > >
> > > > So I think we shouldn't modify the 'else if' condition above and check for
> > > > !processes_are_frozen() at the beginning of the block below.
> > >
> > > Confused.
> > >
> > > I'm suspecting that hibernation can allocate its pages with
> > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > will dtrt: no oom-killings.
> > >
> > > In which case, processes_are_frozen() is not needed at all?
> >
> > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > the combination.
>
> OK. __GFP_WAIT is the big hammer.
Unfortunately it fails too quickly with the combination as well, so it looks
like we can't use __GFP_NORETRY during hibernation.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 18:09 ` Rafael J. Wysocki
(?)
@ 2009-05-07 18:48 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 18:48 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Thu, 7 May 2009 20:09:52 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > I'm suspecting that hibernation can allocate its pages with
> > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > will dtrt: no oom-killings.
> > > >
> > > > In which case, processes_are_frozen() is not needed at all?
> > >
> > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > the combination.
> >
> > OK. __GFP_WAIT is the big hammer.
>
> Unfortunately it fails too quickly with the combination as well, so it looks
> like we can't use __GFP_NORETRY during hibernation.
hm.
So where do we stand now?
I'm not a big fan of the global application-specific state change
thing. Something like __GFP_NO_OOM_KILL has a better chance of being
reused by other subsystems in the future, which is a good indicator.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 18:48 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 18:48 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009 20:09:52 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > I'm suspecting that hibernation can allocate its pages with
> > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > will dtrt: no oom-killings.
> > > >
> > > > In which case, processes_are_frozen() is not needed at all?
> > >
> > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > the combination.
> >
> > OK. __GFP_WAIT is the big hammer.
>
> Unfortunately it fails too quickly with the combination as well, so it looks
> like we can't use __GFP_NORETRY during hibernation.
hm.
So where do we stand now?
I'm not a big fan of the global application-specific state change
thing. Something like __GFP_NO_OOM_KILL has a better chance of being
reused by other subsystems in the future, which is a good indicator.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 18:48 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 18:48 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009 20:09:52 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> > > > I'm suspecting that hibernation can allocate its pages with
> > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > will dtrt: no oom-killings.
> > > >
> > > > In which case, processes_are_frozen() is not needed at all?
> > >
> > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > the combination.
> >
> > OK. __GFP_WAIT is the big hammer.
>
> Unfortunately it fails too quickly with the combination as well, so it looks
> like we can't use __GFP_NORETRY during hibernation.
hm.
So where do we stand now?
I'm not a big fan of the global application-specific state change
thing. Something like __GFP_NO_OOM_KILL has a better chance of being
reused by other subsystems in the future, which is a good indicator.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 19:33 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 19:33 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 20:09:52 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > will dtrt: no oom-killings.
> > > > >
> > > > > In which case, processes_are_frozen() is not needed at all?
> > > >
> > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > the combination.
> > >
> > > OK. __GFP_WAIT is the big hammer.
> >
> > Unfortunately it fails too quickly with the combination as well, so it looks
> > like we can't use __GFP_NORETRY during hibernation.
>
> hm.
>
> So where do we stand now?
>
> I'm not a big fan of the global application-specific state change
> thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> reused by other subsystems in the future, which is a good indicator.
I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
adding new _GPF _FOO flags recently. Is there any likelihood anyone else we'll
really need it any time soon?
The advantage of the freezer-based approach is that it disables the OOM killer
when it's not going to work anyway, so it looks like a reasonable thing to do
regardless. IMHO.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 19:33 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 19:33 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 20:09:52 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > will dtrt: no oom-killings.
> > > > >
> > > > > In which case, processes_are_frozen() is not needed at all?
> > > >
> > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > the combination.
> > >
> > > OK. __GFP_WAIT is the big hammer.
> >
> > Unfortunately it fails too quickly with the combination as well, so it looks
> > like we can't use __GFP_NORETRY during hibernation.
>
> hm.
>
> So where do we stand now?
>
> I'm not a big fan of the global application-specific state change
> thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> reused by other subsystems in the future, which is a good indicator.
I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
adding new _GPF _FOO flags recently. Is there any likelihood anyone else we'll
really need it any time soon?
The advantage of the freezer-based approach is that it disables the OOM killer
when it's not going to work anyway, so it looks like a reasonable thing to do
regardless. IMHO.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 19:33 ` Rafael J. Wysocki
(?)
@ 2009-05-07 20:02 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 20:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Thu, 7 May 2009 21:33:47 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Thursday 07 May 2009, Andrew Morton wrote:
> > On Thu, 7 May 2009 20:09:52 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > > will dtrt: no oom-killings.
> > > > > >
> > > > > > In which case, processes_are_frozen() is not needed at all?
> > > > >
> > > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > > the combination.
> > > >
> > > > OK. __GFP_WAIT is the big hammer.
> > >
> > > Unfortunately it fails too quickly with the combination as well, so it looks
> > > like we can't use __GFP_NORETRY during hibernation.
> >
> > hm.
> >
> > So where do we stand now?
> >
> > I'm not a big fan of the global application-specific state change
> > thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> > reused by other subsystems in the future, which is a good indicator.
>
> I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
> adding new _GPF _FOO flags recently.
We have six or seven left - hardly a crisis.
> Is there any likelihood anyone else we'll
> really need it any time soon?
Dunno - people do all sorts of crazy things. But it's more likely to
be reused than a PM-specific global!
I have no strong feelings really, but slotting into the existing
technique with something which might be reusable is quite a bit tidier.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:02 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 20:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009 21:33:47 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Thursday 07 May 2009, Andrew Morton wrote:
> > On Thu, 7 May 2009 20:09:52 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > > will dtrt: no oom-killings.
> > > > > >
> > > > > > In which case, processes_are_frozen() is not needed at all?
> > > > >
> > > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > > the combination.
> > > >
> > > > OK. __GFP_WAIT is the big hammer.
> > >
> > > Unfortunately it fails too quickly with the combination as well, so it looks
> > > like we can't use __GFP_NORETRY during hibernation.
> >
> > hm.
> >
> > So where do we stand now?
> >
> > I'm not a big fan of the global application-specific state change
> > thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> > reused by other subsystems in the future, which is a good indicator.
>
> I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
> adding new _GPF _FOO flags recently.
We have six or seven left - hardly a crisis.
> Is there any likelihood anyone else we'll
> really need it any time soon?
Dunno - people do all sorts of crazy things. But it's more likely to
be reused than a PM-specific global!
I have no strong feelings really, but slotting into the existing
technique with something which might be reusable is quite a bit tidier.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:02 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 20:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009 21:33:47 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> On Thursday 07 May 2009, Andrew Morton wrote:
> > On Thu, 7 May 2009 20:09:52 +0200
> > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> >
> > > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > > will dtrt: no oom-killings.
> > > > > >
> > > > > > In which case, processes_are_frozen() is not needed at all?
> > > > >
> > > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > > the combination.
> > > >
> > > > OK. __GFP_WAIT is the big hammer.
> > >
> > > Unfortunately it fails too quickly with the combination as well, so it looks
> > > like we can't use __GFP_NORETRY during hibernation.
> >
> > hm.
> >
> > So where do we stand now?
> >
> > I'm not a big fan of the global application-specific state change
> > thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> > reused by other subsystems in the future, which is a good indicator.
>
> I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
> adding new _GPF _FOO flags recently.
We have six or seven left - hardly a crisis.
> Is there any likelihood anyone else we'll
> really need it any time soon?
Dunno - people do all sorts of crazy things. But it's more likely to
be reused than a PM-specific global!
I have no strong feelings really, but slotting into the existing
technique with something which might be reusable is quite a bit tidier.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:18 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 20:18 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 21:33:47 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Thursday 07 May 2009, Andrew Morton wrote:
> > > On Thu, 7 May 2009 20:09:52 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > > > will dtrt: no oom-killings.
> > > > > > >
> > > > > > > In which case, processes_are_frozen() is not needed at all?
> > > > > >
> > > > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > > > the combination.
> > > > >
> > > > > OK. __GFP_WAIT is the big hammer.
> > > >
> > > > Unfortunately it fails too quickly with the combination as well, so it looks
> > > > like we can't use __GFP_NORETRY during hibernation.
> > >
> > > hm.
> > >
> > > So where do we stand now?
> > >
> > > I'm not a big fan of the global application-specific state change
> > > thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> > > reused by other subsystems in the future, which is a good indicator.
> >
> > I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
> > adding new _GPF _FOO flags recently.
>
> We have six or seven left - hardly a crisis.
>
> > Is there any likelihood anyone else we'll
> > really need it any time soon?
>
> Dunno - people do all sorts of crazy things. But it's more likely to
> be reused than a PM-specific global!
>
> I have no strong feelings really, but slotting into the existing
> technique with something which might be reusable is quite a bit tidier.
OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
I'll use the freezer-based approach instead.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:18 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 20:18 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 21:33:47 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > On Thursday 07 May 2009, Andrew Morton wrote:
> > > On Thu, 7 May 2009 20:09:52 +0200
> > > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> > >
> > > > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > > > will dtrt: no oom-killings.
> > > > > > >
> > > > > > > In which case, processes_are_frozen() is not needed at all?
> > > > > >
> > > > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > > > the combination.
> > > > >
> > > > > OK. __GFP_WAIT is the big hammer.
> > > >
> > > > Unfortunately it fails too quickly with the combination as well, so it looks
> > > > like we can't use __GFP_NORETRY during hibernation.
> > >
> > > hm.
> > >
> > > So where do we stand now?
> > >
> > > I'm not a big fan of the global application-specific state change
> > > thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> > > reused by other subsystems in the future, which is a good indicator.
> >
> > I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
> > adding new _GPF _FOO flags recently.
>
> We have six or seven left - hardly a crisis.
>
> > Is there any likelihood anyone else we'll
> > really need it any time soon?
>
> Dunno - people do all sorts of crazy things. But it's more likely to
> be reused than a PM-specific global!
>
> I have no strong feelings really, but slotting into the existing
> technique with something which might be reusable is quite a bit tidier.
OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
I'll use the freezer-based approach instead.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:18 ` Rafael J. Wysocki
(?)
@ 2009-05-07 20:25 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:25 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> I'll use the freezer-based approach instead.
>
Third time I'm going to suggest this, and I'd like a response on why it's
not possible instead of being ignored.
All of your tasks are in D state other than kthreads, right? That means
they won't be in the oom killer (thus no zones are oom locked), so you can
easily do this
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
and then
for_each_populated_zone(z)
zone_clear_flag(z, ZONE_OOM_LOCKED);
The serialization is done with trylocks so this will never invoke the oom
killer because all zones in the allocator's zonelist will be oom locked.
Why does this not work for you?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:25 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:25 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> I'll use the freezer-based approach instead.
>
Third time I'm going to suggest this, and I'd like a response on why it's
not possible instead of being ignored.
All of your tasks are in D state other than kthreads, right? That means
they won't be in the oom killer (thus no zones are oom locked), so you can
easily do this
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
and then
for_each_populated_zone(z)
zone_clear_flag(z, ZONE_OOM_LOCKED);
The serialization is done with trylocks so this will never invoke the oom
killer because all zones in the allocator's zonelist will be oom locked.
Why does this not work for you?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:25 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:25 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> I'll use the freezer-based approach instead.
>
Third time I'm going to suggest this, and I'd like a response on why it's
not possible instead of being ignored.
All of your tasks are in D state other than kthreads, right? That means
they won't be in the oom killer (thus no zones are oom locked), so you can
easily do this
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
and then
for_each_populated_zone(z)
zone_clear_flag(z, ZONE_OOM_LOCKED);
The serialization is done with trylocks so this will never invoke the oom
killer because all zones in the allocator's zonelist will be oom locked.
Why does this not work for you?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:35 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-07 20:35 UTC (permalink / raw)
To: David Rientjes
Cc: Rafael J. Wysocki, Andrew Morton, fengguang.wu, linux-pm,
torvalds, jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thu 2009-05-07 13:25:06, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
Well, OOM killer may be running on behalf of some kthread at that
point....? Quite unlikely, but possible AFAICT.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:35 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-07 20:35 UTC (permalink / raw)
To: David Rientjes
Cc: Rafael J. Wysocki, Andrew Morton,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu 2009-05-07 13:25:06, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
Well, OOM killer may be running on behalf of some kthread at that
point....? Quite unlikely, but possible AFAICT.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:40 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:40 UTC (permalink / raw)
To: Pavel Machek
Cc: Rafael J. Wysocki, Andrew Morton, fengguang.wu, linux-pm,
torvalds, jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Pavel Machek wrote:
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
>
> Well, OOM killer may be running on behalf of some kthread at that
> point....? Quite unlikely, but possible AFAICT.
The oom killer doesn't care about the task's state, so this will be a
genuine oom situation where it will kill a task (one in D state since
kthreads are inherently immune) which will die when unfrozen. That would
have had to happen anyway when all tasks wake up since the system is
completely out of memory (except for kswapd that is given access to memory
reserves because of PF_MEMALLOC) so you're not worried about completely
blocking out the oom killer anymore because the next kthread to invoke it
in such a situation will end up being a no-op because it finds a task with
TIF_MEMDIE set.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:40 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:40 UTC (permalink / raw)
To: Pavel Machek
Cc: Rafael J. Wysocki, Andrew Morton,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Pavel Machek wrote:
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
>
> Well, OOM killer may be running on behalf of some kthread at that
> point....? Quite unlikely, but possible AFAICT.
The oom killer doesn't care about the task's state, so this will be a
genuine oom situation where it will kill a task (one in D state since
kthreads are inherently immune) which will die when unfrozen. That would
have had to happen anyway when all tasks wake up since the system is
completely out of memory (except for kswapd that is given access to memory
reserves because of PF_MEMALLOC) so you're not worried about completely
blocking out the oom killer anymore because the next kthread to invoke it
in such a situation will end up being a no-op because it finds a task with
TIF_MEMDIE set.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:35 ` Pavel Machek
(?)
(?)
@ 2009-05-07 20:40 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:40 UTC (permalink / raw)
To: Pavel Machek
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thu, 7 May 2009, Pavel Machek wrote:
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
>
> Well, OOM killer may be running on behalf of some kthread at that
> point....? Quite unlikely, but possible AFAICT.
The oom killer doesn't care about the task's state, so this will be a
genuine oom situation where it will kill a task (one in D state since
kthreads are inherently immune) which will die when unfrozen. That would
have had to happen anyway when all tasks wake up since the system is
completely out of memory (except for kswapd that is given access to memory
reserves because of PF_MEMALLOC) so you're not worried about completely
blocking out the oom killer anymore because the next kthread to invoke it
in such a situation will end up being a no-op because it finds a task with
TIF_MEMDIE set.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:25 ` David Rientjes
(?)
(?)
@ 2009-05-07 20:35 ` Pavel Machek
-1 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-07 20:35 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thu 2009-05-07 13:25:06, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
Well, OOM killer may be running on behalf of some kthread at that
point....? Quite unlikely, but possible AFAICT.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:38 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 20:38 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
>
> struct zone *z;
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> and then
>
> for_each_populated_zone(z)
> zone_clear_flag(z, ZONE_OOM_LOCKED);
>
> The serialization is done with trylocks so this will never invoke the oom
> killer because all zones in the allocator's zonelist will be oom locked.
>
> Why does this not work for you?
Well, it might work too, but why are you insisting? How's it better than
__GFP_NO_OOM_KILL, actually?
Andrew, what do you think about this?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:38 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 20:38 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
>
> struct zone *z;
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> and then
>
> for_each_populated_zone(z)
> zone_clear_flag(z, ZONE_OOM_LOCKED);
>
> The serialization is done with trylocks so this will never invoke the oom
> killer because all zones in the allocator's zonelist will be oom locked.
>
> Why does this not work for you?
Well, it might work too, but why are you insisting? How's it better than
__GFP_NO_OOM_KILL, actually?
Andrew, what do you think about this?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:42 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:42 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Why does this not work for you?
>
> Well, it might work too, but why are you insisting? How's it better than
> __GFP_NO_OOM_KILL, actually?
>
Because I agree with Christoph's concerns about needlessly adding
additional gfp flags; he was responding to the proposed addition of
__GFP_PANIC which could be handled in other much simpler ways just like
this flag can as I've shown.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:42 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:42 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Why does this not work for you?
>
> Well, it might work too, but why are you insisting? How's it better than
> __GFP_NO_OOM_KILL, actually?
>
Because I agree with Christoph's concerns about needlessly adding
additional gfp flags; he was responding to the proposed addition of
__GFP_PANIC which could be handled in other much simpler ways just like
this flag can as I've shown.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:38 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-07 20:42 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 20:42 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Why does this not work for you?
>
> Well, it might work too, but why are you insisting? How's it better than
> __GFP_NO_OOM_KILL, actually?
>
Because I agree with Christoph's concerns about needlessly adding
additional gfp flags; he was responding to the proposed addition of
__GFP_PANIC which could be handled in other much simpler ways just like
this flag can as I've shown.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:38 ` Rafael J. Wysocki
` (2 preceding siblings ...)
(?)
@ 2009-05-07 20:56 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 20:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Thu, 7 May 2009 22:38:13 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Thursday 07 May 2009, David Rientjes wrote:
> > On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> >
> > > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > > I'll use the freezer-based approach instead.
> > >
> >
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Why does this not work for you?
>
> Well, it might work too, but why are you insisting? How's it better than
> __GFP_NO_OOM_KILL, actually?
>
> Andrew, what do you think about this?
I don't think I understand the proposal. Is it to provide a means by
which PM can go in and set a state bit against each and every zone? If
so, that's still a global boolean, only messier.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:56 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 20:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009 22:38:13 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Thursday 07 May 2009, David Rientjes wrote:
> > On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> >
> > > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > > I'll use the freezer-based approach instead.
> > >
> >
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Why does this not work for you?
>
> Well, it might work too, but why are you insisting? How's it better than
> __GFP_NO_OOM_KILL, actually?
>
> Andrew, what do you think about this?
I don't think I understand the proposal. Is it to provide a means by
which PM can go in and set a state bit against each and every zone? If
so, that's still a global boolean, only messier.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 20:56 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 20:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009 22:38:13 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> On Thursday 07 May 2009, David Rientjes wrote:
> > On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> >
> > > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > > I'll use the freezer-based approach instead.
> > >
> >
> > Third time I'm going to suggest this, and I'd like a response on why it's
> > not possible instead of being ignored.
> >
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Why does this not work for you?
>
> Well, it might work too, but why are you insisting? How's it better than
> __GFP_NO_OOM_KILL, actually?
>
> Andrew, what do you think about this?
I don't think I understand the proposal. Is it to provide a means by
which PM can go in and set a state bit against each and every zone? If
so, that's still a global boolean, only messier.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:25 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 21:25 UTC (permalink / raw)
To: Andrew Morton
Cc: Rafael J. Wysocki, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Andrew Morton wrote:
> > > All of your tasks are in D state other than kthreads, right? That means
> > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > easily do this
> > >
> > > struct zone *z;
> > > for_each_populated_zone(z)
> > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > >
> > > and then
> > >
> > > for_each_populated_zone(z)
> > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > >
> > > The serialization is done with trylocks so this will never invoke the oom
> > > killer because all zones in the allocator's zonelist will be oom locked.
> > >
> > > Why does this not work for you?
> >
> > Well, it might work too, but why are you insisting? How's it better than
> > __GFP_NO_OOM_KILL, actually?
> >
> > Andrew, what do you think about this?
>
> I don't think I understand the proposal. Is it to provide a means by
> which PM can go in and set a state bit against each and every zone? If
> so, that's still a global boolean, only messier.
>
Why can't it be global while preallocating memory for hibernation since
nothing but kthreads could allocate at this point and if the system is oom
then the oom killer wouldn't be able to do anything anyway since it can't
kill them?
The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
whether it specifies it or not since the oom killer would simply kill a
task in D state which can't exit or free memory and subsequent allocations
would make the oom killer a no-op because there's an eligible task with
TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
calling the oom killer in a first place and killing an unresponsive task
but that would have to happen anyway when thawed since the system is oom
(or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:25 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 21:25 UTC (permalink / raw)
To: Andrew Morton
Cc: Rafael J. Wysocki, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Andrew Morton wrote:
> > > All of your tasks are in D state other than kthreads, right? That means
> > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > easily do this
> > >
> > > struct zone *z;
> > > for_each_populated_zone(z)
> > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > >
> > > and then
> > >
> > > for_each_populated_zone(z)
> > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > >
> > > The serialization is done with trylocks so this will never invoke the oom
> > > killer because all zones in the allocator's zonelist will be oom locked.
> > >
> > > Why does this not work for you?
> >
> > Well, it might work too, but why are you insisting? How's it better than
> > __GFP_NO_OOM_KILL, actually?
> >
> > Andrew, what do you think about this?
>
> I don't think I understand the proposal. Is it to provide a means by
> which PM can go in and set a state bit against each and every zone? If
> so, that's still a global boolean, only messier.
>
Why can't it be global while preallocating memory for hibernation since
nothing but kthreads could allocate at this point and if the system is oom
then the oom killer wouldn't be able to do anything anyway since it can't
kill them?
The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
whether it specifies it or not since the oom killer would simply kill a
task in D state which can't exit or free memory and subsequent allocations
would make the oom killer a no-op because there's an eligible task with
TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
calling the oom killer in a first place and killing an unresponsive task
but that would have to happen anyway when thawed since the system is oom
(or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:36 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 21:36 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > > > All of your tasks are in D state other than kthreads, right? That means
> > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > easily do this
> > > >
> > > > struct zone *z;
> > > > for_each_populated_zone(z)
> > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > and then
> > > >
> > > > for_each_populated_zone(z)
> > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > The serialization is done with trylocks so this will never invoke the oom
> > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > >
> > > > Why does this not work for you?
> > >
> > > Well, it might work too, but why are you insisting? How's it better than
> > > __GFP_NO_OOM_KILL, actually?
> > >
> > > Andrew, what do you think about this?
> >
> > I don't think I understand the proposal. Is it to provide a means by
> > which PM can go in and set a state bit against each and every zone? If
> > so, that's still a global boolean, only messier.
> >
>
> Why can't it be global while preallocating memory for hibernation since
> nothing but kthreads could allocate at this point and if the system is oom
> then the oom killer wouldn't be able to do anything anyway since it can't
> kill them?
>
> The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> whether it specifies it or not since the oom killer would simply kill a
> task in D state which can't exit or free memory and subsequent allocations
> would make the oom killer a no-op because there's an eligible task with
> TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> calling the oom killer in a first place and killing an unresponsive task
That's exactly what we're trying to do. We don't want tasks to get killed just
because we're freeing memory for hibernation image.
> but that would have to happen anyway when thawed since the system is oom
> (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
Are you sure? The image memory is freed before thawing tasks.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:36 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 21:36 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > > > All of your tasks are in D state other than kthreads, right? That means
> > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > easily do this
> > > >
> > > > struct zone *z;
> > > > for_each_populated_zone(z)
> > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > and then
> > > >
> > > > for_each_populated_zone(z)
> > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > The serialization is done with trylocks so this will never invoke the oom
> > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > >
> > > > Why does this not work for you?
> > >
> > > Well, it might work too, but why are you insisting? How's it better than
> > > __GFP_NO_OOM_KILL, actually?
> > >
> > > Andrew, what do you think about this?
> >
> > I don't think I understand the proposal. Is it to provide a means by
> > which PM can go in and set a state bit against each and every zone? If
> > so, that's still a global boolean, only messier.
> >
>
> Why can't it be global while preallocating memory for hibernation since
> nothing but kthreads could allocate at this point and if the system is oom
> then the oom killer wouldn't be able to do anything anyway since it can't
> kill them?
>
> The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> whether it specifies it or not since the oom killer would simply kill a
> task in D state which can't exit or free memory and subsequent allocations
> would make the oom killer a no-op because there's an eligible task with
> TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> calling the oom killer in a first place and killing an unresponsive task
That's exactly what we're trying to do. We don't want tasks to get killed just
because we're freeing memory for hibernation image.
> but that would have to happen anyway when thawed since the system is oom
> (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
Are you sure? The image memory is freed before thawing tasks.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:46 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 21:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
>
> That's exactly what we're trying to do. We don't want tasks to get killed just
> because we're freeing memory for hibernation image.
>
Then, again, why can't you just lock out the oom killer as I suggested if
__GFP_NO_OOM_KILL is actually implied for all allocations when
preallocating? It prevents adding an unnecessary gfp flag, sprinkling it
around in the hibernation code, and a comment would actually explain why
it's the right thing to do (i.e. no other threads other than kthreads
could possibly be executing the oom killer and if they are oom then we'll
have to kill a userspace task anyway when thawed).
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> Are you sure? The image memory is freed before thawing tasks.
>
If you try to allocate any non-__GFP_NORETRY memory such as GFP_KERNEL
with order < PAGE_ALLOC_COSTLY_ORDER and direct reclaim cannot free memory
(and the oom killer is implicitly a no-op whether you specify
__GFP_NO_OOM_KILL or not), then you could loop endlessly in the page
allocator. When allocating GFP_IMAGE you need to ensure that can't happen
and __GFP_NORETRY may not be your best option because it could fail
unnecessarily when reclaim could have helped.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:46 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 21:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
>
> That's exactly what we're trying to do. We don't want tasks to get killed just
> because we're freeing memory for hibernation image.
>
Then, again, why can't you just lock out the oom killer as I suggested if
__GFP_NO_OOM_KILL is actually implied for all allocations when
preallocating? It prevents adding an unnecessary gfp flag, sprinkling it
around in the hibernation code, and a comment would actually explain why
it's the right thing to do (i.e. no other threads other than kthreads
could possibly be executing the oom killer and if they are oom then we'll
have to kill a userspace task anyway when thawed).
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> Are you sure? The image memory is freed before thawing tasks.
>
If you try to allocate any non-__GFP_NORETRY memory such as GFP_KERNEL
with order < PAGE_ALLOC_COSTLY_ORDER and direct reclaim cannot free memory
(and the oom killer is implicitly a no-op whether you specify
__GFP_NO_OOM_KILL or not), then you could loop endlessly in the page
allocator. When allocating GFP_IMAGE you need to ensure that can't happen
and __GFP_NORETRY may not be your best option because it could fail
unnecessarily when reclaim could have helped.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 21:46 ` David Rientjes
(?)
@ 2009-05-07 22:05 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:05 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > > whether it specifies it or not since the oom killer would simply kill a
> > > task in D state which can't exit or free memory and subsequent allocations
> > > would make the oom killer a no-op because there's an eligible task with
> > > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > > calling the oom killer in a first place and killing an unresponsive task
> >
> > That's exactly what we're trying to do. We don't want tasks to get killed just
> > because we're freeing memory for hibernation image.
> >
>
> Then, again, why can't you just lock out the oom killer as I suggested if
> __GFP_NO_OOM_KILL is actually implied for all allocations when
> preallocating? It prevents adding an unnecessary gfp flag, sprinkling it
> around in the hibernation code,
In one place really.
> and a comment would actually explain why it's the right thing to do (i.e. no
> other threads other than kthreads could possibly be executing the oom killer
> and if they are oom then we'll have to kill a userspace task anyway when
> thawed).
Quite frankly, I prefer my freezer-based patch to this. I'm not really
inclined to fiddle with the mm internals from within snapshot.c .
Still, I trust the Andrew's experience and that's why I'm going to try the
__GFP_NO_OOM_KILL first, as I already said. If there is a problem with it,
I'm going to use the freezer-based approach.
> > > but that would have to happen anyway when thawed since the system is oom
> > > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
> >
> > Are you sure? The image memory is freed before thawing tasks.
> >
>
> If you try to allocate any non-__GFP_NORETRY memory such as GFP_KERNEL
> with order < PAGE_ALLOC_COSTLY_ORDER and direct reclaim cannot free memory
> (and the oom killer is implicitly a no-op whether you specify
> __GFP_NO_OOM_KILL or not), then you could loop endlessly in the page
> allocator. When allocating GFP_IMAGE you need to ensure that can't happen
> and __GFP_NORETRY may not be your best option because it could fail
> unnecessarily when reclaim could have helped.
I'm really unsure what you mean and how that is related to your previous remark
about what's going to happen after the thawing of tasks.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:05 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:05 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > > whether it specifies it or not since the oom killer would simply kill a
> > > task in D state which can't exit or free memory and subsequent allocations
> > > would make the oom killer a no-op because there's an eligible task with
> > > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > > calling the oom killer in a first place and killing an unresponsive task
> >
> > That's exactly what we're trying to do. We don't want tasks to get killed just
> > because we're freeing memory for hibernation image.
> >
>
> Then, again, why can't you just lock out the oom killer as I suggested if
> __GFP_NO_OOM_KILL is actually implied for all allocations when
> preallocating? It prevents adding an unnecessary gfp flag, sprinkling it
> around in the hibernation code,
In one place really.
> and a comment would actually explain why it's the right thing to do (i.e. no
> other threads other than kthreads could possibly be executing the oom killer
> and if they are oom then we'll have to kill a userspace task anyway when
> thawed).
Quite frankly, I prefer my freezer-based patch to this. I'm not really
inclined to fiddle with the mm internals from within snapshot.c .
Still, I trust the Andrew's experience and that's why I'm going to try the
__GFP_NO_OOM_KILL first, as I already said. If there is a problem with it,
I'm going to use the freezer-based approach.
> > > but that would have to happen anyway when thawed since the system is oom
> > > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
> >
> > Are you sure? The image memory is freed before thawing tasks.
> >
>
> If you try to allocate any non-__GFP_NORETRY memory such as GFP_KERNEL
> with order < PAGE_ALLOC_COSTLY_ORDER and direct reclaim cannot free memory
> (and the oom killer is implicitly a no-op whether you specify
> __GFP_NO_OOM_KILL or not), then you could loop endlessly in the page
> allocator. When allocating GFP_IMAGE you need to ensure that can't happen
> and __GFP_NORETRY may not be your best option because it could fail
> unnecessarily when reclaim could have helped.
I'm really unsure what you mean and how that is related to your previous remark
about what's going to happen after the thawing of tasks.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:05 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:05 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > > whether it specifies it or not since the oom killer would simply kill a
> > > task in D state which can't exit or free memory and subsequent allocations
> > > would make the oom killer a no-op because there's an eligible task with
> > > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > > calling the oom killer in a first place and killing an unresponsive task
> >
> > That's exactly what we're trying to do. We don't want tasks to get killed just
> > because we're freeing memory for hibernation image.
> >
>
> Then, again, why can't you just lock out the oom killer as I suggested if
> __GFP_NO_OOM_KILL is actually implied for all allocations when
> preallocating? It prevents adding an unnecessary gfp flag, sprinkling it
> around in the hibernation code,
In one place really.
> and a comment would actually explain why it's the right thing to do (i.e. no
> other threads other than kthreads could possibly be executing the oom killer
> and if they are oom then we'll have to kill a userspace task anyway when
> thawed).
Quite frankly, I prefer my freezer-based patch to this. I'm not really
inclined to fiddle with the mm internals from within snapshot.c .
Still, I trust the Andrew's experience and that's why I'm going to try the
__GFP_NO_OOM_KILL first, as I already said. If there is a problem with it,
I'm going to use the freezer-based approach.
> > > but that would have to happen anyway when thawed since the system is oom
> > > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
> >
> > Are you sure? The image memory is freed before thawing tasks.
> >
>
> If you try to allocate any non-__GFP_NORETRY memory such as GFP_KERNEL
> with order < PAGE_ALLOC_COSTLY_ORDER and direct reclaim cannot free memory
> (and the oom killer is implicitly a no-op whether you specify
> __GFP_NO_OOM_KILL or not), then you could loop endlessly in the page
> allocator. When allocating GFP_IMAGE you need to ensure that can't happen
> and __GFP_NORETRY may not be your best option because it could fail
> unnecessarily when reclaim could have helped.
I'm really unsure what you mean and how that is related to your previous remark
about what's going to happen after the thawing of tasks.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 21:36 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-07 21:46 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 21:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
>
> That's exactly what we're trying to do. We don't want tasks to get killed just
> because we're freeing memory for hibernation image.
>
Then, again, why can't you just lock out the oom killer as I suggested if
__GFP_NO_OOM_KILL is actually implied for all allocations when
preallocating? It prevents adding an unnecessary gfp flag, sprinkling it
around in the hibernation code, and a comment would actually explain why
it's the right thing to do (i.e. no other threads other than kthreads
could possibly be executing the oom killer and if they are oom then we'll
have to kill a userspace task anyway when thawed).
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> Are you sure? The image memory is freed before thawing tasks.
>
If you try to allocate any non-__GFP_NORETRY memory such as GFP_KERNEL
with order < PAGE_ALLOC_COSTLY_ORDER and direct reclaim cannot free memory
(and the oom killer is implicitly a no-op whether you specify
__GFP_NO_OOM_KILL or not), then you could loop endlessly in the page
allocator. When allocating GFP_IMAGE you need to ensure that can't happen
and __GFP_NORETRY may not be your best option because it could fail
unnecessarily when reclaim could have helped.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 21:25 ` David Rientjes
(?)
(?)
@ 2009-05-07 21:36 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 21:36 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > > > All of your tasks are in D state other than kthreads, right? That means
> > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > easily do this
> > > >
> > > > struct zone *z;
> > > > for_each_populated_zone(z)
> > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > and then
> > > >
> > > > for_each_populated_zone(z)
> > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > The serialization is done with trylocks so this will never invoke the oom
> > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > >
> > > > Why does this not work for you?
> > >
> > > Well, it might work too, but why are you insisting? How's it better than
> > > __GFP_NO_OOM_KILL, actually?
> > >
> > > Andrew, what do you think about this?
> >
> > I don't think I understand the proposal. Is it to provide a means by
> > which PM can go in and set a state bit against each and every zone? If
> > so, that's still a global boolean, only messier.
> >
>
> Why can't it be global while preallocating memory for hibernation since
> nothing but kthreads could allocate at this point and if the system is oom
> then the oom killer wouldn't be able to do anything anyway since it can't
> kill them?
>
> The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> whether it specifies it or not since the oom killer would simply kill a
> task in D state which can't exit or free memory and subsequent allocations
> would make the oom killer a no-op because there's an eligible task with
> TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> calling the oom killer in a first place and killing an unresponsive task
That's exactly what we're trying to do. We don't want tasks to get killed just
because we're freeing memory for hibernation image.
> but that would have to happen anyway when thawed since the system is oom
> (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
Are you sure? The image memory is freed before thawing tasks.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 21:25 ` David Rientjes
` (2 preceding siblings ...)
(?)
@ 2009-05-07 21:50 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 21:50 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
fengguang.wu, torvalds
On Thu, 7 May 2009 14:25:23 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > > > All of your tasks are in D state other than kthreads, right? That means
> > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > easily do this
> > > >
> > > > struct zone *z;
> > > > for_each_populated_zone(z)
> > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > and then
> > > >
> > > > for_each_populated_zone(z)
> > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > The serialization is done with trylocks so this will never invoke the oom
> > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > >
> > > > Why does this not work for you?
> > >
> > > Well, it might work too, but why are you insisting? How's it better than
> > > __GFP_NO_OOM_KILL, actually?
> > >
> > > Andrew, what do you think about this?
> >
> > I don't think I understand the proposal. Is it to provide a means by
> > which PM can go in and set a state bit against each and every zone? If
> > so, that's still a global boolean, only messier.
> >
>
> Why can't it be global while preallocating memory for hibernation since
> nothing but kthreads could allocate at this point and if the system is oom
> then the oom killer wouldn't be able to do anything anyway since it can't
> kill them?
- globals are bad
- the standard way of controlling memory allocator behaviour is via
the gfp_t. Bypassing that is an unusual step and needs a higher
level of justification, which I'm not seeing here.
- if we do this via an unusual global, we reduce the chances that
another subsytem could use the new feature.
I don't know what subsytem that might be, but I bet they're out
there. checkpoint-restart, virtual machines, ballooning memory
drivers, kexec loading, etc.
> The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> whether it specifies it or not since the oom killer would simply kill a
> task in D state which can't exit or free memory and subsequent allocations
> would make the oom killer a no-op because there's an eligible task with
> TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> calling the oom killer in a first place and killing an unresponsive task
> but that would have to happen anyway when thawed since the system is oom
> (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
All the above is specific to the PM application only, when userspace
tasks are stopped.
It might well end up that stopping userspace (beforehand or before
oom-killing) is a hard requirement for reliably disabling the
oom-killer. Because the __GFP_NO_OOM_KILL user will be safe, but
random other allocations from other tasks will not be. So perhaps we
_do_ need a global, and random userspace processes should test and
sleep upon that global if they're heading in the direction of the
oom-killer.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:50 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 21:50 UTC (permalink / raw)
To: David Rientjes
Cc: rjw, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009 14:25:23 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > > > All of your tasks are in D state other than kthreads, right? That means
> > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > easily do this
> > > >
> > > > struct zone *z;
> > > > for_each_populated_zone(z)
> > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > and then
> > > >
> > > > for_each_populated_zone(z)
> > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > The serialization is done with trylocks so this will never invoke the oom
> > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > >
> > > > Why does this not work for you?
> > >
> > > Well, it might work too, but why are you insisting? How's it better than
> > > __GFP_NO_OOM_KILL, actually?
> > >
> > > Andrew, what do you think about this?
> >
> > I don't think I understand the proposal. Is it to provide a means by
> > which PM can go in and set a state bit against each and every zone? If
> > so, that's still a global boolean, only messier.
> >
>
> Why can't it be global while preallocating memory for hibernation since
> nothing but kthreads could allocate at this point and if the system is oom
> then the oom killer wouldn't be able to do anything anyway since it can't
> kill them?
- globals are bad
- the standard way of controlling memory allocator behaviour is via
the gfp_t. Bypassing that is an unusual step and needs a higher
level of justification, which I'm not seeing here.
- if we do this via an unusual global, we reduce the chances that
another subsytem could use the new feature.
I don't know what subsytem that might be, but I bet they're out
there. checkpoint-restart, virtual machines, ballooning memory
drivers, kexec loading, etc.
> The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> whether it specifies it or not since the oom killer would simply kill a
> task in D state which can't exit or free memory and subsequent allocations
> would make the oom killer a no-op because there's an eligible task with
> TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> calling the oom killer in a first place and killing an unresponsive task
> but that would have to happen anyway when thawed since the system is oom
> (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
All the above is specific to the PM application only, when userspace
tasks are stopped.
It might well end up that stopping userspace (beforehand or before
oom-killing) is a hard requirement for reliably disabling the
oom-killer. Because the __GFP_NO_OOM_KILL user will be safe, but
random other allocations from other tasks will not be. So perhaps we
_do_ need a global, and random userspace processes should test and
sleep upon that global if they're heading in the direction of the
oom-killer.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 21:50 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 21:50 UTC (permalink / raw)
To: David Rientjes
Cc: rjw-KKrjLPT3xs0, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009 14:25:23 -0700 (PDT)
David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > > > All of your tasks are in D state other than kthreads, right? That means
> > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > easily do this
> > > >
> > > > struct zone *z;
> > > > for_each_populated_zone(z)
> > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > and then
> > > >
> > > > for_each_populated_zone(z)
> > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > >
> > > > The serialization is done with trylocks so this will never invoke the oom
> > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > >
> > > > Why does this not work for you?
> > >
> > > Well, it might work too, but why are you insisting? How's it better than
> > > __GFP_NO_OOM_KILL, actually?
> > >
> > > Andrew, what do you think about this?
> >
> > I don't think I understand the proposal. Is it to provide a means by
> > which PM can go in and set a state bit against each and every zone? If
> > so, that's still a global boolean, only messier.
> >
>
> Why can't it be global while preallocating memory for hibernation since
> nothing but kthreads could allocate at this point and if the system is oom
> then the oom killer wouldn't be able to do anything anyway since it can't
> kill them?
- globals are bad
- the standard way of controlling memory allocator behaviour is via
the gfp_t. Bypassing that is an unusual step and needs a higher
level of justification, which I'm not seeing here.
- if we do this via an unusual global, we reduce the chances that
another subsytem could use the new feature.
I don't know what subsytem that might be, but I bet they're out
there. checkpoint-restart, virtual machines, ballooning memory
drivers, kexec loading, etc.
> The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> whether it specifies it or not since the oom killer would simply kill a
> task in D state which can't exit or free memory and subsequent allocations
> would make the oom killer a no-op because there's an eligible task with
> TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> calling the oom killer in a first place and killing an unresponsive task
> but that would have to happen anyway when thawed since the system is oom
> (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
All the above is specific to the PM application only, when userspace
tasks are stopped.
It might well end up that stopping userspace (beforehand or before
oom-killing) is a hard requirement for reliably disabling the
oom-killer. Because the __GFP_NO_OOM_KILL user will be safe, but
random other allocations from other tasks will not be. So perhaps we
_do_ need a global, and random userspace processes should test and
sleep upon that global if they're heading in the direction of the
oom-killer.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 21:50 ` Andrew Morton
(?)
@ 2009-05-07 22:14 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:14 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, David Rientjes, linux-kernel, alan-jenkins,
jens.axboe, linux-pm, fengguang.wu, torvalds
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 14:25:23 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
> > On Thu, 7 May 2009, Andrew Morton wrote:
> >
> > > > > All of your tasks are in D state other than kthreads, right? That means
> > > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > > easily do this
> > > > >
> > > > > struct zone *z;
> > > > > for_each_populated_zone(z)
> > > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > > >
> > > > > and then
> > > > >
> > > > > for_each_populated_zone(z)
> > > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > > >
> > > > > The serialization is done with trylocks so this will never invoke the oom
> > > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > > >
> > > > > Why does this not work for you?
> > > >
> > > > Well, it might work too, but why are you insisting? How's it better than
> > > > __GFP_NO_OOM_KILL, actually?
> > > >
> > > > Andrew, what do you think about this?
> > >
> > > I don't think I understand the proposal. Is it to provide a means by
> > > which PM can go in and set a state bit against each and every zone? If
> > > so, that's still a global boolean, only messier.
> > >
> >
> > Why can't it be global while preallocating memory for hibernation since
> > nothing but kthreads could allocate at this point and if the system is oom
> > then the oom killer wouldn't be able to do anything anyway since it can't
> > kill them?
>
> - globals are bad
>
> - the standard way of controlling memory allocator behaviour is via
> the gfp_t. Bypassing that is an unusual step and needs a higher
> level of justification, which I'm not seeing here.
>
> - if we do this via an unusual global, we reduce the chances that
> another subsytem could use the new feature.
>
> I don't know what subsytem that might be, but I bet they're out
> there. checkpoint-restart, virtual machines, ballooning memory
> drivers, kexec loading, etc.
>
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> All the above is specific to the PM application only, when userspace
> tasks are stopped.
>
>
> It might well end up that stopping userspace (beforehand or before
> oom-killing) is a hard requirement for reliably disabling the
> oom-killer.
In fact I think it is and that's why I wanted to make that freezer-dependent.
IOW, you need to freeze the user space totally before trying to disable the
OOM killer. Reversely, if you _have_ frozen the user space totally, the OOM
killer won't really help, so why let it run at all in that situation?
FWIW, I've just posted updated patchset with the first patch replaced with
the one introducing __GFP_NO_OOM_KILL, but perhaps I should use the
freezer-based one after all?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:14 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:14 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 14:25:23 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
> > On Thu, 7 May 2009, Andrew Morton wrote:
> >
> > > > > All of your tasks are in D state other than kthreads, right? That means
> > > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > > easily do this
> > > > >
> > > > > struct zone *z;
> > > > > for_each_populated_zone(z)
> > > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > > >
> > > > > and then
> > > > >
> > > > > for_each_populated_zone(z)
> > > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > > >
> > > > > The serialization is done with trylocks so this will never invoke the oom
> > > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > > >
> > > > > Why does this not work for you?
> > > >
> > > > Well, it might work too, but why are you insisting? How's it better than
> > > > __GFP_NO_OOM_KILL, actually?
> > > >
> > > > Andrew, what do you think about this?
> > >
> > > I don't think I understand the proposal. Is it to provide a means by
> > > which PM can go in and set a state bit against each and every zone? If
> > > so, that's still a global boolean, only messier.
> > >
> >
> > Why can't it be global while preallocating memory for hibernation since
> > nothing but kthreads could allocate at this point and if the system is oom
> > then the oom killer wouldn't be able to do anything anyway since it can't
> > kill them?
>
> - globals are bad
>
> - the standard way of controlling memory allocator behaviour is via
> the gfp_t. Bypassing that is an unusual step and needs a higher
> level of justification, which I'm not seeing here.
>
> - if we do this via an unusual global, we reduce the chances that
> another subsytem could use the new feature.
>
> I don't know what subsytem that might be, but I bet they're out
> there. checkpoint-restart, virtual machines, ballooning memory
> drivers, kexec loading, etc.
>
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> All the above is specific to the PM application only, when userspace
> tasks are stopped.
>
>
> It might well end up that stopping userspace (beforehand or before
> oom-killing) is a hard requirement for reliably disabling the
> oom-killer.
In fact I think it is and that's why I wanted to make that freezer-dependent.
IOW, you need to freeze the user space totally before trying to disable the
OOM killer. Reversely, if you _have_ frozen the user space totally, the OOM
killer won't really help, so why let it run at all in that situation?
FWIW, I've just posted updated patchset with the first patch replaced with
the one introducing __GFP_NO_OOM_KILL, but perhaps I should use the
freezer-based one after all?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:14 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:14 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 14:25:23 -0700 (PDT)
> David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> > On Thu, 7 May 2009, Andrew Morton wrote:
> >
> > > > > All of your tasks are in D state other than kthreads, right? That means
> > > > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > > > easily do this
> > > > >
> > > > > struct zone *z;
> > > > > for_each_populated_zone(z)
> > > > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > > > >
> > > > > and then
> > > > >
> > > > > for_each_populated_zone(z)
> > > > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > > > >
> > > > > The serialization is done with trylocks so this will never invoke the oom
> > > > > killer because all zones in the allocator's zonelist will be oom locked.
> > > > >
> > > > > Why does this not work for you?
> > > >
> > > > Well, it might work too, but why are you insisting? How's it better than
> > > > __GFP_NO_OOM_KILL, actually?
> > > >
> > > > Andrew, what do you think about this?
> > >
> > > I don't think I understand the proposal. Is it to provide a means by
> > > which PM can go in and set a state bit against each and every zone? If
> > > so, that's still a global boolean, only messier.
> > >
> >
> > Why can't it be global while preallocating memory for hibernation since
> > nothing but kthreads could allocate at this point and if the system is oom
> > then the oom killer wouldn't be able to do anything anyway since it can't
> > kill them?
>
> - globals are bad
>
> - the standard way of controlling memory allocator behaviour is via
> the gfp_t. Bypassing that is an unusual step and needs a higher
> level of justification, which I'm not seeing here.
>
> - if we do this via an unusual global, we reduce the chances that
> another subsytem could use the new feature.
>
> I don't know what subsytem that might be, but I bet they're out
> there. checkpoint-restart, virtual machines, ballooning memory
> drivers, kexec loading, etc.
>
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> All the above is specific to the PM application only, when userspace
> tasks are stopped.
>
>
> It might well end up that stopping userspace (beforehand or before
> oom-killing) is a hard requirement for reliably disabling the
> oom-killer.
In fact I think it is and that's why I wanted to make that freezer-dependent.
IOW, you need to freeze the user space totally before trying to disable the
OOM killer. Reversely, if you _have_ frozen the user space totally, the OOM
killer won't really help, so why let it run at all in that situation?
FWIW, I've just posted updated patchset with the first patch replaced with
the one introducing __GFP_NO_OOM_KILL, but perhaps I should use the
freezer-based one after all?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:14 ` Rafael J. Wysocki
(?)
@ 2009-05-07 22:38 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 22:38 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Fri, 8 May 2009 00:14:48 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> IOW, you need to freeze the user space totally before trying to disable the
> OOM killer.
Not necessarily. We only need to take action if a task is about to
start oom-killing - presumably by taking a nap.
If a process is sitting there happily computing pi then we can leave it
running.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:14 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-07 22:38 ` Andrew Morton
2009-05-07 22:50 ` Rafael J. Wysocki
2009-05-07 22:50 ` Rafael J. Wysocki
-1 siblings, 2 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 22:38 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Fri, 8 May 2009 00:14:48 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> IOW, you need to freeze the user space totally before trying to disable the
> OOM killer.
Not necessarily. We only need to take action if a task is about to
start oom-killing - presumably by taking a nap.
If a process is sitting there happily computing pi then we can leave it
running.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:38 ` Andrew Morton
@ 2009-05-07 22:50 ` Rafael J. Wysocki
2009-05-07 23:15 ` Andrew Morton
2009-05-07 23:15 ` Andrew Morton
2009-05-07 22:50 ` Rafael J. Wysocki
1 sibling, 2 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:50 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Friday 08 May 2009, Andrew Morton wrote:
> On Fri, 8 May 2009 00:14:48 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > IOW, you need to freeze the user space totally before trying to disable the
> > OOM killer.
>
> Not necessarily. We only need to take action if a task is about to
> start oom-killing - presumably by taking a nap.
>
> If a process is sitting there happily computing pi then we can leave it
> running.
Well, the point is we don't really know what the task is going to do next.
Is it going to continue computing pi, or is it going to execl(huge_binary), for
example?
If we knew what tasks were going to do in advance, the whole freezing wouldn't
really be necessary. :-)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:50 ` Rafael J. Wysocki
@ 2009-05-07 23:15 ` Andrew Morton
2009-05-07 23:15 ` Andrew Morton
1 sibling, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 23:15 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Fri, 8 May 2009 00:50:41 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Friday 08 May 2009, Andrew Morton wrote:
> > On Fri, 8 May 2009 00:14:48 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > IOW, you need to freeze the user space totally before trying to disable the
> > > OOM killer.
> >
> > Not necessarily. We only need to take action if a task is about to
> > start oom-killing - presumably by taking a nap.
> >
> > If a process is sitting there happily computing pi then we can leave it
> > running.
>
> Well, the point is we don't really know what the task is going to do next.
> Is it going to continue computing pi, or is it going to execl(huge_binary), for
> example?
>
> If we knew what tasks were going to do in advance, the whole freezing wouldn't
> really be necessary. :-)
argh. Third time:
- if the task is computing pi, let it do so.
- if the task tries to allocate memory and succeeds, let it proceed.
- if the task tries to allocate memory and fails and then tries to invoke
the oom-killer, stop the task.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 23:15 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 23:15 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Fri, 8 May 2009 00:50:41 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Friday 08 May 2009, Andrew Morton wrote:
> > On Fri, 8 May 2009 00:14:48 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > IOW, you need to freeze the user space totally before trying to disable the
> > > OOM killer.
> >
> > Not necessarily. We only need to take action if a task is about to
> > start oom-killing - presumably by taking a nap.
> >
> > If a process is sitting there happily computing pi then we can leave it
> > running.
>
> Well, the point is we don't really know what the task is going to do next.
> Is it going to continue computing pi, or is it going to execl(huge_binary), for
> example?
>
> If we knew what tasks were going to do in advance, the whole freezing wouldn't
> really be necessary. :-)
argh. Third time:
- if the task is computing pi, let it do so.
- if the task tries to allocate memory and succeeds, let it proceed.
- if the task tries to allocate memory and fails and then tries to invoke
the oom-killer, stop the task.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 23:15 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 23:15 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Fri, 8 May 2009 00:50:41 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> On Friday 08 May 2009, Andrew Morton wrote:
> > On Fri, 8 May 2009 00:14:48 +0200
> > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> >
> > > IOW, you need to freeze the user space totally before trying to disable the
> > > OOM killer.
> >
> > Not necessarily. We only need to take action if a task is about to
> > start oom-killing - presumably by taking a nap.
> >
> > If a process is sitting there happily computing pi then we can leave it
> > running.
>
> Well, the point is we don't really know what the task is going to do next.
> Is it going to continue computing pi, or is it going to execl(huge_binary), for
> example?
>
> If we knew what tasks were going to do in advance, the whole freezing wouldn't
> really be necessary. :-)
argh. Third time:
- if the task is computing pi, let it do so.
- if the task tries to allocate memory and succeeds, let it proceed.
- if the task tries to allocate memory and fails and then tries to invoke
the oom-killer, stop the task.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 23:24 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 23:24 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Friday 08 May 2009, Andrew Morton wrote:
> On Fri, 8 May 2009 00:50:41 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Friday 08 May 2009, Andrew Morton wrote:
> > > On Fri, 8 May 2009 00:14:48 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > IOW, you need to freeze the user space totally before trying to disable the
> > > > OOM killer.
> > >
> > > Not necessarily. We only need to take action if a task is about to
> > > start oom-killing - presumably by taking a nap.
> > >
> > > If a process is sitting there happily computing pi then we can leave it
> > > running.
> >
> > Well, the point is we don't really know what the task is going to do next.
> > Is it going to continue computing pi, or is it going to execl(huge_binary), for
> > example?
> >
> > If we knew what tasks were going to do in advance, the whole freezing wouldn't
> > really be necessary. :-)
>
> argh. Third time:
>
> - if the task is computing pi, let it do so.
>
> - if the task tries to allocate memory and succeeds, let it proceed.
>
> - if the task tries to allocate memory and fails and then tries to invoke
> the oom-killer, stop the task.
Understood.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 23:24 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 23:24 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Friday 08 May 2009, Andrew Morton wrote:
> On Fri, 8 May 2009 00:50:41 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > On Friday 08 May 2009, Andrew Morton wrote:
> > > On Fri, 8 May 2009 00:14:48 +0200
> > > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> > >
> > > > IOW, you need to freeze the user space totally before trying to disable the
> > > > OOM killer.
> > >
> > > Not necessarily. We only need to take action if a task is about to
> > > start oom-killing - presumably by taking a nap.
> > >
> > > If a process is sitting there happily computing pi then we can leave it
> > > running.
> >
> > Well, the point is we don't really know what the task is going to do next.
> > Is it going to continue computing pi, or is it going to execl(huge_binary), for
> > example?
> >
> > If we knew what tasks were going to do in advance, the whole freezing wouldn't
> > really be necessary. :-)
>
> argh. Third time:
>
> - if the task is computing pi, let it do so.
>
> - if the task tries to allocate memory and succeeds, let it proceed.
>
> - if the task tries to allocate memory and fails and then tries to invoke
> the oom-killer, stop the task.
Understood.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 23:15 ` Andrew Morton
(?)
(?)
@ 2009-05-07 23:24 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 23:24 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Friday 08 May 2009, Andrew Morton wrote:
> On Fri, 8 May 2009 00:50:41 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Friday 08 May 2009, Andrew Morton wrote:
> > > On Fri, 8 May 2009 00:14:48 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > IOW, you need to freeze the user space totally before trying to disable the
> > > > OOM killer.
> > >
> > > Not necessarily. We only need to take action if a task is about to
> > > start oom-killing - presumably by taking a nap.
> > >
> > > If a process is sitting there happily computing pi then we can leave it
> > > running.
> >
> > Well, the point is we don't really know what the task is going to do next.
> > Is it going to continue computing pi, or is it going to execl(huge_binary), for
> > example?
> >
> > If we knew what tasks were going to do in advance, the whole freezing wouldn't
> > really be necessary. :-)
>
> argh. Third time:
>
> - if the task is computing pi, let it do so.
>
> - if the task tries to allocate memory and succeeds, let it proceed.
>
> - if the task tries to allocate memory and fails and then tries to invoke
> the oom-killer, stop the task.
Understood.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:38 ` Andrew Morton
2009-05-07 22:50 ` Rafael J. Wysocki
@ 2009-05-07 22:50 ` Rafael J. Wysocki
1 sibling, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 22:50 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Friday 08 May 2009, Andrew Morton wrote:
> On Fri, 8 May 2009 00:14:48 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > IOW, you need to freeze the user space totally before trying to disable the
> > OOM killer.
>
> Not necessarily. We only need to take action if a task is about to
> start oom-killing - presumably by taking a nap.
>
> If a process is sitting there happily computing pi then we can leave it
> running.
Well, the point is we don't really know what the task is going to do next.
Is it going to continue computing pi, or is it going to execl(huge_binary), for
example?
If we knew what tasks were going to do in advance, the whole freezing wouldn't
really be necessary. :-)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 21:50 ` Andrew Morton
` (2 preceding siblings ...)
(?)
@ 2009-05-07 22:16 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 22:16 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
fengguang.wu, torvalds
On Thu, 7 May 2009, Andrew Morton wrote:
> - the standard way of controlling memory allocator behaviour is via
> the gfp_t. Bypassing that is an unusual step and needs a higher
> level of justification, which I'm not seeing here.
>
The standard way of controlling the oom killer behavior for a zone is via
the ZONE_OOM_LOCKED bit.
> - if we do this via an unusual global, we reduce the chances that
> another subsytem could use the new feature.
>
> I don't know what subsytem that might be, but I bet they're out
> there. checkpoint-restart, virtual machines, ballooning memory
> drivers, kexec loading, etc.
>
There's two separate issues here: the use of ZONE_OOM_LOCKED to control
whether or not to invoke the oom killer for a specific zone (which is
already its only function), and the fact that in this case we're doing it
for all zones. It seems like you're concerned with the latter, but the
distinction in the hibernation case is that no memory freeing would be
possible as the result of the oom killer for _all_ zones, so it makes
sense to lock them all out.
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> All the above is specific to the PM application only, when userspace
> tasks are stopped.
>
I'm not arguing that the only way we can ever implement __GFP_NO_OOM_KILL
is for the entire system: we can set ZONE_OOM_LOCKED for only the zones in
the zonelist that are passed to the page allocator. For this particular
purpose, that is naturally all zones; for other future use cases it may be
chosen only to lock out the zones we're allowed to allocate from in that
context.
> It might well end up that stopping userspace (beforehand or before
> oom-killing) is a hard requirement for reliably disabling the
> oom-killer.
Yes, globally, but future use cases may disable only specific zones such
as with memory hot-remove.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:16 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 22:16 UTC (permalink / raw)
To: Andrew Morton
Cc: rjw, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Andrew Morton wrote:
> - the standard way of controlling memory allocator behaviour is via
> the gfp_t. Bypassing that is an unusual step and needs a higher
> level of justification, which I'm not seeing here.
>
The standard way of controlling the oom killer behavior for a zone is via
the ZONE_OOM_LOCKED bit.
> - if we do this via an unusual global, we reduce the chances that
> another subsytem could use the new feature.
>
> I don't know what subsytem that might be, but I bet they're out
> there. checkpoint-restart, virtual machines, ballooning memory
> drivers, kexec loading, etc.
>
There's two separate issues here: the use of ZONE_OOM_LOCKED to control
whether or not to invoke the oom killer for a specific zone (which is
already its only function), and the fact that in this case we're doing it
for all zones. It seems like you're concerned with the latter, but the
distinction in the hibernation case is that no memory freeing would be
possible as the result of the oom killer for _all_ zones, so it makes
sense to lock them all out.
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> All the above is specific to the PM application only, when userspace
> tasks are stopped.
>
I'm not arguing that the only way we can ever implement __GFP_NO_OOM_KILL
is for the entire system: we can set ZONE_OOM_LOCKED for only the zones in
the zonelist that are passed to the page allocator. For this particular
purpose, that is naturally all zones; for other future use cases it may be
chosen only to lock out the zones we're allowed to allocate from in that
context.
> It might well end up that stopping userspace (beforehand or before
> oom-killing) is a hard requirement for reliably disabling the
> oom-killer.
Yes, globally, but future use cases may disable only specific zones such
as with memory hot-remove.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:16 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 22:16 UTC (permalink / raw)
To: Andrew Morton
Cc: rjw-KKrjLPT3xs0, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Andrew Morton wrote:
> - the standard way of controlling memory allocator behaviour is via
> the gfp_t. Bypassing that is an unusual step and needs a higher
> level of justification, which I'm not seeing here.
>
The standard way of controlling the oom killer behavior for a zone is via
the ZONE_OOM_LOCKED bit.
> - if we do this via an unusual global, we reduce the chances that
> another subsytem could use the new feature.
>
> I don't know what subsytem that might be, but I bet they're out
> there. checkpoint-restart, virtual machines, ballooning memory
> drivers, kexec loading, etc.
>
There's two separate issues here: the use of ZONE_OOM_LOCKED to control
whether or not to invoke the oom killer for a specific zone (which is
already its only function), and the fact that in this case we're doing it
for all zones. It seems like you're concerned with the latter, but the
distinction in the hibernation case is that no memory freeing would be
possible as the result of the oom killer for _all_ zones, so it makes
sense to lock them all out.
> > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > whether it specifies it or not since the oom killer would simply kill a
> > task in D state which can't exit or free memory and subsequent allocations
> > would make the oom killer a no-op because there's an eligible task with
> > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > calling the oom killer in a first place and killing an unresponsive task
> > but that would have to happen anyway when thawed since the system is oom
> > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
>
> All the above is specific to the PM application only, when userspace
> tasks are stopped.
>
I'm not arguing that the only way we can ever implement __GFP_NO_OOM_KILL
is for the entire system: we can set ZONE_OOM_LOCKED for only the zones in
the zonelist that are passed to the page allocator. For this particular
purpose, that is naturally all zones; for other future use cases it may be
chosen only to lock out the zones we're allowed to allocate from in that
context.
> It might well end up that stopping userspace (beforehand or before
> oom-killing) is a hard requirement for reliably disabling the
> oom-killer.
Yes, globally, but future use cases may disable only specific zones such
as with memory hot-remove.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:45 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 22:45 UTC (permalink / raw)
To: David Rientjes
Cc: rjw, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009 15:16:17 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > - the standard way of controlling memory allocator behaviour is via
> > the gfp_t. Bypassing that is an unusual step and needs a higher
> > level of justification, which I'm not seeing here.
> >
>
> The standard way of controlling the oom killer behavior for a zone is via
> the ZONE_OOM_LOCKED bit.
oop, I didn't remember/realise that ZONE_OOM_LOCKED already exists.
> > - if we do this via an unusual global, we reduce the chances that
> > another subsytem could use the new feature.
> >
> > I don't know what subsytem that might be, but I bet they're out
> > there. checkpoint-restart, virtual machines, ballooning memory
> > drivers, kexec loading, etc.
> >
>
> There's two separate issues here: the use of ZONE_OOM_LOCKED to control
> whether or not to invoke the oom killer for a specific zone (which is
> already its only function), and the fact that in this case we're doing it
> for all zones. It seems like you're concerned with the latter, but the
> distinction in the hibernation case is that no memory freeing would be
> possible as the result of the oom killer for _all_ zones, so it makes
> sense to lock them all out.
OK.
> > > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > > whether it specifies it or not since the oom killer would simply kill a
> > > task in D state which can't exit or free memory and subsequent allocations
> > > would make the oom killer a no-op because there's an eligible task with
> > > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > > calling the oom killer in a first place and killing an unresponsive task
> > > but that would have to happen anyway when thawed since the system is oom
> > > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
> >
> > All the above is specific to the PM application only, when userspace
> > tasks are stopped.
> >
>
> I'm not arguing that the only way we can ever implement __GFP_NO_OOM_KILL
> is for the entire system: we can set ZONE_OOM_LOCKED for only the zones in
> the zonelist that are passed to the page allocator. For this particular
> purpose, that is naturally all zones; for other future use cases it may be
> chosen only to lock out the zones we're allowed to allocate from in that
> context.
OK.
> > It might well end up that stopping userspace (beforehand or before
> > oom-killing) is a hard requirement for reliably disabling the
> > oom-killer.
>
> Yes, globally, but future use cases may disable only specific zones such
> as with memory hot-remove.
<goes off to find out what ZONE_OOM_LOCKED does>
That took remarkably longer than one would have expected..
Yes, OK, I agree, globally setting ZONE_OOM_LOCKED would produce a
decent result.
The setting and clearing of that thing looks gruesomely racy..
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:45 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 22:45 UTC (permalink / raw)
To: David Rientjes
Cc: rjw-KKrjLPT3xs0, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009 15:16:17 -0700 (PDT)
David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > - the standard way of controlling memory allocator behaviour is via
> > the gfp_t. Bypassing that is an unusual step and needs a higher
> > level of justification, which I'm not seeing here.
> >
>
> The standard way of controlling the oom killer behavior for a zone is via
> the ZONE_OOM_LOCKED bit.
oop, I didn't remember/realise that ZONE_OOM_LOCKED already exists.
> > - if we do this via an unusual global, we reduce the chances that
> > another subsytem could use the new feature.
> >
> > I don't know what subsytem that might be, but I bet they're out
> > there. checkpoint-restart, virtual machines, ballooning memory
> > drivers, kexec loading, etc.
> >
>
> There's two separate issues here: the use of ZONE_OOM_LOCKED to control
> whether or not to invoke the oom killer for a specific zone (which is
> already its only function), and the fact that in this case we're doing it
> for all zones. It seems like you're concerned with the latter, but the
> distinction in the hibernation case is that no memory freeing would be
> possible as the result of the oom killer for _all_ zones, so it makes
> sense to lock them all out.
OK.
> > > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > > whether it specifies it or not since the oom killer would simply kill a
> > > task in D state which can't exit or free memory and subsequent allocations
> > > would make the oom killer a no-op because there's an eligible task with
> > > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > > calling the oom killer in a first place and killing an unresponsive task
> > > but that would have to happen anyway when thawed since the system is oom
> > > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
> >
> > All the above is specific to the PM application only, when userspace
> > tasks are stopped.
> >
>
> I'm not arguing that the only way we can ever implement __GFP_NO_OOM_KILL
> is for the entire system: we can set ZONE_OOM_LOCKED for only the zones in
> the zonelist that are passed to the page allocator. For this particular
> purpose, that is naturally all zones; for other future use cases it may be
> chosen only to lock out the zones we're allowed to allocate from in that
> context.
OK.
> > It might well end up that stopping userspace (beforehand or before
> > oom-killing) is a hard requirement for reliably disabling the
> > oom-killer.
>
> Yes, globally, but future use cases may disable only specific zones such
> as with memory hot-remove.
<goes off to find out what ZONE_OOM_LOCKED does>
That took remarkably longer than one would have expected..
Yes, OK, I agree, globally setting ZONE_OOM_LOCKED would produce a
decent result.
The setting and clearing of that thing looks gruesomely racy..
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:45 ` Andrew Morton
(?)
@ 2009-05-07 22:59 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 22:59 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
fengguang.wu, torvalds
On Thu, 7 May 2009, Andrew Morton wrote:
> The setting and clearing of that thing looks gruesomely racy..
>
It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
gets test/set and cleared atomically for the entire zonelist (the clear
happens for the same zonelist that was test/set).
Using it for hibernation in the way I've proposed will open it up to the
race I earlier described: when a kthread is in the oom killer and
subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
frozen so they can't be in the oom killer). That's perfectly acceptable,
however, since the system is by definition already oom if kthreads can't
get memory so it will end up killing a user task even though it's stuck in
D state and will exit on thaw; we aren't concerned about killing
needlessly because the oom killer becomes a no-op when it finds a task
that has already been killed but hasn't exited by way of TIF_MEMDIE.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:59 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 22:59 UTC (permalink / raw)
To: Andrew Morton
Cc: rjw, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Andrew Morton wrote:
> The setting and clearing of that thing looks gruesomely racy..
>
It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
gets test/set and cleared atomically for the entire zonelist (the clear
happens for the same zonelist that was test/set).
Using it for hibernation in the way I've proposed will open it up to the
race I earlier described: when a kthread is in the oom killer and
subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
frozen so they can't be in the oom killer). That's perfectly acceptable,
however, since the system is by definition already oom if kthreads can't
get memory so it will end up killing a user task even though it's stuck in
D state and will exit on thaw; we aren't concerned about killing
needlessly because the oom killer becomes a no-op when it finds a task
that has already been killed but hasn't exited by way of TIF_MEMDIE.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 22:59 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 22:59 UTC (permalink / raw)
To: Andrew Morton
Cc: rjw-KKrjLPT3xs0, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Andrew Morton wrote:
> The setting and clearing of that thing looks gruesomely racy..
>
It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
gets test/set and cleared atomically for the entire zonelist (the clear
happens for the same zonelist that was test/set).
Using it for hibernation in the way I've proposed will open it up to the
race I earlier described: when a kthread is in the oom killer and
subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
frozen so they can't be in the oom killer). That's perfectly acceptable,
however, since the system is by definition already oom if kthreads can't
get memory so it will end up killing a user task even though it's stuck in
D state and will exit on thaw; we aren't concerned about killing
needlessly because the oom killer becomes a no-op when it finds a task
that has already been killed but hasn't exited by way of TIF_MEMDIE.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 23:11 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 23:11 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Friday 08 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > The setting and clearing of that thing looks gruesomely racy..
> >
>
> It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> gets test/set and cleared atomically for the entire zonelist (the clear
> happens for the same zonelist that was test/set).
>
> Using it for hibernation in the way I've proposed will open it up to the
> race I earlier described: when a kthread is in the oom killer and
> subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> frozen so they can't be in the oom killer). That's perfectly acceptable,
> however, since the system is by definition already oom if kthreads can't
> get memory so it will end up killing a user task even though it's stuck in
> D state and will exit on thaw; we aren't concerned about killing
> needlessly because the oom killer becomes a no-op when it finds a task
> that has already been killed but hasn't exited by way of TIF_MEMDIE.
OK there.
So everyone seems to agree we can do something like in the patch below?
---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM/Hibernate: Rework shrinking of memory
Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
just once to make some room for the image and then allocates memory
to apply more pressure to the memory management subsystem, if
necessary.
Unfortunately, we don't seem to be able to drop shrink_all_memory()
entirely just yet, because that would lead to huge performance
regressions in some test cases.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
1 file changed, 104 insertions(+), 47 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,69 +1066,126 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper functions used for the shrinking of memory. */
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
+ * preallocate_image_memory - Allocate given number of page frames
+ * @nr_pages: Number of page frames to allocate
*
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * Return value: Number of page frames actually allocated
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
+static unsigned long preallocate_image_memory(unsigned long nr_pages)
{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
+ unsigned long nr_alloc = 0;
+
+ while (nr_pages > 0) {
+ if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
+ break;
+ nr_pages--;
+ nr_alloc++;
+ }
+
+ return nr_alloc;
}
+/**
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * compute the total number of available page frames and allocate at least
+ *
+ * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
+ *
+ * of them, which corresponds to the maximum size of a hibernation image.
+ *
+ * If image_size is set below the number following from the above formula,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size or it is impossible to
+ * allocate more memory, whichever happens first.
+ */
int swsusp_shrink_memory(void)
{
- long tmp;
struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
+ unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
+ int error = 0;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Shrinking memory ... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
+ /* Count the number of saveable data pages. */
+ saveable = count_data_pages() + count_highmem_pages();
- if (highmem_size < 0)
- highmem_size = 0;
+ /*
+ * Compute the total number of page frames we can use (count) and the
+ * number of pages needed for image metadata (size).
+ */
+ count = saveable;
+ size = 0;
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ count -= zone->pages_min;
+ }
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ /* Compute the maximum number of saveable pages to leave in memory. */
+ max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
+ size = DIV_ROUND_UP(image_size, PAGE_SIZE);
+ if (size > max_size)
+ size = max_size;
+ /*
+ * If the maximum is not less than the current number of saveable pages
+ * in memory, we don't need to do anything more.
+ */
+ if (size >= saveable)
+ goto out;
+
+ /*
+ * Let the memory management subsystem know that we're going to need a
+ * large number of page frames to allocate and make it free some memory.
+ * NOTE: If this is not done, performance is heavily affected in some
+ * test cases.
+ */
+ shrink_all_memory(saveable - size);
+
+ /*
+ * Prevent the OOM killer from triggering while we're allocating image
+ * memory.
+ */
+ for_each_populated_zone(zone)
+ zone_set_flag(zone, ZONE_OOM_LOCKED);
+ /*
+ * The number of saveable pages in memory was too high, so apply some
+ * pressure to decrease it. First, make room for the largest possible
+ * image and fail if that doesn't work. Next, try to decrease the size
+ * of the image as much as indicated by image_size.
+ */
+ count -= max_size;
+ pages = preallocate_image_memory(count);
+ if (pages < count)
+ error = -ENOMEM;
+ else
+ pages += preallocate_image_memory(max_size - size);
+
+ for_each_populated_zone(zone)
+ zone_clear_flag(zone, ZONE_OOM_LOCKED);
+
+ /* Release all of the preallocated page frames. */
+ swsusp_free();
+
+ if (error) {
+ printk(KERN_CONT "\n");
+ return error;
+ }
+
+ out:
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
return 0;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 23:11 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 23:11 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Friday 08 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > The setting and clearing of that thing looks gruesomely racy..
> >
>
> It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> gets test/set and cleared atomically for the entire zonelist (the clear
> happens for the same zonelist that was test/set).
>
> Using it for hibernation in the way I've proposed will open it up to the
> race I earlier described: when a kthread is in the oom killer and
> subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> frozen so they can't be in the oom killer). That's perfectly acceptable,
> however, since the system is by definition already oom if kthreads can't
> get memory so it will end up killing a user task even though it's stuck in
> D state and will exit on thaw; we aren't concerned about killing
> needlessly because the oom killer becomes a no-op when it finds a task
> that has already been killed but hasn't exited by way of TIF_MEMDIE.
OK there.
So everyone seems to agree we can do something like in the patch below?
---
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Subject: PM/Hibernate: Rework shrinking of memory
Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
just once to make some room for the image and then allocates memory
to apply more pressure to the memory management subsystem, if
necessary.
Unfortunately, we don't seem to be able to drop shrink_all_memory()
entirely just yet, because that would lead to huge performance
regressions in some test cases.
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
1 file changed, 104 insertions(+), 47 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,69 +1066,126 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper functions used for the shrinking of memory. */
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
+ * preallocate_image_memory - Allocate given number of page frames
+ * @nr_pages: Number of page frames to allocate
*
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * Return value: Number of page frames actually allocated
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
+static unsigned long preallocate_image_memory(unsigned long nr_pages)
{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
+ unsigned long nr_alloc = 0;
+
+ while (nr_pages > 0) {
+ if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
+ break;
+ nr_pages--;
+ nr_alloc++;
+ }
+
+ return nr_alloc;
}
+/**
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * compute the total number of available page frames and allocate at least
+ *
+ * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
+ *
+ * of them, which corresponds to the maximum size of a hibernation image.
+ *
+ * If image_size is set below the number following from the above formula,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size or it is impossible to
+ * allocate more memory, whichever happens first.
+ */
int swsusp_shrink_memory(void)
{
- long tmp;
struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
+ unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
+ int error = 0;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Shrinking memory ... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
+ /* Count the number of saveable data pages. */
+ saveable = count_data_pages() + count_highmem_pages();
- if (highmem_size < 0)
- highmem_size = 0;
+ /*
+ * Compute the total number of page frames we can use (count) and the
+ * number of pages needed for image metadata (size).
+ */
+ count = saveable;
+ size = 0;
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ count -= zone->pages_min;
+ }
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ /* Compute the maximum number of saveable pages to leave in memory. */
+ max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
+ size = DIV_ROUND_UP(image_size, PAGE_SIZE);
+ if (size > max_size)
+ size = max_size;
+ /*
+ * If the maximum is not less than the current number of saveable pages
+ * in memory, we don't need to do anything more.
+ */
+ if (size >= saveable)
+ goto out;
+
+ /*
+ * Let the memory management subsystem know that we're going to need a
+ * large number of page frames to allocate and make it free some memory.
+ * NOTE: If this is not done, performance is heavily affected in some
+ * test cases.
+ */
+ shrink_all_memory(saveable - size);
+
+ /*
+ * Prevent the OOM killer from triggering while we're allocating image
+ * memory.
+ */
+ for_each_populated_zone(zone)
+ zone_set_flag(zone, ZONE_OOM_LOCKED);
+ /*
+ * The number of saveable pages in memory was too high, so apply some
+ * pressure to decrease it. First, make room for the largest possible
+ * image and fail if that doesn't work. Next, try to decrease the size
+ * of the image as much as indicated by image_size.
+ */
+ count -= max_size;
+ pages = preallocate_image_memory(count);
+ if (pages < count)
+ error = -ENOMEM;
+ else
+ pages += preallocate_image_memory(max_size - size);
+
+ for_each_populated_zone(zone)
+ zone_clear_flag(zone, ZONE_OOM_LOCKED);
+
+ /* Release all of the preallocated page frames. */
+ swsusp_free();
+
+ if (error) {
+ printk(KERN_CONT "\n");
+ return error;
+ }
+
+ out:
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
return 0;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 23:11 ` Rafael J. Wysocki
(?)
@ 2009-05-08 1:16 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 580+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08 1:16 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, jens.axboe, linux-kernel, alan-jenkins,
David Rientjes, Andrew Morton, fengguang.wu, torvalds, linux-pm
On Fri, 8 May 2009 01:11:30 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> + for_each_populated_zone(zone)
> + zone_set_flag(zone, ZONE_OOM_LOCKED);
> + for_each_populated_zone(zone)
> + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +
Isn't it better to make above 2 be functions and move to mm/oom_kill.c ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 1:16 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 580+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08 1:16 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton, fengguang.wu, linux-pm, pavel,
torvalds, jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Fri, 8 May 2009 01:11:30 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> + for_each_populated_zone(zone)
> + zone_set_flag(zone, ZONE_OOM_LOCKED);
> + for_each_populated_zone(zone)
> + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +
Isn't it better to make above 2 be functions and move to mm/oom_kill.c ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 1:16 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 580+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08 1:16 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Fri, 8 May 2009 01:11:30 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> + for_each_populated_zone(zone)
> + zone_set_flag(zone, ZONE_OOM_LOCKED);
> + for_each_populated_zone(zone)
> + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +
Isn't it better to make above 2 be functions and move to mm/oom_kill.c ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 13:42 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 13:42 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: David Rientjes, Andrew Morton, fengguang.wu, linux-pm, pavel,
torvalds, jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Friday 08 May 2009, KAMEZAWA Hiroyuki wrote:
> On Fri, 8 May 2009 01:11:30 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > + for_each_populated_zone(zone)
> > + zone_set_flag(zone, ZONE_OOM_LOCKED);
>
> > + for_each_populated_zone(zone)
> > + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> > +
>
> Isn't it better to make above 2 be functions and move to mm/oom_kill.c ?
Hmm, OK. I'll do it.
Well, in fact snapshot.c is all about memory management, so perhaps it's a good
idea to move it into mm as a whole. ;-)
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 13:42 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 13:42 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: David Rientjes, Andrew Morton,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Friday 08 May 2009, KAMEZAWA Hiroyuki wrote:
> On Fri, 8 May 2009 01:11:30 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > + for_each_populated_zone(zone)
> > + zone_set_flag(zone, ZONE_OOM_LOCKED);
>
> > + for_each_populated_zone(zone)
> > + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> > +
>
> Isn't it better to make above 2 be functions and move to mm/oom_kill.c ?
Hmm, OK. I'll do it.
Well, in fact snapshot.c is all about memory management, so perhaps it's a good
idea to move it into mm as a whole. ;-)
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-08 1:16 ` KAMEZAWA Hiroyuki
(?)
(?)
@ 2009-05-08 13:42 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 13:42 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: kernel-testers, jens.axboe, linux-kernel, alan-jenkins,
David Rientjes, Andrew Morton, fengguang.wu, torvalds, linux-pm
On Friday 08 May 2009, KAMEZAWA Hiroyuki wrote:
> On Fri, 8 May 2009 01:11:30 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > + for_each_populated_zone(zone)
> > + zone_set_flag(zone, ZONE_OOM_LOCKED);
>
> > + for_each_populated_zone(zone)
> > + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> > +
>
> Isn't it better to make above 2 be functions and move to mm/oom_kill.c ?
Hmm, OK. I'll do it.
Well, in fact snapshot.c is all about memory management, so perhaps it's a good
idea to move it into mm as a whole. ;-)
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 23:11 ` Rafael J. Wysocki
` (2 preceding siblings ...)
(?)
@ 2009-05-08 9:50 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-08 9:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, kernel-testers, torvalds, linux-pm
On Fri, May 08, 2009 at 07:11:30AM +0800, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, David Rientjes wrote:
> > On Thu, 7 May 2009, Andrew Morton wrote:
> >
> > > The setting and clearing of that thing looks gruesomely racy..
> > >
> >
> > It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> > gets test/set and cleared atomically for the entire zonelist (the clear
> > happens for the same zonelist that was test/set).
> >
> > Using it for hibernation in the way I've proposed will open it up to the
> > race I earlier described: when a kthread is in the oom killer and
> > subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> > frozen so they can't be in the oom killer). That's perfectly acceptable,
> > however, since the system is by definition already oom if kthreads can't
> > get memory so it will end up killing a user task even though it's stuck in
> > D state and will exit on thaw; we aren't concerned about killing
> > needlessly because the oom killer becomes a no-op when it finds a task
> > that has already been killed but hasn't exited by way of TIF_MEMDIE.
>
> OK there.
>
> So everyone seems to agree we can do something like in the patch below?
>
> ---
> From: Rafael J. Wysocki <rjw@sisk.pl>
> Subject: PM/Hibernate: Rework shrinking of memory
>
> Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
> just once to make some room for the image and then allocates memory
> to apply more pressure to the memory management subsystem, if
> necessary.
Thanks! Reducing to single-pass helps memory bounty laptops considerably :)
> Unfortunately, we don't seem to be able to drop shrink_all_memory()
> entirely just yet, because that would lead to huge performance
> regressions in some test cases.
Yes, but it's not the fault of this patch. In fact some regressions
may even be positive pressures to the page allocate/reclaim code ;)
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
> 1 file changed, 104 insertions(+), 47 deletions(-)
>
> Index: linux-2.6/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/snapshot.c
> +++ linux-2.6/kernel/power/snapshot.c
> @@ -1066,69 +1066,126 @@ void swsusp_free(void)
> buffer = NULL;
> }
>
> +/* Helper functions used for the shrinking of memory. */
> +
> /**
> - * swsusp_shrink_memory - Try to free as much memory as needed
> - *
> - * ... but do not OOM-kill anyone
> + * preallocate_image_memory - Allocate given number of page frames
> + * @nr_pages: Number of page frames to allocate
> *
> - * Notice: all userland should be stopped before it is called, or
> - * livelock is possible.
> + * Return value: Number of page frames actually allocated
> */
> -
> -#define SHRINK_BITE 10000
> -static inline unsigned long __shrink_memory(long tmp)
> +static unsigned long preallocate_image_memory(unsigned long nr_pages)
> {
> - if (tmp > SHRINK_BITE)
> - tmp = SHRINK_BITE;
> - return shrink_all_memory(tmp);
> + unsigned long nr_alloc = 0;
> +
> + while (nr_pages > 0) {
> + if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
> + break;
> + nr_pages--;
> + nr_alloc++;
> + }
> +
> + return nr_alloc;
> }
>
> +/**
> + * swsusp_shrink_memory - Make the kernel release as much memory as needed
> + *
> + * To create a hibernation image it is necessary to make a copy of every page
> + * frame in use. We also need a number of page frames to be free during
> + * hibernation for allocations made while saving the image and for device
> + * drivers, in case they need to allocate memory from their hibernation
> + * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
> + * respectively, both of which are rough estimates). To make this happen, we
> + * compute the total number of available page frames and allocate at least
> + *
> + * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
> + *
> + * of them, which corresponds to the maximum size of a hibernation image.
> + *
> + * If image_size is set below the number following from the above formula,
> + * the preallocation of memory is continued until the total number of page
> + * frames in use is below the requested image size or it is impossible to
> + * allocate more memory, whichever happens first.
> + */
> int swsusp_shrink_memory(void)
> {
> - long tmp;
> struct zone *zone;
> - unsigned long pages = 0;
> - unsigned int i = 0;
> - char *p = "-\\|/";
> + unsigned long saveable, size, max_size, count, pages = 0;
> struct timeval start, stop;
> + int error = 0;
>
> - printk(KERN_INFO "PM: Shrinking memory... ");
> + printk(KERN_INFO "PM: Shrinking memory ... ");
> do_gettimeofday(&start);
> - do {
> - long size, highmem_size;
>
> - highmem_size = count_highmem_pages();
> - size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
> - tmp = size;
> - size += highmem_size;
> - for_each_populated_zone(zone) {
> - tmp += snapshot_additional_pages(zone);
> - if (is_highmem(zone)) {
> - highmem_size -=
> - zone_page_state(zone, NR_FREE_PAGES);
> - } else {
> - tmp -= zone_page_state(zone, NR_FREE_PAGES);
> - tmp += zone->lowmem_reserve[ZONE_NORMAL];
> - }
> - }
> + /* Count the number of saveable data pages. */
> + saveable = count_data_pages() + count_highmem_pages();
>
> - if (highmem_size < 0)
> - highmem_size = 0;
> + /*
> + * Compute the total number of page frames we can use (count) and the
> + * number of pages needed for image metadata (size).
> + */
> + count = saveable;
> + size = 0;
> + for_each_populated_zone(zone) {
> + size += snapshot_additional_pages(zone);
> + count += zone_page_state(zone, NR_FREE_PAGES);
> + count -= zone->pages_min;
I'd prefer to be more safe, by removing the above line...
> + }
...and add another line here:
count -= totalreserve_pages;
But hey, that 'count' counts "savable+free" memory.
We don't have a counter for an estimation of "free+freeable" memory,
ie. we are sure we cannot preallocate above that threshold.
One applicable situation is, when there are 800M anonymous memory,
but only 500M image_size and no swap space.
In that case we will otherwise goto the oom code path. Sure oom is
(and shall be) reliably disabled in hibernation, but still we shall be
cautious enough not to create a low memory situation, which will hurt:
- hibernation speed
(vmscan goes mad trying to squeeze the last free page)
- user experiences after resume
(all *active* file data and metadata have to reloaded)
The current code simply tries *too hard* to meet image_size.
I'd rather take that as a mild advice, and to only free
"free+freeable-margin" pages when image_size is not approachable.
The safety margin can be totalreserve_pages, plus enough pages for
retaining the "hard core working set".
Thanks,
Fengguang
> - tmp += highmem_size;
> - if (tmp > 0) {
> - tmp = __shrink_memory(tmp);
> - if (!tmp)
> - return -ENOMEM;
> - pages += tmp;
> - } else if (size > image_size / PAGE_SIZE) {
> - tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
> - pages += tmp;
> - }
> - printk("\b%c", p[i++%4]);
> - } while (tmp > 0);
> + /* Compute the maximum number of saveable pages to leave in memory. */
> + max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
> + size = DIV_ROUND_UP(image_size, PAGE_SIZE);
> + if (size > max_size)
> + size = max_size;
> + /*
> + * If the maximum is not less than the current number of saveable pages
> + * in memory, we don't need to do anything more.
> + */
> + if (size >= saveable)
> + goto out;
> +
> + /*
> + * Let the memory management subsystem know that we're going to need a
> + * large number of page frames to allocate and make it free some memory.
> + * NOTE: If this is not done, performance is heavily affected in some
> + * test cases.
> + */
> + shrink_all_memory(saveable - size);
> +
> + /*
> + * Prevent the OOM killer from triggering while we're allocating image
> + * memory.
> + */
> + for_each_populated_zone(zone)
> + zone_set_flag(zone, ZONE_OOM_LOCKED);
> + /*
> + * The number of saveable pages in memory was too high, so apply some
> + * pressure to decrease it. First, make room for the largest possible
> + * image and fail if that doesn't work. Next, try to decrease the size
> + * of the image as much as indicated by image_size.
> + */
> + count -= max_size;
> + pages = preallocate_image_memory(count);
> + if (pages < count)
> + error = -ENOMEM;
> + else
> + pages += preallocate_image_memory(max_size - size);
> +
> + for_each_populated_zone(zone)
> + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +
> + /* Release all of the preallocated page frames. */
> + swsusp_free();
> +
> + if (error) {
> + printk(KERN_CONT "\n");
> + return error;
> + }
> +
> + out:
> do_gettimeofday(&stop);
> - printk("\bdone (%lu pages freed)\n", pages);
> + printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
> swsusp_show_speed(&start, &stop, pages, "Freed");
>
> return 0;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 9:50 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-08 9:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Fri, May 08, 2009 at 07:11:30AM +0800, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, David Rientjes wrote:
> > On Thu, 7 May 2009, Andrew Morton wrote:
> >
> > > The setting and clearing of that thing looks gruesomely racy..
> > >
> >
> > It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> > gets test/set and cleared atomically for the entire zonelist (the clear
> > happens for the same zonelist that was test/set).
> >
> > Using it for hibernation in the way I've proposed will open it up to the
> > race I earlier described: when a kthread is in the oom killer and
> > subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> > frozen so they can't be in the oom killer). That's perfectly acceptable,
> > however, since the system is by definition already oom if kthreads can't
> > get memory so it will end up killing a user task even though it's stuck in
> > D state and will exit on thaw; we aren't concerned about killing
> > needlessly because the oom killer becomes a no-op when it finds a task
> > that has already been killed but hasn't exited by way of TIF_MEMDIE.
>
> OK there.
>
> So everyone seems to agree we can do something like in the patch below?
>
> ---
> From: Rafael J. Wysocki <rjw@sisk.pl>
> Subject: PM/Hibernate: Rework shrinking of memory
>
> Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
> just once to make some room for the image and then allocates memory
> to apply more pressure to the memory management subsystem, if
> necessary.
Thanks! Reducing to single-pass helps memory bounty laptops considerably :)
> Unfortunately, we don't seem to be able to drop shrink_all_memory()
> entirely just yet, because that would lead to huge performance
> regressions in some test cases.
Yes, but it's not the fault of this patch. In fact some regressions
may even be positive pressures to the page allocate/reclaim code ;)
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
> 1 file changed, 104 insertions(+), 47 deletions(-)
>
> Index: linux-2.6/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/snapshot.c
> +++ linux-2.6/kernel/power/snapshot.c
> @@ -1066,69 +1066,126 @@ void swsusp_free(void)
> buffer = NULL;
> }
>
> +/* Helper functions used for the shrinking of memory. */
> +
> /**
> - * swsusp_shrink_memory - Try to free as much memory as needed
> - *
> - * ... but do not OOM-kill anyone
> + * preallocate_image_memory - Allocate given number of page frames
> + * @nr_pages: Number of page frames to allocate
> *
> - * Notice: all userland should be stopped before it is called, or
> - * livelock is possible.
> + * Return value: Number of page frames actually allocated
> */
> -
> -#define SHRINK_BITE 10000
> -static inline unsigned long __shrink_memory(long tmp)
> +static unsigned long preallocate_image_memory(unsigned long nr_pages)
> {
> - if (tmp > SHRINK_BITE)
> - tmp = SHRINK_BITE;
> - return shrink_all_memory(tmp);
> + unsigned long nr_alloc = 0;
> +
> + while (nr_pages > 0) {
> + if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
> + break;
> + nr_pages--;
> + nr_alloc++;
> + }
> +
> + return nr_alloc;
> }
>
> +/**
> + * swsusp_shrink_memory - Make the kernel release as much memory as needed
> + *
> + * To create a hibernation image it is necessary to make a copy of every page
> + * frame in use. We also need a number of page frames to be free during
> + * hibernation for allocations made while saving the image and for device
> + * drivers, in case they need to allocate memory from their hibernation
> + * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
> + * respectively, both of which are rough estimates). To make this happen, we
> + * compute the total number of available page frames and allocate at least
> + *
> + * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
> + *
> + * of them, which corresponds to the maximum size of a hibernation image.
> + *
> + * If image_size is set below the number following from the above formula,
> + * the preallocation of memory is continued until the total number of page
> + * frames in use is below the requested image size or it is impossible to
> + * allocate more memory, whichever happens first.
> + */
> int swsusp_shrink_memory(void)
> {
> - long tmp;
> struct zone *zone;
> - unsigned long pages = 0;
> - unsigned int i = 0;
> - char *p = "-\\|/";
> + unsigned long saveable, size, max_size, count, pages = 0;
> struct timeval start, stop;
> + int error = 0;
>
> - printk(KERN_INFO "PM: Shrinking memory... ");
> + printk(KERN_INFO "PM: Shrinking memory ... ");
> do_gettimeofday(&start);
> - do {
> - long size, highmem_size;
>
> - highmem_size = count_highmem_pages();
> - size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
> - tmp = size;
> - size += highmem_size;
> - for_each_populated_zone(zone) {
> - tmp += snapshot_additional_pages(zone);
> - if (is_highmem(zone)) {
> - highmem_size -=
> - zone_page_state(zone, NR_FREE_PAGES);
> - } else {
> - tmp -= zone_page_state(zone, NR_FREE_PAGES);
> - tmp += zone->lowmem_reserve[ZONE_NORMAL];
> - }
> - }
> + /* Count the number of saveable data pages. */
> + saveable = count_data_pages() + count_highmem_pages();
>
> - if (highmem_size < 0)
> - highmem_size = 0;
> + /*
> + * Compute the total number of page frames we can use (count) and the
> + * number of pages needed for image metadata (size).
> + */
> + count = saveable;
> + size = 0;
> + for_each_populated_zone(zone) {
> + size += snapshot_additional_pages(zone);
> + count += zone_page_state(zone, NR_FREE_PAGES);
> + count -= zone->pages_min;
I'd prefer to be more safe, by removing the above line...
> + }
...and add another line here:
count -= totalreserve_pages;
But hey, that 'count' counts "savable+free" memory.
We don't have a counter for an estimation of "free+freeable" memory,
ie. we are sure we cannot preallocate above that threshold.
One applicable situation is, when there are 800M anonymous memory,
but only 500M image_size and no swap space.
In that case we will otherwise goto the oom code path. Sure oom is
(and shall be) reliably disabled in hibernation, but still we shall be
cautious enough not to create a low memory situation, which will hurt:
- hibernation speed
(vmscan goes mad trying to squeeze the last free page)
- user experiences after resume
(all *active* file data and metadata have to reloaded)
The current code simply tries *too hard* to meet image_size.
I'd rather take that as a mild advice, and to only free
"free+freeable-margin" pages when image_size is not approachable.
The safety margin can be totalreserve_pages, plus enough pages for
retaining the "hard core working set".
Thanks,
Fengguang
> - tmp += highmem_size;
> - if (tmp > 0) {
> - tmp = __shrink_memory(tmp);
> - if (!tmp)
> - return -ENOMEM;
> - pages += tmp;
> - } else if (size > image_size / PAGE_SIZE) {
> - tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
> - pages += tmp;
> - }
> - printk("\b%c", p[i++%4]);
> - } while (tmp > 0);
> + /* Compute the maximum number of saveable pages to leave in memory. */
> + max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
> + size = DIV_ROUND_UP(image_size, PAGE_SIZE);
> + if (size > max_size)
> + size = max_size;
> + /*
> + * If the maximum is not less than the current number of saveable pages
> + * in memory, we don't need to do anything more.
> + */
> + if (size >= saveable)
> + goto out;
> +
> + /*
> + * Let the memory management subsystem know that we're going to need a
> + * large number of page frames to allocate and make it free some memory.
> + * NOTE: If this is not done, performance is heavily affected in some
> + * test cases.
> + */
> + shrink_all_memory(saveable - size);
> +
> + /*
> + * Prevent the OOM killer from triggering while we're allocating image
> + * memory.
> + */
> + for_each_populated_zone(zone)
> + zone_set_flag(zone, ZONE_OOM_LOCKED);
> + /*
> + * The number of saveable pages in memory was too high, so apply some
> + * pressure to decrease it. First, make room for the largest possible
> + * image and fail if that doesn't work. Next, try to decrease the size
> + * of the image as much as indicated by image_size.
> + */
> + count -= max_size;
> + pages = preallocate_image_memory(count);
> + if (pages < count)
> + error = -ENOMEM;
> + else
> + pages += preallocate_image_memory(max_size - size);
> +
> + for_each_populated_zone(zone)
> + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +
> + /* Release all of the preallocated page frames. */
> + swsusp_free();
> +
> + if (error) {
> + printk(KERN_CONT "\n");
> + return error;
> + }
> +
> + out:
> do_gettimeofday(&stop);
> - printk("\bdone (%lu pages freed)\n", pages);
> + printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
> swsusp_show_speed(&start, &stop, pages, "Freed");
>
> return 0;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 9:50 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-08 9:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Fri, May 08, 2009 at 07:11:30AM +0800, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, David Rientjes wrote:
> > On Thu, 7 May 2009, Andrew Morton wrote:
> >
> > > The setting and clearing of that thing looks gruesomely racy..
> > >
> >
> > It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> > gets test/set and cleared atomically for the entire zonelist (the clear
> > happens for the same zonelist that was test/set).
> >
> > Using it for hibernation in the way I've proposed will open it up to the
> > race I earlier described: when a kthread is in the oom killer and
> > subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> > frozen so they can't be in the oom killer). That's perfectly acceptable,
> > however, since the system is by definition already oom if kthreads can't
> > get memory so it will end up killing a user task even though it's stuck in
> > D state and will exit on thaw; we aren't concerned about killing
> > needlessly because the oom killer becomes a no-op when it finds a task
> > that has already been killed but hasn't exited by way of TIF_MEMDIE.
>
> OK there.
>
> So everyone seems to agree we can do something like in the patch below?
>
> ---
> From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> Subject: PM/Hibernate: Rework shrinking of memory
>
> Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
> just once to make some room for the image and then allocates memory
> to apply more pressure to the memory management subsystem, if
> necessary.
Thanks! Reducing to single-pass helps memory bounty laptops considerably :)
> Unfortunately, we don't seem to be able to drop shrink_all_memory()
> entirely just yet, because that would lead to huge performance
> regressions in some test cases.
Yes, but it's not the fault of this patch. In fact some regressions
may even be positive pressures to the page allocate/reclaim code ;)
> Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> ---
> kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
> 1 file changed, 104 insertions(+), 47 deletions(-)
>
> Index: linux-2.6/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/snapshot.c
> +++ linux-2.6/kernel/power/snapshot.c
> @@ -1066,69 +1066,126 @@ void swsusp_free(void)
> buffer = NULL;
> }
>
> +/* Helper functions used for the shrinking of memory. */
> +
> /**
> - * swsusp_shrink_memory - Try to free as much memory as needed
> - *
> - * ... but do not OOM-kill anyone
> + * preallocate_image_memory - Allocate given number of page frames
> + * @nr_pages: Number of page frames to allocate
> *
> - * Notice: all userland should be stopped before it is called, or
> - * livelock is possible.
> + * Return value: Number of page frames actually allocated
> */
> -
> -#define SHRINK_BITE 10000
> -static inline unsigned long __shrink_memory(long tmp)
> +static unsigned long preallocate_image_memory(unsigned long nr_pages)
> {
> - if (tmp > SHRINK_BITE)
> - tmp = SHRINK_BITE;
> - return shrink_all_memory(tmp);
> + unsigned long nr_alloc = 0;
> +
> + while (nr_pages > 0) {
> + if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
> + break;
> + nr_pages--;
> + nr_alloc++;
> + }
> +
> + return nr_alloc;
> }
>
> +/**
> + * swsusp_shrink_memory - Make the kernel release as much memory as needed
> + *
> + * To create a hibernation image it is necessary to make a copy of every page
> + * frame in use. We also need a number of page frames to be free during
> + * hibernation for allocations made while saving the image and for device
> + * drivers, in case they need to allocate memory from their hibernation
> + * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
> + * respectively, both of which are rough estimates). To make this happen, we
> + * compute the total number of available page frames and allocate at least
> + *
> + * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
> + *
> + * of them, which corresponds to the maximum size of a hibernation image.
> + *
> + * If image_size is set below the number following from the above formula,
> + * the preallocation of memory is continued until the total number of page
> + * frames in use is below the requested image size or it is impossible to
> + * allocate more memory, whichever happens first.
> + */
> int swsusp_shrink_memory(void)
> {
> - long tmp;
> struct zone *zone;
> - unsigned long pages = 0;
> - unsigned int i = 0;
> - char *p = "-\\|/";
> + unsigned long saveable, size, max_size, count, pages = 0;
> struct timeval start, stop;
> + int error = 0;
>
> - printk(KERN_INFO "PM: Shrinking memory... ");
> + printk(KERN_INFO "PM: Shrinking memory ... ");
> do_gettimeofday(&start);
> - do {
> - long size, highmem_size;
>
> - highmem_size = count_highmem_pages();
> - size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
> - tmp = size;
> - size += highmem_size;
> - for_each_populated_zone(zone) {
> - tmp += snapshot_additional_pages(zone);
> - if (is_highmem(zone)) {
> - highmem_size -=
> - zone_page_state(zone, NR_FREE_PAGES);
> - } else {
> - tmp -= zone_page_state(zone, NR_FREE_PAGES);
> - tmp += zone->lowmem_reserve[ZONE_NORMAL];
> - }
> - }
> + /* Count the number of saveable data pages. */
> + saveable = count_data_pages() + count_highmem_pages();
>
> - if (highmem_size < 0)
> - highmem_size = 0;
> + /*
> + * Compute the total number of page frames we can use (count) and the
> + * number of pages needed for image metadata (size).
> + */
> + count = saveable;
> + size = 0;
> + for_each_populated_zone(zone) {
> + size += snapshot_additional_pages(zone);
> + count += zone_page_state(zone, NR_FREE_PAGES);
> + count -= zone->pages_min;
I'd prefer to be more safe, by removing the above line...
> + }
...and add another line here:
count -= totalreserve_pages;
But hey, that 'count' counts "savable+free" memory.
We don't have a counter for an estimation of "free+freeable" memory,
ie. we are sure we cannot preallocate above that threshold.
One applicable situation is, when there are 800M anonymous memory,
but only 500M image_size and no swap space.
In that case we will otherwise goto the oom code path. Sure oom is
(and shall be) reliably disabled in hibernation, but still we shall be
cautious enough not to create a low memory situation, which will hurt:
- hibernation speed
(vmscan goes mad trying to squeeze the last free page)
- user experiences after resume
(all *active* file data and metadata have to reloaded)
The current code simply tries *too hard* to meet image_size.
I'd rather take that as a mild advice, and to only free
"free+freeable-margin" pages when image_size is not approachable.
The safety margin can be totalreserve_pages, plus enough pages for
retaining the "hard core working set".
Thanks,
Fengguang
> - tmp += highmem_size;
> - if (tmp > 0) {
> - tmp = __shrink_memory(tmp);
> - if (!tmp)
> - return -ENOMEM;
> - pages += tmp;
> - } else if (size > image_size / PAGE_SIZE) {
> - tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
> - pages += tmp;
> - }
> - printk("\b%c", p[i++%4]);
> - } while (tmp > 0);
> + /* Compute the maximum number of saveable pages to leave in memory. */
> + max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
> + size = DIV_ROUND_UP(image_size, PAGE_SIZE);
> + if (size > max_size)
> + size = max_size;
> + /*
> + * If the maximum is not less than the current number of saveable pages
> + * in memory, we don't need to do anything more.
> + */
> + if (size >= saveable)
> + goto out;
> +
> + /*
> + * Let the memory management subsystem know that we're going to need a
> + * large number of page frames to allocate and make it free some memory.
> + * NOTE: If this is not done, performance is heavily affected in some
> + * test cases.
> + */
> + shrink_all_memory(saveable - size);
> +
> + /*
> + * Prevent the OOM killer from triggering while we're allocating image
> + * memory.
> + */
> + for_each_populated_zone(zone)
> + zone_set_flag(zone, ZONE_OOM_LOCKED);
> + /*
> + * The number of saveable pages in memory was too high, so apply some
> + * pressure to decrease it. First, make room for the largest possible
> + * image and fail if that doesn't work. Next, try to decrease the size
> + * of the image as much as indicated by image_size.
> + */
> + count -= max_size;
> + pages = preallocate_image_memory(count);
> + if (pages < count)
> + error = -ENOMEM;
> + else
> + pages += preallocate_image_memory(max_size - size);
> +
> + for_each_populated_zone(zone)
> + zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +
> + /* Release all of the preallocated page frames. */
> + swsusp_free();
> +
> + if (error) {
> + printk(KERN_CONT "\n");
> + return error;
> + }
> +
> + out:
> do_gettimeofday(&stop);
> - printk("\bdone (%lu pages freed)\n", pages);
> + printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
> swsusp_show_speed(&start, &stop, pages, "Freed");
>
> return 0;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-08 9:50 ` Wu Fengguang
(?)
@ 2009-05-08 13:51 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 13:51 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, kernel-testers, torvalds, linux-pm
On Friday 08 May 2009, Wu Fengguang wrote:
> On Fri, May 08, 2009 at 07:11:30AM +0800, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, David Rientjes wrote:
> > > On Thu, 7 May 2009, Andrew Morton wrote:
> > >
> > > > The setting and clearing of that thing looks gruesomely racy..
> > > >
> > >
> > > It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> > > gets test/set and cleared atomically for the entire zonelist (the clear
> > > happens for the same zonelist that was test/set).
> > >
> > > Using it for hibernation in the way I've proposed will open it up to the
> > > race I earlier described: when a kthread is in the oom killer and
> > > subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> > > frozen so they can't be in the oom killer). That's perfectly acceptable,
> > > however, since the system is by definition already oom if kthreads can't
> > > get memory so it will end up killing a user task even though it's stuck in
> > > D state and will exit on thaw; we aren't concerned about killing
> > > needlessly because the oom killer becomes a no-op when it finds a task
> > > that has already been killed but hasn't exited by way of TIF_MEMDIE.
> >
> > OK there.
> >
> > So everyone seems to agree we can do something like in the patch below?
> >
> > ---
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> > Subject: PM/Hibernate: Rework shrinking of memory
> >
> > Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
> > just once to make some room for the image and then allocates memory
> > to apply more pressure to the memory management subsystem, if
> > necessary.
>
> Thanks! Reducing to single-pass helps memory bounty laptops considerably :)
>
> > Unfortunately, we don't seem to be able to drop shrink_all_memory()
> > entirely just yet, because that would lead to huge performance
> > regressions in some test cases.
>
> Yes, but it's not the fault of this patch. In fact some regressions
> may even be positive pressures to the page allocate/reclaim code ;)
>
> > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > ---
> > kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 104 insertions(+), 47 deletions(-)
> >
> > Index: linux-2.6/kernel/power/snapshot.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/snapshot.c
> > +++ linux-2.6/kernel/power/snapshot.c
> > @@ -1066,69 +1066,126 @@ void swsusp_free(void)
> > buffer = NULL;
> > }
> >
> > +/* Helper functions used for the shrinking of memory. */
> > +
> > /**
> > - * swsusp_shrink_memory - Try to free as much memory as needed
> > - *
> > - * ... but do not OOM-kill anyone
> > + * preallocate_image_memory - Allocate given number of page frames
> > + * @nr_pages: Number of page frames to allocate
> > *
> > - * Notice: all userland should be stopped before it is called, or
> > - * livelock is possible.
> > + * Return value: Number of page frames actually allocated
> > */
> > -
> > -#define SHRINK_BITE 10000
> > -static inline unsigned long __shrink_memory(long tmp)
> > +static unsigned long preallocate_image_memory(unsigned long nr_pages)
> > {
> > - if (tmp > SHRINK_BITE)
> > - tmp = SHRINK_BITE;
> > - return shrink_all_memory(tmp);
> > + unsigned long nr_alloc = 0;
> > +
> > + while (nr_pages > 0) {
> > + if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
> > + break;
> > + nr_pages--;
> > + nr_alloc++;
> > + }
> > +
> > + return nr_alloc;
> > }
> >
> > +/**
> > + * swsusp_shrink_memory - Make the kernel release as much memory as needed
> > + *
> > + * To create a hibernation image it is necessary to make a copy of every page
> > + * frame in use. We also need a number of page frames to be free during
> > + * hibernation for allocations made while saving the image and for device
> > + * drivers, in case they need to allocate memory from their hibernation
> > + * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
> > + * respectively, both of which are rough estimates). To make this happen, we
> > + * compute the total number of available page frames and allocate at least
> > + *
> > + * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
> > + *
> > + * of them, which corresponds to the maximum size of a hibernation image.
> > + *
> > + * If image_size is set below the number following from the above formula,
> > + * the preallocation of memory is continued until the total number of page
> > + * frames in use is below the requested image size or it is impossible to
> > + * allocate more memory, whichever happens first.
> > + */
> > int swsusp_shrink_memory(void)
> > {
> > - long tmp;
> > struct zone *zone;
> > - unsigned long pages = 0;
> > - unsigned int i = 0;
> > - char *p = "-\\|/";
> > + unsigned long saveable, size, max_size, count, pages = 0;
> > struct timeval start, stop;
> > + int error = 0;
> >
> > - printk(KERN_INFO "PM: Shrinking memory... ");
> > + printk(KERN_INFO "PM: Shrinking memory ... ");
> > do_gettimeofday(&start);
> > - do {
> > - long size, highmem_size;
> >
> > - highmem_size = count_highmem_pages();
> > - size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
> > - tmp = size;
> > - size += highmem_size;
> > - for_each_populated_zone(zone) {
> > - tmp += snapshot_additional_pages(zone);
> > - if (is_highmem(zone)) {
> > - highmem_size -=
> > - zone_page_state(zone, NR_FREE_PAGES);
> > - } else {
> > - tmp -= zone_page_state(zone, NR_FREE_PAGES);
> > - tmp += zone->lowmem_reserve[ZONE_NORMAL];
> > - }
> > - }
> > + /* Count the number of saveable data pages. */
> > + saveable = count_data_pages() + count_highmem_pages();
> >
> > - if (highmem_size < 0)
> > - highmem_size = 0;
> > + /*
> > + * Compute the total number of page frames we can use (count) and the
> > + * number of pages needed for image metadata (size).
> > + */
> > + count = saveable;
> > + size = 0;
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + count -= zone->pages_min;
>
> I'd prefer to be more safe, by removing the above line...
>
> > + }
>
> ...and add another line here:
>
> count -= totalreserve_pages;
OK
> But hey, that 'count' counts "savable+free" memory.
> We don't have a counter for an estimation of "free+freeable" memory,
> ie. we are sure we cannot preallocate above that threshold.
>
> One applicable situation is, when there are 800M anonymous memory,
> but only 500M image_size and no swap space.
>
> In that case we will otherwise goto the oom code path. Sure oom is
> (and shall be) reliably disabled in hibernation, but still we shall be
> cautious enough not to create a low memory situation, which will hurt:
> - hibernation speed
> (vmscan goes mad trying to squeeze the last free page)
> - user experiences after resume
> (all *active* file data and metadata have to reloaded)
Strangely enough, my recent testing with this patch doesn't confirm the
theory. :-) Namely, I set image_size too low on purpose and it only caused
preallocate_image_memory() to return NULL at one point and that was it.
It didn't even took too much time.
I'll carry out more testing to verify this observation.
> The current code simply tries *too hard* to meet image_size.
> I'd rather take that as a mild advice, and to only free
> "free+freeable-margin" pages when image_size is not approachable.
>
> The safety margin can be totalreserve_pages, plus enough pages for
> retaining the "hard core working set".
How to compute the size of the "hard core working set", then?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-08 9:50 ` Wu Fengguang
@ 2009-05-08 13:51 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 13:51 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, Andrew Morton, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Friday 08 May 2009, Wu Fengguang wrote:
> On Fri, May 08, 2009 at 07:11:30AM +0800, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, David Rientjes wrote:
> > > On Thu, 7 May 2009, Andrew Morton wrote:
> > >
> > > > The setting and clearing of that thing looks gruesomely racy..
> > > >
> > >
> > > It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> > > gets test/set and cleared atomically for the entire zonelist (the clear
> > > happens for the same zonelist that was test/set).
> > >
> > > Using it for hibernation in the way I've proposed will open it up to the
> > > race I earlier described: when a kthread is in the oom killer and
> > > subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> > > frozen so they can't be in the oom killer). That's perfectly acceptable,
> > > however, since the system is by definition already oom if kthreads can't
> > > get memory so it will end up killing a user task even though it's stuck in
> > > D state and will exit on thaw; we aren't concerned about killing
> > > needlessly because the oom killer becomes a no-op when it finds a task
> > > that has already been killed but hasn't exited by way of TIF_MEMDIE.
> >
> > OK there.
> >
> > So everyone seems to agree we can do something like in the patch below?
> >
> > ---
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> > Subject: PM/Hibernate: Rework shrinking of memory
> >
> > Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
> > just once to make some room for the image and then allocates memory
> > to apply more pressure to the memory management subsystem, if
> > necessary.
>
> Thanks! Reducing to single-pass helps memory bounty laptops considerably :)
>
> > Unfortunately, we don't seem to be able to drop shrink_all_memory()
> > entirely just yet, because that would lead to huge performance
> > regressions in some test cases.
>
> Yes, but it's not the fault of this patch. In fact some regressions
> may even be positive pressures to the page allocate/reclaim code ;)
>
> > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> > ---
> > kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 104 insertions(+), 47 deletions(-)
> >
> > Index: linux-2.6/kernel/power/snapshot.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/snapshot.c
> > +++ linux-2.6/kernel/power/snapshot.c
> > @@ -1066,69 +1066,126 @@ void swsusp_free(void)
> > buffer = NULL;
> > }
> >
> > +/* Helper functions used for the shrinking of memory. */
> > +
> > /**
> > - * swsusp_shrink_memory - Try to free as much memory as needed
> > - *
> > - * ... but do not OOM-kill anyone
> > + * preallocate_image_memory - Allocate given number of page frames
> > + * @nr_pages: Number of page frames to allocate
> > *
> > - * Notice: all userland should be stopped before it is called, or
> > - * livelock is possible.
> > + * Return value: Number of page frames actually allocated
> > */
> > -
> > -#define SHRINK_BITE 10000
> > -static inline unsigned long __shrink_memory(long tmp)
> > +static unsigned long preallocate_image_memory(unsigned long nr_pages)
> > {
> > - if (tmp > SHRINK_BITE)
> > - tmp = SHRINK_BITE;
> > - return shrink_all_memory(tmp);
> > + unsigned long nr_alloc = 0;
> > +
> > + while (nr_pages > 0) {
> > + if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
> > + break;
> > + nr_pages--;
> > + nr_alloc++;
> > + }
> > +
> > + return nr_alloc;
> > }
> >
> > +/**
> > + * swsusp_shrink_memory - Make the kernel release as much memory as needed
> > + *
> > + * To create a hibernation image it is necessary to make a copy of every page
> > + * frame in use. We also need a number of page frames to be free during
> > + * hibernation for allocations made while saving the image and for device
> > + * drivers, in case they need to allocate memory from their hibernation
> > + * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
> > + * respectively, both of which are rough estimates). To make this happen, we
> > + * compute the total number of available page frames and allocate at least
> > + *
> > + * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
> > + *
> > + * of them, which corresponds to the maximum size of a hibernation image.
> > + *
> > + * If image_size is set below the number following from the above formula,
> > + * the preallocation of memory is continued until the total number of page
> > + * frames in use is below the requested image size or it is impossible to
> > + * allocate more memory, whichever happens first.
> > + */
> > int swsusp_shrink_memory(void)
> > {
> > - long tmp;
> > struct zone *zone;
> > - unsigned long pages = 0;
> > - unsigned int i = 0;
> > - char *p = "-\\|/";
> > + unsigned long saveable, size, max_size, count, pages = 0;
> > struct timeval start, stop;
> > + int error = 0;
> >
> > - printk(KERN_INFO "PM: Shrinking memory... ");
> > + printk(KERN_INFO "PM: Shrinking memory ... ");
> > do_gettimeofday(&start);
> > - do {
> > - long size, highmem_size;
> >
> > - highmem_size = count_highmem_pages();
> > - size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
> > - tmp = size;
> > - size += highmem_size;
> > - for_each_populated_zone(zone) {
> > - tmp += snapshot_additional_pages(zone);
> > - if (is_highmem(zone)) {
> > - highmem_size -=
> > - zone_page_state(zone, NR_FREE_PAGES);
> > - } else {
> > - tmp -= zone_page_state(zone, NR_FREE_PAGES);
> > - tmp += zone->lowmem_reserve[ZONE_NORMAL];
> > - }
> > - }
> > + /* Count the number of saveable data pages. */
> > + saveable = count_data_pages() + count_highmem_pages();
> >
> > - if (highmem_size < 0)
> > - highmem_size = 0;
> > + /*
> > + * Compute the total number of page frames we can use (count) and the
> > + * number of pages needed for image metadata (size).
> > + */
> > + count = saveable;
> > + size = 0;
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + count -= zone->pages_min;
>
> I'd prefer to be more safe, by removing the above line...
>
> > + }
>
> ...and add another line here:
>
> count -= totalreserve_pages;
OK
> But hey, that 'count' counts "savable+free" memory.
> We don't have a counter for an estimation of "free+freeable" memory,
> ie. we are sure we cannot preallocate above that threshold.
>
> One applicable situation is, when there are 800M anonymous memory,
> but only 500M image_size and no swap space.
>
> In that case we will otherwise goto the oom code path. Sure oom is
> (and shall be) reliably disabled in hibernation, but still we shall be
> cautious enough not to create a low memory situation, which will hurt:
> - hibernation speed
> (vmscan goes mad trying to squeeze the last free page)
> - user experiences after resume
> (all *active* file data and metadata have to reloaded)
Strangely enough, my recent testing with this patch doesn't confirm the
theory. :-) Namely, I set image_size too low on purpose and it only caused
preallocate_image_memory() to return NULL at one point and that was it.
It didn't even took too much time.
I'll carry out more testing to verify this observation.
> The current code simply tries *too hard* to meet image_size.
> I'd rather take that as a mild advice, and to only free
> "free+freeable-margin" pages when image_size is not approachable.
>
> The safety margin can be totalreserve_pages, plus enough pages for
> retaining the "hard core working set".
How to compute the size of the "hard core working set", then?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 13:51 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 13:51 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, Andrew Morton,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Friday 08 May 2009, Wu Fengguang wrote:
> On Fri, May 08, 2009 at 07:11:30AM +0800, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, David Rientjes wrote:
> > > On Thu, 7 May 2009, Andrew Morton wrote:
> > >
> > > > The setting and clearing of that thing looks gruesomely racy..
> > > >
> > >
> > > It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> > > gets test/set and cleared atomically for the entire zonelist (the clear
> > > happens for the same zonelist that was test/set).
> > >
> > > Using it for hibernation in the way I've proposed will open it up to the
> > > race I earlier described: when a kthread is in the oom killer and
> > > subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> > > frozen so they can't be in the oom killer). That's perfectly acceptable,
> > > however, since the system is by definition already oom if kthreads can't
> > > get memory so it will end up killing a user task even though it's stuck in
> > > D state and will exit on thaw; we aren't concerned about killing
> > > needlessly because the oom killer becomes a no-op when it finds a task
> > > that has already been killed but hasn't exited by way of TIF_MEMDIE.
> >
> > OK there.
> >
> > So everyone seems to agree we can do something like in the patch below?
> >
> > ---
> > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > Subject: PM/Hibernate: Rework shrinking of memory
> >
> > Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
> > just once to make some room for the image and then allocates memory
> > to apply more pressure to the memory management subsystem, if
> > necessary.
>
> Thanks! Reducing to single-pass helps memory bounty laptops considerably :)
>
> > Unfortunately, we don't seem to be able to drop shrink_all_memory()
> > entirely just yet, because that would lead to huge performance
> > regressions in some test cases.
>
> Yes, but it's not the fault of this patch. In fact some regressions
> may even be positive pressures to the page allocate/reclaim code ;)
>
> > Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > ---
> > kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 104 insertions(+), 47 deletions(-)
> >
> > Index: linux-2.6/kernel/power/snapshot.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/snapshot.c
> > +++ linux-2.6/kernel/power/snapshot.c
> > @@ -1066,69 +1066,126 @@ void swsusp_free(void)
> > buffer = NULL;
> > }
> >
> > +/* Helper functions used for the shrinking of memory. */
> > +
> > /**
> > - * swsusp_shrink_memory - Try to free as much memory as needed
> > - *
> > - * ... but do not OOM-kill anyone
> > + * preallocate_image_memory - Allocate given number of page frames
> > + * @nr_pages: Number of page frames to allocate
> > *
> > - * Notice: all userland should be stopped before it is called, or
> > - * livelock is possible.
> > + * Return value: Number of page frames actually allocated
> > */
> > -
> > -#define SHRINK_BITE 10000
> > -static inline unsigned long __shrink_memory(long tmp)
> > +static unsigned long preallocate_image_memory(unsigned long nr_pages)
> > {
> > - if (tmp > SHRINK_BITE)
> > - tmp = SHRINK_BITE;
> > - return shrink_all_memory(tmp);
> > + unsigned long nr_alloc = 0;
> > +
> > + while (nr_pages > 0) {
> > + if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
> > + break;
> > + nr_pages--;
> > + nr_alloc++;
> > + }
> > +
> > + return nr_alloc;
> > }
> >
> > +/**
> > + * swsusp_shrink_memory - Make the kernel release as much memory as needed
> > + *
> > + * To create a hibernation image it is necessary to make a copy of every page
> > + * frame in use. We also need a number of page frames to be free during
> > + * hibernation for allocations made while saving the image and for device
> > + * drivers, in case they need to allocate memory from their hibernation
> > + * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
> > + * respectively, both of which are rough estimates). To make this happen, we
> > + * compute the total number of available page frames and allocate at least
> > + *
> > + * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
> > + *
> > + * of them, which corresponds to the maximum size of a hibernation image.
> > + *
> > + * If image_size is set below the number following from the above formula,
> > + * the preallocation of memory is continued until the total number of page
> > + * frames in use is below the requested image size or it is impossible to
> > + * allocate more memory, whichever happens first.
> > + */
> > int swsusp_shrink_memory(void)
> > {
> > - long tmp;
> > struct zone *zone;
> > - unsigned long pages = 0;
> > - unsigned int i = 0;
> > - char *p = "-\\|/";
> > + unsigned long saveable, size, max_size, count, pages = 0;
> > struct timeval start, stop;
> > + int error = 0;
> >
> > - printk(KERN_INFO "PM: Shrinking memory... ");
> > + printk(KERN_INFO "PM: Shrinking memory ... ");
> > do_gettimeofday(&start);
> > - do {
> > - long size, highmem_size;
> >
> > - highmem_size = count_highmem_pages();
> > - size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
> > - tmp = size;
> > - size += highmem_size;
> > - for_each_populated_zone(zone) {
> > - tmp += snapshot_additional_pages(zone);
> > - if (is_highmem(zone)) {
> > - highmem_size -=
> > - zone_page_state(zone, NR_FREE_PAGES);
> > - } else {
> > - tmp -= zone_page_state(zone, NR_FREE_PAGES);
> > - tmp += zone->lowmem_reserve[ZONE_NORMAL];
> > - }
> > - }
> > + /* Count the number of saveable data pages. */
> > + saveable = count_data_pages() + count_highmem_pages();
> >
> > - if (highmem_size < 0)
> > - highmem_size = 0;
> > + /*
> > + * Compute the total number of page frames we can use (count) and the
> > + * number of pages needed for image metadata (size).
> > + */
> > + count = saveable;
> > + size = 0;
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + count -= zone->pages_min;
>
> I'd prefer to be more safe, by removing the above line...
>
> > + }
>
> ...and add another line here:
>
> count -= totalreserve_pages;
OK
> But hey, that 'count' counts "savable+free" memory.
> We don't have a counter for an estimation of "free+freeable" memory,
> ie. we are sure we cannot preallocate above that threshold.
>
> One applicable situation is, when there are 800M anonymous memory,
> but only 500M image_size and no swap space.
>
> In that case we will otherwise goto the oom code path. Sure oom is
> (and shall be) reliably disabled in hibernation, but still we shall be
> cautious enough not to create a low memory situation, which will hurt:
> - hibernation speed
> (vmscan goes mad trying to squeeze the last free page)
> - user experiences after resume
> (all *active* file data and metadata have to reloaded)
Strangely enough, my recent testing with this patch doesn't confirm the
theory. :-) Namely, I set image_size too low on purpose and it only caused
preallocate_image_memory() to return NULL at one point and that was it.
It didn't even took too much time.
I'll carry out more testing to verify this observation.
> The current code simply tries *too hard* to meet image_size.
> I'd rather take that as a mild advice, and to only free
> "free+freeable-margin" pages when image_size is not approachable.
>
> The safety margin can be totalreserve_pages, plus enough pages for
> retaining the "hard core working set".
How to compute the size of the "hard core working set", then?
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-08 13:51 ` Rafael J. Wysocki
@ 2009-05-09 0:08 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 0:08 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, Andrew Morton, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Friday 08 May 2009, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, Wu Fengguang wrote:
[--snip--]
> > But hey, that 'count' counts "savable+free" memory.
> > We don't have a counter for an estimation of "free+freeable" memory,
> > ie. we are sure we cannot preallocate above that threshold.
> >
> > One applicable situation is, when there are 800M anonymous memory,
> > but only 500M image_size and no swap space.
> >
> > In that case we will otherwise goto the oom code path. Sure oom is
> > (and shall be) reliably disabled in hibernation, but still we shall be
> > cautious enough not to create a low memory situation, which will hurt:
> > - hibernation speed
> > (vmscan goes mad trying to squeeze the last free page)
> > - user experiences after resume
> > (all *active* file data and metadata have to reloaded)
>
> Strangely enough, my recent testing with this patch doesn't confirm the
> theory. :-) Namely, I set image_size too low on purpose and it only caused
> preallocate_image_memory() to return NULL at one point and that was it.
>
> It didn't even took too much time.
>
> I'll carry out more testing to verify this observation.
I can confirm that even if image_size is below the minimum we can get,
the second preallocate_image_memory() just returns after allocating fewer pages
that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
approach, as I wrote in the previous message in this thread) and nothing bad
happens.
That may be because we freeze the mm kernel threads, but I've also tested
without freezing them and it's still worked the same way.
> > The current code simply tries *too hard* to meet image_size.
> > I'd rather take that as a mild advice, and to only free
> > "free+freeable-margin" pages when image_size is not approachable.
> >
> > The safety margin can be totalreserve_pages, plus enough pages for
> > retaining the "hard core working set".
>
> How to compute the size of the "hard core working set", then?
Well, I'm still interested in the answer here. ;-)
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 0:08 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 0:08 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, kernel-testers, torvalds, linux-pm
On Friday 08 May 2009, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, Wu Fengguang wrote:
[--snip--]
> > But hey, that 'count' counts "savable+free" memory.
> > We don't have a counter for an estimation of "free+freeable" memory,
> > ie. we are sure we cannot preallocate above that threshold.
> >
> > One applicable situation is, when there are 800M anonymous memory,
> > but only 500M image_size and no swap space.
> >
> > In that case we will otherwise goto the oom code path. Sure oom is
> > (and shall be) reliably disabled in hibernation, but still we shall be
> > cautious enough not to create a low memory situation, which will hurt:
> > - hibernation speed
> > (vmscan goes mad trying to squeeze the last free page)
> > - user experiences after resume
> > (all *active* file data and metadata have to reloaded)
>
> Strangely enough, my recent testing with this patch doesn't confirm the
> theory. :-) Namely, I set image_size too low on purpose and it only caused
> preallocate_image_memory() to return NULL at one point and that was it.
>
> It didn't even took too much time.
>
> I'll carry out more testing to verify this observation.
I can confirm that even if image_size is below the minimum we can get,
the second preallocate_image_memory() just returns after allocating fewer pages
that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
approach, as I wrote in the previous message in this thread) and nothing bad
happens.
That may be because we freeze the mm kernel threads, but I've also tested
without freezing them and it's still worked the same way.
> > The current code simply tries *too hard* to meet image_size.
> > I'd rather take that as a mild advice, and to only free
> > "free+freeable-margin" pages when image_size is not approachable.
> >
> > The safety margin can be totalreserve_pages, plus enough pages for
> > retaining the "hard core working set".
>
> How to compute the size of the "hard core working set", then?
Well, I'm still interested in the answer here. ;-)
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 0:08 ` Rafael J. Wysocki
(?)
@ 2009-05-09 7:34 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-09 7:34 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, kernel-testers, torvalds, linux-pm
On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, Wu Fengguang wrote:
> [--snip--]
> > > But hey, that 'count' counts "savable+free" memory.
> > > We don't have a counter for an estimation of "free+freeable" memory,
> > > ie. we are sure we cannot preallocate above that threshold.
> > >
> > > One applicable situation is, when there are 800M anonymous memory,
> > > but only 500M image_size and no swap space.
> > >
> > > In that case we will otherwise goto the oom code path. Sure oom is
> > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > cautious enough not to create a low memory situation, which will hurt:
> > > - hibernation speed
> > > (vmscan goes mad trying to squeeze the last free page)
> > > - user experiences after resume
> > > (all *active* file data and metadata have to reloaded)
> >
> > Strangely enough, my recent testing with this patch doesn't confirm the
> > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > preallocate_image_memory() to return NULL at one point and that was it.
> >
> > It didn't even took too much time.
> >
> > I'll carry out more testing to verify this observation.
>
> I can confirm that even if image_size is below the minimum we can get,
Which minimum please?
> the second preallocate_image_memory() just returns after allocating fewer pages
> that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> approach, as I wrote in the previous message in this thread) and nothing bad
> happens.
>
> That may be because we freeze the mm kernel threads, but I've also tested
> without freezing them and it's still worked the same way.
>
> > > The current code simply tries *too hard* to meet image_size.
> > > I'd rather take that as a mild advice, and to only free
> > > "free+freeable-margin" pages when image_size is not approachable.
> > >
> > > The safety margin can be totalreserve_pages, plus enough pages for
> > > retaining the "hard core working set".
> >
> > How to compute the size of the "hard core working set", then?
>
> Well, I'm still interested in the answer here. ;-)
A tough question ;-)
We can start with the following formula, this should be called *after*
the initial memory shrinking.
/* a typical desktop do not have more than 100MB mapped pages */
#define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
unsigned long hard_core_working_set(void)
{
unsigned long nr;
/*
* mapped pages are normally small and precious,
* but shall be bounded for safety.
*/
nr = global_page_state(NR_FILE_MAPPED);
nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
/*
* if no swap space, this is a hard request;
* otherwise this is an optimization.
* (the disk image IO can be much faster than swap IO)
*/
nr += global_page_state(NR_ACTIVE_ANON);
nr += global_page_state(NR_INACTIVE_ANON);
/* hard (but normally small) memory requests */
nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
nr += global_page_state(NR_UNEVICTABLE);
nr += global_page_state(NR_PAGETABLE);
return nr;
}
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 7:34 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-09 7:34 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, Wu Fengguang wrote:
> [--snip--]
> > > But hey, that 'count' counts "savable+free" memory.
> > > We don't have a counter for an estimation of "free+freeable" memory,
> > > ie. we are sure we cannot preallocate above that threshold.
> > >
> > > One applicable situation is, when there are 800M anonymous memory,
> > > but only 500M image_size and no swap space.
> > >
> > > In that case we will otherwise goto the oom code path. Sure oom is
> > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > cautious enough not to create a low memory situation, which will hurt:
> > > - hibernation speed
> > > (vmscan goes mad trying to squeeze the last free page)
> > > - user experiences after resume
> > > (all *active* file data and metadata have to reloaded)
> >
> > Strangely enough, my recent testing with this patch doesn't confirm the
> > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > preallocate_image_memory() to return NULL at one point and that was it.
> >
> > It didn't even took too much time.
> >
> > I'll carry out more testing to verify this observation.
>
> I can confirm that even if image_size is below the minimum we can get,
Which minimum please?
> the second preallocate_image_memory() just returns after allocating fewer pages
> that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> approach, as I wrote in the previous message in this thread) and nothing bad
> happens.
>
> That may be because we freeze the mm kernel threads, but I've also tested
> without freezing them and it's still worked the same way.
>
> > > The current code simply tries *too hard* to meet image_size.
> > > I'd rather take that as a mild advice, and to only free
> > > "free+freeable-margin" pages when image_size is not approachable.
> > >
> > > The safety margin can be totalreserve_pages, plus enough pages for
> > > retaining the "hard core working set".
> >
> > How to compute the size of the "hard core working set", then?
>
> Well, I'm still interested in the answer here. ;-)
A tough question ;-)
We can start with the following formula, this should be called *after*
the initial memory shrinking.
/* a typical desktop do not have more than 100MB mapped pages */
#define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
unsigned long hard_core_working_set(void)
{
unsigned long nr;
/*
* mapped pages are normally small and precious,
* but shall be bounded for safety.
*/
nr = global_page_state(NR_FILE_MAPPED);
nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
/*
* if no swap space, this is a hard request;
* otherwise this is an optimization.
* (the disk image IO can be much faster than swap IO)
*/
nr += global_page_state(NR_ACTIVE_ANON);
nr += global_page_state(NR_INACTIVE_ANON);
/* hard (but normally small) memory requests */
nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
nr += global_page_state(NR_UNEVICTABLE);
nr += global_page_state(NR_PAGETABLE);
return nr;
}
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 7:34 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-09 7:34 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, Wu Fengguang wrote:
> [--snip--]
> > > But hey, that 'count' counts "savable+free" memory.
> > > We don't have a counter for an estimation of "free+freeable" memory,
> > > ie. we are sure we cannot preallocate above that threshold.
> > >
> > > One applicable situation is, when there are 800M anonymous memory,
> > > but only 500M image_size and no swap space.
> > >
> > > In that case we will otherwise goto the oom code path. Sure oom is
> > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > cautious enough not to create a low memory situation, which will hurt:
> > > - hibernation speed
> > > (vmscan goes mad trying to squeeze the last free page)
> > > - user experiences after resume
> > > (all *active* file data and metadata have to reloaded)
> >
> > Strangely enough, my recent testing with this patch doesn't confirm the
> > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > preallocate_image_memory() to return NULL at one point and that was it.
> >
> > It didn't even took too much time.
> >
> > I'll carry out more testing to verify this observation.
>
> I can confirm that even if image_size is below the minimum we can get,
Which minimum please?
> the second preallocate_image_memory() just returns after allocating fewer pages
> that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> approach, as I wrote in the previous message in this thread) and nothing bad
> happens.
>
> That may be because we freeze the mm kernel threads, but I've also tested
> without freezing them and it's still worked the same way.
>
> > > The current code simply tries *too hard* to meet image_size.
> > > I'd rather take that as a mild advice, and to only free
> > > "free+freeable-margin" pages when image_size is not approachable.
> > >
> > > The safety margin can be totalreserve_pages, plus enough pages for
> > > retaining the "hard core working set".
> >
> > How to compute the size of the "hard core working set", then?
>
> Well, I'm still interested in the answer here. ;-)
A tough question ;-)
We can start with the following formula, this should be called *after*
the initial memory shrinking.
/* a typical desktop do not have more than 100MB mapped pages */
#define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
unsigned long hard_core_working_set(void)
{
unsigned long nr;
/*
* mapped pages are normally small and precious,
* but shall be bounded for safety.
*/
nr = global_page_state(NR_FILE_MAPPED);
nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
/*
* if no swap space, this is a hard request;
* otherwise this is an optimization.
* (the disk image IO can be much faster than swap IO)
*/
nr += global_page_state(NR_ACTIVE_ANON);
nr += global_page_state(NR_INACTIVE_ANON);
/* hard (but normally small) memory requests */
nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
nr += global_page_state(NR_UNEVICTABLE);
nr += global_page_state(NR_PAGETABLE);
return nr;
}
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 7:34 ` Wu Fengguang
(?)
@ 2009-05-09 19:22 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 19:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, kernel-testers, torvalds, linux-pm
On Saturday 09 May 2009, Wu Fengguang wrote:
> On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > On Friday 08 May 2009, Wu Fengguang wrote:
> > [--snip--]
> > > > But hey, that 'count' counts "savable+free" memory.
> > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > ie. we are sure we cannot preallocate above that threshold.
> > > >
> > > > One applicable situation is, when there are 800M anonymous memory,
> > > > but only 500M image_size and no swap space.
> > > >
> > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > cautious enough not to create a low memory situation, which will hurt:
> > > > - hibernation speed
> > > > (vmscan goes mad trying to squeeze the last free page)
> > > > - user experiences after resume
> > > > (all *active* file data and metadata have to reloaded)
> > >
> > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > preallocate_image_memory() to return NULL at one point and that was it.
> > >
> > > It didn't even took too much time.
> > >
> > > I'll carry out more testing to verify this observation.
> >
> > I can confirm that even if image_size is below the minimum we can get,
>
> Which minimum please?
That was supposed to be an alternative way of saying "below any reasonable
value", but it wasn't very precise indeed.
I should have said that for given system there was a minimum number of saveable
pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
go below that limit. If image_size is set below this number, the
preallocate_image_memory(max_size - size) call returns fewer pages that it's
been requested to allocate and that's it. No disasters, no anything wrong.
> > the second preallocate_image_memory() just returns after allocating fewer pages
> > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > approach, as I wrote in the previous message in this thread) and nothing bad
> > happens.
> >
> > That may be because we freeze the mm kernel threads, but I've also tested
> > without freezing them and it's still worked the same way.
> >
> > > > The current code simply tries *too hard* to meet image_size.
> > > > I'd rather take that as a mild advice, and to only free
> > > > "free+freeable-margin" pages when image_size is not approachable.
> > > >
> > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > retaining the "hard core working set".
> > >
> > > How to compute the size of the "hard core working set", then?
> >
> > Well, I'm still interested in the answer here. ;-)
>
> A tough question ;-)
>
> We can start with the following formula, this should be called *after*
> the initial memory shrinking.
OK
> /* a typical desktop do not have more than 100MB mapped pages */
> #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> unsigned long hard_core_working_set(void)
> {
> unsigned long nr;
>
> /*
> * mapped pages are normally small and precious,
> * but shall be bounded for safety.
> */
> nr = global_page_state(NR_FILE_MAPPED);
> nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
>
> /*
> * if no swap space, this is a hard request;
> * otherwise this is an optimization.
> * (the disk image IO can be much faster than swap IO)
Well, if there's no swap space at this point, we won't be able to save the
image anyway, so this always is an optimization IMO. :-)
> */
> nr += global_page_state(NR_ACTIVE_ANON);
> nr += global_page_state(NR_INACTIVE_ANON);
>
> /* hard (but normally small) memory requests */
> nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> nr += global_page_state(NR_UNEVICTABLE);
> nr += global_page_state(NR_PAGETABLE);
>
> return nr;
> }
OK, thanks.
I'll create a separate patch adding this function and we'll see how it works.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 7:34 ` Wu Fengguang
@ 2009-05-09 19:22 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 19:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, Andrew Morton, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Saturday 09 May 2009, Wu Fengguang wrote:
> On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > On Friday 08 May 2009, Wu Fengguang wrote:
> > [--snip--]
> > > > But hey, that 'count' counts "savable+free" memory.
> > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > ie. we are sure we cannot preallocate above that threshold.
> > > >
> > > > One applicable situation is, when there are 800M anonymous memory,
> > > > but only 500M image_size and no swap space.
> > > >
> > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > cautious enough not to create a low memory situation, which will hurt:
> > > > - hibernation speed
> > > > (vmscan goes mad trying to squeeze the last free page)
> > > > - user experiences after resume
> > > > (all *active* file data and metadata have to reloaded)
> > >
> > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > preallocate_image_memory() to return NULL at one point and that was it.
> > >
> > > It didn't even took too much time.
> > >
> > > I'll carry out more testing to verify this observation.
> >
> > I can confirm that even if image_size is below the minimum we can get,
>
> Which minimum please?
That was supposed to be an alternative way of saying "below any reasonable
value", but it wasn't very precise indeed.
I should have said that for given system there was a minimum number of saveable
pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
go below that limit. If image_size is set below this number, the
preallocate_image_memory(max_size - size) call returns fewer pages that it's
been requested to allocate and that's it. No disasters, no anything wrong.
> > the second preallocate_image_memory() just returns after allocating fewer pages
> > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > approach, as I wrote in the previous message in this thread) and nothing bad
> > happens.
> >
> > That may be because we freeze the mm kernel threads, but I've also tested
> > without freezing them and it's still worked the same way.
> >
> > > > The current code simply tries *too hard* to meet image_size.
> > > > I'd rather take that as a mild advice, and to only free
> > > > "free+freeable-margin" pages when image_size is not approachable.
> > > >
> > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > retaining the "hard core working set".
> > >
> > > How to compute the size of the "hard core working set", then?
> >
> > Well, I'm still interested in the answer here. ;-)
>
> A tough question ;-)
>
> We can start with the following formula, this should be called *after*
> the initial memory shrinking.
OK
> /* a typical desktop do not have more than 100MB mapped pages */
> #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> unsigned long hard_core_working_set(void)
> {
> unsigned long nr;
>
> /*
> * mapped pages are normally small and precious,
> * but shall be bounded for safety.
> */
> nr = global_page_state(NR_FILE_MAPPED);
> nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
>
> /*
> * if no swap space, this is a hard request;
> * otherwise this is an optimization.
> * (the disk image IO can be much faster than swap IO)
Well, if there's no swap space at this point, we won't be able to save the
image anyway, so this always is an optimization IMO. :-)
> */
> nr += global_page_state(NR_ACTIVE_ANON);
> nr += global_page_state(NR_INACTIVE_ANON);
>
> /* hard (but normally small) memory requests */
> nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> nr += global_page_state(NR_UNEVICTABLE);
> nr += global_page_state(NR_PAGETABLE);
>
> return nr;
> }
OK, thanks.
I'll create a separate patch adding this function and we'll see how it works.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 19:22 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 19:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, Andrew Morton,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Saturday 09 May 2009, Wu Fengguang wrote:
> On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > On Friday 08 May 2009, Wu Fengguang wrote:
> > [--snip--]
> > > > But hey, that 'count' counts "savable+free" memory.
> > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > ie. we are sure we cannot preallocate above that threshold.
> > > >
> > > > One applicable situation is, when there are 800M anonymous memory,
> > > > but only 500M image_size and no swap space.
> > > >
> > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > cautious enough not to create a low memory situation, which will hurt:
> > > > - hibernation speed
> > > > (vmscan goes mad trying to squeeze the last free page)
> > > > - user experiences after resume
> > > > (all *active* file data and metadata have to reloaded)
> > >
> > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > preallocate_image_memory() to return NULL at one point and that was it.
> > >
> > > It didn't even took too much time.
> > >
> > > I'll carry out more testing to verify this observation.
> >
> > I can confirm that even if image_size is below the minimum we can get,
>
> Which minimum please?
That was supposed to be an alternative way of saying "below any reasonable
value", but it wasn't very precise indeed.
I should have said that for given system there was a minimum number of saveable
pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
go below that limit. If image_size is set below this number, the
preallocate_image_memory(max_size - size) call returns fewer pages that it's
been requested to allocate and that's it. No disasters, no anything wrong.
> > the second preallocate_image_memory() just returns after allocating fewer pages
> > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > approach, as I wrote in the previous message in this thread) and nothing bad
> > happens.
> >
> > That may be because we freeze the mm kernel threads, but I've also tested
> > without freezing them and it's still worked the same way.
> >
> > > > The current code simply tries *too hard* to meet image_size.
> > > > I'd rather take that as a mild advice, and to only free
> > > > "free+freeable-margin" pages when image_size is not approachable.
> > > >
> > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > retaining the "hard core working set".
> > >
> > > How to compute the size of the "hard core working set", then?
> >
> > Well, I'm still interested in the answer here. ;-)
>
> A tough question ;-)
>
> We can start with the following formula, this should be called *after*
> the initial memory shrinking.
OK
> /* a typical desktop do not have more than 100MB mapped pages */
> #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> unsigned long hard_core_working_set(void)
> {
> unsigned long nr;
>
> /*
> * mapped pages are normally small and precious,
> * but shall be bounded for safety.
> */
> nr = global_page_state(NR_FILE_MAPPED);
> nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
>
> /*
> * if no swap space, this is a hard request;
> * otherwise this is an optimization.
> * (the disk image IO can be much faster than swap IO)
Well, if there's no swap space at this point, we won't be able to save the
image anyway, so this always is an optimization IMO. :-)
> */
> nr += global_page_state(NR_ACTIVE_ANON);
> nr += global_page_state(NR_INACTIVE_ANON);
>
> /* hard (but normally small) memory requests */
> nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> nr += global_page_state(NR_UNEVICTABLE);
> nr += global_page_state(NR_PAGETABLE);
>
> return nr;
> }
OK, thanks.
I'll create a separate patch adding this function and we'll see how it works.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 19:22 ` Rafael J. Wysocki
(?)
@ 2009-05-10 4:52 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-10 4:52 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, kernel-testers, torvalds, linux-pm
On Sun, May 10, 2009 at 03:22:57AM +0800, Rafael J. Wysocki wrote:
> On Saturday 09 May 2009, Wu Fengguang wrote:
> > On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > > On Friday 08 May 2009, Wu Fengguang wrote:
> > > [--snip--]
> > > > > But hey, that 'count' counts "savable+free" memory.
> > > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > > ie. we are sure we cannot preallocate above that threshold.
> > > > >
> > > > > One applicable situation is, when there are 800M anonymous memory,
> > > > > but only 500M image_size and no swap space.
> > > > >
> > > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > > cautious enough not to create a low memory situation, which will hurt:
> > > > > - hibernation speed
> > > > > (vmscan goes mad trying to squeeze the last free page)
> > > > > - user experiences after resume
> > > > > (all *active* file data and metadata have to reloaded)
> > > >
> > > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > > preallocate_image_memory() to return NULL at one point and that was it.
> > > >
> > > > It didn't even took too much time.
> > > >
> > > > I'll carry out more testing to verify this observation.
> > >
> > > I can confirm that even if image_size is below the minimum we can get,
> >
> > Which minimum please?
>
> That was supposed to be an alternative way of saying "below any reasonable
> value", but it wasn't very precise indeed.
>
> I should have said that for given system there was a minimum number of saveable
> pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
> go below that limit. If image_size is set below this number, the
> preallocate_image_memory(max_size - size) call returns fewer pages that it's
> been requested to allocate and that's it. No disasters, no anything wrong.
"preallocate_image_memory(max_size - size) returning fewer pages"
would better be avoided, and possibly can be avoided by checking
hard_core_working_set(), right?
> > > the second preallocate_image_memory() just returns after allocating fewer pages
> > > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > > approach, as I wrote in the previous message in this thread) and nothing bad
> > > happens.
> > >
> > > That may be because we freeze the mm kernel threads, but I've also tested
> > > without freezing them and it's still worked the same way.
> > >
> > > > > The current code simply tries *too hard* to meet image_size.
> > > > > I'd rather take that as a mild advice, and to only free
> > > > > "free+freeable-margin" pages when image_size is not approachable.
> > > > >
> > > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > > retaining the "hard core working set".
> > > >
> > > > How to compute the size of the "hard core working set", then?
> > >
> > > Well, I'm still interested in the answer here. ;-)
> >
> > A tough question ;-)
> >
> > We can start with the following formula, this should be called *after*
> > the initial memory shrinking.
>
> OK
>
> > /* a typical desktop do not have more than 100MB mapped pages */
> > #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> > unsigned long hard_core_working_set(void)
> > {
> > unsigned long nr;
> >
> > /*
> > * mapped pages are normally small and precious,
> > * but shall be bounded for safety.
> > */
> > nr = global_page_state(NR_FILE_MAPPED);
> > nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
> >
> > /*
> > * if no swap space, this is a hard request;
> > * otherwise this is an optimization.
> > * (the disk image IO can be much faster than swap IO)
>
> Well, if there's no swap space at this point, we won't be able to save the
> image anyway, so this always is an optimization IMO. :-)
Ah OK. Do you think the anonymous pages optimization should be limited?
My desktop normally consumes 200-400MB anonymous pages, but when some
virtual machine is running, the anonymous pages can go beyond 1GB,
with mapped file pages go slightly beyond 100MB.
The image-write vs. swapout-write speeds should be equal, however the
hibernate tool may be able to compress the dataset.
The image-read will be much faster than swapin-read for *rotational*
disks. It may take more time to resume, however the user experiences
after completion will be much better.
I don't think "populating memory with useless data" would be a major
concern, since we already freed up half of the total memory. It's all
about the speed one can get back to work.
>
> > */
> > nr += global_page_state(NR_ACTIVE_ANON);
> > nr += global_page_state(NR_INACTIVE_ANON);
> >
> > /* hard (but normally small) memory requests */
> > nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> > nr += global_page_state(NR_UNEVICTABLE);
> > nr += global_page_state(NR_PAGETABLE);
> >
> > return nr;
> > }
>
> OK, thanks.
>
> I'll create a separate patch adding this function and we'll see how it works.
OK, thanks!
btw, if the shrink_all_memory() functions cannot go away because of
performance problems, I can help clean it up. (FYI: I happen to be
doing so just before you submitted this patchset.:)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-10 4:52 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-10 4:52 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Sun, May 10, 2009 at 03:22:57AM +0800, Rafael J. Wysocki wrote:
> On Saturday 09 May 2009, Wu Fengguang wrote:
> > On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > > On Friday 08 May 2009, Wu Fengguang wrote:
> > > [--snip--]
> > > > > But hey, that 'count' counts "savable+free" memory.
> > > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > > ie. we are sure we cannot preallocate above that threshold.
> > > > >
> > > > > One applicable situation is, when there are 800M anonymous memory,
> > > > > but only 500M image_size and no swap space.
> > > > >
> > > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > > cautious enough not to create a low memory situation, which will hurt:
> > > > > - hibernation speed
> > > > > (vmscan goes mad trying to squeeze the last free page)
> > > > > - user experiences after resume
> > > > > (all *active* file data and metadata have to reloaded)
> > > >
> > > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > > preallocate_image_memory() to return NULL at one point and that was it.
> > > >
> > > > It didn't even took too much time.
> > > >
> > > > I'll carry out more testing to verify this observation.
> > >
> > > I can confirm that even if image_size is below the minimum we can get,
> >
> > Which minimum please?
>
> That was supposed to be an alternative way of saying "below any reasonable
> value", but it wasn't very precise indeed.
>
> I should have said that for given system there was a minimum number of saveable
> pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
> go below that limit. If image_size is set below this number, the
> preallocate_image_memory(max_size - size) call returns fewer pages that it's
> been requested to allocate and that's it. No disasters, no anything wrong.
"preallocate_image_memory(max_size - size) returning fewer pages"
would better be avoided, and possibly can be avoided by checking
hard_core_working_set(), right?
> > > the second preallocate_image_memory() just returns after allocating fewer pages
> > > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > > approach, as I wrote in the previous message in this thread) and nothing bad
> > > happens.
> > >
> > > That may be because we freeze the mm kernel threads, but I've also tested
> > > without freezing them and it's still worked the same way.
> > >
> > > > > The current code simply tries *too hard* to meet image_size.
> > > > > I'd rather take that as a mild advice, and to only free
> > > > > "free+freeable-margin" pages when image_size is not approachable.
> > > > >
> > > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > > retaining the "hard core working set".
> > > >
> > > > How to compute the size of the "hard core working set", then?
> > >
> > > Well, I'm still interested in the answer here. ;-)
> >
> > A tough question ;-)
> >
> > We can start with the following formula, this should be called *after*
> > the initial memory shrinking.
>
> OK
>
> > /* a typical desktop do not have more than 100MB mapped pages */
> > #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> > unsigned long hard_core_working_set(void)
> > {
> > unsigned long nr;
> >
> > /*
> > * mapped pages are normally small and precious,
> > * but shall be bounded for safety.
> > */
> > nr = global_page_state(NR_FILE_MAPPED);
> > nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
> >
> > /*
> > * if no swap space, this is a hard request;
> > * otherwise this is an optimization.
> > * (the disk image IO can be much faster than swap IO)
>
> Well, if there's no swap space at this point, we won't be able to save the
> image anyway, so this always is an optimization IMO. :-)
Ah OK. Do you think the anonymous pages optimization should be limited?
My desktop normally consumes 200-400MB anonymous pages, but when some
virtual machine is running, the anonymous pages can go beyond 1GB,
with mapped file pages go slightly beyond 100MB.
The image-write vs. swapout-write speeds should be equal, however the
hibernate tool may be able to compress the dataset.
The image-read will be much faster than swapin-read for *rotational*
disks. It may take more time to resume, however the user experiences
after completion will be much better.
I don't think "populating memory with useless data" would be a major
concern, since we already freed up half of the total memory. It's all
about the speed one can get back to work.
>
> > */
> > nr += global_page_state(NR_ACTIVE_ANON);
> > nr += global_page_state(NR_INACTIVE_ANON);
> >
> > /* hard (but normally small) memory requests */
> > nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> > nr += global_page_state(NR_UNEVICTABLE);
> > nr += global_page_state(NR_PAGETABLE);
> >
> > return nr;
> > }
>
> OK, thanks.
>
> I'll create a separate patch adding this function and we'll see how it works.
OK, thanks!
btw, if the shrink_all_memory() functions cannot go away because of
performance problems, I can help clean it up. (FYI: I happen to be
doing so just before you submitted this patchset.:)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-10 4:52 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-10 4:52 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Rientjes, Andrew Morton,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Sun, May 10, 2009 at 03:22:57AM +0800, Rafael J. Wysocki wrote:
> On Saturday 09 May 2009, Wu Fengguang wrote:
> > On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > > On Friday 08 May 2009, Wu Fengguang wrote:
> > > [--snip--]
> > > > > But hey, that 'count' counts "savable+free" memory.
> > > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > > ie. we are sure we cannot preallocate above that threshold.
> > > > >
> > > > > One applicable situation is, when there are 800M anonymous memory,
> > > > > but only 500M image_size and no swap space.
> > > > >
> > > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > > cautious enough not to create a low memory situation, which will hurt:
> > > > > - hibernation speed
> > > > > (vmscan goes mad trying to squeeze the last free page)
> > > > > - user experiences after resume
> > > > > (all *active* file data and metadata have to reloaded)
> > > >
> > > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > > preallocate_image_memory() to return NULL at one point and that was it.
> > > >
> > > > It didn't even took too much time.
> > > >
> > > > I'll carry out more testing to verify this observation.
> > >
> > > I can confirm that even if image_size is below the minimum we can get,
> >
> > Which minimum please?
>
> That was supposed to be an alternative way of saying "below any reasonable
> value", but it wasn't very precise indeed.
>
> I should have said that for given system there was a minimum number of saveable
> pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
> go below that limit. If image_size is set below this number, the
> preallocate_image_memory(max_size - size) call returns fewer pages that it's
> been requested to allocate and that's it. No disasters, no anything wrong.
"preallocate_image_memory(max_size - size) returning fewer pages"
would better be avoided, and possibly can be avoided by checking
hard_core_working_set(), right?
> > > the second preallocate_image_memory() just returns after allocating fewer pages
> > > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > > approach, as I wrote in the previous message in this thread) and nothing bad
> > > happens.
> > >
> > > That may be because we freeze the mm kernel threads, but I've also tested
> > > without freezing them and it's still worked the same way.
> > >
> > > > > The current code simply tries *too hard* to meet image_size.
> > > > > I'd rather take that as a mild advice, and to only free
> > > > > "free+freeable-margin" pages when image_size is not approachable.
> > > > >
> > > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > > retaining the "hard core working set".
> > > >
> > > > How to compute the size of the "hard core working set", then?
> > >
> > > Well, I'm still interested in the answer here. ;-)
> >
> > A tough question ;-)
> >
> > We can start with the following formula, this should be called *after*
> > the initial memory shrinking.
>
> OK
>
> > /* a typical desktop do not have more than 100MB mapped pages */
> > #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> > unsigned long hard_core_working_set(void)
> > {
> > unsigned long nr;
> >
> > /*
> > * mapped pages are normally small and precious,
> > * but shall be bounded for safety.
> > */
> > nr = global_page_state(NR_FILE_MAPPED);
> > nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
> >
> > /*
> > * if no swap space, this is a hard request;
> > * otherwise this is an optimization.
> > * (the disk image IO can be much faster than swap IO)
>
> Well, if there's no swap space at this point, we won't be able to save the
> image anyway, so this always is an optimization IMO. :-)
Ah OK. Do you think the anonymous pages optimization should be limited?
My desktop normally consumes 200-400MB anonymous pages, but when some
virtual machine is running, the anonymous pages can go beyond 1GB,
with mapped file pages go slightly beyond 100MB.
The image-write vs. swapout-write speeds should be equal, however the
hibernate tool may be able to compress the dataset.
The image-read will be much faster than swapin-read for *rotational*
disks. It may take more time to resume, however the user experiences
after completion will be much better.
I don't think "populating memory with useless data" would be a major
concern, since we already freed up half of the total memory. It's all
about the speed one can get back to work.
>
> > */
> > nr += global_page_state(NR_ACTIVE_ANON);
> > nr += global_page_state(NR_INACTIVE_ANON);
> >
> > /* hard (but normally small) memory requests */
> > nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> > nr += global_page_state(NR_UNEVICTABLE);
> > nr += global_page_state(NR_PAGETABLE);
> >
> > return nr;
> > }
>
> OK, thanks.
>
> I'll create a separate patch adding this function and we'll see how it works.
OK, thanks!
btw, if the shrink_all_memory() functions cannot go away because of
performance problems, I can help clean it up. (FYI: I happen to be
doing so just before you submitted this patchset.:)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-10 4:52 ` Wu Fengguang
@ 2009-05-10 12:52 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-10 12:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, Andrew Morton, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Sunday 10 May 2009, Wu Fengguang wrote:
> On Sun, May 10, 2009 at 03:22:57AM +0800, Rafael J. Wysocki wrote:
> > On Saturday 09 May 2009, Wu Fengguang wrote:
> > > On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > > > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > > > On Friday 08 May 2009, Wu Fengguang wrote:
> > > > [--snip--]
> > > > > > But hey, that 'count' counts "savable+free" memory.
> > > > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > > > ie. we are sure we cannot preallocate above that threshold.
> > > > > >
> > > > > > One applicable situation is, when there are 800M anonymous memory,
> > > > > > but only 500M image_size and no swap space.
> > > > > >
> > > > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > > > cautious enough not to create a low memory situation, which will hurt:
> > > > > > - hibernation speed
> > > > > > (vmscan goes mad trying to squeeze the last free page)
> > > > > > - user experiences after resume
> > > > > > (all *active* file data and metadata have to reloaded)
> > > > >
> > > > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > > > preallocate_image_memory() to return NULL at one point and that was it.
> > > > >
> > > > > It didn't even took too much time.
> > > > >
> > > > > I'll carry out more testing to verify this observation.
> > > >
> > > > I can confirm that even if image_size is below the minimum we can get,
> > >
> > > Which minimum please?
> >
> > That was supposed to be an alternative way of saying "below any reasonable
> > value", but it wasn't very precise indeed.
> >
> > I should have said that for given system there was a minimum number of saveable
> > pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
> > go below that limit. If image_size is set below this number, the
> > preallocate_image_memory(max_size - size) call returns fewer pages that it's
> > been requested to allocate and that's it. No disasters, no anything wrong.
>
> "preallocate_image_memory(max_size - size) returning fewer pages"
> would better be avoided, and possibly can be avoided by checking
> hard_core_working_set(), right?
Yes, but your formula doesn't seem to be suitable for that, because the number
it returns it too low.
On an x86_64 test box the minimum image size I can get (by setting
image_size=1000) is about 24000 pages, while the formula for the hard core
working set size returns about 12000, so it is not very useful.
On an i386 test box it's even worse, as the minimum image size I can get is
about 48000 pages.
Besides, while testing this I noticed that on i386 preallocate_image_memory()
didn't allocate from highmem, so I changed it to do so. As a result of this I
had to change the patches, so I'm going to post the new patchset shortly.
> > > > the second preallocate_image_memory() just returns after allocating fewer pages
> > > > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > > > approach, as I wrote in the previous message in this thread) and nothing bad
> > > > happens.
> > > >
> > > > That may be because we freeze the mm kernel threads, but I've also tested
> > > > without freezing them and it's still worked the same way.
> > > >
> > > > > > The current code simply tries *too hard* to meet image_size.
> > > > > > I'd rather take that as a mild advice, and to only free
> > > > > > "free+freeable-margin" pages when image_size is not approachable.
> > > > > >
> > > > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > > > retaining the "hard core working set".
> > > > >
> > > > > How to compute the size of the "hard core working set", then?
> > > >
> > > > Well, I'm still interested in the answer here. ;-)
> > >
> > > A tough question ;-)
> > >
> > > We can start with the following formula, this should be called *after*
> > > the initial memory shrinking.
> >
> > OK
> >
> > > /* a typical desktop do not have more than 100MB mapped pages */
> > > #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> > > unsigned long hard_core_working_set(void)
> > > {
> > > unsigned long nr;
> > >
> > > /*
> > > * mapped pages are normally small and precious,
> > > * but shall be bounded for safety.
> > > */
> > > nr = global_page_state(NR_FILE_MAPPED);
> > > nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
> > >
> > > /*
> > > * if no swap space, this is a hard request;
> > > * otherwise this is an optimization.
> > > * (the disk image IO can be much faster than swap IO)
> >
> > Well, if there's no swap space at this point, we won't be able to save the
> > image anyway, so this always is an optimization IMO. :-)
>
> Ah OK. Do you think the anonymous pages optimization should be limited?
That depends.
> My desktop normally consumes 200-400MB anonymous pages, but when some
> virtual machine is running, the anonymous pages can go beyond 1GB,
That's too much IMO, so there should be a limit.
> with mapped file pages go slightly beyond 100MB.
>
> The image-write vs. swapout-write speeds should be equal,
They aren't, really. Image write is way faster, even without compression.
> however the hibernate tool may be able to compress the dataset.
Sure.
> The image-read will be much faster than swapin-read for *rotational*
> disks. It may take more time to resume, however the user experiences
> after completion will be much better.
Agreed.
> I don't think "populating memory with useless data" would be a major
> concern, since we already freed up half of the total memory. It's all
> about the speed one can get back to work.
Agreed again.
> >
> > > */
> > > nr += global_page_state(NR_ACTIVE_ANON);
> > > nr += global_page_state(NR_INACTIVE_ANON);
> > >
> > > /* hard (but normally small) memory requests */
> > > nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> > > nr += global_page_state(NR_UNEVICTABLE);
> > > nr += global_page_state(NR_PAGETABLE);
> > >
> > > return nr;
> > > }
> >
> > OK, thanks.
> >
> > I'll create a separate patch adding this function and we'll see how it works.
>
> OK, thanks!
Actually it doesn't work too well as I said above. Arguably that's because the
number of anonymous pages was probably lower than average in my test cases,
but I also think that our hard core working set formula should be suitable for
all test cases.
> btw, if the shrink_all_memory() functions cannot go away because of
> performance problems, I can help clean it up. (FYI: I happen to be
> doing so just before you submitted this patchset.:)
That would be great, thanks a lot!
I'm going to post updated patchset in a while, let's move the discussion to
the new thread.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-10 12:52 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-10 12:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, Andrew Morton,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Sunday 10 May 2009, Wu Fengguang wrote:
> On Sun, May 10, 2009 at 03:22:57AM +0800, Rafael J. Wysocki wrote:
> > On Saturday 09 May 2009, Wu Fengguang wrote:
> > > On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > > > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > > > On Friday 08 May 2009, Wu Fengguang wrote:
> > > > [--snip--]
> > > > > > But hey, that 'count' counts "savable+free" memory.
> > > > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > > > ie. we are sure we cannot preallocate above that threshold.
> > > > > >
> > > > > > One applicable situation is, when there are 800M anonymous memory,
> > > > > > but only 500M image_size and no swap space.
> > > > > >
> > > > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > > > cautious enough not to create a low memory situation, which will hurt:
> > > > > > - hibernation speed
> > > > > > (vmscan goes mad trying to squeeze the last free page)
> > > > > > - user experiences after resume
> > > > > > (all *active* file data and metadata have to reloaded)
> > > > >
> > > > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > > > preallocate_image_memory() to return NULL at one point and that was it.
> > > > >
> > > > > It didn't even took too much time.
> > > > >
> > > > > I'll carry out more testing to verify this observation.
> > > >
> > > > I can confirm that even if image_size is below the minimum we can get,
> > >
> > > Which minimum please?
> >
> > That was supposed to be an alternative way of saying "below any reasonable
> > value", but it wasn't very precise indeed.
> >
> > I should have said that for given system there was a minimum number of saveable
> > pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
> > go below that limit. If image_size is set below this number, the
> > preallocate_image_memory(max_size - size) call returns fewer pages that it's
> > been requested to allocate and that's it. No disasters, no anything wrong.
>
> "preallocate_image_memory(max_size - size) returning fewer pages"
> would better be avoided, and possibly can be avoided by checking
> hard_core_working_set(), right?
Yes, but your formula doesn't seem to be suitable for that, because the number
it returns it too low.
On an x86_64 test box the minimum image size I can get (by setting
image_size=1000) is about 24000 pages, while the formula for the hard core
working set size returns about 12000, so it is not very useful.
On an i386 test box it's even worse, as the minimum image size I can get is
about 48000 pages.
Besides, while testing this I noticed that on i386 preallocate_image_memory()
didn't allocate from highmem, so I changed it to do so. As a result of this I
had to change the patches, so I'm going to post the new patchset shortly.
> > > > the second preallocate_image_memory() just returns after allocating fewer pages
> > > > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > > > approach, as I wrote in the previous message in this thread) and nothing bad
> > > > happens.
> > > >
> > > > That may be because we freeze the mm kernel threads, but I've also tested
> > > > without freezing them and it's still worked the same way.
> > > >
> > > > > > The current code simply tries *too hard* to meet image_size.
> > > > > > I'd rather take that as a mild advice, and to only free
> > > > > > "free+freeable-margin" pages when image_size is not approachable.
> > > > > >
> > > > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > > > retaining the "hard core working set".
> > > > >
> > > > > How to compute the size of the "hard core working set", then?
> > > >
> > > > Well, I'm still interested in the answer here. ;-)
> > >
> > > A tough question ;-)
> > >
> > > We can start with the following formula, this should be called *after*
> > > the initial memory shrinking.
> >
> > OK
> >
> > > /* a typical desktop do not have more than 100MB mapped pages */
> > > #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> > > unsigned long hard_core_working_set(void)
> > > {
> > > unsigned long nr;
> > >
> > > /*
> > > * mapped pages are normally small and precious,
> > > * but shall be bounded for safety.
> > > */
> > > nr = global_page_state(NR_FILE_MAPPED);
> > > nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
> > >
> > > /*
> > > * if no swap space, this is a hard request;
> > > * otherwise this is an optimization.
> > > * (the disk image IO can be much faster than swap IO)
> >
> > Well, if there's no swap space at this point, we won't be able to save the
> > image anyway, so this always is an optimization IMO. :-)
>
> Ah OK. Do you think the anonymous pages optimization should be limited?
That depends.
> My desktop normally consumes 200-400MB anonymous pages, but when some
> virtual machine is running, the anonymous pages can go beyond 1GB,
That's too much IMO, so there should be a limit.
> with mapped file pages go slightly beyond 100MB.
>
> The image-write vs. swapout-write speeds should be equal,
They aren't, really. Image write is way faster, even without compression.
> however the hibernate tool may be able to compress the dataset.
Sure.
> The image-read will be much faster than swapin-read for *rotational*
> disks. It may take more time to resume, however the user experiences
> after completion will be much better.
Agreed.
> I don't think "populating memory with useless data" would be a major
> concern, since we already freed up half of the total memory. It's all
> about the speed one can get back to work.
Agreed again.
> >
> > > */
> > > nr += global_page_state(NR_ACTIVE_ANON);
> > > nr += global_page_state(NR_INACTIVE_ANON);
> > >
> > > /* hard (but normally small) memory requests */
> > > nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> > > nr += global_page_state(NR_UNEVICTABLE);
> > > nr += global_page_state(NR_PAGETABLE);
> > >
> > > return nr;
> > > }
> >
> > OK, thanks.
> >
> > I'll create a separate patch adding this function and we'll see how it works.
>
> OK, thanks!
Actually it doesn't work too well as I said above. Arguably that's because the
number of anonymous pages was probably lower than average in my test cases,
but I also think that our hard core working set formula should be suitable for
all test cases.
> btw, if the shrink_all_memory() functions cannot go away because of
> performance problems, I can help clean it up. (FYI: I happen to be
> doing so just before you submitted this patchset.:)
That would be great, thanks a lot!
I'm going to post updated patchset in a while, let's move the discussion to
the new thread.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-10 4:52 ` Wu Fengguang
(?)
(?)
@ 2009-05-10 12:52 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-10 12:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: David Rientjes, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, kernel-testers, torvalds, linux-pm
On Sunday 10 May 2009, Wu Fengguang wrote:
> On Sun, May 10, 2009 at 03:22:57AM +0800, Rafael J. Wysocki wrote:
> > On Saturday 09 May 2009, Wu Fengguang wrote:
> > > On Sat, May 09, 2009 at 08:08:43AM +0800, Rafael J. Wysocki wrote:
> > > > On Friday 08 May 2009, Rafael J. Wysocki wrote:
> > > > > On Friday 08 May 2009, Wu Fengguang wrote:
> > > > [--snip--]
> > > > > > But hey, that 'count' counts "savable+free" memory.
> > > > > > We don't have a counter for an estimation of "free+freeable" memory,
> > > > > > ie. we are sure we cannot preallocate above that threshold.
> > > > > >
> > > > > > One applicable situation is, when there are 800M anonymous memory,
> > > > > > but only 500M image_size and no swap space.
> > > > > >
> > > > > > In that case we will otherwise goto the oom code path. Sure oom is
> > > > > > (and shall be) reliably disabled in hibernation, but still we shall be
> > > > > > cautious enough not to create a low memory situation, which will hurt:
> > > > > > - hibernation speed
> > > > > > (vmscan goes mad trying to squeeze the last free page)
> > > > > > - user experiences after resume
> > > > > > (all *active* file data and metadata have to reloaded)
> > > > >
> > > > > Strangely enough, my recent testing with this patch doesn't confirm the
> > > > > theory. :-) Namely, I set image_size too low on purpose and it only caused
> > > > > preallocate_image_memory() to return NULL at one point and that was it.
> > > > >
> > > > > It didn't even took too much time.
> > > > >
> > > > > I'll carry out more testing to verify this observation.
> > > >
> > > > I can confirm that even if image_size is below the minimum we can get,
> > >
> > > Which minimum please?
> >
> > That was supposed to be an alternative way of saying "below any reasonable
> > value", but it wasn't very precise indeed.
> >
> > I should have said that for given system there was a minimum number of saveable
> > pages that hibernate_preallocate_memory() leaved in memory and it just couldn't
> > go below that limit. If image_size is set below this number, the
> > preallocate_image_memory(max_size - size) call returns fewer pages that it's
> > been requested to allocate and that's it. No disasters, no anything wrong.
>
> "preallocate_image_memory(max_size - size) returning fewer pages"
> would better be avoided, and possibly can be avoided by checking
> hard_core_working_set(), right?
Yes, but your formula doesn't seem to be suitable for that, because the number
it returns it too low.
On an x86_64 test box the minimum image size I can get (by setting
image_size=1000) is about 24000 pages, while the formula for the hard core
working set size returns about 12000, so it is not very useful.
On an i386 test box it's even worse, as the minimum image size I can get is
about 48000 pages.
Besides, while testing this I noticed that on i386 preallocate_image_memory()
didn't allocate from highmem, so I changed it to do so. As a result of this I
had to change the patches, so I'm going to post the new patchset shortly.
> > > > the second preallocate_image_memory() just returns after allocating fewer pages
> > > > that it's been asked for (that's with the original __GFP_NO_OOM_KILL-based
> > > > approach, as I wrote in the previous message in this thread) and nothing bad
> > > > happens.
> > > >
> > > > That may be because we freeze the mm kernel threads, but I've also tested
> > > > without freezing them and it's still worked the same way.
> > > >
> > > > > > The current code simply tries *too hard* to meet image_size.
> > > > > > I'd rather take that as a mild advice, and to only free
> > > > > > "free+freeable-margin" pages when image_size is not approachable.
> > > > > >
> > > > > > The safety margin can be totalreserve_pages, plus enough pages for
> > > > > > retaining the "hard core working set".
> > > > >
> > > > > How to compute the size of the "hard core working set", then?
> > > >
> > > > Well, I'm still interested in the answer here. ;-)
> > >
> > > A tough question ;-)
> > >
> > > We can start with the following formula, this should be called *after*
> > > the initial memory shrinking.
> >
> > OK
> >
> > > /* a typical desktop do not have more than 100MB mapped pages */
> > > #define MAX_MMAP_PAGES (100 << (20 - PAGE_SHIFT))
> > > unsigned long hard_core_working_set(void)
> > > {
> > > unsigned long nr;
> > >
> > > /*
> > > * mapped pages are normally small and precious,
> > > * but shall be bounded for safety.
> > > */
> > > nr = global_page_state(NR_FILE_MAPPED);
> > > nr = min_t(unsigned long, nr, MAX_MMAP_PAGES);
> > >
> > > /*
> > > * if no swap space, this is a hard request;
> > > * otherwise this is an optimization.
> > > * (the disk image IO can be much faster than swap IO)
> >
> > Well, if there's no swap space at this point, we won't be able to save the
> > image anyway, so this always is an optimization IMO. :-)
>
> Ah OK. Do you think the anonymous pages optimization should be limited?
That depends.
> My desktop normally consumes 200-400MB anonymous pages, but when some
> virtual machine is running, the anonymous pages can go beyond 1GB,
That's too much IMO, so there should be a limit.
> with mapped file pages go slightly beyond 100MB.
>
> The image-write vs. swapout-write speeds should be equal,
They aren't, really. Image write is way faster, even without compression.
> however the hibernate tool may be able to compress the dataset.
Sure.
> The image-read will be much faster than swapin-read for *rotational*
> disks. It may take more time to resume, however the user experiences
> after completion will be much better.
Agreed.
> I don't think "populating memory with useless data" would be a major
> concern, since we already freed up half of the total memory. It's all
> about the speed one can get back to work.
Agreed again.
> >
> > > */
> > > nr += global_page_state(NR_ACTIVE_ANON);
> > > nr += global_page_state(NR_INACTIVE_ANON);
> > >
> > > /* hard (but normally small) memory requests */
> > > nr += global_page_state(NR_SLAB_UNRECLAIMABLE);
> > > nr += global_page_state(NR_UNEVICTABLE);
> > > nr += global_page_state(NR_PAGETABLE);
> > >
> > > return nr;
> > > }
> >
> > OK, thanks.
> >
> > I'll create a separate patch adding this function and we'll see how it works.
>
> OK, thanks!
Actually it doesn't work too well as I said above. Arguably that's because the
number of anonymous pages was probably lower than average in my test cases,
but I also think that our hard core working set formula should be suitable for
all test cases.
> btw, if the shrink_all_memory() functions cannot go away because of
> performance problems, I can help clean it up. (FYI: I happen to be
> doing so just before you submitted this patchset.:)
That would be great, thanks a lot!
I'm going to post updated patchset in a while, let's move the discussion to
the new thread.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:59 ` David Rientjes
(?)
(?)
@ 2009-05-07 23:11 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 23:11 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Friday 08 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > The setting and clearing of that thing looks gruesomely racy..
> >
>
> It's not racy currently because zone_scan_lock ensures ZONE_OOM_LOCKED
> gets test/set and cleared atomically for the entire zonelist (the clear
> happens for the same zonelist that was test/set).
>
> Using it for hibernation in the way I've proposed will open it up to the
> race I earlier described: when a kthread is in the oom killer and
> subsequently clears its zonelist of ZONE_OOM_LOCKED (all other tasks are
> frozen so they can't be in the oom killer). That's perfectly acceptable,
> however, since the system is by definition already oom if kthreads can't
> get memory so it will end up killing a user task even though it's stuck in
> D state and will exit on thaw; we aren't concerned about killing
> needlessly because the oom killer becomes a no-op when it finds a task
> that has already been killed but hasn't exited by way of TIF_MEMDIE.
OK there.
So everyone seems to agree we can do something like in the patch below?
---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM/Hibernate: Rework shrinking of memory
Rework swsusp_shrink_memory() so that it calls shrink_all_memory()
just once to make some room for the image and then allocates memory
to apply more pressure to the memory management subsystem, if
necessary.
Unfortunately, we don't seem to be able to drop shrink_all_memory()
entirely just yet, because that would lead to huge performance
regressions in some test cases.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/snapshot.c | 151 +++++++++++++++++++++++++++++++++---------------
1 file changed, 104 insertions(+), 47 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,69 +1066,126 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper functions used for the shrinking of memory. */
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
+ * preallocate_image_memory - Allocate given number of page frames
+ * @nr_pages: Number of page frames to allocate
*
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * Return value: Number of page frames actually allocated
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
+static unsigned long preallocate_image_memory(unsigned long nr_pages)
{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
+ unsigned long nr_alloc = 0;
+
+ while (nr_pages > 0) {
+ if (!alloc_image_page(GFP_KERNEL | __GFP_NOWARN))
+ break;
+ nr_pages--;
+ nr_alloc++;
+ }
+
+ return nr_alloc;
}
+/**
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * compute the total number of available page frames and allocate at least
+ *
+ * ([page frames total] + PAGES_FOR_IO + [metadata pages]) / 2 + 2 * SPARE_PAGES
+ *
+ * of them, which corresponds to the maximum size of a hibernation image.
+ *
+ * If image_size is set below the number following from the above formula,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size or it is impossible to
+ * allocate more memory, whichever happens first.
+ */
int swsusp_shrink_memory(void)
{
- long tmp;
struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
+ unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
+ int error = 0;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Shrinking memory ... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
+ /* Count the number of saveable data pages. */
+ saveable = count_data_pages() + count_highmem_pages();
- if (highmem_size < 0)
- highmem_size = 0;
+ /*
+ * Compute the total number of page frames we can use (count) and the
+ * number of pages needed for image metadata (size).
+ */
+ count = saveable;
+ size = 0;
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ count -= zone->pages_min;
+ }
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ /* Compute the maximum number of saveable pages to leave in memory. */
+ max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
+ size = DIV_ROUND_UP(image_size, PAGE_SIZE);
+ if (size > max_size)
+ size = max_size;
+ /*
+ * If the maximum is not less than the current number of saveable pages
+ * in memory, we don't need to do anything more.
+ */
+ if (size >= saveable)
+ goto out;
+
+ /*
+ * Let the memory management subsystem know that we're going to need a
+ * large number of page frames to allocate and make it free some memory.
+ * NOTE: If this is not done, performance is heavily affected in some
+ * test cases.
+ */
+ shrink_all_memory(saveable - size);
+
+ /*
+ * Prevent the OOM killer from triggering while we're allocating image
+ * memory.
+ */
+ for_each_populated_zone(zone)
+ zone_set_flag(zone, ZONE_OOM_LOCKED);
+ /*
+ * The number of saveable pages in memory was too high, so apply some
+ * pressure to decrease it. First, make room for the largest possible
+ * image and fail if that doesn't work. Next, try to decrease the size
+ * of the image as much as indicated by image_size.
+ */
+ count -= max_size;
+ pages = preallocate_image_memory(count);
+ if (pages < count)
+ error = -ENOMEM;
+ else
+ pages += preallocate_image_memory(max_size - size);
+
+ for_each_populated_zone(zone)
+ zone_clear_flag(zone, ZONE_OOM_LOCKED);
+
+ /* Release all of the preallocated page frames. */
+ swsusp_free();
+
+ if (error) {
+ printk(KERN_CONT "\n");
+ return error;
+ }
+
+ out:
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
return 0;
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 22:16 ` David Rientjes
(?)
(?)
@ 2009-05-07 22:45 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-07 22:45 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
fengguang.wu, torvalds
On Thu, 7 May 2009 15:16:17 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Thu, 7 May 2009, Andrew Morton wrote:
>
> > - the standard way of controlling memory allocator behaviour is via
> > the gfp_t. Bypassing that is an unusual step and needs a higher
> > level of justification, which I'm not seeing here.
> >
>
> The standard way of controlling the oom killer behavior for a zone is via
> the ZONE_OOM_LOCKED bit.
oop, I didn't remember/realise that ZONE_OOM_LOCKED already exists.
> > - if we do this via an unusual global, we reduce the chances that
> > another subsytem could use the new feature.
> >
> > I don't know what subsytem that might be, but I bet they're out
> > there. checkpoint-restart, virtual machines, ballooning memory
> > drivers, kexec loading, etc.
> >
>
> There's two separate issues here: the use of ZONE_OOM_LOCKED to control
> whether or not to invoke the oom killer for a specific zone (which is
> already its only function), and the fact that in this case we're doing it
> for all zones. It seems like you're concerned with the latter, but the
> distinction in the hibernation case is that no memory freeing would be
> possible as the result of the oom killer for _all_ zones, so it makes
> sense to lock them all out.
OK.
> > > The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
> > > whether it specifies it or not since the oom killer would simply kill a
> > > task in D state which can't exit or free memory and subsequent allocations
> > > would make the oom killer a no-op because there's an eligible task with
> > > TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
> > > calling the oom killer in a first place and killing an unresponsive task
> > > but that would have to happen anyway when thawed since the system is oom
> > > (or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
> >
> > All the above is specific to the PM application only, when userspace
> > tasks are stopped.
> >
>
> I'm not arguing that the only way we can ever implement __GFP_NO_OOM_KILL
> is for the entire system: we can set ZONE_OOM_LOCKED for only the zones in
> the zonelist that are passed to the page allocator. For this particular
> purpose, that is naturally all zones; for other future use cases it may be
> chosen only to lock out the zones we're allowed to allocate from in that
> context.
OK.
> > It might well end up that stopping userspace (beforehand or before
> > oom-killing) is a hard requirement for reliably disabling the
> > oom-killer.
>
> Yes, globally, but future use cases may disable only specific zones such
> as with memory hot-remove.
<goes off to find out what ZONE_OOM_LOCKED does>
That took remarkably longer than one would have expected..
Yes, OK, I agree, globally setting ZONE_OOM_LOCKED would produce a
decent result.
The setting and clearing of that thing looks gruesomely racy..
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:56 ` Andrew Morton
(?)
(?)
@ 2009-05-07 21:25 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 21:25 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
fengguang.wu, torvalds
On Thu, 7 May 2009, Andrew Morton wrote:
> > > All of your tasks are in D state other than kthreads, right? That means
> > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > easily do this
> > >
> > > struct zone *z;
> > > for_each_populated_zone(z)
> > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > >
> > > and then
> > >
> > > for_each_populated_zone(z)
> > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > >
> > > The serialization is done with trylocks so this will never invoke the oom
> > > killer because all zones in the allocator's zonelist will be oom locked.
> > >
> > > Why does this not work for you?
> >
> > Well, it might work too, but why are you insisting? How's it better than
> > __GFP_NO_OOM_KILL, actually?
> >
> > Andrew, what do you think about this?
>
> I don't think I understand the proposal. Is it to provide a means by
> which PM can go in and set a state bit against each and every zone? If
> so, that's still a global boolean, only messier.
>
Why can't it be global while preallocating memory for hibernation since
nothing but kthreads could allocate at this point and if the system is oom
then the oom killer wouldn't be able to do anything anyway since it can't
kill them?
The fact is that _all_ allocations here are implicitly __GFP_NO_OOM_KILL
whether it specifies it or not since the oom killer would simply kill a
task in D state which can't exit or free memory and subsequent allocations
would make the oom killer a no-op because there's an eligible task with
TIF_MEMDIE set. The only thing you're saving with __GFP_NO_OOM_KILL is
calling the oom killer in a first place and killing an unresponsive task
but that would have to happen anyway when thawed since the system is oom
(or otherwise lockup for GFP_KERNEL with order < PAGE_ALLOC_COSTLY_ORDER).
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:25 ` David Rientjes
` (3 preceding siblings ...)
(?)
@ 2009-05-07 20:38 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 20:38 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
fengguang.wu, torvalds
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
>
> struct zone *z;
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> and then
>
> for_each_populated_zone(z)
> zone_clear_flag(z, ZONE_OOM_LOCKED);
>
> The serialization is done with trylocks so this will never invoke the oom
> killer because all zones in the allocator's zonelist will be oom locked.
>
> Why does this not work for you?
Well, it might work too, but why are you insisting? How's it better than
__GFP_NO_OOM_KILL, actually?
Andrew, what do you think about this?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:25 ` David Rientjes
` (4 preceding siblings ...)
(?)
@ 2009-05-08 23:55 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 23:55 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
>
> struct zone *z;
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> and then
>
> for_each_populated_zone(z)
> zone_clear_flag(z, ZONE_OOM_LOCKED);
>
> The serialization is done with trylocks so this will never invoke the oom
> killer because all zones in the allocator's zonelist will be oom locked.
Well, that might have been a good idea if it actually had worked. :-(
> Why does this not work for you?
If I set image_size to something below "hard core working set" +
totalreserve_pages, preallocate_image_memory() hangs the
box (please refer to the last patch I sent,
http://patchwork.kernel.org/patch/22423/).
However, with the freezer-based disabling of the OOM killer it doesn't hang
under the same test conditions.
The difference appears to be that using your approach makes
__alloc_pages_internal() loop forever between the !try_set_zone_oom() test and
restart:, while it should go to nopage: in that situation.
So, I think I'll stick to the Andrew's approach with using __GFP_NO_OOM_KILL.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 23:55 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 23:55 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
>
> struct zone *z;
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> and then
>
> for_each_populated_zone(z)
> zone_clear_flag(z, ZONE_OOM_LOCKED);
>
> The serialization is done with trylocks so this will never invoke the oom
> killer because all zones in the allocator's zonelist will be oom locked.
Well, that might have been a good idea if it actually had worked. :-(
> Why does this not work for you?
If I set image_size to something below "hard core working set" +
totalreserve_pages, preallocate_image_memory() hangs the
box (please refer to the last patch I sent,
http://patchwork.kernel.org/patch/22423/).
However, with the freezer-based disabling of the OOM killer it doesn't hang
under the same test conditions.
The difference appears to be that using your approach makes
__alloc_pages_internal() loop forever between the !try_set_zone_oom() test and
restart:, while it should go to nopage: in that situation.
So, I think I'll stick to the Andrew's approach with using __GFP_NO_OOM_KILL.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-08 23:55 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-08 23:55 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, David Rientjes wrote:
> On Thu, 7 May 2009, Rafael J. Wysocki wrote:
>
> > OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
> > I'll use the freezer-based approach instead.
> >
>
> Third time I'm going to suggest this, and I'd like a response on why it's
> not possible instead of being ignored.
>
> All of your tasks are in D state other than kthreads, right? That means
> they won't be in the oom killer (thus no zones are oom locked), so you can
> easily do this
>
> struct zone *z;
> for_each_populated_zone(z)
> zone_set_flag(z, ZONE_OOM_LOCKED);
>
> and then
>
> for_each_populated_zone(z)
> zone_clear_flag(z, ZONE_OOM_LOCKED);
>
> The serialization is done with trylocks so this will never invoke the oom
> killer because all zones in the allocator's zonelist will be oom locked.
Well, that might have been a good idea if it actually had worked. :-(
> Why does this not work for you?
If I set image_size to something below "hard core working set" +
totalreserve_pages, preallocate_image_memory() hangs the
box (please refer to the last patch I sent,
http://patchwork.kernel.org/patch/22423/).
However, with the freezer-based disabling of the OOM killer it doesn't hang
under the same test conditions.
The difference appears to be that using your approach makes
__alloc_pages_internal() loop forever between the !try_set_zone_oom() test and
restart:, while it should go to nopage: in that situation.
So, I think I'll stick to the Andrew's approach with using __GFP_NO_OOM_KILL.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 21:22 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-09 21:22 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, Linus Torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers,
Mel Gorman
On Sat, 9 May 2009, Rafael J. Wysocki wrote:
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
>
> Well, that might have been a good idea if it actually had worked. :-(
>
> > Why does this not work for you?
>
> If I set image_size to something below "hard core working set" +
> totalreserve_pages, preallocate_image_memory() hangs the
> box (please refer to the last patch I sent,
> http://patchwork.kernel.org/patch/22423/).
>
This has been changed in the latest mmotm with Mel's page alloactor
patches (and I think yours should be based on mmotm). Specifically,
page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
one of their zones would unconditionally goto restart. Now, if
order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
it does goto restart.
So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER, using the
ZONE_OOM_LOCKED approach to locking out the oom killer will work just fine
in mmotm.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 21:22 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-09 21:22 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, Linus Torvalds,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mel Gorman
On Sat, 9 May 2009, Rafael J. Wysocki wrote:
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
>
> Well, that might have been a good idea if it actually had worked. :-(
>
> > Why does this not work for you?
>
> If I set image_size to something below "hard core working set" +
> totalreserve_pages, preallocate_image_memory() hangs the
> box (please refer to the last patch I sent,
> http://patchwork.kernel.org/patch/22423/).
>
This has been changed in the latest mmotm with Mel's page alloactor
patches (and I think yours should be based on mmotm). Specifically,
page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
one of their zones would unconditionally goto restart. Now, if
order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
it does goto restart.
So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER, using the
ZONE_OOM_LOCKED approach to locking out the oom killer will work just fine
in mmotm.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 21:22 ` David Rientjes
(?)
@ 2009-05-09 21:37 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 21:37 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, Mel Gorman, linux-kernel, alan-jenkins,
jens.axboe, Andrew Morton, fengguang.wu, Linus Torvalds,
linux-pm
On Saturday 09 May 2009, David Rientjes wrote:
> On Sat, 9 May 2009, Rafael J. Wysocki wrote:
>
> > > All of your tasks are in D state other than kthreads, right? That means
> > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > easily do this
> > >
> > > struct zone *z;
> > > for_each_populated_zone(z)
> > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > >
> > > and then
> > >
> > > for_each_populated_zone(z)
> > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > >
> > > The serialization is done with trylocks so this will never invoke the oom
> > > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Well, that might have been a good idea if it actually had worked. :-(
> >
> > > Why does this not work for you?
> >
> > If I set image_size to something below "hard core working set" +
> > totalreserve_pages, preallocate_image_memory() hangs the
> > box (please refer to the last patch I sent,
> > http://patchwork.kernel.org/patch/22423/).
> >
>
> This has been changed in the latest mmotm with Mel's page alloactor
> patches (and I think yours should be based on mmotm). Specifically,
> page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
>
> Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> one of their zones would unconditionally goto restart. Now, if
> order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> it does goto restart.
>
> So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
It doesn't. All of my allocations are of order 0.
> using the ZONE_OOM_LOCKED approach to locking out the oom killer will work
> just fine in mmotm.
No, it won't, AFAICT.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 21:37 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 21:37 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, Linus Torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers,
Mel Gorman
On Saturday 09 May 2009, David Rientjes wrote:
> On Sat, 9 May 2009, Rafael J. Wysocki wrote:
>
> > > All of your tasks are in D state other than kthreads, right? That means
> > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > easily do this
> > >
> > > struct zone *z;
> > > for_each_populated_zone(z)
> > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > >
> > > and then
> > >
> > > for_each_populated_zone(z)
> > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > >
> > > The serialization is done with trylocks so this will never invoke the oom
> > > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Well, that might have been a good idea if it actually had worked. :-(
> >
> > > Why does this not work for you?
> >
> > If I set image_size to something below "hard core working set" +
> > totalreserve_pages, preallocate_image_memory() hangs the
> > box (please refer to the last patch I sent,
> > http://patchwork.kernel.org/patch/22423/).
> >
>
> This has been changed in the latest mmotm with Mel's page alloactor
> patches (and I think yours should be based on mmotm). Specifically,
> page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
>
> Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> one of their zones would unconditionally goto restart. Now, if
> order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> it does goto restart.
>
> So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
It doesn't. All of my allocations are of order 0.
> using the ZONE_OOM_LOCKED approach to locking out the oom killer will work
> just fine in mmotm.
No, it won't, AFAICT.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 21:37 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 21:37 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, Linus Torvalds,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mel Gorman
On Saturday 09 May 2009, David Rientjes wrote:
> On Sat, 9 May 2009, Rafael J. Wysocki wrote:
>
> > > All of your tasks are in D state other than kthreads, right? That means
> > > they won't be in the oom killer (thus no zones are oom locked), so you can
> > > easily do this
> > >
> > > struct zone *z;
> > > for_each_populated_zone(z)
> > > zone_set_flag(z, ZONE_OOM_LOCKED);
> > >
> > > and then
> > >
> > > for_each_populated_zone(z)
> > > zone_clear_flag(z, ZONE_OOM_LOCKED);
> > >
> > > The serialization is done with trylocks so this will never invoke the oom
> > > killer because all zones in the allocator's zonelist will be oom locked.
> >
> > Well, that might have been a good idea if it actually had worked. :-(
> >
> > > Why does this not work for you?
> >
> > If I set image_size to something below "hard core working set" +
> > totalreserve_pages, preallocate_image_memory() hangs the
> > box (please refer to the last patch I sent,
> > http://patchwork.kernel.org/patch/22423/).
> >
>
> This has been changed in the latest mmotm with Mel's page alloactor
> patches (and I think yours should be based on mmotm). Specifically,
> page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
>
> Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> one of their zones would unconditionally goto restart. Now, if
> order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> it does goto restart.
>
> So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
It doesn't. All of my allocations are of order 0.
> using the ZONE_OOM_LOCKED approach to locking out the oom killer will work
> just fine in mmotm.
No, it won't, AFAICT.
Best,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 22:39 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-09 22:39 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, Linus Torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers,
Mel Gorman
On Sat, 9 May 2009, Rafael J. Wysocki wrote:
> > This has been changed in the latest mmotm with Mel's page alloactor
> > patches (and I think yours should be based on mmotm). Specifically,
> > page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
> >
> > Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> > one of their zones would unconditionally goto restart. Now, if
> > order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> > it does goto restart.
> >
> > So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
>
> It doesn't. All of my allocations are of order 0.
>
All order 0 allocations are implicitly __GFP_NOFAIL and will loop
endlessly unless they can't block. So if you want to simply prohibit the
oom killer from being invoked and not change the retry behavior, setting
ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
means nothing can be reclaimed and you can't free memory via oom killing,
so there's nothing else the page allocator can do.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 22:39 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-09 22:39 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, Linus Torvalds,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mel Gorman
On Sat, 9 May 2009, Rafael J. Wysocki wrote:
> > This has been changed in the latest mmotm with Mel's page alloactor
> > patches (and I think yours should be based on mmotm). Specifically,
> > page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
> >
> > Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> > one of their zones would unconditionally goto restart. Now, if
> > order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> > it does goto restart.
> >
> > So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
>
> It doesn't. All of my allocations are of order 0.
>
All order 0 allocations are implicitly __GFP_NOFAIL and will loop
endlessly unless they can't block. So if you want to simply prohibit the
oom killer from being invoked and not change the retry behavior, setting
ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
means nothing can be reclaimed and you can't free memory via oom killing,
so there's nothing else the page allocator can do.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 23:03 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 23:03 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, Linus Torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers,
Mel Gorman
On Sunday 10 May 2009, David Rientjes wrote:
> On Sat, 9 May 2009, Rafael J. Wysocki wrote:
>
> > > This has been changed in the latest mmotm with Mel's page alloactor
> > > patches (and I think yours should be based on mmotm). Specifically,
> > > page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
> > >
> > > Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> > > one of their zones would unconditionally goto restart. Now, if
> > > order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> > > it does goto restart.
> > >
> > > So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
> >
> > It doesn't. All of my allocations are of order 0.
> >
>
> All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> endlessly unless they can't block. So if you want to simply prohibit the
> oom killer from being invoked and not change the retry behavior, setting
> ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> means nothing can be reclaimed and you can't free memory via oom killing,
> so there's nothing else the page allocator can do.
But I want it to give up in this case instead of looping forever.
Look. I have a specific problem at hand that I want to solve and the approach
you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
why it doesn't work, but you're ingnoring it, so I really don't know what else
I can say.
OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
opinion about it. It's been tested and it's done the job 100% of the time. Go
figure. And please stop beating the dead horse.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-09 23:03 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 23:03 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, Linus Torvalds,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mel Gorman
On Sunday 10 May 2009, David Rientjes wrote:
> On Sat, 9 May 2009, Rafael J. Wysocki wrote:
>
> > > This has been changed in the latest mmotm with Mel's page alloactor
> > > patches (and I think yours should be based on mmotm). Specifically,
> > > page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
> > >
> > > Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> > > one of their zones would unconditionally goto restart. Now, if
> > > order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> > > it does goto restart.
> > >
> > > So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
> >
> > It doesn't. All of my allocations are of order 0.
> >
>
> All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> endlessly unless they can't block. So if you want to simply prohibit the
> oom killer from being invoked and not change the retry behavior, setting
> ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> means nothing can be reclaimed and you can't free memory via oom killing,
> so there's nothing else the page allocator can do.
But I want it to give up in this case instead of looping forever.
Look. I have a specific problem at hand that I want to solve and the approach
you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
why it doesn't work, but you're ingnoring it, so I really don't know what else
I can say.
OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
opinion about it. It's been tested and it's done the job 100% of the time. Go
figure. And please stop beating the dead horse.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 23:03 ` Rafael J. Wysocki
(?)
@ 2009-05-11 20:11 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-11 20:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, Mel Gorman, linux-kernel, alan-jenkins,
jens.axboe, Andrew Morton, fengguang.wu, Linus Torvalds,
linux-pm
On Sun, 10 May 2009, Rafael J. Wysocki wrote:
> > All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> > endlessly unless they can't block. So if you want to simply prohibit the
> > oom killer from being invoked and not change the retry behavior, setting
> > ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> > means nothing can be reclaimed and you can't free memory via oom killing,
> > so there's nothing else the page allocator can do.
>
> But I want it to give up in this case instead of looping forever.
>
> Look. I have a specific problem at hand that I want to solve and the approach
> you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
> why it doesn't work, but you're ingnoring it, so I really don't know what else
> I can say.
>
> OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
> opinion about it. It's been tested and it's done the job 100% of the time. Go
> figure. And please stop beating the dead horse.
>
Which implementation are you talking about? You've had several:
http://marc.info/?l=linux-kernel&m=124121728429113
http://marc.info/?l=linux-kernel&m=124131049223733
http://marc.info/?l=linux-kernel&m=124165031723627
http://marc.info/?l=linux-kernel&m=124146681311494
The issue with your approach is that it doesn't address the problem; the
problem is _not_ specific to individual page allocations it is specific to
the STATE OF THE MACHINE.
If all userspace tasks are uninterruptible when trying to reserve this
memory and, thus, oom killing is negligent and not going to help, that
needs to be addressed in the page allocator. It is a bug for the
allocator to continuously retry the allocation unless __GFP_NOFAIL is set
if oom killing will not free memory.
Adding a new __GFP_NO_OOM_KILL flag to address that isn't helpful since it
has nothing at all to do with the specific allocation. It may certainly
be the easiest way to implement your patchset without doing VM work, but
it's not going to fix the problem for others.
I just posted a patch series[*] that would fix this problem for you
without even locking out the oom killer or adding any unnecessary gfp
flags. It is based on mmotm since it has Mel's page allocator speedups.
Any change you do to the allocator at this point should be based on that
to avoid nasty merge conflicts later, so try my series out and see how it
works.
Now, I won't engage in your personal attacks because (i) nobody else
cares, and (ii) it's not going to be productive. I'll let my code do the
talking.
[*] http://lkml.org/lkml/2009/5/10/118
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 20:11 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-11 20:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, Linus Torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers,
Mel Gorman
On Sun, 10 May 2009, Rafael J. Wysocki wrote:
> > All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> > endlessly unless they can't block. So if you want to simply prohibit the
> > oom killer from being invoked and not change the retry behavior, setting
> > ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> > means nothing can be reclaimed and you can't free memory via oom killing,
> > so there's nothing else the page allocator can do.
>
> But I want it to give up in this case instead of looping forever.
>
> Look. I have a specific problem at hand that I want to solve and the approach
> you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
> why it doesn't work, but you're ingnoring it, so I really don't know what else
> I can say.
>
> OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
> opinion about it. It's been tested and it's done the job 100% of the time. Go
> figure. And please stop beating the dead horse.
>
Which implementation are you talking about? You've had several:
http://marc.info/?l=linux-kernel&m=124121728429113
http://marc.info/?l=linux-kernel&m=124131049223733
http://marc.info/?l=linux-kernel&m=124165031723627
http://marc.info/?l=linux-kernel&m=124146681311494
The issue with your approach is that it doesn't address the problem; the
problem is _not_ specific to individual page allocations it is specific to
the STATE OF THE MACHINE.
If all userspace tasks are uninterruptible when trying to reserve this
memory and, thus, oom killing is negligent and not going to help, that
needs to be addressed in the page allocator. It is a bug for the
allocator to continuously retry the allocation unless __GFP_NOFAIL is set
if oom killing will not free memory.
Adding a new __GFP_NO_OOM_KILL flag to address that isn't helpful since it
has nothing at all to do with the specific allocation. It may certainly
be the easiest way to implement your patchset without doing VM work, but
it's not going to fix the problem for others.
I just posted a patch series[*] that would fix this problem for you
without even locking out the oom killer or adding any unnecessary gfp
flags. It is based on mmotm since it has Mel's page allocator speedups.
Any change you do to the allocator at this point should be based on that
to avoid nasty merge conflicts later, so try my series out and see how it
works.
Now, I won't engage in your personal attacks because (i) nobody else
cares, and (ii) it's not going to be productive. I'll let my code do the
talking.
[*] http://lkml.org/lkml/2009/5/10/118
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 20:11 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-11 20:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, Linus Torvalds,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mel Gorman
On Sun, 10 May 2009, Rafael J. Wysocki wrote:
> > All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> > endlessly unless they can't block. So if you want to simply prohibit the
> > oom killer from being invoked and not change the retry behavior, setting
> > ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> > means nothing can be reclaimed and you can't free memory via oom killing,
> > so there's nothing else the page allocator can do.
>
> But I want it to give up in this case instead of looping forever.
>
> Look. I have a specific problem at hand that I want to solve and the approach
> you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
> why it doesn't work, but you're ingnoring it, so I really don't know what else
> I can say.
>
> OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
> opinion about it. It's been tested and it's done the job 100% of the time. Go
> figure. And please stop beating the dead horse.
>
Which implementation are you talking about? You've had several:
http://marc.info/?l=linux-kernel&m=124121728429113
http://marc.info/?l=linux-kernel&m=124131049223733
http://marc.info/?l=linux-kernel&m=124165031723627
http://marc.info/?l=linux-kernel&m=124146681311494
The issue with your approach is that it doesn't address the problem; the
problem is _not_ specific to individual page allocations it is specific to
the STATE OF THE MACHINE.
If all userspace tasks are uninterruptible when trying to reserve this
memory and, thus, oom killing is negligent and not going to help, that
needs to be addressed in the page allocator. It is a bug for the
allocator to continuously retry the allocation unless __GFP_NOFAIL is set
if oom killing will not free memory.
Adding a new __GFP_NO_OOM_KILL flag to address that isn't helpful since it
has nothing at all to do with the specific allocation. It may certainly
be the easiest way to implement your patchset without doing VM work, but
it's not going to fix the problem for others.
I just posted a patch series[*] that would fix this problem for you
without even locking out the oom killer or adding any unnecessary gfp
flags. It is based on mmotm since it has Mel's page allocator speedups.
Any change you do to the allocator at this point should be based on that
to avoid nasty merge conflicts later, so try my series out and see how it
works.
Now, I won't engage in your personal attacks because (i) nobody else
cares, and (ii) it's not going to be productive. I'll let my code do the
talking.
[*] http://lkml.org/lkml/2009/5/10/118
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 22:44 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-11 22:44 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, Linus Torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers,
Mel Gorman
On Monday 11 May 2009, David Rientjes wrote:
> On Sun, 10 May 2009, Rafael J. Wysocki wrote:
>
> > > All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> > > endlessly unless they can't block. So if you want to simply prohibit the
> > > oom killer from being invoked and not change the retry behavior, setting
> > > ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> > > means nothing can be reclaimed and you can't free memory via oom killing,
> > > so there's nothing else the page allocator can do.
> >
> > But I want it to give up in this case instead of looping forever.
> >
> > Look. I have a specific problem at hand that I want to solve and the approach
> > you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
> > why it doesn't work, but you're ingnoring it, so I really don't know what else
> > I can say.
> >
> > OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
> > opinion about it. It's been tested and it's done the job 100% of the time. Go
> > figure. And please stop beating the dead horse.
> >
>
> Which implementation are you talking about? You've had several:
>
> http://marc.info/?l=linux-kernel&m=124121728429113
> http://marc.info/?l=linux-kernel&m=124131049223733
> http://marc.info/?l=linux-kernel&m=124165031723627
> http://marc.info/?l=linux-kernel&m=124146681311494
The second one. The first one was too much code, the third one was not the
Andrew's favourite and the last one is wrong, because it changes the behaviour
related to __GFP_NORETRY incorrectly.
> The issue with your approach is that it doesn't address the problem; the
> problem is _not_ specific to individual page allocations it is specific to
> the STATE OF THE MACHINE.
Yes, it is, but have you followed my discussion with Andrew?
> If all userspace tasks are uninterruptible when trying to reserve this
> memory and, thus, oom killing is negligent and not going to help, that
> needs to be addressed in the page allocator. It is a bug for the
> allocator to continuously retry the allocation unless __GFP_NOFAIL is set
> if oom killing will not free memory.
That was my argument in the discussion with Andrew, actually.
> Adding a new __GFP_NO_OOM_KILL flag to address that isn't helpful since it
> has nothing at all to do with the specific allocation. It may certainly
> be the easiest way to implement your patchset without doing VM work, but
> it's not going to fix the problem for others.
I agree, but I didn't even want to fix the problem with OOM killing after
freezing tasks.
> I just posted a patch series[*] that would fix this problem for you
> without even locking out the oom killer or adding any unnecessary gfp
> flags. It is based on mmotm since it has Mel's page allocator speedups.
> Any change you do to the allocator at this point should be based on that
> to avoid nasty merge conflicts later, so try my series out and see how it
> works.
>
> Now, I won't engage in your personal attacks because (i) nobody else
> cares, and (ii) it's not going to be productive.
My previous message wasn't meant to be personal, so I'm sorry if it sounded
like it was.
> I'll let my code do the talking.
>
> [*] http://lkml.org/lkml/2009/5/10/118
OK, so the patch is http://lkml.org/lkml/2009/5/10/127, isn't it? I'm not
sure it will fly, given the Andrew's reply.
In fact the problem is that processes in D state are only legitimately going
to stay in this state when they are _frozen_. So, the right approach seems to
be to avoid calling the OOM killer at all after freezing processes and instead
fail the allocations that would have triggered it. Which means this patch:
http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
one).
But Andrew says that it's better to have a __GFP_NO_OOM_KILL flag instead,
because someone else might presumably use it in future for something (I have
no idea who that might be, but whatever) and _surely_ no one else will use a
global switch related to the freezer.
Still _I_ think that since the freezer is the source of the problematic
situation (all tasks are persistently unkillable), using it should change the
behaviour of the page allocator, so that the OOM killer is not activated
while processes are frozen. And in fact that should not depend on what flags
are used by whoever tries to allocate memory.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 22:44 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-11 22:44 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, Linus Torvalds,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mel Gorman
On Monday 11 May 2009, David Rientjes wrote:
> On Sun, 10 May 2009, Rafael J. Wysocki wrote:
>
> > > All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> > > endlessly unless they can't block. So if you want to simply prohibit the
> > > oom killer from being invoked and not change the retry behavior, setting
> > > ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> > > means nothing can be reclaimed and you can't free memory via oom killing,
> > > so there's nothing else the page allocator can do.
> >
> > But I want it to give up in this case instead of looping forever.
> >
> > Look. I have a specific problem at hand that I want to solve and the approach
> > you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
> > why it doesn't work, but you're ingnoring it, so I really don't know what else
> > I can say.
> >
> > OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
> > opinion about it. It's been tested and it's done the job 100% of the time. Go
> > figure. And please stop beating the dead horse.
> >
>
> Which implementation are you talking about? You've had several:
>
> http://marc.info/?l=linux-kernel&m=124121728429113
> http://marc.info/?l=linux-kernel&m=124131049223733
> http://marc.info/?l=linux-kernel&m=124165031723627
> http://marc.info/?l=linux-kernel&m=124146681311494
The second one. The first one was too much code, the third one was not the
Andrew's favourite and the last one is wrong, because it changes the behaviour
related to __GFP_NORETRY incorrectly.
> The issue with your approach is that it doesn't address the problem; the
> problem is _not_ specific to individual page allocations it is specific to
> the STATE OF THE MACHINE.
Yes, it is, but have you followed my discussion with Andrew?
> If all userspace tasks are uninterruptible when trying to reserve this
> memory and, thus, oom killing is negligent and not going to help, that
> needs to be addressed in the page allocator. It is a bug for the
> allocator to continuously retry the allocation unless __GFP_NOFAIL is set
> if oom killing will not free memory.
That was my argument in the discussion with Andrew, actually.
> Adding a new __GFP_NO_OOM_KILL flag to address that isn't helpful since it
> has nothing at all to do with the specific allocation. It may certainly
> be the easiest way to implement your patchset without doing VM work, but
> it's not going to fix the problem for others.
I agree, but I didn't even want to fix the problem with OOM killing after
freezing tasks.
> I just posted a patch series[*] that would fix this problem for you
> without even locking out the oom killer or adding any unnecessary gfp
> flags. It is based on mmotm since it has Mel's page allocator speedups.
> Any change you do to the allocator at this point should be based on that
> to avoid nasty merge conflicts later, so try my series out and see how it
> works.
>
> Now, I won't engage in your personal attacks because (i) nobody else
> cares, and (ii) it's not going to be productive.
My previous message wasn't meant to be personal, so I'm sorry if it sounded
like it was.
> I'll let my code do the talking.
>
> [*] http://lkml.org/lkml/2009/5/10/118
OK, so the patch is http://lkml.org/lkml/2009/5/10/127, isn't it? I'm not
sure it will fly, given the Andrew's reply.
In fact the problem is that processes in D state are only legitimately going
to stay in this state when they are _frozen_. So, the right approach seems to
be to avoid calling the OOM killer at all after freezing processes and instead
fail the allocations that would have triggered it. Which means this patch:
http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
one).
But Andrew says that it's better to have a __GFP_NO_OOM_KILL flag instead,
because someone else might presumably use it in future for something (I have
no idea who that might be, but whatever) and _surely_ no one else will use a
global switch related to the freezer.
Still _I_ think that since the freezer is the source of the problematic
situation (all tasks are persistently unkillable), using it should change the
behaviour of the page allocator, so that the OOM killer is not activated
while processes are frozen. And in fact that should not depend on what flags
are used by whoever tries to allocate memory.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-11 22:44 ` Rafael J. Wysocki
(?)
@ 2009-05-11 23:07 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-11 23:07 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, mel, rientjes, linux-kernel, alan-jenkins,
jens.axboe, linux-pm, fengguang.wu, torvalds
On Tue, 12 May 2009 00:44:36 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> Which means this patch:
> http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> one).
ho hum, I could live with that ;)
Would it make sense to turn it into something more general? Instead of
"tasks_frozen/processes_are_frozen()", present it as
"oom_killer_disabled/oom_killer_is_disabled()"?
That would invite other subsystems to use it, if they want to. Which
might well be a bad thing on their behalf, hard to say..
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 23:07 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-11 23:07 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers, mel
On Tue, 12 May 2009 00:44:36 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> Which means this patch:
> http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> one).
ho hum, I could live with that ;)
Would it make sense to turn it into something more general? Instead of
"tasks_frozen/processes_are_frozen()", present it as
"oom_killer_disabled/oom_killer_is_disabled()"?
That would invite other subsystems to use it, if they want to. Which
might well be a bad thing on their behalf, hard to say..
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 23:07 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-11 23:07 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
mel-wPRd99KPJ+uzQB+pC5nmwQ
On Tue, 12 May 2009 00:44:36 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> Which means this patch:
> http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> one).
ho hum, I could live with that ;)
Would it make sense to turn it into something more general? Instead of
"tasks_frozen/processes_are_frozen()", present it as
"oom_killer_disabled/oom_killer_is_disabled()"?
That would invite other subsystems to use it, if they want to. Which
might well be a bad thing on their behalf, hard to say..
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-11 23:07 ` Andrew Morton
(?)
@ 2009-05-11 23:28 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-11 23:28 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, mel, rientjes, linux-kernel, alan-jenkins,
jens.axboe, linux-pm, fengguang.wu, torvalds
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 00:44:36 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > Which means this patch:
> > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > one).
>
> ho hum, I could live with that ;)
>
> Would it make sense to turn it into something more general? Instead of
> "tasks_frozen/processes_are_frozen()", present it as
> "oom_killer_disabled/oom_killer_is_disabled()"?
>
> That would invite other subsystems to use it, if they want to. Which
> might well be a bad thing on their behalf, hard to say..
I chose the names this way because the variable is defined in the freezer code.
Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
for manipulating it and call them from the freezer code. Do you think that
would be better?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 23:28 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-11 23:28 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers, mel
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 00:44:36 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > Which means this patch:
> > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > one).
>
> ho hum, I could live with that ;)
>
> Would it make sense to turn it into something more general? Instead of
> "tasks_frozen/processes_are_frozen()", present it as
> "oom_killer_disabled/oom_killer_is_disabled()"?
>
> That would invite other subsystems to use it, if they want to. Which
> might well be a bad thing on their behalf, hard to say..
I chose the names this way because the variable is defined in the freezer code.
Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
for manipulating it and call them from the freezer code. Do you think that
would be better?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-11 23:28 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-11 23:28 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
mel-wPRd99KPJ+uzQB+pC5nmwQ
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 00:44:36 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > Which means this patch:
> > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > one).
>
> ho hum, I could live with that ;)
>
> Would it make sense to turn it into something more general? Instead of
> "tasks_frozen/processes_are_frozen()", present it as
> "oom_killer_disabled/oom_killer_is_disabled()"?
>
> That would invite other subsystems to use it, if they want to. Which
> might well be a bad thing on their behalf, hard to say..
I chose the names this way because the variable is defined in the freezer code.
Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
for manipulating it and call them from the freezer code. Do you think that
would be better?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-12 0:11 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-12 0:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers, mel
On Tue, 12 May 2009 01:28:15 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Tuesday 12 May 2009, Andrew Morton wrote:
> > On Tue, 12 May 2009 00:44:36 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > Which means this patch:
> > > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > > one).
> >
> > ho hum, I could live with that ;)
> >
> > Would it make sense to turn it into something more general? Instead of
> > "tasks_frozen/processes_are_frozen()", present it as
> > "oom_killer_disabled/oom_killer_is_disabled()"?
> >
> > That would invite other subsystems to use it, if they want to. Which
> > might well be a bad thing on their behalf, hard to say..
>
> I chose the names this way because the variable is defined in the freezer code.
>
> Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
> for manipulating it and call them from the freezer code. Do you think that
> would be better?
The choice is:
a) put a general oom-killer interface function into the oom-killer
code, call that from swsusp.
b) put a swsusp-specific change into the oom-killer, call that from swsusp.
>From a cleanliess POV, a) is way better. But it does need to be a
general function! If there's some hidden requirement which only makes
the function applicable to swsusp, such as "all tasks must be frozen" then
we'd be kidding ourselves by making it general-looking.
I have a bad feeling that after one week and 12^17 emails, we're back
to your original patch :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-12 0:11 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-12 0:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
mel-wPRd99KPJ+uzQB+pC5nmwQ
On Tue, 12 May 2009 01:28:15 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> On Tuesday 12 May 2009, Andrew Morton wrote:
> > On Tue, 12 May 2009 00:44:36 +0200
> > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> >
> > > Which means this patch:
> > > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > > one).
> >
> > ho hum, I could live with that ;)
> >
> > Would it make sense to turn it into something more general? Instead of
> > "tasks_frozen/processes_are_frozen()", present it as
> > "oom_killer_disabled/oom_killer_is_disabled()"?
> >
> > That would invite other subsystems to use it, if they want to. Which
> > might well be a bad thing on their behalf, hard to say..
>
> I chose the names this way because the variable is defined in the freezer code.
>
> Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
> for manipulating it and call them from the freezer code. Do you think that
> would be better?
The choice is:
a) put a general oom-killer interface function into the oom-killer
code, call that from swsusp.
b) put a swsusp-specific change into the oom-killer, call that from swsusp.
From a cleanliess POV, a) is way better. But it does need to be a
general function! If there's some hidden requirement which only makes
the function applicable to swsusp, such as "all tasks must be frozen" then
we'd be kidding ourselves by making it general-looking.
I have a bad feeling that after one week and 12^17 emails, we're back
to your original patch :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-12 16:52 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-12 16:52 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers, mel
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 01:28:15 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Tuesday 12 May 2009, Andrew Morton wrote:
> > > On Tue, 12 May 2009 00:44:36 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > Which means this patch:
> > > > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > > > one).
> > >
> > > ho hum, I could live with that ;)
> > >
> > > Would it make sense to turn it into something more general? Instead of
> > > "tasks_frozen/processes_are_frozen()", present it as
> > > "oom_killer_disabled/oom_killer_is_disabled()"?
> > >
> > > That would invite other subsystems to use it, if they want to. Which
> > > might well be a bad thing on their behalf, hard to say..
> >
> > I chose the names this way because the variable is defined in the freezer code.
> >
> > Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
> > for manipulating it and call them from the freezer code. Do you think that
> > would be better?
>
> The choice is:
>
> a) put a general oom-killer interface function into the oom-killer
> code, call that from swsusp.
>
> b) put a swsusp-specific change into the oom-killer, call that from swsusp.
>
>
> From a cleanliess POV, a) is way better. But it does need to be a
> general function! If there's some hidden requirement which only makes
> the function applicable to swsusp, such as "all tasks must be frozen" then
> we'd be kidding ourselves by making it general-looking.
Hmm. I guess there may be other situations in which it's better to fail
memory allocations than to kill tasks.
> I have a bad feeling that after one week and 12^17 emails, we're back
> to your original patch :)
Well, what about the following?
---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: mm, PM/Freezer: Disable OOM killer when tasks are frozen
Currently, the following scenario appears to be possible in theory:
* Tasks are frozen for hibernation or suspend.
* Free pages are almost exhausted.
* Certain piece of code in the suspend code path attempts to allocate
some memory using GFP_KERNEL and allocation order less than or
equal to PAGE_ALLOC_COSTLY_ORDER.
* __alloc_pages_internal() cannot find a free page so it invokes the
OOM killer.
* The OOM killer attempts to kill a task, but the task is frozen, so
it doesn't die immediately.
* __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
to find a free page and invokes the OOM killer.
* No progress can be made.
Although it is now hard to trigger during hibernation due to the
memory shrinking carried out by the hibernation code, it is
theoretically possible to trigger during suspend after the memory
shrinking has been removed from that code path. Moreover, since
memory allocations are going to be used for the hibernation memory
shrinking, it will be even more likely to happen during hibernation.
To prevent it from happening, introduce the oom_killer_disabled
switch that will cause __alloc_pages_internal() to fail in the
situations in which the OOM killer would have been called and make
the freezer set this switch after tasks have been successfully
frozen.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
include/linux/gfp.h | 12 ++++++++++++
kernel/power/process.c | 5 +++++
mm/page_alloc.c | 5 +++++
3 files changed, 22 insertions(+)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -175,6 +175,8 @@ static void set_pageblock_migratetype(st
PB_migrate, PB_migrate_end);
}
+bool oom_killer_disabled __read_mostly;
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1600,6 +1602,9 @@ nofail_alloc:
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ if (oom_killer_disabled)
+ goto nopage;
+
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -245,4 +245,16 @@ void drain_zone_pages(struct zone *zone,
void drain_all_pages(void);
void drain_local_pages(void *dummy);
+extern bool oom_killer_disabled;
+
+static inline void disable_oom_killer(void)
+{
+ oom_killer_disabled = true;
+}
+
+static inline void enable_oom_killer(void)
+{
+ oom_killer_disabled = false;
+}
+
#endif /* __LINUX_GFP_H */
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -117,9 +117,12 @@ int freeze_processes(void)
if (error)
goto Exit;
printk("done.");
+
+ disable_oom_killer();
Exit:
BUG_ON(in_atomic());
printk("\n");
+
return error;
}
@@ -145,6 +148,8 @@ static void thaw_tasks(bool nosig_only)
void thaw_processes(void)
{
+ enable_oom_killer();
+
printk("Restarting tasks ... ");
thaw_tasks(true);
thaw_tasks(false);
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-12 16:52 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-12 16:52 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
mel-wPRd99KPJ+uzQB+pC5nmwQ
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 01:28:15 +0200
> "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > On Tuesday 12 May 2009, Andrew Morton wrote:
> > > On Tue, 12 May 2009 00:44:36 +0200
> > > "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> > >
> > > > Which means this patch:
> > > > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > > > one).
> > >
> > > ho hum, I could live with that ;)
> > >
> > > Would it make sense to turn it into something more general? Instead of
> > > "tasks_frozen/processes_are_frozen()", present it as
> > > "oom_killer_disabled/oom_killer_is_disabled()"?
> > >
> > > That would invite other subsystems to use it, if they want to. Which
> > > might well be a bad thing on their behalf, hard to say..
> >
> > I chose the names this way because the variable is defined in the freezer code.
> >
> > Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
> > for manipulating it and call them from the freezer code. Do you think that
> > would be better?
>
> The choice is:
>
> a) put a general oom-killer interface function into the oom-killer
> code, call that from swsusp.
>
> b) put a swsusp-specific change into the oom-killer, call that from swsusp.
>
>
> From a cleanliess POV, a) is way better. But it does need to be a
> general function! If there's some hidden requirement which only makes
> the function applicable to swsusp, such as "all tasks must be frozen" then
> we'd be kidding ourselves by making it general-looking.
Hmm. I guess there may be other situations in which it's better to fail
memory allocations than to kill tasks.
> I have a bad feeling that after one week and 12^17 emails, we're back
> to your original patch :)
Well, what about the following?
---
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Subject: mm, PM/Freezer: Disable OOM killer when tasks are frozen
Currently, the following scenario appears to be possible in theory:
* Tasks are frozen for hibernation or suspend.
* Free pages are almost exhausted.
* Certain piece of code in the suspend code path attempts to allocate
some memory using GFP_KERNEL and allocation order less than or
equal to PAGE_ALLOC_COSTLY_ORDER.
* __alloc_pages_internal() cannot find a free page so it invokes the
OOM killer.
* The OOM killer attempts to kill a task, but the task is frozen, so
it doesn't die immediately.
* __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
to find a free page and invokes the OOM killer.
* No progress can be made.
Although it is now hard to trigger during hibernation due to the
memory shrinking carried out by the hibernation code, it is
theoretically possible to trigger during suspend after the memory
shrinking has been removed from that code path. Moreover, since
memory allocations are going to be used for the hibernation memory
shrinking, it will be even more likely to happen during hibernation.
To prevent it from happening, introduce the oom_killer_disabled
switch that will cause __alloc_pages_internal() to fail in the
situations in which the OOM killer would have been called and make
the freezer set this switch after tasks have been successfully
frozen.
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
include/linux/gfp.h | 12 ++++++++++++
kernel/power/process.c | 5 +++++
mm/page_alloc.c | 5 +++++
3 files changed, 22 insertions(+)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -175,6 +175,8 @@ static void set_pageblock_migratetype(st
PB_migrate, PB_migrate_end);
}
+bool oom_killer_disabled __read_mostly;
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1600,6 +1602,9 @@ nofail_alloc:
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ if (oom_killer_disabled)
+ goto nopage;
+
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -245,4 +245,16 @@ void drain_zone_pages(struct zone *zone,
void drain_all_pages(void);
void drain_local_pages(void *dummy);
+extern bool oom_killer_disabled;
+
+static inline void disable_oom_killer(void)
+{
+ oom_killer_disabled = true;
+}
+
+static inline void enable_oom_killer(void)
+{
+ oom_killer_disabled = false;
+}
+
#endif /* __LINUX_GFP_H */
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -117,9 +117,12 @@ int freeze_processes(void)
if (error)
goto Exit;
printk("done.");
+
+ disable_oom_killer();
Exit:
BUG_ON(in_atomic());
printk("\n");
+
return error;
}
@@ -145,6 +148,8 @@ static void thaw_tasks(bool nosig_only)
void thaw_processes(void)
{
+ enable_oom_killer();
+
printk("Restarting tasks ... ");
thaw_tasks(true);
thaw_tasks(false);
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-12 16:52 ` Rafael J. Wysocki
@ 2009-05-12 17:50 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-12 17:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers, mel
On Tue, 12 May 2009 18:52:36 +0200 "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> Well, what about the following?
It has the virtue of simplicity.
> +static inline void disable_oom_killer(void)
> +{
> + oom_killer_disabled = true;
> +}
> +
> +static inline void enable_oom_killer(void)
> +{
> + oom_killer_disabled = false;
> +}
I'll change these to oom_killer_disable() and oom_killer_enable(), OK?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-12 17:50 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-12 17:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, mel, rientjes, linux-kernel, alan-jenkins,
jens.axboe, linux-pm, fengguang.wu, torvalds
On Tue, 12 May 2009 18:52:36 +0200 "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> Well, what about the following?
It has the virtue of simplicity.
> +static inline void disable_oom_killer(void)
> +{
> + oom_killer_disabled = true;
> +}
> +
> +static inline void enable_oom_killer(void)
> +{
> + oom_killer_disabled = false;
> +}
I'll change these to oom_killer_disable() and oom_killer_enable(), OK?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-12 17:50 ` Andrew Morton
(?)
@ 2009-05-12 20:40 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-12 20:40 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, mel, rientjes, linux-kernel, alan-jenkins,
jens.axboe, linux-pm, fengguang.wu, torvalds
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 18:52:36 +0200 "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > Well, what about the following?
>
> It has the virtue of simplicity.
>
> > +static inline void disable_oom_killer(void)
> > +{
> > + oom_killer_disabled = true;
> > +}
> > +
> > +static inline void enable_oom_killer(void)
> > +{
> > + oom_killer_disabled = false;
> > +}
>
> I'll change these to oom_killer_disable() and oom_killer_enable(), OK?
Works for me, thanks!
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-12 20:40 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-12 20:40 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers, mel
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 18:52:36 +0200 "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > Well, what about the following?
>
> It has the virtue of simplicity.
>
> > +static inline void disable_oom_killer(void)
> > +{
> > + oom_killer_disabled = true;
> > +}
> > +
> > +static inline void enable_oom_killer(void)
> > +{
> > + oom_killer_disabled = false;
> > +}
>
> I'll change these to oom_killer_disable() and oom_killer_enable(), OK?
Works for me, thanks!
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-12 20:40 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-12 20:40 UTC (permalink / raw)
To: Andrew Morton
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
mel-wPRd99KPJ+uzQB+pC5nmwQ
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 18:52:36 +0200 "Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
>
> > Well, what about the following?
>
> It has the virtue of simplicity.
>
> > +static inline void disable_oom_killer(void)
> > +{
> > + oom_killer_disabled = true;
> > +}
> > +
> > +static inline void enable_oom_killer(void)
> > +{
> > + oom_killer_disabled = false;
> > +}
>
> I'll change these to oom_killer_disable() and oom_killer_enable(), OK?
Works for me, thanks!
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-12 0:11 ` Andrew Morton
(?)
(?)
@ 2009-05-12 16:52 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-12 16:52 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, mel, rientjes, linux-kernel, alan-jenkins,
jens.axboe, linux-pm, fengguang.wu, torvalds
On Tuesday 12 May 2009, Andrew Morton wrote:
> On Tue, 12 May 2009 01:28:15 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Tuesday 12 May 2009, Andrew Morton wrote:
> > > On Tue, 12 May 2009 00:44:36 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > Which means this patch:
> > > > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > > > one).
> > >
> > > ho hum, I could live with that ;)
> > >
> > > Would it make sense to turn it into something more general? Instead of
> > > "tasks_frozen/processes_are_frozen()", present it as
> > > "oom_killer_disabled/oom_killer_is_disabled()"?
> > >
> > > That would invite other subsystems to use it, if they want to. Which
> > > might well be a bad thing on their behalf, hard to say..
> >
> > I chose the names this way because the variable is defined in the freezer code.
> >
> > Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
> > for manipulating it and call them from the freezer code. Do you think that
> > would be better?
>
> The choice is:
>
> a) put a general oom-killer interface function into the oom-killer
> code, call that from swsusp.
>
> b) put a swsusp-specific change into the oom-killer, call that from swsusp.
>
>
> From a cleanliess POV, a) is way better. But it does need to be a
> general function! If there's some hidden requirement which only makes
> the function applicable to swsusp, such as "all tasks must be frozen" then
> we'd be kidding ourselves by making it general-looking.
Hmm. I guess there may be other situations in which it's better to fail
memory allocations than to kill tasks.
> I have a bad feeling that after one week and 12^17 emails, we're back
> to your original patch :)
Well, what about the following?
---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: mm, PM/Freezer: Disable OOM killer when tasks are frozen
Currently, the following scenario appears to be possible in theory:
* Tasks are frozen for hibernation or suspend.
* Free pages are almost exhausted.
* Certain piece of code in the suspend code path attempts to allocate
some memory using GFP_KERNEL and allocation order less than or
equal to PAGE_ALLOC_COSTLY_ORDER.
* __alloc_pages_internal() cannot find a free page so it invokes the
OOM killer.
* The OOM killer attempts to kill a task, but the task is frozen, so
it doesn't die immediately.
* __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
to find a free page and invokes the OOM killer.
* No progress can be made.
Although it is now hard to trigger during hibernation due to the
memory shrinking carried out by the hibernation code, it is
theoretically possible to trigger during suspend after the memory
shrinking has been removed from that code path. Moreover, since
memory allocations are going to be used for the hibernation memory
shrinking, it will be even more likely to happen during hibernation.
To prevent it from happening, introduce the oom_killer_disabled
switch that will cause __alloc_pages_internal() to fail in the
situations in which the OOM killer would have been called and make
the freezer set this switch after tasks have been successfully
frozen.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
include/linux/gfp.h | 12 ++++++++++++
kernel/power/process.c | 5 +++++
mm/page_alloc.c | 5 +++++
3 files changed, 22 insertions(+)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -175,6 +175,8 @@ static void set_pageblock_migratetype(st
PB_migrate, PB_migrate_end);
}
+bool oom_killer_disabled __read_mostly;
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1600,6 +1602,9 @@ nofail_alloc:
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ if (oom_killer_disabled)
+ goto nopage;
+
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -245,4 +245,16 @@ void drain_zone_pages(struct zone *zone,
void drain_all_pages(void);
void drain_local_pages(void *dummy);
+extern bool oom_killer_disabled;
+
+static inline void disable_oom_killer(void)
+{
+ oom_killer_disabled = true;
+}
+
+static inline void enable_oom_killer(void)
+{
+ oom_killer_disabled = false;
+}
+
#endif /* __LINUX_GFP_H */
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -117,9 +117,12 @@ int freeze_processes(void)
if (error)
goto Exit;
printk("done.");
+
+ disable_oom_killer();
Exit:
BUG_ON(in_atomic());
printk("\n");
+
return error;
}
@@ -145,6 +148,8 @@ static void thaw_tasks(bool nosig_only)
void thaw_processes(void)
{
+ enable_oom_killer();
+
printk("Restarting tasks ... ");
thaw_tasks(true);
thaw_tasks(false);
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-11 23:28 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-12 0:11 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-12 0:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, mel, rientjes, linux-kernel, alan-jenkins,
jens.axboe, linux-pm, fengguang.wu, torvalds
On Tue, 12 May 2009 01:28:15 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Tuesday 12 May 2009, Andrew Morton wrote:
> > On Tue, 12 May 2009 00:44:36 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > Which means this patch:
> > > http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
> > > one).
> >
> > ho hum, I could live with that ;)
> >
> > Would it make sense to turn it into something more general? Instead of
> > "tasks_frozen/processes_are_frozen()", present it as
> > "oom_killer_disabled/oom_killer_is_disabled()"?
> >
> > That would invite other subsystems to use it, if they want to. Which
> > might well be a bad thing on their behalf, hard to say..
>
> I chose the names this way because the variable is defined in the freezer code.
>
> Alternatively, I can define one in page_alloc.c, add [disable|enable]_oom_killer()
> for manipulating it and call them from the freezer code. Do you think that
> would be better?
The choice is:
a) put a general oom-killer interface function into the oom-killer
code, call that from swsusp.
b) put a swsusp-specific change into the oom-killer, call that from swsusp.
>From a cleanliess POV, a) is way better. But it does need to be a
general function! If there's some hidden requirement which only makes
the function applicable to swsusp, such as "all tasks must be frozen" then
we'd be kidding ourselves by making it general-looking.
I have a bad feeling that after one week and 12^17 emails, we're back
to your original patch :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-11 20:11 ` David Rientjes
(?)
(?)
@ 2009-05-11 22:44 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-11 22:44 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, Mel Gorman, linux-kernel, alan-jenkins,
jens.axboe, Andrew Morton, fengguang.wu, Linus Torvalds,
linux-pm
On Monday 11 May 2009, David Rientjes wrote:
> On Sun, 10 May 2009, Rafael J. Wysocki wrote:
>
> > > All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> > > endlessly unless they can't block. So if you want to simply prohibit the
> > > oom killer from being invoked and not change the retry behavior, setting
> > > ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> > > means nothing can be reclaimed and you can't free memory via oom killing,
> > > so there's nothing else the page allocator can do.
> >
> > But I want it to give up in this case instead of looping forever.
> >
> > Look. I have a specific problem at hand that I want to solve and the approach
> > you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
> > why it doesn't work, but you're ingnoring it, so I really don't know what else
> > I can say.
> >
> > OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
> > opinion about it. It's been tested and it's done the job 100% of the time. Go
> > figure. And please stop beating the dead horse.
> >
>
> Which implementation are you talking about? You've had several:
>
> http://marc.info/?l=linux-kernel&m=124121728429113
> http://marc.info/?l=linux-kernel&m=124131049223733
> http://marc.info/?l=linux-kernel&m=124165031723627
> http://marc.info/?l=linux-kernel&m=124146681311494
The second one. The first one was too much code, the third one was not the
Andrew's favourite and the last one is wrong, because it changes the behaviour
related to __GFP_NORETRY incorrectly.
> The issue with your approach is that it doesn't address the problem; the
> problem is _not_ specific to individual page allocations it is specific to
> the STATE OF THE MACHINE.
Yes, it is, but have you followed my discussion with Andrew?
> If all userspace tasks are uninterruptible when trying to reserve this
> memory and, thus, oom killing is negligent and not going to help, that
> needs to be addressed in the page allocator. It is a bug for the
> allocator to continuously retry the allocation unless __GFP_NOFAIL is set
> if oom killing will not free memory.
That was my argument in the discussion with Andrew, actually.
> Adding a new __GFP_NO_OOM_KILL flag to address that isn't helpful since it
> has nothing at all to do with the specific allocation. It may certainly
> be the easiest way to implement your patchset without doing VM work, but
> it's not going to fix the problem for others.
I agree, but I didn't even want to fix the problem with OOM killing after
freezing tasks.
> I just posted a patch series[*] that would fix this problem for you
> without even locking out the oom killer or adding any unnecessary gfp
> flags. It is based on mmotm since it has Mel's page allocator speedups.
> Any change you do to the allocator at this point should be based on that
> to avoid nasty merge conflicts later, so try my series out and see how it
> works.
>
> Now, I won't engage in your personal attacks because (i) nobody else
> cares, and (ii) it's not going to be productive.
My previous message wasn't meant to be personal, so I'm sorry if it sounded
like it was.
> I'll let my code do the talking.
>
> [*] http://lkml.org/lkml/2009/5/10/118
OK, so the patch is http://lkml.org/lkml/2009/5/10/127, isn't it? I'm not
sure it will fly, given the Andrew's reply.
In fact the problem is that processes in D state are only legitimately going
to stay in this state when they are _frozen_. So, the right approach seems to
be to avoid calling the OOM killer at all after freezing processes and instead
fail the allocations that would have triggered it. Which means this patch:
http://marc.info/?l=linux-kernel&m=124165031723627 (it also is my favourite
one).
But Andrew says that it's better to have a __GFP_NO_OOM_KILL flag instead,
because someone else might presumably use it in future for something (I have
no idea who that might be, but whatever) and _surely_ no one else will use a
global switch related to the freezer.
Still _I_ think that since the freezer is the source of the problematic
situation (all tasks are persistently unkillable), using it should change the
behaviour of the page allocator, so that the OOM killer is not activated
while processes are frozen. And in fact that should not depend on what flags
are used by whoever tries to allocate memory.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 22:39 ` David Rientjes
(?)
(?)
@ 2009-05-09 23:03 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-09 23:03 UTC (permalink / raw)
To: David Rientjes
Cc: kernel-testers, Mel Gorman, linux-kernel, alan-jenkins,
jens.axboe, Andrew Morton, fengguang.wu, Linus Torvalds,
linux-pm
On Sunday 10 May 2009, David Rientjes wrote:
> On Sat, 9 May 2009, Rafael J. Wysocki wrote:
>
> > > This has been changed in the latest mmotm with Mel's page alloactor
> > > patches (and I think yours should be based on mmotm). Specifically,
> > > page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
> > >
> > > Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> > > one of their zones would unconditionally goto restart. Now, if
> > > order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> > > it does goto restart.
> > >
> > > So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
> >
> > It doesn't. All of my allocations are of order 0.
> >
>
> All order 0 allocations are implicitly __GFP_NOFAIL and will loop
> endlessly unless they can't block. So if you want to simply prohibit the
> oom killer from being invoked and not change the retry behavior, setting
> ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
> means nothing can be reclaimed and you can't free memory via oom killing,
> so there's nothing else the page allocator can do.
But I want it to give up in this case instead of looping forever.
Look. I have a specific problem at hand that I want to solve and the approach
you suggested _clearly_ _doesn't_ _work_. I have also tried to explain to you
why it doesn't work, but you're ingnoring it, so I really don't know what else
I can say.
OTOH, the approach suggested by Andrew _does_ _work_ regardless of your
opinion about it. It's been tested and it's done the job 100% of the time. Go
figure. And please stop beating the dead horse.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-09 21:37 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-09 22:39 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-09 22:39 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, Mel Gorman, linux-kernel, alan-jenkins,
jens.axboe, Andrew Morton, fengguang.wu, Linus Torvalds,
linux-pm
On Sat, 9 May 2009, Rafael J. Wysocki wrote:
> > This has been changed in the latest mmotm with Mel's page alloactor
> > patches (and I think yours should be based on mmotm). Specifically,
> > page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
> >
> > Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
> > one of their zones would unconditionally goto restart. Now, if
> > order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
> > it does goto restart.
> >
> > So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER,
>
> It doesn't. All of my allocations are of order 0.
>
All order 0 allocations are implicitly __GFP_NOFAIL and will loop
endlessly unless they can't block. So if you want to simply prohibit the
oom killer from being invoked and not change the retry behavior, setting
ZONE_OOM_LOCKED for all zones will do that. If your machine hangs, it
means nothing can be reclaimed and you can't free memory via oom killing,
so there's nothing else the page allocator can do.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-08 23:55 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-09 21:22 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-09 21:22 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, Mel Gorman, linux-kernel, alan-jenkins,
jens.axboe, Andrew Morton, fengguang.wu, Linus Torvalds,
linux-pm
On Sat, 9 May 2009, Rafael J. Wysocki wrote:
> > All of your tasks are in D state other than kthreads, right? That means
> > they won't be in the oom killer (thus no zones are oom locked), so you can
> > easily do this
> >
> > struct zone *z;
> > for_each_populated_zone(z)
> > zone_set_flag(z, ZONE_OOM_LOCKED);
> >
> > and then
> >
> > for_each_populated_zone(z)
> > zone_clear_flag(z, ZONE_OOM_LOCKED);
> >
> > The serialization is done with trylocks so this will never invoke the oom
> > killer because all zones in the allocator's zonelist will be oom locked.
>
> Well, that might have been a good idea if it actually had worked. :-(
>
> > Why does this not work for you?
>
> If I set image_size to something below "hard core working set" +
> totalreserve_pages, preallocate_image_memory() hangs the
> box (please refer to the last patch I sent,
> http://patchwork.kernel.org/patch/22423/).
>
This has been changed in the latest mmotm with Mel's page alloactor
patches (and I think yours should be based on mmotm). Specifically,
page-allocator-break-up-the-allocator-entry-point-into-fast-and-slow-paths.patch.
Before his patchset, zonelists that had ZONE_OOM_LOCKED set for at least
one of their zones would unconditionally goto restart. Now, if
order > PAGE_ALLOC_COSTLY_ORDER, it gives up and returns NULL. Otherwise,
it does goto restart.
So if your allocation has order > PAGE_ALLOC_COSTLY_ORDER, using the
ZONE_OOM_LOCKED approach to locking out the oom killer will work just fine
in mmotm.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 20:02 ` Andrew Morton
(?)
(?)
@ 2009-05-07 20:18 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 20:18 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 21:33:47 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > On Thursday 07 May 2009, Andrew Morton wrote:
> > > On Thu, 7 May 2009 20:09:52 +0200
> > > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > >
> > > > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > > > will dtrt: no oom-killings.
> > > > > > >
> > > > > > > In which case, processes_are_frozen() is not needed at all?
> > > > > >
> > > > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > > > the combination.
> > > > >
> > > > > OK. __GFP_WAIT is the big hammer.
> > > >
> > > > Unfortunately it fails too quickly with the combination as well, so it looks
> > > > like we can't use __GFP_NORETRY during hibernation.
> > >
> > > hm.
> > >
> > > So where do we stand now?
> > >
> > > I'm not a big fan of the global application-specific state change
> > > thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> > > reused by other subsystems in the future, which is a good indicator.
> >
> > I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
> > adding new _GPF _FOO flags recently.
>
> We have six or seven left - hardly a crisis.
>
> > Is there any likelihood anyone else we'll
> > really need it any time soon?
>
> Dunno - people do all sorts of crazy things. But it's more likely to
> be reused than a PM-specific global!
>
> I have no strong feelings really, but slotting into the existing
> technique with something which might be reusable is quite a bit tidier.
OK, let's try with __GFP_NO_OOM_KILL first. If there's too much disagreement,
I'll use the freezer-based approach instead.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 18:48 ` Andrew Morton
(?)
(?)
@ 2009-05-07 19:33 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 19:33 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Thursday 07 May 2009, Andrew Morton wrote:
> On Thu, 7 May 2009 20:09:52 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
>
> > > > > I'm suspecting that hibernation can allocate its pages with
> > > > > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > > > > will dtrt: no oom-killings.
> > > > >
> > > > > In which case, processes_are_frozen() is not needed at all?
> > > >
> > > > __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> > > > the combination.
> > >
> > > OK. __GFP_WAIT is the big hammer.
> >
> > Unfortunately it fails too quickly with the combination as well, so it looks
> > like we can't use __GFP_NORETRY during hibernation.
>
> hm.
>
> So where do we stand now?
>
> I'm not a big fan of the global application-specific state change
> thing. Something like __GFP_NO_OOM_KILL has a better chance of being
> reused by other subsystems in the future, which is a good indicator.
I'm not against __GFP_NO_OOM_KILL, but there's been some strong resistance to
adding new _GPF _FOO flags recently. Is there any likelihood anyone else we'll
really need it any time soon?
The advantage of the freezer-based approach is that it disables the OOM killer
when it's not going to work anyway, so it looks like a reasonable thing to do
regardless. IMHO.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 18:50 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 18:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu, linux-pm, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> Unfortunately it fails too quickly with the combination as well, so it looks
> like we can't use __GFP_NORETRY during hibernation.
>
If you know that no other tasks are in the oom killer at suspend time, you
can do what I mentioned earlier:
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
and then later
for_each_populated_zone(z)
zone_clear_flag(z, ZONE_OOM_LOCKED);
The only race there is if a task is currently in the oom killer and will
subsequently clear ZONE_OOM_LOCKED for its zonelist.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-07 18:50 ` David Rientjes
0 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 18:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Andrew Morton, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> Unfortunately it fails too quickly with the combination as well, so it looks
> like we can't use __GFP_NORETRY during hibernation.
>
If you know that no other tasks are in the oom killer at suspend time, you
can do what I mentioned earlier:
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
and then later
for_each_populated_zone(z)
zone_clear_flag(z, ZONE_OOM_LOCKED);
The only race there is if a task is currently in the oom killer and will
subsequently clear ZONE_OOM_LOCKED for its zonelist.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-07 18:09 ` Rafael J. Wysocki
` (3 preceding siblings ...)
(?)
@ 2009-05-07 18:50 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-07 18:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, fengguang.wu, torvalds, linux-pm
On Thu, 7 May 2009, Rafael J. Wysocki wrote:
> Unfortunately it fails too quickly with the combination as well, so it looks
> like we can't use __GFP_NORETRY during hibernation.
>
If you know that no other tasks are in the oom killer at suspend time, you
can do what I mentioned earlier:
struct zone *z;
for_each_populated_zone(z)
zone_set_flag(z, ZONE_OOM_LOCKED);
and then later
for_each_populated_zone(z)
zone_clear_flag(z, ZONE_OOM_LOCKED);
The only race there is if a task is currently in the oom killer and will
subsequently clear ZONE_OOM_LOCKED for its zonelist.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-05 23:20 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-05 23:40 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-05 23:40 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Wed, 6 May 2009 01:20:34 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Wednesday 06 May 2009, Andrew Morton wrote:
> > On Wed, 6 May 2009 00:19:35 +0200
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> >
> > > > > + && !processes_are_frozen()) {
> > > > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > > > schedule_timeout_uninterruptible(1);
> > > > > goto restart;
> > > >
> > > > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > > > a new gfp flag. Thanks.
> > >
> > > Well, you're welcome.
> > >
> > > BTW, I think that Andrew was actually right when he asked if I checked whether
> > > the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> > > __GFP_NORETRY unset. Namely, in that case we never reach the code before
> > > nopage: that checks __GFP_NORETRY, do we?
> > >
> > > So I think we shouldn't modify the 'else if' condition above and check for
> > > !processes_are_frozen() at the beginning of the block below.
> >
> > Confused.
> >
> > I'm suspecting that hibernation can allocate its pages with
> > __GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
> > will dtrt: no oom-killings.
> >
> > In which case, processes_are_frozen() is not needed at all?
>
> __GFP_NORETRY alone causes it to fail relatively quickly, but I'll try with
> the combination.
OK. __GFP_WAIT is the big hammer.
> Anyway, even if the hibernation code itself doesn't trigger the OOM killer,
> but anyone else allocates memory in parallel or after we've preallocated the
> image memory, that may still trigger it. So it seems processes_are_frozen()
> may still be useful?
Could be. But only kernel threads are active at this time (yes?), and they
won't have much work to do because userspace is asleep.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-05 22:19 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-05 22:37 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-05 22:37 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Wed, 6 May 2009 00:19:35 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > + && !processes_are_frozen()) {
> > > if (!try_set_zone_oom(zonelist, gfp_mask)) {
> > > schedule_timeout_uninterruptible(1);
> > > goto restart;
> >
> > Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
> > a new gfp flag. Thanks.
>
> Well, you're welcome.
>
> BTW, I think that Andrew was actually right when he asked if I checked whether
> the existing __GFP_NORETRY would work as-is for __GFP_FS set and
> __GFP_NORETRY unset. Namely, in that case we never reach the code before
> nopage: that checks __GFP_NORETRY, do we?
>
> So I think we shouldn't modify the 'else if' condition above and check for
> !processes_are_frozen() at the beginning of the block below.
Confused.
I'm suspecting that hibernation can allocate its pages with
__GFP_FS|__GFP_WAIT|__GFP_NORETRY|__GFP_NOWARN, and the page allocator
will dtrt: no oom-killings.
In which case, processes_are_frozen() is not needed at all?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 22:23 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-05 0:37 ` David Rientjes
-1 siblings, 0 replies; 580+ messages in thread
From: David Rientjes @ 2009-05-05 0:37 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, Wu Fengguang, torvalds, linux-pm
On Tue, 5 May 2009, Rafael J. Wysocki wrote:
> Index: linux-2.6/kernel/power/process.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/process.c
> +++ linux-2.6/kernel/power/process.c
> @@ -19,6 +19,8 @@
> */
> #define TIMEOUT (20 * HZ)
>
> +static bool tasks_frozen;
> +
> static inline int freezeable(struct task_struct * p)
> {
> if ((p == current) ||
> @@ -120,6 +122,10 @@ int freeze_processes(void)
> Exit:
> BUG_ON(in_atomic());
> printk("\n");
> +
> + if (!error)
> + tasks_frozen = true;
> +
> return error;
> }
>
> @@ -145,6 +151,8 @@ static void thaw_tasks(bool nosig_only)
>
> void thaw_processes(void)
> {
> + tasks_frozen = false;
> +
> printk("Restarting tasks ... ");
> thaw_tasks(true);
> thaw_tasks(false);
> @@ -152,3 +160,7 @@ void thaw_processes(void)
> printk("done.\n");
> }
>
> +bool processes_are_frozen(void)
> +{
> + return tasks_frozen;
> +}
> Index: linux-2.6/include/linux/freezer.h
> ===================================================================
> --- linux-2.6.orig/include/linux/freezer.h
> +++ linux-2.6/include/linux/freezer.h
> @@ -50,6 +50,7 @@ extern int thaw_process(struct task_stru
> extern void refrigerator(void);
> extern int freeze_processes(void);
> extern void thaw_processes(void);
> +extern bool processes_are_frozen(void);
>
> static inline int try_to_freeze(void)
> {
> @@ -170,6 +171,7 @@ static inline int thaw_process(struct ta
> static inline void refrigerator(void) {}
> static inline int freeze_processes(void) { BUG(); return 0; }
> static inline void thaw_processes(void) {}
> +static inline bool processes_are_frozen(void) { return false; }
>
> static inline int try_to_freeze(void) { return 0; }
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -46,6 +46,7 @@
> #include <linux/page-isolation.h>
> #include <linux/page_cgroup.h>
> #include <linux/debugobjects.h>
> +#include <linux/freezer.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -1599,7 +1600,8 @@ nofail_alloc:
> zonelist, high_zoneidx, alloc_flags);
> if (page)
> goto got_pg;
> - } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> + } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)
> + && !processes_are_frozen()) {
> if (!try_set_zone_oom(zonelist, gfp_mask)) {
> schedule_timeout_uninterruptible(1);
> goto restart;
Cool, that looks like the semantics of __GFP_NO_OOM_KILL without requiring
a new gfp flag. Thanks.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 19:01 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-04 19:01 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes, fengguang.wu, linux-pm, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Mon, 4 May 2009 17:02:22 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Monday 04 May 2009, David Rientjes wrote:
> > On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> >
> > > Index: linux-2.6/mm/page_alloc.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/page_alloc.c
> > > +++ linux-2.6/mm/page_alloc.c
> > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > }
> > >
> > > /* The OOM killer will not help higher order allocs so fail */
> > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > clear_zonelist_oom(zonelist, gfp_mask);
> > > goto nopage;
> > > }
> >
> > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > __GFP_NORETRY clear.
>
> Well, what would you suggest?
>
Did you check whether the existing __GFP_NORETRY will work as-is for
this requirement?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
@ 2009-05-04 19:01 ` Andrew Morton
0 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-04 19:01 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: rientjes-hpIqsD4AKlfQT0dZR+AlfA,
fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
pavel-+ZI9xUNit7I, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Mon, 4 May 2009 17:02:22 +0200
"Rafael J. Wysocki" <rjw-KKrjLPT3xs0@public.gmane.org> wrote:
> On Monday 04 May 2009, David Rientjes wrote:
> > On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> >
> > > Index: linux-2.6/mm/page_alloc.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/page_alloc.c
> > > +++ linux-2.6/mm/page_alloc.c
> > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > }
> > >
> > > /* The OOM killer will not help higher order allocs so fail */
> > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > clear_zonelist_oom(zonelist, gfp_mask);
> > > goto nopage;
> > > }
> >
> > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > __GFP_NORETRY clear.
>
> Well, what would you suggest?
>
Did you check whether the existing __GFP_NORETRY will work as-is for
this requirement?
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 15:02 ` Rafael J. Wysocki
` (3 preceding siblings ...)
(?)
@ 2009-05-04 19:01 ` Andrew Morton
-1 siblings, 0 replies; 580+ messages in thread
From: Andrew Morton @ 2009-05-04 19:01 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, rientjes, linux-kernel, alan-jenkins, jens.axboe,
linux-pm, fengguang.wu, torvalds
On Mon, 4 May 2009 17:02:22 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> On Monday 04 May 2009, David Rientjes wrote:
> > On Mon, 4 May 2009, Rafael J. Wysocki wrote:
> >
> > > Index: linux-2.6/mm/page_alloc.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/page_alloc.c
> > > +++ linux-2.6/mm/page_alloc.c
> > > @@ -1620,7 +1620,8 @@ nofail_alloc:
> > > }
> > >
> > > /* The OOM killer will not help higher order allocs so fail */
> > > - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> > > + if (order > PAGE_ALLOC_COSTLY_ORDER ||
> > > + (gfp_mask & __GFP_NO_OOM_KILL)) {
> > > clear_zonelist_oom(zonelist, gfp_mask);
> > > goto nopage;
> > > }
> >
> > This is inconsistent because __GFP_NO_OOM_KILL now implies __GFP_NORETRY
> > (the "goto nopage" above), but only for allocations with __GFP_FS set and
> > __GFP_NORETRY clear.
>
> Well, what would you suggest?
>
Did you check whether the existing __GFP_NORETRY will work as-is for
this requirement?
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 1/5] mm: Add __GFP_NO_OOM_KILL flag
2009-05-04 0:08 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-04 0:10 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:10 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
From: Andrew Morton <akpm@linux-foundation.org>
> > Remind me: why can't we just allocate N pages at suspend-time?
>
> We need half of memory free. The reason we can't "just allocate" is
> probably OOM killer; but my memories are quite weak :-(.
hm. You'd think that with our splendid range of __GFP_foo falgs, there
would be some combo which would suit this requirement but I can't
immediately spot one.
We can always add another I guess. Something like...
[rjw: fixed white space]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
include/linux/gfp.h | 3 ++-
mm/page_alloc.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1620,7 +1620,8 @@ nofail_alloc:
}
/* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (order > PAGE_ALLOC_COSTLY_ORDER ||
+ (gfp_mask & __GFP_NO_OOM_KILL)) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -51,8 +51,9 @@ struct vm_area_struct;
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_NO_OOM_KILL ((__force gfp_t)0x200000u) /* Don't invoke out_of_memory() */
-#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 22 /* Number of __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 2/5] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
2009-05-04 0:08 ` Rafael J. Wysocki
` (2 preceding siblings ...)
(?)
@ 2009-05-04 0:11 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
From: Rafael J. Wysocki <rjw@sisk.pl>
The next patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that. For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .
[rev. 2: Make some functions static and remove their headers from
kernel/power/power.h]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/power.h | 4 --
kernel/power/snapshot.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/power/swsusp.c | 76 ---------------------------------------------
3 files changed, 79 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -39,6 +39,14 @@ static int swsusp_page_is_free(struct pa
static void swsusp_set_page_forbidden(struct page *);
static void swsusp_unset_page_forbidden(struct page *);
+/*
+ * Preferred image size in bytes (tunable via /sys/power/image_size).
+ * When it is set to N, swsusp will do its best to ensure the image
+ * size will not exceed N bytes, but if that is impossible, it will
+ * try to create the smallest image possible.
+ */
+unsigned long image_size = 500 * 1024 * 1024;
+
/* List of PBEs needed for restoring the pages that were allocated before
* the suspend and included in the suspend image, but have also been
* allocated by the "resume" kernel, so their contents cannot be written
@@ -840,7 +848,7 @@ static struct page *saveable_highmem_pag
* pages.
*/
-unsigned int count_highmem_pages(void)
+static unsigned int count_highmem_pages(void)
{
struct zone *zone;
unsigned int n = 0;
@@ -902,7 +910,7 @@ static struct page *saveable_page(struct
* pages.
*/
-unsigned int count_data_pages(void)
+static unsigned int count_data_pages(void)
{
struct zone *zone;
unsigned long pfn, max_zone_pfn;
@@ -1058,6 +1066,74 @@ void swsusp_free(void)
buffer = NULL;
}
+/**
+ * swsusp_shrink_memory - Try to free as much memory as needed
+ *
+ * ... but do not OOM-kill anyone
+ *
+ * Notice: all userland should be stopped before it is called, or
+ * livelock is possible.
+ */
+
+#define SHRINK_BITE 10000
+static inline unsigned long __shrink_memory(long tmp)
+{
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ return shrink_all_memory(tmp);
+}
+
+int swsusp_shrink_memory(void)
+{
+ long tmp;
+ struct zone *zone;
+ unsigned long pages = 0;
+ unsigned int i = 0;
+ char *p = "-\\|/";
+ struct timeval start, stop;
+
+ printk(KERN_INFO "PM: Shrinking memory... ");
+ do_gettimeofday(&start);
+ do {
+ long size, highmem_size;
+
+ highmem_size = count_highmem_pages();
+ size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
+ tmp = size;
+ size += highmem_size;
+ for_each_populated_zone(zone) {
+ tmp += snapshot_additional_pages(zone);
+ if (is_highmem(zone)) {
+ highmem_size -=
+ zone_page_state(zone, NR_FREE_PAGES);
+ } else {
+ tmp -= zone_page_state(zone, NR_FREE_PAGES);
+ tmp += zone->lowmem_reserve[ZONE_NORMAL];
+ }
+ }
+
+ if (highmem_size < 0)
+ highmem_size = 0;
+
+ tmp += highmem_size;
+ if (tmp > 0) {
+ tmp = __shrink_memory(tmp);
+ if (!tmp)
+ return -ENOMEM;
+ pages += tmp;
+ } else if (size > image_size / PAGE_SIZE) {
+ tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+ pages += tmp;
+ }
+ printk("\b%c", p[i++%4]);
+ } while (tmp > 0);
+ do_gettimeofday(&stop);
+ printk("\bdone (%lu pages freed)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Freed");
+
+ return 0;
+}
+
#ifdef CONFIG_HIGHMEM
/**
* count_pages_for_highmem - compute the number of non-highmem pages
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -55,14 +55,6 @@
#include "power.h"
-/*
- * Preferred image size in bytes (tunable via /sys/power/image_size).
- * When it is set to N, swsusp will do its best to ensure the image
- * size will not exceed N bytes, but if that is impossible, it will
- * try to create the smallest image possible.
- */
-unsigned long image_size = 500 * 1024 * 1024;
-
int in_suspend __nosavedata = 0;
/**
@@ -195,74 +187,6 @@ void swsusp_show_speed(struct timeval *s
kps / 1000, (kps % 1000) / 10);
}
-/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
- */
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
-int swsusp_shrink_memory(void)
-{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
- struct timeval start, stop;
-
- printk(KERN_INFO "PM: Shrinking memory... ");
- do_gettimeofday(&start);
- do {
- long size, highmem_size;
-
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
-
- if (highmem_size < 0)
- highmem_size = 0;
-
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
- do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
-
- return 0;
-}
-
/*
* Platforms, like ACPI, may want us to save some memory used by them during
* hibernation and to restore the contents of this memory during the subsequent
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern unsigned int count_data_pages(void);
+extern int swsusp_shrink_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
@@ -149,7 +149,6 @@ extern int swsusp_swap_in_use(void);
/* kernel/power/disk.c */
extern int swsusp_check(void);
-extern int swsusp_shrink_memory(void);
extern void swsusp_free(void);
extern int swsusp_read(unsigned int *flags_p);
extern int swsusp_write(unsigned int flags);
@@ -176,7 +175,6 @@ extern int pm_notifier_call_chain(unsign
#endif
#ifdef CONFIG_HIGHMEM
-unsigned int count_highmem_pages(void);
int restore_highmem(void);
#else
static inline unsigned int count_highmem_pages(void) { return 0; }
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 2/5] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
@ 2009-05-04 0:11 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
From: Rafael J. Wysocki <rjw@sisk.pl>
The next patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that. For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .
[rev. 2: Make some functions static and remove their headers from
kernel/power/power.h]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/power.h | 4 --
kernel/power/snapshot.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/power/swsusp.c | 76 ---------------------------------------------
3 files changed, 79 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -39,6 +39,14 @@ static int swsusp_page_is_free(struct pa
static void swsusp_set_page_forbidden(struct page *);
static void swsusp_unset_page_forbidden(struct page *);
+/*
+ * Preferred image size in bytes (tunable via /sys/power/image_size).
+ * When it is set to N, swsusp will do its best to ensure the image
+ * size will not exceed N bytes, but if that is impossible, it will
+ * try to create the smallest image possible.
+ */
+unsigned long image_size = 500 * 1024 * 1024;
+
/* List of PBEs needed for restoring the pages that were allocated before
* the suspend and included in the suspend image, but have also been
* allocated by the "resume" kernel, so their contents cannot be written
@@ -840,7 +848,7 @@ static struct page *saveable_highmem_pag
* pages.
*/
-unsigned int count_highmem_pages(void)
+static unsigned int count_highmem_pages(void)
{
struct zone *zone;
unsigned int n = 0;
@@ -902,7 +910,7 @@ static struct page *saveable_page(struct
* pages.
*/
-unsigned int count_data_pages(void)
+static unsigned int count_data_pages(void)
{
struct zone *zone;
unsigned long pfn, max_zone_pfn;
@@ -1058,6 +1066,74 @@ void swsusp_free(void)
buffer = NULL;
}
+/**
+ * swsusp_shrink_memory - Try to free as much memory as needed
+ *
+ * ... but do not OOM-kill anyone
+ *
+ * Notice: all userland should be stopped before it is called, or
+ * livelock is possible.
+ */
+
+#define SHRINK_BITE 10000
+static inline unsigned long __shrink_memory(long tmp)
+{
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ return shrink_all_memory(tmp);
+}
+
+int swsusp_shrink_memory(void)
+{
+ long tmp;
+ struct zone *zone;
+ unsigned long pages = 0;
+ unsigned int i = 0;
+ char *p = "-\\|/";
+ struct timeval start, stop;
+
+ printk(KERN_INFO "PM: Shrinking memory... ");
+ do_gettimeofday(&start);
+ do {
+ long size, highmem_size;
+
+ highmem_size = count_highmem_pages();
+ size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
+ tmp = size;
+ size += highmem_size;
+ for_each_populated_zone(zone) {
+ tmp += snapshot_additional_pages(zone);
+ if (is_highmem(zone)) {
+ highmem_size -=
+ zone_page_state(zone, NR_FREE_PAGES);
+ } else {
+ tmp -= zone_page_state(zone, NR_FREE_PAGES);
+ tmp += zone->lowmem_reserve[ZONE_NORMAL];
+ }
+ }
+
+ if (highmem_size < 0)
+ highmem_size = 0;
+
+ tmp += highmem_size;
+ if (tmp > 0) {
+ tmp = __shrink_memory(tmp);
+ if (!tmp)
+ return -ENOMEM;
+ pages += tmp;
+ } else if (size > image_size / PAGE_SIZE) {
+ tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+ pages += tmp;
+ }
+ printk("\b%c", p[i++%4]);
+ } while (tmp > 0);
+ do_gettimeofday(&stop);
+ printk("\bdone (%lu pages freed)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Freed");
+
+ return 0;
+}
+
#ifdef CONFIG_HIGHMEM
/**
* count_pages_for_highmem - compute the number of non-highmem pages
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -55,14 +55,6 @@
#include "power.h"
-/*
- * Preferred image size in bytes (tunable via /sys/power/image_size).
- * When it is set to N, swsusp will do its best to ensure the image
- * size will not exceed N bytes, but if that is impossible, it will
- * try to create the smallest image possible.
- */
-unsigned long image_size = 500 * 1024 * 1024;
-
int in_suspend __nosavedata = 0;
/**
@@ -195,74 +187,6 @@ void swsusp_show_speed(struct timeval *s
kps / 1000, (kps % 1000) / 10);
}
-/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
- */
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
-int swsusp_shrink_memory(void)
-{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
- struct timeval start, stop;
-
- printk(KERN_INFO "PM: Shrinking memory... ");
- do_gettimeofday(&start);
- do {
- long size, highmem_size;
-
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
-
- if (highmem_size < 0)
- highmem_size = 0;
-
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
- do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
-
- return 0;
-}
-
/*
* Platforms, like ACPI, may want us to save some memory used by them during
* hibernation and to restore the contents of this memory during the subsequent
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern unsigned int count_data_pages(void);
+extern int swsusp_shrink_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
@@ -149,7 +149,6 @@ extern int swsusp_swap_in_use(void);
/* kernel/power/disk.c */
extern int swsusp_check(void);
-extern int swsusp_shrink_memory(void);
extern void swsusp_free(void);
extern int swsusp_read(unsigned int *flags_p);
extern int swsusp_write(unsigned int flags);
@@ -176,7 +175,6 @@ extern int pm_notifier_call_chain(unsign
#endif
#ifdef CONFIG_HIGHMEM
-unsigned int count_highmem_pages(void);
int restore_highmem(void);
#else
static inline unsigned int count_highmem_pages(void) { return 0; }
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 2/5] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
@ 2009-05-04 0:11 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
The next patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that. For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .
[rev. 2: Make some functions static and remove their headers from
kernel/power/power.h]
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/power.h | 4 --
kernel/power/snapshot.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/power/swsusp.c | 76 ---------------------------------------------
3 files changed, 79 insertions(+), 81 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -39,6 +39,14 @@ static int swsusp_page_is_free(struct pa
static void swsusp_set_page_forbidden(struct page *);
static void swsusp_unset_page_forbidden(struct page *);
+/*
+ * Preferred image size in bytes (tunable via /sys/power/image_size).
+ * When it is set to N, swsusp will do its best to ensure the image
+ * size will not exceed N bytes, but if that is impossible, it will
+ * try to create the smallest image possible.
+ */
+unsigned long image_size = 500 * 1024 * 1024;
+
/* List of PBEs needed for restoring the pages that were allocated before
* the suspend and included in the suspend image, but have also been
* allocated by the "resume" kernel, so their contents cannot be written
@@ -840,7 +848,7 @@ static struct page *saveable_highmem_pag
* pages.
*/
-unsigned int count_highmem_pages(void)
+static unsigned int count_highmem_pages(void)
{
struct zone *zone;
unsigned int n = 0;
@@ -902,7 +910,7 @@ static struct page *saveable_page(struct
* pages.
*/
-unsigned int count_data_pages(void)
+static unsigned int count_data_pages(void)
{
struct zone *zone;
unsigned long pfn, max_zone_pfn;
@@ -1058,6 +1066,74 @@ void swsusp_free(void)
buffer = NULL;
}
+/**
+ * swsusp_shrink_memory - Try to free as much memory as needed
+ *
+ * ... but do not OOM-kill anyone
+ *
+ * Notice: all userland should be stopped before it is called, or
+ * livelock is possible.
+ */
+
+#define SHRINK_BITE 10000
+static inline unsigned long __shrink_memory(long tmp)
+{
+ if (tmp > SHRINK_BITE)
+ tmp = SHRINK_BITE;
+ return shrink_all_memory(tmp);
+}
+
+int swsusp_shrink_memory(void)
+{
+ long tmp;
+ struct zone *zone;
+ unsigned long pages = 0;
+ unsigned int i = 0;
+ char *p = "-\\|/";
+ struct timeval start, stop;
+
+ printk(KERN_INFO "PM: Shrinking memory... ");
+ do_gettimeofday(&start);
+ do {
+ long size, highmem_size;
+
+ highmem_size = count_highmem_pages();
+ size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
+ tmp = size;
+ size += highmem_size;
+ for_each_populated_zone(zone) {
+ tmp += snapshot_additional_pages(zone);
+ if (is_highmem(zone)) {
+ highmem_size -=
+ zone_page_state(zone, NR_FREE_PAGES);
+ } else {
+ tmp -= zone_page_state(zone, NR_FREE_PAGES);
+ tmp += zone->lowmem_reserve[ZONE_NORMAL];
+ }
+ }
+
+ if (highmem_size < 0)
+ highmem_size = 0;
+
+ tmp += highmem_size;
+ if (tmp > 0) {
+ tmp = __shrink_memory(tmp);
+ if (!tmp)
+ return -ENOMEM;
+ pages += tmp;
+ } else if (size > image_size / PAGE_SIZE) {
+ tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
+ pages += tmp;
+ }
+ printk("\b%c", p[i++%4]);
+ } while (tmp > 0);
+ do_gettimeofday(&stop);
+ printk("\bdone (%lu pages freed)\n", pages);
+ swsusp_show_speed(&start, &stop, pages, "Freed");
+
+ return 0;
+}
+
#ifdef CONFIG_HIGHMEM
/**
* count_pages_for_highmem - compute the number of non-highmem pages
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -55,14 +55,6 @@
#include "power.h"
-/*
- * Preferred image size in bytes (tunable via /sys/power/image_size).
- * When it is set to N, swsusp will do its best to ensure the image
- * size will not exceed N bytes, but if that is impossible, it will
- * try to create the smallest image possible.
- */
-unsigned long image_size = 500 * 1024 * 1024;
-
int in_suspend __nosavedata = 0;
/**
@@ -195,74 +187,6 @@ void swsusp_show_speed(struct timeval *s
kps / 1000, (kps % 1000) / 10);
}
-/**
- * swsusp_shrink_memory - Try to free as much memory as needed
- *
- * ... but do not OOM-kill anyone
- *
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
- */
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
-int swsusp_shrink_memory(void)
-{
- long tmp;
- struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
- struct timeval start, stop;
-
- printk(KERN_INFO "PM: Shrinking memory... ");
- do_gettimeofday(&start);
- do {
- long size, highmem_size;
-
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
-
- if (highmem_size < 0)
- highmem_size = 0;
-
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
- do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
-
- return 0;
-}
-
/*
* Platforms, like ACPI, may want us to save some memory used by them during
* hibernation and to restore the contents of this memory during the subsequent
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern unsigned int count_data_pages(void);
+extern int swsusp_shrink_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
@@ -149,7 +149,6 @@ extern int swsusp_swap_in_use(void);
/* kernel/power/disk.c */
extern int swsusp_check(void);
-extern int swsusp_shrink_memory(void);
extern void swsusp_free(void);
extern int swsusp_read(unsigned int *flags_p);
extern int swsusp_write(unsigned int flags);
@@ -176,7 +175,6 @@ extern int pm_notifier_call_chain(unsign
#endif
#ifdef CONFIG_HIGHMEM
-unsigned int count_highmem_pages(void);
int restore_highmem(void);
#else
static inline unsigned int count_highmem_pages(void) { return 0; }
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 2/5] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
@ 2009-05-04 13:35 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 13:35 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, linux-pm, Andrew Morton, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Mon 2009-05-04 02:11:02, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
>
> The next patch is going to modify the memory shrinking code so that
> it will make memory allocations to free memory instead of using an
> artificial memory shrinking mechanism for that. For this purpose it
> is convenient to move swsusp_shrink_memory() from
> kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
> memory-shrinking code is going to use things that are local to
> kernel/power/snapshot.c .
>
> [rev. 2: Make some functions static and remove their headers from
> kernel/power/power.h]
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 2/5] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
@ 2009-05-04 13:35 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 13:35 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Mon 2009-05-04 02:11:02, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
>
> The next patch is going to modify the memory shrinking code so that
> it will make memory allocations to free memory instead of using an
> artificial memory shrinking mechanism for that. For this purpose it
> is convenient to move swsusp_shrink_memory() from
> kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
> memory-shrinking code is going to use things that are local to
> kernel/power/snapshot.c .
>
> [rev. 2: Make some functions static and remove their headers from
> kernel/power/power.h]
>
> Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Acked-by: Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org>
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 2/5] PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
2009-05-04 0:11 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-04 13:35 ` Pavel Machek
-1 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 13:35 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Mon 2009-05-04 02:11:02, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
>
> The next patch is going to modify the memory shrinking code so that
> it will make memory allocations to free memory instead of using an
> artificial memory shrinking mechanism for that. For this purpose it
> is convenient to move swsusp_shrink_memory() from
> kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
> memory-shrinking code is going to use things that are local to
> kernel/power/snapshot.c .
>
> [rev. 2: Make some functions static and remove their headers from
> kernel/power/power.h]
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 3/5] PM/Suspend: Do not shrink memory before suspend
@ 2009-05-04 0:12 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:12 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
From: Rafael J. Wysocki <rjw@sisk.pl>
Remove the shrinking of memory from the suspend-to-RAM code, where
it is not really necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/main.c | 20 +-------------------
1 file changed, 1 insertion(+), 19 deletions(-)
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const ch
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -226,7 +223,6 @@ int suspend_valid_only_mem(suspend_state
static int suspend_prepare(void)
{
int error;
- unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -241,24 +237,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 3/5] PM/Suspend: Do not shrink memory before suspend
@ 2009-05-04 0:12 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:12 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Remove the shrinking of memory from the suspend-to-RAM code, where
it is not really necessary.
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/main.c | 20 +-------------------
1 file changed, 1 insertion(+), 19 deletions(-)
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const ch
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -226,7 +223,6 @@ int suspend_valid_only_mem(suspend_state
static int suspend_prepare(void)
{
int error;
- unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -241,24 +237,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 3/5] PM/Suspend: Do not shrink memory before suspend
2009-05-04 0:08 ` Rafael J. Wysocki
` (5 preceding siblings ...)
(?)
@ 2009-05-04 0:12 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:12 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
From: Rafael J. Wysocki <rjw@sisk.pl>
Remove the shrinking of memory from the suspend-to-RAM code, where
it is not really necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/main.c | 20 +-------------------
1 file changed, 1 insertion(+), 19 deletions(-)
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -188,9 +188,6 @@ static void suspend_test_finish(const ch
#endif
-/* This is just an arbitrary number */
-#define FREE_PAGE_NUMBER (100)
-
static struct platform_suspend_ops *suspend_ops;
/**
@@ -226,7 +223,6 @@ int suspend_valid_only_mem(suspend_state
static int suspend_prepare(void)
{
int error;
- unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -241,24 +237,10 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- if (suspend_freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
- free_pages = global_page_state(NR_FREE_PAGES);
- if (free_pages < FREE_PAGE_NUMBER) {
- pr_debug("PM: free some memory\n");
- shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
- if (nr_free_pages() < FREE_PAGE_NUMBER) {
- error = -ENOMEM;
- printk(KERN_ERR "PM: No enough memory\n");
- }
- }
+ error = suspend_freeze_processes();
if (!error)
return 0;
- Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 4/5] PM/Hibernate: Use memory allocations to free memory (rev. 3)
2009-05-04 0:08 ` Rafael J. Wysocki
` (6 preceding siblings ...)
(?)
@ 2009-05-04 0:20 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:20 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
From: Rafael J. Wysocki <rjw@sisk.pl>
Modify the hibernation memory shrinking code so that it will make
memory allocations to free memory instead of using an artificial
memory shrinking mechanism for that. Remove the no longer used memory
shrinking functions from mm/vmscan.c .
[rev. 2: Use the existing memory bitmaps for marking preallocated
image pages and use swsusp_free() from releasing them, add comments
describing the memory shrinking strategy.
rev. 3: change the memory shrinking strategy to preallocate as much
memory as needed to get the right image size in one shot.]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/snapshot.c | 119 +++++++++++++++++++++++-----------------
mm/vmscan.c | 142 ------------------------------------------------
2 files changed, 70 insertions(+), 191 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,69 +1066,90 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper function used for the shrinking of memory. */
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * compute the total number of available page frames and allocate at least
*
- * ... but do not OOM-kill anyone
+ * ([page frames total] + PAGES_FOR_IO + SPARE_PAGES + [metadata pages]) / 2
*
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * of them, which corresponds to the maximum size of a hibernation image.
+ *
+ * If image_size is set below the number following from the above formula,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size or it is impossible to
+ * allocate more memory, whichever happens first.
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
int swsusp_shrink_memory(void)
{
- long tmp;
struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
+ unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
+ int error = 0;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Shrinking memory ... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
+ /* Count the number of saveable data pages. */
+ saveable = count_data_pages() + count_highmem_pages();
+
+ /*
+ * Compute the total number of page frames we can use (count) and the
+ * number of pages needed for image metadata (size).
+ */
+ count = saveable;
+ size = 0;
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ if (!is_highmem(zone))
+ count -= zone->lowmem_reserve[ZONE_NORMAL];
+ }
+
+ /* Compute the maximum number of saveable pages to leave in memory. */
+ max_size = (count - (size + PAGES_FOR_IO + SPARE_PAGES)) / 2;
+ size = DIV_ROUND_UP(image_size, PAGE_SIZE);
+ if (size > max_size)
+ size = max_size;
+ /*
+ * If the current number of saveable pages is lesser than the maximum,
+ * we don't need to do anything more.
+ */
+ if (size > saveable)
+ goto out;
- if (highmem_size < 0)
- highmem_size = 0;
+ /* Preallocate memory. */
+ for (count -= size; count > 0; count--) {
+ struct page *page;
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ page = alloc_image_page(GFP_KERNEL | __GFP_NO_OOM_KILL);
+ if (!page)
+ break;
+ pages++;
+ }
+ /* If size < max_size, preallocating enough memory may be impossible. */
+ if (count > 0 && size == max_size)
+ error = -ENOMEM;
+
+ /* Release all of the preallocated page frames. */
+ swsusp_free();
+
+ if (error) {
+ printk(KERN_CONT "\n");
+ return error;
+ }
+
+ out:
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
return 0;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2054,148 +2054,6 @@ unsigned long global_lru_pages(void)
+ global_page_state(NR_INACTIVE_FILE);
}
-#ifdef CONFIG_PM
-/*
- * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
- * from LRU lists system-wide, for given pass and priority.
- *
- * For pass > 3 we also try to shrink the LRU lists that contain a few pages
- */
-static void shrink_all_zones(unsigned long nr_pages, int prio,
- int pass, struct scan_control *sc)
-{
- struct zone *zone;
- unsigned long nr_reclaimed = 0;
-
- for_each_populated_zone(zone) {
- enum lru_list l;
-
- if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
- continue;
-
- for_each_evictable_lru(l) {
- enum zone_stat_item ls = NR_LRU_BASE + l;
- unsigned long lru_pages = zone_page_state(zone, ls);
-
- /* For pass = 0, we don't shrink the active list */
- if (pass == 0 && (l == LRU_ACTIVE_ANON ||
- l == LRU_ACTIVE_FILE))
- continue;
-
- zone->lru[l].nr_scan += (lru_pages >> prio) + 1;
- if (zone->lru[l].nr_scan >= nr_pages || pass > 3) {
- unsigned long nr_to_scan;
-
- zone->lru[l].nr_scan = 0;
- nr_to_scan = min(nr_pages, lru_pages);
- nr_reclaimed += shrink_list(l, nr_to_scan, zone,
- sc, prio);
- if (nr_reclaimed >= nr_pages) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
- }
- }
- }
- sc->nr_reclaimed += nr_reclaimed;
-}
-
-/*
- * Try to free `nr_pages' of memory, system-wide, and return the number of
- * freed pages.
- *
- * Rather than trying to age LRUs the aim is to preserve the overall
- * LRU order by reclaiming preferentially
- * inactive > active > active referenced > active mapped
- */
-unsigned long shrink_all_memory(unsigned long nr_pages)
-{
- unsigned long lru_pages, nr_slab;
- int pass;
- struct reclaim_state reclaim_state;
- struct scan_control sc = {
- .gfp_mask = GFP_KERNEL,
- .may_unmap = 0,
- .may_writepage = 1,
- .isolate_pages = isolate_pages_global,
- .nr_reclaimed = 0,
- };
-
- current->reclaim_state = &reclaim_state;
-
- lru_pages = global_lru_pages();
- nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
- /* If slab caches are huge, it's better to hit them first */
- while (nr_slab >= lru_pages) {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
- if (!reclaim_state.reclaimed_slab)
- break;
-
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- nr_slab -= reclaim_state.reclaimed_slab;
- }
-
- /*
- * We try to shrink LRUs in 5 passes:
- * 0 = Reclaim from inactive_list only
- * 1 = Reclaim from active list but don't reclaim mapped
- * 2 = 2nd pass of type 1
- * 3 = Reclaim mapped (normal reclaim)
- * 4 = 2nd pass of type 3
- */
- for (pass = 0; pass < 5; pass++) {
- int prio;
-
- /* Force reclaiming mapped pages in the passes #3 and #4 */
- if (pass > 2)
- sc.may_unmap = 1;
-
- for (prio = DEF_PRIORITY; prio >= 0; prio--) {
- unsigned long nr_to_scan = nr_pages - sc.nr_reclaimed;
-
- sc.nr_scanned = 0;
- sc.swap_cluster_max = nr_to_scan;
- shrink_all_zones(nr_to_scan, prio, pass, &sc);
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask,
- global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
- }
- }
-
- /*
- * If sc.nr_reclaimed = 0, we could not shrink LRUs, but there may be
- * something in slab caches
- */
- if (!sc.nr_reclaimed) {
- do {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- } while (sc.nr_reclaimed < nr_pages &&
- reclaim_state.reclaimed_slab > 0);
- }
-
-
-out:
- current->reclaim_state = NULL;
-
- return sc.nr_reclaimed;
-}
-#endif
-
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness. So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 4/5] PM/Hibernate: Use memory allocations to free memory (rev. 3)
@ 2009-05-04 0:20 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:20 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
From: Rafael J. Wysocki <rjw@sisk.pl>
Modify the hibernation memory shrinking code so that it will make
memory allocations to free memory instead of using an artificial
memory shrinking mechanism for that. Remove the no longer used memory
shrinking functions from mm/vmscan.c .
[rev. 2: Use the existing memory bitmaps for marking preallocated
image pages and use swsusp_free() from releasing them, add comments
describing the memory shrinking strategy.
rev. 3: change the memory shrinking strategy to preallocate as much
memory as needed to get the right image size in one shot.]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/snapshot.c | 119 +++++++++++++++++++++++-----------------
mm/vmscan.c | 142 ------------------------------------------------
2 files changed, 70 insertions(+), 191 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,69 +1066,90 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper function used for the shrinking of memory. */
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * compute the total number of available page frames and allocate at least
*
- * ... but do not OOM-kill anyone
+ * ([page frames total] + PAGES_FOR_IO + SPARE_PAGES + [metadata pages]) / 2
*
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * of them, which corresponds to the maximum size of a hibernation image.
+ *
+ * If image_size is set below the number following from the above formula,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size or it is impossible to
+ * allocate more memory, whichever happens first.
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
int swsusp_shrink_memory(void)
{
- long tmp;
struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
+ unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
+ int error = 0;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Shrinking memory ... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
+ /* Count the number of saveable data pages. */
+ saveable = count_data_pages() + count_highmem_pages();
+
+ /*
+ * Compute the total number of page frames we can use (count) and the
+ * number of pages needed for image metadata (size).
+ */
+ count = saveable;
+ size = 0;
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ if (!is_highmem(zone))
+ count -= zone->lowmem_reserve[ZONE_NORMAL];
+ }
+
+ /* Compute the maximum number of saveable pages to leave in memory. */
+ max_size = (count - (size + PAGES_FOR_IO + SPARE_PAGES)) / 2;
+ size = DIV_ROUND_UP(image_size, PAGE_SIZE);
+ if (size > max_size)
+ size = max_size;
+ /*
+ * If the current number of saveable pages is lesser than the maximum,
+ * we don't need to do anything more.
+ */
+ if (size > saveable)
+ goto out;
- if (highmem_size < 0)
- highmem_size = 0;
+ /* Preallocate memory. */
+ for (count -= size; count > 0; count--) {
+ struct page *page;
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ page = alloc_image_page(GFP_KERNEL | __GFP_NO_OOM_KILL);
+ if (!page)
+ break;
+ pages++;
+ }
+ /* If size < max_size, preallocating enough memory may be impossible. */
+ if (count > 0 && size == max_size)
+ error = -ENOMEM;
+
+ /* Release all of the preallocated page frames. */
+ swsusp_free();
+
+ if (error) {
+ printk(KERN_CONT "\n");
+ return error;
+ }
+
+ out:
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
return 0;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2054,148 +2054,6 @@ unsigned long global_lru_pages(void)
+ global_page_state(NR_INACTIVE_FILE);
}
-#ifdef CONFIG_PM
-/*
- * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
- * from LRU lists system-wide, for given pass and priority.
- *
- * For pass > 3 we also try to shrink the LRU lists that contain a few pages
- */
-static void shrink_all_zones(unsigned long nr_pages, int prio,
- int pass, struct scan_control *sc)
-{
- struct zone *zone;
- unsigned long nr_reclaimed = 0;
-
- for_each_populated_zone(zone) {
- enum lru_list l;
-
- if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
- continue;
-
- for_each_evictable_lru(l) {
- enum zone_stat_item ls = NR_LRU_BASE + l;
- unsigned long lru_pages = zone_page_state(zone, ls);
-
- /* For pass = 0, we don't shrink the active list */
- if (pass == 0 && (l == LRU_ACTIVE_ANON ||
- l == LRU_ACTIVE_FILE))
- continue;
-
- zone->lru[l].nr_scan += (lru_pages >> prio) + 1;
- if (zone->lru[l].nr_scan >= nr_pages || pass > 3) {
- unsigned long nr_to_scan;
-
- zone->lru[l].nr_scan = 0;
- nr_to_scan = min(nr_pages, lru_pages);
- nr_reclaimed += shrink_list(l, nr_to_scan, zone,
- sc, prio);
- if (nr_reclaimed >= nr_pages) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
- }
- }
- }
- sc->nr_reclaimed += nr_reclaimed;
-}
-
-/*
- * Try to free `nr_pages' of memory, system-wide, and return the number of
- * freed pages.
- *
- * Rather than trying to age LRUs the aim is to preserve the overall
- * LRU order by reclaiming preferentially
- * inactive > active > active referenced > active mapped
- */
-unsigned long shrink_all_memory(unsigned long nr_pages)
-{
- unsigned long lru_pages, nr_slab;
- int pass;
- struct reclaim_state reclaim_state;
- struct scan_control sc = {
- .gfp_mask = GFP_KERNEL,
- .may_unmap = 0,
- .may_writepage = 1,
- .isolate_pages = isolate_pages_global,
- .nr_reclaimed = 0,
- };
-
- current->reclaim_state = &reclaim_state;
-
- lru_pages = global_lru_pages();
- nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
- /* If slab caches are huge, it's better to hit them first */
- while (nr_slab >= lru_pages) {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
- if (!reclaim_state.reclaimed_slab)
- break;
-
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- nr_slab -= reclaim_state.reclaimed_slab;
- }
-
- /*
- * We try to shrink LRUs in 5 passes:
- * 0 = Reclaim from inactive_list only
- * 1 = Reclaim from active list but don't reclaim mapped
- * 2 = 2nd pass of type 1
- * 3 = Reclaim mapped (normal reclaim)
- * 4 = 2nd pass of type 3
- */
- for (pass = 0; pass < 5; pass++) {
- int prio;
-
- /* Force reclaiming mapped pages in the passes #3 and #4 */
- if (pass > 2)
- sc.may_unmap = 1;
-
- for (prio = DEF_PRIORITY; prio >= 0; prio--) {
- unsigned long nr_to_scan = nr_pages - sc.nr_reclaimed;
-
- sc.nr_scanned = 0;
- sc.swap_cluster_max = nr_to_scan;
- shrink_all_zones(nr_to_scan, prio, pass, &sc);
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask,
- global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
- }
- }
-
- /*
- * If sc.nr_reclaimed = 0, we could not shrink LRUs, but there may be
- * something in slab caches
- */
- if (!sc.nr_reclaimed) {
- do {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- } while (sc.nr_reclaimed < nr_pages &&
- reclaim_state.reclaimed_slab > 0);
- }
-
-
-out:
- current->reclaim_state = NULL;
-
- return sc.nr_reclaimed;
-}
-#endif
-
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness. So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 4/5] PM/Hibernate: Use memory allocations to free memory (rev. 3)
@ 2009-05-04 0:20 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:20 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Modify the hibernation memory shrinking code so that it will make
memory allocations to free memory instead of using an artificial
memory shrinking mechanism for that. Remove the no longer used memory
shrinking functions from mm/vmscan.c .
[rev. 2: Use the existing memory bitmaps for marking preallocated
image pages and use swsusp_free() from releasing them, add comments
describing the memory shrinking strategy.
rev. 3: change the memory shrinking strategy to preallocate as much
memory as needed to get the right image size in one shot.]
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/snapshot.c | 119 +++++++++++++++++++++++-----------------
mm/vmscan.c | 142 ------------------------------------------------
2 files changed, 70 insertions(+), 191 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1066,69 +1066,90 @@ void swsusp_free(void)
buffer = NULL;
}
+/* Helper function used for the shrinking of memory. */
+
/**
- * swsusp_shrink_memory - Try to free as much memory as needed
+ * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ *
+ * To create a hibernation image it is necessary to make a copy of every page
+ * frame in use. We also need a number of page frames to be free during
+ * hibernation for allocations made while saving the image and for device
+ * drivers, in case they need to allocate memory from their hibernation
+ * callbacks (these two numbers are given by PAGES_FOR_IO and SPARE_PAGES,
+ * respectively, both of which are rough estimates). To make this happen, we
+ * compute the total number of available page frames and allocate at least
*
- * ... but do not OOM-kill anyone
+ * ([page frames total] + PAGES_FOR_IO + SPARE_PAGES + [metadata pages]) / 2
*
- * Notice: all userland should be stopped before it is called, or
- * livelock is possible.
+ * of them, which corresponds to the maximum size of a hibernation image.
+ *
+ * If image_size is set below the number following from the above formula,
+ * the preallocation of memory is continued until the total number of page
+ * frames in use is below the requested image size or it is impossible to
+ * allocate more memory, whichever happens first.
*/
-
-#define SHRINK_BITE 10000
-static inline unsigned long __shrink_memory(long tmp)
-{
- if (tmp > SHRINK_BITE)
- tmp = SHRINK_BITE;
- return shrink_all_memory(tmp);
-}
-
int swsusp_shrink_memory(void)
{
- long tmp;
struct zone *zone;
- unsigned long pages = 0;
- unsigned int i = 0;
- char *p = "-\\|/";
+ unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
+ int error = 0;
- printk(KERN_INFO "PM: Shrinking memory... ");
+ printk(KERN_INFO "PM: Shrinking memory ... ");
do_gettimeofday(&start);
- do {
- long size, highmem_size;
- highmem_size = count_highmem_pages();
- size = count_data_pages() + PAGES_FOR_IO + SPARE_PAGES;
- tmp = size;
- size += highmem_size;
- for_each_populated_zone(zone) {
- tmp += snapshot_additional_pages(zone);
- if (is_highmem(zone)) {
- highmem_size -=
- zone_page_state(zone, NR_FREE_PAGES);
- } else {
- tmp -= zone_page_state(zone, NR_FREE_PAGES);
- tmp += zone->lowmem_reserve[ZONE_NORMAL];
- }
- }
+ /* Count the number of saveable data pages. */
+ saveable = count_data_pages() + count_highmem_pages();
+
+ /*
+ * Compute the total number of page frames we can use (count) and the
+ * number of pages needed for image metadata (size).
+ */
+ count = saveable;
+ size = 0;
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ if (!is_highmem(zone))
+ count -= zone->lowmem_reserve[ZONE_NORMAL];
+ }
+
+ /* Compute the maximum number of saveable pages to leave in memory. */
+ max_size = (count - (size + PAGES_FOR_IO + SPARE_PAGES)) / 2;
+ size = DIV_ROUND_UP(image_size, PAGE_SIZE);
+ if (size > max_size)
+ size = max_size;
+ /*
+ * If the current number of saveable pages is lesser than the maximum,
+ * we don't need to do anything more.
+ */
+ if (size > saveable)
+ goto out;
- if (highmem_size < 0)
- highmem_size = 0;
+ /* Preallocate memory. */
+ for (count -= size; count > 0; count--) {
+ struct page *page;
- tmp += highmem_size;
- if (tmp > 0) {
- tmp = __shrink_memory(tmp);
- if (!tmp)
- return -ENOMEM;
- pages += tmp;
- } else if (size > image_size / PAGE_SIZE) {
- tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
- pages += tmp;
- }
- printk("\b%c", p[i++%4]);
- } while (tmp > 0);
+ page = alloc_image_page(GFP_KERNEL | __GFP_NO_OOM_KILL);
+ if (!page)
+ break;
+ pages++;
+ }
+ /* If size < max_size, preallocating enough memory may be impossible. */
+ if (count > 0 && size == max_size)
+ error = -ENOMEM;
+
+ /* Release all of the preallocated page frames. */
+ swsusp_free();
+
+ if (error) {
+ printk(KERN_CONT "\n");
+ return error;
+ }
+
+ out:
do_gettimeofday(&stop);
- printk("\bdone (%lu pages freed)\n", pages);
+ printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
swsusp_show_speed(&start, &stop, pages, "Freed");
return 0;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2054,148 +2054,6 @@ unsigned long global_lru_pages(void)
+ global_page_state(NR_INACTIVE_FILE);
}
-#ifdef CONFIG_PM
-/*
- * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
- * from LRU lists system-wide, for given pass and priority.
- *
- * For pass > 3 we also try to shrink the LRU lists that contain a few pages
- */
-static void shrink_all_zones(unsigned long nr_pages, int prio,
- int pass, struct scan_control *sc)
-{
- struct zone *zone;
- unsigned long nr_reclaimed = 0;
-
- for_each_populated_zone(zone) {
- enum lru_list l;
-
- if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
- continue;
-
- for_each_evictable_lru(l) {
- enum zone_stat_item ls = NR_LRU_BASE + l;
- unsigned long lru_pages = zone_page_state(zone, ls);
-
- /* For pass = 0, we don't shrink the active list */
- if (pass == 0 && (l == LRU_ACTIVE_ANON ||
- l == LRU_ACTIVE_FILE))
- continue;
-
- zone->lru[l].nr_scan += (lru_pages >> prio) + 1;
- if (zone->lru[l].nr_scan >= nr_pages || pass > 3) {
- unsigned long nr_to_scan;
-
- zone->lru[l].nr_scan = 0;
- nr_to_scan = min(nr_pages, lru_pages);
- nr_reclaimed += shrink_list(l, nr_to_scan, zone,
- sc, prio);
- if (nr_reclaimed >= nr_pages) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
- }
- }
- }
- sc->nr_reclaimed += nr_reclaimed;
-}
-
-/*
- * Try to free `nr_pages' of memory, system-wide, and return the number of
- * freed pages.
- *
- * Rather than trying to age LRUs the aim is to preserve the overall
- * LRU order by reclaiming preferentially
- * inactive > active > active referenced > active mapped
- */
-unsigned long shrink_all_memory(unsigned long nr_pages)
-{
- unsigned long lru_pages, nr_slab;
- int pass;
- struct reclaim_state reclaim_state;
- struct scan_control sc = {
- .gfp_mask = GFP_KERNEL,
- .may_unmap = 0,
- .may_writepage = 1,
- .isolate_pages = isolate_pages_global,
- .nr_reclaimed = 0,
- };
-
- current->reclaim_state = &reclaim_state;
-
- lru_pages = global_lru_pages();
- nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
- /* If slab caches are huge, it's better to hit them first */
- while (nr_slab >= lru_pages) {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
- if (!reclaim_state.reclaimed_slab)
- break;
-
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- nr_slab -= reclaim_state.reclaimed_slab;
- }
-
- /*
- * We try to shrink LRUs in 5 passes:
- * 0 = Reclaim from inactive_list only
- * 1 = Reclaim from active list but don't reclaim mapped
- * 2 = 2nd pass of type 1
- * 3 = Reclaim mapped (normal reclaim)
- * 4 = 2nd pass of type 3
- */
- for (pass = 0; pass < 5; pass++) {
- int prio;
-
- /* Force reclaiming mapped pages in the passes #3 and #4 */
- if (pass > 2)
- sc.may_unmap = 1;
-
- for (prio = DEF_PRIORITY; prio >= 0; prio--) {
- unsigned long nr_to_scan = nr_pages - sc.nr_reclaimed;
-
- sc.nr_scanned = 0;
- sc.swap_cluster_max = nr_to_scan;
- shrink_all_zones(nr_to_scan, prio, pass, &sc);
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask,
- global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- if (sc.nr_reclaimed >= nr_pages)
- goto out;
-
- if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
- }
- }
-
- /*
- * If sc.nr_reclaimed = 0, we could not shrink LRUs, but there may be
- * something in slab caches
- */
- if (!sc.nr_reclaimed) {
- do {
- reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
- sc.nr_reclaimed += reclaim_state.reclaimed_slab;
- } while (sc.nr_reclaimed < nr_pages &&
- reclaim_state.reclaimed_slab > 0);
- }
-
-
-out:
- current->reclaim_state = NULL;
-
- return sc.nr_reclaimed;
-}
-#endif
-
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness. So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-04 0:22 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
From: Rafael J. Wysocki <rjw@sisk.pl>
Since the hibernation code is now going to use allocations of memory
to create enough room for the image, it can also use the page frames
allocated at this stage as image page frames. The low-level
hibernation code needs to be rearranged for this purpose, but it
allows us to avoid freeing a great number of pages and allocating
these same pages once again later, so it generally is worth doing.
[rev. 2: Change the strategy of preallocating memory to allocate as
many pages as needed to get the right image size in one shot (the
excessive allocated pages are released afterwards).]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/disk.c | 15 +++-
kernel/power/power.h | 2
kernel/power/snapshot.c | 157 ++++++++++++++++++++++++++++++------------------
3 files changed, 112 insertions(+), 62 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1033,6 +1033,25 @@ copy_data_pages(struct memory_bitmap *co
static unsigned int nr_copy_pages;
/* Number of pages needed for saving the original pfns of the image pages */
static unsigned int nr_meta_pages;
+/*
+ * Numbers of normal and highmem page frames allocated for hibernation image
+ * before suspending devices.
+ */
+unsigned int alloc_normal, alloc_highmem;
+/*
+ * Memory bitmap used for marking saveable pages (during hibernation) or
+ * hibernation image pages (during restore)
+ */
+static struct memory_bitmap orig_bm;
+/*
+ * Memory bitmap used during hibernation for marking allocated page frames that
+ * will contain copies of saveable pages. During restore it is initially used
+ * for marking hibernation image pages, but then the set bits from it are
+ * duplicated in @orig_bm and it is released. On highmem systems it is next
+ * used for marking "safe" highmem pages, but it has to be reinitialized for
+ * this purpose.
+ */
+static struct memory_bitmap copy_bm;
/**
* swsusp_free - free pages allocated for the suspend.
@@ -1064,12 +1083,16 @@ void swsusp_free(void)
nr_meta_pages = 0;
restore_pblist = NULL;
buffer = NULL;
+ alloc_normal = 0;
+ alloc_highmem = 0;
}
/* Helper function used for the shrinking of memory. */
+#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
+
/**
- * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ * hibernate_preallocate_memory - Preallocate memory for hibernation image
*
* To create a hibernation image it is necessary to make a copy of every page
* frame in use. We also need a number of page frames to be free during
@@ -1088,16 +1111,27 @@ void swsusp_free(void)
* frames in use is below the requested image size or it is impossible to
* allocate more memory, whichever happens first.
*/
-int swsusp_shrink_memory(void)
+int hibernate_preallocate_memory(void)
{
struct zone *zone;
unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
- int error = 0;
+ int error;
- printk(KERN_INFO "PM: Shrinking memory ... ");
+ printk(KERN_INFO "PM: Preallocating image memory ... ");
do_gettimeofday(&start);
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ alloc_normal = 0;
+ alloc_highmem = 0;
+
/* Count the number of saveable data pages. */
saveable = count_data_pages() + count_highmem_pages();
@@ -1130,29 +1164,55 @@ int swsusp_shrink_memory(void)
for (count -= size; count > 0; count--) {
struct page *page;
- page = alloc_image_page(GFP_KERNEL | __GFP_NO_OOM_KILL);
+ page = alloc_image_page(GFP_IMAGE);
if (!page)
break;
- pages++;
+ memory_bm_set_bit(©_bm, page_to_pfn(page));
+ if (PageHighMem(page))
+ alloc_highmem++;
+ else
+ alloc_normal++;
}
/* If size < max_size, preallocating enough memory may be impossible. */
if (count > 0 && size == max_size)
error = -ENOMEM;
+ if (error)
+ goto err_out;
- /* Release all of the preallocated page frames. */
- swsusp_free();
+ /* Save the number of allocated pages for the statistics below. */
+ pages = alloc_normal + alloc_highmem;
- if (error) {
- printk(KERN_CONT "\n");
- return error;
+ /*
+ * We only need 'size' page frames for the image but we have allocated
+ * more. Release the excessive ones now.
+ */
+ memory_bm_position_reset(©_bm);
+ while (alloc_normal + alloc_highmem > size) {
+ unsigned long pfn = memory_bm_next_pfn(©_bm);
+ struct page *page = pfn_to_page(pfn);
+
+ memory_bm_clear_bit(©_bm, pfn);
+ if (PageHighMem(page))
+ alloc_highmem--;
+ else
+ alloc_normal--;
+ swsusp_unset_page_forbidden(page);
+ swsusp_unset_page_free(page);
+ __free_page(page);
}
out:
do_gettimeofday(&stop);
- printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
+ printk(KERN_CONT "done (allocated %lu pages, %lu image pages kept)\n",
+ pages, size);
+ swsusp_show_speed(&start, &stop, pages, "Allocated");
return 0;
+
+ err_out:
+ printk(KERN_CONT "\n");
+ swsusp_free();
+ return error;
}
#ifdef CONFIG_HIGHMEM
@@ -1163,7 +1223,7 @@ int swsusp_shrink_memory(void)
static unsigned int count_pages_for_highmem(unsigned int nr_highmem)
{
- unsigned int free_highmem = count_free_highmem_pages();
+ unsigned int free_highmem = count_free_highmem_pages() + alloc_highmem;
if (free_highmem >= nr_highmem)
nr_highmem = 0;
@@ -1185,19 +1245,17 @@ count_pages_for_highmem(unsigned int nr_
static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
{
struct zone *zone;
- unsigned int free = 0, meta = 0;
+ unsigned int free = alloc_normal;
- for_each_zone(zone) {
- meta += snapshot_additional_pages(zone);
+ for_each_zone(zone)
if (!is_highmem(zone))
free += zone_page_state(zone, NR_FREE_PAGES);
- }
nr_pages += count_pages_for_highmem(nr_highmem);
- pr_debug("PM: Normal pages needed: %u + %u + %u, available pages: %u\n",
- nr_pages, PAGES_FOR_IO, meta, free);
+ pr_debug("PM: Normal pages needed: %u + %u, available pages: %u\n",
+ nr_pages, PAGES_FOR_IO, free);
- return free > nr_pages + PAGES_FOR_IO + meta;
+ return free > nr_pages + PAGES_FOR_IO;
}
#ifdef CONFIG_HIGHMEM
@@ -1219,7 +1277,7 @@ static inline int get_highmem_buffer(int
*/
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
{
unsigned int to_alloc = count_free_highmem_pages();
@@ -1230,7 +1288,7 @@ alloc_highmem_image_pages(struct memory_
while (to_alloc-- > 0) {
struct page *page;
- page = alloc_image_page(__GFP_HIGHMEM);
+ page = alloc_image_page(__GFP_HIGHMEM | __GFP_NO_OOM_KILL);
memory_bm_set_bit(bm, page_to_pfn(page));
}
return nr_highmem;
@@ -1239,7 +1297,7 @@ alloc_highmem_image_pages(struct memory_
static inline int get_highmem_buffer(int safe_needed) { return 0; }
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
#endif /* CONFIG_HIGHMEM */
/**
@@ -1258,51 +1316,36 @@ static int
swsusp_alloc(struct memory_bitmap *orig_bm, struct memory_bitmap *copy_bm,
unsigned int nr_pages, unsigned int nr_highmem)
{
- int error;
-
- error = memory_bm_create(orig_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
-
- error = memory_bm_create(copy_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
+ int error = 0;
if (nr_highmem > 0) {
error = get_highmem_buffer(PG_ANY);
if (error)
- goto Free;
-
- nr_pages += alloc_highmem_image_pages(copy_bm, nr_highmem);
+ goto err_out;
+ if (nr_highmem > alloc_highmem) {
+ nr_highmem -= alloc_highmem;
+ nr_pages += alloc_highmem_pages(copy_bm, nr_highmem);
+ }
}
- while (nr_pages-- > 0) {
- struct page *page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
-
- if (!page)
- goto Free;
+ if (nr_pages > alloc_normal) {
+ nr_pages -= alloc_normal;
+ while (nr_pages-- > 0) {
+ struct page *page;
- memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
+ if (!page)
+ goto err_out;
+ memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ }
}
+
return 0;
- Free:
+ err_out:
swsusp_free();
- return -ENOMEM;
+ return error;
}
-/* Memory bitmap used for marking saveable pages (during suspend) or the
- * suspend image pages (during resume)
- */
-static struct memory_bitmap orig_bm;
-/* Memory bitmap used on suspend for marking allocated pages that will contain
- * the copies of saveable pages. During resume it is initially used for
- * marking the suspend image pages, but then its set bits are duplicated in
- * @orig_bm and it is released. Next, on systems with high memory, it may be
- * used for marking "safe" highmem pages, but it has to be reinitialized for
- * this purpose.
- */
-static struct memory_bitmap copy_bm;
-
asmlinkage int swsusp_save(void)
{
unsigned int nr_pages, nr_highmem;
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern int swsusp_shrink_memory(void);
+extern int hibernate_preallocate_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -303,8 +303,8 @@ int hibernation_snapshot(int platform_mo
if (error)
return error;
- /* Free memory before shutting down devices. */
- error = swsusp_shrink_memory();
+ /* Preallocate image memory before shutting down devices. */
+ error = hibernate_preallocate_memory();
if (error)
goto Close;
@@ -320,6 +320,10 @@ int hibernation_snapshot(int platform_mo
/* Control returns here after successful restore */
Resume_devices:
+ /* We may need to release the preallocated image pages here. */
+ if (error || !in_suspend)
+ swsusp_free();
+
device_resume(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
resume_console();
@@ -593,7 +597,10 @@ int hibernate(void)
goto Thaw;
error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
- if (in_suspend && !error) {
+ if (error)
+ goto Thaw;
+
+ if (in_suspend) {
unsigned int flags = 0;
if (hibernation_mode == HIBERNATION_PLATFORM)
@@ -605,8 +612,8 @@ int hibernate(void)
power_down();
} else {
pr_debug("PM: Image restored successfully.\n");
- swsusp_free();
}
+
Thaw:
thaw_processes();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-04 0:22 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Since the hibernation code is now going to use allocations of memory
to create enough room for the image, it can also use the page frames
allocated at this stage as image page frames. The low-level
hibernation code needs to be rearranged for this purpose, but it
allows us to avoid freeing a great number of pages and allocating
these same pages once again later, so it generally is worth doing.
[rev. 2: Change the strategy of preallocating memory to allocate as
many pages as needed to get the right image size in one shot (the
excessive allocated pages are released afterwards).]
Signed-off-by: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
---
kernel/power/disk.c | 15 +++-
kernel/power/power.h | 2
kernel/power/snapshot.c | 157 ++++++++++++++++++++++++++++++------------------
3 files changed, 112 insertions(+), 62 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1033,6 +1033,25 @@ copy_data_pages(struct memory_bitmap *co
static unsigned int nr_copy_pages;
/* Number of pages needed for saving the original pfns of the image pages */
static unsigned int nr_meta_pages;
+/*
+ * Numbers of normal and highmem page frames allocated for hibernation image
+ * before suspending devices.
+ */
+unsigned int alloc_normal, alloc_highmem;
+/*
+ * Memory bitmap used for marking saveable pages (during hibernation) or
+ * hibernation image pages (during restore)
+ */
+static struct memory_bitmap orig_bm;
+/*
+ * Memory bitmap used during hibernation for marking allocated page frames that
+ * will contain copies of saveable pages. During restore it is initially used
+ * for marking hibernation image pages, but then the set bits from it are
+ * duplicated in @orig_bm and it is released. On highmem systems it is next
+ * used for marking "safe" highmem pages, but it has to be reinitialized for
+ * this purpose.
+ */
+static struct memory_bitmap copy_bm;
/**
* swsusp_free - free pages allocated for the suspend.
@@ -1064,12 +1083,16 @@ void swsusp_free(void)
nr_meta_pages = 0;
restore_pblist = NULL;
buffer = NULL;
+ alloc_normal = 0;
+ alloc_highmem = 0;
}
/* Helper function used for the shrinking of memory. */
+#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
+
/**
- * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ * hibernate_preallocate_memory - Preallocate memory for hibernation image
*
* To create a hibernation image it is necessary to make a copy of every page
* frame in use. We also need a number of page frames to be free during
@@ -1088,16 +1111,27 @@ void swsusp_free(void)
* frames in use is below the requested image size or it is impossible to
* allocate more memory, whichever happens first.
*/
-int swsusp_shrink_memory(void)
+int hibernate_preallocate_memory(void)
{
struct zone *zone;
unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
- int error = 0;
+ int error;
- printk(KERN_INFO "PM: Shrinking memory ... ");
+ printk(KERN_INFO "PM: Preallocating image memory ... ");
do_gettimeofday(&start);
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ alloc_normal = 0;
+ alloc_highmem = 0;
+
/* Count the number of saveable data pages. */
saveable = count_data_pages() + count_highmem_pages();
@@ -1130,29 +1164,55 @@ int swsusp_shrink_memory(void)
for (count -= size; count > 0; count--) {
struct page *page;
- page = alloc_image_page(GFP_KERNEL | __GFP_NO_OOM_KILL);
+ page = alloc_image_page(GFP_IMAGE);
if (!page)
break;
- pages++;
+ memory_bm_set_bit(©_bm, page_to_pfn(page));
+ if (PageHighMem(page))
+ alloc_highmem++;
+ else
+ alloc_normal++;
}
/* If size < max_size, preallocating enough memory may be impossible. */
if (count > 0 && size == max_size)
error = -ENOMEM;
+ if (error)
+ goto err_out;
- /* Release all of the preallocated page frames. */
- swsusp_free();
+ /* Save the number of allocated pages for the statistics below. */
+ pages = alloc_normal + alloc_highmem;
- if (error) {
- printk(KERN_CONT "\n");
- return error;
+ /*
+ * We only need 'size' page frames for the image but we have allocated
+ * more. Release the excessive ones now.
+ */
+ memory_bm_position_reset(©_bm);
+ while (alloc_normal + alloc_highmem > size) {
+ unsigned long pfn = memory_bm_next_pfn(©_bm);
+ struct page *page = pfn_to_page(pfn);
+
+ memory_bm_clear_bit(©_bm, pfn);
+ if (PageHighMem(page))
+ alloc_highmem--;
+ else
+ alloc_normal--;
+ swsusp_unset_page_forbidden(page);
+ swsusp_unset_page_free(page);
+ __free_page(page);
}
out:
do_gettimeofday(&stop);
- printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
+ printk(KERN_CONT "done (allocated %lu pages, %lu image pages kept)\n",
+ pages, size);
+ swsusp_show_speed(&start, &stop, pages, "Allocated");
return 0;
+
+ err_out:
+ printk(KERN_CONT "\n");
+ swsusp_free();
+ return error;
}
#ifdef CONFIG_HIGHMEM
@@ -1163,7 +1223,7 @@ int swsusp_shrink_memory(void)
static unsigned int count_pages_for_highmem(unsigned int nr_highmem)
{
- unsigned int free_highmem = count_free_highmem_pages();
+ unsigned int free_highmem = count_free_highmem_pages() + alloc_highmem;
if (free_highmem >= nr_highmem)
nr_highmem = 0;
@@ -1185,19 +1245,17 @@ count_pages_for_highmem(unsigned int nr_
static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
{
struct zone *zone;
- unsigned int free = 0, meta = 0;
+ unsigned int free = alloc_normal;
- for_each_zone(zone) {
- meta += snapshot_additional_pages(zone);
+ for_each_zone(zone)
if (!is_highmem(zone))
free += zone_page_state(zone, NR_FREE_PAGES);
- }
nr_pages += count_pages_for_highmem(nr_highmem);
- pr_debug("PM: Normal pages needed: %u + %u + %u, available pages: %u\n",
- nr_pages, PAGES_FOR_IO, meta, free);
+ pr_debug("PM: Normal pages needed: %u + %u, available pages: %u\n",
+ nr_pages, PAGES_FOR_IO, free);
- return free > nr_pages + PAGES_FOR_IO + meta;
+ return free > nr_pages + PAGES_FOR_IO;
}
#ifdef CONFIG_HIGHMEM
@@ -1219,7 +1277,7 @@ static inline int get_highmem_buffer(int
*/
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
{
unsigned int to_alloc = count_free_highmem_pages();
@@ -1230,7 +1288,7 @@ alloc_highmem_image_pages(struct memory_
while (to_alloc-- > 0) {
struct page *page;
- page = alloc_image_page(__GFP_HIGHMEM);
+ page = alloc_image_page(__GFP_HIGHMEM | __GFP_NO_OOM_KILL);
memory_bm_set_bit(bm, page_to_pfn(page));
}
return nr_highmem;
@@ -1239,7 +1297,7 @@ alloc_highmem_image_pages(struct memory_
static inline int get_highmem_buffer(int safe_needed) { return 0; }
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
#endif /* CONFIG_HIGHMEM */
/**
@@ -1258,51 +1316,36 @@ static int
swsusp_alloc(struct memory_bitmap *orig_bm, struct memory_bitmap *copy_bm,
unsigned int nr_pages, unsigned int nr_highmem)
{
- int error;
-
- error = memory_bm_create(orig_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
-
- error = memory_bm_create(copy_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
+ int error = 0;
if (nr_highmem > 0) {
error = get_highmem_buffer(PG_ANY);
if (error)
- goto Free;
-
- nr_pages += alloc_highmem_image_pages(copy_bm, nr_highmem);
+ goto err_out;
+ if (nr_highmem > alloc_highmem) {
+ nr_highmem -= alloc_highmem;
+ nr_pages += alloc_highmem_pages(copy_bm, nr_highmem);
+ }
}
- while (nr_pages-- > 0) {
- struct page *page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
-
- if (!page)
- goto Free;
+ if (nr_pages > alloc_normal) {
+ nr_pages -= alloc_normal;
+ while (nr_pages-- > 0) {
+ struct page *page;
- memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
+ if (!page)
+ goto err_out;
+ memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ }
}
+
return 0;
- Free:
+ err_out:
swsusp_free();
- return -ENOMEM;
+ return error;
}
-/* Memory bitmap used for marking saveable pages (during suspend) or the
- * suspend image pages (during resume)
- */
-static struct memory_bitmap orig_bm;
-/* Memory bitmap used on suspend for marking allocated pages that will contain
- * the copies of saveable pages. During resume it is initially used for
- * marking the suspend image pages, but then its set bits are duplicated in
- * @orig_bm and it is released. Next, on systems with high memory, it may be
- * used for marking "safe" highmem pages, but it has to be reinitialized for
- * this purpose.
- */
-static struct memory_bitmap copy_bm;
-
asmlinkage int swsusp_save(void)
{
unsigned int nr_pages, nr_highmem;
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern int swsusp_shrink_memory(void);
+extern int hibernate_preallocate_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -303,8 +303,8 @@ int hibernation_snapshot(int platform_mo
if (error)
return error;
- /* Free memory before shutting down devices. */
- error = swsusp_shrink_memory();
+ /* Preallocate image memory before shutting down devices. */
+ error = hibernate_preallocate_memory();
if (error)
goto Close;
@@ -320,6 +320,10 @@ int hibernation_snapshot(int platform_mo
/* Control returns here after successful restore */
Resume_devices:
+ /* We may need to release the preallocated image pages here. */
+ if (error || !in_suspend)
+ swsusp_free();
+
device_resume(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
resume_console();
@@ -593,7 +597,10 @@ int hibernate(void)
goto Thaw;
error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
- if (in_suspend && !error) {
+ if (error)
+ goto Thaw;
+
+ if (in_suspend) {
unsigned int flags = 0;
if (hibernation_mode == HIBERNATION_PLATFORM)
@@ -605,8 +612,8 @@ int hibernate(void)
power_down();
} else {
pr_debug("PM: Image restored successfully.\n");
- swsusp_free();
}
+
Thaw:
thaw_processes();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-04 0:22 ` Rafael J. Wysocki
(?)
@ 2009-05-05 2:24 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 2:24 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
>
> Since the hibernation code is now going to use allocations of memory
> to create enough room for the image, it can also use the page frames
> allocated at this stage as image page frames. The low-level
> hibernation code needs to be rearranged for this purpose, but it
> allows us to avoid freeing a great number of pages and allocating
> these same pages once again later, so it generally is worth doing.
>
> [rev. 2: Change the strategy of preallocating memory to allocate as
> many pages as needed to get the right image size in one shot (the
> excessive allocated pages are released afterwards).]
Rafael, I tried out your patches and found doubled memory shrink speed!
[ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
[ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
For you reference, here is the free memory before/after
hibernate_preallocate_memory():
# free
total used free shared buffers cached
Mem: 1933 1917 15 0 0 1845
-/+ buffers/cache: 72 1861
Swap: 0 0 0
# free
total used free shared buffers cached
Mem: 1933 920 1012 0 0 356
-/+ buffers/cache: 563 1369
Swap: 0 0 0
It seems that the preallocated memory is not freed on -ENOMEM.
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
memory_bm_create() is called a number of times, each time it will
call create_mem_extents()/memory_bm_free(). Can they be optimized to
be called only once?
A side note: there are somehow duplicated *_extent_*() logics in the
filesystems, is it possible that we abstract out some of the common code?
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ if (!is_highmem(zone))
+ count -= zone->lowmem_reserve[ZONE_NORMAL];
+ }
Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
+ /* If size < max_size, preallocating enough memory may be impossible. */
+ if (count > 0 && size == max_size)
+ error = -ENOMEM;
+ if (error)
+ goto err_out;
The two if()s can be merged.
At last, I'd express my major concern about the transition to preallocate
based memory shrinking: will it lead to more random swapping IOs?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-05 2:24 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 2:24 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
>
> Since the hibernation code is now going to use allocations of memory
> to create enough room for the image, it can also use the page frames
> allocated at this stage as image page frames. The low-level
> hibernation code needs to be rearranged for this purpose, but it
> allows us to avoid freeing a great number of pages and allocating
> these same pages once again later, so it generally is worth doing.
>
> [rev. 2: Change the strategy of preallocating memory to allocate as
> many pages as needed to get the right image size in one shot (the
> excessive allocated pages are released afterwards).]
Rafael, I tried out your patches and found doubled memory shrink speed!
[ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
[ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
For you reference, here is the free memory before/after
hibernate_preallocate_memory():
# free
total used free shared buffers cached
Mem: 1933 1917 15 0 0 1845
-/+ buffers/cache: 72 1861
Swap: 0 0 0
# free
total used free shared buffers cached
Mem: 1933 920 1012 0 0 356
-/+ buffers/cache: 563 1369
Swap: 0 0 0
It seems that the preallocated memory is not freed on -ENOMEM.
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
memory_bm_create() is called a number of times, each time it will
call create_mem_extents()/memory_bm_free(). Can they be optimized to
be called only once?
A side note: there are somehow duplicated *_extent_*() logics in the
filesystems, is it possible that we abstract out some of the common code?
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ if (!is_highmem(zone))
+ count -= zone->lowmem_reserve[ZONE_NORMAL];
+ }
Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
+ /* If size < max_size, preallocating enough memory may be impossible. */
+ if (count > 0 && size == max_size)
+ error = -ENOMEM;
+ if (error)
+ goto err_out;
The two if()s can be merged.
At last, I'd express my major concern about the transition to preallocate
based memory shrinking: will it lead to more random swapping IOs?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-05 2:24 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 2:24 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
>
> Since the hibernation code is now going to use allocations of memory
> to create enough room for the image, it can also use the page frames
> allocated at this stage as image page frames. The low-level
> hibernation code needs to be rearranged for this purpose, but it
> allows us to avoid freeing a great number of pages and allocating
> these same pages once again later, so it generally is worth doing.
>
> [rev. 2: Change the strategy of preallocating memory to allocate as
> many pages as needed to get the right image size in one shot (the
> excessive allocated pages are released afterwards).]
Rafael, I tried out your patches and found doubled memory shrink speed!
[ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
[ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
For you reference, here is the free memory before/after
hibernate_preallocate_memory():
# free
total used free shared buffers cached
Mem: 1933 1917 15 0 0 1845
-/+ buffers/cache: 72 1861
Swap: 0 0 0
# free
total used free shared buffers cached
Mem: 1933 920 1012 0 0 356
-/+ buffers/cache: 563 1369
Swap: 0 0 0
It seems that the preallocated memory is not freed on -ENOMEM.
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
memory_bm_create() is called a number of times, each time it will
call create_mem_extents()/memory_bm_free(). Can they be optimized to
be called only once?
A side note: there are somehow duplicated *_extent_*() logics in the
filesystems, is it possible that we abstract out some of the common code?
+ for_each_populated_zone(zone) {
+ size += snapshot_additional_pages(zone);
+ count += zone_page_state(zone, NR_FREE_PAGES);
+ if (!is_highmem(zone))
+ count -= zone->lowmem_reserve[ZONE_NORMAL];
+ }
Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
+ /* If size < max_size, preallocating enough memory may be impossible. */
+ if (count > 0 && size == max_size)
+ error = -ENOMEM;
+ if (error)
+ goto err_out;
The two if()s can be merged.
At last, I'd express my major concern about the transition to preallocate
based memory shrinking: will it lead to more random swapping IOs?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 2:24 ` Wu Fengguang
(?)
@ 2009-05-05 2:46 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 2:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> >
> > Since the hibernation code is now going to use allocations of memory
> > to create enough room for the image, it can also use the page frames
> > allocated at this stage as image page frames. The low-level
> > hibernation code needs to be rearranged for this purpose, but it
> > allows us to avoid freeing a great number of pages and allocating
> > these same pages once again later, so it generally is worth doing.
> >
> > [rev. 2: Change the strategy of preallocating memory to allocate as
> > many pages as needed to get the right image size in one shot (the
> > excessive allocated pages are released afterwards).]
>
> Rafael, I tried out your patches and found doubled memory shrink speed!
>
> [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> For you reference, here is the free memory before/after
> hibernate_preallocate_memory():
>
> # free
> total used free shared buffers cached
> Mem: 1933 1917 15 0 0 1845
> -/+ buffers/cache: 72 1861
> Swap: 0 0 0
>
> # free
> total used free shared buffers cached
> Mem: 1933 920 1012 0 0 356
> -/+ buffers/cache: 563 1369
> Swap: 0 0 0
>
> It seems that the preallocated memory is not freed on -ENOMEM.
Ah, this was my fault.
I used to do this debugging trick:
@@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
pages, size);
swsusp_show_speed(&start, &stop, pages, "Allocated");
- return 0;
+ return -ENOMEM;
err_out:
printk(KERN_CONT "\n");
That "return -ENOMEM" should be "error = -ENOMEM" :-)
Here is one more run:
[ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
[ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
Now the free report is back to normal:
# free
total used free shared buffers cached
Mem: 1933 74 1858 0 0 15
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 2:24 ` Wu Fengguang
@ 2009-05-05 2:46 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 2:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> >
> > Since the hibernation code is now going to use allocations of memory
> > to create enough room for the image, it can also use the page frames
> > allocated at this stage as image page frames. The low-level
> > hibernation code needs to be rearranged for this purpose, but it
> > allows us to avoid freeing a great number of pages and allocating
> > these same pages once again later, so it generally is worth doing.
> >
> > [rev. 2: Change the strategy of preallocating memory to allocate as
> > many pages as needed to get the right image size in one shot (the
> > excessive allocated pages are released afterwards).]
>
> Rafael, I tried out your patches and found doubled memory shrink speed!
>
> [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> For you reference, here is the free memory before/after
> hibernate_preallocate_memory():
>
> # free
> total used free shared buffers cached
> Mem: 1933 1917 15 0 0 1845
> -/+ buffers/cache: 72 1861
> Swap: 0 0 0
>
> # free
> total used free shared buffers cached
> Mem: 1933 920 1012 0 0 356
> -/+ buffers/cache: 563 1369
> Swap: 0 0 0
>
> It seems that the preallocated memory is not freed on -ENOMEM.
Ah, this was my fault.
I used to do this debugging trick:
@@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
pages, size);
swsusp_show_speed(&start, &stop, pages, "Allocated");
- return 0;
+ return -ENOMEM;
err_out:
printk(KERN_CONT "\n");
That "return -ENOMEM" should be "error = -ENOMEM" :-)
Here is one more run:
[ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
[ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
Now the free report is back to normal:
# free
total used free shared buffers cached
Mem: 1933 74 1858 0 0 15
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-05 2:46 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 2:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> >
> > Since the hibernation code is now going to use allocations of memory
> > to create enough room for the image, it can also use the page frames
> > allocated at this stage as image page frames. The low-level
> > hibernation code needs to be rearranged for this purpose, but it
> > allows us to avoid freeing a great number of pages and allocating
> > these same pages once again later, so it generally is worth doing.
> >
> > [rev. 2: Change the strategy of preallocating memory to allocate as
> > many pages as needed to get the right image size in one shot (the
> > excessive allocated pages are released afterwards).]
>
> Rafael, I tried out your patches and found doubled memory shrink speed!
>
> [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> For you reference, here is the free memory before/after
> hibernate_preallocate_memory():
>
> # free
> total used free shared buffers cached
> Mem: 1933 1917 15 0 0 1845
> -/+ buffers/cache: 72 1861
> Swap: 0 0 0
>
> # free
> total used free shared buffers cached
> Mem: 1933 920 1012 0 0 356
> -/+ buffers/cache: 563 1369
> Swap: 0 0 0
>
> It seems that the preallocated memory is not freed on -ENOMEM.
Ah, this was my fault.
I used to do this debugging trick:
@@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
pages, size);
swsusp_show_speed(&start, &stop, pages, "Allocated");
- return 0;
+ return -ENOMEM;
err_out:
printk(KERN_CONT "\n");
That "return -ENOMEM" should be "error = -ENOMEM" :-)
Here is one more run:
[ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
[ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
Now the free report is back to normal:
# free
total used free shared buffers cached
Mem: 1933 74 1858 0 0 15
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 2:46 ` Wu Fengguang
@ 2009-05-05 23:07 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:07 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Tuesday 05 May 2009, Wu Fengguang wrote:
> On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> > For you reference, here is the free memory before/after
> > hibernate_preallocate_memory():
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 1917 15 0 0 1845
> > -/+ buffers/cache: 72 1861
> > Swap: 0 0 0
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 920 1012 0 0 356
> > -/+ buffers/cache: 563 1369
> > Swap: 0 0 0
> >
> > It seems that the preallocated memory is not freed on -ENOMEM.
>
> Ah, this was my fault.
>
> I used to do this debugging trick:
>
> @@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
> pages, size);
> swsusp_show_speed(&start, &stop, pages, "Allocated");
>
> - return 0;
> + return -ENOMEM;
>
> err_out:
> printk(KERN_CONT "\n");
>
> That "return -ENOMEM" should be "error = -ENOMEM" :-)
>
> Here is one more run:
>
> [ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
> [ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
>
> Now the free report is back to normal:
>
> # free
> total used free shared buffers cached
> Mem: 1933 74 1858 0 0 15
Thanks for testing. The results look encouraging, but I'd also like to get rid
of the regression mentioned in my previous message.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-05 23:07 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:07 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Tuesday 05 May 2009, Wu Fengguang wrote:
> On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> > For you reference, here is the free memory before/after
> > hibernate_preallocate_memory():
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 1917 15 0 0 1845
> > -/+ buffers/cache: 72 1861
> > Swap: 0 0 0
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 920 1012 0 0 356
> > -/+ buffers/cache: 563 1369
> > Swap: 0 0 0
> >
> > It seems that the preallocated memory is not freed on -ENOMEM.
>
> Ah, this was my fault.
>
> I used to do this debugging trick:
>
> @@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
> pages, size);
> swsusp_show_speed(&start, &stop, pages, "Allocated");
>
> - return 0;
> + return -ENOMEM;
>
> err_out:
> printk(KERN_CONT "\n");
>
> That "return -ENOMEM" should be "error = -ENOMEM" :-)
>
> Here is one more run:
>
> [ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
> [ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
>
> Now the free report is back to normal:
>
> # free
> total used free shared buffers cached
> Mem: 1933 74 1858 0 0 15
Thanks for testing. The results look encouraging, but I'd also like to get rid
of the regression mentioned in my previous message.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 23:07 ` Rafael J. Wysocki
(?)
@ 2009-05-05 23:40 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 23:40 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Wed, May 06, 2009 at 07:07:44AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > > For you reference, here is the free memory before/after
> > > hibernate_preallocate_memory():
> > >
> > > # free
> > > total used free shared buffers cached
> > > Mem: 1933 1917 15 0 0 1845
> > > -/+ buffers/cache: 72 1861
> > > Swap: 0 0 0
> > >
> > > # free
> > > total used free shared buffers cached
> > > Mem: 1933 920 1012 0 0 356
> > > -/+ buffers/cache: 563 1369
> > > Swap: 0 0 0
> > >
> > > It seems that the preallocated memory is not freed on -ENOMEM.
> >
> > Ah, this was my fault.
> >
> > I used to do this debugging trick:
> >
> > @@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
> > pages, size);
> > swsusp_show_speed(&start, &stop, pages, "Allocated");
> >
> > - return 0;
> > + return -ENOMEM;
> >
> > err_out:
> > printk(KERN_CONT "\n");
> >
> > That "return -ENOMEM" should be "error = -ENOMEM" :-)
> >
> > Here is one more run:
> >
> > [ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
> > [ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
> >
> > Now the free report is back to normal:
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 74 1858 0 0 15
The above 'free' still exposed something wrong: only 74M memory are left,
instead of image_size=500M memory. I'm prepared to test your updated patches :-)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-05 23:40 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 23:40 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Wed, May 06, 2009 at 07:07:44AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > > For you reference, here is the free memory before/after
> > > hibernate_preallocate_memory():
> > >
> > > # free
> > > total used free shared buffers cached
> > > Mem: 1933 1917 15 0 0 1845
> > > -/+ buffers/cache: 72 1861
> > > Swap: 0 0 0
> > >
> > > # free
> > > total used free shared buffers cached
> > > Mem: 1933 920 1012 0 0 356
> > > -/+ buffers/cache: 563 1369
> > > Swap: 0 0 0
> > >
> > > It seems that the preallocated memory is not freed on -ENOMEM.
> >
> > Ah, this was my fault.
> >
> > I used to do this debugging trick:
> >
> > @@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
> > pages, size);
> > swsusp_show_speed(&start, &stop, pages, "Allocated");
> >
> > - return 0;
> > + return -ENOMEM;
> >
> > err_out:
> > printk(KERN_CONT "\n");
> >
> > That "return -ENOMEM" should be "error = -ENOMEM" :-)
> >
> > Here is one more run:
> >
> > [ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
> > [ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
> >
> > Now the free report is back to normal:
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 74 1858 0 0 15
The above 'free' still exposed something wrong: only 74M memory are left,
instead of image_size=500M memory. I'm prepared to test your updated patches :-)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-05 23:40 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-05 23:40 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wed, May 06, 2009 at 07:07:44AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > > For you reference, here is the free memory before/after
> > > hibernate_preallocate_memory():
> > >
> > > # free
> > > total used free shared buffers cached
> > > Mem: 1933 1917 15 0 0 1845
> > > -/+ buffers/cache: 72 1861
> > > Swap: 0 0 0
> > >
> > > # free
> > > total used free shared buffers cached
> > > Mem: 1933 920 1012 0 0 356
> > > -/+ buffers/cache: 563 1369
> > > Swap: 0 0 0
> > >
> > > It seems that the preallocated memory is not freed on -ENOMEM.
> >
> > Ah, this was my fault.
> >
> > I used to do this debugging trick:
> >
> > @@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
> > pages, size);
> > swsusp_show_speed(&start, &stop, pages, "Allocated");
> >
> > - return 0;
> > + return -ENOMEM;
> >
> > err_out:
> > printk(KERN_CONT "\n");
> >
> > That "return -ENOMEM" should be "error = -ENOMEM" :-)
> >
> > Here is one more run:
> >
> > [ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
> > [ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
> >
> > Now the free report is back to normal:
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 74 1858 0 0 15
The above 'free' still exposed something wrong: only 74M memory are left,
instead of image_size=500M memory. I'm prepared to test your updated patches :-)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 2:46 ` Wu Fengguang
(?)
(?)
@ 2009-05-05 23:07 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:07 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Tuesday 05 May 2009, Wu Fengguang wrote:
> On Tue, May 05, 2009 at 10:24:27AM +0800, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> > For you reference, here is the free memory before/after
> > hibernate_preallocate_memory():
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 1917 15 0 0 1845
> > -/+ buffers/cache: 72 1861
> > Swap: 0 0 0
> >
> > # free
> > total used free shared buffers cached
> > Mem: 1933 920 1012 0 0 356
> > -/+ buffers/cache: 563 1369
> > Swap: 0 0 0
> >
> > It seems that the preallocated memory is not freed on -ENOMEM.
>
> Ah, this was my fault.
>
> I used to do this debugging trick:
>
> @@ -1207,7 +1207,7 @@ int hibernate_preallocate_memory(void)
> pages, size);
> swsusp_show_speed(&start, &stop, pages, "Allocated");
>
> - return 0;
> + return -ENOMEM;
>
> err_out:
> printk(KERN_CONT "\n");
>
> That "return -ENOMEM" should be "error = -ENOMEM" :-)
>
> Here is one more run:
>
> [ 194.016991] PM: Preallocating image memory ... done (allocated 383897 pages, 128000 image pages kept)
> [ 196.505999] PM: Allocated 1535588 kbytes in 2.47 seconds (621.69 MB/s)
>
> Now the free report is back to normal:
>
> # free
> total used free shared buffers cached
> Mem: 1933 74 1858 0 0 15
Thanks for testing. The results look encouraging, but I'd also like to get rid
of the regression mentioned in my previous message.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 2:24 ` Wu Fengguang
` (2 preceding siblings ...)
(?)
@ 2009-05-05 23:05 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:05 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Tuesday 05 May 2009, Wu Fengguang wrote:
> On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> >
> > Since the hibernation code is now going to use allocations of memory
> > to create enough room for the image, it can also use the page frames
> > allocated at this stage as image page frames. The low-level
> > hibernation code needs to be rearranged for this purpose, but it
> > allows us to avoid freeing a great number of pages and allocating
> > these same pages once again later, so it generally is worth doing.
> >
> > [rev. 2: Change the strategy of preallocating memory to allocate as
> > many pages as needed to get the right image size in one shot (the
> > excessive allocated pages are released afterwards).]
>
> Rafael, I tried out your patches and found doubled memory shrink speed!
>
> [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
Unfortunately, I'm observing a regression and a huge one.
On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
and that takes ~2 s with the old code and ~15 s with the new one.
It helps to call shrink_all_memory() once with a sufficiently large argument
before the preallocation.
> For you reference, here is the free memory before/after
> hibernate_preallocate_memory():
>
> # free
> total used free shared buffers cached
> Mem: 1933 1917 15 0 0 1845
> -/+ buffers/cache: 72 1861
> Swap: 0 0 0
>
> # free
> total used free shared buffers cached
> Mem: 1933 920 1012 0 0 356
> -/+ buffers/cache: 563 1369
> Swap: 0 0 0
>
> It seems that the preallocated memory is not freed on -ENOMEM.
>
> + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> + if (error)
> + goto err_out;
> +
> + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> + if (error)
> + goto err_out;
>
> memory_bm_create() is called a number of times, each time it will
> call create_mem_extents()/memory_bm_free(). Can they be optimized to
> be called only once?
Possibly, but not right now if you please? This is just moving code BTW.
> A side note: there are somehow duplicated *_extent_*() logics in the
> filesystems, is it possible that we abstract out some of the common code?
I think we can do it, but it really is low priority to me at the moment.
> + for_each_populated_zone(zone) {
> + size += snapshot_additional_pages(zone);
> + count += zone_page_state(zone, NR_FREE_PAGES);
> + if (!is_highmem(zone))
> + count -= zone->lowmem_reserve[ZONE_NORMAL];
> + }
>
> Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
Ah, this is a leftover and it should be changed or even dropped. Can you
please remind me how exactly lowmem_reserve[] is supposed to work?
> + /* If size < max_size, preallocating enough memory may be impossible. */
> + if (count > 0 && size == max_size)
> + error = -ENOMEM;
> + if (error)
> + goto err_out;
>
> The two if()s can be merged.
Unfortunately, the first one is actually wrong. :-)
It's not present in the updated patchset I'm going to send tomorrow.
> At last, I'd express my major concern about the transition to preallocate
> based memory shrinking: will it lead to more random swapping IOs?
Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
is related to that ...
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 2:24 ` Wu Fengguang
@ 2009-05-05 23:05 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:05 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Tuesday 05 May 2009, Wu Fengguang wrote:
> On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw@sisk.pl>
> >
> > Since the hibernation code is now going to use allocations of memory
> > to create enough room for the image, it can also use the page frames
> > allocated at this stage as image page frames. The low-level
> > hibernation code needs to be rearranged for this purpose, but it
> > allows us to avoid freeing a great number of pages and allocating
> > these same pages once again later, so it generally is worth doing.
> >
> > [rev. 2: Change the strategy of preallocating memory to allocate as
> > many pages as needed to get the right image size in one shot (the
> > excessive allocated pages are released afterwards).]
>
> Rafael, I tried out your patches and found doubled memory shrink speed!
>
> [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
Unfortunately, I'm observing a regression and a huge one.
On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
and that takes ~2 s with the old code and ~15 s with the new one.
It helps to call shrink_all_memory() once with a sufficiently large argument
before the preallocation.
> For you reference, here is the free memory before/after
> hibernate_preallocate_memory():
>
> # free
> total used free shared buffers cached
> Mem: 1933 1917 15 0 0 1845
> -/+ buffers/cache: 72 1861
> Swap: 0 0 0
>
> # free
> total used free shared buffers cached
> Mem: 1933 920 1012 0 0 356
> -/+ buffers/cache: 563 1369
> Swap: 0 0 0
>
> It seems that the preallocated memory is not freed on -ENOMEM.
>
> + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> + if (error)
> + goto err_out;
> +
> + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> + if (error)
> + goto err_out;
>
> memory_bm_create() is called a number of times, each time it will
> call create_mem_extents()/memory_bm_free(). Can they be optimized to
> be called only once?
Possibly, but not right now if you please? This is just moving code BTW.
> A side note: there are somehow duplicated *_extent_*() logics in the
> filesystems, is it possible that we abstract out some of the common code?
I think we can do it, but it really is low priority to me at the moment.
> + for_each_populated_zone(zone) {
> + size += snapshot_additional_pages(zone);
> + count += zone_page_state(zone, NR_FREE_PAGES);
> + if (!is_highmem(zone))
> + count -= zone->lowmem_reserve[ZONE_NORMAL];
> + }
>
> Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
Ah, this is a leftover and it should be changed or even dropped. Can you
please remind me how exactly lowmem_reserve[] is supposed to work?
> + /* If size < max_size, preallocating enough memory may be impossible. */
> + if (count > 0 && size == max_size)
> + error = -ENOMEM;
> + if (error)
> + goto err_out;
>
> The two if()s can be merged.
Unfortunately, the first one is actually wrong. :-)
It's not present in the updated patchset I'm going to send tomorrow.
> At last, I'd express my major concern about the transition to preallocate
> based memory shrinking: will it lead to more random swapping IOs?
Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
is related to that ...
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-05 23:05 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-05 23:05 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Tuesday 05 May 2009, Wu Fengguang wrote:
> On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> >
> > Since the hibernation code is now going to use allocations of memory
> > to create enough room for the image, it can also use the page frames
> > allocated at this stage as image page frames. The low-level
> > hibernation code needs to be rearranged for this purpose, but it
> > allows us to avoid freeing a great number of pages and allocating
> > these same pages once again later, so it generally is worth doing.
> >
> > [rev. 2: Change the strategy of preallocating memory to allocate as
> > many pages as needed to get the right image size in one shot (the
> > excessive allocated pages are released afterwards).]
>
> Rafael, I tried out your patches and found doubled memory shrink speed!
>
> [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
Unfortunately, I'm observing a regression and a huge one.
On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
and that takes ~2 s with the old code and ~15 s with the new one.
It helps to call shrink_all_memory() once with a sufficiently large argument
before the preallocation.
> For you reference, here is the free memory before/after
> hibernate_preallocate_memory():
>
> # free
> total used free shared buffers cached
> Mem: 1933 1917 15 0 0 1845
> -/+ buffers/cache: 72 1861
> Swap: 0 0 0
>
> # free
> total used free shared buffers cached
> Mem: 1933 920 1012 0 0 356
> -/+ buffers/cache: 563 1369
> Swap: 0 0 0
>
> It seems that the preallocated memory is not freed on -ENOMEM.
>
> + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> + if (error)
> + goto err_out;
> +
> + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> + if (error)
> + goto err_out;
>
> memory_bm_create() is called a number of times, each time it will
> call create_mem_extents()/memory_bm_free(). Can they be optimized to
> be called only once?
Possibly, but not right now if you please? This is just moving code BTW.
> A side note: there are somehow duplicated *_extent_*() logics in the
> filesystems, is it possible that we abstract out some of the common code?
I think we can do it, but it really is low priority to me at the moment.
> + for_each_populated_zone(zone) {
> + size += snapshot_additional_pages(zone);
> + count += zone_page_state(zone, NR_FREE_PAGES);
> + if (!is_highmem(zone))
> + count -= zone->lowmem_reserve[ZONE_NORMAL];
> + }
>
> Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
Ah, this is a leftover and it should be changed or even dropped. Can you
please remind me how exactly lowmem_reserve[] is supposed to work?
> + /* If size < max_size, preallocating enough memory may be impossible. */
> + if (count > 0 && size == max_size)
> + error = -ENOMEM;
> + if (error)
> + goto err_out;
>
> The two if()s can be merged.
Unfortunately, the first one is actually wrong. :-)
It's not present in the updated patchset I'm going to send tomorrow.
> At last, I'd express my major concern about the transition to preallocate
> based memory shrinking: will it lead to more random swapping IOs?
Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
is related to that ...
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 23:05 ` Rafael J. Wysocki
(?)
@ 2009-05-06 13:30 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:30 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":
[ 462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 462.834249] PM: Basic memory bitmaps created
[ 462.838631] PM: Syncing filesystems ... done.
[ 463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[ 469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[ 469.612325]
[ 469.768796] Restarting tasks ... done.
[ 469.775044] PM: Basic memory bitmaps freed
Immediately after that, I copied a big sparse file into memory, and get this:
[ 508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 508.111702] PM: Basic memory bitmaps created
[ 508.116073] PM: Syncing filesystems ... done.
[ 509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[ 520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)
It's much worse.
Your patches are really interesting exercises for the vmscan code ;-)
> > + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> > +
> > + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> >
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
>
> Possibly, but not right now if you please? This is just moving code BTW.
OK.
>
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
>
> I think we can do it, but it really is low priority to me at the moment.
OK. Just was a wild thought.
>
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + if (!is_highmem(zone))
> > + count -= zone->lowmem_reserve[ZONE_NORMAL];
> > + }
> >
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
>
> Ah, this is a leftover and it should be changed or even dropped. Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?
totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them. Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.
So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
OK. Anyway a preallocate based shrinking policy could be far from optimal.
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path. The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-06 13:30 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:30 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":
[ 462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 462.834249] PM: Basic memory bitmaps created
[ 462.838631] PM: Syncing filesystems ... done.
[ 463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[ 469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[ 469.612325]
[ 469.768796] Restarting tasks ... done.
[ 469.775044] PM: Basic memory bitmaps freed
Immediately after that, I copied a big sparse file into memory, and get this:
[ 508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 508.111702] PM: Basic memory bitmaps created
[ 508.116073] PM: Syncing filesystems ... done.
[ 509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[ 520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)
It's much worse.
Your patches are really interesting exercises for the vmscan code ;-)
> > + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> > +
> > + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> >
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
>
> Possibly, but not right now if you please? This is just moving code BTW.
OK.
>
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
>
> I think we can do it, but it really is low priority to me at the moment.
OK. Just was a wild thought.
>
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + if (!is_highmem(zone))
> > + count -= zone->lowmem_reserve[ZONE_NORMAL];
> > + }
> >
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
>
> Ah, this is a leftover and it should be changed or even dropped. Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?
totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them. Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.
So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
OK. Anyway a preallocate based shrinking policy could be far from optimal.
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path. The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-06 13:30 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:30 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":
[ 462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 462.834249] PM: Basic memory bitmaps created
[ 462.838631] PM: Syncing filesystems ... done.
[ 463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[ 469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[ 469.612325]
[ 469.768796] Restarting tasks ... done.
[ 469.775044] PM: Basic memory bitmaps freed
Immediately after that, I copied a big sparse file into memory, and get this:
[ 508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 508.111702] PM: Basic memory bitmaps created
[ 508.116073] PM: Syncing filesystems ... done.
[ 509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[ 520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)
It's much worse.
Your patches are really interesting exercises for the vmscan code ;-)
> > + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> > +
> > + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> >
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
>
> Possibly, but not right now if you please? This is just moving code BTW.
OK.
>
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
>
> I think we can do it, but it really is low priority to me at the moment.
OK. Just was a wild thought.
>
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + if (!is_highmem(zone))
> > + count -= zone->lowmem_reserve[ZONE_NORMAL];
> > + }
> >
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
>
> Ah, this is a leftover and it should be changed or even dropped. Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?
totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them. Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.
So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
OK. Anyway a preallocate based shrinking policy could be far from optimal.
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path. The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 23:05 ` Rafael J. Wysocki
` (2 preceding siblings ...)
(?)
@ 2009-05-06 13:52 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:52 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
kernel-testers, torvalds, Andrew Morton
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":
[ 462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 462.834249] PM: Basic memory bitmaps created
[ 462.838631] PM: Syncing filesystems ... done.
[ 463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[ 469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[ 469.612325]
[ 469.768796] Restarting tasks ... done.
[ 469.775044] PM: Basic memory bitmaps freed
Immediately after that, I copied a big sparse file into memory, and get this:
[ 508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 508.111702] PM: Basic memory bitmaps created
[ 508.116073] PM: Syncing filesystems ... done.
[ 509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[ 520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)
It's much worse.
Your patches are really interesting exercises for the vmscan code ;-)
> > + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> > +
> > + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> >
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
>
> Possibly, but not right now if you please? This is just moving code BTW.
OK.
>
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
>
> I think we can do it, but it really is low priority to me at the moment.
OK. Just was a wild thought.
>
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + if (!is_highmem(zone))
> > + count -= zone->lowmem_reserve[ZONE_NORMAL];
> > + }
> >
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
>
> Ah, this is a leftover and it should be changed or even dropped. Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?
totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them. Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.
So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
OK. Anyway a preallocate based shrinking policy could be far from optimal.
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path. The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-06 13:52 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:52 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers, Wu Fengguang
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":
[ 462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 462.834249] PM: Basic memory bitmaps created
[ 462.838631] PM: Syncing filesystems ... done.
[ 463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[ 469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[ 469.612325]
[ 469.768796] Restarting tasks ... done.
[ 469.775044] PM: Basic memory bitmaps freed
Immediately after that, I copied a big sparse file into memory, and get this:
[ 508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 508.111702] PM: Basic memory bitmaps created
[ 508.116073] PM: Syncing filesystems ... done.
[ 509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[ 520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)
It's much worse.
Your patches are really interesting exercises for the vmscan code ;-)
> > + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> > +
> > + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> >
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
>
> Possibly, but not right now if you please? This is just moving code BTW.
OK.
>
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
>
> I think we can do it, but it really is low priority to me at the moment.
OK. Just was a wild thought.
>
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + if (!is_highmem(zone))
> > + count -= zone->lowmem_reserve[ZONE_NORMAL];
> > + }
> >
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
>
> Ah, this is a leftover and it should be changed or even dropped. Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?
totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them. Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.
So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
OK. Anyway a preallocate based shrinking policy could be far from optimal.
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path. The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-06 13:52 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:52 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA, Wu Fengguang
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":
[ 462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 462.834249] PM: Basic memory bitmaps created
[ 462.838631] PM: Syncing filesystems ... done.
[ 463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[ 469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[ 469.612325]
[ 469.768796] Restarting tasks ... done.
[ 469.775044] PM: Basic memory bitmaps freed
Immediately after that, I copied a big sparse file into memory, and get this:
[ 508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 508.111702] PM: Basic memory bitmaps created
[ 508.116073] PM: Syncing filesystems ... done.
[ 509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[ 520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)
It's much worse.
Your patches are really interesting exercises for the vmscan code ;-)
> > + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> > +
> > + error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> >
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
>
> Possibly, but not right now if you please? This is just moving code BTW.
OK.
>
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
>
> I think we can do it, but it really is low priority to me at the moment.
OK. Just was a wild thought.
>
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + if (!is_highmem(zone))
> > + count -= zone->lowmem_reserve[ZONE_NORMAL];
> > + }
> >
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
>
> Ah, this is a leftover and it should be changed or even dropped. Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?
totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them. Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.
So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
OK. Anyway a preallocate based shrinking policy could be far from optimal.
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path. The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 23:05 ` Rafael J. Wysocki
` (4 preceding siblings ...)
(?)
@ 2009-05-06 13:56 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
[snip]
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
So you do have swap file enabled? hibernate_preallocate_memory() will
firstly try to allocate as much pages as possible(savable+free), and
then to free up (allocated-image_size) pages. That means *all*
swappable pages will be swapped out in the process - that's a major
performance regression! And the zones are likely to be *over scanned*
and go to *all unreclaimable* state! (Hopefully they may be already
small at the time.)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-06 13:56 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, linux-pm, Andrew Morton, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
[snip]
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
So you do have swap file enabled? hibernate_preallocate_memory() will
firstly try to allocate as much pages as possible(savable+free), and
then to free up (allocated-image_size) pages. That means *all*
swappable pages will be swapped out in the process - that's a major
performance regression! And the zones are likely to be *over scanned*
and go to *all unreclaimable* state! (Hopefully they may be already
small at the time.)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-06 13:56 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-06 13:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
[snip]
> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...
So you do have swap file enabled? hibernate_preallocate_memory() will
firstly try to allocate as much pages as possible(savable+free), and
then to free up (allocated-image_size) pages. That means *all*
swappable pages will be swapped out in the process - that's a major
performance regression! And the zones are likely to be *over scanned*
and go to *all unreclaimable* state! (Hopefully they may be already
small at the time.)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-06 13:56 ` Wu Fengguang
(?)
@ 2009-05-06 20:54 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-06 20:54 UTC (permalink / raw)
To: Wu Fengguang
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe, linux-pm,
Wu Fengguang, torvalds, Andrew Morton
On Wednesday 06 May 2009, Wu Fengguang wrote:
> On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > Unfortunately, I'm observing a regression and a huge one.
> >
> > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > and that takes ~2 s with the old code and ~15 s with the new one.
> >
> > It helps to call shrink_all_memory() once with a sufficiently large argument
> > before the preallocation.
> [snip]
> > > At last, I'd express my major concern about the transition to preallocate
> > > based memory shrinking: will it lead to more random swapping IOs?
> >
> > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > is related to that ...
>
> So you do have swap file enabled? hibernate_preallocate_memory() will
> firstly try to allocate as much pages as possible(savable+free), and
> then to free up (allocated-image_size) pages.
No. It's going to allocate (total RAM - anticipated image size) and then free
up (allocated-image_size) pages.
If we consider maximum image sizes, that means allocating slightly more than
50% of RAM, so it really shouldn't regress that much IMO.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-06 13:56 ` Wu Fengguang
@ 2009-05-06 20:54 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-06 20:54 UTC (permalink / raw)
To: Wu Fengguang
Cc: Wu Fengguang, linux-pm, Andrew Morton, pavel, torvalds,
jens.axboe, alan-jenkins, linux-kernel, kernel-testers
On Wednesday 06 May 2009, Wu Fengguang wrote:
> On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > Unfortunately, I'm observing a regression and a huge one.
> >
> > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > and that takes ~2 s with the old code and ~15 s with the new one.
> >
> > It helps to call shrink_all_memory() once with a sufficiently large argument
> > before the preallocation.
> [snip]
> > > At last, I'd express my major concern about the transition to preallocate
> > > based memory shrinking: will it lead to more random swapping IOs?
> >
> > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > is related to that ...
>
> So you do have swap file enabled? hibernate_preallocate_memory() will
> firstly try to allocate as much pages as possible(savable+free), and
> then to free up (allocated-image_size) pages.
No. It's going to allocate (total RAM - anticipated image size) and then free
up (allocated-image_size) pages.
If we consider maximum image sizes, that means allocating slightly more than
50% of RAM, so it really shouldn't regress that much IMO.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-06 20:54 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-06 20:54 UTC (permalink / raw)
To: Wu Fengguang
Cc: Wu Fengguang,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wednesday 06 May 2009, Wu Fengguang wrote:
> On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > Unfortunately, I'm observing a regression and a huge one.
> >
> > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > and that takes ~2 s with the old code and ~15 s with the new one.
> >
> > It helps to call shrink_all_memory() once with a sufficiently large argument
> > before the preallocation.
> [snip]
> > > At last, I'd express my major concern about the transition to preallocate
> > > based memory shrinking: will it lead to more random swapping IOs?
> >
> > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > is related to that ...
>
> So you do have swap file enabled? hibernate_preallocate_memory() will
> firstly try to allocate as much pages as possible(savable+free), and
> then to free up (allocated-image_size) pages.
No. It's going to allocate (total RAM - anticipated image size) and then free
up (allocated-image_size) pages.
If we consider maximum image sizes, that means allocating slightly more than
50% of RAM, so it really shouldn't regress that much IMO.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-06 20:54 ` Rafael J. Wysocki
(?)
@ 2009-05-07 1:58 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-07 1:58 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> On Wednesday 06 May 2009, Wu Fengguang wrote:
> > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > > >
> > > > > Since the hibernation code is now going to use allocations of memory
> > > > > to create enough room for the image, it can also use the page frames
> > > > > allocated at this stage as image page frames. The low-level
> > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > these same pages once again later, so it generally is worth doing.
> > > > >
> > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > many pages as needed to get the right image size in one shot (the
> > > > > excessive allocated pages are released afterwards).]
> > > >
> > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > >
> > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > >
> > > Unfortunately, I'm observing a regression and a huge one.
> > >
> > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > and that takes ~2 s with the old code and ~15 s with the new one.
> > >
> > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > before the preallocation.
> > [snip]
> > > > At last, I'd express my major concern about the transition to preallocate
> > > > based memory shrinking: will it lead to more random swapping IOs?
> > >
> > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > is related to that ...
> >
> > So you do have swap file enabled? hibernate_preallocate_memory() will
> > firstly try to allocate as much pages as possible(savable+free), and
> > then to free up (allocated-image_size) pages.
>
> No. It's going to allocate (total RAM - anticipated image size) and then free
> up (allocated-image_size) pages.
Ah yes - I didn't notice that count was subtracted here:
for (count -= size; count > 0; count--) {
Make "count -= size" a standalone line to make that more obvious?
> If we consider maximum image sizes, that means allocating slightly more than
> 50% of RAM, so it really shouldn't regress that much IMO.
Right, that would be a less problem.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-07 1:58 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-07 1:58 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> On Wednesday 06 May 2009, Wu Fengguang wrote:
> > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > > >
> > > > > Since the hibernation code is now going to use allocations of memory
> > > > > to create enough room for the image, it can also use the page frames
> > > > > allocated at this stage as image page frames. The low-level
> > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > these same pages once again later, so it generally is worth doing.
> > > > >
> > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > many pages as needed to get the right image size in one shot (the
> > > > > excessive allocated pages are released afterwards).]
> > > >
> > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > >
> > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > >
> > > Unfortunately, I'm observing a regression and a huge one.
> > >
> > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > and that takes ~2 s with the old code and ~15 s with the new one.
> > >
> > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > before the preallocation.
> > [snip]
> > > > At last, I'd express my major concern about the transition to preallocate
> > > > based memory shrinking: will it lead to more random swapping IOs?
> > >
> > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > is related to that ...
> >
> > So you do have swap file enabled? hibernate_preallocate_memory() will
> > firstly try to allocate as much pages as possible(savable+free), and
> > then to free up (allocated-image_size) pages.
>
> No. It's going to allocate (total RAM - anticipated image size) and then free
> up (allocated-image_size) pages.
Ah yes - I didn't notice that count was subtracted here:
for (count -= size; count > 0; count--) {
Make "count -= size" a standalone line to make that more obvious?
> If we consider maximum image sizes, that means allocating slightly more than
> 50% of RAM, so it really shouldn't regress that much IMO.
Right, that would be a less problem.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-07 1:58 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-07 1:58 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> On Wednesday 06 May 2009, Wu Fengguang wrote:
> > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > > >
> > > > > Since the hibernation code is now going to use allocations of memory
> > > > > to create enough room for the image, it can also use the page frames
> > > > > allocated at this stage as image page frames. The low-level
> > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > these same pages once again later, so it generally is worth doing.
> > > > >
> > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > many pages as needed to get the right image size in one shot (the
> > > > > excessive allocated pages are released afterwards).]
> > > >
> > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > >
> > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > >
> > > Unfortunately, I'm observing a regression and a huge one.
> > >
> > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > and that takes ~2 s with the old code and ~15 s with the new one.
> > >
> > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > before the preallocation.
> > [snip]
> > > > At last, I'd express my major concern about the transition to preallocate
> > > > based memory shrinking: will it lead to more random swapping IOs?
> > >
> > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > is related to that ...
> >
> > So you do have swap file enabled? hibernate_preallocate_memory() will
> > firstly try to allocate as much pages as possible(savable+free), and
> > then to free up (allocated-image_size) pages.
>
> No. It's going to allocate (total RAM - anticipated image size) and then free
> up (allocated-image_size) pages.
Ah yes - I didn't notice that count was subtracted here:
for (count -= size; count > 0; count--) {
Make "count -= size" a standalone line to make that more obvious?
> If we consider maximum image sizes, that means allocating slightly more than
> 50% of RAM, so it really shouldn't regress that much IMO.
Right, that would be a less problem.
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-07 1:58 ` Wu Fengguang
(?)
@ 2009-05-07 12:20 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 12:20 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Thursday 07 May 2009, Wu Fengguang wrote:
> On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> > On Wednesday 06 May 2009, Wu Fengguang wrote:
> > > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > > > >
> > > > > > Since the hibernation code is now going to use allocations of memory
> > > > > > to create enough room for the image, it can also use the page frames
> > > > > > allocated at this stage as image page frames. The low-level
> > > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > > these same pages once again later, so it generally is worth doing.
> > > > > >
> > > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > > many pages as needed to get the right image size in one shot (the
> > > > > > excessive allocated pages are released afterwards).]
> > > > >
> > > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > > >
> > > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > > >
> > > > Unfortunately, I'm observing a regression and a huge one.
> > > >
> > > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > > and that takes ~2 s with the old code and ~15 s with the new one.
> > > >
> > > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > > before the preallocation.
> > > [snip]
> > > > > At last, I'd express my major concern about the transition to preallocate
> > > > > based memory shrinking: will it lead to more random swapping IOs?
> > > >
> > > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > > is related to that ...
> > >
> > > So you do have swap file enabled? hibernate_preallocate_memory() will
> > > firstly try to allocate as much pages as possible(savable+free), and
> > > then to free up (allocated-image_size) pages.
> >
> > No. It's going to allocate (total RAM - anticipated image size) and then free
> > up (allocated-image_size) pages.
>
> Ah yes - I didn't notice that count was subtracted here:
>
> for (count -= size; count > 0; count--) {
>
> Make "count -= size" a standalone line to make that more obvious?
That should be clear in the new patches:
http://patchwork.kernel.org/patch/22193/
http://patchwork.kernel.org/patch/22191/
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-07 1:58 ` Wu Fengguang
@ 2009-05-07 12:20 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 12:20 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thursday 07 May 2009, Wu Fengguang wrote:
> On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> > On Wednesday 06 May 2009, Wu Fengguang wrote:
> > > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > > > >
> > > > > > Since the hibernation code is now going to use allocations of memory
> > > > > > to create enough room for the image, it can also use the page frames
> > > > > > allocated at this stage as image page frames. The low-level
> > > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > > these same pages once again later, so it generally is worth doing.
> > > > > >
> > > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > > many pages as needed to get the right image size in one shot (the
> > > > > > excessive allocated pages are released afterwards).]
> > > > >
> > > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > > >
> > > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > > >
> > > > Unfortunately, I'm observing a regression and a huge one.
> > > >
> > > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > > and that takes ~2 s with the old code and ~15 s with the new one.
> > > >
> > > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > > before the preallocation.
> > > [snip]
> > > > > At last, I'd express my major concern about the transition to preallocate
> > > > > based memory shrinking: will it lead to more random swapping IOs?
> > > >
> > > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > > is related to that ...
> > >
> > > So you do have swap file enabled? hibernate_preallocate_memory() will
> > > firstly try to allocate as much pages as possible(savable+free), and
> > > then to free up (allocated-image_size) pages.
> >
> > No. It's going to allocate (total RAM - anticipated image size) and then free
> > up (allocated-image_size) pages.
>
> Ah yes - I didn't notice that count was subtracted here:
>
> for (count -= size; count > 0; count--) {
>
> Make "count -= size" a standalone line to make that more obvious?
That should be clear in the new patches:
http://patchwork.kernel.org/patch/22193/
http://patchwork.kernel.org/patch/22191/
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-07 12:20 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-07 12:20 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thursday 07 May 2009, Wu Fengguang wrote:
> On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> > On Wednesday 06 May 2009, Wu Fengguang wrote:
> > > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > > > >
> > > > > > Since the hibernation code is now going to use allocations of memory
> > > > > > to create enough room for the image, it can also use the page frames
> > > > > > allocated at this stage as image page frames. The low-level
> > > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > > these same pages once again later, so it generally is worth doing.
> > > > > >
> > > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > > many pages as needed to get the right image size in one shot (the
> > > > > > excessive allocated pages are released afterwards).]
> > > > >
> > > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > > >
> > > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > > >
> > > > Unfortunately, I'm observing a regression and a huge one.
> > > >
> > > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > > and that takes ~2 s with the old code and ~15 s with the new one.
> > > >
> > > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > > before the preallocation.
> > > [snip]
> > > > > At last, I'd express my major concern about the transition to preallocate
> > > > > based memory shrinking: will it lead to more random swapping IOs?
> > > >
> > > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > > is related to that ...
> > >
> > > So you do have swap file enabled? hibernate_preallocate_memory() will
> > > firstly try to allocate as much pages as possible(savable+free), and
> > > then to free up (allocated-image_size) pages.
> >
> > No. It's going to allocate (total RAM - anticipated image size) and then free
> > up (allocated-image_size) pages.
>
> Ah yes - I didn't notice that count was subtracted here:
>
> for (count -= size; count > 0; count--) {
>
> Make "count -= size" a standalone line to make that more obvious?
That should be clear in the new patches:
http://patchwork.kernel.org/patch/22193/
http://patchwork.kernel.org/patch/22191/
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-07 12:34 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-07 12:34 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Thu, May 07, 2009 at 02:20:42PM +0200, Rafael J. Wysocki wrote:
> On Thursday 07 May 2009, Wu Fengguang wrote:
> > On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> > > On Wednesday 06 May 2009, Wu Fengguang wrote:
> > > > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > > > > >
> > > > > > > Since the hibernation code is now going to use allocations of memory
> > > > > > > to create enough room for the image, it can also use the page frames
> > > > > > > allocated at this stage as image page frames. The low-level
> > > > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > > > these same pages once again later, so it generally is worth doing.
> > > > > > >
> > > > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > > > many pages as needed to get the right image size in one shot (the
> > > > > > > excessive allocated pages are released afterwards).]
> > > > > >
> > > > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > > > >
> > > > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > > > >
> > > > > Unfortunately, I'm observing a regression and a huge one.
> > > > >
> > > > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > > > and that takes ~2 s with the old code and ~15 s with the new one.
> > > > >
> > > > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > > > before the preallocation.
> > > > [snip]
> > > > > > At last, I'd express my major concern about the transition to preallocate
> > > > > > based memory shrinking: will it lead to more random swapping IOs?
> > > > >
> > > > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > > > is related to that ...
> > > >
> > > > So you do have swap file enabled? hibernate_preallocate_memory() will
> > > > firstly try to allocate as much pages as possible(savable+free), and
> > > > then to free up (allocated-image_size) pages.
> > >
> > > No. It's going to allocate (total RAM - anticipated image size) and then free
> > > up (allocated-image_size) pages.
> >
> > Ah yes - I didn't notice that count was subtracted here:
> >
> > for (count -= size; count > 0; count--) {
> >
> > Make "count -= size" a standalone line to make that more obvious?
>
> That should be clear in the new patches:
> http://patchwork.kernel.org/patch/22193/
> http://patchwork.kernel.org/patch/22191/
Yes, thanks! That's much better :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-05-07 12:34 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-07 12:34 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Thu, May 07, 2009 at 02:20:42PM +0200, Rafael J. Wysocki wrote:
> On Thursday 07 May 2009, Wu Fengguang wrote:
> > On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> > > On Wednesday 06 May 2009, Wu Fengguang wrote:
> > > > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > > > > >
> > > > > > > Since the hibernation code is now going to use allocations of memory
> > > > > > > to create enough room for the image, it can also use the page frames
> > > > > > > allocated at this stage as image page frames. The low-level
> > > > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > > > these same pages once again later, so it generally is worth doing.
> > > > > > >
> > > > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > > > many pages as needed to get the right image size in one shot (the
> > > > > > > excessive allocated pages are released afterwards).]
> > > > > >
> > > > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > > > >
> > > > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > > > >
> > > > > Unfortunately, I'm observing a regression and a huge one.
> > > > >
> > > > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > > > and that takes ~2 s with the old code and ~15 s with the new one.
> > > > >
> > > > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > > > before the preallocation.
> > > > [snip]
> > > > > > At last, I'd express my major concern about the transition to preallocate
> > > > > > based memory shrinking: will it lead to more random swapping IOs?
> > > > >
> > > > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > > > is related to that ...
> > > >
> > > > So you do have swap file enabled? hibernate_preallocate_memory() will
> > > > firstly try to allocate as much pages as possible(savable+free), and
> > > > then to free up (allocated-image_size) pages.
> > >
> > > No. It's going to allocate (total RAM - anticipated image size) and then free
> > > up (allocated-image_size) pages.
> >
> > Ah yes - I didn't notice that count was subtracted here:
> >
> > for (count -= size; count > 0; count--) {
> >
> > Make "count -= size" a standalone line to make that more obvious?
>
> That should be clear in the new patches:
> http://patchwork.kernel.org/patch/22193/
> http://patchwork.kernel.org/patch/22191/
Yes, thanks! That's much better :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-07 12:20 ` Rafael J. Wysocki
(?)
(?)
@ 2009-05-07 12:34 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-05-07 12:34 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Thu, May 07, 2009 at 02:20:42PM +0200, Rafael J. Wysocki wrote:
> On Thursday 07 May 2009, Wu Fengguang wrote:
> > On Thu, May 07, 2009 at 04:54:09AM +0800, Rafael J. Wysocki wrote:
> > > On Wednesday 06 May 2009, Wu Fengguang wrote:
> > > > On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > > > > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > > > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > > > > >
> > > > > > > Since the hibernation code is now going to use allocations of memory
> > > > > > > to create enough room for the image, it can also use the page frames
> > > > > > > allocated at this stage as image page frames. The low-level
> > > > > > > hibernation code needs to be rearranged for this purpose, but it
> > > > > > > allows us to avoid freeing a great number of pages and allocating
> > > > > > > these same pages once again later, so it generally is worth doing.
> > > > > > >
> > > > > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > > > > many pages as needed to get the right image size in one shot (the
> > > > > > > excessive allocated pages are released afterwards).]
> > > > > >
> > > > > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > > > > >
> > > > > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > > > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> > > > >
> > > > > Unfortunately, I'm observing a regression and a huge one.
> > > > >
> > > > > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > > > > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > > > > and that takes ~2 s with the old code and ~15 s with the new one.
> > > > >
> > > > > It helps to call shrink_all_memory() once with a sufficiently large argument
> > > > > before the preallocation.
> > > > [snip]
> > > > > > At last, I'd express my major concern about the transition to preallocate
> > > > > > based memory shrinking: will it lead to more random swapping IOs?
> > > > >
> > > > > Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> > > > > is related to that ...
> > > >
> > > > So you do have swap file enabled? hibernate_preallocate_memory() will
> > > > firstly try to allocate as much pages as possible(savable+free), and
> > > > then to free up (allocated-image_size) pages.
> > >
> > > No. It's going to allocate (total RAM - anticipated image size) and then free
> > > up (allocated-image_size) pages.
> >
> > Ah yes - I didn't notice that count was subtracted here:
> >
> > for (count -= size; count > 0; count--) {
> >
> > Make "count -= size" a standalone line to make that more obvious?
>
> That should be clear in the new patches:
> http://patchwork.kernel.org/patch/22193/
> http://patchwork.kernel.org/patch/22191/
Yes, thanks! That's much better :)
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-08-16 13:46 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-08-16 13:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
The 10 fold slowdown may be related to swapping IO:
shrink_all_memory() tends to be reclaiming less anon pages.
Is this box running on SSD? (Which can be slow on random writes.)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-08-16 13:46 ` Wu Fengguang
0 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-08-16 13:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
The 10 fold slowdown may be related to swapping IO:
shrink_all_memory() tends to be reclaiming less anon pages.
Is this box running on SSD? (Which can be slow on random writes.)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-08-16 13:46 ` Wu Fengguang
@ 2009-08-16 22:48 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-08-16 22:48 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm, Andrew Morton, pavel, torvalds, jens.axboe,
alan-jenkins, linux-kernel, kernel-testers
On Sunday 16 August 2009, Wu Fengguang wrote:
> On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > Unfortunately, I'm observing a regression and a huge one.
> >
> > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > and that takes ~2 s with the old code and ~15 s with the new one.
> >
> > It helps to call shrink_all_memory() once with a sufficiently large argument
> > before the preallocation.
>
> The 10 fold slowdown may be related to swapping IO:
I guess it is.
> shrink_all_memory() tends to be reclaiming less anon pages.
>
> Is this box running on SSD? (Which can be slow on random writes.)
No, on a normal spinning-plate HDD (2.5'').
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
@ 2009-08-16 22:48 ` Rafael J. Wysocki
0 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-08-16 22:48 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Andrew Morton, pavel-+ZI9xUNit7I,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA
On Sunday 16 August 2009, Wu Fengguang wrote:
> On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > Unfortunately, I'm observing a regression and a huge one.
> >
> > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > and that takes ~2 s with the old code and ~15 s with the new one.
> >
> > It helps to call shrink_all_memory() once with a sufficiently large argument
> > before the preallocation.
>
> The 10 fold slowdown may be related to swapping IO:
I guess it is.
> shrink_all_memory() tends to be reclaiming less anon pages.
>
> Is this box running on SSD? (Which can be slow on random writes.)
No, on a normal spinning-plate HDD (2.5'').
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-08-16 13:46 ` Wu Fengguang
(?)
(?)
@ 2009-08-16 22:48 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-08-16 22:48 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Sunday 16 August 2009, Wu Fengguang wrote:
> On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> > On Tuesday 05 May 2009, Wu Fengguang wrote:
> > > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > > >
> > > > Since the hibernation code is now going to use allocations of memory
> > > > to create enough room for the image, it can also use the page frames
> > > > allocated at this stage as image page frames. The low-level
> > > > hibernation code needs to be rearranged for this purpose, but it
> > > > allows us to avoid freeing a great number of pages and allocating
> > > > these same pages once again later, so it generally is worth doing.
> > > >
> > > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > > many pages as needed to get the right image size in one shot (the
> > > > excessive allocated pages are released afterwards).]
> > >
> > > Rafael, I tried out your patches and found doubled memory shrink speed!
> > >
> > > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> >
> > Unfortunately, I'm observing a regression and a huge one.
> >
> > On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> > with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> > and that takes ~2 s with the old code and ~15 s with the new one.
> >
> > It helps to call shrink_all_memory() once with a sufficiently large argument
> > before the preallocation.
>
> The 10 fold slowdown may be related to swapping IO:
I guess it is.
> shrink_all_memory() tends to be reclaiming less anon pages.
>
> Is this box running on SSD? (Which can be slow on random writes.)
No, on a normal spinning-plate HDD (2.5'').
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-05 23:05 ` Rafael J. Wysocki
` (7 preceding siblings ...)
(?)
@ 2009-08-16 13:46 ` Wu Fengguang
-1 siblings, 0 replies; 580+ messages in thread
From: Wu Fengguang @ 2009-08-16 13:46 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@sisk.pl>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.
The 10 fold slowdown may be related to swapping IO:
shrink_all_memory() tends to be reclaiming less anon pages.
Is this box running on SSD? (Which can be slow on random writes.)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 580+ messages in thread
* [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)
2009-05-04 0:08 ` Rafael J. Wysocki
` (9 preceding siblings ...)
(?)
@ 2009-05-04 0:22 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 0:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, alan-jenkins, jens.axboe, linux-pm, kernel-testers,
torvalds, Andrew Morton
From: Rafael J. Wysocki <rjw@sisk.pl>
Since the hibernation code is now going to use allocations of memory
to create enough room for the image, it can also use the page frames
allocated at this stage as image page frames. The low-level
hibernation code needs to be rearranged for this purpose, but it
allows us to avoid freeing a great number of pages and allocating
these same pages once again later, so it generally is worth doing.
[rev. 2: Change the strategy of preallocating memory to allocate as
many pages as needed to get the right image size in one shot (the
excessive allocated pages are released afterwards).]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
kernel/power/disk.c | 15 +++-
kernel/power/power.h | 2
kernel/power/snapshot.c | 157 ++++++++++++++++++++++++++++++------------------
3 files changed, 112 insertions(+), 62 deletions(-)
Index: linux-2.6/kernel/power/snapshot.c
===================================================================
--- linux-2.6.orig/kernel/power/snapshot.c
+++ linux-2.6/kernel/power/snapshot.c
@@ -1033,6 +1033,25 @@ copy_data_pages(struct memory_bitmap *co
static unsigned int nr_copy_pages;
/* Number of pages needed for saving the original pfns of the image pages */
static unsigned int nr_meta_pages;
+/*
+ * Numbers of normal and highmem page frames allocated for hibernation image
+ * before suspending devices.
+ */
+unsigned int alloc_normal, alloc_highmem;
+/*
+ * Memory bitmap used for marking saveable pages (during hibernation) or
+ * hibernation image pages (during restore)
+ */
+static struct memory_bitmap orig_bm;
+/*
+ * Memory bitmap used during hibernation for marking allocated page frames that
+ * will contain copies of saveable pages. During restore it is initially used
+ * for marking hibernation image pages, but then the set bits from it are
+ * duplicated in @orig_bm and it is released. On highmem systems it is next
+ * used for marking "safe" highmem pages, but it has to be reinitialized for
+ * this purpose.
+ */
+static struct memory_bitmap copy_bm;
/**
* swsusp_free - free pages allocated for the suspend.
@@ -1064,12 +1083,16 @@ void swsusp_free(void)
nr_meta_pages = 0;
restore_pblist = NULL;
buffer = NULL;
+ alloc_normal = 0;
+ alloc_highmem = 0;
}
/* Helper function used for the shrinking of memory. */
+#define GFP_IMAGE (GFP_KERNEL | __GFP_NO_OOM_KILL)
+
/**
- * swsusp_shrink_memory - Make the kernel release as much memory as needed
+ * hibernate_preallocate_memory - Preallocate memory for hibernation image
*
* To create a hibernation image it is necessary to make a copy of every page
* frame in use. We also need a number of page frames to be free during
@@ -1088,16 +1111,27 @@ void swsusp_free(void)
* frames in use is below the requested image size or it is impossible to
* allocate more memory, whichever happens first.
*/
-int swsusp_shrink_memory(void)
+int hibernate_preallocate_memory(void)
{
struct zone *zone;
unsigned long saveable, size, max_size, count, pages = 0;
struct timeval start, stop;
- int error = 0;
+ int error;
- printk(KERN_INFO "PM: Shrinking memory ... ");
+ printk(KERN_INFO "PM: Preallocating image memory ... ");
do_gettimeofday(&start);
+ error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ error = memory_bm_create(©_bm, GFP_IMAGE, PG_ANY);
+ if (error)
+ goto err_out;
+
+ alloc_normal = 0;
+ alloc_highmem = 0;
+
/* Count the number of saveable data pages. */
saveable = count_data_pages() + count_highmem_pages();
@@ -1130,29 +1164,55 @@ int swsusp_shrink_memory(void)
for (count -= size; count > 0; count--) {
struct page *page;
- page = alloc_image_page(GFP_KERNEL | __GFP_NO_OOM_KILL);
+ page = alloc_image_page(GFP_IMAGE);
if (!page)
break;
- pages++;
+ memory_bm_set_bit(©_bm, page_to_pfn(page));
+ if (PageHighMem(page))
+ alloc_highmem++;
+ else
+ alloc_normal++;
}
/* If size < max_size, preallocating enough memory may be impossible. */
if (count > 0 && size == max_size)
error = -ENOMEM;
+ if (error)
+ goto err_out;
- /* Release all of the preallocated page frames. */
- swsusp_free();
+ /* Save the number of allocated pages for the statistics below. */
+ pages = alloc_normal + alloc_highmem;
- if (error) {
- printk(KERN_CONT "\n");
- return error;
+ /*
+ * We only need 'size' page frames for the image but we have allocated
+ * more. Release the excessive ones now.
+ */
+ memory_bm_position_reset(©_bm);
+ while (alloc_normal + alloc_highmem > size) {
+ unsigned long pfn = memory_bm_next_pfn(©_bm);
+ struct page *page = pfn_to_page(pfn);
+
+ memory_bm_clear_bit(©_bm, pfn);
+ if (PageHighMem(page))
+ alloc_highmem--;
+ else
+ alloc_normal--;
+ swsusp_unset_page_forbidden(page);
+ swsusp_unset_page_free(page);
+ __free_page(page);
}
out:
do_gettimeofday(&stop);
- printk(KERN_CONT "done (preallocated %lu free pages)\n", pages);
- swsusp_show_speed(&start, &stop, pages, "Freed");
+ printk(KERN_CONT "done (allocated %lu pages, %lu image pages kept)\n",
+ pages, size);
+ swsusp_show_speed(&start, &stop, pages, "Allocated");
return 0;
+
+ err_out:
+ printk(KERN_CONT "\n");
+ swsusp_free();
+ return error;
}
#ifdef CONFIG_HIGHMEM
@@ -1163,7 +1223,7 @@ int swsusp_shrink_memory(void)
static unsigned int count_pages_for_highmem(unsigned int nr_highmem)
{
- unsigned int free_highmem = count_free_highmem_pages();
+ unsigned int free_highmem = count_free_highmem_pages() + alloc_highmem;
if (free_highmem >= nr_highmem)
nr_highmem = 0;
@@ -1185,19 +1245,17 @@ count_pages_for_highmem(unsigned int nr_
static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem)
{
struct zone *zone;
- unsigned int free = 0, meta = 0;
+ unsigned int free = alloc_normal;
- for_each_zone(zone) {
- meta += snapshot_additional_pages(zone);
+ for_each_zone(zone)
if (!is_highmem(zone))
free += zone_page_state(zone, NR_FREE_PAGES);
- }
nr_pages += count_pages_for_highmem(nr_highmem);
- pr_debug("PM: Normal pages needed: %u + %u + %u, available pages: %u\n",
- nr_pages, PAGES_FOR_IO, meta, free);
+ pr_debug("PM: Normal pages needed: %u + %u, available pages: %u\n",
+ nr_pages, PAGES_FOR_IO, free);
- return free > nr_pages + PAGES_FOR_IO + meta;
+ return free > nr_pages + PAGES_FOR_IO;
}
#ifdef CONFIG_HIGHMEM
@@ -1219,7 +1277,7 @@ static inline int get_highmem_buffer(int
*/
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
{
unsigned int to_alloc = count_free_highmem_pages();
@@ -1230,7 +1288,7 @@ alloc_highmem_image_pages(struct memory_
while (to_alloc-- > 0) {
struct page *page;
- page = alloc_image_page(__GFP_HIGHMEM);
+ page = alloc_image_page(__GFP_HIGHMEM | __GFP_NO_OOM_KILL);
memory_bm_set_bit(bm, page_to_pfn(page));
}
return nr_highmem;
@@ -1239,7 +1297,7 @@ alloc_highmem_image_pages(struct memory_
static inline int get_highmem_buffer(int safe_needed) { return 0; }
static inline unsigned int
-alloc_highmem_image_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
+alloc_highmem_pages(struct memory_bitmap *bm, unsigned int n) { return 0; }
#endif /* CONFIG_HIGHMEM */
/**
@@ -1258,51 +1316,36 @@ static int
swsusp_alloc(struct memory_bitmap *orig_bm, struct memory_bitmap *copy_bm,
unsigned int nr_pages, unsigned int nr_highmem)
{
- int error;
-
- error = memory_bm_create(orig_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
-
- error = memory_bm_create(copy_bm, GFP_ATOMIC | __GFP_COLD, PG_ANY);
- if (error)
- goto Free;
+ int error = 0;
if (nr_highmem > 0) {
error = get_highmem_buffer(PG_ANY);
if (error)
- goto Free;
-
- nr_pages += alloc_highmem_image_pages(copy_bm, nr_highmem);
+ goto err_out;
+ if (nr_highmem > alloc_highmem) {
+ nr_highmem -= alloc_highmem;
+ nr_pages += alloc_highmem_pages(copy_bm, nr_highmem);
+ }
}
- while (nr_pages-- > 0) {
- struct page *page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
-
- if (!page)
- goto Free;
+ if (nr_pages > alloc_normal) {
+ nr_pages -= alloc_normal;
+ while (nr_pages-- > 0) {
+ struct page *page;
- memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ page = alloc_image_page(GFP_ATOMIC | __GFP_COLD);
+ if (!page)
+ goto err_out;
+ memory_bm_set_bit(copy_bm, page_to_pfn(page));
+ }
}
+
return 0;
- Free:
+ err_out:
swsusp_free();
- return -ENOMEM;
+ return error;
}
-/* Memory bitmap used for marking saveable pages (during suspend) or the
- * suspend image pages (during resume)
- */
-static struct memory_bitmap orig_bm;
-/* Memory bitmap used on suspend for marking allocated pages that will contain
- * the copies of saveable pages. During resume it is initially used for
- * marking the suspend image pages, but then its set bits are duplicated in
- * @orig_bm and it is released. Next, on systems with high memory, it may be
- * used for marking "safe" highmem pages, but it has to be reinitialized for
- * this purpose.
- */
-static struct memory_bitmap copy_bm;
-
asmlinkage int swsusp_save(void)
{
unsigned int nr_pages, nr_highmem;
Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -74,7 +74,7 @@ extern asmlinkage int swsusp_arch_resume
extern int create_basic_memory_bitmaps(void);
extern void free_basic_memory_bitmaps(void);
-extern int swsusp_shrink_memory(void);
+extern int hibernate_preallocate_memory(void);
/**
* Auxiliary structure used for reading the snapshot image data and
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -303,8 +303,8 @@ int hibernation_snapshot(int platform_mo
if (error)
return error;
- /* Free memory before shutting down devices. */
- error = swsusp_shrink_memory();
+ /* Preallocate image memory before shutting down devices. */
+ error = hibernate_preallocate_memory();
if (error)
goto Close;
@@ -320,6 +320,10 @@ int hibernation_snapshot(int platform_mo
/* Control returns here after successful restore */
Resume_devices:
+ /* We may need to release the preallocated image pages here. */
+ if (error || !in_suspend)
+ swsusp_free();
+
device_resume(in_suspend ?
(error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE);
resume_console();
@@ -593,7 +597,10 @@ int hibernate(void)
goto Thaw;
error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM);
- if (in_suspend && !error) {
+ if (error)
+ goto Thaw;
+
+ if (in_suspend) {
unsigned int flags = 0;
if (hibernation_mode == HIBERNATION_PLATFORM)
@@ -605,8 +612,8 @@ int hibernate(void)
power_down();
} else {
pr_debug("PM: Image restored successfully.\n");
- swsusp_free();
}
+
Thaw:
thaw_processes();
Finish:
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
2009-05-03 16:30 ` Rafael J. Wysocki
` (2 preceding siblings ...)
(?)
@ 2009-05-04 9:33 ` Pavel Machek
-1 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:33 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, Wu Fengguang, torvalds, linux-pm
Hi!
> I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> of it.
>
> > I wonder if it's possible to free up the memory within 1s at all.
>
> I'm not sure.
>
> Apparently, the counting of saveable pages takes substantial time (0.5 s each
> iteration on my 64-bit test box), so we can improve that by limiting the number
> of iterations.
We could increase step size after each step. Free in 40MB step, then
80MB step, then 160MB step, ...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-04 9:33 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:33 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, Andrew Morton, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
Hi!
> I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> of it.
>
> > I wonder if it's possible to free up the memory within 1s at all.
>
> I'm not sure.
>
> Apparently, the counting of saveable pages takes substantial time (0.5 s each
> iteration on my 64-bit test box), so we can improve that by limiting the number
> of iterations.
We could increase step size after each step. Free in 40MB step, then
80MB step, then 160MB step, ...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-04 9:33 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 9:33 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, Andrew Morton,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Hi!
> I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> of it.
>
> > I wonder if it's possible to free up the memory within 1s at all.
>
> I'm not sure.
>
> Apparently, the counting of saveable pages takes substantial time (0.5 s each
> iteration on my 64-bit test box), so we can improve that by limiting the number
> of iterations.
We could increase step size after each step. Free in 40MB step, then
80MB step, then 160MB step, ...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
2009-05-04 9:33 ` Pavel Machek
(?)
@ 2009-05-04 19:53 ` Rafael J. Wysocki
2009-05-04 20:27 ` Pavel Machek
2009-05-04 20:27 ` Pavel Machek
-1 siblings, 2 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:53 UTC (permalink / raw)
To: Pavel Machek
Cc: Wu Fengguang, Andrew Morton, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Monday 04 May 2009, Pavel Machek wrote:
> Hi!
>
> > I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> > of it.
> >
> > > I wonder if it's possible to free up the memory within 1s at all.
> >
> > I'm not sure.
> >
> > Apparently, the counting of saveable pages takes substantial time (0.5 s each
> > iteration on my 64-bit test box), so we can improve that by limiting the number
> > of iterations.
>
> We could increase step size after each step. Free in 40MB step, then
> 80MB step, then 160MB step, ...
Why not just one step? It doesn't seem to hurt performance AFAICS.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
2009-05-04 19:53 ` Rafael J. Wysocki
@ 2009-05-04 20:27 ` Pavel Machek
2009-05-04 20:27 ` Pavel Machek
1 sibling, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 20:27 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, Wu Fengguang, torvalds, linux-pm
On Mon 2009-05-04 21:53:36, Rafael J. Wysocki wrote:
> On Monday 04 May 2009, Pavel Machek wrote:
> > Hi!
> >
> > > I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> > > of it.
> > >
> > > > I wonder if it's possible to free up the memory within 1s at all.
> > >
> > > I'm not sure.
> > >
> > > Apparently, the counting of saveable pages takes substantial time (0.5 s each
> > > iteration on my 64-bit test box), so we can improve that by limiting the number
> > > of iterations.
> >
> > We could increase step size after each step. Free in 40MB step, then
> > 80MB step, then 160MB step, ...
>
> Why not just one step? It doesn't seem to hurt performance AFAICS.
One step is obviously fine, too.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-04 20:27 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 20:27 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, Andrew Morton, torvalds, jens.axboe, alan-jenkins,
linux-kernel, kernel-testers, linux-pm
On Mon 2009-05-04 21:53:36, Rafael J. Wysocki wrote:
> On Monday 04 May 2009, Pavel Machek wrote:
> > Hi!
> >
> > > I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> > > of it.
> > >
> > > > I wonder if it's possible to free up the memory within 1s at all.
> > >
> > > I'm not sure.
> > >
> > > Apparently, the counting of saveable pages takes substantial time (0.5 s each
> > > iteration on my 64-bit test box), so we can improve that by limiting the number
> > > of iterations.
> >
> > We could increase step size after each step. Free in 40MB step, then
> > 80MB step, then 160MB step, ...
>
> Why not just one step? It doesn't seem to hurt performance AFAICS.
One step is obviously fine, too.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
@ 2009-05-04 20:27 ` Pavel Machek
0 siblings, 0 replies; 580+ messages in thread
From: Pavel Machek @ 2009-05-04 20:27 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Wu Fengguang, Andrew Morton,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
alan-jenkins-cCz0Lq7MMjm9FHfhHBbuYA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-testers-u79uwXL29TY76Z2rM5mHXA,
linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Mon 2009-05-04 21:53:36, Rafael J. Wysocki wrote:
> On Monday 04 May 2009, Pavel Machek wrote:
> > Hi!
> >
> > > I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> > > of it.
> > >
> > > > I wonder if it's possible to free up the memory within 1s at all.
> > >
> > > I'm not sure.
> > >
> > > Apparently, the counting of saveable pages takes substantial time (0.5 s each
> > > iteration on my 64-bit test box), so we can improve that by limiting the number
> > > of iterations.
> >
> > We could increase step size after each step. Free in 40MB step, then
> > 80MB step, then 160MB step, ...
>
> Why not just one step? It doesn't seem to hurt performance AFAICS.
One step is obviously fine, too.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 580+ messages in thread
* Re: [PATCH 0/4] PM: Drop shrink_all_memory (rev. 2) (was: Re: [PATCH 3/3] PM/Hibernate: Use memory allocations to free memory)
2009-05-04 9:33 ` Pavel Machek
(?)
(?)
@ 2009-05-04 19:53 ` Rafael J. Wysocki
-1 siblings, 0 replies; 580+ messages in thread
From: Rafael J. Wysocki @ 2009-05-04 19:53 UTC (permalink / raw)
To: Pavel Machek
Cc: kernel-testers, linux-kernel, alan-jenkins, jens.axboe,
Andrew Morton, Wu Fengguang, torvalds, linux-pm
On Monday 04 May 2009, Pavel Machek wrote:
> Hi!
>
> > I know that swsusp_shrink_memory() has problems, that's why I'd like to get rid
> > of it.
> >
> > > I wonder if it's possible to free up the memory within 1s at all.
> >
> > I'm not sure.
> >
> > Apparently, the counting of saveable pages takes substantial time (0.5 s each
> > iteration on my 64-bit test box), so we can improve that by limiting the number
> > of iterations.
>
> We could increase step size after each step. Free in 40MB step, then
> 80MB step, then 160MB step, ...
Why not just one step? It doesn't seem to hurt performance AFAICS.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 580+ messages in thread