linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Memory corruption during hibernation since 2.6.31
@ 2010-07-28 21:20 Ondrej Zary
  2010-07-28 21:34 ` Rafael J. Wysocki
  0 siblings, 1 reply; 45+ messages in thread
From: Ondrej Zary @ 2010-07-28 21:20 UTC (permalink / raw)
  To: Kernel development list

Hello,
after very long bisection, I finally found what's causing memory corruption 
during hibernation on my machine sice 2.6.31:
https://bugzilla.kernel.org/show_bug.cgi?id=15753

It's commit c9e444103b5e7a5a3519f9913f59767f92e33baf (mm: reuse unused swap 
entry if necessary).

I don't know anything about swapping in Linux so I don't have a clue what's
wrong with that commit.

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-28 21:20 Memory corruption during hibernation since 2.6.31 Ondrej Zary
@ 2010-07-28 21:34 ` Rafael J. Wysocki
  2010-07-28 21:38   ` Ondrej Zary
  2010-07-29  4:23   ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 45+ messages in thread
From: Rafael J. Wysocki @ 2010-07-28 21:34 UTC (permalink / raw)
  To: Ondrej Zary
  Cc: Kernel development list, KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh

On Wednesday, July 28, 2010, Ondrej Zary wrote:
> Hello,
> after very long bisection, I finally found what's causing memory corruption 
> during hibernation on my machine sice 2.6.31:
> https://bugzilla.kernel.org/show_bug.cgi?id=15753
> 
> It's commit c9e444103b5e7a5a3519f9913f59767f92e33baf (mm: reuse unused swap 
> entry if necessary).
> 
> I don't know anything about swapping in Linux so I don't have a clue what's
> wrong with that commit.

Thanks for bisecting!

This looks rather serious.  I'd be grateful from any clues from the mm guys
involved (CCed).

Do you use s2disk or the built-in hibernation code?

Rafael

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-28 21:34 ` Rafael J. Wysocki
@ 2010-07-28 21:38   ` Ondrej Zary
  2010-07-29  1:06     ` KAMEZAWA Hiroyuki
  2010-07-29  4:23   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 45+ messages in thread
From: Ondrej Zary @ 2010-07-28 21:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Kernel development list, KAMEZAWA Hiroyuki, Andrew Morton, Balbir Singh

On Wednesday 28 July 2010 23:34:07 Rafael J. Wysocki wrote:
> On Wednesday, July 28, 2010, Ondrej Zary wrote:
> > Hello,
> > after very long bisection, I finally found what's causing memory
> > corruption during hibernation on my machine sice 2.6.31:
> > https://bugzilla.kernel.org/show_bug.cgi?id=15753
> >
> > It's commit c9e444103b5e7a5a3519f9913f59767f92e33baf (mm: reuse unused
> > swap entry if necessary).
> >
> > I don't know anything about swapping in Linux so I don't have a clue
> > what's wrong with that commit.
>
> Thanks for bisecting!
>
> This looks rather serious.  I'd be grateful from any clues from the mm guys
> involved (CCed).
>
> Do you use s2disk or the built-in hibernation code?

I use built-in code (echo disk >/sys/power/state). The machine has 256MB RAM 
and 256MB swap partition.

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-28 21:38   ` Ondrej Zary
@ 2010-07-29  1:06     ` KAMEZAWA Hiroyuki
  2010-07-29  2:51       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-29  1:06 UTC (permalink / raw)
  To: Ondrej Zary
  Cc: Rafael J. Wysocki, Kernel development list, Andrew Morton, Balbir Singh

On Wed, 28 Jul 2010 23:38:09 +0200
Ondrej Zary <linux@rainbow-software.org> wrote:

> On Wednesday 28 July 2010 23:34:07 Rafael J. Wysocki wrote:
> > On Wednesday, July 28, 2010, Ondrej Zary wrote:
> > > Hello,
> > > after very long bisection, I finally found what's causing memory
> > > corruption during hibernation on my machine sice 2.6.31:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=15753
> > >
> > > It's commit c9e444103b5e7a5a3519f9913f59767f92e33baf (mm: reuse unused
> > > swap entry if necessary).
> > >
> > > I don't know anything about swapping in Linux so I don't have a clue
> > > what's wrong with that commit.
> >
> > Thanks for bisecting!
> >
> > This looks rather serious.  I'd be grateful from any clues from the mm guys
> > involved (CCed).
> >
> > Do you use s2disk or the built-in hibernation code?
> 
> I use built-in code (echo disk >/sys/power/state). The machine has 256MB RAM 
> and 256MB swap partition.
> 

I don't know much about hibernation but it seems my code break something, sorry.

The commit does
	if swap_map[] shows that there is only SwapCache, no real swap users,
	try to reuse it by detaching a page from SwapCache.
 
In usual cases,

	lock_page(page):
	add_to_swap(page); // assign swap offset and mark as SwapCache
	try_to_unmap();    // swap's usage count +1 (swap_duplicate())
	unlock_page(page);

Then, SwapCache will not be reused in usual cases.

What code should I look into ? kernel/power/swap.c ??

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  1:06     ` KAMEZAWA Hiroyuki
@ 2010-07-29  2:51       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-29  2:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ondrej Zary, Rafael J. Wysocki, Kernel development list,
	Andrew Morton, Balbir Singh

On Thu, 29 Jul 2010 10:06:26 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 28 Jul 2010 23:38:09 +0200
> Ondrej Zary <linux@rainbow-software.org> wrote:
> 
> > On Wednesday 28 July 2010 23:34:07 Rafael J. Wysocki wrote:
> > > On Wednesday, July 28, 2010, Ondrej Zary wrote:
> > > > Hello,
> > > > after very long bisection, I finally found what's causing memory
> > > > corruption during hibernation on my machine sice 2.6.31:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=15753
> > > >
> > > > It's commit c9e444103b5e7a5a3519f9913f59767f92e33baf (mm: reuse unused
> > > > swap entry if necessary).
> > > >
> > > > I don't know anything about swapping in Linux so I don't have a clue
> > > > what's wrong with that commit.
> > >
> > > Thanks for bisecting!
> > >
> > > This looks rather serious.  I'd be grateful from any clues from the mm guys
> > > involved (CCed).
> > >
> > > Do you use s2disk or the built-in hibernation code?
> > 
> > I use built-in code (echo disk >/sys/power/state). The machine has 256MB RAM 
> > and 256MB swap partition.
> > 

BTW, what happens at resume ? How to page-in and remap ?

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-28 21:34 ` Rafael J. Wysocki
  2010-07-28 21:38   ` Ondrej Zary
@ 2010-07-29  4:23   ` KAMEZAWA Hiroyuki
  2010-07-29  5:23     ` KOSAKI Motohiro
  2010-08-03 10:50     ` Andrea Gelmini
  1 sibling, 2 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-29  4:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ondrej Zary, Kernel development list, Andrew Morton, Balbir Singh

On Wed, 28 Jul 2010 23:34:07 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> On Wednesday, July 28, 2010, Ondrej Zary wrote:
> > Hello,
> > after very long bisection, I finally found what's causing memory corruption 
> > during hibernation on my machine sice 2.6.31:
> > https://bugzilla.kernel.org/show_bug.cgi?id=15753
> > 
> > It's commit c9e444103b5e7a5a3519f9913f59767f92e33baf (mm: reuse unused swap 
> > entry if necessary).
> > 
> > I don't know anything about swapping in Linux so I don't have a clue what's
> > wrong with that commit.
> 
> Thanks for bisecting!
> 
> This looks rather serious.  I'd be grateful from any clues from the mm guys
> involved (CCed).
> 

Considering possible cases...and here is a patch.
but I'm not fully sure. Could you clarify ?

But hmm...status of swap_map[] to be recovered at resume() seems to be just
based on luck. or hibernation has some tricks on swap_map[] ?

==
At hibernation, all pages-should-be-saved are written into a image (here, swap).
Then, swap_map[], memmap etcs are also saved into disks.

But, swap allocation happens one by one. So, the final image of swap_map[] is
different from saved one and the commit c9e444103b5e7a5a3519f9913f59767f92e33baf
changes page's state while assiging swap. Because memory can be modified in
hibernation is only not-to-be-save memory. it's a breakage.

This patch fixes it by disabling swap entry reuse at hibernation.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/swapfile.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.34.org/mm/swapfile.c
===================================================================
--- linux-2.6.34.org.orig/mm/swapfile.c
+++ linux-2.6.34.org/mm/swapfile.c
@@ -316,7 +316,9 @@ checks:
 		scan_base = offset = si->lowest_bit;
 
 	/* reuse swap entry of cache-only swap if not busy. */
-	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+	if (vm_swap_full()
+		&& usage == SWAP_HAS_CACHE
+		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
 		spin_unlock(&swap_lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);








^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  4:23   ` KAMEZAWA Hiroyuki
@ 2010-07-29  5:23     ` KOSAKI Motohiro
  2010-07-29  5:24       ` KAMEZAWA Hiroyuki
  2010-08-03 10:50     ` Andrea Gelmini
  1 sibling, 1 reply; 45+ messages in thread
From: KOSAKI Motohiro @ 2010-07-29  5:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh

> Index: linux-2.6.34.org/mm/swapfile.c
> ===================================================================
> --- linux-2.6.34.org.orig/mm/swapfile.c
> +++ linux-2.6.34.org/mm/swapfile.c
> @@ -316,7 +316,9 @@ checks:
>  		scan_base = offset = si->lowest_bit;
>  
>  	/* reuse swap entry of cache-only swap if not busy. */
> -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +	if (vm_swap_full()
> +		&& usage == SWAP_HAS_CACHE
> +		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
>  		int swap_was_freed;
>  		spin_unlock(&swap_lock);
>  		swap_was_freed = __try_to_reclaim_swap(si, offset);

Can you please add explicit commenting in the code?




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  5:23     ` KOSAKI Motohiro
@ 2010-07-29  5:24       ` KAMEZAWA Hiroyuki
  2010-07-29  5:30         ` KOSAKI Motohiro
                           ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-29  5:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rafael J. Wysocki, Ondrej Zary, Kernel development list,
	Andrew Morton, Balbir Singh

On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Can you please add explicit commenting in the code?
> 
How about this ?
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At hibernation, all pages-should-be-saved are written into a image (here, swap).
Then, swap_map[], memmap etcs are also saved into disks.

But, swap allocation happens one by one. So, the final image of swap_map[] is
different from saved one and the commit c9e444103b5e7a5a3519f9913f59767f92e33baf
changes page's state while assiging swap. Because memory can be modified in
hibernation is only not-to-be-save memory. it's a breakage.

This patch fixes it by disabling swap entry reuse at hibernation.



Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/swapfile.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.34.org/mm/swapfile.c
===================================================================
--- linux-2.6.34.org.orig/mm/swapfile.c
+++ linux-2.6.34.org/mm/swapfile.c
@@ -315,8 +315,15 @@ checks:
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
-	/* reuse swap entry of cache-only swap if not busy. */
-	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+	/*
+ 	 * reuse swap entry of cache-only swap if not busy &&
+ 	 * when we're called via pageout(). At hibernation, swap-reuse
+ 	 * is harmful because it changes memory status...which may
+ 	 * be saved already.
+ 	 */
+	if (vm_swap_full()
+		&& usage == SWAP_HAS_CACHE
+		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
 		spin_unlock(&swap_lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  5:24       ` KAMEZAWA Hiroyuki
@ 2010-07-29  5:30         ` KOSAKI Motohiro
  2010-07-29 17:33         ` Ondrej Zary
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2010-07-29  5:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh

> On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Can you please add explicit commenting in the code?
> > 
> How about this ?

Looks good to me. thanks.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  5:24       ` KAMEZAWA Hiroyuki
  2010-07-29  5:30         ` KOSAKI Motohiro
@ 2010-07-29 17:33         ` Ondrej Zary
  2010-07-29 18:44         ` Hugh Dickins
  2010-08-05 12:44         ` Ondrej Zary
  3 siblings, 0 replies; 45+ messages in thread
From: Ondrej Zary @ 2010-07-29 17:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Kernel development list,
	Andrew Morton, Balbir Singh

On Thursday 29 July 2010 07:24:29 KAMEZAWA Hiroyuki wrote:
> On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
>
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Can you please add explicit commenting in the code?
>
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> At hibernation, all pages-should-be-saved are written into a image (here,
> swap). Then, swap_map[], memmap etcs are also saved into disks.
>
> But, swap allocation happens one by one. So, the final image of swap_map[]
> is different from saved one and the commit
> c9e444103b5e7a5a3519f9913f59767f92e33baf changes page's state while
> assiging swap. Because memory can be modified in hibernation is only
> not-to-be-save memory. it's a breakage.
>
> This patch fixes it by disabling swap entry reuse at hibernation.


Thanks for the patch, I'm going to test it for a few days. It didn't compile 
with 2.6.32 (because of missing "usage" parameter) so I tried 2.6.35-rc6 only 
to find that there's regression since 2.6.34 that causes console font 
corruption with matroxfb. Almost every new kernel means new regressions on 
this machine. Looks like something is wrong with the development process...

So at least the patch compiles with 2.6.33 where the console works - so that's 
going to be tested.


> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/swapfile.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.34.org/mm/swapfile.c
> ===================================================================
> --- linux-2.6.34.org.orig/mm/swapfile.c
> +++ linux-2.6.34.org/mm/swapfile.c
> @@ -315,8 +315,15 @@ checks:
>  	if (offset > si->highest_bit)
>  		scan_base = offset = si->lowest_bit;
>
> -	/* reuse swap entry of cache-only swap if not busy. */
> -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +	/*
> + 	 * reuse swap entry of cache-only swap if not busy &&
> + 	 * when we're called via pageout(). At hibernation, swap-reuse
> + 	 * is harmful because it changes memory status...which may
> + 	 * be saved already.
> + 	 */
> +	if (vm_swap_full()
> +		&& usage == SWAP_HAS_CACHE
> +		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
>  		int swap_was_freed;
>  		spin_unlock(&swap_lock);
>  		swap_was_freed = __try_to_reclaim_swap(si, offset);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  5:24       ` KAMEZAWA Hiroyuki
  2010-07-29  5:30         ` KOSAKI Motohiro
  2010-07-29 17:33         ` Ondrej Zary
@ 2010-07-29 18:44         ` Hugh Dickins
  2010-07-29 18:55           ` Andrea Arcangeli
                             ` (3 more replies)
  2010-08-05 12:44         ` Ondrej Zary
  3 siblings, 4 replies; 45+ messages in thread
From: Hugh Dickins @ 2010-07-29 18:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thu, 29 Jul 2010, KAMEZAWA Hiroyuki wrote:
> On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Can you please add explicit commenting in the code?
> > 
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> At hibernation, all pages-should-be-saved are written into a image (here, swap).
> Then, swap_map[], memmap etcs are also saved into disks.
> 
> But, swap allocation happens one by one. So, the final image of swap_map[] is
> different from saved one and the commit c9e444103b5e7a5a3519f9913f59767f92e33baf
> changes page's state while assiging swap. Because memory can be modified in
> hibernation is only not-to-be-save memory. it's a breakage.
> 
> This patch fixes it by disabling swap entry reuse at hibernation.
> 
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/swapfile.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.34.org/mm/swapfile.c
> ===================================================================
> --- linux-2.6.34.org.orig/mm/swapfile.c
> +++ linux-2.6.34.org/mm/swapfile.c
> @@ -315,8 +315,15 @@ checks:
>  	if (offset > si->highest_bit)
>  		scan_base = offset = si->lowest_bit;
>  
> -	/* reuse swap entry of cache-only swap if not busy. */
> -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +	/*
> + 	 * reuse swap entry of cache-only swap if not busy &&
> + 	 * when we're called via pageout(). At hibernation, swap-reuse
> + 	 * is harmful because it changes memory status...which may
> + 	 * be saved already.
> + 	 */
> +	if (vm_swap_full()
> +		&& usage == SWAP_HAS_CACHE
> +		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
>  		int swap_was_freed;
>  		spin_unlock(&swap_lock);
>  		swap_was_freed = __try_to_reclaim_swap(si, offset);
> 
> --

KAMEZAWA-San, that is a brilliant realization, I salute you.

So, in between snapshotting the image and actually hibernating, we have
two parallel universes, one frozen in the image, the other writing that
out to swap: with the danger that the latter (which is about to die)
will introduce fatal inconsistencies in the former by placing pages in
swap locations innocently reallocated from it.  (Excuse me while I go
write the movie script.)

I'd never got to think about that before.

Your fix is neat though hacky (it's somewhat of a coincidence that the
usage arg happens to distinguish the hibernation case), and should be
enough to fix "your" regression, but is it really enough?

I'm worrying about the try_to_free_swap() calls in shrink_page_list():
can't those get called from direct reclaim (even if GFP_NOIO), and can't
direct reclaim get invoked from somewhere in the I/O path below
swap_writepage(), used for writing out the hibernation image?

Direct reclaim because kswapd does set_freezable(), so should itself
be out of the picture.  But we cannot freeze writing the hibernation
image, and its occasional need for memory, so maybe a different approach
is required.

I've CC'ed Andrea because we were having an offline conversation about
whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
if this swap bug underlies his interest, though he was mainly worrying
about I/O in progress.

Despite reading Documentation/power/freezing-of-tasks.txt, I have no
clear idea of what really needs freezing, and whether freezing can
fully handle the issues.  Rafael, please can you advise?

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 18:44         ` Hugh Dickins
@ 2010-07-29 18:55           ` Andrea Arcangeli
  2010-07-29 23:40             ` Rafael J. Wysocki
  2010-08-09  7:26             ` Pavel Machek
  2010-07-29 23:29           ` Rafael J. Wysocki
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 45+ messages in thread
From: Andrea Arcangeli @ 2010-07-29 18:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rafael J. Wysocki,
	Ondrej Zary, Kernel development list, Andrew Morton,
	Balbir Singh

Hi everyone,

On Thu, Jul 29, 2010 at 11:44:31AM -0700, Hugh Dickins wrote:
> I've CC'ed Andrea because we were having an offline conversation about
> whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> if this swap bug underlies his interest, though he was mainly worrying
> about I/O in progress.

My opinion is that with current freezer model it is needed for suspend
to disk. The kthread_should_stop seems useless for kswapd/ksmd, but
I'm afraid it might get useful in the future so just to stay on the
safe side I added it to khugepaged as well, but it's contributing to
the pollution.

I've no idea why the freezing isn't preemptive (through the scheduler,
all these kernel threads are obviously lowlatency beasts) by default
(if kthread doesn't run set_freezable) and instead it requires
cooperative freezing. Now I can imagine a kernel thread not being
happy about preemptive freezing, and I also can imagine a kernel
thread not being happy about being frozen. But I think the default
shall be "preempting freezing by default" (removing all those
set_freezable/try_to_freeze and furthermore reducing the latency it
takes to freeze the task without having to add !freezing(current)
checks in the loops, by taking advantage of the existing cond_resched
or preempt=Y) and then introduce a set_freezable_cooperative() if it
wants to call try_to_freeze in a cooperative way in a well defined
point of the code, or set_not_freezable() if it doesn't want to
be frozen.

But for now I'm afraid the below is needed (only ksm.c part applies to
upstream).

------
Subject: freeze khugepaged and ksmd

From: Andrea Arcangeli <aarcange@redhat.com>

It's unclear why schedule friendly kernel threads can't be taken away by the
CPU through the scheduler itself. It's safer to stop them as they can trigger
memory allocation, if kswapd also freezes itself to avoid generating I/O they
have too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -15,6 +15,7 @@
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
 #include <linux/khugepaged.h>
+#include <linux/freezer.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -2053,6 +2054,9 @@ static void khugepaged_do_scan(struct pa
 			break;
 #endif
 
+		if (unlikely(kthread_should_stop() || freezing(current)))
+			break;
+
 		spin_lock(&khugepaged_mm_lock);
 		if (!khugepaged_scan.mm_slot)
 			pass_through_head++;
@@ -2115,6 +2119,9 @@ static void khugepaged_loop(void)
 		if (hpage)
 			put_page(hpage);
 #endif
+		try_to_freeze();
+		if (unlikely(kthread_should_stop()))
+			break;
 		if (khugepaged_has_work()) {
 			DEFINE_WAIT(wait);
 			if (!khugepaged_scan_sleep_millisecs)
@@ -2134,6 +2141,7 @@ static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
 
+	set_freezable();
 	set_user_nice(current, 19);
 
 	/* serialize with start_khugepaged() */
@@ -2148,6 +2156,8 @@ static int khugepaged(void *none)
 		mutex_lock(&khugepaged_mutex);
 		if (!khugepaged_enabled())
 			break;
+		if (unlikely(kthread_should_stop()))
+			break;
 	}
 
 	spin_lock(&khugepaged_mm_lock);
diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -33,6 +33,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/swap.h>
 #include <linux/ksm.h>
+#include <linux/freezer.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -1386,7 +1387,7 @@ static void ksm_do_scan(unsigned int sca
 	struct rmap_item *rmap_item;
 	struct page *uninitialized_var(page);
 
-	while (scan_npages--) {
+	while (scan_npages-- && likely(!freezing(current))) {
 		cond_resched();
 		rmap_item = scan_get_next_rmap_item(&page);
 		if (!rmap_item)
@@ -1412,6 +1413,8 @@ static int ksm_scan_thread(void *nothing
 			ksm_do_scan(ksm_thread_pages_to_scan);
 		mutex_unlock(&ksm_thread_mutex);
 
+		try_to_freeze();
+
 		if (ksmd_should_run()) {
 			schedule_timeout_interruptible(
 				msecs_to_jiffies(ksm_thread_sleep_millisecs));
@@ -1953,6 +1956,7 @@ static int __init ksm_init(void)
 	struct task_struct *ksm_thread;
 	int err;
 
+	set_freezable();
 	err = ksm_slab_init();
 	if (err)
 		goto out;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 18:44         ` Hugh Dickins
  2010-07-29 18:55           ` Andrea Arcangeli
@ 2010-07-29 23:29           ` Rafael J. Wysocki
  2010-07-30  3:36             ` KAMEZAWA Hiroyuki
  2010-07-30  3:54             ` Hugh Dickins
  2010-07-30  0:01           ` KAMEZAWA Hiroyuki
  2010-07-30  4:18           ` Memory corruption during hibernation since 2.6.31 Balbir Singh
  3 siblings, 2 replies; 45+ messages in thread
From: Rafael J. Wysocki @ 2010-07-29 23:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thursday, July 29, 2010, Hugh Dickins wrote:
> On Thu, 29 Jul 2010, KAMEZAWA Hiroyuki wrote:
> > On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Can you please add explicit commenting in the code?
> > > 
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > At hibernation, all pages-should-be-saved are written into a image (here, swap).
> > Then, swap_map[], memmap etcs are also saved into disks.
> > 
> > But, swap allocation happens one by one. So, the final image of swap_map[] is
> > different from saved one and the commit c9e444103b5e7a5a3519f9913f59767f92e33baf
> > changes page's state while assiging swap. Because memory can be modified in
> > hibernation is only not-to-be-save memory. it's a breakage.
> > 
> > This patch fixes it by disabling swap entry reuse at hibernation.
> > 
> > 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/swapfile.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6.34.org/mm/swapfile.c
> > ===================================================================
> > --- linux-2.6.34.org.orig/mm/swapfile.c
> > +++ linux-2.6.34.org/mm/swapfile.c
> > @@ -315,8 +315,15 @@ checks:
> >  	if (offset > si->highest_bit)
> >  		scan_base = offset = si->lowest_bit;
> >  
> > -	/* reuse swap entry of cache-only swap if not busy. */
> > -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +	/*
> > + 	 * reuse swap entry of cache-only swap if not busy &&
> > + 	 * when we're called via pageout(). At hibernation, swap-reuse
> > + 	 * is harmful because it changes memory status...which may
> > + 	 * be saved already.
> > + 	 */
> > +	if (vm_swap_full()
> > +		&& usage == SWAP_HAS_CACHE
> > +		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
> >  		int swap_was_freed;
> >  		spin_unlock(&swap_lock);
> >  		swap_was_freed = __try_to_reclaim_swap(si, offset);
> > 
> > --
> 
> KAMEZAWA-San, that is a brilliant realization, I salute you.
> 
> So, in between snapshotting the image and actually hibernating, we have
> two parallel universes, one frozen in the image, the other writing that
> out to swap: with the danger that the latter (which is about to die)
> will introduce fatal inconsistencies in the former by placing pages in
> swap locations innocently reallocated from it.  (Excuse me while I go
> write the movie script.)
> 
> I'd never got to think about that before.
> 
> Your fix is neat though hacky (it's somewhat of a coincidence that the
> usage arg happens to distinguish the hibernation case), and should be
> enough to fix "your" regression, but is it really enough?
> 
> I'm worrying about the try_to_free_swap() calls in shrink_page_list():
> can't those get called from direct reclaim (even if GFP_NOIO), and can't
> direct reclaim get invoked from somewhere in the I/O path below
> swap_writepage(), used for writing out the hibernation image?
> 
> Direct reclaim because kswapd does set_freezable(), so should itself
> be out of the picture.  But we cannot freeze writing the hibernation
> image, and its occasional need for memory, so maybe a different approach
> is required.

We preallocate memory for the I/O related to the writing of hibernation
images, so swapping things out shouldn't be necessary at that time just for
this purpose.  The I/O itself is done directly at the block layer level and it
will never be frozen, because it is done by the hibernate process itself.

> I've CC'ed Andrea because we were having an offline conversation about
> whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> if this swap bug underlies his interest, though he was mainly worrying
> about I/O in progress.
> 
> Despite reading Documentation/power/freezing-of-tasks.txt, I have no
> clear idea of what really needs freezing, and whether freezing can
> fully handle the issues.  Rafael, please can you advise?

Well, the rule of thumb (if it can be called this way) is that the contents of
the image has to be consistent with whatever is stored in permanent storage.
So, for example, filesystems that were mounted before creating the image
cannot be modified until the image is restored.  Consequently, if there are
any kernel threads that might cause that to happen, they need to be frozen.

Now, if I understand it correctly, the failure mode is that certain page had
been swapped out before the image was created and then it was swapped in
while we were writing the image out and the slot occupied by it was re-used.
In that case the image would contain wrong information on the state of the
swap and that would result in wrong data being loaded into memory on an attempt
to swap that page in after resume.

So, generally, we have to avoid doing things that would result in swapping
memory pages out after we have created a hibernation image.  If that can be
achieved by freezing certain kernel threads, that probably is the simplest
approach.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 18:55           ` Andrea Arcangeli
@ 2010-07-29 23:40             ` Rafael J. Wysocki
  2010-07-30  4:02               ` Hugh Dickins
  2010-08-09  7:26             ` Pavel Machek
  1 sibling, 1 reply; 45+ messages in thread
From: Rafael J. Wysocki @ 2010-07-29 23:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh

On Thursday, July 29, 2010, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Thu, Jul 29, 2010 at 11:44:31AM -0700, Hugh Dickins wrote:
> > I've CC'ed Andrea because we were having an offline conversation about
> > whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> > if this swap bug underlies his interest, though he was mainly worrying
> > about I/O in progress.
> 
> My opinion is that with current freezer model it is needed for suspend
> to disk. The kthread_should_stop seems useless for kswapd/ksmd, but
> I'm afraid it might get useful in the future so just to stay on the
> safe side I added it to khugepaged as well, but it's contributing to
> the pollution.
> 
> I've no idea why the freezing isn't preemptive (through the scheduler,
> all these kernel threads are obviously lowlatency beasts) by default
> (if kthread doesn't run set_freezable) and instead it requires
> cooperative freezing.

This is a result of several discussions on the LKML. :-)  Basically, the vast
majority of kernel threads need not be frozen and attempting to freeze them
all introduces race conditions that are extremely diffucult to resolve (eg.
with multithread workqueues).

> Now I can imagine a kernel thread not being
> happy about preemptive freezing, and I also can imagine a kernel
> thread not being happy about being frozen. But I think the default
> shall be "preempting freezing by default" (removing all those
> set_freezable/try_to_freeze and furthermore reducing the latency it
> takes to freeze the task without having to add !freezing(current)
> checks in the loops, by taking advantage of the existing cond_resched
> or preempt=Y) and then introduce a set_freezable_cooperative() if it
> wants to call try_to_freeze in a cooperative way in a well defined
> point of the code, or set_not_freezable() if it doesn't want to
> be frozen.

I'm afraid that would be difficult to achieve in general.  Besides, there's
no reason why kernel threads that need not be frozen should care about the
freezing thing at all.  It's much simpler to require the ones that need to be
frozen to cooperate.

> But for now I'm afraid the below is needed (only ksm.c part applies to
> upstream).

Looks good to me.

Can you please prepare a patch against mainline for Ondrej to try?

Thanks,
Rafael


> ------
> Subject: freeze khugepaged and ksmd
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> It's unclear why schedule friendly kernel threads can't be taken away by the
> CPU through the scheduler itself. It's safer to stop them as they can trigger
> memory allocation, if kswapd also freezes itself to avoid generating I/O they
> have too.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -15,6 +15,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/kthread.h>
>  #include <linux/khugepaged.h>
> +#include <linux/freezer.h>
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
>  #include "internal.h"
> @@ -2053,6 +2054,9 @@ static void khugepaged_do_scan(struct pa
>  			break;
>  #endif
>  
> +		if (unlikely(kthread_should_stop() || freezing(current)))
> +			break;
> +
>  		spin_lock(&khugepaged_mm_lock);
>  		if (!khugepaged_scan.mm_slot)
>  			pass_through_head++;
> @@ -2115,6 +2119,9 @@ static void khugepaged_loop(void)
>  		if (hpage)
>  			put_page(hpage);
>  #endif
> +		try_to_freeze();
> +		if (unlikely(kthread_should_stop()))
> +			break;
>  		if (khugepaged_has_work()) {
>  			DEFINE_WAIT(wait);
>  			if (!khugepaged_scan_sleep_millisecs)
> @@ -2134,6 +2141,7 @@ static int khugepaged(void *none)
>  {
>  	struct mm_slot *mm_slot;
>  
> +	set_freezable();
>  	set_user_nice(current, 19);
>  
>  	/* serialize with start_khugepaged() */
> @@ -2148,6 +2156,8 @@ static int khugepaged(void *none)
>  		mutex_lock(&khugepaged_mutex);
>  		if (!khugepaged_enabled())
>  			break;
> +		if (unlikely(kthread_should_stop()))
> +			break;
>  	}
>  
>  	spin_lock(&khugepaged_mm_lock);
> diff --git a/mm/ksm.c b/mm/ksm.c
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -33,6 +33,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/swap.h>
>  #include <linux/ksm.h>
> +#include <linux/freezer.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -1386,7 +1387,7 @@ static void ksm_do_scan(unsigned int sca
>  	struct rmap_item *rmap_item;
>  	struct page *uninitialized_var(page);
>  
> -	while (scan_npages--) {
> +	while (scan_npages-- && likely(!freezing(current))) {
>  		cond_resched();
>  		rmap_item = scan_get_next_rmap_item(&page);
>  		if (!rmap_item)
> @@ -1412,6 +1413,8 @@ static int ksm_scan_thread(void *nothing
>  			ksm_do_scan(ksm_thread_pages_to_scan);
>  		mutex_unlock(&ksm_thread_mutex);
>  
> +		try_to_freeze();
> +
>  		if (ksmd_should_run()) {
>  			schedule_timeout_interruptible(
>  				msecs_to_jiffies(ksm_thread_sleep_millisecs));
> @@ -1953,6 +1956,7 @@ static int __init ksm_init(void)
>  	struct task_struct *ksm_thread;
>  	int err;
>  
> +	set_freezable();
>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;
> 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 18:44         ` Hugh Dickins
  2010-07-29 18:55           ` Andrea Arcangeli
  2010-07-29 23:29           ` Rafael J. Wysocki
@ 2010-07-30  0:01           ` KAMEZAWA Hiroyuki
  2010-07-30  4:10             ` Hugh Dickins
  2010-07-30  4:18           ` Memory corruption during hibernation since 2.6.31 Balbir Singh
  3 siblings, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-30  0:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thu, 29 Jul 2010 11:44:31 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:
 
> So, in between snapshotting the image and actually hibernating, we have
> two parallel universes, one frozen in the image, the other writing that
> out to swap: with the danger that the latter (which is about to die)
> will introduce fatal inconsistencies in the former by placing pages in
> swap locations innocently reallocated from it.  (Excuse me while I go
> write the movie script.)
> 
> I'd never got to think about that before.
> 
> Your fix is neat though hacky (it's somewhat of a coincidence that the
> usage arg happens to distinguish the hibernation case), and should be
> enough to fix "your" regression, but is it really enough?
> 
yes, it's hacky and this just hides a problem caused by the patch.

> I'm worrying about the try_to_free_swap() calls in shrink_page_list():
> can't those get called from direct reclaim (even if GFP_NOIO), and can't
> direct reclaim get invoked from somewhere in the I/O path below
> swap_writepage(), used for writing out the hibernation image?
> 
IIUC, before hibernation, shirnk_memory() is called and hibernation codes
expect there are enough memory. But you're right, there are no guarantee.

I think the best way is kexec(). But maybe rollback from hibernation failure
will be difficult. Considering how crash-dump works well and under maintainance
by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
reuse of kdump code, ...or, hibernation-resume code can eat kdump image
directly. Maybe the problem will be the speed of dump.

Otherwise, we need to take care of status changes in already-saved-memory.
Maybe anyone don't like to add COW method...

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 23:29           ` Rafael J. Wysocki
@ 2010-07-30  3:36             ` KAMEZAWA Hiroyuki
  2010-07-30  3:54             ` Hugh Dickins
  1 sibling, 0 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-30  3:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Fri, 30 Jul 2010 01:29:33 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:
 
> Well, the rule of thumb (if it can be called this way) is that the contents of
> the image has to be consistent with whatever is stored in permanent storage.
> So, for example, filesystems that were mounted before creating the image
> cannot be modified until the image is restored.  Consequently, if there are
> any kernel threads that might cause that to happen, they need to be frozen.
> 
> Now, if I understand it correctly, the failure mode is that certain page had
> been swapped out before the image was created and then it was swapped in
> while we were writing the image out and the slot occupied by it was re-used.
> In that case the image would contain wrong information on the state of the
> swap and that would result in wrong data being loaded into memory on an attempt
> to swap that page in after resume.
> 
> So, generally, we have to avoid doing things that would result in swapping
> memory pages out after we have created a hibernation image.  If that can be
> achieved by freezing certain kernel threads, that probably is the simplest
> approach.
> 

BTW, I can't understand what happens in following situation. 

1. start saving image.
2. at saving image, swap entry is used and swap_map[off] += 1.
3. At some point, swap_map[] itself is saved.
...

At restore, IIUC, hibernation don't call any swap_free().
Where does the +1 goes ?
I'm sorry if I misunderstand.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 23:29           ` Rafael J. Wysocki
  2010-07-30  3:36             ` KAMEZAWA Hiroyuki
@ 2010-07-30  3:54             ` Hugh Dickins
  1 sibling, 0 replies; 45+ messages in thread
From: Hugh Dickins @ 2010-07-30  3:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thu, Jul 29, 2010 at 4:29 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday, July 29, 2010, Hugh Dickins wrote:
>>
>> Despite reading Documentation/power/freezing-of-tasks.txt, I have no
>> clear idea of what really needs freezing, and whether freezing can
>> fully handle the issues.  Rafael, please can you advise?
>
> Well, the rule of thumb (if it can be called this way) is that the contents of
> the image has to be consistent with whatever is stored in permanent storage.
> So, for example, filesystems that were mounted before creating the image
> cannot be modified until the image is restored.  Consequently, if there are
> any kernel threads that might cause that to happen, they need to be frozen.

Right, the filesystem part of it is easy to understand and to handle, I think.
But now we're worrying about potential for I/O to be interrupted by suspend
to RAM (or is that well handled by driver suspend methods?), and swap
getting misallocated during hibernation: what measures do we have
to prevent those?

In particular, is there or should there be some state global or test function
endangered code could use to for protection, without having to freeze?
For many threads, freezing would be easiest, but not possible for all.

>
> Now, if I understand it correctly, the failure mode is that certain page had
> been swapped out before the image was created and then it was swapped in
> while we were writing the image out and the slot occupied by it was re-used.

Not quite.  At some point in the past that certain page had been swapped out
and later swapped back in: it's correctly there in the image as swapped in,
but there's some code coming into play when allocating swap to write the
image, that might free its swap and reuse it for an unrelated page of image,
leaving a danger after resume that the original owner page might get freed
then wrong data swapped back in for it later.

> In that case the image would contain wrong information on the state of the
> swap and that would result in wrong data being loaded into memory on an attempt
> to swap that page in after resume.

Yes.

>
> So, generally, we have to avoid doing things that would result in swapping
> memory pages out after we have created a hibernation image.  If that can be
> achieved by freezing certain kernel threads, that probably is the simplest
> approach.

The vulnerable page isn't swapped out or in during or after creating the image,
yet it can still be vulnerable.  KAMEZAWA-san's patch should fix the recent
regression here, but I believe there remains a vulnerability, from swap cleanup
code in vmscan.c which page reclaim might pass through.  If there's some
"heading for hibernation" state we can test there, we can avoid it in those
cases.

I realize that snapshot.c does a lot of preparatory memory freeing e.g.
shrink_all_memory(), and that should make the chance of mis-reuse of
swap very tiny; but nonetheless your swap.c is doing memory allocations,
with the __GFP_WAIT flag, so could conceivably enter page reclaim.

We cannot freeze the hibernation, but we ought to make it swap-safe.

Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 23:40             ` Rafael J. Wysocki
@ 2010-07-30  4:02               ` Hugh Dickins
  0 siblings, 0 replies; 45+ messages in thread
From: Hugh Dickins @ 2010-07-30  4:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrea Arcangeli, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Ondrej Zary, Kernel development list, Andrew Morton,
	Balbir Singh

On Thu, Jul 29, 2010 at 4:40 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Thursday, July 29, 2010, Andrea Arcangeli wrote:
>
> I'm afraid that would be difficult to achieve in general.  Besides, there's
> no reason why kernel threads that need not be frozen should care about the
> freezing thing at all.  It's much simpler to require the ones that need to be
> frozen to cooperate.

But it's looking as if any thread which might have to allocate some memory
(and ksmd is on balance a freer of memory, but nonetheless has to do
slab allocations to get there) would need to be frozen.

Except that does not go far enough, because freezing the hibernating
thread (which itself makes memory allocations) won't work out well!

>
>> But for now I'm afraid the below is needed (only ksm.c part applies to
>> upstream).
>
> Looks good to me.
>
> Can you please prepare a patch against mainline for Ondrej to try?

Andrea wasn't proposing his patch for Ondrej's corruption, and I don't
suppose Ondrej even has ksmd running.  Any confusion there is my
fault, for linking Andrea's concerns about suspend and hibernation
with this particular thread on Ondrej's bug.

Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-30  0:01           ` KAMEZAWA Hiroyuki
@ 2010-07-30  4:10             ` Hugh Dickins
  2010-07-30  4:14               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 45+ messages in thread
From: Hugh Dickins @ 2010-07-30  4:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thu, Jul 29, 2010 at 5:01 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:

> I think the best way is kexec(). But maybe rollback from hibernation failure
> will be difficult. Considering how crash-dump works well and under maintainance
> by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
> reuse of kdump code, ...or, hibernation-resume code can eat kdump image
> directly. Maybe the problem will be the speed of dump.

I've no appetite for a total rework of hibernation, and I don't see
how that would
address the issue: I'm just looking for some protection against swap
reuse danger.

Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-30  4:10             ` Hugh Dickins
@ 2010-07-30  4:14               ` KAMEZAWA Hiroyuki
  2010-07-30  4:46                 ` Hugh Dickins
  2010-08-02  6:02                 ` [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was " KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-30  4:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thu, 29 Jul 2010 21:10:10 -0700
Hugh Dickins <hughd@google.com> wrote:

> On Thu, Jul 29, 2010 at 5:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > I think the best way is kexec(). But maybe rollback from hibernation failure
> > will be difficult. Considering how crash-dump works well and under maintainance
> > by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
> > reuse of kdump code, ...or, hibernation-resume code can eat kdump image
> > directly. Maybe the problem will be the speed of dump.
> 
> I've no appetite for a total rework of hibernation, and I don't see
> how that would
> address the issue: I'm just looking for some protection against swap
> reuse danger.
> 
Okay ;) (And I forget that kexec has to prepare memory at boot time for 2nd kernel.
It will be harmful for small device guys.)

I'll prepare a routine not-quick-fix.
BTW, after reading mail between you and Andrea, kernel threads are not guaranteed
to be stopped when memory-dump is running ?

-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 18:44         ` Hugh Dickins
                             ` (2 preceding siblings ...)
  2010-07-30  0:01           ` KAMEZAWA Hiroyuki
@ 2010-07-30  4:18           ` Balbir Singh
  2010-07-30  4:32             ` Hugh Dickins
  3 siblings, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2010-07-30  4:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rafael J. Wysocki,
	Ondrej Zary, Kernel development list, Andrew Morton,
	Andrea Arcangeli

* Hugh Dickins <hughd@google.com> [2010-07-29 11:44:31]:

> On Thu, 29 Jul 2010, KAMEZAWA Hiroyuki wrote:
> > On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Can you please add explicit commenting in the code?
> > > 
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > At hibernation, all pages-should-be-saved are written into a image (here, swap).
> > Then, swap_map[], memmap etcs are also saved into disks.
> > 
> > But, swap allocation happens one by one. So, the final image of swap_map[] is
> > different from saved one and the commit c9e444103b5e7a5a3519f9913f59767f92e33baf
> > changes page's state while assiging swap. Because memory can be modified in
> > hibernation is only not-to-be-save memory. it's a breakage.
> > 
> > This patch fixes it by disabling swap entry reuse at hibernation.
> > 
> > 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/swapfile.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6.34.org/mm/swapfile.c
> > ===================================================================
> > --- linux-2.6.34.org.orig/mm/swapfile.c
> > +++ linux-2.6.34.org/mm/swapfile.c
> > @@ -315,8 +315,15 @@ checks:
> >  	if (offset > si->highest_bit)
> >  		scan_base = offset = si->lowest_bit;
> >  
> > -	/* reuse swap entry of cache-only swap if not busy. */
> > -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > +	/*
> > + 	 * reuse swap entry of cache-only swap if not busy &&
> > + 	 * when we're called via pageout(). At hibernation, swap-reuse
> > + 	 * is harmful because it changes memory status...which may
> > + 	 * be saved already.
> > + 	 */
> > +	if (vm_swap_full()
> > +		&& usage == SWAP_HAS_CACHE
> > +		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
> >  		int swap_was_freed;
> >  		spin_unlock(&swap_lock);
> >  		swap_was_freed = __try_to_reclaim_swap(si, offset);
> > 
> > --
> 
> KAMEZAWA-San, that is a brilliant realization, I salute you.
> 
> So, in between snapshotting the image and actually hibernating, we have
> two parallel universes, one frozen in the image, the other writing that
> out to swap: with the danger that the latter (which is about to die)
> will introduce fatal inconsistencies in the former by placing pages in
> swap locations innocently reallocated from it.  (Excuse me while I go
> write the movie script.)
> 
> I'd never got to think about that before.
> 
> Your fix is neat though hacky (it's somewhat of a coincidence that the
> usage arg happens to distinguish the hibernation case), and should be
> enough to fix "your" regression, but is it really enough?
> 
> I'm worrying about the try_to_free_swap() calls in shrink_page_list():
> can't those get called from direct reclaim (even if GFP_NOIO), and can't
> direct reclaim get invoked from somewhere in the I/O path below
> swap_writepage(), used for writing out the hibernation image?
> 
> Direct reclaim because kswapd does set_freezable(), so should itself
> be out of the picture.  But we cannot freeze writing the hibernation
> image, and its occasional need for memory, so maybe a different approach
> is required.
> 
> I've CC'ed Andrea because we were having an offline conversation about
> whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> if this swap bug underlies his interest, though he was mainly worrying
> about I/O in progress.
> 
> Despite reading Documentation/power/freezing-of-tasks.txt, I have no
> clear idea of what really needs freezing, and whether freezing can
> fully handle the issues.  Rafael, please can you advise?
>

Couldn't we reuse PF_* flags to differentiate between the paths, if
that is what it eventually boils down to? On an unrelated note, I was
looking at shrink_all_memory() and wondering if swappiness really
mattered there. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-30  4:18           ` Memory corruption during hibernation since 2.6.31 Balbir Singh
@ 2010-07-30  4:32             ` Hugh Dickins
  2010-07-30  6:37               ` Balbir Singh
  0 siblings, 1 reply; 45+ messages in thread
From: Hugh Dickins @ 2010-07-30  4:32 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rafael J. Wysocki,
	Ondrej Zary, Kernel development list, Andrew Morton,
	Andrea Arcangeli

On Thu, Jul 29, 2010 at 9:18 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> Couldn't we reuse PF_* flags to differentiate between the paths, if
> that is what it eventually boils down to? On an unrelated note, I was
> looking at shrink_all_memory() and wondering if swappiness really
> mattered there.

So far as the swap-reuse issue goes, I don't see that a PF_ flag
would help: the threads that already worry about such issues do
the set_freezable()/try_to_freeze() thing, and won't get into
trouble anyway; we don't want to force every other thread to
have to do something special now, better just check global state
in the very few places its needed.

On the unrelated note: better in an unrelated thread!

Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-30  4:14               ` KAMEZAWA Hiroyuki
@ 2010-07-30  4:46                 ` Hugh Dickins
  2010-07-30 10:43                   ` KAMEZAWA Hiroyuki
  2010-08-02  6:02                 ` [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was " KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 45+ messages in thread
From: Hugh Dickins @ 2010-07-30  4:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thu, Jul 29, 2010 at 9:14 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I'll prepare a routine not-quick-fix.

Thank you!

> BTW, after reading mail between you and Andrea, kernel threads are not guaranteed
> to be stopped when memory-dump is running ?

Depends on what you mean by memory-dump, I think.

The hibernation sequence is quite confusing to follow, but Rafael can
correct me if I'm wrong.  create_image() in kernel/power/snapshot.c
does disable_nonboot_cpus() and local_irq_disable() and
sysdev_suspend() and save_processor_state() and dives into
swsusp_arch_suspend() to create the image in memory with everything
else stopped; and it's from there that the system will later resume.
But it also emerges from there in the parallel universe that goes on
to allocate swap and copy the memory image to disk - at that time,
you'll not be surprised to learn that interrupts are enabled, and I
suppose other threads may be running, in a rather futile way.

Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-30  4:32             ` Hugh Dickins
@ 2010-07-30  6:37               ` Balbir Singh
  0 siblings, 0 replies; 45+ messages in thread
From: Balbir Singh @ 2010-07-30  6:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rafael J. Wysocki,
	Ondrej Zary, Kernel development list, Andrew Morton,
	Andrea Arcangeli

* Hugh Dickins <hughd@google.com> [2010-07-29 21:32:35]:

> On Thu, Jul 29, 2010 at 9:18 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >
> > Couldn't we reuse PF_* flags to differentiate between the paths, if
> > that is what it eventually boils down to? On an unrelated note, I was
> > looking at shrink_all_memory() and wondering if swappiness really
> > mattered there.
> 
> So far as the swap-reuse issue goes, I don't see that a PF_ flag
> would help: the threads that already worry about such issues do
> the set_freezable()/try_to_freeze() thing, and won't get into
> trouble anyway; we don't want to force every other thread to
> have to do something special now, better just check global state
> in the very few places its needed.
>

We already do that with PF_MEMALLOC in several places. If the goal is
to avoid resuing the swap entry at all times after hibernate, the
solution needs to be a global state solution like you suggest.
 
> On the unrelated note: better in an unrelated thread!
>

Sure :)
 
-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-30  4:46                 ` Hugh Dickins
@ 2010-07-30 10:43                   ` KAMEZAWA Hiroyuki
  2010-07-30 18:16                     ` Hugh Dickins
  0 siblings, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-30 10:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Thu, 29 Jul 2010 21:46:10 -0700
Hugh Dickins <hughd@google.com> wrote:

> On Thu, Jul 29, 2010 at 9:14 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I'll prepare a routine not-quick-fix.
> 
> Thank you!
> 

plz wait until the next week..I found I can't post today ;(

Maybe trying my quick hack patch (posted) will be enough small for
backporting etc.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-30 10:43                   ` KAMEZAWA Hiroyuki
@ 2010-07-30 18:16                     ` Hugh Dickins
  0 siblings, 0 replies; 45+ messages in thread
From: Hugh Dickins @ 2010-07-30 18:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Fri, Jul 30, 2010 at 3:43 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> plz wait until the next week..I found I can't post today ;(
>
> Maybe trying my quick hack patch (posted) will be enough small for
> backporting etc.

Absolutely.  I didn't intend my broader anxieties to distract from
your immediate scan_swap_map() fix.  If Ondrej finds that it does in
fact fix the corruption he's been seeing since 2.6.31, then I'd like
your fix to go straight to Linus for 2.6.35: once swap is 50% full,
your case is rather too likely to occur; whereas I'm worrying about
correctness in unlikely hypothetical circumstances.  But please do
credit Ondrej, and make clear that you're fixing an observed
corrution; and feel free to add my
Acked-by: Hugh Dickins <hughd@google.com>

Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was Re: Memory corruption during hibernation since 2.6.31
  2010-07-30  4:14               ` KAMEZAWA Hiroyuki
  2010-07-30  4:46                 ` Hugh Dickins
@ 2010-08-02  6:02                 ` KAMEZAWA Hiroyuki
  2010-08-02 14:27                   ` Rafael J. Wysocki
                                     ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-02  6:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Fri, 30 Jul 2010 13:14:32 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 29 Jul 2010 21:10:10 -0700
> Hugh Dickins <hughd@google.com> wrote:
> 
> > On Thu, Jul 29, 2010 at 5:01 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > I think the best way is kexec(). But maybe rollback from hibernation failure
> > > will be difficult. Considering how crash-dump works well and under maintainance
> > > by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
> > > reuse of kdump code, ...or, hibernation-resume code can eat kdump image
> > > directly. Maybe the problem will be the speed of dump.
> > 
> > I've no appetite for a total rework of hibernation, and I don't see
> > how that would
> > address the issue: I'm just looking for some protection against swap
> > reuse danger.
> > 
> Okay ;) (And I forget that kexec has to prepare memory at boot time for 2nd kernel.
> It will be harmful for small device guys.)
> 
> I'll prepare a routine not-quick-fix.

Ok, here. Passed easy tests as
# echo disk > /sys/power/state


Looks like a big hammer ? But I think following kind of patch is required.
About swap-reuse, it's only possible when a page is added to swap cache
but try_to_unmap() fails and the page remains in memory. IIUC, most of this
kind of pages will be backed to swap by shrink_all_memory(). So, reuse-swap
happens only when the user unlucky. This patch ignores reuse-swap but freeze
swap_map[] for saving swap_map[] to the disk in consistent way.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

swap_map[] is a one of objects which can be update while hibernation. I.E,
usage counter in swap_map[] is updated while hibernation is making a copy
of memory image to the disk.

At resume, hibenation code doesn't call swap_free() against the swaps they
used...So, the swap_map[] will turns to be an initial state.

With small consideration, the question is how swap_map[] updated before
dumping to the disk is treated. In swap-system view, there are no guarantee
that swap_map[] are properly reloaded and there is no leak in swap count.

This patch tries to freeze swap_map[] during hibernation.
By this, no updates will be happen to swap_map[] among save_image().
At load_image(), the swap_map[] has no record for swap entries used by
save_image(), we can simply forget it.

Note: I'm not a specialist of hibernation...so, I'm not sure the hooks
      to kernel/power/user.c is appropriate or not.
      And this disables swap-out once hibernation starts saving.
      Should we afraid of OOM ?

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    7 +++
 kernel/power/swap.c  |   15 ++++++--
 kernel/power/user.c  |    1 
 mm/swapfile.c        |   93 ++++++++++++++++++++++++++++++++++++++-------------
 4 files changed, 90 insertions(+), 26 deletions(-)

Index: mmotm-0727/include/linux/swap.h
===================================================================
--- mmotm-0727.orig/include/linux/swap.h
+++ mmotm-0727/include/linux/swap.h
@@ -316,7 +316,6 @@ extern long nr_swap_pages;
 extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
-extern swp_entry_t get_swap_page_of_type(int);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
@@ -333,6 +332,12 @@ extern int reuse_swap_page(struct page *
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
+#ifdef CONFIG_HIBERNATION
+void hibernation_thaw_swap(void);
+swp_entry_t get_swap_for_hibernation(int type);
+void swap_free_for_hibernation(swp_entry_t val);
+#endif
+
 /* linux/mm/thrash.c */
 extern struct mm_struct *swap_token_mm;
 extern void grab_swap_token(struct mm_struct *);
Index: mmotm-0727/mm/swapfile.c
===================================================================
--- mmotm-0727.orig/mm/swapfile.c
+++ mmotm-0727/mm/swapfile.c
@@ -47,6 +47,8 @@ long nr_swap_pages;
 long total_swap_pages;
 static int least_priority;
 
+static bool swap_for_hibernation;
+
 static const char Bad_file[] = "Bad swap file entry ";
 static const char Unused_file[] = "Unused swap file entry ";
 static const char Bad_offset[] = "Bad swap offset entry ";
@@ -449,6 +451,8 @@ swp_entry_t get_swap_page(void)
 	spin_lock(&swap_lock);
 	if (nr_swap_pages <= 0)
 		goto noswap;
+	if (swap_for_hibernation)
+		goto noswap;
 	nr_swap_pages--;
 
 	for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
@@ -481,28 +485,6 @@ noswap:
 	return (swp_entry_t) {0};
 }
 
-/* The only caller of this function is now susupend routine */
-swp_entry_t get_swap_page_of_type(int type)
-{
-	struct swap_info_struct *si;
-	pgoff_t offset;
-
-	spin_lock(&swap_lock);
-	si = swap_info[type];
-	if (si && (si->flags & SWP_WRITEOK)) {
-		nr_swap_pages--;
-		/* This is called for allocating swap entry, not cache */
-		offset = scan_swap_map(si, 1);
-		if (offset) {
-			spin_unlock(&swap_lock);
-			return swp_entry(type, offset);
-		}
-		nr_swap_pages++;
-	}
-	spin_unlock(&swap_lock);
-	return (swp_entry_t) {0};
-}
-
 static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
@@ -762,6 +744,73 @@ int mem_cgroup_count_swap_user(swp_entry
 #endif
 
 #ifdef CONFIG_HIBERNATION
+
+static pgoff_t hibernation_offset[MAX_SWAPFILES];
+static void hibernation_freeze_swap(void)
+{
+	int i;
+
+	printk(KERN_INFO "PM: Freeze Swap\n");
+	swap_for_hibernation = true;
+	for (i = 0; i < MAX_SWAPFILES; i++)
+		hibernation_offset[i] = 1;
+}
+
+void hibernation_thaw_swap(void)
+{
+	spin_lock(&swap_lock);
+	if (swap_for_hibernation) {
+		printk(KERN_INFO "PM: Thaw Swap\n");
+		swap_for_hibernation = false;
+	}
+	spin_unlock(&swap_lock);
+}
+
+/*
+ * Because updateing swap_map[] can make not-saved-status-change,
+ * we use our own easy allocator.
+ * Please see kernel/power/swap.c, Used swaps are recorded into
+ * RB-tree.
+ */
+swp_entry_t get_swap_for_hibernation(int type)
+{
+	pgoff_t off;
+	swp_entry_t val = {0};
+	struct swap_info_struct *si;
+
+	spin_lock(&swap_lock);
+	/*
+	 * Once hibernation starts to use swap, we freeze swap_map[]. Otherwise,
+	 * saved swap_map[] image to the disk will be an incomplete because it's
+	 * changing without synchronization with hibernation snap shot.
+	 * At resume, we just make swap_for_hibernation=false. We can forget
+	 * used maps easily.
+	 */
+	if (!swap_for_hibernation)
+		hibernation_freeze_swap();
+
+	si = swap_info[type];
+	if (!si || !(si->flags & SWP_WRITEOK))
+		goto done;
+
+	for (off = hibernation_offset[type]; off < si->max; ++off) {
+		if (!si->swap_map[off])
+			break;
+	}
+	if (off < si->max) {
+		val = swp_entry(type, off);
+		hibernation_offset[type] = off + 1;
+	}
+done:
+	spin_unlock(&swap_lock);
+	return val;
+}
+
+void swap_free_for_hibernation(swp_entry_t ent)
+{
+	/* Nothing to do */
+}
+
 /*
  * Find the swap type that corresponds to given device (if any).
  *
Index: mmotm-0727/kernel/power/swap.c
===================================================================
--- mmotm-0727.orig/kernel/power/swap.c
+++ mmotm-0727/kernel/power/swap.c
@@ -135,10 +135,10 @@ sector_t alloc_swapdev_block(int swap)
 {
 	unsigned long offset;
 
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_offset(get_swap_for_hibernation(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free_for_hibernation(swp_entry(swap, offset));
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -162,10 +162,11 @@ void free_all_swap_pages(int swap)
 		ext = container_of(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
 		for (offset = ext->start; offset <= ext->end; offset++)
-			swap_free(swp_entry(swap, offset));
+			swap_free_for_hibernation(swp_entry(swap, offset));
 
 		kfree(ext);
 	}
+	hibernation_thaw_swap();
 }
 
 int swsusp_swap_in_use(void)
@@ -440,10 +441,12 @@ int swsusp_write(unsigned int flags)
 	error = get_swap_writer(&handle);
 	if (error) {
 		printk(KERN_ERR "PM: Cannot get swap writer\n");
+		hibernation_thaw_swap();
 		return error;
 	}
 	if (!enough_swap(pages)) {
 		printk(KERN_ERR "PM: Not enough free swap\n");
+		hibernation_thaw_swap();
 		error = -ENOSPC;
 		goto out_finish;
 	}
@@ -461,6 +464,8 @@ int swsusp_write(unsigned int flags)
 		error = save_image(&handle, &snapshot, pages - 1);
 out_finish:
 	error = swap_writer_finish(&handle, flags, error);
+	if (error)
+		hibernation_thaw_swap();
 	return error;
 }
 
@@ -614,6 +619,10 @@ int swsusp_read(unsigned int *flags_p)
 	if (!error)
 		error = load_image(&handle, &snapshot, header->pages - 1);
 	swap_reader_finish(&handle);
+	/*
+	 * When we saved kernel image, swap is freezed, thaw it.
+	 */
+	hibernation_thaw_swap();
 end:
 	if (!error)
 		pr_debug("PM: Image successfully loaded\n");
Index: mmotm-0727/kernel/power/user.c
===================================================================
--- mmotm-0727.orig/kernel/power/user.c
+++ mmotm-0727/kernel/power/user.c
@@ -135,6 +135,7 @@ static int snapshot_release(struct inode
 	free_basic_memory_bitmaps();
 	data = filp->private_data;
 	free_all_swap_pages(data->swap);
+	hibernation_thaw_swap();
 	if (data->frozen)
 		thaw_processes();
 	pm_notifier_call_chain(data->mode == O_WRONLY ?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was Re: Memory corruption during hibernation since 2.6.31
  2010-08-02  6:02                 ` [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was " KAMEZAWA Hiroyuki
@ 2010-08-02 14:27                   ` Rafael J. Wysocki
  2010-08-02 15:59                   ` Balbir Singh
  2010-08-03 23:09                   ` Rafael J. Wysocki
  2 siblings, 0 replies; 45+ messages in thread
From: Rafael J. Wysocki @ 2010-08-02 14:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Monday, August 02, 2010, KAMEZAWA Hiroyuki wrote:
> On Fri, 30 Jul 2010 13:14:32 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 29 Jul 2010 21:10:10 -0700
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > On Thu, Jul 29, 2010 at 5:01 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > I think the best way is kexec(). But maybe rollback from hibernation failure
> > > > will be difficult. Considering how crash-dump works well and under maintainance
> > > > by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
> > > > reuse of kdump code, ...or, hibernation-resume code can eat kdump image
> > > > directly. Maybe the problem will be the speed of dump.
> > > 
> > > I've no appetite for a total rework of hibernation, and I don't see
> > > how that would
> > > address the issue: I'm just looking for some protection against swap
> > > reuse danger.
> > > 
> > Okay ;) (And I forget that kexec has to prepare memory at boot time for 2nd kernel.
> > It will be harmful for small device guys.)
> > 
> > I'll prepare a routine not-quick-fix.
> 
> Ok, here. Passed easy tests as
> # echo disk > /sys/power/state
> 
> 
> Looks like a big hammer ? But I think following kind of patch is required.
> About swap-reuse, it's only possible when a page is added to swap cache
> but try_to_unmap() fails and the page remains in memory. IIUC, most of this
> kind of pages will be backed to swap by shrink_all_memory(). So, reuse-swap
> happens only when the user unlucky. This patch ignores reuse-swap but freeze
> swap_map[] for saving swap_map[] to the disk in consistent way.
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> swap_map[] is a one of objects which can be update while hibernation. I.E,
> usage counter in swap_map[] is updated while hibernation is making a copy
> of memory image to the disk.
> 
> At resume, hibenation code doesn't call swap_free() against the swaps they
> used...So, the swap_map[] will turns to be an initial state.
> 
> With small consideration, the question is how swap_map[] updated before
> dumping to the disk is treated. In swap-system view, there are no guarantee
> that swap_map[] are properly reloaded and there is no leak in swap count.
> 
> This patch tries to freeze swap_map[] during hibernation.
> By this, no updates will be happen to swap_map[] among save_image().
> At load_image(), the swap_map[] has no record for swap entries used by
> save_image(), we can simply forget it.
> 
> Note: I'm not a specialist of hibernation...so, I'm not sure the hooks
>       to kernel/power/user.c is appropriate or not.
>       And this disables swap-out once hibernation starts saving.
>       Should we afraid of OOM ?

OOM shoiuld not happen at that time, so no worry.

I'll have a deeper look into the patch later today.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was Re: Memory corruption during hibernation since 2.6.31
  2010-08-02  6:02                 ` [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was " KAMEZAWA Hiroyuki
  2010-08-02 14:27                   ` Rafael J. Wysocki
@ 2010-08-02 15:59                   ` Balbir Singh
  2010-08-03  0:19                     ` KAMEZAWA Hiroyuki
  2010-08-03 23:09                   ` Rafael J. Wysocki
  2 siblings, 1 reply; 45+ messages in thread
From: Balbir Singh @ 2010-08-02 15:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Andrea Arcangeli

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-08-02 15:02:25]:

> +/*
> + * Because updateing swap_map[] can make not-saved-status-change,
> + * we use our own easy allocator.
> + * Please see kernel/power/swap.c, Used swaps are recorded into
> + * RB-tree.
> + */
> +swp_entry_t get_swap_for_hibernation(int type)
> +{
> +	pgoff_t off;
> +	swp_entry_t val = {0};
> +	struct swap_info_struct *si;
> +
> +	spin_lock(&swap_lock);
> +	/*
> +	 * Once hibernation starts to use swap, we freeze swap_map[]. Otherwise,
> +	 * saved swap_map[] image to the disk will be an incomplete because it's
> +	 * changing without synchronization with hibernation snap shot.
> +	 * At resume, we just make swap_for_hibernation=false. We can forget
> +	 * used maps easily.

I don't understand the consequences of this action. Once swap_map is
fixed, we get additional swapping because we need more free memory,
what happens to the swapped out contents, since resume will never see
the changes? How did this work before 2.6.31?

> +	 */
> +	if (!swap_for_hibernation)
> +		hibernation_freeze_swap();
> +
> +	si = swap_info[type];
> +	if (!si || !(si->flags & SWP_WRITEOK))
> +		goto done;
> +
> +	for (off = hibernation_offset[type]; off < si->max; ++off) {
> +		if (!si->swap_map[off])
> +			break;

So this is a linear scan for the first free entry, right?

> +	}
> +	if (off < si->max) {
> +		val = swp_entry(type, off);
> +		hibernation_offset[type] = off + 1;
> +	}
> +done:
> +	spin_unlock(&swap_lock);
> +	return val;
> +}
> +

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was Re: Memory corruption during hibernation since 2.6.31
  2010-08-02 15:59                   ` Balbir Singh
@ 2010-08-03  0:19                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-03  0:19 UTC (permalink / raw)
  To: balbir
  Cc: Hugh Dickins, KOSAKI Motohiro, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Andrea Arcangeli

On Mon, 2 Aug 2010 21:29:45 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-08-02 15:02:25]:
> 
> > +/*
> > + * Because updateing swap_map[] can make not-saved-status-change,
> > + * we use our own easy allocator.
> > + * Please see kernel/power/swap.c, Used swaps are recorded into
> > + * RB-tree.
> > + */
> > +swp_entry_t get_swap_for_hibernation(int type)
> > +{
> > +	pgoff_t off;
> > +	swp_entry_t val = {0};
> > +	struct swap_info_struct *si;
> > +
> > +	spin_lock(&swap_lock);
> > +	/*
> > +	 * Once hibernation starts to use swap, we freeze swap_map[]. Otherwise,
> > +	 * saved swap_map[] image to the disk will be an incomplete because it's
> > +	 * changing without synchronization with hibernation snap shot.
> > +	 * At resume, we just make swap_for_hibernation=false. We can forget
> > +	 * used maps easily.
> 
> I don't understand the consequences of this action. Once swap_map is
> fixed, we get additional swapping because we need more free memory,
> what happens to the swapped out contents, since resume will never see
> the changes? 

Sorry, I can't understand what you write. Why "we get additional swapping?"
before starting hibernation, shrink_memory() is called and hibernation codes
should have enough memory to work.

This patch does

	1. set swap_for_hibernation = true 
                   => After this, kswapd/direct reclaim will make no swap.
		   => But hibernation can make use of swap.
	2. this variable, swap_for_hibernation is saved to disk as it is.

At resume
	3. swap_for_hibernation is loaded and it's value is "true"
	4. hibernation_thaw_swap() is called and set swap_for_hibernation=false.
	

> How did this work before 2.6.31?
> 

hmm? Are you talking about regression itself ?

   Before 2.6.31
    - At scan_swap_map(), free swap_map[] was used.
   After 2.6.31
    - At scan_swap_map(), if "swapcache-only" swap entry is found,
      it's reused by try_to_free_swapcache(). Because this happens
      during saving image of system memory, the snapshot will have inconsitency
      between swap_map <=> swap cache (I think mem_map is saved firstly)
      Then, memory corruption happens.
   After this patch.
    - scan_swap_map() is never called while saving snapshot to the disk.


> > +	 */
> > +	if (!swap_for_hibernation)
> > +		hibernation_freeze_swap();
> > +
> > +	si = swap_info[type];
> > +	if (!si || !(si->flags & SWP_WRITEOK))
> > +		goto done;
> > +
> > +	for (off = hibernation_offset[type]; off < si->max; ++off) {
> > +		if (!si->swap_map[off])
> > +			break;
> 
> So this is a linear scan for the first free entry, right?
> 
yes. Maybe some clever code can be added but start from simple one.
The result will not be very different because "write" time is long.
Thanks,
-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  4:23   ` KAMEZAWA Hiroyuki
  2010-07-29  5:23     ` KOSAKI Motohiro
@ 2010-08-03 10:50     ` Andrea Gelmini
  2010-08-03 23:36       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 45+ messages in thread
From: Andrea Gelmini @ 2010-08-03 10:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rafael J. Wysocki, Ondrej Zary, Kernel development list,
	Andrew Morton, Balbir Singh, nigel

2010/7/29 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
> Considering possible cases...and here is a patch.
> but I'm not fully sure. Could you clarify ?

I'm using this patch (with the one for ksm.c by Arcangeli) since the publishing.
It seems all works pretty well, with TuxOnIce also.

Never had the need to reboot the system.

Thanks a lot,
Andrea

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was Re: Memory corruption during hibernation since 2.6.31
  2010-08-02  6:02                 ` [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was " KAMEZAWA Hiroyuki
  2010-08-02 14:27                   ` Rafael J. Wysocki
  2010-08-02 15:59                   ` Balbir Singh
@ 2010-08-03 23:09                   ` Rafael J. Wysocki
  2010-08-03 23:31                     ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 45+ messages in thread
From: Rafael J. Wysocki @ 2010-08-03 23:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Monday, August 02, 2010, KAMEZAWA Hiroyuki wrote:
> On Fri, 30 Jul 2010 13:14:32 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 29 Jul 2010 21:10:10 -0700
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > On Thu, Jul 29, 2010 at 5:01 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > I think the best way is kexec(). But maybe rollback from hibernation failure
> > > > will be difficult. Considering how crash-dump works well and under maintainance
> > > > by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
> > > > reuse of kdump code, ...or, hibernation-resume code can eat kdump image
> > > > directly. Maybe the problem will be the speed of dump.
> > > 
> > > I've no appetite for a total rework of hibernation, and I don't see
> > > how that would
> > > address the issue: I'm just looking for some protection against swap
> > > reuse danger.
> > > 
> > Okay ;) (And I forget that kexec has to prepare memory at boot time for 2nd kernel.
> > It will be harmful for small device guys.)
> > 
> > I'll prepare a routine not-quick-fix.
> 
> Ok, here. Passed easy tests as
> # echo disk > /sys/power/state
> 
> 
> Looks like a big hammer ? But I think following kind of patch is required.
> About swap-reuse, it's only possible when a page is added to swap cache
> but try_to_unmap() fails and the page remains in memory. IIUC, most of this
> kind of pages will be backed to swap by shrink_all_memory(). So, reuse-swap
> happens only when the user unlucky. This patch ignores reuse-swap but freeze
> swap_map[] for saving swap_map[] to the disk in consistent way.
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> swap_map[] is a one of objects which can be update while hibernation. I.E,
> usage counter in swap_map[] is updated while hibernation is making a copy
> of memory image to the disk.
> 
> At resume, hibenation code doesn't call swap_free() against the swaps they
> used...So, the swap_map[] will turns to be an initial state.
> 
> With small consideration, the question is how swap_map[] updated before
> dumping to the disk is treated. In swap-system view, there are no guarantee
> that swap_map[] are properly reloaded and there is no leak in swap count.
> 
> This patch tries to freeze swap_map[] during hibernation.
> By this, no updates will be happen to swap_map[] among save_image().
> At load_image(), the swap_map[] has no record for swap entries used by
> save_image(), we can simply forget it.
> 
> Note: I'm not a specialist of hibernation...so, I'm not sure the hooks
>       to kernel/power/user.c is appropriate or not.
>       And this disables swap-out once hibernation starts saving.
>       Should we afraid of OOM ?
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I have only one comment (below).

> ---
...
> Index: mmotm-0727/kernel/power/user.c
> ===================================================================
> --- mmotm-0727.orig/kernel/power/user.c
> +++ mmotm-0727/kernel/power/user.c
> @@ -135,6 +135,7 @@ static int snapshot_release(struct inode
>  	free_basic_memory_bitmaps();
>  	data = filp->private_data;
>  	free_all_swap_pages(data->swap);
> +	hibernation_thaw_swap();

free_all_swap_pages() calls hibernation_thaw_swap(), so it doesn't need to be
called again.

>  	if (data->frozen)
>  		thaw_processes();
>  	pm_notifier_call_chain(data->mode == O_WRONLY ?
> 
> 
> --

Rafael

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was Re: Memory corruption during hibernation since 2.6.31
  2010-08-03 23:09                   ` Rafael J. Wysocki
@ 2010-08-03 23:31                     ` KAMEZAWA Hiroyuki
  2010-08-04  2:26                       ` KAMEZAWA Hiroyuki
  2010-08-04  4:57                       ` [PATCH -mm] hibernation: freeze swap at hibernation v2 KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-03 23:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Wed, 4 Aug 2010 01:09:15 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> On Monday, August 02, 2010, KAMEZAWA Hiroyuki wrote:
> > On Fri, 30 Jul 2010 13:14:32 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Thu, 29 Jul 2010 21:10:10 -0700
> > > Hugh Dickins <hughd@google.com> wrote:
> > > 
> > > > On Thu, Jul 29, 2010 at 5:01 PM, KAMEZAWA Hiroyuki
> > > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > I think the best way is kexec(). But maybe rollback from hibernation failure
> > > > > will be difficult. Considering how crash-dump works well and under maintainance
> > > > > by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
> > > > > reuse of kdump code, ...or, hibernation-resume code can eat kdump image
> > > > > directly. Maybe the problem will be the speed of dump.
> > > > 
> > > > I've no appetite for a total rework of hibernation, and I don't see
> > > > how that would
> > > > address the issue: I'm just looking for some protection against swap
> > > > reuse danger.
> > > > 
> > > Okay ;) (And I forget that kexec has to prepare memory at boot time for 2nd kernel.
> > > It will be harmful for small device guys.)
> > > 
> > > I'll prepare a routine not-quick-fix.
> > 
> > Ok, here. Passed easy tests as
> > # echo disk > /sys/power/state
> > 
> > 
> > Looks like a big hammer ? But I think following kind of patch is required.
> > About swap-reuse, it's only possible when a page is added to swap cache
> > but try_to_unmap() fails and the page remains in memory. IIUC, most of this
> > kind of pages will be backed to swap by shrink_all_memory(). So, reuse-swap
> > happens only when the user unlucky. This patch ignores reuse-swap but freeze
> > swap_map[] for saving swap_map[] to the disk in consistent way.
> > 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > swap_map[] is a one of objects which can be update while hibernation. I.E,
> > usage counter in swap_map[] is updated while hibernation is making a copy
> > of memory image to the disk.
> > 
> > At resume, hibenation code doesn't call swap_free() against the swaps they
> > used...So, the swap_map[] will turns to be an initial state.
> > 
> > With small consideration, the question is how swap_map[] updated before
> > dumping to the disk is treated. In swap-system view, there are no guarantee
> > that swap_map[] are properly reloaded and there is no leak in swap count.
> > 
> > This patch tries to freeze swap_map[] during hibernation.
> > By this, no updates will be happen to swap_map[] among save_image().
> > At load_image(), the swap_map[] has no record for swap entries used by
> > save_image(), we can simply forget it.
> > 
> > Note: I'm not a specialist of hibernation...so, I'm not sure the hooks
> >       to kernel/power/user.c is appropriate or not.
> >       And this disables swap-out once hibernation starts saving.
> >       Should we afraid of OOM ?
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> I have only one comment (below).
> 
> > ---
> ...
> > Index: mmotm-0727/kernel/power/user.c
> > ===================================================================
> > --- mmotm-0727.orig/kernel/power/user.c
> > +++ mmotm-0727/kernel/power/user.c
> > @@ -135,6 +135,7 @@ static int snapshot_release(struct inode
> >  	free_basic_memory_bitmaps();
> >  	data = filp->private_data;
> >  	free_all_swap_pages(data->swap);
> > +	hibernation_thaw_swap();
> 
> free_all_swap_pages() calls hibernation_thaw_swap(), so it doesn't need to be
> called again.
> 

Ah, yes. Thank you for review.

Regards,
-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-08-03 10:50     ` Andrea Gelmini
@ 2010-08-03 23:36       ` KAMEZAWA Hiroyuki
  2010-08-04  1:50         ` [BUGFIX][PATCH] fix corruption of hibernation caused by reusing swap at saving image KAMEZAWA Hiroyuki
  2010-08-05 11:41         ` Memory corruption during hibernation since 2.6.31 Andrea Gelmini
  0 siblings, 2 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-03 23:36 UTC (permalink / raw)
  To: Andrea Gelmini
  Cc: Rafael J. Wysocki, Ondrej Zary, Kernel development list,
	Andrew Morton, Balbir Singh, nigel

On Tue, 3 Aug 2010 12:50:43 +0200
Andrea Gelmini <andrea.gelmini@gmail.com> wrote:

> 2010/7/29 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
> > Considering possible cases...and here is a patch.
> > but I'm not fully sure. Could you clarify ?
> 
> I'm using this patch (with the one for ksm.c by Arcangeli) since the publishing.
> It seems all works pretty well, with TuxOnIce also.
> 
> Never had the need to reboot the system.
> 

Good, thank you for reporting !
I'd like to add your Tested-by:

Thanks,
-Kame

> Thanks a lot,
> Andrea
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [BUGFIX][PATCH] fix corruption of hibernation caused by reusing swap at saving image
  2010-08-03 23:36       ` KAMEZAWA Hiroyuki
@ 2010-08-04  1:50         ` KAMEZAWA Hiroyuki
  2010-08-04  2:31           ` KAMEZAWA Hiroyuki
  2010-08-05 11:41         ` Memory corruption during hibernation since 2.6.31 Andrea Gelmini
  1 sibling, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-04  1:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Gelmini, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh, nigel,
	stable, Hugh Dickins

This patch is created against 2.6.35. CC'ed stable.
Thank you for all helps.

=
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Since 2.6.31, swap_map[]'s refcounting was changed to show that
a used swap entry is just for swap-cache, can be reused.
Then, while scanning free entry in swap_map[], a swap entry may
be able to be reclaimed and reused. It was by the commit
c9e444103b5e7a5a3519f9913f59767f92e33baf.

But this caused deta corruption at hibernation. Considering how
the image is saved, the calls of try_to_reclaim_swap() changes the
status of memory and there will be inconsitency between saved-memory-image's
swap_map[] / memmap / swapper_space because memory is saved per page with
swap-allocation per page.

This patch is for avoiding bug by not reclaiming swap-entry at hibernation.
This is a quick fix for backporting.

Cc: stable@kernel.org
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Ondreg Zary <linux@rainbow-software.org>
Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/swapfile.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6.35.org/mm/swapfile.c
===================================================================
--- linux-2.6.35.org.orig/mm/swapfile.c
+++ linux-2.6.35.org/mm/swapfile.c
@@ -318,8 +318,10 @@ checks:
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
-	/* reuse swap entry of cache-only swap if not busy. */
-	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+	/* reuse swap entry of cache-only swap if not hibernation. */
+	if (vm_swap_full()
+		&& usage == SWAP_HAS_CACHE
+		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
 		spin_unlock(&swap_lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was Re: Memory corruption during hibernation since 2.6.31
  2010-08-03 23:31                     ` KAMEZAWA Hiroyuki
@ 2010-08-04  2:26                       ` KAMEZAWA Hiroyuki
  2010-08-04  4:57                       ` [PATCH -mm] hibernation: freeze swap at hibernation v2 KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-04  2:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rafael J. Wysocki, Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli

On Wed, 4 Aug 2010 08:31:19 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 4 Aug 2010 01:09:15 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > On Monday, August 02, 2010, KAMEZAWA Hiroyuki wrote:
> > > On Fri, 30 Jul 2010 13:14:32 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Thu, 29 Jul 2010 21:10:10 -0700
> > > > Hugh Dickins <hughd@google.com> wrote:
> > > > 
> > > > > On Thu, Jul 29, 2010 at 5:01 PM, KAMEZAWA Hiroyuki
> > > > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > I think the best way is kexec(). But maybe rollback from hibernation failure
> > > > > > will be difficult. Considering how crash-dump works well and under maintainance
> > > > > > by many enterprise guys, hibernation-by-kexec is a choice. I think. It can make
> > > > > > reuse of kdump code, ...or, hibernation-resume code can eat kdump image
> > > > > > directly. Maybe the problem will be the speed of dump.
> > > > > 
> > > > > I've no appetite for a total rework of hibernation, and I don't see
> > > > > how that would
> > > > > address the issue: I'm just looking for some protection against swap
> > > > > reuse danger.
> > > > > 
> > > > Okay ;) (And I forget that kexec has to prepare memory at boot time for 2nd kernel.
> > > > It will be harmful for small device guys.)
> > > > 
> > > > I'll prepare a routine not-quick-fix.
> > > 
> > > Ok, here. Passed easy tests as
> > > # echo disk > /sys/power/state
> > > 

Sorry, I found this patch is of no use. I should put
hibernate_free_swap() before hibernation_snapshot().

I'll retry.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [BUGFIX][PATCH] fix corruption of hibernation caused by reusing swap at saving image
  2010-08-04  1:50         ` [BUGFIX][PATCH] fix corruption of hibernation caused by reusing swap at saving image KAMEZAWA Hiroyuki
@ 2010-08-04  2:31           ` KAMEZAWA Hiroyuki
  2010-08-04  2:46             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-04  2:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Gelmini, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh, nigel,
	stable, Hugh Dickins

On Wed, 4 Aug 2010 10:50:54 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> This patch is created against 2.6.35. CC'ed stable.
> Thank you for all helps.
> 

I'm sorry I now doubt this patch is wrong.
What should be done is stop swap-reuse before hibernation_snapshot().
But, hmm, this one can't.

Maybe what really happens is someone (kswapd?) reuses swap while
swsusp_alloc() is on going.

please allow me to retry..

Thanks,
-Kame


> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Since 2.6.31, swap_map[]'s refcounting was changed to show that
> a used swap entry is just for swap-cache, can be reused.
> Then, while scanning free entry in swap_map[], a swap entry may
> be able to be reclaimed and reused. It was by the commit
> c9e444103b5e7a5a3519f9913f59767f92e33baf.
> 
> But this caused deta corruption at hibernation. Considering how
> the image is saved, the calls of try_to_reclaim_swap() changes the
> status of memory and there will be inconsitency between saved-memory-image's
> swap_map[] / memmap / swapper_space because memory is saved per page with
> swap-allocation per page.
> 
> This patch is for avoiding bug by not reclaiming swap-entry at hibernation.
> This is a quick fix for backporting.
> 
> Cc: stable@kernel.org
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Reported-by: Ondreg Zary <linux@rainbow-software.org>
> Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/swapfile.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.35.org/mm/swapfile.c
> ===================================================================
> --- linux-2.6.35.org.orig/mm/swapfile.c
> +++ linux-2.6.35.org/mm/swapfile.c
> @@ -318,8 +318,10 @@ checks:
>  	if (offset > si->highest_bit)
>  		scan_base = offset = si->lowest_bit;
>  
> -	/* reuse swap entry of cache-only swap if not busy. */
> -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +	/* reuse swap entry of cache-only swap if not hibernation. */
> +	if (vm_swap_full()
> +		&& usage == SWAP_HAS_CACHE
> +		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
>  		int swap_was_freed;
>  		spin_unlock(&swap_lock);
>  		swap_was_freed = __try_to_reclaim_swap(si, offset);
> 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [BUGFIX][PATCH] fix corruption of hibernation caused by reusing swap at saving image
  2010-08-04  2:31           ` KAMEZAWA Hiroyuki
@ 2010-08-04  2:46             ` KAMEZAWA Hiroyuki
  2010-08-05 19:12               ` Hugh Dickins
  0 siblings, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-04  2:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Gelmini, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh, nigel,
	stable, Hugh Dickins

On Wed, 4 Aug 2010 11:31:22 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 4 Aug 2010 10:50:54 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > This patch is created against 2.6.35. CC'ed stable.
> > Thank you for all helps.
> > 
> 
> I'm sorry I now doubt this patch is wrong.

Changed the description.
Maybe the patch description/my understanding was wrong
but the patch itself should work.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Since 2.6.31, swap_map[]'s refcounting was changed to show that
a used swap entry is just for swap-cache, can be reused.
Then, while scanning free entry in swap_map[], a swap entry may
be able to be reclaimed and reused. It was by the commit
c9e444103b5e7a5a3519f9913f59767f92e33baf.

But this caused deta corruption at resume.
The scenario is
	- Assume a clean-swap cache, but mapped.
	- at hibernation_snapshot[], clean-swap-cache is saved as
	  clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE.
	- then, save_image() is called. And reuse SWAP_HAS_CACHE entry
	  to save image, and break the contents.
	After resume.
	- the memory reclaim runs and finds clean-not-referenced-swap-cache
	  and discards it because it's marked as clean.
	  But here, the contents on disk and swap-cache is inconsistent.

Here, memory is corrupted.


This patch is for avoiding bug by not reclaiming swap-entry at hibernation.
This is a quick fix for backporting.

Cc: stable@kernel.org
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Ondreg Zary <linux@rainbow-software.org>
Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/swapfile.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6.35.org/mm/swapfile.c
===================================================================
--- linux-2.6.35.org.orig/mm/swapfile.c
+++ linux-2.6.35.org/mm/swapfile.c
@@ -318,8 +318,10 @@ checks:
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
-	/* reuse swap entry of cache-only swap if not busy. */
-	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+	/* reuse swap entry of cache-only swap if not hibernation. */
+	if (vm_swap_full()
+		&& usage == SWAP_HAS_CACHE
+		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
 		spin_unlock(&swap_lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH -mm] hibernation: freeze swap at hibernation v2
  2010-08-03 23:31                     ` KAMEZAWA Hiroyuki
  2010-08-04  2:26                       ` KAMEZAWA Hiroyuki
@ 2010-08-04  4:57                       ` KAMEZAWA Hiroyuki
  2010-08-04 22:18                         ` Andrew Morton
  1 sibling, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-04  4:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rafael J. Wysocki, Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh,
	Andrea Arcangeli


Fixed description and points of hooks. maybe much clearer.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At taking memory snapshot in hibernate_snapshot(), all (directly called)
memory allocator uses GFP_ATOMIC. And it seems swap-misusage during
hibernation never occurs.

But, from pessimistic point of view, there is no guarantee to trust 
any page allcation doesn't have __GFP_WAIT. It's better to have an indication
"we enter hibernation, don't use swap!".

This patch tries to freeze new-swap-allocation during hibernation.
(We can trust all user processes are freezed, then, dont't take care of swapin)

By this, no updates will be happen to swap_map[] among hibernate_snapshot()
to save_image(). swap is thawed when swsusp_free() is called.
We can trust swap-corruption will never happen without any doubts.

Changelog: 2010-08-04
 - Fixed hibernation_freeze_swap/thaw_swap call points.
 - Rewrite the all description.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h     |    8 +++-
 kernel/power/hibernate.c |    1 
 kernel/power/snapshot.c  |    1 
 kernel/power/swap.c      |    6 +--
 mm/swapfile.c            |   94 ++++++++++++++++++++++++++++++++++++-----------
 5 files changed, 84 insertions(+), 26 deletions(-)

Index: mmotm-0727/include/linux/swap.h
===================================================================
--- mmotm-0727.orig/include/linux/swap.h
+++ mmotm-0727/include/linux/swap.h
@@ -316,7 +316,6 @@ extern long nr_swap_pages;
 extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
-extern swp_entry_t get_swap_page_of_type(int);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
@@ -333,6 +332,13 @@ extern int reuse_swap_page(struct page *
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
+#ifdef CONFIG_HIBERNATION
+void hibernation_freeze_swap(void);
+void hibernation_thaw_swap(void);
+swp_entry_t get_swap_for_hibernation(int type);
+void swap_free_for_hibernation(swp_entry_t val);
+#endif
+
 /* linux/mm/thrash.c */
 extern struct mm_struct *swap_token_mm;
 extern void grab_swap_token(struct mm_struct *);
Index: mmotm-0727/mm/swapfile.c
===================================================================
--- mmotm-0727.orig/mm/swapfile.c
+++ mmotm-0727/mm/swapfile.c
@@ -47,6 +47,8 @@ long nr_swap_pages;
 long total_swap_pages;
 static int least_priority;
 
+static bool swap_for_hibernation;
+
 static const char Bad_file[] = "Bad swap file entry ";
 static const char Unused_file[] = "Unused swap file entry ";
 static const char Bad_offset[] = "Bad swap offset entry ";
@@ -449,6 +451,8 @@ swp_entry_t get_swap_page(void)
 	spin_lock(&swap_lock);
 	if (nr_swap_pages <= 0)
 		goto noswap;
+	if (swap_for_hibernation)
+		goto noswap;
 	nr_swap_pages--;
 
 	for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
@@ -481,28 +485,6 @@ noswap:
 	return (swp_entry_t) {0};
 }
 
-/* The only caller of this function is now susupend routine */
-swp_entry_t get_swap_page_of_type(int type)
-{
-	struct swap_info_struct *si;
-	pgoff_t offset;
-
-	spin_lock(&swap_lock);
-	si = swap_info[type];
-	if (si && (si->flags & SWP_WRITEOK)) {
-		nr_swap_pages--;
-		/* This is called for allocating swap entry, not cache */
-		offset = scan_swap_map(si, 1);
-		if (offset) {
-			spin_unlock(&swap_lock);
-			return swp_entry(type, offset);
-		}
-		nr_swap_pages++;
-	}
-	spin_unlock(&swap_lock);
-	return (swp_entry_t) {0};
-}
-
 static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
@@ -762,6 +744,74 @@ int mem_cgroup_count_swap_user(swp_entry
 #endif
 
 #ifdef CONFIG_HIBERNATION
+
+static pgoff_t hibernation_offset[MAX_SWAPFILES];
+/*
+ * Once hibernation starts to use swap, we freeze swap_map[]. Otherwise,
+ * saved swap_map[] image to the disk will be an incomplete because it's
+ * changing without synchronization with hibernation snap shot.
+ * At resume, we just make swap_for_hibernation=false. We can forget
+ * used maps easily.
+ */
+void hibernation_freeze_swap(void)
+{
+	int i;
+
+	spin_lock(&swap_lock);
+
+	printk(KERN_INFO "PM: Freeze Swap\n");
+	swap_for_hibernation = true;
+	for (i = 0; i < MAX_SWAPFILES; i++)
+		hibernation_offset[i] = 1;
+	spin_unlock(&swap_lock);
+}
+
+void hibernation_thaw_swap(void)
+{
+	spin_lock(&swap_lock);
+	if (swap_for_hibernation) {
+		printk(KERN_INFO "PM: Thaw Swap\n");
+		swap_for_hibernation = false;
+	}
+	spin_unlock(&swap_lock);
+}
+
+/*
+ * Because updateing swap_map[] can make not-saved-status-change,
+ * we use our own easy allocator.
+ * Please see kernel/power/swap.c, Used swaps are recorded into
+ * RB-tree.
+ */
+swp_entry_t get_swap_for_hibernation(int type)
+{
+	pgoff_t off;
+	swp_entry_t val = {0};
+	struct swap_info_struct *si;
+
+	spin_lock(&swap_lock);
+
+	si = swap_info[type];
+	if (!si || !(si->flags & SWP_WRITEOK))
+		goto done;
+
+	for (off = hibernation_offset[type]; off < si->max; ++off) {
+		if (!si->swap_map[off])
+			break;
+	}
+	if (off < si->max) {
+		val = swp_entry(type, off);
+		hibernation_offset[type] = off + 1;
+	}
+done:
+	spin_unlock(&swap_lock);
+	return val;
+}
+
+void swap_free_for_hibernation(swp_entry_t ent)
+{
+	/* Nothing to do */
+}
+
 /*
  * Find the swap type that corresponds to given device (if any).
  *
Index: mmotm-0727/kernel/power/swap.c
===================================================================
--- mmotm-0727.orig/kernel/power/swap.c
+++ mmotm-0727/kernel/power/swap.c
@@ -135,10 +135,10 @@ sector_t alloc_swapdev_block(int swap)
 {
 	unsigned long offset;
 
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_offset(get_swap_for_hibernation(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free_for_hibernation(swp_entry(swap, offset));
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -162,7 +162,7 @@ void free_all_swap_pages(int swap)
 		ext = container_of(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
 		for (offset = ext->start; offset <= ext->end; offset++)
-			swap_free(swp_entry(swap, offset));
+			swap_free_for_hibernation(swp_entry(swap, offset));
 
 		kfree(ext);
 	}
Index: mmotm-0727/kernel/power/hibernate.c
===================================================================
--- mmotm-0727.orig/kernel/power/hibernate.c
+++ mmotm-0727/kernel/power/hibernate.c
@@ -338,6 +338,7 @@ int hibernation_snapshot(int platform_mo
 		goto Close;
 
 	suspend_console();
+	hibernation_freeze_swap();
 	saved_mask = clear_gfp_allowed_mask(GFP_IOFS);
 	error = dpm_suspend_start(PMSG_FREEZE);
 	if (error)
Index: mmotm-0727/kernel/power/snapshot.c
===================================================================
--- mmotm-0727.orig/kernel/power/snapshot.c
+++ mmotm-0727/kernel/power/snapshot.c
@@ -1086,6 +1086,7 @@ void swsusp_free(void)
 	buffer = NULL;
 	alloc_normal = 0;
 	alloc_highmem = 0;
+	hibernation_thaw_swap();
 }
 
 /* Helper functions used for the shrinking of memory. */


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH -mm] hibernation: freeze swap at hibernation v2
  2010-08-04  4:57                       ` [PATCH -mm] hibernation: freeze swap at hibernation v2 KAMEZAWA Hiroyuki
@ 2010-08-04 22:18                         ` Andrew Morton
  2010-08-05  0:32                           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 45+ messages in thread
From: Andrew Morton @ 2010-08-04 22:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rafael J. Wysocki, Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Balbir Singh, Andrea Arcangeli

On Wed, 4 Aug 2010 13:57:39 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> At taking memory snapshot in hibernate_snapshot(), all (directly called)
> memory allocator uses GFP_ATOMIC. And it seems swap-misusage during
> hibernation never occurs.
> 
> But, from pessimistic point of view, there is no guarantee to trust 
> any page allcation doesn't have __GFP_WAIT. It's better to have an indication
> "we enter hibernation, don't use swap!".
> 
> This patch tries to freeze new-swap-allocation during hibernation.
> (We can trust all user processes are freezed, then, dont't take care of swapin)
> 
> By this, no updates will be happen to swap_map[] among hibernate_snapshot()
> to save_image(). swap is thawed when swsusp_free() is called.
> We can trust swap-corruption will never happen without any doubts.
> 

Confused (as usual).

The above seems to be saying "there isn't a bug, but perhaps there is
one that we don't know about so let's fix it in case it's there".  Or
something.

But this email thread used to be called "Memory corruption during
hibernation since 2.6.31" which sounds a heck of a lot more serious.

Does this patch fix memory corruption?  If so, why was that corruption
occurring, and under which circumstances?

Once all this is known, let's decide whether -stable needs this patch.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH -mm] hibernation: freeze swap at hibernation v2
  2010-08-04 22:18                         ` Andrew Morton
@ 2010-08-05  0:32                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-05  0:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rafael J. Wysocki, Hugh Dickins, KOSAKI Motohiro, Ondrej Zary,
	Kernel development list, Balbir Singh, Andrea Arcangeli

On Wed, 4 Aug 2010 15:18:39 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 4 Aug 2010 13:57:39 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > At taking memory snapshot in hibernate_snapshot(), all (directly called)
> > memory allocator uses GFP_ATOMIC. And it seems swap-misusage during
> > hibernation never occurs.
> > 
> > But, from pessimistic point of view, there is no guarantee to trust 
> > any page allcation doesn't have __GFP_WAIT. It's better to have an indication
> > "we enter hibernation, don't use swap!".
> > 
> > This patch tries to freeze new-swap-allocation during hibernation.
> > (We can trust all user processes are freezed, then, dont't take care of swapin)
> > 
> > By this, no updates will be happen to swap_map[] among hibernate_snapshot()
> > to save_image(). swap is thawed when swsusp_free() is called.
> > We can trust swap-corruption will never happen without any doubts.
> > 
> 
> Confused (as usual).
> 
Sorry
> The above seems to be saying "there isn't a bug, but perhaps there is
> one that we don't know about so let's fix it in case it's there".  Or
> something.
> 
> But this email thread used to be called "Memory corruption during
> hibernation since 2.6.31" which sounds a heck of a lot more serious.
> 
> Does this patch fix memory corruption?  If so, why was that corruption
> occurring, and under which circumstances?
> 
The corruption we see now is fixed by 
mm-fix-corruption-of-hibernation-caused-by-reusing-swap-during-image-saving.patch
This patch disables swap-reuse by the hibernation code itself and fixes the bug.

But, IIUC, comments from reviewer was "it looks like a temporary-fix. It will work
and be good for backporting...but can't we have a permanent fix which never be
broken by anyone ?" 

IOW, the patch doesn't stop get_swap_page() because we believe all
subsystems related to hibernation works well _now_ and everyone use GFP_ATOMIC.

However, hibernation is a set of complex features. It's coded very carefully and
works well but no one can't see the all of codes in subsystems. So, some safe
system is better rather than believing.

(BTW, one of reasons we can believe we are safe is that critical threads have
 to be frozen before creating snapshot.)

For example, for guaranteeing gfp_mask, we have 
	clear_gfp_allowed_mask()
	set_gfp_allowed_mask()
and clears __GFP_IO|__GFP_FS from all allocation requests via alloc_pages, slab,
slub etc even if someone use a gfp_mask other than GFP_ATOMIC.
These code is for notifying memory allocator that "hibernation is running!".

The idea of this patch is notifying swap-allocator that "hibernation is running!"
(With this, we can believe ourselves even if some tainted modules are loaded.)

This patch's purpose is adding a safe system rather than believing no one does
bad thing. Fortunately, disallowing using swap is easy.



Thanks,
-Kame







^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-08-03 23:36       ` KAMEZAWA Hiroyuki
  2010-08-04  1:50         ` [BUGFIX][PATCH] fix corruption of hibernation caused by reusing swap at saving image KAMEZAWA Hiroyuki
@ 2010-08-05 11:41         ` Andrea Gelmini
  1 sibling, 0 replies; 45+ messages in thread
From: Andrea Gelmini @ 2010-08-05 11:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rafael J. Wysocki, Ondrej Zary, Kernel development list,
	Andrew Morton, Balbir Singh, nigel

2010/8/4 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
> Good, thank you for reporting !
> I'd like to add your Tested-by:

No prob.

Thanks a lot for your work,
Andrea

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29  5:24       ` KAMEZAWA Hiroyuki
                           ` (2 preceding siblings ...)
  2010-07-29 18:44         ` Hugh Dickins
@ 2010-08-05 12:44         ` Ondrej Zary
  3 siblings, 0 replies; 45+ messages in thread
From: Ondrej Zary @ 2010-08-05 12:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Rafael J. Wysocki, Kernel development list,
	Andrew Morton, Balbir Singh

On Thursday 29 July 2010 07:24:29 KAMEZAWA Hiroyuki wrote:
> On Thu, 29 Jul 2010 14:23:33 +0900 (JST)
>
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Can you please add explicit commenting in the code?
>
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> At hibernation, all pages-should-be-saved are written into a image (here,
> swap). Then, swap_map[], memmap etcs are also saved into disks.
>
> But, swap allocation happens one by one. So, the final image of swap_map[]
> is different from saved one and the commit
> c9e444103b5e7a5a3519f9913f59767f92e33baf changes page's state while
> assiging swap. Because memory can be modified in hibernation is only
> not-to-be-save memory. it's a breakage.
>
> This patch fixes it by disabling swap entry reuse at hibernation.

6 days with no crash - so the patch seems to work.



> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/swapfile.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.34.org/mm/swapfile.c
> ===================================================================
> --- linux-2.6.34.org.orig/mm/swapfile.c
> +++ linux-2.6.34.org/mm/swapfile.c
> @@ -315,8 +315,15 @@ checks:
>  	if (offset > si->highest_bit)
>  		scan_base = offset = si->lowest_bit;
>
> -	/* reuse swap entry of cache-only swap if not busy. */
> -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> +	/*
> + 	 * reuse swap entry of cache-only swap if not busy &&
> + 	 * when we're called via pageout(). At hibernation, swap-reuse
> + 	 * is harmful because it changes memory status...which may
> + 	 * be saved already.
> + 	 */
> +	if (vm_swap_full()
> +		&& usage == SWAP_HAS_CACHE
> +		&& si->swap_map[offset] == SWAP_HAS_CACHE) {
>  		int swap_was_freed;
>  		spin_unlock(&swap_lock);
>  		swap_was_freed = __try_to_reclaim_swap(si, offset);
>


-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [BUGFIX][PATCH] fix corruption of hibernation caused by reusing  swap at saving image
  2010-08-04  2:46             ` KAMEZAWA Hiroyuki
@ 2010-08-05 19:12               ` Hugh Dickins
  0 siblings, 0 replies; 45+ messages in thread
From: Hugh Dickins @ 2010-08-05 19:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Gelmini, Rafael J. Wysocki, Ondrej Zary,
	Kernel development list, Andrew Morton, Balbir Singh, nigel,
	stable

On Tue, Aug 3, 2010 at 7:46 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 4 Aug 2010 11:31:22 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Wed, 4 Aug 2010 10:50:54 +0900
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> > This patch is created against 2.6.35. CC'ed stable.
>> > Thank you for all helps.
>> >
>>
>> I'm sorry I now doubt this patch is wrong.
>
> Changed the description.
> Maybe the patch description/my understanding was wrong
> but the patch itself should work.
>
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Since 2.6.31, swap_map[]'s refcounting was changed to show that
> a used swap entry is just for swap-cache, can be reused.
> Then, while scanning free entry in swap_map[], a swap entry may
> be able to be reclaimed and reused. It was by the commit
> c9e444103b5e7a5a3519f9913f59767f92e33baf.
>
> But this caused deta corruption at resume.
> The scenario is
>        - Assume a clean-swap cache, but mapped.
>        - at hibernation_snapshot[], clean-swap-cache is saved as
>          clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE.
>        - then, save_image() is called. And reuse SWAP_HAS_CACHE entry
>          to save image, and break the contents.
>        After resume.
>        - the memory reclaim runs and finds clean-not-referenced-swap-cache
>          and discards it because it's marked as clean.
>          But here, the contents on disk and swap-cache is inconsistent.
>
> Here, memory is corrupted.

Yes, that's the scenario: I thought that was what you meant by your
original description, but this more detailed description is a better
one.

Hugh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Memory corruption during hibernation since 2.6.31
  2010-07-29 18:55           ` Andrea Arcangeli
  2010-07-29 23:40             ` Rafael J. Wysocki
@ 2010-08-09  7:26             ` Pavel Machek
  1 sibling, 0 replies; 45+ messages in thread
From: Pavel Machek @ 2010-08-09  7:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rafael J. Wysocki, Ondrej Zary, Kernel development list,
	Andrew Morton, Balbir Singh

Hi!

> > I've CC'ed Andrea because we were having an offline conversation about
> > whether ksmd (and his khugepaged) need to set_freezable(); and I wonder
> > if this swap bug underlies his interest, though he was mainly worrying
> > about I/O in progress.
> 
> My opinion is that with current freezer model it is needed for suspend
> to disk. The kthread_should_stop seems useless for kswapd/ksmd, but
> I'm afraid it might get useful in the future so just to stay on the
> safe side I added it to khugepaged as well, but it's contributing to
> the pollution.
> 
> I've no idea why the freezing isn't preemptive (through the scheduler,
> all these kernel threads are obviously lowlatency beasts) by default

Hibernation woulld like all  tasks frozen with no locks held - so that
write to disk does not have to care about locking.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2010-08-09  7:26 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-28 21:20 Memory corruption during hibernation since 2.6.31 Ondrej Zary
2010-07-28 21:34 ` Rafael J. Wysocki
2010-07-28 21:38   ` Ondrej Zary
2010-07-29  1:06     ` KAMEZAWA Hiroyuki
2010-07-29  2:51       ` KAMEZAWA Hiroyuki
2010-07-29  4:23   ` KAMEZAWA Hiroyuki
2010-07-29  5:23     ` KOSAKI Motohiro
2010-07-29  5:24       ` KAMEZAWA Hiroyuki
2010-07-29  5:30         ` KOSAKI Motohiro
2010-07-29 17:33         ` Ondrej Zary
2010-07-29 18:44         ` Hugh Dickins
2010-07-29 18:55           ` Andrea Arcangeli
2010-07-29 23:40             ` Rafael J. Wysocki
2010-07-30  4:02               ` Hugh Dickins
2010-08-09  7:26             ` Pavel Machek
2010-07-29 23:29           ` Rafael J. Wysocki
2010-07-30  3:36             ` KAMEZAWA Hiroyuki
2010-07-30  3:54             ` Hugh Dickins
2010-07-30  0:01           ` KAMEZAWA Hiroyuki
2010-07-30  4:10             ` Hugh Dickins
2010-07-30  4:14               ` KAMEZAWA Hiroyuki
2010-07-30  4:46                 ` Hugh Dickins
2010-07-30 10:43                   ` KAMEZAWA Hiroyuki
2010-07-30 18:16                     ` Hugh Dickins
2010-08-02  6:02                 ` [RFC][PATCH -mm] hibernation: freeze swap at hibernation (Was " KAMEZAWA Hiroyuki
2010-08-02 14:27                   ` Rafael J. Wysocki
2010-08-02 15:59                   ` Balbir Singh
2010-08-03  0:19                     ` KAMEZAWA Hiroyuki
2010-08-03 23:09                   ` Rafael J. Wysocki
2010-08-03 23:31                     ` KAMEZAWA Hiroyuki
2010-08-04  2:26                       ` KAMEZAWA Hiroyuki
2010-08-04  4:57                       ` [PATCH -mm] hibernation: freeze swap at hibernation v2 KAMEZAWA Hiroyuki
2010-08-04 22:18                         ` Andrew Morton
2010-08-05  0:32                           ` KAMEZAWA Hiroyuki
2010-07-30  4:18           ` Memory corruption during hibernation since 2.6.31 Balbir Singh
2010-07-30  4:32             ` Hugh Dickins
2010-07-30  6:37               ` Balbir Singh
2010-08-05 12:44         ` Ondrej Zary
2010-08-03 10:50     ` Andrea Gelmini
2010-08-03 23:36       ` KAMEZAWA Hiroyuki
2010-08-04  1:50         ` [BUGFIX][PATCH] fix corruption of hibernation caused by reusing swap at saving image KAMEZAWA Hiroyuki
2010-08-04  2:31           ` KAMEZAWA Hiroyuki
2010-08-04  2:46             ` KAMEZAWA Hiroyuki
2010-08-05 19:12               ` Hugh Dickins
2010-08-05 11:41         ` Memory corruption during hibernation since 2.6.31 Andrea Gelmini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).