Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* poisoned pages do not play well in the buddy allocator
@ 2019-08-26 10:41 Oscar Salvador
  2019-08-26 11:14 ` Michal Hocko
  2019-08-27  1:34 ` Naoya Horiguchi
  0 siblings, 2 replies; 7+ messages in thread
From: Oscar Salvador @ 2019-08-26 10:41 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, vbabka

Hi,

When analyzing a problem reported by one of our customers, I stumbbled upon an issue
that origins from the fact that poisoned pages end up in the buddy allocator.

Let me break down the stepts that lie to the problem:

1) We soft-offline a page
2) Page gets flagged as HWPoison and is being sent to the buddy allocator.
   This is done through set_hwpoison_free_buddy_page().
3) Kcompactd wakes up in order to perform some compaction.
4) compact_zone() will call migrate_pages()
5) migrate_pages() will try to get a new page from compaction_alloc() to migrate to
6) if cc->freelist is empty, compaction_alloc() will call isolate_free_pagesblock()
7) isolate_free_pagesblock only checks for PageBuddy() to assume that a page is OK
   to be used to migrate to. Since HWPoisoned page are also PageBuddy, we add
   the page to the list. (same problem exists in fast_isolate_freepages()).

The outcome of that is that we end up happily handing poisoned pages in compaction_alloc,
so if we ever got a fault on that page through *_fault, we will return VM_FAULT_HWPOISON,
and the process will be killed.

I first though that I could get away with it by checking PageHWPoison in
{fast_isolate_freepages/isolate_free_pagesblock}, but not really.
It might be that the page we are checking is an order > 0 page, so the first page
might not be poisoned, but the one the follows might be, and we end up in the
same situation.

After some more thought, I really came to the conclusion that HWPoison pages should not
really be in the buddy allocator, as this is only asking for problems.
In this case it is only compaction code, but it could be happening somewhere else,
and one would expect that the pages you got from the buddy allocator are __ready__ to use.

I __think__ that we thought we were safe to put HWPoison pages in the buddy allocator as we
perform healthy checks when getting a page from there, so we skip poisoned pages

Of course, this is not the end of the story, now that someone got a page, if he frees it,
there is a high chance that this page ends up in a pcplist (I saw that).
Unless we are on CONFIG_VM_DEBUG, we do not check for the health of pages got from pcplist,
as we do when getting a page from the buddy allocator.

I checked [1], and it seems that [2] was going towards fixing this kind of issue.

I think it is about time to revamp the whole thing.

@Naoya: I could give it a try if you are busy.

[1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
[2] https://lore.kernel.org/linux-mm/1541746035-13408-9-git-send-email-n-horiguchi@ah.jp.nec.com/

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: poisoned pages do not play well in the buddy allocator
  2019-08-26 10:41 poisoned pages do not play well in the buddy allocator Oscar Salvador
@ 2019-08-26 11:14 ` Michal Hocko
  2019-08-27  1:34 ` Naoya Horiguchi
  1 sibling, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2019-08-26 11:14 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: n-horiguchi, mike.kravetz, linux-mm, linux-kernel, vbabka

On Mon 26-08-19 12:41:50, Oscar Salvador wrote:
[...]
> I checked [1], and it seems that [2] was going towards fixing this kind of issue.
> 
> I think it is about time to revamp the whole thing.

I completely agree. The current state of hwpoison is just too fragile to
be practically usable. We keep getting bug reports (as pointed out by
Oscar) when people try to test this via soft offlining. 

> @Naoya: I could give it a try if you are busy.

That would be more than appreciated. I feel guilty to have it slip
between cracks but I simply couldn't have found enough time to give it a
serious look. Sorry about that.

> [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> [2] https://lore.kernel.org/linux-mm/1541746035-13408-9-git-send-email-n-horiguchi@ah.jp.nec.com/

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: poisoned pages do not play well in the buddy allocator
  2019-08-26 10:41 poisoned pages do not play well in the buddy allocator Oscar Salvador
  2019-08-26 11:14 ` Michal Hocko
@ 2019-08-27  1:34 ` Naoya Horiguchi
  2019-08-27  7:28   ` Oscar Salvador
  1 sibling, 1 reply; 7+ messages in thread
From: Naoya Horiguchi @ 2019-08-27  1:34 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, vbabka

On Mon, Aug 26, 2019 at 12:41:50PM +0200, Oscar Salvador wrote:
> Hi,
> 
> When analyzing a problem reported by one of our customers, I stumbbled upon an issue
> that origins from the fact that poisoned pages end up in the buddy allocator.
> 
> Let me break down the stepts that lie to the problem:
> 
> 1) We soft-offline a page
> 2) Page gets flagged as HWPoison and is being sent to the buddy allocator.
>    This is done through set_hwpoison_free_buddy_page().
> 3) Kcompactd wakes up in order to perform some compaction.
> 4) compact_zone() will call migrate_pages()
> 5) migrate_pages() will try to get a new page from compaction_alloc() to migrate to
> 6) if cc->freelist is empty, compaction_alloc() will call isolate_free_pagesblock()
> 7) isolate_free_pagesblock only checks for PageBuddy() to assume that a page is OK
>    to be used to migrate to. Since HWPoisoned page are also PageBuddy, we add
>    the page to the list. (same problem exists in fast_isolate_freepages()).
> 
> The outcome of that is that we end up happily handing poisoned pages in compaction_alloc,
> so if we ever got a fault on that page through *_fault, we will return VM_FAULT_HWPOISON,
> and the process will be killed.
> 
> I first though that I could get away with it by checking PageHWPoison in
> {fast_isolate_freepages/isolate_free_pagesblock}, but not really.
> It might be that the page we are checking is an order > 0 page, so the first page
> might not be poisoned, but the one the follows might be, and we end up in the
> same situation.

Yes, this is a whole point of the current implementation.

> 
> After some more thought, I really came to the conclusion that HWPoison pages should not
> really be in the buddy allocator, as this is only asking for problems.
> In this case it is only compaction code, but it could be happening somewhere else,
> and one would expect that the pages you got from the buddy allocator are __ready__ to use.
> 
> I __think__ that we thought we were safe to put HWPoison pages in the buddy allocator as we
> perform healthy checks when getting a page from there, so we skip poisoned pages
> 
> Of course, this is not the end of the story, now that someone got a page, if he frees it,
> there is a high chance that this page ends up in a pcplist (I saw that).
> Unless we are on CONFIG_VM_DEBUG, we do not check for the health of pages got from pcplist,
> as we do when getting a page from the buddy allocator.
> 
> I checked [1], and it seems that [2] was going towards fixing this kind of issue.
> 
> I think it is about time to revamp the whole thing.
> 
> @Naoya: I could give it a try if you are busy.

Thanks for raising hand. That's really wonderful. I think that the series [1] is not
merge yet but not rejected yet, so feel free to reuse/update/revamp it.

> 
> [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> [2] https://lore.kernel.org/linux-mm/1541746035-13408-9-git-send-email-n-horiguchi@ah.jp.nec.com/
> 
> -- 
> Oscar Salvador
> SUSE L3
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: poisoned pages do not play well in the buddy allocator
  2019-08-27  1:34 ` Naoya Horiguchi
@ 2019-08-27  7:28   ` Oscar Salvador
  2019-08-30 10:45     ` Oscar Salvador
  0 siblings, 1 reply; 7+ messages in thread
From: Oscar Salvador @ 2019-08-27  7:28 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, vbabka

On Tue, Aug 27, 2019 at 01:34:29AM +0000, Naoya Horiguchi wrote:
> > @Naoya: I could give it a try if you are busy.
> 
> Thanks for raising hand. That's really wonderful. I think that the series [1] is not
> merge yet but not rejected yet, so feel free to reuse/update/revamp it.

I will continue pursuing this then :-).

Thanks Naoya!

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: poisoned pages do not play well in the buddy allocator
  2019-08-27  7:28   ` Oscar Salvador
@ 2019-08-30 10:45     ` Oscar Salvador
  2019-08-30 12:36       ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Oscar Salvador @ 2019-08-30 10:45 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, vbabka

On Tue, Aug 27, 2019 at 09:28:13AM +0200, Oscar Salvador wrote:
> On Tue, Aug 27, 2019 at 01:34:29AM +0000, Naoya Horiguchi wrote:
> > > @Naoya: I could give it a try if you are busy.
> > 
> > Thanks for raising hand. That's really wonderful. I think that the series [1] is not
> > merge yet but not rejected yet, so feel free to reuse/update/revamp it.
> 
> I will continue pursuing this then :-).

I have started implementing a fix for this.
Right now I only performed tests on normal pages (non-hugetlb).

I took the approach of:

- Free page: remove it from the buddy allocator and set it as PageReserved|PageHWPoison.
- Used page: migrate it and do not release it (skip put_page in unmap_and_move for MR_MEMORY_FAILURE
	     reason). Set it as PageReserved|PageHWPoison.

The routine that handles this also sets the refcount of these pages to 1, so the unpoison
machinery will only have to check for PageHWPoison and to a put_page() to send
the page to the buddy allocator.

The Reserved bit is used because these pages will now __only__ be accessible through
pfn walkers, and pfn walkers should respect Reserved pages.
The PageHWPoison bit is used to remember that this page is poisoned, so the unpoison
machinery knows that it is valid to unpoison it.

It should also let us get rid of some if not all of the PageHWPoison() checks.

Overall, it seems to work as I no longer see the issue our customer and I faced.

My goal is to go further and publish that fix along with several
cleanups/refactors for the soft-offline machinery (hard-poison will come later),
as I strongly think we do really need to re-work that a bit, to make it more simple.

Since it will take a bit to have everything ready, I just wanted to
let you know.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: poisoned pages do not play well in the buddy allocator
  2019-08-30 10:45     ` Oscar Salvador
@ 2019-08-30 12:36       ` Michal Hocko
  2019-08-30 13:21         ` Oscar Salvador
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2019-08-30 12:36 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Naoya Horiguchi, mike.kravetz, linux-mm, linux-kernel, vbabka

On Fri 30-08-19 12:45:35, Oscar Salvador wrote:
> On Tue, Aug 27, 2019 at 09:28:13AM +0200, Oscar Salvador wrote:
> > On Tue, Aug 27, 2019 at 01:34:29AM +0000, Naoya Horiguchi wrote:
> > > > @Naoya: I could give it a try if you are busy.
> > > 
> > > Thanks for raising hand. That's really wonderful. I think that the series [1] is not
> > > merge yet but not rejected yet, so feel free to reuse/update/revamp it.
> > 
> > I will continue pursuing this then :-).
> 
> I have started implementing a fix for this.
> Right now I only performed tests on normal pages (non-hugetlb).
> 
> I took the approach of:
> 
> - Free page: remove it from the buddy allocator and set it as PageReserved|PageHWPoison.
> - Used page: migrate it and do not release it (skip put_page in unmap_and_move for MR_MEMORY_FAILURE
> 	     reason). Set it as PageReserved|PageHWPoison.

But this will only cover mapped pages. What about page cache in general?
Any reason why this cannot be handled in __free_one_page and simply skip
the whole freeing of the HWPoisoned parts of the freed page (in case of
higher order).

> The routine that handles this also sets the refcount of these pages to 1, so the unpoison
> machinery will only have to check for PageHWPoison and to a put_page() to send
> the page to the buddy allocator.
> 
> The Reserved bit is used because these pages will now __only__ be accessible through
> pfn walkers, and pfn walkers should respect Reserved pages.
> The PageHWPoison bit is used to remember that this page is poisoned, so the unpoison
> machinery knows that it is valid to unpoison it.

Do we really need both bits? pfn walkers in general shouldn't handle
pages they do not know about.

> It should also let us get rid of some if not all of the PageHWPoison() checks.

Although this sounds really appealing. The kernel code appart from the
hwpoison subsystem shouldn't really care about hwpoison status. With
some few isolated exceptions of course.
 
> Overall, it seems to work as I no longer see the issue our customer and I faced.

\o/

> My goal is to go further and publish that fix along with several
> cleanups/refactors for the soft-offline machinery (hard-poison will come later),
> as I strongly think we do really need to re-work that a bit, to make it more simple.
> 
> Since it will take a bit to have everything ready, I just wanted to
> let you know.

Thanks for the heads up and the work! The plan sounds great to me.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: poisoned pages do not play well in the buddy allocator
  2019-08-30 12:36       ` Michal Hocko
@ 2019-08-30 13:21         ` Oscar Salvador
  0 siblings, 0 replies; 7+ messages in thread
From: Oscar Salvador @ 2019-08-30 13:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, mike.kravetz, linux-mm, linux-kernel, vbabka

On Fri, Aug 30, 2019 at 02:36:56PM +0200, Michal Hocko wrote:
> > - Free page: remove it from the buddy allocator and set it as PageReserved|PageHWPoison.
> > - Used page: migrate it and do not release it (skip put_page in unmap_and_move for MR_MEMORY_FAILURE
> > 	     reason). Set it as PageReserved|PageHWPoison.
> 
> But this will only cover mapped pages. What about page cache in general?
> Any reason why this cannot be handled in __free_one_page and simply skip
> the whole freeing of the HWPoisoned parts of the freed page (in case of
> higher order).

I forgot to mention that part.
pages that are in the page cache and are not mapped are being handled in
invalidate_inode_page:


	/*
         * Try to invalidate first. This should work for
         * non dirty unmapped page cache pages.
         */
        ret = invalidate_inode_page(page);

Once done, we free the page to the buddy (which is wrong).

My approach would be as we do when migrate a poisoned page, simply not release
the page, so it does not end up in the buddy and we have full control of it.

I am still playing with the way to go here in general, to see which approach
is better and more simple.
The implementation I have right works well, but it is true that we could explore
a way to

1) Set PageHWPoison bit on the page
1) Hook into the free routine and ignore any poisoned page

so the overall code could be easier.

I just want to see both codes in place and decide which one feels better.

> > The routine that handles this also sets the refcount of these pages to 1, so the unpoison
> > machinery will only have to check for PageHWPoison and to a put_page() to send
> > the page to the buddy allocator.
> > 
> > The Reserved bit is used because these pages will now __only__ be accessible through
> > pfn walkers, and pfn walkers should respect Reserved pages.
> > The PageHWPoison bit is used to remember that this page is poisoned, so the unpoison
> > machinery knows that it is valid to unpoison it.
> 
> Do we really need both bits? pfn walkers in general shouldn't handle
> pages they do not know about.

Well, I went for setting the Reserved bit to just be overprotective here,
like a way of "stay away from this page", and most of the pfn walkers
skip over Reserved pages as these are not meant to be touched for anyone
but the owner.
Setting the Poison bit is just for the unpoison routine, to check that
the page we are asked to unpoison, was really poisoned in the first place.

So, if that goes as planned, PageHWPoison check should only be neded in the
hwpoison code.
Maybe in the free routine if we decide to hook into that.

All in all, it is something that we will have to discuss at the time I will
send the RFC, as I am pretty sure there will be things to polish and change.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26 10:41 poisoned pages do not play well in the buddy allocator Oscar Salvador
2019-08-26 11:14 ` Michal Hocko
2019-08-27  1:34 ` Naoya Horiguchi
2019-08-27  7:28   ` Oscar Salvador
2019-08-30 10:45     ` Oscar Salvador
2019-08-30 12:36       ` Michal Hocko
2019-08-30 13:21         ` Oscar Salvador

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git