All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	virtualization@lists.linux-foundation.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Juergen Gross <jgross@suse.com>,
	Pavel Tatashin <pavel.tatashin@microsoft.com>,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Anthony Yznaga <anthony.yznaga@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Oscar Salvador <osalvador@suse.de>,
	Pingfan Liu <kernelfans@gmail.com>, Qian Cai <cai@lca.pw>,
	Dan Williams <dan.j.williams@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Wei Yang <richardw.yang@linux.intel.com>,
	Alexander Potapenko <glider@google.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>,
	Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>
Subject: Re: [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
Date: Tue, 22 Oct 2019 14:23:26 +0200	[thread overview]
Message-ID: <20191022122326.GL9379@dhcp22.suse.cz> (raw)
In-Reply-To: <709d39aa-a7ba-97aa-e66b-e2fec2fdf3c4@redhat.com>

On Fri 18-10-19 14:35:06, David Hildenbrand wrote:
> On 18.10.19 13:20, Michal Hocko wrote:
> > On Fri 18-10-19 10:50:24, David Hildenbrand wrote:
> > > On 18.10.19 10:15, Michal Hocko wrote:
[...]
> > > > for that - MEM_GOING_OFFLINE notification. This sounds like a good place
> > > > for the driver to decide whether it is safe to let the page go or not.
> > > 
> > > As I explained, this is too late and fragile. I post again what I posted
> > > before with some further explanations
> > > 
> > > __offline_pages() works like this:
> > > 
> > > 1) start_isolate_page_range()
> > > -> offline pages with a reference count of one will be detected as
> > > unmovable -> offlining aborted. (see below on the memory isolation notifier)
> > 
> > I am assuming that has_unmovable_pages would skip over those pages. Your
> > patch already does that, no?
> 
> Yes, this works IFF the reference count is 0 (IOW, this patch). Not with a
> reference count of 1 (unless the pages are movable, like with balloon
> compaction).

I am pretty sure that has_unmovable_pages can special these pages
regardless of the reference count for the memory hotplug. We already do
that for HWPoison pages.
 
> Please note that we have other users that use PG_offline + refcount >= 1
> (HyperV balloon, XEN). We should not affect these users (IOW,
> has_unmovable_pages() has to stop right there if we see one of these pages).

OK, this is exactly what I was worried about. I can see why you might
want to go an easier way and rule those users out but wouldn't be it
actually more reasonable to explicitly request PageOffline users to
implement MEM_GOING_OFFLINE and prepare their offlined pages for the
offlining operation or fail right there if that is not possible.
If you fail right there during the isolation phase then there is no way
to allow the offlining to proceed from that context.
 
> > > 2) memory_notify(MEM_GOING_OFFLINE, &arg);
> > > -> Here, we could release all pages to the buddy, clearing PG_offline
> > > -> PF_offline must not be cleared so dumping tools will not touch
> > >     these pages. There is a time where pages are !PageBuddy() and
> > >     !PageOffline().
> > 
> > Well, this is fully under control of the driver, no? Reference count
> > shouldn't play any role here AFAIU.
> 
> Yes, this is more a PG_offline issue. The reference count is an issue of
> reaching this call :) If we want to go via the buddy:
> 
> 1. Clear PG_offline
> 2. Free page (gets set PG_buddy)
> 
> Between 1 and 2, a dumping tool could not exclude these pages and therefore
> try to read from these pages. That is an issue IFF we want to return the
> pages back to the buddy instead of doing what I propose here.

If the driver is going to free page to the allocator then it would have
to claim the page back and so it is usable again. If it cannot free it
then it would simply set the reference count to 0. It can even keep the
PG_offline if necessary although I have to admit I am not really sure it
is necessary.

> > > 3) scan_movable_pages() ...
> 
> Please note that when we don't put the pages back to the buddy and don't
> implement something like I have in this patch, we'll loop/fail here.
> Especially if we have pages with PG_offline + refcount >= 1 .

You should have your reference count 0 at this stage as it is after
MEM_GOING_OFFLINE.
 
> > MEM_CANCEL_OFFLINE could gain the reference back to balance the
> > MEM_GOING_OFFLINE step.
> 
> The pages are already unisolated and could be used by the buddy. But again,
> I think you have an idea that tries to avoid putting pages to the buddy.

Yeah, set_page_count(page, 0) if you do not want to release that page
from the notifier context to reflect that the page is ok to be offlined
with the rest.
 
[...]

> > explicit control via the reference count which is the standard way to
> > control the struct page life cycle.
> > 
> > Anyway hooking into __put_page (which tends to be a hot path with
> > something that is barely used on most systems) doesn't sound nice to me.
> > This is the whole point which made me think about the whole reference
> > count approach in the first place.
> 
> Again, the race I think that is possible
> 
> somebody: get_page_unless_zero(page)
> virtio_mem: page_ref_dec(pfn_to_page(pfn)
> somebody: put_page() -> straight to the buddy

Who is that somebody? I thought that it is only the owner/driver to have
a control over the page. Also the above is not possible as long as the
owner/driver keeps a reference to the PageOffline page throughout the
time it is marked that way.
 
[...]

> > > > If you can let the page go then just drop the reference count. The page
> > > > is isolated already by that time. If you cannot let it go for whatever
> > > > reason you can fail the offlining.
> > > 
> > > We do have one hack in current MM code, which is the memory isolation
> > > notifier only used by CMM on PPC. It allows to "cheat" has_unmovable_pages()
> > > to skip over unmovable pages. But quite frankly, I rather want to get rid of
> > > that crap (something I am working on right now) than introduce new users.
> > > This stuff is racy as hell and for CMM, if memory offlining fails, the
> > > ballooned pages are suddenly part of the buddy. Fragile.
> > 
> > Could you be more specific please?
> 
> Let's take a look at how arch/powerpc/platforms/pseries/cmm.c handles it:
> 
> cmm_memory_isolate_cb() -> cmm_count_pages(arg):
> - Memory Isolation notifier callback
> - Count how many pages in the range to be isolated are in the ballooon
> - This makes has_unmovable_pages() succeed. Pages can be isolated.
> 
> cmm_memory_cb -> cmm_mem_going_offline(arg):
> - Memory notifier (online/offline)
> - Release all pages in the range to the buddy
> 
> If offlining fails, the pages are now in the buddy, no longer in the
> balloon. MEM_CANCEL_ONLINE is too late, because the range is already
> unisolated again and the pages might be in use.
> 
> For CMM it might not be that bad, because it can actually "reloan" any
> pages. In contrast, virtio-mem cannot simply go ahead and reuse random
> memory in unplugged. Any access to these pages would be evil. Giving them
> back to the buddy is dangerous.

Thanks, I was not aware of that code. But from what I understood this is
an outright bug in this code because cmm_mem_going_offline releases
pages to the buddy allocator which is something that is not recoverable
on a later failure.

-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Pingfan Liu <kernelfans@gmail.com>,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Alexander Potapenko <glider@google.com>,
	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Yu Zhao <yuzhao@google.com>, Matthew Wilcox <willy@infradead.org>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Anthony Yznaga <anthony.yznaga@oracle.com>,
	Pavel Tatashin <pavel.tatashin@microsoft.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>, Qian Cai <cai@lca.pw>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Oscar Salvador <osalvador@suse.de>,
	Juergen Gross <jgross@suse.com>,
	Yang Shi <yang.shi@linux.alibaba.c>
Subject: Re: [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
Date: Tue, 22 Oct 2019 14:23:26 +0200	[thread overview]
Message-ID: <20191022122326.GL9379@dhcp22.suse.cz> (raw)
In-Reply-To: <709d39aa-a7ba-97aa-e66b-e2fec2fdf3c4@redhat.com>

On Fri 18-10-19 14:35:06, David Hildenbrand wrote:
> On 18.10.19 13:20, Michal Hocko wrote:
> > On Fri 18-10-19 10:50:24, David Hildenbrand wrote:
> > > On 18.10.19 10:15, Michal Hocko wrote:
[...]
> > > > for that - MEM_GOING_OFFLINE notification. This sounds like a good place
> > > > for the driver to decide whether it is safe to let the page go or not.
> > > 
> > > As I explained, this is too late and fragile. I post again what I posted
> > > before with some further explanations
> > > 
> > > __offline_pages() works like this:
> > > 
> > > 1) start_isolate_page_range()
> > > -> offline pages with a reference count of one will be detected as
> > > unmovable -> offlining aborted. (see below on the memory isolation notifier)
> > 
> > I am assuming that has_unmovable_pages would skip over those pages. Your
> > patch already does that, no?
> 
> Yes, this works IFF the reference count is 0 (IOW, this patch). Not with a
> reference count of 1 (unless the pages are movable, like with balloon
> compaction).

I am pretty sure that has_unmovable_pages can special these pages
regardless of the reference count for the memory hotplug. We already do
that for HWPoison pages.
 
> Please note that we have other users that use PG_offline + refcount >= 1
> (HyperV balloon, XEN). We should not affect these users (IOW,
> has_unmovable_pages() has to stop right there if we see one of these pages).

OK, this is exactly what I was worried about. I can see why you might
want to go an easier way and rule those users out but wouldn't be it
actually more reasonable to explicitly request PageOffline users to
implement MEM_GOING_OFFLINE and prepare their offlined pages for the
offlining operation or fail right there if that is not possible.
If you fail right there during the isolation phase then there is no way
to allow the offlining to proceed from that context.
 
> > > 2) memory_notify(MEM_GOING_OFFLINE, &arg);
> > > -> Here, we could release all pages to the buddy, clearing PG_offline
> > > -> PF_offline must not be cleared so dumping tools will not touch
> > >     these pages. There is a time where pages are !PageBuddy() and
> > >     !PageOffline().
> > 
> > Well, this is fully under control of the driver, no? Reference count
> > shouldn't play any role here AFAIU.
> 
> Yes, this is more a PG_offline issue. The reference count is an issue of
> reaching this call :) If we want to go via the buddy:
> 
> 1. Clear PG_offline
> 2. Free page (gets set PG_buddy)
> 
> Between 1 and 2, a dumping tool could not exclude these pages and therefore
> try to read from these pages. That is an issue IFF we want to return the
> pages back to the buddy instead of doing what I propose here.

If the driver is going to free page to the allocator then it would have
to claim the page back and so it is usable again. If it cannot free it
then it would simply set the reference count to 0. It can even keep the
PG_offline if necessary although I have to admit I am not really sure it
is necessary.

> > > 3) scan_movable_pages() ...
> 
> Please note that when we don't put the pages back to the buddy and don't
> implement something like I have in this patch, we'll loop/fail here.
> Especially if we have pages with PG_offline + refcount >= 1 .

You should have your reference count 0 at this stage as it is after
MEM_GOING_OFFLINE.
 
> > MEM_CANCEL_OFFLINE could gain the reference back to balance the
> > MEM_GOING_OFFLINE step.
> 
> The pages are already unisolated and could be used by the buddy. But again,
> I think you have an idea that tries to avoid putting pages to the buddy.

Yeah, set_page_count(page, 0) if you do not want to release that page
from the notifier context to reflect that the page is ok to be offlined
with the rest.
 
[...]

> > explicit control via the reference count which is the standard way to
> > control the struct page life cycle.
> > 
> > Anyway hooking into __put_page (which tends to be a hot path with
> > something that is barely used on most systems) doesn't sound nice to me.
> > This is the whole point which made me think about the whole reference
> > count approach in the first place.
> 
> Again, the race I think that is possible
> 
> somebody: get_page_unless_zero(page)
> virtio_mem: page_ref_dec(pfn_to_page(pfn)
> somebody: put_page() -> straight to the buddy

Who is that somebody? I thought that it is only the owner/driver to have
a control over the page. Also the above is not possible as long as the
owner/driver keeps a reference to the PageOffline page throughout the
time it is marked that way.
 
[...]

> > > > If you can let the page go then just drop the reference count. The page
> > > > is isolated already by that time. If you cannot let it go for whatever
> > > > reason you can fail the offlining.
> > > 
> > > We do have one hack in current MM code, which is the memory isolation
> > > notifier only used by CMM on PPC. It allows to "cheat" has_unmovable_pages()
> > > to skip over unmovable pages. But quite frankly, I rather want to get rid of
> > > that crap (something I am working on right now) than introduce new users.
> > > This stuff is racy as hell and for CMM, if memory offlining fails, the
> > > ballooned pages are suddenly part of the buddy. Fragile.
> > 
> > Could you be more specific please?
> 
> Let's take a look at how arch/powerpc/platforms/pseries/cmm.c handles it:
> 
> cmm_memory_isolate_cb() -> cmm_count_pages(arg):
> - Memory Isolation notifier callback
> - Count how many pages in the range to be isolated are in the ballooon
> - This makes has_unmovable_pages() succeed. Pages can be isolated.
> 
> cmm_memory_cb -> cmm_mem_going_offline(arg):
> - Memory notifier (online/offline)
> - Release all pages in the range to the buddy
> 
> If offlining fails, the pages are now in the buddy, no longer in the
> balloon. MEM_CANCEL_ONLINE is too late, because the range is already
> unisolated again and the pages might be in use.
> 
> For CMM it might not be that bad, because it can actually "reloan" any
> pages. In contrast, virtio-mem cannot simply go ahead and reuse random
> memory in unplugged. Any access to these pages would be evil. Giving them
> back to the buddy is dangerous.

Thanks, I was not aware of that code. But from what I understood this is
an outright bug in this code because cmm_mem_going_offline releases
pages to the buddy allocator which is something that is not recoverable
on a later failure.

-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2019-10-22 12:23 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-19 14:22 [PATCH RFC v3 0/9] virtio-mem: paravirtualized memory David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 1/9] ACPI: NUMA: export pxm_to_node David Hildenbrand
2019-09-19 14:22 ` David Hildenbrand
2019-09-23 10:13   ` David Hildenbrand
2019-09-23 10:36     ` Michal Hocko
2019-09-23 10:36       ` Michal Hocko
2019-09-23 10:39       ` David Hildenbrand
2019-09-23 10:39       ` David Hildenbrand
2019-09-23 10:13   ` David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 2/9] virtio-mem: Paravirtualized memory hotplug David Hildenbrand
2019-09-19 14:22 ` David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 3/9] virtio-mem: Paravirtualized memory hotunplug part 1 David Hildenbrand
2019-09-19 14:22 ` David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 4/9] mm: Export alloc_contig_range() / free_contig_range() David Hildenbrand
2019-10-16 11:20   ` Michal Hocko
2019-10-16 12:31     ` David Hildenbrand
2019-10-16 12:31     ` David Hildenbrand
2019-10-16 11:20   ` Michal Hocko
2019-09-19 14:22 ` David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 5/9] virtio-mem: Paravirtualized memory hotunplug part 2 David Hildenbrand
2019-09-19 14:22 ` David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0 David Hildenbrand
2019-09-19 14:22   ` David Hildenbrand
2019-10-16 11:43   ` Michal Hocko
2019-10-16 12:50     ` David Hildenbrand
2019-10-16 12:50     ` David Hildenbrand
2019-10-16 13:45       ` Michal Hocko
2019-10-16 13:55         ` David Hildenbrand
2019-10-16 13:55         ` David Hildenbrand
2019-10-16 14:09           ` Michal Hocko
2019-10-16 14:16             ` David Hildenbrand
2019-10-16 14:16             ` David Hildenbrand
2019-10-16 14:09           ` Michal Hocko
2019-10-16 13:59         ` David Hildenbrand
2019-10-16 13:59           ` David Hildenbrand
2019-10-16 13:45       ` Michal Hocko
2019-10-16 13:45     ` David Hildenbrand
2019-10-16 13:45     ` David Hildenbrand
2019-10-16 14:03       ` Michal Hocko
2019-10-16 14:14         ` David Hildenbrand
2019-10-16 14:14           ` David Hildenbrand
2019-10-18  8:15           ` Michal Hocko
2019-10-18  8:50             ` David Hildenbrand
2019-10-18  8:50             ` David Hildenbrand
2019-10-18 11:20               ` Michal Hocko
2019-10-18 11:20               ` Michal Hocko
2019-10-18 12:35                 ` David Hildenbrand
2019-10-18 12:35                   ` David Hildenbrand
2019-10-22 12:23                   ` Michal Hocko [this message]
2019-10-22 12:23                     ` Michal Hocko
2019-10-22 14:02                     ` David Hildenbrand
2019-10-22 14:02                       ` David Hildenbrand
2019-10-23  9:43                       ` Michal Hocko
2019-10-23  9:43                         ` Michal Hocko
2019-10-23 10:03                         ` David Hildenbrand
2019-10-23 10:03                         ` David Hildenbrand
2019-10-24  8:42                           ` Michal Hocko
2019-10-24  8:42                           ` Michal Hocko
2019-10-24  8:51                             ` David Hildenbrand
2019-10-24  8:51                               ` David Hildenbrand
2019-10-25 11:28                               ` [PATCH RFC] mm: Allow to offline unmovable PageOffline() pages if the driver agrees David Hildenbrand
2019-10-25 11:28                                 ` David Hildenbrand
2019-10-18  8:15           ` [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0 Michal Hocko
2019-10-16 14:03       ` Michal Hocko
2019-10-16 11:43   ` Michal Hocko
2019-09-19 14:22 ` [PATCH RFC v3 7/9] virtio-mem: Allow to offline partially unplugged memory blocks David Hildenbrand
2019-09-19 14:22 ` David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 8/9] mm/memory_hotplug: Introduce offline_and_remove_memory() David Hildenbrand
2019-09-19 14:22 ` David Hildenbrand
2019-10-16 11:47   ` Michal Hocko
2019-10-16 11:47     ` Michal Hocko
2019-10-16 12:57     ` David Hildenbrand
2019-10-16 12:57       ` David Hildenbrand
2019-09-19 14:22 ` [PATCH RFC v3 9/9] virtio-mem: Offline and remove completely unplugged memory blocks David Hildenbrand
2019-09-19 14:22 ` David Hildenbrand
2019-10-16  8:12 ` [PATCH RFC v3 0/9] virtio-mem: paravirtualized memory David Hildenbrand
2019-10-16  8:12 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191022122326.GL9379@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=anshuman.khandual@arm.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=cai@lca.pw \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=glider@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jgross@suse.com \
    --cc=kernelfans@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mchehab+samsung@kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=minchan@kernel.org \
    --cc=osalvador@suse.de \
    --cc=pavel.tatashin@microsoft.com \
    --cc=richardw.yang@linux.intel.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=sfr@canb.auug.org.au \
    --cc=vbabka@suse.cz \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=yang.shi@linux.alibaba.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.