All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	LKML <linux-kernel@vger.kernel.org>,
	David Hildenbrand <david@redhat.com>,
	Oscar Salvador <osalvador@suse.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	Tyler Hicks <tyhicks@linux.microsoft.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	sthemmin@microsoft.com
Subject: Re: Pinning ZONE_MOVABLE pages
Date: Mon, 23 Nov 2020 10:01:29 +0100	[thread overview]
Message-ID: <20201123090129.GD27488@dhcp22.suse.cz> (raw)
In-Reply-To: <CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com>

On Fri 20-11-20 15:27:46, Pavel Tatashin wrote:
> Recently, I encountered a hang that is happening during memory hot
> remove operation. It turns out that the hang is caused by pinned user
> pages in ZONE_MOVABLE.
> 
> Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> this is not the case if a user applications such as through dpdk
> libraries pinned them via vfio dma map.

Long term or effectively time unbound pinning on zone movable is
fundamentaly broken. The sole reason of ZONE_MOVABLE existence is to
guarantee migrateability. If the cosumer of this memory cannot guarantee
that then it shouldn't use __GFP_MOVABLE in the first place.

> Kernel keeps trying to
> hot-remove them, but refcnt never gets to zero, so we are looping
> until the hardware watchdog kicks in.

Yeah, the existing offlining behavior doesn't stop trying because the
current implementation of the migration cannot tell a diffence between
short and long term failures. Maybe the recent ref count for long term
pinning can be used to help out there.

Anyway, I am wondering what do you mean by watchdog firing. The
operation should trigger neither of soft, hard or hung detectors.

> We cannot do dma unmaps before hot-remove, because hot-remove is a
> slow operation, and we have thousands for network flows handled by
> dpdk that we just cannot suspend for the duration of hot-remove
> operation.
> 
> The solution is for dpdk to allocate pages from a zone below
> ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> There is no user interface that we have that allows applications to
> select what zone the memory should come from.

Our existing interface is __GFP_MOVABLE. It is a responsibility of the
driver to know whether the resulting memory is migratable. Users
shouldn't even have to think about that.

> I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> the direction of using transparent huge pages instead of HugeTLBs,
> which means that we need to allow at least anonymous, and anonymous
> transparent huge pages to come from non-movable zones on demand.

You can migrate before pinning.

> Here is what I am proposing:
> 1. Add a new flag that is passed through pin_user_pages_* down to
> fault handlers, and allow the fault handler to allocate from a
> non-movable zone.

gup already tries to deal with long term pins on CMA regions and migrate
to a non CMA region. Have a look at __gup_longterm_locked. Migrating of
the movable zone sounds like a reasonable solution to me.

> 2. Add an internal move_pages_zone() similar to move_pages() syscall
> but instead of migrating to a different NUMA node, migrate pages from
> ZONE_MOVABLE to another zone.
> Call move_pages_zone() on demand prior to pinning pages from
> vfio_pin_map_dma() for instance.

Why is the existing migration API insufficient?

> 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> pages from non-movable zone. When a user application knows that it
> will do DMA mapping, and pin pages for a long time, the memory that it
> allocates should never be migrated or hot-removed, so make sure that
> it comes from the appropriate place.
> The benefit of adding madvise() flag is that we won't have to deal
> with slow page migration during pin time, but the disadvantage is that
> we would need to change the user interface.

No, the MOVABLE_ZONE like other zone types are internal implementation
detail of the MM. I do not think we want to expose that to the userspace
and carve this into stone.

-- 
Michal Hocko
SUSE Labs

  parent reply	other threads:[~2020-11-23  9:02 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-20 20:27 Pinning ZONE_MOVABLE pages Pavel Tatashin
2020-11-20 20:27 ` Pavel Tatashin
2020-11-20 20:59 ` David Hildenbrand
2020-11-20 21:17   ` Matthew Wilcox
2020-11-20 21:34     ` David Hildenbrand
2020-11-20 21:53       ` Pavel Tatashin
2020-11-20 21:53         ` Pavel Tatashin
2020-11-20 21:58   ` Pavel Tatashin
2020-11-20 21:58     ` Pavel Tatashin
2020-11-20 22:06     ` David Hildenbrand
2020-11-22 21:06 ` David Rientjes
2020-11-22 21:06   ` David Rientjes
2020-11-23 15:31   ` Pavel Tatashin
2020-11-23 15:31     ` Pavel Tatashin
2020-11-23  9:01 ` Michal Hocko [this message]
2020-11-23 16:06   ` Pavel Tatashin
2020-11-23 16:06     ` Pavel Tatashin
2020-11-23 17:15     ` Jason Gunthorpe
2020-11-23 17:54       ` Pavel Tatashin
2020-11-23 17:54         ` Pavel Tatashin
2020-11-23 18:34         ` Jason Gunthorpe
2020-11-24  8:20     ` Michal Hocko
2020-11-23 15:04 ` Vlastimil Babka
2020-11-23 16:31   ` Pavel Tatashin
2020-11-23 16:31     ` Pavel Tatashin
2020-11-24  8:24     ` Michal Hocko
2020-11-24  8:43     ` Michal Hocko
2020-11-24  8:44       ` David Hildenbrand
2020-11-24  6:49 ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201123090129.GD27488@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=sashal@kernel.org \
    --cc=sthemmin@microsoft.com \
    --cc=tyhicks@linux.microsoft.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.