Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
	David Hildenbrand <david@redhat.com>,
	Oscar Salvador <osalvador@suse.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Sasha Levin <sashal@kernel.org>,
	Tyler Hicks <tyhicks@linux.microsoft.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	mike.kravetz@oracle.com, Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	David Rientjes <rientjes@google.com>,
	John Hubbard <jhubbard@nvidia.com>
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone
Date: Fri, 4 Dec 2020 16:52:33 -0400
Message-ID: <20201204205233.GF5487@ziepe.ca> (raw)
In-Reply-To: <87360lnxph.fsf@oracle.com>

On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
> Jason Gunthorpe <jgg@ziepe.ca> writes:
> 
> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
> >> What I meant is the users of the interface do it incrementally not in
> >> large chunks. For example:
> >> 
> >> vfio_pin_pages_remote
> >>    vaddr_get_pfn
> >>         ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> >> FOLL_LONGTERM, page, NULL, NULL);
> >> 1 -> pin only one pages at a time
> >
> > I don't know why vfio does this, it is why it so ridiculously slow at
> > least.
> 
> Well Alex can correct me, but I went digging and a comment from the
> first type1 vfio commit says the iommu API didn't promise to unmap
> subpages of previous mappings, so doing page at a time gave flexibility
> at the cost of inefficiency.

iommu restrictions are not related to with gup. vfio needs to get the
page list from the page tables as efficiently as possible, then you
break it up into what you want to feed into the IOMMU how the iommu
wants.

vfio must maintain a page list to call unpin_user_pages() anyhow, so
it makes alot of sense to assemble the page list up front, then do the
iommu, instead of trying to do both things page at a time.

It would be smart to rebuild vfio to use scatter lists to store the
page list and then break the sgl into pages for iommu
configuration. SGLs will consume alot less memory for the usual case
of THPs backing the VFIO registrations.

ib_umem_get() has some example of how to code this, I've been thinking
we could make this some common API, and it could be further optimized.

> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
> Matthew pointed out this same issue to me by coincidence last week.

Please don't just hack up vfio like this. Everyone needs faster gup,
we really need to solve this in the core code. Plus this is tricky,
vfio is already using follow_pfn wrongly, drivers should not be open
coding MM stuff.

> Currently debugging, but if there's a fundamental reason this won't work
> on the vfio side, it'd be nice to know.

AFAIK there is no guarentee that just because you see a compound head
that the remaining pages in the page tables are actually the tail
pages. This is only true sometimes, for instance if an entire huge
page is placed in a page table level.

I belive Ralph pointed to some case where we might break a huge page
from PMD to PTEs then later COW one of the PTEs. In this case the
compound head will be visible but the page map will be non-contiguous
and the page flags on each 4k entry will be different.

Only GUP's page walkers know that the compound page is actually at a
PMD level and can safely apply the 'everything is the same'
optimization.

The solution here is to make core gup faster, espcially for the cases
where it is returning huge pages. We can approach this by:
 - Batching the compound & tail page acquisition for higher page
   levels, eg gup fast does this already, look at record_subpages()
   gup slow needs it too
 - Batching unpin for compound & tail page, the opposite of the 'refs'
   arg for try_grab_compound_head()
 - Devise some API where get_user_pages can directly return
   contiguous groups of pages to avoid memory traffic
 - Reduce the cost of a FOLL_LONGTERM pin eg here is a start:
    https://lore.kernel.org/linux-mm/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com
   And CMA should get some similar treatment. Scanning the output page
   list multiple times is slow.

I would like to get to a point where the main GUP walker functions can
output in more formats than just page array. For instance directly
constructing and chaining a biovec or sgl would dramatically improve
perfomance and decrease memory consumption. Being able to write in
hmm_range_fault's pfn&flags output format would delete a whole bunch
of duplicated code.

Jason


  parent reply index

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-02  5:23 [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
2020-12-02  5:23 ` [PATCH 1/6] mm/gup: perform check_dax_vmas only when FS_DAX is enabled Pavel Tatashin
2020-12-02 16:22   ` Ira Weiny
2020-12-02 18:15     ` Pavel Tatashin
2020-12-02 16:29   ` Jason Gunthorpe
2020-12-02 18:16     ` Pavel Tatashin
2020-12-03  7:59   ` John Hubbard
2020-12-03 14:52     ` Pavel Tatashin
2020-12-02  5:23 ` [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone Pavel Tatashin
2020-12-02 16:31   ` David Hildenbrand
2020-12-02 18:17     ` Pavel Tatashin
2020-12-03  8:01   ` John Hubbard
2020-12-03  8:46   ` Michal Hocko
2020-12-03 14:58     ` Pavel Tatashin
2020-12-02  5:23 ` [PATCH 3/6] mm/gup: make __gup_longterm_locked common Pavel Tatashin
2020-12-02 16:31   ` Ira Weiny
2020-12-02 16:33     ` Ira Weiny
2020-12-02 18:19       ` Pavel Tatashin
2020-12-03  0:03         ` Pavel Tatashin
2020-12-03  8:03   ` John Hubbard
2020-12-03 15:02     ` Pavel Tatashin
2020-12-02  5:23 ` [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE Pavel Tatashin
2020-12-03  8:04   ` John Hubbard
2020-12-03 15:02     ` Pavel Tatashin
2020-12-03  8:57   ` Michal Hocko
2020-12-03 15:02     ` Pavel Tatashin
2020-12-02  5:23 ` [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations Pavel Tatashin
2020-12-03  8:17   ` John Hubbard
2020-12-03 15:06     ` Pavel Tatashin
2020-12-03 16:51       ` John Hubbard
2020-12-03  9:17   ` Michal Hocko
2020-12-03 15:15     ` Pavel Tatashin
2020-12-04  8:43       ` Michal Hocko
2020-12-04  8:54         ` Michal Hocko
2020-12-04 16:07           ` Pavel Tatashin
2020-12-02  5:23 ` [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone Pavel Tatashin
2020-12-02 16:35   ` Jason Gunthorpe
2020-12-03  0:19     ` Pavel Tatashin
2020-12-03  1:08       ` Jason Gunthorpe
2020-12-03  1:34         ` Pavel Tatashin
2020-12-03 14:17           ` Jason Gunthorpe
2020-12-03 16:40             ` Pavel Tatashin
2020-12-03 16:59               ` Jason Gunthorpe
2020-12-03 17:14                 ` Pavel Tatashin
2020-12-03 19:15                   ` Pavel Tatashin
2020-12-03 19:36                     ` Jason Gunthorpe
2020-12-04 16:24                       ` Pavel Tatashin
2020-12-04 17:06                         ` Jason Gunthorpe
2020-12-04 20:05             ` Daniel Jordan
2020-12-04 20:16               ` Pavel Tatashin
2020-12-08  2:27                 ` Daniel Jordan
2020-12-04 20:52               ` Jason Gunthorpe [this message]
2020-12-08  2:48                 ` Daniel Jordan
2020-12-08 13:24                   ` Jason Gunthorpe
2020-12-03  8:22   ` John Hubbard
2020-12-03 15:55     ` Pavel Tatashin
2020-12-04  4:13   ` Joonsoo Kim
2020-12-04 17:43     ` Pavel Tatashin
2020-12-07  7:13       ` Joonsoo Kim
2020-12-04  4:02 ` [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE Joonsoo Kim
2020-12-04 15:55   ` Pavel Tatashin
2020-12-04 16:10     ` Jason Gunthorpe
2020-12-04 17:50       ` Pavel Tatashin
2020-12-04 18:01         ` David Hildenbrand
2020-12-04 18:10           ` Pavel Tatashin
2020-12-07  7:12         ` Joonsoo Kim
2020-12-07 12:13           ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201204205233.GF5487@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=david@redhat.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mingo@redhat.com \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterz@infradead.org \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=sashal@kernel.org \
    --cc=tyhicks@linux.microsoft.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git