From: Jason Gunthorpe <jgg@ziepe.ca> To: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>, Alex Williamson <alex.williamson@redhat.com>, LKML <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>, David Hildenbrand <david@redhat.com>, Oscar Salvador <osalvador@suse.de>, Dan Williams <dan.j.williams@intel.com>, Sasha Levin <sashal@kernel.org>, Tyler Hicks <tyhicks@linux.microsoft.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, mike.kravetz@oracle.com, Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Mel Gorman <mgorman@suse.de>, Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com>, John Hubbard <jhubbard@nvidia.com> Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone Date: Fri, 4 Dec 2020 16:52:33 -0400 [thread overview] Message-ID: <20201204205233.GF5487@ziepe.ca> (raw) In-Reply-To: <87360lnxph.fsf@oracle.com> On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote: > Jason Gunthorpe <jgg@ziepe.ca> writes: > > > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote: > >> What I meant is the users of the interface do it incrementally not in > >> large chunks. For example: > >> > >> vfio_pin_pages_remote > >> vaddr_get_pfn > >> ret = pin_user_pages_remote(mm, vaddr, 1, flags | > >> FOLL_LONGTERM, page, NULL, NULL); > >> 1 -> pin only one pages at a time > > > > I don't know why vfio does this, it is why it so ridiculously slow at > > least. > > Well Alex can correct me, but I went digging and a comment from the > first type1 vfio commit says the iommu API didn't promise to unmap > subpages of previous mappings, so doing page at a time gave flexibility > at the cost of inefficiency. iommu restrictions are not related to with gup. vfio needs to get the page list from the page tables as efficiently as possible, then you break it up into what you want to feed into the IOMMU how the iommu wants. vfio must maintain a page list to call unpin_user_pages() anyhow, so it makes alot of sense to assemble the page list up front, then do the iommu, instead of trying to do both things page at a time. It would be smart to rebuild vfio to use scatter lists to store the page list and then break the sgl into pages for iommu configuration. SGLs will consume alot less memory for the usual case of THPs backing the VFIO registrations. ib_umem_get() has some example of how to code this, I've been thinking we could make this some common API, and it could be further optimized. > Yesterday I tried optimizing vfio to skip gup calls for tail pages after > Matthew pointed out this same issue to me by coincidence last week. Please don't just hack up vfio like this. Everyone needs faster gup, we really need to solve this in the core code. Plus this is tricky, vfio is already using follow_pfn wrongly, drivers should not be open coding MM stuff. > Currently debugging, but if there's a fundamental reason this won't work > on the vfio side, it'd be nice to know. AFAIK there is no guarentee that just because you see a compound head that the remaining pages in the page tables are actually the tail pages. This is only true sometimes, for instance if an entire huge page is placed in a page table level. I belive Ralph pointed to some case where we might break a huge page from PMD to PTEs then later COW one of the PTEs. In this case the compound head will be visible but the page map will be non-contiguous and the page flags on each 4k entry will be different. Only GUP's page walkers know that the compound page is actually at a PMD level and can safely apply the 'everything is the same' optimization. The solution here is to make core gup faster, espcially for the cases where it is returning huge pages. We can approach this by: - Batching the compound & tail page acquisition for higher page levels, eg gup fast does this already, look at record_subpages() gup slow needs it too - Batching unpin for compound & tail page, the opposite of the 'refs' arg for try_grab_compound_head() - Devise some API where get_user_pages can directly return contiguous groups of pages to avoid memory traffic - Reduce the cost of a FOLL_LONGTERM pin eg here is a start: https://lore.kernel.org/linux-mm/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com And CMA should get some similar treatment. Scanning the output page list multiple times is slow. I would like to get to a point where the main GUP walker functions can output in more formats than just page array. For instance directly constructing and chaining a biovec or sgl would dramatically improve perfomance and decrease memory consumption. Being able to write in hmm_range_fault's pfn&flags output format would delete a whole bunch of duplicated code. Jason
next prev parent reply other threads:[~2020-12-04 20:52 UTC|newest] Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-12-02 5:23 [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin 2020-12-02 5:23 ` [PATCH 1/6] mm/gup: perform check_dax_vmas only when FS_DAX is enabled Pavel Tatashin 2020-12-02 16:22 ` Ira Weiny 2020-12-02 18:15 ` Pavel Tatashin 2020-12-02 16:29 ` Jason Gunthorpe 2020-12-02 18:16 ` Pavel Tatashin 2020-12-03 7:59 ` John Hubbard 2020-12-03 14:52 ` Pavel Tatashin 2020-12-02 5:23 ` [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone Pavel Tatashin 2020-12-02 16:31 ` David Hildenbrand 2020-12-02 18:17 ` Pavel Tatashin 2020-12-03 8:01 ` John Hubbard 2020-12-03 8:46 ` Michal Hocko 2020-12-03 14:58 ` Pavel Tatashin 2020-12-02 5:23 ` [PATCH 3/6] mm/gup: make __gup_longterm_locked common Pavel Tatashin 2020-12-02 16:31 ` Ira Weiny 2020-12-02 16:33 ` Ira Weiny 2020-12-02 18:19 ` Pavel Tatashin 2020-12-03 0:03 ` Pavel Tatashin 2020-12-03 8:03 ` John Hubbard 2020-12-03 15:02 ` Pavel Tatashin 2020-12-02 5:23 ` [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE Pavel Tatashin 2020-12-03 8:04 ` John Hubbard 2020-12-03 15:02 ` Pavel Tatashin 2020-12-03 8:57 ` Michal Hocko 2020-12-03 15:02 ` Pavel Tatashin 2020-12-02 5:23 ` [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations Pavel Tatashin 2020-12-03 8:17 ` John Hubbard 2020-12-03 15:06 ` Pavel Tatashin 2020-12-03 16:51 ` John Hubbard 2020-12-03 9:17 ` Michal Hocko 2020-12-03 15:15 ` Pavel Tatashin 2020-12-04 8:43 ` Michal Hocko 2020-12-04 8:54 ` Michal Hocko 2020-12-04 16:07 ` Pavel Tatashin 2020-12-02 5:23 ` [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone Pavel Tatashin 2020-12-02 16:35 ` Jason Gunthorpe 2020-12-03 0:19 ` Pavel Tatashin 2020-12-03 1:08 ` Jason Gunthorpe 2020-12-03 1:34 ` Pavel Tatashin 2020-12-03 14:17 ` Jason Gunthorpe 2020-12-03 16:40 ` Pavel Tatashin 2020-12-03 16:59 ` Jason Gunthorpe 2020-12-03 17:14 ` Pavel Tatashin 2020-12-03 19:15 ` Pavel Tatashin 2020-12-03 19:36 ` Jason Gunthorpe 2020-12-04 16:24 ` Pavel Tatashin 2020-12-04 17:06 ` Jason Gunthorpe 2020-12-04 20:05 ` Daniel Jordan 2020-12-04 20:16 ` Pavel Tatashin 2020-12-08 2:27 ` Daniel Jordan 2020-12-04 20:52 ` Jason Gunthorpe [this message] 2020-12-08 2:48 ` Daniel Jordan 2020-12-08 13:24 ` Jason Gunthorpe 2020-12-03 8:22 ` John Hubbard 2020-12-03 15:55 ` Pavel Tatashin 2020-12-04 4:13 ` Joonsoo Kim 2020-12-04 17:43 ` Pavel Tatashin 2020-12-07 7:13 ` Joonsoo Kim 2020-12-04 4:02 ` [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE Joonsoo Kim 2020-12-04 15:55 ` Pavel Tatashin 2020-12-04 16:10 ` Jason Gunthorpe 2020-12-04 17:50 ` Pavel Tatashin 2020-12-04 18:01 ` David Hildenbrand 2020-12-04 18:10 ` Pavel Tatashin 2020-12-07 7:12 ` Joonsoo Kim 2020-12-07 12:13 ` Michal Hocko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20201204205233.GF5487@ziepe.ca \ --to=jgg@ziepe.ca \ --cc=akpm@linux-foundation.org \ --cc=alex.williamson@redhat.com \ --cc=dan.j.williams@intel.com \ --cc=daniel.m.jordan@oracle.com \ --cc=david@redhat.com \ --cc=iamjoonsoo.kim@lge.com \ --cc=jhubbard@nvidia.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mgorman@suse.de \ --cc=mhocko@suse.com \ --cc=mike.kravetz@oracle.com \ --cc=mingo@redhat.com \ --cc=osalvador@suse.de \ --cc=pasha.tatashin@soleen.com \ --cc=peterz@infradead.org \ --cc=rientjes@google.com \ --cc=rostedt@goodmis.org \ --cc=sashal@kernel.org \ --cc=tyhicks@linux.microsoft.com \ --cc=vbabka@suse.cz \ --cc=willy@infradead.org \ --subject='Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).