From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 7 Dec 2018 14:40:09 -0500 From: Jerome Glisse To: Dan Williams Cc: John Hubbard , Matthew Wilcox , John Hubbard , Andrew Morton , Linux MM , Jan Kara , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , Mike Marciniszyn , rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20181207194009.GF3293@redhat.com> References: <3c91d335-921c-4704-d159-2975ff3a5f20@nvidia.com> <20181205011519.GV10377@bombadil.infradead.org> <20181205014441.GA3045@redhat.com> <59ca5c4b-fd5b-1fc6-f891-c7986d91908e@nvidia.com> <7b4733be-13d3-c790-ff1b-ac51b505e9a6@nvidia.com> <20181207191620.GD3293@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: On Fri, Dec 07, 2018 at 11:26:34AM -0800, Dan Williams wrote: > On Fri, Dec 7, 2018 at 11:16 AM Jerome Glisse wrote: > > > > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote: > > > On 12/4/18 5:57 PM, John Hubbard wrote: > > > > On 12/4/18 5:44 PM, Jerome Glisse wrote: > > > >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote: > > > >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote: > > > >>>> On 12/4/18 3:03 PM, Dan Williams wrote: > > > >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how > > > >>>>> does this proposal interact with those? > > > >>>> > > > >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire > > > >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another > > > >>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages? > > > >>>> > > > >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole > > > >>>> LRU field approach is unusable. > > > >>> > > > >>> We just need to rearrange ZONE_DEVICE pages. Please excuse the whitespace > > > >>> damage: > > > >>> > > > >>> +++ b/include/linux/mm_types.h > > > >>> @@ -151,10 +151,12 @@ struct page { > > > >>> #endif > > > >>> }; > > > >>> struct { /* ZONE_DEVICE pages */ > > > >>> + unsigned long _zd_pad_2; /* LRU */ > > > >>> + unsigned long _zd_pad_3; /* LRU */ > > > >>> + unsigned long _zd_pad_1; /* uses mapping */ > > > >>> /** @pgmap: Points to the hosting device page map. */ > > > >>> struct dev_pagemap *pgmap; > > > >>> unsigned long hmm_data; > > > >>> - unsigned long _zd_pad_1; /* uses mapping */ > > > >>> }; > > > >>> > > > >>> /** @rcu_head: You can use this to free a page by RCU. */ > > > >>> > > > >>> You don't use page->private or page->index, do you Dan? > > > >> > > > >> page->private and page->index are use by HMM DEVICE page. > > > >> > > > > > > > > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for > > > > dma-pinned information. Which might work. To recap, we need: > > > > > > > > -- 1 bit for PageDmaPinned > > > > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru. > > > > -- N bits for a reference count > > > > > > > > Those *could* be packed into a single 64-bit field, if really necessary. > > > > > > > > > > ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot. > > > However, it is still possible for this to work. > > > > > > Matthew, can I have that bit now please? I'm about out of options, and now it will actually > > > solve the problem here. > > > > > > Given: > > > > > > 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU. > > > That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, > > > for that case. > > > > > > 2) There is an independent bit available (according to Matthew). > > > > > > 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter > > > in that case. > > > > To expend on this, HMM private page are use for anonymous page > > so the index and mapping fields have the value you expect for > > such pages. Down the road i want also to support file backed > > page with HMM private (mapping, private, index). > > > > For HMM public both anonymous and file back page are supported > > today (HMM public is only useful on platform with something like > > OpenCAPI, CCIX or NVlink ... so PowerPC for now). > > > > > 4) get_user_pages() must work on ZONE_DEVICE and HMM pages. > > > > get_user_pages() only need to work with HMM public page not the > > private one as we can not allow _anyone_ to pin HMM private page. > > How does HMM enforce that? Because the kernel should not allow *any* > memory management facility to arbitrarily fail direct-I/O operations. > That's why CONFIG_FS_DAX_LIMITED is a temporary / experimental hack > for S390 and ZONE_DEVICE was invented to bypass that hack for X86 and > any arch that plans to properly support DAX. I would classify any > memory management that can't support direct-I/O in the same > "experimental" category. It does not fail direct-I/O GUP sees a swap entry for the private memory and it behave just like if the page was swap to disk so i am not introducing any new behavior. With HMM page everything just work as you expect they would from CPU point of view. It is just like swap. Cheers, J�r�me