* [PATCH 0/3] mm/: 3 more put_user_page() conversions @ 2019-08-05 22:20 john.hubbard 2019-08-05 22:20 ` [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() john.hubbard ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: john.hubbard @ 2019-08-05 22:20 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, John Hubbard From: John Hubbard <jhubbard@nvidia.com> Hi, Here are a few more mm/ files that I wasn't ready to send with the larger 34-patch set. John Hubbard (3): mm/mlock.c: convert put_page() to put_user_page*() mm/mempolicy.c: convert put_page() to put_user_page*() mm/ksm: convert put_page() to put_user_page*() mm/ksm.c | 14 +++++++------- mm/mempolicy.c | 2 +- mm/mlock.c | 6 +++--- 3 files changed, 11 insertions(+), 11 deletions(-) -- 2.22.0 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-05 22:20 [PATCH 0/3] mm/: 3 more put_user_page() conversions john.hubbard @ 2019-08-05 22:20 ` john.hubbard 2019-08-07 11:01 ` Michal Hocko 2019-08-05 22:20 ` [PATCH 2/3] mm/mempolicy.c: " john.hubbard ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: john.hubbard @ 2019-08-05 22:20 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, John Hubbard, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz From: John Hubbard <jhubbard@nvidia.com> For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages(). This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Black <daniel@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> --- mm/mlock.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index a90099da4fb4..b980e6270e8a 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -345,7 +345,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) get_page(page); /* for putback_lru_page() */ __munlock_isolated_page(page); unlock_page(page); - put_page(page); /* from follow_page_mask() */ + put_user_page(page); /* from follow_page_mask() */ } } } @@ -467,7 +467,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, if (page && !IS_ERR(page)) { if (PageTransTail(page)) { VM_BUG_ON_PAGE(PageMlocked(page), page); - put_page(page); /* follow_page_mask() */ + put_user_page(page); /* follow_page_mask() */ } else if (PageTransHuge(page)) { lock_page(page); /* @@ -478,7 +478,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, */ page_mask = munlock_vma_page(page); unlock_page(page); - put_page(page); /* follow_page_mask() */ + put_user_page(page); /* follow_page_mask() */ } else { /* * Non-huge pages are handled in batches via -- 2.22.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-05 22:20 ` [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() john.hubbard @ 2019-08-07 11:01 ` Michal Hocko 2019-08-07 23:32 ` John Hubbard 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2019-08-07 11:01 UTC (permalink / raw) To: john.hubbard Cc: Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, John Hubbard, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > From: John Hubbard <jhubbard@nvidia.com> > > For pages that were retained via get_user_pages*(), release those pages > via the new put_user_page*() routines, instead of via put_page() or > release_pages(). Hmm, this is an interesting code path. There seems to be a mix of pages in the game. We get one page via follow_page_mask but then other pages in the range are filled by __munlock_pagevec_fill and that does a direct pte walk. Is using put_user_page correct in this case? Could you explain why in the changelog? > This is part a tree-wide conversion, as described in commit fc1d8e7cca2d > ("mm: introduce put_user_page*(), placeholder versions"). > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Daniel Black <daniel@linux.ibm.com> > Cc: Jan Kara <jack@suse.cz> > Cc: Jérôme Glisse <jglisse@redhat.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: John Hubbard <jhubbard@nvidia.com> > --- > mm/mlock.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/mlock.c b/mm/mlock.c > index a90099da4fb4..b980e6270e8a 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -345,7 +345,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) > get_page(page); /* for putback_lru_page() */ > __munlock_isolated_page(page); > unlock_page(page); > - put_page(page); /* from follow_page_mask() */ > + put_user_page(page); /* from follow_page_mask() */ > } > } > } > @@ -467,7 +467,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, > if (page && !IS_ERR(page)) { > if (PageTransTail(page)) { > VM_BUG_ON_PAGE(PageMlocked(page), page); > - put_page(page); /* follow_page_mask() */ > + put_user_page(page); /* follow_page_mask() */ > } else if (PageTransHuge(page)) { > lock_page(page); > /* > @@ -478,7 +478,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, > */ > page_mask = munlock_vma_page(page); > unlock_page(page); > - put_page(page); /* follow_page_mask() */ > + put_user_page(page); /* follow_page_mask() */ > } else { > /* > * Non-huge pages are handled in batches via > -- > 2.22.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-07 11:01 ` Michal Hocko @ 2019-08-07 23:32 ` John Hubbard 2019-08-08 6:21 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: John Hubbard @ 2019-08-07 23:32 UTC (permalink / raw) To: Michal Hocko, john.hubbard Cc: Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/7/19 4:01 AM, Michal Hocko wrote: > On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >> From: John Hubbard <jhubbard@nvidia.com> >> >> For pages that were retained via get_user_pages*(), release those pages >> via the new put_user_page*() routines, instead of via put_page() or >> release_pages(). > > Hmm, this is an interesting code path. There seems to be a mix of pages > in the game. We get one page via follow_page_mask but then other pages > in the range are filled by __munlock_pagevec_fill and that does a direct > pte walk. Is using put_user_page correct in this case? Could you explain > why in the changelog? > Actually, I think follow_page_mask() gets all the pages, right? And the get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() later. But I still think I mighthave missed an error case, because the pvec_putback in __munlock_pagevec() is never doing put_user_page() on the put-backed pages. Let me sort through this one more time and maybe I'll need to actually change the code. And either way, comments and changelog will need some notes, agreed. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-07 23:32 ` John Hubbard @ 2019-08-08 6:21 ` Michal Hocko 2019-08-08 11:09 ` Vlastimil Babka 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2019-08-08 6:21 UTC (permalink / raw) To: John Hubbard Cc: john.hubbard, Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On Wed 07-08-19 16:32:08, John Hubbard wrote: > On 8/7/19 4:01 AM, Michal Hocko wrote: > > On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > >> From: John Hubbard <jhubbard@nvidia.com> > >> > >> For pages that were retained via get_user_pages*(), release those pages > >> via the new put_user_page*() routines, instead of via put_page() or > >> release_pages(). > > > > Hmm, this is an interesting code path. There seems to be a mix of pages > > in the game. We get one page via follow_page_mask but then other pages > > in the range are filled by __munlock_pagevec_fill and that does a direct > > pte walk. Is using put_user_page correct in this case? Could you explain > > why in the changelog? > > > > Actually, I think follow_page_mask() gets all the pages, right? And the > get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() > later. Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range calls follow_page for the start address and then if not THP tries to fill up the pagevec with few more pages (up to end), do the shortcut via manual pte walk as an optimization and use generic get_page there. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-08 6:21 ` Michal Hocko @ 2019-08-08 11:09 ` Vlastimil Babka 2019-08-08 19:20 ` John Hubbard 0 siblings, 1 reply; 23+ messages in thread From: Vlastimil Babka @ 2019-08-08 11:09 UTC (permalink / raw) To: Michal Hocko, John Hubbard Cc: john.hubbard, Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/8/19 8:21 AM, Michal Hocko wrote: > On Wed 07-08-19 16:32:08, John Hubbard wrote: >> On 8/7/19 4:01 AM, Michal Hocko wrote: >>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >>>> From: John Hubbard <jhubbard@nvidia.com> >>>> >>>> For pages that were retained via get_user_pages*(), release those pages >>>> via the new put_user_page*() routines, instead of via put_page() or >>>> release_pages(). >>> >>> Hmm, this is an interesting code path. There seems to be a mix of pages >>> in the game. We get one page via follow_page_mask but then other pages >>> in the range are filled by __munlock_pagevec_fill and that does a direct >>> pte walk. Is using put_user_page correct in this case? Could you explain >>> why in the changelog? >>> >> >> Actually, I think follow_page_mask() gets all the pages, right? And the >> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() >> later. > > Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range > calls follow_page for the start address and then if not THP tries to > fill up the pagevec with few more pages (up to end), do the shortcut > via manual pte walk as an optimization and use generic get_page there. That's true. However, I'm not sure munlocking is where the put_user_page() machinery is intended to be used anyway? These are short-term pins for struct page manipulation, not e.g. dirtying of page contents. Reading commit fc1d8e7cca2d I don't think this case falls within the reasoning there. Perhaps not all GUP users should be converted to the planned separate GUP tracking, and instead we should have a GUP/follow_page_mask() variant that keeps using get_page/put_page? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-08 11:09 ` Vlastimil Babka @ 2019-08-08 19:20 ` John Hubbard 2019-08-08 22:59 ` John Hubbard 0 siblings, 1 reply; 23+ messages in thread From: John Hubbard @ 2019-08-08 19:20 UTC (permalink / raw) To: Vlastimil Babka, Michal Hocko Cc: Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/8/19 4:09 AM, Vlastimil Babka wrote: > On 8/8/19 8:21 AM, Michal Hocko wrote: >> On Wed 07-08-19 16:32:08, John Hubbard wrote: >>> On 8/7/19 4:01 AM, Michal Hocko wrote: >>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >>>>> From: John Hubbard <jhubbard@nvidia.com> >>> Actually, I think follow_page_mask() gets all the pages, right? And the >>> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() >>> later. >> >> Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range >> calls follow_page for the start address and then if not THP tries to >> fill up the pagevec with few more pages (up to end), do the shortcut >> via manual pte walk as an optimization and use generic get_page there. > Yes, I see it finally, thanks. :) > That's true. However, I'm not sure munlocking is where the > put_user_page() machinery is intended to be used anyway? These are > short-term pins for struct page manipulation, not e.g. dirtying of page > contents. Reading commit fc1d8e7cca2d I don't think this case falls > within the reasoning there. Perhaps not all GUP users should be > converted to the planned separate GUP tracking, and instead we should > have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > Interesting. So far, the approach has been to get all the gup callers to release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() wrapper, then maybe we could leave some sites unconverted. However, in order to do so, we would have to change things so that we have one set of APIs (gup) that do *not* increment a pin count, and another set (vaddr_pin_pages) that do. Is that where we want to go...? I have a tracking patch that only deals with gup/pup. I could post as an RFC, but I think it might just muddy the waters at this point, anyway it's this one: https://github.com/johnhubbard/linux/commit/a0fb73ce0a39c74f0d1fb6bd9d866f660f762eae thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-08 19:20 ` John Hubbard @ 2019-08-08 22:59 ` John Hubbard 2019-08-08 23:41 ` Ira Weiny 2019-08-09 8:12 ` Vlastimil Babka 0 siblings, 2 replies; 23+ messages in thread From: John Hubbard @ 2019-08-08 22:59 UTC (permalink / raw) To: Vlastimil Babka, Michal Hocko Cc: Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/8/19 12:20 PM, John Hubbard wrote: > On 8/8/19 4:09 AM, Vlastimil Babka wrote: >> On 8/8/19 8:21 AM, Michal Hocko wrote: >>> On Wed 07-08-19 16:32:08, John Hubbard wrote: >>>> On 8/7/19 4:01 AM, Michal Hocko wrote: >>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: >>>>>> From: John Hubbard <jhubbard@nvidia.com> >>>> Actually, I think follow_page_mask() gets all the pages, right? And the >>>> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() >>>> later. >>> >>> Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range >>> calls follow_page for the start address and then if not THP tries to >>> fill up the pagevec with few more pages (up to end), do the shortcut >>> via manual pte walk as an optimization and use generic get_page there. >> > > Yes, I see it finally, thanks. :) > >> That's true. However, I'm not sure munlocking is where the >> put_user_page() machinery is intended to be used anyway? These are >> short-term pins for struct page manipulation, not e.g. dirtying of page >> contents. Reading commit fc1d8e7cca2d I don't think this case falls >> within the reasoning there. Perhaps not all GUP users should be >> converted to the planned separate GUP tracking, and instead we should >> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? >> > > Interesting. So far, the approach has been to get all the gup callers to > release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > wrapper, then maybe we could leave some sites unconverted. > > However, in order to do so, we would have to change things so that we have > one set of APIs (gup) that do *not* increment a pin count, and another set > (vaddr_pin_pages) that do. > > Is that where we want to go...? > Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead of get_page(), and also fix the releasing code. So this incremental patch, on top of the existing one, should do it: diff --git a/mm/mlock.c b/mm/mlock.c index b980e6270e8a..2ea272c6fee3 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) /* * We won't be munlocking this page in the next phase * but we still need to release the follow_page_mask() - * pin. We cannot do it under lru_lock however. If it's - * the last pin, __page_cache_release() would deadlock. + * pin. */ - pagevec_add(&pvec_putback, pvec->pages[i]); + put_user_page(pages[i]); pvec->pages[i] = NULL; } __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); spin_unlock_irq(&zone->zone_pgdat->lru_lock); - /* Now we can release pins of pages that we are not munlocking */ - pagevec_release(&pvec_putback); - /* Phase 2: page munlock */ for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; @@ -394,6 +390,8 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, start += PAGE_SIZE; while (start < end) { struct page *page = NULL; + int ret; + pte++; if (pte_present(*pte)) page = vm_normal_page(vma, start, *pte); @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, if (PageTransCompound(page)) break; - get_page(page); + /* + * Use get_user_pages_fast(), instead of get_page() so that the + * releasing code can unconditionally call put_user_page(). + */ + ret = get_user_pages_fast(start, 1, 0, &page); + if (ret != 1) + break; /* * Increase the address that will be returned *before* the * eventual break due to pvec becoming full by adding the page thanks, -- John Hubbard NVIDIA ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-08 22:59 ` John Hubbard @ 2019-08-08 23:41 ` Ira Weiny 2019-08-08 23:57 ` John Hubbard 2019-08-09 8:12 ` Vlastimil Babka 1 sibling, 1 reply; 23+ messages in thread From: Ira Weiny @ 2019-08-08 23:41 UTC (permalink / raw) To: John Hubbard Cc: Vlastimil Babka, Michal Hocko, Andrew Morton, Christoph Hellwig, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On Thu, Aug 08, 2019 at 03:59:15PM -0700, John Hubbard wrote: > On 8/8/19 12:20 PM, John Hubbard wrote: > > On 8/8/19 4:09 AM, Vlastimil Babka wrote: > >> On 8/8/19 8:21 AM, Michal Hocko wrote: > >>> On Wed 07-08-19 16:32:08, John Hubbard wrote: > >>>> On 8/7/19 4:01 AM, Michal Hocko wrote: > >>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > >>>>>> From: John Hubbard <jhubbard@nvidia.com> > >>>> Actually, I think follow_page_mask() gets all the pages, right? And the > >>>> get_page() in __munlock_pagevec_fill() is there to allow a pagevec_release() > >>>> later. > >>> > >>> Maybe I am misreading the code (looking at Linus tree) but munlock_vma_pages_range > >>> calls follow_page for the start address and then if not THP tries to > >>> fill up the pagevec with few more pages (up to end), do the shortcut > >>> via manual pte walk as an optimization and use generic get_page there. > >> > > > > Yes, I see it finally, thanks. :) > > > >> That's true. However, I'm not sure munlocking is where the > >> put_user_page() machinery is intended to be used anyway? These are > >> short-term pins for struct page manipulation, not e.g. dirtying of page > >> contents. Reading commit fc1d8e7cca2d I don't think this case falls > >> within the reasoning there. Perhaps not all GUP users should be > >> converted to the planned separate GUP tracking, and instead we should > >> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > >> > > > > Interesting. So far, the approach has been to get all the gup callers to > > release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > wrapper, then maybe we could leave some sites unconverted. > > > > However, in order to do so, we would have to change things so that we have > > one set of APIs (gup) that do *not* increment a pin count, and another set > > (vaddr_pin_pages) that do. > > > > Is that where we want to go...? > > > > Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead > of get_page(), and also fix the releasing code. So this incremental patch, on > top of the existing one, should do it: > > diff --git a/mm/mlock.c b/mm/mlock.c > index b980e6270e8a..2ea272c6fee3 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) > /* > * We won't be munlocking this page in the next phase > * but we still need to release the follow_page_mask() > - * pin. We cannot do it under lru_lock however. If it's > - * the last pin, __page_cache_release() would deadlock. > + * pin. > */ > - pagevec_add(&pvec_putback, pvec->pages[i]); > + put_user_page(pages[i]); > pvec->pages[i] = NULL; > } > __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); > spin_unlock_irq(&zone->zone_pgdat->lru_lock); > > - /* Now we can release pins of pages that we are not munlocking */ > - pagevec_release(&pvec_putback); > - I'm not an expert but this skips a call to lru_add_drain(). Is that ok? > /* Phase 2: page munlock */ > for (i = 0; i < nr; i++) { > struct page *page = pvec->pages[i]; > @@ -394,6 +390,8 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > start += PAGE_SIZE; > while (start < end) { > struct page *page = NULL; > + int ret; > + > pte++; > if (pte_present(*pte)) > page = vm_normal_page(vma, start, *pte); > @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > if (PageTransCompound(page)) > break; > > - get_page(page); > + /* > + * Use get_user_pages_fast(), instead of get_page() so that the > + * releasing code can unconditionally call put_user_page(). > + */ > + ret = get_user_pages_fast(start, 1, 0, &page); > + if (ret != 1) > + break; I like the idea of making this a get/put pair but I'm feeling uneasy about how this is really supposed to work. For sure the GUP/PUP was supposed to be separate from [get|put]_page. Ira > /* > * Increase the address that will be returned *before* the > * eventual break due to pvec becoming full by adding the page > > > thanks, > -- > John Hubbard > NVIDIA ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-08 23:41 ` Ira Weiny @ 2019-08-08 23:57 ` John Hubbard 2019-08-09 18:22 ` Weiny, Ira 0 siblings, 1 reply; 23+ messages in thread From: John Hubbard @ 2019-08-08 23:57 UTC (permalink / raw) To: Ira Weiny Cc: Vlastimil Babka, Michal Hocko, Andrew Morton, Christoph Hellwig, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/8/19 4:41 PM, Ira Weiny wrote: > On Thu, Aug 08, 2019 at 03:59:15PM -0700, John Hubbard wrote: >> On 8/8/19 12:20 PM, John Hubbard wrote: >>> On 8/8/19 4:09 AM, Vlastimil Babka wrote: >>>> On 8/8/19 8:21 AM, Michal Hocko wrote: >>>>> On Wed 07-08-19 16:32:08, John Hubbard wrote: >>>>>> On 8/7/19 4:01 AM, Michal Hocko wrote: >>>>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: ... >> Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead >> of get_page(), and also fix the releasing code. So this incremental patch, on >> top of the existing one, should do it: >> >> diff --git a/mm/mlock.c b/mm/mlock.c >> index b980e6270e8a..2ea272c6fee3 100644 >> --- a/mm/mlock.c >> +++ b/mm/mlock.c >> @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) >> /* >> * We won't be munlocking this page in the next phase >> * but we still need to release the follow_page_mask() >> - * pin. We cannot do it under lru_lock however. If it's >> - * the last pin, __page_cache_release() would deadlock. >> + * pin. >> */ >> - pagevec_add(&pvec_putback, pvec->pages[i]); >> + put_user_page(pages[i]); correction, make that: put_user_page(pvec->pages[i]); (This is not fully tested yet.) >> pvec->pages[i] = NULL; >> } >> __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); >> spin_unlock_irq(&zone->zone_pgdat->lru_lock); >> >> - /* Now we can release pins of pages that we are not munlocking */ >> - pagevec_release(&pvec_putback); >> - > > I'm not an expert but this skips a call to lru_add_drain(). Is that ok? Yes: unless I'm missing something, there is no reason to go through lru_add_drain in this case. These are gup'd pages that are not going to get any further processing. > >> /* Phase 2: page munlock */ >> for (i = 0; i < nr; i++) { >> struct page *page = pvec->pages[i]; >> @@ -394,6 +390,8 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >> start += PAGE_SIZE; >> while (start < end) { >> struct page *page = NULL; >> + int ret; >> + >> pte++; >> if (pte_present(*pte)) >> page = vm_normal_page(vma, start, *pte); >> @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >> if (PageTransCompound(page)) >> break; >> >> - get_page(page); >> + /* >> + * Use get_user_pages_fast(), instead of get_page() so that the >> + * releasing code can unconditionally call put_user_page(). >> + */ >> + ret = get_user_pages_fast(start, 1, 0, &page); >> + if (ret != 1) >> + break; > > I like the idea of making this a get/put pair but I'm feeling uneasy about how > this is really supposed to work. > > For sure the GUP/PUP was supposed to be separate from [get|put]_page. > Actually, they both take references on the page. And it is absolutely OK to call them both on the same page. But anyway, we're not mixing them up here. If you follow the code paths, either gup or follow_page_mask() is used, and then put_user_page() releases. So...you haven't actually pointed to a bug here, right? :) thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-08 23:57 ` John Hubbard @ 2019-08-09 18:22 ` Weiny, Ira 0 siblings, 0 replies; 23+ messages in thread From: Weiny, Ira @ 2019-08-09 18:22 UTC (permalink / raw) To: John Hubbard Cc: Vlastimil Babka, Michal Hocko, Andrew Morton, Christoph Hellwig, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Williams, Dan J, Daniel Black, Matthew Wilcox, Mike Kravetz > > On 8/8/19 4:41 PM, Ira Weiny wrote: > > On Thu, Aug 08, 2019 at 03:59:15PM -0700, John Hubbard wrote: > >> On 8/8/19 12:20 PM, John Hubbard wrote: > >>> On 8/8/19 4:09 AM, Vlastimil Babka wrote: > >>>> On 8/8/19 8:21 AM, Michal Hocko wrote: > >>>>> On Wed 07-08-19 16:32:08, John Hubbard wrote: > >>>>>> On 8/7/19 4:01 AM, Michal Hocko wrote: > >>>>>>> On Mon 05-08-19 15:20:17, john.hubbard@gmail.com wrote: > ... > >> Oh, and meanwhile, I'm leaning toward a cheap fix: just use > >> gup_fast() instead of get_page(), and also fix the releasing code. So > >> this incremental patch, on top of the existing one, should do it: > >> > >> diff --git a/mm/mlock.c b/mm/mlock.c > >> index b980e6270e8a..2ea272c6fee3 100644 > >> --- a/mm/mlock.c > >> +++ b/mm/mlock.c > >> @@ -318,18 +318,14 @@ static void __munlock_pagevec(struct pagevec > *pvec, struct zone *zone) > >> /* > >> * We won't be munlocking this page in the next phase > >> * but we still need to release the follow_page_mask() > >> - * pin. We cannot do it under lru_lock however. If it's > >> - * the last pin, __page_cache_release() would deadlock. > >> + * pin. > >> */ > >> - pagevec_add(&pvec_putback, pvec->pages[i]); > >> + put_user_page(pages[i]); > > correction, make that: > put_user_page(pvec->pages[i]); > > (This is not fully tested yet.) > > >> pvec->pages[i] = NULL; > >> } > >> __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); > >> spin_unlock_irq(&zone->zone_pgdat->lru_lock); > >> > >> - /* Now we can release pins of pages that we are not munlocking */ > >> - pagevec_release(&pvec_putback); > >> - > > > > I'm not an expert but this skips a call to lru_add_drain(). Is that ok? > > Yes: unless I'm missing something, there is no reason to go through > lru_add_drain in this case. These are gup'd pages that are not going to get > any further processing. > > > > >> /* Phase 2: page munlock */ > >> for (i = 0; i < nr; i++) { > >> struct page *page = pvec->pages[i]; @@ -394,6 +390,8 > >> @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > >> start += PAGE_SIZE; > >> while (start < end) { > >> struct page *page = NULL; > >> + int ret; > >> + > >> pte++; > >> if (pte_present(*pte)) > >> page = vm_normal_page(vma, start, *pte); @@ > >> -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct > pagevec *pvec, > >> if (PageTransCompound(page)) > >> break; > >> > >> - get_page(page); > >> + /* > >> + * Use get_user_pages_fast(), instead of get_page() so that the > >> + * releasing code can unconditionally call put_user_page(). > >> + */ > >> + ret = get_user_pages_fast(start, 1, 0, &page); > >> + if (ret != 1) > >> + break; > > > > I like the idea of making this a get/put pair but I'm feeling uneasy > > about how this is really supposed to work. > > > > For sure the GUP/PUP was supposed to be separate from [get|put]_page. > > > > Actually, they both take references on the page. And it is absolutely OK to call > them both on the same page. > > But anyway, we're not mixing them up here. If you follow the code paths, > either gup or follow_page_mask() is used, and then put_user_page() > releases. > > So...you haven't actually pointed to a bug here, right? :) No... no bug. sorry this was just a general comment on semantics. But in keeping with the semantics discussion it is further confusing that follow_page_mask() is also mixed in here... Which is where my comment was driving toward. If you call GUP there should be a PUP. Get_page/put_page... follow_page/unfollow_page... ??? ;-) Ok now I'm off the rails... but that was the point... I think Jan and Michal are onto something here WRT internal vs external interfaces. Ira ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-08 22:59 ` John Hubbard 2019-08-08 23:41 ` Ira Weiny @ 2019-08-09 8:12 ` Vlastimil Babka 2019-08-09 8:23 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Vlastimil Babka @ 2019-08-09 8:12 UTC (permalink / raw) To: John Hubbard, Michal Hocko Cc: Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/9/19 12:59 AM, John Hubbard wrote: >>> That's true. However, I'm not sure munlocking is where the >>> put_user_page() machinery is intended to be used anyway? These are >>> short-term pins for struct page manipulation, not e.g. dirtying of page >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls >>> within the reasoning there. Perhaps not all GUP users should be >>> converted to the planned separate GUP tracking, and instead we should >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? >>> >> >> Interesting. So far, the approach has been to get all the gup callers to >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() >> wrapper, then maybe we could leave some sites unconverted. >> >> However, in order to do so, we would have to change things so that we have >> one set of APIs (gup) that do *not* increment a pin count, and another set >> (vaddr_pin_pages) that do. >> >> Is that where we want to go...? >> We already have a FOLL_LONGTERM flag, isn't that somehow related? And if it's not exactly the same thing, perhaps a new gup flag to distinguish which kind of pinning to use? > Oh, and meanwhile, I'm leaning toward a cheap fix: just use gup_fast() instead > of get_page(), and also fix the releasing code. So this incremental patch, on > top of the existing one, should do it: > ... > @@ -411,7 +409,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, > if (PageTransCompound(page)) > break; > > - get_page(page); > + /* > + * Use get_user_pages_fast(), instead of get_page() so that the > + * releasing code can unconditionally call put_user_page(). > + */ > + ret = get_user_pages_fast(start, 1, 0, &page); Um the whole reason of __munlock_pagevec_fill() was to avoid the full page walk cost, which made a 14% difference, see 7a8010cd3627 ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()") Replacing simple get_page() with page walk to satisfy API requirements seems rather suboptimal to me. > + if (ret != 1) > + break; > /* > * Increase the address that will be returned *before* the > * eventual break due to pvec becoming full by adding the page > > > thanks, > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-09 8:12 ` Vlastimil Babka @ 2019-08-09 8:23 ` Michal Hocko 2019-08-09 9:05 ` John Hubbard 2019-08-09 13:58 ` Jan Kara 0 siblings, 2 replies; 23+ messages in thread From: Michal Hocko @ 2019-08-09 8:23 UTC (permalink / raw) To: John Hubbard Cc: Vlastimil Babka, Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > On 8/9/19 12:59 AM, John Hubbard wrote: > >>> That's true. However, I'm not sure munlocking is where the > >>> put_user_page() machinery is intended to be used anyway? These are > >>> short-term pins for struct page manipulation, not e.g. dirtying of page > >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls > >>> within the reasoning there. Perhaps not all GUP users should be > >>> converted to the planned separate GUP tracking, and instead we should > >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > >>> > >> > >> Interesting. So far, the approach has been to get all the gup callers to > >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > >> wrapper, then maybe we could leave some sites unconverted. > >> > >> However, in order to do so, we would have to change things so that we have > >> one set of APIs (gup) that do *not* increment a pin count, and another set > >> (vaddr_pin_pages) that do. > >> > >> Is that where we want to go...? > >> > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > it's not exactly the same thing, perhaps a new gup flag to distinguish > which kind of pinning to use? Agreed. This is a shiny example how forcing all existing gup users into the new scheme is subotimal at best. Not the mention the overal fragility mention elsewhere. I dislike the conversion even more now. Sorry if this was already discussed already but why the new pinning is not bound to FOLL_LONGTERM (ideally hidden by an interface so that users do not have to care about the flag) only? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-09 8:23 ` Michal Hocko @ 2019-08-09 9:05 ` John Hubbard 2019-08-09 9:16 ` Michal Hocko 2019-08-09 13:58 ` Jan Kara 1 sibling, 1 reply; 23+ messages in thread From: John Hubbard @ 2019-08-09 9:05 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/9/19 1:23 AM, Michal Hocko wrote: > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: >> On 8/9/19 12:59 AM, John Hubbard wrote: >>>>> That's true. However, I'm not sure munlocking is where the >>>>> put_user_page() machinery is intended to be used anyway? These are >>>>> short-term pins for struct page manipulation, not e.g. dirtying of page >>>>> contents. Reading commit fc1d8e7cca2d I don't think this case falls >>>>> within the reasoning there. Perhaps not all GUP users should be >>>>> converted to the planned separate GUP tracking, and instead we should >>>>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? >>>>> >>>> >>>> Interesting. So far, the approach has been to get all the gup callers to >>>> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() >>>> wrapper, then maybe we could leave some sites unconverted. >>>> >>>> However, in order to do so, we would have to change things so that we have >>>> one set of APIs (gup) that do *not* increment a pin count, and another set >>>> (vaddr_pin_pages) that do. >>>> >>>> Is that where we want to go...? >>>> >> >> We already have a FOLL_LONGTERM flag, isn't that somehow related? And if >> it's not exactly the same thing, perhaps a new gup flag to distinguish >> which kind of pinning to use? > > Agreed. This is a shiny example how forcing all existing gup users into > the new scheme is subotimal at best. Not the mention the overal > fragility mention elsewhere. I dislike the conversion even more now. > > Sorry if this was already discussed already but why the new pinning is > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > do not have to care about the flag) only? > Oh, it's been discussed alright, but given how some of the discussions have gone, I certainly am not surprised that there are still questions and criticisms! Especially since I may have misunderstood some of the points, along the way. It's been quite a merry go round. :) Anyway, what I'm hearing now is: for gup(FOLL_LONGTERM), apply the pinned tracking. And therefore only do put_user_page() on pages that were pinned with FOLL_LONGTERM. For short term pins, let the locking do what it will: things can briefly block and all will be well. Also, that may or may not come with a wrapper function, courtesy of Jan and Ira. Is that about right? It's late here, but I don't immediately recall any problems with doing it that way... thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-09 9:05 ` John Hubbard @ 2019-08-09 9:16 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2019-08-09 9:16 UTC (permalink / raw) To: John Hubbard Cc: Vlastimil Babka, Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On Fri 09-08-19 02:05:15, John Hubbard wrote: > On 8/9/19 1:23 AM, Michal Hocko wrote: > > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > > On 8/9/19 12:59 AM, John Hubbard wrote: > > > > > > That's true. However, I'm not sure munlocking is where the > > > > > > put_user_page() machinery is intended to be used anyway? These are > > > > > > short-term pins for struct page manipulation, not e.g. dirtying of page > > > > > > contents. Reading commit fc1d8e7cca2d I don't think this case falls > > > > > > within the reasoning there. Perhaps not all GUP users should be > > > > > > converted to the planned separate GUP tracking, and instead we should > > > > > > have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > > > > > > > > > > Interesting. So far, the approach has been to get all the gup callers to > > > > > release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > > > > wrapper, then maybe we could leave some sites unconverted. > > > > > > > > > > However, in order to do so, we would have to change things so that we have > > > > > one set of APIs (gup) that do *not* increment a pin count, and another set > > > > > (vaddr_pin_pages) that do. > > > > > > > > > > Is that where we want to go...? > > > > > > > > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > > > it's not exactly the same thing, perhaps a new gup flag to distinguish > > > which kind of pinning to use? > > > > Agreed. This is a shiny example how forcing all existing gup users into > > the new scheme is subotimal at best. Not the mention the overal > > fragility mention elsewhere. I dislike the conversion even more now. > > > > Sorry if this was already discussed already but why the new pinning is > > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > > do not have to care about the flag) only? > > > > Oh, it's been discussed alright, but given how some of the discussions have gone, > I certainly am not surprised that there are still questions and criticisms! > Especially since I may have misunderstood some of the points, along the way. > It's been quite a merry go round. :) Yeah, I've tried to follow them but just gave up at some point. > Anyway, what I'm hearing now is: for gup(FOLL_LONGTERM), apply the pinned tracking. > And therefore only do put_user_page() on pages that were pinned with > FOLL_LONGTERM. For short term pins, let the locking do what it will: > things can briefly block and all will be well. > > Also, that may or may not come with a wrapper function, courtesy of Jan > and Ira. > > Is that about right? It's late here, but I don't immediately recall any > problems with doing it that way... Yes that makes more sense to me. Whoever needs that tracking should opt-in for it. Otherwise you just risk problems like the one discussed in the mlock path (because we do a strange stuff in the name of performance) and a never ending whack a mole where new users do not follow the new API usage and that results in all sorts of weird issues. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-09 8:23 ` Michal Hocko 2019-08-09 9:05 ` John Hubbard @ 2019-08-09 13:58 ` Jan Kara 2019-08-09 17:52 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Jan Kara @ 2019-08-09 13:58 UTC (permalink / raw) To: Michal Hocko Cc: John Hubbard, Vlastimil Babka, Andrew Morton, Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On Fri 09-08-19 10:23:07, Michal Hocko wrote: > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > On 8/9/19 12:59 AM, John Hubbard wrote: > > >>> That's true. However, I'm not sure munlocking is where the > > >>> put_user_page() machinery is intended to be used anyway? These are > > >>> short-term pins for struct page manipulation, not e.g. dirtying of page > > >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls > > >>> within the reasoning there. Perhaps not all GUP users should be > > >>> converted to the planned separate GUP tracking, and instead we should > > >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > > >>> > > >> > > >> Interesting. So far, the approach has been to get all the gup callers to > > >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > >> wrapper, then maybe we could leave some sites unconverted. > > >> > > >> However, in order to do so, we would have to change things so that we have > > >> one set of APIs (gup) that do *not* increment a pin count, and another set > > >> (vaddr_pin_pages) that do. > > >> > > >> Is that where we want to go...? > > >> > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > > it's not exactly the same thing, perhaps a new gup flag to distinguish > > which kind of pinning to use? > > Agreed. This is a shiny example how forcing all existing gup users into > the new scheme is subotimal at best. Not the mention the overal > fragility mention elsewhere. I dislike the conversion even more now. > > Sorry if this was already discussed already but why the new pinning is > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > do not have to care about the flag) only? The new tracking cannot be bound to FOLL_LONGTERM. Anything that gets page reference and then touches page data (e.g. direct IO) needs the new kind of tracking so that filesystem knows someone is messing with the page data. So what John is trying to address is a different (although related) problem to someone pinning a page for a long time. In principle, I'm not strongly opposed to a new FOLL flag to determine whether a pin or an ordinary page reference will be acquired at least as an internal implementation detail inside mm/gup.c. But I would really like to discourage new GUP users taking just page reference as the most clueless users (drivers) usually need a pin in the sense John implements. So in terms of API I'd strongly prefer to deprecate GUP as an API, provide vaddr_pin_pages() for drivers to get their buffer pages pinned and then for those few users who really know what they are doing (and who are not interested in page contents) we can have APIs like follow_page() to get a page reference from a virtual address. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-09 13:58 ` Jan Kara @ 2019-08-09 17:52 ` Michal Hocko 2019-08-09 18:14 ` Weiny, Ira 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2019-08-09 17:52 UTC (permalink / raw) To: Jan Kara Cc: John Hubbard, Vlastimil Babka, Andrew Morton, Christoph Hellwig, Ira Weiny, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz On Fri 09-08-19 15:58:13, Jan Kara wrote: > On Fri 09-08-19 10:23:07, Michal Hocko wrote: > > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > > On 8/9/19 12:59 AM, John Hubbard wrote: > > > >>> That's true. However, I'm not sure munlocking is where the > > > >>> put_user_page() machinery is intended to be used anyway? These are > > > >>> short-term pins for struct page manipulation, not e.g. dirtying of page > > > >>> contents. Reading commit fc1d8e7cca2d I don't think this case falls > > > >>> within the reasoning there. Perhaps not all GUP users should be > > > >>> converted to the planned separate GUP tracking, and instead we should > > > >>> have a GUP/follow_page_mask() variant that keeps using get_page/put_page? > > > >>> > > > >> > > > >> Interesting. So far, the approach has been to get all the gup callers to > > > >> release via put_user_page(), but if we add in Jan's and Ira's vaddr_pin_pages() > > > >> wrapper, then maybe we could leave some sites unconverted. > > > >> > > > >> However, in order to do so, we would have to change things so that we have > > > >> one set of APIs (gup) that do *not* increment a pin count, and another set > > > >> (vaddr_pin_pages) that do. > > > >> > > > >> Is that where we want to go...? > > > >> > > > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? And if > > > it's not exactly the same thing, perhaps a new gup flag to distinguish > > > which kind of pinning to use? > > > > Agreed. This is a shiny example how forcing all existing gup users into > > the new scheme is subotimal at best. Not the mention the overal > > fragility mention elsewhere. I dislike the conversion even more now. > > > > Sorry if this was already discussed already but why the new pinning is > > not bound to FOLL_LONGTERM (ideally hidden by an interface so that users > > do not have to care about the flag) only? > > The new tracking cannot be bound to FOLL_LONGTERM. Anything that gets page > reference and then touches page data (e.g. direct IO) needs the new kind of > tracking so that filesystem knows someone is messing with the page data. > So what John is trying to address is a different (although related) problem > to someone pinning a page for a long time. OK, I see. Thanks for the clarification. > In principle, I'm not strongly opposed to a new FOLL flag to determine > whether a pin or an ordinary page reference will be acquired at least as an > internal implementation detail inside mm/gup.c. But I would really like to > discourage new GUP users taking just page reference as the most clueless > users (drivers) usually need a pin in the sense John implements. So in > terms of API I'd strongly prefer to deprecate GUP as an API, provide > vaddr_pin_pages() for drivers to get their buffer pages pinned and then for > those few users who really know what they are doing (and who are not > interested in page contents) we can have APIs like follow_page() to get a > page reference from a virtual address. Yes, going with a dedicated API sounds much better to me. Whether a dedicated FOLL flag is used internally is not that important. I am also for making the underlying gup to be really internal to the core kernel. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-09 17:52 ` Michal Hocko @ 2019-08-09 18:14 ` Weiny, Ira 2019-08-09 18:36 ` John Hubbard 0 siblings, 1 reply; 23+ messages in thread From: Weiny, Ira @ 2019-08-09 18:14 UTC (permalink / raw) To: Michal Hocko, Jan Kara Cc: John Hubbard, Vlastimil Babka, Andrew Morton, Christoph Hellwig, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Williams, Dan J, Daniel Black, Matthew Wilcox, Mike Kravetz > On Fri 09-08-19 15:58:13, Jan Kara wrote: > > On Fri 09-08-19 10:23:07, Michal Hocko wrote: > > > On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: > > > > On 8/9/19 12:59 AM, John Hubbard wrote: > > > > >>> That's true. However, I'm not sure munlocking is where the > > > > >>> put_user_page() machinery is intended to be used anyway? These > > > > >>> are short-term pins for struct page manipulation, not e.g. > > > > >>> dirtying of page contents. Reading commit fc1d8e7cca2d I don't > > > > >>> think this case falls within the reasoning there. Perhaps not > > > > >>> all GUP users should be converted to the planned separate GUP > > > > >>> tracking, and instead we should have a GUP/follow_page_mask() > variant that keeps using get_page/put_page? > > > > >>> > > > > >> > > > > >> Interesting. So far, the approach has been to get all the gup > > > > >> callers to release via put_user_page(), but if we add in Jan's > > > > >> and Ira's vaddr_pin_pages() wrapper, then maybe we could leave > some sites unconverted. > > > > >> > > > > >> However, in order to do so, we would have to change things so > > > > >> that we have one set of APIs (gup) that do *not* increment a > > > > >> pin count, and another set > > > > >> (vaddr_pin_pages) that do. > > > > >> > > > > >> Is that where we want to go...? > > > > >> > > > > > > > > We already have a FOLL_LONGTERM flag, isn't that somehow related? > > > > And if it's not exactly the same thing, perhaps a new gup flag to > > > > distinguish which kind of pinning to use? > > > > > > Agreed. This is a shiny example how forcing all existing gup users > > > into the new scheme is subotimal at best. Not the mention the overal > > > fragility mention elsewhere. I dislike the conversion even more now. > > > > > > Sorry if this was already discussed already but why the new pinning > > > is not bound to FOLL_LONGTERM (ideally hidden by an interface so > > > that users do not have to care about the flag) only? > > > > The new tracking cannot be bound to FOLL_LONGTERM. Anything that gets > > page reference and then touches page data (e.g. direct IO) needs the > > new kind of tracking so that filesystem knows someone is messing with the > page data. > > So what John is trying to address is a different (although related) > > problem to someone pinning a page for a long time. > > OK, I see. Thanks for the clarification. Not to beat a dead horse but FOLL_LONGTERM also has implications now for CMA pages which may or may not (I'm not an expert on those pages) need special tracking. > > > In principle, I'm not strongly opposed to a new FOLL flag to determine > > whether a pin or an ordinary page reference will be acquired at least > > as an internal implementation detail inside mm/gup.c. But I would > > really like to discourage new GUP users taking just page reference as > > the most clueless users (drivers) usually need a pin in the sense John > > implements. So in terms of API I'd strongly prefer to deprecate GUP as > > an API, provide > > vaddr_pin_pages() for drivers to get their buffer pages pinned and > > then for those few users who really know what they are doing (and who > > are not interested in page contents) we can have APIs like > > follow_page() to get a page reference from a virtual address. > > Yes, going with a dedicated API sounds much better to me. Whether a > dedicated FOLL flag is used internally is not that important. I am also for > making the underlying gup to be really internal to the core kernel. +1 I think GUP is too confusing. I've been working with the details for many months now and it continues to confuse me. :-( My patches should be posted soon (based on mmotm) and I'll have my flame suit on so we can debate the interface. Ira ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() 2019-08-09 18:14 ` Weiny, Ira @ 2019-08-09 18:36 ` John Hubbard 0 siblings, 0 replies; 23+ messages in thread From: John Hubbard @ 2019-08-09 18:36 UTC (permalink / raw) To: Weiny, Ira, Michal Hocko, Jan Kara Cc: Vlastimil Babka, Andrew Morton, Christoph Hellwig, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, Williams, Dan J, Daniel Black, Matthew Wilcox, Mike Kravetz On 8/9/19 11:14 AM, Weiny, Ira wrote: >> On Fri 09-08-19 15:58:13, Jan Kara wrote: >>> On Fri 09-08-19 10:23:07, Michal Hocko wrote: >>>> On Fri 09-08-19 10:12:48, Vlastimil Babka wrote: >>>>> On 8/9/19 12:59 AM, John Hubbard wrote: ... >>> In principle, I'm not strongly opposed to a new FOLL flag to determine >>> whether a pin or an ordinary page reference will be acquired at least >>> as an internal implementation detail inside mm/gup.c. But I would >>> really like to discourage new GUP users taking just page reference as >>> the most clueless users (drivers) usually need a pin in the sense John >>> implements. So in terms of API I'd strongly prefer to deprecate GUP as >>> an API, provide >>> vaddr_pin_pages() for drivers to get their buffer pages pinned and >>> then for those few users who really know what they are doing (and who >>> are not interested in page contents) we can have APIs like >>> follow_page() to get a page reference from a virtual address. >> >> Yes, going with a dedicated API sounds much better to me. Whether a >> dedicated FOLL flag is used internally is not that important. I am also for >> making the underlying gup to be really internal to the core kernel. > > +1 > > I think GUP is too confusing. I've been working with the details for many months now and it continues to confuse me. :-( > > My patches should be posted soon (based on mmotm) and I'll have my flame suit on so we can debate the interface. > OK, so: use FOLL_PIN as an internal gup flag. FOLL_PIN will get set by the new vaddr_pin_pages*() wrapper calls. Then, put_user_page*() shall only be invoked from call sites that use FOLL_PIN. With that approach in mind, I can sweep through my callsite conversion patchset and drop a few patches. There are actually quite a few patches that just want to find the page, not really operate on its data. And the conversion of the actual gup() calls can be done almost independently of the put_user_page*() conversions, if necessary (and it sounds like with your patchset, it is). btw, as part of the conversion, to make merging and call site conversion smoother, maybe it's OK to pass in FOLL_PIN to existing gup() calls, with the intent to convert them to use vaddr_pin_pages.) thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/3] mm/mempolicy.c: convert put_page() to put_user_page*() 2019-08-05 22:20 [PATCH 0/3] mm/: 3 more put_user_page() conversions john.hubbard 2019-08-05 22:20 ` [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() john.hubbard @ 2019-08-05 22:20 ` john.hubbard 2019-08-05 22:20 ` [PATCH 3/3] mm/ksm: " john.hubbard 2019-08-06 21:59 ` [PATCH 0/3] mm/: 3 more put_user_page() conversions Andrew Morton 3 siblings, 0 replies; 23+ messages in thread From: john.hubbard @ 2019-08-05 22:20 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, John Hubbard, Andrea Arcangeli, Anshuman Khandual, David Rientjes, Dominik Brodowski, Kirill A . Shutemov, Michal Hocko, Vlastimil Babka, zhong jiang From: John Hubbard <jhubbard@nvidia.com> For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages(). This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: zhong jiang <zhongjiang@huawei.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> --- mm/mempolicy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f48693f75b37..76a8e935e2e6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -832,7 +832,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr) err = get_user_pages_locked(addr & PAGE_MASK, 1, 0, &p, &locked); if (err >= 0) { err = page_to_nid(p); - put_page(p); + put_user_page(p); } if (locked) up_read(&mm->mmap_sem); -- 2.22.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/3] mm/ksm: convert put_page() to put_user_page*() 2019-08-05 22:20 [PATCH 0/3] mm/: 3 more put_user_page() conversions john.hubbard 2019-08-05 22:20 ` [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() john.hubbard 2019-08-05 22:20 ` [PATCH 2/3] mm/mempolicy.c: " john.hubbard @ 2019-08-05 22:20 ` john.hubbard 2019-08-06 21:59 ` [PATCH 0/3] mm/: 3 more put_user_page() conversions Andrew Morton 3 siblings, 0 replies; 23+ messages in thread From: john.hubbard @ 2019-08-05 22:20 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, John Hubbard, Dan Williams, Daniel Black, Matthew Wilcox, Mike Kravetz From: John Hubbard <jhubbard@nvidia.com> For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages(). This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Black <daniel@linux.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> --- mm/ksm.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/ksm.c b/mm/ksm.c index 3dc4346411e4..e10ee4d5fdd8 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -456,7 +456,7 @@ static inline bool ksm_test_exit(struct mm_struct *mm) * We use break_ksm to break COW on a ksm page: it's a stripped down * * if (get_user_pages(addr, 1, 1, 1, &page, NULL) == 1) - * put_page(page); + * put_user_page(page); * * but taking great care only to touch a ksm page, in a VM_MERGEABLE vma, * in case the application has unmapped and remapped mm,addr meanwhile. @@ -483,7 +483,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE); else ret = VM_FAULT_WRITE; - put_page(page); + put_user_page(page); } while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM))); /* * We must loop because handle_mm_fault() may back out if there's @@ -568,7 +568,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item) flush_anon_page(vma, page, addr); flush_dcache_page(page); } else { - put_page(page); + put_user_page(page); out: page = NULL; } @@ -1974,10 +1974,10 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, parent = *new; if (ret < 0) { - put_page(tree_page); + put_user_page(tree_page); new = &parent->rb_left; } else if (ret > 0) { - put_page(tree_page); + put_user_page(tree_page); new = &parent->rb_right; } else if (!ksm_merge_across_nodes && page_to_nid(tree_page) != nid) { @@ -1986,7 +1986,7 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, * it will be flushed out and put in the right unstable * tree next time: only merge with it when across_nodes. */ - put_page(tree_page); + put_user_page(tree_page); return NULL; } else { *tree_pagep = tree_page; @@ -2328,7 +2328,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page) &rmap_item->rmap_list; ksm_scan.address += PAGE_SIZE; } else - put_page(*page); + put_user_page(*page); up_read(&mm->mmap_sem); return rmap_item; } -- 2.22.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] mm/: 3 more put_user_page() conversions 2019-08-05 22:20 [PATCH 0/3] mm/: 3 more put_user_page() conversions john.hubbard ` (2 preceding siblings ...) 2019-08-05 22:20 ` [PATCH 3/3] mm/ksm: " john.hubbard @ 2019-08-06 21:59 ` Andrew Morton 2019-08-06 22:05 ` John Hubbard 3 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2019-08-06 21:59 UTC (permalink / raw) To: john.hubbard Cc: Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel, John Hubbard On Mon, 5 Aug 2019 15:20:16 -0700 john.hubbard@gmail.com wrote: > Here are a few more mm/ files that I wasn't ready to send with the > larger 34-patch set. Seems that a v3 of "put_user_pages(): miscellaneous call sites" is in the works, so can we make that a 37 patch series? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] mm/: 3 more put_user_page() conversions 2019-08-06 21:59 ` [PATCH 0/3] mm/: 3 more put_user_page() conversions Andrew Morton @ 2019-08-06 22:05 ` John Hubbard 0 siblings, 0 replies; 23+ messages in thread From: John Hubbard @ 2019-08-06 22:05 UTC (permalink / raw) To: Andrew Morton, john.hubbard Cc: Christoph Hellwig, Ira Weiny, Jan Kara, Jason Gunthorpe, Jerome Glisse, LKML, linux-mm, linux-fsdevel On 8/6/19 2:59 PM, Andrew Morton wrote: > On Mon, 5 Aug 2019 15:20:16 -0700 john.hubbard@gmail.com wrote: > >> Here are a few more mm/ files that I wasn't ready to send with the >> larger 34-patch set. > > Seems that a v3 of "put_user_pages(): miscellaneous call sites" is in > the works, so can we make that a 37 patch series? > Sure, I'll add them to that. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-08-09 18:38 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-05 22:20 [PATCH 0/3] mm/: 3 more put_user_page() conversions john.hubbard 2019-08-05 22:20 ` [PATCH 1/3] mm/mlock.c: convert put_page() to put_user_page*() john.hubbard 2019-08-07 11:01 ` Michal Hocko 2019-08-07 23:32 ` John Hubbard 2019-08-08 6:21 ` Michal Hocko 2019-08-08 11:09 ` Vlastimil Babka 2019-08-08 19:20 ` John Hubbard 2019-08-08 22:59 ` John Hubbard 2019-08-08 23:41 ` Ira Weiny 2019-08-08 23:57 ` John Hubbard 2019-08-09 18:22 ` Weiny, Ira 2019-08-09 8:12 ` Vlastimil Babka 2019-08-09 8:23 ` Michal Hocko 2019-08-09 9:05 ` John Hubbard 2019-08-09 9:16 ` Michal Hocko 2019-08-09 13:58 ` Jan Kara 2019-08-09 17:52 ` Michal Hocko 2019-08-09 18:14 ` Weiny, Ira 2019-08-09 18:36 ` John Hubbard 2019-08-05 22:20 ` [PATCH 2/3] mm/mempolicy.c: " john.hubbard 2019-08-05 22:20 ` [PATCH 3/3] mm/ksm: " john.hubbard 2019-08-06 21:59 ` [PATCH 0/3] mm/: 3 more put_user_page() conversions Andrew Morton 2019-08-06 22:05 ` John Hubbard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).