Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:08                                             ` Jan Kara
@ 2018-12-20 10:54                                               ` John Hubbard
  2018-12-20 16:50                                                 ` Jerome Glisse
  2018-12-20 16:49                                               ` Jerome Glisse
  2019-01-03  1:55                                               ` Jerome Glisse
  2 siblings, 1 reply; 207+ messages in thread
From: John Hubbard @ 2018-12-20 10:54 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/19/18 3:08 AM, Jan Kara wrote:
> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>> solution for that.  
>>>
>>> So as I understand it, this would use page->_mapcount to store both the real
>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>> file-backed (non-anonymous) pages:
>>>
>>>
>>> __get_user_pages()
>>> {
>>> 	...
>>> 	get_page(page);
>>>
>>> 	if (!PageAnon)
>>> 		atomic_inc(page->_mapcount);
>>> 	...
>>> }
>>>
>>> put_user_page(struct page *page)
>>> {
>>> 	...
>>> 	if (!PageAnon)
>>> 		atomic_dec(&page->_mapcount);
>>>
>>> 	put_page(page);
>>> 	...
>>> }
>>>
>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
>>> had in mind?
>>
>> Mostly, with the extra two observations:
>>     [1] We only need to know the pin count when a write back kicks in
>>     [2] We need to protect GUP code with wait_for_write_back() in case
>>         GUP is racing with a write back that might not the see the
>>         elevated mapcount in time.
>>
>> So for [2]
>>
>> __get_user_pages()
>> {
>>     get_page(page);
>>
>>     if (!PageAnon) {
>>         atomic_inc(page->_mapcount);
>> +       if (PageWriteback(page)) {
>> +           // Assume we are racing and curent write back will not see
>> +           // the elevated mapcount so wait for current write back and
>> +           // force page fault
>> +           wait_on_page_writeback(page);
>> +           // force slow path that will fault again
>> +       }
>>     }
>> }
> 
> This is not needed AFAICT. __get_user_pages() gets page reference (and it
> should also increment page->_mapcount) under PTE lock. So at that point we
> are sure we have writeable PTE nobody can change. So page_mkclean() has to
> block on PTE lock to make PTE read-only and only after going through all
> PTEs like this, it can check page->_mapcount. So the PTE lock provides
> enough synchronization.
> 
>> For [1] only needing pin count during write back turns page_mkclean into
>> the perfect spot to check for that so:
>>
>> int page_mkclean(struct page *page)
>> {
>>     int cleaned = 0;
>> +   int real_mapcount = 0;
>>     struct address_space *mapping;
>>     struct rmap_walk_control rwc = {
>>         .arg = (void *)&cleaned,
>>         .rmap_one = page_mkclean_one,
>>         .invalid_vma = invalid_mkclean_vma,
>> +       .mapcount = &real_mapcount,
>>     };
>>
>>     BUG_ON(!PageLocked(page));
>>
>>     if (!page_mapped(page))
>>         return 0;
>>
>>     mapping = page_mapping(page);
>>     if (!mapping)
>>         return 0;
>>
>>     // rmap_walk need to change to count mapping and return value
>>     // in .mapcount easy one
>>     rmap_walk(page, &rwc);
>>
>>     // Big fat comment to explain what is going on
>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
>> +       SetPageDMAPined(page);
>> +   } else {
>> +       ClearPageDMAPined(page);
>> +   }
> 
> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> check we do in page_mkclean() is wrong?

Right. This looks like a dead end, after all. We can't lock a whole chunk 
of "all these are mapped, hold still while we count you" pages. It's not
designed to allow that at all.

IMHO, we are now back to something like dynamic_page, which provides an
independent dma pinned count. 

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-02-08 10:32   ` Mike Rapoport
@ 2019-02-08 20:44     ` John Hubbard
  0 siblings, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-02-08 20:44 UTC (permalink / raw)
  To: Mike Rapoport, john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel

On 2/8/19 2:32 AM, Mike Rapoport wrote:
> On Thu, Feb 07, 2019 at 11:56:48PM -0800, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> +/**
>> + * put_user_page() - release a gup-pinned page
>> + * @page:            pointer to page to be released
>> + *
>> + * Pages that were pinned via get_user_pages*() must be released via
>> + * either put_user_page(), or one of the put_user_pages*() routines
>> + * below. This is so that eventually, pages that are pinned via
>> + * get_user_pages*() can be separately tracked and uniquely handled. In
>> + * particular, interactions with RDMA and filesystems need special
>> + * handling.
>> + *
>> + * put_user_page() and put_page() are not interchangeable, despite this early
>> + * implementation that makes them look the same. put_user_page() calls must
> 
> I just hope we'll remember to update when the real implementation will be
> merged ;-)
> 
> Other than that, feel free to add
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>	# docs
> 

Thanks for the review!

Yes, the follow-on patch that turns this into a real implementation is
posted [1], and its documentation is updated accordingly.

(I've already changed "@Returns" to "@Return" locally in that patch, btw.)

[1] https://lore.kernel.org/r/20190204052135.25784-5-jhubbard@nvidia.com

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-02-08  7:56 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2019-02-08 10:32   ` Mike Rapoport
  2019-02-08 20:44     ` John Hubbard
  0 siblings, 1 reply; 207+ messages in thread
From: Mike Rapoport @ 2019-02-08 10:32 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, John Hubbard

On Thu, Feb 07, 2019 at 11:56:48PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with interactions between get_user_pages ("gup") and filesystems.
> 
> Problem description: let's start with a bug report. Below, is what happens
> sometimes, under memory pressure, when a driver pins some pages via gup,
> and then marks those pages dirty, and releases them. Note that the gup
> documentation actually recommends that pattern. The problem is that the
> filesystem may do a writeback while the pages were gup-pinned, and then the
> filesystem believes that the pages are clean. So, when the driver later
> marks the pages as dirty, that conflicts with the filesystem's page
> tracking and results in a BUG(), like this one that I experienced:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.
> 
> The steps are to fix it are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
> [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..809b7397d41e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + *
> + * put_user_page() and put_page() are not interchangeable, despite this early
> + * implementation that makes them look the same. put_user_page() calls must

I just hope we'll remember to update when the real implementation will be
merged ;-)

Other than that, feel free to add

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>	# docs 

> + * be perfectly matched up with get_user_page() calls.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index 4929bc1be60e..7c42ca45bb89 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
>  
> +typedef int (*set_dirty_func)(struct page *page);
> +
> +static void __put_user_pages_dirty(struct page **pages,
> +				   unsigned long npages,
> +				   set_dirty_func sdf)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			sdf(page);
> +
> +		put_user_page(page);
> +	}
> +}
> +
> +/**
> + * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> + * variants called on that page.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/**
> + * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +
> +/**
> + * put_user_pages() - release an array of gup-pinned pages.
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, release the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
>  /*
>   * get_kernel_pages() - pin kernel pages in memory
>   * @kiov:	An array of struct kvec structures
> -- 
> 2.20.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 207+ messages in thread

* [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-02-08  7:56 [PATCH 0/2] mm: put_user_page() call site conversion first john.hubbard
@ 2019-02-08  7:56 ` john.hubbard
  2019-02-08 10:32   ` Mike Rapoport
  0 siblings, 1 reply; 207+ messages in thread
From: john.hubbard @ 2019-02-08  7:56 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Jan Kara, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing a problem (also described in [1] and
[2]) with interactions between get_user_pages ("gup") and filesystems.

Problem description: let's start with a bug report. Below, is what happens
sometimes, under memory pressure, when a driver pins some pages via gup,
and then marks those pages dirty, and releases them. Note that the gup
documentation actually recommends that pattern. The problem is that the
filesystem may do a writeback while the pages were gup-pinned, and then the
filesystem believes that the pages are clean. So, when the driver later
marks the pages as dirty, that conflicts with the filesystem's page
tracking and results in a BUG(), like this one that I experienced:

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

Dave Chinner's description of this is very clear:

    "The fundamental issue is that ->page_mkwrite must be called on every
    write access to a clean file backed page, not just the first one.
    How long the GUP reference lasts is irrelevant, if the page is clean
    and you need to dirty it, you must call ->page_mkwrite before it is
    marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: filesystems do not
actually support get_user_pages() being called on their pages, and letting
hardware write directly to those pages--even though that patter has been
going on since about 2005 or so.

The steps are to fix it are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem.

[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 24 ++++++++++++++
 mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..809b7397d41e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/**
+ * put_user_page() - release a gup-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ *
+ * put_user_page() and put_page() are not interchangeable, despite this early
+ * implementation that makes them look the same. put_user_page() calls must
+ * be perfectly matched up with get_user_page() calls.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index 4929bc1be60e..7c42ca45bb89 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+				   unsigned long npages,
+				   set_dirty_func sdf)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			sdf(page);
+
+		put_user_page(page);
+	}
+}
+
+/**
+ * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * "gup-pinned page" refers to a page that has had one of the get_user_pages()
+ * variants called on that page.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/**
+ * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/**
+ * put_user_pages() - release an array of gup-pinned pages.
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, release the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 /*
  * get_kernel_pages() - pin kernel pages in memory
  * @kiov:	An array of struct kvec structures
-- 
2.20.1


^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-29 10:12                                                                                               ` Jan Kara
@ 2019-01-30  2:21                                                                                                 ` John Hubbard
  0 siblings, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-30  2:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jerome Glisse, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/29/19 2:12 AM, Jan Kara wrote:
> On Mon 28-01-19 22:41:41, John Hubbard wrote:
[...]
>> Here is the case I'm wondering about:
>>
>> thread A                             thread B
>> --------                             --------
>>                                      gup_fast
>> page_mkclean
>>     is page gup-pinned?(no)
>>                                          page_cache_get_speculative
>>                                              (gup-pins the page here)
>>                                          check pte_val unchanged (yes)
>>        set_pte_at()
>>
>> ...and now thread A has created a read-only PTE, after gup_fast walked
>> the page tables and found a writeable entry. And so far, thread A has
>> not seen that the page is pinned.
>>
>> What am I missing here? The above seems like a problem even before we
>> change anything.
> 
> Your implementation of page_mkclean() is wrong :) It needs to first call
> set_pte_at() and only after that ask "is page gup pinned?". In fact,
> page_mkclean() probably has no bussiness in checking for page pins
> whatsoever. It is clear_page_dirty_for_io() that cares, so that should
> check for page pins after page_mkclean() has returned.
> 

Perfect, that was the missing piece for me: page_mkclean() internally doesn't
need the consistent view, just the caller does. The whole situation with
two distinct lock-free algorithms going on here actually seems clear at last. :)

Thanks (also to Jerome) for explaining this!

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-29  6:41                                                                                             ` John Hubbard
@ 2019-01-29 10:12                                                                                               ` Jan Kara
  2019-01-30  2:21                                                                                                 ` John Hubbard
  0 siblings, 1 reply; 207+ messages in thread
From: Jan Kara @ 2019-01-29 10:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Jan Kara, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 28-01-19 22:41:41, John Hubbard wrote:
> On 1/28/19 5:23 PM, Jerome Glisse wrote:
> > On Mon, Jan 28, 2019 at 04:22:16PM -0800, John Hubbard wrote:
> > > On 1/23/19 11:04 AM, Jerome Glisse wrote:
> > > > On Wed, Jan 23, 2019 at 07:02:30PM +0100, Jan Kara wrote:
> > > > > On Tue 22-01-19 11:46:13, Jerome Glisse wrote:
> > > > > > On Tue, Jan 22, 2019 at 04:24:59PM +0100, Jan Kara wrote:
> > > > > > > On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
> > > > > > > > On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> > > > > > > > > On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> > > > > > > > > > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > > > > > > > > > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > > > > > > > > > > Agreed. So with page lock it would actually look like:
> > > > > > > > > > > > 
> > > > > > > > > > > > get_page_pin()
> > > > > > > > > > > > 	lock_page(page);
> > > > > > > > > > > > 	wait_for_stable_page();
> > > > > > > > > > > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > > > > > > > > > 	unlock_page(page);
> > > > > > > > > > > > 
> > > > > > > > > > > > And if we perform page_pinned() check under page lock, then if
> > > > > > > > > > > > page_pinned() returned false, we are sure page is not and will not be
> > > > > > > > > > > > pinned until we drop the page lock (and also until page writeback is
> > > > > > > > > > > > completed if needed).
> > > > > > > > > > > 
> > > > > > > > > > > After some more though, why do we even need wait_for_stable_page() and
> > > > > > > > > > > lock_page() in get_page_pin()?
> > > > > > > > > > > 
> > > > > > > > > > > During writepage page_mkclean() will write protect all page tables. So
> > > > > > > > > > > there can be no new writeable GUP pins until we unlock the page as all such
> > > > > > > > > > > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > > > > > > > > > > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > > > > > > > > > > Am I just confused?
> > > > > > > > > > 
> > > > > > > > > > Yeah with page lock it should synchronize on the pte but you still
> > > > > > > > > > need to check for writeback iirc the page is unlocked after file
> > > > > > > > > > system has queue up the write and thus the page can be unlock with
> > > > > > > > > > write back pending (and PageWriteback() == trye) and i am not sure
> > > > > > > > > > that in that states we can safely let anyone write to that page. I
> > > > > > > > > > am assuming that in some case the block device also expect stable
> > > > > > > > > > page content (RAID stuff).
> > > > > > > > > > 
> > > > > > > > > > So the PageWriteback() test is not only for racing page_mkclean()/
> > > > > > > > > > test_set_page_writeback() and GUP but also for pending write back.
> > > > > > > > > 
> > > > > > > > > But this is prevented by wait_for_stable_page() that is already present in
> > > > > > > > > ->page_mkwrite() handlers. Look:
> > > > > > > > > 
> > > > > > > > > ->writepage()
> > > > > > > > >    /* Page is locked here */
> > > > > > > > >    clear_page_dirty_for_io(page)
> > > > > > > > >      page_mkclean(page)
> > > > > > > > >        -> page tables get writeprotected
> > > > > > > > >      /* The following line will be added by our patches */
> > > > > > > > >      if (page_pinned(page)) -> bounce
> > > > > > > > >      TestClearPageDirty(page)
> > > > > > > > >    set_page_writeback(page);
> > > > > > > > >    unlock_page(page);
> > > > > > > > >    ...submit_io...
> > > > > > > > > 
> > > > > > > > > IRQ
> > > > > > > > >    - IO completion
> > > > > > > > >    end_page_writeback()
> > > > > > > > > 
> > > > > > > > > So if GUP happens before page_mkclean() writeprotects corresponding PTE
> > > > > > > > > (and these two actions are synchronized on the PTE lock), page_pinned()
> > > > > > > > > will see the increment and report the page as pinned.
> > > > > > > > > 
> > > > > > > > > If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> > > > > > > > > will fault:
> > > > > > > > >    handle_mm_fault()
> > > > > > > > >      do_wp_page()
> > > > > > > > >        wp_page_shared()
> > > > > > > > >          do_page_mkwrite()
> > > > > > > > >            ->page_mkwrite() - that is block_page_mkwrite() or
> > > > > > > > > 	    iomap_page_mkwrite() or whatever filesystem provides
> > > > > > > > > 	  lock_page(page)
> > > > > > > > >            ... prepare page ...
> > > > > > > > > 	  wait_for_stable_page(page) -> this blocks until IO completes
> > > > > > > > > 	    if someone cares about pages not being modified while under IO.
> > > > > > > > 
> > > > > > > > The case i am worried is GUP see pte with write flag set but has not
> > > > > > > > lock the page yet (GUP is get pte first, then pte to page then lock
> > > > > > > > page), then it locks the page but the lock page can make it wait for a
> > > > > > > > racing page_mkclean()...write back that have not yet write protected
> > > > > > > > the pte the GUP just read. So by the time GUP has the page locked the
> > > > > > > > pte it read might no longer have the write flag set. Hence why you need
> > > > > > > > to also check for write back after taking the page lock. Alternatively
> > > > > > > > you could recheck the pte after a successful try_lock on the page.
> > > > > > > 
> > > > > > > This isn't really possible. GUP does:
> > > > > > > 
> > > > > > > get_user_pages()
> > > > > > > ...
> > > > > > >    follow_page_mask()
> > > > > > >    ...
> > > > > > >      follow_page_pte()
> > > > > > >        ptep = pte_offset_map_lock()
> > > > > > >        check permissions and page sanity
> > > > > > >        if (flags & FOLL_GET)
> > > > > > >          get_page(page); -> this would become
> > > > > > > 	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > > > >        pte_unmap_unlock(ptep, ptl);
> > > > > > > 
> > > > > > > page_mkclean() on the other hand grabs the same pte lock to change the pte
> > > > > > > to write-protected. So after page_mkclean() has modified the PTE we are
> > > > > > > racing on for access, we are sure to either see increased _refcount or get
> > > > > > > page fault from GUP.
> > > > > > > 
> > > > > > > If we see increased _refcount, we bounce the page and are fine. If GUP
> > > > > > > faults, we will wait for page lock (so wait until page is prepared for IO
> > > > > > > and has PageWriteback set) while handling the fault, then enter
> > > > > > > ->page_mkwrite, which will do wait_for_stable_page() -> wait for
> > > > > > > outstanding writeback to complete.
> > > > > > > 
> > > > > > > So I still conclude - no need for page lock in the GUP path at all AFAICT.
> > > > > > > In fact we rely on the very same page fault vs page writeback synchronization
> > > > > > > for normal user faults as well. And normal user mmap access is even nastier
> > > > > > > than GUP access because the CPU reads page tables without taking PTE lock.
> > > > > > 
> > > > > > For the "slow" GUP path you are right you do not need a lock as the
> > > > > > page table lock give you the ordering. For the GUP fast path you
> > > > > > would either need the lock or the memory barrier with the test for
> > > > > > page write back.
> > > > > > 
> > > > > > Maybe an easier thing is to convert GUP fast to try to take the page
> > > > > > table lock if it fails taking the page table lock then we fall back
> > > > > > to slow GUP path. Otherwise then we have the same garantee as the slow
> > > > > > path.
> > > > > 
> > > > > You're right I was looking at the wrong place for GUP_fast() path. But I
> > > > > still don't think anything special (i.e. page lock or new barrier) is
> > > > > necessary. GUP_fast() takes care already now that it cannot race with page
> > > > > unmapping or write-protection (as there are other places in MM that rely on
> > > > > this). Look, gup_pte_range() has:
> > > > > 
> > > > >                  if (!page_cache_get_speculative(head))
> > > > >                          goto pte_unmap;
> > > > > 
> > > > >                  if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > > > >                          put_page(head);
> > > > >                          goto pte_unmap;
> > > > >                  }
> > > > > 
> > > > > So that page_cache_get_speculative() will become
> > > > > page_cache_pin_speculative() to increment refcount by PAGE_PIN_BIAS instead
> > > > > of 1. That is atomic ordered operation so it cannot be reordered with the
> > > > > following check that PTE stayed same. So once page_mkclean() write-protects
> > > > > PTE, there can be no new pins from GUP_fast() and we are sure all
> > > > > succeeding pins are visible in page->_refcount after page_mkclean()
> > > > > completes. Again this is nothing new, other mm code already relies on
> > > > > either seeing page->_refcount incremented or GUP fast bailing out (e.g. DAX
> > > > > relies on this). Although strictly speaking I'm not 100% sure what prevents
> > > > > page->_refcount load to be speculatively reordered before PTE update even
> > > > > in current places using this but there's so much stuff inbetween that
> > > > > there's probably something ;). But we could add smp_rmb() after
> > > > > page_mkclean() before changing page_pinned() for the peace of mind I guess.
> > > > 
> > > > Yeah i think you are right, i missed the check on same pte value
> > > > and the atomic inc in page_cache_get_speculative() is a barrier.
> > > > I do not think the barrier would be necessary as page_mkclean is
> > > > taking and dropping locks so those should have enough barriering.
> > > > 
> > > 
> > > Hi Jan, Jerome,
> > > 
> > > OK, this seems to be up and running locally, but while putting together
> > > documentation and polishing up things, I noticed that there is one last piece
> > > that I don't quite understand, after all. The page_cache_get_speculative()
> > > existing documentation explains how refcount synchronizes these things, but I
> > > don't see how that helps with synchronization page_mkclean and gup, in this
> > > situation:
> > > 
> > >      gup_fast gets the refcount and rechecks the pte hasn't changed
> > > 
> > >      meanwhile, page_mkclean...wait, how does refcount come into play here?
> > >      page_mkclean can remove the mapping and insert a write-protected pte,
> > >      regardless of page refcount, correct?  Help? :)
> > 
> > Correct, page_mkclean() does not check the refcount and do not need to
> > check it. We need to check for the page pin after the page_mkclean when
> > code is done prepping the page for io (clear_page_dirty_for_io).
> > 
> > The race Jan and I were discussing was about wether we needed to lock
> > the page or not and we do not. For slow path page_mkclean and GUP_slow
> > will synchronize on the page table lock. For GUP_fast the fast code will
> > back off if the pte is not the same and thus either we see the pin after
> > page_mkclean() or GUP_fast back off. You will never have code that miss
> > the pin after page_mkclean() and GUP_fast that did not back off.
> 
> Here is the case I'm wondering about:
> 
> thread A                             thread B
> --------                             --------
>                                      gup_fast
> page_mkclean
>     is page gup-pinned?(no)
>                                          page_cache_get_speculative
>                                              (gup-pins the page here)
>                                          check pte_val unchanged (yes)
>        set_pte_at()
> 
> ...and now thread A has created a read-only PTE, after gup_fast walked
> the page tables and found a writeable entry. And so far, thread A has
> not seen that the page is pinned.
> 
> What am I missing here? The above seems like a problem even before we
> change anything.

Your implementation of page_mkclean() is wrong :) It needs to first call
set_pte_at() and only after that ask "is page gup pinned?". In fact,
page_mkclean() probably has no bussiness in checking for page pins
whatsoever. It is clear_page_dirty_for_io() that cares, so that should
check for page pins after page_mkclean() has returned.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-29  1:23                                                                                           ` Jerome Glisse
@ 2019-01-29  6:41                                                                                             ` John Hubbard
  2019-01-29 10:12                                                                                               ` Jan Kara
  0 siblings, 1 reply; 207+ messages in thread
From: John Hubbard @ 2019-01-29  6:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/28/19 5:23 PM, Jerome Glisse wrote:
> On Mon, Jan 28, 2019 at 04:22:16PM -0800, John Hubbard wrote:
>> On 1/23/19 11:04 AM, Jerome Glisse wrote:
>>> On Wed, Jan 23, 2019 at 07:02:30PM +0100, Jan Kara wrote:
>>>> On Tue 22-01-19 11:46:13, Jerome Glisse wrote:
>>>>> On Tue, Jan 22, 2019 at 04:24:59PM +0100, Jan Kara wrote:
>>>>>> On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
>>>>>>> On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
>>>>>>>> On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
>>>>>>>>> On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
>>>>>>>>>> On Tue 15-01-19 09:07:59, Jan Kara wrote:
>>>>>>>>>>> Agreed. So with page lock it would actually look like:
>>>>>>>>>>>
>>>>>>>>>>> get_page_pin()
>>>>>>>>>>> 	lock_page(page);
>>>>>>>>>>> 	wait_for_stable_page();
>>>>>>>>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>>>>>>>> 	unlock_page(page);
>>>>>>>>>>>
>>>>>>>>>>> And if we perform page_pinned() check under page lock, then if
>>>>>>>>>>> page_pinned() returned false, we are sure page is not and will not be
>>>>>>>>>>> pinned until we drop the page lock (and also until page writeback is
>>>>>>>>>>> completed if needed).
>>>>>>>>>>
>>>>>>>>>> After some more though, why do we even need wait_for_stable_page() and
>>>>>>>>>> lock_page() in get_page_pin()?
>>>>>>>>>>
>>>>>>>>>> During writepage page_mkclean() will write protect all page tables. So
>>>>>>>>>> there can be no new writeable GUP pins until we unlock the page as all such
>>>>>>>>>> GUPs will have to first go through fault and ->page_mkwrite() handler. And
>>>>>>>>>> that will wait on page lock and do wait_for_stable_page() for us anyway.
>>>>>>>>>> Am I just confused?
>>>>>>>>>
>>>>>>>>> Yeah with page lock it should synchronize on the pte but you still
>>>>>>>>> need to check for writeback iirc the page is unlocked after file
>>>>>>>>> system has queue up the write and thus the page can be unlock with
>>>>>>>>> write back pending (and PageWriteback() == trye) and i am not sure
>>>>>>>>> that in that states we can safely let anyone write to that page. I
>>>>>>>>> am assuming that in some case the block device also expect stable
>>>>>>>>> page content (RAID stuff).
>>>>>>>>>
>>>>>>>>> So the PageWriteback() test is not only for racing page_mkclean()/
>>>>>>>>> test_set_page_writeback() and GUP but also for pending write back.
>>>>>>>>
>>>>>>>> But this is prevented by wait_for_stable_page() that is already present in
>>>>>>>> ->page_mkwrite() handlers. Look:
>>>>>>>>
>>>>>>>> ->writepage()
>>>>>>>>    /* Page is locked here */
>>>>>>>>    clear_page_dirty_for_io(page)
>>>>>>>>      page_mkclean(page)
>>>>>>>>        -> page tables get writeprotected
>>>>>>>>      /* The following line will be added by our patches */
>>>>>>>>      if (page_pinned(page)) -> bounce
>>>>>>>>      TestClearPageDirty(page)
>>>>>>>>    set_page_writeback(page);
>>>>>>>>    unlock_page(page);
>>>>>>>>    ...submit_io...
>>>>>>>>
>>>>>>>> IRQ
>>>>>>>>    - IO completion
>>>>>>>>    end_page_writeback()
>>>>>>>>
>>>>>>>> So if GUP happens before page_mkclean() writeprotects corresponding PTE
>>>>>>>> (and these two actions are synchronized on the PTE lock), page_pinned()
>>>>>>>> will see the increment and report the page as pinned.
>>>>>>>>
>>>>>>>> If GUP happens after page_mkclean() writeprotects corresponding PTE, it
>>>>>>>> will fault:
>>>>>>>>    handle_mm_fault()
>>>>>>>>      do_wp_page()
>>>>>>>>        wp_page_shared()
>>>>>>>>          do_page_mkwrite()
>>>>>>>>            ->page_mkwrite() - that is block_page_mkwrite() or
>>>>>>>> 	    iomap_page_mkwrite() or whatever filesystem provides
>>>>>>>> 	  lock_page(page)
>>>>>>>>            ... prepare page ...
>>>>>>>> 	  wait_for_stable_page(page) -> this blocks until IO completes
>>>>>>>> 	    if someone cares about pages not being modified while under IO.
>>>>>>>
>>>>>>> The case i am worried is GUP see pte with write flag set but has not
>>>>>>> lock the page yet (GUP is get pte first, then pte to page then lock
>>>>>>> page), then it locks the page but the lock page can make it wait for a
>>>>>>> racing page_mkclean()...write back that have not yet write protected
>>>>>>> the pte the GUP just read. So by the time GUP has the page locked the
>>>>>>> pte it read might no longer have the write flag set. Hence why you need
>>>>>>> to also check for write back after taking the page lock. Alternatively
>>>>>>> you could recheck the pte after a successful try_lock on the page.
>>>>>>
>>>>>> This isn't really possible. GUP does:
>>>>>>
>>>>>> get_user_pages()
>>>>>> ...
>>>>>>    follow_page_mask()
>>>>>>    ...
>>>>>>      follow_page_pte()
>>>>>>        ptep = pte_offset_map_lock()
>>>>>>        check permissions and page sanity
>>>>>>        if (flags & FOLL_GET)
>>>>>>          get_page(page); -> this would become
>>>>>> 	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>>>        pte_unmap_unlock(ptep, ptl);
>>>>>>
>>>>>> page_mkclean() on the other hand grabs the same pte lock to change the pte
>>>>>> to write-protected. So after page_mkclean() has modified the PTE we are
>>>>>> racing on for access, we are sure to either see increased _refcount or get
>>>>>> page fault from GUP.
>>>>>>
>>>>>> If we see increased _refcount, we bounce the page and are fine. If GUP
>>>>>> faults, we will wait for page lock (so wait until page is prepared for IO
>>>>>> and has PageWriteback set) while handling the fault, then enter
>>>>>> ->page_mkwrite, which will do wait_for_stable_page() -> wait for
>>>>>> outstanding writeback to complete.
>>>>>>
>>>>>> So I still conclude - no need for page lock in the GUP path at all AFAICT.
>>>>>> In fact we rely on the very same page fault vs page writeback synchronization
>>>>>> for normal user faults as well. And normal user mmap access is even nastier
>>>>>> than GUP access because the CPU reads page tables without taking PTE lock.
>>>>>
>>>>> For the "slow" GUP path you are right you do not need a lock as the
>>>>> page table lock give you the ordering. For the GUP fast path you
>>>>> would either need the lock or the memory barrier with the test for
>>>>> page write back.
>>>>>
>>>>> Maybe an easier thing is to convert GUP fast to try to take the page
>>>>> table lock if it fails taking the page table lock then we fall back
>>>>> to slow GUP path. Otherwise then we have the same garantee as the slow
>>>>> path.
>>>>
>>>> You're right I was looking at the wrong place for GUP_fast() path. But I
>>>> still don't think anything special (i.e. page lock or new barrier) is
>>>> necessary. GUP_fast() takes care already now that it cannot race with page
>>>> unmapping or write-protection (as there are other places in MM that rely on
>>>> this). Look, gup_pte_range() has:
>>>>
>>>>                  if (!page_cache_get_speculative(head))
>>>>                          goto pte_unmap;
>>>>
>>>>                  if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>                          put_page(head);
>>>>                          goto pte_unmap;
>>>>                  }
>>>>
>>>> So that page_cache_get_speculative() will become
>>>> page_cache_pin_speculative() to increment refcount by PAGE_PIN_BIAS instead
>>>> of 1. That is atomic ordered operation so it cannot be reordered with the
>>>> following check that PTE stayed same. So once page_mkclean() write-protects
>>>> PTE, there can be no new pins from GUP_fast() and we are sure all
>>>> succeeding pins are visible in page->_refcount after page_mkclean()
>>>> completes. Again this is nothing new, other mm code already relies on
>>>> either seeing page->_refcount incremented or GUP fast bailing out (e.g. DAX
>>>> relies on this). Although strictly speaking I'm not 100% sure what prevents
>>>> page->_refcount load to be speculatively reordered before PTE update even
>>>> in current places using this but there's so much stuff inbetween that
>>>> there's probably something ;). But we could add smp_rmb() after
>>>> page_mkclean() before changing page_pinned() for the peace of mind I guess.
>>>
>>> Yeah i think you are right, i missed the check on same pte value
>>> and the atomic inc in page_cache_get_speculative() is a barrier.
>>> I do not think the barrier would be necessary as page_mkclean is
>>> taking and dropping locks so those should have enough barriering.
>>>
>>
>> Hi Jan, Jerome,
>>
>> OK, this seems to be up and running locally, but while putting together
>> documentation and polishing up things, I noticed that there is one last piece
>> that I don't quite understand, after all. The page_cache_get_speculative()
>> existing documentation explains how refcount synchronizes these things, but I
>> don't see how that helps with synchronization page_mkclean and gup, in this
>> situation:
>>
>>      gup_fast gets the refcount and rechecks the pte hasn't changed
>>
>>      meanwhile, page_mkclean...wait, how does refcount come into play here?
>>      page_mkclean can remove the mapping and insert a write-protected pte,
>>      regardless of page refcount, correct?  Help? :)
> 
> Correct, page_mkclean() does not check the refcount and do not need to
> check it. We need to check for the page pin after the page_mkclean when
> code is done prepping the page for io (clear_page_dirty_for_io).
> 
> The race Jan and I were discussing was about wether we needed to lock
> the page or not and we do not. For slow path page_mkclean and GUP_slow
> will synchronize on the page table lock. For GUP_fast the fast code will
> back off if the pte is not the same and thus either we see the pin after
> page_mkclean() or GUP_fast back off. You will never have code that miss
> the pin after page_mkclean() and GUP_fast that did not back off.

Here is the case I'm wondering about:

thread A                             thread B
--------                             --------
                                      gup_fast
page_mkclean
     is page gup-pinned?(no)
                                          page_cache_get_speculative
                                              (gup-pins the page here)
                                          check pte_val unchanged (yes)
        set_pte_at()

...and now thread A has created a read-only PTE, after gup_fast walked
the page tables and found a writeable entry. And so far, thread A has
not seen that the page is pinned.

What am I missing here? The above seems like a problem even before we
change anything.

> 
> Now the page_cache_get_speculative() is for another race when a page is
> freed concurrently. page_cache_get_speculative() only inc the refcount
> if the page is not already freed ie refcount != 0. So GUP_fast has 2
> exclusions mechanisms, one for racing modification to the page table
> like page_mkclean (pte the same after incrementing the refcount) and one
> for racing put_page (only increment refcount if it is not 0). Here for
> what we want we just modify this second mechanisms to add the bias
> value not just 1 to the refcount. This keep both mechanisms intacts
> and give us the page pin test through refcount bias value.
> 
> Note that page_mkclean can not race with a put_page() as whoever calls
> page_mkclean already hold a reference on the page and thus no put_page
> can free the page.
> 
> Does that help ?

Yes...getting close... :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-29  0:22                                                                                         ` John Hubbard
@ 2019-01-29  1:23                                                                                           ` Jerome Glisse
  2019-01-29  6:41                                                                                             ` John Hubbard
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2019-01-29  1:23 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Mon, Jan 28, 2019 at 04:22:16PM -0800, John Hubbard wrote:
> On 1/23/19 11:04 AM, Jerome Glisse wrote:
> > On Wed, Jan 23, 2019 at 07:02:30PM +0100, Jan Kara wrote:
> >> On Tue 22-01-19 11:46:13, Jerome Glisse wrote:
> >>> On Tue, Jan 22, 2019 at 04:24:59PM +0100, Jan Kara wrote:
> >>>> On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
> >>>>> On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> >>>>>> On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> >>>>>>> On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 15-01-19 09:07:59, Jan Kara wrote:
> >>>>>>>>> Agreed. So with page lock it would actually look like:
> >>>>>>>>>
> >>>>>>>>> get_page_pin()
> >>>>>>>>> 	lock_page(page);
> >>>>>>>>> 	wait_for_stable_page();
> >>>>>>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>>>>>>> 	unlock_page(page);
> >>>>>>>>>
> >>>>>>>>> And if we perform page_pinned() check under page lock, then if
> >>>>>>>>> page_pinned() returned false, we are sure page is not and will not be
> >>>>>>>>> pinned until we drop the page lock (and also until page writeback is
> >>>>>>>>> completed if needed).
> >>>>>>>>
> >>>>>>>> After some more though, why do we even need wait_for_stable_page() and
> >>>>>>>> lock_page() in get_page_pin()?
> >>>>>>>>
> >>>>>>>> During writepage page_mkclean() will write protect all page tables. So
> >>>>>>>> there can be no new writeable GUP pins until we unlock the page as all such
> >>>>>>>> GUPs will have to first go through fault and ->page_mkwrite() handler. And
> >>>>>>>> that will wait on page lock and do wait_for_stable_page() for us anyway.
> >>>>>>>> Am I just confused?
> >>>>>>>
> >>>>>>> Yeah with page lock it should synchronize on the pte but you still
> >>>>>>> need to check for writeback iirc the page is unlocked after file
> >>>>>>> system has queue up the write and thus the page can be unlock with
> >>>>>>> write back pending (and PageWriteback() == trye) and i am not sure
> >>>>>>> that in that states we can safely let anyone write to that page. I
> >>>>>>> am assuming that in some case the block device also expect stable
> >>>>>>> page content (RAID stuff).
> >>>>>>>
> >>>>>>> So the PageWriteback() test is not only for racing page_mkclean()/
> >>>>>>> test_set_page_writeback() and GUP but also for pending write back.
> >>>>>>
> >>>>>> But this is prevented by wait_for_stable_page() that is already present in
> >>>>>> ->page_mkwrite() handlers. Look:
> >>>>>>
> >>>>>> ->writepage()
> >>>>>>   /* Page is locked here */
> >>>>>>   clear_page_dirty_for_io(page)
> >>>>>>     page_mkclean(page)
> >>>>>>       -> page tables get writeprotected
> >>>>>>     /* The following line will be added by our patches */
> >>>>>>     if (page_pinned(page)) -> bounce
> >>>>>>     TestClearPageDirty(page)
> >>>>>>   set_page_writeback(page);
> >>>>>>   unlock_page(page);
> >>>>>>   ...submit_io...
> >>>>>>
> >>>>>> IRQ
> >>>>>>   - IO completion
> >>>>>>   end_page_writeback()
> >>>>>>
> >>>>>> So if GUP happens before page_mkclean() writeprotects corresponding PTE
> >>>>>> (and these two actions are synchronized on the PTE lock), page_pinned()
> >>>>>> will see the increment and report the page as pinned.
> >>>>>>
> >>>>>> If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> >>>>>> will fault:
> >>>>>>   handle_mm_fault()
> >>>>>>     do_wp_page()
> >>>>>>       wp_page_shared()
> >>>>>>         do_page_mkwrite()
> >>>>>>           ->page_mkwrite() - that is block_page_mkwrite() or
> >>>>>> 	    iomap_page_mkwrite() or whatever filesystem provides
> >>>>>> 	  lock_page(page)
> >>>>>>           ... prepare page ...
> >>>>>> 	  wait_for_stable_page(page) -> this blocks until IO completes
> >>>>>> 	    if someone cares about pages not being modified while under IO.
> >>>>>
> >>>>> The case i am worried is GUP see pte with write flag set but has not
> >>>>> lock the page yet (GUP is get pte first, then pte to page then lock
> >>>>> page), then it locks the page but the lock page can make it wait for a
> >>>>> racing page_mkclean()...write back that have not yet write protected
> >>>>> the pte the GUP just read. So by the time GUP has the page locked the
> >>>>> pte it read might no longer have the write flag set. Hence why you need
> >>>>> to also check for write back after taking the page lock. Alternatively
> >>>>> you could recheck the pte after a successful try_lock on the page.
> >>>>
> >>>> This isn't really possible. GUP does:
> >>>>
> >>>> get_user_pages()
> >>>> ...
> >>>>   follow_page_mask()
> >>>>   ...
> >>>>     follow_page_pte()
> >>>>       ptep = pte_offset_map_lock()
> >>>>       check permissions and page sanity
> >>>>       if (flags & FOLL_GET)
> >>>>         get_page(page); -> this would become
> >>>> 	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>>       pte_unmap_unlock(ptep, ptl);
> >>>>
> >>>> page_mkclean() on the other hand grabs the same pte lock to change the pte
> >>>> to write-protected. So after page_mkclean() has modified the PTE we are
> >>>> racing on for access, we are sure to either see increased _refcount or get
> >>>> page fault from GUP.
> >>>>
> >>>> If we see increased _refcount, we bounce the page and are fine. If GUP
> >>>> faults, we will wait for page lock (so wait until page is prepared for IO
> >>>> and has PageWriteback set) while handling the fault, then enter
> >>>> ->page_mkwrite, which will do wait_for_stable_page() -> wait for
> >>>> outstanding writeback to complete.
> >>>>
> >>>> So I still conclude - no need for page lock in the GUP path at all AFAICT.
> >>>> In fact we rely on the very same page fault vs page writeback synchronization
> >>>> for normal user faults as well. And normal user mmap access is even nastier
> >>>> than GUP access because the CPU reads page tables without taking PTE lock.
> >>>
> >>> For the "slow" GUP path you are right you do not need a lock as the
> >>> page table lock give you the ordering. For the GUP fast path you
> >>> would either need the lock or the memory barrier with the test for
> >>> page write back.
> >>>
> >>> Maybe an easier thing is to convert GUP fast to try to take the page
> >>> table lock if it fails taking the page table lock then we fall back
> >>> to slow GUP path. Otherwise then we have the same garantee as the slow
> >>> path.
> >>
> >> You're right I was looking at the wrong place for GUP_fast() path. But I
> >> still don't think anything special (i.e. page lock or new barrier) is
> >> necessary. GUP_fast() takes care already now that it cannot race with page
> >> unmapping or write-protection (as there are other places in MM that rely on
> >> this). Look, gup_pte_range() has:
> >>
> >>                 if (!page_cache_get_speculative(head))
> >>                         goto pte_unmap;
> >>
> >>                 if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> >>                         put_page(head);
> >>                         goto pte_unmap;
> >>                 }
> >>
> >> So that page_cache_get_speculative() will become
> >> page_cache_pin_speculative() to increment refcount by PAGE_PIN_BIAS instead
> >> of 1. That is atomic ordered operation so it cannot be reordered with the
> >> following check that PTE stayed same. So once page_mkclean() write-protects
> >> PTE, there can be no new pins from GUP_fast() and we are sure all
> >> succeeding pins are visible in page->_refcount after page_mkclean()
> >> completes. Again this is nothing new, other mm code already relies on
> >> either seeing page->_refcount incremented or GUP fast bailing out (e.g. DAX
> >> relies on this). Although strictly speaking I'm not 100% sure what prevents
> >> page->_refcount load to be speculatively reordered before PTE update even
> >> in current places using this but there's so much stuff inbetween that
> >> there's probably something ;). But we could add smp_rmb() after
> >> page_mkclean() before changing page_pinned() for the peace of mind I guess.
> > 
> > Yeah i think you are right, i missed the check on same pte value
> > and the atomic inc in page_cache_get_speculative() is a barrier.
> > I do not think the barrier would be necessary as page_mkclean is
> > taking and dropping locks so those should have enough barriering.
> > 
> 
> Hi Jan, Jerome,
> 
> OK, this seems to be up and running locally, but while putting together 
> documentation and polishing up things, I noticed that there is one last piece 
> that I don't quite understand, after all. The page_cache_get_speculative() 
> existing documentation explains how refcount synchronizes these things, but I
> don't see how that helps with synchronization page_mkclean and gup, in this 
> situation:
> 
>     gup_fast gets the refcount and rechecks the pte hasn't changed
> 
>     meanwhile, page_mkclean...wait, how does refcount come into play here?
>     page_mkclean can remove the mapping and insert a write-protected pte, 
>     regardless of page refcount, correct?  Help? :)

Correct, page_mkclean() does not check the refcount and do not need to
check it. We need to check for the page pin after the page_mkclean when
code is done prepping the page for io (clear_page_dirty_for_io).

The race Jan and I were discussing was about wether we needed to lock
the page or not and we do not. For slow path page_mkclean and GUP_slow
will synchronize on the page table lock. For GUP_fast the fast code will
back off if the pte is not the same and thus either we see the pin after
page_mkclean() or GUP_fast back off. You will never have code that miss
the pin after page_mkclean() and GUP_fast that did not back off.

Now the page_cache_get_speculative() is for another race when a page is
freed concurrently. page_cache_get_speculative() only inc the refcount
if the page is not already freed ie refcount != 0. So GUP_fast has 2
exclusions mechanisms, one for racing modification to the page table
like page_mkclean (pte the same after incrementing the refcount) and one
for racing put_page (only increment refcount if it is not 0). Here for
what we want we just modify this second mechanisms to add the bias
value not just 1 to the refcount. This keep both mechanisms intacts
and give us the page pin test through refcount bias value.

Note that page_mkclean can not race with a put_page() as whoever calls
page_mkclean already hold a reference on the page and thus no put_page
can free the page.

Does that help ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-23 19:04                                                                                       ` Jerome Glisse
@ 2019-01-29  0:22                                                                                         ` John Hubbard
  2019-01-29  1:23                                                                                           ` Jerome Glisse
  0 siblings, 1 reply; 207+ messages in thread
From: John Hubbard @ 2019-01-29  0:22 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/23/19 11:04 AM, Jerome Glisse wrote:
> On Wed, Jan 23, 2019 at 07:02:30PM +0100, Jan Kara wrote:
>> On Tue 22-01-19 11:46:13, Jerome Glisse wrote:
>>> On Tue, Jan 22, 2019 at 04:24:59PM +0100, Jan Kara wrote:
>>>> On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
>>>>> On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
>>>>>> On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
>>>>>>> On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
>>>>>>>> On Tue 15-01-19 09:07:59, Jan Kara wrote:
>>>>>>>>> Agreed. So with page lock it would actually look like:
>>>>>>>>>
>>>>>>>>> get_page_pin()
>>>>>>>>> 	lock_page(page);
>>>>>>>>> 	wait_for_stable_page();
>>>>>>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>>>>>> 	unlock_page(page);
>>>>>>>>>
>>>>>>>>> And if we perform page_pinned() check under page lock, then if
>>>>>>>>> page_pinned() returned false, we are sure page is not and will not be
>>>>>>>>> pinned until we drop the page lock (and also until page writeback is
>>>>>>>>> completed if needed).
>>>>>>>>
>>>>>>>> After some more though, why do we even need wait_for_stable_page() and
>>>>>>>> lock_page() in get_page_pin()?
>>>>>>>>
>>>>>>>> During writepage page_mkclean() will write protect all page tables. So
>>>>>>>> there can be no new writeable GUP pins until we unlock the page as all such
>>>>>>>> GUPs will have to first go through fault and ->page_mkwrite() handler. And
>>>>>>>> that will wait on page lock and do wait_for_stable_page() for us anyway.
>>>>>>>> Am I just confused?
>>>>>>>
>>>>>>> Yeah with page lock it should synchronize on the pte but you still
>>>>>>> need to check for writeback iirc the page is unlocked after file
>>>>>>> system has queue up the write and thus the page can be unlock with
>>>>>>> write back pending (and PageWriteback() == trye) and i am not sure
>>>>>>> that in that states we can safely let anyone write to that page. I
>>>>>>> am assuming that in some case the block device also expect stable
>>>>>>> page content (RAID stuff).
>>>>>>>
>>>>>>> So the PageWriteback() test is not only for racing page_mkclean()/
>>>>>>> test_set_page_writeback() and GUP but also for pending write back.
>>>>>>
>>>>>> But this is prevented by wait_for_stable_page() that is already present in
>>>>>> ->page_mkwrite() handlers. Look:
>>>>>>
>>>>>> ->writepage()
>>>>>>   /* Page is locked here */
>>>>>>   clear_page_dirty_for_io(page)
>>>>>>     page_mkclean(page)
>>>>>>       -> page tables get writeprotected
>>>>>>     /* The following line will be added by our patches */
>>>>>>     if (page_pinned(page)) -> bounce
>>>>>>     TestClearPageDirty(page)
>>>>>>   set_page_writeback(page);
>>>>>>   unlock_page(page);
>>>>>>   ...submit_io...
>>>>>>
>>>>>> IRQ
>>>>>>   - IO completion
>>>>>>   end_page_writeback()
>>>>>>
>>>>>> So if GUP happens before page_mkclean() writeprotects corresponding PTE
>>>>>> (and these two actions are synchronized on the PTE lock), page_pinned()
>>>>>> will see the increment and report the page as pinned.
>>>>>>
>>>>>> If GUP happens after page_mkclean() writeprotects corresponding PTE, it
>>>>>> will fault:
>>>>>>   handle_mm_fault()
>>>>>>     do_wp_page()
>>>>>>       wp_page_shared()
>>>>>>         do_page_mkwrite()
>>>>>>           ->page_mkwrite() - that is block_page_mkwrite() or
>>>>>> 	    iomap_page_mkwrite() or whatever filesystem provides
>>>>>> 	  lock_page(page)
>>>>>>           ... prepare page ...
>>>>>> 	  wait_for_stable_page(page) -> this blocks until IO completes
>>>>>> 	    if someone cares about pages not being modified while under IO.
>>>>>
>>>>> The case i am worried is GUP see pte with write flag set but has not
>>>>> lock the page yet (GUP is get pte first, then pte to page then lock
>>>>> page), then it locks the page but the lock page can make it wait for a
>>>>> racing page_mkclean()...write back that have not yet write protected
>>>>> the pte the GUP just read. So by the time GUP has the page locked the
>>>>> pte it read might no longer have the write flag set. Hence why you need
>>>>> to also check for write back after taking the page lock. Alternatively
>>>>> you could recheck the pte after a successful try_lock on the page.
>>>>
>>>> This isn't really possible. GUP does:
>>>>
>>>> get_user_pages()
>>>> ...
>>>>   follow_page_mask()
>>>>   ...
>>>>     follow_page_pte()
>>>>       ptep = pte_offset_map_lock()
>>>>       check permissions and page sanity
>>>>       if (flags & FOLL_GET)
>>>>         get_page(page); -> this would become
>>>> 	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>       pte_unmap_unlock(ptep, ptl);
>>>>
>>>> page_mkclean() on the other hand grabs the same pte lock to change the pte
>>>> to write-protected. So after page_mkclean() has modified the PTE we are
>>>> racing on for access, we are sure to either see increased _refcount or get
>>>> page fault from GUP.
>>>>
>>>> If we see increased _refcount, we bounce the page and are fine. If GUP
>>>> faults, we will wait for page lock (so wait until page is prepared for IO
>>>> and has PageWriteback set) while handling the fault, then enter
>>>> ->page_mkwrite, which will do wait_for_stable_page() -> wait for
>>>> outstanding writeback to complete.
>>>>
>>>> So I still conclude - no need for page lock in the GUP path at all AFAICT.
>>>> In fact we rely on the very same page fault vs page writeback synchronization
>>>> for normal user faults as well. And normal user mmap access is even nastier
>>>> than GUP access because the CPU reads page tables without taking PTE lock.
>>>
>>> For the "slow" GUP path you are right you do not need a lock as the
>>> page table lock give you the ordering. For the GUP fast path you
>>> would either need the lock or the memory barrier with the test for
>>> page write back.
>>>
>>> Maybe an easier thing is to convert GUP fast to try to take the page
>>> table lock if it fails taking the page table lock then we fall back
>>> to slow GUP path. Otherwise then we have the same garantee as the slow
>>> path.
>>
>> You're right I was looking at the wrong place for GUP_fast() path. But I
>> still don't think anything special (i.e. page lock or new barrier) is
>> necessary. GUP_fast() takes care already now that it cannot race with page
>> unmapping or write-protection (as there are other places in MM that rely on
>> this). Look, gup_pte_range() has:
>>
>>                 if (!page_cache_get_speculative(head))
>>                         goto pte_unmap;
>>
>>                 if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>                         put_page(head);
>>                         goto pte_unmap;
>>                 }
>>
>> So that page_cache_get_speculative() will become
>> page_cache_pin_speculative() to increment refcount by PAGE_PIN_BIAS instead
>> of 1. That is atomic ordered operation so it cannot be reordered with the
>> following check that PTE stayed same. So once page_mkclean() write-protects
>> PTE, there can be no new pins from GUP_fast() and we are sure all
>> succeeding pins are visible in page->_refcount after page_mkclean()
>> completes. Again this is nothing new, other mm code already relies on
>> either seeing page->_refcount incremented or GUP fast bailing out (e.g. DAX
>> relies on this). Although strictly speaking I'm not 100% sure what prevents
>> page->_refcount load to be speculatively reordered before PTE update even
>> in current places using this but there's so much stuff inbetween that
>> there's probably something ;). But we could add smp_rmb() after
>> page_mkclean() before changing page_pinned() for the peace of mind I guess.
> 
> Yeah i think you are right, i missed the check on same pte value
> and the atomic inc in page_cache_get_speculative() is a barrier.
> I do not think the barrier would be necessary as page_mkclean is
> taking and dropping locks so those should have enough barriering.
> 

Hi Jan, Jerome,

OK, this seems to be up and running locally, but while putting together 
documentation and polishing up things, I noticed that there is one last piece 
that I don't quite understand, after all. The page_cache_get_speculative() 
existing documentation explains how refcount synchronizes these things, but I
don't see how that helps with synchronization page_mkclean and gup, in this 
situation:

    gup_fast gets the refcount and rechecks the pte hasn't changed

    meanwhile, page_mkclean...wait, how does refcount come into play here?
    page_mkclean can remove the mapping and insert a write-protected pte, 
    regardless of page refcount, correct?  Help? :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-23 18:02                                                                                     ` Jan Kara
@ 2019-01-23 19:04                                                                                       ` Jerome Glisse
  2019-01-29  0:22                                                                                         ` John Hubbard
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2019-01-23 19:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 23, 2019 at 07:02:30PM +0100, Jan Kara wrote:
> On Tue 22-01-19 11:46:13, Jerome Glisse wrote:
> > On Tue, Jan 22, 2019 at 04:24:59PM +0100, Jan Kara wrote:
> > > On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
> > > > On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> > > > > On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> > > > > > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > > > > > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > > > > > > Agreed. So with page lock it would actually look like:
> > > > > > > > 
> > > > > > > > get_page_pin()
> > > > > > > > 	lock_page(page);
> > > > > > > > 	wait_for_stable_page();
> > > > > > > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > > > > > 	unlock_page(page);
> > > > > > > > 
> > > > > > > > And if we perform page_pinned() check under page lock, then if
> > > > > > > > page_pinned() returned false, we are sure page is not and will not be
> > > > > > > > pinned until we drop the page lock (and also until page writeback is
> > > > > > > > completed if needed).
> > > > > > > 
> > > > > > > After some more though, why do we even need wait_for_stable_page() and
> > > > > > > lock_page() in get_page_pin()?
> > > > > > > 
> > > > > > > During writepage page_mkclean() will write protect all page tables. So
> > > > > > > there can be no new writeable GUP pins until we unlock the page as all such
> > > > > > > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > > > > > > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > > > > > > Am I just confused?
> > > > > > 
> > > > > > Yeah with page lock it should synchronize on the pte but you still
> > > > > > need to check for writeback iirc the page is unlocked after file
> > > > > > system has queue up the write and thus the page can be unlock with
> > > > > > write back pending (and PageWriteback() == trye) and i am not sure
> > > > > > that in that states we can safely let anyone write to that page. I
> > > > > > am assuming that in some case the block device also expect stable
> > > > > > page content (RAID stuff).
> > > > > > 
> > > > > > So the PageWriteback() test is not only for racing page_mkclean()/
> > > > > > test_set_page_writeback() and GUP but also for pending write back.
> > > > > 
> > > > > But this is prevented by wait_for_stable_page() that is already present in
> > > > > ->page_mkwrite() handlers. Look:
> > > > > 
> > > > > ->writepage()
> > > > >   /* Page is locked here */
> > > > >   clear_page_dirty_for_io(page)
> > > > >     page_mkclean(page)
> > > > >       -> page tables get writeprotected
> > > > >     /* The following line will be added by our patches */
> > > > >     if (page_pinned(page)) -> bounce
> > > > >     TestClearPageDirty(page)
> > > > >   set_page_writeback(page);
> > > > >   unlock_page(page);
> > > > >   ...submit_io...
> > > > > 
> > > > > IRQ
> > > > >   - IO completion
> > > > >   end_page_writeback()
> > > > > 
> > > > > So if GUP happens before page_mkclean() writeprotects corresponding PTE
> > > > > (and these two actions are synchronized on the PTE lock), page_pinned()
> > > > > will see the increment and report the page as pinned.
> > > > > 
> > > > > If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> > > > > will fault:
> > > > >   handle_mm_fault()
> > > > >     do_wp_page()
> > > > >       wp_page_shared()
> > > > >         do_page_mkwrite()
> > > > >           ->page_mkwrite() - that is block_page_mkwrite() or
> > > > > 	    iomap_page_mkwrite() or whatever filesystem provides
> > > > > 	  lock_page(page)
> > > > >           ... prepare page ...
> > > > > 	  wait_for_stable_page(page) -> this blocks until IO completes
> > > > > 	    if someone cares about pages not being modified while under IO.
> > > > 
> > > > The case i am worried is GUP see pte with write flag set but has not
> > > > lock the page yet (GUP is get pte first, then pte to page then lock
> > > > page), then it locks the page but the lock page can make it wait for a
> > > > racing page_mkclean()...write back that have not yet write protected
> > > > the pte the GUP just read. So by the time GUP has the page locked the
> > > > pte it read might no longer have the write flag set. Hence why you need
> > > > to also check for write back after taking the page lock. Alternatively
> > > > you could recheck the pte after a successful try_lock on the page.
> > > 
> > > This isn't really possible. GUP does:
> > > 
> > > get_user_pages()
> > > ...
> > >   follow_page_mask()
> > >   ...
> > >     follow_page_pte()
> > >       ptep = pte_offset_map_lock()
> > >       check permissions and page sanity
> > >       if (flags & FOLL_GET)
> > >         get_page(page); -> this would become
> > > 	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > >       pte_unmap_unlock(ptep, ptl);
> > > 
> > > page_mkclean() on the other hand grabs the same pte lock to change the pte
> > > to write-protected. So after page_mkclean() has modified the PTE we are
> > > racing on for access, we are sure to either see increased _refcount or get
> > > page fault from GUP.
> > > 
> > > If we see increased _refcount, we bounce the page and are fine. If GUP
> > > faults, we will wait for page lock (so wait until page is prepared for IO
> > > and has PageWriteback set) while handling the fault, then enter
> > > ->page_mkwrite, which will do wait_for_stable_page() -> wait for
> > > outstanding writeback to complete.
> > > 
> > > So I still conclude - no need for page lock in the GUP path at all AFAICT.
> > > In fact we rely on the very same page fault vs page writeback synchronization
> > > for normal user faults as well. And normal user mmap access is even nastier
> > > than GUP access because the CPU reads page tables without taking PTE lock.
> > 
> > For the "slow" GUP path you are right you do not need a lock as the
> > page table lock give you the ordering. For the GUP fast path you
> > would either need the lock or the memory barrier with the test for
> > page write back.
> > 
> > Maybe an easier thing is to convert GUP fast to try to take the page
> > table lock if it fails taking the page table lock then we fall back
> > to slow GUP path. Otherwise then we have the same garantee as the slow
> > path.
> 
> You're right I was looking at the wrong place for GUP_fast() path. But I
> still don't think anything special (i.e. page lock or new barrier) is
> necessary. GUP_fast() takes care already now that it cannot race with page
> unmapping or write-protection (as there are other places in MM that rely on
> this). Look, gup_pte_range() has:
> 
>                 if (!page_cache_get_speculative(head))
>                         goto pte_unmap;
> 
>                 if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>                         put_page(head);
>                         goto pte_unmap;
>                 }
> 
> So that page_cache_get_speculative() will become
> page_cache_pin_speculative() to increment refcount by PAGE_PIN_BIAS instead
> of 1. That is atomic ordered operation so it cannot be reordered with the
> following check that PTE stayed same. So once page_mkclean() write-protects
> PTE, there can be no new pins from GUP_fast() and we are sure all
> succeeding pins are visible in page->_refcount after page_mkclean()
> completes. Again this is nothing new, other mm code already relies on
> either seeing page->_refcount incremented or GUP fast bailing out (e.g. DAX
> relies on this). Although strictly speaking I'm not 100% sure what prevents
> page->_refcount load to be speculatively reordered before PTE update even
> in current places using this but there's so much stuff inbetween that
> there's probably something ;). But we could add smp_rmb() after
> page_mkclean() before changing page_pinned() for the peace of mind I guess.

Yeah i think you are right, i missed the check on same pte value
and the atomic inc in page_cache_get_speculative() is a barrier.
I do not think the barrier would be necessary as page_mkclean is
taking and dropping locks so those should have enough barriering.

> 
> > The issue is that i am not sure if the page table directory page and
> > it's associated spinlock can go in bad state if the directory is being
> > freed (like a racing munmap). This would need to be check. A scheme
> > that might protect against that is to take the above lock of each level
> > before going down one level. Once you are down one level you can unlock
> > the above level. So at any point in time GUP fast holds the lock to a
> > current and valid directory and thus no one could race to remove it.
> > 
> >     GUP_fast()
> >       gup_pgd_range()
> >         if (p4d_try_map_lock()) {
> >           gup_p4d_range()
> >           if (pud_try_map_lock()) {
> >             p4d_unlock();
> >             gup_pud_range();
> >               if (pmd_try_map_lock()) {
> >                 pud_unlock();
> >                 gup_pmd_range();
> >                   if (pte_try_map_lock()) {
> >                     pmd_unlock();
> >                     // Do gup
> >                   }
> >              }
> >           }
> >        }
> > 
> > Maybe this is worse than taking the mmap_sem and checking for vma.
> > 
> > 
> > > > > > > That actually touches on another question I wanted to get opinions on. GUP
> > > > > > > can be for read and GUP can be for write (that is one of GUP flags).
> > > > > > > Filesystems with page cache generally have issues only with GUP for write
> > > > > > > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > > > > > > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > > > > > > memory hotplug will just loop in kernel until the page gets unpinned). So
> > > > > > > we probably want to track both types of GUP pins and page-cache based
> > > > > > > filesystems will take the hit even if they don't have to for read-pins?
> > > > > > 
> > > > > > Yes the distinction between read and write would be nice. With the map
> > > > > > count solution you can only increment the mapcount for GUP(write=true).
> > > > > 
> > > > > Well, but if we track only pins for write, DAX or memory hotplug will not
> > > > > be able to use this mechanism. So at this point I'm more leaning towards
> > > > > tracking all pins. It will cost some performance needlessly for read pins
> > > > > and filesystems using page cache when bouncing such pages but it's not like
> > > > > writeback of pinned pages is some performance critical operation... But I
> > > > > wanted to spell this out so that people are aware of this.
> > > > 
> > > > No they would know for regular pin, it is just as page migrate code. If
> > > > the refcount + (extra_ref_by_the_code_checking) > mapcount then you know
> > > > someone has extra reference on your page.
> > > > 
> > > > Those extra references are either some regular fs event taking place (some
> > > > code doing find_get_page for instance) or a GUP reference (wether it is a
> > > > write pin or a read pin).
> > > > 
> > > > So the only issue is false positive, ie thinking the page is under GUP
> > > > while it has just elevated refcount because of some other regular fs/mm
> > > > event. To minimize false positive for a more accurate pin test (write or
> > > > read) you can enforce few thing:
> > > > 
> > > >     1 - first page lock
> > > >     2 - then freeze the page with expected counted
> > > > 
> > > > With that it should minimize false positive. In the end even with the bias
> > > > case you can also have false positive.
> > > 
> > > So this is basically what the code is currently doing. And for DAX it works
> > > well since the page is being truncated and so essentially nobody is
> > > touching it. But for hotplug it doesn't work quite well - hotplug would
> > > like to return EBUSY to userspace when the page is pinned but retry if the
> > > page reference is just transient.
> > 
> > I do not think there is anyway around transient refcount other
> > than periodicaly check. Maybe hot unplug (i am assuming we are
> > talking about unplug here) can set the reserved page flag and
> > we can change the page get ref to never inc refcount for page
> > with reserved flag. I see that they have been cleanup on going
> > around reserved page so it might or might not be possible.
> 
> Well, there's no problem with transient refcount. The code is fine with
> retrying in that case. The problem is when the refcount actually is *not*
> transient and so hot-unplug ends up retrying forever (or for very long
> time). In that case it should rather bail with EBUSY but currently there's
> no way to distinguish transient from longer term pin.

Yes and i was trying to point out that we can try checking for
GUP read pin and if we have a doubt about GUP read pin false
positive we can do heavier check the second time around if it
is still positive then just return -EBUSY or something to user
space so that we do not retry forever in the kernel.

So not using bias for read GUP looks ok to me (ie we should
still be able to differentiate transient refcount from read
GUP). But this can be done as an optimization if it the page
bouncing for read GUP ever hurt performance too much.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-22 16:46                                                                                   ` Jerome Glisse
@ 2019-01-23 18:02                                                                                     ` Jan Kara
  2019-01-23 19:04                                                                                       ` Jerome Glisse
  0 siblings, 1 reply; 207+ messages in thread
From: Jan Kara @ 2019-01-23 18:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue 22-01-19 11:46:13, Jerome Glisse wrote:
> On Tue, Jan 22, 2019 at 04:24:59PM +0100, Jan Kara wrote:
> > On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
> > > On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> > > > On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> > > > > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > > > > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > > > > > Agreed. So with page lock it would actually look like:
> > > > > > > 
> > > > > > > get_page_pin()
> > > > > > > 	lock_page(page);
> > > > > > > 	wait_for_stable_page();
> > > > > > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > > > > 	unlock_page(page);
> > > > > > > 
> > > > > > > And if we perform page_pinned() check under page lock, then if
> > > > > > > page_pinned() returned false, we are sure page is not and will not be
> > > > > > > pinned until we drop the page lock (and also until page writeback is
> > > > > > > completed if needed).
> > > > > > 
> > > > > > After some more though, why do we even need wait_for_stable_page() and
> > > > > > lock_page() in get_page_pin()?
> > > > > > 
> > > > > > During writepage page_mkclean() will write protect all page tables. So
> > > > > > there can be no new writeable GUP pins until we unlock the page as all such
> > > > > > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > > > > > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > > > > > Am I just confused?
> > > > > 
> > > > > Yeah with page lock it should synchronize on the pte but you still
> > > > > need to check for writeback iirc the page is unlocked after file
> > > > > system has queue up the write and thus the page can be unlock with
> > > > > write back pending (and PageWriteback() == trye) and i am not sure
> > > > > that in that states we can safely let anyone write to that page. I
> > > > > am assuming that in some case the block device also expect stable
> > > > > page content (RAID stuff).
> > > > > 
> > > > > So the PageWriteback() test is not only for racing page_mkclean()/
> > > > > test_set_page_writeback() and GUP but also for pending write back.
> > > > 
> > > > But this is prevented by wait_for_stable_page() that is already present in
> > > > ->page_mkwrite() handlers. Look:
> > > > 
> > > > ->writepage()
> > > >   /* Page is locked here */
> > > >   clear_page_dirty_for_io(page)
> > > >     page_mkclean(page)
> > > >       -> page tables get writeprotected
> > > >     /* The following line will be added by our patches */
> > > >     if (page_pinned(page)) -> bounce
> > > >     TestClearPageDirty(page)
> > > >   set_page_writeback(page);
> > > >   unlock_page(page);
> > > >   ...submit_io...
> > > > 
> > > > IRQ
> > > >   - IO completion
> > > >   end_page_writeback()
> > > > 
> > > > So if GUP happens before page_mkclean() writeprotects corresponding PTE
> > > > (and these two actions are synchronized on the PTE lock), page_pinned()
> > > > will see the increment and report the page as pinned.
> > > > 
> > > > If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> > > > will fault:
> > > >   handle_mm_fault()
> > > >     do_wp_page()
> > > >       wp_page_shared()
> > > >         do_page_mkwrite()
> > > >           ->page_mkwrite() - that is block_page_mkwrite() or
> > > > 	    iomap_page_mkwrite() or whatever filesystem provides
> > > > 	  lock_page(page)
> > > >           ... prepare page ...
> > > > 	  wait_for_stable_page(page) -> this blocks until IO completes
> > > > 	    if someone cares about pages not being modified while under IO.
> > > 
> > > The case i am worried is GUP see pte with write flag set but has not
> > > lock the page yet (GUP is get pte first, then pte to page then lock
> > > page), then it locks the page but the lock page can make it wait for a
> > > racing page_mkclean()...write back that have not yet write protected
> > > the pte the GUP just read. So by the time GUP has the page locked the
> > > pte it read might no longer have the write flag set. Hence why you need
> > > to also check for write back after taking the page lock. Alternatively
> > > you could recheck the pte after a successful try_lock on the page.
> > 
> > This isn't really possible. GUP does:
> > 
> > get_user_pages()
> > ...
> >   follow_page_mask()
> >   ...
> >     follow_page_pte()
> >       ptep = pte_offset_map_lock()
> >       check permissions and page sanity
> >       if (flags & FOLL_GET)
> >         get_page(page); -> this would become
> > 	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >       pte_unmap_unlock(ptep, ptl);
> > 
> > page_mkclean() on the other hand grabs the same pte lock to change the pte
> > to write-protected. So after page_mkclean() has modified the PTE we are
> > racing on for access, we are sure to either see increased _refcount or get
> > page fault from GUP.
> > 
> > If we see increased _refcount, we bounce the page and are fine. If GUP
> > faults, we will wait for page lock (so wait until page is prepared for IO
> > and has PageWriteback set) while handling the fault, then enter
> > ->page_mkwrite, which will do wait_for_stable_page() -> wait for
> > outstanding writeback to complete.
> > 
> > So I still conclude - no need for page lock in the GUP path at all AFAICT.
> > In fact we rely on the very same page fault vs page writeback synchronization
> > for normal user faults as well. And normal user mmap access is even nastier
> > than GUP access because the CPU reads page tables without taking PTE lock.
> 
> For the "slow" GUP path you are right you do not need a lock as the
> page table lock give you the ordering. For the GUP fast path you
> would either need the lock or the memory barrier with the test for
> page write back.
> 
> Maybe an easier thing is to convert GUP fast to try to take the page
> table lock if it fails taking the page table lock then we fall back
> to slow GUP path. Otherwise then we have the same garantee as the slow
> path.

You're right I was looking at the wrong place for GUP_fast() path. But I
still don't think anything special (i.e. page lock or new barrier) is
necessary. GUP_fast() takes care already now that it cannot race with page
unmapping or write-protection (as there are other places in MM that rely on
this). Look, gup_pte_range() has:

                if (!page_cache_get_speculative(head))
                        goto pte_unmap;

                if (unlikely(pte_val(pte) != pte_val(*ptep))) {
                        put_page(head);
                        goto pte_unmap;
                }

So that page_cache_get_speculative() will become
page_cache_pin_speculative() to increment refcount by PAGE_PIN_BIAS instead
of 1. That is atomic ordered operation so it cannot be reordered with the
following check that PTE stayed same. So once page_mkclean() write-protects
PTE, there can be no new pins from GUP_fast() and we are sure all
succeeding pins are visible in page->_refcount after page_mkclean()
completes. Again this is nothing new, other mm code already relies on
either seeing page->_refcount incremented or GUP fast bailing out (e.g. DAX
relies on this). Although strictly speaking I'm not 100% sure what prevents
page->_refcount load to be speculatively reordered before PTE update even
in current places using this but there's so much stuff inbetween that
there's probably something ;). But we could add smp_rmb() after
page_mkclean() before changing page_pinned() for the peace of mind I guess.

> The issue is that i am not sure if the page table directory page and
> it's associated spinlock can go in bad state if the directory is being
> freed (like a racing munmap). This would need to be check. A scheme
> that might protect against that is to take the above lock of each level
> before going down one level. Once you are down one level you can unlock
> the above level. So at any point in time GUP fast holds the lock to a
> current and valid directory and thus no one could race to remove it.
> 
>     GUP_fast()
>       gup_pgd_range()
>         if (p4d_try_map_lock()) {
>           gup_p4d_range()
>           if (pud_try_map_lock()) {
>             p4d_unlock();
>             gup_pud_range();
>               if (pmd_try_map_lock()) {
>                 pud_unlock();
>                 gup_pmd_range();
>                   if (pte_try_map_lock()) {
>                     pmd_unlock();
>                     // Do gup
>                   }
>              }
>           }
>        }
> 
> Maybe this is worse than taking the mmap_sem and checking for vma.
> 
> 
> > > > > > That actually touches on another question I wanted to get opinions on. GUP
> > > > > > can be for read and GUP can be for write (that is one of GUP flags).
> > > > > > Filesystems with page cache generally have issues only with GUP for write
> > > > > > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > > > > > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > > > > > memory hotplug will just loop in kernel until the page gets unpinned). So
> > > > > > we probably want to track both types of GUP pins and page-cache based
> > > > > > filesystems will take the hit even if they don't have to for read-pins?
> > > > > 
> > > > > Yes the distinction between read and write would be nice. With the map
> > > > > count solution you can only increment the mapcount for GUP(write=true).
> > > > 
> > > > Well, but if we track only pins for write, DAX or memory hotplug will not
> > > > be able to use this mechanism. So at this point I'm more leaning towards
> > > > tracking all pins. It will cost some performance needlessly for read pins
> > > > and filesystems using page cache when bouncing such pages but it's not like
> > > > writeback of pinned pages is some performance critical operation... But I
> > > > wanted to spell this out so that people are aware of this.
> > > 
> > > No they would know for regular pin, it is just as page migrate code. If
> > > the refcount + (extra_ref_by_the_code_checking) > mapcount then you know
> > > someone has extra reference on your page.
> > > 
> > > Those extra references are either some regular fs event taking place (some
> > > code doing find_get_page for instance) or a GUP reference (wether it is a
> > > write pin or a read pin).
> > > 
> > > So the only issue is false positive, ie thinking the page is under GUP
> > > while it has just elevated refcount because of some other regular fs/mm
> > > event. To minimize false positive for a more accurate pin test (write or
> > > read) you can enforce few thing:
> > > 
> > >     1 - first page lock
> > >     2 - then freeze the page with expected counted
> > > 
> > > With that it should minimize false positive. In the end even with the bias
> > > case you can also have false positive.
> > 
> > So this is basically what the code is currently doing. And for DAX it works
> > well since the page is being truncated and so essentially nobody is
> > touching it. But for hotplug it doesn't work quite well - hotplug would
> > like to return EBUSY to userspace when the page is pinned but retry if the
> > page reference is just transient.
> 
> I do not think there is anyway around transient refcount other
> than periodicaly check. Maybe hot unplug (i am assuming we are
> talking about unplug here) can set the reserved page flag and
> we can change the page get ref to never inc refcount for page
> with reserved flag. I see that they have been cleanup on going
> around reserved page so it might or might not be possible.

Well, there's no problem with transient refcount. The code is fine with
retrying in that case. The problem is when the refcount actually is *not*
transient and so hot-unplug ends up retrying forever (or for very long
time). In that case it should rather bail with EBUSY but currently there's
no way to distinguish transient from longer term pin.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-22 15:24                                                                                 ` Jan Kara
@ 2019-01-22 16:46                                                                                   ` Jerome Glisse
  2019-01-23 18:02                                                                                     ` Jan Kara
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2019-01-22 16:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 22, 2019 at 04:24:59PM +0100, Jan Kara wrote:
> On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
> > On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> > > On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> > > > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > > > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > > > > Agreed. So with page lock it would actually look like:
> > > > > > 
> > > > > > get_page_pin()
> > > > > > 	lock_page(page);
> > > > > > 	wait_for_stable_page();
> > > > > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > > > 	unlock_page(page);
> > > > > > 
> > > > > > And if we perform page_pinned() check under page lock, then if
> > > > > > page_pinned() returned false, we are sure page is not and will not be
> > > > > > pinned until we drop the page lock (and also until page writeback is
> > > > > > completed if needed).
> > > > > 
> > > > > After some more though, why do we even need wait_for_stable_page() and
> > > > > lock_page() in get_page_pin()?
> > > > > 
> > > > > During writepage page_mkclean() will write protect all page tables. So
> > > > > there can be no new writeable GUP pins until we unlock the page as all such
> > > > > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > > > > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > > > > Am I just confused?
> > > > 
> > > > Yeah with page lock it should synchronize on the pte but you still
> > > > need to check for writeback iirc the page is unlocked after file
> > > > system has queue up the write and thus the page can be unlock with
> > > > write back pending (and PageWriteback() == trye) and i am not sure
> > > > that in that states we can safely let anyone write to that page. I
> > > > am assuming that in some case the block device also expect stable
> > > > page content (RAID stuff).
> > > > 
> > > > So the PageWriteback() test is not only for racing page_mkclean()/
> > > > test_set_page_writeback() and GUP but also for pending write back.
> > > 
> > > But this is prevented by wait_for_stable_page() that is already present in
> > > ->page_mkwrite() handlers. Look:
> > > 
> > > ->writepage()
> > >   /* Page is locked here */
> > >   clear_page_dirty_for_io(page)
> > >     page_mkclean(page)
> > >       -> page tables get writeprotected
> > >     /* The following line will be added by our patches */
> > >     if (page_pinned(page)) -> bounce
> > >     TestClearPageDirty(page)
> > >   set_page_writeback(page);
> > >   unlock_page(page);
> > >   ...submit_io...
> > > 
> > > IRQ
> > >   - IO completion
> > >   end_page_writeback()
> > > 
> > > So if GUP happens before page_mkclean() writeprotects corresponding PTE
> > > (and these two actions are synchronized on the PTE lock), page_pinned()
> > > will see the increment and report the page as pinned.
> > > 
> > > If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> > > will fault:
> > >   handle_mm_fault()
> > >     do_wp_page()
> > >       wp_page_shared()
> > >         do_page_mkwrite()
> > >           ->page_mkwrite() - that is block_page_mkwrite() or
> > > 	    iomap_page_mkwrite() or whatever filesystem provides
> > > 	  lock_page(page)
> > >           ... prepare page ...
> > > 	  wait_for_stable_page(page) -> this blocks until IO completes
> > > 	    if someone cares about pages not being modified while under IO.
> > 
> > The case i am worried is GUP see pte with write flag set but has not
> > lock the page yet (GUP is get pte first, then pte to page then lock
> > page), then it locks the page but the lock page can make it wait for a
> > racing page_mkclean()...write back that have not yet write protected
> > the pte the GUP just read. So by the time GUP has the page locked the
> > pte it read might no longer have the write flag set. Hence why you need
> > to also check for write back after taking the page lock. Alternatively
> > you could recheck the pte after a successful try_lock on the page.
> 
> This isn't really possible. GUP does:
> 
> get_user_pages()
> ...
>   follow_page_mask()
>   ...
>     follow_page_pte()
>       ptep = pte_offset_map_lock()
>       check permissions and page sanity
>       if (flags & FOLL_GET)
>         get_page(page); -> this would become
> 	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>       pte_unmap_unlock(ptep, ptl);
> 
> page_mkclean() on the other hand grabs the same pte lock to change the pte
> to write-protected. So after page_mkclean() has modified the PTE we are
> racing on for access, we are sure to either see increased _refcount or get
> page fault from GUP.
> 
> If we see increased _refcount, we bounce the page and are fine. If GUP
> faults, we will wait for page lock (so wait until page is prepared for IO
> and has PageWriteback set) while handling the fault, then enter
> ->page_mkwrite, which will do wait_for_stable_page() -> wait for
> outstanding writeback to complete.
> 
> So I still conclude - no need for page lock in the GUP path at all AFAICT.
> In fact we rely on the very same page fault vs page writeback synchronization
> for normal user faults as well. And normal user mmap access is even nastier
> than GUP access because the CPU reads page tables without taking PTE lock.

For the "slow" GUP path you are right you do not need a lock as the
page table lock give you the ordering. For the GUP fast path you
would either need the lock or the memory barrier with the test for
page write back.

Maybe an easier thing is to convert GUP fast to try to take the page
table lock if it fails taking the page table lock then we fall back
to slow GUP path. Otherwise then we have the same garantee as the slow
path.

The issue is that i am not sure if the page table directory page and
it's associated spinlock can go in bad state if the directory is being
freed (like a racing munmap). This would need to be check. A scheme
that might protect against that is to take the above lock of each level
before going down one level. Once you are down one level you can unlock
the above level. So at any point in time GUP fast holds the lock to a
current and valid directory and thus no one could race to remove it.

    GUP_fast()
      gup_pgd_range()
        if (p4d_try_map_lock()) {
          gup_p4d_range()
          if (pud_try_map_lock()) {
            p4d_unlock();
            gup_pud_range();
              if (pmd_try_map_lock()) {
                pud_unlock();
                gup_pmd_range();
                  if (pte_try_map_lock()) {
                    pmd_unlock();
                    // Do gup
                  }
             }
          }
       }

Maybe this is worse than taking the mmap_sem and checking for vma.


> > > > > That actually touches on another question I wanted to get opinions on. GUP
> > > > > can be for read and GUP can be for write (that is one of GUP flags).
> > > > > Filesystems with page cache generally have issues only with GUP for write
> > > > > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > > > > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > > > > memory hotplug will just loop in kernel until the page gets unpinned). So
> > > > > we probably want to track both types of GUP pins and page-cache based
> > > > > filesystems will take the hit even if they don't have to for read-pins?
> > > > 
> > > > Yes the distinction between read and write would be nice. With the map
> > > > count solution you can only increment the mapcount for GUP(write=true).
> > > 
> > > Well, but if we track only pins for write, DAX or memory hotplug will not
> > > be able to use this mechanism. So at this point I'm more leaning towards
> > > tracking all pins. It will cost some performance needlessly for read pins
> > > and filesystems using page cache when bouncing such pages but it's not like
> > > writeback of pinned pages is some performance critical operation... But I
> > > wanted to spell this out so that people are aware of this.
> > 
> > No they would know for regular pin, it is just as page migrate code. If
> > the refcount + (extra_ref_by_the_code_checking) > mapcount then you know
> > someone has extra reference on your page.
> > 
> > Those extra references are either some regular fs event taking place (some
> > code doing find_get_page for instance) or a GUP reference (wether it is a
> > write pin or a read pin).
> > 
> > So the only issue is false positive, ie thinking the page is under GUP
> > while it has just elevated refcount because of some other regular fs/mm
> > event. To minimize false positive for a more accurate pin test (write or
> > read) you can enforce few thing:
> > 
> >     1 - first page lock
> >     2 - then freeze the page with expected counted
> > 
> > With that it should minimize false positive. In the end even with the bias
> > case you can also have false positive.
> 
> So this is basically what the code is currently doing. And for DAX it works
> well since the page is being truncated and so essentially nobody is
> touching it. But for hotplug it doesn't work quite well - hotplug would
> like to return EBUSY to userspace when the page is pinned but retry if the
> page reference is just transient.

I do not think there is anyway around transient refcount other
than periodicaly check. Maybe hot unplug (i am assuming we are
talking about unplug here) can set the reserved page flag and
we can change the page get ref to never inc refcount for page
with reserved flag. I see that they have been cleanup on going
around reserved page so it might or might not be possible.

Also it would mean that get_page() could now fail and we would
need to update all path that do that to handle this case. Then
you know that if the freeze fails it must be because of read
GUP (no transient refcount can happen).

Or another way is to record page refcount at time t and then
compare it at time t+timeout and if it matches then consider the
refcount to be from a GUP read and not a transient refcount,
this should at very least drasticly reduce the likely hood of
a false GUP positive because of a transient refcount inc.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17 15:17                                                                               ` Jerome Glisse
  2019-01-17 15:17                                                                                 ` Jerome Glisse
@ 2019-01-22 15:24                                                                                 ` Jan Kara
  2019-01-22 16:46                                                                                   ` Jerome Glisse
  1 sibling, 1 reply; 207+ messages in thread
From: Jan Kara @ 2019-01-22 15:24 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Thu 17-01-19 10:17:59, Jerome Glisse wrote:
> On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> > On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> > > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > > > Agreed. So with page lock it would actually look like:
> > > > > 
> > > > > get_page_pin()
> > > > > 	lock_page(page);
> > > > > 	wait_for_stable_page();
> > > > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > > 	unlock_page(page);
> > > > > 
> > > > > And if we perform page_pinned() check under page lock, then if
> > > > > page_pinned() returned false, we are sure page is not and will not be
> > > > > pinned until we drop the page lock (and also until page writeback is
> > > > > completed if needed).
> > > > 
> > > > After some more though, why do we even need wait_for_stable_page() and
> > > > lock_page() in get_page_pin()?
> > > > 
> > > > During writepage page_mkclean() will write protect all page tables. So
> > > > there can be no new writeable GUP pins until we unlock the page as all such
> > > > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > > > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > > > Am I just confused?
> > > 
> > > Yeah with page lock it should synchronize on the pte but you still
> > > need to check for writeback iirc the page is unlocked after file
> > > system has queue up the write and thus the page can be unlock with
> > > write back pending (and PageWriteback() == trye) and i am not sure
> > > that in that states we can safely let anyone write to that page. I
> > > am assuming that in some case the block device also expect stable
> > > page content (RAID stuff).
> > > 
> > > So the PageWriteback() test is not only for racing page_mkclean()/
> > > test_set_page_writeback() and GUP but also for pending write back.
> > 
> > But this is prevented by wait_for_stable_page() that is already present in
> > ->page_mkwrite() handlers. Look:
> > 
> > ->writepage()
> >   /* Page is locked here */
> >   clear_page_dirty_for_io(page)
> >     page_mkclean(page)
> >       -> page tables get writeprotected
> >     /* The following line will be added by our patches */
> >     if (page_pinned(page)) -> bounce
> >     TestClearPageDirty(page)
> >   set_page_writeback(page);
> >   unlock_page(page);
> >   ...submit_io...
> > 
> > IRQ
> >   - IO completion
> >   end_page_writeback()
> > 
> > So if GUP happens before page_mkclean() writeprotects corresponding PTE
> > (and these two actions are synchronized on the PTE lock), page_pinned()
> > will see the increment and report the page as pinned.
> > 
> > If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> > will fault:
> >   handle_mm_fault()
> >     do_wp_page()
> >       wp_page_shared()
> >         do_page_mkwrite()
> >           ->page_mkwrite() - that is block_page_mkwrite() or
> > 	    iomap_page_mkwrite() or whatever filesystem provides
> > 	  lock_page(page)
> >           ... prepare page ...
> > 	  wait_for_stable_page(page) -> this blocks until IO completes
> > 	    if someone cares about pages not being modified while under IO.
> 
> The case i am worried is GUP see pte with write flag set but has not
> lock the page yet (GUP is get pte first, then pte to page then lock
> page), then it locks the page but the lock page can make it wait for a
> racing page_mkclean()...write back that have not yet write protected
> the pte the GUP just read. So by the time GUP has the page locked the
> pte it read might no longer have the write flag set. Hence why you need
> to also check for write back after taking the page lock. Alternatively
> you could recheck the pte after a successful try_lock on the page.

This isn't really possible. GUP does:

get_user_pages()
...
  follow_page_mask()
  ...
    follow_page_pte()
      ptep = pte_offset_map_lock()
      check permissions and page sanity
      if (flags & FOLL_GET)
        get_page(page); -> this would become
	  atomic_add(&page->_refcount, PAGE_PIN_BIAS);
      pte_unmap_unlock(ptep, ptl);

page_mkclean() on the other hand grabs the same pte lock to change the pte
to write-protected. So after page_mkclean() has modified the PTE we are
racing on for access, we are sure to either see increased _refcount or get
page fault from GUP.

If we see increased _refcount, we bounce the page and are fine. If GUP
faults, we will wait for page lock (so wait until page is prepared for IO
and has PageWriteback set) while handling the fault, then enter
->page_mkwrite, which will do wait_for_stable_page() -> wait for
outstanding writeback to complete.

So I still conclude - no need for page lock in the GUP path at all AFAICT.
In fact we rely on the very same page fault vs page writeback synchronization
for normal user faults as well. And normal user mmap access is even nastier
than GUP access because the CPU reads page tables without taking PTE lock.

> > > > That actually touches on another question I wanted to get opinions on. GUP
> > > > can be for read and GUP can be for write (that is one of GUP flags).
> > > > Filesystems with page cache generally have issues only with GUP for write
> > > > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > > > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > > > memory hotplug will just loop in kernel until the page gets unpinned). So
> > > > we probably want to track both types of GUP pins and page-cache based
> > > > filesystems will take the hit even if they don't have to for read-pins?
> > > 
> > > Yes the distinction between read and write would be nice. With the map
> > > count solution you can only increment the mapcount for GUP(write=true).
> > 
> > Well, but if we track only pins for write, DAX or memory hotplug will not
> > be able to use this mechanism. So at this point I'm more leaning towards
> > tracking all pins. It will cost some performance needlessly for read pins
> > and filesystems using page cache when bouncing such pages but it's not like
> > writeback of pinned pages is some performance critical operation... But I
> > wanted to spell this out so that people are aware of this.
> 
> No they would know for regular pin, it is just as page migrate code. If
> the refcount + (extra_ref_by_the_code_checking) > mapcount then you know
> someone has extra reference on your page.
> 
> Those extra references are either some regular fs event taking place (some
> code doing find_get_page for instance) or a GUP reference (wether it is a
> write pin or a read pin).
> 
> So the only issue is false positive, ie thinking the page is under GUP
> while it has just elevated refcount because of some other regular fs/mm
> event. To minimize false positive for a more accurate pin test (write or
> read) you can enforce few thing:
> 
>     1 - first page lock
>     2 - then freeze the page with expected counted
> 
> With that it should minimize false positive. In the end even with the bias
> case you can also have false positive.

So this is basically what the code is currently doing. And for DAX it works
well since the page is being truncated and so essentially nobody is
touching it. But for hotplug it doesn't work quite well - hotplug would
like to return EBUSY to userspace when the page is pinned but retry if the
page reference is just transient.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-18  0:16                                                                                 ` Dave Chinner
@ 2019-01-18  1:59                                                                                   ` Jerome Glisse
  0 siblings, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-18  1:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 18, 2019 at 11:16:08AM +1100, Dave Chinner wrote:
> On Thu, Jan 17, 2019 at 10:21:08AM -0500, Jerome Glisse wrote:
> > On Wed, Jan 16, 2019 at 09:42:25PM -0800, John Hubbard wrote:
> > > On 1/16/19 5:08 AM, Jerome Glisse wrote:
> > > > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > >> That actually touches on another question I wanted to get opinions on. GUP
> > > >> can be for read and GUP can be for write (that is one of GUP flags).
> > > >> Filesystems with page cache generally have issues only with GUP for write
> > > >> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > > >> hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > > >> memory hotplug will just loop in kernel until the page gets unpinned). So
> > > >> we probably want to track both types of GUP pins and page-cache based
> > > >> filesystems will take the hit even if they don't have to for read-pins?
> > > > 
> > > > Yes the distinction between read and write would be nice. With the map
> > > > count solution you can only increment the mapcount for GUP(write=true).
> > > > With pin bias the issue is that a big number of read pin can trigger
> > > > false positive ie you would do:
> > > >     GUP(vaddr, write)
> > > >         ...
> > > >         if (write)
> > > >             atomic_add(page->refcount, PAGE_PIN_BIAS)
> > > >         else
> > > >             atomic_inc(page->refcount)
> > > > 
> > > >     PUP(page, write)
> > > >         if (write)
> > > >             atomic_add(page->refcount, -PAGE_PIN_BIAS)
> > > >         else
> > > >             atomic_dec(page->refcount)
> > > > 
> > > > I am guessing false positive because of too many read GUP is ok as
> > > > it should be unlikely and when it happens then we take the hit.
> > > > 
> > > 
> > > I'm also intrigued by the point that read-only GUP is harmless, and we 
> > > could just focus on the writeable case.
> > 
> > For filesystem anybody that just look at the page is fine, as it would
> > not change its content thus the page would stay stable.
> 
> Other processes can access and dirty the page cache page while there
> is a GUP reference.  It's unclear to me whether that changes what
> GUP needs to do here, but we can't assume a page referenced for
> read-only GUP will be clean and unchanging for the duration of the
> GUP reference. It may even be dirty at the time of the read-only
> GUP pin...
> 

Yes and it is fine, GUP read only user do not assume that the page
is read only for everyone, it just means that the GUP user swear
it will only read from the page, not write to it.

So for GUP read only we do not need to synchronize with anything
writting to the page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17 15:21                                                                               ` Jerome Glisse
  2019-01-17 15:21                                                                                 ` Jerome Glisse
@ 2019-01-18  0:16                                                                                 ` Dave Chinner
  2019-01-18  1:59                                                                                   ` Jerome Glisse
  1 sibling, 1 reply; 207+ messages in thread
From: Dave Chinner @ 2019-01-18  0:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 17, 2019 at 10:21:08AM -0500, Jerome Glisse wrote:
> On Wed, Jan 16, 2019 at 09:42:25PM -0800, John Hubbard wrote:
> > On 1/16/19 5:08 AM, Jerome Glisse wrote:
> > > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > >> That actually touches on another question I wanted to get opinions on. GUP
> > >> can be for read and GUP can be for write (that is one of GUP flags).
> > >> Filesystems with page cache generally have issues only with GUP for write
> > >> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > >> hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > >> memory hotplug will just loop in kernel until the page gets unpinned). So
> > >> we probably want to track both types of GUP pins and page-cache based
> > >> filesystems will take the hit even if they don't have to for read-pins?
> > > 
> > > Yes the distinction between read and write would be nice. With the map
> > > count solution you can only increment the mapcount for GUP(write=true).
> > > With pin bias the issue is that a big number of read pin can trigger
> > > false positive ie you would do:
> > >     GUP(vaddr, write)
> > >         ...
> > >         if (write)
> > >             atomic_add(page->refcount, PAGE_PIN_BIAS)
> > >         else
> > >             atomic_inc(page->refcount)
> > > 
> > >     PUP(page, write)
> > >         if (write)
> > >             atomic_add(page->refcount, -PAGE_PIN_BIAS)
> > >         else
> > >             atomic_dec(page->refcount)
> > > 
> > > I am guessing false positive because of too many read GUP is ok as
> > > it should be unlikely and when it happens then we take the hit.
> > > 
> > 
> > I'm also intrigued by the point that read-only GUP is harmless, and we 
> > could just focus on the writeable case.
> 
> For filesystem anybody that just look at the page is fine, as it would
> not change its content thus the page would stay stable.

Other processes can access and dirty the page cache page while there
is a GUP reference.  It's unclear to me whether that changes what
GUP needs to do here, but we can't assume a page referenced for
read-only GUP will be clean and unchanging for the duration of the
GUP reference. It may even be dirty at the time of the read-only
GUP pin...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17  5:42                                                                             ` John Hubbard
  2019-01-17  5:42                                                                               ` John Hubbard
@ 2019-01-17 15:21                                                                               ` Jerome Glisse
  2019-01-17 15:21                                                                                 ` Jerome Glisse
  2019-01-18  0:16                                                                                 ` Dave Chinner
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-17 15:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 16, 2019 at 09:42:25PM -0800, John Hubbard wrote:
> On 1/16/19 5:08 AM, Jerome Glisse wrote:
> > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> >> On Tue 15-01-19 09:07:59, Jan Kara wrote:
> >>> Agreed. So with page lock it would actually look like:
> >>>
> >>> get_page_pin()
> >>> 	lock_page(page);
> >>> 	wait_for_stable_page();
> >>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>> 	unlock_page(page);
> >>>
> >>> And if we perform page_pinned() check under page lock, then if
> >>> page_pinned() returned false, we are sure page is not and will not be
> >>> pinned until we drop the page lock (and also until page writeback is
> >>> completed if needed).
> >>
> >> After some more though, why do we even need wait_for_stable_page() and
> >> lock_page() in get_page_pin()?
> >>
> >> During writepage page_mkclean() will write protect all page tables. So
> >> there can be no new writeable GUP pins until we unlock the page as all such
> >> GUPs will have to first go through fault and ->page_mkwrite() handler. And
> >> that will wait on page lock and do wait_for_stable_page() for us anyway.
> >> Am I just confused?
> > 
> > Yeah with page lock it should synchronize on the pte but you still
> > need to check for writeback iirc the page is unlocked after file
> > system has queue up the write and thus the page can be unlock with
> > write back pending (and PageWriteback() == trye) and i am not sure
> > that in that states we can safely let anyone write to that page. I
> > am assuming that in some case the block device also expect stable
> > page content (RAID stuff).
> > 
> > So the PageWriteback() test is not only for racing page_mkclean()/
> > test_set_page_writeback() and GUP but also for pending write back.
> 
> 
> That was how I thought it worked too: page_mkclean and a few other things
> like page migration take the page lock, but writeback takes the lock, 
> queues it up, then drops the lock, and writeback actually happens outside
> that lock. 
> 
> So on the GUP end, some combination of taking the page lock, and 
> wait_on_page_writeback(), is required in order to flush out the writebacks.
> I think I just rephrased what Jerome said, actually. :)
> 
> 
> > 
> > 
> >> That actually touches on another question I wanted to get opinions on. GUP
> >> can be for read and GUP can be for write (that is one of GUP flags).
> >> Filesystems with page cache generally have issues only with GUP for write
> >> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> >> hotplug have issues with both (DAX cannot truncate page pinned in any way,
> >> memory hotplug will just loop in kernel until the page gets unpinned). So
> >> we probably want to track both types of GUP pins and page-cache based
> >> filesystems will take the hit even if they don't have to for read-pins?
> > 
> > Yes the distinction between read and write would be nice. With the map
> > count solution you can only increment the mapcount for GUP(write=true).
> > With pin bias the issue is that a big number of read pin can trigger
> > false positive ie you would do:
> >     GUP(vaddr, write)
> >         ...
> >         if (write)
> >             atomic_add(page->refcount, PAGE_PIN_BIAS)
> >         else
> >             atomic_inc(page->refcount)
> > 
> >     PUP(page, write)
> >         if (write)
> >             atomic_add(page->refcount, -PAGE_PIN_BIAS)
> >         else
> >             atomic_dec(page->refcount)
> > 
> > I am guessing false positive because of too many read GUP is ok as
> > it should be unlikely and when it happens then we take the hit.
> > 
> 
> I'm also intrigued by the point that read-only GUP is harmless, and we 
> could just focus on the writeable case.

For filesystem anybody that just look at the page is fine, as it would
not change its content thus the page would stay stable.

> 
> However, I'm rather worried about actually attempting it, because remember
> that so far, each call site does no special tracking of each struct page. 
> It just remembers that it needs to do a put_page(), not whether or
> not that particular page was set up with writeable or read-only GUP. I mean,
> sure, they often call set_page_dirty before put_page, indicating that it might
> have been a writeable GUP call, but it seems sketchy to rely on that.
> 
> So actually doing this could go from merely lots of work, to K*(lots_of_work)...

I did a quick scan and most of the GUP user know wether they did a write
GUP or not by the time they do put_page for instance all device knows
that because they use that very information for the dma_page_unmap()

So wether the GUP was write or read only is available at the time of PUP.

If you do not feel comfortable you can leave it out for now.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17 15:21                                                                               ` Jerome Glisse
@ 2019-01-17 15:21                                                                                 ` Jerome Glisse
  2019-01-18  0:16                                                                                 ` Dave Chinner
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-17 15:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 16, 2019 at 09:42:25PM -0800, John Hubbard wrote:
> On 1/16/19 5:08 AM, Jerome Glisse wrote:
> > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> >> On Tue 15-01-19 09:07:59, Jan Kara wrote:
> >>> Agreed. So with page lock it would actually look like:
> >>>
> >>> get_page_pin()
> >>> 	lock_page(page);
> >>> 	wait_for_stable_page();
> >>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>> 	unlock_page(page);
> >>>
> >>> And if we perform page_pinned() check under page lock, then if
> >>> page_pinned() returned false, we are sure page is not and will not be
> >>> pinned until we drop the page lock (and also until page writeback is
> >>> completed if needed).
> >>
> >> After some more though, why do we even need wait_for_stable_page() and
> >> lock_page() in get_page_pin()?
> >>
> >> During writepage page_mkclean() will write protect all page tables. So
> >> there can be no new writeable GUP pins until we unlock the page as all such
> >> GUPs will have to first go through fault and ->page_mkwrite() handler. And
> >> that will wait on page lock and do wait_for_stable_page() for us anyway.
> >> Am I just confused?
> > 
> > Yeah with page lock it should synchronize on the pte but you still
> > need to check for writeback iirc the page is unlocked after file
> > system has queue up the write and thus the page can be unlock with
> > write back pending (and PageWriteback() == trye) and i am not sure
> > that in that states we can safely let anyone write to that page. I
> > am assuming that in some case the block device also expect stable
> > page content (RAID stuff).
> > 
> > So the PageWriteback() test is not only for racing page_mkclean()/
> > test_set_page_writeback() and GUP but also for pending write back.
> 
> 
> That was how I thought it worked too: page_mkclean and a few other things
> like page migration take the page lock, but writeback takes the lock, 
> queues it up, then drops the lock, and writeback actually happens outside
> that lock. 
> 
> So on the GUP end, some combination of taking the page lock, and 
> wait_on_page_writeback(), is required in order to flush out the writebacks.
> I think I just rephrased what Jerome said, actually. :)
> 
> 
> > 
> > 
> >> That actually touches on another question I wanted to get opinions on. GUP
> >> can be for read and GUP can be for write (that is one of GUP flags).
> >> Filesystems with page cache generally have issues only with GUP for write
> >> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> >> hotplug have issues with both (DAX cannot truncate page pinned in any way,
> >> memory hotplug will just loop in kernel until the page gets unpinned). So
> >> we probably want to track both types of GUP pins and page-cache based
> >> filesystems will take the hit even if they don't have to for read-pins?
> > 
> > Yes the distinction between read and write would be nice. With the map
> > count solution you can only increment the mapcount for GUP(write=true).
> > With pin bias the issue is that a big number of read pin can trigger
> > false positive ie you would do:
> >     GUP(vaddr, write)
> >         ...
> >         if (write)
> >             atomic_add(page->refcount, PAGE_PIN_BIAS)
> >         else
> >             atomic_inc(page->refcount)
> > 
> >     PUP(page, write)
> >         if (write)
> >             atomic_add(page->refcount, -PAGE_PIN_BIAS)
> >         else
> >             atomic_dec(page->refcount)
> > 
> > I am guessing false positive because of too many read GUP is ok as
> > it should be unlikely and when it happens then we take the hit.
> > 
> 
> I'm also intrigued by the point that read-only GUP is harmless, and we 
> could just focus on the writeable case.

For filesystem anybody that just look at the page is fine, as it would
not change its content thus the page would stay stable.

> 
> However, I'm rather worried about actually attempting it, because remember
> that so far, each call site does no special tracking of each struct page. 
> It just remembers that it needs to do a put_page(), not whether or
> not that particular page was set up with writeable or read-only GUP. I mean,
> sure, they often call set_page_dirty before put_page, indicating that it might
> have been a writeable GUP call, but it seems sketchy to rely on that.
> 
> So actually doing this could go from merely lots of work, to K*(lots_of_work)...

I did a quick scan and most of the GUP user know wether they did a write
GUP or not by the time they do put_page for instance all device knows
that because they use that very information for the dma_page_unmap()

So wether the GUP was write or read only is available at the time of PUP.

If you do not feel comfortable you can leave it out for now.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17  9:30                                                                             ` Jan Kara
  2019-01-17  9:30                                                                               ` Jan Kara
@ 2019-01-17 15:17                                                                               ` Jerome Glisse
  2019-01-17 15:17                                                                                 ` Jerome Glisse
  2019-01-22 15:24                                                                                 ` Jan Kara
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-17 15:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > > Agreed. So with page lock it would actually look like:
> > > > 
> > > > get_page_pin()
> > > > 	lock_page(page);
> > > > 	wait_for_stable_page();
> > > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > 	unlock_page(page);
> > > > 
> > > > And if we perform page_pinned() check under page lock, then if
> > > > page_pinned() returned false, we are sure page is not and will not be
> > > > pinned until we drop the page lock (and also until page writeback is
> > > > completed if needed).
> > > 
> > > After some more though, why do we even need wait_for_stable_page() and
> > > lock_page() in get_page_pin()?
> > > 
> > > During writepage page_mkclean() will write protect all page tables. So
> > > there can be no new writeable GUP pins until we unlock the page as all such
> > > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > > Am I just confused?
> > 
> > Yeah with page lock it should synchronize on the pte but you still
> > need to check for writeback iirc the page is unlocked after file
> > system has queue up the write and thus the page can be unlock with
> > write back pending (and PageWriteback() == trye) and i am not sure
> > that in that states we can safely let anyone write to that page. I
> > am assuming that in some case the block device also expect stable
> > page content (RAID stuff).
> > 
> > So the PageWriteback() test is not only for racing page_mkclean()/
> > test_set_page_writeback() and GUP but also for pending write back.
> 
> But this is prevented by wait_for_stable_page() that is already present in
> ->page_mkwrite() handlers. Look:
> 
> ->writepage()
>   /* Page is locked here */
>   clear_page_dirty_for_io(page)
>     page_mkclean(page)
>       -> page tables get writeprotected
>     /* The following line will be added by our patches */
>     if (page_pinned(page)) -> bounce
>     TestClearPageDirty(page)
>   set_page_writeback(page);
>   unlock_page(page);
>   ...submit_io...
> 
> IRQ
>   - IO completion
>   end_page_writeback()
> 
> So if GUP happens before page_mkclean() writeprotects corresponding PTE
> (and these two actions are synchronized on the PTE lock), page_pinned()
> will see the increment and report the page as pinned.
> 
> If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> will fault:
>   handle_mm_fault()
>     do_wp_page()
>       wp_page_shared()
>         do_page_mkwrite()
>           ->page_mkwrite() - that is block_page_mkwrite() or
> 	    iomap_page_mkwrite() or whatever filesystem provides
> 	  lock_page(page)
>           ... prepare page ...
> 	  wait_for_stable_page(page) -> this blocks until IO completes
> 	    if someone cares about pages not being modified while under IO.

The case i am worried is GUP see pte with write flag set but has not
lock the page yet (GUP is get pte first, then pte to page then lock
page), then it locks the page but the lock page can make it wait for a
racing page_mkclean()...write back that have not yet write protected
the pte the GUP just read. So by the time GUP has the page locked the
pte it read might no longer have the write flag set. Hence why you need
to also check for write back after taking the page lock. Alternatively
you could recheck the pte after a successful try_lock on the page.

You can optimize out by checking also before trying to lock the page
to bail out early and avoid unecessary lock wait. But that is an
optimization. For correctness you need to check after taking the page
lock.

> 
> > > That actually touches on another question I wanted to get opinions on. GUP
> > > can be for read and GUP can be for write (that is one of GUP flags).
> > > Filesystems with page cache generally have issues only with GUP for write
> > > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > > memory hotplug will just loop in kernel until the page gets unpinned). So
> > > we probably want to track both types of GUP pins and page-cache based
> > > filesystems will take the hit even if they don't have to for read-pins?
> > 
> > Yes the distinction between read and write would be nice. With the map
> > count solution you can only increment the mapcount for GUP(write=true).
> 
> Well, but if we track only pins for write, DAX or memory hotplug will not
> be able to use this mechanism. So at this point I'm more leaning towards
> tracking all pins. It will cost some performance needlessly for read pins
> and filesystems using page cache when bouncing such pages but it's not like
> writeback of pinned pages is some performance critical operation... But I
> wanted to spell this out so that people are aware of this.

No they would know for regular pin, it is just as page migrate code. If
the refcount + (extra_ref_by_the_code_checking) > mapcount then you know
someone has extra reference on your page.

Those extra references are either some regular fs event taking place (some
code doing find_get_page for instance) or a GUP reference (wether it is a
write pin or a read pin).

So the only issue is false positive, ie thinking the page is under GUP
while it has just elevated refcount because of some other regular fs/mm
event. To minimize false positive for a more accurate pin test (write or
read) you can enforce few thing:

    1 - first page lock
    2 - then freeze the page with expected counted

With that it should minimize false positive. In the end even with the bias
case you can also have false positive.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17 15:17                                                                               ` Jerome Glisse
@ 2019-01-17 15:17                                                                                 ` Jerome Glisse
  2019-01-22 15:24                                                                                 ` Jan Kara
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-17 15:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 17, 2019 at 10:30:47AM +0100, Jan Kara wrote:
> On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> > On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > > Agreed. So with page lock it would actually look like:
> > > > 
> > > > get_page_pin()
> > > > 	lock_page(page);
> > > > 	wait_for_stable_page();
> > > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > > 	unlock_page(page);
> > > > 
> > > > And if we perform page_pinned() check under page lock, then if
> > > > page_pinned() returned false, we are sure page is not and will not be
> > > > pinned until we drop the page lock (and also until page writeback is
> > > > completed if needed).
> > > 
> > > After some more though, why do we even need wait_for_stable_page() and
> > > lock_page() in get_page_pin()?
> > > 
> > > During writepage page_mkclean() will write protect all page tables. So
> > > there can be no new writeable GUP pins until we unlock the page as all such
> > > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > > Am I just confused?
> > 
> > Yeah with page lock it should synchronize on the pte but you still
> > need to check for writeback iirc the page is unlocked after file
> > system has queue up the write and thus the page can be unlock with
> > write back pending (and PageWriteback() == trye) and i am not sure
> > that in that states we can safely let anyone write to that page. I
> > am assuming that in some case the block device also expect stable
> > page content (RAID stuff).
> > 
> > So the PageWriteback() test is not only for racing page_mkclean()/
> > test_set_page_writeback() and GUP but also for pending write back.
> 
> But this is prevented by wait_for_stable_page() that is already present in
> ->page_mkwrite() handlers. Look:
> 
> ->writepage()
>   /* Page is locked here */
>   clear_page_dirty_for_io(page)
>     page_mkclean(page)
>       -> page tables get writeprotected
>     /* The following line will be added by our patches */
>     if (page_pinned(page)) -> bounce
>     TestClearPageDirty(page)
>   set_page_writeback(page);
>   unlock_page(page);
>   ...submit_io...
> 
> IRQ
>   - IO completion
>   end_page_writeback()
> 
> So if GUP happens before page_mkclean() writeprotects corresponding PTE
> (and these two actions are synchronized on the PTE lock), page_pinned()
> will see the increment and report the page as pinned.
> 
> If GUP happens after page_mkclean() writeprotects corresponding PTE, it
> will fault:
>   handle_mm_fault()
>     do_wp_page()
>       wp_page_shared()
>         do_page_mkwrite()
>           ->page_mkwrite() - that is block_page_mkwrite() or
> 	    iomap_page_mkwrite() or whatever filesystem provides
> 	  lock_page(page)
>           ... prepare page ...
> 	  wait_for_stable_page(page) -> this blocks until IO completes
> 	    if someone cares about pages not being modified while under IO.

The case i am worried is GUP see pte with write flag set but has not
lock the page yet (GUP is get pte first, then pte to page then lock
page), then it locks the page but the lock page can make it wait for a
racing page_mkclean()...write back that have not yet write protected
the pte the GUP just read. So by the time GUP has the page locked the
pte it read might no longer have the write flag set. Hence why you need
to also check for write back after taking the page lock. Alternatively
you could recheck the pte after a successful try_lock on the page.

You can optimize out by checking also before trying to lock the page
to bail out early and avoid unecessary lock wait. But that is an
optimization. For correctness you need to check after taking the page
lock.

> 
> > > That actually touches on another question I wanted to get opinions on. GUP
> > > can be for read and GUP can be for write (that is one of GUP flags).
> > > Filesystems with page cache generally have issues only with GUP for write
> > > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > > memory hotplug will just loop in kernel until the page gets unpinned). So
> > > we probably want to track both types of GUP pins and page-cache based
> > > filesystems will take the hit even if they don't have to for read-pins?
> > 
> > Yes the distinction between read and write would be nice. With the map
> > count solution you can only increment the mapcount for GUP(write=true).
> 
> Well, but if we track only pins for write, DAX or memory hotplug will not
> be able to use this mechanism. So at this point I'm more leaning towards
> tracking all pins. It will cost some performance needlessly for read pins
> and filesystems using page cache when bouncing such pages but it's not like
> writeback of pinned pages is some performance critical operation... But I
> wanted to spell this out so that people are aware of this.

No they would know for regular pin, it is just as page migrate code. If
the refcount + (extra_ref_by_the_code_checking) > mapcount then you know
someone has extra reference on your page.

Those extra references are either some regular fs event taking place (some
code doing find_get_page for instance) or a GUP reference (wether it is a
write pin or a read pin).

So the only issue is false positive, ie thinking the page is under GUP
while it has just elevated refcount because of some other regular fs/mm
event. To minimize false positive for a more accurate pin test (write or
read) you can enforce few thing:

    1 - first page lock
    2 - then freeze the page with expected counted

With that it should minimize false positive. In the end even with the bias
case you can also have false positive.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 13:08                                                                           ` Jerome Glisse
  2019-01-16 13:08                                                                             ` Jerome Glisse
  2019-01-17  5:42                                                                             ` John Hubbard
@ 2019-01-17  9:30                                                                             ` Jan Kara
  2019-01-17  9:30                                                                               ` Jan Kara
  2019-01-17 15:17                                                                               ` Jerome Glisse
  2 siblings, 2 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-17  9:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > Agreed. So with page lock it would actually look like:
> > > 
> > > get_page_pin()
> > > 	lock_page(page);
> > > 	wait_for_stable_page();
> > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > 	unlock_page(page);
> > > 
> > > And if we perform page_pinned() check under page lock, then if
> > > page_pinned() returned false, we are sure page is not and will not be
> > > pinned until we drop the page lock (and also until page writeback is
> > > completed if needed).
> > 
> > After some more though, why do we even need wait_for_stable_page() and
> > lock_page() in get_page_pin()?
> > 
> > During writepage page_mkclean() will write protect all page tables. So
> > there can be no new writeable GUP pins until we unlock the page as all such
> > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > Am I just confused?
> 
> Yeah with page lock it should synchronize on the pte but you still
> need to check for writeback iirc the page is unlocked after file
> system has queue up the write and thus the page can be unlock with
> write back pending (and PageWriteback() == trye) and i am not sure
> that in that states we can safely let anyone write to that page. I
> am assuming that in some case the block device also expect stable
> page content (RAID stuff).
> 
> So the PageWriteback() test is not only for racing page_mkclean()/
> test_set_page_writeback() and GUP but also for pending write back.

But this is prevented by wait_for_stable_page() that is already present in
->page_mkwrite() handlers. Look:

->writepage()
  /* Page is locked here */
  clear_page_dirty_for_io(page)
    page_mkclean(page)
      -> page tables get writeprotected
    /* The following line will be added by our patches */
    if (page_pinned(page)) -> bounce
    TestClearPageDirty(page)
  set_page_writeback(page);
  unlock_page(page);
  ...submit_io...

IRQ
  - IO completion
  end_page_writeback()

So if GUP happens before page_mkclean() writeprotects corresponding PTE
(and these two actions are synchronized on the PTE lock), page_pinned()
will see the increment and report the page as pinned.

If GUP happens after page_mkclean() writeprotects corresponding PTE, it
will fault:
  handle_mm_fault()
    do_wp_page()
      wp_page_shared()
        do_page_mkwrite()
          ->page_mkwrite() - that is block_page_mkwrite() or
	    iomap_page_mkwrite() or whatever filesystem provides
	  lock_page(page)
          ... prepare page ...
	  wait_for_stable_page(page) -> this blocks until IO completes
	    if someone cares about pages not being modified while under IO.

> > That actually touches on another question I wanted to get opinions on. GUP
> > can be for read and GUP can be for write (that is one of GUP flags).
> > Filesystems with page cache generally have issues only with GUP for write
> > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > memory hotplug will just loop in kernel until the page gets unpinned). So
> > we probably want to track both types of GUP pins and page-cache based
> > filesystems will take the hit even if they don't have to for read-pins?
> 
> Yes the distinction between read and write would be nice. With the map
> count solution you can only increment the mapcount for GUP(write=true).

Well, but if we track only pins for write, DAX or memory hotplug will not
be able to use this mechanism. So at this point I'm more leaning towards
tracking all pins. It will cost some performance needlessly for read pins
and filesystems using page cache when bouncing such pages but it's not like
writeback of pinned pages is some performance critical operation... But I
wanted to spell this out so that people are aware of this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17  9:30                                                                             ` Jan Kara
@ 2019-01-17  9:30                                                                               ` Jan Kara
  2019-01-17 15:17                                                                               ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-17  9:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 16-01-19 08:08:14, Jerome Glisse wrote:
> On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> > On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > > Agreed. So with page lock it would actually look like:
> > > 
> > > get_page_pin()
> > > 	lock_page(page);
> > > 	wait_for_stable_page();
> > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > 	unlock_page(page);
> > > 
> > > And if we perform page_pinned() check under page lock, then if
> > > page_pinned() returned false, we are sure page is not and will not be
> > > pinned until we drop the page lock (and also until page writeback is
> > > completed if needed).
> > 
> > After some more though, why do we even need wait_for_stable_page() and
> > lock_page() in get_page_pin()?
> > 
> > During writepage page_mkclean() will write protect all page tables. So
> > there can be no new writeable GUP pins until we unlock the page as all such
> > GUPs will have to first go through fault and ->page_mkwrite() handler. And
> > that will wait on page lock and do wait_for_stable_page() for us anyway.
> > Am I just confused?
> 
> Yeah with page lock it should synchronize on the pte but you still
> need to check for writeback iirc the page is unlocked after file
> system has queue up the write and thus the page can be unlock with
> write back pending (and PageWriteback() == trye) and i am not sure
> that in that states we can safely let anyone write to that page. I
> am assuming that in some case the block device also expect stable
> page content (RAID stuff).
> 
> So the PageWriteback() test is not only for racing page_mkclean()/
> test_set_page_writeback() and GUP but also for pending write back.

But this is prevented by wait_for_stable_page() that is already present in
->page_mkwrite() handlers. Look:

->writepage()
  /* Page is locked here */
  clear_page_dirty_for_io(page)
    page_mkclean(page)
      -> page tables get writeprotected
    /* The following line will be added by our patches */
    if (page_pinned(page)) -> bounce
    TestClearPageDirty(page)
  set_page_writeback(page);
  unlock_page(page);
  ...submit_io...

IRQ
  - IO completion
  end_page_writeback()

So if GUP happens before page_mkclean() writeprotects corresponding PTE
(and these two actions are synchronized on the PTE lock), page_pinned()
will see the increment and report the page as pinned.

If GUP happens after page_mkclean() writeprotects corresponding PTE, it
will fault:
  handle_mm_fault()
    do_wp_page()
      wp_page_shared()
        do_page_mkwrite()
          ->page_mkwrite() - that is block_page_mkwrite() or
	    iomap_page_mkwrite() or whatever filesystem provides
	  lock_page(page)
          ... prepare page ...
	  wait_for_stable_page(page) -> this blocks until IO completes
	    if someone cares about pages not being modified while under IO.

> > That actually touches on another question I wanted to get opinions on. GUP
> > can be for read and GUP can be for write (that is one of GUP flags).
> > Filesystems with page cache generally have issues only with GUP for write
> > as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> > hotplug have issues with both (DAX cannot truncate page pinned in any way,
> > memory hotplug will just loop in kernel until the page gets unpinned). So
> > we probably want to track both types of GUP pins and page-cache based
> > filesystems will take the hit even if they don't have to for read-pins?
> 
> Yes the distinction between read and write would be nice. With the map
> count solution you can only increment the mapcount for GUP(write=true).

Well, but if we track only pins for write, DAX or memory hotplug will not
be able to use this mechanism. So at this point I'm more leaning towards
tracking all pins. It will cost some performance needlessly for read pins
and filesystems using page cache when bouncing such pages but it's not like
writeback of pinned pages is some performance critical operation... But I
wanted to spell this out so that people are aware of this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17  5:25                                                                         ` John Hubbard
  2019-01-17  5:25                                                                           ` John Hubbard
@ 2019-01-17  9:04                                                                           ` Jan Kara
  2019-01-17  9:04                                                                             ` Jan Kara
  1 sibling, 1 reply; 207+ messages in thread
From: Jan Kara @ 2019-01-17  9:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Jerome Glisse, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 16-01-19 21:25:05, John Hubbard wrote:
> On 1/15/19 12:07 AM, Jan Kara wrote:
> >>>>> [...]
> >>> Also there is one more idea I had how to record number of pins in the page:
> >>>
> >>> #define PAGE_PIN_BIAS	1024
> >>>
> >>> get_page_pin()
> >>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>
> >>> put_page_pin();
> >>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>
> >>> page_pinned(page)
> >>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>
> >>> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >>> which should be plenty (but we should check for that and bail with error if
> >>> it would overflow). Also there will be no false negatives and false
> >>> positives only if there are more than 1024 non-page-table references to the
> >>> page which I expect to be rare (we might want to also subtract
> >>> hpage_nr_pages() for radix tree references to avoid excessive false
> >>> positives for huge pages although at this point I don't think they would
> >>> matter). Thoughts?
> 
> Some details, sorry I'm not fully grasping your plan without more
> explanation:
> 
> Do I read it correctly that this uses the lower 10 bits for the original
> page->_refcount, and the upper 22 bits for gup-pinned counts? If so, I'm
> surprised, because gup-pinned is going to be less than or equal to the
> normal (get_page-based) pin count. And 1024 seems like it might be
> reached in a large system with lots of processes and IPC.
> 
> Are you just allowing the lower 10 bits to overflow, and that's why the 
> subtraction of mapcount? Wouldn't it be better to allow more than 10 bits, 
> instead?

I'm not really dividing the page->_refcount counter, that's a wrong way how
to think about it I believe. Normal get_page() simply increments the
_refcount by 1, get_page_pin() will increment by 1024 (or 999 or whatever -
that's PAGE_PIN_BIAS). The choice of value of PAGE_PIN_BIAS is essentially
a tradeoff between how many page pins you allow and how likely
page_pinned() is to return false positive. Large PAGE_PIN_BIAS means lower
amount of false positives but also less page pins allowed for the page
before _refcount would overflow.

Now the trick with subtracting of page_mapcount() is following: We know
that certain places hold references to the page. Common holders of
page references are page table entries. So if we subtract page_mapcount()
from _refcount, we'll get more accurate view how many other references
(including pins) are there and thus reduce the number of false positives.

> Another question: do we just allow other kernel code to observe this biased
> _refcount, or do we attempt to filter it out?  In other words, do you expect 
> problems due to some kernel code checking the _refcount and finding a large 
> number there, when it expected, say, 3? I recall some code tries to do 
> that...in fact, ZONE_DEVICE is 1-based, instead of zero-based, with respect 
> to _refcount, right?

I would just allow other places to observe biased refcount. Sure there are
places that do comparions on exact refcount value but if such place does
not exclude page pins, it cannot really depend on whether there's just one
or thousand of them. Generally such places try to detect whether they are
the only owner of the page (besides page cache radix tree, LRU, etc.). So
they want to bail if any page pin exists and that check remains the same
regardless whether we increment _refcount by 1 or by 1024 when pinning the
page.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17  9:04                                                                           ` Jan Kara
@ 2019-01-17  9:04                                                                             ` Jan Kara
  0 siblings, 0 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-17  9:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Jerome Glisse, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 16-01-19 21:25:05, John Hubbard wrote:
> On 1/15/19 12:07 AM, Jan Kara wrote:
> >>>>> [...]
> >>> Also there is one more idea I had how to record number of pins in the page:
> >>>
> >>> #define PAGE_PIN_BIAS	1024
> >>>
> >>> get_page_pin()
> >>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>
> >>> put_page_pin();
> >>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>
> >>> page_pinned(page)
> >>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>
> >>> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >>> which should be plenty (but we should check for that and bail with error if
> >>> it would overflow). Also there will be no false negatives and false
> >>> positives only if there are more than 1024 non-page-table references to the
> >>> page which I expect to be rare (we might want to also subtract
> >>> hpage_nr_pages() for radix tree references to avoid excessive false
> >>> positives for huge pages although at this point I don't think they would
> >>> matter). Thoughts?
> 
> Some details, sorry I'm not fully grasping your plan without more
> explanation:
> 
> Do I read it correctly that this uses the lower 10 bits for the original
> page->_refcount, and the upper 22 bits for gup-pinned counts? If so, I'm
> surprised, because gup-pinned is going to be less than or equal to the
> normal (get_page-based) pin count. And 1024 seems like it might be
> reached in a large system with lots of processes and IPC.
> 
> Are you just allowing the lower 10 bits to overflow, and that's why the 
> subtraction of mapcount? Wouldn't it be better to allow more than 10 bits, 
> instead?

I'm not really dividing the page->_refcount counter, that's a wrong way how
to think about it I believe. Normal get_page() simply increments the
_refcount by 1, get_page_pin() will increment by 1024 (or 999 or whatever -
that's PAGE_PIN_BIAS). The choice of value of PAGE_PIN_BIAS is essentially
a tradeoff between how many page pins you allow and how likely
page_pinned() is to return false positive. Large PAGE_PIN_BIAS means lower
amount of false positives but also less page pins allowed for the page
before _refcount would overflow.

Now the trick with subtracting of page_mapcount() is following: We know
that certain places hold references to the page. Common holders of
page references are page table entries. So if we subtract page_mapcount()
from _refcount, we'll get more accurate view how many other references
(including pins) are there and thus reduce the number of false positives.

> Another question: do we just allow other kernel code to observe this biased
> _refcount, or do we attempt to filter it out?  In other words, do you expect 
> problems due to some kernel code checking the _refcount and finding a large 
> number there, when it expected, say, 3? I recall some code tries to do 
> that...in fact, ZONE_DEVICE is 1-based, instead of zero-based, with respect 
> to _refcount, right?

I would just allow other places to observe biased refcount. Sure there are
places that do comparions on exact refcount value but if such place does
not exclude page pins, it cannot really depend on whether there's just one
or thousand of them. Generally such places try to detect whether they are
the only owner of the page (besides page cache radix tree, LRU, etc.). So
they want to bail if any page pin exists and that check remains the same
regardless whether we increment _refcount by 1 or by 1024 when pinning the
page.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 13:08                                                                           ` Jerome Glisse
  2019-01-16 13:08                                                                             ` Jerome Glisse
@ 2019-01-17  5:42                                                                             ` John Hubbard
  2019-01-17  5:42                                                                               ` John Hubbard
  2019-01-17 15:21                                                                               ` Jerome Glisse
  2019-01-17  9:30                                                                             ` Jan Kara
  2 siblings, 2 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-17  5:42 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/16/19 5:08 AM, Jerome Glisse wrote:
> On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
>> On Tue 15-01-19 09:07:59, Jan Kara wrote:
>>> Agreed. So with page lock it would actually look like:
>>>
>>> get_page_pin()
>>> 	lock_page(page);
>>> 	wait_for_stable_page();
>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>> 	unlock_page(page);
>>>
>>> And if we perform page_pinned() check under page lock, then if
>>> page_pinned() returned false, we are sure page is not and will not be
>>> pinned until we drop the page lock (and also until page writeback is
>>> completed if needed).
>>
>> After some more though, why do we even need wait_for_stable_page() and
>> lock_page() in get_page_pin()?
>>
>> During writepage page_mkclean() will write protect all page tables. So
>> there can be no new writeable GUP pins until we unlock the page as all such
>> GUPs will have to first go through fault and ->page_mkwrite() handler. And
>> that will wait on page lock and do wait_for_stable_page() for us anyway.
>> Am I just confused?
> 
> Yeah with page lock it should synchronize on the pte but you still
> need to check for writeback iirc the page is unlocked after file
> system has queue up the write and thus the page can be unlock with
> write back pending (and PageWriteback() == trye) and i am not sure
> that in that states we can safely let anyone write to that page. I
> am assuming that in some case the block device also expect stable
> page content (RAID stuff).
> 
> So the PageWriteback() test is not only for racing page_mkclean()/
> test_set_page_writeback() and GUP but also for pending write back.


That was how I thought it worked too: page_mkclean and a few other things
like page migration take the page lock, but writeback takes the lock, 
queues it up, then drops the lock, and writeback actually happens outside
that lock. 

So on the GUP end, some combination of taking the page lock, and 
wait_on_page_writeback(), is required in order to flush out the writebacks.
I think I just rephrased what Jerome said, actually. :)


> 
> 
>> That actually touches on another question I wanted to get opinions on. GUP
>> can be for read and GUP can be for write (that is one of GUP flags).
>> Filesystems with page cache generally have issues only with GUP for write
>> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
>> hotplug have issues with both (DAX cannot truncate page pinned in any way,
>> memory hotplug will just loop in kernel until the page gets unpinned). So
>> we probably want to track both types of GUP pins and page-cache based
>> filesystems will take the hit even if they don't have to for read-pins?
> 
> Yes the distinction between read and write would be nice. With the map
> count solution you can only increment the mapcount for GUP(write=true).
> With pin bias the issue is that a big number of read pin can trigger
> false positive ie you would do:
>     GUP(vaddr, write)
>         ...
>         if (write)
>             atomic_add(page->refcount, PAGE_PIN_BIAS)
>         else
>             atomic_inc(page->refcount)
> 
>     PUP(page, write)
>         if (write)
>             atomic_add(page->refcount, -PAGE_PIN_BIAS)
>         else
>             atomic_dec(page->refcount)
> 
> I am guessing false positive because of too many read GUP is ok as
> it should be unlikely and when it happens then we take the hit.
> 

I'm also intrigued by the point that read-only GUP is harmless, and we 
could just focus on the writeable case.

However, I'm rather worried about actually attempting it, because remember
that so far, each call site does no special tracking of each struct page. 
It just remembers that it needs to do a put_page(), not whether or
not that particular page was set up with writeable or read-only GUP. I mean,
sure, they often call set_page_dirty before put_page, indicating that it might
have been a writeable GUP call, but it seems sketchy to rely on that.

So actually doing this could go from merely lots of work, to K*(lots_of_work)...


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17  5:42                                                                             ` John Hubbard
@ 2019-01-17  5:42                                                                               ` John Hubbard
  2019-01-17 15:21                                                                               ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-17  5:42 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/16/19 5:08 AM, Jerome Glisse wrote:
> On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
>> On Tue 15-01-19 09:07:59, Jan Kara wrote:
>>> Agreed. So with page lock it would actually look like:
>>>
>>> get_page_pin()
>>> 	lock_page(page);
>>> 	wait_for_stable_page();
>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>> 	unlock_page(page);
>>>
>>> And if we perform page_pinned() check under page lock, then if
>>> page_pinned() returned false, we are sure page is not and will not be
>>> pinned until we drop the page lock (and also until page writeback is
>>> completed if needed).
>>
>> After some more though, why do we even need wait_for_stable_page() and
>> lock_page() in get_page_pin()?
>>
>> During writepage page_mkclean() will write protect all page tables. So
>> there can be no new writeable GUP pins until we unlock the page as all such
>> GUPs will have to first go through fault and ->page_mkwrite() handler. And
>> that will wait on page lock and do wait_for_stable_page() for us anyway.
>> Am I just confused?
> 
> Yeah with page lock it should synchronize on the pte but you still
> need to check for writeback iirc the page is unlocked after file
> system has queue up the write and thus the page can be unlock with
> write back pending (and PageWriteback() == trye) and i am not sure
> that in that states we can safely let anyone write to that page. I
> am assuming that in some case the block device also expect stable
> page content (RAID stuff).
> 
> So the PageWriteback() test is not only for racing page_mkclean()/
> test_set_page_writeback() and GUP but also for pending write back.


That was how I thought it worked too: page_mkclean and a few other things
like page migration take the page lock, but writeback takes the lock, 
queues it up, then drops the lock, and writeback actually happens outside
that lock. 

So on the GUP end, some combination of taking the page lock, and 
wait_on_page_writeback(), is required in order to flush out the writebacks.
I think I just rephrased what Jerome said, actually. :)


> 
> 
>> That actually touches on another question I wanted to get opinions on. GUP
>> can be for read and GUP can be for write (that is one of GUP flags).
>> Filesystems with page cache generally have issues only with GUP for write
>> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
>> hotplug have issues with both (DAX cannot truncate page pinned in any way,
>> memory hotplug will just loop in kernel until the page gets unpinned). So
>> we probably want to track both types of GUP pins and page-cache based
>> filesystems will take the hit even if they don't have to for read-pins?
> 
> Yes the distinction between read and write would be nice. With the map
> count solution you can only increment the mapcount for GUP(write=true).
> With pin bias the issue is that a big number of read pin can trigger
> false positive ie you would do:
>     GUP(vaddr, write)
>         ...
>         if (write)
>             atomic_add(page->refcount, PAGE_PIN_BIAS)
>         else
>             atomic_inc(page->refcount)
> 
>     PUP(page, write)
>         if (write)
>             atomic_add(page->refcount, -PAGE_PIN_BIAS)
>         else
>             atomic_dec(page->refcount)
> 
> I am guessing false positive because of too many read GUP is ok as
> it should be unlikely and when it happens then we take the hit.
> 

I'm also intrigued by the point that read-only GUP is harmless, and we 
could just focus on the writeable case.

However, I'm rather worried about actually attempting it, because remember
that so far, each call site does no special tracking of each struct page. 
It just remembers that it needs to do a put_page(), not whether or
not that particular page was set up with writeable or read-only GUP. I mean,
sure, they often call set_page_dirty before put_page, indicating that it might
have been a writeable GUP call, but it seems sketchy to rely on that.

So actually doing this could go from merely lots of work, to K*(lots_of_work)...


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15  8:07                                                                       ` Jan Kara
                                                                                           ` (2 preceding siblings ...)
  2019-01-16 11:38                                                                         ` Jan Kara
@ 2019-01-17  5:25                                                                         ` John Hubbard
  2019-01-17  5:25                                                                           ` John Hubbard
  2019-01-17  9:04                                                                           ` Jan Kara
  3 siblings, 2 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-17  5:25 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 12:07 AM, Jan Kara wrote:
>>>>> [...]
>>> Also there is one more idea I had how to record number of pins in the page:
>>>
>>> #define PAGE_PIN_BIAS	1024
>>>
>>> get_page_pin()
>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>
>>> put_page_pin();
>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>
>>> page_pinned(page)
>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>
>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>>> which should be plenty (but we should check for that and bail with error if
>>> it would overflow). Also there will be no false negatives and false
>>> positives only if there are more than 1024 non-page-table references to the
>>> page which I expect to be rare (we might want to also subtract
>>> hpage_nr_pages() for radix tree references to avoid excessive false
>>> positives for huge pages although at this point I don't think they would
>>> matter). Thoughts?

Hi Jan,

Some details, sorry I'm not fully grasping your plan without more explanation:

Do I read it correctly that this uses the lower 10 bits for the original
page->_refcount, and the upper 22 bits for gup-pinned counts? If so, I'm surprised,
because gup-pinned is going to be less than or equal to the normal (get_page-based)
pin count. And 1024 seems like it might be reached in a large system with lots
of processes and IPC.

Are you just allowing the lower 10 bits to overflow, and that's why the 
subtraction of mapcount? Wouldn't it be better to allow more than 10 bits, 
instead?

Another question: do we just allow other kernel code to observe this biased
_refcount, or do we attempt to filter it out?  In other words, do you expect 
problems due to some kernel code checking the _refcount and finding a large 
number there, when it expected, say, 3? I recall some code tries to do 
that...in fact, ZONE_DEVICE is 1-based, instead of zero-based, with respect 
to _refcount, right?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-17  5:25                                                                         ` John Hubbard
@ 2019-01-17  5:25                                                                           ` John Hubbard
  2019-01-17  9:04                                                                           ` Jan Kara
  1 sibling, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-17  5:25 UTC (permalink / raw)
  To: Jan Kara, Jerome Glisse
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 12:07 AM, Jan Kara wrote:
>>>>> [...]
>>> Also there is one more idea I had how to record number of pins in the page:
>>>
>>> #define PAGE_PIN_BIAS	1024
>>>
>>> get_page_pin()
>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>
>>> put_page_pin();
>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>
>>> page_pinned(page)
>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>
>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>>> which should be plenty (but we should check for that and bail with error if
>>> it would overflow). Also there will be no false negatives and false
>>> positives only if there are more than 1024 non-page-table references to the
>>> page which I expect to be rare (we might want to also subtract
>>> hpage_nr_pages() for radix tree references to avoid excessive false
>>> positives for huge pages although at this point I don't think they would
>>> matter). Thoughts?

Hi Jan,

Some details, sorry I'm not fully grasping your plan without more explanation:

Do I read it correctly that this uses the lower 10 bits for the original
page->_refcount, and the upper 22 bits for gup-pinned counts? If so, I'm surprised,
because gup-pinned is going to be less than or equal to the normal (get_page-based)
pin count. And 1024 seems like it might be reached in a large system with lots
of processes and IPC.

Are you just allowing the lower 10 bits to overflow, and that's why the 
subtraction of mapcount? Wouldn't it be better to allow more than 10 bits, 
instead?

Another question: do we just allow other kernel code to observe this biased
_refcount, or do we attempt to filter it out?  In other words, do you expect 
problems due to some kernel code checking the _refcount and finding a large 
number there, when it expected, say, 3? I recall some code tries to do 
that...in fact, ZONE_DEVICE is 1-based, instead of zero-based, with respect 
to _refcount, right?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 14:50                                                                                         ` Jerome Glisse
  2019-01-16 14:50                                                                                           ` Jerome Glisse
@ 2019-01-16 22:51                                                                                           ` Dave Chinner
  2019-01-16 22:51                                                                                             ` Dave Chinner
  1 sibling, 1 reply; 207+ messages in thread
From: Dave Chinner @ 2019-01-16 22:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, John Hubbard, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 16, 2019 at 09:50:16AM -0500, Jerome Glisse wrote:
> On Wed, Jan 16, 2019 at 03:34:55PM +1100, Dave Chinner wrote:
> > On Tue, Jan 15, 2019 at 09:23:12PM -0500, Jerome Glisse wrote:
> > > On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> > > > On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> > > > [..]
> > > > > To make it clear.
> > > > >
> > > > > Lock code:
> > > > >     GUP()
> > > > >         ...
> > > > >         lock_page(page);
> > > > >         if (PageWriteback(page)) {
> > > > >             unlock_page(page);
> > > > >             wait_stable_page(page);
> > > > >             goto retry;
> > > > >         }
> > > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > > >         unlock_page(page);
> > > > >
> > > > >     test_set_page_writeback()
> > > > >         bool pinned = false;
> > > > >         ...
> > > > >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> > > > >         TestSetPageWriteback(page);
> > > > >         ...
> > > > >         return pinned;
> > > > >
> > > > > Memory barrier:
> > > > >     GUP()
> > > > >         ...
> > > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > > >         smp_mb();
> > > > >         if (PageWriteback(page)) {
> > > > >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> > > > >             wait_stable_page(page);
> > > > >             goto retry;
> > > > >         }
> > > > >
> > > > >     test_set_page_writeback()
> > > > >         bool pinned = false;
> > > > >         ...
> > > > >         TestSetPageWriteback(page);
> > > > >         smp_wmb();
> > > > >         pinned = page_is_pin(page);
> > > > >         ...
> > > > >         return pinned;
> > > > >
> > > > >
> > > > > One is not more complex than the other. One can contend, the other
> > > > > will _never_ contend.
> > > > 
> > > > The complexity is in the validation of lockless algorithms. It's
> > > > easier to reason about locks than barriers for the long term
> > > > maintainability of this code. I'm with Jan and John on wanting to
> > > > explore lock_page() before a barrier-based scheme.
> > > 
> > > How is the above hard to validate ?
> > 
> > Well, if you think it's so easy, then please write the test cases so
> > we can add them to fstests and make sure that we don't break it in
> > future.
> > 
> > If you can't write filesystem test cases that exercise these race
> > conditions reliably, then the answer to your question is "it is
> > extremely hard to validate" and the correct thing to do is to start
> > with the simple lock_page() based algorithm.
> > 
> > Premature optimisation in code this complex is something we really,
> > really need to avoid.
> 
> Litmus test shows that this never happens, i am attaching 2 litmus
> test one with barrier and one without. Without barrier we can see
> the double negative !PageWriteback in GUP and !page_pinned() in
> test_set_page_writeback() (0:EAX = 0; 1:EAX = 0; below)

That's not a regression test, nor does it actually test the code
that the kernel runs. It's just an extremely simplified model of
a small part of the algorithm. Sure, that specific interaction is
fine, but that in no way reflects the complexity of the code or the
interactions with other code that interacts with that state. And
it's not something we can use to detect that some future change has
broken gup vs writeback synchronisation.

Memory barriers might be fast, but they are hell for anyone but the
person who wrote the algorithm to understand.  If a simple page lock
is good enough and doesn't cause performance problems, then use the
simple page lock mechanism and stop trying to be clever.

Other people who will have to understand and debug issues in this
code are much less accepting of such clever algorithms. We've
been badly burnt on repeated occasions by broken memory barriers in
code heavily optimised for performance (*cough* rwsems *cough*), so
I'm extremely wary of using memory ordering dependent algorithms in
places where they are not necessary.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 22:51                                                                                           ` Dave Chinner
@ 2019-01-16 22:51                                                                                             ` Dave Chinner
  0 siblings, 0 replies; 207+ messages in thread
From: Dave Chinner @ 2019-01-16 22:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, John Hubbard, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 16, 2019 at 09:50:16AM -0500, Jerome Glisse wrote:
> On Wed, Jan 16, 2019 at 03:34:55PM +1100, Dave Chinner wrote:
> > On Tue, Jan 15, 2019 at 09:23:12PM -0500, Jerome Glisse wrote:
> > > On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> > > > On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> > > > [..]
> > > > > To make it clear.
> > > > >
> > > > > Lock code:
> > > > >     GUP()
> > > > >         ...
> > > > >         lock_page(page);
> > > > >         if (PageWriteback(page)) {
> > > > >             unlock_page(page);
> > > > >             wait_stable_page(page);
> > > > >             goto retry;
> > > > >         }
> > > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > > >         unlock_page(page);
> > > > >
> > > > >     test_set_page_writeback()
> > > > >         bool pinned = false;
> > > > >         ...
> > > > >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> > > > >         TestSetPageWriteback(page);
> > > > >         ...
> > > > >         return pinned;
> > > > >
> > > > > Memory barrier:
> > > > >     GUP()
> > > > >         ...
> > > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > > >         smp_mb();
> > > > >         if (PageWriteback(page)) {
> > > > >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> > > > >             wait_stable_page(page);
> > > > >             goto retry;
> > > > >         }
> > > > >
> > > > >     test_set_page_writeback()
> > > > >         bool pinned = false;
> > > > >         ...
> > > > >         TestSetPageWriteback(page);
> > > > >         smp_wmb();
> > > > >         pinned = page_is_pin(page);
> > > > >         ...
> > > > >         return pinned;
> > > > >
> > > > >
> > > > > One is not more complex than the other. One can contend, the other
> > > > > will _never_ contend.
> > > > 
> > > > The complexity is in the validation of lockless algorithms. It's
> > > > easier to reason about locks than barriers for the long term
> > > > maintainability of this code. I'm with Jan and John on wanting to
> > > > explore lock_page() before a barrier-based scheme.
> > > 
> > > How is the above hard to validate ?
> > 
> > Well, if you think it's so easy, then please write the test cases so
> > we can add them to fstests and make sure that we don't break it in
> > future.
> > 
> > If you can't write filesystem test cases that exercise these race
> > conditions reliably, then the answer to your question is "it is
> > extremely hard to validate" and the correct thing to do is to start
> > with the simple lock_page() based algorithm.
> > 
> > Premature optimisation in code this complex is something we really,
> > really need to avoid.
> 
> Litmus test shows that this never happens, i am attaching 2 litmus
> test one with barrier and one without. Without barrier we can see
> the double negative !PageWriteback in GUP and !page_pinned() in
> test_set_page_writeback() (0:EAX = 0; 1:EAX = 0; below)

That's not a regression test, nor does it actually test the code
that the kernel runs. It's just an extremely simplified model of
a small part of the algorithm. Sure, that specific interaction is
fine, but that in no way reflects the complexity of the code or the
interactions with other code that interacts with that state. And
it's not something we can use to detect that some future change has
broken gup vs writeback synchronisation.

Memory barriers might be fast, but they are hell for anyone but the
person who wrote the algorithm to understand.  If a simple page lock
is good enough and doesn't cause performance problems, then use the
simple page lock mechanism and stop trying to be clever.

Other people who will have to understand and debug issues in this
code are much less accepting of such clever algorithms. We've
been badly burnt on repeated occasions by broken memory barriers in
code heavily optimised for performance (*cough* rwsems *cough*), so
I'm extremely wary of using memory ordering dependent algorithms in
places where they are not necessary.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  4:34                                                                                       ` Dave Chinner
  2019-01-16  4:34                                                                                         ` Dave Chinner
@ 2019-01-16 14:50                                                                                         ` Jerome Glisse
  2019-01-16 14:50                                                                                           ` Jerome Glisse
  2019-01-16 22:51                                                                                           ` Dave Chinner
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16 14:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, John Hubbard, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3794 bytes --]

On Wed, Jan 16, 2019 at 03:34:55PM +1100, Dave Chinner wrote:
> On Tue, Jan 15, 2019 at 09:23:12PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> > > On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> > > [..]
> > > > To make it clear.
> > > >
> > > > Lock code:
> > > >     GUP()
> > > >         ...
> > > >         lock_page(page);
> > > >         if (PageWriteback(page)) {
> > > >             unlock_page(page);
> > > >             wait_stable_page(page);
> > > >             goto retry;
> > > >         }
> > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > >         unlock_page(page);
> > > >
> > > >     test_set_page_writeback()
> > > >         bool pinned = false;
> > > >         ...
> > > >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> > > >         TestSetPageWriteback(page);
> > > >         ...
> > > >         return pinned;
> > > >
> > > > Memory barrier:
> > > >     GUP()
> > > >         ...
> > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > >         smp_mb();
> > > >         if (PageWriteback(page)) {
> > > >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> > > >             wait_stable_page(page);
> > > >             goto retry;
> > > >         }
> > > >
> > > >     test_set_page_writeback()
> > > >         bool pinned = false;
> > > >         ...
> > > >         TestSetPageWriteback(page);
> > > >         smp_wmb();
> > > >         pinned = page_is_pin(page);
> > > >         ...
> > > >         return pinned;
> > > >
> > > >
> > > > One is not more complex than the other. One can contend, the other
> > > > will _never_ contend.
> > > 
> > > The complexity is in the validation of lockless algorithms. It's
> > > easier to reason about locks than barriers for the long term
> > > maintainability of this code. I'm with Jan and John on wanting to
> > > explore lock_page() before a barrier-based scheme.
> > 
> > How is the above hard to validate ?
> 
> Well, if you think it's so easy, then please write the test cases so
> we can add them to fstests and make sure that we don't break it in
> future.
> 
> If you can't write filesystem test cases that exercise these race
> conditions reliably, then the answer to your question is "it is
> extremely hard to validate" and the correct thing to do is to start
> with the simple lock_page() based algorithm.
> 
> Premature optimisation in code this complex is something we really,
> really need to avoid.

Litmus test shows that this never happens, i am attaching 2 litmus
test one with barrier and one without. Without barrier we can see
the double negative !PageWriteback in GUP and !page_pinned() in
test_set_page_writeback() (0:EAX = 0; 1:EAX = 0; below)


    ~/local/bin/litmus7 -r 100 gup.litmus

    ...

    Histogram (3 states)
    2     *>0:EAX=0; 1:EAX=0; x=1; y=1;
    4999999:>0:EAX=1; 1:EAX=0; x=1; y=1;
    4999999:>0:EAX=0; 1:EAX=1; x=1; y=1;
    Ok

    Witnesses
    Positive: 2, Negative: 9999998
    Condition exists (0:EAX=0 /\ 1:EAX=0) is validated
    Hash=2d53e83cd627ba17ab11c875525e078b
    Observation SB Sometimes 2 9999998
    Time SB 3.24



With the barrier this never happens:
    ~/local/bin/litmus7 -r 10000 gup-mb.litmus

    ...

    Histogram (3 states)
    499579828:>0:EAX=1; 1:EAX=0; x=1; y=1;
    499540152:>0:EAX=0; 1:EAX=1; x=1; y=1;
    880020:>0:EAX=1; 1:EAX=1; x=1; y=1;
    No

    Witnesses
    Positive: 0, Negative: 1000000000
    Condition exists (0:EAX=0 /\ 1:EAX=0) is NOT validated
    Hash=0dd48258687c8f737921f907c093c316
    Observation SB Never 0 1000000000


I do not know any better test than litmus for this kind of thing.

Cheers,
Jérôme

[-- Attachment #2: gup.litmus --]
[-- Type: text/plain, Size: 159 bytes --]

X86 SB
"GUP"
{ x=0; y=0; }
 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MOV EAX,[y] | MOV EAX,[x] ;
locations [x;y;]
exists (0:EAX=0 /\ 1:EAX=0)

[-- Attachment #3: gup-mb.litmus --]
[-- Type: text/plain, Size: 201 bytes --]

X86 SB
"GUP with barrier"
{ x=0; y=0; }
 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MFENCE      | MFENCE      ;
 MOV EAX,[y] | MOV EAX,[x] ;
locations [x;y;]
exists (0:EAX=0 /\ 1:EAX=0)

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 14:50                                                                                         ` Jerome Glisse
@ 2019-01-16 14:50                                                                                           ` Jerome Glisse
  2019-01-16 22:51                                                                                           ` Dave Chinner
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16 14:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, John Hubbard, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3794 bytes --]

On Wed, Jan 16, 2019 at 03:34:55PM +1100, Dave Chinner wrote:
> On Tue, Jan 15, 2019 at 09:23:12PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> > > On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> > > [..]
> > > > To make it clear.
> > > >
> > > > Lock code:
> > > >     GUP()
> > > >         ...
> > > >         lock_page(page);
> > > >         if (PageWriteback(page)) {
> > > >             unlock_page(page);
> > > >             wait_stable_page(page);
> > > >             goto retry;
> > > >         }
> > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > >         unlock_page(page);
> > > >
> > > >     test_set_page_writeback()
> > > >         bool pinned = false;
> > > >         ...
> > > >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> > > >         TestSetPageWriteback(page);
> > > >         ...
> > > >         return pinned;
> > > >
> > > > Memory barrier:
> > > >     GUP()
> > > >         ...
> > > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > > >         smp_mb();
> > > >         if (PageWriteback(page)) {
> > > >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> > > >             wait_stable_page(page);
> > > >             goto retry;
> > > >         }
> > > >
> > > >     test_set_page_writeback()
> > > >         bool pinned = false;
> > > >         ...
> > > >         TestSetPageWriteback(page);
> > > >         smp_wmb();
> > > >         pinned = page_is_pin(page);
> > > >         ...
> > > >         return pinned;
> > > >
> > > >
> > > > One is not more complex than the other. One can contend, the other
> > > > will _never_ contend.
> > > 
> > > The complexity is in the validation of lockless algorithms. It's
> > > easier to reason about locks than barriers for the long term
> > > maintainability of this code. I'm with Jan and John on wanting to
> > > explore lock_page() before a barrier-based scheme.
> > 
> > How is the above hard to validate ?
> 
> Well, if you think it's so easy, then please write the test cases so
> we can add them to fstests and make sure that we don't break it in
> future.
> 
> If you can't write filesystem test cases that exercise these race
> conditions reliably, then the answer to your question is "it is
> extremely hard to validate" and the correct thing to do is to start
> with the simple lock_page() based algorithm.
> 
> Premature optimisation in code this complex is something we really,
> really need to avoid.

Litmus test shows that this never happens, i am attaching 2 litmus
test one with barrier and one without. Without barrier we can see
the double negative !PageWriteback in GUP and !page_pinned() in
test_set_page_writeback() (0:EAX = 0; 1:EAX = 0; below)


    ~/local/bin/litmus7 -r 100 gup.litmus

    ...

    Histogram (3 states)
    2     *>0:EAX=0; 1:EAX=0; x=1; y=1;
    4999999:>0:EAX=1; 1:EAX=0; x=1; y=1;
    4999999:>0:EAX=0; 1:EAX=1; x=1; y=1;
    Ok

    Witnesses
    Positive: 2, Negative: 9999998
    Condition exists (0:EAX=0 /\ 1:EAX=0) is validated
    Hash=2d53e83cd627ba17ab11c875525e078b
    Observation SB Sometimes 2 9999998
    Time SB 3.24



With the barrier this never happens:
    ~/local/bin/litmus7 -r 10000 gup-mb.litmus

    ...

    Histogram (3 states)
    499579828:>0:EAX=1; 1:EAX=0; x=1; y=1;
    499540152:>0:EAX=0; 1:EAX=1; x=1; y=1;
    880020:>0:EAX=1; 1:EAX=1; x=1; y=1;
    No

    Witnesses
    Positive: 0, Negative: 1000000000
    Condition exists (0:EAX=0 /\ 1:EAX=0) is NOT validated
    Hash=0dd48258687c8f737921f907c093c316
    Observation SB Never 0 1000000000


I do not know any better test than litmus for this kind of thing.

Cheers,
Jérôme

[-- Attachment #2: gup.litmus --]
[-- Type: text/plain, Size: 159 bytes --]

X86 SB
"GUP"
{ x=0; y=0; }
 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MOV EAX,[y] | MOV EAX,[x] ;
locations [x;y;]
exists (0:EAX=0 /\ 1:EAX=0)

[-- Attachment #3: gup-mb.litmus --]
[-- Type: text/plain, Size: 201 bytes --]

X86 SB
"GUP with barrier"
{ x=0; y=0; }
 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MFENCE      | MFENCE      ;
 MOV EAX,[y] | MOV EAX,[x] ;
locations [x;y;]
exists (0:EAX=0 /\ 1:EAX=0)

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 11:38                                                                         ` Jan Kara
  2019-01-16 11:38                                                                           ` Jan Kara
@ 2019-01-16 13:08                                                                           ` Jerome Glisse
  2019-01-16 13:08                                                                             ` Jerome Glisse
                                                                                               ` (2 more replies)
  1 sibling, 3 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16 13:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > Agreed. So with page lock it would actually look like:
> > 
> > get_page_pin()
> > 	lock_page(page);
> > 	wait_for_stable_page();
> > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > 	unlock_page(page);
> > 
> > And if we perform page_pinned() check under page lock, then if
> > page_pinned() returned false, we are sure page is not and will not be
> > pinned until we drop the page lock (and also until page writeback is
> > completed if needed).
> 
> After some more though, why do we even need wait_for_stable_page() and
> lock_page() in get_page_pin()?
> 
> During writepage page_mkclean() will write protect all page tables. So
> there can be no new writeable GUP pins until we unlock the page as all such
> GUPs will have to first go through fault and ->page_mkwrite() handler. And
> that will wait on page lock and do wait_for_stable_page() for us anyway.
> Am I just confused?

Yeah with page lock it should synchronize on the pte but you still
need to check for writeback iirc the page is unlocked after file
system has queue up the write and thus the page can be unlock with
write back pending (and PageWriteback() == trye) and i am not sure
that in that states we can safely let anyone write to that page. I
am assuming that in some case the block device also expect stable
page content (RAID stuff).

So the PageWriteback() test is not only for racing page_mkclean()/
test_set_page_writeback() and GUP but also for pending write back.


> That actually touches on another question I wanted to get opinions on. GUP
> can be for read and GUP can be for write (that is one of GUP flags).
> Filesystems with page cache generally have issues only with GUP for write
> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> hotplug have issues with both (DAX cannot truncate page pinned in any way,
> memory hotplug will just loop in kernel until the page gets unpinned). So
> we probably want to track both types of GUP pins and page-cache based
> filesystems will take the hit even if they don't have to for read-pins?

Yes the distinction between read and write would be nice. With the map
count solution you can only increment the mapcount for GUP(write=true).
With pin bias the issue is that a big number of read pin can trigger
false positive ie you would do:
    GUP(vaddr, write)
        ...
        if (write)
            atomic_add(page->refcount, PAGE_PIN_BIAS)
        else
            atomic_inc(page->refcount)

    PUP(page, write)
        if (write)
            atomic_add(page->refcount, -PAGE_PIN_BIAS)
        else
            atomic_dec(page->refcount)

I am guessing false positive because of too many read GUP is ok as
it should be unlikely and when it happens then we take the hit.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 13:08                                                                           ` Jerome Glisse
@ 2019-01-16 13:08                                                                             ` Jerome Glisse
  2019-01-17  5:42                                                                             ` John Hubbard
  2019-01-17  9:30                                                                             ` Jan Kara
  2 siblings, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16 13:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 16, 2019 at 12:38:19PM +0100, Jan Kara wrote:
> On Tue 15-01-19 09:07:59, Jan Kara wrote:
> > Agreed. So with page lock it would actually look like:
> > 
> > get_page_pin()
> > 	lock_page(page);
> > 	wait_for_stable_page();
> > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > 	unlock_page(page);
> > 
> > And if we perform page_pinned() check under page lock, then if
> > page_pinned() returned false, we are sure page is not and will not be
> > pinned until we drop the page lock (and also until page writeback is
> > completed if needed).
> 
> After some more though, why do we even need wait_for_stable_page() and
> lock_page() in get_page_pin()?
> 
> During writepage page_mkclean() will write protect all page tables. So
> there can be no new writeable GUP pins until we unlock the page as all such
> GUPs will have to first go through fault and ->page_mkwrite() handler. And
> that will wait on page lock and do wait_for_stable_page() for us anyway.
> Am I just confused?

Yeah with page lock it should synchronize on the pte but you still
need to check for writeback iirc the page is unlocked after file
system has queue up the write and thus the page can be unlock with
write back pending (and PageWriteback() == trye) and i am not sure
that in that states we can safely let anyone write to that page. I
am assuming that in some case the block device also expect stable
page content (RAID stuff).

So the PageWriteback() test is not only for racing page_mkclean()/
test_set_page_writeback() and GUP but also for pending write back.


> That actually touches on another question I wanted to get opinions on. GUP
> can be for read and GUP can be for write (that is one of GUP flags).
> Filesystems with page cache generally have issues only with GUP for write
> as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
> hotplug have issues with both (DAX cannot truncate page pinned in any way,
> memory hotplug will just loop in kernel until the page gets unpinned). So
> we probably want to track both types of GUP pins and page-cache based
> filesystems will take the hit even if they don't have to for read-pins?

Yes the distinction between read and write would be nice. With the map
count solution you can only increment the mapcount for GUP(write=true).
With pin bias the issue is that a big number of read pin can trigger
false positive ie you would do:
    GUP(vaddr, write)
        ...
        if (write)
            atomic_add(page->refcount, PAGE_PIN_BIAS)
        else
            atomic_inc(page->refcount)

    PUP(page, write)
        if (write)
            atomic_add(page->refcount, -PAGE_PIN_BIAS)
        else
            atomic_dec(page->refcount)

I am guessing false positive because of too many read GUP is ok as
it should be unlikely and when it happens then we take the hit.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15  8:07                                                                       ` Jan Kara
  2019-01-15  8:07                                                                         ` Jan Kara
  2019-01-15 17:15                                                                         ` Jerome Glisse
@ 2019-01-16 11:38                                                                         ` Jan Kara
  2019-01-16 11:38                                                                           ` Jan Kara
  2019-01-16 13:08                                                                           ` Jerome Glisse
  2019-01-17  5:25                                                                         ` John Hubbard
  3 siblings, 2 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-16 11:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue 15-01-19 09:07:59, Jan Kara wrote:
> Agreed. So with page lock it would actually look like:
> 
> get_page_pin()
> 	lock_page(page);
> 	wait_for_stable_page();
> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> 	unlock_page(page);
> 
> And if we perform page_pinned() check under page lock, then if
> page_pinned() returned false, we are sure page is not and will not be
> pinned until we drop the page lock (and also until page writeback is
> completed if needed).

After some more though, why do we even need wait_for_stable_page() and
lock_page() in get_page_pin()?

During writepage page_mkclean() will write protect all page tables. So
there can be no new writeable GUP pins until we unlock the page as all such
GUPs will have to first go through fault and ->page_mkwrite() handler. And
that will wait on page lock and do wait_for_stable_page() for us anyway.
Am I just confused?

That actually touches on another question I wanted to get opinions on. GUP
can be for read and GUP can be for write (that is one of GUP flags).
Filesystems with page cache generally have issues only with GUP for write
as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
hotplug have issues with both (DAX cannot truncate page pinned in any way,
memory hotplug will just loop in kernel until the page gets unpinned). So
we probably want to track both types of GUP pins and page-cache based
filesystems will take the hit even if they don't have to for read-pins?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16 11:38                                                                         ` Jan Kara
@ 2019-01-16 11:38                                                                           ` Jan Kara
  2019-01-16 13:08                                                                           ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-16 11:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue 15-01-19 09:07:59, Jan Kara wrote:
> Agreed. So with page lock it would actually look like:
> 
> get_page_pin()
> 	lock_page(page);
> 	wait_for_stable_page();
> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> 	unlock_page(page);
> 
> And if we perform page_pinned() check under page lock, then if
> page_pinned() returned false, we are sure page is not and will not be
> pinned until we drop the page lock (and also until page writeback is
> completed if needed).

After some more though, why do we even need wait_for_stable_page() and
lock_page() in get_page_pin()?

During writepage page_mkclean() will write protect all page tables. So
there can be no new writeable GUP pins until we unlock the page as all such
GUPs will have to first go through fault and ->page_mkwrite() handler. And
that will wait on page lock and do wait_for_stable_page() for us anyway.
Am I just confused?

That actually touches on another question I wanted to get opinions on. GUP
can be for read and GUP can be for write (that is one of GUP flags).
Filesystems with page cache generally have issues only with GUP for write
as it can currently corrupt data, unexpectedly dirty page etc.. DAX & memory
hotplug have issues with both (DAX cannot truncate page pinned in any way,
memory hotplug will just loop in kernel until the page gets unpinned). So
we probably want to track both types of GUP pins and page-cache based
filesystems will take the hit even if they don't have to for read-pins?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  2:23                                                                                     ` Jerome Glisse
  2019-01-16  2:23                                                                                       ` Jerome Glisse
@ 2019-01-16  4:34                                                                                       ` Dave Chinner
  2019-01-16  4:34                                                                                         ` Dave Chinner
  2019-01-16 14:50                                                                                         ` Jerome Glisse
  1 sibling, 2 replies; 207+ messages in thread
From: Dave Chinner @ 2019-01-16  4:34 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, John Hubbard, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 09:23:12PM -0500, Jerome Glisse wrote:
> On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> > On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> > [..]
> > > To make it clear.
> > >
> > > Lock code:
> > >     GUP()
> > >         ...
> > >         lock_page(page);
> > >         if (PageWriteback(page)) {
> > >             unlock_page(page);
> > >             wait_stable_page(page);
> > >             goto retry;
> > >         }
> > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > >         unlock_page(page);
> > >
> > >     test_set_page_writeback()
> > >         bool pinned = false;
> > >         ...
> > >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> > >         TestSetPageWriteback(page);
> > >         ...
> > >         return pinned;
> > >
> > > Memory barrier:
> > >     GUP()
> > >         ...
> > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > >         smp_mb();
> > >         if (PageWriteback(page)) {
> > >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> > >             wait_stable_page(page);
> > >             goto retry;
> > >         }
> > >
> > >     test_set_page_writeback()
> > >         bool pinned = false;
> > >         ...
> > >         TestSetPageWriteback(page);
> > >         smp_wmb();
> > >         pinned = page_is_pin(page);
> > >         ...
> > >         return pinned;
> > >
> > >
> > > One is not more complex than the other. One can contend, the other
> > > will _never_ contend.
> > 
> > The complexity is in the validation of lockless algorithms. It's
> > easier to reason about locks than barriers for the long term
> > maintainability of this code. I'm with Jan and John on wanting to
> > explore lock_page() before a barrier-based scheme.
> 
> How is the above hard to validate ?

Well, if you think it's so easy, then please write the test cases so
we can add them to fstests and make sure that we don't break it in
future.

If you can't write filesystem test cases that exercise these race
conditions reliably, then the answer to your question is "it is
extremely hard to validate" and the correct thing to do is to start
with the simple lock_page() based algorithm.

Premature optimisation in code this complex is something we really,
really need to avoid.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  4:34                                                                                       ` Dave Chinner
@ 2019-01-16  4:34                                                                                         ` Dave Chinner
  2019-01-16 14:50                                                                                         ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: Dave Chinner @ 2019-01-16  4:34 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, John Hubbard, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 09:23:12PM -0500, Jerome Glisse wrote:
> On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> > On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> > [..]
> > > To make it clear.
> > >
> > > Lock code:
> > >     GUP()
> > >         ...
> > >         lock_page(page);
> > >         if (PageWriteback(page)) {
> > >             unlock_page(page);
> > >             wait_stable_page(page);
> > >             goto retry;
> > >         }
> > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > >         unlock_page(page);
> > >
> > >     test_set_page_writeback()
> > >         bool pinned = false;
> > >         ...
> > >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> > >         TestSetPageWriteback(page);
> > >         ...
> > >         return pinned;
> > >
> > > Memory barrier:
> > >     GUP()
> > >         ...
> > >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> > >         smp_mb();
> > >         if (PageWriteback(page)) {
> > >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> > >             wait_stable_page(page);
> > >             goto retry;
> > >         }
> > >
> > >     test_set_page_writeback()
> > >         bool pinned = false;
> > >         ...
> > >         TestSetPageWriteback(page);
> > >         smp_wmb();
> > >         pinned = page_is_pin(page);
> > >         ...
> > >         return pinned;
> > >
> > >
> > > One is not more complex than the other. One can contend, the other
> > > will _never_ contend.
> > 
> > The complexity is in the validation of lockless algorithms. It's
> > easier to reason about locks than barriers for the long term
> > maintainability of this code. I'm with Jan and John on wanting to
> > explore lock_page() before a barrier-based scheme.
> 
> How is the above hard to validate ?

Well, if you think it's so easy, then please write the test cases so
we can add them to fstests and make sure that we don't break it in
future.

If you can't write filesystem test cases that exercise these race
conditions reliably, then the answer to your question is "it is
extremely hard to validate" and the correct thing to do is to start
with the simple lock_page() based algorithm.

Premature optimisation in code this complex is something we really,
really need to avoid.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  2:01                                                                                   ` Dan Williams
  2019-01-16  2:01                                                                                     ` Dan Williams
@ 2019-01-16  2:23                                                                                     ` Jerome Glisse
  2019-01-16  2:23                                                                                       ` Jerome Glisse
  2019-01-16  4:34                                                                                       ` Dave Chinner
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16  2:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> [..]
> > To make it clear.
> >
> > Lock code:
> >     GUP()
> >         ...
> >         lock_page(page);
> >         if (PageWriteback(page)) {
> >             unlock_page(page);
> >             wait_stable_page(page);
> >             goto retry;
> >         }
> >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> >         unlock_page(page);
> >
> >     test_set_page_writeback()
> >         bool pinned = false;
> >         ...
> >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> >         TestSetPageWriteback(page);
> >         ...
> >         return pinned;
> >
> > Memory barrier:
> >     GUP()
> >         ...
> >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> >         smp_mb();
> >         if (PageWriteback(page)) {
> >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> >             wait_stable_page(page);
> >             goto retry;
> >         }
> >
> >     test_set_page_writeback()
> >         bool pinned = false;
> >         ...
> >         TestSetPageWriteback(page);
> >         smp_wmb();
> >         pinned = page_is_pin(page);
> >         ...
> >         return pinned;
> >
> >
> > One is not more complex than the other. One can contend, the other
> > will _never_ contend.
> 
> The complexity is in the validation of lockless algorithms. It's
> easier to reason about locks than barriers for the long term
> maintainability of this code. I'm with Jan and John on wanting to
> explore lock_page() before a barrier-based scheme.

How is the above hard to validate ? Either GUP see racing
test_set_page_writeback because it test write back after
incrementing the refcount, or test_set_page_writeback sees
GUP because it checks for pin after setting the write back
bits.

So if GUP see !PageWriteback() then test_set_page_writeback
see page_pin(page) as true. If test_set_page_writeback sees
page_pin(page) as false then GUP did see PageWriteback() as
true.

You _never_ have !PageWriteback() in GUP and !page_pin() in
test_set_page_writeback() if they are both racing. This is
an impossible scenario because of memory barrier.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  2:23                                                                                     ` Jerome Glisse
@ 2019-01-16  2:23                                                                                       ` Jerome Glisse
  2019-01-16  4:34                                                                                       ` Dave Chinner
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16  2:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 06:01:09PM -0800, Dan Williams wrote:
> On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> [..]
> > To make it clear.
> >
> > Lock code:
> >     GUP()
> >         ...
> >         lock_page(page);
> >         if (PageWriteback(page)) {
> >             unlock_page(page);
> >             wait_stable_page(page);
> >             goto retry;
> >         }
> >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> >         unlock_page(page);
> >
> >     test_set_page_writeback()
> >         bool pinned = false;
> >         ...
> >         pinned = page_is_pin(page); // could be after TestSetPageWriteback
> >         TestSetPageWriteback(page);
> >         ...
> >         return pinned;
> >
> > Memory barrier:
> >     GUP()
> >         ...
> >         atomic_add(page->refcount, PAGE_PIN_BIAS);
> >         smp_mb();
> >         if (PageWriteback(page)) {
> >             atomic_add(page->refcount, -PAGE_PIN_BIAS);
> >             wait_stable_page(page);
> >             goto retry;
> >         }
> >
> >     test_set_page_writeback()
> >         bool pinned = false;
> >         ...
> >         TestSetPageWriteback(page);
> >         smp_wmb();
> >         pinned = page_is_pin(page);
> >         ...
> >         return pinned;
> >
> >
> > One is not more complex than the other. One can contend, the other
> > will _never_ contend.
> 
> The complexity is in the validation of lockless algorithms. It's
> easier to reason about locks than barriers for the long term
> maintainability of this code. I'm with Jan and John on wanting to
> explore lock_page() before a barrier-based scheme.

How is the above hard to validate ? Either GUP see racing
test_set_page_writeback because it test write back after
incrementing the refcount, or test_set_page_writeback sees
GUP because it checks for pin after setting the write back
bits.

So if GUP see !PageWriteback() then test_set_page_writeback
see page_pin(page) as true. If test_set_page_writeback sees
page_pin(page) as false then GUP did see PageWriteback() as
true.

You _never_ have !PageWriteback() in GUP and !page_pin() in
test_set_page_writeback() if they are both racing. This is
an impossible scenario because of memory barrier.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  1:56                                                                                 ` Jerome Glisse
  2019-01-16  1:56                                                                                   ` Jerome Glisse
@ 2019-01-16  2:01                                                                                   ` Dan Williams
  2019-01-16  2:01                                                                                     ` Dan Williams
  2019-01-16  2:23                                                                                     ` Jerome Glisse
  1 sibling, 2 replies; 207+ messages in thread
From: Dan Williams @ 2019-01-16  2:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
[..]
> To make it clear.
>
> Lock code:
>     GUP()
>         ...
>         lock_page(page);
>         if (PageWriteback(page)) {
>             unlock_page(page);
>             wait_stable_page(page);
>             goto retry;
>         }
>         atomic_add(page->refcount, PAGE_PIN_BIAS);
>         unlock_page(page);
>
>     test_set_page_writeback()
>         bool pinned = false;
>         ...
>         pinned = page_is_pin(page); // could be after TestSetPageWriteback
>         TestSetPageWriteback(page);
>         ...
>         return pinned;
>
> Memory barrier:
>     GUP()
>         ...
>         atomic_add(page->refcount, PAGE_PIN_BIAS);
>         smp_mb();
>         if (PageWriteback(page)) {
>             atomic_add(page->refcount, -PAGE_PIN_BIAS);
>             wait_stable_page(page);
>             goto retry;
>         }
>
>     test_set_page_writeback()
>         bool pinned = false;
>         ...
>         TestSetPageWriteback(page);
>         smp_wmb();
>         pinned = page_is_pin(page);
>         ...
>         return pinned;
>
>
> One is not more complex than the other. One can contend, the other
> will _never_ contend.

The complexity is in the validation of lockless algorithms. It's
easier to reason about locks than barriers for the long term
maintainability of this code. I'm with Jan and John on wanting to
explore lock_page() before a barrier-based scheme.

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  2:01                                                                                   ` Dan Williams
@ 2019-01-16  2:01                                                                                     ` Dan Williams
  2019-01-16  2:23                                                                                     ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: Dan Williams @ 2019-01-16  2:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 5:56 PM Jerome Glisse <jglisse@redhat.com> wrote:
> On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
[..]
> To make it clear.
>
> Lock code:
>     GUP()
>         ...
>         lock_page(page);
>         if (PageWriteback(page)) {
>             unlock_page(page);
>             wait_stable_page(page);
>             goto retry;
>         }
>         atomic_add(page->refcount, PAGE_PIN_BIAS);
>         unlock_page(page);
>
>     test_set_page_writeback()
>         bool pinned = false;
>         ...
>         pinned = page_is_pin(page); // could be after TestSetPageWriteback
>         TestSetPageWriteback(page);
>         ...
>         return pinned;
>
> Memory barrier:
>     GUP()
>         ...
>         atomic_add(page->refcount, PAGE_PIN_BIAS);
>         smp_mb();
>         if (PageWriteback(page)) {
>             atomic_add(page->refcount, -PAGE_PIN_BIAS);
>             wait_stable_page(page);
>             goto retry;
>         }
>
>     test_set_page_writeback()
>         bool pinned = false;
>         ...
>         TestSetPageWriteback(page);
>         smp_wmb();
>         pinned = page_is_pin(page);
>         ...
>         return pinned;
>
>
> One is not more complex than the other. One can contend, the other
> will _never_ contend.

The complexity is in the validation of lockless algorithms. It's
easier to reason about locks than barriers for the long term
maintainability of this code. I'm with Jan and John on wanting to
explore lock_page() before a barrier-based scheme.

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  0:44                                                                               ` John Hubbard
  2019-01-16  0:44                                                                                 ` John Hubbard
@ 2019-01-16  1:56                                                                                 ` Jerome Glisse
  2019-01-16  1:56                                                                                   ` Jerome Glisse
  2019-01-16  2:01                                                                                   ` Dan Williams
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16  1:56 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> On 1/15/19 2:12 PM, Jerome Glisse wrote:
> > On Tue, Jan 15, 2019 at 01:56:51PM -0800, John Hubbard wrote:
> >> On 1/15/19 9:15 AM, Jerome Glisse wrote:
> >>> On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
> >>>> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> >>>>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> >>>>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
> >>>>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> >>>>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> >>>>>>>> [...]
> >>>>>>>>
> >>>>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>>>>>>>>> option, which we'll certainly need in order to safely convert all the call
> >>>>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>>>>>>>>> a separate patchset, as you recommended.
> >>>>>>>>>>>
> >>>>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>>>>>>>>> then I'd recommend using another page bit to do the same thing.)
> >>>>>>>>>>
> >>>>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
> >>>>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
> >>>>>>>>>> ordering and i do not see any issues.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Why is it that page lock cannot be used for gup fast, btw?
> >>>>>>>>
> >>>>>>>> Well it can not happen within the preempt disable section. But after
> >>>>>>>> as a post pass before GUP_fast return and after reenabling preempt then
> >>>>>>>> it is fine like it would be for regular GUP. But locking page for GUP
> >>>>>>>> is also likely to slow down some workload (with direct-IO).
> >>>>>>>>
> >>>>>>>
> >>>>>>> Right, and so to crux of the matter: taking an uncontended page lock
> >>>>>>> involves pretty much the same set of operations that your approach does.
> >>>>>>> (If gup ends up contended with the page lock for other reasons than these
> >>>>>>> paths, that seems surprising.) I'd expect very similar performance.
> >>>>>>>
> >>>>>>> But the page lock approach leads to really dramatically simpler code (and
> >>>>>>> code reviews, let's not forget). Any objection to my going that
> >>>>>>> direction, and keeping this idea as a Plan B? I think the next step will
> >>>>>>> be, once again, to gather some performance metrics, so maybe that will
> >>>>>>> help us decide.
> >>>>>>
> >>>>>> FWIW I agree that using page lock for protecting page pinning (and thus
> >>>>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
> >>>>>> convinced there will be measurable difference to the more complex scheme
> >>>>>> with barriers Jerome suggests unless that page lock contended. Jerome is
> >>>>>> right that you cannot just do lock_page() in gup_fast() path. There you
> >>>>>> have to do trylock_page() and if that fails just bail out to the slow gup
> >>>>>> path.
> >>>>>>
> >>>>>> Regarding places other than page_mkclean() that need to check pinned state:
> >>>>>> Definitely page migration will want to check whether the page is pinned or
> >>>>>> not so that it can deal differently with short-term page references vs
> >>>>>> longer-term pins.
> >>>>>>
> >>>>>> Also there is one more idea I had how to record number of pins in the page:
> >>>>>>
> >>>>>> #define PAGE_PIN_BIAS	1024
> >>>>>>
> >>>>>> get_page_pin()
> >>>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>>>>
> >>>>>> put_page_pin();
> >>>>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>>>>
> >>>>>> page_pinned(page)
> >>>>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>>>>
> >>>>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >>>>>> which should be plenty (but we should check for that and bail with error if
> >>>>>> it would overflow). Also there will be no false negatives and false
> >>>>>> positives only if there are more than 1024 non-page-table references to the
> >>>>>> page which I expect to be rare (we might want to also subtract
> >>>>>> hpage_nr_pages() for radix tree references to avoid excessive false
> >>>>>> positives for huge pages although at this point I don't think they would
> >>>>>> matter). Thoughts?
> >>>>>
> >>>>> Racing PUP are as likely to cause issues:
> >>>>>
> >>>>> CPU0                        | CPU1       | CPU2
> >>>>>                             |            |
> >>>>>                             | PUP()      |
> >>>>>     page_pinned(page)       |            |
> >>>>>       (page_count(page) -   |            |
> >>>>>        page_mapcount(page)) |            |
> >>>>>                             |            | GUP()
> >>>>>
> >>>>> So here the refcount snap-shot does not include the second GUP and
> >>>>> we can have a false negative ie the page_pinned() will return false
> >>>>> because of the PUP happening just before on CPU1 despite the racing
> >>>>> GUP on CPU2 just after.
> >>>>>
> >>>>> I believe only either lock or memory ordering with barrier can
> >>>>> guarantee that we do not miss GUP ie no false negative. Still the
> >>>>> bias idea might be usefull as with it we should not need a flag.
> >>>>
> >>>> Right. We need similar synchronization (i.e., page lock or careful checks
> >>>> with memory barriers) if we want to get a reliable page pin information.
> >>>>
> >>>>> So to make the above safe it would still need the page write back
> >>>>> double check that i described so that GUP back-off if it raced with
> >>>>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> >>>>> which call test_set_page_writeback() (yes it is very unlikely but
> >>>>> might still happen).
> >>>>
> >>>> Agreed. So with page lock it would actually look like:
> >>>>
> >>>> get_page_pin()
> >>>> 	lock_page(page);
> >>>> 	wait_for_stable_page();
> >>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>> 	unlock_page(page);
> >>>>
> >>>> And if we perform page_pinned() check under page lock, then if
> >>>> page_pinned() returned false, we are sure page is not and will not be
> >>>> pinned until we drop the page lock (and also until page writeback is
> >>>> completed if needed).
> >>>>
> >>
> >> OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
> >> compensation steps, is a pretty major selling point. And if we do the above
> >> locking, that does look correct to me. I wasn't able to visualize the
> >> locking you had in mind, until just now (above), but now it is clear, 
> >> thanks for spelling it out.
> >>
> >>>
> >>> So i still can't see anything wrong with that idea, i had similar
> >>> one in the past and diss-missed and i can't remember why :( But
> >>> thinking over and over i do not see any issue beside refcount wrap
> >>> around. Which is something that can happens today thought i don't
> >>> think it can be use in an evil way and we can catch it and be
> >>> loud about it.
> >>>
> >>> So i think the following would be bullet proof:
> >>>
> >>>
> >>> get_page_pin()
> >>>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>     smp_wmb();
> >>>     if (PageWriteback(page)) {
> >>>         // back off
> >>>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>         // re-enable preempt if in fast
> >>>         wait_on_page_writeback(page);
> >>>         goto retry;
> >>>     }
> >>>
> >>> put_page_pin();
> >>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>
> >>> page_pinned(page)
> >>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>
> >>> test_set_page_writeback()
> >>>     ...
> >>>     wb = TestSetPageWriteback(page)
> >>
> >> Minor point, but using PageWriteback for synchronization may rule out using
> >> wait_for_stable_page(), because wait_for_stable_page() might not actually 
> >> wait_on_page_writeback. Jan pointed out in the other thread, that we should
> >> prefer wait_for_stable_page(). 
> > 
> > Yes, but wait_for_stable_page() has no page flag so nothing we can
> > synchronize against. So my advice would be:
> >     if (PageWriteback(page)) {
> >         wait_for_stable_page(page);
> >         if (PageWriteback(page))
> >             wait_for_write_back(page);
> >     }
> > 
> > wait_for_stable_page() can optimize out the wait_for_write_back()
> > if it is safe to do so. So we can improve the above slightly too.
> > 
> >>
> >>>     smp_mb();
> >>>     if (page_pinned(page)) {
> >>>         // report page as pinned to caller of test_set_page_writeback()
> >>>     }
> >>>     ...
> >>>
> >>> This is text book memory barrier. Either get_page_pin() see racing
> >>> test_set_page_writeback() or test_set_page_writeback() see racing GUP
> >>>
> >>>
> >>
> >> This approach is probably workable, but again, it's more complex and comes
> >> without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
> >> to use it as either "do this after everything is up and running and stable", 
> >> or else as Plan B, if there is some performance implication from the page lock.
> >>
> >> Simple and correct first, then performance optimization, *if* necessary.
> > 
> > I do not like taking page lock while they are no good reasons to do so.
> 
> There actually are very good reasons to do so! These include:
> 
> 1) Simpler code that is less likely to have subtle bugs in the initial 
>    implementations.

It is not simpler, memory barrier is 1 line of code ...

> 
> 2) Pre-existing, known locking constructs that include instrumentation and
>    visibility.

Like i said i don't think page lock benefit from those at it is
very struct page specific. I need to check what is available but
you definitly do not get all the bell and whistle you get with
regular lock.

> 
> 3) ...and all of the other goodness that comes from smaller and simpler code.
> 
> I'm not saying that those reasons necessarily prevail here, but it's not
> fair to say "there are no good reasons". Less code is still worth something,
> even in the kernel.

Again memory barrier is just one line of code, i do not see lock as
something simpler than that.

> 
> > The above is textbook memory barrier as explain in Documentations/
> > Forcing page lock for GUP will inevitably slow down some workload and
> 
> Such as?
> 
> Here's the thing: if a workload is taking the page lock for some
> reason, and also competing with GUP, that's actually something that I worry
> about: what is changing in page state, while we're setting up GUP? Either
> we audit for that, or we let runtime locking rules (taking the page lock)
> keep us out of trouble in the first place.
> 
> In other words, if there is a performance hit, it might very likely be
> due to a required synchronization that is taking place.

You need to take the page lock for several thing, top of my mind: insert a
mapping, migrate, truncate, swapping, reverse map, mlock, cgroup, madvise,
... so if GUP now also need it then you force synchronization with all
that for direct-IO.

You do not need to synchronize with most of the above as they do not care
about GUP. In fact only write back path need synchronization i can not
think of anything else that would need to synchronize with GUP.

> > report for such can takes time to trickle down to mailing list and it
> > can takes time for people to actualy figure out that this are the GUP
> > changes that introduce such regression.
> > 
> > So if we could minimize performance regression with something like
> > memory barrier we should definitly do that.
> 
> We do not yet know that the more complex memory barrier approach is actually
> faster. That's worth repeating.

I would be surprise if a memory barrier was slower than a lock. Lock
can contends, memory barrier do not contend. Lock require atomic
operation and thus implied barrier so lock should translate into some-
thing slower than a memory barrier alone.


> > Also i do not think that page lock has lock dep (as it is not using
> > any of the usual locking function) but that's just my memory of that
> > code.
> > 
> 
> Lock page is pretty thoroughly instrumented. It uses wait_on_page_bit_common(),
> which in turn uses spin locks and more.

It does not gives you all the bell and whistle you get with spinlock
debug. The spinlock taken in wait_on_page_bit_common() is for the
waitqueue the page belongs to so you only get debuging on that not
on the individual page lock bit. So i do not think there is anything
that would help debugging page lock like double unlock or dead lock.


> The more I think about this, the more I want actual performance data to 
> justify anything involving the more complicated custom locking. So I think
> it's best to build the page lock based version, do some benchmarks, and see
> where we stand.

This is not custom locking, we do employ memory barrier in several
places already. Memory barrier is something quite common in the kernel
and we should favor it when there is no need for a lock.

Memory barrier never contend so you know you will never have lock
contention ... so memory barrier can only be faster than anything
with lock. The contrary would surprise me.

Using lock and believing it will be as fast as memory barrier is
is hopping that you will never contend on that lock. So i would
rather get proof that GUP will never contend on page lock.


To make it clear.

Lock code:
    GUP()
        ...
        lock_page(page);
        if (PageWriteback(page)) {
            unlock_page(page);
            wait_stable_page(page);
            goto retry;
        }
        atomic_add(page->refcount, PAGE_PIN_BIAS);
        unlock_page(page);

    test_set_page_writeback()
        bool pinned = false;
        ...
        pinned = page_is_pin(page); // could be after TestSetPageWriteback
        TestSetPageWriteback(page);
        ...
        return pinned;

Memory barrier:
    GUP()
        ...
        atomic_add(page->refcount, PAGE_PIN_BIAS);
        smp_mb();
        if (PageWriteback(page)) {
            atomic_add(page->refcount, -PAGE_PIN_BIAS);
            wait_stable_page(page);
            goto retry;
        }

    test_set_page_writeback()
        bool pinned = false;
        ...
        TestSetPageWriteback(page);
        smp_wmb();
        pinned = page_is_pin(page);
        ...
        return pinned;


One is not more complex than the other. One can contend, the other
will _never_ contend.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  1:56                                                                                 ` Jerome Glisse
@ 2019-01-16  1:56                                                                                   ` Jerome Glisse
  2019-01-16  2:01                                                                                   ` Dan Williams
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-16  1:56 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 04:44:41PM -0800, John Hubbard wrote:
> On 1/15/19 2:12 PM, Jerome Glisse wrote:
> > On Tue, Jan 15, 2019 at 01:56:51PM -0800, John Hubbard wrote:
> >> On 1/15/19 9:15 AM, Jerome Glisse wrote:
> >>> On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
> >>>> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> >>>>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> >>>>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
> >>>>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> >>>>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> >>>>>>>> [...]
> >>>>>>>>
> >>>>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>>>>>>>>> option, which we'll certainly need in order to safely convert all the call
> >>>>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>>>>>>>>> a separate patchset, as you recommended.
> >>>>>>>>>>>
> >>>>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>>>>>>>>> then I'd recommend using another page bit to do the same thing.)
> >>>>>>>>>>
> >>>>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
> >>>>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
> >>>>>>>>>> ordering and i do not see any issues.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Why is it that page lock cannot be used for gup fast, btw?
> >>>>>>>>
> >>>>>>>> Well it can not happen within the preempt disable section. But after
> >>>>>>>> as a post pass before GUP_fast return and after reenabling preempt then
> >>>>>>>> it is fine like it would be for regular GUP. But locking page for GUP
> >>>>>>>> is also likely to slow down some workload (with direct-IO).
> >>>>>>>>
> >>>>>>>
> >>>>>>> Right, and so to crux of the matter: taking an uncontended page lock
> >>>>>>> involves pretty much the same set of operations that your approach does.
> >>>>>>> (If gup ends up contended with the page lock for other reasons than these
> >>>>>>> paths, that seems surprising.) I'd expect very similar performance.
> >>>>>>>
> >>>>>>> But the page lock approach leads to really dramatically simpler code (and
> >>>>>>> code reviews, let's not forget). Any objection to my going that
> >>>>>>> direction, and keeping this idea as a Plan B? I think the next step will
> >>>>>>> be, once again, to gather some performance metrics, so maybe that will
> >>>>>>> help us decide.
> >>>>>>
> >>>>>> FWIW I agree that using page lock for protecting page pinning (and thus
> >>>>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
> >>>>>> convinced there will be measurable difference to the more complex scheme
> >>>>>> with barriers Jerome suggests unless that page lock contended. Jerome is
> >>>>>> right that you cannot just do lock_page() in gup_fast() path. There you
> >>>>>> have to do trylock_page() and if that fails just bail out to the slow gup
> >>>>>> path.
> >>>>>>
> >>>>>> Regarding places other than page_mkclean() that need to check pinned state:
> >>>>>> Definitely page migration will want to check whether the page is pinned or
> >>>>>> not so that it can deal differently with short-term page references vs
> >>>>>> longer-term pins.
> >>>>>>
> >>>>>> Also there is one more idea I had how to record number of pins in the page:
> >>>>>>
> >>>>>> #define PAGE_PIN_BIAS	1024
> >>>>>>
> >>>>>> get_page_pin()
> >>>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>>>>
> >>>>>> put_page_pin();
> >>>>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>>>>
> >>>>>> page_pinned(page)
> >>>>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>>>>
> >>>>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >>>>>> which should be plenty (but we should check for that and bail with error if
> >>>>>> it would overflow). Also there will be no false negatives and false
> >>>>>> positives only if there are more than 1024 non-page-table references to the
> >>>>>> page which I expect to be rare (we might want to also subtract
> >>>>>> hpage_nr_pages() for radix tree references to avoid excessive false
> >>>>>> positives for huge pages although at this point I don't think they would
> >>>>>> matter). Thoughts?
> >>>>>
> >>>>> Racing PUP are as likely to cause issues:
> >>>>>
> >>>>> CPU0                        | CPU1       | CPU2
> >>>>>                             |            |
> >>>>>                             | PUP()      |
> >>>>>     page_pinned(page)       |            |
> >>>>>       (page_count(page) -   |            |
> >>>>>        page_mapcount(page)) |            |
> >>>>>                             |            | GUP()
> >>>>>
> >>>>> So here the refcount snap-shot does not include the second GUP and
> >>>>> we can have a false negative ie the page_pinned() will return false
> >>>>> because of the PUP happening just before on CPU1 despite the racing
> >>>>> GUP on CPU2 just after.
> >>>>>
> >>>>> I believe only either lock or memory ordering with barrier can
> >>>>> guarantee that we do not miss GUP ie no false negative. Still the
> >>>>> bias idea might be usefull as with it we should not need a flag.
> >>>>
> >>>> Right. We need similar synchronization (i.e., page lock or careful checks
> >>>> with memory barriers) if we want to get a reliable page pin information.
> >>>>
> >>>>> So to make the above safe it would still need the page write back
> >>>>> double check that i described so that GUP back-off if it raced with
> >>>>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> >>>>> which call test_set_page_writeback() (yes it is very unlikely but
> >>>>> might still happen).
> >>>>
> >>>> Agreed. So with page lock it would actually look like:
> >>>>
> >>>> get_page_pin()
> >>>> 	lock_page(page);
> >>>> 	wait_for_stable_page();
> >>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>> 	unlock_page(page);
> >>>>
> >>>> And if we perform page_pinned() check under page lock, then if
> >>>> page_pinned() returned false, we are sure page is not and will not be
> >>>> pinned until we drop the page lock (and also until page writeback is
> >>>> completed if needed).
> >>>>
> >>
> >> OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
> >> compensation steps, is a pretty major selling point. And if we do the above
> >> locking, that does look correct to me. I wasn't able to visualize the
> >> locking you had in mind, until just now (above), but now it is clear, 
> >> thanks for spelling it out.
> >>
> >>>
> >>> So i still can't see anything wrong with that idea, i had similar
> >>> one in the past and diss-missed and i can't remember why :( But
> >>> thinking over and over i do not see any issue beside refcount wrap
> >>> around. Which is something that can happens today thought i don't
> >>> think it can be use in an evil way and we can catch it and be
> >>> loud about it.
> >>>
> >>> So i think the following would be bullet proof:
> >>>
> >>>
> >>> get_page_pin()
> >>>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>     smp_wmb();
> >>>     if (PageWriteback(page)) {
> >>>         // back off
> >>>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>         // re-enable preempt if in fast
> >>>         wait_on_page_writeback(page);
> >>>         goto retry;
> >>>     }
> >>>
> >>> put_page_pin();
> >>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>
> >>> page_pinned(page)
> >>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>
> >>> test_set_page_writeback()
> >>>     ...
> >>>     wb = TestSetPageWriteback(page)
> >>
> >> Minor point, but using PageWriteback for synchronization may rule out using
> >> wait_for_stable_page(), because wait_for_stable_page() might not actually 
> >> wait_on_page_writeback. Jan pointed out in the other thread, that we should
> >> prefer wait_for_stable_page(). 
> > 
> > Yes, but wait_for_stable_page() has no page flag so nothing we can
> > synchronize against. So my advice would be:
> >     if (PageWriteback(page)) {
> >         wait_for_stable_page(page);
> >         if (PageWriteback(page))
> >             wait_for_write_back(page);
> >     }
> > 
> > wait_for_stable_page() can optimize out the wait_for_write_back()
> > if it is safe to do so. So we can improve the above slightly too.
> > 
> >>
> >>>     smp_mb();
> >>>     if (page_pinned(page)) {
> >>>         // report page as pinned to caller of test_set_page_writeback()
> >>>     }
> >>>     ...
> >>>
> >>> This is text book memory barrier. Either get_page_pin() see racing
> >>> test_set_page_writeback() or test_set_page_writeback() see racing GUP
> >>>
> >>>
> >>
> >> This approach is probably workable, but again, it's more complex and comes
> >> without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
> >> to use it as either "do this after everything is up and running and stable", 
> >> or else as Plan B, if there is some performance implication from the page lock.
> >>
> >> Simple and correct first, then performance optimization, *if* necessary.
> > 
> > I do not like taking page lock while they are no good reasons to do so.
> 
> There actually are very good reasons to do so! These include:
> 
> 1) Simpler code that is less likely to have subtle bugs in the initial 
>    implementations.

It is not simpler, memory barrier is 1 line of code ...

> 
> 2) Pre-existing, known locking constructs that include instrumentation and
>    visibility.

Like i said i don't think page lock benefit from those at it is
very struct page specific. I need to check what is available but
you definitly do not get all the bell and whistle you get with
regular lock.

> 
> 3) ...and all of the other goodness that comes from smaller and simpler code.
> 
> I'm not saying that those reasons necessarily prevail here, but it's not
> fair to say "there are no good reasons". Less code is still worth something,
> even in the kernel.

Again memory barrier is just one line of code, i do not see lock as
something simpler than that.

> 
> > The above is textbook memory barrier as explain in Documentations/
> > Forcing page lock for GUP will inevitably slow down some workload and
> 
> Such as?
> 
> Here's the thing: if a workload is taking the page lock for some
> reason, and also competing with GUP, that's actually something that I worry
> about: what is changing in page state, while we're setting up GUP? Either
> we audit for that, or we let runtime locking rules (taking the page lock)
> keep us out of trouble in the first place.
> 
> In other words, if there is a performance hit, it might very likely be
> due to a required synchronization that is taking place.

You need to take the page lock for several thing, top of my mind: insert a
mapping, migrate, truncate, swapping, reverse map, mlock, cgroup, madvise,
... so if GUP now also need it then you force synchronization with all
that for direct-IO.

You do not need to synchronize with most of the above as they do not care
about GUP. In fact only write back path need synchronization i can not
think of anything else that would need to synchronize with GUP.

> > report for such can takes time to trickle down to mailing list and it
> > can takes time for people to actualy figure out that this are the GUP
> > changes that introduce such regression.
> > 
> > So if we could minimize performance regression with something like
> > memory barrier we should definitly do that.
> 
> We do not yet know that the more complex memory barrier approach is actually
> faster. That's worth repeating.

I would be surprise if a memory barrier was slower than a lock. Lock
can contends, memory barrier do not contend. Lock require atomic
operation and thus implied barrier so lock should translate into some-
thing slower than a memory barrier alone.


> > Also i do not think that page lock has lock dep (as it is not using
> > any of the usual locking function) but that's just my memory of that
> > code.
> > 
> 
> Lock page is pretty thoroughly instrumented. It uses wait_on_page_bit_common(),
> which in turn uses spin locks and more.

It does not gives you all the bell and whistle you get with spinlock
debug. The spinlock taken in wait_on_page_bit_common() is for the
waitqueue the page belongs to so you only get debuging on that not
on the individual page lock bit. So i do not think there is anything
that would help debugging page lock like double unlock or dead lock.


> The more I think about this, the more I want actual performance data to 
> justify anything involving the more complicated custom locking. So I think
> it's best to build the page lock based version, do some benchmarks, and see
> where we stand.

This is not custom locking, we do employ memory barrier in several
places already. Memory barrier is something quite common in the kernel
and we should favor it when there is no need for a lock.

Memory barrier never contend so you know you will never have lock
contention ... so memory barrier can only be faster than anything
with lock. The contrary would surprise me.

Using lock and believing it will be as fast as memory barrier is
is hopping that you will never contend on that lock. So i would
rather get proof that GUP will never contend on page lock.


To make it clear.

Lock code:
    GUP()
        ...
        lock_page(page);
        if (PageWriteback(page)) {
            unlock_page(page);
            wait_stable_page(page);
            goto retry;
        }
        atomic_add(page->refcount, PAGE_PIN_BIAS);
        unlock_page(page);

    test_set_page_writeback()
        bool pinned = false;
        ...
        pinned = page_is_pin(page); // could be after TestSetPageWriteback
        TestSetPageWriteback(page);
        ...
        return pinned;

Memory barrier:
    GUP()
        ...
        atomic_add(page->refcount, PAGE_PIN_BIAS);
        smp_mb();
        if (PageWriteback(page)) {
            atomic_add(page->refcount, -PAGE_PIN_BIAS);
            wait_stable_page(page);
            goto retry;
        }

    test_set_page_writeback()
        bool pinned = false;
        ...
        TestSetPageWriteback(page);
        smp_wmb();
        pinned = page_is_pin(page);
        ...
        return pinned;


One is not more complex than the other. One can contend, the other
will _never_ contend.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15 22:12                                                                             ` Jerome Glisse
  2019-01-15 22:12                                                                               ` Jerome Glisse
@ 2019-01-16  0:44                                                                               ` John Hubbard
  2019-01-16  0:44                                                                                 ` John Hubbard
  2019-01-16  1:56                                                                                 ` Jerome Glisse
  1 sibling, 2 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-16  0:44 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 2:12 PM, Jerome Glisse wrote:
> On Tue, Jan 15, 2019 at 01:56:51PM -0800, John Hubbard wrote:
>> On 1/15/19 9:15 AM, Jerome Glisse wrote:
>>> On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
>>>> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
>>>>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
>>>>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
>>>>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
>>>>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>>>>>>>>> option, which we'll certainly need in order to safely convert all the call
>>>>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>>>>>>>>> a separate patchset, as you recommended.
>>>>>>>>>>>
>>>>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>>>>>>>>> then I'd recommend using another page bit to do the same thing.)
>>>>>>>>>>
>>>>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
>>>>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
>>>>>>>>>> ordering and i do not see any issues.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>>>>>>
>>>>>>>> Well it can not happen within the preempt disable section. But after
>>>>>>>> as a post pass before GUP_fast return and after reenabling preempt then
>>>>>>>> it is fine like it would be for regular GUP. But locking page for GUP
>>>>>>>> is also likely to slow down some workload (with direct-IO).
>>>>>>>>
>>>>>>>
>>>>>>> Right, and so to crux of the matter: taking an uncontended page lock
>>>>>>> involves pretty much the same set of operations that your approach does.
>>>>>>> (If gup ends up contended with the page lock for other reasons than these
>>>>>>> paths, that seems surprising.) I'd expect very similar performance.
>>>>>>>
>>>>>>> But the page lock approach leads to really dramatically simpler code (and
>>>>>>> code reviews, let's not forget). Any objection to my going that
>>>>>>> direction, and keeping this idea as a Plan B? I think the next step will
>>>>>>> be, once again, to gather some performance metrics, so maybe that will
>>>>>>> help us decide.
>>>>>>
>>>>>> FWIW I agree that using page lock for protecting page pinning (and thus
>>>>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
>>>>>> convinced there will be measurable difference to the more complex scheme
>>>>>> with barriers Jerome suggests unless that page lock contended. Jerome is
>>>>>> right that you cannot just do lock_page() in gup_fast() path. There you
>>>>>> have to do trylock_page() and if that fails just bail out to the slow gup
>>>>>> path.
>>>>>>
>>>>>> Regarding places other than page_mkclean() that need to check pinned state:
>>>>>> Definitely page migration will want to check whether the page is pinned or
>>>>>> not so that it can deal differently with short-term page references vs
>>>>>> longer-term pins.
>>>>>>
>>>>>> Also there is one more idea I had how to record number of pins in the page:
>>>>>>
>>>>>> #define PAGE_PIN_BIAS	1024
>>>>>>
>>>>>> get_page_pin()
>>>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>>>
>>>>>> put_page_pin();
>>>>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>>>>
>>>>>> page_pinned(page)
>>>>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>>>>
>>>>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>>>>>> which should be plenty (but we should check for that and bail with error if
>>>>>> it would overflow). Also there will be no false negatives and false
>>>>>> positives only if there are more than 1024 non-page-table references to the
>>>>>> page which I expect to be rare (we might want to also subtract
>>>>>> hpage_nr_pages() for radix tree references to avoid excessive false
>>>>>> positives for huge pages although at this point I don't think they would
>>>>>> matter). Thoughts?
>>>>>
>>>>> Racing PUP are as likely to cause issues:
>>>>>
>>>>> CPU0                        | CPU1       | CPU2
>>>>>                             |            |
>>>>>                             | PUP()      |
>>>>>     page_pinned(page)       |            |
>>>>>       (page_count(page) -   |            |
>>>>>        page_mapcount(page)) |            |
>>>>>                             |            | GUP()
>>>>>
>>>>> So here the refcount snap-shot does not include the second GUP and
>>>>> we can have a false negative ie the page_pinned() will return false
>>>>> because of the PUP happening just before on CPU1 despite the racing
>>>>> GUP on CPU2 just after.
>>>>>
>>>>> I believe only either lock or memory ordering with barrier can
>>>>> guarantee that we do not miss GUP ie no false negative. Still the
>>>>> bias idea might be usefull as with it we should not need a flag.
>>>>
>>>> Right. We need similar synchronization (i.e., page lock or careful checks
>>>> with memory barriers) if we want to get a reliable page pin information.
>>>>
>>>>> So to make the above safe it would still need the page write back
>>>>> double check that i described so that GUP back-off if it raced with
>>>>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
>>>>> which call test_set_page_writeback() (yes it is very unlikely but
>>>>> might still happen).
>>>>
>>>> Agreed. So with page lock it would actually look like:
>>>>
>>>> get_page_pin()
>>>> 	lock_page(page);
>>>> 	wait_for_stable_page();
>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>> 	unlock_page(page);
>>>>
>>>> And if we perform page_pinned() check under page lock, then if
>>>> page_pinned() returned false, we are sure page is not and will not be
>>>> pinned until we drop the page lock (and also until page writeback is
>>>> completed if needed).
>>>>
>>
>> OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
>> compensation steps, is a pretty major selling point. And if we do the above
>> locking, that does look correct to me. I wasn't able to visualize the
>> locking you had in mind, until just now (above), but now it is clear, 
>> thanks for spelling it out.
>>
>>>
>>> So i still can't see anything wrong with that idea, i had similar
>>> one in the past and diss-missed and i can't remember why :( But
>>> thinking over and over i do not see any issue beside refcount wrap
>>> around. Which is something that can happens today thought i don't
>>> think it can be use in an evil way and we can catch it and be
>>> loud about it.
>>>
>>> So i think the following would be bullet proof:
>>>
>>>
>>> get_page_pin()
>>>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>     smp_wmb();
>>>     if (PageWriteback(page)) {
>>>         // back off
>>>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>         // re-enable preempt if in fast
>>>         wait_on_page_writeback(page);
>>>         goto retry;
>>>     }
>>>
>>> put_page_pin();
>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>
>>> page_pinned(page)
>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>
>>> test_set_page_writeback()
>>>     ...
>>>     wb = TestSetPageWriteback(page)
>>
>> Minor point, but using PageWriteback for synchronization may rule out using
>> wait_for_stable_page(), because wait_for_stable_page() might not actually 
>> wait_on_page_writeback. Jan pointed out in the other thread, that we should
>> prefer wait_for_stable_page(). 
> 
> Yes, but wait_for_stable_page() has no page flag so nothing we can
> synchronize against. So my advice would be:
>     if (PageWriteback(page)) {
>         wait_for_stable_page(page);
>         if (PageWriteback(page))
>             wait_for_write_back(page);
>     }
> 
> wait_for_stable_page() can optimize out the wait_for_write_back()
> if it is safe to do so. So we can improve the above slightly too.
> 
>>
>>>     smp_mb();
>>>     if (page_pinned(page)) {
>>>         // report page as pinned to caller of test_set_page_writeback()
>>>     }
>>>     ...
>>>
>>> This is text book memory barrier. Either get_page_pin() see racing
>>> test_set_page_writeback() or test_set_page_writeback() see racing GUP
>>>
>>>
>>
>> This approach is probably workable, but again, it's more complex and comes
>> without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
>> to use it as either "do this after everything is up and running and stable", 
>> or else as Plan B, if there is some performance implication from the page lock.
>>
>> Simple and correct first, then performance optimization, *if* necessary.
> 
> I do not like taking page lock while they are no good reasons to do so.

There actually are very good reasons to do so! These include:

1) Simpler code that is less likely to have subtle bugs in the initial 
   implementations.

2) Pre-existing, known locking constructs that include instrumentation and
   visibility.

3) ...and all of the other goodness that comes from smaller and simpler code.

I'm not saying that those reasons necessarily prevail here, but it's not
fair to say "there are no good reasons". Less code is still worth something,
even in the kernel.

> The above is textbook memory barrier as explain in Documentations/
> Forcing page lock for GUP will inevitably slow down some workload and

Such as?

Here's the thing: if a workload is taking the page lock for some
reason, and also competing with GUP, that's actually something that I worry
about: what is changing in page state, while we're setting up GUP? Either
we audit for that, or we let runtime locking rules (taking the page lock)
keep us out of trouble in the first place.

In other words, if there is a performance hit, it might very likely be
due to a required synchronization that is taking place.

> report for such can takes time to trickle down to mailing list and it
> can takes time for people to actualy figure out that this are the GUP
> changes that introduce such regression.
> 
> So if we could minimize performance regression with something like
> memory barrier we should definitly do that.

We do not yet know that the more complex memory barrier approach is actually
faster. That's worth repeating.

> 
> Also i do not think that page lock has lock dep (as it is not using
> any of the usual locking function) but that's just my memory of that
> code.
> 

Lock page is pretty thoroughly instrumented. It uses wait_on_page_bit_common(),
which in turn uses spin locks and more.

The more I think about this, the more I want actual performance data to 
justify anything involving the more complicated custom locking. So I think
it's best to build the page lock based version, do some benchmarks, and see
where we stand.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-16  0:44                                                                               ` John Hubbard
@ 2019-01-16  0:44                                                                                 ` John Hubbard
  2019-01-16  1:56                                                                                 ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-16  0:44 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 2:12 PM, Jerome Glisse wrote:
> On Tue, Jan 15, 2019 at 01:56:51PM -0800, John Hubbard wrote:
>> On 1/15/19 9:15 AM, Jerome Glisse wrote:
>>> On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
>>>> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
>>>>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
>>>>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
>>>>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
>>>>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>>>>>>>>> option, which we'll certainly need in order to safely convert all the call
>>>>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>>>>>>>>> a separate patchset, as you recommended.
>>>>>>>>>>>
>>>>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>>>>>>>>> then I'd recommend using another page bit to do the same thing.)
>>>>>>>>>>
>>>>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
>>>>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
>>>>>>>>>> ordering and i do not see any issues.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>>>>>>
>>>>>>>> Well it can not happen within the preempt disable section. But after
>>>>>>>> as a post pass before GUP_fast return and after reenabling preempt then
>>>>>>>> it is fine like it would be for regular GUP. But locking page for GUP
>>>>>>>> is also likely to slow down some workload (with direct-IO).
>>>>>>>>
>>>>>>>
>>>>>>> Right, and so to crux of the matter: taking an uncontended page lock
>>>>>>> involves pretty much the same set of operations that your approach does.
>>>>>>> (If gup ends up contended with the page lock for other reasons than these
>>>>>>> paths, that seems surprising.) I'd expect very similar performance.
>>>>>>>
>>>>>>> But the page lock approach leads to really dramatically simpler code (and
>>>>>>> code reviews, let's not forget). Any objection to my going that
>>>>>>> direction, and keeping this idea as a Plan B? I think the next step will
>>>>>>> be, once again, to gather some performance metrics, so maybe that will
>>>>>>> help us decide.
>>>>>>
>>>>>> FWIW I agree that using page lock for protecting page pinning (and thus
>>>>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
>>>>>> convinced there will be measurable difference to the more complex scheme
>>>>>> with barriers Jerome suggests unless that page lock contended. Jerome is
>>>>>> right that you cannot just do lock_page() in gup_fast() path. There you
>>>>>> have to do trylock_page() and if that fails just bail out to the slow gup
>>>>>> path.
>>>>>>
>>>>>> Regarding places other than page_mkclean() that need to check pinned state:
>>>>>> Definitely page migration will want to check whether the page is pinned or
>>>>>> not so that it can deal differently with short-term page references vs
>>>>>> longer-term pins.
>>>>>>
>>>>>> Also there is one more idea I had how to record number of pins in the page:
>>>>>>
>>>>>> #define PAGE_PIN_BIAS	1024
>>>>>>
>>>>>> get_page_pin()
>>>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>>>
>>>>>> put_page_pin();
>>>>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>>>>
>>>>>> page_pinned(page)
>>>>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>>>>
>>>>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>>>>>> which should be plenty (but we should check for that and bail with error if
>>>>>> it would overflow). Also there will be no false negatives and false
>>>>>> positives only if there are more than 1024 non-page-table references to the
>>>>>> page which I expect to be rare (we might want to also subtract
>>>>>> hpage_nr_pages() for radix tree references to avoid excessive false
>>>>>> positives for huge pages although at this point I don't think they would
>>>>>> matter). Thoughts?
>>>>>
>>>>> Racing PUP are as likely to cause issues:
>>>>>
>>>>> CPU0                        | CPU1       | CPU2
>>>>>                             |            |
>>>>>                             | PUP()      |
>>>>>     page_pinned(page)       |            |
>>>>>       (page_count(page) -   |            |
>>>>>        page_mapcount(page)) |            |
>>>>>                             |            | GUP()
>>>>>
>>>>> So here the refcount snap-shot does not include the second GUP and
>>>>> we can have a false negative ie the page_pinned() will return false
>>>>> because of the PUP happening just before on CPU1 despite the racing
>>>>> GUP on CPU2 just after.
>>>>>
>>>>> I believe only either lock or memory ordering with barrier can
>>>>> guarantee that we do not miss GUP ie no false negative. Still the
>>>>> bias idea might be usefull as with it we should not need a flag.
>>>>
>>>> Right. We need similar synchronization (i.e., page lock or careful checks
>>>> with memory barriers) if we want to get a reliable page pin information.
>>>>
>>>>> So to make the above safe it would still need the page write back
>>>>> double check that i described so that GUP back-off if it raced with
>>>>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
>>>>> which call test_set_page_writeback() (yes it is very unlikely but
>>>>> might still happen).
>>>>
>>>> Agreed. So with page lock it would actually look like:
>>>>
>>>> get_page_pin()
>>>> 	lock_page(page);
>>>> 	wait_for_stable_page();
>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>> 	unlock_page(page);
>>>>
>>>> And if we perform page_pinned() check under page lock, then if
>>>> page_pinned() returned false, we are sure page is not and will not be
>>>> pinned until we drop the page lock (and also until page writeback is
>>>> completed if needed).
>>>>
>>
>> OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
>> compensation steps, is a pretty major selling point. And if we do the above
>> locking, that does look correct to me. I wasn't able to visualize the
>> locking you had in mind, until just now (above), but now it is clear, 
>> thanks for spelling it out.
>>
>>>
>>> So i still can't see anything wrong with that idea, i had similar
>>> one in the past and diss-missed and i can't remember why :( But
>>> thinking over and over i do not see any issue beside refcount wrap
>>> around. Which is something that can happens today thought i don't
>>> think it can be use in an evil way and we can catch it and be
>>> loud about it.
>>>
>>> So i think the following would be bullet proof:
>>>
>>>
>>> get_page_pin()
>>>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>     smp_wmb();
>>>     if (PageWriteback(page)) {
>>>         // back off
>>>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>         // re-enable preempt if in fast
>>>         wait_on_page_writeback(page);
>>>         goto retry;
>>>     }
>>>
>>> put_page_pin();
>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>
>>> page_pinned(page)
>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>
>>> test_set_page_writeback()
>>>     ...
>>>     wb = TestSetPageWriteback(page)
>>
>> Minor point, but using PageWriteback for synchronization may rule out using
>> wait_for_stable_page(), because wait_for_stable_page() might not actually 
>> wait_on_page_writeback. Jan pointed out in the other thread, that we should
>> prefer wait_for_stable_page(). 
> 
> Yes, but wait_for_stable_page() has no page flag so nothing we can
> synchronize against. So my advice would be:
>     if (PageWriteback(page)) {
>         wait_for_stable_page(page);
>         if (PageWriteback(page))
>             wait_for_write_back(page);
>     }
> 
> wait_for_stable_page() can optimize out the wait_for_write_back()
> if it is safe to do so. So we can improve the above slightly too.
> 
>>
>>>     smp_mb();
>>>     if (page_pinned(page)) {
>>>         // report page as pinned to caller of test_set_page_writeback()
>>>     }
>>>     ...
>>>
>>> This is text book memory barrier. Either get_page_pin() see racing
>>> test_set_page_writeback() or test_set_page_writeback() see racing GUP
>>>
>>>
>>
>> This approach is probably workable, but again, it's more complex and comes
>> without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
>> to use it as either "do this after everything is up and running and stable", 
>> or else as Plan B, if there is some performance implication from the page lock.
>>
>> Simple and correct first, then performance optimization, *if* necessary.
> 
> I do not like taking page lock while they are no good reasons to do so.

There actually are very good reasons to do so! These include:

1) Simpler code that is less likely to have subtle bugs in the initial 
   implementations.

2) Pre-existing, known locking constructs that include instrumentation and
   visibility.

3) ...and all of the other goodness that comes from smaller and simpler code.

I'm not saying that those reasons necessarily prevail here, but it's not
fair to say "there are no good reasons". Less code is still worth something,
even in the kernel.

> The above is textbook memory barrier as explain in Documentations/
> Forcing page lock for GUP will inevitably slow down some workload and

Such as?

Here's the thing: if a workload is taking the page lock for some
reason, and also competing with GUP, that's actually something that I worry
about: what is changing in page state, while we're setting up GUP? Either
we audit for that, or we let runtime locking rules (taking the page lock)
keep us out of trouble in the first place.

In other words, if there is a performance hit, it might very likely be
due to a required synchronization that is taking place.

> report for such can takes time to trickle down to mailing list and it
> can takes time for people to actualy figure out that this are the GUP
> changes that introduce such regression.
> 
> So if we could minimize performance regression with something like
> memory barrier we should definitly do that.

We do not yet know that the more complex memory barrier approach is actually
faster. That's worth repeating.

> 
> Also i do not think that page lock has lock dep (as it is not using
> any of the usual locking function) but that's just my memory of that
> code.
> 

Lock page is pretty thoroughly instrumented. It uses wait_on_page_bit_common(),
which in turn uses spin locks and more.

The more I think about this, the more I want actual performance data to 
justify anything involving the more complicated custom locking. So I think
it's best to build the page lock based version, do some benchmarks, and see
where we stand.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15 21:56                                                                           ` John Hubbard
  2019-01-15 21:56                                                                             ` John Hubbard
@ 2019-01-15 22:12                                                                             ` Jerome Glisse
  2019-01-15 22:12                                                                               ` Jerome Glisse
  2019-01-16  0:44                                                                               ` John Hubbard
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-15 22:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 01:56:51PM -0800, John Hubbard wrote:
> On 1/15/19 9:15 AM, Jerome Glisse wrote:
> > On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
> >> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> >>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> >>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
> >>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> >>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> >>>>>> [...]
> >>>>>>
> >>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>>>>>>> option, which we'll certainly need in order to safely convert all the call
> >>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>>>>>>> a separate patchset, as you recommended.
> >>>>>>>>>
> >>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>>>>>>> then I'd recommend using another page bit to do the same thing.)
> >>>>>>>>
> >>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
> >>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
> >>>>>>>> ordering and i do not see any issues.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Why is it that page lock cannot be used for gup fast, btw?
> >>>>>>
> >>>>>> Well it can not happen within the preempt disable section. But after
> >>>>>> as a post pass before GUP_fast return and after reenabling preempt then
> >>>>>> it is fine like it would be for regular GUP. But locking page for GUP
> >>>>>> is also likely to slow down some workload (with direct-IO).
> >>>>>>
> >>>>>
> >>>>> Right, and so to crux of the matter: taking an uncontended page lock
> >>>>> involves pretty much the same set of operations that your approach does.
> >>>>> (If gup ends up contended with the page lock for other reasons than these
> >>>>> paths, that seems surprising.) I'd expect very similar performance.
> >>>>>
> >>>>> But the page lock approach leads to really dramatically simpler code (and
> >>>>> code reviews, let's not forget). Any objection to my going that
> >>>>> direction, and keeping this idea as a Plan B? I think the next step will
> >>>>> be, once again, to gather some performance metrics, so maybe that will
> >>>>> help us decide.
> >>>>
> >>>> FWIW I agree that using page lock for protecting page pinning (and thus
> >>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
> >>>> convinced there will be measurable difference to the more complex scheme
> >>>> with barriers Jerome suggests unless that page lock contended. Jerome is
> >>>> right that you cannot just do lock_page() in gup_fast() path. There you
> >>>> have to do trylock_page() and if that fails just bail out to the slow gup
> >>>> path.
> >>>>
> >>>> Regarding places other than page_mkclean() that need to check pinned state:
> >>>> Definitely page migration will want to check whether the page is pinned or
> >>>> not so that it can deal differently with short-term page references vs
> >>>> longer-term pins.
> >>>>
> >>>> Also there is one more idea I had how to record number of pins in the page:
> >>>>
> >>>> #define PAGE_PIN_BIAS	1024
> >>>>
> >>>> get_page_pin()
> >>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>>
> >>>> put_page_pin();
> >>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>>
> >>>> page_pinned(page)
> >>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>>
> >>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >>>> which should be plenty (but we should check for that and bail with error if
> >>>> it would overflow). Also there will be no false negatives and false
> >>>> positives only if there are more than 1024 non-page-table references to the
> >>>> page which I expect to be rare (we might want to also subtract
> >>>> hpage_nr_pages() for radix tree references to avoid excessive false
> >>>> positives for huge pages although at this point I don't think they would
> >>>> matter). Thoughts?
> >>>
> >>> Racing PUP are as likely to cause issues:
> >>>
> >>> CPU0                        | CPU1       | CPU2
> >>>                             |            |
> >>>                             | PUP()      |
> >>>     page_pinned(page)       |            |
> >>>       (page_count(page) -   |            |
> >>>        page_mapcount(page)) |            |
> >>>                             |            | GUP()
> >>>
> >>> So here the refcount snap-shot does not include the second GUP and
> >>> we can have a false negative ie the page_pinned() will return false
> >>> because of the PUP happening just before on CPU1 despite the racing
> >>> GUP on CPU2 just after.
> >>>
> >>> I believe only either lock or memory ordering with barrier can
> >>> guarantee that we do not miss GUP ie no false negative. Still the
> >>> bias idea might be usefull as with it we should not need a flag.
> >>
> >> Right. We need similar synchronization (i.e., page lock or careful checks
> >> with memory barriers) if we want to get a reliable page pin information.
> >>
> >>> So to make the above safe it would still need the page write back
> >>> double check that i described so that GUP back-off if it raced with
> >>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> >>> which call test_set_page_writeback() (yes it is very unlikely but
> >>> might still happen).
> >>
> >> Agreed. So with page lock it would actually look like:
> >>
> >> get_page_pin()
> >> 	lock_page(page);
> >> 	wait_for_stable_page();
> >> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >> 	unlock_page(page);
> >>
> >> And if we perform page_pinned() check under page lock, then if
> >> page_pinned() returned false, we are sure page is not and will not be
> >> pinned until we drop the page lock (and also until page writeback is
> >> completed if needed).
> >>
> 
> OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
> compensation steps, is a pretty major selling point. And if we do the above
> locking, that does look correct to me. I wasn't able to visualize the
> locking you had in mind, until just now (above), but now it is clear, 
> thanks for spelling it out.
> 
> > 
> > So i still can't see anything wrong with that idea, i had similar
> > one in the past and diss-missed and i can't remember why :( But
> > thinking over and over i do not see any issue beside refcount wrap
> > around. Which is something that can happens today thought i don't
> > think it can be use in an evil way and we can catch it and be
> > loud about it.
> > 
> > So i think the following would be bullet proof:
> > 
> > 
> > get_page_pin()
> >     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >     smp_wmb();
> >     if (PageWriteback(page)) {
> >         // back off
> >         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >         // re-enable preempt if in fast
> >         wait_on_page_writeback(page);
> >         goto retry;
> >     }
> > 
> > put_page_pin();
> > 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> > 
> > page_pinned(page)
> > 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> > 
> > test_set_page_writeback()
> >     ...
> >     wb = TestSetPageWriteback(page)
> 
> Minor point, but using PageWriteback for synchronization may rule out using
> wait_for_stable_page(), because wait_for_stable_page() might not actually 
> wait_on_page_writeback. Jan pointed out in the other thread, that we should
> prefer wait_for_stable_page(). 

Yes, but wait_for_stable_page() has no page flag so nothing we can
synchronize against. So my advice would be:
    if (PageWriteback(page)) {
        wait_for_stable_page(page);
        if (PageWriteback(page))
            wait_for_write_back(page);
    }

wait_for_stable_page() can optimize out the wait_for_write_back()
if it is safe to do so. So we can improve the above slightly too.

> 
> >     smp_mb();
> >     if (page_pinned(page)) {
> >         // report page as pinned to caller of test_set_page_writeback()
> >     }
> >     ...
> > 
> > This is text book memory barrier. Either get_page_pin() see racing
> > test_set_page_writeback() or test_set_page_writeback() see racing GUP
> > 
> > 
> 
> This approach is probably workable, but again, it's more complex and comes
> without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
> to use it as either "do this after everything is up and running and stable", 
> or else as Plan B, if there is some performance implication from the page lock.
> 
> Simple and correct first, then performance optimization, *if* necessary.

I do not like taking page lock while they are no good reasons to do so.
The above is textbook memory barrier as explain in Documentations/
Forcing page lock for GUP will inevitably slow down some workload and
report for such can takes time to trickle down to mailing list and it
can takes time for people to actualy figure out that this are the GUP
changes that introduce such regression.

So if we could minimize performance regression with something like
memory barrier we should definitly do that.

Also i do not think that page lock has lock dep (as it is not using
any of the usual locking function) but that's just my memory of that
code.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15 22:12                                                                             ` Jerome Glisse
@ 2019-01-15 22:12                                                                               ` Jerome Glisse
  2019-01-16  0:44                                                                               ` John Hubbard
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-15 22:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 01:56:51PM -0800, John Hubbard wrote:
> On 1/15/19 9:15 AM, Jerome Glisse wrote:
> > On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
> >> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> >>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> >>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
> >>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> >>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> >>>>>> [...]
> >>>>>>
> >>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>>>>>>> option, which we'll certainly need in order to safely convert all the call
> >>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>>>>>>> a separate patchset, as you recommended.
> >>>>>>>>>
> >>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>>>>>>> then I'd recommend using another page bit to do the same thing.)
> >>>>>>>>
> >>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
> >>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
> >>>>>>>> ordering and i do not see any issues.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Why is it that page lock cannot be used for gup fast, btw?
> >>>>>>
> >>>>>> Well it can not happen within the preempt disable section. But after
> >>>>>> as a post pass before GUP_fast return and after reenabling preempt then
> >>>>>> it is fine like it would be for regular GUP. But locking page for GUP
> >>>>>> is also likely to slow down some workload (with direct-IO).
> >>>>>>
> >>>>>
> >>>>> Right, and so to crux of the matter: taking an uncontended page lock
> >>>>> involves pretty much the same set of operations that your approach does.
> >>>>> (If gup ends up contended with the page lock for other reasons than these
> >>>>> paths, that seems surprising.) I'd expect very similar performance.
> >>>>>
> >>>>> But the page lock approach leads to really dramatically simpler code (and
> >>>>> code reviews, let's not forget). Any objection to my going that
> >>>>> direction, and keeping this idea as a Plan B? I think the next step will
> >>>>> be, once again, to gather some performance metrics, so maybe that will
> >>>>> help us decide.
> >>>>
> >>>> FWIW I agree that using page lock for protecting page pinning (and thus
> >>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
> >>>> convinced there will be measurable difference to the more complex scheme
> >>>> with barriers Jerome suggests unless that page lock contended. Jerome is
> >>>> right that you cannot just do lock_page() in gup_fast() path. There you
> >>>> have to do trylock_page() and if that fails just bail out to the slow gup
> >>>> path.
> >>>>
> >>>> Regarding places other than page_mkclean() that need to check pinned state:
> >>>> Definitely page migration will want to check whether the page is pinned or
> >>>> not so that it can deal differently with short-term page references vs
> >>>> longer-term pins.
> >>>>
> >>>> Also there is one more idea I had how to record number of pins in the page:
> >>>>
> >>>> #define PAGE_PIN_BIAS	1024
> >>>>
> >>>> get_page_pin()
> >>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>>>
> >>>> put_page_pin();
> >>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>>>
> >>>> page_pinned(page)
> >>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>>>
> >>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >>>> which should be plenty (but we should check for that and bail with error if
> >>>> it would overflow). Also there will be no false negatives and false
> >>>> positives only if there are more than 1024 non-page-table references to the
> >>>> page which I expect to be rare (we might want to also subtract
> >>>> hpage_nr_pages() for radix tree references to avoid excessive false
> >>>> positives for huge pages although at this point I don't think they would
> >>>> matter). Thoughts?
> >>>
> >>> Racing PUP are as likely to cause issues:
> >>>
> >>> CPU0                        | CPU1       | CPU2
> >>>                             |            |
> >>>                             | PUP()      |
> >>>     page_pinned(page)       |            |
> >>>       (page_count(page) -   |            |
> >>>        page_mapcount(page)) |            |
> >>>                             |            | GUP()
> >>>
> >>> So here the refcount snap-shot does not include the second GUP and
> >>> we can have a false negative ie the page_pinned() will return false
> >>> because of the PUP happening just before on CPU1 despite the racing
> >>> GUP on CPU2 just after.
> >>>
> >>> I believe only either lock or memory ordering with barrier can
> >>> guarantee that we do not miss GUP ie no false negative. Still the
> >>> bias idea might be usefull as with it we should not need a flag.
> >>
> >> Right. We need similar synchronization (i.e., page lock or careful checks
> >> with memory barriers) if we want to get a reliable page pin information.
> >>
> >>> So to make the above safe it would still need the page write back
> >>> double check that i described so that GUP back-off if it raced with
> >>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> >>> which call test_set_page_writeback() (yes it is very unlikely but
> >>> might still happen).
> >>
> >> Agreed. So with page lock it would actually look like:
> >>
> >> get_page_pin()
> >> 	lock_page(page);
> >> 	wait_for_stable_page();
> >> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >> 	unlock_page(page);
> >>
> >> And if we perform page_pinned() check under page lock, then if
> >> page_pinned() returned false, we are sure page is not and will not be
> >> pinned until we drop the page lock (and also until page writeback is
> >> completed if needed).
> >>
> 
> OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
> compensation steps, is a pretty major selling point. And if we do the above
> locking, that does look correct to me. I wasn't able to visualize the
> locking you had in mind, until just now (above), but now it is clear, 
> thanks for spelling it out.
> 
> > 
> > So i still can't see anything wrong with that idea, i had similar
> > one in the past and diss-missed and i can't remember why :( But
> > thinking over and over i do not see any issue beside refcount wrap
> > around. Which is something that can happens today thought i don't
> > think it can be use in an evil way and we can catch it and be
> > loud about it.
> > 
> > So i think the following would be bullet proof:
> > 
> > 
> > get_page_pin()
> >     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >     smp_wmb();
> >     if (PageWriteback(page)) {
> >         // back off
> >         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >         // re-enable preempt if in fast
> >         wait_on_page_writeback(page);
> >         goto retry;
> >     }
> > 
> > put_page_pin();
> > 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> > 
> > page_pinned(page)
> > 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> > 
> > test_set_page_writeback()
> >     ...
> >     wb = TestSetPageWriteback(page)
> 
> Minor point, but using PageWriteback for synchronization may rule out using
> wait_for_stable_page(), because wait_for_stable_page() might not actually 
> wait_on_page_writeback. Jan pointed out in the other thread, that we should
> prefer wait_for_stable_page(). 

Yes, but wait_for_stable_page() has no page flag so nothing we can
synchronize against. So my advice would be:
    if (PageWriteback(page)) {
        wait_for_stable_page(page);
        if (PageWriteback(page))
            wait_for_write_back(page);
    }

wait_for_stable_page() can optimize out the wait_for_write_back()
if it is safe to do so. So we can improve the above slightly too.

> 
> >     smp_mb();
> >     if (page_pinned(page)) {
> >         // report page as pinned to caller of test_set_page_writeback()
> >     }
> >     ...
> > 
> > This is text book memory barrier. Either get_page_pin() see racing
> > test_set_page_writeback() or test_set_page_writeback() see racing GUP
> > 
> > 
> 
> This approach is probably workable, but again, it's more complex and comes
> without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
> to use it as either "do this after everything is up and running and stable", 
> or else as Plan B, if there is some performance implication from the page lock.
> 
> Simple and correct first, then performance optimization, *if* necessary.

I do not like taking page lock while they are no good reasons to do so.
The above is textbook memory barrier as explain in Documentations/
Forcing page lock for GUP will inevitably slow down some workload and
report for such can takes time to trickle down to mailing list and it
can takes time for people to actualy figure out that this are the GUP
changes that introduce such regression.

So if we could minimize performance regression with something like
memory barrier we should definitly do that.

Also i do not think that page lock has lock dep (as it is not using
any of the usual locking function) but that's just my memory of that
code.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15 17:15                                                                         ` Jerome Glisse
  2019-01-15 17:15                                                                           ` Jerome Glisse
@ 2019-01-15 21:56                                                                           ` John Hubbard
  2019-01-15 21:56                                                                             ` John Hubbard
  2019-01-15 22:12                                                                             ` Jerome Glisse
  1 sibling, 2 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-15 21:56 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 9:15 AM, Jerome Glisse wrote:
> On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
>> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
>>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
>>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
>>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
>>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>>>>>> [...]
>>>>>>
>>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>>>>>>> option, which we'll certainly need in order to safely convert all the call
>>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>>>>>>> a separate patchset, as you recommended.
>>>>>>>>>
>>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>>>>>>> then I'd recommend using another page bit to do the same thing.)
>>>>>>>>
>>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
>>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
>>>>>>>> ordering and i do not see any issues.
>>>>>>>>
>>>>>>>
>>>>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>>>>
>>>>>> Well it can not happen within the preempt disable section. But after
>>>>>> as a post pass before GUP_fast return and after reenabling preempt then
>>>>>> it is fine like it would be for regular GUP. But locking page for GUP
>>>>>> is also likely to slow down some workload (with direct-IO).
>>>>>>
>>>>>
>>>>> Right, and so to crux of the matter: taking an uncontended page lock
>>>>> involves pretty much the same set of operations that your approach does.
>>>>> (If gup ends up contended with the page lock for other reasons than these
>>>>> paths, that seems surprising.) I'd expect very similar performance.
>>>>>
>>>>> But the page lock approach leads to really dramatically simpler code (and
>>>>> code reviews, let's not forget). Any objection to my going that
>>>>> direction, and keeping this idea as a Plan B? I think the next step will
>>>>> be, once again, to gather some performance metrics, so maybe that will
>>>>> help us decide.
>>>>
>>>> FWIW I agree that using page lock for protecting page pinning (and thus
>>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
>>>> convinced there will be measurable difference to the more complex scheme
>>>> with barriers Jerome suggests unless that page lock contended. Jerome is
>>>> right that you cannot just do lock_page() in gup_fast() path. There you
>>>> have to do trylock_page() and if that fails just bail out to the slow gup
>>>> path.
>>>>
>>>> Regarding places other than page_mkclean() that need to check pinned state:
>>>> Definitely page migration will want to check whether the page is pinned or
>>>> not so that it can deal differently with short-term page references vs
>>>> longer-term pins.
>>>>
>>>> Also there is one more idea I had how to record number of pins in the page:
>>>>
>>>> #define PAGE_PIN_BIAS	1024
>>>>
>>>> get_page_pin()
>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>
>>>> put_page_pin();
>>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>>
>>>> page_pinned(page)
>>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>>
>>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>>>> which should be plenty (but we should check for that and bail with error if
>>>> it would overflow). Also there will be no false negatives and false
>>>> positives only if there are more than 1024 non-page-table references to the
>>>> page which I expect to be rare (we might want to also subtract
>>>> hpage_nr_pages() for radix tree references to avoid excessive false
>>>> positives for huge pages although at this point I don't think they would
>>>> matter). Thoughts?
>>>
>>> Racing PUP are as likely to cause issues:
>>>
>>> CPU0                        | CPU1       | CPU2
>>>                             |            |
>>>                             | PUP()      |
>>>     page_pinned(page)       |            |
>>>       (page_count(page) -   |            |
>>>        page_mapcount(page)) |            |
>>>                             |            | GUP()
>>>
>>> So here the refcount snap-shot does not include the second GUP and
>>> we can have a false negative ie the page_pinned() will return false
>>> because of the PUP happening just before on CPU1 despite the racing
>>> GUP on CPU2 just after.
>>>
>>> I believe only either lock or memory ordering with barrier can
>>> guarantee that we do not miss GUP ie no false negative. Still the
>>> bias idea might be usefull as with it we should not need a flag.
>>
>> Right. We need similar synchronization (i.e., page lock or careful checks
>> with memory barriers) if we want to get a reliable page pin information.
>>
>>> So to make the above safe it would still need the page write back
>>> double check that i described so that GUP back-off if it raced with
>>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
>>> which call test_set_page_writeback() (yes it is very unlikely but
>>> might still happen).
>>
>> Agreed. So with page lock it would actually look like:
>>
>> get_page_pin()
>> 	lock_page(page);
>> 	wait_for_stable_page();
>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>> 	unlock_page(page);
>>
>> And if we perform page_pinned() check under page lock, then if
>> page_pinned() returned false, we are sure page is not and will not be
>> pinned until we drop the page lock (and also until page writeback is
>> completed if needed).
>>

OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
compensation steps, is a pretty major selling point. And if we do the above
locking, that does look correct to me. I wasn't able to visualize the
locking you had in mind, until just now (above), but now it is clear, 
thanks for spelling it out.

> 
> So i still can't see anything wrong with that idea, i had similar
> one in the past and diss-missed and i can't remember why :( But
> thinking over and over i do not see any issue beside refcount wrap
> around. Which is something that can happens today thought i don't
> think it can be use in an evil way and we can catch it and be
> loud about it.
> 
> So i think the following would be bullet proof:
> 
> 
> get_page_pin()
>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>     smp_wmb();
>     if (PageWriteback(page)) {
>         // back off
>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>         // re-enable preempt if in fast
>         wait_on_page_writeback(page);
>         goto retry;
>     }
> 
> put_page_pin();
> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> 
> page_pinned(page)
> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> 
> test_set_page_writeback()
>     ...
>     wb = TestSetPageWriteback(page)

Minor point, but using PageWriteback for synchronization may rule out using
wait_for_stable_page(), because wait_for_stable_page() might not actually 
wait_on_page_writeback. Jan pointed out in the other thread, that we should
prefer wait_for_stable_page(). 


>     smp_mb();
>     if (page_pinned(page)) {
>         // report page as pinned to caller of test_set_page_writeback()
>     }
>     ...
> 
> This is text book memory barrier. Either get_page_pin() see racing
> test_set_page_writeback() or test_set_page_writeback() see racing GUP
> 
> 

This approach is probably workable, but again, it's more complex and comes
without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
to use it as either "do this after everything is up and running and stable", 
or else as Plan B, if there is some performance implication from the page lock.

Simple and correct first, then performance optimization, *if* necessary.


> An optimization for GUP:
> get_page_pin()
>     pwp = PageWriteback(page);
>     smp_rmb();
>     waspinned = page_pinned(page);
>     if (!waspinned && pwp) {
>         // backoff
>     }
> 
>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>     smp_wmb();
>     if (PageWriteback(page)) {
>         // back off
>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>         // re-enable preempt if in fast
>         wait_on_page_writeback(page);
>         goto retry;
>     }
> 
> If page was not pin prior to this GUP than we can back off early.
> 
> 
> Anyway i think this is better than mapcount. I started an analysis
> of all places that were looking at mapcount a few of them would have
> need an update if we were to increment mapcount with GUP.
> 
> I will go take a look at THP and hugetlbfs in respect to this just
> to check for way to mitigate false positive.
> 

Awesome. I still have a hard time with the details of THP and hugetlbfs,
so it's good to have someone who understands it, taking a closer look.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15 21:56                                                                           ` John Hubbard
@ 2019-01-15 21:56                                                                             ` John Hubbard
  2019-01-15 22:12                                                                             ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-15 21:56 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 9:15 AM, Jerome Glisse wrote:
> On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
>> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
>>> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
>>>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
>>>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
>>>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>>>>>> [...]
>>>>>>
>>>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>>>>>>> option, which we'll certainly need in order to safely convert all the call
>>>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>>>>>>> a separate patchset, as you recommended.
>>>>>>>>>
>>>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>>>>>>> then I'd recommend using another page bit to do the same thing.)
>>>>>>>>
>>>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
>>>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
>>>>>>>> ordering and i do not see any issues.
>>>>>>>>
>>>>>>>
>>>>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>>>>
>>>>>> Well it can not happen within the preempt disable section. But after
>>>>>> as a post pass before GUP_fast return and after reenabling preempt then
>>>>>> it is fine like it would be for regular GUP. But locking page for GUP
>>>>>> is also likely to slow down some workload (with direct-IO).
>>>>>>
>>>>>
>>>>> Right, and so to crux of the matter: taking an uncontended page lock
>>>>> involves pretty much the same set of operations that your approach does.
>>>>> (If gup ends up contended with the page lock for other reasons than these
>>>>> paths, that seems surprising.) I'd expect very similar performance.
>>>>>
>>>>> But the page lock approach leads to really dramatically simpler code (and
>>>>> code reviews, let's not forget). Any objection to my going that
>>>>> direction, and keeping this idea as a Plan B? I think the next step will
>>>>> be, once again, to gather some performance metrics, so maybe that will
>>>>> help us decide.
>>>>
>>>> FWIW I agree that using page lock for protecting page pinning (and thus
>>>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
>>>> convinced there will be measurable difference to the more complex scheme
>>>> with barriers Jerome suggests unless that page lock contended. Jerome is
>>>> right that you cannot just do lock_page() in gup_fast() path. There you
>>>> have to do trylock_page() and if that fails just bail out to the slow gup
>>>> path.
>>>>
>>>> Regarding places other than page_mkclean() that need to check pinned state:
>>>> Definitely page migration will want to check whether the page is pinned or
>>>> not so that it can deal differently with short-term page references vs
>>>> longer-term pins.
>>>>
>>>> Also there is one more idea I had how to record number of pins in the page:
>>>>
>>>> #define PAGE_PIN_BIAS	1024
>>>>
>>>> get_page_pin()
>>>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>>>
>>>> put_page_pin();
>>>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>>>
>>>> page_pinned(page)
>>>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>>>
>>>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>>>> which should be plenty (but we should check for that and bail with error if
>>>> it would overflow). Also there will be no false negatives and false
>>>> positives only if there are more than 1024 non-page-table references to the
>>>> page which I expect to be rare (we might want to also subtract
>>>> hpage_nr_pages() for radix tree references to avoid excessive false
>>>> positives for huge pages although at this point I don't think they would
>>>> matter). Thoughts?
>>>
>>> Racing PUP are as likely to cause issues:
>>>
>>> CPU0                        | CPU1       | CPU2
>>>                             |            |
>>>                             | PUP()      |
>>>     page_pinned(page)       |            |
>>>       (page_count(page) -   |            |
>>>        page_mapcount(page)) |            |
>>>                             |            | GUP()
>>>
>>> So here the refcount snap-shot does not include the second GUP and
>>> we can have a false negative ie the page_pinned() will return false
>>> because of the PUP happening just before on CPU1 despite the racing
>>> GUP on CPU2 just after.
>>>
>>> I believe only either lock or memory ordering with barrier can
>>> guarantee that we do not miss GUP ie no false negative. Still the
>>> bias idea might be usefull as with it we should not need a flag.
>>
>> Right. We need similar synchronization (i.e., page lock or careful checks
>> with memory barriers) if we want to get a reliable page pin information.
>>
>>> So to make the above safe it would still need the page write back
>>> double check that i described so that GUP back-off if it raced with
>>> page_mkclean,clear_page_dirty_for_io and the fs write page call back
>>> which call test_set_page_writeback() (yes it is very unlikely but
>>> might still happen).
>>
>> Agreed. So with page lock it would actually look like:
>>
>> get_page_pin()
>> 	lock_page(page);
>> 	wait_for_stable_page();
>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>> 	unlock_page(page);
>>
>> And if we perform page_pinned() check under page lock, then if
>> page_pinned() returned false, we are sure page is not and will not be
>> pinned until we drop the page lock (and also until page writeback is
>> completed if needed).
>>

OK. Avoiding a new page flag, *and* avoiding the _mapcount auditing and
compensation steps, is a pretty major selling point. And if we do the above
locking, that does look correct to me. I wasn't able to visualize the
locking you had in mind, until just now (above), but now it is clear, 
thanks for spelling it out.

> 
> So i still can't see anything wrong with that idea, i had similar
> one in the past and diss-missed and i can't remember why :( But
> thinking over and over i do not see any issue beside refcount wrap
> around. Which is something that can happens today thought i don't
> think it can be use in an evil way and we can catch it and be
> loud about it.
> 
> So i think the following would be bullet proof:
> 
> 
> get_page_pin()
>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>     smp_wmb();
>     if (PageWriteback(page)) {
>         // back off
>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>         // re-enable preempt if in fast
>         wait_on_page_writeback(page);
>         goto retry;
>     }
> 
> put_page_pin();
> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> 
> page_pinned(page)
> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> 
> test_set_page_writeback()
>     ...
>     wb = TestSetPageWriteback(page)

Minor point, but using PageWriteback for synchronization may rule out using
wait_for_stable_page(), because wait_for_stable_page() might not actually 
wait_on_page_writeback. Jan pointed out in the other thread, that we should
prefer wait_for_stable_page(). 


>     smp_mb();
>     if (page_pinned(page)) {
>         // report page as pinned to caller of test_set_page_writeback()
>     }
>     ...
> 
> This is text book memory barrier. Either get_page_pin() see racing
> test_set_page_writeback() or test_set_page_writeback() see racing GUP
> 
> 

This approach is probably workable, but again, it's more complex and comes
without any lockdep support. Maybe it's faster, maybe not. Therefore, I want 
to use it as either "do this after everything is up and running and stable", 
or else as Plan B, if there is some performance implication from the page lock.

Simple and correct first, then performance optimization, *if* necessary.


> An optimization for GUP:
> get_page_pin()
>     pwp = PageWriteback(page);
>     smp_rmb();
>     waspinned = page_pinned(page);
>     if (!waspinned && pwp) {
>         // backoff
>     }
> 
>     atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>     smp_wmb();
>     if (PageWriteback(page)) {
>         // back off
>         atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>         // re-enable preempt if in fast
>         wait_on_page_writeback(page);
>         goto retry;
>     }
> 
> If page was not pin prior to this GUP than we can back off early.
> 
> 
> Anyway i think this is better than mapcount. I started an analysis
> of all places that were looking at mapcount a few of them would have
> need an update if we were to increment mapcount with GUP.
> 
> I will go take a look at THP and hugetlbfs in respect to this just
> to check for way to mitigate false positive.
> 

Awesome. I still have a hard time with the details of THP and hugetlbfs,
so it's good to have someone who understands it, taking a closer look.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15  8:34                                                                         ` Jan Kara
  2019-01-15  8:34                                                                           ` Jan Kara
@ 2019-01-15 21:39                                                                           ` John Hubbard
  2019-01-15 21:39                                                                             ` John Hubbard
  1 sibling, 1 reply; 207+ messages in thread
From: John Hubbard @ 2019-01-15 21:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jerome Glisse, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 12:34 AM, Jan Kara wrote:
> On Mon 14-01-19 11:09:20, John Hubbard wrote:
>> On 1/14/19 9:21 AM, Jerome Glisse wrote:
>>>>
[...]
> 
>> For example, the following already survives a basic boot to graphics mode.
>> It requires a bunch of callsite conversions, and a page flag (neither of which
>> is shown here), and may also have "a few" gross conceptual errors, but take a 
>> peek:
> 
> Thanks for writing this down! Some comments inline.
> 

I appreciate your taking a look at this, Jan. I'm still pretty new to gup.c, 
so it's really good to get an early review.


>> +/*
>> + * Manages the PG_gup_pinned flag.
>> + *
>> + * Note that page->_mapcount counting part of managing that flag, because the
>> + * _mapcount is used to determine if PG_gup_pinned can be cleared, in
>> + * page_mkclean().
>> + */
>> +static void track_gup_page(struct page *page)
>> +{
>> +	page = compound_head(page);
>> +
>> +	lock_page(page);
>> +
>> +	wait_on_page_writeback(page);
> 
> ^^ I'd use wait_for_stable_page() here. That is the standard waiting
> mechanism to use before you allow page modification.

OK, will do. In fact, I initially wanted to use wait_for_stable_page(), but 
hesitated when I saw that it won't necessarily do wait_on_page_writeback(), 
and I then I also remembered Dave Chinner recently mentioned that the policy
decision needed some thought in the future (maybe something about block 
device vs. filesystem policy):

void wait_for_stable_page(struct page *page)
{
	if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
		wait_on_page_writeback(page);
}

...but like you say, it's the standard way that fs does this, so we should
just use it.

> 
>> +
>> +	atomic_inc(&page->_mapcount);
>> +	SetPageGupPinned(page);
>> +
>> +	unlock_page(page);
>> +}
>> +
>> +/*
>> + * A variant of track_gup_page() that returns -EBUSY, instead of waiting.
>> + */
>> +static int track_gup_page_atomic(struct page *page)
>> +{
>> +	page = compound_head(page);
>> +
>> +	if (PageWriteback(page) || !trylock_page(page))
>> +		return -EBUSY;
>> +
>> +	if (PageWriteback(page)) {
>> +		unlock_page(page);
>> +		return -EBUSY;
>> +	}
> 
> Here you'd need some helper that would return whether
> wait_for_stable_page() is going to wait. Like would_wait_for_stable_page()
> but maybe you can come up with a better name.

Yes, in order to wait_for_stable_page(), that seems necessary, I agree.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15 21:39                                                                           ` John Hubbard
@ 2019-01-15 21:39                                                                             ` John Hubbard
  0 siblings, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-15 21:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jerome Glisse, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/15/19 12:34 AM, Jan Kara wrote:
> On Mon 14-01-19 11:09:20, John Hubbard wrote:
>> On 1/14/19 9:21 AM, Jerome Glisse wrote:
>>>>
[...]
> 
>> For example, the following already survives a basic boot to graphics mode.
>> It requires a bunch of callsite conversions, and a page flag (neither of which
>> is shown here), and may also have "a few" gross conceptual errors, but take a 
>> peek:
> 
> Thanks for writing this down! Some comments inline.
> 

I appreciate your taking a look at this, Jan. I'm still pretty new to gup.c, 
so it's really good to get an early review.


>> +/*
>> + * Manages the PG_gup_pinned flag.
>> + *
>> + * Note that page->_mapcount counting part of managing that flag, because the
>> + * _mapcount is used to determine if PG_gup_pinned can be cleared, in
>> + * page_mkclean().
>> + */
>> +static void track_gup_page(struct page *page)
>> +{
>> +	page = compound_head(page);
>> +
>> +	lock_page(page);
>> +
>> +	wait_on_page_writeback(page);
> 
> ^^ I'd use wait_for_stable_page() here. That is the standard waiting
> mechanism to use before you allow page modification.

OK, will do. In fact, I initially wanted to use wait_for_stable_page(), but 
hesitated when I saw that it won't necessarily do wait_on_page_writeback(), 
and I then I also remembered Dave Chinner recently mentioned that the policy
decision needed some thought in the future (maybe something about block 
device vs. filesystem policy):

void wait_for_stable_page(struct page *page)
{
	if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
		wait_on_page_writeback(page);
}

...but like you say, it's the standard way that fs does this, so we should
just use it.

> 
>> +
>> +	atomic_inc(&page->_mapcount);
>> +	SetPageGupPinned(page);
>> +
>> +	unlock_page(page);
>> +}
>> +
>> +/*
>> + * A variant of track_gup_page() that returns -EBUSY, instead of waiting.
>> + */
>> +static int track_gup_page_atomic(struct page *page)
>> +{
>> +	page = compound_head(page);
>> +
>> +	if (PageWriteback(page) || !trylock_page(page))
>> +		return -EBUSY;
>> +
>> +	if (PageWriteback(page)) {
>> +		unlock_page(page);
>> +		return -EBUSY;
>> +	}
> 
> Here you'd need some helper that would return whether
> wait_for_stable_page() is going to wait. Like would_wait_for_stable_page()
> but maybe you can come up with a better name.

Yes, in order to wait_for_stable_page(), that seems necessary, I agree.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15  8:07                                                                       ` Jan Kara
  2019-01-15  8:07                                                                         ` Jan Kara
@ 2019-01-15 17:15                                                                         ` Jerome Glisse
  2019-01-15 17:15                                                                           ` Jerome Glisse
  2019-01-15 21:56                                                                           ` John Hubbard
  2019-01-16 11:38                                                                         ` Jan Kara
  2019-01-17  5:25                                                                         ` John Hubbard
  3 siblings, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-15 17:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> > On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> > > On Fri 11-01-19 19:06:08, John Hubbard wrote:
> > > > On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > > > > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > > > > [...]
> > > > > 
> > > > >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> > > > >>>> option, which we'll certainly need in order to safely convert all the call
> > > > >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> > > > >>>> and put_user_page() can verify that the right call was made.)  That will be
> > > > >>>> a separate patchset, as you recommended.
> > > > >>>>
> > > > >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> > > > >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> > > > >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> > > > >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> > > > >>>> then I'd recommend using another page bit to do the same thing.)
> > > > >>>
> > > > >>> Please page lock is pointless and it will not work for GUP fast. The above
> > > > >>> scheme do work and is fine. I spend the day again thinking about all memory
> > > > >>> ordering and i do not see any issues.
> > > > >>>
> > > > >>
> > > > >> Why is it that page lock cannot be used for gup fast, btw?
> > > > > 
> > > > > Well it can not happen within the preempt disable section. But after
> > > > > as a post pass before GUP_fast return and after reenabling preempt then
> > > > > it is fine like it would be for regular GUP. But locking page for GUP
> > > > > is also likely to slow down some workload (with direct-IO).
> > > > > 
> > > > 
> > > > Right, and so to crux of the matter: taking an uncontended page lock
> > > > involves pretty much the same set of operations that your approach does.
> > > > (If gup ends up contended with the page lock for other reasons than these
> > > > paths, that seems surprising.) I'd expect very similar performance.
> > > > 
> > > > But the page lock approach leads to really dramatically simpler code (and
> > > > code reviews, let's not forget). Any objection to my going that
> > > > direction, and keeping this idea as a Plan B? I think the next step will
> > > > be, once again, to gather some performance metrics, so maybe that will
> > > > help us decide.
> > > 
> > > FWIW I agree that using page lock for protecting page pinning (and thus
> > > avoid races with page_mkclean()) looks simpler to me as well and I'm not
> > > convinced there will be measurable difference to the more complex scheme
> > > with barriers Jerome suggests unless that page lock contended. Jerome is
> > > right that you cannot just do lock_page() in gup_fast() path. There you
> > > have to do trylock_page() and if that fails just bail out to the slow gup
> > > path.
> > > 
> > > Regarding places other than page_mkclean() that need to check pinned state:
> > > Definitely page migration will want to check whether the page is pinned or
> > > not so that it can deal differently with short-term page references vs
> > > longer-term pins.
> > > 
> > > Also there is one more idea I had how to record number of pins in the page:
> > > 
> > > #define PAGE_PIN_BIAS	1024
> > > 
> > > get_page_pin()
> > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > 
> > > put_page_pin();
> > > 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> > > 
> > > page_pinned(page)
> > > 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> > > 
> > > This is pretty trivial scheme. It still gives us 22-bits for page pins
> > > which should be plenty (but we should check for that and bail with error if
> > > it would overflow). Also there will be no false negatives and false
> > > positives only if there are more than 1024 non-page-table references to the
> > > page which I expect to be rare (we might want to also subtract
> > > hpage_nr_pages() for radix tree references to avoid excessive false
> > > positives for huge pages although at this point I don't think they would
> > > matter). Thoughts?
> > 
> > Racing PUP are as likely to cause issues:
> > 
> > CPU0                        | CPU1       | CPU2
> >                             |            |
> >                             | PUP()      |
> >     page_pinned(page)       |            |
> >       (page_count(page) -   |            |
> >        page_mapcount(page)) |            |
> >                             |            | GUP()
> > 
> > So here the refcount snap-shot does not include the second GUP and
> > we can have a false negative ie the page_pinned() will return false
> > because of the PUP happening just before on CPU1 despite the racing
> > GUP on CPU2 just after.
> > 
> > I believe only either lock or memory ordering with barrier can
> > guarantee that we do not miss GUP ie no false negative. Still the
> > bias idea might be usefull as with it we should not need a flag.
> 
> Right. We need similar synchronization (i.e., page lock or careful checks
> with memory barriers) if we want to get a reliable page pin information.
> 
> > So to make the above safe it would still need the page write back
> > double check that i described so that GUP back-off if it raced with
> > page_mkclean,clear_page_dirty_for_io and the fs write page call back
> > which call test_set_page_writeback() (yes it is very unlikely but
> > might still happen).
> 
> Agreed. So with page lock it would actually look like:
> 
> get_page_pin()
> 	lock_page(page);
> 	wait_for_stable_page();
> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> 	unlock_page(page);
> 
> And if we perform page_pinned() check under page lock, then if
> page_pinned() returned false, we are sure page is not and will not be
> pinned until we drop the page lock (and also until page writeback is
> completed if needed).
> 

So i still can't see anything wrong with that idea, i had similar
one in the past and diss-missed and i can't remember why :( But
thinking over and over i do not see any issue beside refcount wrap
around. Which is something that can happens today thought i don't
think it can be use in an evil way and we can catch it and be
loud about it.

So i think the following would be bullet proof:


get_page_pin()
    atomic_add(&page->_refcount, PAGE_PIN_BIAS);
    smp_wmb();
    if (PageWriteback(page)) {
        // back off
        atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
        // re-enable preempt if in fast
        wait_on_page_writeback(page);
        goto retry;
    }

put_page_pin();
	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);

page_pinned(page)
	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS

test_set_page_writeback()
    ...
    wb = TestSetPageWriteback(page)
    smp_mb();
    if (page_pinned(page)) {
        // report page as pinned to caller of test_set_page_writeback()
    }
    ...

This is text book memory barrier. Either get_page_pin() see racing
test_set_page_writeback() or test_set_page_writeback() see racing GUP


An optimization for GUP:
get_page_pin()
    pwp = PageWriteback(page);
    smp_rmb();
    waspinned = page_pinned(page);
    if (!waspinned && pwp) {
        // backoff
    }

    atomic_add(&page->_refcount, PAGE_PIN_BIAS);
    smp_wmb();
    if (PageWriteback(page)) {
        // back off
        atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
        // re-enable preempt if in fast
        wait_on_page_writeback(page);
        goto retry;
    }

If page was not pin prior to this GUP than we can back off early.


Anyway i think this is better than mapcount. I started an analysis
of all places that were looking at mapcount a few of them would have
need an update if we were to increment mapcount with GUP.

I will go take a look at THP and hugetlbfs in respect to this just
to check for way to mitigate false positive.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15 17:15                                                                         ` Jerome Glisse
@ 2019-01-15 17:15                                                                           ` Jerome Glisse
  2019-01-15 21:56                                                                           ` John Hubbard
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-15 17:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jan 15, 2019 at 09:07:59AM +0100, Jan Kara wrote:
> On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> > On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> > > On Fri 11-01-19 19:06:08, John Hubbard wrote:
> > > > On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > > > > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > > > > [...]
> > > > > 
> > > > >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> > > > >>>> option, which we'll certainly need in order to safely convert all the call
> > > > >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> > > > >>>> and put_user_page() can verify that the right call was made.)  That will be
> > > > >>>> a separate patchset, as you recommended.
> > > > >>>>
> > > > >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> > > > >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> > > > >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> > > > >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> > > > >>>> then I'd recommend using another page bit to do the same thing.)
> > > > >>>
> > > > >>> Please page lock is pointless and it will not work for GUP fast. The above
> > > > >>> scheme do work and is fine. I spend the day again thinking about all memory
> > > > >>> ordering and i do not see any issues.
> > > > >>>
> > > > >>
> > > > >> Why is it that page lock cannot be used for gup fast, btw?
> > > > > 
> > > > > Well it can not happen within the preempt disable section. But after
> > > > > as a post pass before GUP_fast return and after reenabling preempt then
> > > > > it is fine like it would be for regular GUP. But locking page for GUP
> > > > > is also likely to slow down some workload (with direct-IO).
> > > > > 
> > > > 
> > > > Right, and so to crux of the matter: taking an uncontended page lock
> > > > involves pretty much the same set of operations that your approach does.
> > > > (If gup ends up contended with the page lock for other reasons than these
> > > > paths, that seems surprising.) I'd expect very similar performance.
> > > > 
> > > > But the page lock approach leads to really dramatically simpler code (and
> > > > code reviews, let's not forget). Any objection to my going that
> > > > direction, and keeping this idea as a Plan B? I think the next step will
> > > > be, once again, to gather some performance metrics, so maybe that will
> > > > help us decide.
> > > 
> > > FWIW I agree that using page lock for protecting page pinning (and thus
> > > avoid races with page_mkclean()) looks simpler to me as well and I'm not
> > > convinced there will be measurable difference to the more complex scheme
> > > with barriers Jerome suggests unless that page lock contended. Jerome is
> > > right that you cannot just do lock_page() in gup_fast() path. There you
> > > have to do trylock_page() and if that fails just bail out to the slow gup
> > > path.
> > > 
> > > Regarding places other than page_mkclean() that need to check pinned state:
> > > Definitely page migration will want to check whether the page is pinned or
> > > not so that it can deal differently with short-term page references vs
> > > longer-term pins.
> > > 
> > > Also there is one more idea I had how to record number of pins in the page:
> > > 
> > > #define PAGE_PIN_BIAS	1024
> > > 
> > > get_page_pin()
> > > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > > 
> > > put_page_pin();
> > > 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> > > 
> > > page_pinned(page)
> > > 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> > > 
> > > This is pretty trivial scheme. It still gives us 22-bits for page pins
> > > which should be plenty (but we should check for that and bail with error if
> > > it would overflow). Also there will be no false negatives and false
> > > positives only if there are more than 1024 non-page-table references to the
> > > page which I expect to be rare (we might want to also subtract
> > > hpage_nr_pages() for radix tree references to avoid excessive false
> > > positives for huge pages although at this point I don't think they would
> > > matter). Thoughts?
> > 
> > Racing PUP are as likely to cause issues:
> > 
> > CPU0                        | CPU1       | CPU2
> >                             |            |
> >                             | PUP()      |
> >     page_pinned(page)       |            |
> >       (page_count(page) -   |            |
> >        page_mapcount(page)) |            |
> >                             |            | GUP()
> > 
> > So here the refcount snap-shot does not include the second GUP and
> > we can have a false negative ie the page_pinned() will return false
> > because of the PUP happening just before on CPU1 despite the racing
> > GUP on CPU2 just after.
> > 
> > I believe only either lock or memory ordering with barrier can
> > guarantee that we do not miss GUP ie no false negative. Still the
> > bias idea might be usefull as with it we should not need a flag.
> 
> Right. We need similar synchronization (i.e., page lock or careful checks
> with memory barriers) if we want to get a reliable page pin information.
> 
> > So to make the above safe it would still need the page write back
> > double check that i described so that GUP back-off if it raced with
> > page_mkclean,clear_page_dirty_for_io and the fs write page call back
> > which call test_set_page_writeback() (yes it is very unlikely but
> > might still happen).
> 
> Agreed. So with page lock it would actually look like:
> 
> get_page_pin()
> 	lock_page(page);
> 	wait_for_stable_page();
> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> 	unlock_page(page);
> 
> And if we perform page_pinned() check under page lock, then if
> page_pinned() returned false, we are sure page is not and will not be
> pinned until we drop the page lock (and also until page writeback is
> completed if needed).
> 

So i still can't see anything wrong with that idea, i had similar
one in the past and diss-missed and i can't remember why :( But
thinking over and over i do not see any issue beside refcount wrap
around. Which is something that can happens today thought i don't
think it can be use in an evil way and we can catch it and be
loud about it.

So i think the following would be bullet proof:


get_page_pin()
    atomic_add(&page->_refcount, PAGE_PIN_BIAS);
    smp_wmb();
    if (PageWriteback(page)) {
        // back off
        atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
        // re-enable preempt if in fast
        wait_on_page_writeback(page);
        goto retry;
    }

put_page_pin();
	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);

page_pinned(page)
	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS

test_set_page_writeback()
    ...
    wb = TestSetPageWriteback(page)
    smp_mb();
    if (page_pinned(page)) {
        // report page as pinned to caller of test_set_page_writeback()
    }
    ...

This is text book memory barrier. Either get_page_pin() see racing
test_set_page_writeback() or test_set_page_writeback() see racing GUP


An optimization for GUP:
get_page_pin()
    pwp = PageWriteback(page);
    smp_rmb();
    waspinned = page_pinned(page);
    if (!waspinned && pwp) {
        // backoff
    }

    atomic_add(&page->_refcount, PAGE_PIN_BIAS);
    smp_wmb();
    if (PageWriteback(page)) {
        // back off
        atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
        // re-enable preempt if in fast
        wait_on_page_writeback(page);
        goto retry;
    }

If page was not pin prior to this GUP than we can back off early.


Anyway i think this is better than mapcount. I started an analysis
of all places that were looking at mapcount a few of them would have
need an update if we were to increment mapcount with GUP.

I will go take a look at THP and hugetlbfs in respect to this just
to check for way to mitigate false positive.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-14 19:09                                                                       ` John Hubbard
  2019-01-14 19:09                                                                         ` John Hubbard
@ 2019-01-15  8:34                                                                         ` Jan Kara
  2019-01-15  8:34                                                                           ` Jan Kara
  2019-01-15 21:39                                                                           ` John Hubbard
  1 sibling, 2 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-15  8:34 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Jan Kara, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 14-01-19 11:09:20, John Hubbard wrote:
> On 1/14/19 9:21 AM, Jerome Glisse wrote:
> >>
> >> Also there is one more idea I had how to record number of pins in the page:
> >>
> >> #define PAGE_PIN_BIAS	1024
> >>
> >> get_page_pin()
> >> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>
> >> put_page_pin();
> >> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>
> >> page_pinned(page)
> >> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>
> >> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >> which should be plenty (but we should check for that and bail with error if
> >> it would overflow). Also there will be no false negatives and false
> >> positives only if there are more than 1024 non-page-table references to the
> >> page which I expect to be rare (we might want to also subtract
> >> hpage_nr_pages() for radix tree references to avoid excessive false
> >> positives for huge pages although at this point I don't think they would
> >> matter). Thoughts?
> > 
> > Racing PUP are as likely to cause issues:
> > 
> > CPU0                        | CPU1       | CPU2
> >                             |            |
> >                             | PUP()      |
> >     page_pinned(page)       |            |
> >       (page_count(page) -   |            |
> >        page_mapcount(page)) |            |
> >                             |            | GUP()
> > 
> > So here the refcount snap-shot does not include the second GUP and
> > we can have a false negative ie the page_pinned() will return false
> > because of the PUP happening just before on CPU1 despite the racing
> > GUP on CPU2 just after.
> > 
> > I believe only either lock or memory ordering with barrier can
> > guarantee that we do not miss GUP ie no false negative. Still the
> > bias idea might be usefull as with it we should not need a flag.
> > 
> > So to make the above safe it would still need the page write back
> > double check that i described so that GUP back-off if it raced with
> > page_mkclean,clear_page_dirty_for_io and the fs write page call back
> > which call test_set_page_writeback() (yes it is very unlikely but
> > might still happen).
> > 
> > 
> > I still need to ponder some more on all the races.
> > 
> 
> Tentatively, so far I prefer the _mapcount scheme, because it seems more
> accurate to add mapcounts than to overload the _refcount field. And the 
> implementation is going to be cleaner. And we've already figured out the
> races.

I think there's no difference WRT the races when using _mapcount or _count
bias to identify page pins. In fact the difference between what I suggested
and what you did are just that you update _count instead of _mapcount and
you can drop the rmap walk code and the page flag.

There are two reasons why I like using _count bias more:

1) I'm still not 100% convinced that some page_mapped() or page_mapcount()
check that starts to be true due to page being unmapped but pinned does not
confuse some code with bad consequences. The fact that the kernel boots
indicates that there's no common check that would get confused but full
audit of page_mapped() and page_mapcount() checks is needed to confirm
there isn't some cornercase missed and that is tedious. There are
definitely places that e.g. assert that page_mapcount() == 0 after all page
tables are unmapped and that is not necessarily true after your changes.

2) If the page gets pinned, we will report it as pinned until the next
page_mkclean() call. That can be quite a long time after page has been
really unpinned. In particular if the page was never dirtied (e.g. because
it was gup'ed only for read but there can be other reasons), it may never
happen that page_mkclean() is called and we won't be able to ever reclaim
such page. So we would have to also add some mechanism to eventually get
such pages cleaned up and that involves rmap walk for each such page which
is not quite cheap.

> For example, the following already survives a basic boot to graphics mode.
> It requires a bunch of callsite conversions, and a page flag (neither of which
> is shown here), and may also have "a few" gross conceptual errors, but take a 
> peek:

Thanks for writing this down! Some comments inline.

> +/*
> + * Manages the PG_gup_pinned flag.
> + *
> + * Note that page->_mapcount counting part of managing that flag, because the
> + * _mapcount is used to determine if PG_gup_pinned can be cleared, in
> + * page_mkclean().
> + */
> +static void track_gup_page(struct page *page)
> +{
> +	page = compound_head(page);
> +
> +	lock_page(page);
> +
> +	wait_on_page_writeback(page);

^^ I'd use wait_for_stable_page() here. That is the standard waiting
mechanism to use before you allow page modification.

> +
> +	atomic_inc(&page->_mapcount);
> +	SetPageGupPinned(page);
> +
> +	unlock_page(page);
> +}
> +
> +/*
> + * A variant of track_gup_page() that returns -EBUSY, instead of waiting.
> + */
> +static int track_gup_page_atomic(struct page *page)
> +{
> +	page = compound_head(page);
> +
> +	if (PageWriteback(page) || !trylock_page(page))
> +		return -EBUSY;
> +
> +	if (PageWriteback(page)) {
> +		unlock_page(page);
> +		return -EBUSY;
> +	}

Here you'd need some helper that would return whether
wait_for_stable_page() is going to wait. Like would_wait_for_stable_page()
but maybe you can come up with a better name.

> +	atomic_inc(&page->_mapcount);
> +	SetPageGupPinned(page);
> +
> +	unlock_page(page);
> +	return 0;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15  8:34                                                                         ` Jan Kara
@ 2019-01-15  8:34                                                                           ` Jan Kara
  2019-01-15 21:39                                                                           ` John Hubbard
  1 sibling, 0 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-15  8:34 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Jan Kara, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 14-01-19 11:09:20, John Hubbard wrote:
> On 1/14/19 9:21 AM, Jerome Glisse wrote:
> >>
> >> Also there is one more idea I had how to record number of pins in the page:
> >>
> >> #define PAGE_PIN_BIAS	1024
> >>
> >> get_page_pin()
> >> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> >>
> >> put_page_pin();
> >> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> >>
> >> page_pinned(page)
> >> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> >>
> >> This is pretty trivial scheme. It still gives us 22-bits for page pins
> >> which should be plenty (but we should check for that and bail with error if
> >> it would overflow). Also there will be no false negatives and false
> >> positives only if there are more than 1024 non-page-table references to the
> >> page which I expect to be rare (we might want to also subtract
> >> hpage_nr_pages() for radix tree references to avoid excessive false
> >> positives for huge pages although at this point I don't think they would
> >> matter). Thoughts?
> > 
> > Racing PUP are as likely to cause issues:
> > 
> > CPU0                        | CPU1       | CPU2
> >                             |            |
> >                             | PUP()      |
> >     page_pinned(page)       |            |
> >       (page_count(page) -   |            |
> >        page_mapcount(page)) |            |
> >                             |            | GUP()
> > 
> > So here the refcount snap-shot does not include the second GUP and
> > we can have a false negative ie the page_pinned() will return false
> > because of the PUP happening just before on CPU1 despite the racing
> > GUP on CPU2 just after.
> > 
> > I believe only either lock or memory ordering with barrier can
> > guarantee that we do not miss GUP ie no false negative. Still the
> > bias idea might be usefull as with it we should not need a flag.
> > 
> > So to make the above safe it would still need the page write back
> > double check that i described so that GUP back-off if it raced with
> > page_mkclean,clear_page_dirty_for_io and the fs write page call back
> > which call test_set_page_writeback() (yes it is very unlikely but
> > might still happen).
> > 
> > 
> > I still need to ponder some more on all the races.
> > 
> 
> Tentatively, so far I prefer the _mapcount scheme, because it seems more
> accurate to add mapcounts than to overload the _refcount field. And the 
> implementation is going to be cleaner. And we've already figured out the
> races.

I think there's no difference WRT the races when using _mapcount or _count
bias to identify page pins. In fact the difference between what I suggested
and what you did are just that you update _count instead of _mapcount and
you can drop the rmap walk code and the page flag.

There are two reasons why I like using _count bias more:

1) I'm still not 100% convinced that some page_mapped() or page_mapcount()
check that starts to be true due to page being unmapped but pinned does not
confuse some code with bad consequences. The fact that the kernel boots
indicates that there's no common check that would get confused but full
audit of page_mapped() and page_mapcount() checks is needed to confirm
there isn't some cornercase missed and that is tedious. There are
definitely places that e.g. assert that page_mapcount() == 0 after all page
tables are unmapped and that is not necessarily true after your changes.

2) If the page gets pinned, we will report it as pinned until the next
page_mkclean() call. That can be quite a long time after page has been
really unpinned. In particular if the page was never dirtied (e.g. because
it was gup'ed only for read but there can be other reasons), it may never
happen that page_mkclean() is called and we won't be able to ever reclaim
such page. So we would have to also add some mechanism to eventually get
such pages cleaned up and that involves rmap walk for each such page which
is not quite cheap.

> For example, the following already survives a basic boot to graphics mode.
> It requires a bunch of callsite conversions, and a page flag (neither of which
> is shown here), and may also have "a few" gross conceptual errors, but take a 
> peek:

Thanks for writing this down! Some comments inline.

> +/*
> + * Manages the PG_gup_pinned flag.
> + *
> + * Note that page->_mapcount counting part of managing that flag, because the
> + * _mapcount is used to determine if PG_gup_pinned can be cleared, in
> + * page_mkclean().
> + */
> +static void track_gup_page(struct page *page)
> +{
> +	page = compound_head(page);
> +
> +	lock_page(page);
> +
> +	wait_on_page_writeback(page);

^^ I'd use wait_for_stable_page() here. That is the standard waiting
mechanism to use before you allow page modification.

> +
> +	atomic_inc(&page->_mapcount);
> +	SetPageGupPinned(page);
> +
> +	unlock_page(page);
> +}
> +
> +/*
> + * A variant of track_gup_page() that returns -EBUSY, instead of waiting.
> + */
> +static int track_gup_page_atomic(struct page *page)
> +{
> +	page = compound_head(page);
> +
> +	if (PageWriteback(page) || !trylock_page(page))
> +		return -EBUSY;
> +
> +	if (PageWriteback(page)) {
> +		unlock_page(page);
> +		return -EBUSY;
> +	}

Here you'd need some helper that would return whether
wait_for_stable_page() is going to wait. Like would_wait_for_stable_page()
but maybe you can come up with a better name.

> +	atomic_inc(&page->_mapcount);
> +	SetPageGupPinned(page);
> +
> +	unlock_page(page);
> +	return 0;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-14 17:21                                                                     ` Jerome Glisse
  2019-01-14 17:21                                                                       ` Jerome Glisse
  2019-01-14 19:09                                                                       ` John Hubbard
@ 2019-01-15  8:07                                                                       ` Jan Kara
  2019-01-15  8:07                                                                         ` Jan Kara
                                                                                           ` (3 more replies)
  2 siblings, 4 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-15  8:07 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> > On Fri 11-01-19 19:06:08, John Hubbard wrote:
> > > On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > > > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > > > [...]
> > > > 
> > > >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> > > >>>> option, which we'll certainly need in order to safely convert all the call
> > > >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> > > >>>> and put_user_page() can verify that the right call was made.)  That will be
> > > >>>> a separate patchset, as you recommended.
> > > >>>>
> > > >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> > > >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> > > >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> > > >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> > > >>>> then I'd recommend using another page bit to do the same thing.)
> > > >>>
> > > >>> Please page lock is pointless and it will not work for GUP fast. The above
> > > >>> scheme do work and is fine. I spend the day again thinking about all memory
> > > >>> ordering and i do not see any issues.
> > > >>>
> > > >>
> > > >> Why is it that page lock cannot be used for gup fast, btw?
> > > > 
> > > > Well it can not happen within the preempt disable section. But after
> > > > as a post pass before GUP_fast return and after reenabling preempt then
> > > > it is fine like it would be for regular GUP. But locking page for GUP
> > > > is also likely to slow down some workload (with direct-IO).
> > > > 
> > > 
> > > Right, and so to crux of the matter: taking an uncontended page lock
> > > involves pretty much the same set of operations that your approach does.
> > > (If gup ends up contended with the page lock for other reasons than these
> > > paths, that seems surprising.) I'd expect very similar performance.
> > > 
> > > But the page lock approach leads to really dramatically simpler code (and
> > > code reviews, let's not forget). Any objection to my going that
> > > direction, and keeping this idea as a Plan B? I think the next step will
> > > be, once again, to gather some performance metrics, so maybe that will
> > > help us decide.
> > 
> > FWIW I agree that using page lock for protecting page pinning (and thus
> > avoid races with page_mkclean()) looks simpler to me as well and I'm not
> > convinced there will be measurable difference to the more complex scheme
> > with barriers Jerome suggests unless that page lock contended. Jerome is
> > right that you cannot just do lock_page() in gup_fast() path. There you
> > have to do trylock_page() and if that fails just bail out to the slow gup
> > path.
> > 
> > Regarding places other than page_mkclean() that need to check pinned state:
> > Definitely page migration will want to check whether the page is pinned or
> > not so that it can deal differently with short-term page references vs
> > longer-term pins.
> > 
> > Also there is one more idea I had how to record number of pins in the page:
> > 
> > #define PAGE_PIN_BIAS	1024
> > 
> > get_page_pin()
> > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > 
> > put_page_pin();
> > 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> > 
> > page_pinned(page)
> > 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> > 
> > This is pretty trivial scheme. It still gives us 22-bits for page pins
> > which should be plenty (but we should check for that and bail with error if
> > it would overflow). Also there will be no false negatives and false
> > positives only if there are more than 1024 non-page-table references to the
> > page which I expect to be rare (we might want to also subtract
> > hpage_nr_pages() for radix tree references to avoid excessive false
> > positives for huge pages although at this point I don't think they would
> > matter). Thoughts?
> 
> Racing PUP are as likely to cause issues:
> 
> CPU0                        | CPU1       | CPU2
>                             |            |
>                             | PUP()      |
>     page_pinned(page)       |            |
>       (page_count(page) -   |            |
>        page_mapcount(page)) |            |
>                             |            | GUP()
> 
> So here the refcount snap-shot does not include the second GUP and
> we can have a false negative ie the page_pinned() will return false
> because of the PUP happening just before on CPU1 despite the racing
> GUP on CPU2 just after.
> 
> I believe only either lock or memory ordering with barrier can
> guarantee that we do not miss GUP ie no false negative. Still the
> bias idea might be usefull as with it we should not need a flag.

Right. We need similar synchronization (i.e., page lock or careful checks
with memory barriers) if we want to get a reliable page pin information.

> So to make the above safe it would still need the page write back
> double check that i described so that GUP back-off if it raced with
> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> which call test_set_page_writeback() (yes it is very unlikely but
> might still happen).

Agreed. So with page lock it would actually look like:

get_page_pin()
	lock_page(page);
	wait_for_stable_page();
	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
	unlock_page(page);

And if we perform page_pinned() check under page lock, then if
page_pinned() returned false, we are sure page is not and will not be
pinned until we drop the page lock (and also until page writeback is
completed if needed).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-15  8:07                                                                       ` Jan Kara
@ 2019-01-15  8:07                                                                         ` Jan Kara
  2019-01-15 17:15                                                                         ` Jerome Glisse
                                                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-15  8:07 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 14-01-19 12:21:25, Jerome Glisse wrote:
> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> > On Fri 11-01-19 19:06:08, John Hubbard wrote:
> > > On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > > > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > > > [...]
> > > > 
> > > >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> > > >>>> option, which we'll certainly need in order to safely convert all the call
> > > >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> > > >>>> and put_user_page() can verify that the right call was made.)  That will be
> > > >>>> a separate patchset, as you recommended.
> > > >>>>
> > > >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> > > >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> > > >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> > > >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> > > >>>> then I'd recommend using another page bit to do the same thing.)
> > > >>>
> > > >>> Please page lock is pointless and it will not work for GUP fast. The above
> > > >>> scheme do work and is fine. I spend the day again thinking about all memory
> > > >>> ordering and i do not see any issues.
> > > >>>
> > > >>
> > > >> Why is it that page lock cannot be used for gup fast, btw?
> > > > 
> > > > Well it can not happen within the preempt disable section. But after
> > > > as a post pass before GUP_fast return and after reenabling preempt then
> > > > it is fine like it would be for regular GUP. But locking page for GUP
> > > > is also likely to slow down some workload (with direct-IO).
> > > > 
> > > 
> > > Right, and so to crux of the matter: taking an uncontended page lock
> > > involves pretty much the same set of operations that your approach does.
> > > (If gup ends up contended with the page lock for other reasons than these
> > > paths, that seems surprising.) I'd expect very similar performance.
> > > 
> > > But the page lock approach leads to really dramatically simpler code (and
> > > code reviews, let's not forget). Any objection to my going that
> > > direction, and keeping this idea as a Plan B? I think the next step will
> > > be, once again, to gather some performance metrics, so maybe that will
> > > help us decide.
> > 
> > FWIW I agree that using page lock for protecting page pinning (and thus
> > avoid races with page_mkclean()) looks simpler to me as well and I'm not
> > convinced there will be measurable difference to the more complex scheme
> > with barriers Jerome suggests unless that page lock contended. Jerome is
> > right that you cannot just do lock_page() in gup_fast() path. There you
> > have to do trylock_page() and if that fails just bail out to the slow gup
> > path.
> > 
> > Regarding places other than page_mkclean() that need to check pinned state:
> > Definitely page migration will want to check whether the page is pinned or
> > not so that it can deal differently with short-term page references vs
> > longer-term pins.
> > 
> > Also there is one more idea I had how to record number of pins in the page:
> > 
> > #define PAGE_PIN_BIAS	1024
> > 
> > get_page_pin()
> > 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> > 
> > put_page_pin();
> > 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> > 
> > page_pinned(page)
> > 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> > 
> > This is pretty trivial scheme. It still gives us 22-bits for page pins
> > which should be plenty (but we should check for that and bail with error if
> > it would overflow). Also there will be no false negatives and false
> > positives only if there are more than 1024 non-page-table references to the
> > page which I expect to be rare (we might want to also subtract
> > hpage_nr_pages() for radix tree references to avoid excessive false
> > positives for huge pages although at this point I don't think they would
> > matter). Thoughts?
> 
> Racing PUP are as likely to cause issues:
> 
> CPU0                        | CPU1       | CPU2
>                             |            |
>                             | PUP()      |
>     page_pinned(page)       |            |
>       (page_count(page) -   |            |
>        page_mapcount(page)) |            |
>                             |            | GUP()
> 
> So here the refcount snap-shot does not include the second GUP and
> we can have a false negative ie the page_pinned() will return false
> because of the PUP happening just before on CPU1 despite the racing
> GUP on CPU2 just after.
> 
> I believe only either lock or memory ordering with barrier can
> guarantee that we do not miss GUP ie no false negative. Still the
> bias idea might be usefull as with it we should not need a flag.

Right. We need similar synchronization (i.e., page lock or careful checks
with memory barriers) if we want to get a reliable page pin information.

> So to make the above safe it would still need the page write back
> double check that i described so that GUP back-off if it raced with
> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> which call test_set_page_writeback() (yes it is very unlikely but
> might still happen).

Agreed. So with page lock it would actually look like:

get_page_pin()
	lock_page(page);
	wait_for_stable_page();
	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
	unlock_page(page);

And if we perform page_pinned() check under page lock, then if
page_pinned() returned false, we are sure page is not and will not be
pinned until we drop the page lock (and also until page writeback is
completed if needed).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-14 17:21                                                                     ` Jerome Glisse
  2019-01-14 17:21                                                                       ` Jerome Glisse
@ 2019-01-14 19:09                                                                       ` John Hubbard
  2019-01-14 19:09                                                                         ` John Hubbard
  2019-01-15  8:34                                                                         ` Jan Kara
  2019-01-15  8:07                                                                       ` Jan Kara
  2 siblings, 2 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-14 19:09 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/14/19 9:21 AM, Jerome Glisse wrote:
> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>>>> [...]
>>>>
>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>>>>> option, which we'll certainly need in order to safely convert all the call
>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>>>>> a separate patchset, as you recommended.
>>>>>>>
>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>>>>> then I'd recommend using another page bit to do the same thing.)
>>>>>>
>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
>>>>>> ordering and i do not see any issues.
>>>>>>
>>>>>
>>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>>
>>>> Well it can not happen within the preempt disable section. But after
>>>> as a post pass before GUP_fast return and after reenabling preempt then
>>>> it is fine like it would be for regular GUP. But locking page for GUP
>>>> is also likely to slow down some workload (with direct-IO).
>>>>
>>>
>>> Right, and so to crux of the matter: taking an uncontended page lock
>>> involves pretty much the same set of operations that your approach does.
>>> (If gup ends up contended with the page lock for other reasons than these
>>> paths, that seems surprising.) I'd expect very similar performance.
>>>
>>> But the page lock approach leads to really dramatically simpler code (and
>>> code reviews, let's not forget). Any objection to my going that
>>> direction, and keeping this idea as a Plan B? I think the next step will
>>> be, once again, to gather some performance metrics, so maybe that will
>>> help us decide.
>>
>> FWIW I agree that using page lock for protecting page pinning (and thus
>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
>> convinced there will be measurable difference to the more complex scheme
>> with barriers Jerome suggests unless that page lock contended. Jerome is
>> right that you cannot just do lock_page() in gup_fast() path. There you
>> have to do trylock_page() and if that fails just bail out to the slow gup
>> path.
>>

Yes, understood about gup fast.

>> Regarding places other than page_mkclean() that need to check pinned state:
>> Definitely page migration will want to check whether the page is pinned or
>> not so that it can deal differently with short-term page references vs
>> longer-term pins.

OK.

>>
>> Also there is one more idea I had how to record number of pins in the page:
>>
>> #define PAGE_PIN_BIAS	1024
>>
>> get_page_pin()
>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>
>> put_page_pin();
>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>
>> page_pinned(page)
>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>
>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>> which should be plenty (but we should check for that and bail with error if
>> it would overflow). Also there will be no false negatives and false
>> positives only if there are more than 1024 non-page-table references to the
>> page which I expect to be rare (we might want to also subtract
>> hpage_nr_pages() for radix tree references to avoid excessive false
>> positives for huge pages although at this point I don't think they would
>> matter). Thoughts?
> 
> Racing PUP are as likely to cause issues:
> 
> CPU0                        | CPU1       | CPU2
>                             |            |
>                             | PUP()      |
>     page_pinned(page)       |            |
>       (page_count(page) -   |            |
>        page_mapcount(page)) |            |
>                             |            | GUP()
> 
> So here the refcount snap-shot does not include the second GUP and
> we can have a false negative ie the page_pinned() will return false
> because of the PUP happening just before on CPU1 despite the racing
> GUP on CPU2 just after.
> 
> I believe only either lock or memory ordering with barrier can
> guarantee that we do not miss GUP ie no false negative. Still the
> bias idea might be usefull as with it we should not need a flag.
> 
> So to make the above safe it would still need the page write back
> double check that i described so that GUP back-off if it raced with
> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> which call test_set_page_writeback() (yes it is very unlikely but
> might still happen).
> 
> 
> I still need to ponder some more on all the races.
> 

Tentatively, so far I prefer the _mapcount scheme, because it seems more
accurate to add mapcounts than to overload the _refcount field. And the 
implementation is going to be cleaner. And we've already figured out the
races.

For example, the following already survives a basic boot to graphics mode.
It requires a bunch of callsite conversions, and a page flag (neither of which
is shown here), and may also have "a few" gross conceptual errors, but take a 
peek:

>From 1b6e611238a45badda7e63d3ffc089cefb621cb2 Mon Sep 17 00:00:00 2001
From: John Hubbard <jhubbard@nvidia.com>
Date: Sun, 13 Jan 2019 15:10:31 -0800
Subject: [PATCH 2/2] mm: track gup-pinned pages
X-NVConfidentiality: public
Cc: John Hubbard <jhubbard@nvidia.com>

Track GUP-pinned pages.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  8 ++++---
 mm/gup.c           | 59 +++++++++++++++++++++++++++++++++++++++++++---
 mm/rmap.c          | 23 ++++++++++++++----
 3 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 809b7397d41e..3221a13b4891 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1004,12 +1004,14 @@ static inline void put_page(struct page *page)
  * particular, interactions with RDMA and filesystems need special
  * handling.
  *
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with get_user_page() calls.
+ * put_user_page() and put_page() are not interchangeable. put_user_page()
+ * calls must be perfectly matched up with get_user_page() calls.
  */
 static inline void put_user_page(struct page *page)
 {
+	page = compound_head(page);
+
+	atomic_dec(&page->_mapcount);
 	put_page(page);
 }
 
diff --git a/mm/gup.c b/mm/gup.c
index 05acd7e2eb22..af3909814be7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -615,6 +615,48 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	return 0;
 }
 
+/*
+ * Manages the PG_gup_pinned flag.
+ *
+ * Note that page->_mapcount counting part of managing that flag, because the
+ * _mapcount is used to determine if PG_gup_pinned can be cleared, in
+ * page_mkclean().
+ */
+static void track_gup_page(struct page *page)
+{
+	page = compound_head(page);
+
+	lock_page(page);
+
+	wait_on_page_writeback(page);
+
+	atomic_inc(&page->_mapcount);
+	SetPageGupPinned(page);
+
+	unlock_page(page);
+}
+
+/*
+ * A variant of track_gup_page() that returns -EBUSY, instead of waiting.
+ */
+static int track_gup_page_atomic(struct page *page)
+{
+	page = compound_head(page);
+
+	if (PageWriteback(page) || !trylock_page(page))
+		return -EBUSY;
+
+	if (PageWriteback(page)) {
+		unlock_page(page);
+		return -EBUSY;
+	}
+	atomic_inc(&page->_mapcount);
+	SetPageGupPinned(page);
+
+	unlock_page(page);
+	return 0;
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -761,6 +803,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			ret = PTR_ERR(page);
 			goto out;
 		}
+
+		track_gup_page(page);
+
 		if (pages) {
 			pages[i] = page;
 			flush_anon_page(vma, page, start);
@@ -1439,6 +1484,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 
+		if (track_gup_page_atomic(page)) {
+			put_page(head);
+			goto pte_unmap;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
@@ -1574,7 +1624,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
-	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp)) ||
+	    track_gup_page_atomic(head)) {
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
@@ -1612,7 +1663,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
-	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+	if (unlikely(pud_val(orig) != pud_val(*pudp)) ||
+	    track_gup_page_atomic(head)) {
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
@@ -1649,7 +1701,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 	}
 
-	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
+	if (unlikely(pgd_val(orig) != pgd_val(*pgdp)) ||
+	    track_gup_page_atomic(head)) {
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..434283898bb0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -880,6 +880,11 @@ int page_referenced(struct page *page,
 	return pra.referenced;
 }
 
+struct page_mkclean_args {
+	int cleaned;
+	int mapcount;
+};
+
 static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			    unsigned long address, void *arg)
 {
@@ -890,7 +895,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		.flags = PVMW_SYNC,
 	};
 	struct mmu_notifier_range range;
-	int *cleaned = arg;
+	struct page_mkclean_args *pma = arg;
 
 	/*
 	 * We have to assume the worse case ie pmd for invalidation. Note that
@@ -940,6 +945,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 #endif
 		}
 
+		pma->mapcount++;
+
 		/*
 		 * No need to call mmu_notifier_invalidate_range() as we are
 		 * downgrading page table protection not changing it to point
@@ -948,7 +955,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		if (ret)
-			(*cleaned)++;
+			pma->cleaned++;
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
@@ -966,10 +973,13 @@ static bool invalid_mkclean_vma(struct vm_area_struct *vma, void *arg)
 
 int page_mkclean(struct page *page)
 {
-	int cleaned = 0;
+	struct page_mkclean_args pma = {
+		.cleaned = 0,
+		.mapcount = 0
+	};
 	struct address_space *mapping;
 	struct rmap_walk_control rwc = {
-		.arg = (void *)&cleaned,
+		.arg = (void *)&pma,
 		.rmap_one = page_mkclean_one,
 		.invalid_vma = invalid_mkclean_vma,
 	};
@@ -985,7 +995,10 @@ int page_mkclean(struct page *page)
 
 	rmap_walk(page, &rwc);
 
-	return cleaned;
+	if (pma.mapcount == page_mapcount(page))
+		ClearPageGupPinned(page);
+
+	return pma.cleaned;
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
-- 
2.20.1



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-14 19:09                                                                       ` John Hubbard
@ 2019-01-14 19:09                                                                         ` John Hubbard
  2019-01-15  8:34                                                                         ` Jan Kara
  1 sibling, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-14 19:09 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/14/19 9:21 AM, Jerome Glisse wrote:
> On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
>> On Fri 11-01-19 19:06:08, John Hubbard wrote:
>>> On 1/11/19 6:46 PM, Jerome Glisse wrote:
>>>> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>>>> [...]
>>>>
>>>>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>>>>> option, which we'll certainly need in order to safely convert all the call
>>>>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>>>>> a separate patchset, as you recommended.
>>>>>>>
>>>>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>>>>> then I'd recommend using another page bit to do the same thing.)
>>>>>>
>>>>>> Please page lock is pointless and it will not work for GUP fast. The above
>>>>>> scheme do work and is fine. I spend the day again thinking about all memory
>>>>>> ordering and i do not see any issues.
>>>>>>
>>>>>
>>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>>
>>>> Well it can not happen within the preempt disable section. But after
>>>> as a post pass before GUP_fast return and after reenabling preempt then
>>>> it is fine like it would be for regular GUP. But locking page for GUP
>>>> is also likely to slow down some workload (with direct-IO).
>>>>
>>>
>>> Right, and so to crux of the matter: taking an uncontended page lock
>>> involves pretty much the same set of operations that your approach does.
>>> (If gup ends up contended with the page lock for other reasons than these
>>> paths, that seems surprising.) I'd expect very similar performance.
>>>
>>> But the page lock approach leads to really dramatically simpler code (and
>>> code reviews, let's not forget). Any objection to my going that
>>> direction, and keeping this idea as a Plan B? I think the next step will
>>> be, once again, to gather some performance metrics, so maybe that will
>>> help us decide.
>>
>> FWIW I agree that using page lock for protecting page pinning (and thus
>> avoid races with page_mkclean()) looks simpler to me as well and I'm not
>> convinced there will be measurable difference to the more complex scheme
>> with barriers Jerome suggests unless that page lock contended. Jerome is
>> right that you cannot just do lock_page() in gup_fast() path. There you
>> have to do trylock_page() and if that fails just bail out to the slow gup
>> path.
>>

Yes, understood about gup fast.

>> Regarding places other than page_mkclean() that need to check pinned state:
>> Definitely page migration will want to check whether the page is pinned or
>> not so that it can deal differently with short-term page references vs
>> longer-term pins.

OK.

>>
>> Also there is one more idea I had how to record number of pins in the page:
>>
>> #define PAGE_PIN_BIAS	1024
>>
>> get_page_pin()
>> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
>>
>> put_page_pin();
>> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
>>
>> page_pinned(page)
>> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
>>
>> This is pretty trivial scheme. It still gives us 22-bits for page pins
>> which should be plenty (but we should check for that and bail with error if
>> it would overflow). Also there will be no false negatives and false
>> positives only if there are more than 1024 non-page-table references to the
>> page which I expect to be rare (we might want to also subtract
>> hpage_nr_pages() for radix tree references to avoid excessive false
>> positives for huge pages although at this point I don't think they would
>> matter). Thoughts?
> 
> Racing PUP are as likely to cause issues:
> 
> CPU0                        | CPU1       | CPU2
>                             |            |
>                             | PUP()      |
>     page_pinned(page)       |            |
>       (page_count(page) -   |            |
>        page_mapcount(page)) |            |
>                             |            | GUP()
> 
> So here the refcount snap-shot does not include the second GUP and
> we can have a false negative ie the page_pinned() will return false
> because of the PUP happening just before on CPU1 despite the racing
> GUP on CPU2 just after.
> 
> I believe only either lock or memory ordering with barrier can
> guarantee that we do not miss GUP ie no false negative. Still the
> bias idea might be usefull as with it we should not need a flag.
> 
> So to make the above safe it would still need the page write back
> double check that i described so that GUP back-off if it raced with
> page_mkclean,clear_page_dirty_for_io and the fs write page call back
> which call test_set_page_writeback() (yes it is very unlikely but
> might still happen).
> 
> 
> I still need to ponder some more on all the races.
> 

Tentatively, so far I prefer the _mapcount scheme, because it seems more
accurate to add mapcounts than to overload the _refcount field. And the 
implementation is going to be cleaner. And we've already figured out the
races.

For example, the following already survives a basic boot to graphics mode.
It requires a bunch of callsite conversions, and a page flag (neither of which
is shown here), and may also have "a few" gross conceptual errors, but take a 
peek:

From 1b6e611238a45badda7e63d3ffc089cefb621cb2 Mon Sep 17 00:00:00 2001
From: John Hubbard <jhubbard@nvidia.com>
Date: Sun, 13 Jan 2019 15:10:31 -0800
Subject: [PATCH 2/2] mm: track gup-pinned pages
X-NVConfidentiality: public
Cc: John Hubbard <jhubbard@nvidia.com>

Track GUP-pinned pages.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  8 ++++---
 mm/gup.c           | 59 +++++++++++++++++++++++++++++++++++++++++++---
 mm/rmap.c          | 23 ++++++++++++++----
 3 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 809b7397d41e..3221a13b4891 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1004,12 +1004,14 @@ static inline void put_page(struct page *page)
  * particular, interactions with RDMA and filesystems need special
  * handling.
  *
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with get_user_page() calls.
+ * put_user_page() and put_page() are not interchangeable. put_user_page()
+ * calls must be perfectly matched up with get_user_page() calls.
  */
 static inline void put_user_page(struct page *page)
 {
+	page = compound_head(page);
+
+	atomic_dec(&page->_mapcount);
 	put_page(page);
 }
 
diff --git a/mm/gup.c b/mm/gup.c
index 05acd7e2eb22..af3909814be7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -615,6 +615,48 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	return 0;
 }
 
+/*
+ * Manages the PG_gup_pinned flag.
+ *
+ * Note that page->_mapcount counting part of managing that flag, because the
+ * _mapcount is used to determine if PG_gup_pinned can be cleared, in
+ * page_mkclean().
+ */
+static void track_gup_page(struct page *page)
+{
+	page = compound_head(page);
+
+	lock_page(page);
+
+	wait_on_page_writeback(page);
+
+	atomic_inc(&page->_mapcount);
+	SetPageGupPinned(page);
+
+	unlock_page(page);
+}
+
+/*
+ * A variant of track_gup_page() that returns -EBUSY, instead of waiting.
+ */
+static int track_gup_page_atomic(struct page *page)
+{
+	page = compound_head(page);
+
+	if (PageWriteback(page) || !trylock_page(page))
+		return -EBUSY;
+
+	if (PageWriteback(page)) {
+		unlock_page(page);
+		return -EBUSY;
+	}
+	atomic_inc(&page->_mapcount);
+	SetPageGupPinned(page);
+
+	unlock_page(page);
+	return 0;
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -761,6 +803,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			ret = PTR_ERR(page);
 			goto out;
 		}
+
+		track_gup_page(page);
+
 		if (pages) {
 			pages[i] = page;
 			flush_anon_page(vma, page, start);
@@ -1439,6 +1484,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 
+		if (track_gup_page_atomic(page)) {
+			put_page(head);
+			goto pte_unmap;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
@@ -1574,7 +1624,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
-	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp)) ||
+	    track_gup_page_atomic(head)) {
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
@@ -1612,7 +1663,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
-	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+	if (unlikely(pud_val(orig) != pud_val(*pudp)) ||
+	    track_gup_page_atomic(head)) {
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
@@ -1649,7 +1701,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 	}
 
-	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
+	if (unlikely(pgd_val(orig) != pgd_val(*pgdp)) ||
+	    track_gup_page_atomic(head)) {
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..434283898bb0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -880,6 +880,11 @@ int page_referenced(struct page *page,
 	return pra.referenced;
 }
 
+struct page_mkclean_args {
+	int cleaned;
+	int mapcount;
+};
+
 static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			    unsigned long address, void *arg)
 {
@@ -890,7 +895,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		.flags = PVMW_SYNC,
 	};
 	struct mmu_notifier_range range;
-	int *cleaned = arg;
+	struct page_mkclean_args *pma = arg;
 
 	/*
 	 * We have to assume the worse case ie pmd for invalidation. Note that
@@ -940,6 +945,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 #endif
 		}
 
+		pma->mapcount++;
+
 		/*
 		 * No need to call mmu_notifier_invalidate_range() as we are
 		 * downgrading page table protection not changing it to point
@@ -948,7 +955,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		if (ret)
-			(*cleaned)++;
+			pma->cleaned++;
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
@@ -966,10 +973,13 @@ static bool invalid_mkclean_vma(struct vm_area_struct *vma, void *arg)
 
 int page_mkclean(struct page *page)
 {
-	int cleaned = 0;
+	struct page_mkclean_args pma = {
+		.cleaned = 0,
+		.mapcount = 0
+	};
 	struct address_space *mapping;
 	struct rmap_walk_control rwc = {
-		.arg = (void *)&cleaned,
+		.arg = (void *)&pma,
 		.rmap_one = page_mkclean_one,
 		.invalid_vma = invalid_mkclean_vma,
 	};
@@ -985,7 +995,10 @@ int page_mkclean(struct page *page)
 
 	rmap_walk(page, &rwc);
 
-	return cleaned;
+	if (pma.mapcount == page_mapcount(page))
+		ClearPageGupPinned(page);
+
+	return pma.cleaned;
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
-- 
2.20.1



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-14 14:54                                                                   ` Jan Kara
  2019-01-14 14:54                                                                     ` Jan Kara
@ 2019-01-14 17:21                                                                     ` Jerome Glisse
  2019-01-14 17:21                                                                       ` Jerome Glisse
                                                                                         ` (2 more replies)
  1 sibling, 3 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-14 17:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> On Fri 11-01-19 19:06:08, John Hubbard wrote:
> > On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > > [...]
> > > 
> > >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> > >>>> option, which we'll certainly need in order to safely convert all the call
> > >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> > >>>> and put_user_page() can verify that the right call was made.)  That will be
> > >>>> a separate patchset, as you recommended.
> > >>>>
> > >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> > >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> > >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> > >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> > >>>> then I'd recommend using another page bit to do the same thing.)
> > >>>
> > >>> Please page lock is pointless and it will not work for GUP fast. The above
> > >>> scheme do work and is fine. I spend the day again thinking about all memory
> > >>> ordering and i do not see any issues.
> > >>>
> > >>
> > >> Why is it that page lock cannot be used for gup fast, btw?
> > > 
> > > Well it can not happen within the preempt disable section. But after
> > > as a post pass before GUP_fast return and after reenabling preempt then
> > > it is fine like it would be for regular GUP. But locking page for GUP
> > > is also likely to slow down some workload (with direct-IO).
> > > 
> > 
> > Right, and so to crux of the matter: taking an uncontended page lock
> > involves pretty much the same set of operations that your approach does.
> > (If gup ends up contended with the page lock for other reasons than these
> > paths, that seems surprising.) I'd expect very similar performance.
> > 
> > But the page lock approach leads to really dramatically simpler code (and
> > code reviews, let's not forget). Any objection to my going that
> > direction, and keeping this idea as a Plan B? I think the next step will
> > be, once again, to gather some performance metrics, so maybe that will
> > help us decide.
> 
> FWIW I agree that using page lock for protecting page pinning (and thus
> avoid races with page_mkclean()) looks simpler to me as well and I'm not
> convinced there will be measurable difference to the more complex scheme
> with barriers Jerome suggests unless that page lock contended. Jerome is
> right that you cannot just do lock_page() in gup_fast() path. There you
> have to do trylock_page() and if that fails just bail out to the slow gup
> path.
> 
> Regarding places other than page_mkclean() that need to check pinned state:
> Definitely page migration will want to check whether the page is pinned or
> not so that it can deal differently with short-term page references vs
> longer-term pins.
> 
> Also there is one more idea I had how to record number of pins in the page:
> 
> #define PAGE_PIN_BIAS	1024
> 
> get_page_pin()
> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> 
> put_page_pin();
> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> 
> page_pinned(page)
> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> 
> This is pretty trivial scheme. It still gives us 22-bits for page pins
> which should be plenty (but we should check for that and bail with error if
> it would overflow). Also there will be no false negatives and false
> positives only if there are more than 1024 non-page-table references to the
> page which I expect to be rare (we might want to also subtract
> hpage_nr_pages() for radix tree references to avoid excessive false
> positives for huge pages although at this point I don't think they would
> matter). Thoughts?

Racing PUP are as likely to cause issues:

CPU0                        | CPU1       | CPU2
                            |            |
                            | PUP()      |
    page_pinned(page)       |            |
      (page_count(page) -   |            |
       page_mapcount(page)) |            |
                            |            | GUP()

So here the refcount snap-shot does not include the second GUP and
we can have a false negative ie the page_pinned() will return false
because of the PUP happening just before on CPU1 despite the racing
GUP on CPU2 just after.

I believe only either lock or memory ordering with barrier can
guarantee that we do not miss GUP ie no false negative. Still the
bias idea might be usefull as with it we should not need a flag.

So to make the above safe it would still need the page write back
double check that i described so that GUP back-off if it raced with
page_mkclean,clear_page_dirty_for_io and the fs write page call back
which call test_set_page_writeback() (yes it is very unlikely but
might still happen).


I still need to ponder some more on all the races.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-14 17:21                                                                     ` Jerome Glisse
@ 2019-01-14 17:21                                                                       ` Jerome Glisse
  2019-01-14 19:09                                                                       ` John Hubbard
  2019-01-15  8:07                                                                       ` Jan Kara
  2 siblings, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-14 17:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Mon, Jan 14, 2019 at 03:54:47PM +0100, Jan Kara wrote:
> On Fri 11-01-19 19:06:08, John Hubbard wrote:
> > On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > > [...]
> > > 
> > >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> > >>>> option, which we'll certainly need in order to safely convert all the call
> > >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> > >>>> and put_user_page() can verify that the right call was made.)  That will be
> > >>>> a separate patchset, as you recommended.
> > >>>>
> > >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> > >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> > >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> > >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> > >>>> then I'd recommend using another page bit to do the same thing.)
> > >>>
> > >>> Please page lock is pointless and it will not work for GUP fast. The above
> > >>> scheme do work and is fine. I spend the day again thinking about all memory
> > >>> ordering and i do not see any issues.
> > >>>
> > >>
> > >> Why is it that page lock cannot be used for gup fast, btw?
> > > 
> > > Well it can not happen within the preempt disable section. But after
> > > as a post pass before GUP_fast return and after reenabling preempt then
> > > it is fine like it would be for regular GUP. But locking page for GUP
> > > is also likely to slow down some workload (with direct-IO).
> > > 
> > 
> > Right, and so to crux of the matter: taking an uncontended page lock
> > involves pretty much the same set of operations that your approach does.
> > (If gup ends up contended with the page lock for other reasons than these
> > paths, that seems surprising.) I'd expect very similar performance.
> > 
> > But the page lock approach leads to really dramatically simpler code (and
> > code reviews, let's not forget). Any objection to my going that
> > direction, and keeping this idea as a Plan B? I think the next step will
> > be, once again, to gather some performance metrics, so maybe that will
> > help us decide.
> 
> FWIW I agree that using page lock for protecting page pinning (and thus
> avoid races with page_mkclean()) looks simpler to me as well and I'm not
> convinced there will be measurable difference to the more complex scheme
> with barriers Jerome suggests unless that page lock contended. Jerome is
> right that you cannot just do lock_page() in gup_fast() path. There you
> have to do trylock_page() and if that fails just bail out to the slow gup
> path.
> 
> Regarding places other than page_mkclean() that need to check pinned state:
> Definitely page migration will want to check whether the page is pinned or
> not so that it can deal differently with short-term page references vs
> longer-term pins.
> 
> Also there is one more idea I had how to record number of pins in the page:
> 
> #define PAGE_PIN_BIAS	1024
> 
> get_page_pin()
> 	atomic_add(&page->_refcount, PAGE_PIN_BIAS);
> 
> put_page_pin();
> 	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);
> 
> page_pinned(page)
> 	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS
> 
> This is pretty trivial scheme. It still gives us 22-bits for page pins
> which should be plenty (but we should check for that and bail with error if
> it would overflow). Also there will be no false negatives and false
> positives only if there are more than 1024 non-page-table references to the
> page which I expect to be rare (we might want to also subtract
> hpage_nr_pages() for radix tree references to avoid excessive false
> positives for huge pages although at this point I don't think they would
> matter). Thoughts?

Racing PUP are as likely to cause issues:

CPU0                        | CPU1       | CPU2
                            |            |
                            | PUP()      |
    page_pinned(page)       |            |
      (page_count(page) -   |            |
       page_mapcount(page)) |            |
                            |            | GUP()

So here the refcount snap-shot does not include the second GUP and
we can have a false negative ie the page_pinned() will return false
because of the PUP happening just before on CPU1 despite the racing
GUP on CPU2 just after.

I believe only either lock or memory ordering with barrier can
guarantee that we do not miss GUP ie no false negative. Still the
bias idea might be usefull as with it we should not need a flag.

So to make the above safe it would still need the page write back
double check that i described so that GUP back-off if it raced with
page_mkclean,clear_page_dirty_for_io and the fs write page call back
which call test_set_page_writeback() (yes it is very unlikely but
might still happen).


I still need to ponder some more on all the races.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:06                                                                 ` John Hubbard
  2019-01-12  3:06                                                                   ` John Hubbard
  2019-01-12  3:25                                                                   ` Jerome Glisse
@ 2019-01-14 14:54                                                                   ` Jan Kara
  2019-01-14 14:54                                                                     ` Jan Kara
  2019-01-14 17:21                                                                     ` Jerome Glisse
  2 siblings, 2 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-14 14:54 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Jan Kara, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Fri 11-01-19 19:06:08, John Hubbard wrote:
> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > [...]
> > 
> >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>> option, which we'll certainly need in order to safely convert all the call
> >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>> a separate patchset, as you recommended.
> >>>>
> >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>> then I'd recommend using another page bit to do the same thing.)
> >>>
> >>> Please page lock is pointless and it will not work for GUP fast. The above
> >>> scheme do work and is fine. I spend the day again thinking about all memory
> >>> ordering and i do not see any issues.
> >>>
> >>
> >> Why is it that page lock cannot be used for gup fast, btw?
> > 
> > Well it can not happen within the preempt disable section. But after
> > as a post pass before GUP_fast return and after reenabling preempt then
> > it is fine like it would be for regular GUP. But locking page for GUP
> > is also likely to slow down some workload (with direct-IO).
> > 
> 
> Right, and so to crux of the matter: taking an uncontended page lock
> involves pretty much the same set of operations that your approach does.
> (If gup ends up contended with the page lock for other reasons than these
> paths, that seems surprising.) I'd expect very similar performance.
> 
> But the page lock approach leads to really dramatically simpler code (and
> code reviews, let's not forget). Any objection to my going that
> direction, and keeping this idea as a Plan B? I think the next step will
> be, once again, to gather some performance metrics, so maybe that will
> help us decide.

FWIW I agree that using page lock for protecting page pinning (and thus
avoid races with page_mkclean()) looks simpler to me as well and I'm not
convinced there will be measurable difference to the more complex scheme
with barriers Jerome suggests unless that page lock contended. Jerome is
right that you cannot just do lock_page() in gup_fast() path. There you
have to do trylock_page() and if that fails just bail out to the slow gup
path.

Regarding places other than page_mkclean() that need to check pinned state:
Definitely page migration will want to check whether the page is pinned or
not so that it can deal differently with short-term page references vs
longer-term pins.

Also there is one more idea I had how to record number of pins in the page:

#define PAGE_PIN_BIAS	1024

get_page_pin()
	atomic_add(&page->_refcount, PAGE_PIN_BIAS);

put_page_pin();
	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);

page_pinned(page)
	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS

This is pretty trivial scheme. It still gives us 22-bits for page pins
which should be plenty (but we should check for that and bail with error if
it would overflow). Also there will be no false negatives and false
positives only if there are more than 1024 non-page-table references to the
page which I expect to be rare (we might want to also subtract
hpage_nr_pages() for radix tree references to avoid excessive false
positives for huge pages although at this point I don't think they would
matter). Thoughts?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-14 14:54                                                                   ` Jan Kara
@ 2019-01-14 14:54                                                                     ` Jan Kara
  2019-01-14 17:21                                                                     ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: Jan Kara @ 2019-01-14 14:54 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Jan Kara, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Fri 11-01-19 19:06:08, John Hubbard wrote:
> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> > [...]
> > 
> >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>> option, which we'll certainly need in order to safely convert all the call
> >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>> a separate patchset, as you recommended.
> >>>>
> >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>> then I'd recommend using another page bit to do the same thing.)
> >>>
> >>> Please page lock is pointless and it will not work for GUP fast. The above
> >>> scheme do work and is fine. I spend the day again thinking about all memory
> >>> ordering and i do not see any issues.
> >>>
> >>
> >> Why is it that page lock cannot be used for gup fast, btw?
> > 
> > Well it can not happen within the preempt disable section. But after
> > as a post pass before GUP_fast return and after reenabling preempt then
> > it is fine like it would be for regular GUP. But locking page for GUP
> > is also likely to slow down some workload (with direct-IO).
> > 
> 
> Right, and so to crux of the matter: taking an uncontended page lock
> involves pretty much the same set of operations that your approach does.
> (If gup ends up contended with the page lock for other reasons than these
> paths, that seems surprising.) I'd expect very similar performance.
> 
> But the page lock approach leads to really dramatically simpler code (and
> code reviews, let's not forget). Any objection to my going that
> direction, and keeping this idea as a Plan B? I think the next step will
> be, once again, to gather some performance metrics, so maybe that will
> help us decide.

FWIW I agree that using page lock for protecting page pinning (and thus
avoid races with page_mkclean()) looks simpler to me as well and I'm not
convinced there will be measurable difference to the more complex scheme
with barriers Jerome suggests unless that page lock contended. Jerome is
right that you cannot just do lock_page() in gup_fast() path. There you
have to do trylock_page() and if that fails just bail out to the slow gup
path.

Regarding places other than page_mkclean() that need to check pinned state:
Definitely page migration will want to check whether the page is pinned or
not so that it can deal differently with short-term page references vs
longer-term pins.

Also there is one more idea I had how to record number of pins in the page:

#define PAGE_PIN_BIAS	1024

get_page_pin()
	atomic_add(&page->_refcount, PAGE_PIN_BIAS);

put_page_pin();
	atomic_add(&page->_refcount, -PAGE_PIN_BIAS);

page_pinned(page)
	(atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS

This is pretty trivial scheme. It still gives us 22-bits for page pins
which should be plenty (but we should check for that and bail with error if
it would overflow). Also there will be no false negatives and false
positives only if there are more than 1024 non-page-table references to the
page which I expect to be rare (we might want to also subtract
hpage_nr_pages() for radix tree references to avoid excessive false
positives for huge pages although at this point I don't think they would
matter). Thoughts?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:25                                                                   ` Jerome Glisse
  2019-01-12  3:25                                                                     ` Jerome Glisse
@ 2019-01-12 20:46                                                                     ` John Hubbard
  2019-01-12 20:46                                                                       ` John Hubbard
  1 sibling, 1 reply; 207+ messages in thread
From: John Hubbard @ 2019-01-12 20:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 7:25 PM, Jerome Glisse wrote:
[...]
>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>
>>> Well it can not happen within the preempt disable section. But after
>>> as a post pass before GUP_fast return and after reenabling preempt then
>>> it is fine like it would be for regular GUP. But locking page for GUP
>>> is also likely to slow down some workload (with direct-IO).
>>>
>>
>> Right, and so to crux of the matter: taking an uncontended page lock involves
>> pretty much the same set of operations that your approach does. (If gup ends up
>> contended with the page lock for other reasons than these paths, that seems
>> surprising.) I'd expect very similar performance.
>>
>> But the page lock approach leads to really dramatically simpler code (and code
>> reviews, let's not forget). Any objection to my going that direction, and keeping
>> this idea as a Plan B? I think the next step will be, once again, to gather some
>> performance metrics, so maybe that will help us decide.
> 
> They are already work load that suffer from the page lock so adding more
> code that need it will only worsen those situations. I guess i will do a
> patchset with my solution as it is definitly lighter weight that having to
> take the page lock.
> 

Hi Jerome,

I expect that you're right, and in any case, having you code up the new 
synchronization parts is probably a smart idea--you understand it best. To avoid
duplicating work, may I propose these steps:

1. I'll post a new RFC, using your mapcount idea, but with a minor variation: 
using the page lock to synchronize gup() and page_mkclean(). 

   a) I'll also include a github path that has enough gup callsite conversions
   done, to allow performance testing. 

   b) And also, you and others have provided a lot of information that I want to
   turn into nice neat comments and documentation.

2. Then your proposed synchronization system would only need to replace probably
one or two of the patches, instead of duplicating the whole patchset. I dread
having two large, overlapping patchsets competing, and hope we can avoid that mess.

3. We can run performance tests on both approaches, hopefully finding some test
cases that will highlight whether page lock is a noticeable problem here.

Or, the other thing that could happen is someone will jump in here and NAK anything
involving the page lock, based on long experience, and we'll just go straight to
your scheme anyway.  I'm sorta expecting that any minute now. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12 20:46                                                                     ` John Hubbard
@ 2019-01-12 20:46                                                                       ` John Hubbard
  0 siblings, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-12 20:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 7:25 PM, Jerome Glisse wrote:
[...]
>>>> Why is it that page lock cannot be used for gup fast, btw?
>>>
>>> Well it can not happen within the preempt disable section. But after
>>> as a post pass before GUP_fast return and after reenabling preempt then
>>> it is fine like it would be for regular GUP. But locking page for GUP
>>> is also likely to slow down some workload (with direct-IO).
>>>
>>
>> Right, and so to crux of the matter: taking an uncontended page lock involves
>> pretty much the same set of operations that your approach does. (If gup ends up
>> contended with the page lock for other reasons than these paths, that seems
>> surprising.) I'd expect very similar performance.
>>
>> But the page lock approach leads to really dramatically simpler code (and code
>> reviews, let's not forget). Any objection to my going that direction, and keeping
>> this idea as a Plan B? I think the next step will be, once again, to gather some
>> performance metrics, so maybe that will help us decide.
> 
> They are already work load that suffer from the page lock so adding more
> code that need it will only worsen those situations. I guess i will do a
> patchset with my solution as it is definitly lighter weight that having to
> take the page lock.
> 

Hi Jerome,

I expect that you're right, and in any case, having you code up the new 
synchronization parts is probably a smart idea--you understand it best. To avoid
duplicating work, may I propose these steps:

1. I'll post a new RFC, using your mapcount idea, but with a minor variation: 
using the page lock to synchronize gup() and page_mkclean(). 

   a) I'll also include a github path that has enough gup callsite conversions
   done, to allow performance testing. 

   b) And also, you and others have provided a lot of information that I want to
   turn into nice neat comments and documentation.

2. Then your proposed synchronization system would only need to replace probably
one or two of the patches, instead of duplicating the whole patchset. I dread
having two large, overlapping patchsets competing, and hope we can avoid that mess.

3. We can run performance tests on both approaches, hopefully finding some test
cases that will highlight whether page lock is a noticeable problem here.

Or, the other thing that could happen is someone will jump in here and NAK anything
involving the page lock, based on long experience, and we'll just go straight to
your scheme anyway.  I'm sorta expecting that any minute now. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:06                                                                 ` John Hubbard
  2019-01-12  3:06                                                                   ` John Hubbard
@ 2019-01-12  3:25                                                                   ` Jerome Glisse
  2019-01-12  3:25                                                                     ` Jerome Glisse
  2019-01-12 20:46                                                                     ` John Hubbard
  2019-01-14 14:54                                                                   ` Jan Kara
  2 siblings, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  3:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 07:06:08PM -0800, John Hubbard wrote:
> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> >> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> >>> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >>>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>>>> [...]
> >>>>
> >>>> Hi Jerome,
> >>>>
> >>>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >>>> in case anyone spots a disastrous conceptual error (such as the lock_page
> >>>> point), while I'm putting together the revised patchset.
> >>>>
> >>>> I've studied this carefully, and I agree that using mapcount in 
> >>>> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >>>> like one: your "memory barrier, check, retry" is really just a lock) in
> >>>> order to hold off gup() while page_mkclean() is in progress. In other words,
> >>>> nothing that increments mapcount may proceed while page_mkclean() is running.
> >>>
> >>> No, increment to page->_mapcount are fine while page_mkclean() is running.
> >>> The above solution do work no matter what happens thanks to the memory
> >>> barrier. By clearing the pin flag first and reading the page->_mapcount
> >>> after (and doing the reverse in GUP) we know that a racing GUP will either
> >>> have its pin page clear but the incremented mapcount taken into account by
> >>> page_mkclean() or page_mkclean() will miss the incremented mapcount but
> >>> it will also no clear the pin flag set concurrently by any GUP.
> >>>
> >>> Here are all the possible time line:
> >>> [T1]:
> >>> GUP on CPU0                      | page_mkclean() on CPU1
> >>>                                  |
> >>> [G2] atomic_inc(&page->mapcount) |
> >>> [G3] smp_wmb();                  |
> >>> [G4] SetPagePin(page);           |
> >>>                                 ...
> >>>                                  | [C1] pined = TestClearPagePin(page);
> >>
> >> It appears that you're using the "page pin is clear" to indicate that
> >> page_mkclean() is running. The problem is, that approach leads to toggling
> >> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> >> see intervals during which the PagePin flag is clear, when conceptually it
> >> should be set.
> >>
> >> Jan and other FS people, is it definitely the case that we only have to take
> >> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
> >> Because I recall from earlier experiments that there were several places, not 
> >> just page_mkclean().
> > 
> > Yes and it is fine to temporarily have the pin flag unstable. Anything
> > that need stable page content will have to lock the page so will have
> > to sync against any page_mkclean() and in the end the only thing were
> > we want to check the pin flag is when doing write back ie after
> > page_mkclean() while the page is still locked. If they are any other
> > place that need to check the pin flag then they will need to lock the
> > page. But i can not think of any other place right now.
> > 
> > 
> 
> OK. Yes, since the clearing and resetting happens under page lock, that will
> suffice to synchronize it. That's a good point.
> 
> > [...]
> > 
> >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>> option, which we'll certainly need in order to safely convert all the call
> >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>> a separate patchset, as you recommended.
> >>>>
> >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>> then I'd recommend using another page bit to do the same thing.)
> >>>
> >>> Please page lock is pointless and it will not work for GUP fast. The above
> >>> scheme do work and is fine. I spend the day again thinking about all memory
> >>> ordering and i do not see any issues.
> >>>
> >>
> >> Why is it that page lock cannot be used for gup fast, btw?
> > 
> > Well it can not happen within the preempt disable section. But after
> > as a post pass before GUP_fast return and after reenabling preempt then
> > it is fine like it would be for regular GUP. But locking page for GUP
> > is also likely to slow down some workload (with direct-IO).
> > 
> 
> Right, and so to crux of the matter: taking an uncontended page lock involves
> pretty much the same set of operations that your approach does. (If gup ends up
> contended with the page lock for other reasons than these paths, that seems
> surprising.) I'd expect very similar performance.
> 
> But the page lock approach leads to really dramatically simpler code (and code
> reviews, let's not forget). Any objection to my going that direction, and keeping
> this idea as a Plan B? I think the next step will be, once again, to gather some
> performance metrics, so maybe that will help us decide.

They are already work load that suffer from the page lock so adding more
code that need it will only worsen those situations. I guess i will do a
patchset with my solution as it is definitly lighter weight that having to
take the page lock.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:25                                                                   ` Jerome Glisse
@ 2019-01-12  3:25                                                                     ` Jerome Glisse
  2019-01-12 20:46                                                                     ` John Hubbard
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  3:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 07:06:08PM -0800, John Hubbard wrote:
> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> >> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> >>> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >>>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>>>> [...]
> >>>>
> >>>> Hi Jerome,
> >>>>
> >>>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >>>> in case anyone spots a disastrous conceptual error (such as the lock_page
> >>>> point), while I'm putting together the revised patchset.
> >>>>
> >>>> I've studied this carefully, and I agree that using mapcount in 
> >>>> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >>>> like one: your "memory barrier, check, retry" is really just a lock) in
> >>>> order to hold off gup() while page_mkclean() is in progress. In other words,
> >>>> nothing that increments mapcount may proceed while page_mkclean() is running.
> >>>
> >>> No, increment to page->_mapcount are fine while page_mkclean() is running.
> >>> The above solution do work no matter what happens thanks to the memory
> >>> barrier. By clearing the pin flag first and reading the page->_mapcount
> >>> after (and doing the reverse in GUP) we know that a racing GUP will either
> >>> have its pin page clear but the incremented mapcount taken into account by
> >>> page_mkclean() or page_mkclean() will miss the incremented mapcount but
> >>> it will also no clear the pin flag set concurrently by any GUP.
> >>>
> >>> Here are all the possible time line:
> >>> [T1]:
> >>> GUP on CPU0                      | page_mkclean() on CPU1
> >>>                                  |
> >>> [G2] atomic_inc(&page->mapcount) |
> >>> [G3] smp_wmb();                  |
> >>> [G4] SetPagePin(page);           |
> >>>                                 ...
> >>>                                  | [C1] pined = TestClearPagePin(page);
> >>
> >> It appears that you're using the "page pin is clear" to indicate that
> >> page_mkclean() is running. The problem is, that approach leads to toggling
> >> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> >> see intervals during which the PagePin flag is clear, when conceptually it
> >> should be set.
> >>
> >> Jan and other FS people, is it definitely the case that we only have to take
> >> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
> >> Because I recall from earlier experiments that there were several places, not 
> >> just page_mkclean().
> > 
> > Yes and it is fine to temporarily have the pin flag unstable. Anything
> > that need stable page content will have to lock the page so will have
> > to sync against any page_mkclean() and in the end the only thing were
> > we want to check the pin flag is when doing write back ie after
> > page_mkclean() while the page is still locked. If they are any other
> > place that need to check the pin flag then they will need to lock the
> > page. But i can not think of any other place right now.
> > 
> > 
> 
> OK. Yes, since the clearing and resetting happens under page lock, that will
> suffice to synchronize it. That's a good point.
> 
> > [...]
> > 
> >>>> The other idea that you and Dan (and maybe others) pointed out was a debug
> >>>> option, which we'll certainly need in order to safely convert all the call
> >>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >>>> and put_user_page() can verify that the right call was made.)  That will be
> >>>> a separate patchset, as you recommended.
> >>>>
> >>>> I'll even go as far as recommending the page lock itself. I realize that this 
> >>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >>>> that this (below) has similar overhead to the notes above--but is *much* easier
> >>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >>>> then I'd recommend using another page bit to do the same thing.)
> >>>
> >>> Please page lock is pointless and it will not work for GUP fast. The above
> >>> scheme do work and is fine. I spend the day again thinking about all memory
> >>> ordering and i do not see any issues.
> >>>
> >>
> >> Why is it that page lock cannot be used for gup fast, btw?
> > 
> > Well it can not happen within the preempt disable section. But after
> > as a post pass before GUP_fast return and after reenabling preempt then
> > it is fine like it would be for regular GUP. But locking page for GUP
> > is also likely to slow down some workload (with direct-IO).
> > 
> 
> Right, and so to crux of the matter: taking an uncontended page lock involves
> pretty much the same set of operations that your approach does. (If gup ends up
> contended with the page lock for other reasons than these paths, that seems
> surprising.) I'd expect very similar performance.
> 
> But the page lock approach leads to really dramatically simpler code (and code
> reviews, let's not forget). Any objection to my going that direction, and keeping
> this idea as a Plan B? I think the next step will be, once again, to gather some
> performance metrics, so maybe that will help us decide.

They are already work load that suffer from the page lock so adding more
code that need it will only worsen those situations. I guess i will do a
patchset with my solution as it is definitly lighter weight that having to
take the page lock.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:38                                                             ` John Hubbard
  2019-01-12  2:38                                                               ` John Hubbard
  2019-01-12  2:46                                                               ` Jerome Glisse
@ 2019-01-12  3:14                                                               ` Jerome Glisse
  2019-01-12  3:14                                                                 ` Jerome Glisse
  2 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  3:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.

Also forgot to stress that i am not using the pin flag to report page_mkclean
is running, i am clearing it first because clearing that bit is the thing
that is racy. If we clear it first and then read map and pin count and then
count number of real mapping we get a proper ordering and we will always
detect pined page and properly restore the pin flag at the end of page_mkclean.

In fact GUP or PUP never need to check if the flag is clear. The check in
GUP in my pseudo code is an optimization for the write back ordering (no
need to do any ordering if the pin flag was already set before the current
GUP).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:14                                                               ` Jerome Glisse
@ 2019-01-12  3:14                                                                 ` Jerome Glisse
  0 siblings, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  3:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.

Also forgot to stress that i am not using the pin flag to report page_mkclean
is running, i am clearing it first because clearing that bit is the thing
that is racy. If we clear it first and then read map and pin count and then
count number of real mapping we get a proper ordering and we will always
detect pined page and properly restore the pin flag at the end of page_mkclean.

In fact GUP or PUP never need to check if the flag is clear. The check in
GUP in my pseudo code is an optimization for the write back ordering (no
need to do any ordering if the pin flag was already set before the current
GUP).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:46                                                               ` Jerome Glisse
  2019-01-12  2:46                                                                 ` Jerome Glisse
@ 2019-01-12  3:06                                                                 ` John Hubbard
  2019-01-12  3:06                                                                   ` John Hubbard
                                                                                     ` (2 more replies)
  1 sibling, 3 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-12  3:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:46 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>> On 1/11/19 6:02 PM, Jerome Glisse wrote:
>>> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>>>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>> [...]
>>>>
>>>> Hi Jerome,
>>>>
>>>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>>>> in case anyone spots a disastrous conceptual error (such as the lock_page
>>>> point), while I'm putting together the revised patchset.
>>>>
>>>> I've studied this carefully, and I agree that using mapcount in 
>>>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>>>> like one: your "memory barrier, check, retry" is really just a lock) in
>>>> order to hold off gup() while page_mkclean() is in progress. In other words,
>>>> nothing that increments mapcount may proceed while page_mkclean() is running.
>>>
>>> No, increment to page->_mapcount are fine while page_mkclean() is running.
>>> The above solution do work no matter what happens thanks to the memory
>>> barrier. By clearing the pin flag first and reading the page->_mapcount
>>> after (and doing the reverse in GUP) we know that a racing GUP will either
>>> have its pin page clear but the incremented mapcount taken into account by
>>> page_mkclean() or page_mkclean() will miss the incremented mapcount but
>>> it will also no clear the pin flag set concurrently by any GUP.
>>>
>>> Here are all the possible time line:
>>> [T1]:
>>> GUP on CPU0                      | page_mkclean() on CPU1
>>>                                  |
>>> [G2] atomic_inc(&page->mapcount) |
>>> [G3] smp_wmb();                  |
>>> [G4] SetPagePin(page);           |
>>>                                 ...
>>>                                  | [C1] pined = TestClearPagePin(page);
>>
>> It appears that you're using the "page pin is clear" to indicate that
>> page_mkclean() is running. The problem is, that approach leads to toggling
>> the PagePin flag, and so an observer (other than gup or page_mkclean) will
>> see intervals during which the PagePin flag is clear, when conceptually it
>> should be set.
>>
>> Jan and other FS people, is it definitely the case that we only have to take
>> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
>> Because I recall from earlier experiments that there were several places, not 
>> just page_mkclean().
> 
> Yes and it is fine to temporarily have the pin flag unstable. Anything
> that need stable page content will have to lock the page so will have
> to sync against any page_mkclean() and in the end the only thing were
> we want to check the pin flag is when doing write back ie after
> page_mkclean() while the page is still locked. If they are any other
> place that need to check the pin flag then they will need to lock the
> page. But i can not think of any other place right now.
> 
> 

OK. Yes, since the clearing and resetting happens under page lock, that will
suffice to synchronize it. That's a good point.

> [...]
> 
>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>> option, which we'll certainly need in order to safely convert all the call
>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>> a separate patchset, as you recommended.
>>>>
>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>> then I'd recommend using another page bit to do the same thing.)
>>>
>>> Please page lock is pointless and it will not work for GUP fast. The above
>>> scheme do work and is fine. I spend the day again thinking about all memory
>>> ordering and i do not see any issues.
>>>
>>
>> Why is it that page lock cannot be used for gup fast, btw?
> 
> Well it can not happen within the preempt disable section. But after
> as a post pass before GUP_fast return and after reenabling preempt then
> it is fine like it would be for regular GUP. But locking page for GUP
> is also likely to slow down some workload (with direct-IO).
> 

Right, and so to crux of the matter: taking an uncontended page lock involves
pretty much the same set of operations that your approach does. (If gup ends up
contended with the page lock for other reasons than these paths, that seems
surprising.) I'd expect very similar performance.

But the page lock approach leads to really dramatically simpler code (and code
reviews, let's not forget). Any objection to my going that direction, and keeping
this idea as a Plan B? I think the next step will be, once again, to gather some
performance metrics, so maybe that will help us decide.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:06                                                                 ` John Hubbard
@ 2019-01-12  3:06                                                                   ` John Hubbard
  2019-01-12  3:25                                                                   ` Jerome Glisse
  2019-01-14 14:54                                                                   ` Jan Kara
  2 siblings, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-12  3:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:46 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>> On 1/11/19 6:02 PM, Jerome Glisse wrote:
>>> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>>>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>> [...]
>>>>
>>>> Hi Jerome,
>>>>
>>>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>>>> in case anyone spots a disastrous conceptual error (such as the lock_page
>>>> point), while I'm putting together the revised patchset.
>>>>
>>>> I've studied this carefully, and I agree that using mapcount in 
>>>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>>>> like one: your "memory barrier, check, retry" is really just a lock) in
>>>> order to hold off gup() while page_mkclean() is in progress. In other words,
>>>> nothing that increments mapcount may proceed while page_mkclean() is running.
>>>
>>> No, increment to page->_mapcount are fine while page_mkclean() is running.
>>> The above solution do work no matter what happens thanks to the memory
>>> barrier. By clearing the pin flag first and reading the page->_mapcount
>>> after (and doing the reverse in GUP) we know that a racing GUP will either
>>> have its pin page clear but the incremented mapcount taken into account by
>>> page_mkclean() or page_mkclean() will miss the incremented mapcount but
>>> it will also no clear the pin flag set concurrently by any GUP.
>>>
>>> Here are all the possible time line:
>>> [T1]:
>>> GUP on CPU0                      | page_mkclean() on CPU1
>>>                                  |
>>> [G2] atomic_inc(&page->mapcount) |
>>> [G3] smp_wmb();                  |
>>> [G4] SetPagePin(page);           |
>>>                                 ...
>>>                                  | [C1] pined = TestClearPagePin(page);
>>
>> It appears that you're using the "page pin is clear" to indicate that
>> page_mkclean() is running. The problem is, that approach leads to toggling
>> the PagePin flag, and so an observer (other than gup or page_mkclean) will
>> see intervals during which the PagePin flag is clear, when conceptually it
>> should be set.
>>
>> Jan and other FS people, is it definitely the case that we only have to take
>> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
>> Because I recall from earlier experiments that there were several places, not 
>> just page_mkclean().
> 
> Yes and it is fine to temporarily have the pin flag unstable. Anything
> that need stable page content will have to lock the page so will have
> to sync against any page_mkclean() and in the end the only thing were
> we want to check the pin flag is when doing write back ie after
> page_mkclean() while the page is still locked. If they are any other
> place that need to check the pin flag then they will need to lock the
> page. But i can not think of any other place right now.
> 
> 

OK. Yes, since the clearing and resetting happens under page lock, that will
suffice to synchronize it. That's a good point.

> [...]
> 
>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>> option, which we'll certainly need in order to safely convert all the call
>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>> a separate patchset, as you recommended.
>>>>
>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>> then I'd recommend using another page bit to do the same thing.)
>>>
>>> Please page lock is pointless and it will not work for GUP fast. The above
>>> scheme do work and is fine. I spend the day again thinking about all memory
>>> ordering and i do not see any issues.
>>>
>>
>> Why is it that page lock cannot be used for gup fast, btw?
> 
> Well it can not happen within the preempt disable section. But after
> as a post pass before GUP_fast return and after reenabling preempt then
> it is fine like it would be for regular GUP. But locking page for GUP
> is also likely to slow down some workload (with direct-IO).
> 

Right, and so to crux of the matter: taking an uncontended page lock involves
pretty much the same set of operations that your approach does. (If gup ends up
contended with the page lock for other reasons than these paths, that seems
surprising.) I'd expect very similar performance.

But the page lock approach leads to really dramatically simpler code (and code
reviews, let's not forget). Any objection to my going that direction, and keeping
this idea as a Plan B? I think the next step will be, once again, to gather some
performance metrics, so maybe that will help us decide.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:38                                                             ` John Hubbard
  2019-01-12  2:38                                                               ` John Hubbard
@ 2019-01-12  2:46                                                               ` Jerome Glisse
  2019-01-12  2:46                                                                 ` Jerome Glisse
  2019-01-12  3:06                                                                 ` John Hubbard
  2019-01-12  3:14                                                               ` Jerome Glisse
  2 siblings, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.
> 
> Jan and other FS people, is it definitely the case that we only have to take
> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
> Because I recall from earlier experiments that there were several places, not 
> just page_mkclean().

Yes and it is fine to temporarily have the pin flag unstable. Anything
that need stable page content will have to lock the page so will have
to sync against any page_mkclean() and in the end the only thing were
we want to check the pin flag is when doing write back ie after
page_mkclean() while the page is still locked. If they are any other
place that need to check the pin flag then they will need to lock the
page. But i can not think of any other place right now.


[...]

> >> The other idea that you and Dan (and maybe others) pointed out was a debug
> >> option, which we'll certainly need in order to safely convert all the call
> >> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >> and put_user_page() can verify that the right call was made.)  That will be
> >> a separate patchset, as you recommended.
> >>
> >> I'll even go as far as recommending the page lock itself. I realize that this 
> >> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >> that this (below) has similar overhead to the notes above--but is *much* easier
> >> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >> then I'd recommend using another page bit to do the same thing.)
> > 
> > Please page lock is pointless and it will not work for GUP fast. The above
> > scheme do work and is fine. I spend the day again thinking about all memory
> > ordering and i do not see any issues.
> > 
> 
> Why is it that page lock cannot be used for gup fast, btw?

Well it can not happen within the preempt disable section. But after
as a post pass before GUP_fast return and after reenabling preempt then
it is fine like it would be for regular GUP. But locking page for GUP
is also likely to slow down some workload (with direct-IO).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:46                                                               ` Jerome Glisse
@ 2019-01-12  2:46                                                                 ` Jerome Glisse
  2019-01-12  3:06                                                                 ` John Hubbard
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.
> 
> Jan and other FS people, is it definitely the case that we only have to take
> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
> Because I recall from earlier experiments that there were several places, not 
> just page_mkclean().

Yes and it is fine to temporarily have the pin flag unstable. Anything
that need stable page content will have to lock the page so will have
to sync against any page_mkclean() and in the end the only thing were
we want to check the pin flag is when doing write back ie after
page_mkclean() while the page is still locked. If they are any other
place that need to check the pin flag then they will need to lock the
page. But i can not think of any other place right now.


[...]

> >> The other idea that you and Dan (and maybe others) pointed out was a debug
> >> option, which we'll certainly need in order to safely convert all the call
> >> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >> and put_user_page() can verify that the right call was made.)  That will be
> >> a separate patchset, as you recommended.
> >>
> >> I'll even go as far as recommending the page lock itself. I realize that this 
> >> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >> that this (below) has similar overhead to the notes above--but is *much* easier
> >> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >> then I'd recommend using another page bit to do the same thing.)
> > 
> > Please page lock is pointless and it will not work for GUP fast. The above
> > scheme do work and is fine. I spend the day again thinking about all memory
> > ordering and i do not see any issues.
> > 
> 
> Why is it that page lock cannot be used for gup fast, btw?

Well it can not happen within the preempt disable section. But after
as a post pass before GUP_fast return and after reenabling preempt then
it is fine like it would be for regular GUP. But locking page for GUP
is also likely to slow down some workload (with direct-IO).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:02                                                           ` Jerome Glisse
  2019-01-12  2:02                                                             ` Jerome Glisse
@ 2019-01-12  2:38                                                             ` John Hubbard
  2019-01-12  2:38                                                               ` John Hubbard
                                                                                 ` (2 more replies)
  1 sibling, 3 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-12  2:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:02 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>> [...]
>>
>> Hi Jerome,
>>
>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>> in case anyone spots a disastrous conceptual error (such as the lock_page
>> point), while I'm putting together the revised patchset.
>>
>> I've studied this carefully, and I agree that using mapcount in 
>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>> like one: your "memory barrier, check, retry" is really just a lock) in
>> order to hold off gup() while page_mkclean() is in progress. In other words,
>> nothing that increments mapcount may proceed while page_mkclean() is running.
> 
> No, increment to page->_mapcount are fine while page_mkclean() is running.
> The above solution do work no matter what happens thanks to the memory
> barrier. By clearing the pin flag first and reading the page->_mapcount
> after (and doing the reverse in GUP) we know that a racing GUP will either
> have its pin page clear but the incremented mapcount taken into account by
> page_mkclean() or page_mkclean() will miss the incremented mapcount but
> it will also no clear the pin flag set concurrently by any GUP.
> 
> Here are all the possible time line:
> [T1]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
>                                 ...
>                                  | [C1] pined = TestClearPagePin(page);

It appears that you're using the "page pin is clear" to indicate that
page_mkclean() is running. The problem is, that approach leads to toggling
the PagePin flag, and so an observer (other than gup or page_mkclean) will
see intervals during which the PagePin flag is clear, when conceptually it
should be set.

Jan and other FS people, is it definitely the case that we only have to take
action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
Because I recall from earlier experiments that there were several places, not 
just page_mkclean().

One more quick question below...

>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
> 
> It is fine because page_mkclean() will read the correct page->mapcount
> which include the GUP that happens before [C1]
> 
> 
> [T2]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
>                                  | [C1] pined = TestClearPagePin(page);
>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
>                                 ...
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
> 
> It is fine because [G4] set the pin flag so it does not matter that [C3]
> did miss the mapcount increase from the GUP.
> 
> 
> [T3]:
> GUP on CPU0                      | page_mkclean() on CPU1
> [G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);
> 
> No matter which CPU ordering we get ie either:
>     - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
>       that was incremented by [G2] so we will map_count < map_and_pin_count
>       and we will set the pin flag again at the end of page_mkclean()
>     - [C1] is overwritten by [G4] in that case the pin flag is set and thus
>       it does not matter that [C3] also see the mapcount that was incremented
>       by [G2]
> 
> 
> This is totaly race free ie at the end of page_mkclean() the pin flag will
> be set for all page that are pin and for some page that are no longer pin.
> What matter is that they are no false negative.
> 
> 
>> I especially am intrigued by your idea about a fuzzy count that allows
>> false positives but no false negatives. To do that, we need to put a hard
>> lock protecting the increment operation, but we can be loose (no lock) on
>> decrement. That turns out to be a perfect match for the problem here, because
>> as I recall from my earlier efforts, put_user_page() must *not* take locks--
>> and that's where we just decrement. Sweet! See below.
> 
> You do not need lock, lock are easier to think with but they are not always
> necessary and in this case we do not need any lock. We can happily have any
> number of concurrent GUP, PUP or pte zapping. Worse case is false positive
> ie reporting a page as pin while it has just been unpin concurrently by a
> PUP.
> 
>> The other idea that you and Dan (and maybe others) pointed out was a debug
>> option, which we'll certainly need in order to safely convert all the call
>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>> and put_user_page() can verify that the right call was made.)  That will be
>> a separate patchset, as you recommended.
>>
>> I'll even go as far as recommending the page lock itself. I realize that this 
>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>> that this (below) has similar overhead to the notes above--but is *much* easier
>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>> then I'd recommend using another page bit to do the same thing.)
> 
> Please page lock is pointless and it will not work for GUP fast. The above
> scheme do work and is fine. I spend the day again thinking about all memory
> ordering and i do not see any issues.
> 

Why is it that page lock cannot be used for gup fast, btw?

> 
>> (Note that memory barriers will simply be built into the various Set|Clear|Read
>> operations, as is common with a few other page flags.)
>>
>> page_mkclean():
>> ===============
>> lock_page()
>>     page_mkclean()
>>         Count actual mappings
>>             if(mappings == atomic_read(&page->_mapcount))
>>                 ClearPageDmaPinned 
>>
>> gup_fast():
>> ===========
>> for each page {
>>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
>>
>>     atomic_inc(&page->_mapcount)
>>     SetPageDmaPinned()
>>
>>     /* details of gup vs gup_fast not shown here... */
>>
>>
>> put_user_page():
>> ================
>>     atomic_dec(&page->_mapcount); /* no locking! */
>>    
>>
>> try_to_unmap() and other consumers of the PageDmaPinned flag:
>> =============================================================
>> lock_page() /* not required, but already done by existing callers */
>>     if(PageDmaPinned) {
>>         ...take appropriate action /* future patchsets */
> 
> We can not block try_to_unmap() on pined page. What we want to block is
> fs using a different page for the same file offset the original pined
> page was pin (modulo truncate that we should not block). Everything else
> must keep working as if there was no pin. We can not fix that, driver
> doing long term GUP and not abiding to mmu notifier are hopelessly broken
> in front of many regular syscall (mremap, truncate, splice, ...) we can
> not block those syscall or failing them, doing so would mean breaking
> applications in a bad way.
> 
> The only thing we should do is avoid fs corruption and bug due to
> dirtying page after fs believe it has been clean.
> 
> 
>> page freeing:
>> ============
>> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */
> 
> Yeah this need to happen when we sanitize flags of free page.
> 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:38                                                             ` John Hubbard
@ 2019-01-12  2:38                                                               ` John Hubbard
  2019-01-12  2:46                                                               ` Jerome Glisse
  2019-01-12  3:14                                                               ` Jerome Glisse
  2 siblings, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-12  2:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:02 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>> [...]
>>
>> Hi Jerome,
>>
>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>> in case anyone spots a disastrous conceptual error (such as the lock_page
>> point), while I'm putting together the revised patchset.
>>
>> I've studied this carefully, and I agree that using mapcount in 
>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>> like one: your "memory barrier, check, retry" is really just a lock) in
>> order to hold off gup() while page_mkclean() is in progress. In other words,
>> nothing that increments mapcount may proceed while page_mkclean() is running.
> 
> No, increment to page->_mapcount are fine while page_mkclean() is running.
> The above solution do work no matter what happens thanks to the memory
> barrier. By clearing the pin flag first and reading the page->_mapcount
> after (and doing the reverse in GUP) we know that a racing GUP will either
> have its pin page clear but the incremented mapcount taken into account by
> page_mkclean() or page_mkclean() will miss the incremented mapcount but
> it will also no clear the pin flag set concurrently by any GUP.
> 
> Here are all the possible time line:
> [T1]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
>                                 ...
>                                  | [C1] pined = TestClearPagePin(page);

It appears that you're using the "page pin is clear" to indicate that
page_mkclean() is running. The problem is, that approach leads to toggling
the PagePin flag, and so an observer (other than gup or page_mkclean) will
see intervals during which the PagePin flag is clear, when conceptually it
should be set.

Jan and other FS people, is it definitely the case that we only have to take
action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
Because I recall from earlier experiments that there were several places, not 
just page_mkclean().

One more quick question below...

>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
> 
> It is fine because page_mkclean() will read the correct page->mapcount
> which include the GUP that happens before [C1]
> 
> 
> [T2]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
>                                  | [C1] pined = TestClearPagePin(page);
>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
>                                 ...
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
> 
> It is fine because [G4] set the pin flag so it does not matter that [C3]
> did miss the mapcount increase from the GUP.
> 
> 
> [T3]:
> GUP on CPU0                      | page_mkclean() on CPU1
> [G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);
> 
> No matter which CPU ordering we get ie either:
>     - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
>       that was incremented by [G2] so we will map_count < map_and_pin_count
>       and we will set the pin flag again at the end of page_mkclean()
>     - [C1] is overwritten by [G4] in that case the pin flag is set and thus
>       it does not matter that [C3] also see the mapcount that was incremented
>       by [G2]
> 
> 
> This is totaly race free ie at the end of page_mkclean() the pin flag will
> be set for all page that are pin and for some page that are no longer pin.
> What matter is that they are no false negative.
> 
> 
>> I especially am intrigued by your idea about a fuzzy count that allows
>> false positives but no false negatives. To do that, we need to put a hard
>> lock protecting the increment operation, but we can be loose (no lock) on
>> decrement. That turns out to be a perfect match for the problem here, because
>> as I recall from my earlier efforts, put_user_page() must *not* take locks--
>> and that's where we just decrement. Sweet! See below.
> 
> You do not need lock, lock are easier to think with but they are not always
> necessary and in this case we do not need any lock. We can happily have any
> number of concurrent GUP, PUP or pte zapping. Worse case is false positive
> ie reporting a page as pin while it has just been unpin concurrently by a
> PUP.
> 
>> The other idea that you and Dan (and maybe others) pointed out was a debug
>> option, which we'll certainly need in order to safely convert all the call
>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>> and put_user_page() can verify that the right call was made.)  That will be
>> a separate patchset, as you recommended.
>>
>> I'll even go as far as recommending the page lock itself. I realize that this 
>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>> that this (below) has similar overhead to the notes above--but is *much* easier
>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>> then I'd recommend using another page bit to do the same thing.)
> 
> Please page lock is pointless and it will not work for GUP fast. The above
> scheme do work and is fine. I spend the day again thinking about all memory
> ordering and i do not see any issues.
> 

Why is it that page lock cannot be used for gup fast, btw?

> 
>> (Note that memory barriers will simply be built into the various Set|Clear|Read
>> operations, as is common with a few other page flags.)
>>
>> page_mkclean():
>> ===============
>> lock_page()
>>     page_mkclean()
>>         Count actual mappings
>>             if(mappings == atomic_read(&page->_mapcount))
>>                 ClearPageDmaPinned 
>>
>> gup_fast():
>> ===========
>> for each page {
>>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
>>
>>     atomic_inc(&page->_mapcount)
>>     SetPageDmaPinned()
>>
>>     /* details of gup vs gup_fast not shown here... */
>>
>>
>> put_user_page():
>> ================
>>     atomic_dec(&page->_mapcount); /* no locking! */
>>    
>>
>> try_to_unmap() and other consumers of the PageDmaPinned flag:
>> =============================================================
>> lock_page() /* not required, but already done by existing callers */
>>     if(PageDmaPinned) {
>>         ...take appropriate action /* future patchsets */
> 
> We can not block try_to_unmap() on pined page. What we want to block is
> fs using a different page for the same file offset the original pined
> page was pin (modulo truncate that we should not block). Everything else
> must keep working as if there was no pin. We can not fix that, driver
> doing long term GUP and not abiding to mmu notifier are hopelessly broken
> in front of many regular syscall (mremap, truncate, splice, ...) we can
> not block those syscall or failing them, doing so would mean breaking
> applications in a bad way.
> 
> The only thing we should do is avoid fs corruption and bug due to
> dirtying page after fs believe it has been clean.
> 
> 
>> page freeing:
>> ============
>> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */
> 
> Yeah this need to happen when we sanitize flags of free page.
> 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  1:04                                                         ` John Hubbard
  2019-01-12  1:04                                                           ` John Hubbard
@ 2019-01-12  2:02                                                           ` Jerome Glisse
  2019-01-12  2:02                                                             ` Jerome Glisse
  2019-01-12  2:38                                                             ` John Hubbard
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:02 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> > On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > 
> > [...]
> > 
> >>>>> Now page_mkclean:
> >>>>>
> >>>>> int page_mkclean(struct page *page)
> >>>>> {
> >>>>>     int cleaned = 0;
> >>>>> +   int real_mapcount = 0;
> >>>>>     struct address_space *mapping;
> >>>>>     struct rmap_walk_control rwc = {
> >>>>>         .arg = (void *)&cleaned,
> >>>>>         .rmap_one = page_mkclean_one,
> >>>>>         .invalid_vma = invalid_mkclean_vma,
> >>>>> +       .mapcount = &real_mapcount,
> >>>>>     };
> >>>>> +   int mapcount1, mapcount2;
> >>>>>
> >>>>>     BUG_ON(!PageLocked(page));
> >>>>>
> >>>>>     if (!page_mapped(page))
> >>>>>         return 0;
> >>>>>
> >>>>>     mapping = page_mapping(page);
> >>>>>     if (!mapping)
> >>>>>         return 0;
> >>>>>
> >>>>> +   mapcount1 = page_mapcount(page);
> >>>>>     // rmap_walk need to change to count mapping and return value
> >>>>>     // in .mapcount easy one
> >>>>>     rmap_walk(page, &rwc);
> >>>>
> >>>> So what prevents GUP_fast() to grab reference here and the test below would
> >>>> think the page is not pinned? Or do you assume that every page_mkclean()
> >>>> call will be protected by PageWriteback (currently it is not) so that
> >>>> GUP_fast() blocks / bails out?
> >>
> >> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> >> for each page" question (ignoring, for now, what to actually *do* in response to 
> >> that flag being set):
> >>
> >> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> >> This is probably less troubling than the next point, but it does undermine all the 
> >> complicated schemes involving PageWriteback, that try to synchronize gup() with
> >> page_mkclean().
> >>
> >> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> >> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> >> can always jump in and increment the mapcount, while page_mkclean is in the middle
> >> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> >>
> >> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> > 
> > Both point is address by the solution at the end of this email.
> > 
> >>>
> >>> So GUP_fast() becomes:
> >>>
> >>> GUP_fast_existing() { ... }
> >>> GUP_fast()
> >>> {
> >>>     GUP_fast_existing();
> >>>
> >>>     for (i = 0; i < npages; ++i) {
> >>>         if (PageWriteback(pages[i])) {
> >>>             // need to force slow path for this page
> >>>         } else {
> >>>             SetPageDmaPinned(pages[i]);
> >>>             atomic_inc(pages[i]->mapcount);
> >>>         }
> >>>     }
> >>> }
> >>>
> >>> This is a minor slow down for GUP fast and it takes care of a
> >>> write back race on behalf of caller. This means that page_mkclean
> >>> can not see a mapcount value that increase. This simplify thing
> >>> we can relax that. Note that what this is doing is making sure
> >>> that GUP_fast never get lucky :) ie never GUP a page that is in
> >>> the process of being write back but has not yet had its pte
> >>> updated to reflect that.
> >>>
> >>>
> >>>> But I think that detecting pinned pages with small false positive rate is
> >>>> OK. The extra page bouncing will cost some performance but if it is rare,
> >>>> then we are OK. So I think we can go for the simple version of detecting
> >>>> pinned pages as you mentioned in some earlier email. We just have to be
> >>>> sure there are no false negatives.
> >>>
> >>
> >> Agree with that sentiment, but there are still false negatives and I'm not
> >> yet seeing any solutions for that.
> > 
> > So here is the solution:
> > 
> > 
> > Is a page pin ? With no false negative:
> > =======================================
> > 
> > get_user_page*() aka GUP:
> >      if (!PageAnon(page)) {
> >         bool write_back = PageWriteback(page);
> >         bool page_is_pin = PagePin(page);
> >         if (write_back && !page_is_pin) {
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> > [G1]    smp_rmb();
> > [G2]    atomic_inc(&page->_mapcount)
> > [G3]    smp_wmb();
> > [G4]    SetPagePin(page);
> > [G5]    smp_wmb();
> > [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
> >             /* Back-off as write back might have miss us */
> >             atomic_dec(&page->_mapcount);
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> >      }
> > 
> > put_user_page() aka PUP:
> > [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> > [P2] put_page(page);
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page);
> > [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> > 
> > So with above code we store the map and pin count inside struct page
> > _mapcount field. The idea is that we can count the number of page
> > table entry that point to the page when reverse walking all the page
> > mapping in page_mkclean() [C4].
> > 
> > The issue is that GUP, PUP and page table entry zapping can all run
> > concurrently with page_mkclean() and thus we can not get the real
> > map and pin count and the real map count at a given point in time
> > ([C5] for instance in the above). However we only care about avoiding
> > false negative ie we do not want to report a page as unpin if in fact
> > it is pin (it has active GUP). Avoiding false positive would be nice
> > but it would need more heavy weight synchronization within GUP and
> > PUP (we can mitigate it see the section on that below).
> > 
> > With the above scheme a page is _not_ pin (unpin) if and only if we
> > have real_map_count == real_map_and_pin_count at a given point in
> > time. In the above pseudo code the page is lock within page_mkclean()
> > thus no new page table entry can be added and thus the number of page
> > mapping can only go down (because of conccurent pte zapping). So no
> > matter what happens at [C5] we have map_count <= real_map_count.
> > 
> > At [C3] we have two cases to consider:
> >  [R1] A concurrent GUP after [C3] then we do not care what happens at
> >       [C5] as the GUP would already have set the page pin flag. If it
> >       raced before [C3] at [C1] with TestClearPagePin() then we would
> >       have the map_and_pin_count reflect the GUP thanks to the memory
> >       barrier [G3] and [C2].
> >  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
> >       worry about and thus the real_map_and_pin_count can only go down.
> >       So because we first snap shot that value at [C5] we have:
> >       real_map_and_pin_count <= map_and_pin_count.
> > 
> >       So at [C5] we end up with map_count <= real_map_count and with
> >       real_map_and_pin_count <= map_pin_count but we also always have
> >       real_map_count <= real_map_and_pin_count so it means we are in a
> >       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
> >       if map_count == map_pin_count then we know for sure that we have
> >       real_map_count == real_map_and_pin_count and if that is the case
> >       then the page is no longer pin. So at [C5] we will never miss a
> >       pin page (no false negative).
> > 
> >       Another way to word this is that we always under-estimate the real
> >       map count and over estimate the map and pin count and thus we can
> >       never have false negative (map count equal to map and pin count
> >       while in fact real map count is inferior to real map and pin count).
> > 
> > 
> > PageWriteback() test and ordering with page_mkclean()
> > =====================================================
> > 
> > In GUP we test for page write back flag to avoid pining a page that
> > is under going write back. That flag is set after page_mkclean() so
> > the filesystem code that will check for the pin flag need some memory
> > barrier:
> >     int __test_set_page_writeback(struct page *page, bool keep_write,
> > +                                 bool *use_bounce_page)
> >     {
> >         ...
> >   [T1]  TestSetPageWriteback(page);
> > + [T2]  smp_wmb();
> > + [T3]  *use_bounce_page = PagePin(page);
> >         ...
> >     }
> > 
> > That way if there is a concurrent GUP we either have:
> >     [R1] GUP sees the write back flag set before [G1] so it back-off
> >     [R2] GUP sees no write back before [G1] here either we have GUP
> >          that sees the write back flag at [G6] or [T3] that sees the
> >          pin flag thanks to the memory barrier [G5] and [T2].
> > 
> > So in all cases we never miss a pin or a write back.
> > 
> > 
> > Mitigate false positive:
> > ========================
> > 
> > If false positive is ever an issue we can improve the situation and to
> > properly account conccurent pte zapping with the following changes:
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page, &page_mkclean_one());
> > [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> > [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> > [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> > [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
> >      }
> > 
> > page_map_count():
> > [M1] if (pte_valid(pte) { map_count++; }
> >      } else if (pte_special_zap(pte)) {
> > [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> > [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
> >      }
> > 
> > And pte zapping of file back page will write a special pte entry which
> > has the page map and pin count value at the time the pte is zap. Also
> > page_mkclean_one() unconditionaly replace those special pte with pte
> > none and ignore them altogether. We only want to detect pte zapping that
> > happens after [C6] and before [C7] is done.
> > 
> > With [M3] we are counting all page table entry that have been zap after
> > the map_and_pin_count value we read at [C6]. Again we have two cases:
> >  [R1] A concurrent GUP after [C6] then we do not care what happens
> >       at [C8] as the GUP would already have set the page pin flag.
> >  [R2] No concurrent GUP then we only have concurrent PUP to worry
> >       about. If they happen before [C6] they are included in [C6]
> >       map_and_pin_count value. If after [C6] then we might miss a
> >       page that is no longer pin ie we are over estimating the
> >       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
> >       at [C8]). So no false negative just false positive.
> > 
> > Here we just get the accurate real_map_count at [C6] time so if the
> > page was no longer pin at [C6] time we will correctly detect it and
> > not set the flag at [C8]. If there is any concurrent GUP that GUP
> > would set the flag properly.
> > 
> > There is one last thing to note about above code, the MASK in [M3].
> > For special pte entry we might not have enough bits to store the
> > whole map and pin count value (on 32bits arch). So we might expose
> > ourself to wrap around. Again we do not care about [R1] case as any
> > concurrent GUP will set the pin flag. So we only care if the only
> > thing happening concurrently is either PUP or pte zapping. In both
> > case its means that the map and pin count is going down so if there
> > is a wrap around sometimes within [C7]/page_map_count() we have:
> >   [t0] page_map_count() executed on some pte
> >   [t1] page_map_count() executed on another pte after [t1]
> > With:
> >     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> > While in fact:
> >     map_count_t0 > map_count_t1
> > 
> > So if that happens then we will under-estimate the map count ie we
> > will ignore some of the concurrent pte zapping and not count them.
> > So again we are only exposing our self to false positive not false
> > negative.
> > 
> > 
> > ---------------------------------------------------------------------
> > 
> > 
> > Hopes this prove that this solution do work. The false positive is
> > something that i believe is acceptable. We will get them only when
> > they are racing GUP or PUP. For racing GUP it is safer to have false
> > positive. For racing PUP it would be nice to catch them but hey some
> > times you just get unlucky.
> > 
> > Note that any other solution will also suffer from false positive
> > situation because anyway you are testing for the page pin status
> > at a given point in time so it can always race with a PUP. So the
> > only difference with any other solution would be how long is the
> > false positive race window.
> > 
> 
> Hi Jerome,
> 
> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> in case anyone spots a disastrous conceptual error (such as the lock_page
> point), while I'm putting together the revised patchset.
> 
> I've studied this carefully, and I agree that using mapcount in 
> this way is viable, *as long* as we use a lock (or a construct that looks just 
> like one: your "memory barrier, check, retry" is really just a lock) in
> order to hold off gup() while page_mkclean() is in progress. In other words,
> nothing that increments mapcount may proceed while page_mkclean() is running.

No, increment to page->_mapcount are fine while page_mkclean() is running.
The above solution do work no matter what happens thanks to the memory
barrier. By clearing the pin flag first and reading the page->_mapcount
after (and doing the reverse in GUP) we know that a racing GUP will either
have its pin page clear but the incremented mapcount taken into account by
page_mkclean() or page_mkclean() will miss the incremented mapcount but
it will also no clear the pin flag set concurrently by any GUP.

Here are all the possible time line:
[T1]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |
                                ...
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)

It is fine because page_mkclean() will read the correct page->mapcount
which include the GUP that happens before [C1]


[T2]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)
                                ...
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |

It is fine because [G4] set the pin flag so it does not matter that [C3]
did miss the mapcount increase from the GUP.


[T3]:
GUP on CPU0                      | page_mkclean() on CPU1
[G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);

No matter which CPU ordering we get ie either:
    - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
      that was incremented by [G2] so we will map_count < map_and_pin_count
      and we will set the pin flag again at the end of page_mkclean()
    - [C1] is overwritten by [G4] in that case the pin flag is set and thus
      it does not matter that [C3] also see the mapcount that was incremented
      by [G2]


This is totaly race free ie at the end of page_mkclean() the pin flag will
be set for all page that are pin and for some page that are no longer pin.
What matter is that they are no false negative.


> I especially am intrigued by your idea about a fuzzy count that allows
> false positives but no false negatives. To do that, we need to put a hard
> lock protecting the increment operation, but we can be loose (no lock) on
> decrement. That turns out to be a perfect match for the problem here, because
> as I recall from my earlier efforts, put_user_page() must *not* take locks--
> and that's where we just decrement. Sweet! See below.

You do not need lock, lock are easier to think with but they are not always
necessary and in this case we do not need any lock. We can happily have any
number of concurrent GUP, PUP or pte zapping. Worse case is false positive
ie reporting a page as pin while it has just been unpin concurrently by a
PUP.

> The other idea that you and Dan (and maybe others) pointed out was a debug
> option, which we'll certainly need in order to safely convert all the call
> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> and put_user_page() can verify that the right call was made.)  That will be
> a separate patchset, as you recommended.
> 
> I'll even go as far as recommending the page lock itself. I realize that this 
> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> that this (below) has similar overhead to the notes above--but is *much* easier
> to verify correct. (If the page lock is unacceptable due to being so widely used,
> then I'd recommend using another page bit to do the same thing.)

Please page lock is pointless and it will not work for GUP fast. The above
scheme do work and is fine. I spend the day again thinking about all memory
ordering and i do not see any issues.


> (Note that memory barriers will simply be built into the various Set|Clear|Read
> operations, as is common with a few other page flags.)
> 
> page_mkclean():
> ===============
> lock_page()
>     page_mkclean()
>         Count actual mappings
>             if(mappings == atomic_read(&page->_mapcount))
>                 ClearPageDmaPinned 
> 
> gup_fast():
> ===========
> for each page {
>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
> 
>     atomic_inc(&page->_mapcount)
>     SetPageDmaPinned()
> 
>     /* details of gup vs gup_fast not shown here... */
> 
> 
> put_user_page():
> ================
>     atomic_dec(&page->_mapcount); /* no locking! */
>    
> 
> try_to_unmap() and other consumers of the PageDmaPinned flag:
> =============================================================
> lock_page() /* not required, but already done by existing callers */
>     if(PageDmaPinned) {
>         ...take appropriate action /* future patchsets */

We can not block try_to_unmap() on pined page. What we want to block is
fs using a different page for the same file offset the original pined
page was pin (modulo truncate that we should not block). Everything else
must keep working as if there was no pin. We can not fix that, driver
doing long term GUP and not abiding to mmu notifier are hopelessly broken
in front of many regular syscall (mremap, truncate, splice, ...) we can
not block those syscall or failing them, doing so would mean breaking
applications in a bad way.

The only thing we should do is avoid fs corruption and bug due to
dirtying page after fs believe it has been clean.


> page freeing:
> ============
> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */

Yeah this need to happen when we sanitize flags of free page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:02                                                           ` Jerome Glisse
@ 2019-01-12  2:02                                                             ` Jerome Glisse
  2019-01-12  2:38                                                             ` John Hubbard
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:02 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> > On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > 
> > [...]
> > 
> >>>>> Now page_mkclean:
> >>>>>
> >>>>> int page_mkclean(struct page *page)
> >>>>> {
> >>>>>     int cleaned = 0;
> >>>>> +   int real_mapcount = 0;
> >>>>>     struct address_space *mapping;
> >>>>>     struct rmap_walk_control rwc = {
> >>>>>         .arg = (void *)&cleaned,
> >>>>>         .rmap_one = page_mkclean_one,
> >>>>>         .invalid_vma = invalid_mkclean_vma,
> >>>>> +       .mapcount = &real_mapcount,
> >>>>>     };
> >>>>> +   int mapcount1, mapcount2;
> >>>>>
> >>>>>     BUG_ON(!PageLocked(page));
> >>>>>
> >>>>>     if (!page_mapped(page))
> >>>>>         return 0;
> >>>>>
> >>>>>     mapping = page_mapping(page);
> >>>>>     if (!mapping)
> >>>>>         return 0;
> >>>>>
> >>>>> +   mapcount1 = page_mapcount(page);
> >>>>>     // rmap_walk need to change to count mapping and return value
> >>>>>     // in .mapcount easy one
> >>>>>     rmap_walk(page, &rwc);
> >>>>
> >>>> So what prevents GUP_fast() to grab reference here and the test below would
> >>>> think the page is not pinned? Or do you assume that every page_mkclean()
> >>>> call will be protected by PageWriteback (currently it is not) so that
> >>>> GUP_fast() blocks / bails out?
> >>
> >> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> >> for each page" question (ignoring, for now, what to actually *do* in response to 
> >> that flag being set):
> >>
> >> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> >> This is probably less troubling than the next point, but it does undermine all the 
> >> complicated schemes involving PageWriteback, that try to synchronize gup() with
> >> page_mkclean().
> >>
> >> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> >> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> >> can always jump in and increment the mapcount, while page_mkclean is in the middle
> >> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> >>
> >> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> > 
> > Both point is address by the solution at the end of this email.
> > 
> >>>
> >>> So GUP_fast() becomes:
> >>>
> >>> GUP_fast_existing() { ... }
> >>> GUP_fast()
> >>> {
> >>>     GUP_fast_existing();
> >>>
> >>>     for (i = 0; i < npages; ++i) {
> >>>         if (PageWriteback(pages[i])) {
> >>>             // need to force slow path for this page
> >>>         } else {
> >>>             SetPageDmaPinned(pages[i]);
> >>>             atomic_inc(pages[i]->mapcount);
> >>>         }
> >>>     }
> >>> }
> >>>
> >>> This is a minor slow down for GUP fast and it takes care of a
> >>> write back race on behalf of caller. This means that page_mkclean
> >>> can not see a mapcount value that increase. This simplify thing
> >>> we can relax that. Note that what this is doing is making sure
> >>> that GUP_fast never get lucky :) ie never GUP a page that is in
> >>> the process of being write back but has not yet had its pte
> >>> updated to reflect that.
> >>>
> >>>
> >>>> But I think that detecting pinned pages with small false positive rate is
> >>>> OK. The extra page bouncing will cost some performance but if it is rare,
> >>>> then we are OK. So I think we can go for the simple version of detecting
> >>>> pinned pages as you mentioned in some earlier email. We just have to be
> >>>> sure there are no false negatives.
> >>>
> >>
> >> Agree with that sentiment, but there are still false negatives and I'm not
> >> yet seeing any solutions for that.
> > 
> > So here is the solution:
> > 
> > 
> > Is a page pin ? With no false negative:
> > =======================================
> > 
> > get_user_page*() aka GUP:
> >      if (!PageAnon(page)) {
> >         bool write_back = PageWriteback(page);
> >         bool page_is_pin = PagePin(page);
> >         if (write_back && !page_is_pin) {
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> > [G1]    smp_rmb();
> > [G2]    atomic_inc(&page->_mapcount)
> > [G3]    smp_wmb();
> > [G4]    SetPagePin(page);
> > [G5]    smp_wmb();
> > [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
> >             /* Back-off as write back might have miss us */
> >             atomic_dec(&page->_mapcount);
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> >      }
> > 
> > put_user_page() aka PUP:
> > [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> > [P2] put_page(page);
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page);
> > [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> > 
> > So with above code we store the map and pin count inside struct page
> > _mapcount field. The idea is that we can count the number of page
> > table entry that point to the page when reverse walking all the page
> > mapping in page_mkclean() [C4].
> > 
> > The issue is that GUP, PUP and page table entry zapping can all run
> > concurrently with page_mkclean() and thus we can not get the real
> > map and pin count and the real map count at a given point in time
> > ([C5] for instance in the above). However we only care about avoiding
> > false negative ie we do not want to report a page as unpin if in fact
> > it is pin (it has active GUP). Avoiding false positive would be nice
> > but it would need more heavy weight synchronization within GUP and
> > PUP (we can mitigate it see the section on that below).
> > 
> > With the above scheme a page is _not_ pin (unpin) if and only if we
> > have real_map_count == real_map_and_pin_count at a given point in
> > time. In the above pseudo code the page is lock within page_mkclean()
> > thus no new page table entry can be added and thus the number of page
> > mapping can only go down (because of conccurent pte zapping). So no
> > matter what happens at [C5] we have map_count <= real_map_count.
> > 
> > At [C3] we have two cases to consider:
> >  [R1] A concurrent GUP after [C3] then we do not care what happens at
> >       [C5] as the GUP would already have set the page pin flag. If it
> >       raced before [C3] at [C1] with TestClearPagePin() then we would
> >       have the map_and_pin_count reflect the GUP thanks to the memory
> >       barrier [G3] and [C2].
> >  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
> >       worry about and thus the real_map_and_pin_count can only go down.
> >       So because we first snap shot that value at [C5] we have:
> >       real_map_and_pin_count <= map_and_pin_count.
> > 
> >       So at [C5] we end up with map_count <= real_map_count and with
> >       real_map_and_pin_count <= map_pin_count but we also always have
> >       real_map_count <= real_map_and_pin_count so it means we are in a
> >       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
> >       if map_count == map_pin_count then we know for sure that we have
> >       real_map_count == real_map_and_pin_count and if that is the case
> >       then the page is no longer pin. So at [C5] we will never miss a
> >       pin page (no false negative).
> > 
> >       Another way to word this is that we always under-estimate the real
> >       map count and over estimate the map and pin count and thus we can
> >       never have false negative (map count equal to map and pin count
> >       while in fact real map count is inferior to real map and pin count).
> > 
> > 
> > PageWriteback() test and ordering with page_mkclean()
> > =====================================================
> > 
> > In GUP we test for page write back flag to avoid pining a page that
> > is under going write back. That flag is set after page_mkclean() so
> > the filesystem code that will check for the pin flag need some memory
> > barrier:
> >     int __test_set_page_writeback(struct page *page, bool keep_write,
> > +                                 bool *use_bounce_page)
> >     {
> >         ...
> >   [T1]  TestSetPageWriteback(page);
> > + [T2]  smp_wmb();
> > + [T3]  *use_bounce_page = PagePin(page);
> >         ...
> >     }
> > 
> > That way if there is a concurrent GUP we either have:
> >     [R1] GUP sees the write back flag set before [G1] so it back-off
> >     [R2] GUP sees no write back before [G1] here either we have GUP
> >          that sees the write back flag at [G6] or [T3] that sees the
> >          pin flag thanks to the memory barrier [G5] and [T2].
> > 
> > So in all cases we never miss a pin or a write back.
> > 
> > 
> > Mitigate false positive:
> > ========================
> > 
> > If false positive is ever an issue we can improve the situation and to
> > properly account conccurent pte zapping with the following changes:
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page, &page_mkclean_one());
> > [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> > [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> > [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> > [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
> >      }
> > 
> > page_map_count():
> > [M1] if (pte_valid(pte) { map_count++; }
> >      } else if (pte_special_zap(pte)) {
> > [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> > [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
> >      }
> > 
> > And pte zapping of file back page will write a special pte entry which
> > has the page map and pin count value at the time the pte is zap. Also
> > page_mkclean_one() unconditionaly replace those special pte with pte
> > none and ignore them altogether. We only want to detect pte zapping that
> > happens after [C6] and before [C7] is done.
> > 
> > With [M3] we are counting all page table entry that have been zap after
> > the map_and_pin_count value we read at [C6]. Again we have two cases:
> >  [R1] A concurrent GUP after [C6] then we do not care what happens
> >       at [C8] as the GUP would already have set the page pin flag.
> >  [R2] No concurrent GUP then we only have concurrent PUP to worry
> >       about. If they happen before [C6] they are included in [C6]
> >       map_and_pin_count value. If after [C6] then we might miss a
> >       page that is no longer pin ie we are over estimating the
> >       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
> >       at [C8]). So no false negative just false positive.
> > 
> > Here we just get the accurate real_map_count at [C6] time so if the
> > page was no longer pin at [C6] time we will correctly detect it and
> > not set the flag at [C8]. If there is any concurrent GUP that GUP
> > would set the flag properly.
> > 
> > There is one last thing to note about above code, the MASK in [M3].
> > For special pte entry we might not have enough bits to store the
> > whole map and pin count value (on 32bits arch). So we might expose
> > ourself to wrap around. Again we do not care about [R1] case as any
> > concurrent GUP will set the pin flag. So we only care if the only
> > thing happening concurrently is either PUP or pte zapping. In both
> > case its means that the map and pin count is going down so if there
> > is a wrap around sometimes within [C7]/page_map_count() we have:
> >   [t0] page_map_count() executed on some pte
> >   [t1] page_map_count() executed on another pte after [t1]
> > With:
> >     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> > While in fact:
> >     map_count_t0 > map_count_t1
> > 
> > So if that happens then we will under-estimate the map count ie we
> > will ignore some of the concurrent pte zapping and not count them.
> > So again we are only exposing our self to false positive not false
> > negative.
> > 
> > 
> > ---------------------------------------------------------------------
> > 
> > 
> > Hopes this prove that this solution do work. The false positive is
> > something that i believe is acceptable. We will get them only when
> > they are racing GUP or PUP. For racing GUP it is safer to have false
> > positive. For racing PUP it would be nice to catch them but hey some
> > times you just get unlucky.
> > 
> > Note that any other solution will also suffer from false positive
> > situation because anyway you are testing for the page pin status
> > at a given point in time so it can always race with a PUP. So the
> > only difference with any other solution would be how long is the
> > false positive race window.
> > 
> 
> Hi Jerome,
> 
> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> in case anyone spots a disastrous conceptual error (such as the lock_page
> point), while I'm putting together the revised patchset.
> 
> I've studied this carefully, and I agree that using mapcount in 
> this way is viable, *as long* as we use a lock (or a construct that looks just 
> like one: your "memory barrier, check, retry" is really just a lock) in
> order to hold off gup() while page_mkclean() is in progress. In other words,
> nothing that increments mapcount may proceed while page_mkclean() is running.

No, increment to page->_mapcount are fine while page_mkclean() is running.
The above solution do work no matter what happens thanks to the memory
barrier. By clearing the pin flag first and reading the page->_mapcount
after (and doing the reverse in GUP) we know that a racing GUP will either
have its pin page clear but the incremented mapcount taken into account by
page_mkclean() or page_mkclean() will miss the incremented mapcount but
it will also no clear the pin flag set concurrently by any GUP.

Here are all the possible time line:
[T1]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |
                                ...
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)

It is fine because page_mkclean() will read the correct page->mapcount
which include the GUP that happens before [C1]


[T2]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)
                                ...
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |

It is fine because [G4] set the pin flag so it does not matter that [C3]
did miss the mapcount increase from the GUP.


[T3]:
GUP on CPU0                      | page_mkclean() on CPU1
[G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);

No matter which CPU ordering we get ie either:
    - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
      that was incremented by [G2] so we will map_count < map_and_pin_count
      and we will set the pin flag again at the end of page_mkclean()
    - [C1] is overwritten by [G4] in that case the pin flag is set and thus
      it does not matter that [C3] also see the mapcount that was incremented
      by [G2]


This is totaly race free ie at the end of page_mkclean() the pin flag will
be set for all page that are pin and for some page that are no longer pin.
What matter is that they are no false negative.


> I especially am intrigued by your idea about a fuzzy count that allows
> false positives but no false negatives. To do that, we need to put a hard
> lock protecting the increment operation, but we can be loose (no lock) on
> decrement. That turns out to be a perfect match for the problem here, because
> as I recall from my earlier efforts, put_user_page() must *not* take locks--
> and that's where we just decrement. Sweet! See below.

You do not need lock, lock are easier to think with but they are not always
necessary and in this case we do not need any lock. We can happily have any
number of concurrent GUP, PUP or pte zapping. Worse case is false positive
ie reporting a page as pin while it has just been unpin concurrently by a
PUP.

> The other idea that you and Dan (and maybe others) pointed out was a debug
> option, which we'll certainly need in order to safely convert all the call
> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> and put_user_page() can verify that the right call was made.)  That will be
> a separate patchset, as you recommended.
> 
> I'll even go as far as recommending the page lock itself. I realize that this 
> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> that this (below) has similar overhead to the notes above--but is *much* easier
> to verify correct. (If the page lock is unacceptable due to being so widely used,
> then I'd recommend using another page bit to do the same thing.)

Please page lock is pointless and it will not work for GUP fast. The above
scheme do work and is fine. I spend the day again thinking about all memory
ordering and i do not see any issues.


> (Note that memory barriers will simply be built into the various Set|Clear|Read
> operations, as is common with a few other page flags.)
> 
> page_mkclean():
> ===============
> lock_page()
>     page_mkclean()
>         Count actual mappings
>             if(mappings == atomic_read(&page->_mapcount))
>                 ClearPageDmaPinned 
> 
> gup_fast():
> ===========
> for each page {
>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
> 
>     atomic_inc(&page->_mapcount)
>     SetPageDmaPinned()
> 
>     /* details of gup vs gup_fast not shown here... */
> 
> 
> put_user_page():
> ================
>     atomic_dec(&page->_mapcount); /* no locking! */
>    
> 
> try_to_unmap() and other consumers of the PageDmaPinned flag:
> =============================================================
> lock_page() /* not required, but already done by existing callers */
>     if(PageDmaPinned) {
>         ...take appropriate action /* future patchsets */

We can not block try_to_unmap() on pined page. What we want to block is
fs using a different page for the same file offset the original pined
page was pin (modulo truncate that we should not block). Everything else
must keep working as if there was no pin. We can not fix that, driver
doing long term GUP and not abiding to mmu notifier are hopelessly broken
in front of many regular syscall (mremap, truncate, splice, ...) we can
not block those syscall or failing them, doing so would mean breaking
applications in a bad way.

The only thing we should do is avoid fs corruption and bug due to
dirtying page after fs believe it has been clean.


> page freeing:
> ============
> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */

Yeah this need to happen when we sanitize flags of free page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11 16:51                                                       ` Jerome Glisse
  2019-01-11 16:51                                                         ` Jerome Glisse
@ 2019-01-12  1:04                                                         ` John Hubbard
  2019-01-12  1:04                                                           ` John Hubbard
  2019-01-12  2:02                                                           ` Jerome Glisse
  1 sibling, 2 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-12  1:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 8:51 AM, Jerome Glisse wrote:
> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> 
> [...]
> 
>>>>> Now page_mkclean:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>> +   int mapcount1, mapcount2;
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>> +   mapcount1 = page_mapcount(page);
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>
>>>> So what prevents GUP_fast() to grab reference here and the test below would
>>>> think the page is not pinned? Or do you assume that every page_mkclean()
>>>> call will be protected by PageWriteback (currently it is not) so that
>>>> GUP_fast() blocks / bails out?
>>
>> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
>> for each page" question (ignoring, for now, what to actually *do* in response to 
>> that flag being set):
>>
>> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
>> This is probably less troubling than the next point, but it does undermine all the 
>> complicated schemes involving PageWriteback, that try to synchronize gup() with
>> page_mkclean().
>>
>> 2. Also, the mapcount approach here still does not reliably avoid false negatives
>> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
>> can always jump in and increment the mapcount, while page_mkclean is in the middle
>> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
>>
>> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> 
> Both point is address by the solution at the end of this email.
> 
>>>
>>> So GUP_fast() becomes:
>>>
>>> GUP_fast_existing() { ... }
>>> GUP_fast()
>>> {
>>>     GUP_fast_existing();
>>>
>>>     for (i = 0; i < npages; ++i) {
>>>         if (PageWriteback(pages[i])) {
>>>             // need to force slow path for this page
>>>         } else {
>>>             SetPageDmaPinned(pages[i]);
>>>             atomic_inc(pages[i]->mapcount);
>>>         }
>>>     }
>>> }
>>>
>>> This is a minor slow down for GUP fast and it takes care of a
>>> write back race on behalf of caller. This means that page_mkclean
>>> can not see a mapcount value that increase. This simplify thing
>>> we can relax that. Note that what this is doing is making sure
>>> that GUP_fast never get lucky :) ie never GUP a page that is in
>>> the process of being write back but has not yet had its pte
>>> updated to reflect that.
>>>
>>>
>>>> But I think that detecting pinned pages with small false positive rate is
>>>> OK. The extra page bouncing will cost some performance but if it is rare,
>>>> then we are OK. So I think we can go for the simple version of detecting
>>>> pinned pages as you mentioned in some earlier email. We just have to be
>>>> sure there are no false negatives.
>>>
>>
>> Agree with that sentiment, but there are still false negatives and I'm not
>> yet seeing any solutions for that.
> 
> So here is the solution:
> 
> 
> Is a page pin ? With no false negative:
> =======================================
> 
> get_user_page*() aka GUP:
>      if (!PageAnon(page)) {
>         bool write_back = PageWriteback(page);
>         bool page_is_pin = PagePin(page);
>         if (write_back && !page_is_pin) {
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
> [G1]    smp_rmb();
> [G2]    atomic_inc(&page->_mapcount)
> [G3]    smp_wmb();
> [G4]    SetPagePin(page);
> [G5]    smp_wmb();
> [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
>             /* Back-off as write back might have miss us */
>             atomic_dec(&page->_mapcount);
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
>      }
> 
> put_user_page() aka PUP:
> [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> [P2] put_page(page);
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page);
> [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> 
> So with above code we store the map and pin count inside struct page
> _mapcount field. The idea is that we can count the number of page
> table entry that point to the page when reverse walking all the page
> mapping in page_mkclean() [C4].
> 
> The issue is that GUP, PUP and page table entry zapping can all run
> concurrently with page_mkclean() and thus we can not get the real
> map and pin count and the real map count at a given point in time
> ([C5] for instance in the above). However we only care about avoiding
> false negative ie we do not want to report a page as unpin if in fact
> it is pin (it has active GUP). Avoiding false positive would be nice
> but it would need more heavy weight synchronization within GUP and
> PUP (we can mitigate it see the section on that below).
> 
> With the above scheme a page is _not_ pin (unpin) if and only if we
> have real_map_count == real_map_and_pin_count at a given point in
> time. In the above pseudo code the page is lock within page_mkclean()
> thus no new page table entry can be added and thus the number of page
> mapping can only go down (because of conccurent pte zapping). So no
> matter what happens at [C5] we have map_count <= real_map_count.
> 
> At [C3] we have two cases to consider:
>  [R1] A concurrent GUP after [C3] then we do not care what happens at
>       [C5] as the GUP would already have set the page pin flag. If it
>       raced before [C3] at [C1] with TestClearPagePin() then we would
>       have the map_and_pin_count reflect the GUP thanks to the memory
>       barrier [G3] and [C2].
>  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
>       worry about and thus the real_map_and_pin_count can only go down.
>       So because we first snap shot that value at [C5] we have:
>       real_map_and_pin_count <= map_and_pin_count.
> 
>       So at [C5] we end up with map_count <= real_map_count and with
>       real_map_and_pin_count <= map_pin_count but we also always have
>       real_map_count <= real_map_and_pin_count so it means we are in a
>       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
>       if map_count == map_pin_count then we know for sure that we have
>       real_map_count == real_map_and_pin_count and if that is the case
>       then the page is no longer pin. So at [C5] we will never miss a
>       pin page (no false negative).
> 
>       Another way to word this is that we always under-estimate the real
>       map count and over estimate the map and pin count and thus we can
>       never have false negative (map count equal to map and pin count
>       while in fact real map count is inferior to real map and pin count).
> 
> 
> PageWriteback() test and ordering with page_mkclean()
> =====================================================
> 
> In GUP we test for page write back flag to avoid pining a page that
> is under going write back. That flag is set after page_mkclean() so
> the filesystem code that will check for the pin flag need some memory
> barrier:
>     int __test_set_page_writeback(struct page *page, bool keep_write,
> +                                 bool *use_bounce_page)
>     {
>         ...
>   [T1]  TestSetPageWriteback(page);
> + [T2]  smp_wmb();
> + [T3]  *use_bounce_page = PagePin(page);
>         ...
>     }
> 
> That way if there is a concurrent GUP we either have:
>     [R1] GUP sees the write back flag set before [G1] so it back-off
>     [R2] GUP sees no write back before [G1] here either we have GUP
>          that sees the write back flag at [G6] or [T3] that sees the
>          pin flag thanks to the memory barrier [G5] and [T2].
> 
> So in all cases we never miss a pin or a write back.
> 
> 
> Mitigate false positive:
> ========================
> 
> If false positive is ever an issue we can improve the situation and to
> properly account conccurent pte zapping with the following changes:
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page, &page_mkclean_one());
> [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
>      }
> 
> page_map_count():
> [M1] if (pte_valid(pte) { map_count++; }
>      } else if (pte_special_zap(pte)) {
> [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
>      }
> 
> And pte zapping of file back page will write a special pte entry which
> has the page map and pin count value at the time the pte is zap. Also
> page_mkclean_one() unconditionaly replace those special pte with pte
> none and ignore them altogether. We only want to detect pte zapping that
> happens after [C6] and before [C7] is done.
> 
> With [M3] we are counting all page table entry that have been zap after
> the map_and_pin_count value we read at [C6]. Again we have two cases:
>  [R1] A concurrent GUP after [C6] then we do not care what happens
>       at [C8] as the GUP would already have set the page pin flag.
>  [R2] No concurrent GUP then we only have concurrent PUP to worry
>       about. If they happen before [C6] they are included in [C6]
>       map_and_pin_count value. If after [C6] then we might miss a
>       page that is no longer pin ie we are over estimating the
>       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
>       at [C8]). So no false negative just false positive.
> 
> Here we just get the accurate real_map_count at [C6] time so if the
> page was no longer pin at [C6] time we will correctly detect it and
> not set the flag at [C8]. If there is any concurrent GUP that GUP
> would set the flag properly.
> 
> There is one last thing to note about above code, the MASK in [M3].
> For special pte entry we might not have enough bits to store the
> whole map and pin count value (on 32bits arch). So we might expose
> ourself to wrap around. Again we do not care about [R1] case as any
> concurrent GUP will set the pin flag. So we only care if the only
> thing happening concurrently is either PUP or pte zapping. In both
> case its means that the map and pin count is going down so if there
> is a wrap around sometimes within [C7]/page_map_count() we have:
>   [t0] page_map_count() executed on some pte
>   [t1] page_map_count() executed on another pte after [t1]
> With:
>     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> While in fact:
>     map_count_t0 > map_count_t1
> 
> So if that happens then we will under-estimate the map count ie we
> will ignore some of the concurrent pte zapping and not count them.
> So again we are only exposing our self to false positive not false
> negative.
> 
> 
> ---------------------------------------------------------------------
> 
> 
> Hopes this prove that this solution do work. The false positive is
> something that i believe is acceptable. We will get them only when
> they are racing GUP or PUP. For racing GUP it is safer to have false
> positive. For racing PUP it would be nice to catch them but hey some
> times you just get unlucky.
> 
> Note that any other solution will also suffer from false positive
> situation because anyway you are testing for the page pin status
> at a given point in time so it can always race with a PUP. So the
> only difference with any other solution would be how long is the
> false positive race window.
> 

Hi Jerome,

Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
in case anyone spots a disastrous conceptual error (such as the lock_page
point), while I'm putting together the revised patchset.

I've studied this carefully, and I agree that using mapcount in 
this way is viable, *as long* as we use a lock (or a construct that looks just 
like one: your "memory barrier, check, retry" is really just a lock) in
order to hold off gup() while page_mkclean() is in progress. In other words,
nothing that increments mapcount may proceed while page_mkclean() is running.

I especially am intrigued by your idea about a fuzzy count that allows
false positives but no false negatives. To do that, we need to put a hard
lock protecting the increment operation, but we can be loose (no lock) on
decrement. That turns out to be a perfect match for the problem here, because
as I recall from my earlier efforts, put_user_page() must *not* take locks--
and that's where we just decrement. Sweet! See below.

The other idea that you and Dan (and maybe others) pointed out was a debug
option, which we'll certainly need in order to safely convert all the call
sites. (Mirror the mappings at a different kernel offset, so that put_page()
and put_user_page() can verify that the right call was made.)  That will be
a separate patchset, as you recommended.

I'll even go as far as recommending the page lock itself. I realize that this 
adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
that this (below) has similar overhead to the notes above--but is *much* easier
to verify correct. (If the page lock is unacceptable due to being so widely used,
then I'd recommend using another page bit to do the same thing.)

(Note that memory barriers will simply be built into the various Set|Clear|Read
operations, as is common with a few other page flags.)

page_mkclean():
===============
lock_page()
    page_mkclean()
        Count actual mappings
            if(mappings == atomic_read(&page->_mapcount))
                ClearPageDmaPinned 

gup_fast():
===========
for each page {
    lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */

    atomic_inc(&page->_mapcount)
    SetPageDmaPinned()

    /* details of gup vs gup_fast not shown here... */


put_user_page():
================
    atomic_dec(&page->_mapcount); /* no locking! */
   

try_to_unmap() and other consumers of the PageDmaPinned flag:
=============================================================
lock_page() /* not required, but already done by existing callers */
    if(PageDmaPinned) {
        ...take appropriate action /* future patchsets */

page freeing:
============
ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  1:04                                                         ` John Hubbard
@ 2019-01-12  1:04                                                           ` John Hubbard
  2019-01-12  2:02                                                           ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-12  1:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 8:51 AM, Jerome Glisse wrote:
> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> 
> [...]
> 
>>>>> Now page_mkclean:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>> +   int mapcount1, mapcount2;
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>> +   mapcount1 = page_mapcount(page);
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>
>>>> So what prevents GUP_fast() to grab reference here and the test below would
>>>> think the page is not pinned? Or do you assume that every page_mkclean()
>>>> call will be protected by PageWriteback (currently it is not) so that
>>>> GUP_fast() blocks / bails out?
>>
>> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
>> for each page" question (ignoring, for now, what to actually *do* in response to 
>> that flag being set):
>>
>> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
>> This is probably less troubling than the next point, but it does undermine all the 
>> complicated schemes involving PageWriteback, that try to synchronize gup() with
>> page_mkclean().
>>
>> 2. Also, the mapcount approach here still does not reliably avoid false negatives
>> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
>> can always jump in and increment the mapcount, while page_mkclean is in the middle
>> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
>>
>> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> 
> Both point is address by the solution at the end of this email.
> 
>>>
>>> So GUP_fast() becomes:
>>>
>>> GUP_fast_existing() { ... }
>>> GUP_fast()
>>> {
>>>     GUP_fast_existing();
>>>
>>>     for (i = 0; i < npages; ++i) {
>>>         if (PageWriteback(pages[i])) {
>>>             // need to force slow path for this page
>>>         } else {
>>>             SetPageDmaPinned(pages[i]);
>>>             atomic_inc(pages[i]->mapcount);
>>>         }
>>>     }
>>> }
>>>
>>> This is a minor slow down for GUP fast and it takes care of a
>>> write back race on behalf of caller. This means that page_mkclean
>>> can not see a mapcount value that increase. This simplify thing
>>> we can relax that. Note that what this is doing is making sure
>>> that GUP_fast never get lucky :) ie never GUP a page that is in
>>> the process of being write back but has not yet had its pte
>>> updated to reflect that.
>>>
>>>
>>>> But I think that detecting pinned pages with small false positive rate is
>>>> OK. The extra page bouncing will cost some performance but if it is rare,
>>>> then we are OK. So I think we can go for the simple version of detecting
>>>> pinned pages as you mentioned in some earlier email. We just have to be
>>>> sure there are no false negatives.
>>>
>>
>> Agree with that sentiment, but there are still false negatives and I'm not
>> yet seeing any solutions for that.
> 
> So here is the solution:
> 
> 
> Is a page pin ? With no false negative:
> =======================================
> 
> get_user_page*() aka GUP:
>      if (!PageAnon(page)) {
>         bool write_back = PageWriteback(page);
>         bool page_is_pin = PagePin(page);
>         if (write_back && !page_is_pin) {
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
> [G1]    smp_rmb();
> [G2]    atomic_inc(&page->_mapcount)
> [G3]    smp_wmb();
> [G4]    SetPagePin(page);
> [G5]    smp_wmb();
> [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
>             /* Back-off as write back might have miss us */
>             atomic_dec(&page->_mapcount);
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
>      }
> 
> put_user_page() aka PUP:
> [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> [P2] put_page(page);
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page);
> [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> 
> So with above code we store the map and pin count inside struct page
> _mapcount field. The idea is that we can count the number of page
> table entry that point to the page when reverse walking all the page
> mapping in page_mkclean() [C4].
> 
> The issue is that GUP, PUP and page table entry zapping can all run
> concurrently with page_mkclean() and thus we can not get the real
> map and pin count and the real map count at a given point in time
> ([C5] for instance in the above). However we only care about avoiding
> false negative ie we do not want to report a page as unpin if in fact
> it is pin (it has active GUP). Avoiding false positive would be nice
> but it would need more heavy weight synchronization within GUP and
> PUP (we can mitigate it see the section on that below).
> 
> With the above scheme a page is _not_ pin (unpin) if and only if we
> have real_map_count == real_map_and_pin_count at a given point in
> time. In the above pseudo code the page is lock within page_mkclean()
> thus no new page table entry can be added and thus the number of page
> mapping can only go down (because of conccurent pte zapping). So no
> matter what happens at [C5] we have map_count <= real_map_count.
> 
> At [C3] we have two cases to consider:
>  [R1] A concurrent GUP after [C3] then we do not care what happens at
>       [C5] as the GUP would already have set the page pin flag. If it
>       raced before [C3] at [C1] with TestClearPagePin() then we would
>       have the map_and_pin_count reflect the GUP thanks to the memory
>       barrier [G3] and [C2].
>  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
>       worry about and thus the real_map_and_pin_count can only go down.
>       So because we first snap shot that value at [C5] we have:
>       real_map_and_pin_count <= map_and_pin_count.
> 
>       So at [C5] we end up with map_count <= real_map_count and with
>       real_map_and_pin_count <= map_pin_count but we also always have
>       real_map_count <= real_map_and_pin_count so it means we are in a
>       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
>       if map_count == map_pin_count then we know for sure that we have
>       real_map_count == real_map_and_pin_count and if that is the case
>       then the page is no longer pin. So at [C5] we will never miss a
>       pin page (no false negative).
> 
>       Another way to word this is that we always under-estimate the real
>       map count and over estimate the map and pin count and thus we can
>       never have false negative (map count equal to map and pin count
>       while in fact real map count is inferior to real map and pin count).
> 
> 
> PageWriteback() test and ordering with page_mkclean()
> =====================================================
> 
> In GUP we test for page write back flag to avoid pining a page that
> is under going write back. That flag is set after page_mkclean() so
> the filesystem code that will check for the pin flag need some memory
> barrier:
>     int __test_set_page_writeback(struct page *page, bool keep_write,
> +                                 bool *use_bounce_page)
>     {
>         ...
>   [T1]  TestSetPageWriteback(page);
> + [T2]  smp_wmb();
> + [T3]  *use_bounce_page = PagePin(page);
>         ...
>     }
> 
> That way if there is a concurrent GUP we either have:
>     [R1] GUP sees the write back flag set before [G1] so it back-off
>     [R2] GUP sees no write back before [G1] here either we have GUP
>          that sees the write back flag at [G6] or [T3] that sees the
>          pin flag thanks to the memory barrier [G5] and [T2].
> 
> So in all cases we never miss a pin or a write back.
> 
> 
> Mitigate false positive:
> ========================
> 
> If false positive is ever an issue we can improve the situation and to
> properly account conccurent pte zapping with the following changes:
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page, &page_mkclean_one());
> [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
>      }
> 
> page_map_count():
> [M1] if (pte_valid(pte) { map_count++; }
>      } else if (pte_special_zap(pte)) {
> [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
>      }
> 
> And pte zapping of file back page will write a special pte entry which
> has the page map and pin count value at the time the pte is zap. Also
> page_mkclean_one() unconditionaly replace those special pte with pte
> none and ignore them altogether. We only want to detect pte zapping that
> happens after [C6] and before [C7] is done.
> 
> With [M3] we are counting all page table entry that have been zap after
> the map_and_pin_count value we read at [C6]. Again we have two cases:
>  [R1] A concurrent GUP after [C6] then we do not care what happens
>       at [C8] as the GUP would already have set the page pin flag.
>  [R2] No concurrent GUP then we only have concurrent PUP to worry
>       about. If they happen before [C6] they are included in [C6]
>       map_and_pin_count value. If after [C6] then we might miss a
>       page that is no longer pin ie we are over estimating the
>       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
>       at [C8]). So no false negative just false positive.
> 
> Here we just get the accurate real_map_count at [C6] time so if the
> page was no longer pin at [C6] time we will correctly detect it and
> not set the flag at [C8]. If there is any concurrent GUP that GUP
> would set the flag properly.
> 
> There is one last thing to note about above code, the MASK in [M3].
> For special pte entry we might not have enough bits to store the
> whole map and pin count value (on 32bits arch). So we might expose
> ourself to wrap around. Again we do not care about [R1] case as any
> concurrent GUP will set the pin flag. So we only care if the only
> thing happening concurrently is either PUP or pte zapping. In both
> case its means that the map and pin count is going down so if there
> is a wrap around sometimes within [C7]/page_map_count() we have:
>   [t0] page_map_count() executed on some pte
>   [t1] page_map_count() executed on another pte after [t1]
> With:
>     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> While in fact:
>     map_count_t0 > map_count_t1
> 
> So if that happens then we will under-estimate the map count ie we
> will ignore some of the concurrent pte zapping and not count them.
> So again we are only exposing our self to false positive not false
> negative.
> 
> 
> ---------------------------------------------------------------------
> 
> 
> Hopes this prove that this solution do work. The false positive is
> something that i believe is acceptable. We will get them only when
> they are racing GUP or PUP. For racing GUP it is safer to have false
> positive. For racing PUP it would be nice to catch them but hey some
> times you just get unlucky.
> 
> Note that any other solution will also suffer from false positive
> situation because anyway you are testing for the page pin status
> at a given point in time so it can always race with a PUP. So the
> only difference with any other solution would be how long is the
> false positive race window.
> 

Hi Jerome,

Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
in case anyone spots a disastrous conceptual error (such as the lock_page
point), while I'm putting together the revised patchset.

I've studied this carefully, and I agree that using mapcount in 
this way is viable, *as long* as we use a lock (or a construct that looks just 
like one: your "memory barrier, check, retry" is really just a lock) in
order to hold off gup() while page_mkclean() is in progress. In other words,
nothing that increments mapcount may proceed while page_mkclean() is running.

I especially am intrigued by your idea about a fuzzy count that allows
false positives but no false negatives. To do that, we need to put a hard
lock protecting the increment operation, but we can be loose (no lock) on
decrement. That turns out to be a perfect match for the problem here, because
as I recall from my earlier efforts, put_user_page() must *not* take locks--
and that's where we just decrement. Sweet! See below.

The other idea that you and Dan (and maybe others) pointed out was a debug
option, which we'll certainly need in order to safely convert all the call
sites. (Mirror the mappings at a different kernel offset, so that put_page()
and put_user_page() can verify that the right call was made.)  That will be
a separate patchset, as you recommended.

I'll even go as far as recommending the page lock itself. I realize that this 
adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
that this (below) has similar overhead to the notes above--but is *much* easier
to verify correct. (If the page lock is unacceptable due to being so widely used,
then I'd recommend using another page bit to do the same thing.)

(Note that memory barriers will simply be built into the various Set|Clear|Read
operations, as is common with a few other page flags.)

page_mkclean():
===============
lock_page()
    page_mkclean()
        Count actual mappings
            if(mappings == atomic_read(&page->_mapcount))
                ClearPageDmaPinned 

gup_fast():
===========
for each page {
    lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */

    atomic_inc(&page->_mapcount)
    SetPageDmaPinned()

    /* details of gup vs gup_fast not shown here... */


put_user_page():
================
    atomic_dec(&page->_mapcount); /* no locking! */
   

try_to_unmap() and other consumers of the PageDmaPinned flag:
=============================================================
lock_page() /* not required, but already done by existing callers */
    if(PageDmaPinned) {
        ...take appropriate action /* future patchsets */

page freeing:
============
ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */



thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11  2:59                                                     ` John Hubbard
  2019-01-11  2:59                                                       ` John Hubbard
@ 2019-01-11 16:51                                                       ` Jerome Glisse
  2019-01-11 16:51                                                         ` Jerome Glisse
  2019-01-12  1:04                                                         ` John Hubbard
  1 sibling, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-11 16:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> > On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:

[...]

> >>> Now page_mkclean:
> >>>
> >>> int page_mkclean(struct page *page)
> >>> {
> >>>     int cleaned = 0;
> >>> +   int real_mapcount = 0;
> >>>     struct address_space *mapping;
> >>>     struct rmap_walk_control rwc = {
> >>>         .arg = (void *)&cleaned,
> >>>         .rmap_one = page_mkclean_one,
> >>>         .invalid_vma = invalid_mkclean_vma,
> >>> +       .mapcount = &real_mapcount,
> >>>     };
> >>> +   int mapcount1, mapcount2;
> >>>
> >>>     BUG_ON(!PageLocked(page));
> >>>
> >>>     if (!page_mapped(page))
> >>>         return 0;
> >>>
> >>>     mapping = page_mapping(page);
> >>>     if (!mapping)
> >>>         return 0;
> >>>
> >>> +   mapcount1 = page_mapcount(page);
> >>>     // rmap_walk need to change to count mapping and return value
> >>>     // in .mapcount easy one
> >>>     rmap_walk(page, &rwc);
> >>
> >> So what prevents GUP_fast() to grab reference here and the test below would
> >> think the page is not pinned? Or do you assume that every page_mkclean()
> >> call will be protected by PageWriteback (currently it is not) so that
> >> GUP_fast() blocks / bails out?
> 
> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> for each page" question (ignoring, for now, what to actually *do* in response to 
> that flag being set):
> 
> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> This is probably less troubling than the next point, but it does undermine all the 
> complicated schemes involving PageWriteback, that try to synchronize gup() with
> page_mkclean().
> 
> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> can always jump in and increment the mapcount, while page_mkclean is in the middle
> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> 
> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

Both point is address by the solution at the end of this email.

> > 
> > So GUP_fast() becomes:
> > 
> > GUP_fast_existing() { ... }
> > GUP_fast()
> > {
> >     GUP_fast_existing();
> > 
> >     for (i = 0; i < npages; ++i) {
> >         if (PageWriteback(pages[i])) {
> >             // need to force slow path for this page
> >         } else {
> >             SetPageDmaPinned(pages[i]);
> >             atomic_inc(pages[i]->mapcount);
> >         }
> >     }
> > }
> > 
> > This is a minor slow down for GUP fast and it takes care of a
> > write back race on behalf of caller. This means that page_mkclean
> > can not see a mapcount value that increase. This simplify thing
> > we can relax that. Note that what this is doing is making sure
> > that GUP_fast never get lucky :) ie never GUP a page that is in
> > the process of being write back but has not yet had its pte
> > updated to reflect that.
> > 
> > 
> >> But I think that detecting pinned pages with small false positive rate is
> >> OK. The extra page bouncing will cost some performance but if it is rare,
> >> then we are OK. So I think we can go for the simple version of detecting
> >> pinned pages as you mentioned in some earlier email. We just have to be
> >> sure there are no false negatives.
> > 
> 
> Agree with that sentiment, but there are still false negatives and I'm not
> yet seeing any solutions for that.

So here is the solution:


Is a page pin ? With no false negative:
=======================================

get_user_page*() aka GUP:
     if (!PageAnon(page)) {
        bool write_back = PageWriteback(page);
        bool page_is_pin = PagePin(page);
        if (write_back && !page_is_pin) {
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
[G1]    smp_rmb();
[G2]    atomic_inc(&page->_mapcount)
[G3]    smp_wmb();
[G4]    SetPagePin(page);
[G5]    smp_wmb();
[G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
            /* Back-off as write back might have miss us */
            atomic_dec(&page->_mapcount);
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
     }

put_user_page() aka PUP:
[P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
[P2] put_page(page);

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page);
[C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);

So with above code we store the map and pin count inside struct page
_mapcount field. The idea is that we can count the number of page
table entry that point to the page when reverse walking all the page
mapping in page_mkclean() [C4].

The issue is that GUP, PUP and page table entry zapping can all run
concurrently with page_mkclean() and thus we can not get the real
map and pin count and the real map count at a given point in time
([C5] for instance in the above). However we only care about avoiding
false negative ie we do not want to report a page as unpin if in fact
it is pin (it has active GUP). Avoiding false positive would be nice
but it would need more heavy weight synchronization within GUP and
PUP (we can mitigate it see the section on that below).

With the above scheme a page is _not_ pin (unpin) if and only if we
have real_map_count == real_map_and_pin_count at a given point in
time. In the above pseudo code the page is lock within page_mkclean()
thus no new page table entry can be added and thus the number of page
mapping can only go down (because of conccurent pte zapping). So no
matter what happens at [C5] we have map_count <= real_map_count.

At [C3] we have two cases to consider:
 [R1] A concurrent GUP after [C3] then we do not care what happens at
      [C5] as the GUP would already have set the page pin flag. If it
      raced before [C3] at [C1] with TestClearPagePin() then we would
      have the map_and_pin_count reflect the GUP thanks to the memory
      barrier [G3] and [C2].
 [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
      worry about and thus the real_map_and_pin_count can only go down.
      So because we first snap shot that value at [C5] we have:
      real_map_and_pin_count <= map_and_pin_count.

      So at [C5] we end up with map_count <= real_map_count and with
      real_map_and_pin_count <= map_pin_count but we also always have
      real_map_count <= real_map_and_pin_count so it means we are in a
      a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
      if map_count == map_pin_count then we know for sure that we have
      real_map_count == real_map_and_pin_count and if that is the case
      then the page is no longer pin. So at [C5] we will never miss a
      pin page (no false negative).

      Another way to word this is that we always under-estimate the real
      map count and over estimate the map and pin count and thus we can
      never have false negative (map count equal to map and pin count
      while in fact real map count is inferior to real map and pin count).


PageWriteback() test and ordering with page_mkclean()
=====================================================

In GUP we test for page write back flag to avoid pining a page that
is under going write back. That flag is set after page_mkclean() so
the filesystem code that will check for the pin flag need some memory
barrier:
    int __test_set_page_writeback(struct page *page, bool keep_write,
+                                 bool *use_bounce_page)
    {
        ...
  [T1]  TestSetPageWriteback(page);
+ [T2]  smp_wmb();
+ [T3]  *use_bounce_page = PagePin(page);
        ...
    }

That way if there is a concurrent GUP we either have:
    [R1] GUP sees the write back flag set before [G1] so it back-off
    [R2] GUP sees no write back before [G1] here either we have GUP
         that sees the write back flag at [G6] or [T3] that sees the
         pin flag thanks to the memory barrier [G5] and [T2].

So in all cases we never miss a pin or a write back.


Mitigate false positive:
========================

If false positive is ever an issue we can improve the situation and to
properly account conccurent pte zapping with the following changes:

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page, &page_mkclean_one());
[C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
[C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
[C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
[C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
     }

page_map_count():
[M1] if (pte_valid(pte) { map_count++; }
     } else if (pte_special_zap(pte)) {
[M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
[M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
     }

And pte zapping of file back page will write a special pte entry which
has the page map and pin count value at the time the pte is zap. Also
page_mkclean_one() unconditionaly replace those special pte with pte
none and ignore them altogether. We only want to detect pte zapping that
happens after [C6] and before [C7] is done.

With [M3] we are counting all page table entry that have been zap after
the map_and_pin_count value we read at [C6]. Again we have two cases:
 [R1] A concurrent GUP after [C6] then we do not care what happens
      at [C8] as the GUP would already have set the page pin flag.
 [R2] No concurrent GUP then we only have concurrent PUP to worry
      about. If they happen before [C6] they are included in [C6]
      map_and_pin_count value. If after [C6] then we might miss a
      page that is no longer pin ie we are over estimating the
      map_and_pin_count (real_map_and_pin_count < map_and_pin_count
      at [C8]). So no false negative just false positive.

Here we just get the accurate real_map_count at [C6] time so if the
page was no longer pin at [C6] time we will correctly detect it and
not set the flag at [C8]. If there is any concurrent GUP that GUP
would set the flag properly.

There is one last thing to note about above code, the MASK in [M3].
For special pte entry we might not have enough bits to store the
whole map and pin count value (on 32bits arch). So we might expose
ourself to wrap around. Again we do not care about [R1] case as any
concurrent GUP will set the pin flag. So we only care if the only
thing happening concurrently is either PUP or pte zapping. In both
case its means that the map and pin count is going down so if there
is a wrap around sometimes within [C7]/page_map_count() we have:
  [t0] page_map_count() executed on some pte
  [t1] page_map_count() executed on another pte after [t1]
With:
    (map_count_t0 & MASK) < (map_count_t1 & MASK)
While in fact:
    map_count_t0 > map_count_t1

So if that happens then we will under-estimate the map count ie we
will ignore some of the concurrent pte zapping and not count them.
So again we are only exposing our self to false positive not false
negative.


---------------------------------------------------------------------


Hopes this prove that this solution do work. The false positive is
something that i believe is acceptable. We will get them only when
they are racing GUP or PUP. For racing GUP it is safer to have false
positive. For racing PUP it would be nice to catch them but hey some
times you just get unlucky.

Note that any other solution will also suffer from false positive
situation because anyway you are testing for the page pin status
at a given point in time so it can always race with a PUP. So the
only difference with any other solution would be how long is the
false positive race window.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11 16:51                                                       ` Jerome Glisse
@ 2019-01-11 16:51                                                         ` Jerome Glisse
  2019-01-12  1:04                                                         ` John Hubbard
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-11 16:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> > On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:

[...]

> >>> Now page_mkclean:
> >>>
> >>> int page_mkclean(struct page *page)
> >>> {
> >>>     int cleaned = 0;
> >>> +   int real_mapcount = 0;
> >>>     struct address_space *mapping;
> >>>     struct rmap_walk_control rwc = {
> >>>         .arg = (void *)&cleaned,
> >>>         .rmap_one = page_mkclean_one,
> >>>         .invalid_vma = invalid_mkclean_vma,
> >>> +       .mapcount = &real_mapcount,
> >>>     };
> >>> +   int mapcount1, mapcount2;
> >>>
> >>>     BUG_ON(!PageLocked(page));
> >>>
> >>>     if (!page_mapped(page))
> >>>         return 0;
> >>>
> >>>     mapping = page_mapping(page);
> >>>     if (!mapping)
> >>>         return 0;
> >>>
> >>> +   mapcount1 = page_mapcount(page);
> >>>     // rmap_walk need to change to count mapping and return value
> >>>     // in .mapcount easy one
> >>>     rmap_walk(page, &rwc);
> >>
> >> So what prevents GUP_fast() to grab reference here and the test below would
> >> think the page is not pinned? Or do you assume that every page_mkclean()
> >> call will be protected by PageWriteback (currently it is not) so that
> >> GUP_fast() blocks / bails out?
> 
> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> for each page" question (ignoring, for now, what to actually *do* in response to 
> that flag being set):
> 
> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> This is probably less troubling than the next point, but it does undermine all the 
> complicated schemes involving PageWriteback, that try to synchronize gup() with
> page_mkclean().
> 
> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> can always jump in and increment the mapcount, while page_mkclean is in the middle
> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> 
> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

Both point is address by the solution at the end of this email.

> > 
> > So GUP_fast() becomes:
> > 
> > GUP_fast_existing() { ... }
> > GUP_fast()
> > {
> >     GUP_fast_existing();
> > 
> >     for (i = 0; i < npages; ++i) {
> >         if (PageWriteback(pages[i])) {
> >             // need to force slow path for this page
> >         } else {
> >             SetPageDmaPinned(pages[i]);
> >             atomic_inc(pages[i]->mapcount);
> >         }
> >     }
> > }
> > 
> > This is a minor slow down for GUP fast and it takes care of a
> > write back race on behalf of caller. This means that page_mkclean
> > can not see a mapcount value that increase. This simplify thing
> > we can relax that. Note that what this is doing is making sure
> > that GUP_fast never get lucky :) ie never GUP a page that is in
> > the process of being write back but has not yet had its pte
> > updated to reflect that.
> > 
> > 
> >> But I think that detecting pinned pages with small false positive rate is
> >> OK. The extra page bouncing will cost some performance but if it is rare,
> >> then we are OK. So I think we can go for the simple version of detecting
> >> pinned pages as you mentioned in some earlier email. We just have to be
> >> sure there are no false negatives.
> > 
> 
> Agree with that sentiment, but there are still false negatives and I'm not
> yet seeing any solutions for that.

So here is the solution:


Is a page pin ? With no false negative:
=======================================

get_user_page*() aka GUP:
     if (!PageAnon(page)) {
        bool write_back = PageWriteback(page);
        bool page_is_pin = PagePin(page);
        if (write_back && !page_is_pin) {
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
[G1]    smp_rmb();
[G2]    atomic_inc(&page->_mapcount)
[G3]    smp_wmb();
[G4]    SetPagePin(page);
[G5]    smp_wmb();
[G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
            /* Back-off as write back might have miss us */
            atomic_dec(&page->_mapcount);
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
     }

put_user_page() aka PUP:
[P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
[P2] put_page(page);

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page);
[C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);

So with above code we store the map and pin count inside struct page
_mapcount field. The idea is that we can count the number of page
table entry that point to the page when reverse walking all the page
mapping in page_mkclean() [C4].

The issue is that GUP, PUP and page table entry zapping can all run
concurrently with page_mkclean() and thus we can not get the real
map and pin count and the real map count at a given point in time
([C5] for instance in the above). However we only care about avoiding
false negative ie we do not want to report a page as unpin if in fact
it is pin (it has active GUP). Avoiding false positive would be nice
but it would need more heavy weight synchronization within GUP and
PUP (we can mitigate it see the section on that below).

With the above scheme a page is _not_ pin (unpin) if and only if we
have real_map_count == real_map_and_pin_count at a given point in
time. In the above pseudo code the page is lock within page_mkclean()
thus no new page table entry can be added and thus the number of page
mapping can only go down (because of conccurent pte zapping). So no
matter what happens at [C5] we have map_count <= real_map_count.

At [C3] we have two cases to consider:
 [R1] A concurrent GUP after [C3] then we do not care what happens at
      [C5] as the GUP would already have set the page pin flag. If it
      raced before [C3] at [C1] with TestClearPagePin() then we would
      have the map_and_pin_count reflect the GUP thanks to the memory
      barrier [G3] and [C2].
 [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
      worry about and thus the real_map_and_pin_count can only go down.
      So because we first snap shot that value at [C5] we have:
      real_map_and_pin_count <= map_and_pin_count.

      So at [C5] we end up with map_count <= real_map_count and with
      real_map_and_pin_count <= map_pin_count but we also always have
      real_map_count <= real_map_and_pin_count so it means we are in a
      a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
      if map_count == map_pin_count then we know for sure that we have
      real_map_count == real_map_and_pin_count and if that is the case
      then the page is no longer pin. So at [C5] we will never miss a
      pin page (no false negative).

      Another way to word this is that we always under-estimate the real
      map count and over estimate the map and pin count and thus we can
      never have false negative (map count equal to map and pin count
      while in fact real map count is inferior to real map and pin count).


PageWriteback() test and ordering with page_mkclean()
=====================================================

In GUP we test for page write back flag to avoid pining a page that
is under going write back. That flag is set after page_mkclean() so
the filesystem code that will check for the pin flag need some memory
barrier:
    int __test_set_page_writeback(struct page *page, bool keep_write,
+                                 bool *use_bounce_page)
    {
        ...
  [T1]  TestSetPageWriteback(page);
+ [T2]  smp_wmb();
+ [T3]  *use_bounce_page = PagePin(page);
        ...
    }

That way if there is a concurrent GUP we either have:
    [R1] GUP sees the write back flag set before [G1] so it back-off
    [R2] GUP sees no write back before [G1] here either we have GUP
         that sees the write back flag at [G6] or [T3] that sees the
         pin flag thanks to the memory barrier [G5] and [T2].

So in all cases we never miss a pin or a write back.


Mitigate false positive:
========================

If false positive is ever an issue we can improve the situation and to
properly account conccurent pte zapping with the following changes:

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page, &page_mkclean_one());
[C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
[C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
[C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
[C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
     }

page_map_count():
[M1] if (pte_valid(pte) { map_count++; }
     } else if (pte_special_zap(pte)) {
[M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
[M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
     }

And pte zapping of file back page will write a special pte entry which
has the page map and pin count value at the time the pte is zap. Also
page_mkclean_one() unconditionaly replace those special pte with pte
none and ignore them altogether. We only want to detect pte zapping that
happens after [C6] and before [C7] is done.

With [M3] we are counting all page table entry that have been zap after
the map_and_pin_count value we read at [C6]. Again we have two cases:
 [R1] A concurrent GUP after [C6] then we do not care what happens
      at [C8] as the GUP would already have set the page pin flag.
 [R2] No concurrent GUP then we only have concurrent PUP to worry
      about. If they happen before [C6] they are included in [C6]
      map_and_pin_count value. If after [C6] then we might miss a
      page that is no longer pin ie we are over estimating the
      map_and_pin_count (real_map_and_pin_count < map_and_pin_count
      at [C8]). So no false negative just false positive.

Here we just get the accurate real_map_count at [C6] time so if the
page was no longer pin at [C6] time we will correctly detect it and
not set the flag at [C8]. If there is any concurrent GUP that GUP
would set the flag properly.

There is one last thing to note about above code, the MASK in [M3].
For special pte entry we might not have enough bits to store the
whole map and pin count value (on 32bits arch). So we might expose
ourself to wrap around. Again we do not care about [R1] case as any
concurrent GUP will set the pin flag. So we only care if the only
thing happening concurrently is either PUP or pte zapping. In both
case its means that the map and pin count is going down so if there
is a wrap around sometimes within [C7]/page_map_count() we have:
  [t0] page_map_count() executed on some pte
  [t1] page_map_count() executed on another pte after [t1]
With:
    (map_count_t0 & MASK) < (map_count_t1 & MASK)
While in fact:
    map_count_t0 > map_count_t1

So if that happens then we will under-estimate the map count ie we
will ignore some of the concurrent pte zapping and not count them.
So again we are only exposing our self to false positive not false
negative.


---------------------------------------------------------------------


Hopes this prove that this solution do work. The false positive is
something that i believe is acceptable. We will get them only when
they are racing GUP or PUP. For racing GUP it is safer to have false
positive. For racing PUP it would be nice to catch them but hey some
times you just get unlucky.

Note that any other solution will also suffer from false positive
situation because anyway you are testing for the page pin status
at a given point in time so it can always race with a PUP. So the
only difference with any other solution would be how long is the
false positive race window.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03 14:44                                                   ` Jerome Glisse
@ 2019-01-11  2:59                                                     ` John Hubbard
  2019-01-11  2:59                                                       ` John Hubbard
  2019-01-11 16:51                                                       ` Jerome Glisse
  0 siblings, 2 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-11  2:59 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/3/19 6:44 AM, Jerome Glisse wrote:
> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>>>>> solution for that.  
>>>>>>
>>>>>> So as I understand it, this would use page->_mapcount to store both the real
>>>>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>>>>> file-backed (non-anonymous) pages:
>>>>>>
>>>>>>
>>>>>> __get_user_pages()
>>>>>> {
>>>>>> 	...
>>>>>> 	get_page(page);
>>>>>>
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_inc(page->_mapcount);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> put_user_page(struct page *page)
>>>>>> {
>>>>>> 	...
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_dec(&page->_mapcount);
>>>>>>
>>>>>> 	put_page(page);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
>>>>>> had in mind?
>>>>>
>>>>> Mostly, with the extra two observations:
>>>>>     [1] We only need to know the pin count when a write back kicks in
>>>>>     [2] We need to protect GUP code with wait_for_write_back() in case
>>>>>         GUP is racing with a write back that might not the see the
>>>>>         elevated mapcount in time.
>>>>>
>>>>> So for [2]
>>>>>
>>>>> __get_user_pages()
>>>>> {
>>>>>     get_page(page);
>>>>>
>>>>>     if (!PageAnon) {
>>>>>         atomic_inc(page->_mapcount);
>>>>> +       if (PageWriteback(page)) {
>>>>> +           // Assume we are racing and curent write back will not see
>>>>> +           // the elevated mapcount so wait for current write back and
>>>>> +           // force page fault
>>>>> +           wait_on_page_writeback(page);
>>>>> +           // force slow path that will fault again
>>>>> +       }
>>>>>     }
>>>>> }
>>>>
>>>> This is not needed AFAICT. __get_user_pages() gets page reference (and it
>>>> should also increment page->_mapcount) under PTE lock. So at that point we
>>>> are sure we have writeable PTE nobody can change. So page_mkclean() has to
>>>> block on PTE lock to make PTE read-only and only after going through all
>>>> PTEs like this, it can check page->_mapcount. So the PTE lock provides
>>>> enough synchronization.
>>>>
>>>>> For [1] only needing pin count during write back turns page_mkclean into
>>>>> the perfect spot to check for that so:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>>
>>>>>     // Big fat comment to explain what is going on
>>>>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
>>>>> +       SetPageDMAPined(page);
>>>>> +   } else {
>>>>> +       ClearPageDMAPined(page);
>>>>> +   }
>>>>
>>>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
>>>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
>>>> check we do in page_mkclean() is wrong?
>>>>
>>>
>>> Ok so i found a solution for that. First GUP must wait for racing
>>> write back. If GUP see a valid write-able PTE and the page has
>>> write back flag set then it must back of as if the PTE was not
>>> valid to force fault. It is just a race with page_mkclean and we
>>> want ordering between the two. Note this is not strictly needed
>>> so we can relax that but i believe this ordering is better to do
>>> in GUP rather then having each single user of GUP test for this
>>> to avoid the race.
>>>
>>> GUP increase mapcount only after checking that it is not racing
>>> with writeback it also set a page flag (SetPageDMAPined(page)).
>>>
>>> When clearing a write-able pte we set a special entry inside the
>>> page table (might need a new special swap type for this) and change
>>> page_mkclean_one() to clear to 0 those special entry.
>>>
>>>
>>> Now page_mkclean:
>>>
>>> int page_mkclean(struct page *page)
>>> {
>>>     int cleaned = 0;
>>> +   int real_mapcount = 0;
>>>     struct address_space *mapping;
>>>     struct rmap_walk_control rwc = {
>>>         .arg = (void *)&cleaned,
>>>         .rmap_one = page_mkclean_one,
>>>         .invalid_vma = invalid_mkclean_vma,
>>> +       .mapcount = &real_mapcount,
>>>     };
>>> +   int mapcount1, mapcount2;
>>>
>>>     BUG_ON(!PageLocked(page));
>>>
>>>     if (!page_mapped(page))
>>>         return 0;
>>>
>>>     mapping = page_mapping(page);
>>>     if (!mapping)
>>>         return 0;
>>>
>>> +   mapcount1 = page_mapcount(page);
>>>     // rmap_walk need to change to count mapping and return value
>>>     // in .mapcount easy one
>>>     rmap_walk(page, &rwc);
>>
>> So what prevents GUP_fast() to grab reference here and the test below would
>> think the page is not pinned? Or do you assume that every page_mkclean()
>> call will be protected by PageWriteback (currently it is not) so that
>> GUP_fast() blocks / bails out?

Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
for each page" question (ignoring, for now, what to actually *do* in response to 
that flag being set):

1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
This is probably less troubling than the next point, but it does undermine all the 
complicated schemes involving PageWriteback, that try to synchronize gup() with
page_mkclean().

2. Also, the mapcount approach here still does not reliably avoid false negatives
(that is, a page may have been gup'd, but page_mkclean could miss that): gup()
can always jump in and increment the mapcount, while page_mkclean is in the middle
of making (wrong) decisions based on that mapcount. There's no lock to prevent that.

Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

> 
> So GUP_fast() becomes:
> 
> GUP_fast_existing() { ... }
> GUP_fast()
> {
>     GUP_fast_existing();
> 
>     for (i = 0; i < npages; ++i) {
>         if (PageWriteback(pages[i])) {
>             // need to force slow path for this page
>         } else {
>             SetPageDmaPinned(pages[i]);
>             atomic_inc(pages[i]->mapcount);
>         }
>     }
> }
> 
> This is a minor slow down for GUP fast and it takes care of a
> write back race on behalf of caller. This means that page_mkclean
> can not see a mapcount value that increase. This simplify thing
> we can relax that. Note that what this is doing is making sure
> that GUP_fast never get lucky :) ie never GUP a page that is in
> the process of being write back but has not yet had its pte
> updated to reflect that.
> 
> 
>> But I think that detecting pinned pages with small false positive rate is
>> OK. The extra page bouncing will cost some performance but if it is rare,
>> then we are OK. So I think we can go for the simple version of detecting
>> pinned pages as you mentioned in some earlier email. We just have to be
>> sure there are no false negatives.
> 

Agree with that sentiment, but there are still false negatives and I'm not
yet seeing any solutions for that.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11  2:59                                                     ` John Hubbard
@ 2019-01-11  2:59                                                       ` John Hubbard
  2019-01-11 16:51                                                       ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: John Hubbard @ 2019-01-11  2:59 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/3/19 6:44 AM, Jerome Glisse wrote:
> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>>>>> solution for that.  
>>>>>>
>>>>>> So as I understand it, this would use page->_mapcount to store both the real
>>>>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>>>>> file-backed (non-anonymous) pages:
>>>>>>
>>>>>>
>>>>>> __get_user_pages()
>>>>>> {
>>>>>> 	...
>>>>>> 	get_page(page);
>>>>>>
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_inc(page->_mapcount);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> put_user_page(struct page *page)
>>>>>> {
>>>>>> 	...
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_dec(&page->_mapcount);
>>>>>>
>>>>>> 	put_page(page);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
>>>>>> had in mind?
>>>>>
>>>>> Mostly, with the extra two observations:
>>>>>     [1] We only need to know the pin count when a write back kicks in
>>>>>     [2] We need to protect GUP code with wait_for_write_back() in case
>>>>>         GUP is racing with a write back that might not the see the
>>>>>         elevated mapcount in time.
>>>>>
>>>>> So for [2]
>>>>>
>>>>> __get_user_pages()
>>>>> {
>>>>>     get_page(page);
>>>>>
>>>>>     if (!PageAnon) {
>>>>>         atomic_inc(page->_mapcount);
>>>>> +       if (PageWriteback(page)) {
>>>>> +           // Assume we are racing and curent write back will not see
>>>>> +           // the elevated mapcount so wait for current write back and
>>>>> +           // force page fault
>>>>> +           wait_on_page_writeback(page);
>>>>> +           // force slow path that will fault again
>>>>> +       }
>>>>>     }
>>>>> }
>>>>
>>>> This is not needed AFAICT. __get_user_pages() gets page reference (and it
>>>> should also increment page->_mapcount) under PTE lock. So at that point we
>>>> are sure we have writeable PTE nobody can change. So page_mkclean() has to
>>>> block on PTE lock to make PTE read-only and only after going through all
>>>> PTEs like this, it can check page->_mapcount. So the PTE lock provides
>>>> enough synchronization.
>>>>
>>>>> For [1] only needing pin count during write back turns page_mkclean into
>>>>> the perfect spot to check for that so:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>>
>>>>>     // Big fat comment to explain what is going on
>>>>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
>>>>> +       SetPageDMAPined(page);
>>>>> +   } else {
>>>>> +       ClearPageDMAPined(page);
>>>>> +   }
>>>>
>>>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
>>>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
>>>> check we do in page_mkclean() is wrong?
>>>>
>>>
>>> Ok so i found a solution for that. First GUP must wait for racing
>>> write back. If GUP see a valid write-able PTE and the page has
>>> write back flag set then it must back of as if the PTE was not
>>> valid to force fault. It is just a race with page_mkclean and we
>>> want ordering between the two. Note this is not strictly needed
>>> so we can relax that but i believe this ordering is better to do
>>> in GUP rather then having each single user of GUP test for this
>>> to avoid the race.
>>>
>>> GUP increase mapcount only after checking that it is not racing
>>> with writeback it also set a page flag (SetPageDMAPined(page)).
>>>
>>> When clearing a write-able pte we set a special entry inside the
>>> page table (might need a new special swap type for this) and change
>>> page_mkclean_one() to clear to 0 those special entry.
>>>
>>>
>>> Now page_mkclean:
>>>
>>> int page_mkclean(struct page *page)
>>> {
>>>     int cleaned = 0;
>>> +   int real_mapcount = 0;
>>>     struct address_space *mapping;
>>>     struct rmap_walk_control rwc = {
>>>         .arg = (void *)&cleaned,
>>>         .rmap_one = page_mkclean_one,
>>>         .invalid_vma = invalid_mkclean_vma,
>>> +       .mapcount = &real_mapcount,
>>>     };
>>> +   int mapcount1, mapcount2;
>>>
>>>     BUG_ON(!PageLocked(page));
>>>
>>>     if (!page_mapped(page))
>>>         return 0;
>>>
>>>     mapping = page_mapping(page);
>>>     if (!mapping)
>>>         return 0;
>>>
>>> +   mapcount1 = page_mapcount(page);
>>>     // rmap_walk need to change to count mapping and return value
>>>     // in .mapcount easy one
>>>     rmap_walk(page, &rwc);
>>
>> So what prevents GUP_fast() to grab reference here and the test below would
>> think the page is not pinned? Or do you assume that every page_mkclean()
>> call will be protected by PageWriteback (currently it is not) so that
>> GUP_fast() blocks / bails out?

Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
for each page" question (ignoring, for now, what to actually *do* in response to 
that flag being set):

1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
This is probably less troubling than the next point, but it does undermine all the 
complicated schemes involving PageWriteback, that try to synchronize gup() with
page_mkclean().

2. Also, the mapcount approach here still does not reliably avoid false negatives
(that is, a page may have been gup'd, but page_mkclean could miss that): gup()
can always jump in and increment the mapcount, while page_mkclean is in the middle
of making (wrong) decisions based on that mapcount. There's no lock to prevent that.

Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

> 
> So GUP_fast() becomes:
> 
> GUP_fast_existing() { ... }
> GUP_fast()
> {
>     GUP_fast_existing();
> 
>     for (i = 0; i < npages; ++i) {
>         if (PageWriteback(pages[i])) {
>             // need to force slow path for this page
>         } else {
>             SetPageDmaPinned(pages[i]);
>             atomic_inc(pages[i]->mapcount);
>         }
>     }
> }
> 
> This is a minor slow down for GUP fast and it takes care of a
> write back race on behalf of caller. This means that page_mkclean
> can not see a mapcount value that increase. This simplify thing
> we can relax that. Note that what this is doing is making sure
> that GUP_fast never get lucky :) ie never GUP a page that is in
> the process of being write back but has not yet had its pte
> updated to reflect that.
> 
> 
>> But I think that detecting pinned pages with small false positive rate is
>> OK. The extra page bouncing will cost some performance but if it is rare,
>> then we are OK. So I think we can go for the simple version of detecting
>> pinned pages as you mentioned in some earlier email. We just have to be
>> sure there are no false negatives.
> 

Agree with that sentiment, but there are still false negatives and I'm not
yet seeing any solutions for that.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  3:27                                                 ` John Hubbard
@ 2019-01-03 14:57                                                   ` Jerome Glisse
  0 siblings, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-03 14:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 02, 2019 at 07:27:17PM -0800, John Hubbard wrote:
> On 1/2/19 5:55 PM, Jerome Glisse wrote:
> > On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> >>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
> >>>> solution for that.  
> >>>>
> >>>> So as I understand it, this would use page->_mapcount to store both the real
> >>>> mapcount, and the dma pinned count (simply added together), but only do so for
> >>>> file-backed (non-anonymous) pages:
> >>>>
> >>>>
> >>>> __get_user_pages()
> >>>> {
> >>>> 	...
> >>>> 	get_page(page);
> >>>>
> >>>> 	if (!PageAnon)
> >>>> 		atomic_inc(page->_mapcount);
> >>>> 	...
> >>>> }
> >>>>
> >>>> put_user_page(struct page *page)
> >>>> {
> >>>> 	...
> >>>> 	if (!PageAnon)
> >>>> 		atomic_dec(&page->_mapcount);
> >>>>
> >>>> 	put_page(page);
> >>>> 	...
> >>>> }
> >>>>
> >>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> >>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> >>>> had in mind?
> >>>
> >>> Mostly, with the extra two observations:
> >>>     [1] We only need to know the pin count when a write back kicks in
> >>>     [2] We need to protect GUP code with wait_for_write_back() in case
> >>>         GUP is racing with a write back that might not the see the
> >>>         elevated mapcount in time.
> >>>
> >>> So for [2]
> >>>
> >>> __get_user_pages()
> >>> {
> >>>     get_page(page);
> >>>
> >>>     if (!PageAnon) {
> >>>         atomic_inc(page->_mapcount);
> >>> +       if (PageWriteback(page)) {
> >>> +           // Assume we are racing and curent write back will not see
> >>> +           // the elevated mapcount so wait for current write back and
> >>> +           // force page fault
> >>> +           wait_on_page_writeback(page);
> >>> +           // force slow path that will fault again
> >>> +       }
> >>>     }
> >>> }
> >>
> >> This is not needed AFAICT. __get_user_pages() gets page reference (and it
> >> should also increment page->_mapcount) under PTE lock. So at that point we
> >> are sure we have writeable PTE nobody can change. So page_mkclean() has to
> >> block on PTE lock to make PTE read-only and only after going through all
> >> PTEs like this, it can check page->_mapcount. So the PTE lock provides
> >> enough synchronization.
> >>
> >>> For [1] only needing pin count during write back turns page_mkclean into
> >>> the perfect spot to check for that so:
> >>>
> >>> int page_mkclean(struct page *page)
> >>> {
> >>>     int cleaned = 0;
> >>> +   int real_mapcount = 0;
> >>>     struct address_space *mapping;
> >>>     struct rmap_walk_control rwc = {
> >>>         .arg = (void *)&cleaned,
> >>>         .rmap_one = page_mkclean_one,
> >>>         .invalid_vma = invalid_mkclean_vma,
> >>> +       .mapcount = &real_mapcount,
> >>>     };
> >>>
> >>>     BUG_ON(!PageLocked(page));
> >>>
> >>>     if (!page_mapped(page))
> >>>         return 0;
> >>>
> >>>     mapping = page_mapping(page);
> >>>     if (!mapping)
> >>>         return 0;
> >>>
> >>>     // rmap_walk need to change to count mapping and return value
> >>>     // in .mapcount easy one
> >>>     rmap_walk(page, &rwc);
> >>>
> >>>     // Big fat comment to explain what is going on
> >>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> >>> +       SetPageDMAPined(page);
> >>> +   } else {
> >>> +       ClearPageDMAPined(page);
> >>> +   }
> >>
> >> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> >> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> >> check we do in page_mkclean() is wrong?
> >>
> > 
> > Ok so i found a solution for that. First GUP must wait for racing
> > write back. If GUP see a valid write-able PTE and the page has
> > write back flag set then it must back of as if the PTE was not
> > valid to force fault. It is just a race with page_mkclean and we
> > want ordering between the two. Note this is not strictly needed
> > so we can relax that but i believe this ordering is better to do
> > in GUP rather then having each single user of GUP test for this
> > to avoid the race.
> > 
> > GUP increase mapcount only after checking that it is not racing
> > with writeback it also set a page flag (SetPageDMAPined(page)).
> > 
> > When clearing a write-able pte we set a special entry inside the
> > page table (might need a new special swap type for this) and change
> > page_mkclean_one() to clear to 0 those special entry.
> > 
> > 
> > Now page_mkclean:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > +   int mapcount1, mapcount2;
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> > +   mapcount1 = page_mapcount(page);
> > 
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> > 
> > +   if (PageDMAPined(page)) {
> > +       int rc2;
> > +
> > +       if (mapcount1 == real_count) {
> > +           /* Page is no longer pin, no zap pte race */
> > +           ClearPageDMAPined(page);
> > +           goto out;
> > +       }
> > +       /* No new mapping of the page so mp1 < rc is illegal. */
> > +       VM_BUG_ON(mapcount1 < real_count);
> > +       /* Page might be pin. */
> > +       mapcount2 = page_mapcount(page);
> > +       if (mapcount2 > real_count) {
> > +           /* Page is pin for sure. */
> > +           goto out;
> > +       }
> > +       /* We had a race with zap pte we need to rewalk again. */
> > +       rc2 = real_mapcount;
> > +       real_mapcount = 0;
> > +       rwc.rmap_one = page_pin_one;
> > +       rmap_walk(page, &rwc);
> > +       if (mapcount2 <= (real_count + rc2)) {
> > +           /* Page is no longer pin */
> > +           ClearPageDMAPined(page);
> > +       }
> > +       /* At this point the page pin flag reflect pin status of the page */
> 
> Until...what? In other words, what is providing synchronization here?

It can still race with put_user_page() but this is fine ie it means
that a racing put_user_page() will not be taken into account and that
page will still be consider pin for this round, even thought the last
pin might just have been drop.

It is all about getting the "real" mapcount value at one point in
time while racing with something that zap ptes. So what you want is
being able to count the number of zap ptes that are racing with you.
If there is none than you know you have a stable real mapcount value,
if there is you can account them in real mapcount and compare it to
the mapcount value of the page. Worst case is you report a page as
pin while it has just been release but next write back will catch
that (unless page is GUPed again).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  9:26                                                 ` Jan Kara
@ 2019-01-03 14:44                                                   ` Jerome Glisse
  2019-01-11  2:59                                                     ` John Hubbard
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2019-01-03 14:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> > On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> > > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > > > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > > > solution for that.  
> > > > > 
> > > > > So as I understand it, this would use page->_mapcount to store both the real
> > > > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > > > file-backed (non-anonymous) pages:
> > > > > 
> > > > > 
> > > > > __get_user_pages()
> > > > > {
> > > > > 	...
> > > > > 	get_page(page);
> > > > > 
> > > > > 	if (!PageAnon)
> > > > > 		atomic_inc(page->_mapcount);
> > > > > 	...
> > > > > }
> > > > > 
> > > > > put_user_page(struct page *page)
> > > > > {
> > > > > 	...
> > > > > 	if (!PageAnon)
> > > > > 		atomic_dec(&page->_mapcount);
> > > > > 
> > > > > 	put_page(page);
> > > > > 	...
> > > > > }
> > > > > 
> > > > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > > > had in mind?
> > > > 
> > > > Mostly, with the extra two observations:
> > > >     [1] We only need to know the pin count when a write back kicks in
> > > >     [2] We need to protect GUP code with wait_for_write_back() in case
> > > >         GUP is racing with a write back that might not the see the
> > > >         elevated mapcount in time.
> > > > 
> > > > So for [2]
> > > > 
> > > > __get_user_pages()
> > > > {
> > > >     get_page(page);
> > > > 
> > > >     if (!PageAnon) {
> > > >         atomic_inc(page->_mapcount);
> > > > +       if (PageWriteback(page)) {
> > > > +           // Assume we are racing and curent write back will not see
> > > > +           // the elevated mapcount so wait for current write back and
> > > > +           // force page fault
> > > > +           wait_on_page_writeback(page);
> > > > +           // force slow path that will fault again
> > > > +       }
> > > >     }
> > > > }
> > > 
> > > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > > should also increment page->_mapcount) under PTE lock. So at that point we
> > > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > > block on PTE lock to make PTE read-only and only after going through all
> > > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > > enough synchronization.
> > > 
> > > > For [1] only needing pin count during write back turns page_mkclean into
> > > > the perfect spot to check for that so:
> > > > 
> > > > int page_mkclean(struct page *page)
> > > > {
> > > >     int cleaned = 0;
> > > > +   int real_mapcount = 0;
> > > >     struct address_space *mapping;
> > > >     struct rmap_walk_control rwc = {
> > > >         .arg = (void *)&cleaned,
> > > >         .rmap_one = page_mkclean_one,
> > > >         .invalid_vma = invalid_mkclean_vma,
> > > > +       .mapcount = &real_mapcount,
> > > >     };
> > > > 
> > > >     BUG_ON(!PageLocked(page));
> > > > 
> > > >     if (!page_mapped(page))
> > > >         return 0;
> > > > 
> > > >     mapping = page_mapping(page);
> > > >     if (!mapping)
> > > >         return 0;
> > > > 
> > > >     // rmap_walk need to change to count mapping and return value
> > > >     // in .mapcount easy one
> > > >     rmap_walk(page, &rwc);
> > > > 
> > > >     // Big fat comment to explain what is going on
> > > > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > > > +       SetPageDMAPined(page);
> > > > +   } else {
> > > > +       ClearPageDMAPined(page);
> > > > +   }
> > > 
> > > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > > check we do in page_mkclean() is wrong?
> > > 
> > 
> > Ok so i found a solution for that. First GUP must wait for racing
> > write back. If GUP see a valid write-able PTE and the page has
> > write back flag set then it must back of as if the PTE was not
> > valid to force fault. It is just a race with page_mkclean and we
> > want ordering between the two. Note this is not strictly needed
> > so we can relax that but i believe this ordering is better to do
> > in GUP rather then having each single user of GUP test for this
> > to avoid the race.
> > 
> > GUP increase mapcount only after checking that it is not racing
> > with writeback it also set a page flag (SetPageDMAPined(page)).
> > 
> > When clearing a write-able pte we set a special entry inside the
> > page table (might need a new special swap type for this) and change
> > page_mkclean_one() to clear to 0 those special entry.
> > 
> > 
> > Now page_mkclean:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > +   int mapcount1, mapcount2;
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> > +   mapcount1 = page_mapcount(page);
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> 
> So what prevents GUP_fast() to grab reference here and the test below would
> think the page is not pinned? Or do you assume that every page_mkclean()
> call will be protected by PageWriteback (currently it is not) so that
> GUP_fast() blocks / bails out?

So GUP_fast() becomes:

GUP_fast_existing() { ... }
GUP_fast()
{
    GUP_fast_existing();

    for (i = 0; i < npages; ++i) {
        if (PageWriteback(pages[i])) {
            // need to force slow path for this page
        } else {
            SetPageDmaPinned(pages[i]);
            atomic_inc(pages[i]->mapcount);
        }
    }
}

This is a minor slow down for GUP fast and it takes care of a
write back race on behalf of caller. This means that page_mkclean
can not see a mapcount value that increase. This simplify thing
we can relax that. Note that what this is doing is making sure
that GUP_fast never get lucky :) ie never GUP a page that is in
the process of being write back but has not yet had its pte
updated to reflect that.


> But I think that detecting pinned pages with small false positive rate is
> OK. The extra page bouncing will cost some performance but if it is rare,
> then we are OK. So I think we can go for the simple version of detecting
> pinned pages as you mentioned in some earlier email. We just have to be
> sure there are no false negatives.

What worry me is that a page might stays with the DMA pinned flag forever
if it keeps getting unlucky ie some process keeps mapping it after last
write back and keeps zapping that mapping while racing with page_mkclean.
This should be unlikely but nothing would prevent it. I am fine with
living with this but page might become a zombie GUP :)

Maybe we can start with the simple version and add big fat comment and see
if anyone complains about a zombie GUP ...

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  1:55                                               ` Jerome Glisse
  2019-01-03  3:27                                                 ` John Hubbard
@ 2019-01-03  9:26                                                 ` Jan Kara
  2019-01-03 14:44                                                   ` Jerome Glisse
  1 sibling, 1 reply; 207+ messages in thread
From: Jan Kara @ 2019-01-03  9:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > > solution for that.  
> > > > 
> > > > So as I understand it, this would use page->_mapcount to store both the real
> > > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > > file-backed (non-anonymous) pages:
> > > > 
> > > > 
> > > > __get_user_pages()
> > > > {
> > > > 	...
> > > > 	get_page(page);
> > > > 
> > > > 	if (!PageAnon)
> > > > 		atomic_inc(page->_mapcount);
> > > > 	...
> > > > }
> > > > 
> > > > put_user_page(struct page *page)
> > > > {
> > > > 	...
> > > > 	if (!PageAnon)
> > > > 		atomic_dec(&page->_mapcount);
> > > > 
> > > > 	put_page(page);
> > > > 	...
> > > > }
> > > > 
> > > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > > had in mind?
> > > 
> > > Mostly, with the extra two observations:
> > >     [1] We only need to know the pin count when a write back kicks in
> > >     [2] We need to protect GUP code with wait_for_write_back() in case
> > >         GUP is racing with a write back that might not the see the
> > >         elevated mapcount in time.
> > > 
> > > So for [2]
> > > 
> > > __get_user_pages()
> > > {
> > >     get_page(page);
> > > 
> > >     if (!PageAnon) {
> > >         atomic_inc(page->_mapcount);
> > > +       if (PageWriteback(page)) {
> > > +           // Assume we are racing and curent write back will not see
> > > +           // the elevated mapcount so wait for current write back and
> > > +           // force page fault
> > > +           wait_on_page_writeback(page);
> > > +           // force slow path that will fault again
> > > +       }
> > >     }
> > > }
> > 
> > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > should also increment page->_mapcount) under PTE lock. So at that point we
> > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > block on PTE lock to make PTE read-only and only after going through all
> > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > enough synchronization.
> > 
> > > For [1] only needing pin count during write back turns page_mkclean into
> > > the perfect spot to check for that so:
> > > 
> > > int page_mkclean(struct page *page)
> > > {
> > >     int cleaned = 0;
> > > +   int real_mapcount = 0;
> > >     struct address_space *mapping;
> > >     struct rmap_walk_control rwc = {
> > >         .arg = (void *)&cleaned,
> > >         .rmap_one = page_mkclean_one,
> > >         .invalid_vma = invalid_mkclean_vma,
> > > +       .mapcount = &real_mapcount,
> > >     };
> > > 
> > >     BUG_ON(!PageLocked(page));
> > > 
> > >     if (!page_mapped(page))
> > >         return 0;
> > > 
> > >     mapping = page_mapping(page);
> > >     if (!mapping)
> > >         return 0;
> > > 
> > >     // rmap_walk need to change to count mapping and return value
> > >     // in .mapcount easy one
> > >     rmap_walk(page, &rwc);
> > > 
> > >     // Big fat comment to explain what is going on
> > > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > > +       SetPageDMAPined(page);
> > > +   } else {
> > > +       ClearPageDMAPined(page);
> > > +   }
> > 
> > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > check we do in page_mkclean() is wrong?
> > 
> 
> Ok so i found a solution for that. First GUP must wait for racing
> write back. If GUP see a valid write-able PTE and the page has
> write back flag set then it must back of as if the PTE was not
> valid to force fault. It is just a race with page_mkclean and we
> want ordering between the two. Note this is not strictly needed
> so we can relax that but i believe this ordering is better to do
> in GUP rather then having each single user of GUP test for this
> to avoid the race.
> 
> GUP increase mapcount only after checking that it is not racing
> with writeback it also set a page flag (SetPageDMAPined(page)).
> 
> When clearing a write-able pte we set a special entry inside the
> page table (might need a new special swap type for this) and change
> page_mkclean_one() to clear to 0 those special entry.
> 
> 
> Now page_mkclean:
> 
> int page_mkclean(struct page *page)
> {
>     int cleaned = 0;
> +   int real_mapcount = 0;
>     struct address_space *mapping;
>     struct rmap_walk_control rwc = {
>         .arg = (void *)&cleaned,
>         .rmap_one = page_mkclean_one,
>         .invalid_vma = invalid_mkclean_vma,
> +       .mapcount = &real_mapcount,
>     };
> +   int mapcount1, mapcount2;
> 
>     BUG_ON(!PageLocked(page));
> 
>     if (!page_mapped(page))
>         return 0;
> 
>     mapping = page_mapping(page);
>     if (!mapping)
>         return 0;
> 
> +   mapcount1 = page_mapcount(page);
>     // rmap_walk need to change to count mapping and return value
>     // in .mapcount easy one
>     rmap_walk(page, &rwc);

So what prevents GUP_fast() to grab reference here and the test below would
think the page is not pinned? Or do you assume that every page_mkclean()
call will be protected by PageWriteback (currently it is not) so that
GUP_fast() blocks / bails out?

But I think that detecting pinned pages with small false positive rate is
OK. The extra page bouncing will cost some performance but if it is rare,
then we are OK. So I think we can go for the simple version of detecting
pinned pages as you mentioned in some earlier email. We just have to be
sure there are no false negatives.

								Honza

> +   if (PageDMAPined(page)) {
> +       int rc2;
> +
> +       if (mapcount1 == real_count) {
> +           /* Page is no longer pin, no zap pte race */
> +           ClearPageDMAPined(page);
> +           goto out;
> +       }
> +       /* No new mapping of the page so mp1 < rc is illegal. */
> +       VM_BUG_ON(mapcount1 < real_count);
> +       /* Page might be pin. */
> +       mapcount2 = page_mapcount(page);
> +       if (mapcount2 > real_count) {
> +           /* Page is pin for sure. */
> +           goto out;
> +       }
> +       /* We had a race with zap pte we need to rewalk again. */
> +       rc2 = real_mapcount;
> +       real_mapcount = 0;
> +       rwc.rmap_one = page_pin_one;
> +       rmap_walk(page, &rwc);
> +       if (mapcount2 <= (real_count + rc2)) {
> +           /* Page is no longer pin */
> +           ClearPageDMAPined(page);
> +       }
> +       /* At this point the page pin flag reflect pin status of the page */
> +   }
> +
> +out:
>     ...
> }
> 
> The page_pin_one() function count the number of special PTE entry so
> which match the count of pte that have been zapped since the first
> reverse map walk.
> 
> So worst case a page that was pin by a GUP would need 2 reverse map
> walk during page_mkclean(). Moreover this is only needed if we race
> with something that clear pte. I believe this is an acceptable worst
> case. I will work on some RFC patchset next week (once i am down with
> email catch up).
> 
> 
> I do not think i made mistake here, i have been torturing my mind
> trying to think of any race scenario and i believe it holds to any
> racing zap and page_mkclean()
> 
> Cheers,
> J�r�me
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  1:55                                               ` Jerome Glisse
@ 2019-01-03  3:27                                                 ` John Hubbard
  2019-01-03 14:57                                                   ` Jerome Glisse
  2019-01-03  9:26                                                 ` Jan Kara
  1 sibling, 1 reply; 207+ messages in thread
From: John Hubbard @ 2019-01-03  3:27 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/2/19 5:55 PM, Jerome Glisse wrote:
> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>>> solution for that.  
>>>>
>>>> So as I understand it, this would use page->_mapcount to store both the real
>>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>>> file-backed (non-anonymous) pages:
>>>>
>>>>
>>>> __get_user_pages()
>>>> {
>>>> 	...
>>>> 	get_page(page);
>>>>
>>>> 	if (!PageAnon)
>>>> 		atomic_inc(page->_mapcount);
>>>> 	...
>>>> }
>>>>
>>>> put_user_page(struct page *page)
>>>> {
>>>> 	...
>>>> 	if (!PageAnon)
>>>> 		atomic_dec(&page->_mapcount);
>>>>
>>>> 	put_page(page);
>>>> 	...
>>>> }
>>>>
>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
>>>> had in mind?
>>>
>>> Mostly, with the extra two observations:
>>>     [1] We only need to know the pin count when a write back kicks in
>>>     [2] We need to protect GUP code with wait_for_write_back() in case
>>>         GUP is racing with a write back that might not the see the
>>>         elevated mapcount in time.
>>>
>>> So for [2]
>>>
>>> __get_user_pages()
>>> {
>>>     get_page(page);
>>>
>>>     if (!PageAnon) {
>>>         atomic_inc(page->_mapcount);
>>> +       if (PageWriteback(page)) {
>>> +           // Assume we are racing and curent write back will not see
>>> +           // the elevated mapcount so wait for current write back and
>>> +           // force page fault
>>> +           wait_on_page_writeback(page);
>>> +           // force slow path that will fault again
>>> +       }
>>>     }
>>> }
>>
>> This is not needed AFAICT. __get_user_pages() gets page reference (and it
>> should also increment page->_mapcount) under PTE lock. So at that point we
>> are sure we have writeable PTE nobody can change. So page_mkclean() has to
>> block on PTE lock to make PTE read-only and only after going through all
>> PTEs like this, it can check page->_mapcount. So the PTE lock provides
>> enough synchronization.
>>
>>> For [1] only needing pin count during write back turns page_mkclean into
>>> the perfect spot to check for that so:
>>>
>>> int page_mkclean(struct page *page)
>>> {
>>>     int cleaned = 0;
>>> +   int real_mapcount = 0;
>>>     struct address_space *mapping;
>>>     struct rmap_walk_control rwc = {
>>>         .arg = (void *)&cleaned,
>>>         .rmap_one = page_mkclean_one,
>>>         .invalid_vma = invalid_mkclean_vma,
>>> +       .mapcount = &real_mapcount,
>>>     };
>>>
>>>     BUG_ON(!PageLocked(page));
>>>
>>>     if (!page_mapped(page))
>>>         return 0;
>>>
>>>     mapping = page_mapping(page);
>>>     if (!mapping)
>>>         return 0;
>>>
>>>     // rmap_walk need to change to count mapping and return value
>>>     // in .mapcount easy one
>>>     rmap_walk(page, &rwc);
>>>
>>>     // Big fat comment to explain what is going on
>>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
>>> +       SetPageDMAPined(page);
>>> +   } else {
>>> +       ClearPageDMAPined(page);
>>> +   }
>>
>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
>> check we do in page_mkclean() is wrong?
>>
> 
> Ok so i found a solution for that. First GUP must wait for racing
> write back. If GUP see a valid write-able PTE and the page has
> write back flag set then it must back of as if the PTE was not
> valid to force fault. It is just a race with page_mkclean and we
> want ordering between the two. Note this is not strictly needed
> so we can relax that but i believe this ordering is better to do
> in GUP rather then having each single user of GUP test for this
> to avoid the race.
> 
> GUP increase mapcount only after checking that it is not racing
> with writeback it also set a page flag (SetPageDMAPined(page)).
> 
> When clearing a write-able pte we set a special entry inside the
> page table (might need a new special swap type for this) and change
> page_mkclean_one() to clear to 0 those special entry.
> 
> 
> Now page_mkclean:
> 
> int page_mkclean(struct page *page)
> {
>     int cleaned = 0;
> +   int real_mapcount = 0;
>     struct address_space *mapping;
>     struct rmap_walk_control rwc = {
>         .arg = (void *)&cleaned,
>         .rmap_one = page_mkclean_one,
>         .invalid_vma = invalid_mkclean_vma,
> +       .mapcount = &real_mapcount,
>     };
> +   int mapcount1, mapcount2;
> 
>     BUG_ON(!PageLocked(page));
> 
>     if (!page_mapped(page))
>         return 0;
> 
>     mapping = page_mapping(page);
>     if (!mapping)
>         return 0;
> 
> +   mapcount1 = page_mapcount(page);
> 
>     // rmap_walk need to change to count mapping and return value
>     // in .mapcount easy one
>     rmap_walk(page, &rwc);
> 
> +   if (PageDMAPined(page)) {
> +       int rc2;
> +
> +       if (mapcount1 == real_count) {
> +           /* Page is no longer pin, no zap pte race */
> +           ClearPageDMAPined(page);
> +           goto out;
> +       }
> +       /* No new mapping of the page so mp1 < rc is illegal. */
> +       VM_BUG_ON(mapcount1 < real_count);
> +       /* Page might be pin. */
> +       mapcount2 = page_mapcount(page);
> +       if (mapcount2 > real_count) {
> +           /* Page is pin for sure. */
> +           goto out;
> +       }
> +       /* We had a race with zap pte we need to rewalk again. */
> +       rc2 = real_mapcount;
> +       real_mapcount = 0;
> +       rwc.rmap_one = page_pin_one;
> +       rmap_walk(page, &rwc);
> +       if (mapcount2 <= (real_count + rc2)) {
> +           /* Page is no longer pin */
> +           ClearPageDMAPined(page);
> +       }
> +       /* At this point the page pin flag reflect pin status of the page */

Until...what? In other words, what is providing synchronization here?

thanks,
-- 
John Hubbard
NVIDIA

> +   }
> +
> +out:
>     ...
> }
> 
> The page_pin_one() function count the number of special PTE entry so
> which match the count of pte that have been zapped since the first
> reverse map walk.
> 
> So worst case a page that was pin by a GUP would need 2 reverse map
> walk during page_mkclean(). Moreover this is only needed if we race
> with something that clear pte. I believe this is an acceptable worst
> case. I will work on some RFC patchset next week (once i am down with
> email catch up).
> 
> 
> I do not think i made mistake here, i have been torturing my mind
> trying to think of any race scenario and i believe it holds to any
> racing zap and page_mkclean()
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:08                                             ` Jan Kara
  2018-12-20 10:54                                               ` John Hubbard
  2018-12-20 16:49                                               ` Jerome Glisse
@ 2019-01-03  1:55                                               ` Jerome Glisse
  2019-01-03  3:27                                                 ` John Hubbard
  2019-01-03  9:26                                                 ` Jan Kara
  2 siblings, 2 replies; 207+ messages in thread
From: Jerome Glisse @ 2019-01-03  1:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > solution for that.  
> > > 
> > > So as I understand it, this would use page->_mapcount to store both the real
> > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > file-backed (non-anonymous) pages:
> > > 
> > > 
> > > __get_user_pages()
> > > {
> > > 	...
> > > 	get_page(page);
> > > 
> > > 	if (!PageAnon)
> > > 		atomic_inc(page->_mapcount);
> > > 	...
> > > }
> > > 
> > > put_user_page(struct page *page)
> > > {
> > > 	...
> > > 	if (!PageAnon)
> > > 		atomic_dec(&page->_mapcount);
> > > 
> > > 	put_page(page);
> > > 	...
> > > }
> > > 
> > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > had in mind?
> > 
> > Mostly, with the extra two observations:
> >     [1] We only need to know the pin count when a write back kicks in
> >     [2] We need to protect GUP code with wait_for_write_back() in case
> >         GUP is racing with a write back that might not the see the
> >         elevated mapcount in time.
> > 
> > So for [2]
> > 
> > __get_user_pages()
> > {
> >     get_page(page);
> > 
> >     if (!PageAnon) {
> >         atomic_inc(page->_mapcount);
> > +       if (PageWriteback(page)) {
> > +           // Assume we are racing and curent write back will not see
> > +           // the elevated mapcount so wait for current write back and
> > +           // force page fault
> > +           wait_on_page_writeback(page);
> > +           // force slow path that will fault again
> > +       }
> >     }
> > }
> 
> This is not needed AFAICT. __get_user_pages() gets page reference (and it
> should also increment page->_mapcount) under PTE lock. So at that point we
> are sure we have writeable PTE nobody can change. So page_mkclean() has to
> block on PTE lock to make PTE read-only and only after going through all
> PTEs like this, it can check page->_mapcount. So the PTE lock provides
> enough synchronization.
> 
> > For [1] only needing pin count during write back turns page_mkclean into
> > the perfect spot to check for that so:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> > 
> >     // Big fat comment to explain what is going on
> > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > +       SetPageDMAPined(page);
> > +   } else {
> > +       ClearPageDMAPined(page);
> > +   }
> 
> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> check we do in page_mkclean() is wrong?
> 

Ok so i found a solution for that. First GUP must wait for racing
write back. If GUP see a valid write-able PTE and the page has
write back flag set then it must back of as if the PTE was not
valid to force fault. It is just a race with page_mkclean and we
want ordering between the two. Note this is not strictly needed
so we can relax that but i believe this ordering is better to do
in GUP rather then having each single user of GUP test for this
to avoid the race.

GUP increase mapcount only after checking that it is not racing
with writeback it also set a page flag (SetPageDMAPined(page)).

When clearing a write-able pte we set a special entry inside the
page table (might need a new special swap type for this) and change
page_mkclean_one() to clear to 0 those special entry.


Now page_mkclean:

int page_mkclean(struct page *page)
{
    int cleaned = 0;
+   int real_mapcount = 0;
    struct address_space *mapping;
    struct rmap_walk_control rwc = {
        .arg = (void *)&cleaned,
        .rmap_one = page_mkclean_one,
        .invalid_vma = invalid_mkclean_vma,
+       .mapcount = &real_mapcount,
    };
+   int mapcount1, mapcount2;

    BUG_ON(!PageLocked(page));

    if (!page_mapped(page))
        return 0;

    mapping = page_mapping(page);
    if (!mapping)
        return 0;

+   mapcount1 = page_mapcount(page);

    // rmap_walk need to change to count mapping and return value
    // in .mapcount easy one
    rmap_walk(page, &rwc);

+   if (PageDMAPined(page)) {
+       int rc2;
+
+       if (mapcount1 == real_count) {
+           /* Page is no longer pin, no zap pte race */
+           ClearPageDMAPined(page);
+           goto out;
+       }
+       /* No new mapping of the page so mp1 < rc is illegal. */
+       VM_BUG_ON(mapcount1 < real_count);
+       /* Page might be pin. */
+       mapcount2 = page_mapcount(page);
+       if (mapcount2 > real_count) {
+           /* Page is pin for sure. */
+           goto out;
+       }
+       /* We had a race with zap pte we need to rewalk again. */
+       rc2 = real_mapcount;
+       real_mapcount = 0;
+       rwc.rmap_one = page_pin_one;
+       rmap_walk(page, &rwc);
+       if (mapcount2 <= (real_count + rc2)) {
+           /* Page is no longer pin */
+           ClearPageDMAPined(page);
+       }
+       /* At this point the page pin flag reflect pin status of the page */
+   }
+
+out:
    ...
}

The page_pin_one() function count the number of special PTE entry so
which match the count of pte that have been zapped since the first
reverse map walk.

So worst case a page that was pin by a GUP would need 2 reverse map
walk during page_mkclean(). Moreover this is only needed if we race
with something that clear pte. I believe this is an acceptable worst
case. I will work on some RFC patchset next week (once i am down with
email catch up).


I do not think i made mistake here, i have been torturing my mind
trying to think of any race scenario and i believe it holds to any
racing zap and page_mkclean()

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-20 16:50                                                 ` Jerome Glisse
@ 2018-12-20 16:57                                                   ` Dan Williams
  0 siblings, 0 replies; 207+ messages in thread
From: Dan Williams @ 2018-12-20 16:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 20, 2018 at 8:50 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Thu, Dec 20, 2018 at 02:54:49AM -0800, John Hubbard wrote:
> > On 12/19/18 3:08 AM, Jan Kara wrote:
> > > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > >> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > >>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > >>> *only* the tracking pinned pages aspect), given that it is the lightest weight
> > >>> solution for that.
> > >>>
> > >>> So as I understand it, this would use page->_mapcount to store both the real
> > >>> mapcount, and the dma pinned count (simply added together), but only do so for
> > >>> file-backed (non-anonymous) pages:
> > >>>
> > >>>
> > >>> __get_user_pages()
> > >>> {
> > >>>   ...
> > >>>   get_page(page);
> > >>>
> > >>>   if (!PageAnon)
> > >>>           atomic_inc(page->_mapcount);
> > >>>   ...
> > >>> }
> > >>>
> > >>> put_user_page(struct page *page)
> > >>> {
> > >>>   ...
> > >>>   if (!PageAnon)
> > >>>           atomic_dec(&page->_mapcount);
> > >>>
> > >>>   put_page(page);
> > >>>   ...
> > >>> }
> > >>>
> > >>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > >>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you
> > >>> had in mind?
> > >>
> > >> Mostly, with the extra two observations:
> > >>     [1] We only need to know the pin count when a write back kicks in
> > >>     [2] We need to protect GUP code with wait_for_write_back() in case
> > >>         GUP is racing with a write back that might not the see the
> > >>         elevated mapcount in time.
> > >>
> > >> So for [2]
> > >>
> > >> __get_user_pages()
> > >> {
> > >>     get_page(page);
> > >>
> > >>     if (!PageAnon) {
> > >>         atomic_inc(page->_mapcount);
> > >> +       if (PageWriteback(page)) {
> > >> +           // Assume we are racing and curent write back will not see
> > >> +           // the elevated mapcount so wait for current write back and
> > >> +           // force page fault
> > >> +           wait_on_page_writeback(page);
> > >> +           // force slow path that will fault again
> > >> +       }
> > >>     }
> > >> }
> > >
> > > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > > should also increment page->_mapcount) under PTE lock. So at that point we
> > > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > > block on PTE lock to make PTE read-only and only after going through all
> > > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > > enough synchronization.
> > >
> > >> For [1] only needing pin count during write back turns page_mkclean into
> > >> the perfect spot to check for that so:
> > >>
> > >> int page_mkclean(struct page *page)
> > >> {
> > >>     int cleaned = 0;
> > >> +   int real_mapcount = 0;
> > >>     struct address_space *mapping;
> > >>     struct rmap_walk_control rwc = {
> > >>         .arg = (void *)&cleaned,
> > >>         .rmap_one = page_mkclean_one,
> > >>         .invalid_vma = invalid_mkclean_vma,
> > >> +       .mapcount = &real_mapcount,
> > >>     };
> > >>
> > >>     BUG_ON(!PageLocked(page));
> > >>
> > >>     if (!page_mapped(page))
> > >>         return 0;
> > >>
> > >>     mapping = page_mapping(page);
> > >>     if (!mapping)
> > >>         return 0;
> > >>
> > >>     // rmap_walk need to change to count mapping and return value
> > >>     // in .mapcount easy one
> > >>     rmap_walk(page, &rwc);
> > >>
> > >>     // Big fat comment to explain what is going on
> > >> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > >> +       SetPageDMAPined(page);
> > >> +   } else {
> > >> +       ClearPageDMAPined(page);
> > >> +   }
> > >
> > > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > > check we do in page_mkclean() is wrong?
> >
> > Right. This looks like a dead end, after all. We can't lock a whole chunk
> > of "all these are mapped, hold still while we count you" pages. It's not
> > designed to allow that at all.
> >
> > IMHO, we are now back to something like dynamic_page, which provides an
> > independent dma pinned count.
>
> I will keep looking because allocating a structure for every GUP is
> insane to me they are user out there that are GUPin GigaBytes of data

This is not the common case.

> and it gonna waste tons of memory just to fix crappy hardware.

This is the common case.

Please refrain from the hyperbolic assessments.

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 22:33                                             ` Dave Chinner
  2018-12-20  9:07                                               ` Jan Kara
@ 2018-12-20 16:54                                               ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jason Gunthorpe, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Thu, Dec 20, 2018 at 09:33:12AM +1100, Dave Chinner wrote:
> On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> > On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > > 
> > > > > Essentially, what we are talking about is how to handle broken
> > > > > hardware. I say we should just brun it with napalm and thermite
> > > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > > the underlying storage doesn't already require it.
> > > > 
> > > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > > then just do it.
> > > 
> > > O_DIRECT IO *isn't the problem*.
> > 
> > That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> > than the problem with RDMA but currently O_DIRECT IO can crash your machine
> > or corrupt data the same way RDMA can.
> 
> It's not O_DIRECT - it's a ""transient page pin". Yes, there are
> problems with that right now, but as we've discussed the issues can
> be avoided by:
> 
> 	a) stable pages always blocking in ->page_mkwrite;
> 	b) blocking in write_cache_pages() on an elevated map count
> 	when WB_SYNC_ALL is set; and
> 	c) blocking in truncate_pagecache() on an elevated map
> 	count.
> 
> That prevents:
> 	a) gup pinning a page that is currently under writeback and
> 	modifying it while IO is in flight;
> 	b) a dirty page being written back while it is pinned by
> 	GUP, thereby turning it clean before the gup reference calls
> 	set_page_dirty() on DMA completion; and
> 	c) truncate/hole punch for pulling the page out from under
> 	the gup operation that is ongoing.
> 
> This is an adequate solution for a short term transient pins. It
> doesn't break fsync(), it doesn't change how truncate works and it
> fixes the problem where a mapped file is the buffer for an O_DIRECT
> IO rather than the open fd and that buffer file gets truncated.
> IOWs, transient pins (and hence O_DIRECT) is not really the problem
> here.
> 
> The problem with this is that blocking on elevated map count does
> not work for long term pins (i.e. gup_longterm()) which are defined
> as:
> 
>  * "longterm" == userspace controlled elevated page count lifetime.
>  * Contrast this to iov_iter_get_pages() usages which are transient.
> 
> It's the "userspace controlled" part of the long term gup pin that
> is the problem we need to solve. If we treat them the same as a
> transient pin, then this leads to fsync() and truncate either
> blocking for a long time waiting for userspace to drop it's gup
> reference, or having to be failed with something like EBUSY or
> EAGAIN.
> 
> This is the problem revokable file layout leases solve. The NFS
> server is already using this for revoking delegations from remote
> clients. Userspace holding long term GUP references is essentially
> the same thing - it's a delegation of file ownership to userspace
> that the filesystem must be able to revoke when it needs to run
> internal and/or 3rd-party requested operations on that delegated
> file.
> 
> If the hardware supports page faults, then we can further optimise
> the long term pin case to relax stable page requirements and allow
> page cleaning to occur while there are long term pins. In this case,
> the hardware will write-fault the clean pages appropriately before
> DMA is initiated, and hence avoid the need for data integrity
> operations like fsync() to trigger lease revocation. However,
> truncate/hole punch still requires lease revocation to work sanely,
> especially when we consider DAX *must* ensure there are no remaining
> references to the physical pmem page after the space has been freed.

truncate does not requires lease recovations for faulting hardware,
truncate will trigger a mmu notifier callback which will invalidate
the hardware page table. On next access the hardware will fault and
this will turn into a regular page fault from kernel point of view.

So truncate/reflink and all fs expectation for faulting hardware do
hold. It is exactly as the CPU page table. So if CPU page table is
properly updated then so will be the hardware one.

Note that such hardware also abive by munmap() so hardware mapping
does not outlive vma.


Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-20 10:54                                               ` John Hubbard
@ 2018-12-20 16:50                                                 ` Jerome Glisse
  2018-12-20 16:57                                                   ` Dan Williams
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 20, 2018 at 02:54:49AM -0800, John Hubbard wrote:
> On 12/19/18 3:08 AM, Jan Kara wrote:
> > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> >>> *only* the tracking pinned pages aspect), given that it is the lightest weight
> >>> solution for that.  
> >>>
> >>> So as I understand it, this would use page->_mapcount to store both the real
> >>> mapcount, and the dma pinned count (simply added together), but only do so for
> >>> file-backed (non-anonymous) pages:
> >>>
> >>>
> >>> __get_user_pages()
> >>> {
> >>> 	...
> >>> 	get_page(page);
> >>>
> >>> 	if (!PageAnon)
> >>> 		atomic_inc(page->_mapcount);
> >>> 	...
> >>> }
> >>>
> >>> put_user_page(struct page *page)
> >>> {
> >>> 	...
> >>> 	if (!PageAnon)
> >>> 		atomic_dec(&page->_mapcount);
> >>>
> >>> 	put_page(page);
> >>> 	...
> >>> }
> >>>
> >>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> >>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> >>> had in mind?
> >>
> >> Mostly, with the extra two observations:
> >>     [1] We only need to know the pin count when a write back kicks in
> >>     [2] We need to protect GUP code with wait_for_write_back() in case
> >>         GUP is racing with a write back that might not the see the
> >>         elevated mapcount in time.
> >>
> >> So for [2]
> >>
> >> __get_user_pages()
> >> {
> >>     get_page(page);
> >>
> >>     if (!PageAnon) {
> >>         atomic_inc(page->_mapcount);
> >> +       if (PageWriteback(page)) {
> >> +           // Assume we are racing and curent write back will not see
> >> +           // the elevated mapcount so wait for current write back and
> >> +           // force page fault
> >> +           wait_on_page_writeback(page);
> >> +           // force slow path that will fault again
> >> +       }
> >>     }
> >> }
> > 
> > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > should also increment page->_mapcount) under PTE lock. So at that point we
> > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > block on PTE lock to make PTE read-only and only after going through all
> > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > enough synchronization.
> > 
> >> For [1] only needing pin count during write back turns page_mkclean into
> >> the perfect spot to check for that so:
> >>
> >> int page_mkclean(struct page *page)
> >> {
> >>     int cleaned = 0;
> >> +   int real_mapcount = 0;
> >>     struct address_space *mapping;
> >>     struct rmap_walk_control rwc = {
> >>         .arg = (void *)&cleaned,
> >>         .rmap_one = page_mkclean_one,
> >>         .invalid_vma = invalid_mkclean_vma,
> >> +       .mapcount = &real_mapcount,
> >>     };
> >>
> >>     BUG_ON(!PageLocked(page));
> >>
> >>     if (!page_mapped(page))
> >>         return 0;
> >>
> >>     mapping = page_mapping(page);
> >>     if (!mapping)
> >>         return 0;
> >>
> >>     // rmap_walk need to change to count mapping and return value
> >>     // in .mapcount easy one
> >>     rmap_walk(page, &rwc);
> >>
> >>     // Big fat comment to explain what is going on
> >> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> >> +       SetPageDMAPined(page);
> >> +   } else {
> >> +       ClearPageDMAPined(page);
> >> +   }
> > 
> > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > check we do in page_mkclean() is wrong?
> 
> Right. This looks like a dead end, after all. We can't lock a whole chunk 
> of "all these are mapped, hold still while we count you" pages. It's not
> designed to allow that at all.
> 
> IMHO, we are now back to something like dynamic_page, which provides an
> independent dma pinned count. 

I will keep looking because allocating a structure for every GUP is
insane to me they are user out there that are GUPin GigaBytes of data
and it gonna waste tons of memory just to fix crappy hardware.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:08                                             ` Jan Kara
  2018-12-20 10:54                                               ` John Hubbard
@ 2018-12-20 16:49                                               ` Jerome Glisse
  2019-01-03  1:55                                               ` Jerome Glisse
  2 siblings, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > solution for that.  
> > > 
> > > So as I understand it, this would use page->_mapcount to store both the real
> > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > file-backed (non-anonymous) pages:
> > > 
> > > 
> > > __get_user_pages()
> > > {
> > > 	...
> > > 	get_page(page);
> > > 
> > > 	if (!PageAnon)
> > > 		atomic_inc(page->_mapcount);
> > > 	...
> > > }
> > > 
> > > put_user_page(struct page *page)
> > > {
> > > 	...
> > > 	if (!PageAnon)
> > > 		atomic_dec(&page->_mapcount);
> > > 
> > > 	put_page(page);
> > > 	...
> > > }
> > > 
> > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > had in mind?
> > 
> > Mostly, with the extra two observations:
> >     [1] We only need to know the pin count when a write back kicks in
> >     [2] We need to protect GUP code with wait_for_write_back() in case
> >         GUP is racing with a write back that might not the see the
> >         elevated mapcount in time.
> > 
> > So for [2]
> > 
> > __get_user_pages()
> > {
> >     get_page(page);
> > 
> >     if (!PageAnon) {
> >         atomic_inc(page->_mapcount);
> > +       if (PageWriteback(page)) {
> > +           // Assume we are racing and curent write back will not see
> > +           // the elevated mapcount so wait for current write back and
> > +           // force page fault
> > +           wait_on_page_writeback(page);
> > +           // force slow path that will fault again
> > +       }
> >     }
> > }
> 
> This is not needed AFAICT. __get_user_pages() gets page reference (and it
> should also increment page->_mapcount) under PTE lock. So at that point we
> are sure we have writeable PTE nobody can change. So page_mkclean() has to
> block on PTE lock to make PTE read-only and only after going through all
> PTEs like this, it can check page->_mapcount. So the PTE lock provides
> enough synchronization.

This is needed, file back page can be map in any number of page table
and thus no PTE lock gonna protect anything in the end. More over with
GUP fast we really have to assume there is no lock that force ordering.

In fact in the above snipet that mapcount should not happen if there
is an on going write back.


> > For [1] only needing pin count during write back turns page_mkclean into
> > the perfect spot to check for that so:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> > 
> >     // Big fat comment to explain what is going on
> > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > +       SetPageDMAPined(page);
> > +   } else {
> > +       ClearPageDMAPined(page);
> > +   }
> 
> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> check we do in page_mkclean() is wrong?

Ok so i thought about this here is what we have:
    mp1 = page_mapcount(page);
    // let name rc1 the number of real count at mp1 time (this is
    // an ideal value that we can not get)

    rmap_walk(page, &rwc);
    // at this point let's name frc the number of real map count
    // found by rmap_walk

    mp2 = page_mapcount(page);
    // let name rc2 the number of real count at mp2 time (this is
    // an ideal value that we can not get)


So we have
    rc1 >= frc >= rc2
    pc1 = mp1 - rc1     // pin count at mp1 time
    pc2 = mp2 - rc2     // pin count at mp2 time

So we have:
    mp1 - rc1 <= mp1 - frc
    mp2 - rc2 >= mp2 - frc

>From the above:
    mp1 - frc <  0 impossible value mapcount can only go down so
                   frc <= mp1
    mp1 - frc == 0 -> the page is not pin
U1  mp1 - frc >  0 -> the page might be pin

U2  mp2 - frc <= 0 -> the page might be pin
    mp2 - frc >  0 -> the page is pin

They are two unknowns [U1] and [U2]:
    [U1]    a zap raced before rmap_walk() could account the zaped
            mapping (frc < rc1)
    [U2]    a zap raced after rmap_walk() accounted the zaped
            mapping (frc > rc2)

In both cases we can detect the race but we can not ascertain if page
is pin or not.

So we can do 2 things here:
    - try to recount the real mapping (it is bound to end as no
      new mapping can be added and thus mapcount can only go down)
    - assume false positive and uselessly bounce page that would
      not need bouncing if we were not unlucky

We could mitigate this with a flag GUP unconditionaly set it and page
mkclean clears it when mp1 - frc == 0 this way we never bounce page
that were never GUPed but we might keep bouncing a page that was GUPed
once in its lifetime until there is not race for it in page_mkclean.

I will ponder a bit more and see if i can get an idea on how to close
that race ie either close U1 or close U2.


> >     // Maybe we want to leverage the int nature of return value so that
> >     // we can express more than cleaned/truncated and express cleaned/
> >     // truncated/pinned for benefit of caller and that way we do not
> >     // even need one bit as page flags above.
> > 
> >     return cleaned;
> > }
> > 
> > You do not want to change page_mapped() i do not see a need for that.
> > 
> > Then the whole discussion between Jan and Dave seems to indicate that
> > the bounce mechanism will need to be in the fs layer and that we can
> > not reuse the bio bounce mechanism. This means that more work is needed
> > at the fs level for that (so that fs do not freak on bounce page).
> > 
> > Note that they are few gotcha where we need to preserve the pin count
> > ie mostly in truncate code path that can remove page from page cache
> > and overwrite the mapcount in the process, this would need to be fixed
> > to not overwrite mapcount so that put_user_page does not set the map
> > count to an invalid value turning the page into a bad state that will
> > at one point trigger kernel BUG_ON();
> >
> > I am not saying block truncate, i am saying make sure it does not
> > erase pin count and keep truncating happily. The how to handle truncate
> > is a per existing GUP user discussion to see what they want to do for
> > that.
> > 
> > Obviously a bit deeper analysis of all spot that use mapcount is needed
> > to check that we are not breaking anything but from the top of my head
> > i can not think of anything bad (migrate will abort and other things will
> > assume the page is mapped even it is only in hardware page table, ...).
> 
> Hum, grepping for page_mapped() and page_mapcount(), this is actually going
> to be non-trivial to get right AFAICT.

No that's not that scary a good chunk of all those are for anonymous
memory and many are obvious (like migrate, ksm, ...).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 22:33                                             ` Dave Chinner
@ 2018-12-20  9:07                                               ` Jan Kara
  2018-12-20 16:54                                               ` Jerome Glisse
  1 sibling, 0 replies; 207+ messages in thread
From: Jan Kara @ 2018-12-20  9:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jason Gunthorpe, Jerome Glisse, John Hubbard,
	Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Thu 20-12-18 09:33:12, Dave Chinner wrote:
> On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> > On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > > 
> > > > > Essentially, what we are talking about is how to handle broken
> > > > > hardware. I say we should just brun it with napalm and thermite
> > > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > > the underlying storage doesn't already require it.
> > > > 
> > > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > > then just do it.
> > > 
> > > O_DIRECT IO *isn't the problem*.
> > 
> > That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> > than the problem with RDMA but currently O_DIRECT IO can crash your machine
> > or corrupt data the same way RDMA can.
> 
> It's not O_DIRECT - it's a ""transient page pin". Yes, there are
> problems with that right now, but as we've discussed the issues can
> be avoided by:
> 
> 	a) stable pages always blocking in ->page_mkwrite;
> 	b) blocking in write_cache_pages() on an elevated map count
> 	when WB_SYNC_ALL is set; and
> 	c) blocking in truncate_pagecache() on an elevated map
> 	count.
> 
> That prevents:
> 	a) gup pinning a page that is currently under writeback and
> 	modifying it while IO is in flight;
> 	b) a dirty page being written back while it is pinned by
> 	GUP, thereby turning it clean before the gup reference calls
> 	set_page_dirty() on DMA completion; and

This is not prevented by what you wrote above as currently GUP does not
increase page->_mapcount. Currently, there's no way to distinguish GUP page
reference from any other page reference - GUP simply does get_page() - and
big part of this thread as I see it is exactly about how to introduce this
distinction and how to convert all GUP users to the new convention safely
(as currently they just pass struct page * pointers around and eventually
do put_page() on them). Increasing page->_mapcount in GUP and trying to
deduce the pin count from that is one option Jerome suggested. At this
point I'm not 100% sure this is going to work but we'll see.

> 	c) truncate/hole punch for pulling the page out from under
> 	the gup operation that is ongoing.
> 
> This is an adequate solution for a short term transient pins. It
> doesn't break fsync(), it doesn't change how truncate works and it
> fixes the problem where a mapped file is the buffer for an O_DIRECT
> IO rather than the open fd and that buffer file gets truncated.
> IOWs, transient pins (and hence O_DIRECT) is not really the problem
> here.

For now let's assume that the mechanism how to detect page pinned by GUP is
actually somehow solved and we have already page_pinned() implemented. Then
what you suggest can actually create a deadlock AFAICS:

Process 1:						Process 2:

							fsync("file")
/* Evil memory buffer with page order reversed */
addr1 = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, "file", 4096);
addr2 = mmap(addr1+4096, 4096, PROT_WRITE, MAP_SHARED, "file", 0);

/* Fault in pages */
*addr1 = 0;
*addr2 = 0;
							adds page with index 0
							  to bio

fd = open("file2", O_RDWR | O_DIRECT);
read(fd, addr1, 8192)
  -> eventually gets to iov_iter_get_pages() and then to
     get_user_pages_fast().
     -> pins "file" page with index 1
							blocks on pin for
							  page with index 1
     -> blocks in PageWriteback for page with index 0

Possibility of deadlocks like this is why I've decided it will be easier
to just bounce the page for writeback we cannot avoid rather than block the
writeback...

> The problem with this is that blocking on elevated map count does
> not work for long term pins (i.e. gup_longterm()) which are defined
> as:
> 
>  * "longterm" == userspace controlled elevated page count lifetime.
>  * Contrast this to iov_iter_get_pages() usages which are transient.
> 
> It's the "userspace controlled" part of the long term gup pin that
> is the problem we need to solve. If we treat them the same as a
> transient pin, then this leads to fsync() and truncate either
> blocking for a long time waiting for userspace to drop it's gup
> reference, or having to be failed with something like EBUSY or
> EAGAIN.

I agree. "userspace controlled" pins are another big problem to solve.

> This is the problem revokable file layout leases solve. The NFS
> server is already using this for revoking delegations from remote
> clients. Userspace holding long term GUP references is essentially
> the same thing - it's a delegation of file ownership to userspace
> that the filesystem must be able to revoke when it needs to run
> internal and/or 3rd-party requested operations on that delegated
> file.
> 
> If the hardware supports page faults, then we can further optimise
> the long term pin case to relax stable page requirements and allow
> page cleaning to occur while there are long term pins. In this case,
> the hardware will write-fault the clean pages appropriately before
> DMA is initiated, and hence avoid the need for data integrity
> operations like fsync() to trigger lease revocation. However,
> truncate/hole punch still requires lease revocation to work sanely,
> especially when we consider DAX *must* ensure there are no remaining
> references to the physical pmem page after the space has been freed.
> 
> i.e. conflating the transient and long term gup pins as the same
> problem doesn't help anyone. If we fix the short term pin problems,
> then the long term pin problem become tractable by adding a layer
> over the top (i.e.  hardware page fault capability and/or file lease
> requirements).  Existing apps and hardware will continue to work -
> external operations on the pinned file will simply hang rather than
> causing corruption or kernel crashes.  New (or updated) applications
> will play nicely with lease revocation and at that point the "long
> term pin" basically becomes a transient pin where the unpin latency
> is determined by how quickly the app responds to the lease
> revocation. And page fault capable hardware will reduce the
> occurrence of lease revocations due to data writeback/integrity
> operations and behave almost identically to cpu-based mmap accesses
> to file backed pages.

Agreed. I think we are on the same page wrt this. Just at this point I'm
trying to solve the "transient pin" problem...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:35                                           ` Jan Kara
  2018-12-19 16:56                                             ` Jason Gunthorpe
@ 2018-12-19 22:33                                             ` Dave Chinner
  2018-12-20  9:07                                               ` Jan Kara
  2018-12-20 16:54                                               ` Jerome Glisse
  1 sibling, 2 replies; 207+ messages in thread
From: Dave Chinner @ 2018-12-19 22:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > 
> > > > Essentially, what we are talking about is how to handle broken
> > > > hardware. I say we should just brun it with napalm and thermite
> > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > the underlying storage doesn't already require it.
> > > 
> > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > then just do it.
> > 
> > O_DIRECT IO *isn't the problem*.
> 
> That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> than the problem with RDMA but currently O_DIRECT IO can crash your machine
> or corrupt data the same way RDMA can.

It's not O_DIRECT - it's a ""transient page pin". Yes, there are
problems with that right now, but as we've discussed the issues can
be avoided by:

	a) stable pages always blocking in ->page_mkwrite;
	b) blocking in write_cache_pages() on an elevated map count
	when WB_SYNC_ALL is set; and
	c) blocking in truncate_pagecache() on an elevated map
	count.

That prevents:
	a) gup pinning a page that is currently under writeback and
	modifying it while IO is in flight;
	b) a dirty page being written back while it is pinned by
	GUP, thereby turning it clean before the gup reference calls
	set_page_dirty() on DMA completion; and
	c) truncate/hole punch for pulling the page out from under
	the gup operation that is ongoing.

This is an adequate solution for a short term transient pins. It
doesn't break fsync(), it doesn't change how truncate works and it
fixes the problem where a mapped file is the buffer for an O_DIRECT
IO rather than the open fd and that buffer file gets truncated.
IOWs, transient pins (and hence O_DIRECT) is not really the problem
here.

The problem with this is that blocking on elevated map count does
not work for long term pins (i.e. gup_longterm()) which are defined
as:

 * "longterm" == userspace controlled elevated page count lifetime.
 * Contrast this to iov_iter_get_pages() usages which are transient.

It's the "userspace controlled" part of the long term gup pin that
is the problem we need to solve. If we treat them the same as a
transient pin, then this leads to fsync() and truncate either
blocking for a long time waiting for userspace to drop it's gup
reference, or having to be failed with something like EBUSY or
EAGAIN.

This is the problem revokable file layout leases solve. The NFS
server is already using this for revoking delegations from remote
clients. Userspace holding long term GUP references is essentially
the same thing - it's a delegation of file ownership to userspace
that the filesystem must be able to revoke when it needs to run
internal and/or 3rd-party requested operations on that delegated
file.

If the hardware supports page faults, then we can further optimise
the long term pin case to relax stable page requirements and allow
page cleaning to occur while there are long term pins. In this case,
the hardware will write-fault the clean pages appropriately before
DMA is initiated, and hence avoid the need for data integrity
operations like fsync() to trigger lease revocation. However,
truncate/hole punch still requires lease revocation to work sanely,
especially when we consider DAX *must* ensure there are no remaining
references to the physical pmem page after the space has been freed.

i.e. conflating the transient and long term gup pins as the same
problem doesn't help anyone. If we fix the short term pin problems,
then the long term pin problem become tractable by adding a layer
over the top (i.e.  hardware page fault capability and/or file lease
requirements).  Existing apps and hardware will continue to work -
external operations on the pinned file will simply hang rather than
causing corruption or kernel crashes.  New (or updated) applications
will play nicely with lease revocation and at that point the "long
term pin" basically becomes a transient pin where the unpin latency
is determined by how quickly the app responds to the lease
revocation. And page fault capable hardware will reduce the
occurrence of lease revocations due to data writeback/integrity
operations and behave almost identically to cpu-based mmap accesses
to file backed pages.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:35                                           ` Jan Kara
@ 2018-12-19 16:56                                             ` Jason Gunthorpe
  2018-12-19 22:33                                             ` Dave Chinner
  1 sibling, 0 replies; 207+ messages in thread
From: Jason Gunthorpe @ 2018-12-19 16:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > 
> > > > Essentially, what we are talking about is how to handle broken
> > > > hardware. I say we should just brun it with napalm and thermite
> > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > the underlying storage doesn't already require it.
> > > 
> > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > then just do it.
> > 
> > O_DIRECT IO *isn't the problem*.
> 
> That is not true. O_DIRECT IO is a problem. In some aspects it is
> easier than the problem with RDMA but currently O_DIRECT IO can
> crash your machine or corrupt data the same way RDMA can. Just the
> race window is much smaller. So we have to fix the generic GUP
> infrastructure to make O_DIRECT IO work. I agree that fixing RDMA
> will likely require even more work like revokable leases or what
> not.

This is what I've understood, talking to all the experts. Dave? Why do
you think O_DIRECT is actually OK?

I agree the duration issue with RDMA is different, but don't forget,
O_DIRECT goes out to the network too and has potentially very long
timeouts as well.

If O_DIRECT works fine then lets use the same approach in RDMA??

Jason

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18 23:42                                     ` Dave Chinner
  2018-12-19  3:03                                       ` Jason Gunthorpe
@ 2018-12-19 13:24                                       ` Jan Kara
  1 sibling, 0 replies; 207+ messages in thread
From: Jan Kara @ 2018-12-19 13:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 19-12-18 10:42:54, Dave Chinner wrote:
> On Tue, Dec 18, 2018 at 11:33:06AM +0100, Jan Kara wrote:
> > On Mon 17-12-18 08:58:19, Dave Chinner wrote:
> > > On Fri, Dec 14, 2018 at 04:43:21PM +0100, Jan Kara wrote:
> > > > Yes, for filesystem it is too late. But the plan we figured back in October
> > > > was to do the bouncing in the block layer. I.e., mark the bio (or just the
> > > > particular page) as needing bouncing and then use the existing page
> > > > bouncing mechanism in the block layer to do the bouncing for us. Ext3 (when
> > > > it was still a separate fs driver) has been using a mechanism like this to
> > > > make DIF/DIX work with its metadata.
> > > 
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > >
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > The scenario you describe here cannot happen exactly because of the
> > wait_for_stable_page() in ->page_mkwrite() you mention below.
> 
> In general, no, because stable pages are controlled by block
> devices.
> 
> void wait_for_stable_page(struct page *page)
> {
>         if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
>                 wait_on_page_writeback(page);
> }
> 
> 
> I have previously advocated for the filesystem to be in control of
> stable pages but, well, too many people shouted "but performance!"
> and so we still have all these holes I wanted to close in our
> code...
> 
> > If someone
> > will try to GUP a page that is under writeback (has already PageWriteback
> > set), GUP will have to do a write fault because the page is writeprotected
> > in page tables and go into ->page_mkwrite() which will wait.
> 
> Correct, but that doesn't close the problem down because stable
> pages are something we cannot rely on right now. We need to fix
> wait_for_stable_page() to always block on page writeback before
> this specific race condition goes away.

Right, all I said was assuming that someone actually cares about stable
pages so bdi_cap_stable_pages_required() is set. I agree with filesystem
having ability to control whether stable pages are required or not. But
when stable pages get enforced seems like a separate problem to me.

> > The problem rather is with someone mapping the page *before* writeback
> > starts, giving the page to HW. Then clear_page_dirty_for_io() writeprotects
> > the page in PTEs but the HW gives a damn about that. Then, after we add the
> > page to the bio but before the page gets bounced by the block layer, the HW
> > can still modify it.
> 
> Sure, that's yet another aspect of the same problem - not getting a
> write fault when the page is being written to. If we got a write
> fault, then the wait_for_stable_page() call in ->page_mkwrite would
> then solve the problem.
> 
> Essentially, what we are talking about is how to handle broken
> hardware. I say we should just brun it with napalm and thermite
> (i.e. taint the kernel with "unsupportable hardware") and force
> wait_for_stable_page() to trigger when there are GUP mappings if
> the underlying storage doesn't already require it.

As I wrote in other email, this is also about direct IO using file mapping
as a data buffer. So burn with napalm can hardly be a complete solution...
I agree that for the hardware that cannot support revoking of access /
fault on access and uses long-term page pins, we may just have to put up
with weird behavior in some corner cases.

> > > If it's permanently dirty, how do we trigger new COW operations
> > > after writeback has "cleaned" the page? i.e. we still need a
> > > ->page_mkwrite call to run before we allow the next write to the
> > > page to be done, regardless of whether the page is "permanently
> > > dirty" or not....
> > 
> > Interaction with COW is certainly an interesting problem. When the page
> > gets pinned, GUP will make sure the page is writeably mapped and trigger a
> > write fault if not. So at the moment the page is pinned, we are sure the
> > page is COWed. Now the question is what should happen when the file A
> > containing this pinned page gets reflinked to file B while the page is still
> > pinned.
> > 
> > Options I can see are:
> > 
> > 1) Fail the reflink.
> >   - difficult for sysadmin to discover the source of failure
> >
> > 2) Block reflink until the pin of the page is released.
> >   - can last for a long time, again difficult to discover
> > 
> > 3) Don't do anything special.
> >   - can corrupt data as read accesses through file B won't see
> >     modifications done to the page (and thus eventually the corresponding disk
> >     block) by the HW.
> > 
> > 4) Immediately COW the block during reflink when the corresponding page
> >    cache page is pinned.
> >   - seems as the best solution at this point, although sadly also requires
> >     the most per-filesystem work
> 
> None of the above are acceptable solutions - they all have nasty
> corner cases which are going to be difficult to get right, test,
> etc. IMO, the robust, reliable, testable solution is this:
> 
> 5) The reflink breaks the file lease, the userspace app releases the
> pinned pages on the file and drops the lease. The reflink proceeds,
> does it's work, and then the app gets a new lease on the file. When
> the app pins the pages again, it triggers new ->page_mkwrite calls
> to break any sharing that the reflink created. And if the app fails
> to drop the lease, then we can either fail with a lease related
> error or kill it....

This is certainly fine for the GUP users that are going to support leases.
But do you want GUP in direct IO to create a lease if the pages are from a
file mapping? I belive we need another option at least for GUP references
that are short-term in nature and sometimes also performance critical.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 10:28                                         ` Dave Chinner
@ 2018-12-19 11:35                                           ` Jan Kara
  2018-12-19 16:56                                             ` Jason Gunthorpe
  2018-12-19 22:33                                             ` Dave Chinner
  0 siblings, 2 replies; 207+ messages in thread
From: Jan Kara @ 2018-12-19 11:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Jerome Glisse, John Hubbard,
	Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > 
> > > Essentially, what we are talking about is how to handle broken
> > > hardware. I say we should just brun it with napalm and thermite
> > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > the underlying storage doesn't already require it.
> > 
> > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > then just do it.
> 
> O_DIRECT IO *isn't the problem*.

That is not true. O_DIRECT IO is a problem. In some aspects it is easier
than the problem with RDMA but currently O_DIRECT IO can crash your machine
or corrupt data the same way RDMA can. Just the race window is much
smaller. So we have to fix the generic GUP infrastructure to make O_DIRECT
IO work. I agree that fixing RDMA will likely require even more work like
revokable leases or what not.

> iO_DIRECT IO uses a short term pin that the existing prefaulting
> during GUP works just fine for. The problem we have is the long term
> pins where pages can be cleaned while the pages are pinned. i.e. the
> use case we current have to disable for DAX because *we can't make
> it work sanely* without either revokable file leases and/or hardware
> that is able to trigger page faults when they need write access to a
> clean page.

I would like to find a solution to the O_DIRECT IO problem while making the
infractructure reusable also for solving the problems with RDMA... Because
nobody wants to go through those couple hundred get_user_pages() users in
the kernel twice...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  5:26                                         ` Dan Williams
@ 2018-12-19 11:19                                           ` Jan Kara
  0 siblings, 0 replies; 207+ messages in thread
From: Jan Kara @ 2018-12-19 11:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Dave Chinner, Jan Kara, Jerome Glisse,
	John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue 18-12-18 21:26:28, Dan Williams wrote:
> On Tue, Dec 18, 2018 at 7:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> >
> > > Essentially, what we are talking about is how to handle broken
> > > hardware. I say we should just brun it with napalm and thermite
> > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > the underlying storage doesn't already require it.
> >
> > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > then just do it.
> >
> > Otherwise I'm not sure demanding some unrealistic HW design is
> > reasonable. ie nvme drives are not likely to add page faulting to
> > their IO path any time soon.
> >
> > A SW architecture that relies on page faulting is just not going to
> > support real world block IO devices.
> >
> > GPUs and one RDMA are about the only things that can do this today,
> > and they are basically irrelevant to O_DIRECT.
> 
> Yes.
> 
> I'm missing why a bounce buffer is needed. If writeback hits a
> DMA-writable page why can't that path just turn around and trigger
> another mkwrite notifcation on behalf of hardware that will never send
> it? "Nice try writeback, this page is dirty again".

You are conflating two things here. Bounce buffer (or a way to stop DMA
from happening) is needed because think what happens when RAID5 computes
its stripe checksum while someone modifies the data through DMA. Checksum
mismatch and all fun arising from that.

Notifying filesystem about the fact that the page didn't get cleaned by the
writeback and still can be modified by the DMA is a different thing.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  2:07                                           ` Jerome Glisse
@ 2018-12-19 11:08                                             ` Jan Kara
  2018-12-20 10:54                                               ` John Hubbard
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 207+ messages in thread
From: Jan Kara @ 2018-12-19 11:08 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > solution for that.  
> > 
> > So as I understand it, this would use page->_mapcount to store both the real
> > mapcount, and the dma pinned count (simply added together), but only do so for
> > file-backed (non-anonymous) pages:
> > 
> > 
> > __get_user_pages()
> > {
> > 	...
> > 	get_page(page);
> > 
> > 	if (!PageAnon)
> > 		atomic_inc(page->_mapcount);
> > 	...
> > }
> > 
> > put_user_page(struct page *page)
> > {
> > 	...
> > 	if (!PageAnon)
> > 		atomic_dec(&page->_mapcount);
> > 
> > 	put_page(page);
> > 	...
> > }
> > 
> > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > had in mind?
> 
> Mostly, with the extra two observations:
>     [1] We only need to know the pin count when a write back kicks in
>     [2] We need to protect GUP code with wait_for_write_back() in case
>         GUP is racing with a write back that might not the see the
>         elevated mapcount in time.
> 
> So for [2]
> 
> __get_user_pages()
> {
>     get_page(page);
> 
>     if (!PageAnon) {
>         atomic_inc(page->_mapcount);
> +       if (PageWriteback(page)) {
> +           // Assume we are racing and curent write back will not see
> +           // the elevated mapcount so wait for current write back and
> +           // force page fault
> +           wait_on_page_writeback(page);
> +           // force slow path that will fault again
> +       }
>     }
> }

This is not needed AFAICT. __get_user_pages() gets page reference (and it
should also increment page->_mapcount) under PTE lock. So at that point we
are sure we have writeable PTE nobody can change. So page_mkclean() has to
block on PTE lock to make PTE read-only and only after going through all
PTEs like this, it can check page->_mapcount. So the PTE lock provides
enough synchronization.

> For [1] only needing pin count during write back turns page_mkclean into
> the perfect spot to check for that so:
> 
> int page_mkclean(struct page *page)
> {
>     int cleaned = 0;
> +   int real_mapcount = 0;
>     struct address_space *mapping;
>     struct rmap_walk_control rwc = {
>         .arg = (void *)&cleaned,
>         .rmap_one = page_mkclean_one,
>         .invalid_vma = invalid_mkclean_vma,
> +       .mapcount = &real_mapcount,
>     };
> 
>     BUG_ON(!PageLocked(page));
> 
>     if (!page_mapped(page))
>         return 0;
> 
>     mapping = page_mapping(page);
>     if (!mapping)
>         return 0;
> 
>     // rmap_walk need to change to count mapping and return value
>     // in .mapcount easy one
>     rmap_walk(page, &rwc);
> 
>     // Big fat comment to explain what is going on
> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> +       SetPageDMAPined(page);
> +   } else {
> +       ClearPageDMAPined(page);
> +   }

This is the detail I'm not sure about: Why cannot rmap_walk_file() race
with e.g. zap_pte_range() which decrements page->_mapcount and thus the
check we do in page_mkclean() is wrong?

> 
>     // Maybe we want to leverage the int nature of return value so that
>     // we can express more than cleaned/truncated and express cleaned/
>     // truncated/pinned for benefit of caller and that way we do not
>     // even need one bit as page flags above.
> 
>     return cleaned;
> }
> 
> You do not want to change page_mapped() i do not see a need for that.
> 
> Then the whole discussion between Jan and Dave seems to indicate that
> the bounce mechanism will need to be in the fs layer and that we can
> not reuse the bio bounce mechanism. This means that more work is needed
> at the fs level for that (so that fs do not freak on bounce page).
> 
> Note that they are few gotcha where we need to preserve the pin count
> ie mostly in truncate code path that can remove page from page cache
> and overwrite the mapcount in the process, this would need to be fixed
> to not overwrite mapcount so that put_user_page does not set the map
> count to an invalid value turning the page into a bad state that will
> at one point trigger kernel BUG_ON();
>
> I am not saying block truncate, i am saying make sure it does not
> erase pin count and keep truncating happily. The how to handle truncate
> is a per existing GUP user discussion to see what they want to do for
> that.
> 
> Obviously a bit deeper analysis of all spot that use mapcount is needed
> to check that we are not breaking anything but from the top of my head
> i can not think of anything bad (migrate will abort and other things will
> assume the page is mapped even it is only in hardware page table, ...).

Hum, grepping for page_mapped() and page_mapcount(), this is actually going
to be non-trivial to get right AFAICT.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  3:03                                       ` Jason Gunthorpe
  2018-12-19  5:26                                         ` Dan Williams
@ 2018-12-19 10:28                                         ` Dave Chinner
  2018-12-19 11:35                                           ` Jan Kara
  1 sibling, 1 reply; 207+ messages in thread
From: Dave Chinner @ 2018-12-19 10:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> 
> > Essentially, what we are talking about is how to handle broken
> > hardware. I say we should just brun it with napalm and thermite
> > (i.e. taint the kernel with "unsupportable hardware") and force
> > wait_for_stable_page() to trigger when there are GUP mappings if
> > the underlying storage doesn't already require it.
> 
> If you want to ban O_DIRECT/etc from writing to file backed pages,
> then just do it.

O_DIRECT IO *isn't the problem*.


iO_DIRECT IO uses a short term pin that the existing prefaulting
during GUP works just fine for. The problem we have is the long term
pins where pages can be cleaned while the pages are pinned. i.e. the
use case we current have to disable for DAX because *we can't make
it work sanely* without either revokable file leases and/or hardware
that is able to trigger page faults when they need write access to a
clean page.

> Otherwise I'm not sure demanding some unrealistic HW design is
> reasonable. ie nvme drives are not likely to add page faulting to
> their IO path any time soon.

Direct IO on nvme drives are not the problem. It's RDMA pinning
pages for hours or days and expecting everyone else to jump through
hoops to support their broken page access access model.

> A SW architecture that relies on page faulting is just not going to
> support real world block IO devices.

The existing software architecture for file backed pages has been
based around page faulting for write notifications since ~2005. That
horse bolted many, many years ago.

> GPUs and one RDMA are about the only things that can do this today,
> and they are basically irrelevant to O_DIRECT.

It's RDMA that we need these changes for, not O_DIRECT.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  3:03                                       ` Jason Gunthorpe
@ 2018-12-19  5:26                                         ` Dan Williams
  2018-12-19 11:19                                           ` Jan Kara
  2018-12-19 10:28                                         ` Dave Chinner
  1 sibling, 1 reply; 207+ messages in thread
From: Dan Williams @ 2018-12-19  5:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Jan Kara, Jerome Glisse, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue, Dec 18, 2018 at 7:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
>
> > Essentially, what we are talking about is how to handle broken
> > hardware. I say we should just brun it with napalm and thermite
> > (i.e. taint the kernel with "unsupportable hardware") and force
> > wait_for_stable_page() to trigger when there are GUP mappings if
> > the underlying storage doesn't already require it.
>
> If you want to ban O_DIRECT/etc from writing to file backed pages,
> then just do it.
>
> Otherwise I'm not sure demanding some unrealistic HW design is
> reasonable. ie nvme drives are not likely to add page faulting to
> their IO path any time soon.
>
> A SW architecture that relies on page faulting is just not going to
> support real world block IO devices.
>
> GPUs and one RDMA are about the only things that can do this today,
> and they are basically irrelevant to O_DIRECT.

Yes.

I'm missing why a bounce buffer is needed. If writeback hits a
DMA-writable page why can't that path just turn around and trigger
another mkwrite notifcation on behalf of hardware that will never send
it? "Nice try writeback, this page is dirty again".

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18 23:42                                     ` Dave Chinner
@ 2018-12-19  3:03                                       ` Jason Gunthorpe
  2018-12-19  5:26                                         ` Dan Williams
  2018-12-19 10:28                                         ` Dave Chinner
  2018-12-19 13:24                                       ` Jan Kara
  1 sibling, 2 replies; 207+ messages in thread
From: Jason Gunthorpe @ 2018-12-19  3:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:

> Essentially, what we are talking about is how to handle broken
> hardware. I say we should just brun it with napalm and thermite
> (i.e. taint the kernel with "unsupportable hardware") and force
> wait_for_stable_page() to trigger when there are GUP mappings if
> the underlying storage doesn't already require it.

If you want to ban O_DIRECT/etc from writing to file backed pages,
then just do it.

Otherwise I'm not sure demanding some unrealistic HW design is
reasonable. ie nvme drives are not likely to add page faulting to
their IO path any time soon.

A SW architecture that relies on page faulting is just not going to
support real world block IO devices.

GPUs and one RDMA are about the only things that can do this today,
and they are basically irrelevant to O_DIRECT.

Jason

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18 23:29                                         ` John Hubbard
@ 2018-12-19  2:07                                           ` Jerome Glisse
  2018-12-19 11:08                                             ` Jan Kara
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2018-12-19  2:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> On 12/18/18 1:30 AM, Jan Kara wrote:
> > On Mon 17-12-18 10:34:43, Matthew Wilcox wrote:
> >> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> >>> On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> >>>> Sure, that's a possibility, but that doesn't close off any race
> >>>> conditions because there can be DMA into the page in progress while
> >>>> the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> >>>> different in that there is no 3rd-party access to the page while it
> >>>> is under IO (ext3 arbitrates all access to it's metadata), and so
> >>>> nothing can actually race for modification of the page between
> >>>> submission and bouncing at the block layer.
> >>>>
> >>>> In this case, the moment the page is unlocked, anyone else can map
> >>>> it and start (R)DMA on it, and that can happen before the bio is
> >>>> bounced by the block layer. So AFAICT, block layer bouncing doesn't
> >>>> solve the problem of racing writeback and DMA direct to the page we
> >>>> are doing IO on. Yes, it reduces the race window substantially, but
> >>>> it doesn't get rid of it.
> >>>
> >>> So the event flow is:
> >>>     - userspace create object that match a range of virtual address
> >>>       against a given kernel sub-system (let's say infiniband) and
> >>>       let's assume that the range is an mmap() of a regular file
> >>>     - device driver do GUP on the range (let's assume it is a write
> >>>       GUP) so if the page is not already map with write permission
> >>>       in the page table than a page fault is trigger and page_mkwrite
> >>>       happens
> >>>     - Once GUP return the page to the device driver and once the
> >>>       device driver as updated the hardware states to allow access
> >>>       to this page then from that point on hardware can write to the
> >>>       page at _any_ time, it is fully disconnected from any fs event
> >>>       like write back, it fully ignore things like page_mkclean
> >>>
> >>> This is how it is to day, we allowed people to push upstream such
> >>> users of GUP. This is a fact we have to live with, we can not stop
> >>> hardware access to the page, we can not force the hardware to follow
> >>> page_mkclean and force a page_mkwrite once write back ends. This is
> >>> the situation we are inheriting (and i am personnaly not happy with
> >>> that).
> >>>
> >>> >From my point of view we are left with 2 choices:
> >>>     [C1] break all drivers that do not abide by the page_mkclean and
> >>>          page_mkwrite
> >>>     [C2] mitigate as much as possible the issue
> >>>
> >>> For [C2] the idea is to keep track of GUP per page so we know if we
> >>> can expect the page to be written to at any time. Here is the event
> >>> flow:
> >>>     - driver GUP the page and program the hardware, page is mark as
> >>>       GUPed
> >>>     ...
> >>>     - write back kicks in on the dirty page, lock the page and every
> >>>       thing as usual , sees it is GUPed and inform the block layer to
> >>>       use a bounce page
> >>
> >> No.  The solution John, Dan & I have been looking at is to take the
> >> dirty page off the LRU while it is pinned by GUP.  It will never be
> >> found for writeback.
> >>
> >> That's not the end of the story though.  Other parts of the kernel (eg
> >> msync) also need to be taught to stay away from pages which are pinned
> >> by GUP.  But the idea is that no page gets written back to storage while
> >> it's pinned by GUP.  Only when the last GUP ends is the page returned
> >> to the list of dirty pages.
> > 
> > We've been through this in:
> > 
> > https://lore.kernel.org/lkml/20180709194740.rymbt2fzohbdmpye@quack2.suse.cz/
> > 
> > back in July. You cannot just skip pages for fsync(2). So as I wrote above -
> > memory cleaning writeback can skip pinned pages. Data integrity writeback
> > must be able to write pinned pages. And bouncing is one reasonable way how
> > to do that.
> > 
> > This writeback decision is pretty much independent from the mechanism by
> > which we are going to identify pinned pages. Whether that's going to be
> > separate counter in struct page, using page->_mapcount, or separately
> > allocated data structure as you know promote.
> > 
> > I currently like the most the _mapcount suggestion from Jerome but I'm not
> > really attached to any solution as long as it performs reasonably and
> > someone can make it working :) as I don't have time to implement it at
> > least till January.
> > 
> 
> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> *only* the tracking pinned pages aspect), given that it is the lightest weight
> solution for that.  
> 
> So as I understand it, this would use page->_mapcount to store both the real
> mapcount, and the dma pinned count (simply added together), but only do so for
> file-backed (non-anonymous) pages:
> 
> 
> __get_user_pages()
> {
> 	...
> 	get_page(page);
> 
> 	if (!PageAnon)
> 		atomic_inc(page->_mapcount);
> 	...
> }
> 
> put_user_page(struct page *page)
> {
> 	...
> 	if (!PageAnon)
> 		atomic_dec(&page->_mapcount);
> 
> 	put_page(page);
> 	...
> }
> 
> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> had in mind?

Mostly, with the extra two observations:
    [1] We only need to know the pin count when a write back kicks in
    [2] We need to protect GUP code with wait_for_write_back() in case
        GUP is racing with a write back that might not the see the
        elevated mapcount in time.

So for [2]

__get_user_pages()
{
    get_page(page);

    if (!PageAnon) {
        atomic_inc(page->_mapcount);
+       if (PageWriteback(page)) {
+           // Assume we are racing and curent write back will not see
+           // the elevated mapcount so wait for current write back and
+           // force page fault
+           wait_on_page_writeback(page);
+           // force slow path that will fault again
+       }
    }
}

For [1] only needing pin count during write back turns page_mkclean into
the perfect spot to check for that so:

int page_mkclean(struct page *page)
{
    int cleaned = 0;
+   int real_mapcount = 0;
    struct address_space *mapping;
    struct rmap_walk_control rwc = {
        .arg = (void *)&cleaned,
        .rmap_one = page_mkclean_one,
        .invalid_vma = invalid_mkclean_vma,
+       .mapcount = &real_mapcount,
    };

    BUG_ON(!PageLocked(page));

    if (!page_mapped(page))
        return 0;

    mapping = page_mapping(page);
    if (!mapping)
        return 0;

    // rmap_walk need to change to count mapping and return value
    // in .mapcount easy one
    rmap_walk(page, &rwc);

    // Big fat comment to explain what is going on
+   if ((page_mapcount(page) - real_mapcount) > 0) {
+       SetPageDMAPined(page);
+   } else {
+       ClearPageDMAPined(page);
+   }

    // Maybe we want to leverage the int nature of return value so that
    // we can express more than cleaned/truncated and express cleaned/
    // truncated/pinned for benefit of caller and that way we do not
    // even need one bit as page flags above.

    return cleaned;
}

You do not want to change page_mapped() i do not see a need for that.

Then the whole discussion between Jan and Dave seems to indicate that
the bounce mechanism will need to be in the fs layer and that we can
not reuse the bio bounce mechanism. This means that more work is needed
at the fs level for that (so that fs do not freak on bounce page).

Note that they are few gotcha where we need to preserve the pin count
ie mostly in truncate code path that can remove page from page cache
and overwrite the mapcount in the process, this would need to be fixed
to not overwrite mapcount so that put_user_page does not set the map
count to an invalid value turning the page into a bad state that will
at one point trigger kernel BUG_ON();

I am not saying block truncate, i am saying make sure it does not
erase pin count and keep truncating happily. The how to handle truncate
is a per existing GUP user discussion to see what they want to do for
that.

Obviously a bit deeper analysis of all spot that use mapcount is needed
to check that we are not breaking anything but from the top of my head
i can not think of anything bad (migrate will abort and other things will
assume the page is mapped even it is only in hardware page table, ...).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18 10:33                                   ` Jan Kara
@ 2018-12-18 23:42                                     ` Dave Chinner
  2018-12-19  3:03                                       ` Jason Gunthorpe
  2018-12-19 13:24                                       ` Jan Kara
  0 siblings, 2 replies; 207+ messages in thread
From: Dave Chinner @ 2018-12-18 23:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jerome Glisse, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 18, 2018 at 11:33:06AM +0100, Jan Kara wrote:
> On Mon 17-12-18 08:58:19, Dave Chinner wrote:
> > On Fri, Dec 14, 2018 at 04:43:21PM +0100, Jan Kara wrote:
> > > Hi!
> > > 
> > > On Thu 13-12-18 08:46:41, Dave Chinner wrote:
> > > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > > it may be easier said than done...
> > > > > 
> > > > > So i want back to the drawing board and first i would like to ascertain
> > > > > that we all agree on what the objectives are:
> > > > > 
> > > > >     [O1] Avoid write back from a page still being written by either a
> > > > >          device or some direct I/O or any other existing user of GUP.
> > > > >          This would avoid possible file system corruption.
> > > > > 
> > > > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > > > >          considered clean by core mm (buffer head have been remove and
> > > > >          with some file system this turns into an ugly mess).
> > > > 
> > > > I think that's wrong. This isn't an "avoid a crash" case, this is a
> > > > "prevent data and/or filesystem corruption" case. The primary goal
> > > > we have here is removing our exposure to potential corruption, which
> > > > has the secondary effect of avoiding the crash/panics that currently
> > > > occur as a result of inconsistent page/filesystem state.
> > > > 
> > > > i.e. The goal is to have ->page_mkwrite() called on the clean page
> > > > /before/ the file-backed page is marked dirty, and hence we don't
> > > > expose ourselves to potential corruption or crashes that are a
> > > > result of inappropriately calling set_page_dirty() on clean
> > > > file-backed pages.
> > > 
> > > I agree that [O1] - i.e., avoid corrupting fs data - is more important and
> > > [O2] is just one consequence of [O1].
> > > 
> > > > > For [O1] and [O2] i believe a solution with mapcount would work. So
> > > > > no new struct, no fake vma, nothing like that. In GUP for file back
> > > > > pages we increment both refcount and mapcount (we also need a special
> > > > > put_user_page to decrement mapcount when GUP user are done with the
> > > > > page).
> > > > 
> > > > I don't see how a mapcount can prevent anyone from calling
> > > > set_page_dirty() inappropriately.
> > > > 
> > > > > Now for [O1] the write back have to call page_mkclean() to go through
> > > > > all reverse mapping of the page and map read only. This means that
> > > > > we can count the number of real mapping and see if the mapcount is
> > > > > bigger than that. If mapcount is bigger than page is pin and we need
> > > > > to use a bounce page to do the writeback.
> > > > 
> > > > Doesn't work. Generally filesystems have already mapped the page
> > > > into bios before they call clear_page_dirty_for_io(), so it's too
> > > > late for the filesystem to bounce the page at that point.
> > > 
> > > Yes, for filesystem it is too late. But the plan we figured back in October
> > > was to do the bouncing in the block layer. I.e., mark the bio (or just the
> > > particular page) as needing bouncing and then use the existing page
> > > bouncing mechanism in the block layer to do the bouncing for us. Ext3 (when
> > > it was still a separate fs driver) has been using a mechanism like this to
> > > make DIF/DIX work with its metadata.
> > 
> > Sure, that's a possibility, but that doesn't close off any race
> > conditions because there can be DMA into the page in progress while
> > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > different in that there is no 3rd-party access to the page while it
> > is under IO (ext3 arbitrates all access to it's metadata), and so
> > nothing can actually race for modification of the page between
> > submission and bouncing at the block layer.
> >
> > In this case, the moment the page is unlocked, anyone else can map
> > it and start (R)DMA on it, and that can happen before the bio is
> > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > solve the problem of racing writeback and DMA direct to the page we
> > are doing IO on. Yes, it reduces the race window substantially, but
> > it doesn't get rid of it.
> 
> The scenario you describe here cannot happen exactly because of the
> wait_for_stable_page() in ->page_mkwrite() you mention below.

In general, no, because stable pages are controlled by block
devices.

void wait_for_stable_page(struct page *page)
{
        if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
                wait_on_page_writeback(page);
}


I have previously advocated for the filesystem to be in control of
stable pages but, well, too many people shouted "but performance!"
and so we still have all these holes I wanted to close in our
code...

> If someone
> will try to GUP a page that is under writeback (has already PageWriteback
> set), GUP will have to do a write fault because the page is writeprotected
> in page tables and go into ->page_mkwrite() which will wait.

Correct, but that doesn't close the problem down because stable
pages are something we cannot rely on right now. We need to fix
wait_for_stable_page() to always block on page writeback before
this specific race condition goes away.

> The problem rather is with someone mapping the page *before* writeback
> starts, giving the page to HW. Then clear_page_dirty_for_io() writeprotects
> the page in PTEs but the HW gives a damn about that. Then, after we add the
> page to the bio but before the page gets bounced by the block layer, the HW
> can still modify it.

Sure, that's yet another aspect of the same problem - not getting a
write fault when the page is being written to. If we got a write
fault, then the wait_for_stable_page() call in ->page_mkwrite would
then solve the problem.

Essentially, what we are talking about is how to handle broken
hardware. I say we should just brun it with napalm and thermite
(i.e. taint the kernel with "unsupportable hardware") and force
wait_for_stable_page() to trigger when there are GUP mappings if
the underlying storage doesn't already require it.

> So for anything in the block layer and below, doing the bouncing in the
> block layer is enough. So DIF/DIX will work, device mapper targets will
> work etc.  But you are right that if the filesystem itself needs stable
> data, then bouncing in the block layer is not enough. Thanks for catching
> this hole, I didn't think of that. Ext4 does not need stable data during
> writeback, btrfs probably does as it needs to compute data checksums during
> writeout. So we'll probably need something like clear_page_dirty_for_io()
> returning a bounced page for data integrity writeback of a pinned page. And
> the filesystem would then need to submit this page for IO.
> 
> > /me points to wait_for_stable_page() in ->page_mkwrite as the
> > mechanism we already have to avoid races between dirtying mapped
> > pages and page writeback....
> 
> Yes, the problem is that some RDMA hardware does not (and AFAIU cannot be
> made to) follow the wait_for_stable_page() rules. And we have a userspace
> visible API by which application can ask the kernel to give mapped file
> pages to the HW as a buffer and the HW can fill in the contents at its will
> until userspace tells the kernel to remove these mapped pages from the HW.
> The interval how long userspace leaves these pages to the HW is upto
> userspace and in practice it is often counted in hours.

Yes, and as we've discussed in previous threads and at LSFMM, that
needs to die and be replaced with file leases that allow the kernel
to revoke the GUP mappings by breaking the lease.

> If you want to tell me this is broken and could always corrupt data and
> kill the kernel and what not, I agree. The goal is to avoid at least the
> worst issues without breaking userspace too badly.

My goal is to create infrastructure that is sane, reliable,
verifiable and doesn't have gaping data integrity holes in it that
performs as well as the current nasty kernel bypass hacks we have
now. i.e. I don't care about the current mess as it's largely
unfixable - we're at the point where we need to fix the
architecture, not keep trying to slap band-aids over the worst of
the symptoms of the current mess.

> > If it's permanently dirty, how do we trigger new COW operations
> > after writeback has "cleaned" the page? i.e. we still need a
> > ->page_mkwrite call to run before we allow the next write to the
> > page to be done, regardless of whether the page is "permanently
> > dirty" or not....
> 
> Interaction with COW is certainly an interesting problem. When the page
> gets pinned, GUP will make sure the page is writeably mapped and trigger a
> write fault if not. So at the moment the page is pinned, we are sure the
> page is COWed. Now the question is what should happen when the file A
> containing this pinned page gets reflinked to file B while the page is still
> pinned.
> 
> Options I can see are:
> 
> 1) Fail the reflink.
>   - difficult for sysadmin to discover the source of failure
>
> 2) Block reflink until the pin of the page is released.
>   - can last for a long time, again difficult to discover
> 
> 3) Don't do anything special.
>   - can corrupt data as read accesses through file B won't see
>     modifications done to the page (and thus eventually the corresponding disk
>     block) by the HW.
> 
> 4) Immediately COW the block during reflink when the corresponding page
>    cache page is pinned.
>   - seems as the best solution at this point, although sadly also requires
>     the most per-filesystem work

None of the above are acceptable solutions - they all have nasty
corner cases which are going to be difficult to get right, test,
etc. IMO, the robust, reliable, testable solution is this:

5) The reflink breaks the file lease, the userspace app releases the
pinned pages on the file and drops the lease. The reflink proceeds,
does it's work, and then the app gets a new lease on the file. When
the app pins the pages again, it triggers new ->page_mkwrite calls
to break any sharing that the reflink created. And if the app fails
to drop the lease, then we can either fail with a lease related
error or kill it....

XFS already has all the file lease breaking stuff in it, we just
need userspace interfaces to expose them and userspace apps to start
using them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18  9:30                                       ` Jan Kara
@ 2018-12-18 23:29                                         ` John Hubbard
  2018-12-19  2:07                                           ` Jerome Glisse
  0 siblings, 1 reply; 207+ messages in thread
From: John Hubbard @ 2018-12-18 23:29 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox
  Cc: Jerome Glisse, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/18/18 1:30 AM, Jan Kara wrote:
> On Mon 17-12-18 10:34:43, Matthew Wilcox wrote:
>> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
>>> On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
>>>> Sure, that's a possibility, but that doesn't close off any race
>>>> conditions because there can be DMA into the page in progress while
>>>> the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
>>>> different in that there is no 3rd-party access to the page while it
>>>> is under IO (ext3 arbitrates all access to it's metadata), and so
>>>> nothing can actually race for modification of the page between
>>>> submission and bouncing at the block layer.
>>>>
>>>> In this case, the moment the page is unlocked, anyone else can map
>>>> it and start (R)DMA on it, and that can happen before the bio is
>>>> bounced by the block layer. So AFAICT, block layer bouncing doesn't
>>>> solve the problem of racing writeback and DMA direct to the page we
>>>> are doing IO on. Yes, it reduces the race window substantially, but
>>>> it doesn't get rid of it.
>>>
>>> So the event flow is:
>>>     - userspace create object that match a range of virtual address
>>>       against a given kernel sub-system (let's say infiniband) and
>>>       let's assume that the range is an mmap() of a regular file
>>>     - device driver do GUP on the range (let's assume it is a write
>>>       GUP) so if the page is not already map with write permission
>>>       in the page table than a page fault is trigger and page_mkwrite
>>>       happens
>>>     - Once GUP return the page to the device driver and once the
>>>       device driver as updated the hardware states to allow access
>>>       to this page then from that point on hardware can write to the
>>>       page at _any_ time, it is fully disconnected from any fs event
>>>       like write back, it fully ignore things like page_mkclean
>>>
>>> This is how it is to day, we allowed people to push upstream such
>>> users of GUP. This is a fact we have to live with, we can not stop
>>> hardware access to the page, we can not force the hardware to follow
>>> page_mkclean and force a page_mkwrite once write back ends. This is
>>> the situation we are inheriting (and i am personnaly not happy with
>>> that).
>>>
>>> >From my point of view we are left with 2 choices:
>>>     [C1] break all drivers that do not abide by the page_mkclean and
>>>          page_mkwrite
>>>     [C2] mitigate as much as possible the issue
>>>
>>> For [C2] the idea is to keep track of GUP per page so we know if we
>>> can expect the page to be written to at any time. Here is the event
>>> flow:
>>>     - driver GUP the page and program the hardware, page is mark as
>>>       GUPed
>>>     ...
>>>     - write back kicks in on the dirty page, lock the page and every
>>>       thing as usual , sees it is GUPed and inform the block layer to
>>>       use a bounce page
>>
>> No.  The solution John, Dan & I have been looking at is to take the
>> dirty page off the LRU while it is pinned by GUP.  It will never be
>> found for writeback.
>>
>> That's not the end of the story though.  Other parts of the kernel (eg
>> msync) also need to be taught to stay away from pages which are pinned
>> by GUP.  But the idea is that no page gets written back to storage while
>> it's pinned by GUP.  Only when the last GUP ends is the page returned
>> to the list of dirty pages.
> 
> We've been through this in:
> 
> https://lore.kernel.org/lkml/20180709194740.rymbt2fzohbdmpye@quack2.suse.cz/
> 
> back in July. You cannot just skip pages for fsync(2). So as I wrote above -
> memory cleaning writeback can skip pinned pages. Data integrity writeback
> must be able to write pinned pages. And bouncing is one reasonable way how
> to do that.
> 
> This writeback decision is pretty much independent from the mechanism by
> which we are going to identify pinned pages. Whether that's going to be
> separate counter in struct page, using page->_mapcount, or separately
> allocated data structure as you know promote.
> 
> I currently like the most the _mapcount suggestion from Jerome but I'm not
> really attached to any solution as long as it performs reasonably and
> someone can make it working :) as I don't have time to implement it at
> least till January.
> 

OK, so let's take another look at Jerome's _mapcount idea all by itself (using
*only* the tracking pinned pages aspect), given that it is the lightest weight
solution for that.  

So as I understand it, this would use page->_mapcount to store both the real
mapcount, and the dma pinned count (simply added together), but only do so for
file-backed (non-anonymous) pages:


__get_user_pages()
{
	...
	get_page(page);

	if (!PageAnon)
		atomic_inc(page->_mapcount);
	...
}

put_user_page(struct page *page)
{
	...
	if (!PageAnon)
		atomic_dec(&page->_mapcount);

	put_page(page);
	...
}

...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
had in mind?


-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-16 21:58                                 ` Dave Chinner
  2018-12-17 18:11                                   ` Jerome Glisse
@ 2018-12-18 10:33                                   ` Jan Kara
  2018-12-18 23:42                                     ` Dave Chinner
  1 sibling, 1 reply; 207+ messages in thread
From: Jan Kara @ 2018-12-18 10:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 17-12-18 08:58:19, Dave Chinner wrote:
> On Fri, Dec 14, 2018 at 04:43:21PM +0100, Jan Kara wrote:
> > Hi!
> > 
> > On Thu 13-12-18 08:46:41, Dave Chinner wrote:
> > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > it may be easier said than done...
> > > > 
> > > > So i want back to the drawing board and first i would like to ascertain
> > > > that we all agree on what the objectives are:
> > > > 
> > > >     [O1] Avoid write back from a page still being written by either a
> > > >          device or some direct I/O or any other existing user of GUP.
> > > >          This would avoid possible file system corruption.
> > > > 
> > > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > > >          considered clean by core mm (buffer head have been remove and
> > > >          with some file system this turns into an ugly mess).
> > > 
> > > I think that's wrong. This isn't an "avoid a crash" case, this is a
> > > "prevent data and/or filesystem corruption" case. The primary goal
> > > we have here is removing our exposure to potential corruption, which
> > > has the secondary effect of avoiding the crash/panics that currently
> > > occur as a result of inconsistent page/filesystem state.
> > > 
> > > i.e. The goal is to have ->page_mkwrite() called on the clean page
> > > /before/ the file-backed page is marked dirty, and hence we don't
> > > expose ourselves to potential corruption or crashes that are a
> > > result of inappropriately calling set_page_dirty() on clean
> > > file-backed pages.
> > 
> > I agree that [O1] - i.e., avoid corrupting fs data - is more important and
> > [O2] is just one consequence of [O1].
> > 
> > > > For [O1] and [O2] i believe a solution with mapcount would work. So
> > > > no new struct, no fake vma, nothing like that. In GUP for file back
> > > > pages we increment both refcount and mapcount (we also need a special
> > > > put_user_page to decrement mapcount when GUP user are done with the
> > > > page).
> > > 
> > > I don't see how a mapcount can prevent anyone from calling
> > > set_page_dirty() inappropriately.
> > > 
> > > > Now for [O1] the write back have to call page_mkclean() to go through
> > > > all reverse mapping of the page and map read only. This means that
> > > > we can count the number of real mapping and see if the mapcount is
> > > > bigger than that. If mapcount is bigger than page is pin and we need
> > > > to use a bounce page to do the writeback.
> > > 
> > > Doesn't work. Generally filesystems have already mapped the page
> > > into bios before they call clear_page_dirty_for_io(), so it's too
> > > late for the filesystem to bounce the page at that point.
> > 
> > Yes, for filesystem it is too late. But the plan we figured back in October
> > was to do the bouncing in the block layer. I.e., mark the bio (or just the
> > particular page) as needing bouncing and then use the existing page
> > bouncing mechanism in the block layer to do the bouncing for us. Ext3 (when
> > it was still a separate fs driver) has been using a mechanism like this to
> > make DIF/DIX work with its metadata.
> 
> Sure, that's a possibility, but that doesn't close off any race
> conditions because there can be DMA into the page in progress while
> the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> different in that there is no 3rd-party access to the page while it
> is under IO (ext3 arbitrates all access to it's metadata), and so
> nothing can actually race for modification of the page between
> submission and bouncing at the block layer.
>
> In this case, the moment the page is unlocked, anyone else can map
> it and start (R)DMA on it, and that can happen before the bio is
> bounced by the block layer. So AFAICT, block layer bouncing doesn't
> solve the problem of racing writeback and DMA direct to the page we
> are doing IO on. Yes, it reduces the race window substantially, but
> it doesn't get rid of it.

The scenario you describe here cannot happen exactly because of the
wait_for_stable_page() in ->page_mkwrite() you mention below. If someone
will try to GUP a page that is under writeback (has already PageWriteback
set), GUP will have to do a write fault because the page is writeprotected
in page tables and go into ->page_mkwrite() which will wait.

The problem rather is with someone mapping the page *before* writeback
starts, giving the page to HW. Then clear_page_dirty_for_io() writeprotects
the page in PTEs but the HW gives a damn about that. Then, after we add the
page to the bio but before the page gets bounced by the block layer, the HW
can still modify it.

So for anything in the block layer and below, doing the bouncing in the
block layer is enough. So DIF/DIX will work, device mapper targets will
work etc.  But you are right that if the filesystem itself needs stable
data, then bouncing in the block layer is not enough. Thanks for catching
this hole, I didn't think of that. Ext4 does not need stable data during
writeback, btrfs probably does as it needs to compute data checksums during
writeout. So we'll probably need something like clear_page_dirty_for_io()
returning a bounced page for data integrity writeback of a pinned page. And
the filesystem would then need to submit this page for IO.

> /me points to wait_for_stable_page() in ->page_mkwrite as the
> mechanism we already have to avoid races between dirtying mapped
> pages and page writeback....

Yes, the problem is that some RDMA hardware does not (and AFAIU cannot be
made to) follow the wait_for_stable_page() rules. And we have a userspace
visible API by which application can ask the kernel to give mapped file
pages to the HW as a buffer and the HW can fill in the contents at its will
until userspace tells the kernel to remove these mapped pages from the HW.
The interval how long userspace leaves these pages to the HW is upto
userspace and in practice it is often counted in hours.

If you want to tell me this is broken and could always corrupt data and
kill the kernel and what not, I agree. The goal is to avoid at least the
worst issues without breaking userspace too badly.

> > > > For [O2] i believe we can handle that case in the put_user_page()
> > > > function to properly dirty the page without causing filesystem
> > > > freak out.
> > > 
> > > I'm pretty sure you can't call ->page_mkwrite() from
> > > put_user_page(), so I don't think this is workable at all.
> > 
> > Yes, calling ->page_mkwrite() in put_user_page() is not only technically
> > complicated but also too late - DMA has already modified page contents.
> > What we planned to do (again discussed back in October) was to never allow
> > the pinned page to become clean. I.e., clear_page_dirty_for_io() would
> > leave pinned pages dirty. Also we would skip pinned pages for WB_SYNC_NONE
> > writeback as there's no point in that really. That way MM and filesystems
> > would be aware of the real page state - i.e., what's in memory is not in
> > sync (potentially) with what's on disk. I was thinking whether this
> > permanently-dirty state couldn't confuse filesystem in some way but I
> > didn't find anything serious - the worst I could think of are places that
> > do filemap_write_and_wait() and then invalidate page cache e.g. before hole
> > punching or extent shifting.
> 
> If it's permanently dirty, how do we trigger new COW operations
> after writeback has "cleaned" the page? i.e. we still need a
> ->page_mkwrite call to run before we allow the next write to the
> page to be done, regardless of whether the page is "permanently
> dirty" or not....

Interaction with COW is certainly an interesting problem. When the page
gets pinned, GUP will make sure the page is writeably mapped and trigger a
write fault if not. So at the moment the page is pinned, we are sure the
page is COWed. Now the question is what should happen when the file A
containing this pinned page gets reflinked to file B while the page is still
pinned.

Options I can see are:

1) Fail the reflink.
  - difficult for sysadmin to discover the source of failure

2) Block reflink until the pin of the page is released.
  - can last for a long time, again difficult to discover

3) Don't do anything special.
  - can corrupt data as read accesses through file B won't see
    modifications done to the page (and thus eventually the corresponding disk
    block) by the HW.

4) Immediately COW the block during reflink when the corresponding page
   cache page is pinned.
  - seems as the best solution at this point, although sadly also requires
    the most per-filesystem work

> > But these should work fine as is (page cache
> > invalidation will just happily truncate dirty pages). DIO might get
> > confused by the inability to invalidate dirty pages but then user combining
> > RDMA with DIO on the same file at one moment gets what he deserves...
> 
> I'm almost certain this will do something that will occur. i.e.
> permanently mapped RDMA file, filesystem backup program uses DIO....

IMO this falls into the same cathegory like two independent processes doing
DIO overwrite of the same file block. Or combining mmap and DIO access.
Kernel must not crash, filesystem must stay consistent, but the user is
responsible to put together bits of the data that are left...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:34                                     ` Matthew Wilcox
                                                         ` (2 preceding siblings ...)
  2018-12-18  6:12                                       ` Darrick J. Wong
@ 2018-12-18  9:30                                       ` Jan Kara
  2018-12-18 23:29                                         ` John Hubbard
  3 siblings, 1 reply; 207+ messages in thread
From: Jan Kara @ 2018-12-18  9:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jerome Glisse, Dave Chinner, Jan Kara, John Hubbard,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 17-12-18 10:34:43, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > > 
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > So the event flow is:
> >     - userspace create object that match a range of virtual address
> >       against a given kernel sub-system (let's say infiniband) and
> >       let's assume that the range is an mmap() of a regular file
> >     - device driver do GUP on the range (let's assume it is a write
> >       GUP) so if the page is not already map with write permission
> >       in the page table than a page fault is trigger and page_mkwrite
> >       happens
> >     - Once GUP return the page to the device driver and once the
> >       device driver as updated the hardware states to allow access
> >       to this page then from that point on hardware can write to the
> >       page at _any_ time, it is fully disconnected from any fs event
> >       like write back, it fully ignore things like page_mkclean
> > 
> > This is how it is to day, we allowed people to push upstream such
> > users of GUP. This is a fact we have to live with, we can not stop
> > hardware access to the page, we can not force the hardware to follow
> > page_mkclean and force a page_mkwrite once write back ends. This is
> > the situation we are inheriting (and i am personnaly not happy with
> > that).
> > 
> > >From my point of view we are left with 2 choices:
> >     [C1] break all drivers that do not abide by the page_mkclean and
> >          page_mkwrite
> >     [C2] mitigate as much as possible the issue
> > 
> > For [C2] the idea is to keep track of GUP per page so we know if we
> > can expect the page to be written to at any time. Here is the event
> > flow:
> >     - driver GUP the page and program the hardware, page is mark as
> >       GUPed
> >     ...
> >     - write back kicks in on the dirty page, lock the page and every
> >       thing as usual , sees it is GUPed and inform the block layer to
> >       use a bounce page
> 
> No.  The solution John, Dan & I have been looking at is to take the
> dirty page off the LRU while it is pinned by GUP.  It will never be
> found for writeback.
> 
> That's not the end of the story though.  Other parts of the kernel (eg
> msync) also need to be taught to stay away from pages which are pinned
> by GUP.  But the idea is that no page gets written back to storage while
> it's pinned by GUP.  Only when the last GUP ends is the page returned
> to the list of dirty pages.

We've been through this in:

https://lore.kernel.org/lkml/20180709194740.rymbt2fzohbdmpye@quack2.suse.cz/

back in July. You cannot just skip pages for fsync(2). So as I wrote above -
memory cleaning writeback can skip pinned pages. Data integrity writeback
must be able to write pinned pages. And bouncing is one reasonable way how
to do that.

This writeback decision is pretty much independent from the mechanism by
which we are going to identify pinned pages. Whether that's going to be
separate counter in struct page, using page->_mapcount, or separately
allocated data structure as you know promote.

I currently like the most the _mapcount suggestion from Jerome but I'm not
really attached to any solution as long as it performs reasonably and
someone can make it working :) as I don't have time to implement it at
least till January.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:34                                     ` Matthew Wilcox
  2018-12-17 19:48                                       ` Jerome Glisse
  2018-12-18  1:09                                       ` Dave Chinner
@ 2018-12-18  6:12                                       ` Darrick J. Wong
  2018-12-18  9:30                                       ` Jan Kara
  3 siblings, 0 replies; 207+ messages in thread
From: Darrick J. Wong @ 2018-12-18  6:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jerome Glisse, Dave Chinner, Jan Kara, John Hubbard,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > > 
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > So the event flow is:
> >     - userspace create object that match a range of virtual address
> >       against a given kernel sub-system (let's say infiniband) and
> >       let's assume that the range is an mmap() of a regular file
> >     - device driver do GUP on the range (let's assume it is a write
> >       GUP) so if the page is not already map with write permission
> >       in the page table than a page fault is trigger and page_mkwrite
> >       happens
> >     - Once GUP return the page to the device driver and once the
> >       device driver as updated the hardware states to allow access
> >       to this page then from that point on hardware can write to the
> >       page at _any_ time, it is fully disconnected from any fs event
> >       like write back, it fully ignore things like page_mkclean
> > 
> > This is how it is to day, we allowed people to push upstream such
> > users of GUP. This is a fact we have to live with, we can not stop
> > hardware access to the page, we can not force the hardware to follow
> > page_mkclean and force a page_mkwrite once write back ends. This is
> > the situation we are inheriting (and i am personnaly not happy with
> > that).
> > 
> > >From my point of view we are left with 2 choices:
> >     [C1] break all drivers that do not abide by the page_mkclean and
> >          page_mkwrite
> >     [C2] mitigate as much as possible the issue
> > 
> > For [C2] the idea is to keep track of GUP per page so we know if we
> > can expect the page to be written to at any time. Here is the event
> > flow:
> >     - driver GUP the page and program the hardware, page is mark as
> >       GUPed
> >     ...
> >     - write back kicks in on the dirty page, lock the page and every
> >       thing as usual , sees it is GUPed and inform the block layer to
> >       use a bounce page
> 
> No.  The solution John, Dan & I have been looking at is to take the
> dirty page off the LRU while it is pinned by GUP.  It will never be
> found for writeback.
> 
> That's not the end of the story though.  Other parts of the kernel (eg
> msync) also need to be taught to stay away from pages which are pinned
> by GUP.  But the idea is that no page gets written back to storage while
> it's pinned by GUP.  Only when the last GUP ends is the page returned
> to the list of dirty pages.

Errr... what does fsync do in the meantime?  Not write the page?
That would seem to break what fsync() is supposed to do.

--D

> >     - block layer copy the page to a bounce page effectively creating
> >       a snapshot of what is the content of the real page. This allows
> >       everything in block layer that need stable content to work on
> >       the bounce page (raid, stripping, encryption, ...)
> >     - once write back is done the page is not marked clean but stays
> >       dirty, this effectively disable things like COW for filesystem
> >       and other feature that expect page_mkwrite between write back.
> >       AFAIK it is believe that it is something acceptable
> 
> So none of this is necessary.
> 

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:34                                     ` Matthew Wilcox
  2018-12-17 19:48                                       ` Jerome Glisse
@ 2018-12-18  1:09                                       ` Dave Chinner
  2018-12-18  6:12                                       ` Darrick J. Wong
  2018-12-18  9:30                                       ` Jan Kara
  3 siblings, 0 replies; 207+ messages in thread
From: Dave Chinner @ 2018-12-18  1:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jerome Glisse, Jan Kara, John Hubbard, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > > 
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > So the event flow is:
> >     - userspace create object that match a range of virtual address
> >       against a given kernel sub-system (let's say infiniband) and
> >       let's assume that the range is an mmap() of a regular file
> >     - device driver do GUP on the range (let's assume it is a write
> >       GUP) so if the page is not already map with write permission
> >       in the page table than a page fault is trigger and page_mkwrite
> >       happens
> >     - Once GUP return the page to the device driver and once the
> >       device driver as updated the hardware states to allow access
> >       to this page then from that point on hardware can write to the
> >       page at _any_ time, it is fully disconnected from any fs event
> >       like write back, it fully ignore things like page_mkclean
> > 
> > This is how it is to day, we allowed people to push upstream such
> > users of GUP. This is a fact we have to live with, we can not stop
> > hardware access to the page, we can not force the hardware to follow
> > page_mkclean and force a page_mkwrite once write back ends. This is
> > the situation we are inheriting (and i am personnaly not happy with
> > that).
> > 
> > >From my point of view we are left with 2 choices:
> >     [C1] break all drivers that do not abide by the page_mkclean and
> >          page_mkwrite
> >     [C2] mitigate as much as possible the issue
> > 
> > For [C2] the idea is to keep track of GUP per page so we know if we
> > can expect the page to be written to at any time. Here is the event
> > flow:
> >     - driver GUP the page and program the hardware, page is mark as
> >       GUPed
> >     ...
> >     - write back kicks in on the dirty page, lock the page and every
> >       thing as usual , sees it is GUPed and inform the block layer to
> >       use a bounce page
> 
> No.  The solution John, Dan & I have been looking at is to take the
> dirty page off the LRU while it is pinned by GUP.  It will never be
> found for writeback.

Pages are found for writeback by mapping tree lookup, not page LRU
scans (i.e. write_cache_pages() from background writeback)

Are suggesting that pages pinned by GUP are going to be removed from
the page cache *and* the mapping tree while they are pinned?

> That's not the end of the story though.  Other parts of the kernel (eg
> msync) also need to be taught to stay away from pages which are pinned
> by GUP. But the idea is that no page gets written back to storage while
> it's pinned by GUP. Only when the last GUP ends is the page returned
> to the list of dirty pages.

I think playing fast and loose with data integrity like this is
fundamentally wrong. If this gets implemented, then I'll be sending
every "I ran sync and then two hours later the system crashed but
the data was lost when the system came back up" bug report directly
to you.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 21:03                                                 ` Matthew Wilcox
@ 2018-12-17 21:15                                                   ` Jerome Glisse
  0 siblings, 0 replies; 207+ messages in thread
From: Jerome Glisse @ 2018-12-17 21:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 01:03:58PM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 03:55:01PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 11:59:22AM -0800, Matthew Wilcox wrote:
> > > On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > > > > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > > > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > > > > found for writeback.
> > > > > > 
> > > > > > With the solution you are proposing we loose GUP fast and we have to
> > > > > > allocate a structure for each page that is under GUP, and the LRU
> > > > > > changes too. Moreover by not writing back there is a greater chance
> > > > > > of data loss.
> > > > > 
> > > > > Why can't you store the hmm_data in a side data structure?  Why does it
> > > > > have to be in struct page?
> > > > 
> > > > hmm_data is not even the issue here, we can have a pincount without
> > > > moving things around. So i do not see the need to complexify any of
> > > > the existing code to add new structure and consume more memory for
> > > > no good reasons. I do not see any benefit in that.
> > > 
> > > You said "we have to allocate a structure for each page that is under
> > > GUP".  The only reason to do that is if we want to keep hmm_data in
> > > struct page.  If we ditch hmm_data, there's no need to allocate a
> > > structure, and we don't lose GUP fast either.
> > 
> > And i have propose a way that do not need to ditch hmm_data nor
> > needs to remove page from the lru. What is it you do not like
> > with that ?
> 
> I don't like bounce buffering.  I don't like "end of writeback doesn't
> mark page as clean".  I don't like pages being on the LRU that aren't
> actually removable.  I don't like writing pages back which we know we're
> going to have to write back again.

And my solution allow to pick at which ever point ... you can decide to
abort write back if you feel it is better, you can remove from LRU on
first write back abort ... So you can do everything you want in my solution
it is as flexible. Right now i am finishing couple patchset once i am
done i will do an RFC on that, in my RFC i will keep write back and
bounce but it can easily be turn into no write back and remove from
LRU. My feeling is that not writing back means data loss, at the same
time if the page is on continuous write one can argue that what ever
snapshot we write back might be pointless. I do not see any strong
argument either ways.

Cheers.
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 20:55                                               ` Jerome Glisse
@ 2018-12-17 21:03                                                 ` Matthew Wilcox
  2018-12-17 21:15                                                   ` Jerome Glisse
  0 siblings, 1 reply; 207+ messages in thread
From: Matthew Wilcox @ 2018-12-17 21:03 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 03:55:01PM -0500, Jerome Glisse wrote:
> On Mon, Dec 17, 2018 at 11:59:22AM -0800, Matthew Wilcox wrote:
> > On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > > > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > > > found for writeback.
> > > > > 
> > > > > With the solution you are proposing we loose GUP fast and we have to
> > > > > allocate a structure for each page that is under GUP, and the LRU
> > > > > changes too. Moreover by not writing back there is a greater chance
> > > > > of data loss.
> > > > 
> > > > Why can't you store the hmm_data in a side data structure?  Why does it
> > > > have to be in struct page?
> > > 
> > > hmm_data is not even the issue here, we can have a pincount without
> > > moving things around. So i do not see the need to complexify any of
> > > the existing code to add new structure and consume more memory for
> > > no good reasons. I do not see any benefit in that.
> > 
> > You said "we have to allocate a structure for each page that is under
> > GUP".  The only reason to do that is if we want to keep hmm_data in
> > struct page.  If we ditch hmm_data, there's no need to allocate a
> > structure, and we don't lose GUP fast either.
> 
> And i have propose a way that do not need to ditch hmm_data nor
> needs to remove page from the lru. What is it you do not like
> with that ?

I don't like bounce buffering.  I don't like "end of writeback doesn't
mark page as clean".  I don't like pages being on the LRU that aren't
actually removable.  I don't like writing pages back which we know we're
going to have to write back again.

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:59                                             ` Matthew Wilcox
@ 2018-12-17 20:55                                               ` Jerome Glisse
  2018-12-17 21:03                                                 ` Matthew Wilcox
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2018-12-17 20:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 11:59:22AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > > found for writeback.
> > > > 
> > > > With the solution you are proposing we loose GUP fast and we have to
> > > > allocate a structure for each page that is under GUP, and the LRU
> > > > changes too. Moreover by not writing back there is a greater chance
> > > > of data loss.
> > > 
> > > Why can't you store the hmm_data in a side data structure?  Why does it
> > > have to be in struct page?
> > 
> > hmm_data is not even the issue here, we can have a pincount without
> > moving things around. So i do not see the need to complexify any of
> > the existing code to add new structure and consume more memory for
> > no good reasons. I do not see any benefit in that.
> 
> You said "we have to allocate a structure for each page that is under
> GUP".  The only reason to do that is if we want to keep hmm_data in
> struct page.  If we ditch hmm_data, there's no need to allocate a
> structure, and we don't lose GUP fast either.

And i have propose a way that do not need to ditch hmm_data nor
needs to remove page from the lru. What is it you do not like
with that ?

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:54                                           ` Jerome Glisse
@ 2018-12-17 19:59                                             ` Matthew Wilcox
  2018-12-17 20:55                                               ` Jerome Glisse
  0 siblings, 1 reply; 207+ messages in thread
From: Matthew Wilcox @ 2018-12-17 19:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > found for writeback.
> > > 
> > > With the solution you are proposing we loose GUP fast and we have to
> > > allocate a structure for each page that is under GUP, and the LRU
> > > changes too. Moreover by not writing back there is a greater chance
> > > of data loss.
> > 
> > Why can't you store the hmm_data in a side data structure?  Why does it
> > have to be in struct page?
> 
> hmm_data is not even the issue here, we can have a pincount without
> moving things around. So i do not see the need to complexify any of
> the existing code to add new structure and consume more memory for
> no good reasons. I do not see any benefit in that.

You said "we have to allocate a structure for each page that is under
GUP".  The only reason to do that is if we want to keep hmm_data in
struct page.  If we ditch hmm_data, there's no need to allocate a
structure, and we don't lose GUP fast either.

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:51                                         ` Matthew Wilcox
@ 2018-12-17 19:54                                           ` Jerome Glisse
  2018-12-17 19:59                                             ` Matthew Wilcox
  0 siblings, 1 reply; 207+ messages in thread
From: Jerome Glisse @ 2018-12-17 19:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > > > Sure, that's a possibility, but that doesn't close off any race
> > > > > conditions because there can be DMA into the page in progress while
> > > > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > > > different in that there is no 3rd-party access to the page while it
> > > > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > > > nothing can actually race for modification of the page between
> > > > > submission and bouncing at the block layer.
> > > > > 
> > > > > In this case, the moment the page is unlocked, anyone else can map
> > > > > it and start (R)DMA on it, and that can happen before the bio is
> > > > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > > > solve the problem of racing writeback and DMA direct to the page we
> > > > > are doing IO on. Yes, it reduces the race window substantially, but
> > > > > it doesn't get rid of it.
> > > > 
> > > > So the event flow is:
> > > >     - userspace create object that match a range of virtual address
> > > >       against a given kernel sub-system (let's say infiniband) and
> > > >       let's assume that the range is an mmap() of a regular file
> > > >     - device driver do GUP on the range (let's assume it is a write
> > > >       GUP) so if the page is not already map with write permission
> > > >       in the page table than a page fault is trigger and page_mkwrite
> > > >       happens
> > > >     - Once GUP return the page to the device driver and once the
> > > >       device driver as updated the hardware states to allow access
> > > >       to this page then from that point on hardware can write to the
> > > >       page at _any_ time, it is fully disconnected from any fs event
> > > >       like write back, it fully ignore things like page_mkclean
> > > > 
> > > > This is how it is to day, we allowed people to push upstream such
> > > > users of GUP. This is a fact we have to live with, we can not stop
> > > > hardware access to the page, we can not force the hardware to follow
> > > > page_mkclean and force a page_mkwrite once write back ends. This is
> > > > the situation we are inheriting (and i am personnaly not happy with
> > > > that).
> > > > 
> > > > >From my point of view we are left with 2 choices:
> > > >     [C1] break all drivers that do not abide by the page_mkclean and
> > > >          page_mkwrite
> > > >     [C2] mitigate as much as possible the issue
> > > > 
> > > > For [C2] the idea is to keep track of GUP per page so we know if we
> > > > can expect the page to be written to at any time. Here is the event
> > > > flow:
> > > >     - driver GUP the page and program the hardware, page is mark as
> > > >       GUPed
> > > >     ...
> > > >     - write back kicks in on the dirty page, lock the page and every
> > > >       thing as usual , sees it is GUPed and inform the block layer to
> > > >       use a bounce page
> > > 
> > > No.  The solution John, Dan & I have been looking at is to take the
> > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > found for writeback.
> > > 
> > > That's not the end of the story though.  Other parts of the kernel (eg
> > > msync) also need to be taught to stay away from pages which are pinned
> > > by GUP.  But the idea is that no page gets written back to storage while
> > > it's pinned by GUP.  Only when the last GUP ends is the page returned
> > > to the list of dirty pages.
> > > 
> > > >     - block layer copy the page to a bounce page effectively creating
> > > >       a snapshot of what is the content of the real page. This allows
> > > >       everything in block layer that need stable content to work on
> > > >       the bounce page (raid, stripping, encryption, ...)
> > > >     - once write back is done the page is not marked clean but stays
> > > >       dirty, this effectively disable things like COW for filesystem
> > > >       and other feature that expect page_mkwrite between write back.
> > > >       AFAIK it is believe that it is something acceptable
> > > 
> > > So none of this is necessary.
> > 
> > With the solution you are proposing we loose GUP fast and we have to
> > allocate a structure for each page that is under GUP, and the LRU
> > changes too. Moreover by not writing back there is a greater chance
> > of data loss.
> 
> Why can't you store the hmm_data in a side data structure?  Why does it
> have to be in struct page?

hmm_data is not even the issue here, we can have a pincount without
moving things around. So i do not see the need to complexify any of
the existing code to add new structure and consume more memory for
no good reasons. I do not see any benefit in that.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:48                                       ` Jerome Glisse
@ 2018-12-17 19:51                                         ` Matthew Wilcox
  2018-12-17 19:54                                           ` Jerome Glisse
  0 siblings, 1 reply; 207+ messages in thread
From: Matthew Wilcox @ 2018-12-17 19:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > > Sure, that's a possibility, but that doesn't close off any race
> > > > conditions because there can be DMA into the page in progress while
> > > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > > different in that there is no 3rd-party access to the page while it
> > > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > > nothing can actually race for modification of the page between
> > > > submission and bouncing at the block layer.
> > > > 
> > > > In this case, the moment the page is unlocked, anyone else can map
> > > > it and start (R)DMA on it, and that can happen before the bio is
> > > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > > solve the problem of racing writeback and DMA direct to the page we
> > > > are doing IO on. Yes, it reduces the race window substantially, but
> > > > it doesn't get rid of it.
> > > 
> > > So the event flow is:
> > >     - userspace create object that match a range of virtual address
> > >       against a given kernel sub-system (let's say infiniband) and
> > >       let's assume that the range is an mmap() of a regular file
> > >     - device driver do GUP on the range (let's assume it is a write
> > >       GUP) so if the page is not already map with write permission
> > >       in the page table than a page fault is trigger and page_mkwrite
> > >       happens
> > >     - Once GUP return the page to the device driver and once the
> > >       device driver as updated the hardware states to allow access
> > >       to this page then from that point on hardware can write to the
> > >       page at _any_ time, it is fully disconnected from any fs event
> > >       like write back, it fully ignore things like page_mkclean
> > > 
> > > This is how it is to day, we allowed people to push upstream such
> > > users of GUP. This is a fact we have to live with, we can not stop
> > > hardware access to the page, we can not force the hardware to follow
> > > page_mkclean and force a page_mkwrite once write back ends. This is
> > > the situation we are inheriting (and i am personnaly not happy with
> > > that).
> > > 
> > > >From my point of view we are left with 2 choices:
> > >     [C1] break all drivers that do not abide by the page_mkclean and
> > >          page_mkwrite
> > >     [C2] mitigate as much as possible the issue
> > > 
> > > For [C2] the idea is to keep track of GUP per page so we know if we
> > > can expect the page to be written to at any time. Here is the event
> > > flow:
> > >     - driver GUP the page and program the hardware, page is mark as
> > >       GUPed
> > >     ...
> > >     - write back kicks in on the dirty page, lock the page and every
> > >       thing as usual , sees it is GUPed and inform the block layer to
> > >       use a bounce page
> > 
> > No.  The solution John, Dan & I have been looking at is to take the
> > dirty page off the LRU while it is pinned by GUP.  It will never be
> > found for writeback.
> > 
> > That's not the end of the story though.  Other parts of the kernel (eg
> > msync) also need to be taught to stay away from pages which are pinned
> > by GUP.  But the idea is that no page gets written back to storage while
> > it's pinned by GUP.  Only when the last GUP ends is the page returned
> > to the list of dirty pages.
> > 
> > >     - block layer copy the page to a bounce page effectively creating
> > >       a snapshot of what is the content of the real page. This allows
> > >       everything in block layer that need stable content to work on
> > >       the bounce page (raid, stripping, encryption, ...)
> > >     - once write back is done the page is not marked clean but stays
> > >       dirty, this effectively disable things like COW for filesystem
> > >       and other feature that expect page_mkwrite between write back.
> > >       AFAIK it is believe that it is something acceptable
> > 
> > So none of this is necessary.
> 
> With the solution you are proposing we loose GUP fast and we have to
> allocate a structure for each page that is under GUP, and the LRU
> changes too. Moreover by not writing back there is a greater chance
> of data loss.

Why can't you store the hmm_data in a side data structure?  Why does it
have to be in struct page?

^ permalink raw reply	[flat|nested] 207+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions