linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
@ 2019-03-06 23:54 john.hubbard
  2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: john.hubbard @ 2019-03-06 23:54 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Hi Andrew and all,

Can we please apply this (destined for 5.2) once the time is right?
(I see that -mm just got merged into the main tree today.)

We seem to have pretty solid consensus on the concept and details of the
put_user_pages() approach. Or at least, if we don't, someone please speak
up now. Christopher Lameter, especially, since you had some concerns
recently.

Therefore, here is the first patch--only. This allows us to begin
converting the get_user_pages() call sites to use put_user_page(), instead
of put_page(). This is in order to implement tracking of get_user_page()
pages.

Normally I'd include a user of this code, but in this case, I think we have
examples of how it will work in the RFC and related discussions [1]. What
matters more at this point is unblocking the ability to start fixing up
various subsystems, through git trees other than linux-mm. For example, the
Infiniband example conversion now needs to pick up some prerequisite
patches via the RDMA tree. It seems likely that other call sites may need
similar attention, and so having put_user_pages() available would really
make this go more quickly.

Previous cover letter follows:
==============================

A discussion of the overall problem is below.

As mentioned in patch 0001, the steps are to fix the problem are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem.

Overview
========

Some kernel components (file systems, device drivers) need to access
memory that is specified via process virtual address. For a long time, the
API to achieve that was get_user_pages ("GUP") and its variations. However,
GUP has critical limitations that have been overlooked; in particular, GUP
does not interact correctly with filesystems in all situations. That means
that file-backed memory + GUP is a recipe for potential problems, some of
which have already occurred in the field.

GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
to get the struct page behind a virtual address and to let storage hardware
perform a direct copy to or from that page. This is a short-lived access
pattern, and as such, the window for a concurrent writeback of GUP'd page
was small enough that there were not (we think) any reported problems.
Also, userspace was expected to understand and accept that Direct IO was
not synchronized with memory-mapped access to that data, nor with any
process address space changes such as munmap(), mremap(), etc.

Over the years, more GUP uses have appeared (virtualization, device
drivers, RDMA) that can keep the pages they get via GUP for a long period
of time (seconds, minutes, hours, days, ...). This long-term pinning makes
an underlying design problem more obvious.

In fact, there are a number of key problems inherent to GUP:

Interactions with file systems
==============================

File systems expect to be able to write back data, both to reclaim pages,
and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
write access to the file memory pages means that such hardware can dirty
the pages, without the filesystem being aware. This can, in some cases
(depending on filesystem, filesystem options, block device, block device
options, and other variables), lead to data corruption, and also to kernel
bugs of the form:

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

Dave Chinner's description of this is very clear:

    "The fundamental issue is that ->page_mkwrite must be called on every
    write access to a clean file backed page, not just the first one.
    How long the GUP reference lasts is irrelevant, if the page is clean
    and you need to dirty it, you must call ->page_mkwrite before it is
    marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: filesystems do not
actually support get_user_pages() being called on their pages, and letting
hardware write directly to those pages--even though that pattern has been
going on since about 2005 or so.

Long term GUP
=============

Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
writeable mapping is created), and the pages are file-backed. That can lead
to filesystem corruption. What happens is that when a file-backed page is
being written back, it is first mapped read-only in all of the CPU page
tables; the file system then assumes that nobody can write to the page, and
that the page content is therefore stable. Unfortunately, the GUP callers
generally do not monitor changes to the CPU pages tables; they instead
assume that the following pattern is safe (it's not):

    get_user_pages()

    Hardware can keep a reference to those pages for a very long time,
    and write to it at any time. Because "hardware" here means "devices
    that are not a CPU", this activity occurs without any interaction
    with the kernel's file system code.

    for each page
        set_page_dirty
        put_page()

In fact, the GUP documentation even recommends that pattern.

Anyway, the file system assumes that the page is stable (nothing is writing
to the page), and that is a problem: stable page content is necessary for
many filesystem actions during writeback, such as checksum, encryption,
RAID striping, etc. Furthermore, filesystem features like COW (copy on
write) or snapshot also rely on being able to use a new page for as memory
for that memory range inside the file.

Corruption during write back is clearly possible here. To solve that, one
idea is to identify pages that have active GUP, so that we can use a bounce
page to write stable data to the filesystem. The filesystem would work
on the bounce page, while any of the active GUP might write to the
original page. This would avoid the stable page violation problem, but note
that it is only part of the overall solution, because other problems
remain.

Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be to
require GUP users to monitor and respond to CPU page table updates.
Subsystems such as ODP and HMM do this, for example. This aspect of the
problem is still under discussion.

Direct IO
=========

Direct IO can cause corruption, if userspace does Direct-IO that writes to
a range of virtual addresses that are mmap'd to a file.  The pages written
to are file-backed pages that can be under write back, while the Direct IO
is taking place.  Here, Direct IO races with a write back: it calls
GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
The race window is pretty small, which is probably why years have gone by
before we noticed this problem: Direct IO is generally very quick, and
tends to finish up before the filesystem gets around to do anything with
the page contents.  However, it's still a real problem.  The solution is
to never let GUP return pages that are under write back, but instead,
force GUP to take a write fault on those pages.  That way, GUP will
properly synchronize with the active write back.  This does not change the
required GUP behavior, it just avoids that race.

Changes since v2:

 * Reduced down to just one patch, in order to avoid dependencies between
   subsystem git repos.

 * Rebased to latest linux.git: commit afe6fe7036c6 ("Merge tag
   'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc")

 * Added Ira's review tag, based on
   https://lore.kernel.org/lkml/20190215002312.GC7512@iweiny-DESK2.sc.intel.com/


[1] https://lore.kernel.org/r/20190208075649.3025-3-jhubbard@nvidia.com
    (RFC v2: mm: gup/dma tracking)

Cc: Christian Benvenuti <benve@cisco.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Tom Talpey <tom@talpey.com>


John Hubbard (1):
  mm: introduce put_user_page*(), placeholder versions

 include/linux/mm.h | 24 ++++++++++++++
 mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-06 23:54 [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2019-03-06 23:54 ` john.hubbard
  2019-03-08  2:58   ` Christopher Lameter
                     ` (2 more replies)
  2019-03-07  8:37 ` [PATCH v3 0/1] " Ira Weiny
  2019-03-08  3:08 ` Christopher Lameter
  2 siblings, 3 replies; 33+ messages in thread
From: john.hubbard @ 2019-03-06 23:54 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing a problem (also described in [1] and
[2]) with interactions between get_user_pages ("gup") and filesystems.

Problem description: let's start with a bug report. Below, is what happens
sometimes, under memory pressure, when a driver pins some pages via gup,
and then marks those pages dirty, and releases them. Note that the gup
documentation actually recommends that pattern. The problem is that the
filesystem may do a writeback while the pages were gup-pinned, and then the
filesystem believes that the pages are clean. So, when the driver later
marks the pages as dirty, that conflicts with the filesystem's page
tracking and results in a BUG(), like this one that I experienced:

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

Dave Chinner's description of this is very clear:

    "The fundamental issue is that ->page_mkwrite must be called on every
    write access to a clean file backed page, not just the first one.
    How long the GUP reference lasts is irrelevant, if the page is clean
    and you need to dirty it, you must call ->page_mkwrite before it is
    marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: filesystems do not
actually support get_user_pages() being called on their pages, and letting
hardware write directly to those pages--even though that patter has been
going on since about 2005 or so.

The steps are to fix it are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem.

[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 24 ++++++++++++++
 mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..809b7397d41e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/**
+ * put_user_page() - release a gup-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ *
+ * put_user_page() and put_page() are not interchangeable, despite this early
+ * implementation that makes them look the same. put_user_page() calls must
+ * be perfectly matched up with get_user_page() calls.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index 4d7d37eb3c40..a6b4f693f46d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+				   unsigned long npages,
+				   set_dirty_func sdf)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			sdf(page);
+
+		put_user_page(page);
+	}
+}
+
+/**
+ * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * "gup-pinned page" refers to a page that has had one of the get_user_pages()
+ * variants called on that page.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/**
+ * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/**
+ * put_user_pages() - release an array of gup-pinned pages.
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, release the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 /*
  * get_kernel_pages() - pin kernel pages in memory
  * @kiov:	An array of struct kvec structures
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-06 23:54 [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
@ 2019-03-07  8:37 ` Ira Weiny
  2019-03-08  3:08 ` Christopher Lameter
  2 siblings, 0 replies; 33+ messages in thread
From: Ira Weiny @ 2019-03-07  8:37 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed, Mar 06, 2019 at 03:54:54PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Hi Andrew and all,
> 
> Can we please apply this (destined for 5.2) once the time is right?
> (I see that -mm just got merged into the main tree today.)
> 
> We seem to have pretty solid consensus on the concept and details of the
> put_user_pages() approach. Or at least, if we don't, someone please speak
> up now. Christopher Lameter, especially, since you had some concerns
> recently.
> 
> Therefore, here is the first patch--only. This allows us to begin
> converting the get_user_pages() call sites to use put_user_page(), instead
> of put_page(). This is in order to implement tracking of get_user_page()
> pages.
> 
> Normally I'd include a user of this code, but in this case, I think we have
> examples of how it will work in the RFC and related discussions [1]. What
> matters more at this point is unblocking the ability to start fixing up
> various subsystems, through git trees other than linux-mm. For example, the
> Infiniband example conversion now needs to pick up some prerequisite
> patches via the RDMA tree. It seems likely that other call sites may need
> similar attention, and so having put_user_pages() available would really
> make this go more quickly.
>

FWIW I agree with John.

Ira

> 
> Previous cover letter follows:
> ==============================
> 
> A discussion of the overall problem is below.
> 
> As mentioned in patch 0001, the steps are to fix the problem are:
> 
> 1) Provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> Overview
> ========
> 
> Some kernel components (file systems, device drivers) need to access
> memory that is specified via process virtual address. For a long time, the
> API to achieve that was get_user_pages ("GUP") and its variations. However,
> GUP has critical limitations that have been overlooked; in particular, GUP
> does not interact correctly with filesystems in all situations. That means
> that file-backed memory + GUP is a recipe for potential problems, some of
> which have already occurred in the field.
> 
> GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
> to get the struct page behind a virtual address and to let storage hardware
> perform a direct copy to or from that page. This is a short-lived access
> pattern, and as such, the window for a concurrent writeback of GUP'd page
> was small enough that there were not (we think) any reported problems.
> Also, userspace was expected to understand and accept that Direct IO was
> not synchronized with memory-mapped access to that data, nor with any
> process address space changes such as munmap(), mremap(), etc.
> 
> Over the years, more GUP uses have appeared (virtualization, device
> drivers, RDMA) that can keep the pages they get via GUP for a long period
> of time (seconds, minutes, hours, days, ...). This long-term pinning makes
> an underlying design problem more obvious.
> 
> In fact, there are a number of key problems inherent to GUP:
> 
> Interactions with file systems
> ==============================
> 
> File systems expect to be able to write back data, both to reclaim pages,
> and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
> write access to the file memory pages means that such hardware can dirty
> the pages, without the filesystem being aware. This can, in some cases
> (depending on filesystem, filesystem options, block device, block device
> options, and other variables), lead to data corruption, and also to kernel
> bugs of the form:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that pattern has been
> going on since about 2005 or so.
> 
> Long term GUP
> =============
> 
> Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
> writeable mapping is created), and the pages are file-backed. That can lead
> to filesystem corruption. What happens is that when a file-backed page is
> being written back, it is first mapped read-only in all of the CPU page
> tables; the file system then assumes that nobody can write to the page, and
> that the page content is therefore stable. Unfortunately, the GUP callers
> generally do not monitor changes to the CPU pages tables; they instead
> assume that the following pattern is safe (it's not):
> 
>     get_user_pages()
> 
>     Hardware can keep a reference to those pages for a very long time,
>     and write to it at any time. Because "hardware" here means "devices
>     that are not a CPU", this activity occurs without any interaction
>     with the kernel's file system code.
> 
>     for each page
>         set_page_dirty
>         put_page()
> 
> In fact, the GUP documentation even recommends that pattern.
> 
> Anyway, the file system assumes that the page is stable (nothing is writing
> to the page), and that is a problem: stable page content is necessary for
> many filesystem actions during writeback, such as checksum, encryption,
> RAID striping, etc. Furthermore, filesystem features like COW (copy on
> write) or snapshot also rely on being able to use a new page for as memory
> for that memory range inside the file.
> 
> Corruption during write back is clearly possible here. To solve that, one
> idea is to identify pages that have active GUP, so that we can use a bounce
> page to write stable data to the filesystem. The filesystem would work
> on the bounce page, while any of the active GUP might write to the
> original page. This would avoid the stable page violation problem, but note
> that it is only part of the overall solution, because other problems
> remain.
> 
> Other filesystem features that need to replace the page with a new one can
> be inhibited for pages that are GUP-pinned. This will, however, alter and
> limit some of those filesystem features. The only fix for that would be to
> require GUP users to monitor and respond to CPU page table updates.
> Subsystems such as ODP and HMM do this, for example. This aspect of the
> problem is still under discussion.
> 
> Direct IO
> =========
> 
> Direct IO can cause corruption, if userspace does Direct-IO that writes to
> a range of virtual addresses that are mmap'd to a file.  The pages written
> to are file-backed pages that can be under write back, while the Direct IO
> is taking place.  Here, Direct IO races with a write back: it calls
> GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
> The race window is pretty small, which is probably why years have gone by
> before we noticed this problem: Direct IO is generally very quick, and
> tends to finish up before the filesystem gets around to do anything with
> the page contents.  However, it's still a real problem.  The solution is
> to never let GUP return pages that are under write back, but instead,
> force GUP to take a write fault on those pages.  That way, GUP will
> properly synchronize with the active write back.  This does not change the
> required GUP behavior, it just avoids that race.
> 
> Changes since v2:
> 
>  * Reduced down to just one patch, in order to avoid dependencies between
>    subsystem git repos.
> 
>  * Rebased to latest linux.git: commit afe6fe7036c6 ("Merge tag
>    'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc")
> 
>  * Added Ira's review tag, based on
>    https://lore.kernel.org/lkml/20190215002312.GC7512@iweiny-DESK2.sc.intel.com/
> 
> 
> [1] https://lore.kernel.org/r/20190208075649.3025-3-jhubbard@nvidia.com
>     (RFC v2: mm: gup/dma tracking)
> 
> Cc: Christian Benvenuti <benve@cisco.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Tom Talpey <tom@talpey.com>
> 
> 
> John Hubbard (1):
>   mm: introduce put_user_page*(), placeholder versions
> 
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 106 insertions(+)
> 
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
@ 2019-03-08  2:58   ` Christopher Lameter
  2019-03-08  3:15     ` John Hubbard
  2019-03-08 17:57   ` Jerome Glisse
  2019-03-12 15:30   ` Ira Weiny
  2 siblings, 1 reply; 33+ messages in thread
From: Christopher Lameter @ 2019-03-08  2:58 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:

> Dave Chinner's description of this is very clear:
>
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
>
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.

Can we distinguish between real filesystems that actually write to a
backing device and the special filesystems (like hugetlbfs, shm and
friends) that are like anonymous memory and do not require
->page_mkwrite() in the same way as regular filesystems?

The use that I have seen in my section of the world has been restricted to
RDMA and get_user_pages being limited to anonymous memory and those
special filesystems. And if the RDMA memory is of such type then the use
in the past and present is safe.

So a logical other approach would be to simply not allow the use of
long term get_user_page() on real filesystem pages. I hope this patch
supports that?

It is customary after all that a file read or write operation involve one
single file(!) and that what is written either comes from or goes to
memory (anonymous or special memory filesystem).

If you have an mmapped memory segment with a regular device backed file
then you already have one file associated with a memory segment and a
filesystem that does take care of synchronizing the contents of the memory
segment to a backing device.

If you now perform RDMA or device I/O on such a memory segment then you
will have *two* different devices interacting with that memory segment. I
think that ought not to happen and not be supported out of the box. It
will be difficult to handle and the semantics will be hard for users to
understand.

What could happen is that the filesystem could agree on request to allow
third party I/O to go to such a memory segment. But that needs to be well
defined and clearly and explicitly handled by some mechanism in user space
that has well defined semantics for data integrity for the filesystem as
well as the RDMA or device I/O.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-06 23:54 [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
  2019-03-07  8:37 ` [PATCH v3 0/1] " Ira Weiny
@ 2019-03-08  3:08 ` Christopher Lameter
  2019-03-08 19:07   ` Jerome Glisse
  2019-03-10 22:47   ` Dave Chinner
  2 siblings, 2 replies; 33+ messages in thread
From: Christopher Lameter @ 2019-03-08  3:08 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:


> GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
> to get the struct page behind a virtual address and to let storage hardware
> perform a direct copy to or from that page. This is a short-lived access
> pattern, and as such, the window for a concurrent writeback of GUP'd page
> was small enough that there were not (we think) any reported problems.
> Also, userspace was expected to understand and accept that Direct IO was
> not synchronized with memory-mapped access to that data, nor with any
> process address space changes such as munmap(), mremap(), etc.

It would good if that understanding would be enforced somehow given the problems
that we see.

> Interactions with file systems
> ==============================
>
> File systems expect to be able to write back data, both to reclaim pages,

Regular filesystems do that. But usually those are not used with GUP
pinning AFAICT.

> and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
> write access to the file memory pages means that such hardware can dirty
> the pages, without the filesystem being aware. This can, in some cases
> (depending on filesystem, filesystem options, block device, block device
> options, and other variables), lead to data corruption, and also to kernel
> bugs of the form:

> Long term GUP
> =============
>
> Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
> writeable mapping is created), and the pages are file-backed. That can lead
> to filesystem corruption. What happens is that when a file-backed page is
> being written back, it is first mapped read-only in all of the CPU page
> tables; the file system then assumes that nobody can write to the page, and
> that the page content is therefore stable. Unfortunately, the GUP callers
> generally do not monitor changes to the CPU pages tables; they instead
> assume that the following pattern is safe (it's not):
>
>     get_user_pages()
>
>     Hardware can keep a reference to those pages for a very long time,
>     and write to it at any time. Because "hardware" here means "devices
>     that are not a CPU", this activity occurs without any interaction
>     with the kernel's file system code.
>
>     for each page
>         set_page_dirty
>         put_page()
>
> In fact, the GUP documentation even recommends that pattern.

Isnt that pattern safe for anonymous memory and memory filesystems like
hugetlbfs etc? Which is the common use case.

> Anyway, the file system assumes that the page is stable (nothing is writing
> to the page), and that is a problem: stable page content is necessary for
> many filesystem actions during writeback, such as checksum, encryption,
> RAID striping, etc. Furthermore, filesystem features like COW (copy on
> write) or snapshot also rely on being able to use a new page for as memory
> for that memory range inside the file.
>
> Corruption during write back is clearly possible here. To solve that, one
> idea is to identify pages that have active GUP, so that we can use a bounce
> page to write stable data to the filesystem. The filesystem would work
> on the bounce page, while any of the active GUP might write to the
> original page. This would avoid the stable page violation problem, but note
> that it is only part of the overall solution, because other problems
> remain.

Yes you now have the filesystem as well as the GUP pinner claiming
authority over the contents of a single memory segment. Maybe better not
allow that?

> Direct IO
> =========
>
> Direct IO can cause corruption, if userspace does Direct-IO that writes to
> a range of virtual addresses that are mmap'd to a file.  The pages written
> to are file-backed pages that can be under write back, while the Direct IO
> is taking place.  Here, Direct IO races with a write back: it calls
> GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
> The race window is pretty small, which is probably why years have gone by
> before we noticed this problem: Direct IO is generally very quick, and
> tends to finish up before the filesystem gets around to do anything with
> the page contents.  However, it's still a real problem.  The solution is
> to never let GUP return pages that are under write back, but instead,
> force GUP to take a write fault on those pages.  That way, GUP will
> properly synchronize with the active write back.  This does not change the
> required GUP behavior, it just avoids that race.

Direct IO on a mmapped file backed page doesnt make any sense. The direct
I/O write syscall already specifies one file handle of a filesystem that
the data is to be written onto.  Plus mmap already established another
second filehandle and another filesystem that is also in charge of that
memory segment.

Two filesystem trying to sync one memory segment both believing to have
exclusive access and we want to sort this out. Why? Dont allow this.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08  2:58   ` Christopher Lameter
@ 2019-03-08  3:15     ` John Hubbard
  2019-03-08 17:43       ` Weiny, Ira
  0 siblings, 1 reply; 33+ messages in thread
From: John Hubbard @ 2019-03-08  3:15 UTC (permalink / raw)
  To: Christopher Lameter, john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel

On 3/7/19 6:58 PM, Christopher Lameter wrote:
> On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:
> 
>> Dave Chinner's description of this is very clear:
>>
>>     "The fundamental issue is that ->page_mkwrite must be called on every
>>     write access to a clean file backed page, not just the first one.
>>     How long the GUP reference lasts is irrelevant, if the page is clean
>>     and you need to dirty it, you must call ->page_mkwrite before it is
>>     marked writeable and dirtied. Every. Time."
>>
>> This is just one symptom of the larger design problem: filesystems do not
>> actually support get_user_pages() being called on their pages, and letting
>> hardware write directly to those pages--even though that patter has been
>> going on since about 2005 or so.
> 
> Can we distinguish between real filesystems that actually write to a
> backing device and the special filesystems (like hugetlbfs, shm and
> friends) that are like anonymous memory and do not require
> ->page_mkwrite() in the same way as regular filesystems?

Yes. I'll change the wording in the commit message to say "real filesystems
that actually write to a backing device", instead of "filesystems". That
does help, thanks.

> 
> The use that I have seen in my section of the world has been restricted to
> RDMA and get_user_pages being limited to anonymous memory and those
> special filesystems. And if the RDMA memory is of such type then the use
> in the past and present is safe.

Agreed.

> 
> So a logical other approach would be to simply not allow the use of
> long term get_user_page() on real filesystem pages. I hope this patch
> supports that?

This patch neither prevents nor provides that. What this patch does is
provide a prerequisite to clear identification of pages that have had
get_user_pages() called on them.


> 
> It is customary after all that a file read or write operation involve one
> single file(!) and that what is written either comes from or goes to
> memory (anonymous or special memory filesystem).
> 
> If you have an mmapped memory segment with a regular device backed file
> then you already have one file associated with a memory segment and a
> filesystem that does take care of synchronizing the contents of the memory
> segment to a backing device.
> 
> If you now perform RDMA or device I/O on such a memory segment then you
> will have *two* different devices interacting with that memory segment. I
> think that ought not to happen and not be supported out of the box. It
> will be difficult to handle and the semantics will be hard for users to
> understand.
> 
> What could happen is that the filesystem could agree on request to allow
> third party I/O to go to such a memory segment. But that needs to be well
> defined and clearly and explicitly handled by some mechanism in user space
> that has well defined semantics for data integrity for the filesystem as
> well as the RDMA or device I/O.
> 

Those discussions are underway. Dave Chinner and others have been talking
about filesystem leases, for example. The key point here is that we'll still
need, in any of these approaches, to be able to identify the gup-pinned
pages. And there are lots (100+) of call sites to change. So I figure we'd
better get that started.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08  3:15     ` John Hubbard
@ 2019-03-08 17:43       ` Weiny, Ira
  0 siblings, 0 replies; 33+ messages in thread
From: Weiny, Ira @ 2019-03-08 17:43 UTC (permalink / raw)
  To: John Hubbard, Christopher Lameter, john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Williams, Dan J, Dave Chinner, Dalessandro,
	Dennis, Doug Ledford, Jan Kara, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Marciniszyn, Mike,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel

> Subject: Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder
> versions
> 
> On 3/7/19 6:58 PM, Christopher Lameter wrote:
> > On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:
> >
> >> Dave Chinner's description of this is very clear:
> >>
> >>     "The fundamental issue is that ->page_mkwrite must be called on every
> >>     write access to a clean file backed page, not just the first one.
> >>     How long the GUP reference lasts is irrelevant, if the page is clean
> >>     and you need to dirty it, you must call ->page_mkwrite before it is
> >>     marked writeable and dirtied. Every. Time."
> >>
> >> This is just one symptom of the larger design problem: filesystems do
> >> not actually support get_user_pages() being called on their pages,
> >> and letting hardware write directly to those pages--even though that
> >> patter has been going on since about 2005 or so.
> >
> > Can we distinguish between real filesystems that actually write to a
> > backing device and the special filesystems (like hugetlbfs, shm and
> > friends) that are like anonymous memory and do not require
> > ->page_mkwrite() in the same way as regular filesystems?
> 
> Yes. I'll change the wording in the commit message to say "real filesystems
> that actually write to a backing device", instead of "filesystems". That does
> help, thanks.
> 
> >
> > The use that I have seen in my section of the world has been
> > restricted to RDMA and get_user_pages being limited to anonymous
> > memory and those special filesystems. And if the RDMA memory is of
> > such type then the use in the past and present is safe.
> 
> Agreed.
> 
> >
> > So a logical other approach would be to simply not allow the use of
> > long term get_user_page() on real filesystem pages. I hope this patch
> > supports that?
> 
> This patch neither prevents nor provides that. What this patch does is
> provide a prerequisite to clear identification of pages that have had
> get_user_pages() called on them.
> 
> 
> >
> > It is customary after all that a file read or write operation involve
> > one single file(!) and that what is written either comes from or goes
> > to memory (anonymous or special memory filesystem).
> >
> > If you have an mmapped memory segment with a regular device backed
> > file then you already have one file associated with a memory segment
> > and a filesystem that does take care of synchronizing the contents of
> > the memory segment to a backing device.
> >
> > If you now perform RDMA or device I/O on such a memory segment then
> > you will have *two* different devices interacting with that memory
> > segment. I think that ought not to happen and not be supported out of
> > the box. It will be difficult to handle and the semantics will be hard
> > for users to understand.
> >
> > What could happen is that the filesystem could agree on request to
> > allow third party I/O to go to such a memory segment. But that needs
> > to be well defined and clearly and explicitly handled by some
> > mechanism in user space that has well defined semantics for data
> > integrity for the filesystem as well as the RDMA or device I/O.
> >
> 
> Those discussions are underway. Dave Chinner and others have been talking
> about filesystem leases, for example. The key point here is that we'll still
> need, in any of these approaches, to be able to identify the gup-pinned pages.
> And there are lots (100+) of call sites to change. So I figure we'd better get
> that started.
>

+ 1

I'm exploring patch sets like this.  Having this interface available will, IMO, allow for better review of those patches rather than saying "go over to Johns tree to get the pre-requisite patches".  :-D

Also I think it will be easier for users to get things right by calling [get|put]_user_pages() rather than get_user_pages() followed by put_page().

Ira

> thanks,
> --
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
  2019-03-08  2:58   ` Christopher Lameter
@ 2019-03-08 17:57   ` Jerome Glisse
  2019-03-08 21:27     ` John Hubbard
  2019-03-12 15:30   ` Ira Weiny
  2 siblings, 1 reply; 33+ messages in thread
From: Jerome Glisse @ 2019-03-08 17:57 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with interactions between get_user_pages ("gup") and filesystems.
> 
> Problem description: let's start with a bug report. Below, is what happens
> sometimes, under memory pressure, when a driver pins some pages via gup,
> and then marks those pages dirty, and releases them. Note that the gup
> documentation actually recommends that pattern. The problem is that the
> filesystem may do a writeback while the pages were gup-pinned, and then the
> filesystem believes that the pages are clean. So, when the driver later
> marks the pages as dirty, that conflicts with the filesystem's page
> tracking and results in a BUG(), like this one that I experienced:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.
> 
> The steps are to fix it are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
> [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Just a small comments below that would help my life :)

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++

Why not putting those functions in gup.c instead of swap.c ?

>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..809b7397d41e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + *
> + * put_user_page() and put_page() are not interchangeable, despite this early
> + * implementation that makes them look the same. put_user_page() calls must
> + * be perfectly matched up with get_user_page() calls.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index 4d7d37eb3c40..a6b4f693f46d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
>  
> +typedef int (*set_dirty_func)(struct page *page);

set_dirty_func_t would be better as it is the rule for typedef to append
the _t also it make it easier for coccinelle patch.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08  3:08 ` Christopher Lameter
@ 2019-03-08 19:07   ` Jerome Glisse
  2019-03-12  4:52     ` Christopher Lameter
  2019-03-10 22:47   ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Jerome Glisse @ 2019-03-08 19:07 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Fri, Mar 08, 2019 at 03:08:40AM +0000, Christopher Lameter wrote:
> On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:
> 
> 
> > GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
> > to get the struct page behind a virtual address and to let storage hardware
> > perform a direct copy to or from that page. This is a short-lived access
> > pattern, and as such, the window for a concurrent writeback of GUP'd page
> > was small enough that there were not (we think) any reported problems.
> > Also, userspace was expected to understand and accept that Direct IO was
> > not synchronized with memory-mapped access to that data, nor with any
> > process address space changes such as munmap(), mremap(), etc.
> 
> It would good if that understanding would be enforced somehow given the problems
> that we see.

This has been discuss extensively already. GUP usage is now widespread in
multiple drivers, removing that would regress userspace ie break existing
application. We all know what the rules for that is.

> 
> > Interactions with file systems
> > ==============================
> >
> > File systems expect to be able to write back data, both to reclaim pages,
> 
> Regular filesystems do that. But usually those are not used with GUP
> pinning AFAICT.
> 
> > and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
> > write access to the file memory pages means that such hardware can dirty
> > the pages, without the filesystem being aware. This can, in some cases
> > (depending on filesystem, filesystem options, block device, block device
> > options, and other variables), lead to data corruption, and also to kernel
> > bugs of the form:
> 
> > Long term GUP
> > =============
> >
> > Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
> > writeable mapping is created), and the pages are file-backed. That can lead
> > to filesystem corruption. What happens is that when a file-backed page is
> > being written back, it is first mapped read-only in all of the CPU page
> > tables; the file system then assumes that nobody can write to the page, and
> > that the page content is therefore stable. Unfortunately, the GUP callers
> > generally do not monitor changes to the CPU pages tables; they instead
> > assume that the following pattern is safe (it's not):
> >
> >     get_user_pages()
> >
> >     Hardware can keep a reference to those pages for a very long time,
> >     and write to it at any time. Because "hardware" here means "devices
> >     that are not a CPU", this activity occurs without any interaction
> >     with the kernel's file system code.
> >
> >     for each page
> >         set_page_dirty
> >         put_page()
> >
> > In fact, the GUP documentation even recommends that pattern.
> 
> Isnt that pattern safe for anonymous memory and memory filesystems like
> hugetlbfs etc? Which is the common use case.

Still an issue in respect to swapout ie if anon/shmem page was map
read only in preparation for swapout and we do not report the page
as dirty what endup in swap might lack what was written last through
GUP.

> 
> > Anyway, the file system assumes that the page is stable (nothing is writing
> > to the page), and that is a problem: stable page content is necessary for
> > many filesystem actions during writeback, such as checksum, encryption,
> > RAID striping, etc. Furthermore, filesystem features like COW (copy on
> > write) or snapshot also rely on being able to use a new page for as memory
> > for that memory range inside the file.
> >
> > Corruption during write back is clearly possible here. To solve that, one
> > idea is to identify pages that have active GUP, so that we can use a bounce
> > page to write stable data to the filesystem. The filesystem would work
> > on the bounce page, while any of the active GUP might write to the
> > original page. This would avoid the stable page violation problem, but note
> > that it is only part of the overall solution, because other problems
> > remain.
> 
> Yes you now have the filesystem as well as the GUP pinner claiming
> authority over the contents of a single memory segment. Maybe better not
> allow that?

This goes back to regressing existing driver with existing users.

> 
> > Direct IO
> > =========
> >
> > Direct IO can cause corruption, if userspace does Direct-IO that writes to
> > a range of virtual addresses that are mmap'd to a file.  The pages written
> > to are file-backed pages that can be under write back, while the Direct IO
> > is taking place.  Here, Direct IO races with a write back: it calls
> > GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
> > The race window is pretty small, which is probably why years have gone by
> > before we noticed this problem: Direct IO is generally very quick, and
> > tends to finish up before the filesystem gets around to do anything with
> > the page contents.  However, it's still a real problem.  The solution is
> > to never let GUP return pages that are under write back, but instead,
> > force GUP to take a write fault on those pages.  That way, GUP will
> > properly synchronize with the active write back.  This does not change the
> > required GUP behavior, it just avoids that race.
> 
> Direct IO on a mmapped file backed page doesnt make any sense. The direct
> I/O write syscall already specifies one file handle of a filesystem that
> the data is to be written onto.  Plus mmap already established another
> second filehandle and another filesystem that is also in charge of that
> memory segment.
> 
> Two filesystem trying to sync one memory segment both believing to have
> exclusive access and we want to sort this out. Why? Dont allow this.

This is allowed, it always was, forbidding that case now would regress
existing application and it would also means that we are modifying the
API we expose to userspace. So again this is not something we can block
without regressing existing user.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08 17:57   ` Jerome Glisse
@ 2019-03-08 21:27     ` John Hubbard
  0 siblings, 0 replies; 33+ messages in thread
From: John Hubbard @ 2019-03-08 21:27 UTC (permalink / raw)
  To: Jerome Glisse, john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel

On 3/8/19 9:57 AM, Jerome Glisse wrote:
[snip] 
> Just a small comments below that would help my life :)
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 

Thanks for the review! 

>> ---
>>  include/linux/mm.h | 24 ++++++++++++++
>>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
> 
> Why not putting those functions in gup.c instead of swap.c ?

Yes, gup.c is better for these. And it passes the various cross compiler and
tinyconfig builds locally, so I think I'm not missing any cases. (The swap.c 
location was an artifact of very early approaches, pre-dating the
put_user_pages() name.) 

[snip]

>>  #define SECTION_IN_PAGE_FLAGS
>>  #endif
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 4d7d37eb3c40..a6b4f693f46d 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>>  }
>>  EXPORT_SYMBOL(put_pages_list);
>>  
>> +typedef int (*set_dirty_func)(struct page *page);
> 
> set_dirty_func_t would be better as it is the rule for typedef to append
> the _t also it make it easier for coccinelle patch.
> 

Done. I'm posting a v4 in a moment, with both of the above, plus
Christopher's "real filesystems" wording change, and your reviewed-by
tag.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08  3:08 ` Christopher Lameter
  2019-03-08 19:07   ` Jerome Glisse
@ 2019-03-10 22:47   ` Dave Chinner
  2019-03-12  5:23     ` Christopher Lameter
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-03-10 22:47 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Fri, Mar 08, 2019 at 03:08:40AM +0000, Christopher Lameter wrote:
> On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:
> > Direct IO
> > =========
> >
> > Direct IO can cause corruption, if userspace does Direct-IO that writes to
> > a range of virtual addresses that are mmap'd to a file.  The pages written
> > to are file-backed pages that can be under write back, while the Direct IO
> > is taking place.  Here, Direct IO races with a write back: it calls
> > GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
> > The race window is pretty small, which is probably why years have gone by
> > before we noticed this problem: Direct IO is generally very quick, and
> > tends to finish up before the filesystem gets around to do anything with
> > the page contents.  However, it's still a real problem.  The solution is
> > to never let GUP return pages that are under write back, but instead,
> > force GUP to take a write fault on those pages.  That way, GUP will
> > properly synchronize with the active write back.  This does not change the
> > required GUP behavior, it just avoids that race.
> 
> Direct IO on a mmapped file backed page doesnt make any sense.

People have used it for many, many years as zero-copy data movement
pattern. i.e. mmap the destination file, use direct IO to DMA direct
into the destination file page cache pages, fdatasync() to force
writeback of the destination file.

Now we have copy_file_range() to optimise this sort of data
movement, the need for games with mmap+direct IO largely goes away.
However, we still can't just remove that functionality as it will
break lots of random userspace stuff...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08 19:07   ` Jerome Glisse
@ 2019-03-12  4:52     ` Christopher Lameter
  2019-03-12 15:35       ` Jerome Glisse
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Lameter @ 2019-03-12  4:52 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Fri, 8 Mar 2019, Jerome Glisse wrote:

> >
> > It would good if that understanding would be enforced somehow given the problems
> > that we see.
>
> This has been discuss extensively already. GUP usage is now widespread in
> multiple drivers, removing that would regress userspace ie break existing
> application. We all know what the rules for that is.

The applications that work are using anonymous memory and memory
filesystems. I have never seen use cases with a real filesystem and would
have objected if someone tried something crazy like that.

Because someone was able to get away with weird ways of abusing the system
it not an argument that we should continue to allow such things. In fact
we have repeatedly ensured that the kernel works reliably by improving the
kernel so that a proper failure is occurring.


> > > In fact, the GUP documentation even recommends that pattern.
> >
> > Isnt that pattern safe for anonymous memory and memory filesystems like
> > hugetlbfs etc? Which is the common use case.
>
> Still an issue in respect to swapout ie if anon/shmem page was map
> read only in preparation for swapout and we do not report the page
> as dirty what endup in swap might lack what was written last through
> GUP.

Well swapout cannot occur if the page is pinned and those pages are also
often mlocked.

> >
> > Yes you now have the filesystem as well as the GUP pinner claiming
> > authority over the contents of a single memory segment. Maybe better not
> > allow that?
>
> This goes back to regressing existing driver with existing users.

There is no regression if that behavior never really worked.

> > Two filesystem trying to sync one memory segment both believing to have
> > exclusive access and we want to sort this out. Why? Dont allow this.
>
> This is allowed, it always was, forbidding that case now would regress
> existing application and it would also means that we are modifying the
> API we expose to userspace. So again this is not something we can block
> without regressing existing user.

We have always stopped the user from doing obviously stupid and risky
things. It would be logical to do it here as well.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-10 22:47   ` Dave Chinner
@ 2019-03-12  5:23     ` Christopher Lameter
  2019-03-12 10:39       ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Lameter @ 2019-03-12  5:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Mon, 11 Mar 2019, Dave Chinner wrote:

> > Direct IO on a mmapped file backed page doesnt make any sense.
>
> People have used it for many, many years as zero-copy data movement
> pattern. i.e. mmap the destination file, use direct IO to DMA direct
> into the destination file page cache pages, fdatasync() to force
> writeback of the destination file.

Well we could make that more safe through a special API that designates a
range of pages in a file in the same way as for RDMA. This is inherently
not reliable as we found out.

> Now we have copy_file_range() to optimise this sort of data
> movement, the need for games with mmap+direct IO largely goes away.
> However, we still can't just remove that functionality as it will
> break lots of random userspace stuff...

It is already broken and unreliable. Are there really "lots" of these
things around? Can we test this by adding a warning in the kernel and see
where it actually crops up?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12  5:23     ` Christopher Lameter
@ 2019-03-12 10:39       ` Ira Weiny
  2019-03-12 22:11         ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Ira Weiny @ 2019-03-12 10:39 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Dave Chinner, john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jan Kara, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, John Hubbard

On Tue, Mar 12, 2019 at 05:23:21AM +0000, Christopher Lameter wrote:
> On Mon, 11 Mar 2019, Dave Chinner wrote:
> 
> > > Direct IO on a mmapped file backed page doesnt make any sense.
> >
> > People have used it for many, many years as zero-copy data movement
> > pattern. i.e. mmap the destination file, use direct IO to DMA direct
> > into the destination file page cache pages, fdatasync() to force
> > writeback of the destination file.
> 
> Well we could make that more safe through a special API that designates a
> range of pages in a file in the same way as for RDMA. This is inherently
> not reliable as we found out.

I'm not following.  What API was not reliable?  In[2] we had ideas on such an
API but AFAIK these have not been tried.

From what I have seen the above is racy and is prone to the issues John has
seen.  The difference is that Direct IO has a smaller window than RDMA.  (Or at
least I thought we already established that?)

	"And also remember that while RDMA might be the case at least some
	people care about here it really isn't different from any of the other
	gup + I/O cases, including doing direct I/O to a mmap area.  The only
	difference in the various cases is how long the area should be pinned
	down..."

		-- Christoph Hellwig : https://lkml.org/lkml/2018/10/1/591

> 
> > Now we have copy_file_range() to optimise this sort of data
> > movement, the need for games with mmap+direct IO largely goes away.
> > However, we still can't just remove that functionality as it will
> > break lots of random userspace stuff...
> 
> It is already broken and unreliable. Are there really "lots" of these
> things around? Can we test this by adding a warning in the kernel and see
> where it actually crops up?

IMHO I don't think that the copy_file_range() is going to carry us through the
next wave of user performance requirements.  RDMA, while the first, is not the
only technology which is looking to have direct access to files.  XDP is
another.[1]

Ira

[1] https://www.kernel.org/doc/html/v4.19-rc1/networking/af_xdp.html
[2] https://lore.kernel.org/lkml/20190205175059.GB21617@iweiny-DESK2.sc.intel.com/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12 22:11         ` Dave Chinner
@ 2019-03-12 15:23           ` Ira Weiny
  2019-03-13 16:03           ` Christoph Hellwig
  1 sibling, 0 replies; 33+ messages in thread
From: Ira Weiny @ 2019-03-12 15:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christopher Lameter, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jan Kara, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, John Hubbard

On Wed, Mar 13, 2019 at 09:11:13AM +1100, Dave Chinner wrote:
> On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote:
> > IMHO I don't think that the copy_file_range() is going to carry us through the
> > next wave of user performance requirements.  RDMA, while the first, is not the
> > only technology which is looking to have direct access to files.  XDP is
> > another.[1]
> 
> Sure, all I doing here was demonstrating that people have been
> trying to get local direct access to file mappings to DMA directly
> into them for a long time. Direct Io games like these are now
> largely unnecessary because we now have much better APIs to do
> zero-copy data transfer between files (which can do hardware offload
> if it is available!).
> 
> It's the long term pins that RDMA does that are the problem here.
> I'm asssuming that for XDP, you're talking about userspace zero copy
> from files to the network hardware and vice versa? transmit is
> simple (read-only mapping), but receive probably requires bpf
> programs to ensure that data (minus headers) in the incoming packet
> stream is correctly placed into the UMEM region?

Yes, exactly.

> 
> XDP receive seems pretty much like the same problem as RDMA writes
> into the file. i.e.  the incoming write DMAs are going to have to
> trigger page faults if the UMEM is a long term pin so the filesystem
> behaves correctly with this remote data placement.  I'd suggest that
> RDMA, XDP and anything other hardware that is going to pin
> file-backed mappings for the long term need to use the same "inform
> the fs of a write operation into it's mapping" mechanisms...

Yes agreed.  I have a hack patch I'm testing right now which allows the user to
take a LAYOUT lease from user space and GUP triggers on that, either allowing
or rejecting the pin based on the lease.  I think this is the first step of
what Jan suggested.[1]  There is a lot more detail to work out with what
happens if that lease needs to be broken.

> 
> And if we start talking about wanting to do peer-to-peer DMA from
> network/GPU device to storage device without going through a
> file-backed CPU mapping, we still need to have the filesystem
> involved to translate file offsets to storage locations the
> filesystem has allocated for the data and to lock them down for as
> long as the peer-to-peer DMA offload is in place.  In effect, this
> is the same problem as RDMA+FS-DAXs - the filesystem owns the file
> offset to storage location mapping and manages storage access
> arbitration, not the mm/vma mapping presented to userspace....

I've only daydreamed about Peer-to-peer transfers.  But yes I think this is the
direction we need to go.  But The details of doing a

GPU -> RDMA -> {network } -> RDMA -> FS DAX

And back again... without CPU/OS involvement are only a twinkle in my eye...
If that.

Ira

[1] https://lore.kernel.org/lkml/20190212160707.GA19076@quack2.suse.cz/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
  2019-03-08  2:58   ` Christopher Lameter
  2019-03-08 17:57   ` Jerome Glisse
@ 2019-03-12 15:30   ` Ira Weiny
  2019-03-13  0:38     ` John Hubbard
  2 siblings, 1 reply; 33+ messages in thread
From: Ira Weiny @ 2019-03-12 15:30 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().

So I've been running with these patches for a while but today while ramping up
my testing I hit the following:

[ 1355.557819] ------------[ cut here ]------------
[ 1355.563436] get_user_pages pin count overflowed
[ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 get_gup_pin_page+0xa5/0xb0
[ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transpo
rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscs
i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbyp
ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul ledtrig_audio snd_hda_intel
crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel crypto_simd s
nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si device_dax nd_btt ioatdma nd
_e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support libnvdimm pcspkr lpc_ich mei_m
e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod mgag
200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm crc32c_intel igb isci ah
ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
[ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: rdmavt]
[ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
[ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.02.04.0003.1023201411
38 10/23/2014
[ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
[ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 48 c7 c7 48 0a e9 81 89 4
4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 66 66 66 66 90 41 57 49
bf 00 00
[ 1355.733244] RSP: 0018:ffffc90005a23b30 EFLAGS: 00010286
[ 1355.739536] RAX: 0000000000000000 RBX: ffffea0014220000 RCX: 0000000000000000
[ 1355.748005] RDX: 0000000000000003 RSI: ffffffff827d94a3 RDI: 0000000000000246
[ 1355.756453] RBP: ffffea0014220000 R08: 0000000000000002 R09: 0000000000022400
[ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0000000000000001 R12: 0000000000010207
[ 1355.773369] R13: ffff8884130b7040 R14: fff0000000000fff R15: 000fffffffe00000
[ 1355.781836] FS:  00007f2680d0d740(0000) GS:ffff88842e840000(0000) knlGS:0000000000000000
[ 1355.791384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1355.798319] CR2: 0000000000589000 CR3: 000000040b05e004 CR4: 00000000000606e0
[ 1355.806809] Call Trace:
[ 1355.810078]  follow_page_pte+0x4f3/0x5c0
[ 1355.814987]  __get_user_pages+0x1eb/0x730
[ 1355.820020]  get_user_pages+0x3e/0x50
[ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
[ 1355.830340]  ? _cond_resched+0x15/0x30
[ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
[ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
[ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
[ 1355.853473]  __vfs_write+0x36/0x1b0
[ 1355.857904]  ? selinux_file_permission+0xf0/0x130
[ 1355.863702]  ? security_file_permission+0x2e/0xe0
[ 1355.869503]  vfs_write+0xa5/0x1a0
[ 1355.873751]  ksys_write+0x4f/0xb0
[ 1355.878009]  do_syscall_64+0x5b/0x180
[ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1355.888862] RIP: 0033:0x7f2680ec3ed8
[ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 78 0
d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49
89 d4 55
[ 1355.915573] RSP: 002b:00007ffe65d50bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1355.924621] RAX: ffffffffffffffda RBX: 00007ffe65d50c74 RCX: 00007f2680ec3ed8
[ 1355.933195] RDX: 0000000000000030 RSI: 00007ffe65d50c80 RDI: 0000000000000003
[ 1355.941760] RBP: 0000000000000030 R08: 0000000000000007 R09: 0000000000581260
[ 1355.950326] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000581930
[ 1355.958885] R13: 000000000000000c R14: 0000000000581260 R15: 0000000000000000
[ 1355.967430] ---[ end trace bc771ac6189977a2 ]---


I'm not sure what I did to do this and I'm going to work on a reproducer.  At
the time of the Warning I only had 1 GUP user?!?!?!?!

I'm not using ODP, so I don't think the changes we have discussed there are a
problem.

Ira

> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with interactions between get_user_pages ("gup") and filesystems.
> 
> Problem description: let's start with a bug report. Below, is what happens
> sometimes, under memory pressure, when a driver pins some pages via gup,
> and then marks those pages dirty, and releases them. Note that the gup
> documentation actually recommends that pattern. The problem is that the
> filesystem may do a writeback while the pages were gup-pinned, and then the
> filesystem believes that the pages are clean. So, when the driver later
> marks the pages as dirty, that conflicts with the filesystem's page
> tracking and results in a BUG(), like this one that I experienced:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.
> 
> The steps are to fix it are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
> [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/swap.c          | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..809b7397d41e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + *
> + * put_user_page() and put_page() are not interchangeable, despite this early
> + * implementation that makes them look the same. put_user_page() calls must
> + * be perfectly matched up with get_user_page() calls.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index 4d7d37eb3c40..a6b4f693f46d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
>  
> +typedef int (*set_dirty_func)(struct page *page);
> +
> +static void __put_user_pages_dirty(struct page **pages,
> +				   unsigned long npages,
> +				   set_dirty_func sdf)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			sdf(page);
> +
> +		put_user_page(page);
> +	}
> +}
> +
> +/**
> + * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> + * variants called on that page.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/**
> + * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +
> +/**
> + * put_user_pages() - release an array of gup-pinned pages.
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, release the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
>  /*
>   * get_kernel_pages() - pin kernel pages in memory
>   * @kiov:	An array of struct kvec structures
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12  4:52     ` Christopher Lameter
@ 2019-03-12 15:35       ` Jerome Glisse
  2019-03-12 15:53         ` Jason Gunthorpe
  2019-03-13 19:16         ` Christopher Lameter
  0 siblings, 2 replies; 33+ messages in thread
From: Jerome Glisse @ 2019-03-12 15:35 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Tue, Mar 12, 2019 at 04:52:07AM +0000, Christopher Lameter wrote:
> On Fri, 8 Mar 2019, Jerome Glisse wrote:
> 
> > >
> > > It would good if that understanding would be enforced somehow given the problems
> > > that we see.
> >
> > This has been discuss extensively already. GUP usage is now widespread in
> > multiple drivers, removing that would regress userspace ie break existing
> > application. We all know what the rules for that is.
> 
> The applications that work are using anonymous memory and memory
> filesystems. I have never seen use cases with a real filesystem and would
> have objected if someone tried something crazy like that.
> 
> Because someone was able to get away with weird ways of abusing the system
> it not an argument that we should continue to allow such things. In fact
> we have repeatedly ensured that the kernel works reliably by improving the
> kernel so that a proper failure is occurring.

Driver doing GUP on mmap of regular file is something that seems to
already have widespread user (in the RDMA devices at least). So they
are active users and they were never told that what they are doing
was illegal.

Note that i am personaly fine with breaking device driver that can not
abide by mmu notifier but the consensus seems that it is not fine to
do so.

> > > > In fact, the GUP documentation even recommends that pattern.
> > >
> > > Isnt that pattern safe for anonymous memory and memory filesystems like
> > > hugetlbfs etc? Which is the common use case.
> >
> > Still an issue in respect to swapout ie if anon/shmem page was map
> > read only in preparation for swapout and we do not report the page
> > as dirty what endup in swap might lack what was written last through
> > GUP.
> 
> Well swapout cannot occur if the page is pinned and those pages are also
> often mlocked.

I would need to check the swapout code but i believe the write to disk
can happen before the pin checks happens. I believe the event flow is:
map read only, allocate swap, write to disk, try to free page which
checks for pin. So that you could write stale data to disk and the GUP
going away before you perform the pin checks.

They are other thing to take into account and that need proper page
dirtying, like soft dirtyness for instance.


> > >
> > > Yes you now have the filesystem as well as the GUP pinner claiming
> > > authority over the contents of a single memory segment. Maybe better not
> > > allow that?
> >
> > This goes back to regressing existing driver with existing users.
> 
> There is no regression if that behavior never really worked.

Well RDMA driver maintainer seems to report that this has been a valid
and working workload for their users.


> > > Two filesystem trying to sync one memory segment both believing to have
> > > exclusive access and we want to sort this out. Why? Dont allow this.
> >
> > This is allowed, it always was, forbidding that case now would regress
> > existing application and it would also means that we are modifying the
> > API we expose to userspace. So again this is not something we can block
> > without regressing existing user.
> 
> We have always stopped the user from doing obviously stupid and risky
> things. It would be logical to do it here as well.

While i would rather only allow device that can handle mmu notifier
it is just not acceptable to regress existing user and they do seem
to exist and had working setup going on for a while.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12 15:35       ` Jerome Glisse
@ 2019-03-12 15:53         ` Jason Gunthorpe
  2019-03-13 19:16         ` Christopher Lameter
  1 sibling, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2019-03-12 15:53 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christopher Lameter, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, John Hubbard

On Tue, Mar 12, 2019 at 11:35:29AM -0400, Jerome Glisse wrote:
> > > > Yes you now have the filesystem as well as the GUP pinner claiming
> > > > authority over the contents of a single memory segment. Maybe better not
> > > > allow that?
> > >
> > > This goes back to regressing existing driver with existing users.
> > 
> > There is no regression if that behavior never really worked.
> 
> Well RDMA driver maintainer seems to report that this has been a valid
> and working workload for their users.

I think it is more O_DIRECT that is the history here..

In RDMA land long term GUPs of file backed pages tend to crash the
kernel (what John is trying to fix here) so I'm not sure there are
actual real & tested users, only people that wish they could do this..

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12 10:39       ` Ira Weiny
@ 2019-03-12 22:11         ` Dave Chinner
  2019-03-12 15:23           ` Ira Weiny
  2019-03-13 16:03           ` Christoph Hellwig
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2019-03-12 22:11 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Christopher Lameter, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jan Kara, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, John Hubbard

On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote:
> IMHO I don't think that the copy_file_range() is going to carry us through the
> next wave of user performance requirements.  RDMA, while the first, is not the
> only technology which is looking to have direct access to files.  XDP is
> another.[1]

Sure, all I doing here was demonstrating that people have been
trying to get local direct access to file mappings to DMA directly
into them for a long time. Direct Io games like these are now
largely unnecessary because we now have much better APIs to do
zero-copy data transfer between files (which can do hardware offload
if it is available!).

It's the long term pins that RDMA does that are the problem here.
I'm asssuming that for XDP, you're talking about userspace zero copy
from files to the network hardware and vice versa? transmit is
simple (read-only mapping), but receive probably requires bpf
programs to ensure that data (minus headers) in the incoming packet
stream is correctly placed into the UMEM region?

XDP receive seems pretty much like the same problem as RDMA writes
into the file. i.e.  the incoming write DMAs are going to have to
trigger page faults if the UMEM is a long term pin so the filesystem
behaves correctly with this remote data placement.  I'd suggest that
RDMA, XDP and anything other hardware that is going to pin
file-backed mappings for the long term need to use the same "inform
the fs of a write operation into it's mapping" mechanisms...

And if we start talking about wanting to do peer-to-peer DMA from
network/GPU device to storage device without going through a
file-backed CPU mapping, we still need to have the filesystem
involved to translate file offsets to storage locations the
filesystem has allocated for the data and to lock them down for as
long as the peer-to-peer DMA offload is in place.  In effect, this
is the same problem as RDMA+FS-DAXs - the filesystem owns the file
offset to storage location mapping and manages storage access
arbitration, not the mm/vma mapping presented to userspace....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12 15:30   ` Ira Weiny
@ 2019-03-13  0:38     ` John Hubbard
  2019-03-13 14:49       ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: John Hubbard @ 2019-03-13  0:38 UTC (permalink / raw)
  To: Ira Weiny, john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel

On 3/12/19 8:30 AM, Ira Weiny wrote:
> On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
> 
> So I've been running with these patches for a while but today while ramping up
> my testing I hit the following:
> 
> [ 1355.557819] ------------[ cut here ]------------
> [ 1355.563436] get_user_pages pin count overflowed

Hi Ira,

Thanks for reporting this. That overflow, at face value, means that we've
used more than the 22 bits worth of gup pin counts, so about 4 million pins
of the same page...

> [ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 get_gup_pin_page+0xa5/0xb0
> [ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transpo
> rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscs
> i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbyp
> ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul ledtrig_audio snd_hda_intel
> crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel crypto_simd s
> nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si device_dax nd_btt ioatdma nd
> _e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support libnvdimm pcspkr lpc_ich mei_m
> e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod mgag
> 200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm crc32c_intel igb isci ah
> ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
> [ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: rdmavt]
> [ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
> [ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.02.04.0003.1023201411
> 38 10/23/2014
> [ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
> [ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 48 c7 c7 48 0a e9 81 89 4
> 4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 66 66 66 66 90 41 57 49
> bf 00 00
> [ 1355.733244] RSP: 0018:ffffc90005a23b30 EFLAGS: 00010286
> [ 1355.739536] RAX: 0000000000000000 RBX: ffffea0014220000 RCX: 0000000000000000
> [ 1355.748005] RDX: 0000000000000003 RSI: ffffffff827d94a3 RDI: 0000000000000246
> [ 1355.756453] RBP: ffffea0014220000 R08: 0000000000000002 R09: 0000000000022400
> [ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0000000000000001 R12: 0000000000010207
> [ 1355.773369] R13: ffff8884130b7040 R14: fff0000000000fff R15: 000fffffffe00000
> [ 1355.781836] FS:  00007f2680d0d740(0000) GS:ffff88842e840000(0000) knlGS:0000000000000000
> [ 1355.791384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1355.798319] CR2: 0000000000589000 CR3: 000000040b05e004 CR4: 00000000000606e0
> [ 1355.806809] Call Trace:
> [ 1355.810078]  follow_page_pte+0x4f3/0x5c0
> [ 1355.814987]  __get_user_pages+0x1eb/0x730
> [ 1355.820020]  get_user_pages+0x3e/0x50
> [ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
> [ 1355.830340]  ? _cond_resched+0x15/0x30
> [ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
> [ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
> [ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
> [ 1355.853473]  __vfs_write+0x36/0x1b0
> [ 1355.857904]  ? selinux_file_permission+0xf0/0x130
> [ 1355.863702]  ? security_file_permission+0x2e/0xe0
> [ 1355.869503]  vfs_write+0xa5/0x1a0
> [ 1355.873751]  ksys_write+0x4f/0xb0
> [ 1355.878009]  do_syscall_64+0x5b/0x180
> [ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1355.888862] RIP: 0033:0x7f2680ec3ed8
> [ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 78 0
> d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49
> 89 d4 55
> [ 1355.915573] RSP: 002b:00007ffe65d50bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 1355.924621] RAX: ffffffffffffffda RBX: 00007ffe65d50c74 RCX: 00007f2680ec3ed8
> [ 1355.933195] RDX: 0000000000000030 RSI: 00007ffe65d50c80 RDI: 0000000000000003
> [ 1355.941760] RBP: 0000000000000030 R08: 0000000000000007 R09: 0000000000581260
> [ 1355.950326] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000581930
> [ 1355.958885] R13: 000000000000000c R14: 0000000000581260 R15: 0000000000000000
> [ 1355.967430] ---[ end trace bc771ac6189977a2 ]---
> 
> 
> I'm not sure what I did to do this and I'm going to work on a reproducer.  At
> the time of the Warning I only had 1 GUP user?!?!?!?!

If there is a get_user_pages() call that lacks a corresponding put_user_pages()
call, then the count could start working its way up, and up. Either that, or a
bug in my patches here, could cause this. The basic counting works correctly
in fio runs on an NVMe driver with Direct IO, when I dump out
`cat /proc/vmstat | grep gup`: the counts match up, but that is a simple test.

One way to force a faster repro is to increase the GUP_PIN_COUNTING_BIAS, so
that the gup pin count runs into the max much sooner.

I'd really love to create a test setup that would generate this failure, so
anything you discover on how to repro (including what hardware is required--I'm
sure I can scrounge up some IB gear in a pinch) is of great interest.

Also, I'm just now starting on the DEBUG_USER_PAGE_REFERENCES idea that Jerome,
Jan, and Dan floated some months ago. It's clearly a prerequisite to converting
the call sites properly--just our relatively small IB driver is showing that.
This feature will provide a different mapping of the struct pages, if get
them via get_user_pages(). That will allow easily asserting that put_user_page()
and put_page() are not swapped, in either direction.


> 
> I'm not using ODP, so I don't think the changes we have discussed there are a
> problem.
> 
> Ira
> 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-13  0:38     ` John Hubbard
@ 2019-03-13 14:49       ` Ira Weiny
  2019-03-14  3:19         ` John Hubbard
  0 siblings, 1 reply; 33+ messages in thread
From: Ira Weiny @ 2019-03-13 14:49 UTC (permalink / raw)
  To: John Hubbard
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Jan Kara, Jason Gunthorpe, Jerome Glisse, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:
> On 3/12/19 8:30 AM, Ira Weiny wrote:
> > On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
> > > From: John Hubbard <jhubbard@nvidia.com>
> > > 
> > > Introduces put_user_page(), which simply calls put_page().
> > > This provides a way to update all get_user_pages*() callers,
> > > so that they call put_user_page(), instead of put_page().
> > 
> > So I've been running with these patches for a while but today while ramping up
> > my testing I hit the following:
> > 
> > [ 1355.557819] ------------[ cut here ]------------
> > [ 1355.563436] get_user_pages pin count overflowed
> 
> Hi Ira,
> 
> Thanks for reporting this. That overflow, at face value, means that we've
> used more than the 22 bits worth of gup pin counts, so about 4 million pins
> of the same page...

This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
than gets...  I'm not sure how but this is for sure my problem.

Backing off to your patch set the numbers are good.

Sorry for the noise.

With the testing I've done today I feel comfortable adding

Tested-by: Ira Weiny <ira.weiny@intel.com>

For the main GUP and InfiniBand patches.

Ira

> 
> > [ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 get_gup_pin_page+0xa5/0xb0
> > [ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transpo
> > rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscs
> > i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbyp
> > ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul ledtrig_audio snd_hda_intel
> > crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel crypto_simd s
> > nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si device_dax nd_btt ioatdma nd
> > _e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support libnvdimm pcspkr lpc_ich mei_m
> > e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod mgag
> > 200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm crc32c_intel igb isci ah
> > ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
> > [ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: rdmavt]
> > [ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
> > [ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.02.04.0003.1023201411
> > 38 10/23/2014
> > [ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
> > [ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 48 c7 c7 48 0a e9 81 89 4
> > 4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 66 66 66 66 90 41 57 49
> > bf 00 00
> > [ 1355.733244] RSP: 0018:ffffc90005a23b30 EFLAGS: 00010286
> > [ 1355.739536] RAX: 0000000000000000 RBX: ffffea0014220000 RCX: 0000000000000000
> > [ 1355.748005] RDX: 0000000000000003 RSI: ffffffff827d94a3 RDI: 0000000000000246
> > [ 1355.756453] RBP: ffffea0014220000 R08: 0000000000000002 R09: 0000000000022400
> > [ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0000000000000001 R12: 0000000000010207
> > [ 1355.773369] R13: ffff8884130b7040 R14: fff0000000000fff R15: 000fffffffe00000
> > [ 1355.781836] FS:  00007f2680d0d740(0000) GS:ffff88842e840000(0000) knlGS:0000000000000000
> > [ 1355.791384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1355.798319] CR2: 0000000000589000 CR3: 000000040b05e004 CR4: 00000000000606e0
> > [ 1355.806809] Call Trace:
> > [ 1355.810078]  follow_page_pte+0x4f3/0x5c0
> > [ 1355.814987]  __get_user_pages+0x1eb/0x730
> > [ 1355.820020]  get_user_pages+0x3e/0x50
> > [ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
> > [ 1355.830340]  ? _cond_resched+0x15/0x30
> > [ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
> > [ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
> > [ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
> > [ 1355.853473]  __vfs_write+0x36/0x1b0
> > [ 1355.857904]  ? selinux_file_permission+0xf0/0x130
> > [ 1355.863702]  ? security_file_permission+0x2e/0xe0
> > [ 1355.869503]  vfs_write+0xa5/0x1a0
> > [ 1355.873751]  ksys_write+0x4f/0xb0
> > [ 1355.878009]  do_syscall_64+0x5b/0x180
> > [ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 1355.888862] RIP: 0033:0x7f2680ec3ed8
> > [ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 78 0
> > d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49
> > 89 d4 55
> > [ 1355.915573] RSP: 002b:00007ffe65d50bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [ 1355.924621] RAX: ffffffffffffffda RBX: 00007ffe65d50c74 RCX: 00007f2680ec3ed8
> > [ 1355.933195] RDX: 0000000000000030 RSI: 00007ffe65d50c80 RDI: 0000000000000003
> > [ 1355.941760] RBP: 0000000000000030 R08: 0000000000000007 R09: 0000000000581260
> > [ 1355.950326] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000581930
> > [ 1355.958885] R13: 000000000000000c R14: 0000000000581260 R15: 0000000000000000
> > [ 1355.967430] ---[ end trace bc771ac6189977a2 ]---
> > 
> > 
> > I'm not sure what I did to do this and I'm going to work on a reproducer.  At
> > the time of the Warning I only had 1 GUP user?!?!?!?!
> 
> If there is a get_user_pages() call that lacks a corresponding put_user_pages()
> call, then the count could start working its way up, and up. Either that, or a
> bug in my patches here, could cause this. The basic counting works correctly
> in fio runs on an NVMe driver with Direct IO, when I dump out
> `cat /proc/vmstat | grep gup`: the counts match up, but that is a simple test.
> 
> One way to force a faster repro is to increase the GUP_PIN_COUNTING_BIAS, so
> that the gup pin count runs into the max much sooner.
> 
> I'd really love to create a test setup that would generate this failure, so
> anything you discover on how to repro (including what hardware is required--I'm
> sure I can scrounge up some IB gear in a pinch) is of great interest.
> 
> Also, I'm just now starting on the DEBUG_USER_PAGE_REFERENCES idea that Jerome,
> Jan, and Dan floated some months ago. It's clearly a prerequisite to converting
> the call sites properly--just our relatively small IB driver is showing that.
> This feature will provide a different mapping of the struct pages, if get
> them via get_user_pages(). That will allow easily asserting that put_user_page()
> and put_page() are not swapped, in either direction.
> 
> 
> > 
> > I'm not using ODP, so I don't think the changes we have discussed there are a
> > problem.
> > 
> > Ira
> > 
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12 22:11         ` Dave Chinner
  2019-03-12 15:23           ` Ira Weiny
@ 2019-03-13 16:03           ` Christoph Hellwig
  2019-03-13 19:21             ` Christopher Lameter
  1 sibling, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2019-03-13 16:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ira Weiny, Christopher Lameter, john.hubbard, Andrew Morton,
	linux-mm, Al Viro, Christian Benvenuti, Christoph Hellwig,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed, Mar 13, 2019 at 09:11:13AM +1100, Dave Chinner wrote:
> On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote:
> > IMHO I don't think that the copy_file_range() is going to carry us through the
> > next wave of user performance requirements.  RDMA, while the first, is not the
> > only technology which is looking to have direct access to files.  XDP is
> > another.[1]
> 
> Sure, all I doing here was demonstrating that people have been
> trying to get local direct access to file mappings to DMA directly
> into them for a long time. Direct Io games like these are now
> largely unnecessary because we now have much better APIs to do
> zero-copy data transfer between files (which can do hardware offload
> if it is available!).

And that is just the file to file case.  There are tons of other
users of get_user_pages, including various drivers that do large
amounts of I/O like video capture.  For them it makes tons of sense
to transfer directly to/from a mmap()ed file.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-12 15:35       ` Jerome Glisse
  2019-03-12 15:53         ` Jason Gunthorpe
@ 2019-03-13 19:16         ` Christopher Lameter
  2019-03-13 19:33           ` Jerome Glisse
  2019-03-14  9:03           ` Jan Kara
  1 sibling, 2 replies; 33+ messages in thread
From: Christopher Lameter @ 2019-03-13 19:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Tue, 12 Mar 2019, Jerome Glisse wrote:

> > > This has been discuss extensively already. GUP usage is now widespread in
> > > multiple drivers, removing that would regress userspace ie break existing
> > > application. We all know what the rules for that is.

You are still misstating the issue. In RDMA land GUP is widely used for
anonyous memory and memory based filesystems. *Not* for real filesystems.

> > Because someone was able to get away with weird ways of abusing the system
> > it not an argument that we should continue to allow such things. In fact
> > we have repeatedly ensured that the kernel works reliably by improving the
> > kernel so that a proper failure is occurring.
>
> Driver doing GUP on mmap of regular file is something that seems to
> already have widespread user (in the RDMA devices at least). So they
> are active users and they were never told that what they are doing
> was illegal.

Not true. Again please differentiate the use cases between regular
filesystem and anonyous mappings.

> > Well swapout cannot occur if the page is pinned and those pages are also
> > often mlocked.
>
> I would need to check the swapout code but i believe the write to disk
> can happen before the pin checks happens. I believe the event flow is:
> map read only, allocate swap, write to disk, try to free page which
> checks for pin. So that you could write stale data to disk and the GUP
> going away before you perform the pin checks.

Allocate swap is a separate step that associates a swap entry to an
anonymous page.

> They are other thing to take into account and that need proper page
> dirtying, like soft dirtyness for instance.

RDMA mapped pages are all dirty all the time.

> Well RDMA driver maintainer seems to report that this has been a valid
> and working workload for their users.

No they dont.

Could you please get up to date on the discussion before posting?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-13 16:03           ` Christoph Hellwig
@ 2019-03-13 19:21             ` Christopher Lameter
  2019-03-14  9:06               ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Lameter @ 2019-03-13 19:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Ira Weiny, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Jan Kara, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel, John Hubbard

On Wed, 13 Mar 2019, Christoph Hellwig wrote:

> On Wed, Mar 13, 2019 at 09:11:13AM +1100, Dave Chinner wrote:
> > On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote:
> > > IMHO I don't think that the copy_file_range() is going to carry us through the
> > > next wave of user performance requirements.  RDMA, while the first, is not the
> > > only technology which is looking to have direct access to files.  XDP is
> > > another.[1]
> >
> > Sure, all I doing here was demonstrating that people have been
> > trying to get local direct access to file mappings to DMA directly
> > into them for a long time. Direct Io games like these are now
> > largely unnecessary because we now have much better APIs to do
> > zero-copy data transfer between files (which can do hardware offload
> > if it is available!).
>
> And that is just the file to file case.  There are tons of other
> users of get_user_pages, including various drivers that do large
> amounts of I/O like video capture.  For them it makes tons of sense
> to transfer directly to/from a mmap()ed file.

That is very similar to the RDMA case and DAX etc. We need to have a way
to tell a filesystem that this is going to happen and that things need to
be setup for this to work properly.

But if that has not been done then I think its proper to fail a long term
pin operation on page cache pages. Meaning the regular filesystems
maintain control of whats happening with their pages.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-13 19:16         ` Christopher Lameter
@ 2019-03-13 19:33           ` Jerome Glisse
  2019-03-14  9:03           ` Jan Kara
  1 sibling, 0 replies; 33+ messages in thread
From: Jerome Glisse @ 2019-03-13 19:33 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed, Mar 13, 2019 at 07:16:51PM +0000, Christopher Lameter wrote:
> On Tue, 12 Mar 2019, Jerome Glisse wrote:
> 
> > > > This has been discuss extensively already. GUP usage is now widespread in
> > > > multiple drivers, removing that would regress userspace ie break existing
> > > > application. We all know what the rules for that is.
> 
> You are still misstating the issue. In RDMA land GUP is widely used for
> anonyous memory and memory based filesystems. *Not* for real filesystems.

Then why are they bug report as one pointed out in cover letter ? It
means someone is doing GUP on filesystem. Moreover looking at RDMA
driver i do not see anything that check that VA for GUP belongs to a
vma that is not back by a regular file.

> 
> > > Because someone was able to get away with weird ways of abusing the system
> > > it not an argument that we should continue to allow such things. In fact
> > > we have repeatedly ensured that the kernel works reliably by improving the
> > > kernel so that a proper failure is occurring.
> >
> > Driver doing GUP on mmap of regular file is something that seems to
> > already have widespread user (in the RDMA devices at least). So they
> > are active users and they were never told that what they are doing
> > was illegal.
> 
> Not true. Again please differentiate the use cases between regular
> filesystem and anonyous mappings.

Again where does the bug comes from ? Where in RDMA is the check that
VA belong to a vma that is not back by a file ?

> 
> > > Well swapout cannot occur if the page is pinned and those pages are also
> > > often mlocked.
> >
> > I would need to check the swapout code but i believe the write to disk
> > can happen before the pin checks happens. I believe the event flow is:
> > map read only, allocate swap, write to disk, try to free page which
> > checks for pin. So that you could write stale data to disk and the GUP
> > going away before you perform the pin checks.
> 
> Allocate swap is a separate step that associates a swap entry to an
> anonymous page.
> 
> > They are other thing to take into account and that need proper page
> > dirtying, like soft dirtyness for instance.
> 
> RDMA mapped pages are all dirty all the time.

Point is the pte dirty bit might not be accurate nor the soft dirty bit
because GUP user does not update those bits and thus GUP user need to
call the set_page_dirty or similar to properly report page dirtyness.

> > Well RDMA driver maintainer seems to report that this has been a valid
> > and working workload for their users.
> 
> No they dont.
> 
> Could you please get up to date on the discussion before posting?

Again why is there bug report ? Where is the code in RDMA that check
that VA does not belong to vma that is back by a file ?

As much as i would like that this use case did not exist i fear it
does and it has been upstream for a while. This also very much apply
to O_DIRECT wether you like it or not.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-13 14:49       ` Ira Weiny
@ 2019-03-14  3:19         ` John Hubbard
  0 siblings, 0 replies; 33+ messages in thread
From: John Hubbard @ 2019-03-14  3:19 UTC (permalink / raw)
  To: Ira Weiny
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Jan Kara, Jason Gunthorpe, Jerome Glisse, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On 3/13/19 7:49 AM, Ira Weiny wrote:
> On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:
>> On 3/12/19 8:30 AM, Ira Weiny wrote:
>>> On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>
>>>> Introduces put_user_page(), which simply calls put_page().
>>>> This provides a way to update all get_user_pages*() callers,
>>>> so that they call put_user_page(), instead of put_page().
>>>
>>> So I've been running with these patches for a while but today while ramping up
>>> my testing I hit the following:
>>>
>>> [ 1355.557819] ------------[ cut here ]------------
>>> [ 1355.563436] get_user_pages pin count overflowed
>>
>> Hi Ira,
>>
>> Thanks for reporting this. That overflow, at face value, means that we've
>> used more than the 22 bits worth of gup pin counts, so about 4 million pins
>> of the same page...
> 
> This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
> than gets...  I'm not sure how but this is for sure my problem.
> 
> Backing off to your patch set the numbers are good.

Now that's a welcome bit of good news!

> 
> Sorry for the noise.
> 
> With the testing I've done today I feel comfortable adding
> 
> Tested-by: Ira Weiny <ira.weiny@intel.com>
> 
> For the main GUP and InfiniBand patches.
> 
> Ira
> 

OK, I'll add your tested-by tag to patches 1, 2, 4, 5 (the numbering refers
to the "RFC v2: mm: gup/dma tracking" posting [1]) in my repo [2], and they'll 
show up in the next posting. (Patch 3 is already upstream, and patch 6 is
documentation that needs to be rewritten entirely.)

[1] https://lore.kernel.org/r/20190204052135.25784-1-jhubbard@nvidia.com

[2] https://github.com/johnhubbard/linux/tree/gup_dma_core

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-13 19:16         ` Christopher Lameter
  2019-03-13 19:33           ` Jerome Glisse
@ 2019-03-14  9:03           ` Jan Kara
  2019-03-14 12:57             ` Jason Gunthorpe
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kara @ 2019-03-14  9:03 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Jerome Glisse, john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed 13-03-19 19:16:51, Christopher Lameter wrote:
> On Tue, 12 Mar 2019, Jerome Glisse wrote:
> 
> > > > This has been discuss extensively already. GUP usage is now widespread in
> > > > multiple drivers, removing that would regress userspace ie break existing
> > > > application. We all know what the rules for that is.
> 
> You are still misstating the issue. In RDMA land GUP is widely used for
> anonyous memory and memory based filesystems. *Not* for real filesystems.

Maybe in your RDMA land. But there are apparently other users which do use
mmap of a file on normal filesystem (e.g. ext4) as a buffer for DMA
(Infiniband does not prohibit this if nothing else, video capture devices
also use very similar pattern of gup-ing pages and using them as video
buffers). And these users are reporting occasional kernel crashes. That's
how this whole effort started. Sadly the DMA to file mmap is working good
enough that people started using it so at this point we cannot just tell:
Sorry it was a mistake to allow this, just rewrite your applications.

Plus we have O_DIRECT io which can use file mmap as a buffer and as Dave
Chinner mentioned there are real applications using this.

So no, we are not going to get away with "just forbid GUP for file backed
pages" which seems to be what you suggest. We might get away with that for
*some* GUP users and you are welcome to do that in the drivers you care
about but definitely not for all.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-13 19:21             ` Christopher Lameter
@ 2019-03-14  9:06               ` Jan Kara
  2019-03-18 20:12                 ` John Hubbard
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2019-03-14  9:06 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Christoph Hellwig, Dave Chinner, Ira Weiny, john.hubbard,
	Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Wed 13-03-19 19:21:37, Christopher Lameter wrote:
> On Wed, 13 Mar 2019, Christoph Hellwig wrote:
> 
> > On Wed, Mar 13, 2019 at 09:11:13AM +1100, Dave Chinner wrote:
> > > On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote:
> > > > IMHO I don't think that the copy_file_range() is going to carry us through the
> > > > next wave of user performance requirements.  RDMA, while the first, is not the
> > > > only technology which is looking to have direct access to files.  XDP is
> > > > another.[1]
> > >
> > > Sure, all I doing here was demonstrating that people have been
> > > trying to get local direct access to file mappings to DMA directly
> > > into them for a long time. Direct Io games like these are now
> > > largely unnecessary because we now have much better APIs to do
> > > zero-copy data transfer between files (which can do hardware offload
> > > if it is available!).
> >
> > And that is just the file to file case.  There are tons of other
> > users of get_user_pages, including various drivers that do large
> > amounts of I/O like video capture.  For them it makes tons of sense
> > to transfer directly to/from a mmap()ed file.
> 
> That is very similar to the RDMA case and DAX etc. We need to have a way
> to tell a filesystem that this is going to happen and that things need to
> be setup for this to work properly.

The way to tell filesystem what's happening is exactly what we are working
on with these patches...

> But if that has not been done then I think its proper to fail a long term
> pin operation on page cache pages. Meaning the regular filesystems
> maintain control of whats happening with their pages.

And as I mentioned in my other email, we cannot just fail the pin for
pagecache pages as that would regress existing applications.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-14  9:03           ` Jan Kara
@ 2019-03-14 12:57             ` Jason Gunthorpe
  2019-03-14 13:30               ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2019-03-14 12:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christopher Lameter, Jerome Glisse, john.hubbard, Andrew Morton,
	linux-mm, Al Viro, Christian Benvenuti, Christoph Hellwig,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, John Hubbard

On Thu, Mar 14, 2019 at 10:03:45AM +0100, Jan Kara wrote:
> On Wed 13-03-19 19:16:51, Christopher Lameter wrote:
> > On Tue, 12 Mar 2019, Jerome Glisse wrote:
> > 
> > > > > This has been discuss extensively already. GUP usage is now widespread in
> > > > > multiple drivers, removing that would regress userspace ie break existing
> > > > > application. We all know what the rules for that is.
> > 
> > You are still misstating the issue. In RDMA land GUP is widely used for
> > anonyous memory and memory based filesystems. *Not* for real filesystems.
> 
> Maybe in your RDMA land. But there are apparently other users which do use
> mmap of a file on normal filesystem (e.g. ext4) as a buffer for DMA
> (Infiniband does not prohibit this if nothing else, video capture devices
> also use very similar pattern of gup-ing pages and using them as video
> buffers). And these users are reporting occasional kernel crashes. That's
> how this whole effort started. Sadly the DMA to file mmap is working good
> enough that people started using it so at this point we cannot just tell:
> Sorry it was a mistake to allow this, just rewrite your applications.

This is where we are in RDMA too.. People are trying it and the ones
that do enough load testing find their kernel OOPs

So it is not clear at all if this has graduated to a real use, or just
an experiment. Perhaps there are some system configurations that don't
trigger crashes..

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-14 12:57             ` Jason Gunthorpe
@ 2019-03-14 13:30               ` Jan Kara
  2019-03-14 20:25                 ` William Kucharski
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2019-03-14 13:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Christopher Lameter, Jerome Glisse, john.hubbard,
	Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard

On Thu 14-03-19 09:57:18, Jason Gunthorpe wrote:
> On Thu, Mar 14, 2019 at 10:03:45AM +0100, Jan Kara wrote:
> > On Wed 13-03-19 19:16:51, Christopher Lameter wrote:
> > > On Tue, 12 Mar 2019, Jerome Glisse wrote:
> > > 
> > > > > > This has been discuss extensively already. GUP usage is now widespread in
> > > > > > multiple drivers, removing that would regress userspace ie break existing
> > > > > > application. We all know what the rules for that is.
> > > 
> > > You are still misstating the issue. In RDMA land GUP is widely used for
> > > anonyous memory and memory based filesystems. *Not* for real filesystems.
> > 
> > Maybe in your RDMA land. But there are apparently other users which do use
> > mmap of a file on normal filesystem (e.g. ext4) as a buffer for DMA
> > (Infiniband does not prohibit this if nothing else, video capture devices
> > also use very similar pattern of gup-ing pages and using them as video
> > buffers). And these users are reporting occasional kernel crashes. That's
> > how this whole effort started. Sadly the DMA to file mmap is working good
> > enough that people started using it so at this point we cannot just tell:
> > Sorry it was a mistake to allow this, just rewrite your applications.
> 
> This is where we are in RDMA too.. People are trying it and the ones
> that do enough load testing find their kernel OOPs
> 
> So it is not clear at all if this has graduated to a real use, or just
> an experiment. Perhaps there are some system configurations that don't
> trigger crashes..

Well I have some crash reports couple years old and they are not from QA
departments. So I'm pretty confident there are real users that use this in
production... and just reboot their machine in case it crashes.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-14 13:30               ` Jan Kara
@ 2019-03-14 20:25                 ` William Kucharski
  2019-03-14 20:37                   ` John Hubbard
  0 siblings, 1 reply; 33+ messages in thread
From: William Kucharski @ 2019-03-14 20:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Christopher Lameter, Jerome Glisse,
	john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel, John Hubbard



> On Mar 14, 2019, at 7:30 AM, Jan Kara <jack@suse.cz> wrote:
> 
> Well I have some crash reports couple years old and they are not from QA
> departments. So I'm pretty confident there are real users that use this in
> production... and just reboot their machine in case it crashes.

Do you know what the use case in those crashes actually was?

I'm curious to know they were actually cases of say DMA from a video
capture card or if the uses posited to date are simply theoretical.

It's always good to know who might be doing this and why if for no other
reason than as something to keep in mind when designing future interfaces.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-14 20:25                 ` William Kucharski
@ 2019-03-14 20:37                   ` John Hubbard
  0 siblings, 0 replies; 33+ messages in thread
From: John Hubbard @ 2019-03-14 20:37 UTC (permalink / raw)
  To: William Kucharski, Jan Kara
  Cc: Jason Gunthorpe, Christopher Lameter, Jerome Glisse,
	john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel

On 3/14/19 1:25 PM, William Kucharski wrote:
> 
> 
>> On Mar 14, 2019, at 7:30 AM, Jan Kara <jack@suse.cz> wrote:
>>
>> Well I have some crash reports couple years old and they are not from QA
>> departments. So I'm pretty confident there are real users that use this in
>> production... and just reboot their machine in case it crashes.
> 
> Do you know what the use case in those crashes actually was?
> 
> I'm curious to know they were actually cases of say DMA from a video
> capture card or if the uses posited to date are simply theoretical.


It's not merely theoretical. In addition to Jan's bug reports, I've
personally investigated a bug that involved an GPU (acting basically as
an AI accelerator in this case) that was doing DMA to memory that turned
out to be file backed.

The backtrace for that is in the commit description.

As others have mentioned, this works well enough to lure people into
using it, but then fails when you load down a powerful system (and put
it under memory pressure).

I think that as systems get larger, and more highly threaded, we might
see more such failures--maybe even in the Direct IO case someday,
although so far that race window is so small that that one truly is
still theoretical (or, we just haven't been in communication with
anyone who hit it).

thanks,
-- 
John Hubbard
NVIDIA

> 
> It's always good to know who might be doing this and why if for no other
> reason than as something to keep in mind when designing future interfaces.
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-14  9:06               ` Jan Kara
@ 2019-03-18 20:12                 ` John Hubbard
  0 siblings, 0 replies; 33+ messages in thread
From: John Hubbard @ 2019-03-18 20:12 UTC (permalink / raw)
  To: Jan Kara, Christopher Lameter
  Cc: Christoph Hellwig, Dave Chinner, Ira Weiny, john.hubbard,
	Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel

On 3/14/19 2:06 AM, Jan Kara wrote:
> On Wed 13-03-19 19:21:37, Christopher Lameter wrote:
>> On Wed, 13 Mar 2019, Christoph Hellwig wrote:
>>
>>> On Wed, Mar 13, 2019 at 09:11:13AM +1100, Dave Chinner wrote:
>>>> On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote:
>>>>> IMHO I don't think that the copy_file_range() is going to carry us through the
>>>>> next wave of user performance requirements.  RDMA, while the first, is not the
>>>>> only technology which is looking to have direct access to files.  XDP is
>>>>> another.[1]
>>>>
>>>> Sure, all I doing here was demonstrating that people have been
>>>> trying to get local direct access to file mappings to DMA directly
>>>> into them for a long time. Direct Io games like these are now
>>>> largely unnecessary because we now have much better APIs to do
>>>> zero-copy data transfer between files (which can do hardware offload
>>>> if it is available!).
>>>
>>> And that is just the file to file case.  There are tons of other
>>> users of get_user_pages, including various drivers that do large
>>> amounts of I/O like video capture.  For them it makes tons of sense
>>> to transfer directly to/from a mmap()ed file.
>>
>> That is very similar to the RDMA case and DAX etc. We need to have a way
>> to tell a filesystem that this is going to happen and that things need to
>> be setup for this to work properly.
> 
> The way to tell filesystem what's happening is exactly what we are working
> on with these patches...
> 
>> But if that has not been done then I think its proper to fail a long term
>> pin operation on page cache pages. Meaning the regular filesystems
>> maintain control of whats happening with their pages.
> 
> And as I mentioned in my other email, we cannot just fail the pin for
> pagecache pages as that would regress existing applications.
> 
> 								Honza
> 

Christopher L,

Are you OK with this approach now? If so, I'd like to collect any additional
ACKs people are willing to provide, and ask Andrew to consider this first 
patch for 5.2, so we can get started.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-03-18 20:12 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-06 23:54 [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
2019-03-08  2:58   ` Christopher Lameter
2019-03-08  3:15     ` John Hubbard
2019-03-08 17:43       ` Weiny, Ira
2019-03-08 17:57   ` Jerome Glisse
2019-03-08 21:27     ` John Hubbard
2019-03-12 15:30   ` Ira Weiny
2019-03-13  0:38     ` John Hubbard
2019-03-13 14:49       ` Ira Weiny
2019-03-14  3:19         ` John Hubbard
2019-03-07  8:37 ` [PATCH v3 0/1] " Ira Weiny
2019-03-08  3:08 ` Christopher Lameter
2019-03-08 19:07   ` Jerome Glisse
2019-03-12  4:52     ` Christopher Lameter
2019-03-12 15:35       ` Jerome Glisse
2019-03-12 15:53         ` Jason Gunthorpe
2019-03-13 19:16         ` Christopher Lameter
2019-03-13 19:33           ` Jerome Glisse
2019-03-14  9:03           ` Jan Kara
2019-03-14 12:57             ` Jason Gunthorpe
2019-03-14 13:30               ` Jan Kara
2019-03-14 20:25                 ` William Kucharski
2019-03-14 20:37                   ` John Hubbard
2019-03-10 22:47   ` Dave Chinner
2019-03-12  5:23     ` Christopher Lameter
2019-03-12 10:39       ` Ira Weiny
2019-03-12 22:11         ` Dave Chinner
2019-03-12 15:23           ` Ira Weiny
2019-03-13 16:03           ` Christoph Hellwig
2019-03-13 19:21             ` Christopher Lameter
2019-03-14  9:06               ` Jan Kara
2019-03-18 20:12                 ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).