linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
@ 2019-03-08 21:36 john.hubbard
  2019-03-08 21:36 ` [PATCH v4 1/1] " john.hubbard
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: john.hubbard @ 2019-03-08 21:36 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Hi Andrew and all,

Can we please apply this (destined for 5.2) once the time is right?
(I see that -mm just got merged into the main tree today.)

We seem to have pretty solid consensus on the concept and details of the
put_user_pages() approach. Or at least, if we don't, someone please speak
up now. Christopher Lameter, especially, since you had some concerns
recently.

Therefore, here is the first patch--only. This allows us to begin
converting the get_user_pages() call sites to use put_user_page(), instead
of put_page(). This is in order to implement tracking of get_user_page()
pages.

Normally I'd include a user of this code, but in this case, I think we have
examples of how it will work in the RFC and related discussions [1]. What
matters more at this point is unblocking the ability to start fixing up
various subsystems, through git trees other than linux-mm. For example, the
Infiniband example conversion now needs to pick up some prerequisite
patches via the RDMA tree. It seems likely that other call sites may need
similar attention, and so having put_user_pages() available would really
make this go more quickly.

Previous cover letter follows:
==============================

A discussion of the overall problem is below.

As mentioned in patch 0001, the steps are to fix the problem are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem.

Overview
========

Some kernel components (file systems, device drivers) need to access
memory that is specified via process virtual address. For a long time, the
API to achieve that was get_user_pages ("GUP") and its variations. However,
GUP has critical limitations that have been overlooked; in particular, GUP
does not interact correctly with filesystems in all situations. That means
that file-backed memory + GUP is a recipe for potential problems, some of
which have already occurred in the field.

GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
to get the struct page behind a virtual address and to let storage hardware
perform a direct copy to or from that page. This is a short-lived access
pattern, and as such, the window for a concurrent writeback of GUP'd page
was small enough that there were not (we think) any reported problems.
Also, userspace was expected to understand and accept that Direct IO was
not synchronized with memory-mapped access to that data, nor with any
process address space changes such as munmap(), mremap(), etc.

Over the years, more GUP uses have appeared (virtualization, device
drivers, RDMA) that can keep the pages they get via GUP for a long period
of time (seconds, minutes, hours, days, ...). This long-term pinning makes
an underlying design problem more obvious.

In fact, there are a number of key problems inherent to GUP:

Interactions with file systems
==============================

File systems expect to be able to write back data, both to reclaim pages,
and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
write access to the file memory pages means that such hardware can dirty
the pages, without the filesystem being aware. This can, in some cases
(depending on filesystem, filesystem options, block device, block device
options, and other variables), lead to data corruption, and also to kernel
bugs of the form:

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

Dave Chinner's description of this is very clear:

    "The fundamental issue is that ->page_mkwrite must be called on every
    write access to a clean file backed page, not just the first one.
    How long the GUP reference lasts is irrelevant, if the page is clean
    and you need to dirty it, you must call ->page_mkwrite before it is
    marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.


Long term GUP
=============

Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
writeable mapping is created), and the pages are file-backed. That can lead
to filesystem corruption. What happens is that when a file-backed page is
being written back, it is first mapped read-only in all of the CPU page
tables; the file system then assumes that nobody can write to the page, and
that the page content is therefore stable. Unfortunately, the GUP callers
generally do not monitor changes to the CPU pages tables; they instead
assume that the following pattern is safe (it's not):

    get_user_pages()

    Hardware can keep a reference to those pages for a very long time,
    and write to it at any time. Because "hardware" here means "devices
    that are not a CPU", this activity occurs without any interaction
    with the kernel's file system code.

    for each page
        set_page_dirty
        put_page()

In fact, the GUP documentation even recommends that pattern.

Anyway, the file system assumes that the page is stable (nothing is writing
to the page), and that is a problem: stable page content is necessary for
many filesystem actions during writeback, such as checksum, encryption,
RAID striping, etc. Furthermore, filesystem features like COW (copy on
write) or snapshot also rely on being able to use a new page for as memory
for that memory range inside the file.

Corruption during write back is clearly possible here. To solve that, one
idea is to identify pages that have active GUP, so that we can use a bounce
page to write stable data to the filesystem. The filesystem would work
on the bounce page, while any of the active GUP might write to the
original page. This would avoid the stable page violation problem, but note
that it is only part of the overall solution, because other problems
remain.

Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be to
require GUP users to monitor and respond to CPU page table updates.
Subsystems such as ODP and HMM do this, for example. This aspect of the
problem is still under discussion.

Direct IO
=========

Direct IO can cause corruption, if userspace does Direct-IO that writes to
a range of virtual addresses that are mmap'd to a file.  The pages written
to are file-backed pages that can be under write back, while the Direct IO
is taking place.  Here, Direct IO races with a write back: it calls
GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
The race window is pretty small, which is probably why years have gone by
before we noticed this problem: Direct IO is generally very quick, and
tends to finish up before the filesystem gets around to do anything with
the page contents.  However, it's still a real problem.  The solution is
to never let GUP return pages that are under write back, but instead,
force GUP to take a write fault on those pages.  That way, GUP will
properly synchronize with the active write back.  This does not change the
required GUP behavior, it just avoids that race.

Changes since v3:

 * Moved put_user_page*() implementation from swap.c to gup.c, as per
   Jerome's review recommendation.

 * Updated wording in patch #1 (and in this cover letter) to refer to real
   filesystems with a backing store, as per Christopher Lameter's feedback.

 * Rebased to latest linux.git: commit 3601fe43e816 ("Merge tag
   'gpio-v5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio")

Changes since v2:

 * Reduced down to just one patch, in order to avoid dependencies between
   subsystem git repos.

 * Rebased to latest linux.git: commit afe6fe7036c6 ("Merge tag
   'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc")

 * Added Ira's review tag, based on
   https://lore.kernel.org/lkml/20190215002312.GC7512@iweiny-DESK2.sc.intel.com/


[1] https://lore.kernel.org/r/20190208075649.3025-3-jhubbard@nvidia.com
    (RFC v2: mm: gup/dma tracking)

Cc: Christian Benvenuti <benve@cisco.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Tom Talpey <tom@talpey.com>

John Hubbard (1):
  mm: introduce put_user_page*(), placeholder versions

 include/linux/mm.h | 24 ++++++++++++++
 mm/gup.c           | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08 21:36 [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2019-03-08 21:36 ` john.hubbard
  2019-03-19 12:04   ` Kirill A. Shutemov
  2019-03-08 23:21 ` [PATCH v4 0/1] " John Hubbard
  2019-03-19 18:12 ` Christopher Lameter
  2 siblings, 1 reply; 34+ messages in thread
From: john.hubbard @ 2019-03-08 21:36 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing a problem (also described in [1] and
[2]) with interactions between get_user_pages ("gup") and filesystems.

Problem description: let's start with a bug report. Below, is what happens
sometimes, under memory pressure, when a driver pins some pages via gup,
and then marks those pages dirty, and releases them. Note that the gup
documentation actually recommends that pattern. The problem is that the
filesystem may do a writeback while the pages were gup-pinned, and then the
filesystem believes that the pages are clean. So, when the driver later
marks the pages as dirty, that conflicts with the filesystem's page
tracking and results in a BUG(), like this one that I experienced:

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

Dave Chinner's description of this is very clear:

    "The fundamental issue is that ->page_mkwrite must be called on every
    write access to a clean file backed page, not just the first one.
    How long the GUP reference lasts is irrelevant, if the page is clean
    and you need to dirty it, you must call ->page_mkwrite before it is
    marked writeable and dirtied. Every. Time."

This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.

The steps are to fix it are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem.

[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 24 ++++++++++++++
 mm/gup.c           | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5801ee849f36..353035c8b115 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/**
+ * put_user_page() - release a gup-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ *
+ * put_user_page() and put_page() are not interchangeable, despite this early
+ * implementation that makes them look the same. put_user_page() calls must
+ * be perfectly matched up with get_user_page() calls.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/gup.c b/mm/gup.c
index f84e22685aaa..37085b8163b1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -28,6 +28,88 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+typedef int (*set_dirty_func_t)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+				   unsigned long npages,
+				   set_dirty_func_t sdf)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			sdf(page);
+
+		put_user_page(page);
+	}
+}
+
+/**
+ * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * "gup-pinned page" refers to a page that has had one of the get_user_pages()
+ * variants called on that page.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/**
+ * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if it was previously listed as clean. Then, release
+ * the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/**
+ * put_user_pages() - release an array of gup-pinned pages.
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ * For each page in the @pages array, release the page using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 static struct page *no_page_table(struct vm_area_struct *vma,
 		unsigned int flags)
 {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08 21:36 [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2019-03-08 21:36 ` [PATCH v4 1/1] " john.hubbard
@ 2019-03-08 23:21 ` John Hubbard
  2019-03-19 18:12 ` Christopher Lameter
  2 siblings, 0 replies; 34+ messages in thread
From: John Hubbard @ 2019-03-08 23:21 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton, linux-mm
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel

On 3/8/19 1:36 PM, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Hi Andrew and all,
> 
> Can we please apply this (destined for 5.2) once the time is right?
> (I see that -mm just got merged into the main tree today.)
> 
> We seem to have pretty solid consensus on the concept and details of the
> put_user_pages() approach. Or at least, if we don't, someone please speak
> up now. Christopher Lameter, especially, since you had some concerns
> recently.

I forgot to update the above bit for this v4 patch. In fact, Christopher
already commented on v3. He disagrees that the whole design should be done at
all, and it's a high-level point about kernel requirements. I think that
the "we've been supporting this for years" argument needs to prevail,
though, so I'm still pushing this patch for 5.2, and hoping that 
Christopher eventually agrees.

(The changelog is at the end of this cover letter, btw.)

thanks,
-- 
John Hubbard
NVIDIA

> 
> Therefore, here is the first patch--only. This allows us to begin
> converting the get_user_pages() call sites to use put_user_page(), instead
> of put_page(). This is in order to implement tracking of get_user_page()
> pages.
> 
> Normally I'd include a user of this code, but in this case, I think we have
> examples of how it will work in the RFC and related discussions [1]. What
> matters more at this point is unblocking the ability to start fixing up
> various subsystems, through git trees other than linux-mm. For example, the
> Infiniband example conversion now needs to pick up some prerequisite
> patches via the RDMA tree. It seems likely that other call sites may need
> similar attention, and so having put_user_pages() available would really
> make this go more quickly.
> 
> Previous cover letter follows:
> ==============================
> 
> A discussion of the overall problem is below.
> 
> As mentioned in patch 0001, the steps are to fix the problem are:
> 
> 1) Provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> Overview
> ========
> 
> Some kernel components (file systems, device drivers) need to access
> memory that is specified via process virtual address. For a long time, the
> API to achieve that was get_user_pages ("GUP") and its variations. However,
> GUP has critical limitations that have been overlooked; in particular, GUP
> does not interact correctly with filesystems in all situations. That means
> that file-backed memory + GUP is a recipe for potential problems, some of
> which have already occurred in the field.
> 
> GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
> to get the struct page behind a virtual address and to let storage hardware
> perform a direct copy to or from that page. This is a short-lived access
> pattern, and as such, the window for a concurrent writeback of GUP'd page
> was small enough that there were not (we think) any reported problems.
> Also, userspace was expected to understand and accept that Direct IO was
> not synchronized with memory-mapped access to that data, nor with any
> process address space changes such as munmap(), mremap(), etc.
> 
> Over the years, more GUP uses have appeared (virtualization, device
> drivers, RDMA) that can keep the pages they get via GUP for a long period
> of time (seconds, minutes, hours, days, ...). This long-term pinning makes
> an underlying design problem more obvious.
> 
> In fact, there are a number of key problems inherent to GUP:
> 
> Interactions with file systems
> ==============================
> 
> File systems expect to be able to write back data, both to reclaim pages,
> and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
> write access to the file memory pages means that such hardware can dirty
> the pages, without the filesystem being aware. This can, in some cases
> (depending on filesystem, filesystem options, block device, block device
> options, and other variables), lead to data corruption, and also to kernel
> bugs of the form:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: real filesystems
> that actually write to a backing device, do not actually support
> get_user_pages() being called on their pages, and letting hardware write
> directly to those pages--even though that pattern has been going on since
> about 2005 or so.
> 
> 
> Long term GUP
> =============
> 
> Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
> writeable mapping is created), and the pages are file-backed. That can lead
> to filesystem corruption. What happens is that when a file-backed page is
> being written back, it is first mapped read-only in all of the CPU page
> tables; the file system then assumes that nobody can write to the page, and
> that the page content is therefore stable. Unfortunately, the GUP callers
> generally do not monitor changes to the CPU pages tables; they instead
> assume that the following pattern is safe (it's not):
> 
>     get_user_pages()
> 
>     Hardware can keep a reference to those pages for a very long time,
>     and write to it at any time. Because "hardware" here means "devices
>     that are not a CPU", this activity occurs without any interaction
>     with the kernel's file system code.
> 
>     for each page
>         set_page_dirty
>         put_page()
> 
> In fact, the GUP documentation even recommends that pattern.
> 
> Anyway, the file system assumes that the page is stable (nothing is writing
> to the page), and that is a problem: stable page content is necessary for
> many filesystem actions during writeback, such as checksum, encryption,
> RAID striping, etc. Furthermore, filesystem features like COW (copy on
> write) or snapshot also rely on being able to use a new page for as memory
> for that memory range inside the file.
> 
> Corruption during write back is clearly possible here. To solve that, one
> idea is to identify pages that have active GUP, so that we can use a bounce
> page to write stable data to the filesystem. The filesystem would work
> on the bounce page, while any of the active GUP might write to the
> original page. This would avoid the stable page violation problem, but note
> that it is only part of the overall solution, because other problems
> remain.
> 
> Other filesystem features that need to replace the page with a new one can
> be inhibited for pages that are GUP-pinned. This will, however, alter and
> limit some of those filesystem features. The only fix for that would be to
> require GUP users to monitor and respond to CPU page table updates.
> Subsystems such as ODP and HMM do this, for example. This aspect of the
> problem is still under discussion.
> 
> Direct IO
> =========
> 
> Direct IO can cause corruption, if userspace does Direct-IO that writes to
> a range of virtual addresses that are mmap'd to a file.  The pages written
> to are file-backed pages that can be under write back, while the Direct IO
> is taking place.  Here, Direct IO races with a write back: it calls
> GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
> The race window is pretty small, which is probably why years have gone by
> before we noticed this problem: Direct IO is generally very quick, and
> tends to finish up before the filesystem gets around to do anything with
> the page contents.  However, it's still a real problem.  The solution is
> to never let GUP return pages that are under write back, but instead,
> force GUP to take a write fault on those pages.  That way, GUP will
> properly synchronize with the active write back.  This does not change the
> required GUP behavior, it just avoids that race.
> 
> Changes since v3:
> 
>  * Moved put_user_page*() implementation from swap.c to gup.c, as per
>    Jerome's review recommendation.
> 
>  * Updated wording in patch #1 (and in this cover letter) to refer to real
>    filesystems with a backing store, as per Christopher Lameter's feedback.
> 
>  * Rebased to latest linux.git: commit 3601fe43e816 ("Merge tag
>    'gpio-v5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio")
> 
> Changes since v2:
> 
>  * Reduced down to just one patch, in order to avoid dependencies between
>    subsystem git repos.
> 
>  * Rebased to latest linux.git: commit afe6fe7036c6 ("Merge tag
>    'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc")
> 
>  * Added Ira's review tag, based on
>    https://lore.kernel.org/lkml/20190215002312.GC7512@iweiny-DESK2.sc.intel.com/
> 
> 
> [1] https://lore.kernel.org/r/20190208075649.3025-3-jhubbard@nvidia.com
>     (RFC v2: mm: gup/dma tracking)
> 
> Cc: Christian Benvenuti <benve@cisco.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Tom Talpey <tom@talpey.com>
> 
> John Hubbard (1):
>   mm: introduce put_user_page*(), placeholder versions
> 
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/gup.c           | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 106 insertions(+)
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 15:36           ` Jan Kara
@ 2019-03-19  9:03             ` Ira Weiny
  2019-03-19 20:43               ` Tom Talpey
  2019-03-19 19:02             ` John Hubbard
  1 sibling, 1 reply; 34+ messages in thread
From: Ira Weiny @ 2019-03-19  9:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Jerome Glisse, john.hubbard, Andrew Morton,
	linux-mm, Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On Tue, Mar 19, 2019 at 04:36:44PM +0100, Jan Kara wrote:
> On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:
> > On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> > > On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > > > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > > 
> > > > [...]
> > > > 
> > > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > > index f84e22685aaa..37085b8163b1 100644
> > > > > > --- a/mm/gup.c
> > > > > > +++ b/mm/gup.c
> > > > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > > > >  	unsigned int page_mask;
> > > > > >  };
> > > > > >  
> > > > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > > > +
> > > > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > > > +				   unsigned long npages,
> > > > > > +				   set_dirty_func_t sdf)
> > > > > > +{
> > > > > > +	unsigned long index;
> > > > > > +
> > > > > > +	for (index = 0; index < npages; index++) {
> > > > > > +		struct page *page = compound_head(pages[index]);
> > > > > > +
> > > > > > +		if (!PageDirty(page))
> > > > > > +			sdf(page);
> > > > > 
> > > > > How is this safe? What prevents the page to be cleared under you?
> > > > > 
> > > > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > > > with a reason why. It's not very clear to me as it is.
> > > > 
> > > > The PageDirty() optimization above is fine to race with clear the
> > > > page flag as it means it is racing after a page_mkclean() and the
> > > > GUP user is done with the page so page is about to be write back
> > > > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > > > call while a split second after TestClearPageDirty() happens then
> > > > it means the racing clear is about to write back the page so all
> > > > is fine (the page was dirty and it is being clear for write back).
> > > > 
> > > > If it does call the sdf() while racing with write back then we
> > > > just redirtied the page just like clear_page_dirty_for_io() would
> > > > do if page_mkclean() failed so nothing harmful will come of that
> > > > neither. Page stays dirty despite write back it just means that
> > > > the page might be write back twice in a row.
> > > 
> > > Forgot to mention one thing, we had a discussion with Andrea and Jan
> > > about set_page_dirty() and Andrea had the good idea of maybe doing
> > > the set_page_dirty() at GUP time (when GUP with write) not when the
> > > GUP user calls put_page(). We can do that by setting the dirty bit
> > > in the pte for instance. They are few bonus of doing things that way:
> > >     - amortize the cost of calling set_page_dirty() (ie one call for
> > >       GUP and page_mkclean()
> > >     - it is always safe to do so at GUP time (ie the pte has write
> > >       permission and thus the page is in correct state)
> > >     - safe from truncate race
> > >     - no need to ever lock the page
> > > 
> > > Extra bonus from my point of view, it simplify thing for my generic
> > > page protection patchset (KSM for file back page).
> > > 
> > > So maybe we should explore that ? It would also be a lot less code.
> > 
> > Yes, please. It sounds more sensible to me to dirty the page on get, not
> > on put.
> 
> I fully agree this is a desirable final state of affairs.

I'm glad to see this presented because it has crossed my mind more than once
that effectively a GUP pinned page should be considered "dirty" at all times
until the pin is removed.  This is especially true in the RDMA case.

> And with changes
> to how we treat pinned pages during writeback there won't have to be any
> explicit dirtying at all in the end because the page is guaranteed to be
> dirty after a write page fault and pin would make sure it stays dirty until
> unpinned. However initially I want the helpers to be as close to code they
> are replacing as possible. Because it will be hard to catch all the bugs
> due to driver conversions even in that situation. So I still think that
> these helpers as they are a good first step. Then we need to convert
> GUP users to use them and then it is much easier to modify the behavior
> since it is no longer opencoded in two hudred or how many places...

Agreed.  I continue to test with these patches and RDMA and have not seen any
problems thus far.

Ira

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08 21:36 ` [PATCH v4 1/1] " john.hubbard
@ 2019-03-19 12:04   ` Kirill A. Shutemov
  2019-03-19 13:47     ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Kirill A. Shutemov @ 2019-03-19 12:04 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Jerome Glisse, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard

On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with interactions between get_user_pages ("gup") and filesystems.
> 
> Problem description: let's start with a bug report. Below, is what happens
> sometimes, under memory pressure, when a driver pins some pages via gup,
> and then marks those pages dirty, and releases them. Note that the gup
> documentation actually recommends that pattern. The problem is that the
> filesystem may do a writeback while the pages were gup-pinned, and then the
> filesystem believes that the pages are clean. So, when the driver later
> marks the pages as dirty, that conflicts with the filesystem's page
> tracking and results in a BUG(), like this one that I experienced:
> 
>     kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
>     backtrace:
>         ext4_writepage
>         __writepage
>         write_cache_pages
>         ext4_writepages
>         do_writepages
>         __writeback_single_inode
>         writeback_sb_inodes
>         __writeback_inodes_wb
>         wb_writeback
>         wb_workfn
>         process_one_work
>         worker_thread
>         kthread
>         ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
>         ({                                                      \
>                 BUG_ON(!PagePrivate(page));                     \
>                 ((struct buffer_head *)page_private(page));     \
>         })
> 
> Dave Chinner's description of this is very clear:
> 
>     "The fundamental issue is that ->page_mkwrite must be called on every
>     write access to a clean file backed page, not just the first one.
>     How long the GUP reference lasts is irrelevant, if the page is clean
>     and you need to dirty it, you must call ->page_mkwrite before it is
>     marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: real filesystems
> that actually write to a backing device, do not actually support
> get_user_pages() being called on their pages, and letting hardware write
> directly to those pages--even though that pattern has been going on since
> about 2005 or so.
> 
> The steps are to fix it are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem.
> 
> [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
> [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>    # docs
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 24 ++++++++++++++
>  mm/gup.c           | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5801ee849f36..353035c8b115 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + *
> + * put_user_page() and put_page() are not interchangeable, despite this early
> + * implementation that makes them look the same. put_user_page() calls must
> + * be perfectly matched up with get_user_page() calls.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/gup.c b/mm/gup.c
> index f84e22685aaa..37085b8163b1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -28,6 +28,88 @@ struct follow_page_context {
>  	unsigned int page_mask;
>  };
>  
> +typedef int (*set_dirty_func_t)(struct page *page);
> +
> +static void __put_user_pages_dirty(struct page **pages,
> +				   unsigned long npages,
> +				   set_dirty_func_t sdf)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			sdf(page);

How is this safe? What prevents the page to be cleared under you?

If it's safe to race clear_page_dirty*() it has to be stated explicitly
with a reason why. It's not very clear to me as it is.

> +
> +		put_user_page(page);
> +	}
> +}
> +
> +/**
> + * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> + * variants called on that page.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty);

Have you checked if compiler is clever enough eliminate indirect function
call here? Maybe it's better to go with an opencodded approach and get rid
of callbacks?


> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/**
> + * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if it was previously listed as clean. Then, release
> + * the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +
> +/**
> + * put_user_pages() - release an array of gup-pinned pages.
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + * For each page in the @pages array, release the page using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);

I believe there's an room for improvement for compound pages.

If there's multiple consequential pages in the array that belong to the
same compound page we can get away with a single atomic operation to
handle them all.

> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
>  static struct page *no_page_table(struct vm_area_struct *vma,
>  		unsigned int flags)
>  {
> -- 
> 2.21.0
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 12:04   ` Kirill A. Shutemov
@ 2019-03-19 13:47     ` Jerome Glisse
  2019-03-19 14:06       ` Kirill A. Shutemov
                         ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Jerome Glisse @ 2019-03-19 13:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard

On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > From: John Hubbard <jhubbard@nvidia.com>

[...]

> > diff --git a/mm/gup.c b/mm/gup.c
> > index f84e22685aaa..37085b8163b1 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -28,6 +28,88 @@ struct follow_page_context {
> >  	unsigned int page_mask;
> >  };
> >  
> > +typedef int (*set_dirty_func_t)(struct page *page);
> > +
> > +static void __put_user_pages_dirty(struct page **pages,
> > +				   unsigned long npages,
> > +				   set_dirty_func_t sdf)
> > +{
> > +	unsigned long index;
> > +
> > +	for (index = 0; index < npages; index++) {
> > +		struct page *page = compound_head(pages[index]);
> > +
> > +		if (!PageDirty(page))
> > +			sdf(page);
> 
> How is this safe? What prevents the page to be cleared under you?
> 
> If it's safe to race clear_page_dirty*() it has to be stated explicitly
> with a reason why. It's not very clear to me as it is.

The PageDirty() optimization above is fine to race with clear the
page flag as it means it is racing after a page_mkclean() and the
GUP user is done with the page so page is about to be write back
ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
call while a split second after TestClearPageDirty() happens then
it means the racing clear is about to write back the page so all
is fine (the page was dirty and it is being clear for write back).

If it does call the sdf() while racing with write back then we
just redirtied the page just like clear_page_dirty_for_io() would
do if page_mkclean() failed so nothing harmful will come of that
neither. Page stays dirty despite write back it just means that
the page might be write back twice in a row.

> > +
> > +		put_user_page(page);
> > +	}
> > +}
> > +
> > +/**
> > + * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
> > + * @pages:  array of pages to be marked dirty and released.
> > + * @npages: number of pages in the @pages array.
> > + *
> > + * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> > + * variants called on that page.
> > + *
> > + * For each page in the @pages array, make that page (or its head page, if a
> > + * compound page) dirty, if it was previously listed as clean. Then, release
> > + * the page using put_user_page().
> > + *
> > + * Please see the put_user_page() documentation for details.
> > + *
> > + * set_page_dirty(), which does not lock the page, is used here.
> > + * Therefore, it is the caller's responsibility to ensure that this is
> > + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> > + *
> > + */
> > +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> > +{
> > +	__put_user_pages_dirty(pages, npages, set_page_dirty);
> 
> Have you checked if compiler is clever enough eliminate indirect function
> call here? Maybe it's better to go with an opencodded approach and get rid
> of callbacks?
> 

Good point, dunno if John did check that.

> 
> > +}
> > +EXPORT_SYMBOL(put_user_pages_dirty);
> > +
> > +/**
> > + * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
> > + * @pages:  array of pages to be marked dirty and released.
> > + * @npages: number of pages in the @pages array.
> > + *
> > + * For each page in the @pages array, make that page (or its head page, if a
> > + * compound page) dirty, if it was previously listed as clean. Then, release
> > + * the page using put_user_page().
> > + *
> > + * Please see the put_user_page() documentation for details.
> > + *
> > + * This is just like put_user_pages_dirty(), except that it invokes
> > + * set_page_dirty_lock(), instead of set_page_dirty().
> > + *
> > + */
> > +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> > +{
> > +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> > +}
> > +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> > +
> > +/**
> > + * put_user_pages() - release an array of gup-pinned pages.
> > + * @pages:  array of pages to be marked dirty and released.
> > + * @npages: number of pages in the @pages array.
> > + *
> > + * For each page in the @pages array, release the page using put_user_page().
> > + *
> > + * Please see the put_user_page() documentation for details.
> > + */
> > +void put_user_pages(struct page **pages, unsigned long npages)
> > +{
> > +	unsigned long index;
> > +
> > +	for (index = 0; index < npages; index++)
> > +		put_user_page(pages[index]);
> 
> I believe there's an room for improvement for compound pages.
> 
> If there's multiple consequential pages in the array that belong to the
> same compound page we can get away with a single atomic operation to
> handle them all.

Yes maybe just add a comment with that for now and leave this kind of
optimization to latter ?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 13:47     ` Jerome Glisse
@ 2019-03-19 14:06       ` Kirill A. Shutemov
  2019-03-19 14:15         ` Jerome Glisse
  2019-03-19 20:01         ` John Hubbard
  2019-03-19 14:14       ` Jerome Glisse
  2019-03-19 19:24       ` John Hubbard
  2 siblings, 2 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2019-03-19 14:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard

On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > From: John Hubbard <jhubbard@nvidia.com>
> 
> [...]
> 
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index f84e22685aaa..37085b8163b1 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > >  	unsigned int page_mask;
> > >  };
> > >  
> > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > +
> > > +static void __put_user_pages_dirty(struct page **pages,
> > > +				   unsigned long npages,
> > > +				   set_dirty_func_t sdf)
> > > +{
> > > +	unsigned long index;
> > > +
> > > +	for (index = 0; index < npages; index++) {
> > > +		struct page *page = compound_head(pages[index]);
> > > +
> > > +		if (!PageDirty(page))
> > > +			sdf(page);
> > 
> > How is this safe? What prevents the page to be cleared under you?
> > 
> > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > with a reason why. It's not very clear to me as it is.
> 
> The PageDirty() optimization above is fine to race with clear the
> page flag as it means it is racing after a page_mkclean() and the
> GUP user is done with the page so page is about to be write back
> ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> call while a split second after TestClearPageDirty() happens then
> it means the racing clear is about to write back the page so all
> is fine (the page was dirty and it is being clear for write back).
> 
> If it does call the sdf() while racing with write back then we
> just redirtied the page just like clear_page_dirty_for_io() would
> do if page_mkclean() failed so nothing harmful will come of that
> neither. Page stays dirty despite write back it just means that
> the page might be write back twice in a row.

Fair enough. Should we get it into a comment here?

> > > +void put_user_pages(struct page **pages, unsigned long npages)
> > > +{
> > > +	unsigned long index;
> > > +
> > > +	for (index = 0; index < npages; index++)
> > > +		put_user_page(pages[index]);
> > 
> > I believe there's an room for improvement for compound pages.
> > 
> > If there's multiple consequential pages in the array that belong to the
> > same compound page we can get away with a single atomic operation to
> > handle them all.
> 
> Yes maybe just add a comment with that for now and leave this kind of
> optimization to latter ?

Sounds good to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 13:47     ` Jerome Glisse
  2019-03-19 14:06       ` Kirill A. Shutemov
@ 2019-03-19 14:14       ` Jerome Glisse
  2019-03-19 14:29         ` Kirill A. Shutemov
  2019-03-19 21:23         ` Dave Chinner
  2019-03-19 19:24       ` John Hubbard
  2 siblings, 2 replies; 34+ messages in thread
From: Jerome Glisse @ 2019-03-19 14:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard, Andrea Arcangeli

On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > From: John Hubbard <jhubbard@nvidia.com>
> 
> [...]
> 
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index f84e22685aaa..37085b8163b1 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > >  	unsigned int page_mask;
> > >  };
> > >  
> > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > +
> > > +static void __put_user_pages_dirty(struct page **pages,
> > > +				   unsigned long npages,
> > > +				   set_dirty_func_t sdf)
> > > +{
> > > +	unsigned long index;
> > > +
> > > +	for (index = 0; index < npages; index++) {
> > > +		struct page *page = compound_head(pages[index]);
> > > +
> > > +		if (!PageDirty(page))
> > > +			sdf(page);
> > 
> > How is this safe? What prevents the page to be cleared under you?
> > 
> > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > with a reason why. It's not very clear to me as it is.
> 
> The PageDirty() optimization above is fine to race with clear the
> page flag as it means it is racing after a page_mkclean() and the
> GUP user is done with the page so page is about to be write back
> ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> call while a split second after TestClearPageDirty() happens then
> it means the racing clear is about to write back the page so all
> is fine (the page was dirty and it is being clear for write back).
> 
> If it does call the sdf() while racing with write back then we
> just redirtied the page just like clear_page_dirty_for_io() would
> do if page_mkclean() failed so nothing harmful will come of that
> neither. Page stays dirty despite write back it just means that
> the page might be write back twice in a row.

Forgot to mention one thing, we had a discussion with Andrea and Jan
about set_page_dirty() and Andrea had the good idea of maybe doing
the set_page_dirty() at GUP time (when GUP with write) not when the
GUP user calls put_page(). We can do that by setting the dirty bit
in the pte for instance. They are few bonus of doing things that way:
    - amortize the cost of calling set_page_dirty() (ie one call for
      GUP and page_mkclean()
    - it is always safe to do so at GUP time (ie the pte has write
      permission and thus the page is in correct state)
    - safe from truncate race
    - no need to ever lock the page

Extra bonus from my point of view, it simplify thing for my generic
page protection patchset (KSM for file back page).

So maybe we should explore that ? It would also be a lot less code.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 14:06       ` Kirill A. Shutemov
@ 2019-03-19 14:15         ` Jerome Glisse
  2019-03-19 20:01         ` John Hubbard
  1 sibling, 0 replies; 34+ messages in thread
From: Jerome Glisse @ 2019-03-19 14:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard

On Tue, Mar 19, 2019 at 05:06:23PM +0300, Kirill A. Shutemov wrote:
> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > From: John Hubbard <jhubbard@nvidia.com>
> > 
> > [...]
> > 
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index f84e22685aaa..37085b8163b1 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > >  	unsigned int page_mask;
> > > >  };
> > > >  
> > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > +
> > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > +				   unsigned long npages,
> > > > +				   set_dirty_func_t sdf)
> > > > +{
> > > > +	unsigned long index;
> > > > +
> > > > +	for (index = 0; index < npages; index++) {
> > > > +		struct page *page = compound_head(pages[index]);
> > > > +
> > > > +		if (!PageDirty(page))
> > > > +			sdf(page);
> > > 
> > > How is this safe? What prevents the page to be cleared under you?
> > > 
> > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > with a reason why. It's not very clear to me as it is.
> > 
> > The PageDirty() optimization above is fine to race with clear the
> > page flag as it means it is racing after a page_mkclean() and the
> > GUP user is done with the page so page is about to be write back
> > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > call while a split second after TestClearPageDirty() happens then
> > it means the racing clear is about to write back the page so all
> > is fine (the page was dirty and it is being clear for write back).
> > 
> > If it does call the sdf() while racing with write back then we
> > just redirtied the page just like clear_page_dirty_for_io() would
> > do if page_mkclean() failed so nothing harmful will come of that
> > neither. Page stays dirty despite write back it just means that
> > the page might be write back twice in a row.
> 
> Fair enough. Should we get it into a comment here?

Yes definitly also i just sent an email with an alternative that is
slightly better from my POV as it simplify my life for other things :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 14:14       ` Jerome Glisse
@ 2019-03-19 14:29         ` Kirill A. Shutemov
  2019-03-19 15:36           ` Jan Kara
  2019-03-19 21:23         ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Kirill A. Shutemov @ 2019-03-19 14:29 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard, Andrea Arcangeli

On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > From: John Hubbard <jhubbard@nvidia.com>
> > 
> > [...]
> > 
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index f84e22685aaa..37085b8163b1 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > >  	unsigned int page_mask;
> > > >  };
> > > >  
> > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > +
> > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > +				   unsigned long npages,
> > > > +				   set_dirty_func_t sdf)
> > > > +{
> > > > +	unsigned long index;
> > > > +
> > > > +	for (index = 0; index < npages; index++) {
> > > > +		struct page *page = compound_head(pages[index]);
> > > > +
> > > > +		if (!PageDirty(page))
> > > > +			sdf(page);
> > > 
> > > How is this safe? What prevents the page to be cleared under you?
> > > 
> > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > with a reason why. It's not very clear to me as it is.
> > 
> > The PageDirty() optimization above is fine to race with clear the
> > page flag as it means it is racing after a page_mkclean() and the
> > GUP user is done with the page so page is about to be write back
> > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > call while a split second after TestClearPageDirty() happens then
> > it means the racing clear is about to write back the page so all
> > is fine (the page was dirty and it is being clear for write back).
> > 
> > If it does call the sdf() while racing with write back then we
> > just redirtied the page just like clear_page_dirty_for_io() would
> > do if page_mkclean() failed so nothing harmful will come of that
> > neither. Page stays dirty despite write back it just means that
> > the page might be write back twice in a row.
> 
> Forgot to mention one thing, we had a discussion with Andrea and Jan
> about set_page_dirty() and Andrea had the good idea of maybe doing
> the set_page_dirty() at GUP time (when GUP with write) not when the
> GUP user calls put_page(). We can do that by setting the dirty bit
> in the pte for instance. They are few bonus of doing things that way:
>     - amortize the cost of calling set_page_dirty() (ie one call for
>       GUP and page_mkclean()
>     - it is always safe to do so at GUP time (ie the pte has write
>       permission and thus the page is in correct state)
>     - safe from truncate race
>     - no need to ever lock the page
> 
> Extra bonus from my point of view, it simplify thing for my generic
> page protection patchset (KSM for file back page).
> 
> So maybe we should explore that ? It would also be a lot less code.

Yes, please. It sounds more sensible to me to dirty the page on get, not
on put.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 14:29         ` Kirill A. Shutemov
@ 2019-03-19 15:36           ` Jan Kara
  2019-03-19  9:03             ` Ira Weiny
  2019-03-19 19:02             ` John Hubbard
  0 siblings, 2 replies; 34+ messages in thread
From: Jan Kara @ 2019-03-19 15:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Glisse, john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel, John Hubbard, Andrea Arcangeli

On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:
> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> > On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > 
> > > [...]
> > > 
> > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > index f84e22685aaa..37085b8163b1 100644
> > > > > --- a/mm/gup.c
> > > > > +++ b/mm/gup.c
> > > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > > >  	unsigned int page_mask;
> > > > >  };
> > > > >  
> > > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > > +
> > > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > > +				   unsigned long npages,
> > > > > +				   set_dirty_func_t sdf)
> > > > > +{
> > > > > +	unsigned long index;
> > > > > +
> > > > > +	for (index = 0; index < npages; index++) {
> > > > > +		struct page *page = compound_head(pages[index]);
> > > > > +
> > > > > +		if (!PageDirty(page))
> > > > > +			sdf(page);
> > > > 
> > > > How is this safe? What prevents the page to be cleared under you?
> > > > 
> > > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > > with a reason why. It's not very clear to me as it is.
> > > 
> > > The PageDirty() optimization above is fine to race with clear the
> > > page flag as it means it is racing after a page_mkclean() and the
> > > GUP user is done with the page so page is about to be write back
> > > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > > call while a split second after TestClearPageDirty() happens then
> > > it means the racing clear is about to write back the page so all
> > > is fine (the page was dirty and it is being clear for write back).
> > > 
> > > If it does call the sdf() while racing with write back then we
> > > just redirtied the page just like clear_page_dirty_for_io() would
> > > do if page_mkclean() failed so nothing harmful will come of that
> > > neither. Page stays dirty despite write back it just means that
> > > the page might be write back twice in a row.
> > 
> > Forgot to mention one thing, we had a discussion with Andrea and Jan
> > about set_page_dirty() and Andrea had the good idea of maybe doing
> > the set_page_dirty() at GUP time (when GUP with write) not when the
> > GUP user calls put_page(). We can do that by setting the dirty bit
> > in the pte for instance. They are few bonus of doing things that way:
> >     - amortize the cost of calling set_page_dirty() (ie one call for
> >       GUP and page_mkclean()
> >     - it is always safe to do so at GUP time (ie the pte has write
> >       permission and thus the page is in correct state)
> >     - safe from truncate race
> >     - no need to ever lock the page
> > 
> > Extra bonus from my point of view, it simplify thing for my generic
> > page protection patchset (KSM for file back page).
> > 
> > So maybe we should explore that ? It would also be a lot less code.
> 
> Yes, please. It sounds more sensible to me to dirty the page on get, not
> on put.

I fully agree this is a desirable final state of affairs. And with changes
to how we treat pinned pages during writeback there won't have to be any
explicit dirtying at all in the end because the page is guaranteed to be
dirty after a write page fault and pin would make sure it stays dirty until
unpinned. However initially I want the helpers to be as close to code they
are replacing as possible. Because it will be hard to catch all the bugs
due to driver conversions even in that situation. So I still think that
these helpers as they are a good first step. Then we need to convert
GUP users to use them and then it is much easier to modify the behavior
since it is no longer opencoded in two hudred or how many places...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-08 21:36 [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2019-03-08 21:36 ` [PATCH v4 1/1] " john.hubbard
  2019-03-08 23:21 ` [PATCH v4 0/1] " John Hubbard
@ 2019-03-19 18:12 ` Christopher Lameter
  2019-03-19 19:24   ` John Hubbard
  2 siblings, 1 reply; 34+ messages in thread
From: Christopher Lameter @ 2019-03-19 18:12 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard

On Fri, 8 Mar 2019, john.hubbard@gmail.com wrote:

> We seem to have pretty solid consensus on the concept and details of the
> put_user_pages() approach. Or at least, if we don't, someone please speak
> up now. Christopher Lameter, especially, since you had some concerns
> recently.

My concerns do not affect this patchset which just marks the get/put for
the pagecache. The problem was that the description was making claims that
were a bit misleading and seemed to prescribe a solution.

So lets get this merged. Whatever the solution will be, we will need this
markup.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 15:36           ` Jan Kara
  2019-03-19  9:03             ` Ira Weiny
@ 2019-03-19 19:02             ` John Hubbard
  1 sibling, 0 replies; 34+ messages in thread
From: John Hubbard @ 2019-03-19 19:02 UTC (permalink / raw)
  To: Jan Kara, Kirill A. Shutemov
  Cc: Jerome Glisse, john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, Andrea Arcangeli

On 3/19/19 8:36 AM, Jan Kara wrote:
> On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:
>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>> [...]
>>> Forgot to mention one thing, we had a discussion with Andrea and Jan
>>> about set_page_dirty() and Andrea had the good idea of maybe doing
>>> the set_page_dirty() at GUP time (when GUP with write) not when the
>>> GUP user calls put_page(). We can do that by setting the dirty bit
>>> in the pte for instance. They are few bonus of doing things that way:
>>>     - amortize the cost of calling set_page_dirty() (ie one call for
>>>       GUP and page_mkclean()
>>>     - it is always safe to do so at GUP time (ie the pte has write
>>>       permission and thus the page is in correct state)
>>>     - safe from truncate race
>>>     - no need to ever lock the page
>>>
>>> Extra bonus from my point of view, it simplify thing for my generic
>>> page protection patchset (KSM for file back page).
>>>
>>> So maybe we should explore that ? It would also be a lot less code.
>>
>> Yes, please. It sounds more sensible to me to dirty the page on get, not
>> on put.
> 
> I fully agree this is a desirable final state of affairs. And with changes
> to how we treat pinned pages during writeback there won't have to be any
> explicit dirtying at all in the end because the page is guaranteed to be
> dirty after a write page fault and pin would make sure it stays dirty until
> unpinned. However initially I want the helpers to be as close to code they
> are replacing as possible. Because it will be hard to catch all the bugs
> due to driver conversions even in that situation. So I still think that
> these helpers as they are a good first step. Then we need to convert
> GUP users to use them and then it is much easier to modify the behavior
> since it is no longer opencoded in two hudred or how many places...
> 
> 								Honza

In fact, we had this very same question come up last month [1]: I was also
wondering if we should just jump directly to the final step, and not
do the dirtying call, but it is true that during the conversion process,
(which effectively wraps put_page(), without changing anything else),
it's safer to avoid changing things. 

The whole system is fragile because it's running something that has some 
latent bugs in this area, so probably best to do it the way Jan says, and 
avoid causing any new instances of reproducing this problem, even though 
there is a bit more churn involved.


[1] https://lore.kernel.org/r/20190205112107.GB3872@quack2.suse.cz

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 13:47     ` Jerome Glisse
  2019-03-19 14:06       ` Kirill A. Shutemov
  2019-03-19 14:14       ` Jerome Glisse
@ 2019-03-19 19:24       ` John Hubbard
  2019-03-20  9:40         ` Kirill A. Shutemov
  2 siblings, 1 reply; 34+ messages in thread
From: John Hubbard @ 2019-03-19 19:24 UTC (permalink / raw)
  To: Jerome Glisse, Kirill A. Shutemov
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On 3/19/19 6:47 AM, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
>>> From: John Hubbard <jhubbard@nvidia.com>
> 
> [...]
>>> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
>>> +{
>>> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
>>
>> Have you checked if compiler is clever enough eliminate indirect function
>> call here? Maybe it's better to go with an opencodded approach and get rid
>> of callbacks?
>>
> 
> Good point, dunno if John did check that.

Hi Kirill, Jerome,

The compiler does *not* eliminate the indirect function call, at least unless
I'm misunderstanding things. The __put_user_pages_dirty() function calls the
appropriate set_page_dirty*() call, via __x86_indirect_thunk_r12, which seems
pretty definitive.

ffffffff81a00ef0 <__x86_indirect_thunk_r12>:
ffffffff81a00ef0:	41 ff e4             	jmpq   *%r12
ffffffff81a00ef3:	90                   	nop
ffffffff81a00ef4:	90                   	nop
ffffffff81a00ef5:	90                   	nop
ffffffff81a00ef6:	90                   	nop
ffffffff81a00ef7:	90                   	nop
ffffffff81a00ef8:	90                   	nop
ffffffff81a00ef9:	90                   	nop
ffffffff81a00efa:	90                   	nop
ffffffff81a00efb:	90                   	nop
ffffffff81a00efc:	90                   	nop
ffffffff81a00efd:	90                   	nop
ffffffff81a00efe:	90                   	nop
ffffffff81a00eff:	90                   	nop
ffffffff81a00f00:	90                   	nop
ffffffff81a00f01:	66 66 2e 0f 1f 84 00 	data16 nopw %cs:0x0(%rax,%rax,1)
ffffffff81a00f08:	00 00 00 00 
ffffffff81a00f0c:	0f 1f 40 00          	nopl   0x0(%rax)

However, there is no visible overhead to doing so, at a macro level. An fio
O_DIRECT run with and without the full conversion patchset shows the same 
numbers:

cat fio.conf 
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

=====================
Before (baseline):
=====================

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process

reader: (groupid=0, jobs=1): err= 0: pid=1828: Mon Mar 18 14:56:22 2019
   read: IOPS=192k, BW=751MiB/s (787MB/s)(1024MiB/1364msec)
    slat (nsec): min=1274, max=42375, avg=1564.12, stdev=682.65
    clat (usec): min=168, max=12209, avg=331.01, stdev=184.95
     lat (usec): min=171, max=12215, avg=332.61, stdev=185.11
    clat percentiles (usec):
     |  1.00th=[  326],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
     | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
     | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
     | 99.00th=[  519], 99.50th=[  523], 99.90th=[  537], 99.95th=[  594],
     | 99.99th=[12125]
   bw (  KiB/s): min=755280, max=783016, per=100.00%, avg=769148.00, stdev=19612.31, samples=2
   iops        : min=188820, max=195754, avg=192287.00, stdev=4903.08, samples=2
  lat (usec)   : 250=0.14%, 500=98.59%, 750=1.25%
  lat (msec)   : 20=0.02%
  cpu          : usr=12.69%, sys=48.20%, ctx=248836, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=751MiB/s (787MB/s), 751MiB/s-751MiB/s (787MB/s-787MB/s), io=1024MiB (1074MB), run=1364-1364msec

Disk stats (read/write):
  nvme0n1: ios=220106/0, merge=0/0, ticks=70136/0, in_queue=704, util=91.19%

==================================================
After (with enough callsites converted to run fio:
==================================================

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process

reader: (groupid=0, jobs=1): err= 0: pid=2026: Mon Mar 18 14:35:07 2019
   read: IOPS=192k, BW=751MiB/s (787MB/s)(1024MiB/1364msec)
    slat (nsec): min=1263, max=41861, avg=1591.99, stdev=692.09
    clat (usec): min=154, max=12205, avg=330.82, stdev=184.98
     lat (usec): min=157, max=12212, avg=332.45, stdev=185.14
    clat percentiles (usec):
     |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
     | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
     | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
     | 99.00th=[  502], 99.50th=[  510], 99.90th=[  523], 99.95th=[  570],
     | 99.99th=[12125]
   bw (  KiB/s): min=746848, max=783088, per=99.51%, avg=764968.00, stdev=25625.55, samples=2
   iops        : min=186712, max=195772, avg=191242.00, stdev=6406.39, samples=2
  lat (usec)   : 250=0.09%, 500=98.88%, 750=1.01%
  lat (msec)   : 20=0.02%
  cpu          : usr=14.38%, sys=48.64%, ctx=248037, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=751MiB/s (787MB/s), 751MiB/s-751MiB/s (787MB/s-787MB/s), io=1024MiB (1074MB), run=1364-1364msec

Disk stats (read/write):
  nvme0n1: ios=220228/0, merge=0/0, ticks=70426/0, in_queue=704, util=91.27%


So, I could be persuaded either way. But given the lack of an visible perf
effects, and given that this could will get removed anyway because we'll
likely end up with set_page_dirty() called at GUP time instead...it seems
like it's probably OK to just leave it as is.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 18:12 ` Christopher Lameter
@ 2019-03-19 19:24   ` John Hubbard
  2019-03-20  1:09     ` Christopher Lameter
  0 siblings, 1 reply; 34+ messages in thread
From: John Hubbard @ 2019-03-19 19:24 UTC (permalink / raw)
  To: Christopher Lameter, john.hubbard
  Cc: Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jerome Glisse, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel

On 3/19/19 11:12 AM, Christopher Lameter wrote:
> On Fri, 8 Mar 2019, john.hubbard@gmail.com wrote:
> 
>> We seem to have pretty solid consensus on the concept and details of the
>> put_user_pages() approach. Or at least, if we don't, someone please speak
>> up now. Christopher Lameter, especially, since you had some concerns
>> recently.
> 
> My concerns do not affect this patchset which just marks the get/put for
> the pagecache. The problem was that the description was making claims that
> were a bit misleading and seemed to prescribe a solution.
> 
> So lets get this merged. Whatever the solution will be, we will need this
> markup.
> 

Sounds good. Do you care to promote that thought into a formal ACK for me? :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 14:06       ` Kirill A. Shutemov
  2019-03-19 14:15         ` Jerome Glisse
@ 2019-03-19 20:01         ` John Hubbard
  2019-03-20  9:28           ` Kirill A. Shutemov
  1 sibling, 1 reply; 34+ messages in thread
From: John Hubbard @ 2019-03-19 20:01 UTC (permalink / raw)
  To: Kirill A. Shutemov, Jerome Glisse
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On 3/19/19 7:06 AM, Kirill A. Shutemov wrote:
> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> [...]
>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index f84e22685aaa..37085b8163b1 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -28,6 +28,88 @@ struct follow_page_context {
>>>>  	unsigned int page_mask;
>>>>  };
>>>>  
>>>> +typedef int (*set_dirty_func_t)(struct page *page);
>>>> +
>>>> +static void __put_user_pages_dirty(struct page **pages,
>>>> +				   unsigned long npages,
>>>> +				   set_dirty_func_t sdf)
>>>> +{
>>>> +	unsigned long index;
>>>> +
>>>> +	for (index = 0; index < npages; index++) {
>>>> +		struct page *page = compound_head(pages[index]);
>>>> +
>>>> +		if (!PageDirty(page))
>>>> +			sdf(page);
>>>
>>> How is this safe? What prevents the page to be cleared under you?
>>>
>>> If it's safe to race clear_page_dirty*() it has to be stated explicitly
>>> with a reason why. It's not very clear to me as it is.
>>
>> The PageDirty() optimization above is fine to race with clear the
>> page flag as it means it is racing after a page_mkclean() and the
>> GUP user is done with the page so page is about to be write back
>> ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
>> call while a split second after TestClearPageDirty() happens then
>> it means the racing clear is about to write back the page so all
>> is fine (the page was dirty and it is being clear for write back).
>>
>> If it does call the sdf() while racing with write back then we
>> just redirtied the page just like clear_page_dirty_for_io() would
>> do if page_mkclean() failed so nothing harmful will come of that
>> neither. Page stays dirty despite write back it just means that
>> the page might be write back twice in a row.
> 
> Fair enough. Should we get it into a comment here?

How's this read to you? I reworded and slightly expanded Jerome's 
description:

diff --git a/mm/gup.c b/mm/gup.c
index d1df7b8ba973..86397ae23922 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -61,6 +61,24 @@ static void __put_user_pages_dirty(struct page **pages,
        for (index = 0; index < npages; index++) {
                struct page *page = compound_head(pages[index]);
 
+               /*
+                * Checking PageDirty at this point may race with
+                * clear_page_dirty_for_io(), but that's OK. Two key cases:
+                *
+                * 1) This code sees the page as already dirty, so it skips
+                * the call to sdf(). That could happen because
+                * clear_page_dirty_for_io() called page_mkclean(),
+                * followed by set_page_dirty(). However, now the page is
+                * going to get written back, which meets the original
+                * intention of setting it dirty, so all is well:
+                * clear_page_dirty_for_io() goes on to call
+                * TestClearPageDirty(), and write the page back.
+                *
+                * 2) This code sees the page as clean, so it calls sdf().
+                * The page stays dirty, despite being written back, so it
+                * gets written back again in the next writeback cycle.
+                * This is harmless.
+                */
                if (!PageDirty(page))
                        sdf(page);

> 
>>>> +void put_user_pages(struct page **pages, unsigned long npages)
>>>> +{
>>>> +	unsigned long index;
>>>> +
>>>> +	for (index = 0; index < npages; index++)
>>>> +		put_user_page(pages[index]);
>>>
>>> I believe there's an room for improvement for compound pages.
>>>
>>> If there's multiple consequential pages in the array that belong to the
>>> same compound page we can get away with a single atomic operation to
>>> handle them all.
>>
>> Yes maybe just add a comment with that for now and leave this kind of
>> optimization to latter ?
> 
> Sounds good to me.
> 

Here's a comment for that:

@@ -127,6 +145,11 @@ void put_user_pages(struct page **pages, unsigned long npages)
 {
        unsigned long index;
 
+       /*
+        * TODO: this can be optimized for huge pages: if a series of pages is
+        * physically contiguous and part of the same compound page, then a
+        * single operation to the head page should suffice.
+        */
        for (index = 0; index < npages; index++)
                put_user_page(pages[index]);
 }


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19  9:03             ` Ira Weiny
@ 2019-03-19 20:43               ` Tom Talpey
  2019-03-19 20:45                 ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: Tom Talpey @ 2019-03-19 20:43 UTC (permalink / raw)
  To: Ira Weiny, Jan Kara
  Cc: Kirill A. Shutemov, Jerome Glisse, john.hubbard, Andrew Morton,
	linux-mm, Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dave Chinner,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On 3/19/2019 4:03 AM, Ira Weiny wrote:
> On Tue, Mar 19, 2019 at 04:36:44PM +0100, Jan Kara wrote:
>> On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:
>>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
>>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
>>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
>>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
>>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>>
>>>>> [...]
>>>>>
>>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>>> index f84e22685aaa..37085b8163b1 100644
>>>>>>> --- a/mm/gup.c
>>>>>>> +++ b/mm/gup.c
>>>>>>> @@ -28,6 +28,88 @@ struct follow_page_context {
>>>>>>>   	unsigned int page_mask;
>>>>>>>   };
>>>>>>>   
>>>>>>> +typedef int (*set_dirty_func_t)(struct page *page);
>>>>>>> +
>>>>>>> +static void __put_user_pages_dirty(struct page **pages,
>>>>>>> +				   unsigned long npages,
>>>>>>> +				   set_dirty_func_t sdf)
>>>>>>> +{
>>>>>>> +	unsigned long index;
>>>>>>> +
>>>>>>> +	for (index = 0; index < npages; index++) {
>>>>>>> +		struct page *page = compound_head(pages[index]);
>>>>>>> +
>>>>>>> +		if (!PageDirty(page))
>>>>>>> +			sdf(page);
>>>>>>
>>>>>> How is this safe? What prevents the page to be cleared under you?
>>>>>>
>>>>>> If it's safe to race clear_page_dirty*() it has to be stated explicitly
>>>>>> with a reason why. It's not very clear to me as it is.
>>>>>
>>>>> The PageDirty() optimization above is fine to race with clear the
>>>>> page flag as it means it is racing after a page_mkclean() and the
>>>>> GUP user is done with the page so page is about to be write back
>>>>> ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
>>>>> call while a split second after TestClearPageDirty() happens then
>>>>> it means the racing clear is about to write back the page so all
>>>>> is fine (the page was dirty and it is being clear for write back).
>>>>>
>>>>> If it does call the sdf() while racing with write back then we
>>>>> just redirtied the page just like clear_page_dirty_for_io() would
>>>>> do if page_mkclean() failed so nothing harmful will come of that
>>>>> neither. Page stays dirty despite write back it just means that
>>>>> the page might be write back twice in a row.
>>>>
>>>> Forgot to mention one thing, we had a discussion with Andrea and Jan
>>>> about set_page_dirty() and Andrea had the good idea of maybe doing
>>>> the set_page_dirty() at GUP time (when GUP with write) not when the
>>>> GUP user calls put_page(). We can do that by setting the dirty bit
>>>> in the pte for instance. They are few bonus of doing things that way:
>>>>      - amortize the cost of calling set_page_dirty() (ie one call for
>>>>        GUP and page_mkclean()
>>>>      - it is always safe to do so at GUP time (ie the pte has write
>>>>        permission and thus the page is in correct state)
>>>>      - safe from truncate race
>>>>      - no need to ever lock the page
>>>>
>>>> Extra bonus from my point of view, it simplify thing for my generic
>>>> page protection patchset (KSM for file back page).
>>>>
>>>> So maybe we should explore that ? It would also be a lot less code.
>>>
>>> Yes, please. It sounds more sensible to me to dirty the page on get, not
>>> on put.
>>
>> I fully agree this is a desirable final state of affairs.
> 
> I'm glad to see this presented because it has crossed my mind more than once
> that effectively a GUP pinned page should be considered "dirty" at all times
> until the pin is removed.  This is especially true in the RDMA case.

But, what if the RDMA registration is readonly? That's not uncommon, and
marking dirty unconditonally would add needless overhead to such pages.

Tom.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 20:43               ` Tom Talpey
@ 2019-03-19 20:45                 ` Jerome Glisse
  2019-03-19 20:55                   ` Tom Talpey
  0 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2019-03-19 20:45 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Ira Weiny, Jan Kara, Kirill A. Shutemov, john.hubbard,
	Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On Tue, Mar 19, 2019 at 03:43:44PM -0500, Tom Talpey wrote:
> On 3/19/2019 4:03 AM, Ira Weiny wrote:
> > On Tue, Mar 19, 2019 at 04:36:44PM +0100, Jan Kara wrote:
> > > On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:
> > > > On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> > > > > On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > > > > > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > > > > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > > > > index f84e22685aaa..37085b8163b1 100644
> > > > > > > > --- a/mm/gup.c
> > > > > > > > +++ b/mm/gup.c
> > > > > > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > > > > > >   	unsigned int page_mask;
> > > > > > > >   };
> > > > > > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > > > > > +
> > > > > > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > > > > > +				   unsigned long npages,
> > > > > > > > +				   set_dirty_func_t sdf)
> > > > > > > > +{
> > > > > > > > +	unsigned long index;
> > > > > > > > +
> > > > > > > > +	for (index = 0; index < npages; index++) {
> > > > > > > > +		struct page *page = compound_head(pages[index]);
> > > > > > > > +
> > > > > > > > +		if (!PageDirty(page))
> > > > > > > > +			sdf(page);
> > > > > > > 
> > > > > > > How is this safe? What prevents the page to be cleared under you?
> > > > > > > 
> > > > > > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > > > > > with a reason why. It's not very clear to me as it is.
> > > > > > 
> > > > > > The PageDirty() optimization above is fine to race with clear the
> > > > > > page flag as it means it is racing after a page_mkclean() and the
> > > > > > GUP user is done with the page so page is about to be write back
> > > > > > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > > > > > call while a split second after TestClearPageDirty() happens then
> > > > > > it means the racing clear is about to write back the page so all
> > > > > > is fine (the page was dirty and it is being clear for write back).
> > > > > > 
> > > > > > If it does call the sdf() while racing with write back then we
> > > > > > just redirtied the page just like clear_page_dirty_for_io() would
> > > > > > do if page_mkclean() failed so nothing harmful will come of that
> > > > > > neither. Page stays dirty despite write back it just means that
> > > > > > the page might be write back twice in a row.
> > > > > 
> > > > > Forgot to mention one thing, we had a discussion with Andrea and Jan
> > > > > about set_page_dirty() and Andrea had the good idea of maybe doing
> > > > > the set_page_dirty() at GUP time (when GUP with write) not when the
> > > > > GUP user calls put_page(). We can do that by setting the dirty bit
> > > > > in the pte for instance. They are few bonus of doing things that way:
> > > > >      - amortize the cost of calling set_page_dirty() (ie one call for
> > > > >        GUP and page_mkclean()
> > > > >      - it is always safe to do so at GUP time (ie the pte has write
> > > > >        permission and thus the page is in correct state)
> > > > >      - safe from truncate race
> > > > >      - no need to ever lock the page
> > > > > 
> > > > > Extra bonus from my point of view, it simplify thing for my generic
> > > > > page protection patchset (KSM for file back page).
> > > > > 
> > > > > So maybe we should explore that ? It would also be a lot less code.
> > > > 
> > > > Yes, please. It sounds more sensible to me to dirty the page on get, not
> > > > on put.
> > > 
> > > I fully agree this is a desirable final state of affairs.
> > 
> > I'm glad to see this presented because it has crossed my mind more than once
> > that effectively a GUP pinned page should be considered "dirty" at all times
> > until the pin is removed.  This is especially true in the RDMA case.
> 
> But, what if the RDMA registration is readonly? That's not uncommon, and
> marking dirty unconditonally would add needless overhead to such pages.

Yes and this is only when FOLL_WRITE is set ie when you are doing GUP and
asking for write. Doing GUP and asking for read is always safe.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 20:45                 ` Jerome Glisse
@ 2019-03-19 20:55                   ` Tom Talpey
  0 siblings, 0 replies; 34+ messages in thread
From: Tom Talpey @ 2019-03-19 20:55 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ira Weiny, Jan Kara, Kirill A. Shutemov, john.hubbard,
	Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On 3/19/2019 3:45 PM, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 03:43:44PM -0500, Tom Talpey wrote:
>> On 3/19/2019 4:03 AM, Ira Weiny wrote:
>>> On Tue, Mar 19, 2019 at 04:36:44PM +0100, Jan Kara wrote:
>>>> On Tue 19-03-19 17:29:18, Kirill A. Shutemov wrote:
>>>>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
>>>>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
>>>>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
>>>>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
>>>>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>>>>> index f84e22685aaa..37085b8163b1 100644
>>>>>>>>> --- a/mm/gup.c
>>>>>>>>> +++ b/mm/gup.c
>>>>>>>>> @@ -28,6 +28,88 @@ struct follow_page_context {
>>>>>>>>>    	unsigned int page_mask;
>>>>>>>>>    };
>>>>>>>>> +typedef int (*set_dirty_func_t)(struct page *page);
>>>>>>>>> +
>>>>>>>>> +static void __put_user_pages_dirty(struct page **pages,
>>>>>>>>> +				   unsigned long npages,
>>>>>>>>> +				   set_dirty_func_t sdf)
>>>>>>>>> +{
>>>>>>>>> +	unsigned long index;
>>>>>>>>> +
>>>>>>>>> +	for (index = 0; index < npages; index++) {
>>>>>>>>> +		struct page *page = compound_head(pages[index]);
>>>>>>>>> +
>>>>>>>>> +		if (!PageDirty(page))
>>>>>>>>> +			sdf(page);
>>>>>>>>
>>>>>>>> How is this safe? What prevents the page to be cleared under you?
>>>>>>>>
>>>>>>>> If it's safe to race clear_page_dirty*() it has to be stated explicitly
>>>>>>>> with a reason why. It's not very clear to me as it is.
>>>>>>>
>>>>>>> The PageDirty() optimization above is fine to race with clear the
>>>>>>> page flag as it means it is racing after a page_mkclean() and the
>>>>>>> GUP user is done with the page so page is about to be write back
>>>>>>> ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
>>>>>>> call while a split second after TestClearPageDirty() happens then
>>>>>>> it means the racing clear is about to write back the page so all
>>>>>>> is fine (the page was dirty and it is being clear for write back).
>>>>>>>
>>>>>>> If it does call the sdf() while racing with write back then we
>>>>>>> just redirtied the page just like clear_page_dirty_for_io() would
>>>>>>> do if page_mkclean() failed so nothing harmful will come of that
>>>>>>> neither. Page stays dirty despite write back it just means that
>>>>>>> the page might be write back twice in a row.
>>>>>>
>>>>>> Forgot to mention one thing, we had a discussion with Andrea and Jan
>>>>>> about set_page_dirty() and Andrea had the good idea of maybe doing
>>>>>> the set_page_dirty() at GUP time (when GUP with write) not when the
>>>>>> GUP user calls put_page(). We can do that by setting the dirty bit
>>>>>> in the pte for instance. They are few bonus of doing things that way:
>>>>>>       - amortize the cost of calling set_page_dirty() (ie one call for
>>>>>>         GUP and page_mkclean()
>>>>>>       - it is always safe to do so at GUP time (ie the pte has write
>>>>>>         permission and thus the page is in correct state)
>>>>>>       - safe from truncate race
>>>>>>       - no need to ever lock the page
>>>>>>
>>>>>> Extra bonus from my point of view, it simplify thing for my generic
>>>>>> page protection patchset (KSM for file back page).
>>>>>>
>>>>>> So maybe we should explore that ? It would also be a lot less code.
>>>>>
>>>>> Yes, please. It sounds more sensible to me to dirty the page on get, not
>>>>> on put.
>>>>
>>>> I fully agree this is a desirable final state of affairs.
>>>
>>> I'm glad to see this presented because it has crossed my mind more than once
>>> that effectively a GUP pinned page should be considered "dirty" at all times
>>> until the pin is removed.  This is especially true in the RDMA case.
>>
>> But, what if the RDMA registration is readonly? That's not uncommon, and
>> marking dirty unconditonally would add needless overhead to such pages.
> 
> Yes and this is only when FOLL_WRITE is set ie when you are doing GUP and
> asking for write. Doing GUP and asking for read is always safe.

Aha, ok great.

I guess it does introduce something for callers to be aware of, if
they GUP very large regions. I suppose if they're sufficiently aware
of the situation, e.g. pnfs LAYOUT_COMMIT notifications, they could
walk lists and reset page_dirty for untouched pages before releasing.
That's their issue though, and agreed it's safest for the GUP layer
to mark.

Tom.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 14:14       ` Jerome Glisse
  2019-03-19 14:29         ` Kirill A. Shutemov
@ 2019-03-19 21:23         ` Dave Chinner
  2019-03-19 22:06           ` Jerome Glisse
  1 sibling, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2019-03-19 21:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kirill A. Shutemov, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > From: John Hubbard <jhubbard@nvidia.com>
> > 
> > [...]
> > 
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index f84e22685aaa..37085b8163b1 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > >  	unsigned int page_mask;
> > > >  };
> > > >  
> > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > +
> > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > +				   unsigned long npages,
> > > > +				   set_dirty_func_t sdf)
> > > > +{
> > > > +	unsigned long index;
> > > > +
> > > > +	for (index = 0; index < npages; index++) {
> > > > +		struct page *page = compound_head(pages[index]);
> > > > +
> > > > +		if (!PageDirty(page))
> > > > +			sdf(page);
> > > 
> > > How is this safe? What prevents the page to be cleared under you?
> > > 
> > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > with a reason why. It's not very clear to me as it is.
> > 
> > The PageDirty() optimization above is fine to race with clear the
> > page flag as it means it is racing after a page_mkclean() and the
> > GUP user is done with the page so page is about to be write back
> > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > call while a split second after TestClearPageDirty() happens then
> > it means the racing clear is about to write back the page so all
> > is fine (the page was dirty and it is being clear for write back).
> > 
> > If it does call the sdf() while racing with write back then we
> > just redirtied the page just like clear_page_dirty_for_io() would
> > do if page_mkclean() failed so nothing harmful will come of that
> > neither. Page stays dirty despite write back it just means that
> > the page might be write back twice in a row.
> 
> Forgot to mention one thing, we had a discussion with Andrea and Jan
> about set_page_dirty() and Andrea had the good idea of maybe doing
> the set_page_dirty() at GUP time (when GUP with write) not when the
> GUP user calls put_page(). We can do that by setting the dirty bit
> in the pte for instance. They are few bonus of doing things that way:
>     - amortize the cost of calling set_page_dirty() (ie one call for
>       GUP and page_mkclean()
>     - it is always safe to do so at GUP time (ie the pte has write
>       permission and thus the page is in correct state)
>     - safe from truncate race
>     - no need to ever lock the page

I seem to have missed this conversation, so please excuse me for
asking a stupid question: if it's a file backed page, what prevents
background writeback from cleaning the dirty page ~30s into a long
term pin? i.e. I don't see anything in this proposal that prevents
the page from being cleaned by writeback and putting us straight
back into the situation where a long term RDMA is writing to a clean
page....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 21:23         ` Dave Chinner
@ 2019-03-19 22:06           ` Jerome Glisse
  2019-03-19 23:57             ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2019-03-19 22:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kirill A. Shutemov, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote:
> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> > On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > 
> > > [...]
> > > 
> > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > index f84e22685aaa..37085b8163b1 100644
> > > > > --- a/mm/gup.c
> > > > > +++ b/mm/gup.c
> > > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > > >  	unsigned int page_mask;
> > > > >  };
> > > > >  
> > > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > > +
> > > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > > +				   unsigned long npages,
> > > > > +				   set_dirty_func_t sdf)
> > > > > +{
> > > > > +	unsigned long index;
> > > > > +
> > > > > +	for (index = 0; index < npages; index++) {
> > > > > +		struct page *page = compound_head(pages[index]);
> > > > > +
> > > > > +		if (!PageDirty(page))
> > > > > +			sdf(page);
> > > > 
> > > > How is this safe? What prevents the page to be cleared under you?
> > > > 
> > > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > > with a reason why. It's not very clear to me as it is.
> > > 
> > > The PageDirty() optimization above is fine to race with clear the
> > > page flag as it means it is racing after a page_mkclean() and the
> > > GUP user is done with the page so page is about to be write back
> > > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > > call while a split second after TestClearPageDirty() happens then
> > > it means the racing clear is about to write back the page so all
> > > is fine (the page was dirty and it is being clear for write back).
> > > 
> > > If it does call the sdf() while racing with write back then we
> > > just redirtied the page just like clear_page_dirty_for_io() would
> > > do if page_mkclean() failed so nothing harmful will come of that
> > > neither. Page stays dirty despite write back it just means that
> > > the page might be write back twice in a row.
> > 
> > Forgot to mention one thing, we had a discussion with Andrea and Jan
> > about set_page_dirty() and Andrea had the good idea of maybe doing
> > the set_page_dirty() at GUP time (when GUP with write) not when the
> > GUP user calls put_page(). We can do that by setting the dirty bit
> > in the pte for instance. They are few bonus of doing things that way:
> >     - amortize the cost of calling set_page_dirty() (ie one call for
> >       GUP and page_mkclean()
> >     - it is always safe to do so at GUP time (ie the pte has write
> >       permission and thus the page is in correct state)
> >     - safe from truncate race
> >     - no need to ever lock the page
> 
> I seem to have missed this conversation, so please excuse me for

The set_page_dirty() at GUP was in a private discussion (it started
on another topic and drifted away to set_page_dirty()).

> asking a stupid question: if it's a file backed page, what prevents
> background writeback from cleaning the dirty page ~30s into a long
> term pin? i.e. I don't see anything in this proposal that prevents
> the page from being cleaned by writeback and putting us straight
> back into the situation where a long term RDMA is writing to a clean
> page....

So this patchset does not solve this issue. The plan is multi-step
(with different patchset for each steps):
    [1] convert all places that do gup() and then put_page() to
        use gup_put_page() instead. This is what this present
        patchset is about.
    [2] use bias pin count so that we can identify GUPed page from
        non GUPed page. So instead of incrementing page refcount
        by 1 on GUP we increment it by GUP_BIAS and in gup_put_page()
        we decrement it by GUP_BIAS. This means that page with a
        refcount > GUP_BIAS can be considered as GUP (more on false
        positive below)
    [3..N] decide what to do for GUPed page, so far the plans seems
         to be to keep the page always dirty and never allow page
         write back to restore the page in a clean state. This does
         disable thing like COW and other fs feature but at least
         it seems to be the best thing we can do.

For race between clear_page_for_io() and GUP this was extensively
discuss and IIRC you were on that thread. Basicly we can only race
with page_mkclean() (as GUP can only succeed if there is a pte with
write permission) and page cleaning happens after page_mkclean() and
they are barrier between page_mkclean() and what's after. Hence we
will be able to see the page as GUPed before cleaning it without
any race.

For false positive i think we agreed that it is something we can
live with. It could only happen to page that are share more than
GUP_BIAS times, it should be rare enough and false positive means
you get the same treatement as a GUPed page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 22:06           ` Jerome Glisse
@ 2019-03-19 23:57             ` Dave Chinner
  2019-03-20  0:08               ` Jerome Glisse
                                 ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Dave Chinner @ 2019-03-19 23:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kirill A. Shutemov, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On Tue, Mar 19, 2019 at 06:06:55PM -0400, Jerome Glisse wrote:
> On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote:
> > On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> > > On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > > > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > > 
> > > > [...]
> > > > 
> > > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > > index f84e22685aaa..37085b8163b1 100644
> > > > > > --- a/mm/gup.c
> > > > > > +++ b/mm/gup.c
> > > > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > > > >  	unsigned int page_mask;
> > > > > >  };
> > > > > >  
> > > > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > > > +
> > > > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > > > +				   unsigned long npages,
> > > > > > +				   set_dirty_func_t sdf)
> > > > > > +{
> > > > > > +	unsigned long index;
> > > > > > +
> > > > > > +	for (index = 0; index < npages; index++) {
> > > > > > +		struct page *page = compound_head(pages[index]);
> > > > > > +
> > > > > > +		if (!PageDirty(page))
> > > > > > +			sdf(page);
> > > > > 
> > > > > How is this safe? What prevents the page to be cleared under you?
> > > > > 
> > > > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > > > with a reason why. It's not very clear to me as it is.
> > > > 
> > > > The PageDirty() optimization above is fine to race with clear the
> > > > page flag as it means it is racing after a page_mkclean() and the
> > > > GUP user is done with the page so page is about to be write back
> > > > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > > > call while a split second after TestClearPageDirty() happens then
> > > > it means the racing clear is about to write back the page so all
> > > > is fine (the page was dirty and it is being clear for write back).
> > > > 
> > > > If it does call the sdf() while racing with write back then we
> > > > just redirtied the page just like clear_page_dirty_for_io() would
> > > > do if page_mkclean() failed so nothing harmful will come of that
> > > > neither. Page stays dirty despite write back it just means that
> > > > the page might be write back twice in a row.
> > > 
> > > Forgot to mention one thing, we had a discussion with Andrea and Jan
> > > about set_page_dirty() and Andrea had the good idea of maybe doing
> > > the set_page_dirty() at GUP time (when GUP with write) not when the
> > > GUP user calls put_page(). We can do that by setting the dirty bit
> > > in the pte for instance. They are few bonus of doing things that way:
> > >     - amortize the cost of calling set_page_dirty() (ie one call for
> > >       GUP and page_mkclean()
> > >     - it is always safe to do so at GUP time (ie the pte has write
> > >       permission and thus the page is in correct state)
> > >     - safe from truncate race
> > >     - no need to ever lock the page
> > 
> > I seem to have missed this conversation, so please excuse me for
> 
> The set_page_dirty() at GUP was in a private discussion (it started
> on another topic and drifted away to set_page_dirty()).
> 
> > asking a stupid question: if it's a file backed page, what prevents
> > background writeback from cleaning the dirty page ~30s into a long
> > term pin? i.e. I don't see anything in this proposal that prevents
> > the page from being cleaned by writeback and putting us straight
> > back into the situation where a long term RDMA is writing to a clean
> > page....
> 
> So this patchset does not solve this issue.

OK, so it just kicks the can further down the road.

>     [3..N] decide what to do for GUPed page, so far the plans seems
>          to be to keep the page always dirty and never allow page
>          write back to restore the page in a clean state. This does
>          disable thing like COW and other fs feature but at least
>          it seems to be the best thing we can do.

So the plan for GUP vs writeback so far is "break fsync()"? :)

We might need to work on that a bit more...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 23:57             ` Dave Chinner
@ 2019-03-20  0:08               ` Jerome Glisse
  2019-03-20  1:43                 ` John Hubbard
  2019-03-20  0:15               ` John Hubbard
  2019-03-20  1:01               ` Christopher Lameter
  2 siblings, 1 reply; 34+ messages in thread
From: Jerome Glisse @ 2019-03-20  0:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kirill A. Shutemov, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel, John Hubbard,
	Andrea Arcangeli

On Wed, Mar 20, 2019 at 10:57:52AM +1100, Dave Chinner wrote:
> On Tue, Mar 19, 2019 at 06:06:55PM -0400, Jerome Glisse wrote:
> > On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote:
> > > On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> > > > On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > > > > On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > > > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > > > index f84e22685aaa..37085b8163b1 100644
> > > > > > > --- a/mm/gup.c
> > > > > > > +++ b/mm/gup.c
> > > > > > > @@ -28,6 +28,88 @@ struct follow_page_context {
> > > > > > >  	unsigned int page_mask;
> > > > > > >  };
> > > > > > >  
> > > > > > > +typedef int (*set_dirty_func_t)(struct page *page);
> > > > > > > +
> > > > > > > +static void __put_user_pages_dirty(struct page **pages,
> > > > > > > +				   unsigned long npages,
> > > > > > > +				   set_dirty_func_t sdf)
> > > > > > > +{
> > > > > > > +	unsigned long index;
> > > > > > > +
> > > > > > > +	for (index = 0; index < npages; index++) {
> > > > > > > +		struct page *page = compound_head(pages[index]);
> > > > > > > +
> > > > > > > +		if (!PageDirty(page))
> > > > > > > +			sdf(page);
> > > > > > 
> > > > > > How is this safe? What prevents the page to be cleared under you?
> > > > > > 
> > > > > > If it's safe to race clear_page_dirty*() it has to be stated explicitly
> > > > > > with a reason why. It's not very clear to me as it is.
> > > > > 
> > > > > The PageDirty() optimization above is fine to race with clear the
> > > > > page flag as it means it is racing after a page_mkclean() and the
> > > > > GUP user is done with the page so page is about to be write back
> > > > > ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> > > > > call while a split second after TestClearPageDirty() happens then
> > > > > it means the racing clear is about to write back the page so all
> > > > > is fine (the page was dirty and it is being clear for write back).
> > > > > 
> > > > > If it does call the sdf() while racing with write back then we
> > > > > just redirtied the page just like clear_page_dirty_for_io() would
> > > > > do if page_mkclean() failed so nothing harmful will come of that
> > > > > neither. Page stays dirty despite write back it just means that
> > > > > the page might be write back twice in a row.
> > > > 
> > > > Forgot to mention one thing, we had a discussion with Andrea and Jan
> > > > about set_page_dirty() and Andrea had the good idea of maybe doing
> > > > the set_page_dirty() at GUP time (when GUP with write) not when the
> > > > GUP user calls put_page(). We can do that by setting the dirty bit
> > > > in the pte for instance. They are few bonus of doing things that way:
> > > >     - amortize the cost of calling set_page_dirty() (ie one call for
> > > >       GUP and page_mkclean()
> > > >     - it is always safe to do so at GUP time (ie the pte has write
> > > >       permission and thus the page is in correct state)
> > > >     - safe from truncate race
> > > >     - no need to ever lock the page
> > > 
> > > I seem to have missed this conversation, so please excuse me for
> > 
> > The set_page_dirty() at GUP was in a private discussion (it started
> > on another topic and drifted away to set_page_dirty()).
> > 
> > > asking a stupid question: if it's a file backed page, what prevents
> > > background writeback from cleaning the dirty page ~30s into a long
> > > term pin? i.e. I don't see anything in this proposal that prevents
> > > the page from being cleaned by writeback and putting us straight
> > > back into the situation where a long term RDMA is writing to a clean
> > > page....
> > 
> > So this patchset does not solve this issue.
> 
> OK, so it just kicks the can further down the road.
> 
> >     [3..N] decide what to do for GUPed page, so far the plans seems
> >          to be to keep the page always dirty and never allow page
> >          write back to restore the page in a clean state. This does
> >          disable thing like COW and other fs feature but at least
> >          it seems to be the best thing we can do.
> 
> So the plan for GUP vs writeback so far is "break fsync()"? :)
> 
> We might need to work on that a bit more...

Sorry forgot to say that we still do write back using a bounce page
so that at least we write something to disk that is just a snapshot
of the GUPed page everytime writeback kicks in (so either through
radix tree dirty page write back or fsync or any other sync events).
So many little details that i forgot the big chunk :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 23:57             ` Dave Chinner
  2019-03-20  0:08               ` Jerome Glisse
@ 2019-03-20  0:15               ` John Hubbard
  2019-03-20  1:01               ` Christopher Lameter
  2 siblings, 0 replies; 34+ messages in thread
From: John Hubbard @ 2019-03-20  0:15 UTC (permalink / raw)
  To: Dave Chinner, Jerome Glisse
  Cc: Kirill A. Shutemov, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel,
	Andrea Arcangeli

On 3/19/19 4:57 PM, Dave Chinner wrote:
> On Tue, Mar 19, 2019 at 06:06:55PM -0400, Jerome Glisse wrote:
>> On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote:
>>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
>>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
>>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
>>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
>>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>> [...]
>>>> Forgot to mention one thing, we had a discussion with Andrea and Jan
>>>> about set_page_dirty() and Andrea had the good idea of maybe doing
>>>> the set_page_dirty() at GUP time (when GUP with write) not when the
>>>> GUP user calls put_page(). We can do that by setting the dirty bit
>>>> in the pte for instance. They are few bonus of doing things that way:
>>>>     - amortize the cost of calling set_page_dirty() (ie one call for
>>>>       GUP and page_mkclean()
>>>>     - it is always safe to do so at GUP time (ie the pte has write
>>>>       permission and thus the page is in correct state)
>>>>     - safe from truncate race
>>>>     - no need to ever lock the page
>>>
>>> I seem to have missed this conversation, so please excuse me for
>>
>> The set_page_dirty() at GUP was in a private discussion (it started
>> on another topic and drifted away to set_page_dirty()).
>>
>>> asking a stupid question: if it's a file backed page, what prevents
>>> background writeback from cleaning the dirty page ~30s into a long
>>> term pin? i.e. I don't see anything in this proposal that prevents
>>> the page from being cleaned by writeback and putting us straight
>>> back into the situation where a long term RDMA is writing to a clean
>>> page....
>>
>> So this patchset does not solve this issue.
> 
> OK, so it just kicks the can further down the road.

Hi Dave,

My take on this is that all of the viable solution proposals so far require
tracking of gup-pinned pages. That's why I'm trying to get started now on 
at least the tracking aspects: it seems like the tracking part is now well
understood. And it does have some lead time, because I expect the call site
conversions probably have to go through various maintainers' trees.

However, if you are thinking that this is unwise, and that's it's smarter 
to wait until the entire design is completely worked out, I'm open to that,
too. 

Thoughts?

thanks,
-- 
John Hubbard
NVIDIA

> 
>>     [3..N] decide what to do for GUPed page, so far the plans seems
>>          to be to keep the page always dirty and never allow page
>>          write back to restore the page in a clean state. This does
>>          disable thing like COW and other fs feature but at least
>>          it seems to be the best thing we can do.
> 
> So the plan for GUP vs writeback so far is "break fsync()"? :)
> 
> We might need to work on that a bit more...
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 23:57             ` Dave Chinner
  2019-03-20  0:08               ` Jerome Glisse
  2019-03-20  0:15               ` John Hubbard
@ 2019-03-20  1:01               ` Christopher Lameter
  2 siblings, 0 replies; 34+ messages in thread
From: Christopher Lameter @ 2019-03-20  1:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jerome Glisse, Kirill A. Shutemov, john.hubbard, Andrew Morton,
	linux-mm, Al Viro, Christian Benvenuti, Christoph Hellwig,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Mike Marciniszyn, Ralph Campbell, Tom Talpey,
	LKML, linux-fsdevel, John Hubbard, Andrea Arcangeli

On Wed, 20 Mar 2019, Dave Chinner wrote:

> So the plan for GUP vs writeback so far is "break fsync()"? :)

Well if its an anonymous page and not a file backed page then the
semantics are preserved. Disallow GUP long term pinning (marking stuff like in this
patchset may make that possible) and its clean.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 19:24   ` John Hubbard
@ 2019-03-20  1:09     ` Christopher Lameter
  2019-03-20  1:18       ` John Hubbard
  0 siblings, 1 reply; 34+ messages in thread
From: Christopher Lameter @ 2019-03-20  1:09 UTC (permalink / raw)
  To: John Hubbard
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Jerome Glisse, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On Tue, 19 Mar 2019, John Hubbard wrote:

> >
> > My concerns do not affect this patchset which just marks the get/put for
> > the pagecache. The problem was that the description was making claims that
> > were a bit misleading and seemed to prescribe a solution.
> >
> > So lets get this merged. Whatever the solution will be, we will need this
> > markup.
> >
>
> Sounds good. Do you care to promote that thought into a formal ACK for me? :)

Reviewed-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-20  1:09     ` Christopher Lameter
@ 2019-03-20  1:18       ` John Hubbard
  0 siblings, 0 replies; 34+ messages in thread
From: John Hubbard @ 2019-03-20  1:18 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Dave Chinner, Dennis Dalessandro, Doug Ledford, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Jerome Glisse, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On 3/19/19 6:09 PM, Christopher Lameter wrote:
> On Tue, 19 Mar 2019, John Hubbard wrote:
> 
>>>
>>> My concerns do not affect this patchset which just marks the get/put for
>>> the pagecache. The problem was that the description was making claims that
>>> were a bit misleading and seemed to prescribe a solution.
>>>
>>> So lets get this merged. Whatever the solution will be, we will need this
>>> markup.
>>>
>>
>> Sounds good. Do you care to promote that thought into a formal ACK for me? :)
> 
> Reviewed-by: Christoph Lameter <cl@linux.com>
> 

Awesome! I've added that tag and it will show up in the next posting.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-20  0:08               ` Jerome Glisse
@ 2019-03-20  1:43                 ` John Hubbard
  2019-03-20  4:33                   ` Jerome Glisse
  0 siblings, 1 reply; 34+ messages in thread
From: John Hubbard @ 2019-03-20  1:43 UTC (permalink / raw)
  To: Jerome Glisse, Dave Chinner
  Cc: Kirill A. Shutemov, john.hubbard, Andrew Morton, linux-mm,
	Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel,
	Andrea Arcangeli

On 3/19/19 5:08 PM, Jerome Glisse wrote:
> On Wed, Mar 20, 2019 at 10:57:52AM +1100, Dave Chinner wrote:
>> On Tue, Mar 19, 2019 at 06:06:55PM -0400, Jerome Glisse wrote:
>>> On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote:
>>>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
>>>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
>>>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
>>>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
>>>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>>> [...]
>>>>> Forgot to mention one thing, we had a discussion with Andrea and Jan
>>>>> about set_page_dirty() and Andrea had the good idea of maybe doing
>>>>> the set_page_dirty() at GUP time (when GUP with write) not when the
>>>>> GUP user calls put_page(). We can do that by setting the dirty bit
>>>>> in the pte for instance. They are few bonus of doing things that way:
>>>>>     - amortize the cost of calling set_page_dirty() (ie one call for
>>>>>       GUP and page_mkclean()
>>>>>     - it is always safe to do so at GUP time (ie the pte has write
>>>>>       permission and thus the page is in correct state)
>>>>>     - safe from truncate race
>>>>>     - no need to ever lock the page
>>>>
>>>> I seem to have missed this conversation, so please excuse me for
>>>
>>> The set_page_dirty() at GUP was in a private discussion (it started
>>> on another topic and drifted away to set_page_dirty()).
>>>
>>>> asking a stupid question: if it's a file backed page, what prevents
>>>> background writeback from cleaning the dirty page ~30s into a long
>>>> term pin? i.e. I don't see anything in this proposal that prevents
>>>> the page from being cleaned by writeback and putting us straight
>>>> back into the situation where a long term RDMA is writing to a clean
>>>> page....
>>>
>>> So this patchset does not solve this issue.
>>
>> OK, so it just kicks the can further down the road.
>>
>>>     [3..N] decide what to do for GUPed page, so far the plans seems
>>>          to be to keep the page always dirty and never allow page
>>>          write back to restore the page in a clean state. This does
>>>          disable thing like COW and other fs feature but at least
>>>          it seems to be the best thing we can do.
>>
>> So the plan for GUP vs writeback so far is "break fsync()"? :)
>>
>> We might need to work on that a bit more...
> 
> Sorry forgot to say that we still do write back using a bounce page
> so that at least we write something to disk that is just a snapshot
> of the GUPed page everytime writeback kicks in (so either through
> radix tree dirty page write back or fsync or any other sync events).
> So many little details that i forgot the big chunk :)
> 
> Cheers,
> Jérôme
> 

Dave, Jan, Jerome,

Bounce pages for periodic data integrity still seem viable. But for the
question of things like fsync or truncate, I think we were zeroing in
on file leases as a nice building block.

Can we revive the file lease discussion? By going all the way out to user
space and requiring file leases to be coordinated at a high level in the
software call chain, it seems like we could routinely avoid some of the
worst conflicts that the kernel code has to resolve.

For example:

Process A
=========
    gets a lease on file_a that allows gup 
        usage on a range within file_a

    sets up writable DMA:
        get_user_pages() on the file_a range
        start DMA (independent hardware ops)
            hw is reading and writing to range

                                                    Process B
                                                    =========
                                                    truncate(file_a)
                                                       ...
                                                       __break_lease()
    
    handle SIGIO from __break_lease
         if unhandled, process gets killed
         and put_user_pages should get called
         at some point here

...and so this way, user space gets to decide the proper behavior,
instead of leaving the kernel in the dark with an impossible decision
(kill process A? Block process B? User space knows the preference,
per app, but kernel does not.)
        

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-20  1:43                 ` John Hubbard
@ 2019-03-20  4:33                   ` Jerome Glisse
  2019-03-20  9:08                     ` Ira Weiny
  2019-03-20 14:55                     ` William Kucharski
  0 siblings, 2 replies; 34+ messages in thread
From: Jerome Glisse @ 2019-03-20  4:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dave Chinner, Kirill A. Shutemov, john.hubbard, Andrew Morton,
	linux-mm, Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel,
	Andrea Arcangeli

On Tue, Mar 19, 2019 at 06:43:45PM -0700, John Hubbard wrote:
> On 3/19/19 5:08 PM, Jerome Glisse wrote:
> > On Wed, Mar 20, 2019 at 10:57:52AM +1100, Dave Chinner wrote:
> >> On Tue, Mar 19, 2019 at 06:06:55PM -0400, Jerome Glisse wrote:
> >>> On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote:
> >>>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> >>>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> >>>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> >>>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> >>>>>>>> From: John Hubbard <jhubbard@nvidia.com>
> >>>>>> [...]
> >>>>> Forgot to mention one thing, we had a discussion with Andrea and Jan
> >>>>> about set_page_dirty() and Andrea had the good idea of maybe doing
> >>>>> the set_page_dirty() at GUP time (when GUP with write) not when the
> >>>>> GUP user calls put_page(). We can do that by setting the dirty bit
> >>>>> in the pte for instance. They are few bonus of doing things that way:
> >>>>>     - amortize the cost of calling set_page_dirty() (ie one call for
> >>>>>       GUP and page_mkclean()
> >>>>>     - it is always safe to do so at GUP time (ie the pte has write
> >>>>>       permission and thus the page is in correct state)
> >>>>>     - safe from truncate race
> >>>>>     - no need to ever lock the page
> >>>>
> >>>> I seem to have missed this conversation, so please excuse me for
> >>>
> >>> The set_page_dirty() at GUP was in a private discussion (it started
> >>> on another topic and drifted away to set_page_dirty()).
> >>>
> >>>> asking a stupid question: if it's a file backed page, what prevents
> >>>> background writeback from cleaning the dirty page ~30s into a long
> >>>> term pin? i.e. I don't see anything in this proposal that prevents
> >>>> the page from being cleaned by writeback and putting us straight
> >>>> back into the situation where a long term RDMA is writing to a clean
> >>>> page....
> >>>
> >>> So this patchset does not solve this issue.
> >>
> >> OK, so it just kicks the can further down the road.
> >>
> >>>     [3..N] decide what to do for GUPed page, so far the plans seems
> >>>          to be to keep the page always dirty and never allow page
> >>>          write back to restore the page in a clean state. This does
> >>>          disable thing like COW and other fs feature but at least
> >>>          it seems to be the best thing we can do.
> >>
> >> So the plan for GUP vs writeback so far is "break fsync()"? :)
> >>
> >> We might need to work on that a bit more...
> > 
> > Sorry forgot to say that we still do write back using a bounce page
> > so that at least we write something to disk that is just a snapshot
> > of the GUPed page everytime writeback kicks in (so either through
> > radix tree dirty page write back or fsync or any other sync events).
> > So many little details that i forgot the big chunk :)
> > 
> > Cheers,
> > Jérôme
> > 
> 
> Dave, Jan, Jerome,
> 
> Bounce pages for periodic data integrity still seem viable. But for the
> question of things like fsync or truncate, I think we were zeroing in
> on file leases as a nice building block.
> 
> Can we revive the file lease discussion? By going all the way out to user
> space and requiring file leases to be coordinated at a high level in the
> software call chain, it seems like we could routinely avoid some of the
> worst conflicts that the kernel code has to resolve.
> 
> For example:
> 
> Process A
> =========
>     gets a lease on file_a that allows gup 
>         usage on a range within file_a
> 
>     sets up writable DMA:
>         get_user_pages() on the file_a range
>         start DMA (independent hardware ops)
>             hw is reading and writing to range
> 
>                                                     Process B
>                                                     =========
>                                                     truncate(file_a)
>                                                        ...
>                                                        __break_lease()
>     
>     handle SIGIO from __break_lease
>          if unhandled, process gets killed
>          and put_user_pages should get called
>          at some point here
> 
> ...and so this way, user space gets to decide the proper behavior,
> instead of leaving the kernel in the dark with an impossible decision
> (kill process A? Block process B? User space knows the preference,
> per app, but kernel does not.)

There is no need to kill anything here ... if truncate happens then
the GUP user is just GUPing page that do not correspond to anything
anymore. This is the current behavior and it is what GUP always has
been. By the time you get the page from GUP there is no garantee that
they correspond to anything.

If a device really want to mirror process address faithfully then the
hardware need to make little effort either have something like ATS/
PASID or be able to abide mmu notifier.

If we start blocking existing syscall just because someone is doing a
GUP we are opening a pandora box. It is not just truncate, it is a
whole range of syscall that deals with either file or virtual address.

The semantic of GUP is really the semantic of direct I/O and the
virtual address you are direct I/O-ing to/from and the rule there is:
do not do anything stupid to those virtual addresses while you are
doing direct I/O with them (no munmap, mremap, madvise, truncate, ...).


Same logic apply to file, when two process do thing to same file there
the kernel never get in the way of one process doing something the
other process did not expect. For instance one process mmaping the file
the other process truncating the file, if the first process try to access
the file through the mmap after the truncation it will get a sigbus.

So i believe best we could do is send a SIGBUS to the process that has
GUPed a range of a file that is being truncated this would match what
we do for CPU acces. There is no reason access through GUP should be
handled any differently.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-20  4:33                   ` Jerome Glisse
@ 2019-03-20  9:08                     ` Ira Weiny
  2019-03-20 14:55                     ` William Kucharski
  1 sibling, 0 replies; 34+ messages in thread
From: Ira Weiny @ 2019-03-20  9:08 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Dave Chinner, Kirill A. Shutemov, john.hubbard,
	Andrew Morton, linux-mm, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jan Kara, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Mike Marciniszyn,
	Ralph Campbell, Tom Talpey, LKML, linux-fsdevel,
	Andrea Arcangeli

On Wed, Mar 20, 2019 at 12:33:20AM -0400, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 06:43:45PM -0700, John Hubbard wrote:
> > On 3/19/19 5:08 PM, Jerome Glisse wrote:
> > > On Wed, Mar 20, 2019 at 10:57:52AM +1100, Dave Chinner wrote:
> > >> On Tue, Mar 19, 2019 at 06:06:55PM -0400, Jerome Glisse wrote:
> > >>> On Wed, Mar 20, 2019 at 08:23:46AM +1100, Dave Chinner wrote:
> > >>>> On Tue, Mar 19, 2019 at 10:14:16AM -0400, Jerome Glisse wrote:
> > >>>>> On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> > >>>>>> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> > >>>>>>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> > >>>>>>>> From: John Hubbard <jhubbard@nvidia.com>
> > >>>>>> [...]
> > >>>>> Forgot to mention one thing, we had a discussion with Andrea and Jan
> > >>>>> about set_page_dirty() and Andrea had the good idea of maybe doing
> > >>>>> the set_page_dirty() at GUP time (when GUP with write) not when the
> > >>>>> GUP user calls put_page(). We can do that by setting the dirty bit
> > >>>>> in the pte for instance. They are few bonus of doing things that way:
> > >>>>>     - amortize the cost of calling set_page_dirty() (ie one call for
> > >>>>>       GUP and page_mkclean()
> > >>>>>     - it is always safe to do so at GUP time (ie the pte has write
> > >>>>>       permission and thus the page is in correct state)
> > >>>>>     - safe from truncate race
> > >>>>>     - no need to ever lock the page
> > >>>>
> > >>>> I seem to have missed this conversation, so please excuse me for
> > >>>
> > >>> The set_page_dirty() at GUP was in a private discussion (it started
> > >>> on another topic and drifted away to set_page_dirty()).
> > >>>
> > >>>> asking a stupid question: if it's a file backed page, what prevents
> > >>>> background writeback from cleaning the dirty page ~30s into a long
> > >>>> term pin? i.e. I don't see anything in this proposal that prevents
> > >>>> the page from being cleaned by writeback and putting us straight
> > >>>> back into the situation where a long term RDMA is writing to a clean
> > >>>> page....
> > >>>
> > >>> So this patchset does not solve this issue.
> > >>
> > >> OK, so it just kicks the can further down the road.
> > >>
> > >>>     [3..N] decide what to do for GUPed page, so far the plans seems
> > >>>          to be to keep the page always dirty and never allow page
> > >>>          write back to restore the page in a clean state. This does
> > >>>          disable thing like COW and other fs feature but at least
> > >>>          it seems to be the best thing we can do.
> > >>
> > >> So the plan for GUP vs writeback so far is "break fsync()"? :)
> > >>
> > >> We might need to work on that a bit more...
> > > 
> > > Sorry forgot to say that we still do write back using a bounce page
> > > so that at least we write something to disk that is just a snapshot
> > > of the GUPed page everytime writeback kicks in (so either through
> > > radix tree dirty page write back or fsync or any other sync events).
> > > So many little details that i forgot the big chunk :)
> > > 
> > > Cheers,
> > > Jérôme
> > > 
> > 
> > Dave, Jan, Jerome,
> > 
> > Bounce pages for periodic data integrity still seem viable. But for the
> > question of things like fsync or truncate, I think we were zeroing in
> > on file leases as a nice building block.
> > 
> > Can we revive the file lease discussion? By going all the way out to user
> > space and requiring file leases to be coordinated at a high level in the
> > software call chain, it seems like we could routinely avoid some of the
> > worst conflicts that the kernel code has to resolve.
> > 
> > For example:
> > 
> > Process A
> > =========
> >     gets a lease on file_a that allows gup 
> >         usage on a range within file_a
> > 
> >     sets up writable DMA:
> >         get_user_pages() on the file_a range
> >         start DMA (independent hardware ops)
> >             hw is reading and writing to range
> > 
> >                                                     Process B
> >                                                     =========
> >                                                     truncate(file_a)
> >                                                        ...
> >                                                        __break_lease()
> >     
> >     handle SIGIO from __break_lease
> >          if unhandled, process gets killed
> >          and put_user_pages should get called
> >          at some point here
> > 
> > ...and so this way, user space gets to decide the proper behavior,
> > instead of leaving the kernel in the dark with an impossible decision
> > (kill process A? Block process B? User space knows the preference,
> > per app, but kernel does not.)
> 
> There is no need to kill anything here ... if truncate happens then
> the GUP user is just GUPing page that do not correspond to anything
> anymore. This is the current behavior and it is what GUP always has
> been. By the time you get the page from GUP there is no garantee that
> they correspond to anything.
> 
> If a device really want to mirror process address faithfully then the
> hardware need to make little effort either have something like ATS/
> PASID or be able to abide mmu notifier.
> 
> If we start blocking existing syscall just because someone is doing a
> GUP we are opening a pandora box. It is not just truncate, it is a
> whole range of syscall that deals with either file or virtual address.
> 
> The semantic of GUP is really the semantic of direct I/O and the
> virtual address you are direct I/O-ing to/from and the rule there is:
> do not do anything stupid to those virtual addresses while you are
> doing direct I/O with them (no munmap, mremap, madvise, truncate, ...).
> 
> 
> Same logic apply to file, when two process do thing to same file there
> the kernel never get in the way of one process doing something the
> other process did not expect. For instance one process mmaping the file
> the other process truncating the file, if the first process try to access
> the file through the mmap after the truncation it will get a sigbus.
> 
> So i believe best we could do is send a SIGBUS to the process that has
> GUPed a range of a file that is being truncated this would match what
> we do for CPU acces. There is no reason access through GUP should be
> handled any differently.

I agree in sending SIGBUS but the fact is most "Process A"'s will not be
handling SIGBUS and will then result in that process dying.

Ira

>
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 20:01         ` John Hubbard
@ 2019-03-20  9:28           ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2019-03-20  9:28 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On Tue, Mar 19, 2019 at 01:01:01PM -0700, John Hubbard wrote:
> On 3/19/19 7:06 AM, Kirill A. Shutemov wrote:
> > On Tue, Mar 19, 2019 at 09:47:24AM -0400, Jerome Glisse wrote:
> >> On Tue, Mar 19, 2019 at 03:04:17PM +0300, Kirill A. Shutemov wrote:
> >>> On Fri, Mar 08, 2019 at 01:36:33PM -0800, john.hubbard@gmail.com wrote:
> >>>> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> [...]
> >>
> >>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>> index f84e22685aaa..37085b8163b1 100644
> >>>> --- a/mm/gup.c
> >>>> +++ b/mm/gup.c
> >>>> @@ -28,6 +28,88 @@ struct follow_page_context {
> >>>>  	unsigned int page_mask;
> >>>>  };
> >>>>  
> >>>> +typedef int (*set_dirty_func_t)(struct page *page);
> >>>> +
> >>>> +static void __put_user_pages_dirty(struct page **pages,
> >>>> +				   unsigned long npages,
> >>>> +				   set_dirty_func_t sdf)
> >>>> +{
> >>>> +	unsigned long index;
> >>>> +
> >>>> +	for (index = 0; index < npages; index++) {
> >>>> +		struct page *page = compound_head(pages[index]);
> >>>> +
> >>>> +		if (!PageDirty(page))
> >>>> +			sdf(page);
> >>>
> >>> How is this safe? What prevents the page to be cleared under you?
> >>>
> >>> If it's safe to race clear_page_dirty*() it has to be stated explicitly
> >>> with a reason why. It's not very clear to me as it is.
> >>
> >> The PageDirty() optimization above is fine to race with clear the
> >> page flag as it means it is racing after a page_mkclean() and the
> >> GUP user is done with the page so page is about to be write back
> >> ie if (!PageDirty(page)) see the page as dirty and skip the sdf()
> >> call while a split second after TestClearPageDirty() happens then
> >> it means the racing clear is about to write back the page so all
> >> is fine (the page was dirty and it is being clear for write back).
> >>
> >> If it does call the sdf() while racing with write back then we
> >> just redirtied the page just like clear_page_dirty_for_io() would
> >> do if page_mkclean() failed so nothing harmful will come of that
> >> neither. Page stays dirty despite write back it just means that
> >> the page might be write back twice in a row.
> > 
> > Fair enough. Should we get it into a comment here?
> 
> How's this read to you? I reworded and slightly expanded Jerome's 
> description:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index d1df7b8ba973..86397ae23922 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -61,6 +61,24 @@ static void __put_user_pages_dirty(struct page **pages,
>         for (index = 0; index < npages; index++) {
>                 struct page *page = compound_head(pages[index]);
>  
> +               /*
> +                * Checking PageDirty at this point may race with
> +                * clear_page_dirty_for_io(), but that's OK. Two key cases:
> +                *
> +                * 1) This code sees the page as already dirty, so it skips
> +                * the call to sdf(). That could happen because
> +                * clear_page_dirty_for_io() called page_mkclean(),
> +                * followed by set_page_dirty(). However, now the page is
> +                * going to get written back, which meets the original
> +                * intention of setting it dirty, so all is well:
> +                * clear_page_dirty_for_io() goes on to call
> +                * TestClearPageDirty(), and write the page back.
> +                *
> +                * 2) This code sees the page as clean, so it calls sdf().
> +                * The page stays dirty, despite being written back, so it
> +                * gets written back again in the next writeback cycle.
> +                * This is harmless.
> +                */
>                 if (!PageDirty(page))
>                         sdf(page);

Looks good to me.

Other nit: effectively the same type of callback called 'spd' in
set_page_dirty(). Should we rename 'sdf' to 'sdp' here too?

> >>>> +void put_user_pages(struct page **pages, unsigned long npages)
> >>>> +{
> >>>> +	unsigned long index;
> >>>> +
> >>>> +	for (index = 0; index < npages; index++)
> >>>> +		put_user_page(pages[index]);
> >>>
> >>> I believe there's an room for improvement for compound pages.
> >>>
> >>> If there's multiple consequential pages in the array that belong to the
> >>> same compound page we can get away with a single atomic operation to
> >>> handle them all.
> >>
> >> Yes maybe just add a comment with that for now and leave this kind of
> >> optimization to latter ?
> > 
> > Sounds good to me.
> > 
> 
> Here's a comment for that:
> 
> @@ -127,6 +145,11 @@ void put_user_pages(struct page **pages, unsigned long npages)
>  {
>         unsigned long index;
>  
> +       /*
> +        * TODO: this can be optimized for huge pages: if a series of pages is
> +        * physically contiguous and part of the same compound page, then a

Comound pages are always physically contiguous. I initially ment that the
optimization makes sense if they are next to each other in 'pages' array.

> +        * single operation to the head page should suffice.
> +        */
>         for (index = 0; index < npages; index++)
>                 put_user_page(pages[index]);
>  }
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-19 19:24       ` John Hubbard
@ 2019-03-20  9:40         ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2019-03-20  9:40 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, john.hubbard, Andrew Morton, linux-mm, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dave Chinner, Dennis Dalessandro, Doug Ledford,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Mike Marciniszyn, Ralph Campbell,
	Tom Talpey, LKML, linux-fsdevel

On Tue, Mar 19, 2019 at 12:24:00PM -0700, John Hubbard wrote:
> So, I could be persuaded either way. But given the lack of an visible perf
> effects, and given that this could will get removed anyway because we'll
> likely end up with set_page_dirty() called at GUP time instead...it seems
> like it's probably OK to just leave it as is.

Apart from ugly code generated, other argument might be Spectre-like
attacks on these call. I would rather avoid indirect function calls
whenever possible. And I don't think opencodded versions of these
functions would look much worse.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-20  4:33                   ` Jerome Glisse
  2019-03-20  9:08                     ` Ira Weiny
@ 2019-03-20 14:55                     ` William Kucharski
  2019-03-20 14:59                       ` Jerome Glisse
  1 sibling, 1 reply; 34+ messages in thread
From: William Kucharski @ 2019-03-20 14:55 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Dave Chinner, Kirill A. Shutemov, john.hubbard,
	Andrew Morton, Linux MM, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, Andrea Arcangeli



> On Mar 19, 2019, at 10:33 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> 
> So i believe best we could do is send a SIGBUS to the process that has
> GUPed a range of a file that is being truncated this would match what
> we do for CPU acces. There is no reason access through GUP should be
> handled any differently.

This should be done lazily, as there's no need to send the SIGBUS unless
the GUPed page is actually accessed post-truncate.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
  2019-03-20 14:55                     ` William Kucharski
@ 2019-03-20 14:59                       ` Jerome Glisse
  0 siblings, 0 replies; 34+ messages in thread
From: Jerome Glisse @ 2019-03-20 14:59 UTC (permalink / raw)
  To: William Kucharski
  Cc: John Hubbard, Dave Chinner, Kirill A. Shutemov, john.hubbard,
	Andrew Morton, Linux MM, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Mike Marciniszyn, Ralph Campbell, Tom Talpey, LKML,
	linux-fsdevel, Andrea Arcangeli

On Wed, Mar 20, 2019 at 08:55:17AM -0600, William Kucharski wrote:
> 
> 
> > On Mar 19, 2019, at 10:33 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > 
> > So i believe best we could do is send a SIGBUS to the process that has
> > GUPed a range of a file that is being truncated this would match what
> > we do for CPU acces. There is no reason access through GUP should be
> > handled any differently.
> 
> This should be done lazily, as there's no need to send the SIGBUS unless
> the GUPed page is actually accessed post-truncate.

Issue is that unlike CPU access we might not be able to detect device
access and thus it is not something we can do lazily for everyone.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-03-20 17:10 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-08 21:36 [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
2019-03-08 21:36 ` [PATCH v4 1/1] " john.hubbard
2019-03-19 12:04   ` Kirill A. Shutemov
2019-03-19 13:47     ` Jerome Glisse
2019-03-19 14:06       ` Kirill A. Shutemov
2019-03-19 14:15         ` Jerome Glisse
2019-03-19 20:01         ` John Hubbard
2019-03-20  9:28           ` Kirill A. Shutemov
2019-03-19 14:14       ` Jerome Glisse
2019-03-19 14:29         ` Kirill A. Shutemov
2019-03-19 15:36           ` Jan Kara
2019-03-19  9:03             ` Ira Weiny
2019-03-19 20:43               ` Tom Talpey
2019-03-19 20:45                 ` Jerome Glisse
2019-03-19 20:55                   ` Tom Talpey
2019-03-19 19:02             ` John Hubbard
2019-03-19 21:23         ` Dave Chinner
2019-03-19 22:06           ` Jerome Glisse
2019-03-19 23:57             ` Dave Chinner
2019-03-20  0:08               ` Jerome Glisse
2019-03-20  1:43                 ` John Hubbard
2019-03-20  4:33                   ` Jerome Glisse
2019-03-20  9:08                     ` Ira Weiny
2019-03-20 14:55                     ` William Kucharski
2019-03-20 14:59                       ` Jerome Glisse
2019-03-20  0:15               ` John Hubbard
2019-03-20  1:01               ` Christopher Lameter
2019-03-19 19:24       ` John Hubbard
2019-03-20  9:40         ` Kirill A. Shutemov
2019-03-08 23:21 ` [PATCH v4 0/1] " John Hubbard
2019-03-19 18:12 ` Christopher Lameter
2019-03-19 19:24   ` John Hubbard
2019-03-20  1:09     ` Christopher Lameter
2019-03-20  1:18       ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).