Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/2] put_user_page*(): start converting the call sites
@ 2018-12-04  0:17 john.hubbard
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
                   ` (2 more replies)
  0 siblings, 3 replies; 206+ messages in thread
From: john.hubbard @ 2018-12-04  0:17 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Hi,

Summary: I'd like these two patches to go into the next convenient cycle.
I *think* that means 4.21.

Details

At the Linux Plumbers Conference, we talked about this approach [1], and
the primary lingering concern was over performance. Tom Talpey helped me
through a much more accurate run of the fio performance test, and now
it's looking like an under 1% performance cost, to add and remove pages
from the LRU (this is only paid when dealing with get_user_pages) [2]. So
we should be fine to start converting call sites.

This patchset gets the conversion started. Both patches already had a fair
amount of review.

(Tom, I'll add you Tested-by to the actual implementation that moves
pages on and off the LRU. These first two patches don't do that.)

[1] https://linuxplumbersconf.org/event/2/contributions/126/
    "RDMA and get_user_pages"

[2] https://lore.kernel.org/r/79d1ee27-9ea0-3d15-3fc4-97c1bd79c990@talpey.com

John Hubbard (2):
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()

 drivers/infiniband/core/umem.c              |  7 +-
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 11 ++-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 +-
 include/linux/mm.h                          | 20 ++++++
 mm/swap.c                                   | 80 +++++++++++++++++++++
 9 files changed, 123 insertions(+), 27 deletions(-)

-- 
2.19.2

^ permalink raw reply	[flat|nested] 206+ messages in thread

* [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
@ 2018-12-04  0:17 ` john.hubbard
  2018-12-04  7:53   ` Mike Rapoport
  2018-12-04 20:28   ` Dan Williams
  2018-12-04  0:17 ` [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
  2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
  2 siblings, 2 replies; 206+ messages in thread
From: john.hubbard @ 2018-12-04  0:17 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing the problem described in [1]. The steps
are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Reviewed-by: Jan Kara <jack@suse.cz>

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 20 ++++++++++++
 mm/swap.c          | 80 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..09fbb2c81aba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index aa483719922e..bb8c32595e5f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+				   unsigned long npages,
+				   set_dirty_func sdf)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			sdf(page);
+
+		put_user_page(page);
+	}
+}
+
+/*
+ * put_user_pages_dirty() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/*
+ * put_user_pages_dirty_lock() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/*
+ * put_user_pages() - for each page in the @pages array, release the page
+ * using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 /*
  * get_kernel_pages() - pin kernel pages in memory
  * @kiov:	An array of struct kvec structures
-- 
2.19.2

^ permalink raw reply	[flat|nested] 206+ messages in thread

* [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*()
  2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-12-04  0:17 ` john.hubbard
  2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
  2 siblings, 0 replies; 206+ messages in thread
From: john.hubbard @ 2018-12-04  0:17 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

For infiniband code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This is a tiny part of the second step of fixing the problem described
in [1]. The steps are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>

Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c              |  7 ++++---
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 11 ++++-------
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++++-------
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +++---
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 ++++---
 7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index c6144df47ea4..c2898bc7b3b2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
 		page = sg_page(sg);
-		if (!PageDirty(page) && umem->writable && dirty)
-			set_page_dirty_lock(page);
-		put_page(page);
+		if (umem->writable && dirty)
+			put_user_pages_dirty_lock(&page, 1);
+		else
+			put_user_page(page);
 	}
 
 	sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 676c1fd1119d..99715049cd3b 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -659,7 +659,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 					ret = -EFAULT;
 					break;
 				}
-				put_page(local_page_list[j]);
+				put_user_page(local_page_list[j]);
 				continue;
 			}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 			     size_t npages, bool dirty)
 {
-	size_t i;
-
-	for (i = 0; i < npages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, npages);
+	else
+		put_user_pages(p, npages);
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		down_write(&mm->mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
 	if (ret < 0) {
-		put_page(pages[0]);
+		put_user_page(pages[0]);
 		goto out;
 	}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 				 mthca_uarc_virt(dev, uar, i));
 	if (ret) {
 		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-		put_page(sg_page(&db_tab->page[i].mem));
+		put_user_page(sg_page(&db_tab->page[i].mem));
 		goto out;
 	}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
 		if (db_tab->page[i].uvirt) {
 			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
 			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-			put_page(sg_page(&db_tab->page[i].mem));
+			put_user_page(sg_page(&db_tab->page[i].mem));
 		}
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..1a5c64c8695f 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,13 +40,10 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
 				     int dirty)
 {
-	size_t i;
-
-	for (i = 0; i < num_pages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, num_pages);
+	else
+		put_user_pages(p, num_pages);
 }
 
 /*
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 926f3c8eba69..4a4b802b011f 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -321,7 +321,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd,
 		 * the caller can ignore this page.
 		 */
 		if (put) {
-			put_page(page);
+			put_user_page(page);
 		} else {
 			/* coalesce case */
 			kunmap(page);
@@ -635,7 +635,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev,
 			kunmap(pkt->addr[i].page);
 
 		if (pkt->addr[i].put_page)
-			put_page(pkt->addr[i].page);
+			put_user_page(pkt->addr[i].page);
 		else
 			__free_page(pkt->addr[i].page);
 	} else if (pkt->addr[i].kvaddr) {
@@ -710,7 +710,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 	/* if error, return all pages not managed by pkt */
 free_pages:
 	while (i < j)
-		put_page(pages[i++]);
+		put_user_page(pages[i++]);
 
 done:
 	return ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 49275a548751..2ef8d31dc838 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -77,9 +77,10 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
 		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
 			page = sg_page(sg);
 			pa = sg_phys(sg);
-			if (!PageDirty(page) && dirty)
-				set_page_dirty_lock(page);
-			put_page(page);
+			if (dirty)
+				put_user_pages_dirty_lock(&page, 1);
+			else
+				put_user_page(page);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
-- 
2.19.2

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-12-04  7:53   ` Mike Rapoport
  2018-12-05  1:40     ` John Hubbard
  2018-12-04 20:28   ` Dan Williams
  1 sibling, 1 reply; 206+ messages in thread
From: Mike Rapoport @ 2018-12-04  7:53 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Jan Kara, Tom Talpey, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Marciniszyn,
	Ralph Campbell, LKML, linux-fsdevel, John Hubbard

Hi John,

Thanks for having documentation as a part of the patch. Some kernel-doc
nits below.

On Mon, Dec 03, 2018 at 04:17:19PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing the problem described in [1]. The steps
> are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem. Again, [1] provides details as to why that is
>    desirable.
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 20 ++++++++++++
>  mm/swap.c          | 80 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 100 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..09fbb2c81aba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
> 
> +/*
> + * put_user_page() - release a page that had previously been acquired via
> + * a call to one of the get_user_pages*() functions.

Please add @page parameter description, otherwise kernel-doc is unhappy

> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index aa483719922e..bb8c32595e5f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
> 
> +typedef int (*set_dirty_func)(struct page *page);
> +
> +static void __put_user_pages_dirty(struct page **pages,
> +				   unsigned long npages,
> +				   set_dirty_func sdf)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			sdf(page);
> +
> +		put_user_page(page);
> +	}
> +}
> +
> +/*
> + * put_user_pages_dirty() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.

Please put the parameters description next to the brief function
description, as described in [1]

[1] https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#function-documentation


> + *
> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/*
> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.

Ditto

> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +
> +/*
> + * put_user_pages() - for each page in the @pages array, release the page
> + * using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *

And here as well :)

> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
>  /*
>   * get_kernel_pages() - pin kernel pages in memory
>   * @kiov:	An array of struct kvec structures
> -- 
> 2.19.2
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* RE: [PATCH 0/2] put_user_page*(): start converting the call sites
  2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2018-12-04  0:17 ` [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
@ 2018-12-04 17:10 ` David Laight
  2018-12-05  1:05   ` John Hubbard
  2 siblings, 1 reply; 206+ messages in thread
From: David Laight @ 2018-12-04 17:10 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: john.hubbard@gmail.com
> Sent: 04 December 2018 00:17
> 
> Summary: I'd like these two patches to go into the next convenient cycle.
> I *think* that means 4.21.
> 
> Details
> 
> At the Linux Plumbers Conference, we talked about this approach [1], and
> the primary lingering concern was over performance. Tom Talpey helped me
> through a much more accurate run of the fio performance test, and now
> it's looking like an under 1% performance cost, to add and remove pages
> from the LRU (this is only paid when dealing with get_user_pages) [2]. So
> we should be fine to start converting call sites.
> 
> This patchset gets the conversion started. Both patches already had a fair
> amount of review.

Shouldn't the commit message contain actual details of the change?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2018-12-04  7:53   ` Mike Rapoport
@ 2018-12-04 20:28   ` Dan Williams
  2018-12-04 21:56     ` John Hubbard
  1 sibling, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-04 20:28 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, John Hubbard

On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>
> From: John Hubbard <jhubbard@nvidia.com>
>
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
>
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
>
> This is the first step of fixing the problem described in [1]. The steps
> are:
>
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
>
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
>
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
>
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem. Again, [1] provides details as to why that is
>    desirable.

I thought at Plumbers we talked about using a page bit to tag pages
that have had their reference count elevated by get_user_pages()? That
way there is no need to distinguish put_page() from put_user_page() it
just happens internally to put_page(). At the conference Matthew was
offering to free up a page bit for this purpose.

> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>
> Reviewed-by: Jan Kara <jack@suse.cz>

Wish, you could have been there Jan. I'm missing why it's safe to
assume that a single put_user_page() is paired with a get_user_page()?

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 20:28   ` Dan Williams
@ 2018-12-04 21:56     ` John Hubbard
  2018-12-04 23:03       ` Dan Williams
  2018-12-05 11:16       ` Jan Kara
  0 siblings, 2 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-04 21:56 UTC (permalink / raw)
  To: Dan Williams, John Hubbard
  Cc: Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/4/18 12:28 PM, Dan Williams wrote:
> On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>>
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also introduces put_user_pages(), and a few dirty/locked variations,
>> as a replacement for release_pages(), and also as a replacement
>> for open-coded loops that release multiple pages.
>> These may be used for subsequent performance improvements,
>> via batching of pages to be released.
>>
>> This is the first step of fixing the problem described in [1]. The steps
>> are:
>>
>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>    for releasing pages that were pinned via get_user_pages*().
>>
>> 2) Convert all of the call sites for get_user_pages*(), to
>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>    call sites, and will take some time.
>>
>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>    implement tracking of these pages. This tracking will be separate from
>>    the existing struct page refcounting.
>>
>> 4) Use the tracking and identification of these pages, to implement
>>    special handling (especially in writeback paths) when the pages are
>>    backed by a filesystem. Again, [1] provides details as to why that is
>>    desirable.
> 
> I thought at Plumbers we talked about using a page bit to tag pages
> that have had their reference count elevated by get_user_pages()? That
> way there is no need to distinguish put_page() from put_user_page() it
> just happens internally to put_page(). At the conference Matthew was
> offering to free up a page bit for this purpose.
> 

...but then, upon further discussion in that same session, we realized that
that doesn't help. You need a reference count. Otherwise a random put_page
could affect your dma-pinned pages, etc, etc.

I was not able to actually find any place where a single additional page
bit would help our situation, which is why this still uses LRU fields for
both the two bits required (the RFC [1] still applies), and the dma_pinned_count.


[1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com



>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Wish, you could have been there Jan. I'm missing why it's safe to
> assume that a single put_user_page() is paired with a get_user_page()?
> 

A put_user_page() per page, or a put_user_pages() for an array of pages. See
patch 0002 for several examples.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 21:56     ` John Hubbard
@ 2018-12-04 23:03       ` Dan Williams
  2018-12-05  0:36         ` Jerome Glisse
  2018-12-05  0:58         ` John Hubbard
  2018-12-05 11:16       ` Jan Kara
  1 sibling, 2 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-04 23:03 UTC (permalink / raw)
  To: John Hubbard
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/4/18 12:28 PM, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> >>
> >> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Introduces put_user_page(), which simply calls put_page().
> >> This provides a way to update all get_user_pages*() callers,
> >> so that they call put_user_page(), instead of put_page().
> >>
> >> Also introduces put_user_pages(), and a few dirty/locked variations,
> >> as a replacement for release_pages(), and also as a replacement
> >> for open-coded loops that release multiple pages.
> >> These may be used for subsequent performance improvements,
> >> via batching of pages to be released.
> >>
> >> This is the first step of fixing the problem described in [1]. The steps
> >> are:
> >>
> >> 1) (This patch): provide put_user_page*() routines, intended to be used
> >>    for releasing pages that were pinned via get_user_pages*().
> >>
> >> 2) Convert all of the call sites for get_user_pages*(), to
> >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> >>    call sites, and will take some time.
> >>
> >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> >>    implement tracking of these pages. This tracking will be separate from
> >>    the existing struct page refcounting.
> >>
> >> 4) Use the tracking and identification of these pages, to implement
> >>    special handling (especially in writeback paths) when the pages are
> >>    backed by a filesystem. Again, [1] provides details as to why that is
> >>    desirable.
> >
> > I thought at Plumbers we talked about using a page bit to tag pages
> > that have had their reference count elevated by get_user_pages()? That
> > way there is no need to distinguish put_page() from put_user_page() it
> > just happens internally to put_page(). At the conference Matthew was
> > offering to free up a page bit for this purpose.
> >
>
> ...but then, upon further discussion in that same session, we realized that
> that doesn't help. You need a reference count. Otherwise a random put_page
> could affect your dma-pinned pages, etc, etc.

Ok, sorry, I mis-remembered. So, you're effectively trying to capture
the end of the page pin event separate from the final 'put' of the
page? Makes sense.

> I was not able to actually find any place where a single additional page
> bit would help our situation, which is why this still uses LRU fields for
> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.

Except the LRU fields are already in use for ZONE_DEVICE pages... how
does this proposal interact with those?

> [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
>
> >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> >
> > Wish, you could have been there Jan. I'm missing why it's safe to
> > assume that a single put_user_page() is paired with a get_user_page()?
> >
>
> A put_user_page() per page, or a put_user_pages() for an array of pages. See
> patch 0002 for several examples.

Yes, however I was more concerned about validation and trying to
locate missed places where put_page() is used instead of
put_user_page().

It would be interesting to see if we could have a debug mode where
get_user_pages() returned dynamically allocated pages from a known
address range and catch drivers that operate on a user-pinned page
without using the proper helper to 'put' it. I think we might also
need a ref_user_page() for drivers that may do their own get_page()
and expect the dma_pinned_count to also increase.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 23:03       ` Dan Williams
@ 2018-12-05  0:36         ` Jerome Glisse
  2018-12-05  0:40           ` Dan Williams
  2018-12-05  0:58         ` John Hubbard
  1 sibling, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-05  0:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 12/4/18 12:28 PM, Dan Williams wrote:
> > > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> > >>
> > >> From: John Hubbard <jhubbard@nvidia.com>
> > >>
> > >> Introduces put_user_page(), which simply calls put_page().
> > >> This provides a way to update all get_user_pages*() callers,
> > >> so that they call put_user_page(), instead of put_page().
> > >>
> > >> Also introduces put_user_pages(), and a few dirty/locked variations,
> > >> as a replacement for release_pages(), and also as a replacement
> > >> for open-coded loops that release multiple pages.
> > >> These may be used for subsequent performance improvements,
> > >> via batching of pages to be released.
> > >>
> > >> This is the first step of fixing the problem described in [1]. The steps
> > >> are:
> > >>
> > >> 1) (This patch): provide put_user_page*() routines, intended to be used
> > >>    for releasing pages that were pinned via get_user_pages*().
> > >>
> > >> 2) Convert all of the call sites for get_user_pages*(), to
> > >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> > >>    call sites, and will take some time.
> > >>
> > >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> > >>    implement tracking of these pages. This tracking will be separate from
> > >>    the existing struct page refcounting.
> > >>
> > >> 4) Use the tracking and identification of these pages, to implement
> > >>    special handling (especially in writeback paths) when the pages are
> > >>    backed by a filesystem. Again, [1] provides details as to why that is
> > >>    desirable.
> > >
> > > I thought at Plumbers we talked about using a page bit to tag pages
> > > that have had their reference count elevated by get_user_pages()? That
> > > way there is no need to distinguish put_page() from put_user_page() it
> > > just happens internally to put_page(). At the conference Matthew was
> > > offering to free up a page bit for this purpose.
> > >
> >
> > ...but then, upon further discussion in that same session, we realized that
> > that doesn't help. You need a reference count. Otherwise a random put_page
> > could affect your dma-pinned pages, etc, etc.
> 
> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> the end of the page pin event separate from the final 'put' of the
> page? Makes sense.
> 
> > I was not able to actually find any place where a single additional page
> > bit would help our situation, which is why this still uses LRU fields for
> > both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> 
> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> does this proposal interact with those?
> 
> > [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
> >
> > >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> > >>
> > >> Reviewed-by: Jan Kara <jack@suse.cz>
> > >
> > > Wish, you could have been there Jan. I'm missing why it's safe to
> > > assume that a single put_user_page() is paired with a get_user_page()?
> > >
> >
> > A put_user_page() per page, or a put_user_pages() for an array of pages. See
> > patch 0002 for several examples.
> 
> Yes, however I was more concerned about validation and trying to
> locate missed places where put_page() is used instead of
> put_user_page().
> 
> It would be interesting to see if we could have a debug mode where
> get_user_pages() returned dynamically allocated pages from a known
> address range and catch drivers that operate on a user-pinned page
> without using the proper helper to 'put' it. I think we might also
> need a ref_user_page() for drivers that may do their own get_page()
> and expect the dma_pinned_count to also increase.

Total crazy idea for this, but this is the right time of day
for this (for me at least it is beer time :)) What about mapping
all struct page in two different range of kernel virtual address
and when get user space is use it returns a pointer from the second
range of kernel virtual address to the struct page. Then in put_page
you know for sure if the code putting the page got it from GUP or
from somewhere else. page_to_pfn() would need some trickery to
handle that.

Dunno if we are running out of kernel virtual address (outside
32bits that i believe we are trying to shot down quietly behind
the bar).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:36         ` Jerome Glisse
@ 2018-12-05  0:40           ` Dan Williams
  2018-12-05  0:59             ` John Hubbard
  0 siblings, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-05  0:40 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: John Hubbard, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 4:37 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 12/4/18 12:28 PM, Dan Williams wrote:
> > > > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> > > >>
> > > >> From: John Hubbard <jhubbard@nvidia.com>
> > > >>
> > > >> Introduces put_user_page(), which simply calls put_page().
> > > >> This provides a way to update all get_user_pages*() callers,
> > > >> so that they call put_user_page(), instead of put_page().
> > > >>
> > > >> Also introduces put_user_pages(), and a few dirty/locked variations,
> > > >> as a replacement for release_pages(), and also as a replacement
> > > >> for open-coded loops that release multiple pages.
> > > >> These may be used for subsequent performance improvements,
> > > >> via batching of pages to be released.
> > > >>
> > > >> This is the first step of fixing the problem described in [1]. The steps
> > > >> are:
> > > >>
> > > >> 1) (This patch): provide put_user_page*() routines, intended to be used
> > > >>    for releasing pages that were pinned via get_user_pages*().
> > > >>
> > > >> 2) Convert all of the call sites for get_user_pages*(), to
> > > >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> > > >>    call sites, and will take some time.
> > > >>
> > > >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> > > >>    implement tracking of these pages. This tracking will be separate from
> > > >>    the existing struct page refcounting.
> > > >>
> > > >> 4) Use the tracking and identification of these pages, to implement
> > > >>    special handling (especially in writeback paths) when the pages are
> > > >>    backed by a filesystem. Again, [1] provides details as to why that is
> > > >>    desirable.
> > > >
> > > > I thought at Plumbers we talked about using a page bit to tag pages
> > > > that have had their reference count elevated by get_user_pages()? That
> > > > way there is no need to distinguish put_page() from put_user_page() it
> > > > just happens internally to put_page(). At the conference Matthew was
> > > > offering to free up a page bit for this purpose.
> > > >
> > >
> > > ...but then, upon further discussion in that same session, we realized that
> > > that doesn't help. You need a reference count. Otherwise a random put_page
> > > could affect your dma-pinned pages, etc, etc.
> >
> > Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> > the end of the page pin event separate from the final 'put' of the
> > page? Makes sense.
> >
> > > I was not able to actually find any place where a single additional page
> > > bit would help our situation, which is why this still uses LRU fields for
> > > both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> >
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
> >
> > > [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
> > >
> > > >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> > > >>
> > > >> Reviewed-by: Jan Kara <jack@suse.cz>
> > > >
> > > > Wish, you could have been there Jan. I'm missing why it's safe to
> > > > assume that a single put_user_page() is paired with a get_user_page()?
> > > >
> > >
> > > A put_user_page() per page, or a put_user_pages() for an array of pages. See
> > > patch 0002 for several examples.
> >
> > Yes, however I was more concerned about validation and trying to
> > locate missed places where put_page() is used instead of
> > put_user_page().
> >
> > It would be interesting to see if we could have a debug mode where
> > get_user_pages() returned dynamically allocated pages from a known
> > address range and catch drivers that operate on a user-pinned page
> > without using the proper helper to 'put' it. I think we might also
> > need a ref_user_page() for drivers that may do their own get_page()
> > and expect the dma_pinned_count to also increase.
>
> Total crazy idea for this, but this is the right time of day
> for this (for me at least it is beer time :)) What about mapping
> all struct page in two different range of kernel virtual address
> and when get user space is use it returns a pointer from the second
> range of kernel virtual address to the struct page. Then in put_page
> you know for sure if the code putting the page got it from GUP or
> from somewhere else. page_to_pfn() would need some trickery to
> handle that.

Yes, exactly what I was thinking, if only as a debug mode since
instrumenting every pfn/page translation would be expensive.

> Dunno if we are running out of kernel virtual address (outside
> 32bits that i believe we are trying to shot down quietly behind
> the bar).

There's room, KASAN is in a roughly similar place.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 23:03       ` Dan Williams
  2018-12-05  0:36         ` Jerome Glisse
@ 2018-12-05  0:58         ` John Hubbard
  2018-12-05  1:00           ` Dan Williams
  2018-12-05  1:15           ` Matthew Wilcox
  1 sibling, 2 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-05  0:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/4/18 3:03 PM, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 12/4/18 12:28 PM, Dan Williams wrote:
>>> On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>>>>
>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>
>>>> Introduces put_user_page(), which simply calls put_page().
>>>> This provides a way to update all get_user_pages*() callers,
>>>> so that they call put_user_page(), instead of put_page().
>>>>
>>>> Also introduces put_user_pages(), and a few dirty/locked variations,
>>>> as a replacement for release_pages(), and also as a replacement
>>>> for open-coded loops that release multiple pages.
>>>> These may be used for subsequent performance improvements,
>>>> via batching of pages to be released.
>>>>
>>>> This is the first step of fixing the problem described in [1]. The steps
>>>> are:
>>>>
>>>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>>>    for releasing pages that were pinned via get_user_pages*().
>>>>
>>>> 2) Convert all of the call sites for get_user_pages*(), to
>>>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>>>    call sites, and will take some time.
>>>>
>>>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>>>    implement tracking of these pages. This tracking will be separate from
>>>>    the existing struct page refcounting.
>>>>
>>>> 4) Use the tracking and identification of these pages, to implement
>>>>    special handling (especially in writeback paths) when the pages are
>>>>    backed by a filesystem. Again, [1] provides details as to why that is
>>>>    desirable.
>>>
>>> I thought at Plumbers we talked about using a page bit to tag pages
>>> that have had their reference count elevated by get_user_pages()? That
>>> way there is no need to distinguish put_page() from put_user_page() it
>>> just happens internally to put_page(). At the conference Matthew was
>>> offering to free up a page bit for this purpose.
>>>
>>
>> ...but then, upon further discussion in that same session, we realized that
>> that doesn't help. You need a reference count. Otherwise a random put_page
>> could affect your dma-pinned pages, etc, etc.
> 
> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> the end of the page pin event separate from the final 'put' of the
> page? Makes sense.
> 

Yes, that's it exactly.

>> I was not able to actually find any place where a single additional page
>> bit would help our situation, which is why this still uses LRU fields for
>> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> 
> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> does this proposal interact with those?

Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?

If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
LRU field approach is unusable.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:40           ` Dan Williams
@ 2018-12-05  0:59             ` John Hubbard
  0 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-05  0:59 UTC (permalink / raw)
  To: Dan Williams, Jérôme Glisse
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/4/18 4:40 PM, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 4:37 PM Jerome Glisse <jglisse@redhat.com> wrote:
>>
>> On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
>>> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>
>>>> On 12/4/18 12:28 PM, Dan Williams wrote:
>>>>> On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>>>>>>
>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>>>
>>>>>> Introduces put_user_page(), which simply calls put_page().
>>>>>> This provides a way to update all get_user_pages*() callers,
>>>>>> so that they call put_user_page(), instead of put_page().
>>>>>>
>>>>>> Also introduces put_user_pages(), and a few dirty/locked variations,
>>>>>> as a replacement for release_pages(), and also as a replacement
>>>>>> for open-coded loops that release multiple pages.
>>>>>> These may be used for subsequent performance improvements,
>>>>>> via batching of pages to be released.
>>>>>>
>>>>>> This is the first step of fixing the problem described in [1]. The steps
>>>>>> are:
>>>>>>
>>>>>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>>>>>    for releasing pages that were pinned via get_user_pages*().
>>>>>>
>>>>>> 2) Convert all of the call sites for get_user_pages*(), to
>>>>>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>>>>>    call sites, and will take some time.
>>>>>>
>>>>>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>>>>>    implement tracking of these pages. This tracking will be separate from
>>>>>>    the existing struct page refcounting.
>>>>>>
>>>>>> 4) Use the tracking and identification of these pages, to implement
>>>>>>    special handling (especially in writeback paths) when the pages are
>>>>>>    backed by a filesystem. Again, [1] provides details as to why that is
>>>>>>    desirable.
>>>>>
>>>>> I thought at Plumbers we talked about using a page bit to tag pages
>>>>> that have had their reference count elevated by get_user_pages()? That
>>>>> way there is no need to distinguish put_page() from put_user_page() it
>>>>> just happens internally to put_page(). At the conference Matthew was
>>>>> offering to free up a page bit for this purpose.
>>>>>
>>>>
>>>> ...but then, upon further discussion in that same session, we realized that
>>>> that doesn't help. You need a reference count. Otherwise a random put_page
>>>> could affect your dma-pinned pages, etc, etc.
>>>
>>> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
>>> the end of the page pin event separate from the final 'put' of the
>>> page? Makes sense.
>>>
>>>> I was not able to actually find any place where a single additional page
>>>> bit would help our situation, which is why this still uses LRU fields for
>>>> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
>>>
>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>> does this proposal interact with those?
>>>
>>>> [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
>>>>
>>>>>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>>>>>
>>>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>>>
>>>>> Wish, you could have been there Jan. I'm missing why it's safe to
>>>>> assume that a single put_user_page() is paired with a get_user_page()?
>>>>>
>>>>
>>>> A put_user_page() per page, or a put_user_pages() for an array of pages. See
>>>> patch 0002 for several examples.
>>>
>>> Yes, however I was more concerned about validation and trying to
>>> locate missed places where put_page() is used instead of
>>> put_user_page().
>>>
>>> It would be interesting to see if we could have a debug mode where
>>> get_user_pages() returned dynamically allocated pages from a known
>>> address range and catch drivers that operate on a user-pinned page
>>> without using the proper helper to 'put' it. I think we might also
>>> need a ref_user_page() for drivers that may do their own get_page()
>>> and expect the dma_pinned_count to also increase.

Good idea about a new ref_user_page() call. It's going to hard to find 
those places at all of the call sites, btw.

>>
>> Total crazy idea for this, but this is the right time of day
>> for this (for me at least it is beer time :)) What about mapping
>> all struct page in two different range of kernel virtual address
>> and when get user space is use it returns a pointer from the second
>> range of kernel virtual address to the struct page. Then in put_page
>> you know for sure if the code putting the page got it from GUP or
>> from somewhere else. page_to_pfn() would need some trickery to
>> handle that.
> 
> Yes, exactly what I was thinking, if only as a debug mode since
> instrumenting every pfn/page translation would be expensive.
> 

That does sound viable as a debug mode. I'll try it out. A reliable way
(in both directions) of sorting out put_page() vs. put_user_page() 
would be a huge improvement, even if just in debug mode.

>> Dunno if we are running out of kernel virtual address (outside
>> 32bits that i believe we are trying to shot down quietly behind
>> the bar).
> 
> There's room, KASAN is in a roughly similar place.
> 

Looks like I'd better post a new version of the entire RFC, rather than just
these two patches. It's still less fully-baked than I'd hoped. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:58         ` John Hubbard
@ 2018-12-05  1:00           ` Dan Williams
  2018-12-05  1:15           ` Matthew Wilcox
  1 sibling, 0 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-05  1:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 4:58 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/4/18 3:03 PM, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
[..]
> > Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> > the end of the page pin event separate from the final 'put' of the
> > page? Makes sense.
> >
>
> Yes, that's it exactly.
>
> >> I was not able to actually find any place where a single additional page
> >> bit would help our situation, which is why this still uses LRU fields for
> >> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> >
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
>
> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>
> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> LRU field approach is unusable.

Unfortunately, the entire motivation for ZONE_DEVICE was to support
get_user_pages() for persistent memory.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 0/2] put_user_page*(): start converting the call sites
  2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
@ 2018-12-05  1:05   ` John Hubbard
  2018-12-05 14:08     ` David Laight
  0 siblings, 1 reply; 206+ messages in thread
From: John Hubbard @ 2018-12-05  1:05 UTC (permalink / raw)
  To: David Laight, john.hubbard, Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel

On 12/4/18 9:10 AM, David Laight wrote:
> From: john.hubbard@gmail.com
>> Sent: 04 December 2018 00:17
>>
>> Summary: I'd like these two patches to go into the next convenient cycle.
>> I *think* that means 4.21.
>>
>> Details
>>
>> At the Linux Plumbers Conference, we talked about this approach [1], and
>> the primary lingering concern was over performance. Tom Talpey helped me
>> through a much more accurate run of the fio performance test, and now
>> it's looking like an under 1% performance cost, to add and remove pages
>> from the LRU (this is only paid when dealing with get_user_pages) [2]. So
>> we should be fine to start converting call sites.
>>
>> This patchset gets the conversion started. Both patches already had a fair
>> amount of review.
> 
> Shouldn't the commit message contain actual details of the change?
> 

Hi David,

This "patch 0000" is not a commit message, as it never shows up in git log.
Each of the follow-up patches does have details about the changes it makes.

But maybe you are really asking for more background information, which I
should have added in this cover letter. Here's a start:

https://lore.kernel.org/r/20181110085041.10071-1-jhubbard@nvidia.com

...and it looks like this small patch series is not going to work out--I'm
going to have to fall back to another RFC spin. So I'll be sure to include 
you and everyone on that. Hope that helps.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:58         ` John Hubbard
  2018-12-05  1:00           ` Dan Williams
@ 2018-12-05  1:15           ` Matthew Wilcox
  2018-12-05  1:44             ` Jerome Glisse
  2018-12-05  5:52             ` Dan Williams
  1 sibling, 2 replies; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-05  1:15 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Jérôme Glisse, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> On 12/4/18 3:03 PM, Dan Williams wrote:
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
> 
> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> 
> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> LRU field approach is unusable.

We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
damage:

+++ b/include/linux/mm_types.h
@@ -151,10 +151,12 @@ struct page {
 #endif
                };
                struct {        /* ZONE_DEVICE pages */
+                       unsigned long _zd_pad_2;        /* LRU */
+                       unsigned long _zd_pad_3;        /* LRU */
+                       unsigned long _zd_pad_1;        /* uses mapping */
                        /** @pgmap: Points to the hosting device page map. */
                        struct dev_pagemap *pgmap;
                        unsigned long hmm_data;
-                       unsigned long _zd_pad_1;        /* uses mapping */
                };
 
                /** @rcu_head: You can use this to free a page by RCU. */

You don't use page->private or page->index, do you Dan?

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  7:53   ` Mike Rapoport
@ 2018-12-05  1:40     ` John Hubbard
  0 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-05  1:40 UTC (permalink / raw)
  To: Mike Rapoport, john.hubbard
  Cc: Andrew Morton, linux-mm, Jan Kara, Tom Talpey, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Marciniszyn,
	Ralph Campbell, LKML, linux-fsdevel

On 12/3/18 11:53 PM, Mike Rapoport wrote:
> Hi John,
> 
> Thanks for having documentation as a part of the patch. Some kernel-doc
> nits below.
> 
> On Mon, Dec 03, 2018 at 04:17:19PM -0800, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also introduces put_user_pages(), and a few dirty/locked variations,
>> as a replacement for release_pages(), and also as a replacement
>> for open-coded loops that release multiple pages.
>> These may be used for subsequent performance improvements,
>> via batching of pages to be released.
>>
>> This is the first step of fixing the problem described in [1]. The steps
>> are:
>>
>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>    for releasing pages that were pinned via get_user_pages*().
>>
>> 2) Convert all of the call sites for get_user_pages*(), to
>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>    call sites, and will take some time.
>>
>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>    implement tracking of these pages. This tracking will be separate from
>>    the existing struct page refcounting.
>>
>> 4) Use the tracking and identification of these pages, to implement
>>    special handling (especially in writeback paths) when the pages are
>>    backed by a filesystem. Again, [1] provides details as to why that is
>>    desirable.
>>
>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Christopher Lameter <cl@linux.com>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Jerome Glisse <jglisse@redhat.com>
>> Cc: Christoph Hellwig <hch@infradead.org>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>>  include/linux/mm.h | 20 ++++++++++++
>>  mm/swap.c          | 80 ++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 100 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5411de93a363..09fbb2c81aba 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
>>  		__put_page(page);
>>  }
>>
>> +/*
>> + * put_user_page() - release a page that had previously been acquired via
>> + * a call to one of the get_user_pages*() functions.
> 
> Please add @page parameter description, otherwise kernel-doc is unhappy

Hi Mike,

Sorry I missed these kerneldoc points from your earlier review! I'll fix it
up now and it will show up in the next posting.

> 
>> + *
>> + * Pages that were pinned via get_user_pages*() must be released via
>> + * either put_user_page(), or one of the put_user_pages*() routines
>> + * below. This is so that eventually, pages that are pinned via
>> + * get_user_pages*() can be separately tracked and uniquely handled. In
>> + * particular, interactions with RDMA and filesystems need special
>> + * handling.
>> + */
>> +static inline void put_user_page(struct page *page)
>> +{
>> +	put_page(page);
>> +}
>> +
>> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
>> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
>> +void put_user_pages(struct page **pages, unsigned long npages);
>> +
>>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>  #define SECTION_IN_PAGE_FLAGS
>>  #endif
>> diff --git a/mm/swap.c b/mm/swap.c
>> index aa483719922e..bb8c32595e5f 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
>>  }
>>  EXPORT_SYMBOL(put_pages_list);
>>
>> +typedef int (*set_dirty_func)(struct page *page);
>> +
>> +static void __put_user_pages_dirty(struct page **pages,
>> +				   unsigned long npages,
>> +				   set_dirty_func sdf)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
>> +		struct page *page = compound_head(pages[index]);
>> +
>> +		if (!PageDirty(page))
>> +			sdf(page);
>> +
>> +		put_user_page(page);
>> +	}
>> +}
>> +
>> +/*
>> + * put_user_pages_dirty() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * set_page_dirty(), which does not lock the page, is used here.
>> + * Therefore, it is the caller's responsibility to ensure that this is
>> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
> 
> Please put the parameters description next to the brief function
> description, as described in [1]
> 
> [1] https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#function-documentation
> 

OK. 

> 
>> + *
>> + */
>> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
>> +{
>> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty);
>> +
>> +/*
>> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * This is just like put_user_pages_dirty(), except that it invokes
>> + * set_page_dirty_lock(), instead of set_page_dirty().
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
> 
> Ditto

OK.

> 
>> + *
>> + */
>> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
>> +{
>> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
>> +
>> +/*
>> + * put_user_pages() - for each page in the @pages array, release the page
>> + * using put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
> 
> And here as well :)

OK.


thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:15           ` Matthew Wilcox
@ 2018-12-05  1:44             ` Jerome Glisse
  2018-12-05  1:57               ` John Hubbard
  2018-12-05  5:52             ` Dan Williams
  1 sibling, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-05  1:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > On 12/4/18 3:03 PM, Dan Williams wrote:
> > > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > > does this proposal interact with those?
> > 
> > Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> > 
> > If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> > LRU field approach is unusable.
> 
> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> damage:
> 
> +++ b/include/linux/mm_types.h
> @@ -151,10 +151,12 @@ struct page {
>  #endif
>                 };
>                 struct {        /* ZONE_DEVICE pages */
> +                       unsigned long _zd_pad_2;        /* LRU */
> +                       unsigned long _zd_pad_3;        /* LRU */
> +                       unsigned long _zd_pad_1;        /* uses mapping */
>                         /** @pgmap: Points to the hosting device page map. */
>                         struct dev_pagemap *pgmap;
>                         unsigned long hmm_data;
> -                       unsigned long _zd_pad_1;        /* uses mapping */
>                 };
>  
>                 /** @rcu_head: You can use this to free a page by RCU. */
> 
> You don't use page->private or page->index, do you Dan?

page->private and page->index are use by HMM DEVICE page.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:44             ` Jerome Glisse
@ 2018-12-05  1:57               ` John Hubbard
  2018-12-07  2:45                 ` John Hubbard
  0 siblings, 1 reply; 206+ messages in thread
From: John Hubbard @ 2018-12-05  1:57 UTC (permalink / raw)
  To: Jerome Glisse, Matthew Wilcox
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On 12/4/18 5:44 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>> does this proposal interact with those?
>>>
>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>>>
>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
>>> LRU field approach is unusable.
>>
>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>> damage:
>>
>> +++ b/include/linux/mm_types.h
>> @@ -151,10 +151,12 @@ struct page {
>>  #endif
>>                 };
>>                 struct {        /* ZONE_DEVICE pages */
>> +                       unsigned long _zd_pad_2;        /* LRU */
>> +                       unsigned long _zd_pad_3;        /* LRU */
>> +                       unsigned long _zd_pad_1;        /* uses mapping */
>>                         /** @pgmap: Points to the hosting device page map. */
>>                         struct dev_pagemap *pgmap;
>>                         unsigned long hmm_data;
>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>>                 };
>>  
>>                 /** @rcu_head: You can use this to free a page by RCU. */
>>
>> You don't use page->private or page->index, do you Dan?
> 
> page->private and page->index are use by HMM DEVICE page.
> 

OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
dma-pinned information. Which might work. To recap, we need:

-- 1 bit for PageDmaPinned
-- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
-- N bits for a reference count

Those *could* be packed into a single 64-bit field, if really necessary.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:15           ` Matthew Wilcox
  2018-12-05  1:44             ` Jerome Glisse
@ 2018-12-05  5:52             ` Dan Williams
  1 sibling, 0 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-05  5:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Jérôme Glisse, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 5:15 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > On 12/4/18 3:03 PM, Dan Williams wrote:
> > > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > > does this proposal interact with those?
> >
> > Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> >
> > If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> > LRU field approach is unusable.
>
> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> damage:
>
> +++ b/include/linux/mm_types.h
> @@ -151,10 +151,12 @@ struct page {
>  #endif
>                 };
>                 struct {        /* ZONE_DEVICE pages */
> +                       unsigned long _zd_pad_2;        /* LRU */
> +                       unsigned long _zd_pad_3;        /* LRU */
> +                       unsigned long _zd_pad_1;        /* uses mapping */
>                         /** @pgmap: Points to the hosting device page map. */
>                         struct dev_pagemap *pgmap;
>                         unsigned long hmm_data;
> -                       unsigned long _zd_pad_1;        /* uses mapping */
>                 };
>
>                 /** @rcu_head: You can use this to free a page by RCU. */
>
> You don't use page->private or page->index, do you Dan?

I don't use page->private, but page->index is used by the
memory-failure path to do an rmap.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 21:56     ` John Hubbard
  2018-12-04 23:03       ` Dan Williams
@ 2018-12-05 11:16       ` Jan Kara
  1 sibling, 0 replies; 206+ messages in thread
From: Jan Kara @ 2018-12-05 11:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Jérôme Glisse, Matthew Wilcox, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue 04-12-18 13:56:36, John Hubbard wrote:
> On 12/4/18 12:28 PM, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> >>
> >> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Introduces put_user_page(), which simply calls put_page().
> >> This provides a way to update all get_user_pages*() callers,
> >> so that they call put_user_page(), instead of put_page().
> >>
> >> Also introduces put_user_pages(), and a few dirty/locked variations,
> >> as a replacement for release_pages(), and also as a replacement
> >> for open-coded loops that release multiple pages.
> >> These may be used for subsequent performance improvements,
> >> via batching of pages to be released.
> >>
> >> This is the first step of fixing the problem described in [1]. The steps
> >> are:
> >>
> >> 1) (This patch): provide put_user_page*() routines, intended to be used
> >>    for releasing pages that were pinned via get_user_pages*().
> >>
> >> 2) Convert all of the call sites for get_user_pages*(), to
> >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> >>    call sites, and will take some time.
> >>
> >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> >>    implement tracking of these pages. This tracking will be separate from
> >>    the existing struct page refcounting.
> >>
> >> 4) Use the tracking and identification of these pages, to implement
> >>    special handling (especially in writeback paths) when the pages are
> >>    backed by a filesystem. Again, [1] provides details as to why that is
> >>    desirable.
> > 
> > I thought at Plumbers we talked about using a page bit to tag pages
> > that have had their reference count elevated by get_user_pages()? That
> > way there is no need to distinguish put_page() from put_user_page() it
> > just happens internally to put_page(). At the conference Matthew was
> > offering to free up a page bit for this purpose.
> > 
> 
> ...but then, upon further discussion in that same session, we realized that
> that doesn't help. You need a reference count. Otherwise a random put_page
> could affect your dma-pinned pages, etc, etc.

Exactly.

> I was not able to actually find any place where a single additional page
> bit would help our situation, which is why this still uses LRU fields for
> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.

So single page bit could help you with performance. In 99% of cases there's
just one reference from GUP. So if you could store that info in page flags,
you could safe yourself a relatively expensive removal from LRU and putting
it back to make space in struct page for proper refcount. But since you
report that the performance isn't that horrible, I'd leave this idea on a
backburner. We can always implement it later in case we find in future we
need to improve the performance.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* RE: [PATCH 0/2] put_user_page*(): start converting the call sites
  2018-12-05  1:05   ` John Hubbard
@ 2018-12-05 14:08     ` David Laight
  2018-12-28  8:37       ` Pavel Machek
  0 siblings, 1 reply; 206+ messages in thread
From: David Laight @ 2018-12-05 14:08 UTC (permalink / raw)
  To: John Hubbard, john.hubbard, Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel

From: John Hubbard
> Sent: 05 December 2018 01:06
> On 12/4/18 9:10 AM, David Laight wrote:
> > From: john.hubbard@gmail.com
> >> Sent: 04 December 2018 00:17
> >>
> >> Summary: I'd like these two patches to go into the next convenient cycle.
> >> I *think* that means 4.21.
> >>
> >> Details
> >>
> >> At the Linux Plumbers Conference, we talked about this approach [1], and
> >> the primary lingering concern was over performance. Tom Talpey helped me
> >> through a much more accurate run of the fio performance test, and now
> >> it's looking like an under 1% performance cost, to add and remove pages
> >> from the LRU (this is only paid when dealing with get_user_pages) [2]. So
> >> we should be fine to start converting call sites.
> >>
> >> This patchset gets the conversion started. Both patches already had a fair
> >> amount of review.
> >
> > Shouldn't the commit message contain actual details of the change?
> >
> 
> Hi David,
> 
> This "patch 0000" is not a commit message, as it never shows up in git log.
> Each of the follow-up patches does have details about the changes it makes.

I think you should still describe the change - at least in summary.

The patch I looked at didn't really...
IIRC it still referred to external links.

> But maybe you are really asking for more background information, which I
> should have added in this cover letter. Here's a start:
> 
> https://lore.kernel.org/r/20181110085041.10071-1-jhubbard@nvidia.com

Yes, but links go stale....

> ...and it looks like this small patch series is not going to work out--I'm
> going to have to fall back to another RFC spin. So I'll be sure to include
> you and everyone on that. Hope that helps.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:57               ` John Hubbard
@ 2018-12-07  2:45                 ` John Hubbard
  2018-12-07 19:16                   ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: John Hubbard @ 2018-12-07  2:45 UTC (permalink / raw)
  To: Jerome Glisse, Matthew Wilcox
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On 12/4/18 5:57 PM, John Hubbard wrote:
> On 12/4/18 5:44 PM, Jerome Glisse wrote:
>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>>> does this proposal interact with those?
>>>>
>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>>>>
>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
>>>> LRU field approach is unusable.
>>>
>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>>> damage:
>>>
>>> +++ b/include/linux/mm_types.h
>>> @@ -151,10 +151,12 @@ struct page {
>>>  #endif
>>>                 };
>>>                 struct {        /* ZONE_DEVICE pages */
>>> +                       unsigned long _zd_pad_2;        /* LRU */
>>> +                       unsigned long _zd_pad_3;        /* LRU */
>>> +                       unsigned long _zd_pad_1;        /* uses mapping */
>>>                         /** @pgmap: Points to the hosting device page map. */
>>>                         struct dev_pagemap *pgmap;
>>>                         unsigned long hmm_data;
>>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>>>                 };
>>>  
>>>                 /** @rcu_head: You can use this to free a page by RCU. */
>>>
>>> You don't use page->private or page->index, do you Dan?
>>
>> page->private and page->index are use by HMM DEVICE page.
>>
> 
> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
> dma-pinned information. Which might work. To recap, we need:
> 
> -- 1 bit for PageDmaPinned
> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> -- N bits for a reference count
> 
> Those *could* be packed into a single 64-bit field, if really necessary.
> 

...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
However, it is still possible for this to work.

Matthew, can I have that bit now please? I'm about out of options, and now it will actually
solve the problem here.

Given:

1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
for that case. 

2) There is an independent bit available (according to Matthew). 

3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
   in that case.

4) get_user_pages() must work on ZONE_DEVICE and HMM pages.

5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
unsigned long field.

So that leads to the following approach:

-- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).

For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
optionally renamed:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 017ab82e36ca..b5dcd9398cae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -90,8 +90,8 @@ struct page {
                                 * are in use.
                                 */
                                struct {
-                                       unsigned long dma_pinned_flags;
-                                       atomic_t      dma_pinned_count;
+                                       unsigned long dma_pinned_flags; /* LRU.next */
+                                       atomic_t      dma_pinned_count; /* LRU.prev */
                                };
                        };
                        /* See page-flags.h for PAGE_MAPPING_FLAGS */
@@ -161,9 +161,9 @@ struct page {
                };
                struct {        /* ZONE_DEVICE pages */
                        /** @pgmap: Points to the hosting device page map. */
-                       struct dev_pagemap *pgmap;
-                       unsigned long hmm_data;
-                       unsigned long _zd_pad_1;        /* uses mapping */
+                       struct dev_pagemap *pgmap;      /* LRU.next */
+                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
+                       unsigned long hmm_data;         /* uses mapping */
                };
 
                /** @rcu_head: You can use this to free a page by RCU. */



-- Use an additional, fully independent page bit (from Matthew) for PageDmaPinned.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07  2:45                 ` John Hubbard
@ 2018-12-07 19:16                   ` Jerome Glisse
  2018-12-07 19:26                     ` Dan Williams
  2018-12-08  0:52                     ` John Hubbard
  0 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-07 19:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> On 12/4/18 5:57 PM, John Hubbard wrote:
> > On 12/4/18 5:44 PM, Jerome Glisse wrote:
> >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> >>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> >>>>> does this proposal interact with those?
> >>>>
> >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> >>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> >>>>
> >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> >>>> LRU field approach is unusable.
> >>>
> >>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> >>> damage:
> >>>
> >>> +++ b/include/linux/mm_types.h
> >>> @@ -151,10 +151,12 @@ struct page {
> >>>  #endif
> >>>                 };
> >>>                 struct {        /* ZONE_DEVICE pages */
> >>> +                       unsigned long _zd_pad_2;        /* LRU */
> >>> +                       unsigned long _zd_pad_3;        /* LRU */
> >>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>                         /** @pgmap: Points to the hosting device page map. */
> >>>                         struct dev_pagemap *pgmap;
> >>>                         unsigned long hmm_data;
> >>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>                 };
> >>>  
> >>>                 /** @rcu_head: You can use this to free a page by RCU. */
> >>>
> >>> You don't use page->private or page->index, do you Dan?
> >>
> >> page->private and page->index are use by HMM DEVICE page.
> >>
> > 
> > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
> > dma-pinned information. Which might work. To recap, we need:
> > 
> > -- 1 bit for PageDmaPinned
> > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> > -- N bits for a reference count
> > 
> > Those *could* be packed into a single 64-bit field, if really necessary.
> > 
> 
> ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> However, it is still possible for this to work.
> 
> Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> solve the problem here.
> 
> Given:
> 
> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
> for that case. 
> 
> 2) There is an independent bit available (according to Matthew). 
> 
> 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
>    in that case.

To expend on this, HMM private page are use for anonymous page
so the index and mapping fields have the value you expect for
such pages. Down the road i want also to support file backed
page with HMM private (mapping, private, index).

For HMM public both anonymous and file back page are supported
today (HMM public is only useful on platform with something like
OpenCAPI, CCIX or NVlink ... so PowerPC for now).

> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.

get_user_pages() only need to work with HMM public page not the
private one as we can not allow _anyone_ to pin HMM private page.
So on get_user_pages() on HMM private we get a page fault and
it is migrated back to regular memory.


> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
> unsigned long field.
> 
> So that leads to the following approach:
> 
> -- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
> For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).
> 
> For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
> available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
> optionally renamed:
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 017ab82e36ca..b5dcd9398cae 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -90,8 +90,8 @@ struct page {
>                                  * are in use.
>                                  */
>                                 struct {
> -                                       unsigned long dma_pinned_flags;
> -                                       atomic_t      dma_pinned_count;
> +                                       unsigned long dma_pinned_flags; /* LRU.next */
> +                                       atomic_t      dma_pinned_count; /* LRU.prev */
>                                 };
>                         };
>                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> @@ -161,9 +161,9 @@ struct page {
>                 };
>                 struct {        /* ZONE_DEVICE pages */
>                         /** @pgmap: Points to the hosting device page map. */
> -                       struct dev_pagemap *pgmap;
> -                       unsigned long hmm_data;
> -                       unsigned long _zd_pad_1;        /* uses mapping */
> +                       struct dev_pagemap *pgmap;      /* LRU.next */
> +                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
> +                       unsigned long hmm_data;         /* uses mapping */

This breaks HMM today as hmm_data would alias with mapping field.
hmm_data can only be in LRU.prev

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07 19:16                   ` Jerome Glisse
@ 2018-12-07 19:26                     ` Dan Williams
  2018-12-07 19:40                       ` Jerome Glisse
  2018-12-08  0:52                     ` John Hubbard
  1 sibling, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-07 19:26 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 7, 2018 at 11:16 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> > On 12/4/18 5:57 PM, John Hubbard wrote:
> > > On 12/4/18 5:44 PM, Jerome Glisse wrote:
> > >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> > >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > >>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> > >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > >>>>> does this proposal interact with those?
> > >>>>
> > >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > >>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> > >>>>
> > >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> > >>>> LRU field approach is unusable.
> > >>>
> > >>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> > >>> damage:
> > >>>
> > >>> +++ b/include/linux/mm_types.h
> > >>> @@ -151,10 +151,12 @@ struct page {
> > >>>  #endif
> > >>>                 };
> > >>>                 struct {        /* ZONE_DEVICE pages */
> > >>> +                       unsigned long _zd_pad_2;        /* LRU */
> > >>> +                       unsigned long _zd_pad_3;        /* LRU */
> > >>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> > >>>                         /** @pgmap: Points to the hosting device page map. */
> > >>>                         struct dev_pagemap *pgmap;
> > >>>                         unsigned long hmm_data;
> > >>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> > >>>                 };
> > >>>
> > >>>                 /** @rcu_head: You can use this to free a page by RCU. */
> > >>>
> > >>> You don't use page->private or page->index, do you Dan?
> > >>
> > >> page->private and page->index are use by HMM DEVICE page.
> > >>
> > >
> > > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for
> > > dma-pinned information. Which might work. To recap, we need:
> > >
> > > -- 1 bit for PageDmaPinned
> > > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> > > -- N bits for a reference count
> > >
> > > Those *could* be packed into a single 64-bit field, if really necessary.
> > >
> >
> > ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> > However, it is still possible for this to work.
> >
> > Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> > solve the problem here.
> >
> > Given:
> >
> > 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> > That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required,
> > for that case.
> >
> > 2) There is an independent bit available (according to Matthew).
> >
> > 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter
> >    in that case.
>
> To expend on this, HMM private page are use for anonymous page
> so the index and mapping fields have the value you expect for
> such pages. Down the road i want also to support file backed
> page with HMM private (mapping, private, index).
>
> For HMM public both anonymous and file back page are supported
> today (HMM public is only useful on platform with something like
> OpenCAPI, CCIX or NVlink ... so PowerPC for now).
>
> > 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
>
> get_user_pages() only need to work with HMM public page not the
> private one as we can not allow _anyone_ to pin HMM private page.

How does HMM enforce that? Because the kernel should not allow *any*
memory management facility to arbitrarily fail direct-I/O operations.
That's why CONFIG_FS_DAX_LIMITED is a temporary / experimental hack
for S390 and ZONE_DEVICE was invented to bypass that hack for X86 and
any arch that plans to properly support DAX. I would classify any
memory management that can't support direct-I/O in the same
"experimental" category.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07 19:26                     ` Dan Williams
@ 2018-12-07 19:40                       ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-07 19:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 11:26:34AM -0800, Dan Williams wrote:
> On Fri, Dec 7, 2018 at 11:16 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> > > On 12/4/18 5:57 PM, John Hubbard wrote:
> > > > On 12/4/18 5:44 PM, Jerome Glisse wrote:
> > > >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> > > >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > > >>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> > > >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > > >>>>> does this proposal interact with those?
> > > >>>>
> > > >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > > >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > > >>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> > > >>>>
> > > >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> > > >>>> LRU field approach is unusable.
> > > >>>
> > > >>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> > > >>> damage:
> > > >>>
> > > >>> +++ b/include/linux/mm_types.h
> > > >>> @@ -151,10 +151,12 @@ struct page {
> > > >>>  #endif
> > > >>>                 };
> > > >>>                 struct {        /* ZONE_DEVICE pages */
> > > >>> +                       unsigned long _zd_pad_2;        /* LRU */
> > > >>> +                       unsigned long _zd_pad_3;        /* LRU */
> > > >>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> > > >>>                         /** @pgmap: Points to the hosting device page map. */
> > > >>>                         struct dev_pagemap *pgmap;
> > > >>>                         unsigned long hmm_data;
> > > >>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> > > >>>                 };
> > > >>>
> > > >>>                 /** @rcu_head: You can use this to free a page by RCU. */
> > > >>>
> > > >>> You don't use page->private or page->index, do you Dan?
> > > >>
> > > >> page->private and page->index are use by HMM DEVICE page.
> > > >>
> > > >
> > > > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for
> > > > dma-pinned information. Which might work. To recap, we need:
> > > >
> > > > -- 1 bit for PageDmaPinned
> > > > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> > > > -- N bits for a reference count
> > > >
> > > > Those *could* be packed into a single 64-bit field, if really necessary.
> > > >
> > >
> > > ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> > > However, it is still possible for this to work.
> > >
> > > Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> > > solve the problem here.
> > >
> > > Given:
> > >
> > > 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> > > That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required,
> > > for that case.
> > >
> > > 2) There is an independent bit available (according to Matthew).
> > >
> > > 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter
> > >    in that case.
> >
> > To expend on this, HMM private page are use for anonymous page
> > so the index and mapping fields have the value you expect for
> > such pages. Down the road i want also to support file backed
> > page with HMM private (mapping, private, index).
> >
> > For HMM public both anonymous and file back page are supported
> > today (HMM public is only useful on platform with something like
> > OpenCAPI, CCIX or NVlink ... so PowerPC for now).
> >
> > > 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
> >
> > get_user_pages() only need to work with HMM public page not the
> > private one as we can not allow _anyone_ to pin HMM private page.
> 
> How does HMM enforce that? Because the kernel should not allow *any*
> memory management facility to arbitrarily fail direct-I/O operations.
> That's why CONFIG_FS_DAX_LIMITED is a temporary / experimental hack
> for S390 and ZONE_DEVICE was invented to bypass that hack for X86 and
> any arch that plans to properly support DAX. I would classify any
> memory management that can't support direct-I/O in the same
> "experimental" category.

It does not fail direct-I/O GUP sees a swap entry for the private
memory and it behave just like if the page was swap to disk so i
am not introducing any new behavior.

With HMM page everything just work as you expect they would from
CPU point of view. It is just like swap.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07 19:16                   ` Jerome Glisse
  2018-12-07 19:26                     ` Dan Williams
@ 2018-12-08  0:52                     ` John Hubbard
  2018-12-08  2:24                       ` Jerome Glisse
                                         ` (2 more replies)
  1 sibling, 3 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-08  0:52 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/7/18 11:16 AM, Jerome Glisse wrote:
> On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
>> On 12/4/18 5:57 PM, John Hubbard wrote:
>>> On 12/4/18 5:44 PM, Jerome Glisse wrote:
>>>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>>>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>>>>> does this proposal interact with those?
>>>>>>
>>>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
>>>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
>>>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>>>>>>
>>>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
>>>>>> LRU field approach is unusable.
>>>>>
>>>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>>>>> damage:
>>>>>
>>>>> +++ b/include/linux/mm_types.h
>>>>> @@ -151,10 +151,12 @@ struct page {
>>>>>  #endif
>>>>>                 };
>>>>>                 struct {        /* ZONE_DEVICE pages */
>>>>> +                       unsigned long _zd_pad_2;        /* LRU */
>>>>> +                       unsigned long _zd_pad_3;        /* LRU */
>>>>> +                       unsigned long _zd_pad_1;        /* uses mapping */
>>>>>                         /** @pgmap: Points to the hosting device page map. */
>>>>>                         struct dev_pagemap *pgmap;
>>>>>                         unsigned long hmm_data;
>>>>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>>>>>                 };
>>>>>  
>>>>>                 /** @rcu_head: You can use this to free a page by RCU. */
>>>>>
>>>>> You don't use page->private or page->index, do you Dan?
>>>>
>>>> page->private and page->index are use by HMM DEVICE page.
>>>>
>>>
>>> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
>>> dma-pinned information. Which might work. To recap, we need:
>>>
>>> -- 1 bit for PageDmaPinned
>>> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
>>> -- N bits for a reference count
>>>
>>> Those *could* be packed into a single 64-bit field, if really necessary.
>>>
>>
>> ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
>> However, it is still possible for this to work.
>>
>> Matthew, can I have that bit now please? I'm about out of options, and now it will actually
>> solve the problem here.
>>
>> Given:
>>
>> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
>> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
>> for that case. 
>>
>> 2) There is an independent bit available (according to Matthew). 
>>
>> 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
>>    in that case.
> 
> To expend on this, HMM private page are use for anonymous page
> so the index and mapping fields have the value you expect for
> such pages. Down the road i want also to support file backed
> page with HMM private (mapping, private, index).
> 
> For HMM public both anonymous and file back page are supported
> today (HMM public is only useful on platform with something like
> OpenCAPI, CCIX or NVlink ... so PowerPC for now).
> 
>> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
> 
> get_user_pages() only need to work with HMM public page not the
> private one as we can not allow _anyone_ to pin HMM private page.
> So on get_user_pages() on HMM private we get a page fault and
> it is migrated back to regular memory.
> 
> 
>> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
>> unsigned long field.
>>
>> So that leads to the following approach:
>>
>> -- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
>> For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).
>>
>> For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
>> available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
>> optionally renamed:
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 017ab82e36ca..b5dcd9398cae 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -90,8 +90,8 @@ struct page {
>>                                  * are in use.
>>                                  */
>>                                 struct {
>> -                                       unsigned long dma_pinned_flags;
>> -                                       atomic_t      dma_pinned_count;
>> +                                       unsigned long dma_pinned_flags; /* LRU.next */
>> +                                       atomic_t      dma_pinned_count; /* LRU.prev */
>>                                 };
>>                         };
>>                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
>> @@ -161,9 +161,9 @@ struct page {
>>                 };
>>                 struct {        /* ZONE_DEVICE pages */
>>                         /** @pgmap: Points to the hosting device page map. */
>> -                       struct dev_pagemap *pgmap;
>> -                       unsigned long hmm_data;
>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>> +                       struct dev_pagemap *pgmap;      /* LRU.next */
>> +                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
>> +                       unsigned long hmm_data;         /* uses mapping */
> 
> This breaks HMM today as hmm_data would alias with mapping field.
> hmm_data can only be in LRU.prev
> 

I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
completely out of space. At this point, after thinking about it carefully, it seems clear
that it's time for a single, new field:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..1c789e324da8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -182,6 +182,9 @@ struct page {
        /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
        atomic_t _refcount;
 
+       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
+       atomic_t _dma_pinned_count;
+
 #ifdef CONFIG_MEMCG
        struct mem_cgroup *mem_cgroup;
 #endif


...because after all, the reason this is so difficult is that this fix has to work
in pretty much every configuration. get_user_pages() use is widespread, it's a very
general facility, and...it needs fixing.  And we're out of space. 

I'm going to send out an updated RFC that shows the latest, and I think it's going
to include the above.

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  0:52                     ` John Hubbard
@ 2018-12-08  2:24                       ` Jerome Glisse
  2018-12-10 10:28                         ` Jan Kara
  2018-12-08  5:18                       ` Matthew Wilcox
  2018-12-08  7:16                       ` Dan Williams
  2 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-08  2:24 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 04:52:42PM -0800, John Hubbard wrote:
> On 12/7/18 11:16 AM, Jerome Glisse wrote:
> > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> >> On 12/4/18 5:57 PM, John Hubbard wrote:
> >>> On 12/4/18 5:44 PM, Jerome Glisse wrote:
> >>>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> >>>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> >>>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> >>>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> >>>>>>> does this proposal interact with those?
> >>>>>>
> >>>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> >>>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> >>>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> >>>>>>
> >>>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> >>>>>> LRU field approach is unusable.
> >>>>>
> >>>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> >>>>> damage:
> >>>>>
> >>>>> +++ b/include/linux/mm_types.h
> >>>>> @@ -151,10 +151,12 @@ struct page {
> >>>>>  #endif
> >>>>>                 };
> >>>>>                 struct {        /* ZONE_DEVICE pages */
> >>>>> +                       unsigned long _zd_pad_2;        /* LRU */
> >>>>> +                       unsigned long _zd_pad_3;        /* LRU */
> >>>>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>>>                         /** @pgmap: Points to the hosting device page map. */
> >>>>>                         struct dev_pagemap *pgmap;
> >>>>>                         unsigned long hmm_data;
> >>>>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>>>                 };
> >>>>>  
> >>>>>                 /** @rcu_head: You can use this to free a page by RCU. */
> >>>>>
> >>>>> You don't use page->private or page->index, do you Dan?
> >>>>
> >>>> page->private and page->index are use by HMM DEVICE page.
> >>>>
> >>>
> >>> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
> >>> dma-pinned information. Which might work. To recap, we need:
> >>>
> >>> -- 1 bit for PageDmaPinned
> >>> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> >>> -- N bits for a reference count
> >>>
> >>> Those *could* be packed into a single 64-bit field, if really necessary.
> >>>
> >>
> >> ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> >> However, it is still possible for this to work.
> >>
> >> Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> >> solve the problem here.
> >>
> >> Given:
> >>
> >> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> >> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
> >> for that case. 
> >>
> >> 2) There is an independent bit available (according to Matthew). 
> >>
> >> 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
> >>    in that case.
> > 
> > To expend on this, HMM private page are use for anonymous page
> > so the index and mapping fields have the value you expect for
> > such pages. Down the road i want also to support file backed
> > page with HMM private (mapping, private, index).
> > 
> > For HMM public both anonymous and file back page are supported
> > today (HMM public is only useful on platform with something like
> > OpenCAPI, CCIX or NVlink ... so PowerPC for now).
> > 
> >> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
> > 
> > get_user_pages() only need to work with HMM public page not the
> > private one as we can not allow _anyone_ to pin HMM private page.
> > So on get_user_pages() on HMM private we get a page fault and
> > it is migrated back to regular memory.
> > 
> > 
> >> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
> >> unsigned long field.
> >>
> >> So that leads to the following approach:
> >>
> >> -- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
> >> For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).
> >>
> >> For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
> >> available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
> >> optionally renamed:
> >>
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 017ab82e36ca..b5dcd9398cae 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -90,8 +90,8 @@ struct page {
> >>                                  * are in use.
> >>                                  */
> >>                                 struct {
> >> -                                       unsigned long dma_pinned_flags;
> >> -                                       atomic_t      dma_pinned_count;
> >> +                                       unsigned long dma_pinned_flags; /* LRU.next */
> >> +                                       atomic_t      dma_pinned_count; /* LRU.prev */
> >>                                 };
> >>                         };
> >>                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> >> @@ -161,9 +161,9 @@ struct page {
> >>                 };
> >>                 struct {        /* ZONE_DEVICE pages */
> >>                         /** @pgmap: Points to the hosting device page map. */
> >> -                       struct dev_pagemap *pgmap;
> >> -                       unsigned long hmm_data;
> >> -                       unsigned long _zd_pad_1;        /* uses mapping */
> >> +                       struct dev_pagemap *pgmap;      /* LRU.next */
> >> +                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
> >> +                       unsigned long hmm_data;         /* uses mapping */
> > 
> > This breaks HMM today as hmm_data would alias with mapping field.
> > hmm_data can only be in LRU.prev
> > 
> 
> I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
> completely out of space. At this point, after thinking about it carefully, it seems clear
> that it's time for a single, new field:
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..1c789e324da8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -182,6 +182,9 @@ struct page {
>         /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>         atomic_t _refcount;
>  
> +       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> +       atomic_t _dma_pinned_count;
> +
>  #ifdef CONFIG_MEMCG
>         struct mem_cgroup *mem_cgroup;
>  #endif
> 
> 
> ...because after all, the reason this is so difficult is that this fix has to work
> in pretty much every configuration. get_user_pages() use is widespread, it's a very
> general facility, and...it needs fixing.  And we're out of space. 
> 
> I'm going to send out an updated RFC that shows the latest, and I think it's going
> to include the above.

Another crazy idea, why not treating GUP as another mapping of the page
and caller of GUP would have to provide either a fake anon_vma struct or
a fake vma struct (or both for PRIVATE mapping of a file where you can
have a mix of both private and file page thus only if it is a read only
GUP) that would get added to the list of existing mapping.

So the flow would be:
    somefunction_thatuse_gup()
    {
        ...
        GUP(_fast)(vma, ..., fake_anon, fake_vma);
        ...
    }

    GUP(vma, ..., fake_anon, fake_vma)
    {
        if (vma->flags == ANON) {
            // Add the fake anon vma to the anon vma chain as a child
            // of current vma
        } else {
            // Add the fake vma to the mapping tree
        }

        // The existing GUP except that now it inc mapcount and not
        // refcount
        GUP_old(..., &nanonymous, &nfiles);

        atomic_add(&fake_anon->refcount, nanonymous);
        atomic_add(&fake_vma->refcount, nfiles);

        return nanonymous + nfiles;
    }

I believe all call place of GUP could be updated they fall into 2
categories:
    - fake_anon/fake_vma on stack (direct I/O and few other who
      just do GUP inside their work function and drop reference
      their too)
    - fake_anon/fake_vma as part of the object they have ie GUP
      user that have some kind of struct where they keep the result
      of the GUP around (most user in driver directory fall under
      that)


Few nice bonus:
    [B1] GUP_pin <= (mapcount - refcount) ie it gives a boundary for
         number of GUP on the page (some other part of the kernel might
         still temporarily inc the refcount without a mapcount increase)
    [B2] can add a revoke call back as part of the fake anon_vma/
         vma structure (if the existing GUP user can do that or maybe
         something like an emergency revoke when the memory is poisonous)
    [B3] extra cost is once per GUP call not per page so the impact
         on performance should definitly be better
    [B4] no need to modify LRU or complexify the inner of GUP code
         only the pre-ambule.

Few issues with that proposal:
    [I1] need to check mapcount in page free code path to avoid
         freeing the page if refcount reach 0 before all the GUP
         user unmap the page, page is no in some zombie state ie
         refcount = 0 and mapcount > 0
    [I2] KVM seems to use GUP for weird reasons, it might be better
         to convert KVM to use something else than GUP that have the
         same end result from KVM point of view (i believe it uses it
         to force page fault on the host page). Maybe we can work
         with KVM folks and see if we can provide them with the API
         that actualy do what they want instead of them using GUP
         for its side effect
    [I3] ANON page will need special handling as this will confuse
         mm code path that deal with COW pages ... the page is not
         COW but still has mapcount > 1
    [I4] GUP must be per vma (not an issue everywhere) we can provide
         helpers to iterate over virtual address by vma
    [I5] to ascertain that a page is under GUP might be costly code
         would look like:
            bool page_is_guped(struct page *page)
            {
                if (page_mapcount(page) > page_refcount(page)) {
                    return true;
                }
                // Unknown have to walk the reverse mapping to see
                // if they are any fake anon or fake vma and even
                // if there is we could not say for sure if they
                // apply to the page under consideration we would
                // have to assume so unless:
                //
                // GUP user keep around the array they used to store
                // the GUP results then we can check if the page is
                // in there.
            }

Probably other issues i can not think of right now ...


Maybe even better would be to add a pointer to struct address_space
and re-arrange struct anon_vma to move unsigned degree at the top and
define some flag in it (i don't think we want to grow anon_vam struct)
so that we can differentiate fake anon_vma from others by looking at
first word.

Patchset would probably looks like:
    [1-N] Convert all put_page to put_user_page() with an extra void
          pointer as first step (to allow converting using one at a
          time)
    [N-M] convet every GUP user to provide the fake struct (from
          stack or as part of their object that track GUP result)
    [M-O] patches to add all the helpers and changes inside mm to
          handle fake vma and fake anon vma and the implication of
          having mapcount > refcount (not all time)
    [P] convert GUP to add the fake to anon_vma/mapping and convert
        GUP to inc mapcount and not refcount

Note that the GUP user do not necessarily need to keep the fake
anon or vma struct as part of their own struct. It can use a key
system ie:
    put_user_page_with_key(page, key);
    versus
    put_user_page(page, fake_anon/fake_vma);

Then put_user_page would walk the list of mapping of the page
until it finds the fake anon or fake vma that have the matching
key and dec the refcount of that fake struct and free it once
it reaches zero ...

Anyway they are few thing we can do to alievate the pain for the
GUP users.


Maybe this is crazy but this is what i have without needing to add
a field to struct page.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  0:52                     ` John Hubbard
  2018-12-08  2:24                       ` Jerome Glisse
@ 2018-12-08  5:18                       ` Matthew Wilcox
  2018-12-12 19:13                         ` John Hubbard
  2018-12-08  7:16                       ` Dan Williams
  2 siblings, 1 reply; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-08  5:18 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 04:52:42PM -0800, John Hubbard wrote:
> I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
> completely out of space. At this point, after thinking about it carefully, it seems clear
> that it's time for a single, new field:

Sorry for not replying earlier; I'm travelling and have had trouble
keeping on top of my mail.

Adding this field will grow struct page by 4-8 bytes, so it will no
longer be 64 bytes.  This isn't an acceptable answer.

We have a few options for bits.  One is that we have (iirc) two
bits available in page->flags on 32-bit.  That'll force a few more
configurations into using _last_cpupid and/or page_ext.  I'm not a huge
fan of this approach.

The second is to use page->lru.next bit 1.  This requires some care
because m68k allows misaligned pointers.  If the list_head that it's
joined to is misaligned, we'll be in trouble.  This can get tricky because
some pages are attached to list_heads which are on the stack ... and I
don't think gcc guarantees __aligned attributes work for stack variables.

The third is to use page->lru.prev bit 0.  We'd want to switch pgmap
and hmm_data around to make this work, and we'd want to record this
in mm_types.h so nobody tries to use a field which aliases with
page->lru.prev and has bit 0 set on a page which can be mapped to
userspace (which I currently believe to be true).

The fourth is to use a bit in page->flags for 64-bit and a bit in
page_ext->flags for 32-bit.  Or we could get rid of page_ext and grow
struct page with a ->flags2 on 32-bit.

Fifth, it isn't clear to me how many bits might be left in ->_last_cpupid
at this point, and perhaps there's scope for using a bit in there.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..1c789e324da8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -182,6 +182,9 @@ struct page {
>         /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>         atomic_t _refcount;
>  
> +       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> +       atomic_t _dma_pinned_count;
> +
>  #ifdef CONFIG_MEMCG
>         struct mem_cgroup *mem_cgroup;
>  #endif
> 
> 
> ...because after all, the reason this is so difficult is that this fix has to work
> in pretty much every configuration. get_user_pages() use is widespread, it's a very
> general facility, and...it needs fixing.  And we're out of space. 
> 
> I'm going to send out an updated RFC that shows the latest, and I think it's going
> to include the above.
> 
> -- 
> thanks,
> John Hubbard
> NVIDIA
> 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  7:16                       ` Dan Williams
@ 2018-12-08 16:33                         ` Jerome Glisse
  2018-12-08 16:48                           ` Christoph Hellwig
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-08 16:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 11:16:32PM -0800, Dan Williams wrote:
> On Fri, Dec 7, 2018 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 12/7/18 11:16 AM, Jerome Glisse wrote:
> > > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> [..]
> > I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
> > completely out of space. At this point, after thinking about it carefully, it seems clear
> > that it's time for a single, new field:
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5ed8f6292a53..1c789e324da8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -182,6 +182,9 @@ struct page {
> >         /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> >         atomic_t _refcount;
> >
> > +       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> > +       atomic_t _dma_pinned_count;
> > +
> >  #ifdef CONFIG_MEMCG
> >         struct mem_cgroup *mem_cgroup;
> >  #endif
> >
> >
> > ...because after all, the reason this is so difficult is that this fix has to work
> > in pretty much every configuration. get_user_pages() use is widespread, it's a very
> > general facility, and...it needs fixing.  And we're out of space.
> 
> HMM seems entirely too greedy in this regard. Especially with zero
> upstream users. When can we start to delete the pieces of HMM that
> have no upstream consumers? I would think that would be 4.21 / 5.0 as
> there needs to be some forcing function. We can always re-add pieces
> of HMM with it's users when / if they arrive.

Patchset to use HMM inside nouveau have already been posted, some
of the bits have already made upstream and more are line up for
next merge window.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 16:33                         ` Jerome Glisse
@ 2018-12-08 16:48                           ` Christoph Hellwig
  2018-12-08 17:47                             ` Jerome Glisse
  2018-12-08 18:09                             ` Dan Williams
  0 siblings, 2 replies; 206+ messages in thread
From: Christoph Hellwig @ 2018-12-08 16:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> Patchset to use HMM inside nouveau have already been posted, some
> of the bits have already made upstream and more are line up for
> next merge window.

Even with that it is a relative fringe feature compared to making
something like get_user_pages() that is literally used every to actually
work properly.

So I think we need to kick out HMM here and just find another place for
it to store data.

And just to make clear that I'm not picking just on this - the same is
true to a just a little smaller extent for the pgmap..

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 16:48                           ` Christoph Hellwig
@ 2018-12-08 17:47                             ` Jerome Glisse
  2018-12-08 18:26                               ` Christoph Hellwig
  2018-12-08 18:09                             ` Dan Williams
  1 sibling, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-08 17:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 08:48:25AM -0800, Christoph Hellwig wrote:
> On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> > Patchset to use HMM inside nouveau have already been posted, some
> > of the bits have already made upstream and more are line up for
> > next merge window.
> 
> Even with that it is a relative fringe feature compared to making
> something like get_user_pages() that is literally used every to actually
> work properly.
> 
> So I think we need to kick out HMM here and just find another place for
> it to store data.
> 
> And just to make clear that I'm not picking just on this - the same is
> true to a just a little smaller extent for the pgmap..

Most of the user of GUP are well behave (everything under driver/gpu and
so is mellanox driver and many other) ie they abide by mmu notifier
invalidation call backs. They are a handfull of device driver that thought
they could just do GUP and ignore the mmu notifier part and those are the
one being problematic. So to me it feels like bystander are be shot for no
good reasons.

I proposed an alternative solution to this GUP thing and i don't think it
is a crazy one and thinking about it we only need to do that for file back
page so we can leave untouch the anonymous page case. This would put a
small burden on the user of GUP (by the way i am working on removing GUP
from drivers/gpu and other well behave driver, patch posted on dri-devel
for some of the GPU already).

So why not explore my idea and see if they are any roadblock on it.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 16:48                           ` Christoph Hellwig
  2018-12-08 17:47                             ` Jerome Glisse
@ 2018-12-08 18:09                             ` Dan Williams
  2018-12-08 18:12                               ` Christoph Hellwig
  2018-12-11  6:18                               ` Dave Chinner
  1 sibling, 2 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-08 18:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jérôme Glisse, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 8, 2018 at 8:48 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> > Patchset to use HMM inside nouveau have already been posted, some
> > of the bits have already made upstream and more are line up for
> > next merge window.
>
> Even with that it is a relative fringe feature compared to making
> something like get_user_pages() that is literally used every to actually
> work properly.
>
> So I think we need to kick out HMM here and just find another place for
> it to store data.
>
> And just to make clear that I'm not picking just on this - the same is
> true to a just a little smaller extent for the pgmap..

Fair enough, I cringed as I took a full pointer for that use case, I'm
happy to look at ways of consolidating or dropping that usage.

Another fix that may put pressure 'struct page' is resolving the
untenable situation of dax being incompatible with reflink, i.e.
reflink currently requires page-cache pages. Dave has talked about
silently establishing page-cache entries when a dax-page is cow'd for
reflink, but I wonder if we could go the other way and introduce the
mechanism of a page belonging to multiple mappings simultaneously and
managed by the filesystem.

Both HMM and ZONE_DEVICE in general are guilty of side-stepping the mm
and I'm in favor of undoing that as much as possible,

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 18:09                             ` Dan Williams
@ 2018-12-08 18:12                               ` Christoph Hellwig
  2018-12-11  6:18                               ` Dave Chinner
  1 sibling, 0 replies; 206+ messages in thread
From: Christoph Hellwig @ 2018-12-08 18:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jérôme Glisse, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 10:09:26AM -0800, Dan Williams wrote:
> Another fix that may put pressure 'struct page' is resolving the
> untenable situation of dax being incompatible with reflink, i.e.
> reflink currently requires page-cache pages. Dave has talked about
> silently establishing page-cache entries when a dax-page is cow'd for
> reflink, but I wonder if we could go the other way and introduce the
> mechanism of a page belonging to multiple mappings simultaneously and
> managed by the filesystem.

FYI, I had a a prototype for DAX + reflink that didn't require
the page cache, although it badly reimplemented parts of it.  But
that was a long time ago, before we started requiring struct page
for the DAX memory.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 17:47                             ` Jerome Glisse
@ 2018-12-08 18:26                               ` Christoph Hellwig
  2018-12-08 18:45                                 ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: Christoph Hellwig @ 2018-12-08 18:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christoph Hellwig, Dan Williams, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 12:47:30PM -0500, Jerome Glisse wrote:
> Most of the user of GUP are well behave (everything under driver/gpu and
> so is mellanox driver and many other) ie they abide by mmu notifier
> invalidation call backs. They are a handfull of device driver that thought
> they could just do GUP and ignore the mmu notifier part and those are the
> one being problematic. So to me it feels like bystander are be shot for no
> good reasons.

get_user_pages is used by every single direct I/O, and while the race
windows in that case are small they very much exists.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 18:26                               ` Christoph Hellwig
@ 2018-12-08 18:45                                 ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-08 18:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 10:26:04AM -0800, Christoph Hellwig wrote:
> On Sat, Dec 08, 2018 at 12:47:30PM -0500, Jerome Glisse wrote:
> > Most of the user of GUP are well behave (everything under driver/gpu and
> > so is mellanox driver and many other) ie they abide by mmu notifier
> > invalidation call backs. They are a handfull of device driver that thought
> > they could just do GUP and ignore the mmu notifier part and those are the
> > one being problematic. So to me it feels like bystander are be shot for no
> > good reasons.
> 
> get_user_pages is used by every single direct I/O, and while the race
> windows in that case are small they very much exists.

Yes and my proposal allow to fix that in even a better way than
the pin count would ie allowing to provide a callback for write
back to wait on direct I/O as you said for direct I/O it is a
small window so it would be fine to have write back wait on it.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  2:24                       ` Jerome Glisse
@ 2018-12-10 10:28                         ` Jan Kara
  2018-12-12 15:03                           ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: Jan Kara @ 2018-12-10 10:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Matthew Wilcox, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> Another crazy idea, why not treating GUP as another mapping of the page
> and caller of GUP would have to provide either a fake anon_vma struct or
> a fake vma struct (or both for PRIVATE mapping of a file where you can
> have a mix of both private and file page thus only if it is a read only
> GUP) that would get added to the list of existing mapping.
>
> So the flow would be:
>     somefunction_thatuse_gup()
>     {
>         ...
>         GUP(_fast)(vma, ..., fake_anon, fake_vma);
>         ...
>     }
> 
>     GUP(vma, ..., fake_anon, fake_vma)
>     {
>         if (vma->flags == ANON) {
>             // Add the fake anon vma to the anon vma chain as a child
>             // of current vma
>         } else {
>             // Add the fake vma to the mapping tree
>         }
> 
>         // The existing GUP except that now it inc mapcount and not
>         // refcount
>         GUP_old(..., &nanonymous, &nfiles);
> 
>         atomic_add(&fake_anon->refcount, nanonymous);
>         atomic_add(&fake_vma->refcount, nfiles);
> 
>         return nanonymous + nfiles;
>     }

Thanks for your idea! This is actually something like I was suggesting back
at LSF/MM in Deer Valley. There were two downsides to this I remember
people pointing out:

1) This cannot really work with __get_user_pages_fast(). You're not allowed
to get necessary locks to insert new entry into the VMA tree in that
context. So essentially we'd loose get_user_pages_fast() functionality.

2) The overhead e.g. for direct IO may be noticeable. You need to allocate
the fake tracking VMA, get VMA interval tree lock, insert into the tree.
Then on IO completion you need to queue work to unpin the pages again as you
cannot remove the fake VMA directly from interrupt context where the IO is
completed.

You are right that the cost could be amortized if gup() is called for
multiple consecutive pages however for small IOs there's no help...

So this approach doesn't look like a win to me over using counter in struct
page and I'd rather try looking into squeezing HMM public page usage of
struct page so that we can fit that gup counter there as well. I know that
it may be easier said than done...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 18:09                             ` Dan Williams
  2018-12-08 18:12                               ` Christoph Hellwig
@ 2018-12-11  6:18                               ` Dave Chinner
  1 sibling, 0 replies; 206+ messages in thread
From: Dave Chinner @ 2018-12-11  6:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jérôme Glisse, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 10:09:26AM -0800, Dan Williams wrote:
> On Sat, Dec 8, 2018 at 8:48 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> > > Patchset to use HMM inside nouveau have already been posted, some
> > > of the bits have already made upstream and more are line up for
> > > next merge window.
> >
> > Even with that it is a relative fringe feature compared to making
> > something like get_user_pages() that is literally used every to actually
> > work properly.
> >
> > So I think we need to kick out HMM here and just find another place for
> > it to store data.
> >
> > And just to make clear that I'm not picking just on this - the same is
> > true to a just a little smaller extent for the pgmap..
> 
> Fair enough, I cringed as I took a full pointer for that use case, I'm
> happy to look at ways of consolidating or dropping that usage.
> 
> Another fix that may put pressure 'struct page' is resolving the
> untenable situation of dax being incompatible with reflink, i.e.
> reflink currently requires page-cache pages. Dave has talked about
> silently establishing page-cache entries when a dax-page is cow'd for
> reflink,

I think you've got it the wrong way around there :)

Think of a set of files with the following physical block mappings:

index		0  1  2  3  4  5
inode W		A  B  C  D  E  F
inode X		B  C  D  E  F  A
inode Y		C  D  E  F  A  B
inode Z		D  E  F  A  B  C

Basically, each block has 4 references (one from each file), and
each reference to a block is from a diffent file offset. Now, with
DAX, each inode wants to put the same struct page into their own
address space mapping tree but have different page indexes.

i.e. for block A, inode W wants page->index = 0, X wants 5, Y wants
4 and Z wants 3.

This is not possible with a single struct page and where the
problem with DAX, struct pages and physically shared data lies.

This is where the page cache is currently required - each mapping
gets it's own copy of the shared block in volatile RAM, but when
sharing is broken (by COW) we can toss the volatile copy and go back
to using DAX for the newly allocated, single owner {block, struct
page} tuple that replaces the shared page.

> but I wonder if we could go the other way and introduce the
> mechanism of a page belonging to multiple mappings simultaneously and
> managed by the filesystem.

That's pretty much what I suggested at LSFMM. We do lookups for
shared extent mappings through the filesystem buffer cache (which is
indexed by physical location) and hold the primary struct page in
the filesystem buffer cache. We then hand out dynamically allocated
struct pages back to the caller that point to the same physical page
and place them in each inode's address space. When a write fault
occurs, we allocate a new block, grab the physical struct page, copy
the data across, and release the dynamically allocated read-only
struct page and reference to the primary struct page held in the
filesytem buffer cache.

It's essentially the same model "cached page per inode address
space" as using volatile RAM copies via the page cache, except
the struct pages point back to the same physical location rather
than having their own temporary, volatile copy of the data.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-10 10:28                         ` Jan Kara
@ 2018-12-12 15:03                           ` Jerome Glisse
  2018-12-12 16:27                             ` Dan Williams
  2018-12-12 21:46                             ` Dave Chinner
  0 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 15:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > Another crazy idea, why not treating GUP as another mapping of the page
> > and caller of GUP would have to provide either a fake anon_vma struct or
> > a fake vma struct (or both for PRIVATE mapping of a file where you can
> > have a mix of both private and file page thus only if it is a read only
> > GUP) that would get added to the list of existing mapping.
> >
> > So the flow would be:
> >     somefunction_thatuse_gup()
> >     {
> >         ...
> >         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> >         ...
> >     }
> > 
> >     GUP(vma, ..., fake_anon, fake_vma)
> >     {
> >         if (vma->flags == ANON) {
> >             // Add the fake anon vma to the anon vma chain as a child
> >             // of current vma
> >         } else {
> >             // Add the fake vma to the mapping tree
> >         }
> > 
> >         // The existing GUP except that now it inc mapcount and not
> >         // refcount
> >         GUP_old(..., &nanonymous, &nfiles);
> > 
> >         atomic_add(&fake_anon->refcount, nanonymous);
> >         atomic_add(&fake_vma->refcount, nfiles);
> > 
> >         return nanonymous + nfiles;
> >     }
> 
> Thanks for your idea! This is actually something like I was suggesting back
> at LSF/MM in Deer Valley. There were two downsides to this I remember
> people pointing out:
> 
> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> to get necessary locks to insert new entry into the VMA tree in that
> context. So essentially we'd loose get_user_pages_fast() functionality.
> 
> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> Then on IO completion you need to queue work to unpin the pages again as you
> cannot remove the fake VMA directly from interrupt context where the IO is
> completed.
> 
> You are right that the cost could be amortized if gup() is called for
> multiple consecutive pages however for small IOs there's no help...
> 
> So this approach doesn't look like a win to me over using counter in struct
> page and I'd rather try looking into squeezing HMM public page usage of
> struct page so that we can fit that gup counter there as well. I know that
> it may be easier said than done...

So i want back to the drawing board and first i would like to ascertain
that we all agree on what the objectives are:

    [O1] Avoid write back from a page still being written by either a
         device or some direct I/O or any other existing user of GUP.
         This would avoid possible file system corruption.

    [O2] Avoid crash when set_page_dirty() is call on a page that is
         considered clean by core mm (buffer head have been remove and
         with some file system this turns into an ugly mess).

    [O3] DAX and the device block problems, ie with DAX the page map in
         userspace is the same as the block (persistent memory) and no
         filesystem nor block device understand page as block or pinned
         block.

For [O3] i don't think any pin count would help in anyway. I believe
that the current long term GUP API that does not allow GUP of DAX is
the only sane solution for now. The real fix would be to teach file-
system about DAX/pinned block so that a pinned block is not reuse
by filesystem.


For [O1] and [O2] i believe a solution with mapcount would work. So
no new struct, no fake vma, nothing like that. In GUP for file back
pages we increment both refcount and mapcount (we also need a special
put_user_page to decrement mapcount when GUP user are done with the
page).

Now for [O1] the write back have to call page_mkclean() to go through
all reverse mapping of the page and map read only. This means that
we can count the number of real mapping and see if the mapcount is
bigger than that. If mapcount is bigger than page is pin and we need
to use a bounce page to do the writeback.

Note that their can be no concurrent new real mapping added as the
page is lock thus we are protected on that front. So only race is
with a GUP running before page_mkclean() had remove all pte with
write permission. To close that race we should check for the page
write back flags in GUP and returns either ERR_PTR(EBUSY) for the
page or return the page and set some flag in the lower bit of the
page struct pointer so that user of GUP can wait on write back to
finish before doing anything else.


For [O2] i believe we can handle that case in the put_user_page()
function to properly dirty the page without causing filesystem
freak out.

Impact of that approach seems pretty small, direct IO would only
be affected for page under active write back which is one of the
thing we are trying to fix anyway.

Did i miss anything ?

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 15:03                           ` Jerome Glisse
@ 2018-12-12 16:27                             ` Dan Williams
  2018-12-12 17:02                               ` Jerome Glisse
  2018-12-12 21:30                               ` Jerome Glisse
  2018-12-12 21:46                             ` Dave Chinner
  1 sibling, 2 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-12 16:27 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > Another crazy idea, why not treating GUP as another mapping of the page
> > > and caller of GUP would have to provide either a fake anon_vma struct or
> > > a fake vma struct (or both for PRIVATE mapping of a file where you can
> > > have a mix of both private and file page thus only if it is a read only
> > > GUP) that would get added to the list of existing mapping.
> > >
> > > So the flow would be:
> > >     somefunction_thatuse_gup()
> > >     {
> > >         ...
> > >         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> > >         ...
> > >     }
> > >
> > >     GUP(vma, ..., fake_anon, fake_vma)
> > >     {
> > >         if (vma->flags == ANON) {
> > >             // Add the fake anon vma to the anon vma chain as a child
> > >             // of current vma
> > >         } else {
> > >             // Add the fake vma to the mapping tree
> > >         }
> > >
> > >         // The existing GUP except that now it inc mapcount and not
> > >         // refcount
> > >         GUP_old(..., &nanonymous, &nfiles);
> > >
> > >         atomic_add(&fake_anon->refcount, nanonymous);
> > >         atomic_add(&fake_vma->refcount, nfiles);
> > >
> > >         return nanonymous + nfiles;
> > >     }
> >
> > Thanks for your idea! This is actually something like I was suggesting back
> > at LSF/MM in Deer Valley. There were two downsides to this I remember
> > people pointing out:
> >
> > 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> > to get necessary locks to insert new entry into the VMA tree in that
> > context. So essentially we'd loose get_user_pages_fast() functionality.
> >
> > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> > the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> > Then on IO completion you need to queue work to unpin the pages again as you
> > cannot remove the fake VMA directly from interrupt context where the IO is
> > completed.
> >
> > You are right that the cost could be amortized if gup() is called for
> > multiple consecutive pages however for small IOs there's no help...
> >
> > So this approach doesn't look like a win to me over using counter in struct
> > page and I'd rather try looking into squeezing HMM public page usage of
> > struct page so that we can fit that gup counter there as well. I know that
> > it may be easier said than done...
>
> So i want back to the drawing board and first i would like to ascertain
> that we all agree on what the objectives are:
>
>     [O1] Avoid write back from a page still being written by either a
>          device or some direct I/O or any other existing user of GUP.
>          This would avoid possible file system corruption.
>
>     [O2] Avoid crash when set_page_dirty() is call on a page that is
>          considered clean by core mm (buffer head have been remove and
>          with some file system this turns into an ugly mess).
>
>     [O3] DAX and the device block problems, ie with DAX the page map in
>          userspace is the same as the block (persistent memory) and no
>          filesystem nor block device understand page as block or pinned
>          block.
>
> For [O3] i don't think any pin count would help in anyway. I believe
> that the current long term GUP API that does not allow GUP of DAX is
> the only sane solution for now.

No, that's not a sane solution, it's an emergency hack.

> The real fix would be to teach file-
> system about DAX/pinned block so that a pinned block is not reuse
> by filesystem.

We already have taught filesystems about pinned dax pages, see
dax_layout_busy_page(). As much as possible I want to eliminate the
concept of "dax pages" as a special case that gets sprinkled
throughout the mm.

> For [O1] and [O2] i believe a solution with mapcount would work. So
> no new struct, no fake vma, nothing like that. In GUP for file back
> pages

With get_user_pages_fast() we don't know that we have a file-backed
page, because we don't have a vma.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 16:27                             ` Dan Williams
@ 2018-12-12 17:02                               ` Jerome Glisse
  2018-12-12 17:49                                 ` Dan Williams
  2018-12-12 21:30                               ` Jerome Glisse
  1 sibling, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 17:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > Another crazy idea, why not treating GUP as another mapping of the page
> > > > and caller of GUP would have to provide either a fake anon_vma struct or
> > > > a fake vma struct (or both for PRIVATE mapping of a file where you can
> > > > have a mix of both private and file page thus only if it is a read only
> > > > GUP) that would get added to the list of existing mapping.
> > > >
> > > > So the flow would be:
> > > >     somefunction_thatuse_gup()
> > > >     {
> > > >         ...
> > > >         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> > > >         ...
> > > >     }
> > > >
> > > >     GUP(vma, ..., fake_anon, fake_vma)
> > > >     {
> > > >         if (vma->flags == ANON) {
> > > >             // Add the fake anon vma to the anon vma chain as a child
> > > >             // of current vma
> > > >         } else {
> > > >             // Add the fake vma to the mapping tree
> > > >         }
> > > >
> > > >         // The existing GUP except that now it inc mapcount and not
> > > >         // refcount
> > > >         GUP_old(..., &nanonymous, &nfiles);
> > > >
> > > >         atomic_add(&fake_anon->refcount, nanonymous);
> > > >         atomic_add(&fake_vma->refcount, nfiles);
> > > >
> > > >         return nanonymous + nfiles;
> > > >     }
> > >
> > > Thanks for your idea! This is actually something like I was suggesting back
> > > at LSF/MM in Deer Valley. There were two downsides to this I remember
> > > people pointing out:
> > >
> > > 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> > > to get necessary locks to insert new entry into the VMA tree in that
> > > context. So essentially we'd loose get_user_pages_fast() functionality.
> > >
> > > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> > > the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> > > Then on IO completion you need to queue work to unpin the pages again as you
> > > cannot remove the fake VMA directly from interrupt context where the IO is
> > > completed.
> > >
> > > You are right that the cost could be amortized if gup() is called for
> > > multiple consecutive pages however for small IOs there's no help...
> > >
> > > So this approach doesn't look like a win to me over using counter in struct
> > > page and I'd rather try looking into squeezing HMM public page usage of
> > > struct page so that we can fit that gup counter there as well. I know that
> > > it may be easier said than done...
> >
> > So i want back to the drawing board and first i would like to ascertain
> > that we all agree on what the objectives are:
> >
> >     [O1] Avoid write back from a page still being written by either a
> >          device or some direct I/O or any other existing user of GUP.
> >          This would avoid possible file system corruption.
> >
> >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> >          considered clean by core mm (buffer head have been remove and
> >          with some file system this turns into an ugly mess).
> >
> >     [O3] DAX and the device block problems, ie with DAX the page map in
> >          userspace is the same as the block (persistent memory) and no
> >          filesystem nor block device understand page as block or pinned
> >          block.
> >
> > For [O3] i don't think any pin count would help in anyway. I believe
> > that the current long term GUP API that does not allow GUP of DAX is
> > the only sane solution for now.
> 
> No, that's not a sane solution, it's an emergency hack.

Then how do you want to solve it ? Knowing pin count does not help
you, at least i do not see how that would help and if it does then
my solution allow you to know pin count it is the difference between
real mapping and mapcount value.


> > The real fix would be to teach file-
> > system about DAX/pinned block so that a pinned block is not reuse
> > by filesystem.
> 
> We already have taught filesystems about pinned dax pages, see
> dax_layout_busy_page(). As much as possible I want to eliminate the
> concept of "dax pages" as a special case that gets sprinkled
> throughout the mm.
> 
> > For [O1] and [O2] i believe a solution with mapcount would work. So
> > no new struct, no fake vma, nothing like that. In GUP for file back
> > pages
> 
> With get_user_pages_fast() we don't know that we have a file-backed
> page, because we don't have a vma.

You do not need a vma to know that we have PageAnon() for that so my
solution is just about adding to core GUP page table walker:

    if (!PageAnon(page))
        atomic_inc(&page->mapcount);


Then in put_user_page() you add the opposite. In page_mkclean() you
count the number of real mapping and voil� ... you got an answer for
[O1]. You could use the same count real mapping to get the pin count
in other place that cares about it but i fails to see why the actual
pin count value would matter to any one.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 17:02                               ` Jerome Glisse
@ 2018-12-12 17:49                                 ` Dan Williams
  2018-12-12 19:07                                   ` John Hubbard
  0 siblings, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-12 17:49 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 9:02 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
> > On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > Another crazy idea, why not treating GUP as another mapping of the page
> > > > > and caller of GUP would have to provide either a fake anon_vma struct or
> > > > > a fake vma struct (or both for PRIVATE mapping of a file where you can
> > > > > have a mix of both private and file page thus only if it is a read only
> > > > > GUP) that would get added to the list of existing mapping.
> > > > >
> > > > > So the flow would be:
> > > > >     somefunction_thatuse_gup()
> > > > >     {
> > > > >         ...
> > > > >         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> > > > >         ...
> > > > >     }
> > > > >
> > > > >     GUP(vma, ..., fake_anon, fake_vma)
> > > > >     {
> > > > >         if (vma->flags == ANON) {
> > > > >             // Add the fake anon vma to the anon vma chain as a child
> > > > >             // of current vma
> > > > >         } else {
> > > > >             // Add the fake vma to the mapping tree
> > > > >         }
> > > > >
> > > > >         // The existing GUP except that now it inc mapcount and not
> > > > >         // refcount
> > > > >         GUP_old(..., &nanonymous, &nfiles);
> > > > >
> > > > >         atomic_add(&fake_anon->refcount, nanonymous);
> > > > >         atomic_add(&fake_vma->refcount, nfiles);
> > > > >
> > > > >         return nanonymous + nfiles;
> > > > >     }
> > > >
> > > > Thanks for your idea! This is actually something like I was suggesting back
> > > > at LSF/MM in Deer Valley. There were two downsides to this I remember
> > > > people pointing out:
> > > >
> > > > 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> > > > to get necessary locks to insert new entry into the VMA tree in that
> > > > context. So essentially we'd loose get_user_pages_fast() functionality.
> > > >
> > > > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> > > > the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> > > > Then on IO completion you need to queue work to unpin the pages again as you
> > > > cannot remove the fake VMA directly from interrupt context where the IO is
> > > > completed.
> > > >
> > > > You are right that the cost could be amortized if gup() is called for
> > > > multiple consecutive pages however for small IOs there's no help...
> > > >
> > > > So this approach doesn't look like a win to me over using counter in struct
> > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > struct page so that we can fit that gup counter there as well. I know that
> > > > it may be easier said than done...
> > >
> > > So i want back to the drawing board and first i would like to ascertain
> > > that we all agree on what the objectives are:
> > >
> > >     [O1] Avoid write back from a page still being written by either a
> > >          device or some direct I/O or any other existing user of GUP.
> > >          This would avoid possible file system corruption.
> > >
> > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > >          considered clean by core mm (buffer head have been remove and
> > >          with some file system this turns into an ugly mess).
> > >
> > >     [O3] DAX and the device block problems, ie with DAX the page map in
> > >          userspace is the same as the block (persistent memory) and no
> > >          filesystem nor block device understand page as block or pinned
> > >          block.
> > >
> > > For [O3] i don't think any pin count would help in anyway. I believe
> > > that the current long term GUP API that does not allow GUP of DAX is
> > > the only sane solution for now.
> >
> > No, that's not a sane solution, it's an emergency hack.
>
> Then how do you want to solve it ? Knowing pin count does not help
> you, at least i do not see how that would help and if it does then
> my solution allow you to know pin count it is the difference between
> real mapping and mapcount value.

True, pin count doesn't help, and indefinite waits are intolerable, so
I think we need to make "long term" GUP revokable, but otherwise
hopefully use the put_user_page() scheme to replace the use of the pin
count for dax_layout_busy_page().

> > > The real fix would be to teach file-
> > > system about DAX/pinned block so that a pinned block is not reuse
> > > by filesystem.
> >
> > We already have taught filesystems about pinned dax pages, see
> > dax_layout_busy_page(). As much as possible I want to eliminate the
> > concept of "dax pages" as a special case that gets sprinkled
> > throughout the mm.
> >
> > > For [O1] and [O2] i believe a solution with mapcount would work. So
> > > no new struct, no fake vma, nothing like that. In GUP for file back
> > > pages
> >
> > With get_user_pages_fast() we don't know that we have a file-backed
> > page, because we don't have a vma.
>
> You do not need a vma to know that we have PageAnon() for that so my
> solution is just about adding to core GUP page table walker:
>
>     if (!PageAnon(page))
>         atomic_inc(&page->mapcount);

Ah, ok, would need to add proper mapcount manipulation for dax and
audit that nothing makes page-cache assumptions based on a non-zero
mapcount.

> Then in put_user_page() you add the opposite. In page_mkclean() you
> count the number of real mapping and voilà ... you got an answer for
> [O1]. You could use the same count real mapping to get the pin count
> in other place that cares about it but i fails to see why the actual
> pin count value would matter to any one.

Sounds like a could work... devil is in the details.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 17:49                                 ` Dan Williams
@ 2018-12-12 19:07                                   ` John Hubbard
  0 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-12 19:07 UTC (permalink / raw)
  To: Dan Williams, Jérôme Glisse
  Cc: Jan Kara, Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On 12/12/18 9:49 AM, Dan Williams wrote:
> On Wed, Dec 12, 2018 at 9:02 AM Jerome Glisse <jglisse@redhat.com> wrote:
>>
>> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
>>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
>>>>
>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>>>>> Another crazy idea, why not treating GUP as another mapping of the page
>>>>>> and caller of GUP would have to provide either a fake anon_vma struct or
>>>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can
>>>>>> have a mix of both private and file page thus only if it is a read only
>>>>>> GUP) that would get added to the list of existing mapping.
>>>>>>
>>>>>> So the flow would be:
>>>>>>     somefunction_thatuse_gup()
>>>>>>     {
>>>>>>         ...
>>>>>>         GUP(_fast)(vma, ..., fake_anon, fake_vma);
>>>>>>         ...
>>>>>>     }
>>>>>>
>>>>>>     GUP(vma, ..., fake_anon, fake_vma)
>>>>>>     {
>>>>>>         if (vma->flags == ANON) {
>>>>>>             // Add the fake anon vma to the anon vma chain as a child
>>>>>>             // of current vma
>>>>>>         } else {
>>>>>>             // Add the fake vma to the mapping tree
>>>>>>         }
>>>>>>
>>>>>>         // The existing GUP except that now it inc mapcount and not
>>>>>>         // refcount
>>>>>>         GUP_old(..., &nanonymous, &nfiles);
>>>>>>
>>>>>>         atomic_add(&fake_anon->refcount, nanonymous);
>>>>>>         atomic_add(&fake_vma->refcount, nfiles);
>>>>>>
>>>>>>         return nanonymous + nfiles;
>>>>>>     }
>>>>>
>>>>> Thanks for your idea! This is actually something like I was suggesting back
>>>>> at LSF/MM in Deer Valley. There were two downsides to this I remember
>>>>> people pointing out:
>>>>>
>>>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
>>>>> to get necessary locks to insert new entry into the VMA tree in that
>>>>> context. So essentially we'd loose get_user_pages_fast() functionality.
>>>>>
>>>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
>>>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
>>>>> Then on IO completion you need to queue work to unpin the pages again as you
>>>>> cannot remove the fake VMA directly from interrupt context where the IO is
>>>>> completed.
>>>>>
>>>>> You are right that the cost could be amortized if gup() is called for
>>>>> multiple consecutive pages however for small IOs there's no help...
>>>>>
>>>>> So this approach doesn't look like a win to me over using counter in struct
>>>>> page and I'd rather try looking into squeezing HMM public page usage of
>>>>> struct page so that we can fit that gup counter there as well. I know that
>>>>> it may be easier said than done...
>>>>
>>>> So i want back to the drawing board and first i would like to ascertain
>>>> that we all agree on what the objectives are:
>>>>
>>>>     [O1] Avoid write back from a page still being written by either a
>>>>          device or some direct I/O or any other existing user of GUP.
>>>>          This would avoid possible file system corruption.
>>>>
>>>>     [O2] Avoid crash when set_page_dirty() is call on a page that is
>>>>          considered clean by core mm (buffer head have been remove and
>>>>          with some file system this turns into an ugly mess).
>>>>
>>>>     [O3] DAX and the device block problems, ie with DAX the page map in
>>>>          userspace is the same as the block (persistent memory) and no
>>>>          filesystem nor block device understand page as block or pinned
>>>>          block.
>>>>
>>>> For [O3] i don't think any pin count would help in anyway. I believe
>>>> that the current long term GUP API that does not allow GUP of DAX is
>>>> the only sane solution for now.
>>>
>>> No, that's not a sane solution, it's an emergency hack.
>>
>> Then how do you want to solve it ? Knowing pin count does not help
>> you, at least i do not see how that would help and if it does then
>> my solution allow you to know pin count it is the difference between
>> real mapping and mapcount value.
> 
> True, pin count doesn't help, and indefinite waits are intolerable, so
> I think we need to make "long term" GUP revokable, but otherwise
> hopefully use the put_user_page() scheme to replace the use of the pin
> count for dax_layout_busy_page().
> 
>>>> The real fix would be to teach file-
>>>> system about DAX/pinned block so that a pinned block is not reuse
>>>> by filesystem.
>>>
>>> We already have taught filesystems about pinned dax pages, see
>>> dax_layout_busy_page(). As much as possible I want to eliminate the
>>> concept of "dax pages" as a special case that gets sprinkled
>>> throughout the mm.
>>>
>>>> For [O1] and [O2] i believe a solution with mapcount would work. So
>>>> no new struct, no fake vma, nothing like that. In GUP for file back
>>>> pages
>>>
>>> With get_user_pages_fast() we don't know that we have a file-backed
>>> page, because we don't have a vma.
>>
>> You do not need a vma to know that we have PageAnon() for that so my
>> solution is just about adding to core GUP page table walker:
>>
>>     if (!PageAnon(page))
>>         atomic_inc(&page->mapcount);
> 
> Ah, ok, would need to add proper mapcount manipulation for dax and
> audit that nothing makes page-cache assumptions based on a non-zero
> mapcount.
> 
>> Then in put_user_page() you add the opposite. In page_mkclean() you
>> count the number of real mapping and voilà ... you got an answer for
>> [O1]. You could use the same count real mapping to get the pin count
>> in other place that cares about it but i fails to see why the actual
>> pin count value would matter to any one.
> 
> Sounds like a could work... devil is in the details.
> 

I also like this solution. It uses existing information and existing pointers
(PageAnon, page_mapping) plus a tiny bit more (another pointer in struct 
address_space, probably, and that's not a size-contentious struct) to figure
out the DMA pinned count, which neatly avoids the struct page contention that I 
ran into. 

Thanks for coming up with this! I'll put together a patchset that shows the details
so we can all take a closer look. This is on the top of my list.


-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  5:18                       ` Matthew Wilcox
@ 2018-12-12 19:13                         ` John Hubbard
  0 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-12 19:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jerome Glisse, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/7/18 9:18 PM, Matthew Wilcox wrote:
> On Fri, Dec 07, 2018 at 04:52:42PM -0800, John Hubbard wrote:
>> I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
>> completely out of space. At this point, after thinking about it carefully, it seems clear
>> that it's time for a single, new field:
> 
> Sorry for not replying earlier; I'm travelling and have had trouble
> keeping on top of my mail.

Hi Matthew,

> 
> Adding this field will grow struct page by 4-8 bytes, so it will no
> longer be 64 bytes.  This isn't an acceptable answer.

I had to ask, though, just in case the historical rules might no longer
be ask pressing. But OK.

> 
> We have a few options for bits.  One is that we have (iirc) two
> bits available in page->flags on 32-bit.  That'll force a few more
> configurations into using _last_cpupid and/or page_ext.  I'm not a huge
> fan of this approach.
> 
> The second is to use page->lru.next bit 1.  This requires some care
> because m68k allows misaligned pointers.  If the list_head that it's
> joined to is misaligned, we'll be in trouble.  This can get tricky because
> some pages are attached to list_heads which are on the stack ... and I
> don't think gcc guarantees __aligned attributes work for stack variables.
> 
> The third is to use page->lru.prev bit 0.  We'd want to switch pgmap
> and hmm_data around to make this work, and we'd want to record this
> in mm_types.h so nobody tries to use a field which aliases with
> page->lru.prev and has bit 0 set on a page which can be mapped to
> userspace (which I currently believe to be true).
> 
> The fourth is to use a bit in page->flags for 64-bit and a bit in
> page_ext->flags for 32-bit.  Or we could get rid of page_ext and grow
> struct page with a ->flags2 on 32-bit.
> 
> Fifth, it isn't clear to me how many bits might be left in ->_last_cpupid
> at this point, and perhaps there's scope for using a bit in there.
> 

Thanks for taking the time to collect and explain all of this, I'm stashing
it away as I'm sure it will come up again.

The latest approach to the gup/dma problem here might, or might not, actually
need a single page bit. I'll know in a day or two.

-- 
thanks,
John Hubbard
NVIDIA 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 16:27                             ` Dan Williams
  2018-12-12 17:02                               ` Jerome Glisse
@ 2018-12-12 21:30                               ` Jerome Glisse
  2018-12-12 21:40                                 ` Dan Williams
  2018-12-12 21:56                                 ` John Hubbard
  1 sibling, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 21:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > Another crazy idea, why not treating GUP as another mapping of the page
> > > > and caller of GUP would have to provide either a fake anon_vma struct or
> > > > a fake vma struct (or both for PRIVATE mapping of a file where you can
> > > > have a mix of both private and file page thus only if it is a read only
> > > > GUP) that would get added to the list of existing mapping.
> > > >
> > > > So the flow would be:
> > > >     somefunction_thatuse_gup()
> > > >     {
> > > >         ...
> > > >         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> > > >         ...
> > > >     }
> > > >
> > > >     GUP(vma, ..., fake_anon, fake_vma)
> > > >     {
> > > >         if (vma->flags == ANON) {
> > > >             // Add the fake anon vma to the anon vma chain as a child
> > > >             // of current vma
> > > >         } else {
> > > >             // Add the fake vma to the mapping tree
> > > >         }
> > > >
> > > >         // The existing GUP except that now it inc mapcount and not
> > > >         // refcount
> > > >         GUP_old(..., &nanonymous, &nfiles);
> > > >
> > > >         atomic_add(&fake_anon->refcount, nanonymous);
> > > >         atomic_add(&fake_vma->refcount, nfiles);
> > > >
> > > >         return nanonymous + nfiles;
> > > >     }
> > >
> > > Thanks for your idea! This is actually something like I was suggesting back
> > > at LSF/MM in Deer Valley. There were two downsides to this I remember
> > > people pointing out:
> > >
> > > 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> > > to get necessary locks to insert new entry into the VMA tree in that
> > > context. So essentially we'd loose get_user_pages_fast() functionality.
> > >
> > > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> > > the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> > > Then on IO completion you need to queue work to unpin the pages again as you
> > > cannot remove the fake VMA directly from interrupt context where the IO is
> > > completed.
> > >
> > > You are right that the cost could be amortized if gup() is called for
> > > multiple consecutive pages however for small IOs there's no help...
> > >
> > > So this approach doesn't look like a win to me over using counter in struct
> > > page and I'd rather try looking into squeezing HMM public page usage of
> > > struct page so that we can fit that gup counter there as well. I know that
> > > it may be easier said than done...
> >
> > So i want back to the drawing board and first i would like to ascertain
> > that we all agree on what the objectives are:
> >
> >     [O1] Avoid write back from a page still being written by either a
> >          device or some direct I/O or any other existing user of GUP.
> >          This would avoid possible file system corruption.
> >
> >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> >          considered clean by core mm (buffer head have been remove and
> >          with some file system this turns into an ugly mess).
> >
> >     [O3] DAX and the device block problems, ie with DAX the page map in
> >          userspace is the same as the block (persistent memory) and no
> >          filesystem nor block device understand page as block or pinned
> >          block.
> >
> > For [O3] i don't think any pin count would help in anyway. I believe
> > that the current long term GUP API that does not allow GUP of DAX is
> > the only sane solution for now.
> 
> No, that's not a sane solution, it's an emergency hack.
> 
> > The real fix would be to teach file-
> > system about DAX/pinned block so that a pinned block is not reuse
> > by filesystem.
> 
> We already have taught filesystems about pinned dax pages, see
> dax_layout_busy_page(). As much as possible I want to eliminate the
> concept of "dax pages" as a special case that gets sprinkled
> throughout the mm.

So thinking on O3 issues what about leveraging the recent change i
did to mmu notifier. Add a event for truncate or any other file
event that need to invalidate the file->page for a range of offset.

Add mmu notifier listener to GUP user (except direct I/O) so that
they invalidate they hardware mapping or switch the hardware mapping
to use a crappy page. When such event happens what ever user do to
the page through that driver is broken anyway. So it is better to
be loud about it then trying to make it pass under the radar.

This will put the burden on broken user and allow you to properly
recycle your DAX page.

Think of it as revoke through mmu notifier.

So patchset would be:
    enum mmu_notifier_event {
+       MMU_NOTIFY_TRUNCATE,
    };

+   Change truncate code path to emit MMU_NOTIFY_TRUNCATE

Then for each user of GUP (except direct I/O or other very short
term GUP):

    Patch 1: register mmu notifier
    Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
             when that happens update the device page table or
             usage to point to a crappy page and do put_user_page
             on all previously held page

So this would solve the revoke side of thing without adding a burden
on GUP user like direct I/O. Many existing user of GUP already do
listen to mmu notifier and already behave properly. It is just about
making every body list to that. Then we can even add the mmu notifier
pointer as argument to GUP just to make sure no new user of GUP forget
about registering a notifier (argument as a teaching guide not as a
something actively use).


So does that sounds like a plan to solve your concern with long term
GUP user ? This does not depend on DAX or anything it would apply to
any file back pages.


Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:30                               ` Jerome Glisse
@ 2018-12-12 21:40                                 ` Dan Williams
  2018-12-12 21:53                                   ` Jerome Glisse
  2018-12-12 21:56                                 ` John Hubbard
  1 sibling, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-12 21:40 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 1:30 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
> > On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > Another crazy idea, why not treating GUP as another mapping of the page
> > > > > and caller of GUP would have to provide either a fake anon_vma struct or
> > > > > a fake vma struct (or both for PRIVATE mapping of a file where you can
> > > > > have a mix of both private and file page thus only if it is a read only
> > > > > GUP) that would get added to the list of existing mapping.
> > > > >
> > > > > So the flow would be:
> > > > >     somefunction_thatuse_gup()
> > > > >     {
> > > > >         ...
> > > > >         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> > > > >         ...
> > > > >     }
> > > > >
> > > > >     GUP(vma, ..., fake_anon, fake_vma)
> > > > >     {
> > > > >         if (vma->flags == ANON) {
> > > > >             // Add the fake anon vma to the anon vma chain as a child
> > > > >             // of current vma
> > > > >         } else {
> > > > >             // Add the fake vma to the mapping tree
> > > > >         }
> > > > >
> > > > >         // The existing GUP except that now it inc mapcount and not
> > > > >         // refcount
> > > > >         GUP_old(..., &nanonymous, &nfiles);
> > > > >
> > > > >         atomic_add(&fake_anon->refcount, nanonymous);
> > > > >         atomic_add(&fake_vma->refcount, nfiles);
> > > > >
> > > > >         return nanonymous + nfiles;
> > > > >     }
> > > >
> > > > Thanks for your idea! This is actually something like I was suggesting back
> > > > at LSF/MM in Deer Valley. There were two downsides to this I remember
> > > > people pointing out:
> > > >
> > > > 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> > > > to get necessary locks to insert new entry into the VMA tree in that
> > > > context. So essentially we'd loose get_user_pages_fast() functionality.
> > > >
> > > > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> > > > the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> > > > Then on IO completion you need to queue work to unpin the pages again as you
> > > > cannot remove the fake VMA directly from interrupt context where the IO is
> > > > completed.
> > > >
> > > > You are right that the cost could be amortized if gup() is called for
> > > > multiple consecutive pages however for small IOs there's no help...
> > > >
> > > > So this approach doesn't look like a win to me over using counter in struct
> > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > struct page so that we can fit that gup counter there as well. I know that
> > > > it may be easier said than done...
> > >
> > > So i want back to the drawing board and first i would like to ascertain
> > > that we all agree on what the objectives are:
> > >
> > >     [O1] Avoid write back from a page still being written by either a
> > >          device or some direct I/O or any other existing user of GUP.
> > >          This would avoid possible file system corruption.
> > >
> > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > >          considered clean by core mm (buffer head have been remove and
> > >          with some file system this turns into an ugly mess).
> > >
> > >     [O3] DAX and the device block problems, ie with DAX the page map in
> > >          userspace is the same as the block (persistent memory) and no
> > >          filesystem nor block device understand page as block or pinned
> > >          block.
> > >
> > > For [O3] i don't think any pin count would help in anyway. I believe
> > > that the current long term GUP API that does not allow GUP of DAX is
> > > the only sane solution for now.
> >
> > No, that's not a sane solution, it's an emergency hack.
> >
> > > The real fix would be to teach file-
> > > system about DAX/pinned block so that a pinned block is not reuse
> > > by filesystem.
> >
> > We already have taught filesystems about pinned dax pages, see
> > dax_layout_busy_page(). As much as possible I want to eliminate the
> > concept of "dax pages" as a special case that gets sprinkled
> > throughout the mm.
>
> So thinking on O3 issues what about leveraging the recent change i
> did to mmu notifier. Add a event for truncate or any other file
> event that need to invalidate the file->page for a range of offset.
>
> Add mmu notifier listener to GUP user (except direct I/O) so that
> they invalidate they hardware mapping or switch the hardware mapping
> to use a crappy page. When such event happens what ever user do to
> the page through that driver is broken anyway. So it is better to
> be loud about it then trying to make it pass under the radar.
>
> This will put the burden on broken user and allow you to properly
> recycle your DAX page.
>
> Think of it as revoke through mmu notifier.
>
> So patchset would be:
>     enum mmu_notifier_event {
> +       MMU_NOTIFY_TRUNCATE,
>     };
>
> +   Change truncate code path to emit MMU_NOTIFY_TRUNCATE
>
> Then for each user of GUP (except direct I/O or other very short
> term GUP):
>
>     Patch 1: register mmu notifier
>     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
>              when that happens update the device page table or
>              usage to point to a crappy page and do put_user_page
>              on all previously held page
>
> So this would solve the revoke side of thing without adding a burden
> on GUP user like direct I/O. Many existing user of GUP already do
> listen to mmu notifier and already behave properly. It is just about
> making every body list to that. Then we can even add the mmu notifier
> pointer as argument to GUP just to make sure no new user of GUP forget
> about registering a notifier (argument as a teaching guide not as a
> something actively use).
>
>
> So does that sounds like a plan to solve your concern with long term
> GUP user ? This does not depend on DAX or anything it would apply to
> any file back pages.

Almost, we need some safety around assuming that DMA is complete the
page, so the notification would need to go all to way to userspace
with something like a file lease notification. It would also need to
be backstopped by an IOMMU in the case where the hardware does not /
can not stop in-flight DMA.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 15:03                           ` Jerome Glisse
  2018-12-12 16:27                             ` Dan Williams
@ 2018-12-12 21:46                             ` Dave Chinner
  2018-12-12 21:59                               ` Jerome Glisse
  2018-12-14 15:43                               ` Jan Kara
  1 sibling, 2 replies; 206+ messages in thread
From: Dave Chinner @ 2018-12-12 21:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > So this approach doesn't look like a win to me over using counter in struct
> > page and I'd rather try looking into squeezing HMM public page usage of
> > struct page so that we can fit that gup counter there as well. I know that
> > it may be easier said than done...
> 
> So i want back to the drawing board and first i would like to ascertain
> that we all agree on what the objectives are:
> 
>     [O1] Avoid write back from a page still being written by either a
>          device or some direct I/O or any other existing user of GUP.
>          This would avoid possible file system corruption.
> 
>     [O2] Avoid crash when set_page_dirty() is call on a page that is
>          considered clean by core mm (buffer head have been remove and
>          with some file system this turns into an ugly mess).

I think that's wrong. This isn't an "avoid a crash" case, this is a
"prevent data and/or filesystem corruption" case. The primary goal
we have here is removing our exposure to potential corruption, which
has the secondary effect of avoiding the crash/panics that currently
occur as a result of inconsistent page/filesystem state.

i.e. The goal is to have ->page_mkwrite() called on the clean page
/before/ the file-backed page is marked dirty, and hence we don't
expose ourselves to potential corruption or crashes that are a
result of inappropriately calling set_page_dirty() on clean
file-backed pages.

> For [O1] and [O2] i believe a solution with mapcount would work. So
> no new struct, no fake vma, nothing like that. In GUP for file back
> pages we increment both refcount and mapcount (we also need a special
> put_user_page to decrement mapcount when GUP user are done with the
> page).

I don't see how a mapcount can prevent anyone from calling
set_page_dirty() inappropriately.

> Now for [O1] the write back have to call page_mkclean() to go through
> all reverse mapping of the page and map read only. This means that
> we can count the number of real mapping and see if the mapcount is
> bigger than that. If mapcount is bigger than page is pin and we need
> to use a bounce page to do the writeback.

Doesn't work. Generally filesystems have already mapped the page
into bios before they call clear_page_dirty_for_io(), so it's too
late for the filesystem to bounce the page at that point.

> For [O2] i believe we can handle that case in the put_user_page()
> function to properly dirty the page without causing filesystem
> freak out.

I'm pretty sure you can't call ->page_mkwrite() from
put_user_page(), so I don't think this is workable at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:40                                 ` Dan Williams
@ 2018-12-12 21:53                                   ` Jerome Glisse
  2018-12-12 22:11                                     ` Matthew Wilcox
  2018-12-12 23:37                                     ` Jason Gunthorpe
  0 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 21:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 01:40:50PM -0800, Dan Williams wrote:
> On Wed, Dec 12, 2018 at 1:30 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
> > > On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > > Another crazy idea, why not treating GUP as another mapping of the page
> > > > > > and caller of GUP would have to provide either a fake anon_vma struct or
> > > > > > a fake vma struct (or both for PRIVATE mapping of a file where you can
> > > > > > have a mix of both private and file page thus only if it is a read only
> > > > > > GUP) that would get added to the list of existing mapping.
> > > > > >
> > > > > > So the flow would be:
> > > > > >     somefunction_thatuse_gup()
> > > > > >     {
> > > > > >         ...
> > > > > >         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> > > > > >         ...
> > > > > >     }
> > > > > >
> > > > > >     GUP(vma, ..., fake_anon, fake_vma)
> > > > > >     {
> > > > > >         if (vma->flags == ANON) {
> > > > > >             // Add the fake anon vma to the anon vma chain as a child
> > > > > >             // of current vma
> > > > > >         } else {
> > > > > >             // Add the fake vma to the mapping tree
> > > > > >         }
> > > > > >
> > > > > >         // The existing GUP except that now it inc mapcount and not
> > > > > >         // refcount
> > > > > >         GUP_old(..., &nanonymous, &nfiles);
> > > > > >
> > > > > >         atomic_add(&fake_anon->refcount, nanonymous);
> > > > > >         atomic_add(&fake_vma->refcount, nfiles);
> > > > > >
> > > > > >         return nanonymous + nfiles;
> > > > > >     }
> > > > >
> > > > > Thanks for your idea! This is actually something like I was suggesting back
> > > > > at LSF/MM in Deer Valley. There were two downsides to this I remember
> > > > > people pointing out:
> > > > >
> > > > > 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> > > > > to get necessary locks to insert new entry into the VMA tree in that
> > > > > context. So essentially we'd loose get_user_pages_fast() functionality.
> > > > >
> > > > > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> > > > > the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> > > > > Then on IO completion you need to queue work to unpin the pages again as you
> > > > > cannot remove the fake VMA directly from interrupt context where the IO is
> > > > > completed.
> > > > >
> > > > > You are right that the cost could be amortized if gup() is called for
> > > > > multiple consecutive pages however for small IOs there's no help...
> > > > >
> > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > it may be easier said than done...
> > > >
> > > > So i want back to the drawing board and first i would like to ascertain
> > > > that we all agree on what the objectives are:
> > > >
> > > >     [O1] Avoid write back from a page still being written by either a
> > > >          device or some direct I/O or any other existing user of GUP.
> > > >          This would avoid possible file system corruption.
> > > >
> > > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > > >          considered clean by core mm (buffer head have been remove and
> > > >          with some file system this turns into an ugly mess).
> > > >
> > > >     [O3] DAX and the device block problems, ie with DAX the page map in
> > > >          userspace is the same as the block (persistent memory) and no
> > > >          filesystem nor block device understand page as block or pinned
> > > >          block.
> > > >
> > > > For [O3] i don't think any pin count would help in anyway. I believe
> > > > that the current long term GUP API that does not allow GUP of DAX is
> > > > the only sane solution for now.
> > >
> > > No, that's not a sane solution, it's an emergency hack.
> > >
> > > > The real fix would be to teach file-
> > > > system about DAX/pinned block so that a pinned block is not reuse
> > > > by filesystem.
> > >
> > > We already have taught filesystems about pinned dax pages, see
> > > dax_layout_busy_page(). As much as possible I want to eliminate the
> > > concept of "dax pages" as a special case that gets sprinkled
> > > throughout the mm.
> >
> > So thinking on O3 issues what about leveraging the recent change i
> > did to mmu notifier. Add a event for truncate or any other file
> > event that need to invalidate the file->page for a range of offset.
> >
> > Add mmu notifier listener to GUP user (except direct I/O) so that
> > they invalidate they hardware mapping or switch the hardware mapping
> > to use a crappy page. When such event happens what ever user do to
> > the page through that driver is broken anyway. So it is better to
> > be loud about it then trying to make it pass under the radar.
> >
> > This will put the burden on broken user and allow you to properly
> > recycle your DAX page.
> >
> > Think of it as revoke through mmu notifier.
> >
> > So patchset would be:
> >     enum mmu_notifier_event {
> > +       MMU_NOTIFY_TRUNCATE,
> >     };
> >
> > +   Change truncate code path to emit MMU_NOTIFY_TRUNCATE
> >
> > Then for each user of GUP (except direct I/O or other very short
> > term GUP):
> >
> >     Patch 1: register mmu notifier
> >     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
> >              when that happens update the device page table or
> >              usage to point to a crappy page and do put_user_page
> >              on all previously held page
> >
> > So this would solve the revoke side of thing without adding a burden
> > on GUP user like direct I/O. Many existing user of GUP already do
> > listen to mmu notifier and already behave properly. It is just about
> > making every body list to that. Then we can even add the mmu notifier
> > pointer as argument to GUP just to make sure no new user of GUP forget
> > about registering a notifier (argument as a teaching guide not as a
> > something actively use).
> >
> >
> > So does that sounds like a plan to solve your concern with long term
> > GUP user ? This does not depend on DAX or anything it would apply to
> > any file back pages.
> 
> Almost, we need some safety around assuming that DMA is complete the
> page, so the notification would need to go all to way to userspace
> with something like a file lease notification. It would also need to
> be backstopped by an IOMMU in the case where the hardware does not /
> can not stop in-flight DMA.

You can always reprogram the hardware right away it will redirect
any dma to the crappy page. Notifying the user is a different
problems through fs notify or something like that. Each driver
API can also define new event for their user through their device
specific API.

What i am saying is that solving the user notification is an
orthogonal issue and i do not see a one solution fit all for that.

>From my point of view driver should listen to ftruncate before the
mmu notifier kicks in and send event to userspace and maybe wait
and block ftruncate (or move it to a worker thread).

The mmu notifier i put forward is the emergency revoke ie last
resort after driver have done everything it could to inform user-
space and release the pages. So doing thing brutaly in it like
reprogramming driver page table (which AFAIK is something you
can do on any hardware wether the hardware will like it or not
is a different question).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:30                               ` Jerome Glisse
  2018-12-12 21:40                                 ` Dan Williams
@ 2018-12-12 21:56                                 ` John Hubbard
  2018-12-12 22:04                                   ` Jerome Glisse
  1 sibling, 1 reply; 206+ messages in thread
From: John Hubbard @ 2018-12-12 21:56 UTC (permalink / raw)
  To: Jerome Glisse, Dan Williams
  Cc: Jan Kara, Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On 12/12/18 1:30 PM, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
>>>
>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>>>> Another crazy idea, why not treating GUP as another mapping of the page
>>>>> and caller of GUP would have to provide either a fake anon_vma struct or
>>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can
>>>>> have a mix of both private and file page thus only if it is a read only
>>>>> GUP) that would get added to the list of existing mapping.
>>>>>
>>>>> So the flow would be:
>>>>>     somefunction_thatuse_gup()
>>>>>     {
>>>>>         ...
>>>>>         GUP(_fast)(vma, ..., fake_anon, fake_vma);
>>>>>         ...
>>>>>     }
>>>>>
>>>>>     GUP(vma, ..., fake_anon, fake_vma)
>>>>>     {
>>>>>         if (vma->flags == ANON) {
>>>>>             // Add the fake anon vma to the anon vma chain as a child
>>>>>             // of current vma
>>>>>         } else {
>>>>>             // Add the fake vma to the mapping tree
>>>>>         }
>>>>>
>>>>>         // The existing GUP except that now it inc mapcount and not
>>>>>         // refcount
>>>>>         GUP_old(..., &nanonymous, &nfiles);
>>>>>
>>>>>         atomic_add(&fake_anon->refcount, nanonymous);
>>>>>         atomic_add(&fake_vma->refcount, nfiles);
>>>>>
>>>>>         return nanonymous + nfiles;
>>>>>     }
>>>>
>>>> Thanks for your idea! This is actually something like I was suggesting back
>>>> at LSF/MM in Deer Valley. There were two downsides to this I remember
>>>> people pointing out:
>>>>
>>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
>>>> to get necessary locks to insert new entry into the VMA tree in that
>>>> context. So essentially we'd loose get_user_pages_fast() functionality.
>>>>
>>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
>>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
>>>> Then on IO completion you need to queue work to unpin the pages again as you
>>>> cannot remove the fake VMA directly from interrupt context where the IO is
>>>> completed.
>>>>
>>>> You are right that the cost could be amortized if gup() is called for
>>>> multiple consecutive pages however for small IOs there's no help...
>>>>
>>>> So this approach doesn't look like a win to me over using counter in struct
>>>> page and I'd rather try looking into squeezing HMM public page usage of
>>>> struct page so that we can fit that gup counter there as well. I know that
>>>> it may be easier said than done...
>>>
>>> So i want back to the drawing board and first i would like to ascertain
>>> that we all agree on what the objectives are:
>>>
>>>     [O1] Avoid write back from a page still being written by either a
>>>          device or some direct I/O or any other existing user of GUP.
>>>          This would avoid possible file system corruption.
>>>
>>>     [O2] Avoid crash when set_page_dirty() is call on a page that is
>>>          considered clean by core mm (buffer head have been remove and
>>>          with some file system this turns into an ugly mess).
>>>
>>>     [O3] DAX and the device block problems, ie with DAX the page map in
>>>          userspace is the same as the block (persistent memory) and no
>>>          filesystem nor block device understand page as block or pinned
>>>          block.
>>>
>>> For [O3] i don't think any pin count would help in anyway. I believe
>>> that the current long term GUP API that does not allow GUP of DAX is
>>> the only sane solution for now.
>>
>> No, that's not a sane solution, it's an emergency hack.
>>
>>> The real fix would be to teach file-
>>> system about DAX/pinned block so that a pinned block is not reuse
>>> by filesystem.
>>
>> We already have taught filesystems about pinned dax pages, see
>> dax_layout_busy_page(). As much as possible I want to eliminate the
>> concept of "dax pages" as a special case that gets sprinkled
>> throughout the mm.
> 
> So thinking on O3 issues what about leveraging the recent change i
> did to mmu notifier. Add a event for truncate or any other file
> event that need to invalidate the file->page for a range of offset.
> 
> Add mmu notifier listener to GUP user (except direct I/O) so that
> they invalidate they hardware mapping or switch the hardware mapping
> to use a crappy page. When such event happens what ever user do to
> the page through that driver is broken anyway. So it is better to
> be loud about it then trying to make it pass under the radar.
> 
> This will put the burden on broken user and allow you to properly
> recycle your DAX page.
> 
> Think of it as revoke through mmu notifier.
> 
> So patchset would be:
>     enum mmu_notifier_event {
> +       MMU_NOTIFY_TRUNCATE,
>     };
> 
> +   Change truncate code path to emit MMU_NOTIFY_TRUNCATE
> 

That part looks good.

> Then for each user of GUP (except direct I/O or other very short
> term GUP):

but, why is there a difference between how we handle long- and
short-term callers? Aren't we just leaving a harder-to-reproduce race
condition, if we ignore the short-term gup callers?

So, how does activity (including direct IO and other short-term callers)
get quiesced (stopped, and guaranteed not to restart or continue), so 
that truncate or umount can continue on?


> 
>     Patch 1: register mmu notifier
>     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
>              when that happens update the device page table or
>              usage to point to a crappy page and do put_user_page
>              on all previously held page

Minor point, this sequence should be done within a wrapper around existing 
get_user_pages(), such as get_user_pages_revokable() or something.

thanks,
-- 
John Hubbard
NVIDIA

> 
> So this would solve the revoke side of thing without adding a burden
> on GUP user like direct I/O. Many existing user of GUP already do
> listen to mmu notifier and already behave properly. It is just about
> making every body list to that. Then we can even add the mmu notifier
> pointer as argument to GUP just to make sure no new user of GUP forget
> about registering a notifier (argument as a teaching guide not as a
> something actively use).
> 
> 
> So does that sounds like a plan to solve your concern with long term
> GUP user ? This does not depend on DAX or anything it would apply to
> any file back pages.
> 
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:46                             ` Dave Chinner
@ 2018-12-12 21:59                               ` Jerome Glisse
  2018-12-13  0:51                                 ` Dave Chinner
  2018-12-14 15:43                               ` Jan Kara
  1 sibling, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 21:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > So this approach doesn't look like a win to me over using counter in struct
> > > page and I'd rather try looking into squeezing HMM public page usage of
> > > struct page so that we can fit that gup counter there as well. I know that
> > > it may be easier said than done...
> > 
> > So i want back to the drawing board and first i would like to ascertain
> > that we all agree on what the objectives are:
> > 
> >     [O1] Avoid write back from a page still being written by either a
> >          device or some direct I/O or any other existing user of GUP.
> >          This would avoid possible file system corruption.
> > 
> >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> >          considered clean by core mm (buffer head have been remove and
> >          with some file system this turns into an ugly mess).
> 
> I think that's wrong. This isn't an "avoid a crash" case, this is a
> "prevent data and/or filesystem corruption" case. The primary goal
> we have here is removing our exposure to potential corruption, which
> has the secondary effect of avoiding the crash/panics that currently
> occur as a result of inconsistent page/filesystem state.

This is O1 avoid corruption is O1

> 
> i.e. The goal is to have ->page_mkwrite() called on the clean page
> /before/ the file-backed page is marked dirty, and hence we don't
> expose ourselves to potential corruption or crashes that are a
> result of inappropriately calling set_page_dirty() on clean
> file-backed pages.

Yes and this would be handle by put_user_page ie:

put_user_page(struct page *page, bool dirty)
{
    if (!PageAnon(page)) {
        if (dirty) {
            // Do the whole dance ie page_mkwrite and all before
            // calling set_page_dirty()
        }
        ...
    }
    ...
}

> 
> > For [O1] and [O2] i believe a solution with mapcount would work. So
> > no new struct, no fake vma, nothing like that. In GUP for file back
> > pages we increment both refcount and mapcount (we also need a special
> > put_user_page to decrement mapcount when GUP user are done with the
> > page).
> 
> I don't see how a mapcount can prevent anyone from calling
> set_page_dirty() inappropriately.

See above.

> 
> > Now for [O1] the write back have to call page_mkclean() to go through
> > all reverse mapping of the page and map read only. This means that
> > we can count the number of real mapping and see if the mapcount is
> > bigger than that. If mapcount is bigger than page is pin and we need
> > to use a bounce page to do the writeback.
> 
> Doesn't work. Generally filesystems have already mapped the page
> into bios before they call clear_page_dirty_for_io(), so it's too
> late for the filesystem to bounce the page at that point.
> 
> > For [O2] i believe we can handle that case in the put_user_page()
> > function to properly dirty the page without causing filesystem
> > freak out.
> 
> I'm pretty sure you can't call ->page_mkwrite() from
> put_user_page(), so I don't think this is workable at all.

Hu why ? i can not think of any reason whike you could not. User of
GUP have their put_user_page in tearing down code and i do not see
why it would be an issue there. Even for direct I/O i can not think
of anything that would block us from doing that. So this put_user_page
is not call while holding any other mm of fs locks.

Do you have some rough idea of what the issue would be ?

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:56                                 ` John Hubbard
@ 2018-12-12 22:04                                   ` Jerome Glisse
  2018-12-12 22:11                                     ` John Hubbard
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 22:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, Jan Kara, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 01:56:00PM -0800, John Hubbard wrote:
> On 12/12/18 1:30 PM, Jerome Glisse wrote:
> > On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
> >> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >>>
> >>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> >>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> >>>>> Another crazy idea, why not treating GUP as another mapping of the page
> >>>>> and caller of GUP would have to provide either a fake anon_vma struct or
> >>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can
> >>>>> have a mix of both private and file page thus only if it is a read only
> >>>>> GUP) that would get added to the list of existing mapping.
> >>>>>
> >>>>> So the flow would be:
> >>>>>     somefunction_thatuse_gup()
> >>>>>     {
> >>>>>         ...
> >>>>>         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> >>>>>         ...
> >>>>>     }
> >>>>>
> >>>>>     GUP(vma, ..., fake_anon, fake_vma)
> >>>>>     {
> >>>>>         if (vma->flags == ANON) {
> >>>>>             // Add the fake anon vma to the anon vma chain as a child
> >>>>>             // of current vma
> >>>>>         } else {
> >>>>>             // Add the fake vma to the mapping tree
> >>>>>         }
> >>>>>
> >>>>>         // The existing GUP except that now it inc mapcount and not
> >>>>>         // refcount
> >>>>>         GUP_old(..., &nanonymous, &nfiles);
> >>>>>
> >>>>>         atomic_add(&fake_anon->refcount, nanonymous);
> >>>>>         atomic_add(&fake_vma->refcount, nfiles);
> >>>>>
> >>>>>         return nanonymous + nfiles;
> >>>>>     }
> >>>>
> >>>> Thanks for your idea! This is actually something like I was suggesting back
> >>>> at LSF/MM in Deer Valley. There were two downsides to this I remember
> >>>> people pointing out:
> >>>>
> >>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> >>>> to get necessary locks to insert new entry into the VMA tree in that
> >>>> context. So essentially we'd loose get_user_pages_fast() functionality.
> >>>>
> >>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> >>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> >>>> Then on IO completion you need to queue work to unpin the pages again as you
> >>>> cannot remove the fake VMA directly from interrupt context where the IO is
> >>>> completed.
> >>>>
> >>>> You are right that the cost could be amortized if gup() is called for
> >>>> multiple consecutive pages however for small IOs there's no help...
> >>>>
> >>>> So this approach doesn't look like a win to me over using counter in struct
> >>>> page and I'd rather try looking into squeezing HMM public page usage of
> >>>> struct page so that we can fit that gup counter there as well. I know that
> >>>> it may be easier said than done...
> >>>
> >>> So i want back to the drawing board and first i would like to ascertain
> >>> that we all agree on what the objectives are:
> >>>
> >>>     [O1] Avoid write back from a page still being written by either a
> >>>          device or some direct I/O or any other existing user of GUP.
> >>>          This would avoid possible file system corruption.
> >>>
> >>>     [O2] Avoid crash when set_page_dirty() is call on a page that is
> >>>          considered clean by core mm (buffer head have been remove and
> >>>          with some file system this turns into an ugly mess).
> >>>
> >>>     [O3] DAX and the device block problems, ie with DAX the page map in
> >>>          userspace is the same as the block (persistent memory) and no
> >>>          filesystem nor block device understand page as block or pinned
> >>>          block.
> >>>
> >>> For [O3] i don't think any pin count would help in anyway. I believe
> >>> that the current long term GUP API that does not allow GUP of DAX is
> >>> the only sane solution for now.
> >>
> >> No, that's not a sane solution, it's an emergency hack.
> >>
> >>> The real fix would be to teach file-
> >>> system about DAX/pinned block so that a pinned block is not reuse
> >>> by filesystem.
> >>
> >> We already have taught filesystems about pinned dax pages, see
> >> dax_layout_busy_page(). As much as possible I want to eliminate the
> >> concept of "dax pages" as a special case that gets sprinkled
> >> throughout the mm.
> > 
> > So thinking on O3 issues what about leveraging the recent change i
> > did to mmu notifier. Add a event for truncate or any other file
> > event that need to invalidate the file->page for a range of offset.
> > 
> > Add mmu notifier listener to GUP user (except direct I/O) so that
> > they invalidate they hardware mapping or switch the hardware mapping
> > to use a crappy page. When such event happens what ever user do to
> > the page through that driver is broken anyway. So it is better to
> > be loud about it then trying to make it pass under the radar.
> > 
> > This will put the burden on broken user and allow you to properly
> > recycle your DAX page.
> > 
> > Think of it as revoke through mmu notifier.
> > 
> > So patchset would be:
> >     enum mmu_notifier_event {
> > +       MMU_NOTIFY_TRUNCATE,
> >     };
> > 
> > +   Change truncate code path to emit MMU_NOTIFY_TRUNCATE
> > 
> 
> That part looks good.
> 
> > Then for each user of GUP (except direct I/O or other very short
> > term GUP):
> 
> but, why is there a difference between how we handle long- and
> short-term callers? Aren't we just leaving a harder-to-reproduce race
> condition, if we ignore the short-term gup callers?
> 
> So, how does activity (including direct IO and other short-term callers)
> get quiesced (stopped, and guaranteed not to restart or continue), so 
> that truncate or umount can continue on?

The fs would delay block reuse to after refcount is gone so it would
wait for that. It is ok to do that only for short term user in case of
direct I/O this should really not happen as it means that the application
is doing something really stupid. So the waiting on short term user
would be a rare event.


> >     Patch 1: register mmu notifier
> >     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
> >              when that happens update the device page table or
> >              usage to point to a crappy page and do put_user_page
> >              on all previously held page
> 
> Minor point, this sequence should be done within a wrapper around existing 
> get_user_pages(), such as get_user_pages_revokable() or something.

No we want to teach everyone to abide by the rules, if we add yet another
GUP function prototype people will use the one where they don;t have to
say they abide by the rules. It is time we advertise the fact that GUP
should not be use willy nilly for anything without worrying about the
implication it has :)

So i would rather see a consolidation in the number of GUP prototype we
have than yet another one.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:53                                   ` Jerome Glisse
@ 2018-12-12 22:11                                     ` Matthew Wilcox
  2018-12-12 22:16                                       ` Jerome Glisse
  2018-12-12 23:37                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-12 22:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Jan Kara, John Hubbard, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> The mmu notifier i put forward is the emergency revoke ie last
> resort after driver have done everything it could to inform user-
> space and release the pages. So doing thing brutaly in it like
> reprogramming driver page table (which AFAIK is something you
> can do on any hardware wether the hardware will like it or not
> is a different question).

You can't do it to an NVMe device.  You submit the DMA addresses in
the command, and the device reads the command at submission time.
There's no way to change the DMA addresses for an in-flight command.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 22:04                                   ` Jerome Glisse
@ 2018-12-12 22:11                                     ` John Hubbard
  2018-12-12 22:14                                       ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: John Hubbard @ 2018-12-12 22:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Jan Kara, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/12/18 2:04 PM, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 01:56:00PM -0800, John Hubbard wrote:
>> On 12/12/18 1:30 PM, Jerome Glisse wrote:
>>> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
>>>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
>>>>>
>>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>>>>>> Another crazy idea, why not treating GUP as another mapping of the page
>>>>>>> and caller of GUP would have to provide either a fake anon_vma struct or
>>>>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can
>>>>>>> have a mix of both private and file page thus only if it is a read only
>>>>>>> GUP) that would get added to the list of existing mapping.
>>>>>>>
>>>>>>> So the flow would be:
>>>>>>>     somefunction_thatuse_gup()
>>>>>>>     {
>>>>>>>         ...
>>>>>>>         GUP(_fast)(vma, ..., fake_anon, fake_vma);
>>>>>>>         ...
>>>>>>>     }
>>>>>>>
>>>>>>>     GUP(vma, ..., fake_anon, fake_vma)
>>>>>>>     {
>>>>>>>         if (vma->flags == ANON) {
>>>>>>>             // Add the fake anon vma to the anon vma chain as a child
>>>>>>>             // of current vma
>>>>>>>         } else {
>>>>>>>             // Add the fake vma to the mapping tree
>>>>>>>         }
>>>>>>>
>>>>>>>         // The existing GUP except that now it inc mapcount and not
>>>>>>>         // refcount
>>>>>>>         GUP_old(..., &nanonymous, &nfiles);
>>>>>>>
>>>>>>>         atomic_add(&fake_anon->refcount, nanonymous);
>>>>>>>         atomic_add(&fake_vma->refcount, nfiles);
>>>>>>>
>>>>>>>         return nanonymous + nfiles;
>>>>>>>     }
>>>>>>
>>>>>> Thanks for your idea! This is actually something like I was suggesting back
>>>>>> at LSF/MM in Deer Valley. There were two downsides to this I remember
>>>>>> people pointing out:
>>>>>>
>>>>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
>>>>>> to get necessary locks to insert new entry into the VMA tree in that
>>>>>> context. So essentially we'd loose get_user_pages_fast() functionality.
>>>>>>
>>>>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
>>>>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
>>>>>> Then on IO completion you need to queue work to unpin the pages again as you
>>>>>> cannot remove the fake VMA directly from interrupt context where the IO is
>>>>>> completed.
>>>>>>
>>>>>> You are right that the cost could be amortized if gup() is called for
>>>>>> multiple consecutive pages however for small IOs there's no help...
>>>>>>
>>>>>> So this approach doesn't look like a win to me over using counter in struct
>>>>>> page and I'd rather try looking into squeezing HMM public page usage of
>>>>>> struct page so that we can fit that gup counter there as well. I know that
>>>>>> it may be easier said than done...
>>>>>
>>>>> So i want back to the drawing board and first i would like to ascertain
>>>>> that we all agree on what the objectives are:
>>>>>
>>>>>     [O1] Avoid write back from a page still being written by either a
>>>>>          device or some direct I/O or any other existing user of GUP.
>>>>>          This would avoid possible file system corruption.
>>>>>
>>>>>     [O2] Avoid crash when set_page_dirty() is call on a page that is
>>>>>          considered clean by core mm (buffer head have been remove and
>>>>>          with some file system this turns into an ugly mess).
>>>>>
>>>>>     [O3] DAX and the device block problems, ie with DAX the page map in
>>>>>          userspace is the same as the block (persistent memory) and no
>>>>>          filesystem nor block device understand page as block or pinned
>>>>>          block.
>>>>>
>>>>> For [O3] i don't think any pin count would help in anyway. I believe
>>>>> that the current long term GUP API that does not allow GUP of DAX is
>>>>> the only sane solution for now.
>>>>
>>>> No, that's not a sane solution, it's an emergency hack.
>>>>
>>>>> The real fix would be to teach file-
>>>>> system about DAX/pinned block so that a pinned block is not reuse
>>>>> by filesystem.
>>>>
>>>> We already have taught filesystems about pinned dax pages, see
>>>> dax_layout_busy_page(). As much as possible I want to eliminate the
>>>> concept of "dax pages" as a special case that gets sprinkled
>>>> throughout the mm.
>>>
>>> So thinking on O3 issues what about leveraging the recent change i
>>> did to mmu notifier. Add a event for truncate or any other file
>>> event that need to invalidate the file->page for a range of offset.
>>>
>>> Add mmu notifier listener to GUP user (except direct I/O) so that
>>> they invalidate they hardware mapping or switch the hardware mapping
>>> to use a crappy page. When such event happens what ever user do to
>>> the page through that driver is broken anyway. So it is better to
>>> be loud about it then trying to make it pass under the radar.
>>>
>>> This will put the burden on broken user and allow you to properly
>>> recycle your DAX page.
>>>
>>> Think of it as revoke through mmu notifier.
>>>
>>> So patchset would be:
>>>     enum mmu_notifier_event {
>>> +       MMU_NOTIFY_TRUNCATE,
>>>     };
>>>
>>> +   Change truncate code path to emit MMU_NOTIFY_TRUNCATE
>>>
>>
>> That part looks good.
>>
>>> Then for each user of GUP (except direct I/O or other very short
>>> term GUP):
>>
>> but, why is there a difference between how we handle long- and
>> short-term callers? Aren't we just leaving a harder-to-reproduce race
>> condition, if we ignore the short-term gup callers?
>>
>> So, how does activity (including direct IO and other short-term callers)
>> get quiesced (stopped, and guaranteed not to restart or continue), so 
>> that truncate or umount can continue on?
> 
> The fs would delay block reuse to after refcount is gone so it would
> wait for that. It is ok to do that only for short term user in case of
> direct I/O this should really not happen as it means that the application
> is doing something really stupid. So the waiting on short term user
> would be a rare event.

OK, I think that sounds like there are no race conditions left.

> 
> 
>>>     Patch 1: register mmu notifier
>>>     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
>>>              when that happens update the device page table or
>>>              usage to point to a crappy page and do put_user_page
>>>              on all previously held page
>>
>> Minor point, this sequence should be done within a wrapper around existing 
>> get_user_pages(), such as get_user_pages_revokable() or something.
> 
> No we want to teach everyone to abide by the rules, if we add yet another
> GUP function prototype people will use the one where they don;t have to
> say they abide by the rules. It is time we advertise the fact that GUP
> should not be use willy nilly for anything without worrying about the
> implication it has :)

Well, the best way to do that is to provide a named function call that 
implements the rules. That also makes it easy to grep around and see which
call sites still need upgrades, and which don't.

> 
> So i would rather see a consolidation in the number of GUP prototype we
> have than yet another one.

We could eventually get rid of the older GUP prototypes, once we're done
converting. Having a new, named function call will *without question* make
the call site conversion go much easier, and the end result is also better:
the common code is in a central function, rather than being at all the call
sites.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 22:11                                     ` John Hubbard
@ 2018-12-12 22:14                                       ` Jerome Glisse
  2018-12-12 22:17                                         ` John Hubbard
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 22:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, Jan Kara, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 02:11:58PM -0800, John Hubbard wrote:
> On 12/12/18 2:04 PM, Jerome Glisse wrote:
> > On Wed, Dec 12, 2018 at 01:56:00PM -0800, John Hubbard wrote:
> >> On 12/12/18 1:30 PM, Jerome Glisse wrote:
> >>> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
> >>>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >>>>>
> >>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> >>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> >>>>>>> Another crazy idea, why not treating GUP as another mapping of the page
> >>>>>>> and caller of GUP would have to provide either a fake anon_vma struct or
> >>>>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can
> >>>>>>> have a mix of both private and file page thus only if it is a read only
> >>>>>>> GUP) that would get added to the list of existing mapping.
> >>>>>>>
> >>>>>>> So the flow would be:
> >>>>>>>     somefunction_thatuse_gup()
> >>>>>>>     {
> >>>>>>>         ...
> >>>>>>>         GUP(_fast)(vma, ..., fake_anon, fake_vma);
> >>>>>>>         ...
> >>>>>>>     }
> >>>>>>>
> >>>>>>>     GUP(vma, ..., fake_anon, fake_vma)
> >>>>>>>     {
> >>>>>>>         if (vma->flags == ANON) {
> >>>>>>>             // Add the fake anon vma to the anon vma chain as a child
> >>>>>>>             // of current vma
> >>>>>>>         } else {
> >>>>>>>             // Add the fake vma to the mapping tree
> >>>>>>>         }
> >>>>>>>
> >>>>>>>         // The existing GUP except that now it inc mapcount and not
> >>>>>>>         // refcount
> >>>>>>>         GUP_old(..., &nanonymous, &nfiles);
> >>>>>>>
> >>>>>>>         atomic_add(&fake_anon->refcount, nanonymous);
> >>>>>>>         atomic_add(&fake_vma->refcount, nfiles);
> >>>>>>>
> >>>>>>>         return nanonymous + nfiles;
> >>>>>>>     }
> >>>>>>
> >>>>>> Thanks for your idea! This is actually something like I was suggesting back
> >>>>>> at LSF/MM in Deer Valley. There were two downsides to this I remember
> >>>>>> people pointing out:
> >>>>>>
> >>>>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
> >>>>>> to get necessary locks to insert new entry into the VMA tree in that
> >>>>>> context. So essentially we'd loose get_user_pages_fast() functionality.
> >>>>>>
> >>>>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
> >>>>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
> >>>>>> Then on IO completion you need to queue work to unpin the pages again as you
> >>>>>> cannot remove the fake VMA directly from interrupt context where the IO is
> >>>>>> completed.
> >>>>>>
> >>>>>> You are right that the cost could be amortized if gup() is called for
> >>>>>> multiple consecutive pages however for small IOs there's no help...
> >>>>>>
> >>>>>> So this approach doesn't look like a win to me over using counter in struct
> >>>>>> page and I'd rather try looking into squeezing HMM public page usage of
> >>>>>> struct page so that we can fit that gup counter there as well. I know that
> >>>>>> it may be easier said than done...
> >>>>>
> >>>>> So i want back to the drawing board and first i would like to ascertain
> >>>>> that we all agree on what the objectives are:
> >>>>>
> >>>>>     [O1] Avoid write back from a page still being written by either a
> >>>>>          device or some direct I/O or any other existing user of GUP.
> >>>>>          This would avoid possible file system corruption.
> >>>>>
> >>>>>     [O2] Avoid crash when set_page_dirty() is call on a page that is
> >>>>>          considered clean by core mm (buffer head have been remove and
> >>>>>          with some file system this turns into an ugly mess).
> >>>>>
> >>>>>     [O3] DAX and the device block problems, ie with DAX the page map in
> >>>>>          userspace is the same as the block (persistent memory) and no
> >>>>>          filesystem nor block device understand page as block or pinned
> >>>>>          block.
> >>>>>
> >>>>> For [O3] i don't think any pin count would help in anyway. I believe
> >>>>> that the current long term GUP API that does not allow GUP of DAX is
> >>>>> the only sane solution for now.
> >>>>
> >>>> No, that's not a sane solution, it's an emergency hack.
> >>>>
> >>>>> The real fix would be to teach file-
> >>>>> system about DAX/pinned block so that a pinned block is not reuse
> >>>>> by filesystem.
> >>>>
> >>>> We already have taught filesystems about pinned dax pages, see
> >>>> dax_layout_busy_page(). As much as possible I want to eliminate the
> >>>> concept of "dax pages" as a special case that gets sprinkled
> >>>> throughout the mm.
> >>>
> >>> So thinking on O3 issues what about leveraging the recent change i
> >>> did to mmu notifier. Add a event for truncate or any other file
> >>> event that need to invalidate the file->page for a range of offset.
> >>>
> >>> Add mmu notifier listener to GUP user (except direct I/O) so that
> >>> they invalidate they hardware mapping or switch the hardware mapping
> >>> to use a crappy page. When such event happens what ever user do to
> >>> the page through that driver is broken anyway. So it is better to
> >>> be loud about it then trying to make it pass under the radar.
> >>>
> >>> This will put the burden on broken user and allow you to properly
> >>> recycle your DAX page.
> >>>
> >>> Think of it as revoke through mmu notifier.
> >>>
> >>> So patchset would be:
> >>>     enum mmu_notifier_event {
> >>> +       MMU_NOTIFY_TRUNCATE,
> >>>     };
> >>>
> >>> +   Change truncate code path to emit MMU_NOTIFY_TRUNCATE
> >>>
> >>
> >> That part looks good.
> >>
> >>> Then for each user of GUP (except direct I/O or other very short
> >>> term GUP):
> >>
> >> but, why is there a difference between how we handle long- and
> >> short-term callers? Aren't we just leaving a harder-to-reproduce race
> >> condition, if we ignore the short-term gup callers?
> >>
> >> So, how does activity (including direct IO and other short-term callers)
> >> get quiesced (stopped, and guaranteed not to restart or continue), so 
> >> that truncate or umount can continue on?
> > 
> > The fs would delay block reuse to after refcount is gone so it would
> > wait for that. It is ok to do that only for short term user in case of
> > direct I/O this should really not happen as it means that the application
> > is doing something really stupid. So the waiting on short term user
> > would be a rare event.
> 
> OK, I think that sounds like there are no race conditions left.
> 
> > 
> > 
> >>>     Patch 1: register mmu notifier
> >>>     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
> >>>              when that happens update the device page table or
> >>>              usage to point to a crappy page and do put_user_page
> >>>              on all previously held page
> >>
> >> Minor point, this sequence should be done within a wrapper around existing 
> >> get_user_pages(), such as get_user_pages_revokable() or something.
> > 
> > No we want to teach everyone to abide by the rules, if we add yet another
> > GUP function prototype people will use the one where they don;t have to
> > say they abide by the rules. It is time we advertise the fact that GUP
> > should not be use willy nilly for anything without worrying about the
> > implication it has :)
> 
> Well, the best way to do that is to provide a named function call that 
> implements the rules. That also makes it easy to grep around and see which
> call sites still need upgrades, and which don't.
> 
> > 
> > So i would rather see a consolidation in the number of GUP prototype we
> > have than yet another one.
> 
> We could eventually get rid of the older GUP prototypes, once we're done
> converting. Having a new, named function call will *without question* make
> the call site conversion go much easier, and the end result is also better:
> the common code is in a central function, rather than being at all the call
> sites.
> 

Then last patch in the patchset must remove all GUP prototype except
ones with the right API :)

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 22:11                                     ` Matthew Wilcox
@ 2018-12-12 22:16                                       ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-12 22:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dan Williams, Jan Kara, John Hubbard, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 02:11:57PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > The mmu notifier i put forward is the emergency revoke ie last
> > resort after driver have done everything it could to inform user-
> > space and release the pages. So doing thing brutaly in it like
> > reprogramming driver page table (which AFAIK is something you
> > can do on any hardware wether the hardware will like it or not
> > is a different question).
> 
> You can't do it to an NVMe device.  You submit the DMA addresses in
> the command, and the device reads the command at submission time.
> There's no way to change the DMA addresses for an in-flight command.

But like for GPU you can wait for in flight commands right ? ie
you can wait for the queue to be done. This is how GPU do GUP ie
GUP submit commands to queue and wait in mmu notifier for queue
to be done.

CHeers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 22:14                                       ` Jerome Glisse
@ 2018-12-12 22:17                                         ` John Hubbard
  0 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-12 22:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Jan Kara, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/12/18 2:14 PM, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 02:11:58PM -0800, John Hubbard wrote:
>> On 12/12/18 2:04 PM, Jerome Glisse wrote:
>>> On Wed, Dec 12, 2018 at 01:56:00PM -0800, John Hubbard wrote:
>>>> On 12/12/18 1:30 PM, Jerome Glisse wrote:
>>>>> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
>>>>>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
>>>>>>>
>>>>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
[...]
>>>
>>>>>     Patch 1: register mmu notifier
>>>>>     Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
>>>>>              when that happens update the device page table or
>>>>>              usage to point to a crappy page and do put_user_page
>>>>>              on all previously held page
>>>>
>>>> Minor point, this sequence should be done within a wrapper around existing 
>>>> get_user_pages(), such as get_user_pages_revokable() or something.
>>>
>>> No we want to teach everyone to abide by the rules, if we add yet another
>>> GUP function prototype people will use the one where they don;t have to
>>> say they abide by the rules. It is time we advertise the fact that GUP
>>> should not be use willy nilly for anything without worrying about the
>>> implication it has :)
>>
>> Well, the best way to do that is to provide a named function call that 
>> implements the rules. That also makes it easy to grep around and see which
>> call sites still need upgrades, and which don't.
>>
>>>
>>> So i would rather see a consolidation in the number of GUP prototype we
>>> have than yet another one.
>>
>> We could eventually get rid of the older GUP prototypes, once we're done
>> converting. Having a new, named function call will *without question* make
>> the call site conversion go much easier, and the end result is also better:
>> the common code is in a central function, rather than being at all the call
>> sites.
>>
> 
> Then last patch in the patchset must remove all GUP prototype except
> ones with the right API :)
> 

Yes, exactly.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:53                                   ` Jerome Glisse
  2018-12-12 22:11                                     ` Matthew Wilcox
@ 2018-12-12 23:37                                     ` Jason Gunthorpe
  2018-12-12 23:46                                       ` John Hubbard
                                                         ` (2 more replies)
  1 sibling, 3 replies; 206+ messages in thread
From: Jason Gunthorpe @ 2018-12-12 23:37 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > Almost, we need some safety around assuming that DMA is complete the
> > page, so the notification would need to go all to way to userspace
> > with something like a file lease notification. It would also need to
> > be backstopped by an IOMMU in the case where the hardware does not /
> > can not stop in-flight DMA.
> 
> You can always reprogram the hardware right away it will redirect
> any dma to the crappy page.

That causes silent data corruption for RDMA users - we can't do that.

The only way out for current hardware is to forcibly terminate the
RDMA activity somehow (and I'm not even sure this is possible, at
least it would be driver specific)

Even the IOMMU idea probably doesn't work, I doubt all current
hardware can handle a PCI-E error TLP properly. 

On some hardware it probably just protects DAX by causing data
corruption for RDMA - I fail to see how that is a win for system
stability if the user obviously wants to use DAX and RDMA together...

I think your approach with ODP only is the only one that meets your
requirements, the only other data-integrity-preserving approach is to
block/fail ftruncate/etc.

> From my point of view driver should listen to ftruncate before the
> mmu notifier kicks in and send event to userspace and maybe wait
> and block ftruncate (or move it to a worker thread).

We can do this, but we can't guarantee forward progress in userspace
and the best way we have to cancel that is portable to all RDMA
hardware is to kill the process(es)..

So if that is acceptable then we could use user notifiers and allow
non-ODP users...

Jason

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 23:37                                     ` Jason Gunthorpe
@ 2018-12-12 23:46                                       ` John Hubbard
  2018-12-12 23:54                                       ` Dan Williams
  2018-12-13  0:01                                       ` Jerome Glisse
  2 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-12 23:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse
  Cc: Dan Williams, Jan Kara, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On 12/12/18 3:37 PM, Jason Gunthorpe wrote:
> On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
>>> Almost, we need some safety around assuming that DMA is complete the
>>> page, so the notification would need to go all to way to userspace
>>> with something like a file lease notification. It would also need to
>>> be backstopped by an IOMMU in the case where the hardware does not /
>>> can not stop in-flight DMA.
>>
>> You can always reprogram the hardware right away it will redirect
>> any dma to the crappy page.
> 
> That causes silent data corruption for RDMA users - we can't do that.
> 
> The only way out for current hardware is to forcibly terminate the
> RDMA activity somehow (and I'm not even sure this is possible, at
> least it would be driver specific)
> 
> Even the IOMMU idea probably doesn't work, I doubt all current
> hardware can handle a PCI-E error TLP properly. 

Very true.

> 
> On some hardware it probably just protects DAX by causing data
> corruption for RDMA - I fail to see how that is a win for system
> stability if the user obviously wants to use DAX and RDMA together...
> 
> I think your approach with ODP only is the only one that meets your
> requirements, the only other data-integrity-preserving approach is to
> block/fail ftruncate/etc.
> 
>> From my point of view driver should listen to ftruncate before the
>> mmu notifier kicks in and send event to userspace and maybe wait
>> and block ftruncate (or move it to a worker thread).
> 
> We can do this, but we can't guarantee forward progress in userspace
> and the best way we have to cancel that is portable to all RDMA
> hardware is to kill the process(es)..
> 
> So if that is acceptable then we could use user notifiers and allow
> non-ODP users...
> 

That is exactly the conclusion that some of us in the GPU world reached as
well, when chatting about how this would have to work, even on modern GPU 
hardware that can replay page faults, in many cases. 

I think as long as we specify that the acceptable consequence of doing, say,
umount on a filesystem that has active DMA happening is that the associated
processes get killed, then we're going to be OK.

What would worry me is if there was an expectation that processes could
continue working properly after such a scenario.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 23:37                                     ` Jason Gunthorpe
  2018-12-12 23:46                                       ` John Hubbard
@ 2018-12-12 23:54                                       ` Dan Williams
  2018-12-13  0:01                                       ` Jerome Glisse
  2 siblings, 0 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-12 23:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jérôme Glisse, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 3:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > Almost, we need some safety around assuming that DMA is complete the
> > > page, so the notification would need to go all to way to userspace
> > > with something like a file lease notification. It would also need to
> > > be backstopped by an IOMMU in the case where the hardware does not /
> > > can not stop in-flight DMA.
> >
> > You can always reprogram the hardware right away it will redirect
> > any dma to the crappy page.
>
> That causes silent data corruption for RDMA users - we can't do that.
>
> The only way out for current hardware is to forcibly terminate the
> RDMA activity somehow (and I'm not even sure this is possible, at
> least it would be driver specific)
>
> Even the IOMMU idea probably doesn't work, I doubt all current
> hardware can handle a PCI-E error TLP properly.

My thinking here is that we would at least have the infrastructure for
userspace to opt-in to getting the callback, the threat of an IOMMU
forcibly tearing down mappings, and likely some identification for
pages that are revocable. With "long term" pins I would hope to move
any detection of incompatibility to the memory registration phase
rather than something unacceptable like injecting random truncate
failures.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 23:37                                     ` Jason Gunthorpe
  2018-12-12 23:46                                       ` John Hubbard
  2018-12-12 23:54                                       ` Dan Williams
@ 2018-12-13  0:01                                       ` Jerome Glisse
  2018-12-13  0:18                                         ` Dan Williams
  2018-12-13  3:20                                         ` Jason Gunthorpe
  2 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13  0:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > Almost, we need some safety around assuming that DMA is complete the
> > > page, so the notification would need to go all to way to userspace
> > > with something like a file lease notification. It would also need to
> > > be backstopped by an IOMMU in the case where the hardware does not /
> > > can not stop in-flight DMA.
> > 
> > You can always reprogram the hardware right away it will redirect
> > any dma to the crappy page.
> 
> That causes silent data corruption for RDMA users - we can't do that.
> 
> The only way out for current hardware is to forcibly terminate the
> RDMA activity somehow (and I'm not even sure this is possible, at
> least it would be driver specific)
> 
> Even the IOMMU idea probably doesn't work, I doubt all current
> hardware can handle a PCI-E error TLP properly. 

What i saying is reprogram hardware to crappy page ie valid page
dma map but that just has random content as a last resort to allow
filesystem to reuse block. So their should be no PCIE error unless
hardware freak out to see its page table reprogram randomly.

> 
> On some hardware it probably just protects DAX by causing data
> corruption for RDMA - I fail to see how that is a win for system
> stability if the user obviously wants to use DAX and RDMA together...

The question is who do you want to punish ? RDMA user that pin stuff
and expect thing to work forever without worrying for other fs
activities ? Or filesystem to pin block forever :) I am not gonna
take side here but i don't think we can please both side, one will
have to be mean to the user ie either the RDMA user or the file-
system which also percolate to being mean to end user.

> I think your approach with ODP only is the only one that meets your
> requirements, the only other data-integrity-preserving approach is to
> block/fail ftruncate/etc.

> 
> > From my point of view driver should listen to ftruncate before the
> > mmu notifier kicks in and send event to userspace and maybe wait
> > and block ftruncate (or move it to a worker thread).
> 
> We can do this, but we can't guarantee forward progress in userspace
> and the best way we have to cancel that is portable to all RDMA
> hardware is to kill the process(es)..
> 
> So if that is acceptable then we could use user notifiers and allow
> non-ODP users...

Yes ODP with listening to _all_ mmu notifier event is the only
sane way. But for hardware not capable of doing that (GPU are
capable, so are mlx5, i won't do a list of the bad ones). We
either keep the status quo that is today behavior or we do
something either mean to the RDMA user or mean to the file-
system. And previous discussion on failing ftruncate where a
no no, can't remember why. In any case i am personnaly fine with
what ever which is:
    S1: keep block pin until RDMA goes away, even if it means
        that RDMA user is no longer really accessing anything
        that make sense (ie the page is no longer part of the
        file or the original vma so as this point it fully
        disconnected from the original intent ie today status
        quo we pin block and annoy filesystem while we pretend
        that everything is fine.
    S2: notify userspace program through device/sub-system
        specific API and delay ftruncate. After a while if there
        is no answer just be mean and force hardware to use
        crappy page as anyway this is what happens today (note
        we can fully mirror today behavior by allocating pages
        and copying existing content their and then swaping
        out to point the hardware to those pages.
    S3: be mean to filesystem a keep block pin for as long as
        they are active GUP, this means failing ftruncate and
        or possibly munmap().

S3 can be split in sub-choices. Do we want to take vote ? Or
is there a way that can please everyone ?

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  0:01                                       ` Jerome Glisse
@ 2018-12-13  0:18                                         ` Dan Williams
  2018-12-13  0:44                                           ` Jerome Glisse
  2018-12-13  3:20                                         ` Jason Gunthorpe
  1 sibling, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-13  0:18 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Jason Gunthorpe, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 4:01 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > > Almost, we need some safety around assuming that DMA is complete the
> > > > page, so the notification would need to go all to way to userspace
> > > > with something like a file lease notification. It would also need to
> > > > be backstopped by an IOMMU in the case where the hardware does not /
> > > > can not stop in-flight DMA.
> > >
> > > You can always reprogram the hardware right away it will redirect
> > > any dma to the crappy page.
> >
> > That causes silent data corruption for RDMA users - we can't do that.
> >
> > The only way out for current hardware is to forcibly terminate the
> > RDMA activity somehow (and I'm not even sure this is possible, at
> > least it would be driver specific)
> >
> > Even the IOMMU idea probably doesn't work, I doubt all current
> > hardware can handle a PCI-E error TLP properly.
>
> What i saying is reprogram hardware to crappy page ie valid page
> dma map but that just has random content as a last resort to allow
> filesystem to reuse block. So their should be no PCIE error unless
> hardware freak out to see its page table reprogram randomly.

Hardware has a hard enough time stopping I/O to existing page let
alone switching to a new one in the middle of a transaction. This is a
non-starter, but it's also a non-concern because the bulk of DMA is
transient. For non-transient DMA there is a usually a registration
phase where the capability to support revocation can be validated,

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  0:18                                         ` Dan Williams
@ 2018-12-13  0:44                                           ` Jerome Glisse
  2018-12-13  3:26                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13  0:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 04:18:33PM -0800, Dan Williams wrote:
> On Wed, Dec 12, 2018 at 4:01 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > > > Almost, we need some safety around assuming that DMA is complete the
> > > > > page, so the notification would need to go all to way to userspace
> > > > > with something like a file lease notification. It would also need to
> > > > > be backstopped by an IOMMU in the case where the hardware does not /
> > > > > can not stop in-flight DMA.
> > > >
> > > > You can always reprogram the hardware right away it will redirect
> > > > any dma to the crappy page.
> > >
> > > That causes silent data corruption for RDMA users - we can't do that.
> > >
> > > The only way out for current hardware is to forcibly terminate the
> > > RDMA activity somehow (and I'm not even sure this is possible, at
> > > least it would be driver specific)
> > >
> > > Even the IOMMU idea probably doesn't work, I doubt all current
> > > hardware can handle a PCI-E error TLP properly.
> >
> > What i saying is reprogram hardware to crappy page ie valid page
> > dma map but that just has random content as a last resort to allow
> > filesystem to reuse block. So their should be no PCIE error unless
> > hardware freak out to see its page table reprogram randomly.
> 
> Hardware has a hard enough time stopping I/O to existing page let
> alone switching to a new one in the middle of a transaction. This is a
> non-starter, but it's also a non-concern because the bulk of DMA is
> transient. For non-transient DMA there is a usually a registration
> phase where the capability to support revocation can be validated,

On many GPUs you can do that, it is hardware dependant and you have
steps to take but it is something you can do (and GPU can do
continuous DMA traffic have they have threads running that can
do continuous memory access). So i assume that other hardware
can do it too.

Any revocation mechanism gonna be device/sub-system specific so it
would probably be better to talk case by case and see what we can
do. Like i said posted patches to remove GUP from GPUs driver, i
am working on improving some core code to make those patches even
simpler and i will keep pushing for that in subsystem i know.

Maybe we should grep for GUP in drivers/ and start discussion within
each sub-system to see what can be done within each. If any common
pattern emerge we can draw up common plans for those at least.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:59                               ` Jerome Glisse
@ 2018-12-13  0:51                                 ` Dave Chinner
  2018-12-13  2:02                                   ` Jerome Glisse
  2018-12-14  3:52                                   ` John Hubbard
  0 siblings, 2 replies; 206+ messages in thread
From: Dave Chinner @ 2018-12-13  0:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > So this approach doesn't look like a win to me over using counter in struct
> > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > struct page so that we can fit that gup counter there as well. I know that
> > > > it may be easier said than done...
> > > 
> > > So i want back to the drawing board and first i would like to ascertain
> > > that we all agree on what the objectives are:
> > > 
> > >     [O1] Avoid write back from a page still being written by either a
> > >          device or some direct I/O or any other existing user of GUP.

IOWs, you need to mark pages being written to by a GUP as
PageWriteback, so all attempts to write the page will block on
wait_on_page_writeback() before trying to write the dirty page.

> > >          This would avoid possible file system corruption.

This isn't a filesystem corruption vector. At worst, it could cause
torn data writes due to updating the page while it is under IO. We
have a name for this: "stable pages". This is designed to prevent
updates to pages via mmap writes from causing corruption of things
like MD RAID due to modification of the data during RAID parity
calculations. Hence we have wait_for_stable_page() calls in all
->page_mkwrite implementations so that new mmap writes block until
writeback IO is complete on the devices that require stable pages
to prevent corruption.

IOWs, we already deal with this "delay new modification while
writeback is in progress" problem in the mmap/filesystem world and
have infrastructure to handle it. And the ->page_mkwrite code
already deals with it.

> > > 
> > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > >          considered clean by core mm (buffer head have been remove and
> > >          with some file system this turns into an ugly mess).
> > 
> > I think that's wrong. This isn't an "avoid a crash" case, this is a
> > "prevent data and/or filesystem corruption" case. The primary goal
> > we have here is removing our exposure to potential corruption, which
> > has the secondary effect of avoiding the crash/panics that currently
> > occur as a result of inconsistent page/filesystem state.
> 
> This is O1 avoid corruption is O1

It's "avoid a specific instance of data corruption", not a general
mechanism for avoiding data/filesystem corruption.

Calling set_page_dirty() on a file backed page which has not been
correctly prepared can cause data corruption, filesystem coruption
and shutdowns, etc because we have dirty data over a region that is
not correctly mapped. Yes, it can also cause a crash (because we
really, really suck at validation and error handling in generic code
paths), but there's so, so much more that can go wrong than crash
the kernel when we do stupid shit like this.

> > i.e. The goal is to have ->page_mkwrite() called on the clean page
> > /before/ the file-backed page is marked dirty, and hence we don't
> > expose ourselves to potential corruption or crashes that are a
> > result of inappropriately calling set_page_dirty() on clean
> > file-backed pages.
> 
> Yes and this would be handle by put_user_page ie:

No, put_user_page() is too late - it's after the DMA has completed,
but we have to ensure the file has backing store allocated and the
pages are in the correct state /before/ the DMA is done.

Think ENOSPC - that has to be handled before we do the DMA, not
after. Before the DMA it is a recoverable error, after the DMA it is
data loss/corruption failure.

> put_user_page(struct page *page, bool dirty)
> {
>     if (!PageAnon(page)) {
>         if (dirty) {
>             // Do the whole dance ie page_mkwrite and all before
>             // calling set_page_dirty()
>         }
>         ...
>     }
>     ...
> }

Essentially, doing this would require a whole new "dirty a page"
infrastructure because it is in the IO path, not the page fault
path.

And, for hardware that does it's own page faults for DMA, this whole
post-DMA page setup is broken because the pages will have already
gone through ->page_mkwrite() and be set up correctly already.

> > > For [O2] i believe we can handle that case in the put_user_page()
> > > function to properly dirty the page without causing filesystem
> > > freak out.
> > 
> > I'm pretty sure you can't call ->page_mkwrite() from
> > put_user_page(), so I don't think this is workable at all.
> 
> Hu why ? i can not think of any reason whike you could not. User of

It's not a fault path, you can't safely lock pages, you can't take
fault-path only locks in the IO path (mmap_sem inversion problems),
etc.

/me has a nagging feeling this was all explained in a previous
discussions of this patchset...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  0:51                                 ` Dave Chinner
@ 2018-12-13  2:02                                   ` Jerome Glisse
  2018-12-13 15:56                                     ` Christopher Lameter
  2018-12-14  6:00                                     ` Dave Chinner
  2018-12-14  3:52                                   ` John Hubbard
  1 sibling, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13  2:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 13, 2018 at 11:51:19AM +1100, Dave Chinner wrote:
> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> > On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > it may be easier said than done...
> > > > 
> > > > So i want back to the drawing board and first i would like to ascertain
> > > > that we all agree on what the objectives are:
> > > > 
> > > >     [O1] Avoid write back from a page still being written by either a
> > > >          device or some direct I/O or any other existing user of GUP.
> 
> IOWs, you need to mark pages being written to by a GUP as
> PageWriteback, so all attempts to write the page will block on
> wait_on_page_writeback() before trying to write the dirty page.

No you don't and you can't for the simple reasons is that the GUP
of some device driver can last days, weeks, months, years ... so
it is not something you want to do. Here is what happens today:
    - user space submit directio read from a file and writing to
      virtual address and the problematic case is when that virtual
      address is actualy a mmap of a file itself
    - kernel do GUP on the virtual address, if the page has write
      permission in the CPU page table mapping then the page
      refcount is incremented and the page is return to directio
      kernel code that do memcpy

      It means that the page already went through page_mkwrite so
      all is fine from fs point of view.

      If page does not have write permission then a page fault is
      triggered and page_mkwrite will happen and prep the page
      accordingly

In the above scheme a page write back might happens after we looked
up the page from the CPU page table and before directio finish with
memcpy so that the page content during the write back might not be
stable. This is a small window for things to go bad and i do not
think we know if anybody ever experience a bug because of that.

For other GUP users the flow is the same except that device driver
that keep the page around and do continuous dma to it might last
days, weeks, months, years ... so for those the race window is big
enough for bad things to happen. Jan have report of such bugs.

So what i am proposing to fix the above is have page_mkclean return
a is_pin boolean if page is pin than the fs code use a bounce page
to do the write back giving a stable bounce page. More over fs will
need to keep around all buffer_head, blocks, ... ie whatever is
associated with that file offset so that any latter set_page_dirty
would not freak out and would not need to reallocate blocks or do
anything heavy weight.

We have a separate discussion on what to do about truncate and other
fs event that inherently invalidate portion of file so i do not
want to complexify present discussion with those but we also have
that in mind.

Do you see any fundamental issues with that ? It abides by all
existing fs standard AFAICT (you have a page_mkwrite and we ask
fs to keep the result of that around).


> > > >          This would avoid possible file system corruption.
> 
> This isn't a filesystem corruption vector. At worst, it could cause
> torn data writes due to updating the page while it is under IO. We
> have a name for this: "stable pages". This is designed to prevent
> updates to pages via mmap writes from causing corruption of things
> like MD RAID due to modification of the data during RAID parity
> calculations. Hence we have wait_for_stable_page() calls in all
> ->page_mkwrite implementations so that new mmap writes block until
> writeback IO is complete on the devices that require stable pages
> to prevent corruption.
> 
> IOWs, we already deal with this "delay new modification while
> writeback is in progress" problem in the mmap/filesystem world and
> have infrastructure to handle it. And the ->page_mkwrite code
> already deals with it.

Does the above answer that too ?

> 
> > > > 
> > > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > > >          considered clean by core mm (buffer head have been remove and
> > > >          with some file system this turns into an ugly mess).
> > > 
> > > I think that's wrong. This isn't an "avoid a crash" case, this is a
> > > "prevent data and/or filesystem corruption" case. The primary goal
> > > we have here is removing our exposure to potential corruption, which
> > > has the secondary effect of avoiding the crash/panics that currently
> > > occur as a result of inconsistent page/filesystem state.
> > 
> > This is O1 avoid corruption is O1
> 
> It's "avoid a specific instance of data corruption", not a general
> mechanism for avoiding data/filesystem corruption.
> 
> Calling set_page_dirty() on a file backed page which has not been
> correctly prepared can cause data corruption, filesystem coruption
> and shutdowns, etc because we have dirty data over a region that is
> not correctly mapped. Yes, it can also cause a crash (because we
> really, really suck at validation and error handling in generic code
> paths), but there's so, so much more that can go wrong than crash
> the kernel when we do stupid shit like this.

I believe i answer in the above explaination.

> 
> > > i.e. The goal is to have ->page_mkwrite() called on the clean page
> > > /before/ the file-backed page is marked dirty, and hence we don't
> > > expose ourselves to potential corruption or crashes that are a
> > > result of inappropriately calling set_page_dirty() on clean
> > > file-backed pages.
> > 
> > Yes and this would be handle by put_user_page ie:
> 
> No, put_user_page() is too late - it's after the DMA has completed,
> but we have to ensure the file has backing store allocated and the
> pages are in the correct state /before/ the DMA is done.
> 
> Think ENOSPC - that has to be handled before we do the DMA, not
> after. Before the DMA it is a recoverable error, after the DMA it is
> data loss/corruption failure.

Yes agree and i hope that the above explaination properly explains
that it would become legal to do set_page_dirty in put_user_page
thanks to page_mkclean telling fs code not to recycle anything
after write back finish.


> > put_user_page(struct page *page, bool dirty)
> > {
> >     if (!PageAnon(page)) {
> >         if (dirty) {
> >             // Do the whole dance ie page_mkwrite and all before
> >             // calling set_page_dirty()
> >         }
> >         ...
> >     }
> >     ...
> > }
> 
> Essentially, doing this would require a whole new "dirty a page"
> infrastructure because it is in the IO path, not the page fault
> path.
> 
> And, for hardware that does it's own page faults for DMA, this whole
> post-DMA page setup is broken because the pages will have already
> gone through ->page_mkwrite() and be set up correctly already.

Does the above properly explain that you would not need a new
set_page_dirty ?

> 
> > > > For [O2] i believe we can handle that case in the put_user_page()
> > > > function to properly dirty the page without causing filesystem
> > > > freak out.
> > > 
> > > I'm pretty sure you can't call ->page_mkwrite() from
> > > put_user_page(), so I don't think this is workable at all.
> > 
> > Hu why ? i can not think of any reason whike you could not. User of
> 
> It's not a fault path, you can't safely lock pages, you can't take
> fault-path only locks in the IO path (mmap_sem inversion problems),
> etc.
> 
> /me has a nagging feeling this was all explained in a previous
> discussions of this patchset...

Did i explain properly my idea this time ? In the scheme i am proposing
it abides by all fs rules that i am aware of at least. I hope i did not
forget any.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  0:01                                       ` Jerome Glisse
  2018-12-13  0:18                                         ` Dan Williams
@ 2018-12-13  3:20                                         ` Jason Gunthorpe
  2018-12-13 12:43                                           ` Jerome Glisse
  1 sibling, 1 reply; 206+ messages in thread
From: Jason Gunthorpe @ 2018-12-13  3:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > > Almost, we need some safety around assuming that DMA is complete the
> > > > page, so the notification would need to go all to way to userspace
> > > > with something like a file lease notification. It would also need to
> > > > be backstopped by an IOMMU in the case where the hardware does not /
> > > > can not stop in-flight DMA.
> > > 
> > > You can always reprogram the hardware right away it will redirect
> > > any dma to the crappy page.
> > 
> > That causes silent data corruption for RDMA users - we can't do that.
> > 
> > The only way out for current hardware is to forcibly terminate the
> > RDMA activity somehow (and I'm not even sure this is possible, at
> > least it would be driver specific)
> > 
> > Even the IOMMU idea probably doesn't work, I doubt all current
> > hardware can handle a PCI-E error TLP properly. 
> 
> What i saying is reprogram hardware to crappy page ie valid page
> dma map but that just has random content as a last resort to allow
> filesystem to reuse block. So their should be no PCIE error unless
> hardware freak out to see its page table reprogram randomly.

No, that isn't an option. You can't silently provide corrupted data
for RDMA to transfer out onto the network, or silently discard data
coming in!! 

Think of the consequences of that - I have a fileserver process and
someone does ftruncate and now my clients receive corrupted data??

The only option is to prevent the RDMA transfer from ever happening,
and we just don't have hardware support (beyond destroy everything) to
do that.

> The question is who do you want to punish ? RDMA user that pin stuff
> and expect thing to work forever without worrying for other fs
> activities ? Or filesystem to pin block forever :) 

I don't want to punish everyone, I want both sides to have complete
data integrity as the USER has deliberately decided to combine DAX and
RDMA. So either stop it at the front end (ie get_user_pages_longterm)
or make it work in a way that guarantees integrity for both.

>     S2: notify userspace program through device/sub-system
>         specific API and delay ftruncate. After a while if there
>         is no answer just be mean and force hardware to use
>         crappy page as anyway this is what happens today

I don't think this happens today (outside of DAX).. Does it?

.. and the remedy here is to kill the process, not provide corrupt
data. Kill the process is likely to not go over well with any real
users that want this combination.

Think Samba serving files over RDMA - you can't have random unpriv
users calling ftruncate and causing smbd to be killed or serve corrupt
data.

Jason

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  0:44                                           ` Jerome Glisse
@ 2018-12-13  3:26                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 206+ messages in thread
From: Jason Gunthorpe @ 2018-12-13  3:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 07:44:37PM -0500, Jerome Glisse wrote:

> On many GPUs you can do that, it is hardware dependant and you have
> steps to take but it is something you can do (and GPU can do
> continuous DMA traffic have they have threads running that can
> do continuous memory access). So i assume that other hardware
> can do it too.

RDMA has no generic way to modify a MR and then guarntee the HW sees
the modifications. Some HW can do this (ie the same HW that can do
ODP, because ODP needs this capability), other HW is an unknown as
this has never been asked for as a driver API.

Jason

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  3:20                                         ` Jason Gunthorpe
@ 2018-12-13 12:43                                           ` Jerome Glisse
  2018-12-13 13:40                                             ` Tom Talpey
  2018-12-14 10:41                                             ` Jan Kara
  0 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13 12:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
> On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > > > Almost, we need some safety around assuming that DMA is complete the
> > > > > page, so the notification would need to go all to way to userspace
> > > > > with something like a file lease notification. It would also need to
> > > > > be backstopped by an IOMMU in the case where the hardware does not /
> > > > > can not stop in-flight DMA.
> > > > 
> > > > You can always reprogram the hardware right away it will redirect
> > > > any dma to the crappy page.
> > > 
> > > That causes silent data corruption for RDMA users - we can't do that.
> > > 
> > > The only way out for current hardware is to forcibly terminate the
> > > RDMA activity somehow (and I'm not even sure this is possible, at
> > > least it would be driver specific)
> > > 
> > > Even the IOMMU idea probably doesn't work, I doubt all current
> > > hardware can handle a PCI-E error TLP properly. 
> > 
> > What i saying is reprogram hardware to crappy page ie valid page
> > dma map but that just has random content as a last resort to allow
> > filesystem to reuse block. So their should be no PCIE error unless
> > hardware freak out to see its page table reprogram randomly.
> 
> No, that isn't an option. You can't silently provide corrupted data
> for RDMA to transfer out onto the network, or silently discard data
> coming in!! 
> 
> Think of the consequences of that - I have a fileserver process and
> someone does ftruncate and now my clients receive corrupted data??

This is what happens _today_ ie today someone do GUP on page file
and then someone else do truncate the first GUP is effectively
streaming _random_ data to network as the page does not correspond
to anything anymore and once the RDMA MR goes aways and release
the page the page content will be lost. So i am not changing anything
here, what i proposed was to make it explicit to device driver at
least that they were streaming random data. Right now this is all
silent but this is what is happening wether you like it or not :)

Note that  i am saying do that only for truncate to allow to be
nice to fs. But again i am fine with whatever solution but you can
not please everyone here. Either block truncate and fs folks will
hate you or make it clear to device driver that you are streaming
random things and RDMA people hates you.


> The only option is to prevent the RDMA transfer from ever happening,
> and we just don't have hardware support (beyond destroy everything) to
> do that.
> 
> > The question is who do you want to punish ? RDMA user that pin stuff
> > and expect thing to work forever without worrying for other fs
> > activities ? Or filesystem to pin block forever :) 
> 
> I don't want to punish everyone, I want both sides to have complete
> data integrity as the USER has deliberately decided to combine DAX and
> RDMA. So either stop it at the front end (ie get_user_pages_longterm)
> or make it work in a way that guarantees integrity for both.
> 
> >     S2: notify userspace program through device/sub-system
> >         specific API and delay ftruncate. After a while if there
> >         is no answer just be mean and force hardware to use
> >         crappy page as anyway this is what happens today
> 
> I don't think this happens today (outside of DAX).. Does it?

It does it is just silent, i don't remember anything in the code
that would stop a truncate to happen because of elevated refcount.
This does not happen with ODP mlx5 as it does abide by _all_ mmu
notifier. This is for anything that does ODP without support for
mmu notifier.

> .. and the remedy here is to kill the process, not provide corrupt
> data. Kill the process is likely to not go over well with any real
> users that want this combination.
> 
> Think Samba serving files over RDMA - you can't have random unpriv
> users calling ftruncate and causing smbd to be killed or serve corrupt
> data.

So what i am saying is there is a choice and it would be better to
decide something than let the existing status quo where we just keep
streaming random data after truncate to a GUPed page.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 12:43                                           ` Jerome Glisse
@ 2018-12-13 13:40                                             ` Tom Talpey
  2018-12-13 14:18                                               ` Jerome Glisse
  2018-12-14 10:41                                             ` Jan Kara
  1 sibling, 1 reply; 206+ messages in thread
From: Tom Talpey @ 2018-12-13 13:40 UTC (permalink / raw)
  To: Jerome Glisse, Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On 12/13/2018 7:43 AM, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
>> On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
>>> On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
>>>> On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
>>>>>> Almost, we need some safety around assuming that DMA is complete the
>>>>>> page, so the notification would need to go all to way to userspace
>>>>>> with something like a file lease notification. It would also need to
>>>>>> be backstopped by an IOMMU in the case where the hardware does not /
>>>>>> can not stop in-flight DMA.
>>>>>
>>>>> You can always reprogram the hardware right away it will redirect
>>>>> any dma to the crappy page.
>>>>
>>>> That causes silent data corruption for RDMA users - we can't do that.
>>>>
>>>> The only way out for current hardware is to forcibly terminate the
>>>> RDMA activity somehow (and I'm not even sure this is possible, at
>>>> least it would be driver specific)
>>>>
>>>> Even the IOMMU idea probably doesn't work, I doubt all current
>>>> hardware can handle a PCI-E error TLP properly.
>>>
>>> What i saying is reprogram hardware to crappy page ie valid page
>>> dma map but that just has random content as a last resort to allow
>>> filesystem to reuse block. So their should be no PCIE error unless
>>> hardware freak out to see its page table reprogram randomly.
>>
>> No, that isn't an option. You can't silently provide corrupted data
>> for RDMA to transfer out onto the network, or silently discard data
>> coming in!!
>>
>> Think of the consequences of that - I have a fileserver process and
>> someone does ftruncate and now my clients receive corrupted data??
> 
> This is what happens _today_ ie today someone do GUP on page file
> and then someone else do truncate the first GUP is effectively
> streaming _random_ data to network as the page does not correspond
> to anything anymore and once the RDMA MR goes aways and release
> the page the page content will be lost. So i am not changing anything
> here, what i proposed was to make it explicit to device driver at
> least that they were streaming random data. Right now this is all
> silent but this is what is happening wether you like it or not :)
> 
> Note that  i am saying do that only for truncate to allow to be
> nice to fs. But again i am fine with whatever solution but you can
> not please everyone here. Either block truncate and fs folks will
> hate you or make it clear to device driver that you are streaming
> random things and RDMA people hates you.
> 
> 
>> The only option is to prevent the RDMA transfer from ever happening,
>> and we just don't have hardware support (beyond destroy everything) to
>> do that.
>>
>>> The question is who do you want to punish ? RDMA user that pin stuff
>>> and expect thing to work forever without worrying for other fs
>>> activities ? Or filesystem to pin block forever :)
>>
>> I don't want to punish everyone, I want both sides to have complete
>> data integrity as the USER has deliberately decided to combine DAX and
>> RDMA. So either stop it at the front end (ie get_user_pages_longterm)
>> or make it work in a way that guarantees integrity for both.
>>
>>>      S2: notify userspace program through device/sub-system
>>>          specific API and delay ftruncate. After a while if there
>>>          is no answer just be mean and force hardware to use
>>>          crappy page as anyway this is what happens today
>>
>> I don't think this happens today (outside of DAX).. Does it?
> 
> It does it is just silent, i don't remember anything in the code
> that would stop a truncate to happen because of elevated refcount.
> This does not happen with ODP mlx5 as it does abide by _all_ mmu
> notifier. This is for anything that does ODP without support for
> mmu notifier.

Wait - is it expected that the MMU notifier upcall is handled
synchronously? That is, the page DMA mapping must be torn down
immediately, and before returning?

That's simply not possible, since the hardware needs to get control
to do this. Even if there were an IOMMU that could intercept the
DMA, reprogramming it will require a flush, which cannot be guaranteed
to occur "inline".

>> .. and the remedy here is to kill the process, not provide corrupt
>> data. Kill the process is likely to not go over well with any real
>> users that want this combination.
>>
>> Think Samba serving files over RDMA - you can't have random unpriv
>> users calling ftruncate and causing smbd to be killed or serve corrupt
>> data.
> 
> So what i am saying is there is a choice and it would be better to
> decide something than let the existing status quo where we just keep
> streaming random data after truncate to a GUPed page.

Let's also remember that any torn-down DMA mapping can't be recycled
until all uses of the old DMA addresses are destroyed. The whole
thing screams for reference counting all the way down, to me.

Tom.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 13:40                                             ` Tom Talpey
@ 2018-12-13 14:18                                               ` Jerome Glisse
  2018-12-13 14:51                                                 ` Tom Talpey
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13 14:18 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Thu, Dec 13, 2018 at 08:40:49AM -0500, Tom Talpey wrote:
> On 12/13/2018 7:43 AM, Jerome Glisse wrote:
> > On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> > > > > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > > > > > Almost, we need some safety around assuming that DMA is complete the
> > > > > > > page, so the notification would need to go all to way to userspace
> > > > > > > with something like a file lease notification. It would also need to
> > > > > > > be backstopped by an IOMMU in the case where the hardware does not /
> > > > > > > can not stop in-flight DMA.
> > > > > > 
> > > > > > You can always reprogram the hardware right away it will redirect
> > > > > > any dma to the crappy page.
> > > > > 
> > > > > That causes silent data corruption for RDMA users - we can't do that.
> > > > > 
> > > > > The only way out for current hardware is to forcibly terminate the
> > > > > RDMA activity somehow (and I'm not even sure this is possible, at
> > > > > least it would be driver specific)
> > > > > 
> > > > > Even the IOMMU idea probably doesn't work, I doubt all current
> > > > > hardware can handle a PCI-E error TLP properly.
> > > > 
> > > > What i saying is reprogram hardware to crappy page ie valid page
> > > > dma map but that just has random content as a last resort to allow
> > > > filesystem to reuse block. So their should be no PCIE error unless
> > > > hardware freak out to see its page table reprogram randomly.
> > > 
> > > No, that isn't an option. You can't silently provide corrupted data
> > > for RDMA to transfer out onto the network, or silently discard data
> > > coming in!!
> > > 
> > > Think of the consequences of that - I have a fileserver process and
> > > someone does ftruncate and now my clients receive corrupted data??
> > 
> > This is what happens _today_ ie today someone do GUP on page file
> > and then someone else do truncate the first GUP is effectively
> > streaming _random_ data to network as the page does not correspond
> > to anything anymore and once the RDMA MR goes aways and release
> > the page the page content will be lost. So i am not changing anything
> > here, what i proposed was to make it explicit to device driver at
> > least that they were streaming random data. Right now this is all
> > silent but this is what is happening wether you like it or not :)
> > 
> > Note that  i am saying do that only for truncate to allow to be
> > nice to fs. But again i am fine with whatever solution but you can
> > not please everyone here. Either block truncate and fs folks will
> > hate you or make it clear to device driver that you are streaming
> > random things and RDMA people hates you.
> > 
> > 
> > > The only option is to prevent the RDMA transfer from ever happening,
> > > and we just don't have hardware support (beyond destroy everything) to
> > > do that.
> > > 
> > > > The question is who do you want to punish ? RDMA user that pin stuff
> > > > and expect thing to work forever without worrying for other fs
> > > > activities ? Or filesystem to pin block forever :)
> > > 
> > > I don't want to punish everyone, I want both sides to have complete
> > > data integrity as the USER has deliberately decided to combine DAX and
> > > RDMA. So either stop it at the front end (ie get_user_pages_longterm)
> > > or make it work in a way that guarantees integrity for both.
> > > 
> > > >      S2: notify userspace program through device/sub-system
> > > >          specific API and delay ftruncate. After a while if there
> > > >          is no answer just be mean and force hardware to use
> > > >          crappy page as anyway this is what happens today
> > > 
> > > I don't think this happens today (outside of DAX).. Does it?
> > 
> > It does it is just silent, i don't remember anything in the code
> > that would stop a truncate to happen because of elevated refcount.
> > This does not happen with ODP mlx5 as it does abide by _all_ mmu
> > notifier. This is for anything that does ODP without support for
> > mmu notifier.
> 
> Wait - is it expected that the MMU notifier upcall is handled
> synchronously? That is, the page DMA mapping must be torn down
> immediately, and before returning?

Yes you must torn down mapping before returning from mmu notifier
call back. Any time after is too late. You obviously need hardware
that can support that. In the infiniband sub-system AFAIK only the
mlx5 hardware can do that. In the GPU sub-system everyone is fine.

Dunno about other sub-systems.


> That's simply not possible, since the hardware needs to get control
> to do this. Even if there were an IOMMU that could intercept the
> DMA, reprogramming it will require a flush, which cannot be guaranteed
> to occur "inline".

If hardware can not do that then hardware should not use GUP, at
least not on file back page. I advocated in favor of forbiding GUP
for device that can not do that as right now this silently breaks
in few cases (truncate, mremap, splice, reflink, ...). So the device
in those cases can end up with GUPed pages that do not correspond
to anything anymore ie they do not correspond to the memory backing
the virtual address they were GUP against, nor they correspond to
the file content at the given offset anymore. It is just random
data as far as the kernel or filesystem is concern.

Of course for this to happen you need an application that do stupid
thing like create an MR in one thread on the mmap of a file and
truncate that same file in another thread (or from the same thread).

So this is unlikely to happen in sane program. It does not mean it
will not happen.


The second set of issue at to deals with set_page_dirty happening
long time after page_release did happens and thus the fs dirty
page callback will see page in bad state and will BUG() and you
will have an oops and loose any data your device might have written
to the page. This is highly filesystem dependend and also timing
dependend and link to thing like memory pressure so it might not
happen that often but again it can happen.


> > > .. and the remedy here is to kill the process, not provide corrupt
> > > data. Kill the process is likely to not go over well with any real
> > > users that want this combination.
> > > 
> > > Think Samba serving files over RDMA - you can't have random unpriv
> > > users calling ftruncate and causing smbd to be killed or serve corrupt
> > > data.
> > 
> > So what i am saying is there is a choice and it would be better to
> > decide something than let the existing status quo where we just keep
> > streaming random data after truncate to a GUPed page.
> 
> Let's also remember that any torn-down DMA mapping can't be recycled
> until all uses of the old DMA addresses are destroyed. The whole
> thing screams for reference counting all the way down, to me.

I am not saying reuse the DMA address in the emergency_mean_callback
the idea was:

    gup_page_emergency_revoke(device, page)
    {
        crapy_page = alloc_page();
        dma_addr = dma_map(crappy_page, device, ...);
        mydevice_page_table_update(device, crappy_page, dma_addr);
        mydevice_tlb_flush(device);
        mydevice_wait_pending_dma(device)

        // at this point the original GUPed page is not access by hw

        dma_unmap(page);
        put_user_page(page);
    }

I know that it is something we can do with GPU devices. So i assumed
that other devices can do that to. But i understand this is highly
device dependent. Not that if you have a command queue it is more like:

    gup_page_emergency_revoke(device, page)
    {
        crapy_page = alloc_page();
        dma_addr = dma_map(crappy_page, device, ...);

        // below function update kernel side data structure that stores
        // the pointer to the GUPed page and that are use to build cmds
        // send to the hardware. It does not update the hardware, just
        // the device driver internal data structure.
        mydevice_replace_page_in_object(device, page, crappy_page, dma_addr);

        mydevice_queue_wait_pending_job(device);

        // at this point the original GUPed page is not access by hw and
        // any new command will be using the crappy page not the GUPed
        // page

        dma_unmap(page);
        put_user_page(page);
    }

Again if device can not do any of the above then it should really not
be using GUP because they are corner case that are simply not solvable.
We can avoid kernel OOPS but we can not pin the page as GUP user believe
ie the virtual address the GUP happened against can point to a different
page (true for both anonymous memory and file back memory).


The put_user_page() patchset is about solving the OOPS and BUG() and
also fixing the tiny race that exist for direct I/O. Fixing other user
of GUP should happen sub-system by sub-system and each sub-system or
device driver maintainer must choose their poison. This is what i am
advocating for. If the emergency_revoke above is something that would
work for device is something i can't say for certain, only for devices
i know (which are GPU mostly).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 14:18                                               ` Jerome Glisse
@ 2018-12-13 14:51                                                 ` Tom Talpey
  2018-12-13 15:18                                                   ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: Tom Talpey @ 2018-12-13 14:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On 12/13/2018 9:18 AM, Jerome Glisse wrote:
> On Thu, Dec 13, 2018 at 08:40:49AM -0500, Tom Talpey wrote:
>> On 12/13/2018 7:43 AM, Jerome Glisse wrote:
>>> On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
>>>> On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
>>>>> On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
>>>>>> On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
>>>>>>>> Almost, we need some safety around assuming that DMA is complete the
>>>>>>>> page, so the notification would need to go all to way to userspace
>>>>>>>> with something like a file lease notification. It would also need to
>>>>>>>> be backstopped by an IOMMU in the case where the hardware does not /
>>>>>>>> can not stop in-flight DMA.
>>>>>>>
>>>>>>> You can always reprogram the hardware right away it will redirect
>>>>>>> any dma to the crappy page.
>>>>>>
>>>>>> That causes silent data corruption for RDMA users - we can't do that.
>>>>>>
>>>>>> The only way out for current hardware is to forcibly terminate the
>>>>>> RDMA activity somehow (and I'm not even sure this is possible, at
>>>>>> least it would be driver specific)
>>>>>>
>>>>>> Even the IOMMU idea probably doesn't work, I doubt all current
>>>>>> hardware can handle a PCI-E error TLP properly.
>>>>>
>>>>> What i saying is reprogram hardware to crappy page ie valid page
>>>>> dma map but that just has random content as a last resort to allow
>>>>> filesystem to reuse block. So their should be no PCIE error unless
>>>>> hardware freak out to see its page table reprogram randomly.
>>>>
>>>> No, that isn't an option. You can't silently provide corrupted data
>>>> for RDMA to transfer out onto the network, or silently discard data
>>>> coming in!!
>>>>
>>>> Think of the consequences of that - I have a fileserver process and
>>>> someone does ftruncate and now my clients receive corrupted data??
>>>
>>> This is what happens _today_ ie today someone do GUP on page file
>>> and then someone else do truncate the first GUP is effectively
>>> streaming _random_ data to network as the page does not correspond
>>> to anything anymore and once the RDMA MR goes aways and release
>>> the page the page content will be lost. So i am not changing anything
>>> here, what i proposed was to make it explicit to device driver at
>>> least that they were streaming random data. Right now this is all
>>> silent but this is what is happening wether you like it or not :)
>>>
>>> Note that  i am saying do that only for truncate to allow to be
>>> nice to fs. But again i am fine with whatever solution but you can
>>> not please everyone here. Either block truncate and fs folks will
>>> hate you or make it clear to device driver that you are streaming
>>> random things and RDMA people hates you.
>>>
>>>
>>>> The only option is to prevent the RDMA transfer from ever happening,
>>>> and we just don't have hardware support (beyond destroy everything) to
>>>> do that.
>>>>
>>>>> The question is who do you want to punish ? RDMA user that pin stuff
>>>>> and expect thing to work forever without worrying for other fs
>>>>> activities ? Or filesystem to pin block forever :)
>>>>
>>>> I don't want to punish everyone, I want both sides to have complete
>>>> data integrity as the USER has deliberately decided to combine DAX and
>>>> RDMA. So either stop it at the front end (ie get_user_pages_longterm)
>>>> or make it work in a way that guarantees integrity for both.
>>>>
>>>>>       S2: notify userspace program through device/sub-system
>>>>>           specific API and delay ftruncate. After a while if there
>>>>>           is no answer just be mean and force hardware to use
>>>>>           crappy page as anyway this is what happens today
>>>>
>>>> I don't think this happens today (outside of DAX).. Does it?
>>>
>>> It does it is just silent, i don't remember anything in the code
>>> that would stop a truncate to happen because of elevated refcount.
>>> This does not happen with ODP mlx5 as it does abide by _all_ mmu
>>> notifier. This is for anything that does ODP without support for
>>> mmu notifier.
>>
>> Wait - is it expected that the MMU notifier upcall is handled
>> synchronously? That is, the page DMA mapping must be torn down
>> immediately, and before returning?
> 
> Yes you must torn down mapping before returning from mmu notifier
> call back. Any time after is too late. You obviously need hardware
> that can support that. In the infiniband sub-system AFAIK only the
> mlx5 hardware can do that. In the GPU sub-system everyone is fine.

I'm skeptical that MLX5 can actually make this guarantee. But we
can take that offline in linux-rdma.

I'm also skeptical that NVMe can do this.

> Dunno about other sub-systems.
> 
> 
>> That's simply not possible, since the hardware needs to get control
>> to do this. Even if there were an IOMMU that could intercept the
>> DMA, reprogramming it will require a flush, which cannot be guaranteed
>> to occur "inline".
> 
> If hardware can not do that then hardware should not use GUP, at
> least not on file back page. I advocated in favor of forbiding GUP
> for device that can not do that as right now this silently breaks
> in few cases (truncate, mremap, splice, reflink, ...). So the device
> in those cases can end up with GUPed pages that do not correspond
> to anything anymore ie they do not correspond to the memory backing
> the virtual address they were GUP against, nor they correspond to
> the file content at the given offset anymore. It is just random
> data as far as the kernel or filesystem is concern.
> 
> Of course for this to happen you need an application that do stupid
> thing like create an MR in one thread on the mmap of a file and
> truncate that same file in another thread (or from the same thread).
> 
> So this is unlikely to happen in sane program. It does not mean it
> will not happen.

Completely agree. In other words, this is the responsibility of the
DAX (or g-u-p) consumer, which is nt necessarily the program itself,
it could be an upper layer.

In SMB3 and NFSv4, which I've been focused on, we envision using the
existing protocol leases to protect this. When requesting a DAX mapping,
the server may requires an exclusive lease. If this mapping needs to
change, because of another conflicting mapping, the lease would be
recalled and the mapping dropped. This is a normal and well-established
filesystem requirement.

The twist here is that the platform itself can initiate such an event.
It's my belief that this plumbing must flow to the *top* of the stack,
i.e. the entity that took the mapping (e.g. filesystem), and not
depend on the MMU notifier at the very bottom.

> The second set of issue at to deals with set_page_dirty happening
> long time after page_release did happens and thus the fs dirty
> page callback will see page in bad state and will BUG() and you
> will have an oops and loose any data your device might have written
> to the page. This is highly filesystem dependend and also timing
> dependend and link to thing like memory pressure so it might not
> happen that often but again it can happen.
> 
> 
>>>> .. and the remedy here is to kill the process, not provide corrupt
>>>> data. Kill the process is likely to not go over well with any real
>>>> users that want this combination.
>>>>
>>>> Think Samba serving files over RDMA - you can't have random unpriv
>>>> users calling ftruncate and causing smbd to be killed or serve corrupt
>>>> data.
>>>
>>> So what i am saying is there is a choice and it would be better to
>>> decide something than let the existing status quo where we just keep
>>> streaming random data after truncate to a GUPed page.
>>
>> Let's also remember that any torn-down DMA mapping can't be recycled
>> until all uses of the old DMA addresses are destroyed. The whole
>> thing screams for reference counting all the way down, to me.
> 
> I am not saying reuse the DMA address in the emergency_mean_callback
> the idea was:
> 
>      gup_page_emergency_revoke(device, page)
>      {
>          crapy_page = alloc_page();
>          dma_addr = dma_map(crappy_page, device, ...);
>          mydevice_page_table_update(device, crappy_page, dma_addr);
>          mydevice_tlb_flush(device);
>          mydevice_wait_pending_dma(device)
> 
>          // at this point the original GUPed page is not access by hw
> 
>          dma_unmap(page);
>          put_user_page(page);
>      }

Ok, but my concern was also that the old DMA address then becomes
unused and may be grabbed by a new i/o. If the hardware still has
reads or writes in flight, and they arrive after the old address
becomes valid, well, oops.

Tom.

> I know that it is something we can do with GPU devices. So i assumed
> that other devices can do that to. But i understand this is highly
> device dependent. Not that if you have a command queue it is more like:
> 
>      gup_page_emergency_revoke(device, page)
>      {
>          crapy_page = alloc_page();
>          dma_addr = dma_map(crappy_page, device, ...);
> 
>          // below function update kernel side data structure that stores
>          // the pointer to the GUPed page and that are use to build cmds
>          // send to the hardware. It does not update the hardware, just
>          // the device driver internal data structure.
>          mydevice_replace_page_in_object(device, page, crappy_page, dma_addr);
> 
>          mydevice_queue_wait_pending_job(device);
> 
>          // at this point the original GUPed page is not access by hw and
>          // any new command will be using the crappy page not the GUPed
>          // page
> 
>          dma_unmap(page);
>          put_user_page(page);
>      }
> 
> Again if device can not do any of the above then it should really not
> be using GUP because they are corner case that are simply not solvable.
> We can avoid kernel OOPS but we can not pin the page as GUP user believe
> ie the virtual address the GUP happened against can point to a different
> page (true for both anonymous memory and file back memory).
> 
> 
> The put_user_page() patchset is about solving the OOPS and BUG() and
> also fixing the tiny race that exist for direct I/O. Fixing other user
> of GUP should happen sub-system by sub-system and each sub-system or
> device driver maintainer must choose their poison. This is what i am
> advocating for. If the emergency_revoke above is something that would
> work for device is something i can't say for certain, only for devices
> i know (which are GPU mostly).
> 
> Cheers,
> Jérôme
> 
> 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 14:51                                                 ` Tom Talpey
@ 2018-12-13 15:18                                                   ` Jerome Glisse
  2018-12-13 18:12                                                     ` Tom Talpey
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13 15:18 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Thu, Dec 13, 2018 at 09:51:18AM -0500, Tom Talpey wrote:
> On 12/13/2018 9:18 AM, Jerome Glisse wrote:
> > On Thu, Dec 13, 2018 at 08:40:49AM -0500, Tom Talpey wrote:
> > > On 12/13/2018 7:43 AM, Jerome Glisse wrote:
> > > > On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
> > > > > On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
> > > > > > On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> > > > > > > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > > > > > > > Almost, we need some safety around assuming that DMA is complete the
> > > > > > > > > page, so the notification would need to go all to way to userspace
> > > > > > > > > with something like a file lease notification. It would also need to
> > > > > > > > > be backstopped by an IOMMU in the case where the hardware does not /
> > > > > > > > > can not stop in-flight DMA.
> > > > > > > > 
> > > > > > > > You can always reprogram the hardware right away it will redirect
> > > > > > > > any dma to the crappy page.
> > > > > > > 
> > > > > > > That causes silent data corruption for RDMA users - we can't do that.
> > > > > > > 
> > > > > > > The only way out for current hardware is to forcibly terminate the
> > > > > > > RDMA activity somehow (and I'm not even sure this is possible, at
> > > > > > > least it would be driver specific)
> > > > > > > 
> > > > > > > Even the IOMMU idea probably doesn't work, I doubt all current
> > > > > > > hardware can handle a PCI-E error TLP properly.
> > > > > > 
> > > > > > What i saying is reprogram hardware to crappy page ie valid page
> > > > > > dma map but that just has random content as a last resort to allow
> > > > > > filesystem to reuse block. So their should be no PCIE error unless
> > > > > > hardware freak out to see its page table reprogram randomly.
> > > > > 
> > > > > No, that isn't an option. You can't silently provide corrupted data
> > > > > for RDMA to transfer out onto the network, or silently discard data
> > > > > coming in!!
> > > > > 
> > > > > Think of the consequences of that - I have a fileserver process and
> > > > > someone does ftruncate and now my clients receive corrupted data??
> > > > 
> > > > This is what happens _today_ ie today someone do GUP on page file
> > > > and then someone else do truncate the first GUP is effectively
> > > > streaming _random_ data to network as the page does not correspond
> > > > to anything anymore and once the RDMA MR goes aways and release
> > > > the page the page content will be lost. So i am not changing anything
> > > > here, what i proposed was to make it explicit to device driver at
> > > > least that they were streaming random data. Right now this is all
> > > > silent but this is what is happening wether you like it or not :)
> > > > 
> > > > Note that  i am saying do that only for truncate to allow to be
> > > > nice to fs. But again i am fine with whatever solution but you can
> > > > not please everyone here. Either block truncate and fs folks will
> > > > hate you or make it clear to device driver that you are streaming
> > > > random things and RDMA people hates you.
> > > > 
> > > > 
> > > > > The only option is to prevent the RDMA transfer from ever happening,
> > > > > and we just don't have hardware support (beyond destroy everything) to
> > > > > do that.
> > > > > 
> > > > > > The question is who do you want to punish ? RDMA user that pin stuff
> > > > > > and expect thing to work forever without worrying for other fs
> > > > > > activities ? Or filesystem to pin block forever :)
> > > > > 
> > > > > I don't want to punish everyone, I want both sides to have complete
> > > > > data integrity as the USER has deliberately decided to combine DAX and
> > > > > RDMA. So either stop it at the front end (ie get_user_pages_longterm)
> > > > > or make it work in a way that guarantees integrity for both.
> > > > > 
> > > > > >       S2: notify userspace program through device/sub-system
> > > > > >           specific API and delay ftruncate. After a while if there
> > > > > >           is no answer just be mean and force hardware to use
> > > > > >           crappy page as anyway this is what happens today
> > > > > 
> > > > > I don't think this happens today (outside of DAX).. Does it?
> > > > 
> > > > It does it is just silent, i don't remember anything in the code
> > > > that would stop a truncate to happen because of elevated refcount.
> > > > This does not happen with ODP mlx5 as it does abide by _all_ mmu
> > > > notifier. This is for anything that does ODP without support for
> > > > mmu notifier.
> > > 
> > > Wait - is it expected that the MMU notifier upcall is handled
> > > synchronously? That is, the page DMA mapping must be torn down
> > > immediately, and before returning?
> > 
> > Yes you must torn down mapping before returning from mmu notifier
> > call back. Any time after is too late. You obviously need hardware
> > that can support that. In the infiniband sub-system AFAIK only the
> > mlx5 hardware can do that. In the GPU sub-system everyone is fine.
> 
> I'm skeptical that MLX5 can actually make this guarantee. But we
> can take that offline in linux-rdma.

It does unless the code lies about what the hardware do :) See umem_odp.c
in core and odp.c in mlx5 directories.


> I'm also skeptical that NVMe can do this.
> 
> > Dunno about other sub-systems.
> > 
> > 
> > > That's simply not possible, since the hardware needs to get control
> > > to do this. Even if there were an IOMMU that could intercept the
> > > DMA, reprogramming it will require a flush, which cannot be guaranteed
> > > to occur "inline".
> > 
> > If hardware can not do that then hardware should not use GUP, at
> > least not on file back page. I advocated in favor of forbiding GUP
> > for device that can not do that as right now this silently breaks
> > in few cases (truncate, mremap, splice, reflink, ...). So the device
> > in those cases can end up with GUPed pages that do not correspond
> > to anything anymore ie they do not correspond to the memory backing
> > the virtual address they were GUP against, nor they correspond to
> > the file content at the given offset anymore. It is just random
> > data as far as the kernel or filesystem is concern.
> > 
> > Of course for this to happen you need an application that do stupid
> > thing like create an MR in one thread on the mmap of a file and
> > truncate that same file in another thread (or from the same thread).
> > 
> > So this is unlikely to happen in sane program. It does not mean it
> > will not happen.
> 
> Completely agree. In other words, this is the responsibility of the
> DAX (or g-u-p) consumer, which is nt necessarily the program itself,
> it could be an upper layer.
> 
> In SMB3 and NFSv4, which I've been focused on, we envision using the
> existing protocol leases to protect this. When requesting a DAX mapping,
> the server may requires an exclusive lease. If this mapping needs to
> change, because of another conflicting mapping, the lease would be
> recalled and the mapping dropped. This is a normal and well-established
> filesystem requirement.
> 
> The twist here is that the platform itself can initiate such an event.
> It's my belief that this plumbing must flow to the *top* of the stack,
> i.e. the entity that took the mapping (e.g. filesystem), and not
> depend on the MMU notifier at the very bottom.

So this patchset is about mm plumbings, what fs does before mm code
gets call is a fs discussion. Note also that GUP is always against
a virtual address of a process. GUP is use by few driver to allow
the device direct access to some portion of process address space.

The issues is that some of those device have designed an API that
fully ignore things like munmap, splice, truncate, ... and as such
they open the door for undefined behavior. I believe the original
intention in all the cases was that the user would not do stupid
thing like setup the device mapping through device specific API and
then munmap or truncate or anything that would affect the range of
virtual address.

Thing is from kernel point of view we should not and can not assume
that userspace will behave properly. So we have to brace for the
worst. Which is what this patchset is trying to do ie fix some of
issues and make the rest of them explicit to device driver so that
they can decide what to do about it.

My advice is for each of those sub-system/device to fail loudly
when such thing happens so that the user knows he is doing something
stupid or illegal.


> > The second set of issue at to deals with set_page_dirty happening
> > long time after page_release did happens and thus the fs dirty
> > page callback will see page in bad state and will BUG() and you
> > will have an oops and loose any data your device might have written
> > to the page. This is highly filesystem dependend and also timing
> > dependend and link to thing like memory pressure so it might not
> > happen that often but again it can happen.
> > 
> > 
> > > > > .. and the remedy here is to kill the process, not provide corrupt
> > > > > data. Kill the process is likely to not go over well with any real
> > > > > users that want this combination.
> > > > > 
> > > > > Think Samba serving files over RDMA - you can't have random unpriv
> > > > > users calling ftruncate and causing smbd to be killed or serve corrupt
> > > > > data.
> > > > 
> > > > So what i am saying is there is a choice and it would be better to
> > > > decide something than let the existing status quo where we just keep
> > > > streaming random data after truncate to a GUPed page.
> > > 
> > > Let's also remember that any torn-down DMA mapping can't be recycled
> > > until all uses of the old DMA addresses are destroyed. The whole
> > > thing screams for reference counting all the way down, to me.
> > 
> > I am not saying reuse the DMA address in the emergency_mean_callback
> > the idea was:
> > 
> >      gup_page_emergency_revoke(device, page)
> >      {
> >          crapy_page = alloc_page();
> >          dma_addr = dma_map(crappy_page, device, ...);
> >          mydevice_page_table_update(device, crappy_page, dma_addr);
> >          mydevice_tlb_flush(device);
> >          mydevice_wait_pending_dma(device)
> > 
> >          // at this point the original GUPed page is not access by hw
> > 
> >          dma_unmap(page);
> >          put_user_page(page);
> >      }
> 
> Ok, but my concern was also that the old DMA address then becomes
> unused and may be grabbed by a new i/o. If the hardware still has
> reads or writes in flight, and they arrive after the old address
> becomes valid, well, oops.

In above code at the comment point the driver garanty the hardware
will never use the old dma address any more ie that all in flight
dma are done. I know this can be done for some devices. It might
very well not work for all and this is why this need to be a sub-
system/device discussion.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  2:02                                   ` Jerome Glisse
@ 2018-12-13 15:56                                     ` Christopher Lameter
  2018-12-13 16:02                                       ` Jerome Glisse
  2018-12-14  6:00                                     ` Dave Chinner
  1 sibling, 1 reply; 206+ messages in thread
From: Christopher Lameter @ 2018-12-13 15:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, 12 Dec 2018, Jerome Glisse wrote:

> On Thu, Dec 13, 2018 at 11:51:19AM +1100, Dave Chinner wrote:
> > > > >     [O1] Avoid write back from a page still being written by either a
> > > > >          device or some direct I/O or any other existing user of GUP.
> >
> > IOWs, you need to mark pages being written to by a GUP as
> > PageWriteback, so all attempts to write the page will block on
> > wait_on_page_writeback() before trying to write the dirty page.
>
> No you don't and you can't for the simple reasons is that the GUP
> of some device driver can last days, weeks, months, years ... so
> it is not something you want to do. Here is what happens today:

I think it would be better to use the established way to block access that
Dave suggests. Maybe deal with the issue of threads being blocked for
a long time instead? Introduce a way to abort these attempts in a
controlled fashion that also allows easy debugging of these conflicts?

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 15:56                                     ` Christopher Lameter
@ 2018-12-13 16:02                                       ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13 16:02 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Dave Chinner, Jan Kara, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 13, 2018 at 03:56:05PM +0000, Christopher Lameter wrote:
> On Wed, 12 Dec 2018, Jerome Glisse wrote:
> 
> > On Thu, Dec 13, 2018 at 11:51:19AM +1100, Dave Chinner wrote:
> > > > > >     [O1] Avoid write back from a page still being written by either a
> > > > > >          device or some direct I/O or any other existing user of GUP.
> > >
> > > IOWs, you need to mark pages being written to by a GUP as
> > > PageWriteback, so all attempts to write the page will block on
> > > wait_on_page_writeback() before trying to write the dirty page.
> >
> > No you don't and you can't for the simple reasons is that the GUP
> > of some device driver can last days, weeks, months, years ... so
> > it is not something you want to do. Here is what happens today:
> 
> I think it would be better to use the established way to block access that
> Dave suggests. Maybe deal with the issue of threads being blocked for
> a long time instead? Introduce a way to abort these attempts in a
> controlled fashion that also allows easy debugging of these conflicts?

GUP does not have the information on how long the GUP will last,
the GUP caller might not know either. What is worse is that the
GUP user provide API today to userspace to do this and thus any
attempt to block this from happening can be interpreted (from
some point of view) as a regression ie worked in linux X.Y does
not work in linux X.Y+1.

I am not against doing that, in fact i advocated at last LSF that
any user of GUP that does not abide to mmu notifier should be
denied GUP (direct IO, kvm and couple other like that being the
exception because they are case we can properly fix).

Anyone that abide to mmu notifier will drop the page reference on
any event like truncate, split, mremap, munmap, write back ... so
anyone with mmu notifier is fine.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 15:18                                                   ` Jerome Glisse
@ 2018-12-13 18:12                                                     ` Tom Talpey
  2018-12-13 19:18                                                       ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: Tom Talpey @ 2018-12-13 18:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On 12/13/2018 10:18 AM, Jerome Glisse wrote:
> On Thu, Dec 13, 2018 at 09:51:18AM -0500, Tom Talpey wrote:
>> On 12/13/2018 9:18 AM, Jerome Glisse wrote:
>>> On Thu, Dec 13, 2018 at 08:40:49AM -0500, Tom Talpey wrote:
>>>> On 12/13/2018 7:43 AM, Jerome Glisse wrote:
>>>>> On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
>>>>>> On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
>>>>>>> On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
>>>>>>>> On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
>>>>>>>>>> Almost, we need some safety around assuming that DMA is complete the
>>>>>>>>>> page, so the notification would need to go all to way to userspace
>>>>>>>>>> with something like a file lease notification. It would also need to
>>>>>>>>>> be backstopped by an IOMMU in the case where the hardware does not /
>>>>>>>>>> can not stop in-flight DMA.
>>>>>>>>>
>>>>>>>>> You can always reprogram the hardware right away it will redirect
>>>>>>>>> any dma to the crappy page.
>>>>>>>>
>>>>>>>> That causes silent data corruption for RDMA users - we can't do that.
>>>>>>>>
>>>>>>>> The only way out for current hardware is to forcibly terminate the
>>>>>>>> RDMA activity somehow (and I'm not even sure this is possible, at
>>>>>>>> least it would be driver specific)
>>>>>>>>
>>>>>>>> Even the IOMMU idea probably doesn't work, I doubt all current
>>>>>>>> hardware can handle a PCI-E error TLP properly.
>>>>>>>
>>>>>>> What i saying is reprogram hardware to crappy page ie valid page
>>>>>>> dma map but that just has random content as a last resort to allow
>>>>>>> filesystem to reuse block. So their should be no PCIE error unless
>>>>>>> hardware freak out to see its page table reprogram randomly.
>>>>>>
>>>>>> No, that isn't an option. You can't silently provide corrupted data
>>>>>> for RDMA to transfer out onto the network, or silently discard data
>>>>>> coming in!!
>>>>>>
>>>>>> Think of the consequences of that - I have a fileserver process and
>>>>>> someone does ftruncate and now my clients receive corrupted data??
>>>>>
>>>>> This is what happens _today_ ie today someone do GUP on page file
>>>>> and then someone else do truncate the first GUP is effectively
>>>>> streaming _random_ data to network as the page does not correspond
>>>>> to anything anymore and once the RDMA MR goes aways and release
>>>>> the page the page content will be lost. So i am not changing anything
>>>>> here, what i proposed was to make it explicit to device driver at
>>>>> least that they were streaming random data. Right now this is all
>>>>> silent but this is what is happening wether you like it or not :)
>>>>>
>>>>> Note that  i am saying do that only for truncate to allow to be
>>>>> nice to fs. But again i am fine with whatever solution but you can
>>>>> not please everyone here. Either block truncate and fs folks will
>>>>> hate you or make it clear to device driver that you are streaming
>>>>> random things and RDMA people hates you.
>>>>>
>>>>>
>>>>>> The only option is to prevent the RDMA transfer from ever happening,
>>>>>> and we just don't have hardware support (beyond destroy everything) to
>>>>>> do that.
>>>>>>
>>>>>>> The question is who do you want to punish ? RDMA user that pin stuff
>>>>>>> and expect thing to work forever without worrying for other fs
>>>>>>> activities ? Or filesystem to pin block forever :)
>>>>>>
>>>>>> I don't want to punish everyone, I want both sides to have complete
>>>>>> data integrity as the USER has deliberately decided to combine DAX and
>>>>>> RDMA. So either stop it at the front end (ie get_user_pages_longterm)
>>>>>> or make it work in a way that guarantees integrity for both.
>>>>>>
>>>>>>>        S2: notify userspace program through device/sub-system
>>>>>>>            specific API and delay ftruncate. After a while if there
>>>>>>>            is no answer just be mean and force hardware to use
>>>>>>>            crappy page as anyway this is what happens today
>>>>>>
>>>>>> I don't think this happens today (outside of DAX).. Does it?
>>>>>
>>>>> It does it is just silent, i don't remember anything in the code
>>>>> that would stop a truncate to happen because of elevated refcount.
>>>>> This does not happen with ODP mlx5 as it does abide by _all_ mmu
>>>>> notifier. This is for anything that does ODP without support for
>>>>> mmu notifier.
>>>>
>>>> Wait - is it expected that the MMU notifier upcall is handled
>>>> synchronously? That is, the page DMA mapping must be torn down
>>>> immediately, and before returning?
>>>
>>> Yes you must torn down mapping before returning from mmu notifier
>>> call back. Any time after is too late. You obviously need hardware
>>> that can support that. In the infiniband sub-system AFAIK only the
>>> mlx5 hardware can do that. In the GPU sub-system everyone is fine.
>>
>> I'm skeptical that MLX5 can actually make this guarantee. But we
>> can take that offline in linux-rdma.
> 
> It does unless the code lies about what the hardware do :) See umem_odp.c
> in core and odp.c in mlx5 directories.

Ok, I did look and there are numerous error returns from these calls.
Some are related to resource shortages (including the rather ominous-
sounding "emergency_pages" in odp.c), others related to the generic
RDMA behaviors such as posting work requests and reaping their
completion status.

So I'd ask - what is the backup plan from the mmu notifier if the
unmap fails? Which it certainly will, in many real-world situations.

Tom.

>> I'm also skeptical that NVMe can do this.
>>
>>> Dunno about other sub-systems.
>>>
>>>
>>>> That's simply not possible, since the hardware needs to get control
>>>> to do this. Even if there were an IOMMU that could intercept the
>>>> DMA, reprogramming it will require a flush, which cannot be guaranteed
>>>> to occur "inline".
>>>
>>> If hardware can not do that then hardware should not use GUP, at
>>> least not on file back page. I advocated in favor of forbiding GUP
>>> for device that can not do that as right now this silently breaks
>>> in few cases (truncate, mremap, splice, reflink, ...). So the device
>>> in those cases can end up with GUPed pages that do not correspond
>>> to anything anymore ie they do not correspond to the memory backing
>>> the virtual address they were GUP against, nor they correspond to
>>> the file content at the given offset anymore. It is just random
>>> data as far as the kernel or filesystem is concern.
>>>
>>> Of course for this to happen you need an application that do stupid
>>> thing like create an MR in one thread on the mmap of a file and
>>> truncate that same file in another thread (or from the same thread).
>>>
>>> So this is unlikely to happen in sane program. It does not mean it
>>> will not happen.
>>
>> Completely agree. In other words, this is the responsibility of the
>> DAX (or g-u-p) consumer, which is nt necessarily the program itself,
>> it could be an upper layer.
>>
>> In SMB3 and NFSv4, which I've been focused on, we envision using the
>> existing protocol leases to protect this. When requesting a DAX mapping,
>> the server may requires an exclusive lease. If this mapping needs to
>> change, because of another conflicting mapping, the lease would be
>> recalled and the mapping dropped. This is a normal and well-established
>> filesystem requirement.
>>
>> The twist here is that the platform itself can initiate such an event.
>> It's my belief that this plumbing must flow to the *top* of the stack,
>> i.e. the entity that took the mapping (e.g. filesystem), and not
>> depend on the MMU notifier at the very bottom.
> 
> So this patchset is about mm plumbings, what fs does before mm code
> gets call is a fs discussion. Note also that GUP is always against
> a virtual address of a process. GUP is use by few driver to allow
> the device direct access to some portion of process address space.
> 
> The issues is that some of those device have designed an API that
> fully ignore things like munmap, splice, truncate, ... and as such
> they open the door for undefined behavior. I believe the original
> intention in all the cases was that the user would not do stupid
> thing like setup the device mapping through device specific API and
> then munmap or truncate or anything that would affect the range of
> virtual address.
> 
> Thing is from kernel point of view we should not and can not assume
> that userspace will behave properly. So we have to brace for the
> worst. Which is what this patchset is trying to do ie fix some of
> issues and make the rest of them explicit to device driver so that
> they can decide what to do about it.
> 
> My advice is for each of those sub-system/device to fail loudly
> when such thing happens so that the user knows he is doing something
> stupid or illegal.
> 
> 
>>> The second set of issue at to deals with set_page_dirty happening
>>> long time after page_release did happens and thus the fs dirty
>>> page callback will see page in bad state and will BUG() and you
>>> will have an oops and loose any data your device might have written
>>> to the page. This is highly filesystem dependend and also timing
>>> dependend and link to thing like memory pressure so it might not
>>> happen that often but again it can happen.
>>>
>>>
>>>>>> .. and the remedy here is to kill the process, not provide corrupt
>>>>>> data. Kill the process is likely to not go over well with any real
>>>>>> users that want this combination.
>>>>>>
>>>>>> Think Samba serving files over RDMA - you can't have random unpriv
>>>>>> users calling ftruncate and causing smbd to be killed or serve corrupt
>>>>>> data.
>>>>>
>>>>> So what i am saying is there is a choice and it would be better to
>>>>> decide something than let the existing status quo where we just keep
>>>>> streaming random data after truncate to a GUPed page.
>>>>
>>>> Let's also remember that any torn-down DMA mapping can't be recycled
>>>> until all uses of the old DMA addresses are destroyed. The whole
>>>> thing screams for reference counting all the way down, to me.
>>>
>>> I am not saying reuse the DMA address in the emergency_mean_callback
>>> the idea was:
>>>
>>>       gup_page_emergency_revoke(device, page)
>>>       {
>>>           crapy_page = alloc_page();
>>>           dma_addr = dma_map(crappy_page, device, ...);
>>>           mydevice_page_table_update(device, crappy_page, dma_addr);
>>>           mydevice_tlb_flush(device);
>>>           mydevice_wait_pending_dma(device)
>>>
>>>           // at this point the original GUPed page is not access by hw
>>>
>>>           dma_unmap(page);
>>>           put_user_page(page);
>>>       }
>>
>> Ok, but my concern was also that the old DMA address then becomes
>> unused and may be grabbed by a new i/o. If the hardware still has
>> reads or writes in flight, and they arrive after the old address
>> becomes valid, well, oops.
> 
> In above code at the comment point the driver garanty the hardware
> will never use the old dma address any more ie that all in flight
> dma are done. I know this can be done for some devices. It might
> very well not work for all and this is why this need to be a sub-
> system/device discussion.
> 
> Cheers,
> Jérôme
> 
> 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 18:12                                                     ` Tom Talpey
@ 2018-12-13 19:18                                                       ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-13 19:18 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Thu, Dec 13, 2018 at 01:12:06PM -0500, Tom Talpey wrote:
> On 12/13/2018 10:18 AM, Jerome Glisse wrote:
> > On Thu, Dec 13, 2018 at 09:51:18AM -0500, Tom Talpey wrote:
> > > On 12/13/2018 9:18 AM, Jerome Glisse wrote:
> > > > On Thu, Dec 13, 2018 at 08:40:49AM -0500, Tom Talpey wrote:
> > > > > On 12/13/2018 7:43 AM, Jerome Glisse wrote:
> > > > > > On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
> > > > > > > On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
> > > > > > > > On Wed, Dec 12, 2018 at 04:37:03PM -0700, Jason Gunthorpe wrote:
> > > > > > > > > On Wed, Dec 12, 2018 at 04:53:49PM -0500, Jerome Glisse wrote:
> > > > > > > > > > > Almost, we need some safety around assuming that DMA is complete the
> > > > > > > > > > > page, so the notification would need to go all to way to userspace
> > > > > > > > > > > with something like a file lease notification. It would also need to
> > > > > > > > > > > be backstopped by an IOMMU in the case where the hardware does not /
> > > > > > > > > > > can not stop in-flight DMA.
> > > > > > > > > > 
> > > > > > > > > > You can always reprogram the hardware right away it will redirect
> > > > > > > > > > any dma to the crappy page.
> > > > > > > > > 
> > > > > > > > > That causes silent data corruption for RDMA users - we can't do that.
> > > > > > > > > 
> > > > > > > > > The only way out for current hardware is to forcibly terminate the
> > > > > > > > > RDMA activity somehow (and I'm not even sure this is possible, at
> > > > > > > > > least it would be driver specific)
> > > > > > > > > 
> > > > > > > > > Even the IOMMU idea probably doesn't work, I doubt all current
> > > > > > > > > hardware can handle a PCI-E error TLP properly.
> > > > > > > > 
> > > > > > > > What i saying is reprogram hardware to crappy page ie valid page
> > > > > > > > dma map but that just has random content as a last resort to allow
> > > > > > > > filesystem to reuse block. So their should be no PCIE error unless
> > > > > > > > hardware freak out to see its page table reprogram randomly.
> > > > > > > 
> > > > > > > No, that isn't an option. You can't silently provide corrupted data
> > > > > > > for RDMA to transfer out onto the network, or silently discard data
> > > > > > > coming in!!
> > > > > > > 
> > > > > > > Think of the consequences of that - I have a fileserver process and
> > > > > > > someone does ftruncate and now my clients receive corrupted data??
> > > > > > 
> > > > > > This is what happens _today_ ie today someone do GUP on page file
> > > > > > and then someone else do truncate the first GUP is effectively
> > > > > > streaming _random_ data to network as the page does not correspond
> > > > > > to anything anymore and once the RDMA MR goes aways and release
> > > > > > the page the page content will be lost. So i am not changing anything
> > > > > > here, what i proposed was to make it explicit to device driver at
> > > > > > least that they were streaming random data. Right now this is all
> > > > > > silent but this is what is happening wether you like it or not :)
> > > > > > 
> > > > > > Note that  i am saying do that only for truncate to allow to be
> > > > > > nice to fs. But again i am fine with whatever solution but you can
> > > > > > not please everyone here. Either block truncate and fs folks will
> > > > > > hate you or make it clear to device driver that you are streaming
> > > > > > random things and RDMA people hates you.
> > > > > > 
> > > > > > 
> > > > > > > The only option is to prevent the RDMA transfer from ever happening,
> > > > > > > and we just don't have hardware support (beyond destroy everything) to
> > > > > > > do that.
> > > > > > > 
> > > > > > > > The question is who do you want to punish ? RDMA user that pin stuff
> > > > > > > > and expect thing to work forever without worrying for other fs
> > > > > > > > activities ? Or filesystem to pin block forever :)
> > > > > > > 
> > > > > > > I don't want to punish everyone, I want both sides to have complete
> > > > > > > data integrity as the USER has deliberately decided to combine DAX and
> > > > > > > RDMA. So either stop it at the front end (ie get_user_pages_longterm)
> > > > > > > or make it work in a way that guarantees integrity for both.
> > > > > > > 
> > > > > > > >        S2: notify userspace program through device/sub-system
> > > > > > > >            specific API and delay ftruncate. After a while if there
> > > > > > > >            is no answer just be mean and force hardware to use
> > > > > > > >            crappy page as anyway this is what happens today
> > > > > > > 
> > > > > > > I don't think this happens today (outside of DAX).. Does it?
> > > > > > 
> > > > > > It does it is just silent, i don't remember anything in the code
> > > > > > that would stop a truncate to happen because of elevated refcount.
> > > > > > This does not happen with ODP mlx5 as it does abide by _all_ mmu
> > > > > > notifier. This is for anything that does ODP without support for
> > > > > > mmu notifier.
> > > > > 
> > > > > Wait - is it expected that the MMU notifier upcall is handled
> > > > > synchronously? That is, the page DMA mapping must be torn down
> > > > > immediately, and before returning?
> > > > 
> > > > Yes you must torn down mapping before returning from mmu notifier
> > > > call back. Any time after is too late. You obviously need hardware
> > > > that can support that. In the infiniband sub-system AFAIK only the
> > > > mlx5 hardware can do that. In the GPU sub-system everyone is fine.
> > > 
> > > I'm skeptical that MLX5 can actually make this guarantee. But we
> > > can take that offline in linux-rdma.
> > 
> > It does unless the code lies about what the hardware do :) See umem_odp.c
> > in core and odp.c in mlx5 directories.
> 
> Ok, I did look and there are numerous error returns from these calls.
> Some are related to resource shortages (including the rather ominous-
> sounding "emergency_pages" in odp.c), others related to the generic
> RDMA behaviors such as posting work requests and reaping their
> completion status.
> 
> So I'd ask - what is the backup plan from the mmu notifier if the
> unmap fails? Which it certainly will, in many real-world situations.

No backup, it must succeed, invalidation is:
    invalidate_range_start_trampoline
        mlx5_ib_invalidate_range
            mlx5_ib_update_xlt

Beside sanity check on data structure fields the only failure path
i see there is are for allocating a page to send commands to device
and failing to map that page. What mellanox should be doing there
is pre-allocate and pre-map couple pages for that to avoid any failure
because of that.


There is no way we will accept mmu notifier to fail, it would block
tons of syscall like munmap, truncate, mremap, madvise, mprotect,...
So it is a non starter to try to ask for mmu notifier to fail.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  0:51                                 ` Dave Chinner
  2018-12-13  2:02                                   ` Jerome Glisse
@ 2018-12-14  3:52                                   ` John Hubbard
  2018-12-14  5:21                                     ` Dan Williams
  1 sibling, 1 reply; 206+ messages in thread
From: John Hubbard @ 2018-12-14  3:52 UTC (permalink / raw)
  To: Dave Chinner, Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/12/18 4:51 PM, Dave Chinner wrote:
> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
>> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
>>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>>>> So this approach doesn't look like a win to me over using counter in struct
>>>>> page and I'd rather try looking into squeezing HMM public page usage of
>>>>> struct page so that we can fit that gup counter there as well. I know that
>>>>> it may be easier said than done...
>>>>

Agreed. After all the discussion this week, I'm thinking that the original idea
of a per-struct-page counter is better. Fortunately, we can do the moral equivalent 
of that, unless I'm overlooking something: Jerome had another proposal that he
described, off-list, for doing that counting, and his idea avoids the problem of 
finding space in struct page. (And in fact, when I responded yesterday, I initially 
thought that's where he was going with this.)

So how about this hybrid solution:

1. Stay with the basic RFC approach of using a per-page counter, but actually
store the counter(s) in the mappings instead of the struct page. We can use
!PageAnon and page_mapping to look up all the mappings, stash the dma_pinned_count
there. So the total pinned count is scattered across mappings. Probably still need
a PageDmaPinned bit.

Thanks again to Jerome for coming up with that idea, and I hope I haven't missed
a critical point or misrepresented it.

2. put_user_page() would still restrict itself to managing PageDmaPinned and
dma_pinned_count, as before. No messing with page_mkwrite or anything that
requires lock_page():

void put_user_page(struct page *page)
{
	if (PageAnon(page))
		put_page(page);
	else {
		/* Approximately: Check PageDmaPinned, look up dma_pinned_count
		 * via page_mapping's, decrement the appropriate
		 * mapping's dma_pinned_count. Clear PageDmaPinned
		 * if dma_pinned_count hits zero.
		 */

	...
}

I'm not sure how tricky finding the "appropriate" mapping is, but it seems 
like just comparing current->mm information with the mappings should do it.

3. And as before, use PageDmaPinned to decide what to do in page_mkclean() and
try_to_unmap().

Maybe here is the part where someone says, "you should have created the actual
patchset, instead of typing all those words". But I'm still hoping to get some
consensus first. :)

one more note below...

>>>> So i want back to the drawing board and first i would like to ascertain
>>>> that we all agree on what the objectives are:
>>>>
>>>>     [O1] Avoid write back from a page still being written by either a
>>>>          device or some direct I/O or any other existing user of GUP.
> 
> IOWs, you need to mark pages being written to by a GUP as
> PageWriteback, so all attempts to write the page will block on
> wait_on_page_writeback() before trying to write the dirty page.
> 
>>>>          This would avoid possible file system corruption.
> 
> This isn't a filesystem corruption vector. At worst, it could cause
> torn data writes due to updating the page while it is under IO. We
> have a name for this: "stable pages". This is designed to prevent
> updates to pages via mmap writes from causing corruption of things
> like MD RAID due to modification of the data during RAID parity
> calculations. Hence we have wait_for_stable_page() calls in all
> ->page_mkwrite implementations so that new mmap writes block until
> writeback IO is complete on the devices that require stable pages
> to prevent corruption.
> 
> IOWs, we already deal with this "delay new modification while
> writeback is in progress" problem in the mmap/filesystem world and
> have infrastructure to handle it. And the ->page_mkwrite code
> already deals with it.
> 
>>>>
>>>>     [O2] Avoid crash when set_page_dirty() is call on a page that is
>>>>          considered clean by core mm (buffer head have been remove and
>>>>          with some file system this turns into an ugly mess).
>>>
>>> I think that's wrong. This isn't an "avoid a crash" case, this is a
>>> "prevent data and/or filesystem corruption" case. The primary goal
>>> we have here is removing our exposure to potential corruption, which
>>> has the secondary effect of avoiding the crash/panics that currently
>>> occur as a result of inconsistent page/filesystem state.
>>
>> This is O1 avoid corruption is O1
> 
> It's "avoid a specific instance of data corruption", not a general
> mechanism for avoiding data/filesystem corruption.
> 
> Calling set_page_dirty() on a file backed page which has not been
> correctly prepared can cause data corruption, filesystem coruption
> and shutdowns, etc because we have dirty data over a region that is
> not correctly mapped. Yes, it can also cause a crash (because we
> really, really suck at validation and error handling in generic code
> paths), but there's so, so much more that can go wrong than crash
> the kernel when we do stupid shit like this.
> 
>>> i.e. The goal is to have ->page_mkwrite() called on the clean page
>>> /before/ the file-backed page is marked dirty, and hence we don't
>>> expose ourselves to potential corruption or crashes that are a
>>> result of inappropriately calling set_page_dirty() on clean
>>> file-backed pages.
>>
>> Yes and this would be handle by put_user_page ie:
> 
> No, put_user_page() is too late - it's after the DMA has completed,
> but we have to ensure the file has backing store allocated and the
> pages are in the correct state /before/ the DMA is done.
> 
> Think ENOSPC - that has to be handled before we do the DMA, not
> after. Before the DMA it is a recoverable error, after the DMA it is
> data loss/corruption failure.
> 
>> put_user_page(struct page *page, bool dirty)
>> {
>>     if (!PageAnon(page)) {
>>         if (dirty) {
>>             // Do the whole dance ie page_mkwrite and all before
>>             // calling set_page_dirty()
>>         }
>>         ...
>>     }
>>     ...
>> }
> 
> Essentially, doing this would require a whole new "dirty a page"
> infrastructure because it is in the IO path, not the page fault
> path.
> 
> And, for hardware that does it's own page faults for DMA, this whole
> post-DMA page setup is broken because the pages will have already
> gone through ->page_mkwrite() and be set up correctly already.
> 
>>>> For [O2] i believe we can handle that case in the put_user_page()
>>>> function to properly dirty the page without causing filesystem
>>>> freak out.
>>>
>>> I'm pretty sure you can't call ->page_mkwrite() from
>>> put_user_page(), so I don't think this is workable at all.
>>
>> Hu why ? i can not think of any reason whike you could not. User of
> 
> It's not a fault path, you can't safely lock pages, you can't take
> fault-path only locks in the IO path (mmap_sem inversion problems),
> etc.
> 

Yes, I looked closer at ->page_mkwrite (ext4_page_mkwrite, for example),
and it's clearly doing lock_page(), so it does seem like this particular
detail (calling page_mkwrite from put_user_page) is dead.

> /me has a nagging feeling this was all explained in a previous
> discussions of this patchset...
> 

Yes, lots of related discussion definitely happened already, for example
this October thread covered page_mkwrite and interactions with gup:

https://lore.kernel.org/r/20181001061127.GQ31060@dastard

...but so far, this is the first time I recall seeing a proposal to call
page_mkwrite from put_user_page. 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14  3:52                                   ` John Hubbard
@ 2018-12-14  5:21                                     ` Dan Williams
  2018-12-14  6:11                                       ` John Hubbard
  0 siblings, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-14  5:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: david, Jérôme Glisse, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 13, 2018 at 7:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/12/18 4:51 PM, Dave Chinner wrote:
> > On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> >> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> >>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> >>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> >>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> >>>>> So this approach doesn't look like a win to me over using counter in struct
> >>>>> page and I'd rather try looking into squeezing HMM public page usage of
> >>>>> struct page so that we can fit that gup counter there as well. I know that
> >>>>> it may be easier said than done...
> >>>>
>
> Agreed. After all the discussion this week, I'm thinking that the original idea
> of a per-struct-page counter is better. Fortunately, we can do the moral equivalent
> of that, unless I'm overlooking something: Jerome had another proposal that he
> described, off-list, for doing that counting, and his idea avoids the problem of
> finding space in struct page. (And in fact, when I responded yesterday, I initially
> thought that's where he was going with this.)
>
> So how about this hybrid solution:
>
> 1. Stay with the basic RFC approach of using a per-page counter, but actually
> store the counter(s) in the mappings instead of the struct page. We can use
> !PageAnon and page_mapping to look up all the mappings, stash the dma_pinned_count
> there. So the total pinned count is scattered across mappings. Probably still need
> a PageDmaPinned bit.

How do you safely look at page->mapping from the get_user_pages_fast()
path? You'll be racing invalidation disconnecting the page from the
mapping.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13  2:02                                   ` Jerome Glisse
  2018-12-13 15:56                                     ` Christopher Lameter
@ 2018-12-14  6:00                                     ` Dave Chinner
  2018-12-14 15:13                                       ` Jerome Glisse
  1 sibling, 1 reply; 206+ messages in thread
From: Dave Chinner @ 2018-12-14  6:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 12, 2018 at 09:02:29PM -0500, Jerome Glisse wrote:
> On Thu, Dec 13, 2018 at 11:51:19AM +1100, Dave Chinner wrote:
> > On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> > > On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > > it may be easier said than done...
> > > > > 
> > > > > So i want back to the drawing board and first i would like to ascertain
> > > > > that we all agree on what the objectives are:
> > > > > 
> > > > >     [O1] Avoid write back from a page still being written by either a
> > > > >          device or some direct I/O or any other existing user of GUP.
> > 
> > IOWs, you need to mark pages being written to by a GUP as
> > PageWriteback, so all attempts to write the page will block on
> > wait_on_page_writeback() before trying to write the dirty page.
> 
> No you don't and you can't for the simple reasons is that the GUP
> of some device driver can last days, weeks, months, years ... so
> it is not something you want to do. Here is what happens today:
>     - user space submit directio read from a file and writing to
>       virtual address and the problematic case is when that virtual
>       address is actualy a mmap of a file itself
>     - kernel do GUP on the virtual address, if the page has write
>       permission in the CPU page table mapping then the page
>       refcount is incremented and the page is return to directio
>       kernel code that do memcpy
> 
>       It means that the page already went through page_mkwrite so
>       all is fine from fs point of view.
>       If page does not have write permission then a page fault is
>       triggered and page_mkwrite will happen and prep the page
>       accordingly

Yes, the short term GUP references do the right thing. They aren't
the issue - the problem is the long term GUP references that dirty
clean pages without first having called ->page_mkwrite.

> In the above scheme a page write back might happens after we looked
> up the page from the CPU page table and before directio finish with
> memcpy so that the page content during the write back might not be
> stable. This is a small window for things to go bad and i do not
> think we know if anybody ever experience a bug because of that.
> 
> For other GUP users the flow is the same except that device driver
> that keep the page around and do continuous dma to it might last
> days, weeks, months, years ... so for those the race window is big
> enough for bad things to happen. Jan have report of such bugs.

i.e. this case.

GUP faults the page, gets marked dirty, time passes, page
writeback occurs, it's now mapped clean, time passes, another RDMA
hits those pages, it calls set_page_dirty() again and things go
boom.

Basically, you are saying that the problem here is that writeback
of a dirty page occurred while there was an active GUP, and that
you want us to ....

> So what i am proposing to fix the above is have page_mkclean return
> a is_pin boolean if page is pin than the fs code use a bounce page
> to do the write back giving a stable bounce page. More over fs will
> need to keep around all buffer_head, blocks, ... ie whatever is
> associated with that file offset so that any latter set_page_dirty
> would not freak out and would not need to reallocate blocks or do
> anything heavy weight.

.... keep the dirty page pinned and never written back until the GUP
is released.

Which, quite frankly, is insanity.  The whole point of
->page_mkwrite() is that we can clean file backed mapped pages at
any point in time and have the next write access correctly mark it
dirty again so it can be written back.

This is *absolutely necessary* for data integrity (i.e. fsync,
sync(), etc) as well as filesystem management operations (e.g.
filesystem freeze) to work correctly and not lose data if the system
crashes or generate corrupt snapshots for backup or migration
purposes.

> We have a separate discussion on what to do about truncate and other
> fs event that inherently invalidate portion of file so i do not
> want to complexify present discussion with those but we also have
> that in mind.
> 
> Do you see any fundamental issues with that ? It abides by all
> existing fs standard AFAICT (you have a page_mkwrite and we ask
> fs to keep the result of that around).

The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is irrelevant, if the page is clean
and you need to dirty it, you must call ->page_mkwrite before it is
marked writeable and dirtied. Every. Time.

> > Think ENOSPC - that has to be handled before we do the DMA, not
> > after. Before the DMA it is a recoverable error, after the DMA it is
> > data loss/corruption failure.
> 
> Yes agree and i hope that the above explaination properly explains
> that it would become legal to do set_page_dirty in put_user_page
> thanks to page_mkclean telling fs code not to recycle anything
> after write back finish.

No, page_mkclean doesn't help at all. Every time the page is dirtied
it may require block allocation (think COW filesystems) and so
ENOSPC (and block allocation) must be done /before/ the page is
dirtied. YOU can't just keep re-dirtying the same page and assuming
that the filesystem will just work with that - that's essentially
what the current code does with long term GUP references, and that's
why it's so broken.

/me is getting tired of explaining the same thing over and over
again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14  5:21                                     ` Dan Williams
@ 2018-12-14  6:11                                       ` John Hubbard
  2018-12-14 15:20                                         ` Jerome Glisse
  2018-12-14 19:38                                         ` Dan Williams
  0 siblings, 2 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-14  6:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: david, Jérôme Glisse, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 12/13/18 9:21 PM, Dan Williams wrote:
> On Thu, Dec 13, 2018 at 7:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 12/12/18 4:51 PM, Dave Chinner wrote:
>>> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
>>>> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
>>>>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
>>>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>>>>>> So this approach doesn't look like a win to me over using counter in struct
>>>>>>> page and I'd rather try looking into squeezing HMM public page usage of
>>>>>>> struct page so that we can fit that gup counter there as well. I know that
>>>>>>> it may be easier said than done...
>>>>>>
>>
>> Agreed. After all the discussion this week, I'm thinking that the original idea
>> of a per-struct-page counter is better. Fortunately, we can do the moral equivalent
>> of that, unless I'm overlooking something: Jerome had another proposal that he
>> described, off-list, for doing that counting, and his idea avoids the problem of
>> finding space in struct page. (And in fact, when I responded yesterday, I initially
>> thought that's where he was going with this.)
>>
>> So how about this hybrid solution:
>>
>> 1. Stay with the basic RFC approach of using a per-page counter, but actually
>> store the counter(s) in the mappings instead of the struct page. We can use
>> !PageAnon and page_mapping to look up all the mappings, stash the dma_pinned_count
>> there. So the total pinned count is scattered across mappings. Probably still need
>> a PageDmaPinned bit.
> 
> How do you safely look at page->mapping from the get_user_pages_fast()
> path? You'll be racing invalidation disconnecting the page from the
> mapping.
> 

I don't have an answer for that, so maybe the page->mapping idea is dead already. 

So in that case, there is still one more way to do all of this, which is to
combine ZONE_DEVICE, HMM, and gup/dma information in a per-page struct, and get
there via basically page->private, more or less like this:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..13f651bb5cc1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,6 +67,13 @@ struct hmm;
 #define _struct_page_alignment
 #endif
 
+struct page_aux {
+       struct dev_pagemap *pgmap;
+       unsigned long hmm_data;
+       unsigned long private;
+       atomic_t dma_pinned_count;
+};
+
 struct page {
        unsigned long flags;            /* Atomic flags, some possibly
                                         * updated asynchronously */
@@ -149,11 +156,13 @@ struct page {
                        spinlock_t ptl;
 #endif
                };
-               struct {        /* ZONE_DEVICE pages */
+               struct {        /* ZONE_DEVICE, HMM or get_user_pages() pages */
                        /** @pgmap: Points to the hosting device page map. */
-                       struct dev_pagemap *pgmap;
-                       unsigned long hmm_data;
-                       unsigned long _zd_pad_1;        /* uses mapping */
+                       unsigned long _zd_pad_1;        /* LRU */
+                       unsigned long _zd_pad_2;        /* LRU */
+                       unsigned long _zd_pad_3;        /* mapping */
+                       unsigned long _zd_pad_4;        /* index */
+                       struct page_aux *aux;           /* private */
                };
 
                /** @rcu_head: You can use this to free a page by RCU. */

...is there any appetite for that approach?

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-13 12:43                                           ` Jerome Glisse
  2018-12-13 13:40                                             ` Tom Talpey
@ 2018-12-14 10:41                                             ` Jan Kara
  2018-12-14 15:25                                               ` Jerome Glisse
  1 sibling, 1 reply; 206+ messages in thread
From: Jan Kara @ 2018-12-14 10:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel, Weiny, Ira

On Thu 13-12-18 07:43:25, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
> > On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
> > > > Even the IOMMU idea probably doesn't work, I doubt all current
> > > > hardware can handle a PCI-E error TLP properly. 
> > > 
> > > What i saying is reprogram hardware to crappy page ie valid page
> > > dma map but that just has random content as a last resort to allow
> > > filesystem to reuse block. So their should be no PCIE error unless
> > > hardware freak out to see its page table reprogram randomly.
> > 
> > No, that isn't an option. You can't silently provide corrupted data
> > for RDMA to transfer out onto the network, or silently discard data
> > coming in!! 
> > 
> > Think of the consequences of that - I have a fileserver process and
> > someone does ftruncate and now my clients receive corrupted data??
> 
> This is what happens _today_ ie today someone do GUP on page file
> and then someone else do truncate the first GUP is effectively
> streaming _random_ data to network as the page does not correspond
> to anything anymore and once the RDMA MR goes aways and release
> the page the page content will be lost. So i am not changing anything
> here, what i proposed was to make it explicit to device driver at
> least that they were streaming random data. Right now this is all
> silent but this is what is happening wether you like it or not :)

I think you're making the current behaviour sound worse than it really is.
You are correct that currently driver can setup RDMA with some page, one
instant later that page can get truncated from the file and thus has no
association to the file anymore. That can lead to *stale* data being
streamed over RDMA or loss of data that are coming from RDMA. But none of
this is actually a security issue - no streaming of random data or memory
corruption. And that's all kernel cares about. It is userspace
responsibility to make sure file cannot be truncated if it cannot tolerate
stale data.

So your "redirect RDMA to dummy page" solution has to make sure you really
swap one real page for one dummy page and copy old real page contents to
the dummy page contents. Then it will be equivalent to the current behavior
and if the hardware can do the swapping, then I'm fine with such
solution...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14  6:00                                     ` Dave Chinner
@ 2018-12-14 15:13                                       ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-14 15:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 14, 2018 at 05:00:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 12, 2018 at 09:02:29PM -0500, Jerome Glisse wrote:
> > On Thu, Dec 13, 2018 at 11:51:19AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> > > > On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > > > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > > > it may be easier said than done...
> > > > > > 
> > > > > > So i want back to the drawing board and first i would like to ascertain
> > > > > > that we all agree on what the objectives are:
> > > > > > 
> > > > > >     [O1] Avoid write back from a page still being written by either a
> > > > > >          device or some direct I/O or any other existing user of GUP.
> > > 
> > > IOWs, you need to mark pages being written to by a GUP as
> > > PageWriteback, so all attempts to write the page will block on
> > > wait_on_page_writeback() before trying to write the dirty page.
> > 
> > No you don't and you can't for the simple reasons is that the GUP
> > of some device driver can last days, weeks, months, years ... so
> > it is not something you want to do. Here is what happens today:
> >     - user space submit directio read from a file and writing to
> >       virtual address and the problematic case is when that virtual
> >       address is actualy a mmap of a file itself
> >     - kernel do GUP on the virtual address, if the page has write
> >       permission in the CPU page table mapping then the page
> >       refcount is incremented and the page is return to directio
> >       kernel code that do memcpy
> > 
> >       It means that the page already went through page_mkwrite so
> >       all is fine from fs point of view.
> >       If page does not have write permission then a page fault is
> >       triggered and page_mkwrite will happen and prep the page
> >       accordingly
> 
> Yes, the short term GUP references do the right thing. They aren't
> the issue - the problem is the long term GUP references that dirty
> clean pages without first having called ->page_mkwrite.
> 
> > In the above scheme a page write back might happens after we looked
> > up the page from the CPU page table and before directio finish with
> > memcpy so that the page content during the write back might not be
> > stable. This is a small window for things to go bad and i do not
> > think we know if anybody ever experience a bug because of that.
> > 
> > For other GUP users the flow is the same except that device driver
> > that keep the page around and do continuous dma to it might last
> > days, weeks, months, years ... so for those the race window is big
> > enough for bad things to happen. Jan have report of such bugs.
> 
> i.e. this case.
> 
> GUP faults the page, gets marked dirty, time passes, page
> writeback occurs, it's now mapped clean, time passes, another RDMA
> hits those pages, it calls set_page_dirty() again and things go
> boom.
> 
> Basically, you are saying that the problem here is that writeback
> of a dirty page occurred while there was an active GUP, and that
> you want us to ....
> 
> > So what i am proposing to fix the above is have page_mkclean return
> > a is_pin boolean if page is pin than the fs code use a bounce page
> > to do the write back giving a stable bounce page. More over fs will
> > need to keep around all buffer_head, blocks, ... ie whatever is
> > associated with that file offset so that any latter set_page_dirty
> > would not freak out and would not need to reallocate blocks or do
> > anything heavy weight.
> 
> .... keep the dirty page pinned and never written back until the GUP
> is released.

I am sorry if i am so hard to understand but this is not what i
have in mind. WHat i have in mind is the write back will use a
bounce page so that the page content is stable and dma can keep
happening to the GUPed page while write back make progress. But
the end of write back callback should not free buffer_head or
blocks or anything that was done by the first page_mkwrite so
that another set_page_dirty can happens at anytime after without
the fs code freaking out.


> Which, quite frankly, is insanity.  The whole point of
> ->page_mkwrite() is that we can clean file backed mapped pages at
> any point in time and have the next write access correctly mark it
> dirty again so it can be written back.
> 
> This is *absolutely necessary* for data integrity (i.e. fsync,
> sync(), etc) as well as filesystem management operations (e.g.
> filesystem freeze) to work correctly and not lose data if the system
> crashes or generate corrupt snapshots for backup or migration
> purposes.

Is keeping the result of the first page_mkwrite not doable ? Or
at least avoiding freeing blocks and such so that we can have a
latter lighter page_mkwrite_light that can be call by put_user_page

The thing is we can not stop the DMA on some device and thus we
can not force device do redo a GUP after a write back. Note that
if you said that device must do that and you will not accept any-
thing that do not do that, i am fine with that, this was what i
advocated for in the first place, but it means that a certain
number of device driver will have to regress from user point of
view ie they will not be able to support GUP anymore.

> 
> > We have a separate discussion on what to do about truncate and other
> > fs event that inherently invalidate portion of file so i do not
> > want to complexify present discussion with those but we also have
> > that in mind.
> > 
> > Do you see any fundamental issues with that ? It abides by all
> > existing fs standard AFAICT (you have a page_mkwrite and we ask
> > fs to keep the result of that around).
> 
> The fundamental issue is that ->page_mkwrite must be called on every
> write access to a clean file backed page, not just the first one.
> How long the GUP reference lasts is irrelevant, if the page is clean
> and you need to dirty it, you must call ->page_mkwrite before it is
> marked writeable and dirtied. Every. Time.

I am fine with that then it is just a matter of telling device that
do not abide by mmu notifier that they can not use GUP anymore which
means that it will regress from user point of view. But i am ok with
that.

> 
> > > Think ENOSPC - that has to be handled before we do the DMA, not
> > > after. Before the DMA it is a recoverable error, after the DMA it is
> > > data loss/corruption failure.
> > 
> > Yes agree and i hope that the above explaination properly explains
> > that it would become legal to do set_page_dirty in put_user_page
> > thanks to page_mkclean telling fs code not to recycle anything
> > after write back finish.
> 
> No, page_mkclean doesn't help at all. Every time the page is dirtied
> it may require block allocation (think COW filesystems) and so
> ENOSPC (and block allocation) must be done /before/ the page is
> dirtied. YOU can't just keep re-dirtying the same page and assuming
> that the filesystem will just work with that - that's essentially
> what the current code does with long term GUP references, and that's
> why it's so broken.
> 
> /me is getting tired of explaining the same thing over and over
> again.

Sorry you feel that way, thank you for bearing with me. Like i said
i am fine with telling GUP user that do not abibe by mmu notifier
and thus that keep writing to page after write back that they need
to stop even if it means breaking existing userspace.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14  6:11                                       ` John Hubbard
@ 2018-12-14 15:20                                         ` Jerome Glisse
  2018-12-14 19:38                                         ` Dan Williams
  1 sibling, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-14 15:20 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, david, Jan Kara, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 13, 2018 at 10:11:09PM -0800, John Hubbard wrote:
> On 12/13/18 9:21 PM, Dan Williams wrote:
> > On Thu, Dec 13, 2018 at 7:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 12/12/18 4:51 PM, Dave Chinner wrote:
> >>> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> >>>> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> >>>>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> >>>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> >>>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> >>>>>>> So this approach doesn't look like a win to me over using counter in struct
> >>>>>>> page and I'd rather try looking into squeezing HMM public page usage of
> >>>>>>> struct page so that we can fit that gup counter there as well. I know that
> >>>>>>> it may be easier said than done...
> >>>>>>
> >>
> >> Agreed. After all the discussion this week, I'm thinking that the original idea
> >> of a per-struct-page counter is better. Fortunately, we can do the moral equivalent
> >> of that, unless I'm overlooking something: Jerome had another proposal that he
> >> described, off-list, for doing that counting, and his idea avoids the problem of
> >> finding space in struct page. (And in fact, when I responded yesterday, I initially
> >> thought that's where he was going with this.)
> >>
> >> So how about this hybrid solution:
> >>
> >> 1. Stay with the basic RFC approach of using a per-page counter, but actually
> >> store the counter(s) in the mappings instead of the struct page. We can use
> >> !PageAnon and page_mapping to look up all the mappings, stash the dma_pinned_count
> >> there. So the total pinned count is scattered across mappings. Probably still need
> >> a PageDmaPinned bit.
> > 
> > How do you safely look at page->mapping from the get_user_pages_fast()
> > path? You'll be racing invalidation disconnecting the page from the
> > mapping.
> > 
> 
> I don't have an answer for that, so maybe the page->mapping idea is dead already. 
> 
> So in that case, there is still one more way to do all of this, which is to
> combine ZONE_DEVICE, HMM, and gup/dma information in a per-page struct, and get
> there via basically page->private, more or less like this:

The page mapcount idea does work to get a pin count. So i believe
that this is what should be pursue, if no one wants to try it i
will do patches. Anything else is too invasive and requires too
much changes. Note that in all the discussion that happened in the
mapcount having a separate pin count would not have help one bit
nor would it solve the page_mkwrite issue.

So we need to audit put_user_page call place and see if they can
sleep and call mkwrite without issue. I believe the answer will be
yes for many ... maybe all.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 10:41                                             ` Jan Kara
@ 2018-12-14 15:25                                               ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-14 15:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Dan Williams, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, Weiny, Ira

On Fri, Dec 14, 2018 at 11:41:25AM +0100, Jan Kara wrote:
> On Thu 13-12-18 07:43:25, Jerome Glisse wrote:
> > On Wed, Dec 12, 2018 at 08:20:43PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 12, 2018 at 07:01:09PM -0500, Jerome Glisse wrote:
> > > > > Even the IOMMU idea probably doesn't work, I doubt all current
> > > > > hardware can handle a PCI-E error TLP properly. 
> > > > 
> > > > What i saying is reprogram hardware to crappy page ie valid page
> > > > dma map but that just has random content as a last resort to allow
> > > > filesystem to reuse block. So their should be no PCIE error unless
> > > > hardware freak out to see its page table reprogram randomly.
> > > 
> > > No, that isn't an option. You can't silently provide corrupted data
> > > for RDMA to transfer out onto the network, or silently discard data
> > > coming in!! 
> > > 
> > > Think of the consequences of that - I have a fileserver process and
> > > someone does ftruncate and now my clients receive corrupted data??
> > 
> > This is what happens _today_ ie today someone do GUP on page file
> > and then someone else do truncate the first GUP is effectively
> > streaming _random_ data to network as the page does not correspond
> > to anything anymore and once the RDMA MR goes aways and release
> > the page the page content will be lost. So i am not changing anything
> > here, what i proposed was to make it explicit to device driver at
> > least that they were streaming random data. Right now this is all
> > silent but this is what is happening wether you like it or not :)
> 
> I think you're making the current behaviour sound worse than it really is.
> You are correct that currently driver can setup RDMA with some page, one
> instant later that page can get truncated from the file and thus has no
> association to the file anymore. That can lead to *stale* data being
> streamed over RDMA or loss of data that are coming from RDMA. But none of
> this is actually a security issue - no streaming of random data or memory
> corruption. And that's all kernel cares about. It is userspace
> responsibility to make sure file cannot be truncated if it cannot tolerate
> stale data.
> 
> So your "redirect RDMA to dummy page" solution has to make sure you really
> swap one real page for one dummy page and copy old real page contents to
> the dummy page contents. Then it will be equivalent to the current behavior
> and if the hardware can do the swapping, then I'm fine with such
> solution...

Yeah sorry if i make it sounds worse than it is, from my point of view
it is random data because it no longer match to anything but yes it is
still correspond to the correct file data before the truncate so in that
sense it is not random.

Yes for copying the existing content to the new crappy page i said that
at one point during the discussion to make the crappy page less crapy.
But it seems people feels that device that do not abide by mmu notifier
are also the device that can not be updated at _any_ time and thus that
reprogramming their dma engine is a non starter.

I believe the truncate and revoke issue is something that should be
discussed sub-system by sub-system where people knows what their device
can and can not do. I will try to write a document that explains the
GUP pitfalls so that it can be use to explain the issue.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-12 21:46                             ` Dave Chinner
  2018-12-12 21:59                               ` Jerome Glisse
@ 2018-12-14 15:43                               ` Jan Kara
  2018-12-16 21:58                                 ` Dave Chinner
  1 sibling, 1 reply; 206+ messages in thread
From: Jan Kara @ 2018-12-14 15:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jerome Glisse, Jan Kara, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

Hi!

On Thu 13-12-18 08:46:41, Dave Chinner wrote:
> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > So this approach doesn't look like a win to me over using counter in struct
> > > page and I'd rather try looking into squeezing HMM public page usage of
> > > struct page so that we can fit that gup counter there as well. I know that
> > > it may be easier said than done...
> > 
> > So i want back to the drawing board and first i would like to ascertain
> > that we all agree on what the objectives are:
> > 
> >     [O1] Avoid write back from a page still being written by either a
> >          device or some direct I/O or any other existing user of GUP.
> >          This would avoid possible file system corruption.
> > 
> >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> >          considered clean by core mm (buffer head have been remove and
> >          with some file system this turns into an ugly mess).
> 
> I think that's wrong. This isn't an "avoid a crash" case, this is a
> "prevent data and/or filesystem corruption" case. The primary goal
> we have here is removing our exposure to potential corruption, which
> has the secondary effect of avoiding the crash/panics that currently
> occur as a result of inconsistent page/filesystem state.
> 
> i.e. The goal is to have ->page_mkwrite() called on the clean page
> /before/ the file-backed page is marked dirty, and hence we don't
> expose ourselves to potential corruption or crashes that are a
> result of inappropriately calling set_page_dirty() on clean
> file-backed pages.

I agree that [O1] - i.e., avoid corrupting fs data - is more important and
[O2] is just one consequence of [O1].

> > For [O1] and [O2] i believe a solution with mapcount would work. So
> > no new struct, no fake vma, nothing like that. In GUP for file back
> > pages we increment both refcount and mapcount (we also need a special
> > put_user_page to decrement mapcount when GUP user are done with the
> > page).
> 
> I don't see how a mapcount can prevent anyone from calling
> set_page_dirty() inappropriately.
> 
> > Now for [O1] the write back have to call page_mkclean() to go through
> > all reverse mapping of the page and map read only. This means that
> > we can count the number of real mapping and see if the mapcount is
> > bigger than that. If mapcount is bigger than page is pin and we need
> > to use a bounce page to do the writeback.
> 
> Doesn't work. Generally filesystems have already mapped the page
> into bios before they call clear_page_dirty_for_io(), so it's too
> late for the filesystem to bounce the page at that point.

Yes, for filesystem it is too late. But the plan we figured back in October
was to do the bouncing in the block layer. I.e., mark the bio (or just the
particular page) as needing bouncing and then use the existing page
bouncing mechanism in the block layer to do the bouncing for us. Ext3 (when
it was still a separate fs driver) has been using a mechanism like this to
make DIF/DIX work with its metadata.

> > For [O2] i believe we can handle that case in the put_user_page()
> > function to properly dirty the page without causing filesystem
> > freak out.
> 
> I'm pretty sure you can't call ->page_mkwrite() from
> put_user_page(), so I don't think this is workable at all.

Yes, calling ->page_mkwrite() in put_user_page() is not only technically
complicated but also too late - DMA has already modified page contents.
What we planned to do (again discussed back in October) was to never allow
the pinned page to become clean. I.e., clear_page_dirty_for_io() would
leave pinned pages dirty. Also we would skip pinned pages for WB_SYNC_NONE
writeback as there's no point in that really. That way MM and filesystems
would be aware of the real page state - i.e., what's in memory is not in
sync (potentially) with what's on disk. I was thinking whether this
permanently-dirty state couldn't confuse filesystem in some way but I
didn't find anything serious - the worst I could think of are places that
do filemap_write_and_wait() and then invalidate page cache e.g. before hole
punching or extent shifting. But these should work fine as is (page cache
invalidation will just happily truncate dirty pages). DIO might get
confused by the inability to invalidate dirty pages but then user combining
RDMA with DIO on the same file at one moment gets what he deserves...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14  6:11                                       ` John Hubbard
  2018-12-14 15:20                                         ` Jerome Glisse
@ 2018-12-14 19:38                                         ` Dan Williams
  2018-12-14 19:48                                           ` Matthew Wilcox
  2018-12-17  8:56                                           ` Jan Kara
  1 sibling, 2 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-14 19:38 UTC (permalink / raw)
  To: John Hubbard
  Cc: david, Jérôme Glisse, Jan Kara, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel, Dave Hansen

On Thu, Dec 13, 2018 at 10:11 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/13/18 9:21 PM, Dan Williams wrote:
> > On Thu, Dec 13, 2018 at 7:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 12/12/18 4:51 PM, Dave Chinner wrote:
> >>> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> >>>> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> >>>>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> >>>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> >>>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> >>>>>>> So this approach doesn't look like a win to me over using counter in struct
> >>>>>>> page and I'd rather try looking into squeezing HMM public page usage of
> >>>>>>> struct page so that we can fit that gup counter there as well. I know that
> >>>>>>> it may be easier said than done...
> >>>>>>
> >>
> >> Agreed. After all the discussion this week, I'm thinking that the original idea
> >> of a per-struct-page counter is better. Fortunately, we can do the moral equivalent
> >> of that, unless I'm overlooking something: Jerome had another proposal that he
> >> described, off-list, for doing that counting, and his idea avoids the problem of
> >> finding space in struct page. (And in fact, when I responded yesterday, I initially
> >> thought that's where he was going with this.)
> >>
> >> So how about this hybrid solution:
> >>
> >> 1. Stay with the basic RFC approach of using a per-page counter, but actually
> >> store the counter(s) in the mappings instead of the struct page. We can use
> >> !PageAnon and page_mapping to look up all the mappings, stash the dma_pinned_count
> >> there. So the total pinned count is scattered across mappings. Probably still need
> >> a PageDmaPinned bit.
> >
> > How do you safely look at page->mapping from the get_user_pages_fast()
> > path? You'll be racing invalidation disconnecting the page from the
> > mapping.
> >
>
> I don't have an answer for that, so maybe the page->mapping idea is dead already.
>
> So in that case, there is still one more way to do all of this, which is to
> combine ZONE_DEVICE, HMM, and gup/dma information in a per-page struct, and get
> there via basically page->private, more or less like this:

If we're going to allocate something new out-of-line then maybe we
should go even further to allow for a page "proxy" object to front a
real struct page. This idea arose from Dave Hansen as I explained to
him the dax-reflink problem, and dovetails with Dave Chinner's
suggestion earlier in this thread for dax-reflink.

Have get_user_pages() allocate a proxy object that gets passed around
to drivers. Something like a struct page pointer with bit 0 set. This
would add a conditional branch and pointer chase to many page
operations, like page_to_pfn(), I thought something like it would be
unacceptable a few years ago, but then HMM went and added similar
overhead to put_page() and nobody balked.

This has the additional benefit of catching cases that might be doing
a get_page() on a get_user_pages() result and should instead switch to
a "ref_user_page()" (opposite of put_user_page()) as the API to take
additional references on a get_user_pages() result.

page->index and page->mapping could be overridden by similar
attributes in the proxy, and allow an N:1 relationship of proxy
instances to actual pages. Filesystems could generate dynamic proxies
as well.

The auxiliary information (dev_pagemap, hmm_data, etc...) moves to the
proxy and stops polluting the base struct page which remains the
canonical location for dirty-tracking and dma operations.

The difficulties are reconciling the source of the proxies as both
get_user_pages() and filesystem may want to be the source of the
allocation. In the get_user_pages_fast() path we may not be able to
ask the filesystem for the proxy, at least not without destroying the
performance expectations of get_user_pages_fast().

>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..13f651bb5cc1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -67,6 +67,13 @@ struct hmm;
>  #define _struct_page_alignment
>  #endif
>
> +struct page_aux {
> +       struct dev_pagemap *pgmap;
> +       unsigned long hmm_data;
> +       unsigned long private;
> +       atomic_t dma_pinned_count;
> +};
> +
>  struct page {
>         unsigned long flags;            /* Atomic flags, some possibly
>                                          * updated asynchronously */
> @@ -149,11 +156,13 @@ struct page {
>                         spinlock_t ptl;
>  #endif
>                 };
> -               struct {        /* ZONE_DEVICE pages */
> +               struct {        /* ZONE_DEVICE, HMM or get_user_pages() pages */
>                         /** @pgmap: Points to the hosting device page map. */
> -                       struct dev_pagemap *pgmap;
> -                       unsigned long hmm_data;
> -                       unsigned long _zd_pad_1;        /* uses mapping */
> +                       unsigned long _zd_pad_1;        /* LRU */
> +                       unsigned long _zd_pad_2;        /* LRU */
> +                       unsigned long _zd_pad_3;        /* mapping */
> +                       unsigned long _zd_pad_4;        /* index */
> +                       struct page_aux *aux;           /* private */
>                 };
>
>                 /** @rcu_head: You can use this to free a page by RCU. */
>
> ...is there any appetite for that approach?
>
> --
> thanks,
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 19:38                                         ` Dan Williams
@ 2018-12-14 19:48                                           ` Matthew Wilcox
  2018-12-14 19:53                                             ` Dave Hansen
  2018-12-17  8:56                                           ` Jan Kara
  1 sibling, 1 reply; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-14 19:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, david, Jérôme Glisse, Jan Kara,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel, Dave Hansen

On Fri, Dec 14, 2018 at 11:38:59AM -0800, Dan Williams wrote:
> On Thu, Dec 13, 2018 at 10:11 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > I don't have an answer for that, so maybe the page->mapping idea is dead already.
> >
> > So in that case, there is still one more way to do all of this, which is to
> > combine ZONE_DEVICE, HMM, and gup/dma information in a per-page struct, and get
> > there via basically page->private, more or less like this:
> 
> If we're going to allocate something new out-of-line then maybe we
> should go even further to allow for a page "proxy" object to front a
> real struct page. This idea arose from Dave Hansen as I explained to
> him the dax-reflink problem, and dovetails with Dave Chinner's
> suggestion earlier in this thread for dax-reflink.
> 
> Have get_user_pages() allocate a proxy object that gets passed around
> to drivers. Something like a struct page pointer with bit 0 set. This
> would add a conditional branch and pointer chase to many page
> operations, like page_to_pfn(), I thought something like it would be
> unacceptable a few years ago, but then HMM went and added similar
> overhead to put_page() and nobody balked.
> 
> This has the additional benefit of catching cases that might be doing
> a get_page() on a get_user_pages() result and should instead switch to
> a "ref_user_page()" (opposite of put_user_page()) as the API to take
> additional references on a get_user_pages() result.
> 
> page->index and page->mapping could be overridden by similar
> attributes in the proxy, and allow an N:1 relationship of proxy
> instances to actual pages. Filesystems could generate dynamic proxies
> as well.
> 
> The auxiliary information (dev_pagemap, hmm_data, etc...) moves to the
> proxy and stops polluting the base struct page which remains the
> canonical location for dirty-tracking and dma operations.
> 
> The difficulties are reconciling the source of the proxies as both
> get_user_pages() and filesystem may want to be the source of the
> allocation. In the get_user_pages_fast() path we may not be able to
> ask the filesystem for the proxy, at least not without destroying the
> performance expectations of get_user_pages_fast().

I think we can do better than a proxy object with bit 0 set.  I'd go
for allocating something like this:

struct dynamic_page {
	struct page;
	unsigned long vaddr;
	unsigned long pfn;
	...
};

and use a bit in struct page to indicate that this is a dynamic page.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 19:48                                           ` Matthew Wilcox
@ 2018-12-14 19:53                                             ` Dave Hansen
  2018-12-14 20:03                                               ` Matthew Wilcox
  0 siblings, 1 reply; 206+ messages in thread
From: Dave Hansen @ 2018-12-14 19:53 UTC (permalink / raw)
  To: Matthew Wilcox, Dan Williams
  Cc: John Hubbard, david, Jérôme Glisse, Jan Kara,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 12/14/18 11:48 AM, Matthew Wilcox wrote:
> I think we can do better than a proxy object with bit 0 set.  I'd go
> for allocating something like this:
> 
> struct dynamic_page {
> 	struct page;
> 	unsigned long vaddr;
> 	unsigned long pfn;
> 	...
> };
> 
> and use a bit in struct page to indicate that this is a dynamic page.

That might be fun.  We'd just need a fast/static and slow/dynamic path
in page_to_pfn()/pfn_to_page().  We'd also need some kind of auxiliary
pfn-to-page structure since we could not fit that^ structure in vmemmap[].

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 19:53                                             ` Dave Hansen
@ 2018-12-14 20:03                                               ` Matthew Wilcox
  2018-12-14 20:17                                                 ` Dan Williams
  2018-12-15  0:41                                                 ` John Hubbard
  0 siblings, 2 replies; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-14 20:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Williams, John Hubbard, david, Jérôme Glisse,
	Jan Kara, John Hubbard, Andrew Morton, Linux MM, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Fri, Dec 14, 2018 at 11:53:31AM -0800, Dave Hansen wrote:
> On 12/14/18 11:48 AM, Matthew Wilcox wrote:
> > I think we can do better than a proxy object with bit 0 set.  I'd go
> > for allocating something like this:
> > 
> > struct dynamic_page {
> > 	struct page;
> > 	unsigned long vaddr;
> > 	unsigned long pfn;
> > 	...
> > };
> > 
> > and use a bit in struct page to indicate that this is a dynamic page.
> 
> That might be fun.  We'd just need a fast/static and slow/dynamic path
> in page_to_pfn()/pfn_to_page().  We'd also need some kind of auxiliary
> pfn-to-page structure since we could not fit that^ structure in vmemmap[].

Yes; working on the pfn-to-page structure right now as it happens ...
in the meantime, an XArray for it probably wouldn't be _too_ bad.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 20:03                                               ` Matthew Wilcox
@ 2018-12-14 20:17                                                 ` Dan Williams
  2018-12-14 20:29                                                   ` Matthew Wilcox
  2018-12-15  0:41                                                 ` John Hubbard
  1 sibling, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-14 20:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Hansen, John Hubbard, david, Jérôme Glisse,
	Jan Kara, John Hubbard, Andrew Morton, Linux MM, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Fri, Dec 14, 2018 at 12:03 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Dec 14, 2018 at 11:53:31AM -0800, Dave Hansen wrote:
> > On 12/14/18 11:48 AM, Matthew Wilcox wrote:
> > > I think we can do better than a proxy object with bit 0 set.  I'd go
> > > for allocating something like this:
> > >
> > > struct dynamic_page {
> > >     struct page;
> > >     unsigned long vaddr;
> > >     unsigned long pfn;
> > >     ...
> > > };
> > >
> > > and use a bit in struct page to indicate that this is a dynamic page.
> >
> > That might be fun.  We'd just need a fast/static and slow/dynamic path
> > in page_to_pfn()/pfn_to_page().  We'd also need some kind of auxiliary
> > pfn-to-page structure since we could not fit that^ structure in vmemmap[].
>
> Yes; working on the pfn-to-page structure right now as it happens ...
> in the meantime, an XArray for it probably wouldn't be _too_ bad.

It might... see the recent patch from Ketih responding to complaints
about get_dev_pagemap() lookup overhead:

    df06b37ffe5a mm/gup: cache dev_pagemap while pinning pages

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 20:17                                                 ` Dan Williams
@ 2018-12-14 20:29                                                   ` Matthew Wilcox
  0 siblings, 0 replies; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-14 20:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, John Hubbard, david, Jérôme Glisse,
	Jan Kara, John Hubbard, Andrew Morton, Linux MM, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Fri, Dec 14, 2018 at 12:17:08PM -0800, Dan Williams wrote:
> On Fri, Dec 14, 2018 at 12:03 PM Matthew Wilcox <willy@infradead.org> wrote:
> > Yes; working on the pfn-to-page structure right now as it happens ...
> > in the meantime, an XArray for it probably wouldn't be _too_ bad.
> 
> It might... see the recent patch from Ketih responding to complaints
> about get_dev_pagemap() lookup overhead:

Yeah, I saw.  I called xa_dump() on the pgmap_array() running under
QEmu and it's truly awful because the NVDIMMs presented by QEmu are
very misaligned.  If we can make the NVDIMMs better aligned, we won't
hit such a bad case in the XArray data structure.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 20:03                                               ` Matthew Wilcox
  2018-12-14 20:17                                                 ` Dan Williams
@ 2018-12-15  0:41                                                 ` John Hubbard
  1 sibling, 0 replies; 206+ messages in thread
From: John Hubbard @ 2018-12-15  0:41 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Hansen
  Cc: Dan Williams, david, Jérôme Glisse, Jan Kara,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 12/14/18 12:03 PM, Matthew Wilcox wrote:
> On Fri, Dec 14, 2018 at 11:53:31AM -0800, Dave Hansen wrote:
>> On 12/14/18 11:48 AM, Matthew Wilcox wrote:
>>> I think we can do better than a proxy object with bit 0 set.  I'd go
>>> for allocating something like this:
>>>
>>> struct dynamic_page {
>>> 	struct page;
>>> 	unsigned long vaddr;
>>> 	unsigned long pfn;
>>> 	...
>>> };
>>>
>>> and use a bit in struct page to indicate that this is a dynamic page.
>>
>> That might be fun.  We'd just need a fast/static and slow/dynamic path
>> in page_to_pfn()/pfn_to_page().  We'd also need some kind of auxiliary
>> pfn-to-page structure since we could not fit that^ structure in vmemmap[].
> 
> Yes; working on the pfn-to-page structure right now as it happens ...
> in the meantime, an XArray for it probably wouldn't be _too_ bad.
> 

OK, this looks great. And as Dan pointed out, we get a nice side effect of
type safety for the gup/dma call site conversion. After doing partial 
conversions, the need for type safety (some of the callers really are 
complex) really seems worth the extra work, so that's a big benefit.

Next steps: I want to go try this dynamic_page approach out right away. 
If there are pieces such as page_to_pfn and related, that are already in
progress, I'd definitely like to work on top of that. Also, any up front
advice or pitfalls to avoid is always welcome, of course. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 15:43                               ` Jan Kara
@ 2018-12-16 21:58                                 ` Dave Chinner
  2018-12-17 18:11                                   ` Jerome Glisse
  2018-12-18 10:33                                   ` Jan Kara
  0 siblings, 2 replies; 206+ messages in thread
From: Dave Chinner @ 2018-12-16 21:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jerome Glisse, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 14, 2018 at 04:43:21PM +0100, Jan Kara wrote:
> Hi!
> 
> On Thu 13-12-18 08:46:41, Dave Chinner wrote:
> > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > So this approach doesn't look like a win to me over using counter in struct
> > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > struct page so that we can fit that gup counter there as well. I know that
> > > > it may be easier said than done...
> > > 
> > > So i want back to the drawing board and first i would like to ascertain
> > > that we all agree on what the objectives are:
> > > 
> > >     [O1] Avoid write back from a page still being written by either a
> > >          device or some direct I/O or any other existing user of GUP.
> > >          This would avoid possible file system corruption.
> > > 
> > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > >          considered clean by core mm (buffer head have been remove and
> > >          with some file system this turns into an ugly mess).
> > 
> > I think that's wrong. This isn't an "avoid a crash" case, this is a
> > "prevent data and/or filesystem corruption" case. The primary goal
> > we have here is removing our exposure to potential corruption, which
> > has the secondary effect of avoiding the crash/panics that currently
> > occur as a result of inconsistent page/filesystem state.
> > 
> > i.e. The goal is to have ->page_mkwrite() called on the clean page
> > /before/ the file-backed page is marked dirty, and hence we don't
> > expose ourselves to potential corruption or crashes that are a
> > result of inappropriately calling set_page_dirty() on clean
> > file-backed pages.
> 
> I agree that [O1] - i.e., avoid corrupting fs data - is more important and
> [O2] is just one consequence of [O1].
> 
> > > For [O1] and [O2] i believe a solution with mapcount would work. So
> > > no new struct, no fake vma, nothing like that. In GUP for file back
> > > pages we increment both refcount and mapcount (we also need a special
> > > put_user_page to decrement mapcount when GUP user are done with the
> > > page).
> > 
> > I don't see how a mapcount can prevent anyone from calling
> > set_page_dirty() inappropriately.
> > 
> > > Now for [O1] the write back have to call page_mkclean() to go through
> > > all reverse mapping of the page and map read only. This means that
> > > we can count the number of real mapping and see if the mapcount is
> > > bigger than that. If mapcount is bigger than page is pin and we need
> > > to use a bounce page to do the writeback.
> > 
> > Doesn't work. Generally filesystems have already mapped the page
> > into bios before they call clear_page_dirty_for_io(), so it's too
> > late for the filesystem to bounce the page at that point.
> 
> Yes, for filesystem it is too late. But the plan we figured back in October
> was to do the bouncing in the block layer. I.e., mark the bio (or just the
> particular page) as needing bouncing and then use the existing page
> bouncing mechanism in the block layer to do the bouncing for us. Ext3 (when
> it was still a separate fs driver) has been using a mechanism like this to
> make DIF/DIX work with its metadata.

Sure, that's a possibility, but that doesn't close off any race
conditions because there can be DMA into the page in progress while
the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
different in that there is no 3rd-party access to the page while it
is under IO (ext3 arbitrates all access to it's metadata), and so
nothing can actually race for modification of the page between
submission and bouncing at the block layer.

In this case, the moment the page is unlocked, anyone else can map
it and start (R)DMA on it, and that can happen before the bio is
bounced by the block layer. So AFAICT, block layer bouncing doesn't
solve the problem of racing writeback and DMA direct to the page we
are doing IO on. Yes, it reduces the race window substantially, but
it doesn't get rid of it.

/me points to wait_for_stable_page() in ->page_mkwrite as the
mechanism we already have to avoid races between dirtying mapped
pages and page writeback....

> > > For [O2] i believe we can handle that case in the put_user_page()
> > > function to properly dirty the page without causing filesystem
> > > freak out.
> > 
> > I'm pretty sure you can't call ->page_mkwrite() from
> > put_user_page(), so I don't think this is workable at all.
> 
> Yes, calling ->page_mkwrite() in put_user_page() is not only technically
> complicated but also too late - DMA has already modified page contents.
> What we planned to do (again discussed back in October) was to never allow
> the pinned page to become clean. I.e., clear_page_dirty_for_io() would
> leave pinned pages dirty. Also we would skip pinned pages for WB_SYNC_NONE
> writeback as there's no point in that really. That way MM and filesystems
> would be aware of the real page state - i.e., what's in memory is not in
> sync (potentially) with what's on disk. I was thinking whether this
> permanently-dirty state couldn't confuse filesystem in some way but I
> didn't find anything serious - the worst I could think of are places that
> do filemap_write_and_wait() and then invalidate page cache e.g. before hole
> punching or extent shifting.

If it's permanently dirty, how do we trigger new COW operations
after writeback has "cleaned" the page? i.e. we still need a
->page_mkwrite call to run before we allow the next write to the
page to be done, regardless of whether the page is "permanently
dirty" or not....

> But these should work fine as is (page cache
> invalidation will just happily truncate dirty pages). DIO might get
> confused by the inability to invalidate dirty pages but then user combining
> RDMA with DIO on the same file at one moment gets what he deserves...

I'm almost certain this will do something that will occur. i.e.
permanently mapped RDMA file, filesystem backup program uses DIO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-14 19:38                                         ` Dan Williams
  2018-12-14 19:48                                           ` Matthew Wilcox
@ 2018-12-17  8:56                                           ` Jan Kara
  2018-12-17 18:28                                             ` Dan Williams
  1 sibling, 1 reply; 206+ messages in thread
From: Jan Kara @ 2018-12-17  8:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, david, Jérôme Glisse, Jan Kara,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel, Dave Hansen

On Fri 14-12-18 11:38:59, Dan Williams wrote:
> On Thu, Dec 13, 2018 at 10:11 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 12/13/18 9:21 PM, Dan Williams wrote:
> > > On Thu, Dec 13, 2018 at 7:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > >>
> > >> On 12/12/18 4:51 PM, Dave Chinner wrote:
> > >>> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> > >>>> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > >>>>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > >>>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > >>>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > >>>>>>> So this approach doesn't look like a win to me over using counter in struct
> > >>>>>>> page and I'd rather try looking into squeezing HMM public page usage of
> > >>>>>>> struct page so that we can fit that gup counter there as well. I know that
> > >>>>>>> it may be easier said than done...
> > >>>>>>
> > >>
> > >> Agreed. After all the discussion this week, I'm thinking that the original idea
> > >> of a per-struct-page counter is better. Fortunately, we can do the moral equivalent
> > >> of that, unless I'm overlooking something: Jerome had another proposal that he
> > >> described, off-list, for doing that counting, and his idea avoids the problem of
> > >> finding space in struct page. (And in fact, when I responded yesterday, I initially
> > >> thought that's where he was going with this.)
> > >>
> > >> So how about this hybrid solution:
> > >>
> > >> 1. Stay with the basic RFC approach of using a per-page counter, but actually
> > >> store the counter(s) in the mappings instead of the struct page. We can use
> > >> !PageAnon and page_mapping to look up all the mappings, stash the dma_pinned_count
> > >> there. So the total pinned count is scattered across mappings. Probably still need
> > >> a PageDmaPinned bit.
> > >
> > > How do you safely look at page->mapping from the get_user_pages_fast()
> > > path? You'll be racing invalidation disconnecting the page from the
> > > mapping.
> > >
> >
> > I don't have an answer for that, so maybe the page->mapping idea is dead already.
> >
> > So in that case, there is still one more way to do all of this, which is to
> > combine ZONE_DEVICE, HMM, and gup/dma information in a per-page struct, and get
> > there via basically page->private, more or less like this:
> 
> If we're going to allocate something new out-of-line then maybe we
> should go even further to allow for a page "proxy" object to front a
> real struct page. This idea arose from Dave Hansen as I explained to
> him the dax-reflink problem, and dovetails with Dave Chinner's
> suggestion earlier in this thread for dax-reflink.
> 
> Have get_user_pages() allocate a proxy object that gets passed around
> to drivers. Something like a struct page pointer with bit 0 set. This
> would add a conditional branch and pointer chase to many page
> operations, like page_to_pfn(), I thought something like it would be
> unacceptable a few years ago, but then HMM went and added similar
> overhead to put_page() and nobody balked.
> 
> This has the additional benefit of catching cases that might be doing
> a get_page() on a get_user_pages() result and should instead switch to
> a "ref_user_page()" (opposite of put_user_page()) as the API to take
> additional references on a get_user_pages() result.
> 
> page->index and page->mapping could be overridden by similar
> attributes in the proxy, and allow an N:1 relationship of proxy
> instances to actual pages. Filesystems could generate dynamic proxies
> as well.
> 
> The auxiliary information (dev_pagemap, hmm_data, etc...) moves to the
> proxy and stops polluting the base struct page which remains the
> canonical location for dirty-tracking and dma operations.
> 
> The difficulties are reconciling the source of the proxies as both
> get_user_pages() and filesystem may want to be the source of the
> allocation. In the get_user_pages_fast() path we may not be able to
> ask the filesystem for the proxy, at least not without destroying the
> performance expectations of get_user_pages_fast().

What you describe here sounds almost like page_ext mechanism we already
have? Or do you really aim at per-pin allocated structure?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-16 21:58                                 ` Dave Chinner
@ 2018-12-17 18:11                                   ` Jerome Glisse
  2018-12-17 18:34                                     ` Matthew Wilcox
  2018-12-18 10:33                                   ` Jan Kara
  1 sibling, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-17 18:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> On Fri, Dec 14, 2018 at 04:43:21PM +0100, Jan Kara wrote:
> > Hi!
> > 
> > On Thu 13-12-18 08:46:41, Dave Chinner wrote:
> > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > So this approach doesn't look like a win to me over using counter in struct
> > > > > page and I'd rather try looking into squeezing HMM public page usage of
> > > > > struct page so that we can fit that gup counter there as well. I know that
> > > > > it may be easier said than done...
> > > > 
> > > > So i want back to the drawing board and first i would like to ascertain
> > > > that we all agree on what the objectives are:
> > > > 
> > > >     [O1] Avoid write back from a page still being written by either a
> > > >          device or some direct I/O or any other existing user of GUP.
> > > >          This would avoid possible file system corruption.
> > > > 
> > > >     [O2] Avoid crash when set_page_dirty() is call on a page that is
> > > >          considered clean by core mm (buffer head have been remove and
> > > >          with some file system this turns into an ugly mess).
> > > 
> > > I think that's wrong. This isn't an "avoid a crash" case, this is a
> > > "prevent data and/or filesystem corruption" case. The primary goal
> > > we have here is removing our exposure to potential corruption, which
> > > has the secondary effect of avoiding the crash/panics that currently
> > > occur as a result of inconsistent page/filesystem state.
> > > 
> > > i.e. The goal is to have ->page_mkwrite() called on the clean page
> > > /before/ the file-backed page is marked dirty, and hence we don't
> > > expose ourselves to potential corruption or crashes that are a
> > > result of inappropriately calling set_page_dirty() on clean
> > > file-backed pages.
> > 
> > I agree that [O1] - i.e., avoid corrupting fs data - is more important and
> > [O2] is just one consequence of [O1].
> > 
> > > > For [O1] and [O2] i believe a solution with mapcount would work. So
> > > > no new struct, no fake vma, nothing like that. In GUP for file back
> > > > pages we increment both refcount and mapcount (we also need a special
> > > > put_user_page to decrement mapcount when GUP user are done with the
> > > > page).
> > > 
> > > I don't see how a mapcount can prevent anyone from calling
> > > set_page_dirty() inappropriately.
> > > 
> > > > Now for [O1] the write back have to call page_mkclean() to go through
> > > > all reverse mapping of the page and map read only. This means that
> > > > we can count the number of real mapping and see if the mapcount is
> > > > bigger than that. If mapcount is bigger than page is pin and we need
> > > > to use a bounce page to do the writeback.
> > > 
> > > Doesn't work. Generally filesystems have already mapped the page
> > > into bios before they call clear_page_dirty_for_io(), so it's too
> > > late for the filesystem to bounce the page at that point.
> > 
> > Yes, for filesystem it is too late. But the plan we figured back in October
> > was to do the bouncing in the block layer. I.e., mark the bio (or just the
> > particular page) as needing bouncing and then use the existing page
> > bouncing mechanism in the block layer to do the bouncing for us. Ext3 (when
> > it was still a separate fs driver) has been using a mechanism like this to
> > make DIF/DIX work with its metadata.
> 
> Sure, that's a possibility, but that doesn't close off any race
> conditions because there can be DMA into the page in progress while
> the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> different in that there is no 3rd-party access to the page while it
> is under IO (ext3 arbitrates all access to it's metadata), and so
> nothing can actually race for modification of the page between
> submission and bouncing at the block layer.
> 
> In this case, the moment the page is unlocked, anyone else can map
> it and start (R)DMA on it, and that can happen before the bio is
> bounced by the block layer. So AFAICT, block layer bouncing doesn't
> solve the problem of racing writeback and DMA direct to the page we
> are doing IO on. Yes, it reduces the race window substantially, but
> it doesn't get rid of it.

So the event flow is:
    - userspace create object that match a range of virtual address
      against a given kernel sub-system (let's say infiniband) and
      let's assume that the range is an mmap() of a regular file
    - device driver do GUP on the range (let's assume it is a write
      GUP) so if the page is not already map with write permission
      in the page table than a page fault is trigger and page_mkwrite
      happens
    - Once GUP return the page to the device driver and once the
      device driver as updated the hardware states to allow access
      to this page then from that point on hardware can write to the
      page at _any_ time, it is fully disconnected from any fs event
      like write back, it fully ignore things like page_mkclean

This is how it is to day, we allowed people to push upstream such
users of GUP. This is a fact we have to live with, we can not stop
hardware access to the page, we can not force the hardware to follow
page_mkclean and force a page_mkwrite once write back ends. This is
the situation we are inheriting (and i am personnaly not happy with
that).

>From my point of view we are left with 2 choices:
    [C1] break all drivers that do not abide by the page_mkclean and
         page_mkwrite
    [C2] mitigate as much as possible the issue

For [C2] the idea is to keep track of GUP per page so we know if we
can expect the page to be written to at any time. Here is the event
flow:
    - driver GUP the page and program the hardware, page is mark as
      GUPed
    ...
    - write back kicks in on the dirty page, lock the page and every
      thing as usual , sees it is GUPed and inform the block layer to
      use a bounce page
    - block layer copy the page to a bounce page effectively creating
      a snapshot of what is the content of the real page. This allows
      everything in block layer that need stable content to work on
      the bounce page (raid, stripping, encryption, ...)
    - once write back is done the page is not marked clean but stays
      dirty, this effectively disable things like COW for filesystem
      and other feature that expect page_mkwrite between write back.
      AFAIK it is believe that it is something acceptable

The whole write back sequence will repeat until all GUP users calls
put_user_page, once the page is no longer pin by any GUP then things
resume back to the normal flow ie next write back will mark the page
clean and it will force a page_mkwrite on next write fault.



> 
> /me points to wait_for_stable_page() in ->page_mkwrite as the
> mechanism we already have to avoid races between dirtying mapped
> pages and page writeback....

Saddly some devices can not abide by that rules. They over interpreted
what GUP means and what are its guarantee.


> > > > For [O2] i believe we can handle that case in the put_user_page()
> > > > function to properly dirty the page without causing filesystem
> > > > freak out.
> > > 
> > > I'm pretty sure you can't call ->page_mkwrite() from
> > > put_user_page(), so I don't think this is workable at all.
> > 
> > Yes, calling ->page_mkwrite() in put_user_page() is not only technically
> > complicated but also too late - DMA has already modified page contents.
> > What we planned to do (again discussed back in October) was to never allow
> > the pinned page to become clean. I.e., clear_page_dirty_for_io() would
> > leave pinned pages dirty. Also we would skip pinned pages for WB_SYNC_NONE
> > writeback as there's no point in that really. That way MM and filesystems
> > would be aware of the real page state - i.e., what's in memory is not in
> > sync (potentially) with what's on disk. I was thinking whether this
> > permanently-dirty state couldn't confuse filesystem in some way but I
> > didn't find anything serious - the worst I could think of are places that
> > do filemap_write_and_wait() and then invalidate page cache e.g. before hole
> > punching or extent shifting.
> 
> If it's permanently dirty, how do we trigger new COW operations
> after writeback has "cleaned" the page? i.e. we still need a
> ->page_mkwrite call to run before we allow the next write to the
> page to be done, regardless of whether the page is "permanently
> dirty" or not....

For as long as they are GUP reference on the page we are effectively
in some way disabling page cleaning ie we still write back content so
that data loss is still unlikely but pages stays dirty and it disables
anything that rely on page_mkwrite including COW.

>From fs point of view it is as if page is frozen in being dirty and
under write for undefined period of time. We only mitigate the data
loss by allowing write to happen using a bounce page (so that layer
down the stack get a page with stable content ie a snapshot).

In other word we can not block the device from writing to the page,
some hardware are just not capable of that and saddly we allowed GUP
to be use for those.


> > But these should work fine as is (page cache
> > invalidation will just happily truncate dirty pages). DIO might get
> > confused by the inability to invalidate dirty pages but then user combining
> > RDMA with DIO on the same file at one moment gets what he deserves...
> 
> I'm almost certain this will do something that will occur. i.e.
> permanently mapped RDMA file, filesystem backup program uses DIO....

DIO being software i believe it can be told to understand this special
case.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17  8:56                                           ` Jan Kara
@ 2018-12-17 18:28                                             ` Dan Williams
  0 siblings, 0 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-17 18:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, david, Jérôme Glisse, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel, Dave Hansen

On Mon, Dec 17, 2018 at 12:57 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 14-12-18 11:38:59, Dan Williams wrote:
> > On Thu, Dec 13, 2018 at 10:11 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 12/13/18 9:21 PM, Dan Williams wrote:
> > > > On Thu, Dec 13, 2018 at 7:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > > >>
> > > >> On 12/12/18 4:51 PM, Dave Chinner wrote:
> > > >>> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> > > >>>> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > > >>>>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > >>>>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > >>>>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > >>>>>>> So this approach doesn't look like a win to me over using counter in struct
> > > >>>>>>> page and I'd rather try looking into squeezing HMM public page usage of
> > > >>>>>>> struct page so that we can fit that gup counter there as well. I know that
> > > >>>>>>> it may be easier said than done...
> > > >>>>>>
> > > >>
> > > >> Agreed. After all the discussion this week, I'm thinking that the original idea
> > > >> of a per-struct-page counter is better. Fortunately, we can do the moral equivalent
> > > >> of that, unless I'm overlooking something: Jerome had another proposal that he
> > > >> described, off-list, for doing that counting, and his idea avoids the problem of
> > > >> finding space in struct page. (And in fact, when I responded yesterday, I initially
> > > >> thought that's where he was going with this.)
> > > >>
> > > >> So how about this hybrid solution:
> > > >>
> > > >> 1. Stay with the basic RFC approach of using a per-page counter, but actually
> > > >> store the counter(s) in the mappings instead of the struct page. We can use
> > > >> !PageAnon and page_mapping to look up all the mappings, stash the dma_pinned_count
> > > >> there. So the total pinned count is scattered across mappings. Probably still need
> > > >> a PageDmaPinned bit.
> > > >
> > > > How do you safely look at page->mapping from the get_user_pages_fast()
> > > > path? You'll be racing invalidation disconnecting the page from the
> > > > mapping.
> > > >
> > >
> > > I don't have an answer for that, so maybe the page->mapping idea is dead already.
> > >
> > > So in that case, there is still one more way to do all of this, which is to
> > > combine ZONE_DEVICE, HMM, and gup/dma information in a per-page struct, and get
> > > there via basically page->private, more or less like this:
> >
> > If we're going to allocate something new out-of-line then maybe we
> > should go even further to allow for a page "proxy" object to front a
> > real struct page. This idea arose from Dave Hansen as I explained to
> > him the dax-reflink problem, and dovetails with Dave Chinner's
> > suggestion earlier in this thread for dax-reflink.
> >
> > Have get_user_pages() allocate a proxy object that gets passed around
> > to drivers. Something like a struct page pointer with bit 0 set. This
> > would add a conditional branch and pointer chase to many page
> > operations, like page_to_pfn(), I thought something like it would be
> > unacceptable a few years ago, but then HMM went and added similar
> > overhead to put_page() and nobody balked.
> >
> > This has the additional benefit of catching cases that might be doing
> > a get_page() on a get_user_pages() result and should instead switch to
> > a "ref_user_page()" (opposite of put_user_page()) as the API to take
> > additional references on a get_user_pages() result.
> >
> > page->index and page->mapping could be overridden by similar
> > attributes in the proxy, and allow an N:1 relationship of proxy
> > instances to actual pages. Filesystems could generate dynamic proxies
> > as well.
> >
> > The auxiliary information (dev_pagemap, hmm_data, etc...) moves to the
> > proxy and stops polluting the base struct page which remains the
> > canonical location for dirty-tracking and dma operations.
> >
> > The difficulties are reconciling the source of the proxies as both
> > get_user_pages() and filesystem may want to be the source of the
> > allocation. In the get_user_pages_fast() path we may not be able to
> > ask the filesystem for the proxy, at least not without destroying the
> > performance expectations of get_user_pages_fast().
>
> What you describe here sounds almost like page_ext mechanism we already
> have? Or do you really aim at per-pin allocated structure?

Per-pin or dynamically allocated by the filesystem. The existing
page_ext seems to suffer from the expectation that a page_ext exists
for all pfns. The 'struct page' per pfn requirement is already painful
as memory capacities grow into the terabytes, page_ext seems to just
make that worse.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:11                                   ` Jerome Glisse
@ 2018-12-17 18:34                                     ` Matthew Wilcox
  2018-12-17 19:48                                       ` Jerome Glisse
                                                         ` (3 more replies)
  0 siblings, 4 replies; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-17 18:34 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > Sure, that's a possibility, but that doesn't close off any race
> > conditions because there can be DMA into the page in progress while
> > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > different in that there is no 3rd-party access to the page while it
> > is under IO (ext3 arbitrates all access to it's metadata), and so
> > nothing can actually race for modification of the page between
> > submission and bouncing at the block layer.
> > 
> > In this case, the moment the page is unlocked, anyone else can map
> > it and start (R)DMA on it, and that can happen before the bio is
> > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > solve the problem of racing writeback and DMA direct to the page we
> > are doing IO on. Yes, it reduces the race window substantially, but
> > it doesn't get rid of it.
> 
> So the event flow is:
>     - userspace create object that match a range of virtual address
>       against a given kernel sub-system (let's say infiniband) and
>       let's assume that the range is an mmap() of a regular file
>     - device driver do GUP on the range (let's assume it is a write
>       GUP) so if the page is not already map with write permission
>       in the page table than a page fault is trigger and page_mkwrite
>       happens
>     - Once GUP return the page to the device driver and once the
>       device driver as updated the hardware states to allow access
>       to this page then from that point on hardware can write to the
>       page at _any_ time, it is fully disconnected from any fs event
>       like write back, it fully ignore things like page_mkclean
> 
> This is how it is to day, we allowed people to push upstream such
> users of GUP. This is a fact we have to live with, we can not stop
> hardware access to the page, we can not force the hardware to follow
> page_mkclean and force a page_mkwrite once write back ends. This is
> the situation we are inheriting (and i am personnaly not happy with
> that).
> 
> >From my point of view we are left with 2 choices:
>     [C1] break all drivers that do not abide by the page_mkclean and
>          page_mkwrite
>     [C2] mitigate as much as possible the issue
> 
> For [C2] the idea is to keep track of GUP per page so we know if we
> can expect the page to be written to at any time. Here is the event
> flow:
>     - driver GUP the page and program the hardware, page is mark as
>       GUPed
>     ...
>     - write back kicks in on the dirty page, lock the page and every
>       thing as usual , sees it is GUPed and inform the block layer to
>       use a bounce page

No.  The solution John, Dan & I have been looking at is to take the
dirty page off the LRU while it is pinned by GUP.  It will never be
found for writeback.

That's not the end of the story though.  Other parts of the kernel (eg
msync) also need to be taught to stay away from pages which are pinned
by GUP.  But the idea is that no page gets written back to storage while
it's pinned by GUP.  Only when the last GUP ends is the page returned
to the list of dirty pages.

>     - block layer copy the page to a bounce page effectively creating
>       a snapshot of what is the content of the real page. This allows
>       everything in block layer that need stable content to work on
>       the bounce page (raid, stripping, encryption, ...)
>     - once write back is done the page is not marked clean but stays
>       dirty, this effectively disable things like COW for filesystem
>       and other feature that expect page_mkwrite between write back.
>       AFAIK it is believe that it is something acceptable

So none of this is necessary.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:34                                     ` Matthew Wilcox
@ 2018-12-17 19:48                                       ` Jerome Glisse
  2018-12-17 19:51                                         ` Matthew Wilcox
  2018-12-18  1:09                                       ` Dave Chinner
                                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-17 19:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > > 
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > So the event flow is:
> >     - userspace create object that match a range of virtual address
> >       against a given kernel sub-system (let's say infiniband) and
> >       let's assume that the range is an mmap() of a regular file
> >     - device driver do GUP on the range (let's assume it is a write
> >       GUP) so if the page is not already map with write permission
> >       in the page table than a page fault is trigger and page_mkwrite
> >       happens
> >     - Once GUP return the page to the device driver and once the
> >       device driver as updated the hardware states to allow access
> >       to this page then from that point on hardware can write to the
> >       page at _any_ time, it is fully disconnected from any fs event
> >       like write back, it fully ignore things like page_mkclean
> > 
> > This is how it is to day, we allowed people to push upstream such
> > users of GUP. This is a fact we have to live with, we can not stop
> > hardware access to the page, we can not force the hardware to follow
> > page_mkclean and force a page_mkwrite once write back ends. This is
> > the situation we are inheriting (and i am personnaly not happy with
> > that).
> > 
> > >From my point of view we are left with 2 choices:
> >     [C1] break all drivers that do not abide by the page_mkclean and
> >          page_mkwrite
> >     [C2] mitigate as much as possible the issue
> > 
> > For [C2] the idea is to keep track of GUP per page so we know if we
> > can expect the page to be written to at any time. Here is the event
> > flow:
> >     - driver GUP the page and program the hardware, page is mark as
> >       GUPed
> >     ...
> >     - write back kicks in on the dirty page, lock the page and every
> >       thing as usual , sees it is GUPed and inform the block layer to
> >       use a bounce page
> 
> No.  The solution John, Dan & I have been looking at is to take the
> dirty page off the LRU while it is pinned by GUP.  It will never be
> found for writeback.
> 
> That's not the end of the story though.  Other parts of the kernel (eg
> msync) also need to be taught to stay away from pages which are pinned
> by GUP.  But the idea is that no page gets written back to storage while
> it's pinned by GUP.  Only when the last GUP ends is the page returned
> to the list of dirty pages.
> 
> >     - block layer copy the page to a bounce page effectively creating
> >       a snapshot of what is the content of the real page. This allows
> >       everything in block layer that need stable content to work on
> >       the bounce page (raid, stripping, encryption, ...)
> >     - once write back is done the page is not marked clean but stays
> >       dirty, this effectively disable things like COW for filesystem
> >       and other feature that expect page_mkwrite between write back.
> >       AFAIK it is believe that it is something acceptable
> 
> So none of this is necessary.

With the solution you are proposing we loose GUP fast and we have to
allocate a structure for each page that is under GUP, and the LRU
changes too. Moreover by not writing back there is a greater chance
of data loss.

I will do patches with the mapcount solution that require very little
change and people will be able to choose which solution they prefer.

Personaly i prefer the mapcount solution as it is less invasive.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:48                                       ` Jerome Glisse
@ 2018-12-17 19:51                                         ` Matthew Wilcox
  2018-12-17 19:54                                           ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-17 19:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > > Sure, that's a possibility, but that doesn't close off any race
> > > > conditions because there can be DMA into the page in progress while
> > > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > > different in that there is no 3rd-party access to the page while it
> > > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > > nothing can actually race for modification of the page between
> > > > submission and bouncing at the block layer.
> > > > 
> > > > In this case, the moment the page is unlocked, anyone else can map
> > > > it and start (R)DMA on it, and that can happen before the bio is
> > > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > > solve the problem of racing writeback and DMA direct to the page we
> > > > are doing IO on. Yes, it reduces the race window substantially, but
> > > > it doesn't get rid of it.
> > > 
> > > So the event flow is:
> > >     - userspace create object that match a range of virtual address
> > >       against a given kernel sub-system (let's say infiniband) and
> > >       let's assume that the range is an mmap() of a regular file
> > >     - device driver do GUP on the range (let's assume it is a write
> > >       GUP) so if the page is not already map with write permission
> > >       in the page table than a page fault is trigger and page_mkwrite
> > >       happens
> > >     - Once GUP return the page to the device driver and once the
> > >       device driver as updated the hardware states to allow access
> > >       to this page then from that point on hardware can write to the
> > >       page at _any_ time, it is fully disconnected from any fs event
> > >       like write back, it fully ignore things like page_mkclean
> > > 
> > > This is how it is to day, we allowed people to push upstream such
> > > users of GUP. This is a fact we have to live with, we can not stop
> > > hardware access to the page, we can not force the hardware to follow
> > > page_mkclean and force a page_mkwrite once write back ends. This is
> > > the situation we are inheriting (and i am personnaly not happy with
> > > that).
> > > 
> > > >From my point of view we are left with 2 choices:
> > >     [C1] break all drivers that do not abide by the page_mkclean and
> > >          page_mkwrite
> > >     [C2] mitigate as much as possible the issue
> > > 
> > > For [C2] the idea is to keep track of GUP per page so we know if we
> > > can expect the page to be written to at any time. Here is the event
> > > flow:
> > >     - driver GUP the page and program the hardware, page is mark as
> > >       GUPed
> > >     ...
> > >     - write back kicks in on the dirty page, lock the page and every
> > >       thing as usual , sees it is GUPed and inform the block layer to
> > >       use a bounce page
> > 
> > No.  The solution John, Dan & I have been looking at is to take the
> > dirty page off the LRU while it is pinned by GUP.  It will never be
> > found for writeback.
> > 
> > That's not the end of the story though.  Other parts of the kernel (eg
> > msync) also need to be taught to stay away from pages which are pinned
> > by GUP.  But the idea is that no page gets written back to storage while
> > it's pinned by GUP.  Only when the last GUP ends is the page returned
> > to the list of dirty pages.
> > 
> > >     - block layer copy the page to a bounce page effectively creating
> > >       a snapshot of what is the content of the real page. This allows
> > >       everything in block layer that need stable content to work on
> > >       the bounce page (raid, stripping, encryption, ...)
> > >     - once write back is done the page is not marked clean but stays
> > >       dirty, this effectively disable things like COW for filesystem
> > >       and other feature that expect page_mkwrite between write back.
> > >       AFAIK it is believe that it is something acceptable
> > 
> > So none of this is necessary.
> 
> With the solution you are proposing we loose GUP fast and we have to
> allocate a structure for each page that is under GUP, and the LRU
> changes too. Moreover by not writing back there is a greater chance
> of data loss.

Why can't you store the hmm_data in a side data structure?  Why does it
have to be in struct page?

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:51                                         ` Matthew Wilcox
@ 2018-12-17 19:54                                           ` Jerome Glisse
  2018-12-17 19:59                                             ` Matthew Wilcox
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-17 19:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > > > Sure, that's a possibility, but that doesn't close off any race
> > > > > conditions because there can be DMA into the page in progress while
> > > > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > > > different in that there is no 3rd-party access to the page while it
> > > > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > > > nothing can actually race for modification of the page between
> > > > > submission and bouncing at the block layer.
> > > > > 
> > > > > In this case, the moment the page is unlocked, anyone else can map
> > > > > it and start (R)DMA on it, and that can happen before the bio is
> > > > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > > > solve the problem of racing writeback and DMA direct to the page we
> > > > > are doing IO on. Yes, it reduces the race window substantially, but
> > > > > it doesn't get rid of it.
> > > > 
> > > > So the event flow is:
> > > >     - userspace create object that match a range of virtual address
> > > >       against a given kernel sub-system (let's say infiniband) and
> > > >       let's assume that the range is an mmap() of a regular file
> > > >     - device driver do GUP on the range (let's assume it is a write
> > > >       GUP) so if the page is not already map with write permission
> > > >       in the page table than a page fault is trigger and page_mkwrite
> > > >       happens
> > > >     - Once GUP return the page to the device driver and once the
> > > >       device driver as updated the hardware states to allow access
> > > >       to this page then from that point on hardware can write to the
> > > >       page at _any_ time, it is fully disconnected from any fs event
> > > >       like write back, it fully ignore things like page_mkclean
> > > > 
> > > > This is how it is to day, we allowed people to push upstream such
> > > > users of GUP. This is a fact we have to live with, we can not stop
> > > > hardware access to the page, we can not force the hardware to follow
> > > > page_mkclean and force a page_mkwrite once write back ends. This is
> > > > the situation we are inheriting (and i am personnaly not happy with
> > > > that).
> > > > 
> > > > >From my point of view we are left with 2 choices:
> > > >     [C1] break all drivers that do not abide by the page_mkclean and
> > > >          page_mkwrite
> > > >     [C2] mitigate as much as possible the issue
> > > > 
> > > > For [C2] the idea is to keep track of GUP per page so we know if we
> > > > can expect the page to be written to at any time. Here is the event
> > > > flow:
> > > >     - driver GUP the page and program the hardware, page is mark as
> > > >       GUPed
> > > >     ...
> > > >     - write back kicks in on the dirty page, lock the page and every
> > > >       thing as usual , sees it is GUPed and inform the block layer to
> > > >       use a bounce page
> > > 
> > > No.  The solution John, Dan & I have been looking at is to take the
> > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > found for writeback.
> > > 
> > > That's not the end of the story though.  Other parts of the kernel (eg
> > > msync) also need to be taught to stay away from pages which are pinned
> > > by GUP.  But the idea is that no page gets written back to storage while
> > > it's pinned by GUP.  Only when the last GUP ends is the page returned
> > > to the list of dirty pages.
> > > 
> > > >     - block layer copy the page to a bounce page effectively creating
> > > >       a snapshot of what is the content of the real page. This allows
> > > >       everything in block layer that need stable content to work on
> > > >       the bounce page (raid, stripping, encryption, ...)
> > > >     - once write back is done the page is not marked clean but stays
> > > >       dirty, this effectively disable things like COW for filesystem
> > > >       and other feature that expect page_mkwrite between write back.
> > > >       AFAIK it is believe that it is something acceptable
> > > 
> > > So none of this is necessary.
> > 
> > With the solution you are proposing we loose GUP fast and we have to
> > allocate a structure for each page that is under GUP, and the LRU
> > changes too. Moreover by not writing back there is a greater chance
> > of data loss.
> 
> Why can't you store the hmm_data in a side data structure?  Why does it
> have to be in struct page?

hmm_data is not even the issue here, we can have a pincount without
moving things around. So i do not see the need to complexify any of
the existing code to add new structure and consume more memory for
no good reasons. I do not see any benefit in that.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:54                                           ` Jerome Glisse
@ 2018-12-17 19:59                                             ` Matthew Wilcox
  2018-12-17 20:55                                               ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-17 19:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > found for writeback.
> > > 
> > > With the solution you are proposing we loose GUP fast and we have to
> > > allocate a structure for each page that is under GUP, and the LRU
> > > changes too. Moreover by not writing back there is a greater chance
> > > of data loss.
> > 
> > Why can't you store the hmm_data in a side data structure?  Why does it
> > have to be in struct page?
> 
> hmm_data is not even the issue here, we can have a pincount without
> moving things around. So i do not see the need to complexify any of
> the existing code to add new structure and consume more memory for
> no good reasons. I do not see any benefit in that.

You said "we have to allocate a structure for each page that is under
GUP".  The only reason to do that is if we want to keep hmm_data in
struct page.  If we ditch hmm_data, there's no need to allocate a
structure, and we don't lose GUP fast either.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 19:59                                             ` Matthew Wilcox
@ 2018-12-17 20:55                                               ` Jerome Glisse
  2018-12-17 21:03                                                 ` Matthew Wilcox
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-17 20:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 11:59:22AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > > found for writeback.
> > > > 
> > > > With the solution you are proposing we loose GUP fast and we have to
> > > > allocate a structure for each page that is under GUP, and the LRU
> > > > changes too. Moreover by not writing back there is a greater chance
> > > > of data loss.
> > > 
> > > Why can't you store the hmm_data in a side data structure?  Why does it
> > > have to be in struct page?
> > 
> > hmm_data is not even the issue here, we can have a pincount without
> > moving things around. So i do not see the need to complexify any of
> > the existing code to add new structure and consume more memory for
> > no good reasons. I do not see any benefit in that.
> 
> You said "we have to allocate a structure for each page that is under
> GUP".  The only reason to do that is if we want to keep hmm_data in
> struct page.  If we ditch hmm_data, there's no need to allocate a
> structure, and we don't lose GUP fast either.

And i have propose a way that do not need to ditch hmm_data nor
needs to remove page from the lru. What is it you do not like
with that ?

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 20:55                                               ` Jerome Glisse
@ 2018-12-17 21:03                                                 ` Matthew Wilcox
  2018-12-17 21:15                                                   ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: Matthew Wilcox @ 2018-12-17 21:03 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 03:55:01PM -0500, Jerome Glisse wrote:
> On Mon, Dec 17, 2018 at 11:59:22AM -0800, Matthew Wilcox wrote:
> > On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> > > On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > > > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > > > found for writeback.
> > > > > 
> > > > > With the solution you are proposing we loose GUP fast and we have to
> > > > > allocate a structure for each page that is under GUP, and the LRU
> > > > > changes too. Moreover by not writing back there is a greater chance
> > > > > of data loss.
> > > > 
> > > > Why can't you store the hmm_data in a side data structure?  Why does it
> > > > have to be in struct page?
> > > 
> > > hmm_data is not even the issue here, we can have a pincount without
> > > moving things around. So i do not see the need to complexify any of
> > > the existing code to add new structure and consume more memory for
> > > no good reasons. I do not see any benefit in that.
> > 
> > You said "we have to allocate a structure for each page that is under
> > GUP".  The only reason to do that is if we want to keep hmm_data in
> > struct page.  If we ditch hmm_data, there's no need to allocate a
> > structure, and we don't lose GUP fast either.
> 
> And i have propose a way that do not need to ditch hmm_data nor
> needs to remove page from the lru. What is it you do not like
> with that ?

I don't like bounce buffering.  I don't like "end of writeback doesn't
mark page as clean".  I don't like pages being on the LRU that aren't
actually removable.  I don't like writing pages back which we know we're
going to have to write back again.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 21:03                                                 ` Matthew Wilcox
@ 2018-12-17 21:15                                                   ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-17 21:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, John Hubbard, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 01:03:58PM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 03:55:01PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 11:59:22AM -0800, Matthew Wilcox wrote:
> > > On Mon, Dec 17, 2018 at 02:54:08PM -0500, Jerome Glisse wrote:
> > > > On Mon, Dec 17, 2018 at 11:51:51AM -0800, Matthew Wilcox wrote:
> > > > > On Mon, Dec 17, 2018 at 02:48:00PM -0500, Jerome Glisse wrote:
> > > > > > On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> > > > > > > No.  The solution John, Dan & I have been looking at is to take the
> > > > > > > dirty page off the LRU while it is pinned by GUP.  It will never be
> > > > > > > found for writeback.
> > > > > > 
> > > > > > With the solution you are proposing we loose GUP fast and we have to
> > > > > > allocate a structure for each page that is under GUP, and the LRU
> > > > > > changes too. Moreover by not writing back there is a greater chance
> > > > > > of data loss.
> > > > > 
> > > > > Why can't you store the hmm_data in a side data structure?  Why does it
> > > > > have to be in struct page?
> > > > 
> > > > hmm_data is not even the issue here, we can have a pincount without
> > > > moving things around. So i do not see the need to complexify any of
> > > > the existing code to add new structure and consume more memory for
> > > > no good reasons. I do not see any benefit in that.
> > > 
> > > You said "we have to allocate a structure for each page that is under
> > > GUP".  The only reason to do that is if we want to keep hmm_data in
> > > struct page.  If we ditch hmm_data, there's no need to allocate a
> > > structure, and we don't lose GUP fast either.
> > 
> > And i have propose a way that do not need to ditch hmm_data nor
> > needs to remove page from the lru. What is it you do not like
> > with that ?
> 
> I don't like bounce buffering.  I don't like "end of writeback doesn't
> mark page as clean".  I don't like pages being on the LRU that aren't
> actually removable.  I don't like writing pages back which we know we're
> going to have to write back again.

And my solution allow to pick at which ever point ... you can decide to
abort write back if you feel it is better, you can remove from LRU on
first write back abort ... So you can do everything you want in my solution
it is as flexible. Right now i am finishing couple patchset once i am
done i will do an RFC on that, in my RFC i will keep write back and
bounce but it can easily be turn into no write back and remove from
LRU. My feeling is that not writing back means data loss, at the same
time if the page is on continuous write one can argue that what ever
snapshot we write back might be pointless. I do not see any strong
argument either ways.

Cheers.
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:34                                     ` Matthew Wilcox
  2018-12-17 19:48                                       ` Jerome Glisse
@ 2018-12-18  1:09                                       ` Dave Chinner
  2018-12-18  6:12                                       ` Darrick J. Wong
  2018-12-18  9:30                                       ` Jan Kara
  3 siblings, 0 replies; 206+ messages in thread
From: Dave Chinner @ 2018-12-18  1:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jerome Glisse, Jan Kara, John Hubbard, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > > 
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > So the event flow is:
> >     - userspace create object that match a range of virtual address
> >       against a given kernel sub-system (let's say infiniband) and
> >       let's assume that the range is an mmap() of a regular file
> >     - device driver do GUP on the range (let's assume it is a write
> >       GUP) so if the page is not already map with write permission
> >       in the page table than a page fault is trigger and page_mkwrite
> >       happens
> >     - Once GUP return the page to the device driver and once the
> >       device driver as updated the hardware states to allow access
> >       to this page then from that point on hardware can write to the
> >       page at _any_ time, it is fully disconnected from any fs event
> >       like write back, it fully ignore things like page_mkclean
> > 
> > This is how it is to day, we allowed people to push upstream such
> > users of GUP. This is a fact we have to live with, we can not stop
> > hardware access to the page, we can not force the hardware to follow
> > page_mkclean and force a page_mkwrite once write back ends. This is
> > the situation we are inheriting (and i am personnaly not happy with
> > that).
> > 
> > >From my point of view we are left with 2 choices:
> >     [C1] break all drivers that do not abide by the page_mkclean and
> >          page_mkwrite
> >     [C2] mitigate as much as possible the issue
> > 
> > For [C2] the idea is to keep track of GUP per page so we know if we
> > can expect the page to be written to at any time. Here is the event
> > flow:
> >     - driver GUP the page and program the hardware, page is mark as
> >       GUPed
> >     ...
> >     - write back kicks in on the dirty page, lock the page and every
> >       thing as usual , sees it is GUPed and inform the block layer to
> >       use a bounce page
> 
> No.  The solution John, Dan & I have been looking at is to take the
> dirty page off the LRU while it is pinned by GUP.  It will never be
> found for writeback.

Pages are found for writeback by mapping tree lookup, not page LRU
scans (i.e. write_cache_pages() from background writeback)

Are suggesting that pages pinned by GUP are going to be removed from
the page cache *and* the mapping tree while they are pinned?

> That's not the end of the story though.  Other parts of the kernel (eg
> msync) also need to be taught to stay away from pages which are pinned
> by GUP. But the idea is that no page gets written back to storage while
> it's pinned by GUP. Only when the last GUP ends is the page returned
> to the list of dirty pages.

I think playing fast and loose with data integrity like this is
fundamentally wrong. If this gets implemented, then I'll be sending
every "I ran sync and then two hours later the system crashed but
the data was lost when the system came back up" bug report directly
to you.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:34                                     ` Matthew Wilcox
  2018-12-17 19:48                                       ` Jerome Glisse
  2018-12-18  1:09                                       ` Dave Chinner
@ 2018-12-18  6:12                                       ` Darrick J. Wong
  2018-12-18  9:30                                       ` Jan Kara
  3 siblings, 0 replies; 206+ messages in thread
From: Darrick J. Wong @ 2018-12-18  6:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jerome Glisse, Dave Chinner, Jan Kara, John Hubbard,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > > 
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > So the event flow is:
> >     - userspace create object that match a range of virtual address
> >       against a given kernel sub-system (let's say infiniband) and
> >       let's assume that the range is an mmap() of a regular file
> >     - device driver do GUP on the range (let's assume it is a write
> >       GUP) so if the page is not already map with write permission
> >       in the page table than a page fault is trigger and page_mkwrite
> >       happens
> >     - Once GUP return the page to the device driver and once the
> >       device driver as updated the hardware states to allow access
> >       to this page then from that point on hardware can write to the
> >       page at _any_ time, it is fully disconnected from any fs event
> >       like write back, it fully ignore things like page_mkclean
> > 
> > This is how it is to day, we allowed people to push upstream such
> > users of GUP. This is a fact we have to live with, we can not stop
> > hardware access to the page, we can not force the hardware to follow
> > page_mkclean and force a page_mkwrite once write back ends. This is
> > the situation we are inheriting (and i am personnaly not happy with
> > that).
> > 
> > >From my point of view we are left with 2 choices:
> >     [C1] break all drivers that do not abide by the page_mkclean and
> >          page_mkwrite
> >     [C2] mitigate as much as possible the issue
> > 
> > For [C2] the idea is to keep track of GUP per page so we know if we
> > can expect the page to be written to at any time. Here is the event
> > flow:
> >     - driver GUP the page and program the hardware, page is mark as
> >       GUPed
> >     ...
> >     - write back kicks in on the dirty page, lock the page and every
> >       thing as usual , sees it is GUPed and inform the block layer to
> >       use a bounce page
> 
> No.  The solution John, Dan & I have been looking at is to take the
> dirty page off the LRU while it is pinned by GUP.  It will never be
> found for writeback.
> 
> That's not the end of the story though.  Other parts of the kernel (eg
> msync) also need to be taught to stay away from pages which are pinned
> by GUP.  But the idea is that no page gets written back to storage while
> it's pinned by GUP.  Only when the last GUP ends is the page returned
> to the list of dirty pages.

Errr... what does fsync do in the meantime?  Not write the page?
That would seem to break what fsync() is supposed to do.

--D

> >     - block layer copy the page to a bounce page effectively creating
> >       a snapshot of what is the content of the real page. This allows
> >       everything in block layer that need stable content to work on
> >       the bounce page (raid, stripping, encryption, ...)
> >     - once write back is done the page is not marked clean but stays
> >       dirty, this effectively disable things like COW for filesystem
> >       and other feature that expect page_mkwrite between write back.
> >       AFAIK it is believe that it is something acceptable
> 
> So none of this is necessary.
> 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-17 18:34                                     ` Matthew Wilcox
                                                         ` (2 preceding siblings ...)
  2018-12-18  6:12                                       ` Darrick J. Wong
@ 2018-12-18  9:30                                       ` Jan Kara
  2018-12-18 23:29                                         ` John Hubbard
  3 siblings, 1 reply; 206+ messages in thread
From: Jan Kara @ 2018-12-18  9:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jerome Glisse, Dave Chinner, Jan Kara, John Hubbard,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Mon 17-12-18 10:34:43, Matthew Wilcox wrote:
> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > > 
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > So the event flow is:
> >     - userspace create object that match a range of virtual address
> >       against a given kernel sub-system (let's say infiniband) and
> >       let's assume that the range is an mmap() of a regular file
> >     - device driver do GUP on the range (let's assume it is a write
> >       GUP) so if the page is not already map with write permission
> >       in the page table than a page fault is trigger and page_mkwrite
> >       happens
> >     - Once GUP return the page to the device driver and once the
> >       device driver as updated the hardware states to allow access
> >       to this page then from that point on hardware can write to the
> >       page at _any_ time, it is fully disconnected from any fs event
> >       like write back, it fully ignore things like page_mkclean
> > 
> > This is how it is to day, we allowed people to push upstream such
> > users of GUP. This is a fact we have to live with, we can not stop
> > hardware access to the page, we can not force the hardware to follow
> > page_mkclean and force a page_mkwrite once write back ends. This is
> > the situation we are inheriting (and i am personnaly not happy with
> > that).
> > 
> > >From my point of view we are left with 2 choices:
> >     [C1] break all drivers that do not abide by the page_mkclean and
> >          page_mkwrite
> >     [C2] mitigate as much as possible the issue
> > 
> > For [C2] the idea is to keep track of GUP per page so we know if we
> > can expect the page to be written to at any time. Here is the event
> > flow:
> >     - driver GUP the page and program the hardware, page is mark as
> >       GUPed
> >     ...
> >     - write back kicks in on the dirty page, lock the page and every
> >       thing as usual , sees it is GUPed and inform the block layer to
> >       use a bounce page
> 
> No.  The solution John, Dan & I have been looking at is to take the
> dirty page off the LRU while it is pinned by GUP.  It will never be
> found for writeback.
> 
> That's not the end of the story though.  Other parts of the kernel (eg
> msync) also need to be taught to stay away from pages which are pinned
> by GUP.  But the idea is that no page gets written back to storage while
> it's pinned by GUP.  Only when the last GUP ends is the page returned
> to the list of dirty pages.

We've been through this in:

https://lore.kernel.org/lkml/20180709194740.rymbt2fzohbdmpye@quack2.suse.cz/

back in July. You cannot just skip pages for fsync(2). So as I wrote above -
memory cleaning writeback can skip pinned pages. Data integrity writeback
must be able to write pinned pages. And bouncing is one reasonable way how
to do that.

This writeback decision is pretty much independent from the mechanism by
which we are going to identify pinned pages. Whether that's going to be
separate counter in struct page, using page->_mapcount, or separately
allocated data structure as you know promote.

I currently like the most the _mapcount suggestion from Jerome but I'm not
really attached to any solution as long as it performs reasonably and
someone can make it working :) as I don't have time to implement it at
least till January.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18  9:30                                       ` Jan Kara
@ 2018-12-18 23:29                                         ` John Hubbard
  2018-12-19  2:07                                           ` Jerome Glisse
  0 siblings, 1 reply; 206+ messages in thread
From: John Hubbard @ 2018-12-18 23:29 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox
  Cc: Jerome Glisse, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/18/18 1:30 AM, Jan Kara wrote:
> On Mon 17-12-18 10:34:43, Matthew Wilcox wrote:
>> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
>>> On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
>>>> Sure, that's a possibility, but that doesn't close off any race
>>>> conditions because there can be DMA into the page in progress while
>>>> the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
>>>> different in that there is no 3rd-party access to the page while it
>>>> is under IO (ext3 arbitrates all access to it's metadata), and so
>>>> nothing can actually race for modification of the page between
>>>> submission and bouncing at the block layer.
>>>>
>>>> In this case, the moment the page is unlocked, anyone else can map
>>>> it and start (R)DMA on it, and that can happen before the bio is
>>>> bounced by the block layer. So AFAICT, block layer bouncing doesn't
>>>> solve the problem of racing writeback and DMA direct to the page we
>>>> are doing IO on. Yes, it reduces the race window substantially, but
>>>> it doesn't get rid of it.
>>>
>>> So the event flow is:
>>>     - userspace create object that match a range of virtual address
>>>       against a given kernel sub-system (let's say infiniband) and
>>>       let's assume that the range is an mmap() of a regular file
>>>     - device driver do GUP on the range (let's assume it is a write
>>>       GUP) so if the page is not already map with write permission
>>>       in the page table than a page fault is trigger and page_mkwrite
>>>       happens
>>>     - Once GUP return the page to the device driver and once the
>>>       device driver as updated the hardware states to allow access
>>>       to this page then from that point on hardware can write to the
>>>       page at _any_ time, it is fully disconnected from any fs event
>>>       like write back, it fully ignore things like page_mkclean
>>>
>>> This is how it is to day, we allowed people to push upstream such
>>> users of GUP. This is a fact we have to live with, we can not stop
>>> hardware access to the page, we can not force the hardware to follow
>>> page_mkclean and force a page_mkwrite once write back ends. This is
>>> the situation we are inheriting (and i am personnaly not happy with
>>> that).
>>>
>>> >From my point of view we are left with 2 choices:
>>>     [C1] break all drivers that do not abide by the page_mkclean and
>>>          page_mkwrite
>>>     [C2] mitigate as much as possible the issue
>>>
>>> For [C2] the idea is to keep track of GUP per page so we know if we
>>> can expect the page to be written to at any time. Here is the event
>>> flow:
>>>     - driver GUP the page and program the hardware, page is mark as
>>>       GUPed
>>>     ...
>>>     - write back kicks in on the dirty page, lock the page and every
>>>       thing as usual , sees it is GUPed and inform the block layer to
>>>       use a bounce page
>>
>> No.  The solution John, Dan & I have been looking at is to take the
>> dirty page off the LRU while it is pinned by GUP.  It will never be
>> found for writeback.
>>
>> That's not the end of the story though.  Other parts of the kernel (eg
>> msync) also need to be taught to stay away from pages which are pinned
>> by GUP.  But the idea is that no page gets written back to storage while
>> it's pinned by GUP.  Only when the last GUP ends is the page returned
>> to the list of dirty pages.
> 
> We've been through this in:
> 
> https://lore.kernel.org/lkml/20180709194740.rymbt2fzohbdmpye@quack2.suse.cz/
> 
> back in July. You cannot just skip pages for fsync(2). So as I wrote above -
> memory cleaning writeback can skip pinned pages. Data integrity writeback
> must be able to write pinned pages. And bouncing is one reasonable way how
> to do that.
> 
> This writeback decision is pretty much independent from the mechanism by
> which we are going to identify pinned pages. Whether that's going to be
> separate counter in struct page, using page->_mapcount, or separately
> allocated data structure as you know promote.
> 
> I currently like the most the _mapcount suggestion from Jerome but I'm not
> really attached to any solution as long as it performs reasonably and
> someone can make it working :) as I don't have time to implement it at
> least till January.
> 

OK, so let's take another look at Jerome's _mapcount idea all by itself (using
*only* the tracking pinned pages aspect), given that it is the lightest weight
solution for that.  

So as I understand it, this would use page->_mapcount to store both the real
mapcount, and the dma pinned count (simply added together), but only do so for
file-backed (non-anonymous) pages:


__get_user_pages()
{
	...
	get_page(page);

	if (!PageAnon)
		atomic_inc(page->_mapcount);
	...
}

put_user_page(struct page *page)
{
	...
	if (!PageAnon)
		atomic_dec(&page->_mapcount);

	put_page(page);
	...
}

...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
had in mind?


-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18 23:29                                         ` John Hubbard
@ 2018-12-19  2:07                                           ` Jerome Glisse
  2018-12-19 11:08                                             ` Jan Kara
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-19  2:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> On 12/18/18 1:30 AM, Jan Kara wrote:
> > On Mon 17-12-18 10:34:43, Matthew Wilcox wrote:
> >> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> >>> On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> >>>> Sure, that's a possibility, but that doesn't close off any race
> >>>> conditions because there can be DMA into the page in progress while
> >>>> the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> >>>> different in that there is no 3rd-party access to the page while it
> >>>> is under IO (ext3 arbitrates all access to it's metadata), and so
> >>>> nothing can actually race for modification of the page between
> >>>> submission and bouncing at the block layer.
> >>>>
> >>>> In this case, the moment the page is unlocked, anyone else can map
> >>>> it and start (R)DMA on it, and that can happen before the bio is
> >>>> bounced by the block layer. So AFAICT, block layer bouncing doesn't
> >>>> solve the problem of racing writeback and DMA direct to the page we
> >>>> are doing IO on. Yes, it reduces the race window substantially, but
> >>>> it doesn't get rid of it.
> >>>
> >>> So the event flow is:
> >>>     - userspace create object that match a range of virtual address
> >>>       against a given kernel sub-system (let's say infiniband) and
> >>>       let's assume that the range is an mmap() of a regular file
> >>>     - device driver do GUP on the range (let's assume it is a write
> >>>       GUP) so if the page is not already map with write permission
> >>>       in the page table than a page fault is trigger and page_mkwrite
> >>>       happens
> >>>     - Once GUP return the page to the device driver and once the
> >>>       device driver as updated the hardware states to allow access
> >>>       to this page then from that point on hardware can write to the
> >>>       page at _any_ time, it is fully disconnected from any fs event
> >>>       like write back, it fully ignore things like page_mkclean
> >>>
> >>> This is how it is to day, we allowed people to push upstream such
> >>> users of GUP. This is a fact we have to live with, we can not stop
> >>> hardware access to the page, we can not force the hardware to follow
> >>> page_mkclean and force a page_mkwrite once write back ends. This is
> >>> the situation we are inheriting (and i am personnaly not happy with
> >>> that).
> >>>
> >>> >From my point of view we are left with 2 choices:
> >>>     [C1] break all drivers that do not abide by the page_mkclean and
> >>>          page_mkwrite
> >>>     [C2] mitigate as much as possible the issue
> >>>
> >>> For [C2] the idea is to keep track of GUP per page so we know if we
> >>> can expect the page to be written to at any time. Here is the event
> >>> flow:
> >>>     - driver GUP the page and program the hardware, page is mark as
> >>>       GUPed
> >>>     ...
> >>>     - write back kicks in on the dirty page, lock the page and every
> >>>       thing as usual , sees it is GUPed and inform the block layer to
> >>>       use a bounce page
> >>
> >> No.  The solution John, Dan & I have been looking at is to take the
> >> dirty page off the LRU while it is pinned by GUP.  It will never be
> >> found for writeback.
> >>
> >> That's not the end of the story though.  Other parts of the kernel (eg
> >> msync) also need to be taught to stay away from pages which are pinned
> >> by GUP.  But the idea is that no page gets written back to storage while
> >> it's pinned by GUP.  Only when the last GUP ends is the page returned
> >> to the list of dirty pages.
> > 
> > We've been through this in:
> > 
> > https://lore.kernel.org/lkml/20180709194740.rymbt2fzohbdmpye@quack2.suse.cz/
> > 
> > back in July. You cannot just skip pages for fsync(2). So as I wrote above -
> > memory cleaning writeback can skip pinned pages. Data integrity writeback
> > must be able to write pinned pages. And bouncing is one reasonable way how
> > to do that.
> > 
> > This writeback decision is pretty much independent from the mechanism by
> > which we are going to identify pinned pages. Whether that's going to be
> > separate counter in struct page, using page->_mapcount, or separately
> > allocated data structure as you know promote.
> > 
> > I currently like the most the _mapcount suggestion from Jerome but I'm not
> > really attached to any solution as long as it performs reasonably and
> > someone can make it working :) as I don't have time to implement it at
> > least till January.
> > 
> 
> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> *only* the tracking pinned pages aspect), given that it is the lightest weight
> solution for that.  
> 
> So as I understand it, this would use page->_mapcount to store both the real
> mapcount, and the dma pinned count (simply added together), but only do so for
> file-backed (non-anonymous) pages:
> 
> 
> __get_user_pages()
> {
> 	...
> 	get_page(page);
> 
> 	if (!PageAnon)
> 		atomic_inc(page->_mapcount);
> 	...
> }
> 
> put_user_page(struct page *page)
> {
> 	...
> 	if (!PageAnon)
> 		atomic_dec(&page->_mapcount);
> 
> 	put_page(page);
> 	...
> }
> 
> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> had in mind?

Mostly, with the extra two observations:
    [1] We only need to know the pin count when a write back kicks in
    [2] We need to protect GUP code with wait_for_write_back() in case
        GUP is racing with a write back that might not the see the
        elevated mapcount in time.

So for [2]

__get_user_pages()
{
    get_page(page);

    if (!PageAnon) {
        atomic_inc(page->_mapcount);
+       if (PageWriteback(page)) {
+           // Assume we are racing and curent write back will not see
+           // the elevated mapcount so wait for current write back and
+           // force page fault
+           wait_on_page_writeback(page);
+           // force slow path that will fault again
+       }
    }
}

For [1] only needing pin count during write back turns page_mkclean into
the perfect spot to check for that so:

int page_mkclean(struct page *page)
{
    int cleaned = 0;
+   int real_mapcount = 0;
    struct address_space *mapping;
    struct rmap_walk_control rwc = {
        .arg = (void *)&cleaned,
        .rmap_one = page_mkclean_one,
        .invalid_vma = invalid_mkclean_vma,
+       .mapcount = &real_mapcount,
    };

    BUG_ON(!PageLocked(page));

    if (!page_mapped(page))
        return 0;

    mapping = page_mapping(page);
    if (!mapping)
        return 0;

    // rmap_walk need to change to count mapping and return value
    // in .mapcount easy one
    rmap_walk(page, &rwc);

    // Big fat comment to explain what is going on
+   if ((page_mapcount(page) - real_mapcount) > 0) {
+       SetPageDMAPined(page);
+   } else {
+       ClearPageDMAPined(page);
+   }

    // Maybe we want to leverage the int nature of return value so that
    // we can express more than cleaned/truncated and express cleaned/
    // truncated/pinned for benefit of caller and that way we do not
    // even need one bit as page flags above.

    return cleaned;
}

You do not want to change page_mapped() i do not see a need for that.

Then the whole discussion between Jan and Dave seems to indicate that
the bounce mechanism will need to be in the fs layer and that we can
not reuse the bio bounce mechanism. This means that more work is needed
at the fs level for that (so that fs do not freak on bounce page).

Note that they are few gotcha where we need to preserve the pin count
ie mostly in truncate code path that can remove page from page cache
and overwrite the mapcount in the process, this would need to be fixed
to not overwrite mapcount so that put_user_page does not set the map
count to an invalid value turning the page into a bad state that will
at one point trigger kernel BUG_ON();

I am not saying block truncate, i am saying make sure it does not
erase pin count and keep truncating happily. The how to handle truncate
is a per existing GUP user discussion to see what they want to do for
that.

Obviously a bit deeper analysis of all spot that use mapcount is needed
to check that we are not breaking anything but from the top of my head
i can not think of anything bad (migrate will abort and other things will
assume the page is mapped even it is only in hardware page table, ...).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18 23:42                                     ` Dave Chinner
@ 2018-12-19  3:03                                       ` Jason Gunthorpe
  2018-12-19  5:26                                         ` Dan Williams
  2018-12-19 10:28                                         ` Dave Chinner
  2018-12-19 13:24                                       ` Jan Kara
  1 sibling, 2 replies; 206+ messages in thread
From: Jason Gunthorpe @ 2018-12-19  3:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:

> Essentially, what we are talking about is how to handle broken
> hardware. I say we should just brun it with napalm and thermite
> (i.e. taint the kernel with "unsupportable hardware") and force
> wait_for_stable_page() to trigger when there are GUP mappings if
> the underlying storage doesn't already require it.

If you want to ban O_DIRECT/etc from writing to file backed pages,
then just do it.

Otherwise I'm not sure demanding some unrealistic HW design is
reasonable. ie nvme drives are not likely to add page faulting to
their IO path any time soon.

A SW architecture that relies on page faulting is just not going to
support real world block IO devices.

GPUs and one RDMA are about the only things that can do this today,
and they are basically irrelevant to O_DIRECT.

Jason

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  3:03                                       ` Jason Gunthorpe
@ 2018-12-19  5:26                                         ` Dan Williams
  2018-12-19 11:19                                           ` Jan Kara
  2018-12-19 10:28                                         ` Dave Chinner
  1 sibling, 1 reply; 206+ messages in thread
From: Dan Williams @ 2018-12-19  5:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Jan Kara, Jerome Glisse, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	Mike Marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue, Dec 18, 2018 at 7:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
>
> > Essentially, what we are talking about is how to handle broken
> > hardware. I say we should just brun it with napalm and thermite
> > (i.e. taint the kernel with "unsupportable hardware") and force
> > wait_for_stable_page() to trigger when there are GUP mappings if
> > the underlying storage doesn't already require it.
>
> If you want to ban O_DIRECT/etc from writing to file backed pages,
> then just do it.
>
> Otherwise I'm not sure demanding some unrealistic HW design is
> reasonable. ie nvme drives are not likely to add page faulting to
> their IO path any time soon.
>
> A SW architecture that relies on page faulting is just not going to
> support real world block IO devices.
>
> GPUs and one RDMA are about the only things that can do this today,
> and they are basically irrelevant to O_DIRECT.

Yes.

I'm missing why a bounce buffer is needed. If writeback hits a
DMA-writable page why can't that path just turn around and trigger
another mkwrite notifcation on behalf of hardware that will never send
it? "Nice try writeback, this page is dirty again".

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  3:03                                       ` Jason Gunthorpe
  2018-12-19  5:26                                         ` Dan Williams
@ 2018-12-19 10:28                                         ` Dave Chinner
  2018-12-19 11:35                                           ` Jan Kara
  1 sibling, 1 reply; 206+ messages in thread
From: Dave Chinner @ 2018-12-19 10:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> 
> > Essentially, what we are talking about is how to handle broken
> > hardware. I say we should just brun it with napalm and thermite
> > (i.e. taint the kernel with "unsupportable hardware") and force
> > wait_for_stable_page() to trigger when there are GUP mappings if
> > the underlying storage doesn't already require it.
> 
> If you want to ban O_DIRECT/etc from writing to file backed pages,
> then just do it.

O_DIRECT IO *isn't the problem*.


iO_DIRECT IO uses a short term pin that the existing prefaulting
during GUP works just fine for. The problem we have is the long term
pins where pages can be cleaned while the pages are pinned. i.e. the
use case we current have to disable for DAX because *we can't make
it work sanely* without either revokable file leases and/or hardware
that is able to trigger page faults when they need write access to a
clean page.

> Otherwise I'm not sure demanding some unrealistic HW design is
> reasonable. ie nvme drives are not likely to add page faulting to
> their IO path any time soon.

Direct IO on nvme drives are not the problem. It's RDMA pinning
pages for hours or days and expecting everyone else to jump through
hoops to support their broken page access access model.

> A SW architecture that relies on page faulting is just not going to
> support real world block IO devices.

The existing software architecture for file backed pages has been
based around page faulting for write notifications since ~2005. That
horse bolted many, many years ago.

> GPUs and one RDMA are about the only things that can do this today,
> and they are basically irrelevant to O_DIRECT.

It's RDMA that we need these changes for, not O_DIRECT.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  2:07                                           ` Jerome Glisse
@ 2018-12-19 11:08                                             ` Jan Kara
  2018-12-20 10:54                                               ` John Hubbard
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 206+ messages in thread
From: Jan Kara @ 2018-12-19 11:08 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > solution for that.  
> > 
> > So as I understand it, this would use page->_mapcount to store both the real
> > mapcount, and the dma pinned count (simply added together), but only do so for
> > file-backed (non-anonymous) pages:
> > 
> > 
> > __get_user_pages()
> > {
> > 	...
> > 	get_page(page);
> > 
> > 	if (!PageAnon)
> > 		atomic_inc(page->_mapcount);
> > 	...
> > }
> > 
> > put_user_page(struct page *page)
> > {
> > 	...
> > 	if (!PageAnon)
> > 		atomic_dec(&page->_mapcount);
> > 
> > 	put_page(page);
> > 	...
> > }
> > 
> > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > had in mind?
> 
> Mostly, with the extra two observations:
>     [1] We only need to know the pin count when a write back kicks in
>     [2] We need to protect GUP code with wait_for_write_back() in case
>         GUP is racing with a write back that might not the see the
>         elevated mapcount in time.
> 
> So for [2]
> 
> __get_user_pages()
> {
>     get_page(page);
> 
>     if (!PageAnon) {
>         atomic_inc(page->_mapcount);
> +       if (PageWriteback(page)) {
> +           // Assume we are racing and curent write back will not see
> +           // the elevated mapcount so wait for current write back and
> +           // force page fault
> +           wait_on_page_writeback(page);
> +           // force slow path that will fault again
> +       }
>     }
> }

This is not needed AFAICT. __get_user_pages() gets page reference (and it
should also increment page->_mapcount) under PTE lock. So at that point we
are sure we have writeable PTE nobody can change. So page_mkclean() has to
block on PTE lock to make PTE read-only and only after going through all
PTEs like this, it can check page->_mapcount. So the PTE lock provides
enough synchronization.

> For [1] only needing pin count during write back turns page_mkclean into
> the perfect spot to check for that so:
> 
> int page_mkclean(struct page *page)
> {
>     int cleaned = 0;
> +   int real_mapcount = 0;
>     struct address_space *mapping;
>     struct rmap_walk_control rwc = {
>         .arg = (void *)&cleaned,
>         .rmap_one = page_mkclean_one,
>         .invalid_vma = invalid_mkclean_vma,
> +       .mapcount = &real_mapcount,
>     };
> 
>     BUG_ON(!PageLocked(page));
> 
>     if (!page_mapped(page))
>         return 0;
> 
>     mapping = page_mapping(page);
>     if (!mapping)
>         return 0;
> 
>     // rmap_walk need to change to count mapping and return value
>     // in .mapcount easy one
>     rmap_walk(page, &rwc);
> 
>     // Big fat comment to explain what is going on
> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> +       SetPageDMAPined(page);
> +   } else {
> +       ClearPageDMAPined(page);
> +   }

This is the detail I'm not sure about: Why cannot rmap_walk_file() race
with e.g. zap_pte_range() which decrements page->_mapcount and thus the
check we do in page_mkclean() is wrong?

> 
>     // Maybe we want to leverage the int nature of return value so that
>     // we can express more than cleaned/truncated and express cleaned/
>     // truncated/pinned for benefit of caller and that way we do not
>     // even need one bit as page flags above.
> 
>     return cleaned;
> }
> 
> You do not want to change page_mapped() i do not see a need for that.
> 
> Then the whole discussion between Jan and Dave seems to indicate that
> the bounce mechanism will need to be in the fs layer and that we can
> not reuse the bio bounce mechanism. This means that more work is needed
> at the fs level for that (so that fs do not freak on bounce page).
> 
> Note that they are few gotcha where we need to preserve the pin count
> ie mostly in truncate code path that can remove page from page cache
> and overwrite the mapcount in the process, this would need to be fixed
> to not overwrite mapcount so that put_user_page does not set the map
> count to an invalid value turning the page into a bad state that will
> at one point trigger kernel BUG_ON();
>
> I am not saying block truncate, i am saying make sure it does not
> erase pin count and keep truncating happily. The how to handle truncate
> is a per existing GUP user discussion to see what they want to do for
> that.
> 
> Obviously a bit deeper analysis of all spot that use mapcount is needed
> to check that we are not breaking anything but from the top of my head
> i can not think of anything bad (migrate will abort and other things will
> assume the page is mapped even it is only in hardware page table, ...).

Hum, grepping for page_mapped() and page_mapcount(), this is actually going
to be non-trivial to get right AFAICT.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19  5:26                                         ` Dan Williams
@ 2018-12-19 11:19                                           ` Jan Kara
  0 siblings, 0 replies; 206+ messages in thread
From: Jan Kara @ 2018-12-19 11:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Dave Chinner, Jan Kara, Jerome Glisse,
	John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue 18-12-18 21:26:28, Dan Williams wrote:
> On Tue, Dec 18, 2018 at 7:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> >
> > > Essentially, what we are talking about is how to handle broken
> > > hardware. I say we should just brun it with napalm and thermite
> > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > the underlying storage doesn't already require it.
> >
> > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > then just do it.
> >
> > Otherwise I'm not sure demanding some unrealistic HW design is
> > reasonable. ie nvme drives are not likely to add page faulting to
> > their IO path any time soon.
> >
> > A SW architecture that relies on page faulting is just not going to
> > support real world block IO devices.
> >
> > GPUs and one RDMA are about the only things that can do this today,
> > and they are basically irrelevant to O_DIRECT.
> 
> Yes.
> 
> I'm missing why a bounce buffer is needed. If writeback hits a
> DMA-writable page why can't that path just turn around and trigger
> another mkwrite notifcation on behalf of hardware that will never send
> it? "Nice try writeback, this page is dirty again".

You are conflating two things here. Bounce buffer (or a way to stop DMA
from happening) is needed because think what happens when RAID5 computes
its stripe checksum while someone modifies the data through DMA. Checksum
mismatch and all fun arising from that.

Notifying filesystem about the fact that the page didn't get cleaned by the
writeback and still can be modified by the DMA is a different thing.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 10:28                                         ` Dave Chinner
@ 2018-12-19 11:35                                           ` Jan Kara
  2018-12-19 16:56                                             ` Jason Gunthorpe
  2018-12-19 22:33                                             ` Dave Chinner
  0 siblings, 2 replies; 206+ messages in thread
From: Jan Kara @ 2018-12-19 11:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Jerome Glisse, John Hubbard,
	Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > 
> > > Essentially, what we are talking about is how to handle broken
> > > hardware. I say we should just brun it with napalm and thermite
> > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > the underlying storage doesn't already require it.
> > 
> > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > then just do it.
> 
> O_DIRECT IO *isn't the problem*.

That is not true. O_DIRECT IO is a problem. In some aspects it is easier
than the problem with RDMA but currently O_DIRECT IO can crash your machine
or corrupt data the same way RDMA can. Just the race window is much
smaller. So we have to fix the generic GUP infrastructure to make O_DIRECT
IO work. I agree that fixing RDMA will likely require even more work like
revokable leases or what not.

> iO_DIRECT IO uses a short term pin that the existing prefaulting
> during GUP works just fine for. The problem we have is the long term
> pins where pages can be cleaned while the pages are pinned. i.e. the
> use case we current have to disable for DAX because *we can't make
> it work sanely* without either revokable file leases and/or hardware
> that is able to trigger page faults when they need write access to a
> clean page.

I would like to find a solution to the O_DIRECT IO problem while making the
infractructure reusable also for solving the problems with RDMA... Because
nobody wants to go through those couple hundred get_user_pages() users in
the kernel twice...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-18 23:42                                     ` Dave Chinner
  2018-12-19  3:03                                       ` Jason Gunthorpe
@ 2018-12-19 13:24                                       ` Jan Kara
  1 sibling, 0 replies; 206+ messages in thread
From: Jan Kara @ 2018-12-19 13:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 19-12-18 10:42:54, Dave Chinner wrote:
> On Tue, Dec 18, 2018 at 11:33:06AM +0100, Jan Kara wrote:
> > On Mon 17-12-18 08:58:19, Dave Chinner wrote:
> > > On Fri, Dec 14, 2018 at 04:43:21PM +0100, Jan Kara wrote:
> > > > Yes, for filesystem it is too late. But the plan we figured back in October
> > > > was to do the bouncing in the block layer. I.e., mark the bio (or just the
> > > > particular page) as needing bouncing and then use the existing page
> > > > bouncing mechanism in the block layer to do the bouncing for us. Ext3 (when
> > > > it was still a separate fs driver) has been using a mechanism like this to
> > > > make DIF/DIX work with its metadata.
> > > 
> > > Sure, that's a possibility, but that doesn't close off any race
> > > conditions because there can be DMA into the page in progress while
> > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> > > different in that there is no 3rd-party access to the page while it
> > > is under IO (ext3 arbitrates all access to it's metadata), and so
> > > nothing can actually race for modification of the page between
> > > submission and bouncing at the block layer.
> > >
> > > In this case, the moment the page is unlocked, anyone else can map
> > > it and start (R)DMA on it, and that can happen before the bio is
> > > bounced by the block layer. So AFAICT, block layer bouncing doesn't
> > > solve the problem of racing writeback and DMA direct to the page we
> > > are doing IO on. Yes, it reduces the race window substantially, but
> > > it doesn't get rid of it.
> > 
> > The scenario you describe here cannot happen exactly because of the
> > wait_for_stable_page() in ->page_mkwrite() you mention below.
> 
> In general, no, because stable pages are controlled by block
> devices.
> 
> void wait_for_stable_page(struct page *page)
> {
>         if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
>                 wait_on_page_writeback(page);
> }
> 
> 
> I have previously advocated for the filesystem to be in control of
> stable pages but, well, too many people shouted "but performance!"
> and so we still have all these holes I wanted to close in our
> code...
> 
> > If someone
> > will try to GUP a page that is under writeback (has already PageWriteback
> > set), GUP will have to do a write fault because the page is writeprotected
> > in page tables and go into ->page_mkwrite() which will wait.
> 
> Correct, but that doesn't close the problem down because stable
> pages are something we cannot rely on right now. We need to fix
> wait_for_stable_page() to always block on page writeback before
> this specific race condition goes away.

Right, all I said was assuming that someone actually cares about stable
pages so bdi_cap_stable_pages_required() is set. I agree with filesystem
having ability to control whether stable pages are required or not. But
when stable pages get enforced seems like a separate problem to me.

> > The problem rather is with someone mapping the page *before* writeback
> > starts, giving the page to HW. Then clear_page_dirty_for_io() writeprotects
> > the page in PTEs but the HW gives a damn about that. Then, after we add the
> > page to the bio but before the page gets bounced by the block layer, the HW
> > can still modify it.
> 
> Sure, that's yet another aspect of the same problem - not getting a
> write fault when the page is being written to. If we got a write
> fault, then the wait_for_stable_page() call in ->page_mkwrite would
> then solve the problem.
> 
> Essentially, what we are talking about is how to handle broken
> hardware. I say we should just brun it with napalm and thermite
> (i.e. taint the kernel with "unsupportable hardware") and force
> wait_for_stable_page() to trigger when there are GUP mappings if
> the underlying storage doesn't already require it.

As I wrote in other email, this is also about direct IO using file mapping
as a data buffer. So burn with napalm can hardly be a complete solution...
I agree that for the hardware that cannot support revoking of access /
fault on access and uses long-term page pins, we may just have to put up
with weird behavior in some corner cases.

> > > If it's permanently dirty, how do we trigger new COW operations
> > > after writeback has "cleaned" the page? i.e. we still need a
> > > ->page_mkwrite call to run before we allow the next write to the
> > > page to be done, regardless of whether the page is "permanently
> > > dirty" or not....
> > 
> > Interaction with COW is certainly an interesting problem. When the page
> > gets pinned, GUP will make sure the page is writeably mapped and trigger a
> > write fault if not. So at the moment the page is pinned, we are sure the
> > page is COWed. Now the question is what should happen when the file A
> > containing this pinned page gets reflinked to file B while the page is still
> > pinned.
> > 
> > Options I can see are:
> > 
> > 1) Fail the reflink.
> >   - difficult for sysadmin to discover the source of failure
> >
> > 2) Block reflink until the pin of the page is released.
> >   - can last for a long time, again difficult to discover
> > 
> > 3) Don't do anything special.
> >   - can corrupt data as read accesses through file B won't see
> >     modifications done to the page (and thus eventually the corresponding disk
> >     block) by the HW.
> > 
> > 4) Immediately COW the block during reflink when the corresponding page
> >    cache page is pinned.
> >   - seems as the best solution at this point, although sadly also requires
> >     the most per-filesystem work
> 
> None of the above are acceptable solutions - they all have nasty
> corner cases which are going to be difficult to get right, test,
> etc. IMO, the robust, reliable, testable solution is this:
> 
> 5) The reflink breaks the file lease, the userspace app releases the
> pinned pages on the file and drops the lease. The reflink proceeds,
> does it's work, and then the app gets a new lease on the file. When
> the app pins the pages again, it triggers new ->page_mkwrite calls
> to break any sharing that the reflink created. And if the app fails
> to drop the lease, then we can either fail with a lease related
> error or kill it....

This is certainly fine for the GUP users that are going to support leases.
But do you want GUP in direct IO to create a lease if the pages are from a
file mapping? I belive we need another option at least for GUP references
that are short-term in nature and sometimes also performance critical.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:35                                           ` Jan Kara
@ 2018-12-19 16:56                                             ` Jason Gunthorpe
  2018-12-19 22:33                                             ` Dave Chinner
  1 sibling, 0 replies; 206+ messages in thread
From: Jason Gunthorpe @ 2018-12-19 16:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > 
> > > > Essentially, what we are talking about is how to handle broken
> > > > hardware. I say we should just brun it with napalm and thermite
> > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > the underlying storage doesn't already require it.
> > > 
> > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > then just do it.
> > 
> > O_DIRECT IO *isn't the problem*.
> 
> That is not true. O_DIRECT IO is a problem. In some aspects it is
> easier than the problem with RDMA but currently O_DIRECT IO can
> crash your machine or corrupt data the same way RDMA can. Just the
> race window is much smaller. So we have to fix the generic GUP
> infrastructure to make O_DIRECT IO work. I agree that fixing RDMA
> will likely require even more work like revokable leases or what
> not.

This is what I've understood, talking to all the experts. Dave? Why do
you think O_DIRECT is actually OK?

I agree the duration issue with RDMA is different, but don't forget,
O_DIRECT goes out to the network too and has potentially very long
timeouts as well.

If O_DIRECT works fine then lets use the same approach in RDMA??

Jason

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:35                                           ` Jan Kara
  2018-12-19 16:56                                             ` Jason Gunthorpe
@ 2018-12-19 22:33                                             ` Dave Chinner
  2018-12-20  9:07                                               ` Jan Kara
  2018-12-20 16:54                                               ` Jerome Glisse
  1 sibling, 2 replies; 206+ messages in thread
From: Dave Chinner @ 2018-12-19 22:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Jerome Glisse, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > 
> > > > Essentially, what we are talking about is how to handle broken
> > > > hardware. I say we should just brun it with napalm and thermite
> > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > the underlying storage doesn't already require it.
> > > 
> > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > then just do it.
> > 
> > O_DIRECT IO *isn't the problem*.
> 
> That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> than the problem with RDMA but currently O_DIRECT IO can crash your machine
> or corrupt data the same way RDMA can.

It's not O_DIRECT - it's a ""transient page pin". Yes, there are
problems with that right now, but as we've discussed the issues can
be avoided by:

	a) stable pages always blocking in ->page_mkwrite;
	b) blocking in write_cache_pages() on an elevated map count
	when WB_SYNC_ALL is set; and
	c) blocking in truncate_pagecache() on an elevated map
	count.

That prevents:
	a) gup pinning a page that is currently under writeback and
	modifying it while IO is in flight;
	b) a dirty page being written back while it is pinned by
	GUP, thereby turning it clean before the gup reference calls
	set_page_dirty() on DMA completion; and
	c) truncate/hole punch for pulling the page out from under
	the gup operation that is ongoing.

This is an adequate solution for a short term transient pins. It
doesn't break fsync(), it doesn't change how truncate works and it
fixes the problem where a mapped file is the buffer for an O_DIRECT
IO rather than the open fd and that buffer file gets truncated.
IOWs, transient pins (and hence O_DIRECT) is not really the problem
here.

The problem with this is that blocking on elevated map count does
not work for long term pins (i.e. gup_longterm()) which are defined
as:

 * "longterm" == userspace controlled elevated page count lifetime.
 * Contrast this to iov_iter_get_pages() usages which are transient.

It's the "userspace controlled" part of the long term gup pin that
is the problem we need to solve. If we treat them the same as a
transient pin, then this leads to fsync() and truncate either
blocking for a long time waiting for userspace to drop it's gup
reference, or having to be failed with something like EBUSY or
EAGAIN.

This is the problem revokable file layout leases solve. The NFS
server is already using this for revoking delegations from remote
clients. Userspace holding long term GUP references is essentially
the same thing - it's a delegation of file ownership to userspace
that the filesystem must be able to revoke when it needs to run
internal and/or 3rd-party requested operations on that delegated
file.

If the hardware supports page faults, then we can further optimise
the long term pin case to relax stable page requirements and allow
page cleaning to occur while there are long term pins. In this case,
the hardware will write-fault the clean pages appropriately before
DMA is initiated, and hence avoid the need for data integrity
operations like fsync() to trigger lease revocation. However,
truncate/hole punch still requires lease revocation to work sanely,
especially when we consider DAX *must* ensure there are no remaining
references to the physical pmem page after the space has been freed.

i.e. conflating the transient and long term gup pins as the same
problem doesn't help anyone. If we fix the short term pin problems,
then the long term pin problem become tractable by adding a layer
over the top (i.e.  hardware page fault capability and/or file lease
requirements).  Existing apps and hardware will continue to work -
external operations on the pinned file will simply hang rather than
causing corruption or kernel crashes.  New (or updated) applications
will play nicely with lease revocation and at that point the "long
term pin" basically becomes a transient pin where the unpin latency
is determined by how quickly the app responds to the lease
revocation. And page fault capable hardware will reduce the
occurrence of lease revocations due to data writeback/integrity
operations and behave almost identically to cpu-based mmap accesses
to file backed pages.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 22:33                                             ` Dave Chinner
@ 2018-12-20  9:07                                               ` Jan Kara
  2018-12-20 16:54                                               ` Jerome Glisse
  1 sibling, 0 replies; 206+ messages in thread
From: Jan Kara @ 2018-12-20  9:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jason Gunthorpe, Jerome Glisse, John Hubbard,
	Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Thu 20-12-18 09:33:12, Dave Chinner wrote:
> On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> > On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > > 
> > > > > Essentially, what we are talking about is how to handle broken
> > > > > hardware. I say we should just brun it with napalm and thermite
> > > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > > the underlying storage doesn't already require it.
> > > > 
> > > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > > then just do it.
> > > 
> > > O_DIRECT IO *isn't the problem*.
> > 
> > That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> > than the problem with RDMA but currently O_DIRECT IO can crash your machine
> > or corrupt data the same way RDMA can.
> 
> It's not O_DIRECT - it's a ""transient page pin". Yes, there are
> problems with that right now, but as we've discussed the issues can
> be avoided by:
> 
> 	a) stable pages always blocking in ->page_mkwrite;
> 	b) blocking in write_cache_pages() on an elevated map count
> 	when WB_SYNC_ALL is set; and
> 	c) blocking in truncate_pagecache() on an elevated map
> 	count.
> 
> That prevents:
> 	a) gup pinning a page that is currently under writeback and
> 	modifying it while IO is in flight;
> 	b) a dirty page being written back while it is pinned by
> 	GUP, thereby turning it clean before the gup reference calls
> 	set_page_dirty() on DMA completion; and

This is not prevented by what you wrote above as currently GUP does not
increase page->_mapcount. Currently, there's no way to distinguish GUP page
reference from any other page reference - GUP simply does get_page() - and
big part of this thread as I see it is exactly about how to introduce this
distinction and how to convert all GUP users to the new convention safely
(as currently they just pass struct page * pointers around and eventually
do put_page() on them). Increasing page->_mapcount in GUP and trying to
deduce the pin count from that is one option Jerome suggested. At this
point I'm not 100% sure this is going to work but we'll see.

> 	c) truncate/hole punch for pulling the page out from under
> 	the gup operation that is ongoing.
> 
> This is an adequate solution for a short term transient pins. It
> doesn't break fsync(), it doesn't change how truncate works and it
> fixes the problem where a mapped file is the buffer for an O_DIRECT
> IO rather than the open fd and that buffer file gets truncated.
> IOWs, transient pins (and hence O_DIRECT) is not really the problem
> here.

For now let's assume that the mechanism how to detect page pinned by GUP is
actually somehow solved and we have already page_pinned() implemented. Then
what you suggest can actually create a deadlock AFAICS:

Process 1:						Process 2:

							fsync("file")
/* Evil memory buffer with page order reversed */
addr1 = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, "file", 4096);
addr2 = mmap(addr1+4096, 4096, PROT_WRITE, MAP_SHARED, "file", 0);

/* Fault in pages */
*addr1 = 0;
*addr2 = 0;
							adds page with index 0
							  to bio

fd = open("file2", O_RDWR | O_DIRECT);
read(fd, addr1, 8192)
  -> eventually gets to iov_iter_get_pages() and then to
     get_user_pages_fast().
     -> pins "file" page with index 1
							blocks on pin for
							  page with index 1
     -> blocks in PageWriteback for page with index 0

Possibility of deadlocks like this is why I've decided it will be easier
to just bounce the page for writeback we cannot avoid rather than block the
writeback...

> The problem with this is that blocking on elevated map count does
> not work for long term pins (i.e. gup_longterm()) which are defined
> as:
> 
>  * "longterm" == userspace controlled elevated page count lifetime.
>  * Contrast this to iov_iter_get_pages() usages which are transient.
> 
> It's the "userspace controlled" part of the long term gup pin that
> is the problem we need to solve. If we treat them the same as a
> transient pin, then this leads to fsync() and truncate either
> blocking for a long time waiting for userspace to drop it's gup
> reference, or having to be failed with something like EBUSY or
> EAGAIN.

I agree. "userspace controlled" pins are another big problem to solve.

> This is the problem revokable file layout leases solve. The NFS
> server is already using this for revoking delegations from remote
> clients. Userspace holding long term GUP references is essentially
> the same thing - it's a delegation of file ownership to userspace
> that the filesystem must be able to revoke when it needs to run
> internal and/or 3rd-party requested operations on that delegated
> file.
> 
> If the hardware supports page faults, then we can further optimise
> the long term pin case to relax stable page requirements and allow
> page cleaning to occur while there are long term pins. In this case,
> the hardware will write-fault the clean pages appropriately before
> DMA is initiated, and hence avoid the need for data integrity
> operations like fsync() to trigger lease revocation. However,
> truncate/hole punch still requires lease revocation to work sanely,
> especially when we consider DAX *must* ensure there are no remaining
> references to the physical pmem page after the space has been freed.
> 
> i.e. conflating the transient and long term gup pins as the same
> problem doesn't help anyone. If we fix the short term pin problems,
> then the long term pin problem become tractable by adding a layer
> over the top (i.e.  hardware page fault capability and/or file lease
> requirements).  Existing apps and hardware will continue to work -
> external operations on the pinned file will simply hang rather than
> causing corruption or kernel crashes.  New (or updated) applications
> will play nicely with lease revocation and at that point the "long
> term pin" basically becomes a transient pin where the unpin latency
> is determined by how quickly the app responds to the lease
> revocation. And page fault capable hardware will reduce the
> occurrence of lease revocations due to data writeback/integrity
> operations and behave almost identically to cpu-based mmap accesses
> to file backed pages.

Agreed. I think we are on the same page wrt this. Just at this point I'm
trying to solve the "transient pin" problem...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:08                                             ` Jan Kara
  2018-12-20 10:54                                               ` John Hubbard
@ 2018-12-20 16:49                                               ` Jerome Glisse
  2019-01-03  1:55                                               ` Jerome Glisse
  2 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > solution for that.  
> > > 
> > > So as I understand it, this would use page->_mapcount to store both the real
> > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > file-backed (non-anonymous) pages:
> > > 
> > > 
> > > __get_user_pages()
> > > {
> > > 	...
> > > 	get_page(page);
> > > 
> > > 	if (!PageAnon)
> > > 		atomic_inc(page->_mapcount);
> > > 	...
> > > }
> > > 
> > > put_user_page(struct page *page)
> > > {
> > > 	...
> > > 	if (!PageAnon)
> > > 		atomic_dec(&page->_mapcount);
> > > 
> > > 	put_page(page);
> > > 	...
> > > }
> > > 
> > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > had in mind?
> > 
> > Mostly, with the extra two observations:
> >     [1] We only need to know the pin count when a write back kicks in
> >     [2] We need to protect GUP code with wait_for_write_back() in case
> >         GUP is racing with a write back that might not the see the
> >         elevated mapcount in time.
> > 
> > So for [2]
> > 
> > __get_user_pages()
> > {
> >     get_page(page);
> > 
> >     if (!PageAnon) {
> >         atomic_inc(page->_mapcount);
> > +       if (PageWriteback(page)) {
> > +           // Assume we are racing and curent write back will not see
> > +           // the elevated mapcount so wait for current write back and
> > +           // force page fault
> > +           wait_on_page_writeback(page);
> > +           // force slow path that will fault again
> > +       }
> >     }
> > }
> 
> This is not needed AFAICT. __get_user_pages() gets page reference (and it
> should also increment page->_mapcount) under PTE lock. So at that point we
> are sure we have writeable PTE nobody can change. So page_mkclean() has to
> block on PTE lock to make PTE read-only and only after going through all
> PTEs like this, it can check page->_mapcount. So the PTE lock provides
> enough synchronization.

This is needed, file back page can be map in any number of page table
and thus no PTE lock gonna protect anything in the end. More over with
GUP fast we really have to assume there is no lock that force ordering.

In fact in the above snipet that mapcount should not happen if there
is an on going write back.


> > For [1] only needing pin count during write back turns page_mkclean into
> > the perfect spot to check for that so:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> > 
> >     // Big fat comment to explain what is going on
> > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > +       SetPageDMAPined(page);
> > +   } else {
> > +       ClearPageDMAPined(page);
> > +   }
> 
> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> check we do in page_mkclean() is wrong?

Ok so i thought about this here is what we have:
    mp1 = page_mapcount(page);
    // let name rc1 the number of real count at mp1 time (this is
    // an ideal value that we can not get)

    rmap_walk(page, &rwc);
    // at this point let's name frc the number of real map count
    // found by rmap_walk

    mp2 = page_mapcount(page);
    // let name rc2 the number of real count at mp2 time (this is
    // an ideal value that we can not get)


So we have
    rc1 >= frc >= rc2
    pc1 = mp1 - rc1     // pin count at mp1 time
    pc2 = mp2 - rc2     // pin count at mp2 time

So we have:
    mp1 - rc1 <= mp1 - frc
    mp2 - rc2 >= mp2 - frc

>From the above:
    mp1 - frc <  0 impossible value mapcount can only go down so
                   frc <= mp1
    mp1 - frc == 0 -> the page is not pin
U1  mp1 - frc >  0 -> the page might be pin

U2  mp2 - frc <= 0 -> the page might be pin
    mp2 - frc >  0 -> the page is pin

They are two unknowns [U1] and [U2]:
    [U1]    a zap raced before rmap_walk() could account the zaped
            mapping (frc < rc1)
    [U2]    a zap raced after rmap_walk() accounted the zaped
            mapping (frc > rc2)

In both cases we can detect the race but we can not ascertain if page
is pin or not.

So we can do 2 things here:
    - try to recount the real mapping (it is bound to end as no
      new mapping can be added and thus mapcount can only go down)
    - assume false positive and uselessly bounce page that would
      not need bouncing if we were not unlucky

We could mitigate this with a flag GUP unconditionaly set it and page
mkclean clears it when mp1 - frc == 0 this way we never bounce page
that were never GUPed but we might keep bouncing a page that was GUPed
once in its lifetime until there is not race for it in page_mkclean.

I will ponder a bit more and see if i can get an idea on how to close
that race ie either close U1 or close U2.


> >     // Maybe we want to leverage the int nature of return value so that
> >     // we can express more than cleaned/truncated and express cleaned/
> >     // truncated/pinned for benefit of caller and that way we do not
> >     // even need one bit as page flags above.
> > 
> >     return cleaned;
> > }
> > 
> > You do not want to change page_mapped() i do not see a need for that.
> > 
> > Then the whole discussion between Jan and Dave seems to indicate that
> > the bounce mechanism will need to be in the fs layer and that we can
> > not reuse the bio bounce mechanism. This means that more work is needed
> > at the fs level for that (so that fs do not freak on bounce page).
> > 
> > Note that they are few gotcha where we need to preserve the pin count
> > ie mostly in truncate code path that can remove page from page cache
> > and overwrite the mapcount in the process, this would need to be fixed
> > to not overwrite mapcount so that put_user_page does not set the map
> > count to an invalid value turning the page into a bad state that will
> > at one point trigger kernel BUG_ON();
> >
> > I am not saying block truncate, i am saying make sure it does not
> > erase pin count and keep truncating happily. The how to handle truncate
> > is a per existing GUP user discussion to see what they want to do for
> > that.
> > 
> > Obviously a bit deeper analysis of all spot that use mapcount is needed
> > to check that we are not breaking anything but from the top of my head
> > i can not think of anything bad (migrate will abort and other things will
> > assume the page is mapped even it is only in hardware page table, ...).
> 
> Hum, grepping for page_mapped() and page_mapcount(), this is actually going
> to be non-trivial to get right AFAICT.

No that's not that scary a good chunk of all those are for anonymous
memory and many are obvious (like migrate, ksm, ...).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-20 10:54                                               ` John Hubbard
@ 2018-12-20 16:50                                                 ` Jerome Glisse
  2018-12-20 16:57                                                   ` Dan Williams
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 20, 2018 at 02:54:49AM -0800, John Hubbard wrote:
> On 12/19/18 3:08 AM, Jan Kara wrote:
> > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> >>> *only* the tracking pinned pages aspect), given that it is the lightest weight
> >>> solution for that.  
> >>>
> >>> So as I understand it, this would use page->_mapcount to store both the real
> >>> mapcount, and the dma pinned count (simply added together), but only do so for
> >>> file-backed (non-anonymous) pages:
> >>>
> >>>
> >>> __get_user_pages()
> >>> {
> >>> 	...
> >>> 	get_page(page);
> >>>
> >>> 	if (!PageAnon)
> >>> 		atomic_inc(page->_mapcount);
> >>> 	...
> >>> }
> >>>
> >>> put_user_page(struct page *page)
> >>> {
> >>> 	...
> >>> 	if (!PageAnon)
> >>> 		atomic_dec(&page->_mapcount);
> >>>
> >>> 	put_page(page);
> >>> 	...
> >>> }
> >>>
> >>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> >>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> >>> had in mind?
> >>
> >> Mostly, with the extra two observations:
> >>     [1] We only need to know the pin count when a write back kicks in
> >>     [2] We need to protect GUP code with wait_for_write_back() in case
> >>         GUP is racing with a write back that might not the see the
> >>         elevated mapcount in time.
> >>
> >> So for [2]
> >>
> >> __get_user_pages()
> >> {
> >>     get_page(page);
> >>
> >>     if (!PageAnon) {
> >>         atomic_inc(page->_mapcount);
> >> +       if (PageWriteback(page)) {
> >> +           // Assume we are racing and curent write back will not see
> >> +           // the elevated mapcount so wait for current write back and
> >> +           // force page fault
> >> +           wait_on_page_writeback(page);
> >> +           // force slow path that will fault again
> >> +       }
> >>     }
> >> }
> > 
> > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > should also increment page->_mapcount) under PTE lock. So at that point we
> > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > block on PTE lock to make PTE read-only and only after going through all
> > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > enough synchronization.
> > 
> >> For [1] only needing pin count during write back turns page_mkclean into
> >> the perfect spot to check for that so:
> >>
> >> int page_mkclean(struct page *page)
> >> {
> >>     int cleaned = 0;
> >> +   int real_mapcount = 0;
> >>     struct address_space *mapping;
> >>     struct rmap_walk_control rwc = {
> >>         .arg = (void *)&cleaned,
> >>         .rmap_one = page_mkclean_one,
> >>         .invalid_vma = invalid_mkclean_vma,
> >> +       .mapcount = &real_mapcount,
> >>     };
> >>
> >>     BUG_ON(!PageLocked(page));
> >>
> >>     if (!page_mapped(page))
> >>         return 0;
> >>
> >>     mapping = page_mapping(page);
> >>     if (!mapping)
> >>         return 0;
> >>
> >>     // rmap_walk need to change to count mapping and return value
> >>     // in .mapcount easy one
> >>     rmap_walk(page, &rwc);
> >>
> >>     // Big fat comment to explain what is going on
> >> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> >> +       SetPageDMAPined(page);
> >> +   } else {
> >> +       ClearPageDMAPined(page);
> >> +   }
> > 
> > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > check we do in page_mkclean() is wrong?
> 
> Right. This looks like a dead end, after all. We can't lock a whole chunk 
> of "all these are mapped, hold still while we count you" pages. It's not
> designed to allow that at all.
> 
> IMHO, we are now back to something like dynamic_page, which provides an
> independent dma pinned count. 

I will keep looking because allocating a structure for every GUP is
insane to me they are user out there that are GUPin GigaBytes of data
and it gonna waste tons of memory just to fix crappy hardware.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 22:33                                             ` Dave Chinner
  2018-12-20  9:07                                               ` Jan Kara
@ 2018-12-20 16:54                                               ` Jerome Glisse
  1 sibling, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jason Gunthorpe, John Hubbard, Matthew Wilcox,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Thu, Dec 20, 2018 at 09:33:12AM +1100, Dave Chinner wrote:
> On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> > On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > > 
> > > > > Essentially, what we are talking about is how to handle broken
> > > > > hardware. I say we should just brun it with napalm and thermite
> > > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > > the underlying storage doesn't already require it.
> > > > 
> > > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > > then just do it.
> > > 
> > > O_DIRECT IO *isn't the problem*.
> > 
> > That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> > than the problem with RDMA but currently O_DIRECT IO can crash your machine
> > or corrupt data the same way RDMA can.
> 
> It's not O_DIRECT - it's a ""transient page pin". Yes, there are
> problems with that right now, but as we've discussed the issues can
> be avoided by:
> 
> 	a) stable pages always blocking in ->page_mkwrite;
> 	b) blocking in write_cache_pages() on an elevated map count
> 	when WB_SYNC_ALL is set; and
> 	c) blocking in truncate_pagecache() on an elevated map
> 	count.
> 
> That prevents:
> 	a) gup pinning a page that is currently under writeback and
> 	modifying it while IO is in flight;
> 	b) a dirty page being written back while it is pinned by
> 	GUP, thereby turning it clean before the gup reference calls
> 	set_page_dirty() on DMA completion; and
> 	c) truncate/hole punch for pulling the page out from under
> 	the gup operation that is ongoing.
> 
> This is an adequate solution for a short term transient pins. It
> doesn't break fsync(), it doesn't change how truncate works and it
> fixes the problem where a mapped file is the buffer for an O_DIRECT
> IO rather than the open fd and that buffer file gets truncated.
> IOWs, transient pins (and hence O_DIRECT) is not really the problem
> here.
> 
> The problem with this is that blocking on elevated map count does
> not work for long term pins (i.e. gup_longterm()) which are defined
> as:
> 
>  * "longterm" == userspace controlled elevated page count lifetime.
>  * Contrast this to iov_iter_get_pages() usages which are transient.
> 
> It's the "userspace controlled" part of the long term gup pin that
> is the problem we need to solve. If we treat them the same as a
> transient pin, then this leads to fsync() and truncate either
> blocking for a long time waiting for userspace to drop it's gup
> reference, or having to be failed with something like EBUSY or
> EAGAIN.
> 
> This is the problem revokable file layout leases solve. The NFS
> server is already using this for revoking delegations from remote
> clients. Userspace holding long term GUP references is essentially
> the same thing - it's a delegation of file ownership to userspace
> that the filesystem must be able to revoke when it needs to run
> internal and/or 3rd-party requested operations on that delegated
> file.
> 
> If the hardware supports page faults, then we can further optimise
> the long term pin case to relax stable page requirements and allow
> page cleaning to occur while there are long term pins. In this case,
> the hardware will write-fault the clean pages appropriately before
> DMA is initiated, and hence avoid the need for data integrity
> operations like fsync() to trigger lease revocation. However,
> truncate/hole punch still requires lease revocation to work sanely,
> especially when we consider DAX *must* ensure there are no remaining
> references to the physical pmem page after the space has been freed.

truncate does not requires lease recovations for faulting hardware,
truncate will trigger a mmu notifier callback which will invalidate
the hardware page table. On next access the hardware will fault and
this will turn into a regular page fault from kernel point of view.

So truncate/reflink and all fs expectation for faulting hardware do
hold. It is exactly as the CPU page table. So if CPU page table is
properly updated then so will be the hardware one.

Note that such hardware also abive by munmap() so hardware mapping
does not outlive vma.


Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-20 16:50                                                 ` Jerome Glisse
@ 2018-12-20 16:57                                                   ` Dan Williams
  0 siblings, 0 replies; 206+ messages in thread
From: Dan Williams @ 2018-12-20 16:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Jan Kara, Matthew Wilcox, Dave Chinner,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 20, 2018 at 8:50 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Thu, Dec 20, 2018 at 02:54:49AM -0800, John Hubbard wrote:
> > On 12/19/18 3:08 AM, Jan Kara wrote:
> > > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > >> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > >>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > >>> *only* the tracking pinned pages aspect), given that it is the lightest weight
> > >>> solution for that.
> > >>>
> > >>> So as I understand it, this would use page->_mapcount to store both the real
> > >>> mapcount, and the dma pinned count (simply added together), but only do so for
> > >>> file-backed (non-anonymous) pages:
> > >>>
> > >>>
> > >>> __get_user_pages()
> > >>> {
> > >>>   ...
> > >>>   get_page(page);
> > >>>
> > >>>   if (!PageAnon)
> > >>>           atomic_inc(page->_mapcount);
> > >>>   ...
> > >>> }
> > >>>
> > >>> put_user_page(struct page *page)
> > >>> {
> > >>>   ...
> > >>>   if (!PageAnon)
> > >>>           atomic_dec(&page->_mapcount);
> > >>>
> > >>>   put_page(page);
> > >>>   ...
> > >>> }
> > >>>
> > >>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > >>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you
> > >>> had in mind?
> > >>
> > >> Mostly, with the extra two observations:
> > >>     [1] We only need to know the pin count when a write back kicks in
> > >>     [2] We need to protect GUP code with wait_for_write_back() in case
> > >>         GUP is racing with a write back that might not the see the
> > >>         elevated mapcount in time.
> > >>
> > >> So for [2]
> > >>
> > >> __get_user_pages()
> > >> {
> > >>     get_page(page);
> > >>
> > >>     if (!PageAnon) {
> > >>         atomic_inc(page->_mapcount);
> > >> +       if (PageWriteback(page)) {
> > >> +           // Assume we are racing and curent write back will not see
> > >> +           // the elevated mapcount so wait for current write back and
> > >> +           // force page fault
> > >> +           wait_on_page_writeback(page);
> > >> +           // force slow path that will fault again
> > >> +       }
> > >>     }
> > >> }
> > >
> > > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > > should also increment page->_mapcount) under PTE lock. So at that point we
> > > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > > block on PTE lock to make PTE read-only and only after going through all
> > > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > > enough synchronization.
> > >
> > >> For [1] only needing pin count during write back turns page_mkclean into
> > >> the perfect spot to check for that so:
> > >>
> > >> int page_mkclean(struct page *page)
> > >> {
> > >>     int cleaned = 0;
> > >> +   int real_mapcount = 0;
> > >>     struct address_space *mapping;
> > >>     struct rmap_walk_control rwc = {
> > >>         .arg = (void *)&cleaned,
> > >>         .rmap_one = page_mkclean_one,
> > >>         .invalid_vma = invalid_mkclean_vma,
> > >> +       .mapcount = &real_mapcount,
> > >>     };
> > >>
> > >>     BUG_ON(!PageLocked(page));
> > >>
> > >>     if (!page_mapped(page))
> > >>         return 0;
> > >>
> > >>     mapping = page_mapping(page);
> > >>     if (!mapping)
> > >>         return 0;
> > >>
> > >>     // rmap_walk need to change to count mapping and return value
> > >>     // in .mapcount easy one
> > >>     rmap_walk(page, &rwc);
> > >>
> > >>     // Big fat comment to explain what is going on
> > >> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > >> +       SetPageDMAPined(page);
> > >> +   } else {
> > >> +       ClearPageDMAPined(page);
> > >> +   }
> > >
> > > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > > check we do in page_mkclean() is wrong?
> >
> > Right. This looks like a dead end, after all. We can't lock a whole chunk
> > of "all these are mapped, hold still while we count you" pages. It's not
> > designed to allow that at all.
> >
> > IMHO, we are now back to something like dynamic_page, which provides an
> > independent dma pinned count.
>
> I will keep looking because allocating a structure for every GUP is
> insane to me they are user out there that are GUPin GigaBytes of data

This is not the common case.

> and it gonna waste tons of memory just to fix crappy hardware.

This is the common case.

Please refrain from the hyperbolic assessments.

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 0/2] put_user_page*(): start converting the call sites
  2018-12-05 14:08     ` David Laight
@ 2018-12-28  8:37       ` Pavel Machek
  0 siblings, 0 replies; 206+ messages in thread
From: Pavel Machek @ 2018-12-28  8:37 UTC (permalink / raw)
  To: David Laight
  Cc: John Hubbard, john.hubbard, Andrew Morton, linux-mm, Jan Kara,
	Tom Talpey, Al Viro, Christian Benvenuti, Christoph Hellwig,
	Christopher Lameter, Dan Williams, Dennis Dalessandro,
	Doug Ledford, Jason Gunthorpe, Jerome Glisse, Matthew Wilcox,
	Michal Hocko, Mike Marciniszyn, Ralph Campbell, LKML,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 951 bytes --]

Hi!

> > This "patch 0000" is not a commit message, as it never shows up in git log.
> > Each of the follow-up patches does have details about the changes it makes.
> 
> I think you should still describe the change - at least in summary.
> 
> The patch I looked at didn't really...
> IIRC it still referred to external links.
> 
> > But maybe you are really asking for more background information, which I
> > should have added in this cover letter. Here's a start:
> > 
> > https://lore.kernel.org/r/20181110085041.10071-1-jhubbard@nvidia.com
> 
> Yes, but links go stale....

It should really explain what the end goal is... and not even the
20181110085041.10071-1-jhubbard@nvidia.com explains that.

It seems you are introducing small slowdown to simplify something...?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-19 11:08                                             ` Jan Kara
  2018-12-20 10:54                                               ` John Hubbard
  2018-12-20 16:49                                               ` Jerome Glisse
@ 2019-01-03  1:55                                               ` Jerome Glisse
  2019-01-03  3:27                                                 ` John Hubbard
  2019-01-03  9:26                                                 ` Jan Kara
  2 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-03  1:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > solution for that.  
> > > 
> > > So as I understand it, this would use page->_mapcount to store both the real
> > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > file-backed (non-anonymous) pages:
> > > 
> > > 
> > > __get_user_pages()
> > > {
> > > 	...
> > > 	get_page(page);
> > > 
> > > 	if (!PageAnon)
> > > 		atomic_inc(page->_mapcount);
> > > 	...
> > > }
> > > 
> > > put_user_page(struct page *page)
> > > {
> > > 	...
> > > 	if (!PageAnon)
> > > 		atomic_dec(&page->_mapcount);
> > > 
> > > 	put_page(page);
> > > 	...
> > > }
> > > 
> > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > had in mind?
> > 
> > Mostly, with the extra two observations:
> >     [1] We only need to know the pin count when a write back kicks in
> >     [2] We need to protect GUP code with wait_for_write_back() in case
> >         GUP is racing with a write back that might not the see the
> >         elevated mapcount in time.
> > 
> > So for [2]
> > 
> > __get_user_pages()
> > {
> >     get_page(page);
> > 
> >     if (!PageAnon) {
> >         atomic_inc(page->_mapcount);
> > +       if (PageWriteback(page)) {
> > +           // Assume we are racing and curent write back will not see
> > +           // the elevated mapcount so wait for current write back and
> > +           // force page fault
> > +           wait_on_page_writeback(page);
> > +           // force slow path that will fault again
> > +       }
> >     }
> > }
> 
> This is not needed AFAICT. __get_user_pages() gets page reference (and it
> should also increment page->_mapcount) under PTE lock. So at that point we
> are sure we have writeable PTE nobody can change. So page_mkclean() has to
> block on PTE lock to make PTE read-only and only after going through all
> PTEs like this, it can check page->_mapcount. So the PTE lock provides
> enough synchronization.
> 
> > For [1] only needing pin count during write back turns page_mkclean into
> > the perfect spot to check for that so:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> > 
> >     // Big fat comment to explain what is going on
> > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > +       SetPageDMAPined(page);
> > +   } else {
> > +       ClearPageDMAPined(page);
> > +   }
> 
> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> check we do in page_mkclean() is wrong?
> 

Ok so i found a solution for that. First GUP must wait for racing
write back. If GUP see a valid write-able PTE and the page has
write back flag set then it must back of as if the PTE was not
valid to force fault. It is just a race with page_mkclean and we
want ordering between the two. Note this is not strictly needed
so we can relax that but i believe this ordering is better to do
in GUP rather then having each single user of GUP test for this
to avoid the race.

GUP increase mapcount only after checking that it is not racing
with writeback it also set a page flag (SetPageDMAPined(page)).

When clearing a write-able pte we set a special entry inside the
page table (might need a new special swap type for this) and change
page_mkclean_one() to clear to 0 those special entry.


Now page_mkclean:

int page_mkclean(struct page *page)
{
    int cleaned = 0;
+   int real_mapcount = 0;
    struct address_space *mapping;
    struct rmap_walk_control rwc = {
        .arg = (void *)&cleaned,
        .rmap_one = page_mkclean_one,
        .invalid_vma = invalid_mkclean_vma,
+       .mapcount = &real_mapcount,
    };
+   int mapcount1, mapcount2;

    BUG_ON(!PageLocked(page));

    if (!page_mapped(page))
        return 0;

    mapping = page_mapping(page);
    if (!mapping)
        return 0;

+   mapcount1 = page_mapcount(page);

    // rmap_walk need to change to count mapping and return value
    // in .mapcount easy one
    rmap_walk(page, &rwc);

+   if (PageDMAPined(page)) {
+       int rc2;
+
+       if (mapcount1 == real_count) {
+           /* Page is no longer pin, no zap pte race */
+           ClearPageDMAPined(page);
+           goto out;
+       }
+       /* No new mapping of the page so mp1 < rc is illegal. */
+       VM_BUG_ON(mapcount1 < real_count);
+       /* Page might be pin. */
+       mapcount2 = page_mapcount(page);
+       if (mapcount2 > real_count) {
+           /* Page is pin for sure. */
+           goto out;
+       }
+       /* We had a race with zap pte we need to rewalk again. */
+       rc2 = real_mapcount;
+       real_mapcount = 0;
+       rwc.rmap_one = page_pin_one;
+       rmap_walk(page, &rwc);
+       if (mapcount2 <= (real_count + rc2)) {
+           /* Page is no longer pin */
+           ClearPageDMAPined(page);
+       }
+       /* At this point the page pin flag reflect pin status of the page */
+   }
+
+out:
    ...
}

The page_pin_one() function count the number of special PTE entry so
which match the count of pte that have been zapped since the first
reverse map walk.

So worst case a page that was pin by a GUP would need 2 reverse map
walk during page_mkclean(). Moreover this is only needed if we race
with something that clear pte. I believe this is an acceptable worst
case. I will work on some RFC patchset next week (once i am down with
email catch up).


I do not think i made mistake here, i have been torturing my mind
trying to think of any race scenario and i believe it holds to any
racing zap and page_mkclean()

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  1:55                                               ` Jerome Glisse
@ 2019-01-03  3:27                                                 ` John Hubbard
  2019-01-03 14:57                                                   ` Jerome Glisse
  2019-01-03  9:26                                                 ` Jan Kara
  1 sibling, 1 reply; 206+ messages in thread
From: John Hubbard @ 2019-01-03  3:27 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/2/19 5:55 PM, Jerome Glisse wrote:
> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>>> solution for that.  
>>>>
>>>> So as I understand it, this would use page->_mapcount to store both the real
>>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>>> file-backed (non-anonymous) pages:
>>>>
>>>>
>>>> __get_user_pages()
>>>> {
>>>> 	...
>>>> 	get_page(page);
>>>>
>>>> 	if (!PageAnon)
>>>> 		atomic_inc(page->_mapcount);
>>>> 	...
>>>> }
>>>>
>>>> put_user_page(struct page *page)
>>>> {
>>>> 	...
>>>> 	if (!PageAnon)
>>>> 		atomic_dec(&page->_mapcount);
>>>>
>>>> 	put_page(page);
>>>> 	...
>>>> }
>>>>
>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
>>>> had in mind?
>>>
>>> Mostly, with the extra two observations:
>>>     [1] We only need to know the pin count when a write back kicks in
>>>     [2] We need to protect GUP code with wait_for_write_back() in case
>>>         GUP is racing with a write back that might not the see the
>>>         elevated mapcount in time.
>>>
>>> So for [2]
>>>
>>> __get_user_pages()
>>> {
>>>     get_page(page);
>>>
>>>     if (!PageAnon) {
>>>         atomic_inc(page->_mapcount);
>>> +       if (PageWriteback(page)) {
>>> +           // Assume we are racing and curent write back will not see
>>> +           // the elevated mapcount so wait for current write back and
>>> +           // force page fault
>>> +           wait_on_page_writeback(page);
>>> +           // force slow path that will fault again
>>> +       }
>>>     }
>>> }
>>
>> This is not needed AFAICT. __get_user_pages() gets page reference (and it
>> should also increment page->_mapcount) under PTE lock. So at that point we
>> are sure we have writeable PTE nobody can change. So page_mkclean() has to
>> block on PTE lock to make PTE read-only and only after going through all
>> PTEs like this, it can check page->_mapcount. So the PTE lock provides
>> enough synchronization.
>>
>>> For [1] only needing pin count during write back turns page_mkclean into
>>> the perfect spot to check for that so:
>>>
>>> int page_mkclean(struct page *page)
>>> {
>>>     int cleaned = 0;
>>> +   int real_mapcount = 0;
>>>     struct address_space *mapping;
>>>     struct rmap_walk_control rwc = {
>>>         .arg = (void *)&cleaned,
>>>         .rmap_one = page_mkclean_one,
>>>         .invalid_vma = invalid_mkclean_vma,
>>> +       .mapcount = &real_mapcount,
>>>     };
>>>
>>>     BUG_ON(!PageLocked(page));
>>>
>>>     if (!page_mapped(page))
>>>         return 0;
>>>
>>>     mapping = page_mapping(page);
>>>     if (!mapping)
>>>         return 0;
>>>
>>>     // rmap_walk need to change to count mapping and return value
>>>     // in .mapcount easy one
>>>     rmap_walk(page, &rwc);
>>>
>>>     // Big fat comment to explain what is going on
>>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
>>> +       SetPageDMAPined(page);
>>> +   } else {
>>> +       ClearPageDMAPined(page);
>>> +   }
>>
>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
>> check we do in page_mkclean() is wrong?
>>
> 
> Ok so i found a solution for that. First GUP must wait for racing
> write back. If GUP see a valid write-able PTE and the page has
> write back flag set then it must back of as if the PTE was not
> valid to force fault. It is just a race with page_mkclean and we
> want ordering between the two. Note this is not strictly needed
> so we can relax that but i believe this ordering is better to do
> in GUP rather then having each single user of GUP test for this
> to avoid the race.
> 
> GUP increase mapcount only after checking that it is not racing
> with writeback it also set a page flag (SetPageDMAPined(page)).
> 
> When clearing a write-able pte we set a special entry inside the
> page table (might need a new special swap type for this) and change
> page_mkclean_one() to clear to 0 those special entry.
> 
> 
> Now page_mkclean:
> 
> int page_mkclean(struct page *page)
> {
>     int cleaned = 0;
> +   int real_mapcount = 0;
>     struct address_space *mapping;
>     struct rmap_walk_control rwc = {
>         .arg = (void *)&cleaned,
>         .rmap_one = page_mkclean_one,
>         .invalid_vma = invalid_mkclean_vma,
> +       .mapcount = &real_mapcount,
>     };
> +   int mapcount1, mapcount2;
> 
>     BUG_ON(!PageLocked(page));
> 
>     if (!page_mapped(page))
>         return 0;
> 
>     mapping = page_mapping(page);
>     if (!mapping)
>         return 0;
> 
> +   mapcount1 = page_mapcount(page);
> 
>     // rmap_walk need to change to count mapping and return value
>     // in .mapcount easy one
>     rmap_walk(page, &rwc);
> 
> +   if (PageDMAPined(page)) {
> +       int rc2;
> +
> +       if (mapcount1 == real_count) {
> +           /* Page is no longer pin, no zap pte race */
> +           ClearPageDMAPined(page);
> +           goto out;
> +       }
> +       /* No new mapping of the page so mp1 < rc is illegal. */
> +       VM_BUG_ON(mapcount1 < real_count);
> +       /* Page might be pin. */
> +       mapcount2 = page_mapcount(page);
> +       if (mapcount2 > real_count) {
> +           /* Page is pin for sure. */
> +           goto out;
> +       }
> +       /* We had a race with zap pte we need to rewalk again. */
> +       rc2 = real_mapcount;
> +       real_mapcount = 0;
> +       rwc.rmap_one = page_pin_one;
> +       rmap_walk(page, &rwc);
> +       if (mapcount2 <= (real_count + rc2)) {
> +           /* Page is no longer pin */
> +           ClearPageDMAPined(page);
> +       }
> +       /* At this point the page pin flag reflect pin status of the page */

Until...what? In other words, what is providing synchronization here?

thanks,
-- 
John Hubbard
NVIDIA

> +   }
> +
> +out:
>     ...
> }
> 
> The page_pin_one() function count the number of special PTE entry so
> which match the count of pte that have been zapped since the first
> reverse map walk.
> 
> So worst case a page that was pin by a GUP would need 2 reverse map
> walk during page_mkclean(). Moreover this is only needed if we race
> with something that clear pte. I believe this is an acceptable worst
> case. I will work on some RFC patchset next week (once i am down with
> email catch up).
> 
> 
> I do not think i made mistake here, i have been torturing my mind
> trying to think of any race scenario and i believe it holds to any
> racing zap and page_mkclean()
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  1:55                                               ` Jerome Glisse
  2019-01-03  3:27                                                 ` John Hubbard
@ 2019-01-03  9:26                                                 ` Jan Kara
  2019-01-03 14:44                                                   ` Jerome Glisse
  1 sibling, 1 reply; 206+ messages in thread
From: Jan Kara @ 2019-01-03  9:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, Dave Chinner,
	Dan Williams, John Hubbard, Andrew Morton, Linux MM, tom,
	Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > > solution for that.  
> > > > 
> > > > So as I understand it, this would use page->_mapcount to store both the real
> > > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > > file-backed (non-anonymous) pages:
> > > > 
> > > > 
> > > > __get_user_pages()
> > > > {
> > > > 	...
> > > > 	get_page(page);
> > > > 
> > > > 	if (!PageAnon)
> > > > 		atomic_inc(page->_mapcount);
> > > > 	...
> > > > }
> > > > 
> > > > put_user_page(struct page *page)
> > > > {
> > > > 	...
> > > > 	if (!PageAnon)
> > > > 		atomic_dec(&page->_mapcount);
> > > > 
> > > > 	put_page(page);
> > > > 	...
> > > > }
> > > > 
> > > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > > had in mind?
> > > 
> > > Mostly, with the extra two observations:
> > >     [1] We only need to know the pin count when a write back kicks in
> > >     [2] We need to protect GUP code with wait_for_write_back() in case
> > >         GUP is racing with a write back that might not the see the
> > >         elevated mapcount in time.
> > > 
> > > So for [2]
> > > 
> > > __get_user_pages()
> > > {
> > >     get_page(page);
> > > 
> > >     if (!PageAnon) {
> > >         atomic_inc(page->_mapcount);
> > > +       if (PageWriteback(page)) {
> > > +           // Assume we are racing and curent write back will not see
> > > +           // the elevated mapcount so wait for current write back and
> > > +           // force page fault
> > > +           wait_on_page_writeback(page);
> > > +           // force slow path that will fault again
> > > +       }
> > >     }
> > > }
> > 
> > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > should also increment page->_mapcount) under PTE lock. So at that point we
> > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > block on PTE lock to make PTE read-only and only after going through all
> > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > enough synchronization.
> > 
> > > For [1] only needing pin count during write back turns page_mkclean into
> > > the perfect spot to check for that so:
> > > 
> > > int page_mkclean(struct page *page)
> > > {
> > >     int cleaned = 0;
> > > +   int real_mapcount = 0;
> > >     struct address_space *mapping;
> > >     struct rmap_walk_control rwc = {
> > >         .arg = (void *)&cleaned,
> > >         .rmap_one = page_mkclean_one,
> > >         .invalid_vma = invalid_mkclean_vma,
> > > +       .mapcount = &real_mapcount,
> > >     };
> > > 
> > >     BUG_ON(!PageLocked(page));
> > > 
> > >     if (!page_mapped(page))
> > >         return 0;
> > > 
> > >     mapping = page_mapping(page);
> > >     if (!mapping)
> > >         return 0;
> > > 
> > >     // rmap_walk need to change to count mapping and return value
> > >     // in .mapcount easy one
> > >     rmap_walk(page, &rwc);
> > > 
> > >     // Big fat comment to explain what is going on
> > > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > > +       SetPageDMAPined(page);
> > > +   } else {
> > > +       ClearPageDMAPined(page);
> > > +   }
> > 
> > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > check we do in page_mkclean() is wrong?
> > 
> 
> Ok so i found a solution for that. First GUP must wait for racing
> write back. If GUP see a valid write-able PTE and the page has
> write back flag set then it must back of as if the PTE was not
> valid to force fault. It is just a race with page_mkclean and we
> want ordering between the two. Note this is not strictly needed
> so we can relax that but i believe this ordering is better to do
> in GUP rather then having each single user of GUP test for this
> to avoid the race.
> 
> GUP increase mapcount only after checking that it is not racing
> with writeback it also set a page flag (SetPageDMAPined(page)).
> 
> When clearing a write-able pte we set a special entry inside the
> page table (might need a new special swap type for this) and change
> page_mkclean_one() to clear to 0 those special entry.
> 
> 
> Now page_mkclean:
> 
> int page_mkclean(struct page *page)
> {
>     int cleaned = 0;
> +   int real_mapcount = 0;
>     struct address_space *mapping;
>     struct rmap_walk_control rwc = {
>         .arg = (void *)&cleaned,
>         .rmap_one = page_mkclean_one,
>         .invalid_vma = invalid_mkclean_vma,
> +       .mapcount = &real_mapcount,
>     };
> +   int mapcount1, mapcount2;
> 
>     BUG_ON(!PageLocked(page));
> 
>     if (!page_mapped(page))
>         return 0;
> 
>     mapping = page_mapping(page);
>     if (!mapping)
>         return 0;
> 
> +   mapcount1 = page_mapcount(page);
>     // rmap_walk need to change to count mapping and return value
>     // in .mapcount easy one
>     rmap_walk(page, &rwc);

So what prevents GUP_fast() to grab reference here and the test below would
think the page is not pinned? Or do you assume that every page_mkclean()
call will be protected by PageWriteback (currently it is not) so that
GUP_fast() blocks / bails out?

But I think that detecting pinned pages with small false positive rate is
OK. The extra page bouncing will cost some performance but if it is rare,
then we are OK. So I think we can go for the simple version of detecting
pinned pages as you mentioned in some earlier email. We just have to be
sure there are no false negatives.

								Honza

> +   if (PageDMAPined(page)) {
> +       int rc2;
> +
> +       if (mapcount1 == real_count) {
> +           /* Page is no longer pin, no zap pte race */
> +           ClearPageDMAPined(page);
> +           goto out;
> +       }
> +       /* No new mapping of the page so mp1 < rc is illegal. */
> +       VM_BUG_ON(mapcount1 < real_count);
> +       /* Page might be pin. */
> +       mapcount2 = page_mapcount(page);
> +       if (mapcount2 > real_count) {
> +           /* Page is pin for sure. */
> +           goto out;
> +       }
> +       /* We had a race with zap pte we need to rewalk again. */
> +       rc2 = real_mapcount;
> +       real_mapcount = 0;
> +       rwc.rmap_one = page_pin_one;
> +       rmap_walk(page, &rwc);
> +       if (mapcount2 <= (real_count + rc2)) {
> +           /* Page is no longer pin */
> +           ClearPageDMAPined(page);
> +       }
> +       /* At this point the page pin flag reflect pin status of the page */
> +   }
> +
> +out:
>     ...
> }
> 
> The page_pin_one() function count the number of special PTE entry so
> which match the count of pte that have been zapped since the first
> reverse map walk.
> 
> So worst case a page that was pin by a GUP would need 2 reverse map
> walk during page_mkclean(). Moreover this is only needed if we race
> with something that clear pte. I believe this is an acceptable worst
> case. I will work on some RFC patchset next week (once i am down with
> email catch up).
> 
> 
> I do not think i made mistake here, i have been torturing my mind
> trying to think of any race scenario and i believe it holds to any
> racing zap and page_mkclean()
> 
> Cheers,
> J�r�me
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  9:26                                                 ` Jan Kara
@ 2019-01-03 14:44                                                   ` Jerome Glisse
  2019-01-11  2:59                                                     ` John Hubbard
  0 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2019-01-03 14:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> > On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> > > On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> > > > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > > > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> > > > > *only* the tracking pinned pages aspect), given that it is the lightest weight
> > > > > solution for that.  
> > > > > 
> > > > > So as I understand it, this would use page->_mapcount to store both the real
> > > > > mapcount, and the dma pinned count (simply added together), but only do so for
> > > > > file-backed (non-anonymous) pages:
> > > > > 
> > > > > 
> > > > > __get_user_pages()
> > > > > {
> > > > > 	...
> > > > > 	get_page(page);
> > > > > 
> > > > > 	if (!PageAnon)
> > > > > 		atomic_inc(page->_mapcount);
> > > > > 	...
> > > > > }
> > > > > 
> > > > > put_user_page(struct page *page)
> > > > > {
> > > > > 	...
> > > > > 	if (!PageAnon)
> > > > > 		atomic_dec(&page->_mapcount);
> > > > > 
> > > > > 	put_page(page);
> > > > > 	...
> > > > > }
> > > > > 
> > > > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> > > > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> > > > > had in mind?
> > > > 
> > > > Mostly, with the extra two observations:
> > > >     [1] We only need to know the pin count when a write back kicks in
> > > >     [2] We need to protect GUP code with wait_for_write_back() in case
> > > >         GUP is racing with a write back that might not the see the
> > > >         elevated mapcount in time.
> > > > 
> > > > So for [2]
> > > > 
> > > > __get_user_pages()
> > > > {
> > > >     get_page(page);
> > > > 
> > > >     if (!PageAnon) {
> > > >         atomic_inc(page->_mapcount);
> > > > +       if (PageWriteback(page)) {
> > > > +           // Assume we are racing and curent write back will not see
> > > > +           // the elevated mapcount so wait for current write back and
> > > > +           // force page fault
> > > > +           wait_on_page_writeback(page);
> > > > +           // force slow path that will fault again
> > > > +       }
> > > >     }
> > > > }
> > > 
> > > This is not needed AFAICT. __get_user_pages() gets page reference (and it
> > > should also increment page->_mapcount) under PTE lock. So at that point we
> > > are sure we have writeable PTE nobody can change. So page_mkclean() has to
> > > block on PTE lock to make PTE read-only and only after going through all
> > > PTEs like this, it can check page->_mapcount. So the PTE lock provides
> > > enough synchronization.
> > > 
> > > > For [1] only needing pin count during write back turns page_mkclean into
> > > > the perfect spot to check for that so:
> > > > 
> > > > int page_mkclean(struct page *page)
> > > > {
> > > >     int cleaned = 0;
> > > > +   int real_mapcount = 0;
> > > >     struct address_space *mapping;
> > > >     struct rmap_walk_control rwc = {
> > > >         .arg = (void *)&cleaned,
> > > >         .rmap_one = page_mkclean_one,
> > > >         .invalid_vma = invalid_mkclean_vma,
> > > > +       .mapcount = &real_mapcount,
> > > >     };
> > > > 
> > > >     BUG_ON(!PageLocked(page));
> > > > 
> > > >     if (!page_mapped(page))
> > > >         return 0;
> > > > 
> > > >     mapping = page_mapping(page);
> > > >     if (!mapping)
> > > >         return 0;
> > > > 
> > > >     // rmap_walk need to change to count mapping and return value
> > > >     // in .mapcount easy one
> > > >     rmap_walk(page, &rwc);
> > > > 
> > > >     // Big fat comment to explain what is going on
> > > > +   if ((page_mapcount(page) - real_mapcount) > 0) {
> > > > +       SetPageDMAPined(page);
> > > > +   } else {
> > > > +       ClearPageDMAPined(page);
> > > > +   }
> > > 
> > > This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> > > with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> > > check we do in page_mkclean() is wrong?
> > > 
> > 
> > Ok so i found a solution for that. First GUP must wait for racing
> > write back. If GUP see a valid write-able PTE and the page has
> > write back flag set then it must back of as if the PTE was not
> > valid to force fault. It is just a race with page_mkclean and we
> > want ordering between the two. Note this is not strictly needed
> > so we can relax that but i believe this ordering is better to do
> > in GUP rather then having each single user of GUP test for this
> > to avoid the race.
> > 
> > GUP increase mapcount only after checking that it is not racing
> > with writeback it also set a page flag (SetPageDMAPined(page)).
> > 
> > When clearing a write-able pte we set a special entry inside the
> > page table (might need a new special swap type for this) and change
> > page_mkclean_one() to clear to 0 those special entry.
> > 
> > 
> > Now page_mkclean:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > +   int mapcount1, mapcount2;
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> > +   mapcount1 = page_mapcount(page);
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> 
> So what prevents GUP_fast() to grab reference here and the test below would
> think the page is not pinned? Or do you assume that every page_mkclean()
> call will be protected by PageWriteback (currently it is not) so that
> GUP_fast() blocks / bails out?

So GUP_fast() becomes:

GUP_fast_existing() { ... }
GUP_fast()
{
    GUP_fast_existing();

    for (i = 0; i < npages; ++i) {
        if (PageWriteback(pages[i])) {
            // need to force slow path for this page
        } else {
            SetPageDmaPinned(pages[i]);
            atomic_inc(pages[i]->mapcount);
        }
    }
}

This is a minor slow down for GUP fast and it takes care of a
write back race on behalf of caller. This means that page_mkclean
can not see a mapcount value that increase. This simplify thing
we can relax that. Note that what this is doing is making sure
that GUP_fast never get lucky :) ie never GUP a page that is in
the process of being write back but has not yet had its pte
updated to reflect that.


> But I think that detecting pinned pages with small false positive rate is
> OK. The extra page bouncing will cost some performance but if it is rare,
> then we are OK. So I think we can go for the simple version of detecting
> pinned pages as you mentioned in some earlier email. We just have to be
> sure there are no false negatives.

What worry me is that a page might stays with the DMA pinned flag forever
if it keeps getting unlucky ie some process keeps mapping it after last
write back and keeps zapping that mapping while racing with page_mkclean.
This should be unlikely but nothing would prevent it. I am fine with
living with this but page might become a zombie GUP :)

Maybe we can start with the simple version and add big fat comment and see
if anyone complains about a zombie GUP ...

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03  3:27                                                 ` John Hubbard
@ 2019-01-03 14:57                                                   ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-03 14:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Wed, Jan 02, 2019 at 07:27:17PM -0800, John Hubbard wrote:
> On 1/2/19 5:55 PM, Jerome Glisse wrote:
> > On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> >>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
> >>>> solution for that.  
> >>>>
> >>>> So as I understand it, this would use page->_mapcount to store both the real
> >>>> mapcount, and the dma pinned count (simply added together), but only do so for
> >>>> file-backed (non-anonymous) pages:
> >>>>
> >>>>
> >>>> __get_user_pages()
> >>>> {
> >>>> 	...
> >>>> 	get_page(page);
> >>>>
> >>>> 	if (!PageAnon)
> >>>> 		atomic_inc(page->_mapcount);
> >>>> 	...
> >>>> }
> >>>>
> >>>> put_user_page(struct page *page)
> >>>> {
> >>>> 	...
> >>>> 	if (!PageAnon)
> >>>> 		atomic_dec(&page->_mapcount);
> >>>>
> >>>> 	put_page(page);
> >>>> 	...
> >>>> }
> >>>>
> >>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> >>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> >>>> had in mind?
> >>>
> >>> Mostly, with the extra two observations:
> >>>     [1] We only need to know the pin count when a write back kicks in
> >>>     [2] We need to protect GUP code with wait_for_write_back() in case
> >>>         GUP is racing with a write back that might not the see the
> >>>         elevated mapcount in time.
> >>>
> >>> So for [2]
> >>>
> >>> __get_user_pages()
> >>> {
> >>>     get_page(page);
> >>>
> >>>     if (!PageAnon) {
> >>>         atomic_inc(page->_mapcount);
> >>> +       if (PageWriteback(page)) {
> >>> +           // Assume we are racing and curent write back will not see
> >>> +           // the elevated mapcount so wait for current write back and
> >>> +           // force page fault
> >>> +           wait_on_page_writeback(page);
> >>> +           // force slow path that will fault again
> >>> +       }
> >>>     }
> >>> }
> >>
> >> This is not needed AFAICT. __get_user_pages() gets page reference (and it
> >> should also increment page->_mapcount) under PTE lock. So at that point we
> >> are sure we have writeable PTE nobody can change. So page_mkclean() has to
> >> block on PTE lock to make PTE read-only and only after going through all
> >> PTEs like this, it can check page->_mapcount. So the PTE lock provides
> >> enough synchronization.
> >>
> >>> For [1] only needing pin count during write back turns page_mkclean into
> >>> the perfect spot to check for that so:
> >>>
> >>> int page_mkclean(struct page *page)
> >>> {
> >>>     int cleaned = 0;
> >>> +   int real_mapcount = 0;
> >>>     struct address_space *mapping;
> >>>     struct rmap_walk_control rwc = {
> >>>         .arg = (void *)&cleaned,
> >>>         .rmap_one = page_mkclean_one,
> >>>         .invalid_vma = invalid_mkclean_vma,
> >>> +       .mapcount = &real_mapcount,
> >>>     };
> >>>
> >>>     BUG_ON(!PageLocked(page));
> >>>
> >>>     if (!page_mapped(page))
> >>>         return 0;
> >>>
> >>>     mapping = page_mapping(page);
> >>>     if (!mapping)
> >>>         return 0;
> >>>
> >>>     // rmap_walk need to change to count mapping and return value
> >>>     // in .mapcount easy one
> >>>     rmap_walk(page, &rwc);
> >>>
> >>>     // Big fat comment to explain what is going on
> >>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
> >>> +       SetPageDMAPined(page);
> >>> +   } else {
> >>> +       ClearPageDMAPined(page);
> >>> +   }
> >>
> >> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
> >> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
> >> check we do in page_mkclean() is wrong?
> >>
> > 
> > Ok so i found a solution for that. First GUP must wait for racing
> > write back. If GUP see a valid write-able PTE and the page has
> > write back flag set then it must back of as if the PTE was not
> > valid to force fault. It is just a race with page_mkclean and we
> > want ordering between the two. Note this is not strictly needed
> > so we can relax that but i believe this ordering is better to do
> > in GUP rather then having each single user of GUP test for this
> > to avoid the race.
> > 
> > GUP increase mapcount only after checking that it is not racing
> > with writeback it also set a page flag (SetPageDMAPined(page)).
> > 
> > When clearing a write-able pte we set a special entry inside the
> > page table (might need a new special swap type for this) and change
> > page_mkclean_one() to clear to 0 those special entry.
> > 
> > 
> > Now page_mkclean:
> > 
> > int page_mkclean(struct page *page)
> > {
> >     int cleaned = 0;
> > +   int real_mapcount = 0;
> >     struct address_space *mapping;
> >     struct rmap_walk_control rwc = {
> >         .arg = (void *)&cleaned,
> >         .rmap_one = page_mkclean_one,
> >         .invalid_vma = invalid_mkclean_vma,
> > +       .mapcount = &real_mapcount,
> >     };
> > +   int mapcount1, mapcount2;
> > 
> >     BUG_ON(!PageLocked(page));
> > 
> >     if (!page_mapped(page))
> >         return 0;
> > 
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >         return 0;
> > 
> > +   mapcount1 = page_mapcount(page);
> > 
> >     // rmap_walk need to change to count mapping and return value
> >     // in .mapcount easy one
> >     rmap_walk(page, &rwc);
> > 
> > +   if (PageDMAPined(page)) {
> > +       int rc2;
> > +
> > +       if (mapcount1 == real_count) {
> > +           /* Page is no longer pin, no zap pte race */
> > +           ClearPageDMAPined(page);
> > +           goto out;
> > +       }
> > +       /* No new mapping of the page so mp1 < rc is illegal. */
> > +       VM_BUG_ON(mapcount1 < real_count);
> > +       /* Page might be pin. */
> > +       mapcount2 = page_mapcount(page);
> > +       if (mapcount2 > real_count) {
> > +           /* Page is pin for sure. */
> > +           goto out;
> > +       }
> > +       /* We had a race with zap pte we need to rewalk again. */
> > +       rc2 = real_mapcount;
> > +       real_mapcount = 0;
> > +       rwc.rmap_one = page_pin_one;
> > +       rmap_walk(page, &rwc);
> > +       if (mapcount2 <= (real_count + rc2)) {
> > +           /* Page is no longer pin */
> > +           ClearPageDMAPined(page);
> > +       }
> > +       /* At this point the page pin flag reflect pin status of the page */
> 
> Until...what? In other words, what is providing synchronization here?

It can still race with put_user_page() but this is fine ie it means
that a racing put_user_page() will not be taken into account and that
page will still be consider pin for this round, even thought the last
pin might just have been drop.

It is all about getting the "real" mapcount value at one point in
time while racing with something that zap ptes. So what you want is
being able to count the number of zap ptes that are racing with you.
If there is none than you know you have a stable real mapcount value,
if there is you can account them in real mapcount and compare it to
the mapcount value of the page. Worst case is you report a page as
pin while it has just been release but next write back will catch
that (unless page is GUPed again).

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-03 14:44                                                   ` Jerome Glisse
@ 2019-01-11  2:59                                                     ` John Hubbard
  2019-01-11  2:59                                                       ` John Hubbard
  2019-01-11 16:51                                                       ` Jerome Glisse
  0 siblings, 2 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-11  2:59 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/3/19 6:44 AM, Jerome Glisse wrote:
> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>>>>> solution for that.  
>>>>>>
>>>>>> So as I understand it, this would use page->_mapcount to store both the real
>>>>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>>>>> file-backed (non-anonymous) pages:
>>>>>>
>>>>>>
>>>>>> __get_user_pages()
>>>>>> {
>>>>>> 	...
>>>>>> 	get_page(page);
>>>>>>
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_inc(page->_mapcount);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> put_user_page(struct page *page)
>>>>>> {
>>>>>> 	...
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_dec(&page->_mapcount);
>>>>>>
>>>>>> 	put_page(page);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
>>>>>> had in mind?
>>>>>
>>>>> Mostly, with the extra two observations:
>>>>>     [1] We only need to know the pin count when a write back kicks in
>>>>>     [2] We need to protect GUP code with wait_for_write_back() in case
>>>>>         GUP is racing with a write back that might not the see the
>>>>>         elevated mapcount in time.
>>>>>
>>>>> So for [2]
>>>>>
>>>>> __get_user_pages()
>>>>> {
>>>>>     get_page(page);
>>>>>
>>>>>     if (!PageAnon) {
>>>>>         atomic_inc(page->_mapcount);
>>>>> +       if (PageWriteback(page)) {
>>>>> +           // Assume we are racing and curent write back will not see
>>>>> +           // the elevated mapcount so wait for current write back and
>>>>> +           // force page fault
>>>>> +           wait_on_page_writeback(page);
>>>>> +           // force slow path that will fault again
>>>>> +       }
>>>>>     }
>>>>> }
>>>>
>>>> This is not needed AFAICT. __get_user_pages() gets page reference (and it
>>>> should also increment page->_mapcount) under PTE lock. So at that point we
>>>> are sure we have writeable PTE nobody can change. So page_mkclean() has to
>>>> block on PTE lock to make PTE read-only and only after going through all
>>>> PTEs like this, it can check page->_mapcount. So the PTE lock provides
>>>> enough synchronization.
>>>>
>>>>> For [1] only needing pin count during write back turns page_mkclean into
>>>>> the perfect spot to check for that so:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>>
>>>>>     // Big fat comment to explain what is going on
>>>>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
>>>>> +       SetPageDMAPined(page);
>>>>> +   } else {
>>>>> +       ClearPageDMAPined(page);
>>>>> +   }
>>>>
>>>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
>>>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
>>>> check we do in page_mkclean() is wrong?
>>>>
>>>
>>> Ok so i found a solution for that. First GUP must wait for racing
>>> write back. If GUP see a valid write-able PTE and the page has
>>> write back flag set then it must back of as if the PTE was not
>>> valid to force fault. It is just a race with page_mkclean and we
>>> want ordering between the two. Note this is not strictly needed
>>> so we can relax that but i believe this ordering is better to do
>>> in GUP rather then having each single user of GUP test for this
>>> to avoid the race.
>>>
>>> GUP increase mapcount only after checking that it is not racing
>>> with writeback it also set a page flag (SetPageDMAPined(page)).
>>>
>>> When clearing a write-able pte we set a special entry inside the
>>> page table (might need a new special swap type for this) and change
>>> page_mkclean_one() to clear to 0 those special entry.
>>>
>>>
>>> Now page_mkclean:
>>>
>>> int page_mkclean(struct page *page)
>>> {
>>>     int cleaned = 0;
>>> +   int real_mapcount = 0;
>>>     struct address_space *mapping;
>>>     struct rmap_walk_control rwc = {
>>>         .arg = (void *)&cleaned,
>>>         .rmap_one = page_mkclean_one,
>>>         .invalid_vma = invalid_mkclean_vma,
>>> +       .mapcount = &real_mapcount,
>>>     };
>>> +   int mapcount1, mapcount2;
>>>
>>>     BUG_ON(!PageLocked(page));
>>>
>>>     if (!page_mapped(page))
>>>         return 0;
>>>
>>>     mapping = page_mapping(page);
>>>     if (!mapping)
>>>         return 0;
>>>
>>> +   mapcount1 = page_mapcount(page);
>>>     // rmap_walk need to change to count mapping and return value
>>>     // in .mapcount easy one
>>>     rmap_walk(page, &rwc);
>>
>> So what prevents GUP_fast() to grab reference here and the test below would
>> think the page is not pinned? Or do you assume that every page_mkclean()
>> call will be protected by PageWriteback (currently it is not) so that
>> GUP_fast() blocks / bails out?

Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
for each page" question (ignoring, for now, what to actually *do* in response to 
that flag being set):

1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
This is probably less troubling than the next point, but it does undermine all the 
complicated schemes involving PageWriteback, that try to synchronize gup() with
page_mkclean().

2. Also, the mapcount approach here still does not reliably avoid false negatives
(that is, a page may have been gup'd, but page_mkclean could miss that): gup()
can always jump in and increment the mapcount, while page_mkclean is in the middle
of making (wrong) decisions based on that mapcount. There's no lock to prevent that.

Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

> 
> So GUP_fast() becomes:
> 
> GUP_fast_existing() { ... }
> GUP_fast()
> {
>     GUP_fast_existing();
> 
>     for (i = 0; i < npages; ++i) {
>         if (PageWriteback(pages[i])) {
>             // need to force slow path for this page
>         } else {
>             SetPageDmaPinned(pages[i]);
>             atomic_inc(pages[i]->mapcount);
>         }
>     }
> }
> 
> This is a minor slow down for GUP fast and it takes care of a
> write back race on behalf of caller. This means that page_mkclean
> can not see a mapcount value that increase. This simplify thing
> we can relax that. Note that what this is doing is making sure
> that GUP_fast never get lucky :) ie never GUP a page that is in
> the process of being write back but has not yet had its pte
> updated to reflect that.
> 
> 
>> But I think that detecting pinned pages with small false positive rate is
>> OK. The extra page bouncing will cost some performance but if it is rare,
>> then we are OK. So I think we can go for the simple version of detecting
>> pinned pages as you mentioned in some earlier email. We just have to be
>> sure there are no false negatives.
> 

Agree with that sentiment, but there are still false negatives and I'm not
yet seeing any solutions for that.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11  2:59                                                     ` John Hubbard
@ 2019-01-11  2:59                                                       ` John Hubbard
  2019-01-11 16:51                                                       ` Jerome Glisse
  1 sibling, 0 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-11  2:59 UTC (permalink / raw)
  To: Jerome Glisse, Jan Kara
  Cc: Matthew Wilcox, Dave Chinner, Dan Williams, John Hubbard,
	Andrew Morton, Linux MM, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 1/3/19 6:44 AM, Jerome Glisse wrote:
> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
>>>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight
>>>>>> solution for that.  
>>>>>>
>>>>>> So as I understand it, this would use page->_mapcount to store both the real
>>>>>> mapcount, and the dma pinned count (simply added together), but only do so for
>>>>>> file-backed (non-anonymous) pages:
>>>>>>
>>>>>>
>>>>>> __get_user_pages()
>>>>>> {
>>>>>> 	...
>>>>>> 	get_page(page);
>>>>>>
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_inc(page->_mapcount);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> put_user_page(struct page *page)
>>>>>> {
>>>>>> 	...
>>>>>> 	if (!PageAnon)
>>>>>> 		atomic_dec(&page->_mapcount);
>>>>>>
>>>>>> 	put_page(page);
>>>>>> 	...
>>>>>> }
>>>>>>
>>>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
>>>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
>>>>>> had in mind?
>>>>>
>>>>> Mostly, with the extra two observations:
>>>>>     [1] We only need to know the pin count when a write back kicks in
>>>>>     [2] We need to protect GUP code with wait_for_write_back() in case
>>>>>         GUP is racing with a write back that might not the see the
>>>>>         elevated mapcount in time.
>>>>>
>>>>> So for [2]
>>>>>
>>>>> __get_user_pages()
>>>>> {
>>>>>     get_page(page);
>>>>>
>>>>>     if (!PageAnon) {
>>>>>         atomic_inc(page->_mapcount);
>>>>> +       if (PageWriteback(page)) {
>>>>> +           // Assume we are racing and curent write back will not see
>>>>> +           // the elevated mapcount so wait for current write back and
>>>>> +           // force page fault
>>>>> +           wait_on_page_writeback(page);
>>>>> +           // force slow path that will fault again
>>>>> +       }
>>>>>     }
>>>>> }
>>>>
>>>> This is not needed AFAICT. __get_user_pages() gets page reference (and it
>>>> should also increment page->_mapcount) under PTE lock. So at that point we
>>>> are sure we have writeable PTE nobody can change. So page_mkclean() has to
>>>> block on PTE lock to make PTE read-only and only after going through all
>>>> PTEs like this, it can check page->_mapcount. So the PTE lock provides
>>>> enough synchronization.
>>>>
>>>>> For [1] only needing pin count during write back turns page_mkclean into
>>>>> the perfect spot to check for that so:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>>
>>>>>     // Big fat comment to explain what is going on
>>>>> +   if ((page_mapcount(page) - real_mapcount) > 0) {
>>>>> +       SetPageDMAPined(page);
>>>>> +   } else {
>>>>> +       ClearPageDMAPined(page);
>>>>> +   }
>>>>
>>>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race
>>>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the
>>>> check we do in page_mkclean() is wrong?
>>>>
>>>
>>> Ok so i found a solution for that. First GUP must wait for racing
>>> write back. If GUP see a valid write-able PTE and the page has
>>> write back flag set then it must back of as if the PTE was not
>>> valid to force fault. It is just a race with page_mkclean and we
>>> want ordering between the two. Note this is not strictly needed
>>> so we can relax that but i believe this ordering is better to do
>>> in GUP rather then having each single user of GUP test for this
>>> to avoid the race.
>>>
>>> GUP increase mapcount only after checking that it is not racing
>>> with writeback it also set a page flag (SetPageDMAPined(page)).
>>>
>>> When clearing a write-able pte we set a special entry inside the
>>> page table (might need a new special swap type for this) and change
>>> page_mkclean_one() to clear to 0 those special entry.
>>>
>>>
>>> Now page_mkclean:
>>>
>>> int page_mkclean(struct page *page)
>>> {
>>>     int cleaned = 0;
>>> +   int real_mapcount = 0;
>>>     struct address_space *mapping;
>>>     struct rmap_walk_control rwc = {
>>>         .arg = (void *)&cleaned,
>>>         .rmap_one = page_mkclean_one,
>>>         .invalid_vma = invalid_mkclean_vma,
>>> +       .mapcount = &real_mapcount,
>>>     };
>>> +   int mapcount1, mapcount2;
>>>
>>>     BUG_ON(!PageLocked(page));
>>>
>>>     if (!page_mapped(page))
>>>         return 0;
>>>
>>>     mapping = page_mapping(page);
>>>     if (!mapping)
>>>         return 0;
>>>
>>> +   mapcount1 = page_mapcount(page);
>>>     // rmap_walk need to change to count mapping and return value
>>>     // in .mapcount easy one
>>>     rmap_walk(page, &rwc);
>>
>> So what prevents GUP_fast() to grab reference here and the test below would
>> think the page is not pinned? Or do you assume that every page_mkclean()
>> call will be protected by PageWriteback (currently it is not) so that
>> GUP_fast() blocks / bails out?

Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
for each page" question (ignoring, for now, what to actually *do* in response to 
that flag being set):

1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
This is probably less troubling than the next point, but it does undermine all the 
complicated schemes involving PageWriteback, that try to synchronize gup() with
page_mkclean().

2. Also, the mapcount approach here still does not reliably avoid false negatives
(that is, a page may have been gup'd, but page_mkclean could miss that): gup()
can always jump in and increment the mapcount, while page_mkclean is in the middle
of making (wrong) decisions based on that mapcount. There's no lock to prevent that.

Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

> 
> So GUP_fast() becomes:
> 
> GUP_fast_existing() { ... }
> GUP_fast()
> {
>     GUP_fast_existing();
> 
>     for (i = 0; i < npages; ++i) {
>         if (PageWriteback(pages[i])) {
>             // need to force slow path for this page
>         } else {
>             SetPageDmaPinned(pages[i]);
>             atomic_inc(pages[i]->mapcount);
>         }
>     }
> }
> 
> This is a minor slow down for GUP fast and it takes care of a
> write back race on behalf of caller. This means that page_mkclean
> can not see a mapcount value that increase. This simplify thing
> we can relax that. Note that what this is doing is making sure
> that GUP_fast never get lucky :) ie never GUP a page that is in
> the process of being write back but has not yet had its pte
> updated to reflect that.
> 
> 
>> But I think that detecting pinned pages with small false positive rate is
>> OK. The extra page bouncing will cost some performance but if it is rare,
>> then we are OK. So I think we can go for the simple version of detecting
>> pinned pages as you mentioned in some earlier email. We just have to be
>> sure there are no false negatives.
> 

Agree with that sentiment, but there are still false negatives and I'm not
yet seeing any solutions for that.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11  2:59                                                     ` John Hubbard
  2019-01-11  2:59                                                       ` John Hubbard
@ 2019-01-11 16:51                                                       ` Jerome Glisse
  2019-01-11 16:51                                                         ` Jerome Glisse
  2019-01-12  1:04                                                         ` John Hubbard
  1 sibling, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-11 16:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> > On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:

[...]

> >>> Now page_mkclean:
> >>>
> >>> int page_mkclean(struct page *page)
> >>> {
> >>>     int cleaned = 0;
> >>> +   int real_mapcount = 0;
> >>>     struct address_space *mapping;
> >>>     struct rmap_walk_control rwc = {
> >>>         .arg = (void *)&cleaned,
> >>>         .rmap_one = page_mkclean_one,
> >>>         .invalid_vma = invalid_mkclean_vma,
> >>> +       .mapcount = &real_mapcount,
> >>>     };
> >>> +   int mapcount1, mapcount2;
> >>>
> >>>     BUG_ON(!PageLocked(page));
> >>>
> >>>     if (!page_mapped(page))
> >>>         return 0;
> >>>
> >>>     mapping = page_mapping(page);
> >>>     if (!mapping)
> >>>         return 0;
> >>>
> >>> +   mapcount1 = page_mapcount(page);
> >>>     // rmap_walk need to change to count mapping and return value
> >>>     // in .mapcount easy one
> >>>     rmap_walk(page, &rwc);
> >>
> >> So what prevents GUP_fast() to grab reference here and the test below would
> >> think the page is not pinned? Or do you assume that every page_mkclean()
> >> call will be protected by PageWriteback (currently it is not) so that
> >> GUP_fast() blocks / bails out?
> 
> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> for each page" question (ignoring, for now, what to actually *do* in response to 
> that flag being set):
> 
> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> This is probably less troubling than the next point, but it does undermine all the 
> complicated schemes involving PageWriteback, that try to synchronize gup() with
> page_mkclean().
> 
> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> can always jump in and increment the mapcount, while page_mkclean is in the middle
> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> 
> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

Both point is address by the solution at the end of this email.

> > 
> > So GUP_fast() becomes:
> > 
> > GUP_fast_existing() { ... }
> > GUP_fast()
> > {
> >     GUP_fast_existing();
> > 
> >     for (i = 0; i < npages; ++i) {
> >         if (PageWriteback(pages[i])) {
> >             // need to force slow path for this page
> >         } else {
> >             SetPageDmaPinned(pages[i]);
> >             atomic_inc(pages[i]->mapcount);
> >         }
> >     }
> > }
> > 
> > This is a minor slow down for GUP fast and it takes care of a
> > write back race on behalf of caller. This means that page_mkclean
> > can not see a mapcount value that increase. This simplify thing
> > we can relax that. Note that what this is doing is making sure
> > that GUP_fast never get lucky :) ie never GUP a page that is in
> > the process of being write back but has not yet had its pte
> > updated to reflect that.
> > 
> > 
> >> But I think that detecting pinned pages with small false positive rate is
> >> OK. The extra page bouncing will cost some performance but if it is rare,
> >> then we are OK. So I think we can go for the simple version of detecting
> >> pinned pages as you mentioned in some earlier email. We just have to be
> >> sure there are no false negatives.
> > 
> 
> Agree with that sentiment, but there are still false negatives and I'm not
> yet seeing any solutions for that.

So here is the solution:


Is a page pin ? With no false negative:
=======================================

get_user_page*() aka GUP:
     if (!PageAnon(page)) {
        bool write_back = PageWriteback(page);
        bool page_is_pin = PagePin(page);
        if (write_back && !page_is_pin) {
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
[G1]    smp_rmb();
[G2]    atomic_inc(&page->_mapcount)
[G3]    smp_wmb();
[G4]    SetPagePin(page);
[G5]    smp_wmb();
[G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
            /* Back-off as write back might have miss us */
            atomic_dec(&page->_mapcount);
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
     }

put_user_page() aka PUP:
[P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
[P2] put_page(page);

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page);
[C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);

So with above code we store the map and pin count inside struct page
_mapcount field. The idea is that we can count the number of page
table entry that point to the page when reverse walking all the page
mapping in page_mkclean() [C4].

The issue is that GUP, PUP and page table entry zapping can all run
concurrently with page_mkclean() and thus we can not get the real
map and pin count and the real map count at a given point in time
([C5] for instance in the above). However we only care about avoiding
false negative ie we do not want to report a page as unpin if in fact
it is pin (it has active GUP). Avoiding false positive would be nice
but it would need more heavy weight synchronization within GUP and
PUP (we can mitigate it see the section on that below).

With the above scheme a page is _not_ pin (unpin) if and only if we
have real_map_count == real_map_and_pin_count at a given point in
time. In the above pseudo code the page is lock within page_mkclean()
thus no new page table entry can be added and thus the number of page
mapping can only go down (because of conccurent pte zapping). So no
matter what happens at [C5] we have map_count <= real_map_count.

At [C3] we have two cases to consider:
 [R1] A concurrent GUP after [C3] then we do not care what happens at
      [C5] as the GUP would already have set the page pin flag. If it
      raced before [C3] at [C1] with TestClearPagePin() then we would
      have the map_and_pin_count reflect the GUP thanks to the memory
      barrier [G3] and [C2].
 [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
      worry about and thus the real_map_and_pin_count can only go down.
      So because we first snap shot that value at [C5] we have:
      real_map_and_pin_count <= map_and_pin_count.

      So at [C5] we end up with map_count <= real_map_count and with
      real_map_and_pin_count <= map_pin_count but we also always have
      real_map_count <= real_map_and_pin_count so it means we are in a
      a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
      if map_count == map_pin_count then we know for sure that we have
      real_map_count == real_map_and_pin_count and if that is the case
      then the page is no longer pin. So at [C5] we will never miss a
      pin page (no false negative).

      Another way to word this is that we always under-estimate the real
      map count and over estimate the map and pin count and thus we can
      never have false negative (map count equal to map and pin count
      while in fact real map count is inferior to real map and pin count).


PageWriteback() test and ordering with page_mkclean()
=====================================================

In GUP we test for page write back flag to avoid pining a page that
is under going write back. That flag is set after page_mkclean() so
the filesystem code that will check for the pin flag need some memory
barrier:
    int __test_set_page_writeback(struct page *page, bool keep_write,
+                                 bool *use_bounce_page)
    {
        ...
  [T1]  TestSetPageWriteback(page);
+ [T2]  smp_wmb();
+ [T3]  *use_bounce_page = PagePin(page);
        ...
    }

That way if there is a concurrent GUP we either have:
    [R1] GUP sees the write back flag set before [G1] so it back-off
    [R2] GUP sees no write back before [G1] here either we have GUP
         that sees the write back flag at [G6] or [T3] that sees the
         pin flag thanks to the memory barrier [G5] and [T2].

So in all cases we never miss a pin or a write back.


Mitigate false positive:
========================

If false positive is ever an issue we can improve the situation and to
properly account conccurent pte zapping with the following changes:

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page, &page_mkclean_one());
[C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
[C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
[C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
[C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
     }

page_map_count():
[M1] if (pte_valid(pte) { map_count++; }
     } else if (pte_special_zap(pte)) {
[M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
[M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
     }

And pte zapping of file back page will write a special pte entry which
has the page map and pin count value at the time the pte is zap. Also
page_mkclean_one() unconditionaly replace those special pte with pte
none and ignore them altogether. We only want to detect pte zapping that
happens after [C6] and before [C7] is done.

With [M3] we are counting all page table entry that have been zap after
the map_and_pin_count value we read at [C6]. Again we have two cases:
 [R1] A concurrent GUP after [C6] then we do not care what happens
      at [C8] as the GUP would already have set the page pin flag.
 [R2] No concurrent GUP then we only have concurrent PUP to worry
      about. If they happen before [C6] they are included in [C6]
      map_and_pin_count value. If after [C6] then we might miss a
      page that is no longer pin ie we are over estimating the
      map_and_pin_count (real_map_and_pin_count < map_and_pin_count
      at [C8]). So no false negative just false positive.

Here we just get the accurate real_map_count at [C6] time so if the
page was no longer pin at [C6] time we will correctly detect it and
not set the flag at [C8]. If there is any concurrent GUP that GUP
would set the flag properly.

There is one last thing to note about above code, the MASK in [M3].
For special pte entry we might not have enough bits to store the
whole map and pin count value (on 32bits arch). So we might expose
ourself to wrap around. Again we do not care about [R1] case as any
concurrent GUP will set the pin flag. So we only care if the only
thing happening concurrently is either PUP or pte zapping. In both
case its means that the map and pin count is going down so if there
is a wrap around sometimes within [C7]/page_map_count() we have:
  [t0] page_map_count() executed on some pte
  [t1] page_map_count() executed on another pte after [t1]
With:
    (map_count_t0 & MASK) < (map_count_t1 & MASK)
While in fact:
    map_count_t0 > map_count_t1

So if that happens then we will under-estimate the map count ie we
will ignore some of the concurrent pte zapping and not count them.
So again we are only exposing our self to false positive not false
negative.


---------------------------------------------------------------------


Hopes this prove that this solution do work. The false positive is
something that i believe is acceptable. We will get them only when
they are racing GUP or PUP. For racing GUP it is safer to have false
positive. For racing PUP it would be nice to catch them but hey some
times you just get unlucky.

Note that any other solution will also suffer from false positive
situation because anyway you are testing for the page pin status
at a given point in time so it can always race with a PUP. So the
only difference with any other solution would be how long is the
false positive race window.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11 16:51                                                       ` Jerome Glisse
@ 2019-01-11 16:51                                                         ` Jerome Glisse
  2019-01-12  1:04                                                         ` John Hubbard
  1 sibling, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-11 16:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> > On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:

[...]

> >>> Now page_mkclean:
> >>>
> >>> int page_mkclean(struct page *page)
> >>> {
> >>>     int cleaned = 0;
> >>> +   int real_mapcount = 0;
> >>>     struct address_space *mapping;
> >>>     struct rmap_walk_control rwc = {
> >>>         .arg = (void *)&cleaned,
> >>>         .rmap_one = page_mkclean_one,
> >>>         .invalid_vma = invalid_mkclean_vma,
> >>> +       .mapcount = &real_mapcount,
> >>>     };
> >>> +   int mapcount1, mapcount2;
> >>>
> >>>     BUG_ON(!PageLocked(page));
> >>>
> >>>     if (!page_mapped(page))
> >>>         return 0;
> >>>
> >>>     mapping = page_mapping(page);
> >>>     if (!mapping)
> >>>         return 0;
> >>>
> >>> +   mapcount1 = page_mapcount(page);
> >>>     // rmap_walk need to change to count mapping and return value
> >>>     // in .mapcount easy one
> >>>     rmap_walk(page, &rwc);
> >>
> >> So what prevents GUP_fast() to grab reference here and the test below would
> >> think the page is not pinned? Or do you assume that every page_mkclean()
> >> call will be protected by PageWriteback (currently it is not) so that
> >> GUP_fast() blocks / bails out?
> 
> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> for each page" question (ignoring, for now, what to actually *do* in response to 
> that flag being set):
> 
> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> This is probably less troubling than the next point, but it does undermine all the 
> complicated schemes involving PageWriteback, that try to synchronize gup() with
> page_mkclean().
> 
> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> can always jump in and increment the mapcount, while page_mkclean is in the middle
> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> 
> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.

Both point is address by the solution at the end of this email.

> > 
> > So GUP_fast() becomes:
> > 
> > GUP_fast_existing() { ... }
> > GUP_fast()
> > {
> >     GUP_fast_existing();
> > 
> >     for (i = 0; i < npages; ++i) {
> >         if (PageWriteback(pages[i])) {
> >             // need to force slow path for this page
> >         } else {
> >             SetPageDmaPinned(pages[i]);
> >             atomic_inc(pages[i]->mapcount);
> >         }
> >     }
> > }
> > 
> > This is a minor slow down for GUP fast and it takes care of a
> > write back race on behalf of caller. This means that page_mkclean
> > can not see a mapcount value that increase. This simplify thing
> > we can relax that. Note that what this is doing is making sure
> > that GUP_fast never get lucky :) ie never GUP a page that is in
> > the process of being write back but has not yet had its pte
> > updated to reflect that.
> > 
> > 
> >> But I think that detecting pinned pages with small false positive rate is
> >> OK. The extra page bouncing will cost some performance but if it is rare,
> >> then we are OK. So I think we can go for the simple version of detecting
> >> pinned pages as you mentioned in some earlier email. We just have to be
> >> sure there are no false negatives.
> > 
> 
> Agree with that sentiment, but there are still false negatives and I'm not
> yet seeing any solutions for that.

So here is the solution:


Is a page pin ? With no false negative:
=======================================

get_user_page*() aka GUP:
     if (!PageAnon(page)) {
        bool write_back = PageWriteback(page);
        bool page_is_pin = PagePin(page);
        if (write_back && !page_is_pin) {
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
[G1]    smp_rmb();
[G2]    atomic_inc(&page->_mapcount)
[G3]    smp_wmb();
[G4]    SetPagePin(page);
[G5]    smp_wmb();
[G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
            /* Back-off as write back might have miss us */
            atomic_dec(&page->_mapcount);
            /* Wait for write back a re-try GUP */
            ...
            goto retry;
        }
     }

put_user_page() aka PUP:
[P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
[P2] put_page(page);

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page);
[C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);

So with above code we store the map and pin count inside struct page
_mapcount field. The idea is that we can count the number of page
table entry that point to the page when reverse walking all the page
mapping in page_mkclean() [C4].

The issue is that GUP, PUP and page table entry zapping can all run
concurrently with page_mkclean() and thus we can not get the real
map and pin count and the real map count at a given point in time
([C5] for instance in the above). However we only care about avoiding
false negative ie we do not want to report a page as unpin if in fact
it is pin (it has active GUP). Avoiding false positive would be nice
but it would need more heavy weight synchronization within GUP and
PUP (we can mitigate it see the section on that below).

With the above scheme a page is _not_ pin (unpin) if and only if we
have real_map_count == real_map_and_pin_count at a given point in
time. In the above pseudo code the page is lock within page_mkclean()
thus no new page table entry can be added and thus the number of page
mapping can only go down (because of conccurent pte zapping). So no
matter what happens at [C5] we have map_count <= real_map_count.

At [C3] we have two cases to consider:
 [R1] A concurrent GUP after [C3] then we do not care what happens at
      [C5] as the GUP would already have set the page pin flag. If it
      raced before [C3] at [C1] with TestClearPagePin() then we would
      have the map_and_pin_count reflect the GUP thanks to the memory
      barrier [G3] and [C2].
 [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
      worry about and thus the real_map_and_pin_count can only go down.
      So because we first snap shot that value at [C5] we have:
      real_map_and_pin_count <= map_and_pin_count.

      So at [C5] we end up with map_count <= real_map_count and with
      real_map_and_pin_count <= map_pin_count but we also always have
      real_map_count <= real_map_and_pin_count so it means we are in a
      a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
      if map_count == map_pin_count then we know for sure that we have
      real_map_count == real_map_and_pin_count and if that is the case
      then the page is no longer pin. So at [C5] we will never miss a
      pin page (no false negative).

      Another way to word this is that we always under-estimate the real
      map count and over estimate the map and pin count and thus we can
      never have false negative (map count equal to map and pin count
      while in fact real map count is inferior to real map and pin count).


PageWriteback() test and ordering with page_mkclean()
=====================================================

In GUP we test for page write back flag to avoid pining a page that
is under going write back. That flag is set after page_mkclean() so
the filesystem code that will check for the pin flag need some memory
barrier:
    int __test_set_page_writeback(struct page *page, bool keep_write,
+                                 bool *use_bounce_page)
    {
        ...
  [T1]  TestSetPageWriteback(page);
+ [T2]  smp_wmb();
+ [T3]  *use_bounce_page = PagePin(page);
        ...
    }

That way if there is a concurrent GUP we either have:
    [R1] GUP sees the write back flag set before [G1] so it back-off
    [R2] GUP sees no write back before [G1] here either we have GUP
         that sees the write back flag at [G6] or [T3] that sees the
         pin flag thanks to the memory barrier [G5] and [T2].

So in all cases we never miss a pin or a write back.


Mitigate false positive:
========================

If false positive is ever an issue we can improve the situation and to
properly account conccurent pte zapping with the following changes:

page_mkclean():
[C1] pined = TestClearPagePin(page);
[C2] smp_mb();
[C3] map_and_pin_count = atomic_read(&page->_mapcount)
[C4] map_count = rmap_walk(page, &page_mkclean_one());
[C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
[C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
[C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
[C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
     }

page_map_count():
[M1] if (pte_valid(pte) { map_count++; }
     } else if (pte_special_zap(pte)) {
[M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
[M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
     }

And pte zapping of file back page will write a special pte entry which
has the page map and pin count value at the time the pte is zap. Also
page_mkclean_one() unconditionaly replace those special pte with pte
none and ignore them altogether. We only want to detect pte zapping that
happens after [C6] and before [C7] is done.

With [M3] we are counting all page table entry that have been zap after
the map_and_pin_count value we read at [C6]. Again we have two cases:
 [R1] A concurrent GUP after [C6] then we do not care what happens
      at [C8] as the GUP would already have set the page pin flag.
 [R2] No concurrent GUP then we only have concurrent PUP to worry
      about. If they happen before [C6] they are included in [C6]
      map_and_pin_count value. If after [C6] then we might miss a
      page that is no longer pin ie we are over estimating the
      map_and_pin_count (real_map_and_pin_count < map_and_pin_count
      at [C8]). So no false negative just false positive.

Here we just get the accurate real_map_count at [C6] time so if the
page was no longer pin at [C6] time we will correctly detect it and
not set the flag at [C8]. If there is any concurrent GUP that GUP
would set the flag properly.

There is one last thing to note about above code, the MASK in [M3].
For special pte entry we might not have enough bits to store the
whole map and pin count value (on 32bits arch). So we might expose
ourself to wrap around. Again we do not care about [R1] case as any
concurrent GUP will set the pin flag. So we only care if the only
thing happening concurrently is either PUP or pte zapping. In both
case its means that the map and pin count is going down so if there
is a wrap around sometimes within [C7]/page_map_count() we have:
  [t0] page_map_count() executed on some pte
  [t1] page_map_count() executed on another pte after [t1]
With:
    (map_count_t0 & MASK) < (map_count_t1 & MASK)
While in fact:
    map_count_t0 > map_count_t1

So if that happens then we will under-estimate the map count ie we
will ignore some of the concurrent pte zapping and not count them.
So again we are only exposing our self to false positive not false
negative.


---------------------------------------------------------------------


Hopes this prove that this solution do work. The false positive is
something that i believe is acceptable. We will get them only when
they are racing GUP or PUP. For racing GUP it is safer to have false
positive. For racing PUP it would be nice to catch them but hey some
times you just get unlucky.

Note that any other solution will also suffer from false positive
situation because anyway you are testing for the page pin status
at a given point in time so it can always race with a PUP. So the
only difference with any other solution would be how long is the
false positive race window.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-11 16:51                                                       ` Jerome Glisse
  2019-01-11 16:51                                                         ` Jerome Glisse
@ 2019-01-12  1:04                                                         ` John Hubbard
  2019-01-12  1:04                                                           ` John Hubbard
  2019-01-12  2:02                                                           ` Jerome Glisse
  1 sibling, 2 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-12  1:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 8:51 AM, Jerome Glisse wrote:
> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> 
> [...]
> 
>>>>> Now page_mkclean:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>> +   int mapcount1, mapcount2;
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>> +   mapcount1 = page_mapcount(page);
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>
>>>> So what prevents GUP_fast() to grab reference here and the test below would
>>>> think the page is not pinned? Or do you assume that every page_mkclean()
>>>> call will be protected by PageWriteback (currently it is not) so that
>>>> GUP_fast() blocks / bails out?
>>
>> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
>> for each page" question (ignoring, for now, what to actually *do* in response to 
>> that flag being set):
>>
>> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
>> This is probably less troubling than the next point, but it does undermine all the 
>> complicated schemes involving PageWriteback, that try to synchronize gup() with
>> page_mkclean().
>>
>> 2. Also, the mapcount approach here still does not reliably avoid false negatives
>> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
>> can always jump in and increment the mapcount, while page_mkclean is in the middle
>> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
>>
>> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> 
> Both point is address by the solution at the end of this email.
> 
>>>
>>> So GUP_fast() becomes:
>>>
>>> GUP_fast_existing() { ... }
>>> GUP_fast()
>>> {
>>>     GUP_fast_existing();
>>>
>>>     for (i = 0; i < npages; ++i) {
>>>         if (PageWriteback(pages[i])) {
>>>             // need to force slow path for this page
>>>         } else {
>>>             SetPageDmaPinned(pages[i]);
>>>             atomic_inc(pages[i]->mapcount);
>>>         }
>>>     }
>>> }
>>>
>>> This is a minor slow down for GUP fast and it takes care of a
>>> write back race on behalf of caller. This means that page_mkclean
>>> can not see a mapcount value that increase. This simplify thing
>>> we can relax that. Note that what this is doing is making sure
>>> that GUP_fast never get lucky :) ie never GUP a page that is in
>>> the process of being write back but has not yet had its pte
>>> updated to reflect that.
>>>
>>>
>>>> But I think that detecting pinned pages with small false positive rate is
>>>> OK. The extra page bouncing will cost some performance but if it is rare,
>>>> then we are OK. So I think we can go for the simple version of detecting
>>>> pinned pages as you mentioned in some earlier email. We just have to be
>>>> sure there are no false negatives.
>>>
>>
>> Agree with that sentiment, but there are still false negatives and I'm not
>> yet seeing any solutions for that.
> 
> So here is the solution:
> 
> 
> Is a page pin ? With no false negative:
> =======================================
> 
> get_user_page*() aka GUP:
>      if (!PageAnon(page)) {
>         bool write_back = PageWriteback(page);
>         bool page_is_pin = PagePin(page);
>         if (write_back && !page_is_pin) {
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
> [G1]    smp_rmb();
> [G2]    atomic_inc(&page->_mapcount)
> [G3]    smp_wmb();
> [G4]    SetPagePin(page);
> [G5]    smp_wmb();
> [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
>             /* Back-off as write back might have miss us */
>             atomic_dec(&page->_mapcount);
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
>      }
> 
> put_user_page() aka PUP:
> [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> [P2] put_page(page);
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page);
> [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> 
> So with above code we store the map and pin count inside struct page
> _mapcount field. The idea is that we can count the number of page
> table entry that point to the page when reverse walking all the page
> mapping in page_mkclean() [C4].
> 
> The issue is that GUP, PUP and page table entry zapping can all run
> concurrently with page_mkclean() and thus we can not get the real
> map and pin count and the real map count at a given point in time
> ([C5] for instance in the above). However we only care about avoiding
> false negative ie we do not want to report a page as unpin if in fact
> it is pin (it has active GUP). Avoiding false positive would be nice
> but it would need more heavy weight synchronization within GUP and
> PUP (we can mitigate it see the section on that below).
> 
> With the above scheme a page is _not_ pin (unpin) if and only if we
> have real_map_count == real_map_and_pin_count at a given point in
> time. In the above pseudo code the page is lock within page_mkclean()
> thus no new page table entry can be added and thus the number of page
> mapping can only go down (because of conccurent pte zapping). So no
> matter what happens at [C5] we have map_count <= real_map_count.
> 
> At [C3] we have two cases to consider:
>  [R1] A concurrent GUP after [C3] then we do not care what happens at
>       [C5] as the GUP would already have set the page pin flag. If it
>       raced before [C3] at [C1] with TestClearPagePin() then we would
>       have the map_and_pin_count reflect the GUP thanks to the memory
>       barrier [G3] and [C2].
>  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
>       worry about and thus the real_map_and_pin_count can only go down.
>       So because we first snap shot that value at [C5] we have:
>       real_map_and_pin_count <= map_and_pin_count.
> 
>       So at [C5] we end up with map_count <= real_map_count and with
>       real_map_and_pin_count <= map_pin_count but we also always have
>       real_map_count <= real_map_and_pin_count so it means we are in a
>       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
>       if map_count == map_pin_count then we know for sure that we have
>       real_map_count == real_map_and_pin_count and if that is the case
>       then the page is no longer pin. So at [C5] we will never miss a
>       pin page (no false negative).
> 
>       Another way to word this is that we always under-estimate the real
>       map count and over estimate the map and pin count and thus we can
>       never have false negative (map count equal to map and pin count
>       while in fact real map count is inferior to real map and pin count).
> 
> 
> PageWriteback() test and ordering with page_mkclean()
> =====================================================
> 
> In GUP we test for page write back flag to avoid pining a page that
> is under going write back. That flag is set after page_mkclean() so
> the filesystem code that will check for the pin flag need some memory
> barrier:
>     int __test_set_page_writeback(struct page *page, bool keep_write,
> +                                 bool *use_bounce_page)
>     {
>         ...
>   [T1]  TestSetPageWriteback(page);
> + [T2]  smp_wmb();
> + [T3]  *use_bounce_page = PagePin(page);
>         ...
>     }
> 
> That way if there is a concurrent GUP we either have:
>     [R1] GUP sees the write back flag set before [G1] so it back-off
>     [R2] GUP sees no write back before [G1] here either we have GUP
>          that sees the write back flag at [G6] or [T3] that sees the
>          pin flag thanks to the memory barrier [G5] and [T2].
> 
> So in all cases we never miss a pin or a write back.
> 
> 
> Mitigate false positive:
> ========================
> 
> If false positive is ever an issue we can improve the situation and to
> properly account conccurent pte zapping with the following changes:
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page, &page_mkclean_one());
> [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
>      }
> 
> page_map_count():
> [M1] if (pte_valid(pte) { map_count++; }
>      } else if (pte_special_zap(pte)) {
> [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
>      }
> 
> And pte zapping of file back page will write a special pte entry which
> has the page map and pin count value at the time the pte is zap. Also
> page_mkclean_one() unconditionaly replace those special pte with pte
> none and ignore them altogether. We only want to detect pte zapping that
> happens after [C6] and before [C7] is done.
> 
> With [M3] we are counting all page table entry that have been zap after
> the map_and_pin_count value we read at [C6]. Again we have two cases:
>  [R1] A concurrent GUP after [C6] then we do not care what happens
>       at [C8] as the GUP would already have set the page pin flag.
>  [R2] No concurrent GUP then we only have concurrent PUP to worry
>       about. If they happen before [C6] they are included in [C6]
>       map_and_pin_count value. If after [C6] then we might miss a
>       page that is no longer pin ie we are over estimating the
>       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
>       at [C8]). So no false negative just false positive.
> 
> Here we just get the accurate real_map_count at [C6] time so if the
> page was no longer pin at [C6] time we will correctly detect it and
> not set the flag at [C8]. If there is any concurrent GUP that GUP
> would set the flag properly.
> 
> There is one last thing to note about above code, the MASK in [M3].
> For special pte entry we might not have enough bits to store the
> whole map and pin count value (on 32bits arch). So we might expose
> ourself to wrap around. Again we do not care about [R1] case as any
> concurrent GUP will set the pin flag. So we only care if the only
> thing happening concurrently is either PUP or pte zapping. In both
> case its means that the map and pin count is going down so if there
> is a wrap around sometimes within [C7]/page_map_count() we have:
>   [t0] page_map_count() executed on some pte
>   [t1] page_map_count() executed on another pte after [t1]
> With:
>     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> While in fact:
>     map_count_t0 > map_count_t1
> 
> So if that happens then we will under-estimate the map count ie we
> will ignore some of the concurrent pte zapping and not count them.
> So again we are only exposing our self to false positive not false
> negative.
> 
> 
> ---------------------------------------------------------------------
> 
> 
> Hopes this prove that this solution do work. The false positive is
> something that i believe is acceptable. We will get them only when
> they are racing GUP or PUP. For racing GUP it is safer to have false
> positive. For racing PUP it would be nice to catch them but hey some
> times you just get unlucky.
> 
> Note that any other solution will also suffer from false positive
> situation because anyway you are testing for the page pin status
> at a given point in time so it can always race with a PUP. So the
> only difference with any other solution would be how long is the
> false positive race window.
> 

Hi Jerome,

Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
in case anyone spots a disastrous conceptual error (such as the lock_page
point), while I'm putting together the revised patchset.

I've studied this carefully, and I agree that using mapcount in 
this way is viable, *as long* as we use a lock (or a construct that looks just 
like one: your "memory barrier, check, retry" is really just a lock) in
order to hold off gup() while page_mkclean() is in progress. In other words,
nothing that increments mapcount may proceed while page_mkclean() is running.

I especially am intrigued by your idea about a fuzzy count that allows
false positives but no false negatives. To do that, we need to put a hard
lock protecting the increment operation, but we can be loose (no lock) on
decrement. That turns out to be a perfect match for the problem here, because
as I recall from my earlier efforts, put_user_page() must *not* take locks--
and that's where we just decrement. Sweet! See below.

The other idea that you and Dan (and maybe others) pointed out was a debug
option, which we'll certainly need in order to safely convert all the call
sites. (Mirror the mappings at a different kernel offset, so that put_page()
and put_user_page() can verify that the right call was made.)  That will be
a separate patchset, as you recommended.

I'll even go as far as recommending the page lock itself. I realize that this 
adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
that this (below) has similar overhead to the notes above--but is *much* easier
to verify correct. (If the page lock is unacceptable due to being so widely used,
then I'd recommend using another page bit to do the same thing.)

(Note that memory barriers will simply be built into the various Set|Clear|Read
operations, as is common with a few other page flags.)

page_mkclean():
===============
lock_page()
    page_mkclean()
        Count actual mappings
            if(mappings == atomic_read(&page->_mapcount))
                ClearPageDmaPinned 

gup_fast():
===========
for each page {
    lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */

    atomic_inc(&page->_mapcount)
    SetPageDmaPinned()

    /* details of gup vs gup_fast not shown here... */


put_user_page():
================
    atomic_dec(&page->_mapcount); /* no locking! */
   

try_to_unmap() and other consumers of the PageDmaPinned flag:
=============================================================
lock_page() /* not required, but already done by existing callers */
    if(PageDmaPinned) {
        ...take appropriate action /* future patchsets */

page freeing:
============
ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  1:04                                                         ` John Hubbard
@ 2019-01-12  1:04                                                           ` John Hubbard
  2019-01-12  2:02                                                           ` Jerome Glisse
  1 sibling, 0 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-12  1:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 8:51 AM, Jerome Glisse wrote:
> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> 
> [...]
> 
>>>>> Now page_mkclean:
>>>>>
>>>>> int page_mkclean(struct page *page)
>>>>> {
>>>>>     int cleaned = 0;
>>>>> +   int real_mapcount = 0;
>>>>>     struct address_space *mapping;
>>>>>     struct rmap_walk_control rwc = {
>>>>>         .arg = (void *)&cleaned,
>>>>>         .rmap_one = page_mkclean_one,
>>>>>         .invalid_vma = invalid_mkclean_vma,
>>>>> +       .mapcount = &real_mapcount,
>>>>>     };
>>>>> +   int mapcount1, mapcount2;
>>>>>
>>>>>     BUG_ON(!PageLocked(page));
>>>>>
>>>>>     if (!page_mapped(page))
>>>>>         return 0;
>>>>>
>>>>>     mapping = page_mapping(page);
>>>>>     if (!mapping)
>>>>>         return 0;
>>>>>
>>>>> +   mapcount1 = page_mapcount(page);
>>>>>     // rmap_walk need to change to count mapping and return value
>>>>>     // in .mapcount easy one
>>>>>     rmap_walk(page, &rwc);
>>>>
>>>> So what prevents GUP_fast() to grab reference here and the test below would
>>>> think the page is not pinned? Or do you assume that every page_mkclean()
>>>> call will be protected by PageWriteback (currently it is not) so that
>>>> GUP_fast() blocks / bails out?
>>
>> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
>> for each page" question (ignoring, for now, what to actually *do* in response to 
>> that flag being set):
>>
>> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
>> This is probably less troubling than the next point, but it does undermine all the 
>> complicated schemes involving PageWriteback, that try to synchronize gup() with
>> page_mkclean().
>>
>> 2. Also, the mapcount approach here still does not reliably avoid false negatives
>> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
>> can always jump in and increment the mapcount, while page_mkclean is in the middle
>> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
>>
>> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> 
> Both point is address by the solution at the end of this email.
> 
>>>
>>> So GUP_fast() becomes:
>>>
>>> GUP_fast_existing() { ... }
>>> GUP_fast()
>>> {
>>>     GUP_fast_existing();
>>>
>>>     for (i = 0; i < npages; ++i) {
>>>         if (PageWriteback(pages[i])) {
>>>             // need to force slow path for this page
>>>         } else {
>>>             SetPageDmaPinned(pages[i]);
>>>             atomic_inc(pages[i]->mapcount);
>>>         }
>>>     }
>>> }
>>>
>>> This is a minor slow down for GUP fast and it takes care of a
>>> write back race on behalf of caller. This means that page_mkclean
>>> can not see a mapcount value that increase. This simplify thing
>>> we can relax that. Note that what this is doing is making sure
>>> that GUP_fast never get lucky :) ie never GUP a page that is in
>>> the process of being write back but has not yet had its pte
>>> updated to reflect that.
>>>
>>>
>>>> But I think that detecting pinned pages with small false positive rate is
>>>> OK. The extra page bouncing will cost some performance but if it is rare,
>>>> then we are OK. So I think we can go for the simple version of detecting
>>>> pinned pages as you mentioned in some earlier email. We just have to be
>>>> sure there are no false negatives.
>>>
>>
>> Agree with that sentiment, but there are still false negatives and I'm not
>> yet seeing any solutions for that.
> 
> So here is the solution:
> 
> 
> Is a page pin ? With no false negative:
> =======================================
> 
> get_user_page*() aka GUP:
>      if (!PageAnon(page)) {
>         bool write_back = PageWriteback(page);
>         bool page_is_pin = PagePin(page);
>         if (write_back && !page_is_pin) {
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
> [G1]    smp_rmb();
> [G2]    atomic_inc(&page->_mapcount)
> [G3]    smp_wmb();
> [G4]    SetPagePin(page);
> [G5]    smp_wmb();
> [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
>             /* Back-off as write back might have miss us */
>             atomic_dec(&page->_mapcount);
>             /* Wait for write back a re-try GUP */
>             ...
>             goto retry;
>         }
>      }
> 
> put_user_page() aka PUP:
> [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> [P2] put_page(page);
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page);
> [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> 
> So with above code we store the map and pin count inside struct page
> _mapcount field. The idea is that we can count the number of page
> table entry that point to the page when reverse walking all the page
> mapping in page_mkclean() [C4].
> 
> The issue is that GUP, PUP and page table entry zapping can all run
> concurrently with page_mkclean() and thus we can not get the real
> map and pin count and the real map count at a given point in time
> ([C5] for instance in the above). However we only care about avoiding
> false negative ie we do not want to report a page as unpin if in fact
> it is pin (it has active GUP). Avoiding false positive would be nice
> but it would need more heavy weight synchronization within GUP and
> PUP (we can mitigate it see the section on that below).
> 
> With the above scheme a page is _not_ pin (unpin) if and only if we
> have real_map_count == real_map_and_pin_count at a given point in
> time. In the above pseudo code the page is lock within page_mkclean()
> thus no new page table entry can be added and thus the number of page
> mapping can only go down (because of conccurent pte zapping). So no
> matter what happens at [C5] we have map_count <= real_map_count.
> 
> At [C3] we have two cases to consider:
>  [R1] A concurrent GUP after [C3] then we do not care what happens at
>       [C5] as the GUP would already have set the page pin flag. If it
>       raced before [C3] at [C1] with TestClearPagePin() then we would
>       have the map_and_pin_count reflect the GUP thanks to the memory
>       barrier [G3] and [C2].
>  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
>       worry about and thus the real_map_and_pin_count can only go down.
>       So because we first snap shot that value at [C5] we have:
>       real_map_and_pin_count <= map_and_pin_count.
> 
>       So at [C5] we end up with map_count <= real_map_count and with
>       real_map_and_pin_count <= map_pin_count but we also always have
>       real_map_count <= real_map_and_pin_count so it means we are in a
>       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
>       if map_count == map_pin_count then we know for sure that we have
>       real_map_count == real_map_and_pin_count and if that is the case
>       then the page is no longer pin. So at [C5] we will never miss a
>       pin page (no false negative).
> 
>       Another way to word this is that we always under-estimate the real
>       map count and over estimate the map and pin count and thus we can
>       never have false negative (map count equal to map and pin count
>       while in fact real map count is inferior to real map and pin count).
> 
> 
> PageWriteback() test and ordering with page_mkclean()
> =====================================================
> 
> In GUP we test for page write back flag to avoid pining a page that
> is under going write back. That flag is set after page_mkclean() so
> the filesystem code that will check for the pin flag need some memory
> barrier:
>     int __test_set_page_writeback(struct page *page, bool keep_write,
> +                                 bool *use_bounce_page)
>     {
>         ...
>   [T1]  TestSetPageWriteback(page);
> + [T2]  smp_wmb();
> + [T3]  *use_bounce_page = PagePin(page);
>         ...
>     }
> 
> That way if there is a concurrent GUP we either have:
>     [R1] GUP sees the write back flag set before [G1] so it back-off
>     [R2] GUP sees no write back before [G1] here either we have GUP
>          that sees the write back flag at [G6] or [T3] that sees the
>          pin flag thanks to the memory barrier [G5] and [T2].
> 
> So in all cases we never miss a pin or a write back.
> 
> 
> Mitigate false positive:
> ========================
> 
> If false positive is ever an issue we can improve the situation and to
> properly account conccurent pte zapping with the following changes:
> 
> page_mkclean():
> [C1] pined = TestClearPagePin(page);
> [C2] smp_mb();
> [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> [C4] map_count = rmap_walk(page, &page_mkclean_one());
> [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
>      }
> 
> page_map_count():
> [M1] if (pte_valid(pte) { map_count++; }
>      } else if (pte_special_zap(pte)) {
> [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
>      }
> 
> And pte zapping of file back page will write a special pte entry which
> has the page map and pin count value at the time the pte is zap. Also
> page_mkclean_one() unconditionaly replace those special pte with pte
> none and ignore them altogether. We only want to detect pte zapping that
> happens after [C6] and before [C7] is done.
> 
> With [M3] we are counting all page table entry that have been zap after
> the map_and_pin_count value we read at [C6]. Again we have two cases:
>  [R1] A concurrent GUP after [C6] then we do not care what happens
>       at [C8] as the GUP would already have set the page pin flag.
>  [R2] No concurrent GUP then we only have concurrent PUP to worry
>       about. If they happen before [C6] they are included in [C6]
>       map_and_pin_count value. If after [C6] then we might miss a
>       page that is no longer pin ie we are over estimating the
>       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
>       at [C8]). So no false negative just false positive.
> 
> Here we just get the accurate real_map_count at [C6] time so if the
> page was no longer pin at [C6] time we will correctly detect it and
> not set the flag at [C8]. If there is any concurrent GUP that GUP
> would set the flag properly.
> 
> There is one last thing to note about above code, the MASK in [M3].
> For special pte entry we might not have enough bits to store the
> whole map and pin count value (on 32bits arch). So we might expose
> ourself to wrap around. Again we do not care about [R1] case as any
> concurrent GUP will set the pin flag. So we only care if the only
> thing happening concurrently is either PUP or pte zapping. In both
> case its means that the map and pin count is going down so if there
> is a wrap around sometimes within [C7]/page_map_count() we have:
>   [t0] page_map_count() executed on some pte
>   [t1] page_map_count() executed on another pte after [t1]
> With:
>     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> While in fact:
>     map_count_t0 > map_count_t1
> 
> So if that happens then we will under-estimate the map count ie we
> will ignore some of the concurrent pte zapping and not count them.
> So again we are only exposing our self to false positive not false
> negative.
> 
> 
> ---------------------------------------------------------------------
> 
> 
> Hopes this prove that this solution do work. The false positive is
> something that i believe is acceptable. We will get them only when
> they are racing GUP or PUP. For racing GUP it is safer to have false
> positive. For racing PUP it would be nice to catch them but hey some
> times you just get unlucky.
> 
> Note that any other solution will also suffer from false positive
> situation because anyway you are testing for the page pin status
> at a given point in time so it can always race with a PUP. So the
> only difference with any other solution would be how long is the
> false positive race window.
> 

Hi Jerome,

Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
in case anyone spots a disastrous conceptual error (such as the lock_page
point), while I'm putting together the revised patchset.

I've studied this carefully, and I agree that using mapcount in 
this way is viable, *as long* as we use a lock (or a construct that looks just 
like one: your "memory barrier, check, retry" is really just a lock) in
order to hold off gup() while page_mkclean() is in progress. In other words,
nothing that increments mapcount may proceed while page_mkclean() is running.

I especially am intrigued by your idea about a fuzzy count that allows
false positives but no false negatives. To do that, we need to put a hard
lock protecting the increment operation, but we can be loose (no lock) on
decrement. That turns out to be a perfect match for the problem here, because
as I recall from my earlier efforts, put_user_page() must *not* take locks--
and that's where we just decrement. Sweet! See below.

The other idea that you and Dan (and maybe others) pointed out was a debug
option, which we'll certainly need in order to safely convert all the call
sites. (Mirror the mappings at a different kernel offset, so that put_page()
and put_user_page() can verify that the right call was made.)  That will be
a separate patchset, as you recommended.

I'll even go as far as recommending the page lock itself. I realize that this 
adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
that this (below) has similar overhead to the notes above--but is *much* easier
to verify correct. (If the page lock is unacceptable due to being so widely used,
then I'd recommend using another page bit to do the same thing.)

(Note that memory barriers will simply be built into the various Set|Clear|Read
operations, as is common with a few other page flags.)

page_mkclean():
===============
lock_page()
    page_mkclean()
        Count actual mappings
            if(mappings == atomic_read(&page->_mapcount))
                ClearPageDmaPinned 

gup_fast():
===========
for each page {
    lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */

    atomic_inc(&page->_mapcount)
    SetPageDmaPinned()

    /* details of gup vs gup_fast not shown here... */


put_user_page():
================
    atomic_dec(&page->_mapcount); /* no locking! */
   

try_to_unmap() and other consumers of the PageDmaPinned flag:
=============================================================
lock_page() /* not required, but already done by existing callers */
    if(PageDmaPinned) {
        ...take appropriate action /* future patchsets */

page freeing:
============
ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */



thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  1:04                                                         ` John Hubbard
  2019-01-12  1:04                                                           ` John Hubbard
@ 2019-01-12  2:02                                                           ` Jerome Glisse
  2019-01-12  2:02                                                             ` Jerome Glisse
  2019-01-12  2:38                                                             ` John Hubbard
  1 sibling, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:02 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> > On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > 
> > [...]
> > 
> >>>>> Now page_mkclean:
> >>>>>
> >>>>> int page_mkclean(struct page *page)
> >>>>> {
> >>>>>     int cleaned = 0;
> >>>>> +   int real_mapcount = 0;
> >>>>>     struct address_space *mapping;
> >>>>>     struct rmap_walk_control rwc = {
> >>>>>         .arg = (void *)&cleaned,
> >>>>>         .rmap_one = page_mkclean_one,
> >>>>>         .invalid_vma = invalid_mkclean_vma,
> >>>>> +       .mapcount = &real_mapcount,
> >>>>>     };
> >>>>> +   int mapcount1, mapcount2;
> >>>>>
> >>>>>     BUG_ON(!PageLocked(page));
> >>>>>
> >>>>>     if (!page_mapped(page))
> >>>>>         return 0;
> >>>>>
> >>>>>     mapping = page_mapping(page);
> >>>>>     if (!mapping)
> >>>>>         return 0;
> >>>>>
> >>>>> +   mapcount1 = page_mapcount(page);
> >>>>>     // rmap_walk need to change to count mapping and return value
> >>>>>     // in .mapcount easy one
> >>>>>     rmap_walk(page, &rwc);
> >>>>
> >>>> So what prevents GUP_fast() to grab reference here and the test below would
> >>>> think the page is not pinned? Or do you assume that every page_mkclean()
> >>>> call will be protected by PageWriteback (currently it is not) so that
> >>>> GUP_fast() blocks / bails out?
> >>
> >> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> >> for each page" question (ignoring, for now, what to actually *do* in response to 
> >> that flag being set):
> >>
> >> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> >> This is probably less troubling than the next point, but it does undermine all the 
> >> complicated schemes involving PageWriteback, that try to synchronize gup() with
> >> page_mkclean().
> >>
> >> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> >> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> >> can always jump in and increment the mapcount, while page_mkclean is in the middle
> >> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> >>
> >> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> > 
> > Both point is address by the solution at the end of this email.
> > 
> >>>
> >>> So GUP_fast() becomes:
> >>>
> >>> GUP_fast_existing() { ... }
> >>> GUP_fast()
> >>> {
> >>>     GUP_fast_existing();
> >>>
> >>>     for (i = 0; i < npages; ++i) {
> >>>         if (PageWriteback(pages[i])) {
> >>>             // need to force slow path for this page
> >>>         } else {
> >>>             SetPageDmaPinned(pages[i]);
> >>>             atomic_inc(pages[i]->mapcount);
> >>>         }
> >>>     }
> >>> }
> >>>
> >>> This is a minor slow down for GUP fast and it takes care of a
> >>> write back race on behalf of caller. This means that page_mkclean
> >>> can not see a mapcount value that increase. This simplify thing
> >>> we can relax that. Note that what this is doing is making sure
> >>> that GUP_fast never get lucky :) ie never GUP a page that is in
> >>> the process of being write back but has not yet had its pte
> >>> updated to reflect that.
> >>>
> >>>
> >>>> But I think that detecting pinned pages with small false positive rate is
> >>>> OK. The extra page bouncing will cost some performance but if it is rare,
> >>>> then we are OK. So I think we can go for the simple version of detecting
> >>>> pinned pages as you mentioned in some earlier email. We just have to be
> >>>> sure there are no false negatives.
> >>>
> >>
> >> Agree with that sentiment, but there are still false negatives and I'm not
> >> yet seeing any solutions for that.
> > 
> > So here is the solution:
> > 
> > 
> > Is a page pin ? With no false negative:
> > =======================================
> > 
> > get_user_page*() aka GUP:
> >      if (!PageAnon(page)) {
> >         bool write_back = PageWriteback(page);
> >         bool page_is_pin = PagePin(page);
> >         if (write_back && !page_is_pin) {
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> > [G1]    smp_rmb();
> > [G2]    atomic_inc(&page->_mapcount)
> > [G3]    smp_wmb();
> > [G4]    SetPagePin(page);
> > [G5]    smp_wmb();
> > [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
> >             /* Back-off as write back might have miss us */
> >             atomic_dec(&page->_mapcount);
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> >      }
> > 
> > put_user_page() aka PUP:
> > [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> > [P2] put_page(page);
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page);
> > [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> > 
> > So with above code we store the map and pin count inside struct page
> > _mapcount field. The idea is that we can count the number of page
> > table entry that point to the page when reverse walking all the page
> > mapping in page_mkclean() [C4].
> > 
> > The issue is that GUP, PUP and page table entry zapping can all run
> > concurrently with page_mkclean() and thus we can not get the real
> > map and pin count and the real map count at a given point in time
> > ([C5] for instance in the above). However we only care about avoiding
> > false negative ie we do not want to report a page as unpin if in fact
> > it is pin (it has active GUP). Avoiding false positive would be nice
> > but it would need more heavy weight synchronization within GUP and
> > PUP (we can mitigate it see the section on that below).
> > 
> > With the above scheme a page is _not_ pin (unpin) if and only if we
> > have real_map_count == real_map_and_pin_count at a given point in
> > time. In the above pseudo code the page is lock within page_mkclean()
> > thus no new page table entry can be added and thus the number of page
> > mapping can only go down (because of conccurent pte zapping). So no
> > matter what happens at [C5] we have map_count <= real_map_count.
> > 
> > At [C3] we have two cases to consider:
> >  [R1] A concurrent GUP after [C3] then we do not care what happens at
> >       [C5] as the GUP would already have set the page pin flag. If it
> >       raced before [C3] at [C1] with TestClearPagePin() then we would
> >       have the map_and_pin_count reflect the GUP thanks to the memory
> >       barrier [G3] and [C2].
> >  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
> >       worry about and thus the real_map_and_pin_count can only go down.
> >       So because we first snap shot that value at [C5] we have:
> >       real_map_and_pin_count <= map_and_pin_count.
> > 
> >       So at [C5] we end up with map_count <= real_map_count and with
> >       real_map_and_pin_count <= map_pin_count but we also always have
> >       real_map_count <= real_map_and_pin_count so it means we are in a
> >       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
> >       if map_count == map_pin_count then we know for sure that we have
> >       real_map_count == real_map_and_pin_count and if that is the case
> >       then the page is no longer pin. So at [C5] we will never miss a
> >       pin page (no false negative).
> > 
> >       Another way to word this is that we always under-estimate the real
> >       map count and over estimate the map and pin count and thus we can
> >       never have false negative (map count equal to map and pin count
> >       while in fact real map count is inferior to real map and pin count).
> > 
> > 
> > PageWriteback() test and ordering with page_mkclean()
> > =====================================================
> > 
> > In GUP we test for page write back flag to avoid pining a page that
> > is under going write back. That flag is set after page_mkclean() so
> > the filesystem code that will check for the pin flag need some memory
> > barrier:
> >     int __test_set_page_writeback(struct page *page, bool keep_write,
> > +                                 bool *use_bounce_page)
> >     {
> >         ...
> >   [T1]  TestSetPageWriteback(page);
> > + [T2]  smp_wmb();
> > + [T3]  *use_bounce_page = PagePin(page);
> >         ...
> >     }
> > 
> > That way if there is a concurrent GUP we either have:
> >     [R1] GUP sees the write back flag set before [G1] so it back-off
> >     [R2] GUP sees no write back before [G1] here either we have GUP
> >          that sees the write back flag at [G6] or [T3] that sees the
> >          pin flag thanks to the memory barrier [G5] and [T2].
> > 
> > So in all cases we never miss a pin or a write back.
> > 
> > 
> > Mitigate false positive:
> > ========================
> > 
> > If false positive is ever an issue we can improve the situation and to
> > properly account conccurent pte zapping with the following changes:
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page, &page_mkclean_one());
> > [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> > [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> > [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> > [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
> >      }
> > 
> > page_map_count():
> > [M1] if (pte_valid(pte) { map_count++; }
> >      } else if (pte_special_zap(pte)) {
> > [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> > [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
> >      }
> > 
> > And pte zapping of file back page will write a special pte entry which
> > has the page map and pin count value at the time the pte is zap. Also
> > page_mkclean_one() unconditionaly replace those special pte with pte
> > none and ignore them altogether. We only want to detect pte zapping that
> > happens after [C6] and before [C7] is done.
> > 
> > With [M3] we are counting all page table entry that have been zap after
> > the map_and_pin_count value we read at [C6]. Again we have two cases:
> >  [R1] A concurrent GUP after [C6] then we do not care what happens
> >       at [C8] as the GUP would already have set the page pin flag.
> >  [R2] No concurrent GUP then we only have concurrent PUP to worry
> >       about. If they happen before [C6] they are included in [C6]
> >       map_and_pin_count value. If after [C6] then we might miss a
> >       page that is no longer pin ie we are over estimating the
> >       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
> >       at [C8]). So no false negative just false positive.
> > 
> > Here we just get the accurate real_map_count at [C6] time so if the
> > page was no longer pin at [C6] time we will correctly detect it and
> > not set the flag at [C8]. If there is any concurrent GUP that GUP
> > would set the flag properly.
> > 
> > There is one last thing to note about above code, the MASK in [M3].
> > For special pte entry we might not have enough bits to store the
> > whole map and pin count value (on 32bits arch). So we might expose
> > ourself to wrap around. Again we do not care about [R1] case as any
> > concurrent GUP will set the pin flag. So we only care if the only
> > thing happening concurrently is either PUP or pte zapping. In both
> > case its means that the map and pin count is going down so if there
> > is a wrap around sometimes within [C7]/page_map_count() we have:
> >   [t0] page_map_count() executed on some pte
> >   [t1] page_map_count() executed on another pte after [t1]
> > With:
> >     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> > While in fact:
> >     map_count_t0 > map_count_t1
> > 
> > So if that happens then we will under-estimate the map count ie we
> > will ignore some of the concurrent pte zapping and not count them.
> > So again we are only exposing our self to false positive not false
> > negative.
> > 
> > 
> > ---------------------------------------------------------------------
> > 
> > 
> > Hopes this prove that this solution do work. The false positive is
> > something that i believe is acceptable. We will get them only when
> > they are racing GUP or PUP. For racing GUP it is safer to have false
> > positive. For racing PUP it would be nice to catch them but hey some
> > times you just get unlucky.
> > 
> > Note that any other solution will also suffer from false positive
> > situation because anyway you are testing for the page pin status
> > at a given point in time so it can always race with a PUP. So the
> > only difference with any other solution would be how long is the
> > false positive race window.
> > 
> 
> Hi Jerome,
> 
> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> in case anyone spots a disastrous conceptual error (such as the lock_page
> point), while I'm putting together the revised patchset.
> 
> I've studied this carefully, and I agree that using mapcount in 
> this way is viable, *as long* as we use a lock (or a construct that looks just 
> like one: your "memory barrier, check, retry" is really just a lock) in
> order to hold off gup() while page_mkclean() is in progress. In other words,
> nothing that increments mapcount may proceed while page_mkclean() is running.

No, increment to page->_mapcount are fine while page_mkclean() is running.
The above solution do work no matter what happens thanks to the memory
barrier. By clearing the pin flag first and reading the page->_mapcount
after (and doing the reverse in GUP) we know that a racing GUP will either
have its pin page clear but the incremented mapcount taken into account by
page_mkclean() or page_mkclean() will miss the incremented mapcount but
it will also no clear the pin flag set concurrently by any GUP.

Here are all the possible time line:
[T1]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |
                                ...
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)

It is fine because page_mkclean() will read the correct page->mapcount
which include the GUP that happens before [C1]


[T2]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)
                                ...
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |

It is fine because [G4] set the pin flag so it does not matter that [C3]
did miss the mapcount increase from the GUP.


[T3]:
GUP on CPU0                      | page_mkclean() on CPU1
[G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);

No matter which CPU ordering we get ie either:
    - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
      that was incremented by [G2] so we will map_count < map_and_pin_count
      and we will set the pin flag again at the end of page_mkclean()
    - [C1] is overwritten by [G4] in that case the pin flag is set and thus
      it does not matter that [C3] also see the mapcount that was incremented
      by [G2]


This is totaly race free ie at the end of page_mkclean() the pin flag will
be set for all page that are pin and for some page that are no longer pin.
What matter is that they are no false negative.


> I especially am intrigued by your idea about a fuzzy count that allows
> false positives but no false negatives. To do that, we need to put a hard
> lock protecting the increment operation, but we can be loose (no lock) on
> decrement. That turns out to be a perfect match for the problem here, because
> as I recall from my earlier efforts, put_user_page() must *not* take locks--
> and that's where we just decrement. Sweet! See below.

You do not need lock, lock are easier to think with but they are not always
necessary and in this case we do not need any lock. We can happily have any
number of concurrent GUP, PUP or pte zapping. Worse case is false positive
ie reporting a page as pin while it has just been unpin concurrently by a
PUP.

> The other idea that you and Dan (and maybe others) pointed out was a debug
> option, which we'll certainly need in order to safely convert all the call
> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> and put_user_page() can verify that the right call was made.)  That will be
> a separate patchset, as you recommended.
> 
> I'll even go as far as recommending the page lock itself. I realize that this 
> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> that this (below) has similar overhead to the notes above--but is *much* easier
> to verify correct. (If the page lock is unacceptable due to being so widely used,
> then I'd recommend using another page bit to do the same thing.)

Please page lock is pointless and it will not work for GUP fast. The above
scheme do work and is fine. I spend the day again thinking about all memory
ordering and i do not see any issues.


> (Note that memory barriers will simply be built into the various Set|Clear|Read
> operations, as is common with a few other page flags.)
> 
> page_mkclean():
> ===============
> lock_page()
>     page_mkclean()
>         Count actual mappings
>             if(mappings == atomic_read(&page->_mapcount))
>                 ClearPageDmaPinned 
> 
> gup_fast():
> ===========
> for each page {
>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
> 
>     atomic_inc(&page->_mapcount)
>     SetPageDmaPinned()
> 
>     /* details of gup vs gup_fast not shown here... */
> 
> 
> put_user_page():
> ================
>     atomic_dec(&page->_mapcount); /* no locking! */
>    
> 
> try_to_unmap() and other consumers of the PageDmaPinned flag:
> =============================================================
> lock_page() /* not required, but already done by existing callers */
>     if(PageDmaPinned) {
>         ...take appropriate action /* future patchsets */

We can not block try_to_unmap() on pined page. What we want to block is
fs using a different page for the same file offset the original pined
page was pin (modulo truncate that we should not block). Everything else
must keep working as if there was no pin. We can not fix that, driver
doing long term GUP and not abiding to mmu notifier are hopelessly broken
in front of many regular syscall (mremap, truncate, splice, ...) we can
not block those syscall or failing them, doing so would mean breaking
applications in a bad way.

The only thing we should do is avoid fs corruption and bug due to
dirtying page after fs believe it has been clean.


> page freeing:
> ============
> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */

Yeah this need to happen when we sanitize flags of free page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:02                                                           ` Jerome Glisse
@ 2019-01-12  2:02                                                             ` Jerome Glisse
  2019-01-12  2:38                                                             ` John Hubbard
  1 sibling, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:02 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> > On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> > 
> > [...]
> > 
> >>>>> Now page_mkclean:
> >>>>>
> >>>>> int page_mkclean(struct page *page)
> >>>>> {
> >>>>>     int cleaned = 0;
> >>>>> +   int real_mapcount = 0;
> >>>>>     struct address_space *mapping;
> >>>>>     struct rmap_walk_control rwc = {
> >>>>>         .arg = (void *)&cleaned,
> >>>>>         .rmap_one = page_mkclean_one,
> >>>>>         .invalid_vma = invalid_mkclean_vma,
> >>>>> +       .mapcount = &real_mapcount,
> >>>>>     };
> >>>>> +   int mapcount1, mapcount2;
> >>>>>
> >>>>>     BUG_ON(!PageLocked(page));
> >>>>>
> >>>>>     if (!page_mapped(page))
> >>>>>         return 0;
> >>>>>
> >>>>>     mapping = page_mapping(page);
> >>>>>     if (!mapping)
> >>>>>         return 0;
> >>>>>
> >>>>> +   mapcount1 = page_mapcount(page);
> >>>>>     // rmap_walk need to change to count mapping and return value
> >>>>>     // in .mapcount easy one
> >>>>>     rmap_walk(page, &rwc);
> >>>>
> >>>> So what prevents GUP_fast() to grab reference here and the test below would
> >>>> think the page is not pinned? Or do you assume that every page_mkclean()
> >>>> call will be protected by PageWriteback (currently it is not) so that
> >>>> GUP_fast() blocks / bails out?
> >>
> >> Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned
> >> for each page" question (ignoring, for now, what to actually *do* in response to 
> >> that flag being set):
> >>
> >> 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening".
> >> This is probably less troubling than the next point, but it does undermine all the 
> >> complicated schemes involving PageWriteback, that try to synchronize gup() with
> >> page_mkclean().
> >>
> >> 2. Also, the mapcount approach here still does not reliably avoid false negatives
> >> (that is, a page may have been gup'd, but page_mkclean could miss that): gup()
> >> can always jump in and increment the mapcount, while page_mkclean is in the middle
> >> of making (wrong) decisions based on that mapcount. There's no lock to prevent that.
> >>
> >> Again: mapcount can go up *or* down, so I'm not seeing a true solution yet.
> > 
> > Both point is address by the solution at the end of this email.
> > 
> >>>
> >>> So GUP_fast() becomes:
> >>>
> >>> GUP_fast_existing() { ... }
> >>> GUP_fast()
> >>> {
> >>>     GUP_fast_existing();
> >>>
> >>>     for (i = 0; i < npages; ++i) {
> >>>         if (PageWriteback(pages[i])) {
> >>>             // need to force slow path for this page
> >>>         } else {
> >>>             SetPageDmaPinned(pages[i]);
> >>>             atomic_inc(pages[i]->mapcount);
> >>>         }
> >>>     }
> >>> }
> >>>
> >>> This is a minor slow down for GUP fast and it takes care of a
> >>> write back race on behalf of caller. This means that page_mkclean
> >>> can not see a mapcount value that increase. This simplify thing
> >>> we can relax that. Note that what this is doing is making sure
> >>> that GUP_fast never get lucky :) ie never GUP a page that is in
> >>> the process of being write back but has not yet had its pte
> >>> updated to reflect that.
> >>>
> >>>
> >>>> But I think that detecting pinned pages with small false positive rate is
> >>>> OK. The extra page bouncing will cost some performance but if it is rare,
> >>>> then we are OK. So I think we can go for the simple version of detecting
> >>>> pinned pages as you mentioned in some earlier email. We just have to be
> >>>> sure there are no false negatives.
> >>>
> >>
> >> Agree with that sentiment, but there are still false negatives and I'm not
> >> yet seeing any solutions for that.
> > 
> > So here is the solution:
> > 
> > 
> > Is a page pin ? With no false negative:
> > =======================================
> > 
> > get_user_page*() aka GUP:
> >      if (!PageAnon(page)) {
> >         bool write_back = PageWriteback(page);
> >         bool page_is_pin = PagePin(page);
> >         if (write_back && !page_is_pin) {
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> > [G1]    smp_rmb();
> > [G2]    atomic_inc(&page->_mapcount)
> > [G3]    smp_wmb();
> > [G4]    SetPagePin(page);
> > [G5]    smp_wmb();
> > [G6]    if (!write_back && !page_is_pin && PageWriteback(page)) {
> >             /* Back-off as write back might have miss us */
> >             atomic_dec(&page->_mapcount);
> >             /* Wait for write back a re-try GUP */
> >             ...
> >             goto retry;
> >         }
> >      }
> > 
> > put_user_page() aka PUP:
> > [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount);
> > [P2] put_page(page);
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page);
> > [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page);
> > 
> > So with above code we store the map and pin count inside struct page
> > _mapcount field. The idea is that we can count the number of page
> > table entry that point to the page when reverse walking all the page
> > mapping in page_mkclean() [C4].
> > 
> > The issue is that GUP, PUP and page table entry zapping can all run
> > concurrently with page_mkclean() and thus we can not get the real
> > map and pin count and the real map count at a given point in time
> > ([C5] for instance in the above). However we only care about avoiding
> > false negative ie we do not want to report a page as unpin if in fact
> > it is pin (it has active GUP). Avoiding false positive would be nice
> > but it would need more heavy weight synchronization within GUP and
> > PUP (we can mitigate it see the section on that below).
> > 
> > With the above scheme a page is _not_ pin (unpin) if and only if we
> > have real_map_count == real_map_and_pin_count at a given point in
> > time. In the above pseudo code the page is lock within page_mkclean()
> > thus no new page table entry can be added and thus the number of page
> > mapping can only go down (because of conccurent pte zapping). So no
> > matter what happens at [C5] we have map_count <= real_map_count.
> > 
> > At [C3] we have two cases to consider:
> >  [R1] A concurrent GUP after [C3] then we do not care what happens at
> >       [C5] as the GUP would already have set the page pin flag. If it
> >       raced before [C3] at [C1] with TestClearPagePin() then we would
> >       have the map_and_pin_count reflect the GUP thanks to the memory
> >       barrier [G3] and [C2].
> >  [R2] No concurrent GUP after [C3] then we only have concurrent PUP to
> >       worry about and thus the real_map_and_pin_count can only go down.
> >       So because we first snap shot that value at [C5] we have:
> >       real_map_and_pin_count <= map_and_pin_count.
> > 
> >       So at [C5] we end up with map_count <= real_map_count and with
> >       real_map_and_pin_count <= map_pin_count but we also always have
> >       real_map_count <= real_map_and_pin_count so it means we are in a
> >       a <= b <= c <= d scenario and if a == d then b == c. So at [C5]
> >       if map_count == map_pin_count then we know for sure that we have
> >       real_map_count == real_map_and_pin_count and if that is the case
> >       then the page is no longer pin. So at [C5] we will never miss a
> >       pin page (no false negative).
> > 
> >       Another way to word this is that we always under-estimate the real
> >       map count and over estimate the map and pin count and thus we can
> >       never have false negative (map count equal to map and pin count
> >       while in fact real map count is inferior to real map and pin count).
> > 
> > 
> > PageWriteback() test and ordering with page_mkclean()
> > =====================================================
> > 
> > In GUP we test for page write back flag to avoid pining a page that
> > is under going write back. That flag is set after page_mkclean() so
> > the filesystem code that will check for the pin flag need some memory
> > barrier:
> >     int __test_set_page_writeback(struct page *page, bool keep_write,
> > +                                 bool *use_bounce_page)
> >     {
> >         ...
> >   [T1]  TestSetPageWriteback(page);
> > + [T2]  smp_wmb();
> > + [T3]  *use_bounce_page = PagePin(page);
> >         ...
> >     }
> > 
> > That way if there is a concurrent GUP we either have:
> >     [R1] GUP sees the write back flag set before [G1] so it back-off
> >     [R2] GUP sees no write back before [G1] here either we have GUP
> >          that sees the write back flag at [G6] or [T3] that sees the
> >          pin flag thanks to the memory barrier [G5] and [T2].
> > 
> > So in all cases we never miss a pin or a write back.
> > 
> > 
> > Mitigate false positive:
> > ========================
> > 
> > If false positive is ever an issue we can improve the situation and to
> > properly account conccurent pte zapping with the following changes:
> > 
> > page_mkclean():
> > [C1] pined = TestClearPagePin(page);
> > [C2] smp_mb();
> > [C3] map_and_pin_count = atomic_read(&page->_mapcount)
> > [C4] map_count = rmap_walk(page, &page_mkclean_one());
> > [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) {
> > [C6]    map_and_pin_count2 = atomic_read(&page->_mapcount)
> > [C7]    map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2);
> > [C8]    if (map_count < map_and_pin_count2) SetPagePin(page);
> >      }
> > 
> > page_map_count():
> > [M1] if (pte_valid(pte) { map_count++; }
> >      } else if (pte_special_zap(pte)) {
> > [M2]    unsigned long map_count_at_zap = pte_special_zap_to_value(pte);
> > [M3]    if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++;
> >      }
> > 
> > And pte zapping of file back page will write a special pte entry which
> > has the page map and pin count value at the time the pte is zap. Also
> > page_mkclean_one() unconditionaly replace those special pte with pte
> > none and ignore them altogether. We only want to detect pte zapping that
> > happens after [C6] and before [C7] is done.
> > 
> > With [M3] we are counting all page table entry that have been zap after
> > the map_and_pin_count value we read at [C6]. Again we have two cases:
> >  [R1] A concurrent GUP after [C6] then we do not care what happens
> >       at [C8] as the GUP would already have set the page pin flag.
> >  [R2] No concurrent GUP then we only have concurrent PUP to worry
> >       about. If they happen before [C6] they are included in [C6]
> >       map_and_pin_count value. If after [C6] then we might miss a
> >       page that is no longer pin ie we are over estimating the
> >       map_and_pin_count (real_map_and_pin_count < map_and_pin_count
> >       at [C8]). So no false negative just false positive.
> > 
> > Here we just get the accurate real_map_count at [C6] time so if the
> > page was no longer pin at [C6] time we will correctly detect it and
> > not set the flag at [C8]. If there is any concurrent GUP that GUP
> > would set the flag properly.
> > 
> > There is one last thing to note about above code, the MASK in [M3].
> > For special pte entry we might not have enough bits to store the
> > whole map and pin count value (on 32bits arch). So we might expose
> > ourself to wrap around. Again we do not care about [R1] case as any
> > concurrent GUP will set the pin flag. So we only care if the only
> > thing happening concurrently is either PUP or pte zapping. In both
> > case its means that the map and pin count is going down so if there
> > is a wrap around sometimes within [C7]/page_map_count() we have:
> >   [t0] page_map_count() executed on some pte
> >   [t1] page_map_count() executed on another pte after [t1]
> > With:
> >     (map_count_t0 & MASK) < (map_count_t1 & MASK)
> > While in fact:
> >     map_count_t0 > map_count_t1
> > 
> > So if that happens then we will under-estimate the map count ie we
> > will ignore some of the concurrent pte zapping and not count them.
> > So again we are only exposing our self to false positive not false
> > negative.
> > 
> > 
> > ---------------------------------------------------------------------
> > 
> > 
> > Hopes this prove that this solution do work. The false positive is
> > something that i believe is acceptable. We will get them only when
> > they are racing GUP or PUP. For racing GUP it is safer to have false
> > positive. For racing PUP it would be nice to catch them but hey some
> > times you just get unlucky.
> > 
> > Note that any other solution will also suffer from false positive
> > situation because anyway you are testing for the page pin status
> > at a given point in time so it can always race with a PUP. So the
> > only difference with any other solution would be how long is the
> > false positive race window.
> > 
> 
> Hi Jerome,
> 
> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> in case anyone spots a disastrous conceptual error (such as the lock_page
> point), while I'm putting together the revised patchset.
> 
> I've studied this carefully, and I agree that using mapcount in 
> this way is viable, *as long* as we use a lock (or a construct that looks just 
> like one: your "memory barrier, check, retry" is really just a lock) in
> order to hold off gup() while page_mkclean() is in progress. In other words,
> nothing that increments mapcount may proceed while page_mkclean() is running.

No, increment to page->_mapcount are fine while page_mkclean() is running.
The above solution do work no matter what happens thanks to the memory
barrier. By clearing the pin flag first and reading the page->_mapcount
after (and doing the reverse in GUP) we know that a racing GUP will either
have its pin page clear but the incremented mapcount taken into account by
page_mkclean() or page_mkclean() will miss the incremented mapcount but
it will also no clear the pin flag set concurrently by any GUP.

Here are all the possible time line:
[T1]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |
                                ...
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)

It is fine because page_mkclean() will read the correct page->mapcount
which include the GUP that happens before [C1]


[T2]:
GUP on CPU0                      | page_mkclean() on CPU1
                                 |
                                 | [C1] pined = TestClearPagePin(page);
                                 | [C2] smp_mb();
                                 | [C3] map_and_pin_count =
                                 |        atomic_read(&page->mapcount)
                                ...
[G2] atomic_inc(&page->mapcount) |
[G3] smp_wmb();                  |
[G4] SetPagePin(page);           |

It is fine because [G4] set the pin flag so it does not matter that [C3]
did miss the mapcount increase from the GUP.


[T3]:
GUP on CPU0                      | page_mkclean() on CPU1
[G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);

No matter which CPU ordering we get ie either:
    - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
      that was incremented by [G2] so we will map_count < map_and_pin_count
      and we will set the pin flag again at the end of page_mkclean()
    - [C1] is overwritten by [G4] in that case the pin flag is set and thus
      it does not matter that [C3] also see the mapcount that was incremented
      by [G2]


This is totaly race free ie at the end of page_mkclean() the pin flag will
be set for all page that are pin and for some page that are no longer pin.
What matter is that they are no false negative.


> I especially am intrigued by your idea about a fuzzy count that allows
> false positives but no false negatives. To do that, we need to put a hard
> lock protecting the increment operation, but we can be loose (no lock) on
> decrement. That turns out to be a perfect match for the problem here, because
> as I recall from my earlier efforts, put_user_page() must *not* take locks--
> and that's where we just decrement. Sweet! See below.

You do not need lock, lock are easier to think with but they are not always
necessary and in this case we do not need any lock. We can happily have any
number of concurrent GUP, PUP or pte zapping. Worse case is false positive
ie reporting a page as pin while it has just been unpin concurrently by a
PUP.

> The other idea that you and Dan (and maybe others) pointed out was a debug
> option, which we'll certainly need in order to safely convert all the call
> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> and put_user_page() can verify that the right call was made.)  That will be
> a separate patchset, as you recommended.
> 
> I'll even go as far as recommending the page lock itself. I realize that this 
> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> that this (below) has similar overhead to the notes above--but is *much* easier
> to verify correct. (If the page lock is unacceptable due to being so widely used,
> then I'd recommend using another page bit to do the same thing.)

Please page lock is pointless and it will not work for GUP fast. The above
scheme do work and is fine. I spend the day again thinking about all memory
ordering and i do not see any issues.


> (Note that memory barriers will simply be built into the various Set|Clear|Read
> operations, as is common with a few other page flags.)
> 
> page_mkclean():
> ===============
> lock_page()
>     page_mkclean()
>         Count actual mappings
>             if(mappings == atomic_read(&page->_mapcount))
>                 ClearPageDmaPinned 
> 
> gup_fast():
> ===========
> for each page {
>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
> 
>     atomic_inc(&page->_mapcount)
>     SetPageDmaPinned()
> 
>     /* details of gup vs gup_fast not shown here... */
> 
> 
> put_user_page():
> ================
>     atomic_dec(&page->_mapcount); /* no locking! */
>    
> 
> try_to_unmap() and other consumers of the PageDmaPinned flag:
> =============================================================
> lock_page() /* not required, but already done by existing callers */
>     if(PageDmaPinned) {
>         ...take appropriate action /* future patchsets */

We can not block try_to_unmap() on pined page. What we want to block is
fs using a different page for the same file offset the original pined
page was pin (modulo truncate that we should not block). Everything else
must keep working as if there was no pin. We can not fix that, driver
doing long term GUP and not abiding to mmu notifier are hopelessly broken
in front of many regular syscall (mremap, truncate, splice, ...) we can
not block those syscall or failing them, doing so would mean breaking
applications in a bad way.

The only thing we should do is avoid fs corruption and bug due to
dirtying page after fs believe it has been clean.


> page freeing:
> ============
> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */

Yeah this need to happen when we sanitize flags of free page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:02                                                           ` Jerome Glisse
  2019-01-12  2:02                                                             ` Jerome Glisse
@ 2019-01-12  2:38                                                             ` John Hubbard
  2019-01-12  2:38                                                               ` John Hubbard
                                                                                 ` (2 more replies)
  1 sibling, 3 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-12  2:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:02 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>> [...]
>>
>> Hi Jerome,
>>
>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>> in case anyone spots a disastrous conceptual error (such as the lock_page
>> point), while I'm putting together the revised patchset.
>>
>> I've studied this carefully, and I agree that using mapcount in 
>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>> like one: your "memory barrier, check, retry" is really just a lock) in
>> order to hold off gup() while page_mkclean() is in progress. In other words,
>> nothing that increments mapcount may proceed while page_mkclean() is running.
> 
> No, increment to page->_mapcount are fine while page_mkclean() is running.
> The above solution do work no matter what happens thanks to the memory
> barrier. By clearing the pin flag first and reading the page->_mapcount
> after (and doing the reverse in GUP) we know that a racing GUP will either
> have its pin page clear but the incremented mapcount taken into account by
> page_mkclean() or page_mkclean() will miss the incremented mapcount but
> it will also no clear the pin flag set concurrently by any GUP.
> 
> Here are all the possible time line:
> [T1]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
>                                 ...
>                                  | [C1] pined = TestClearPagePin(page);

It appears that you're using the "page pin is clear" to indicate that
page_mkclean() is running. The problem is, that approach leads to toggling
the PagePin flag, and so an observer (other than gup or page_mkclean) will
see intervals during which the PagePin flag is clear, when conceptually it
should be set.

Jan and other FS people, is it definitely the case that we only have to take
action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
Because I recall from earlier experiments that there were several places, not 
just page_mkclean().

One more quick question below...

>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
> 
> It is fine because page_mkclean() will read the correct page->mapcount
> which include the GUP that happens before [C1]
> 
> 
> [T2]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
>                                  | [C1] pined = TestClearPagePin(page);
>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
>                                 ...
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
> 
> It is fine because [G4] set the pin flag so it does not matter that [C3]
> did miss the mapcount increase from the GUP.
> 
> 
> [T3]:
> GUP on CPU0                      | page_mkclean() on CPU1
> [G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);
> 
> No matter which CPU ordering we get ie either:
>     - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
>       that was incremented by [G2] so we will map_count < map_and_pin_count
>       and we will set the pin flag again at the end of page_mkclean()
>     - [C1] is overwritten by [G4] in that case the pin flag is set and thus
>       it does not matter that [C3] also see the mapcount that was incremented
>       by [G2]
> 
> 
> This is totaly race free ie at the end of page_mkclean() the pin flag will
> be set for all page that are pin and for some page that are no longer pin.
> What matter is that they are no false negative.
> 
> 
>> I especially am intrigued by your idea about a fuzzy count that allows
>> false positives but no false negatives. To do that, we need to put a hard
>> lock protecting the increment operation, but we can be loose (no lock) on
>> decrement. That turns out to be a perfect match for the problem here, because
>> as I recall from my earlier efforts, put_user_page() must *not* take locks--
>> and that's where we just decrement. Sweet! See below.
> 
> You do not need lock, lock are easier to think with but they are not always
> necessary and in this case we do not need any lock. We can happily have any
> number of concurrent GUP, PUP or pte zapping. Worse case is false positive
> ie reporting a page as pin while it has just been unpin concurrently by a
> PUP.
> 
>> The other idea that you and Dan (and maybe others) pointed out was a debug
>> option, which we'll certainly need in order to safely convert all the call
>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>> and put_user_page() can verify that the right call was made.)  That will be
>> a separate patchset, as you recommended.
>>
>> I'll even go as far as recommending the page lock itself. I realize that this 
>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>> that this (below) has similar overhead to the notes above--but is *much* easier
>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>> then I'd recommend using another page bit to do the same thing.)
> 
> Please page lock is pointless and it will not work for GUP fast. The above
> scheme do work and is fine. I spend the day again thinking about all memory
> ordering and i do not see any issues.
> 

Why is it that page lock cannot be used for gup fast, btw?

> 
>> (Note that memory barriers will simply be built into the various Set|Clear|Read
>> operations, as is common with a few other page flags.)
>>
>> page_mkclean():
>> ===============
>> lock_page()
>>     page_mkclean()
>>         Count actual mappings
>>             if(mappings == atomic_read(&page->_mapcount))
>>                 ClearPageDmaPinned 
>>
>> gup_fast():
>> ===========
>> for each page {
>>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
>>
>>     atomic_inc(&page->_mapcount)
>>     SetPageDmaPinned()
>>
>>     /* details of gup vs gup_fast not shown here... */
>>
>>
>> put_user_page():
>> ================
>>     atomic_dec(&page->_mapcount); /* no locking! */
>>    
>>
>> try_to_unmap() and other consumers of the PageDmaPinned flag:
>> =============================================================
>> lock_page() /* not required, but already done by existing callers */
>>     if(PageDmaPinned) {
>>         ...take appropriate action /* future patchsets */
> 
> We can not block try_to_unmap() on pined page. What we want to block is
> fs using a different page for the same file offset the original pined
> page was pin (modulo truncate that we should not block). Everything else
> must keep working as if there was no pin. We can not fix that, driver
> doing long term GUP and not abiding to mmu notifier are hopelessly broken
> in front of many regular syscall (mremap, truncate, splice, ...) we can
> not block those syscall or failing them, doing so would mean breaking
> applications in a bad way.
> 
> The only thing we should do is avoid fs corruption and bug due to
> dirtying page after fs believe it has been clean.
> 
> 
>> page freeing:
>> ============
>> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */
> 
> Yeah this need to happen when we sanitize flags of free page.
> 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:38                                                             ` John Hubbard
@ 2019-01-12  2:38                                                               ` John Hubbard
  2019-01-12  2:46                                                               ` Jerome Glisse
  2019-01-12  3:14                                                               ` Jerome Glisse
  2 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-12  2:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:02 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>> [...]
>>
>> Hi Jerome,
>>
>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>> in case anyone spots a disastrous conceptual error (such as the lock_page
>> point), while I'm putting together the revised patchset.
>>
>> I've studied this carefully, and I agree that using mapcount in 
>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>> like one: your "memory barrier, check, retry" is really just a lock) in
>> order to hold off gup() while page_mkclean() is in progress. In other words,
>> nothing that increments mapcount may proceed while page_mkclean() is running.
> 
> No, increment to page->_mapcount are fine while page_mkclean() is running.
> The above solution do work no matter what happens thanks to the memory
> barrier. By clearing the pin flag first and reading the page->_mapcount
> after (and doing the reverse in GUP) we know that a racing GUP will either
> have its pin page clear but the incremented mapcount taken into account by
> page_mkclean() or page_mkclean() will miss the incremented mapcount but
> it will also no clear the pin flag set concurrently by any GUP.
> 
> Here are all the possible time line:
> [T1]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
>                                 ...
>                                  | [C1] pined = TestClearPagePin(page);

It appears that you're using the "page pin is clear" to indicate that
page_mkclean() is running. The problem is, that approach leads to toggling
the PagePin flag, and so an observer (other than gup or page_mkclean) will
see intervals during which the PagePin flag is clear, when conceptually it
should be set.

Jan and other FS people, is it definitely the case that we only have to take
action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
Because I recall from earlier experiments that there were several places, not 
just page_mkclean().

One more quick question below...

>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
> 
> It is fine because page_mkclean() will read the correct page->mapcount
> which include the GUP that happens before [C1]
> 
> 
> [T2]:
> GUP on CPU0                      | page_mkclean() on CPU1
>                                  |
>                                  | [C1] pined = TestClearPagePin(page);
>                                  | [C2] smp_mb();
>                                  | [C3] map_and_pin_count =
>                                  |        atomic_read(&page->mapcount)
>                                 ...
> [G2] atomic_inc(&page->mapcount) |
> [G3] smp_wmb();                  |
> [G4] SetPagePin(page);           |
> 
> It is fine because [G4] set the pin flag so it does not matter that [C3]
> did miss the mapcount increase from the GUP.
> 
> 
> [T3]:
> GUP on CPU0                      | page_mkclean() on CPU1
> [G4] SetPagePin(page);           | [C1] pined = TestClearPagePin(page);
> 
> No matter which CPU ordering we get ie either:
>     - [G4] is overwritten by [C1] in that case [C3] will see the mapcount
>       that was incremented by [G2] so we will map_count < map_and_pin_count
>       and we will set the pin flag again at the end of page_mkclean()
>     - [C1] is overwritten by [G4] in that case the pin flag is set and thus
>       it does not matter that [C3] also see the mapcount that was incremented
>       by [G2]
> 
> 
> This is totaly race free ie at the end of page_mkclean() the pin flag will
> be set for all page that are pin and for some page that are no longer pin.
> What matter is that they are no false negative.
> 
> 
>> I especially am intrigued by your idea about a fuzzy count that allows
>> false positives but no false negatives. To do that, we need to put a hard
>> lock protecting the increment operation, but we can be loose (no lock) on
>> decrement. That turns out to be a perfect match for the problem here, because
>> as I recall from my earlier efforts, put_user_page() must *not* take locks--
>> and that's where we just decrement. Sweet! See below.
> 
> You do not need lock, lock are easier to think with but they are not always
> necessary and in this case we do not need any lock. We can happily have any
> number of concurrent GUP, PUP or pte zapping. Worse case is false positive
> ie reporting a page as pin while it has just been unpin concurrently by a
> PUP.
> 
>> The other idea that you and Dan (and maybe others) pointed out was a debug
>> option, which we'll certainly need in order to safely convert all the call
>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>> and put_user_page() can verify that the right call was made.)  That will be
>> a separate patchset, as you recommended.
>>
>> I'll even go as far as recommending the page lock itself. I realize that this 
>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>> that this (below) has similar overhead to the notes above--but is *much* easier
>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>> then I'd recommend using another page bit to do the same thing.)
> 
> Please page lock is pointless and it will not work for GUP fast. The above
> scheme do work and is fine. I spend the day again thinking about all memory
> ordering and i do not see any issues.
> 

Why is it that page lock cannot be used for gup fast, btw?

> 
>> (Note that memory barriers will simply be built into the various Set|Clear|Read
>> operations, as is common with a few other page flags.)
>>
>> page_mkclean():
>> ===============
>> lock_page()
>>     page_mkclean()
>>         Count actual mappings
>>             if(mappings == atomic_read(&page->_mapcount))
>>                 ClearPageDmaPinned 
>>
>> gup_fast():
>> ===========
>> for each page {
>>     lock_page() /* gup MUST NOT proceed until page_mkclean and writeback finish */
>>
>>     atomic_inc(&page->_mapcount)
>>     SetPageDmaPinned()
>>
>>     /* details of gup vs gup_fast not shown here... */
>>
>>
>> put_user_page():
>> ================
>>     atomic_dec(&page->_mapcount); /* no locking! */
>>    
>>
>> try_to_unmap() and other consumers of the PageDmaPinned flag:
>> =============================================================
>> lock_page() /* not required, but already done by existing callers */
>>     if(PageDmaPinned) {
>>         ...take appropriate action /* future patchsets */
> 
> We can not block try_to_unmap() on pined page. What we want to block is
> fs using a different page for the same file offset the original pined
> page was pin (modulo truncate that we should not block). Everything else
> must keep working as if there was no pin. We can not fix that, driver
> doing long term GUP and not abiding to mmu notifier are hopelessly broken
> in front of many regular syscall (mremap, truncate, splice, ...) we can
> not block those syscall or failing them, doing so would mean breaking
> applications in a bad way.
> 
> The only thing we should do is avoid fs corruption and bug due to
> dirtying page after fs believe it has been clean.
> 
> 
>> page freeing:
>> ============
>> ClearPageDmaPinned() /* It may not have ever had page_mkclean() run on it */
> 
> Yeah this need to happen when we sanitize flags of free page.
> 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:38                                                             ` John Hubbard
  2019-01-12  2:38                                                               ` John Hubbard
@ 2019-01-12  2:46                                                               ` Jerome Glisse
  2019-01-12  2:46                                                                 ` Jerome Glisse
  2019-01-12  3:06                                                                 ` John Hubbard
  2019-01-12  3:14                                                               ` Jerome Glisse
  2 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.
> 
> Jan and other FS people, is it definitely the case that we only have to take
> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
> Because I recall from earlier experiments that there were several places, not 
> just page_mkclean().

Yes and it is fine to temporarily have the pin flag unstable. Anything
that need stable page content will have to lock the page so will have
to sync against any page_mkclean() and in the end the only thing were
we want to check the pin flag is when doing write back ie after
page_mkclean() while the page is still locked. If they are any other
place that need to check the pin flag then they will need to lock the
page. But i can not think of any other place right now.


[...]

> >> The other idea that you and Dan (and maybe others) pointed out was a debug
> >> option, which we'll certainly need in order to safely convert all the call
> >> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >> and put_user_page() can verify that the right call was made.)  That will be
> >> a separate patchset, as you recommended.
> >>
> >> I'll even go as far as recommending the page lock itself. I realize that this 
> >> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >> that this (below) has similar overhead to the notes above--but is *much* easier
> >> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >> then I'd recommend using another page bit to do the same thing.)
> > 
> > Please page lock is pointless and it will not work for GUP fast. The above
> > scheme do work and is fine. I spend the day again thinking about all memory
> > ordering and i do not see any issues.
> > 
> 
> Why is it that page lock cannot be used for gup fast, btw?

Well it can not happen within the preempt disable section. But after
as a post pass before GUP_fast return and after reenabling preempt then
it is fine like it would be for regular GUP. But locking page for GUP
is also likely to slow down some workload (with direct-IO).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:46                                                               ` Jerome Glisse
@ 2019-01-12  2:46                                                                 ` Jerome Glisse
  2019-01-12  3:06                                                                 ` John Hubbard
  1 sibling, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-12  2:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.
> 
> Jan and other FS people, is it definitely the case that we only have to take
> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
> Because I recall from earlier experiments that there were several places, not 
> just page_mkclean().

Yes and it is fine to temporarily have the pin flag unstable. Anything
that need stable page content will have to lock the page so will have
to sync against any page_mkclean() and in the end the only thing were
we want to check the pin flag is when doing write back ie after
page_mkclean() while the page is still locked. If they are any other
place that need to check the pin flag then they will need to lock the
page. But i can not think of any other place right now.


[...]

> >> The other idea that you and Dan (and maybe others) pointed out was a debug
> >> option, which we'll certainly need in order to safely convert all the call
> >> sites. (Mirror the mappings at a different kernel offset, so that put_page()
> >> and put_user_page() can verify that the right call was made.)  That will be
> >> a separate patchset, as you recommended.
> >>
> >> I'll even go as far as recommending the page lock itself. I realize that this 
> >> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
> >> that this (below) has similar overhead to the notes above--but is *much* easier
> >> to verify correct. (If the page lock is unacceptable due to being so widely used,
> >> then I'd recommend using another page bit to do the same thing.)
> > 
> > Please page lock is pointless and it will not work for GUP fast. The above
> > scheme do work and is fine. I spend the day again thinking about all memory
> > ordering and i do not see any issues.
> > 
> 
> Why is it that page lock cannot be used for gup fast, btw?

Well it can not happen within the preempt disable section. But after
as a post pass before GUP_fast return and after reenabling preempt then
it is fine like it would be for regular GUP. But locking page for GUP
is also likely to slow down some workload (with direct-IO).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:46                                                               ` Jerome Glisse
  2019-01-12  2:46                                                                 ` Jerome Glisse
@ 2019-01-12  3:06                                                                 ` John Hubbard
  2019-01-12  3:06                                                                   ` John Hubbard
                                                                                     ` (2 more replies)
  1 sibling, 3 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-12  3:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:46 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>> On 1/11/19 6:02 PM, Jerome Glisse wrote:
>>> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>>>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>> [...]
>>>>
>>>> Hi Jerome,
>>>>
>>>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>>>> in case anyone spots a disastrous conceptual error (such as the lock_page
>>>> point), while I'm putting together the revised patchset.
>>>>
>>>> I've studied this carefully, and I agree that using mapcount in 
>>>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>>>> like one: your "memory barrier, check, retry" is really just a lock) in
>>>> order to hold off gup() while page_mkclean() is in progress. In other words,
>>>> nothing that increments mapcount may proceed while page_mkclean() is running.
>>>
>>> No, increment to page->_mapcount are fine while page_mkclean() is running.
>>> The above solution do work no matter what happens thanks to the memory
>>> barrier. By clearing the pin flag first and reading the page->_mapcount
>>> after (and doing the reverse in GUP) we know that a racing GUP will either
>>> have its pin page clear but the incremented mapcount taken into account by
>>> page_mkclean() or page_mkclean() will miss the incremented mapcount but
>>> it will also no clear the pin flag set concurrently by any GUP.
>>>
>>> Here are all the possible time line:
>>> [T1]:
>>> GUP on CPU0                      | page_mkclean() on CPU1
>>>                                  |
>>> [G2] atomic_inc(&page->mapcount) |
>>> [G3] smp_wmb();                  |
>>> [G4] SetPagePin(page);           |
>>>                                 ...
>>>                                  | [C1] pined = TestClearPagePin(page);
>>
>> It appears that you're using the "page pin is clear" to indicate that
>> page_mkclean() is running. The problem is, that approach leads to toggling
>> the PagePin flag, and so an observer (other than gup or page_mkclean) will
>> see intervals during which the PagePin flag is clear, when conceptually it
>> should be set.
>>
>> Jan and other FS people, is it definitely the case that we only have to take
>> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
>> Because I recall from earlier experiments that there were several places, not 
>> just page_mkclean().
> 
> Yes and it is fine to temporarily have the pin flag unstable. Anything
> that need stable page content will have to lock the page so will have
> to sync against any page_mkclean() and in the end the only thing were
> we want to check the pin flag is when doing write back ie after
> page_mkclean() while the page is still locked. If they are any other
> place that need to check the pin flag then they will need to lock the
> page. But i can not think of any other place right now.
> 
> 

OK. Yes, since the clearing and resetting happens under page lock, that will
suffice to synchronize it. That's a good point.

> [...]
> 
>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>> option, which we'll certainly need in order to safely convert all the call
>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>> a separate patchset, as you recommended.
>>>>
>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>> then I'd recommend using another page bit to do the same thing.)
>>>
>>> Please page lock is pointless and it will not work for GUP fast. The above
>>> scheme do work and is fine. I spend the day again thinking about all memory
>>> ordering and i do not see any issues.
>>>
>>
>> Why is it that page lock cannot be used for gup fast, btw?
> 
> Well it can not happen within the preempt disable section. But after
> as a post pass before GUP_fast return and after reenabling preempt then
> it is fine like it would be for regular GUP. But locking page for GUP
> is also likely to slow down some workload (with direct-IO).
> 

Right, and so to crux of the matter: taking an uncontended page lock involves
pretty much the same set of operations that your approach does. (If gup ends up
contended with the page lock for other reasons than these paths, that seems
surprising.) I'd expect very similar performance.

But the page lock approach leads to really dramatically simpler code (and code
reviews, let's not forget). Any objection to my going that direction, and keeping
this idea as a Plan B? I think the next step will be, once again, to gather some
performance metrics, so maybe that will help us decide.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:06                                                                 ` John Hubbard
@ 2019-01-12  3:06                                                                   ` John Hubbard
  2019-01-12  3:25                                                                   ` Jerome Glisse
  2019-01-14 14:54                                                                   ` Jan Kara
  2 siblings, 0 replies; 206+ messages in thread
From: John Hubbard @ 2019-01-12  3:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On 1/11/19 6:46 PM, Jerome Glisse wrote:
> On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
>> On 1/11/19 6:02 PM, Jerome Glisse wrote:
>>> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
>>>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
>>>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
>>>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
>>>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
>>>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
>>>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
>>>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
>>>>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
>>>>> [...]
>>>>
>>>> Hi Jerome,
>>>>
>>>> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
>>>> in case anyone spots a disastrous conceptual error (such as the lock_page
>>>> point), while I'm putting together the revised patchset.
>>>>
>>>> I've studied this carefully, and I agree that using mapcount in 
>>>> this way is viable, *as long* as we use a lock (or a construct that looks just 
>>>> like one: your "memory barrier, check, retry" is really just a lock) in
>>>> order to hold off gup() while page_mkclean() is in progress. In other words,
>>>> nothing that increments mapcount may proceed while page_mkclean() is running.
>>>
>>> No, increment to page->_mapcount are fine while page_mkclean() is running.
>>> The above solution do work no matter what happens thanks to the memory
>>> barrier. By clearing the pin flag first and reading the page->_mapcount
>>> after (and doing the reverse in GUP) we know that a racing GUP will either
>>> have its pin page clear but the incremented mapcount taken into account by
>>> page_mkclean() or page_mkclean() will miss the incremented mapcount but
>>> it will also no clear the pin flag set concurrently by any GUP.
>>>
>>> Here are all the possible time line:
>>> [T1]:
>>> GUP on CPU0                      | page_mkclean() on CPU1
>>>                                  |
>>> [G2] atomic_inc(&page->mapcount) |
>>> [G3] smp_wmb();                  |
>>> [G4] SetPagePin(page);           |
>>>                                 ...
>>>                                  | [C1] pined = TestClearPagePin(page);
>>
>> It appears that you're using the "page pin is clear" to indicate that
>> page_mkclean() is running. The problem is, that approach leads to toggling
>> the PagePin flag, and so an observer (other than gup or page_mkclean) will
>> see intervals during which the PagePin flag is clear, when conceptually it
>> should be set.
>>
>> Jan and other FS people, is it definitely the case that we only have to take
>> action (defer, wait, revoke, etc) for gup-pinned pages, in page_mkclean()?
>> Because I recall from earlier experiments that there were several places, not 
>> just page_mkclean().
> 
> Yes and it is fine to temporarily have the pin flag unstable. Anything
> that need stable page content will have to lock the page so will have
> to sync against any page_mkclean() and in the end the only thing were
> we want to check the pin flag is when doing write back ie after
> page_mkclean() while the page is still locked. If they are any other
> place that need to check the pin flag then they will need to lock the
> page. But i can not think of any other place right now.
> 
> 

OK. Yes, since the clearing and resetting happens under page lock, that will
suffice to synchronize it. That's a good point.

> [...]
> 
>>>> The other idea that you and Dan (and maybe others) pointed out was a debug
>>>> option, which we'll certainly need in order to safely convert all the call
>>>> sites. (Mirror the mappings at a different kernel offset, so that put_page()
>>>> and put_user_page() can verify that the right call was made.)  That will be
>>>> a separate patchset, as you recommended.
>>>>
>>>> I'll even go as far as recommending the page lock itself. I realize that this 
>>>> adds overhead to gup(), but we *must* hold off page_mkclean(), and I believe
>>>> that this (below) has similar overhead to the notes above--but is *much* easier
>>>> to verify correct. (If the page lock is unacceptable due to being so widely used,
>>>> then I'd recommend using another page bit to do the same thing.)
>>>
>>> Please page lock is pointless and it will not work for GUP fast. The above
>>> scheme do work and is fine. I spend the day again thinking about all memory
>>> ordering and i do not see any issues.
>>>
>>
>> Why is it that page lock cannot be used for gup fast, btw?
> 
> Well it can not happen within the preempt disable section. But after
> as a post pass before GUP_fast return and after reenabling preempt then
> it is fine like it would be for regular GUP. But locking page for GUP
> is also likely to slow down some workload (with direct-IO).
> 

Right, and so to crux of the matter: taking an uncontended page lock involves
pretty much the same set of operations that your approach does. (If gup ends up
contended with the page lock for other reasons than these paths, that seems
surprising.) I'd expect very similar performance.

But the page lock approach leads to really dramatically simpler code (and code
reviews, let's not forget). Any objection to my going that direction, and keeping
this idea as a Plan B? I think the next step will be, once again, to gather some
performance metrics, so maybe that will help us decide.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  2:38                                                             ` John Hubbard
  2019-01-12  2:38                                                               ` John Hubbard
  2019-01-12  2:46                                                               ` Jerome Glisse
@ 2019-01-12  3:14                                                               ` Jerome Glisse
  2019-01-12  3:14                                                                 ` Jerome Glisse
  2 siblings, 1 reply; 206+ messages in thread
From: Jerome Glisse @ 2019-01-12  3:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.

Also forgot to stress that i am not using the pin flag to report page_mkclean
is running, i am clearing it first because clearing that bit is the thing
that is racy. If we clear it first and then read map and pin count and then
count number of real mapping we get a proper ordering and we will always
detect pined page and properly restore the pin flag at the end of page_mkclean.

In fact GUP or PUP never need to check if the flag is clear. The check in
GUP in my pseudo code is an optimization for the write back ordering (no
need to do any ordering if the pin flag was already set before the current
GUP).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:14                                                               ` Jerome Glisse
@ 2019-01-12  3:14                                                                 ` Jerome Glisse
  0 siblings, 0 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-12  3:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote:
> >>>>>>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> >>> [...]
> >>
> >> Hi Jerome,
> >>
> >> Looks good, in a conceptual sense. Let me do a brain dump of how I see it,
> >> in case anyone spots a disastrous conceptual error (such as the lock_page
> >> point), while I'm putting together the revised patchset.
> >>
> >> I've studied this carefully, and I agree that using mapcount in 
> >> this way is viable, *as long* as we use a lock (or a construct that looks just 
> >> like one: your "memory barrier, check, retry" is really just a lock) in
> >> order to hold off gup() while page_mkclean() is in progress. In other words,
> >> nothing that increments mapcount may proceed while page_mkclean() is running.
> > 
> > No, increment to page->_mapcount are fine while page_mkclean() is running.
> > The above solution do work no matter what happens thanks to the memory
> > barrier. By clearing the pin flag first and reading the page->_mapcount
> > after (and doing the reverse in GUP) we know that a racing GUP will either
> > have its pin page clear but the incremented mapcount taken into account by
> > page_mkclean() or page_mkclean() will miss the incremented mapcount but
> > it will also no clear the pin flag set concurrently by any GUP.
> > 
> > Here are all the possible time line:
> > [T1]:
> > GUP on CPU0                      | page_mkclean() on CPU1
> >                                  |
> > [G2] atomic_inc(&page->mapcount) |
> > [G3] smp_wmb();                  |
> > [G4] SetPagePin(page);           |
> >                                 ...
> >                                  | [C1] pined = TestClearPagePin(page);
> 
> It appears that you're using the "page pin is clear" to indicate that
> page_mkclean() is running. The problem is, that approach leads to toggling
> the PagePin flag, and so an observer (other than gup or page_mkclean) will
> see intervals during which the PagePin flag is clear, when conceptually it
> should be set.

Also forgot to stress that i am not using the pin flag to report page_mkclean
is running, i am clearing it first because clearing that bit is the thing
that is racy. If we clear it first and then read map and pin count and then
count number of real mapping we get a proper ordering and we will always
detect pined page and properly restore the pin flag at the end of page_mkclean.

In fact GUP or PUP never need to check if the flag is clear. The check in
GUP in my pseudo code is an optimization for the write back ordering (no
need to do any ordering if the pin flag was already set before the current
GUP).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 206+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2019-01-12  3:06                                                                 ` John Hubbard
  2019-01-12  3:06                                                                   ` John Hubbard
@ 2019-01-12  3:25                                                                   ` Jerome Glisse
  2019-01-12  3:25                                                                     ` Jerome Glisse
  2019-01-12 20:46                                                                     ` John Hubbard
  2019-01-14 14:54                                                                   ` Jan Kara
  2 siblings, 2 replies; 206+ messages in thread
From: Jerome Glisse @ 2019-01-12  3:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Dave Chinner, Dan Williams,
	John Hubbard, Andrew Morton, Linux MM, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Jan 11, 2019 at 07:06:08PM -0800, John Hubbard wrote:
> On 1/11/19 6:46 PM, Jerome Glisse wrote:
> > On Fri, Jan 11, 2019 at 06:38:44PM -0800, John Hubbard wrote:
> >> On 1/11/19 6:02 PM, Jerome Glisse wrote:
> >>> On Fri, Jan 11, 2019 at 05:04:05PM -0800, John Hubbard wrote:
> >>>> On 1/11/19 8:51 AM, Jerome Glisse wrote:
> >>>>> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote:
> >>>>>> On 1/3/19 6:44 AM, Jerome Glisse wrote:
> >>>>>>> On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote:
> >>>>>>>> On Wed 02-01-19 20:55:33, Jerome Glisse wrote:
> >>>>>>>>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote:
> >>>>>>>>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote: