LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/2] put_user_page*(): start converting the call sites
@ 2018-12-04  0:17 john.hubbard
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: john.hubbard @ 2018-12-04  0:17 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Hi,

Summary: I'd like these two patches to go into the next convenient cycle.
I *think* that means 4.21.

Details

At the Linux Plumbers Conference, we talked about this approach [1], and
the primary lingering concern was over performance. Tom Talpey helped me
through a much more accurate run of the fio performance test, and now
it's looking like an under 1% performance cost, to add and remove pages
from the LRU (this is only paid when dealing with get_user_pages) [2]. So
we should be fine to start converting call sites.

This patchset gets the conversion started. Both patches already had a fair
amount of review.

(Tom, I'll add you Tested-by to the actual implementation that moves
pages on and off the LRU. These first two patches don't do that.)

[1] https://linuxplumbersconf.org/event/2/contributions/126/
    "RDMA and get_user_pages"

[2] https://lore.kernel.org/r/79d1ee27-9ea0-3d15-3fc4-97c1bd79c990@talpey.com

John Hubbard (2):
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()

 drivers/infiniband/core/umem.c              |  7 +-
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 11 ++-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 +-
 include/linux/mm.h                          | 20 ++++++
 mm/swap.c                                   | 80 +++++++++++++++++++++
 9 files changed, 123 insertions(+), 27 deletions(-)

-- 
2.19.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
@ 2018-12-04  0:17 ` john.hubbard
  2018-12-04  7:53   ` Mike Rapoport
  2018-12-04 20:28   ` Dan Williams
  2018-12-04  0:17 ` [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
  2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
  2 siblings, 2 replies; 36+ messages in thread
From: john.hubbard @ 2018-12-04  0:17 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing the problem described in [1]. The steps
are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Reviewed-by: Jan Kara <jack@suse.cz>

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 20 ++++++++++++
 mm/swap.c          | 80 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..09fbb2c81aba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index aa483719922e..bb8c32595e5f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+				   unsigned long npages,
+				   set_dirty_func sdf)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			sdf(page);
+
+		put_user_page(page);
+	}
+}
+
+/*
+ * put_user_pages_dirty() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/*
+ * put_user_pages_dirty_lock() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/*
+ * put_user_pages() - for each page in the @pages array, release the page
+ * using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 /*
  * get_kernel_pages() - pin kernel pages in memory
  * @kiov:	An array of struct kvec structures
-- 
2.19.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*()
  2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-12-04  0:17 ` john.hubbard
  2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
  2 siblings, 0 replies; 36+ messages in thread
From: john.hubbard @ 2018-12-04  0:17 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

For infiniband code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This is a tiny part of the second step of fixing the problem described
in [1]. The steps are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>

Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c              |  7 ++++---
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 11 ++++-------
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++++-------
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +++---
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 ++++---
 7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index c6144df47ea4..c2898bc7b3b2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
 		page = sg_page(sg);
-		if (!PageDirty(page) && umem->writable && dirty)
-			set_page_dirty_lock(page);
-		put_page(page);
+		if (umem->writable && dirty)
+			put_user_pages_dirty_lock(&page, 1);
+		else
+			put_user_page(page);
 	}
 
 	sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 676c1fd1119d..99715049cd3b 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -659,7 +659,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 					ret = -EFAULT;
 					break;
 				}
-				put_page(local_page_list[j]);
+				put_user_page(local_page_list[j]);
 				continue;
 			}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 			     size_t npages, bool dirty)
 {
-	size_t i;
-
-	for (i = 0; i < npages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, npages);
+	else
+		put_user_pages(p, npages);
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		down_write(&mm->mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
 	if (ret < 0) {
-		put_page(pages[0]);
+		put_user_page(pages[0]);
 		goto out;
 	}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 				 mthca_uarc_virt(dev, uar, i));
 	if (ret) {
 		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-		put_page(sg_page(&db_tab->page[i].mem));
+		put_user_page(sg_page(&db_tab->page[i].mem));
 		goto out;
 	}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
 		if (db_tab->page[i].uvirt) {
 			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
 			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-			put_page(sg_page(&db_tab->page[i].mem));
+			put_user_page(sg_page(&db_tab->page[i].mem));
 		}
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..1a5c64c8695f 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,13 +40,10 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
 				     int dirty)
 {
-	size_t i;
-
-	for (i = 0; i < num_pages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, num_pages);
+	else
+		put_user_pages(p, num_pages);
 }
 
 /*
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 926f3c8eba69..4a4b802b011f 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -321,7 +321,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd,
 		 * the caller can ignore this page.
 		 */
 		if (put) {
-			put_page(page);
+			put_user_page(page);
 		} else {
 			/* coalesce case */
 			kunmap(page);
@@ -635,7 +635,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev,
 			kunmap(pkt->addr[i].page);
 
 		if (pkt->addr[i].put_page)
-			put_page(pkt->addr[i].page);
+			put_user_page(pkt->addr[i].page);
 		else
 			__free_page(pkt->addr[i].page);
 	} else if (pkt->addr[i].kvaddr) {
@@ -710,7 +710,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 	/* if error, return all pages not managed by pkt */
 free_pages:
 	while (i < j)
-		put_page(pages[i++]);
+		put_user_page(pages[i++]);
 
 done:
 	return ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 49275a548751..2ef8d31dc838 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -77,9 +77,10 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
 		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
 			page = sg_page(sg);
 			pa = sg_phys(sg);
-			if (!PageDirty(page) && dirty)
-				set_page_dirty_lock(page);
-			put_page(page);
+			if (dirty)
+				put_user_pages_dirty_lock(&page, 1);
+			else
+				put_user_page(page);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
-- 
2.19.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-12-04  7:53   ` Mike Rapoport
  2018-12-05  1:40     ` John Hubbard
  2018-12-04 20:28   ` Dan Williams
  1 sibling, 1 reply; 36+ messages in thread
From: Mike Rapoport @ 2018-12-04  7:53 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, linux-mm, Jan Kara, Tom Talpey, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Marciniszyn,
	Ralph Campbell, LKML, linux-fsdevel, John Hubbard

Hi John,

Thanks for having documentation as a part of the patch. Some kernel-doc
nits below.

On Mon, Dec 03, 2018 at 04:17:19PM -0800, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing the problem described in [1]. The steps
> are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem. Again, [1] provides details as to why that is
>    desirable.
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 20 ++++++++++++
>  mm/swap.c          | 80 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 100 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..09fbb2c81aba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
> 
> +/*
> + * put_user_page() - release a page that had previously been acquired via
> + * a call to one of the get_user_pages*() functions.

Please add @page parameter description, otherwise kernel-doc is unhappy

> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index aa483719922e..bb8c32595e5f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
> 
> +typedef int (*set_dirty_func)(struct page *page);
> +
> +static void __put_user_pages_dirty(struct page **pages,
> +				   unsigned long npages,
> +				   set_dirty_func sdf)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			sdf(page);
> +
> +		put_user_page(page);
> +	}
> +}
> +
> +/*
> + * put_user_pages_dirty() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.

Please put the parameters description next to the brief function
description, as described in [1]

[1] https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#function-documentation


> + *
> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/*
> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.

Ditto

> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +
> +/*
> + * put_user_pages() - for each page in the @pages array, release the page
> + * using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *

And here as well :)

> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
>  /*
>   * get_kernel_pages() - pin kernel pages in memory
>   * @kiov:	An array of struct kvec structures
> -- 
> 2.19.2
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 0/2] put_user_page*(): start converting the call sites
  2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2018-12-04  0:17 ` [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
@ 2018-12-04 17:10 ` David Laight
  2018-12-05  1:05   ` John Hubbard
  2 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2018-12-04 17:10 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel, John Hubbard

From: john.hubbard@gmail.com
> Sent: 04 December 2018 00:17
> 
> Summary: I'd like these two patches to go into the next convenient cycle.
> I *think* that means 4.21.
> 
> Details
> 
> At the Linux Plumbers Conference, we talked about this approach [1], and
> the primary lingering concern was over performance. Tom Talpey helped me
> through a much more accurate run of the fio performance test, and now
> it's looking like an under 1% performance cost, to add and remove pages
> from the LRU (this is only paid when dealing with get_user_pages) [2]. So
> we should be fine to start converting call sites.
> 
> This patchset gets the conversion started. Both patches already had a fair
> amount of review.

Shouldn't the commit message contain actual details of the change?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2018-12-04  7:53   ` Mike Rapoport
@ 2018-12-04 20:28   ` Dan Williams
  2018-12-04 21:56     ` John Hubbard
  1 sibling, 1 reply; 36+ messages in thread
From: Dan Williams @ 2018-12-04 20:28 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel, John Hubbard

On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>
> From: John Hubbard <jhubbard@nvidia.com>
>
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
>
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
>
> This is the first step of fixing the problem described in [1]. The steps
> are:
>
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
>
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, and will take some time.
>
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
>
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem. Again, [1] provides details as to why that is
>    desirable.

I thought at Plumbers we talked about using a page bit to tag pages
that have had their reference count elevated by get_user_pages()? That
way there is no need to distinguish put_page() from put_user_page() it
just happens internally to put_page(). At the conference Matthew was
offering to free up a page bit for this purpose.

> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>
> Reviewed-by: Jan Kara <jack@suse.cz>

Wish, you could have been there Jan. I'm missing why it's safe to
assume that a single put_user_page() is paired with a get_user_page()?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 20:28   ` Dan Williams
@ 2018-12-04 21:56     ` John Hubbard
  2018-12-04 23:03       ` Dan Williams
  2018-12-05 11:16       ` Jan Kara
  0 siblings, 2 replies; 36+ messages in thread
From: John Hubbard @ 2018-12-04 21:56 UTC (permalink / raw)
  To: Dan Williams, John Hubbard
  Cc: Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/4/18 12:28 PM, Dan Williams wrote:
> On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>>
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also introduces put_user_pages(), and a few dirty/locked variations,
>> as a replacement for release_pages(), and also as a replacement
>> for open-coded loops that release multiple pages.
>> These may be used for subsequent performance improvements,
>> via batching of pages to be released.
>>
>> This is the first step of fixing the problem described in [1]. The steps
>> are:
>>
>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>    for releasing pages that were pinned via get_user_pages*().
>>
>> 2) Convert all of the call sites for get_user_pages*(), to
>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>    call sites, and will take some time.
>>
>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>    implement tracking of these pages. This tracking will be separate from
>>    the existing struct page refcounting.
>>
>> 4) Use the tracking and identification of these pages, to implement
>>    special handling (especially in writeback paths) when the pages are
>>    backed by a filesystem. Again, [1] provides details as to why that is
>>    desirable.
> 
> I thought at Plumbers we talked about using a page bit to tag pages
> that have had their reference count elevated by get_user_pages()? That
> way there is no need to distinguish put_page() from put_user_page() it
> just happens internally to put_page(). At the conference Matthew was
> offering to free up a page bit for this purpose.
> 

...but then, upon further discussion in that same session, we realized that
that doesn't help. You need a reference count. Otherwise a random put_page
could affect your dma-pinned pages, etc, etc.

I was not able to actually find any place where a single additional page
bit would help our situation, which is why this still uses LRU fields for
both the two bits required (the RFC [1] still applies), and the dma_pinned_count.


[1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com



>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Wish, you could have been there Jan. I'm missing why it's safe to
> assume that a single put_user_page() is paired with a get_user_page()?
> 

A put_user_page() per page, or a put_user_pages() for an array of pages. See
patch 0002 for several examples.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 21:56     ` John Hubbard
@ 2018-12-04 23:03       ` Dan Williams
  2018-12-05  0:36         ` Jerome Glisse
  2018-12-05  0:58         ` John Hubbard
  2018-12-05 11:16       ` Jan Kara
  1 sibling, 2 replies; 36+ messages in thread
From: Dan Williams @ 2018-12-04 23:03 UTC (permalink / raw)
  To: John Hubbard
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/4/18 12:28 PM, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> >>
> >> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Introduces put_user_page(), which simply calls put_page().
> >> This provides a way to update all get_user_pages*() callers,
> >> so that they call put_user_page(), instead of put_page().
> >>
> >> Also introduces put_user_pages(), and a few dirty/locked variations,
> >> as a replacement for release_pages(), and also as a replacement
> >> for open-coded loops that release multiple pages.
> >> These may be used for subsequent performance improvements,
> >> via batching of pages to be released.
> >>
> >> This is the first step of fixing the problem described in [1]. The steps
> >> are:
> >>
> >> 1) (This patch): provide put_user_page*() routines, intended to be used
> >>    for releasing pages that were pinned via get_user_pages*().
> >>
> >> 2) Convert all of the call sites for get_user_pages*(), to
> >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> >>    call sites, and will take some time.
> >>
> >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> >>    implement tracking of these pages. This tracking will be separate from
> >>    the existing struct page refcounting.
> >>
> >> 4) Use the tracking and identification of these pages, to implement
> >>    special handling (especially in writeback paths) when the pages are
> >>    backed by a filesystem. Again, [1] provides details as to why that is
> >>    desirable.
> >
> > I thought at Plumbers we talked about using a page bit to tag pages
> > that have had their reference count elevated by get_user_pages()? That
> > way there is no need to distinguish put_page() from put_user_page() it
> > just happens internally to put_page(). At the conference Matthew was
> > offering to free up a page bit for this purpose.
> >
>
> ...but then, upon further discussion in that same session, we realized that
> that doesn't help. You need a reference count. Otherwise a random put_page
> could affect your dma-pinned pages, etc, etc.

Ok, sorry, I mis-remembered. So, you're effectively trying to capture
the end of the page pin event separate from the final 'put' of the
page? Makes sense.

> I was not able to actually find any place where a single additional page
> bit would help our situation, which is why this still uses LRU fields for
> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.

Except the LRU fields are already in use for ZONE_DEVICE pages... how
does this proposal interact with those?

> [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
>
> >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> >
> > Wish, you could have been there Jan. I'm missing why it's safe to
> > assume that a single put_user_page() is paired with a get_user_page()?
> >
>
> A put_user_page() per page, or a put_user_pages() for an array of pages. See
> patch 0002 for several examples.

Yes, however I was more concerned about validation and trying to
locate missed places where put_page() is used instead of
put_user_page().

It would be interesting to see if we could have a debug mode where
get_user_pages() returned dynamically allocated pages from a known
address range and catch drivers that operate on a user-pinned page
without using the proper helper to 'put' it. I think we might also
need a ref_user_page() for drivers that may do their own get_page()
and expect the dma_pinned_count to also increase.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 23:03       ` Dan Williams
@ 2018-12-05  0:36         ` Jerome Glisse
  2018-12-05  0:40           ` Dan Williams
  2018-12-05  0:58         ` John Hubbard
  1 sibling, 1 reply; 36+ messages in thread
From: Jerome Glisse @ 2018-12-05  0:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 12/4/18 12:28 PM, Dan Williams wrote:
> > > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> > >>
> > >> From: John Hubbard <jhubbard@nvidia.com>
> > >>
> > >> Introduces put_user_page(), which simply calls put_page().
> > >> This provides a way to update all get_user_pages*() callers,
> > >> so that they call put_user_page(), instead of put_page().
> > >>
> > >> Also introduces put_user_pages(), and a few dirty/locked variations,
> > >> as a replacement for release_pages(), and also as a replacement
> > >> for open-coded loops that release multiple pages.
> > >> These may be used for subsequent performance improvements,
> > >> via batching of pages to be released.
> > >>
> > >> This is the first step of fixing the problem described in [1]. The steps
> > >> are:
> > >>
> > >> 1) (This patch): provide put_user_page*() routines, intended to be used
> > >>    for releasing pages that were pinned via get_user_pages*().
> > >>
> > >> 2) Convert all of the call sites for get_user_pages*(), to
> > >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> > >>    call sites, and will take some time.
> > >>
> > >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> > >>    implement tracking of these pages. This tracking will be separate from
> > >>    the existing struct page refcounting.
> > >>
> > >> 4) Use the tracking and identification of these pages, to implement
> > >>    special handling (especially in writeback paths) when the pages are
> > >>    backed by a filesystem. Again, [1] provides details as to why that is
> > >>    desirable.
> > >
> > > I thought at Plumbers we talked about using a page bit to tag pages
> > > that have had their reference count elevated by get_user_pages()? That
> > > way there is no need to distinguish put_page() from put_user_page() it
> > > just happens internally to put_page(). At the conference Matthew was
> > > offering to free up a page bit for this purpose.
> > >
> >
> > ...but then, upon further discussion in that same session, we realized that
> > that doesn't help. You need a reference count. Otherwise a random put_page
> > could affect your dma-pinned pages, etc, etc.
> 
> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> the end of the page pin event separate from the final 'put' of the
> page? Makes sense.
> 
> > I was not able to actually find any place where a single additional page
> > bit would help our situation, which is why this still uses LRU fields for
> > both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> 
> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> does this proposal interact with those?
> 
> > [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
> >
> > >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> > >>
> > >> Reviewed-by: Jan Kara <jack@suse.cz>
> > >
> > > Wish, you could have been there Jan. I'm missing why it's safe to
> > > assume that a single put_user_page() is paired with a get_user_page()?
> > >
> >
> > A put_user_page() per page, or a put_user_pages() for an array of pages. See
> > patch 0002 for several examples.
> 
> Yes, however I was more concerned about validation and trying to
> locate missed places where put_page() is used instead of
> put_user_page().
> 
> It would be interesting to see if we could have a debug mode where
> get_user_pages() returned dynamically allocated pages from a known
> address range and catch drivers that operate on a user-pinned page
> without using the proper helper to 'put' it. I think we might also
> need a ref_user_page() for drivers that may do their own get_page()
> and expect the dma_pinned_count to also increase.

Total crazy idea for this, but this is the right time of day
for this (for me at least it is beer time :)) What about mapping
all struct page in two different range of kernel virtual address
and when get user space is use it returns a pointer from the second
range of kernel virtual address to the struct page. Then in put_page
you know for sure if the code putting the page got it from GUP or
from somewhere else. page_to_pfn() would need some trickery to
handle that.

Dunno if we are running out of kernel virtual address (outside
32bits that i believe we are trying to shot down quietly behind
the bar).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:36         ` Jerome Glisse
@ 2018-12-05  0:40           ` Dan Williams
  2018-12-05  0:59             ` John Hubbard
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Williams @ 2018-12-05  0:40 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: John Hubbard, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 4:37 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 12/4/18 12:28 PM, Dan Williams wrote:
> > > > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> > > >>
> > > >> From: John Hubbard <jhubbard@nvidia.com>
> > > >>
> > > >> Introduces put_user_page(), which simply calls put_page().
> > > >> This provides a way to update all get_user_pages*() callers,
> > > >> so that they call put_user_page(), instead of put_page().
> > > >>
> > > >> Also introduces put_user_pages(), and a few dirty/locked variations,
> > > >> as a replacement for release_pages(), and also as a replacement
> > > >> for open-coded loops that release multiple pages.
> > > >> These may be used for subsequent performance improvements,
> > > >> via batching of pages to be released.
> > > >>
> > > >> This is the first step of fixing the problem described in [1]. The steps
> > > >> are:
> > > >>
> > > >> 1) (This patch): provide put_user_page*() routines, intended to be used
> > > >>    for releasing pages that were pinned via get_user_pages*().
> > > >>
> > > >> 2) Convert all of the call sites for get_user_pages*(), to
> > > >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> > > >>    call sites, and will take some time.
> > > >>
> > > >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> > > >>    implement tracking of these pages. This tracking will be separate from
> > > >>    the existing struct page refcounting.
> > > >>
> > > >> 4) Use the tracking and identification of these pages, to implement
> > > >>    special handling (especially in writeback paths) when the pages are
> > > >>    backed by a filesystem. Again, [1] provides details as to why that is
> > > >>    desirable.
> > > >
> > > > I thought at Plumbers we talked about using a page bit to tag pages
> > > > that have had their reference count elevated by get_user_pages()? That
> > > > way there is no need to distinguish put_page() from put_user_page() it
> > > > just happens internally to put_page(). At the conference Matthew was
> > > > offering to free up a page bit for this purpose.
> > > >
> > >
> > > ...but then, upon further discussion in that same session, we realized that
> > > that doesn't help. You need a reference count. Otherwise a random put_page
> > > could affect your dma-pinned pages, etc, etc.
> >
> > Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> > the end of the page pin event separate from the final 'put' of the
> > page? Makes sense.
> >
> > > I was not able to actually find any place where a single additional page
> > > bit would help our situation, which is why this still uses LRU fields for
> > > both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> >
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
> >
> > > [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
> > >
> > > >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> > > >>
> > > >> Reviewed-by: Jan Kara <jack@suse.cz>
> > > >
> > > > Wish, you could have been there Jan. I'm missing why it's safe to
> > > > assume that a single put_user_page() is paired with a get_user_page()?
> > > >
> > >
> > > A put_user_page() per page, or a put_user_pages() for an array of pages. See
> > > patch 0002 for several examples.
> >
> > Yes, however I was more concerned about validation and trying to
> > locate missed places where put_page() is used instead of
> > put_user_page().
> >
> > It would be interesting to see if we could have a debug mode where
> > get_user_pages() returned dynamically allocated pages from a known
> > address range and catch drivers that operate on a user-pinned page
> > without using the proper helper to 'put' it. I think we might also
> > need a ref_user_page() for drivers that may do their own get_page()
> > and expect the dma_pinned_count to also increase.
>
> Total crazy idea for this, but this is the right time of day
> for this (for me at least it is beer time :)) What about mapping
> all struct page in two different range of kernel virtual address
> and when get user space is use it returns a pointer from the second
> range of kernel virtual address to the struct page. Then in put_page
> you know for sure if the code putting the page got it from GUP or
> from somewhere else. page_to_pfn() would need some trickery to
> handle that.

Yes, exactly what I was thinking, if only as a debug mode since
instrumenting every pfn/page translation would be expensive.

> Dunno if we are running out of kernel virtual address (outside
> 32bits that i believe we are trying to shot down quietly behind
> the bar).

There's room, KASAN is in a roughly similar place.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 23:03       ` Dan Williams
  2018-12-05  0:36         ` Jerome Glisse
@ 2018-12-05  0:58         ` John Hubbard
  2018-12-05  1:00           ` Dan Williams
  2018-12-05  1:15           ` Matthew Wilcox
  1 sibling, 2 replies; 36+ messages in thread
From: John Hubbard @ 2018-12-05  0:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/4/18 3:03 PM, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 12/4/18 12:28 PM, Dan Williams wrote:
>>> On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>>>>
>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>
>>>> Introduces put_user_page(), which simply calls put_page().
>>>> This provides a way to update all get_user_pages*() callers,
>>>> so that they call put_user_page(), instead of put_page().
>>>>
>>>> Also introduces put_user_pages(), and a few dirty/locked variations,
>>>> as a replacement for release_pages(), and also as a replacement
>>>> for open-coded loops that release multiple pages.
>>>> These may be used for subsequent performance improvements,
>>>> via batching of pages to be released.
>>>>
>>>> This is the first step of fixing the problem described in [1]. The steps
>>>> are:
>>>>
>>>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>>>    for releasing pages that were pinned via get_user_pages*().
>>>>
>>>> 2) Convert all of the call sites for get_user_pages*(), to
>>>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>>>    call sites, and will take some time.
>>>>
>>>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>>>    implement tracking of these pages. This tracking will be separate from
>>>>    the existing struct page refcounting.
>>>>
>>>> 4) Use the tracking and identification of these pages, to implement
>>>>    special handling (especially in writeback paths) when the pages are
>>>>    backed by a filesystem. Again, [1] provides details as to why that is
>>>>    desirable.
>>>
>>> I thought at Plumbers we talked about using a page bit to tag pages
>>> that have had their reference count elevated by get_user_pages()? That
>>> way there is no need to distinguish put_page() from put_user_page() it
>>> just happens internally to put_page(). At the conference Matthew was
>>> offering to free up a page bit for this purpose.
>>>
>>
>> ...but then, upon further discussion in that same session, we realized that
>> that doesn't help. You need a reference count. Otherwise a random put_page
>> could affect your dma-pinned pages, etc, etc.
> 
> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> the end of the page pin event separate from the final 'put' of the
> page? Makes sense.
> 

Yes, that's it exactly.

>> I was not able to actually find any place where a single additional page
>> bit would help our situation, which is why this still uses LRU fields for
>> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> 
> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> does this proposal interact with those?

Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?

If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
LRU field approach is unusable.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:40           ` Dan Williams
@ 2018-12-05  0:59             ` John Hubbard
  0 siblings, 0 replies; 36+ messages in thread
From: John Hubbard @ 2018-12-05  0:59 UTC (permalink / raw)
  To: Dan Williams, Jérôme Glisse
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Matthew Wilcox,
	Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/4/18 4:40 PM, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 4:37 PM Jerome Glisse <jglisse@redhat.com> wrote:
>>
>> On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
>>> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>
>>>> On 12/4/18 12:28 PM, Dan Williams wrote:
>>>>> On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
>>>>>>
>>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>>>>>
>>>>>> Introduces put_user_page(), which simply calls put_page().
>>>>>> This provides a way to update all get_user_pages*() callers,
>>>>>> so that they call put_user_page(), instead of put_page().
>>>>>>
>>>>>> Also introduces put_user_pages(), and a few dirty/locked variations,
>>>>>> as a replacement for release_pages(), and also as a replacement
>>>>>> for open-coded loops that release multiple pages.
>>>>>> These may be used for subsequent performance improvements,
>>>>>> via batching of pages to be released.
>>>>>>
>>>>>> This is the first step of fixing the problem described in [1]. The steps
>>>>>> are:
>>>>>>
>>>>>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>>>>>    for releasing pages that were pinned via get_user_pages*().
>>>>>>
>>>>>> 2) Convert all of the call sites for get_user_pages*(), to
>>>>>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>>>>>    call sites, and will take some time.
>>>>>>
>>>>>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>>>>>    implement tracking of these pages. This tracking will be separate from
>>>>>>    the existing struct page refcounting.
>>>>>>
>>>>>> 4) Use the tracking and identification of these pages, to implement
>>>>>>    special handling (especially in writeback paths) when the pages are
>>>>>>    backed by a filesystem. Again, [1] provides details as to why that is
>>>>>>    desirable.
>>>>>
>>>>> I thought at Plumbers we talked about using a page bit to tag pages
>>>>> that have had their reference count elevated by get_user_pages()? That
>>>>> way there is no need to distinguish put_page() from put_user_page() it
>>>>> just happens internally to put_page(). At the conference Matthew was
>>>>> offering to free up a page bit for this purpose.
>>>>>
>>>>
>>>> ...but then, upon further discussion in that same session, we realized that
>>>> that doesn't help. You need a reference count. Otherwise a random put_page
>>>> could affect your dma-pinned pages, etc, etc.
>>>
>>> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
>>> the end of the page pin event separate from the final 'put' of the
>>> page? Makes sense.
>>>
>>>> I was not able to actually find any place where a single additional page
>>>> bit would help our situation, which is why this still uses LRU fields for
>>>> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
>>>
>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>> does this proposal interact with those?
>>>
>>>> [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubbard@nvidia.com
>>>>
>>>>>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>>>>>
>>>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>>>
>>>>> Wish, you could have been there Jan. I'm missing why it's safe to
>>>>> assume that a single put_user_page() is paired with a get_user_page()?
>>>>>
>>>>
>>>> A put_user_page() per page, or a put_user_pages() for an array of pages. See
>>>> patch 0002 for several examples.
>>>
>>> Yes, however I was more concerned about validation and trying to
>>> locate missed places where put_page() is used instead of
>>> put_user_page().
>>>
>>> It would be interesting to see if we could have a debug mode where
>>> get_user_pages() returned dynamically allocated pages from a known
>>> address range and catch drivers that operate on a user-pinned page
>>> without using the proper helper to 'put' it. I think we might also
>>> need a ref_user_page() for drivers that may do their own get_page()
>>> and expect the dma_pinned_count to also increase.

Good idea about a new ref_user_page() call. It's going to hard to find 
those places at all of the call sites, btw.

>>
>> Total crazy idea for this, but this is the right time of day
>> for this (for me at least it is beer time :)) What about mapping
>> all struct page in two different range of kernel virtual address
>> and when get user space is use it returns a pointer from the second
>> range of kernel virtual address to the struct page. Then in put_page
>> you know for sure if the code putting the page got it from GUP or
>> from somewhere else. page_to_pfn() would need some trickery to
>> handle that.
> 
> Yes, exactly what I was thinking, if only as a debug mode since
> instrumenting every pfn/page translation would be expensive.
> 

That does sound viable as a debug mode. I'll try it out. A reliable way
(in both directions) of sorting out put_page() vs. put_user_page() 
would be a huge improvement, even if just in debug mode.

>> Dunno if we are running out of kernel virtual address (outside
>> 32bits that i believe we are trying to shot down quietly behind
>> the bar).
> 
> There's room, KASAN is in a roughly similar place.
> 

Looks like I'd better post a new version of the entire RFC, rather than just
these two patches. It's still less fully-baked than I'd hoped. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:58         ` John Hubbard
@ 2018-12-05  1:00           ` Dan Williams
  2018-12-05  1:15           ` Matthew Wilcox
  1 sibling, 0 replies; 36+ messages in thread
From: Dan Williams @ 2018-12-05  1:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christoph Hellwig, Christopher Lameter, Dalessandro,
	Dennis, Doug Ledford, Jason Gunthorpe, Jérôme Glisse,
	Matthew Wilcox, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 4:58 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/4/18 3:03 PM, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 1:56 PM John Hubbard <jhubbard@nvidia.com> wrote:
[..]
> > Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> > the end of the page pin event separate from the final 'put' of the
> > page? Makes sense.
> >
>
> Yes, that's it exactly.
>
> >> I was not able to actually find any place where a single additional page
> >> bit would help our situation, which is why this still uses LRU fields for
> >> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.
> >
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
>
> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>
> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> LRU field approach is unusable.

Unfortunately, the entire motivation for ZONE_DEVICE was to support
get_user_pages() for persistent memory.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/2] put_user_page*(): start converting the call sites
  2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
@ 2018-12-05  1:05   ` John Hubbard
  2018-12-05 14:08     ` David Laight
  0 siblings, 1 reply; 36+ messages in thread
From: John Hubbard @ 2018-12-05  1:05 UTC (permalink / raw)
  To: David Laight, john.hubbard, Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel

On 12/4/18 9:10 AM, David Laight wrote:
> From: john.hubbard@gmail.com
>> Sent: 04 December 2018 00:17
>>
>> Summary: I'd like these two patches to go into the next convenient cycle.
>> I *think* that means 4.21.
>>
>> Details
>>
>> At the Linux Plumbers Conference, we talked about this approach [1], and
>> the primary lingering concern was over performance. Tom Talpey helped me
>> through a much more accurate run of the fio performance test, and now
>> it's looking like an under 1% performance cost, to add and remove pages
>> from the LRU (this is only paid when dealing with get_user_pages) [2]. So
>> we should be fine to start converting call sites.
>>
>> This patchset gets the conversion started. Both patches already had a fair
>> amount of review.
> 
> Shouldn't the commit message contain actual details of the change?
> 

Hi David,

This "patch 0000" is not a commit message, as it never shows up in git log.
Each of the follow-up patches does have details about the changes it makes.

But maybe you are really asking for more background information, which I
should have added in this cover letter. Here's a start:

https://lore.kernel.org/r/20181110085041.10071-1-jhubbard@nvidia.com

...and it looks like this small patch series is not going to work out--I'm
going to have to fall back to another RFC spin. So I'll be sure to include 
you and everyone on that. Hope that helps.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  0:58         ` John Hubbard
  2018-12-05  1:00           ` Dan Williams
@ 2018-12-05  1:15           ` Matthew Wilcox
  2018-12-05  1:44             ` Jerome Glisse
  2018-12-05  5:52             ` Dan Williams
  1 sibling, 2 replies; 36+ messages in thread
From: Matthew Wilcox @ 2018-12-05  1:15 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Jérôme Glisse, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> On 12/4/18 3:03 PM, Dan Williams wrote:
> > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > does this proposal interact with those?
> 
> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> 
> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> LRU field approach is unusable.

We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
damage:

+++ b/include/linux/mm_types.h
@@ -151,10 +151,12 @@ struct page {
 #endif
                };
                struct {        /* ZONE_DEVICE pages */
+                       unsigned long _zd_pad_2;        /* LRU */
+                       unsigned long _zd_pad_3;        /* LRU */
+                       unsigned long _zd_pad_1;        /* uses mapping */
                        /** @pgmap: Points to the hosting device page map. */
                        struct dev_pagemap *pgmap;
                        unsigned long hmm_data;
-                       unsigned long _zd_pad_1;        /* uses mapping */
                };
 
                /** @rcu_head: You can use this to free a page by RCU. */

You don't use page->private or page->index, do you Dan?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04  7:53   ` Mike Rapoport
@ 2018-12-05  1:40     ` John Hubbard
  0 siblings, 0 replies; 36+ messages in thread
From: John Hubbard @ 2018-12-05  1:40 UTC (permalink / raw)
  To: Mike Rapoport, john.hubbard
  Cc: Andrew Morton, linux-mm, Jan Kara, Tom Talpey, Al Viro,
	Christian Benvenuti, Christoph Hellwig, Christopher Lameter,
	Dan Williams, Dennis Dalessandro, Doug Ledford, Jason Gunthorpe,
	Jerome Glisse, Matthew Wilcox, Michal Hocko, Mike Marciniszyn,
	Ralph Campbell, LKML, linux-fsdevel

On 12/3/18 11:53 PM, Mike Rapoport wrote:
> Hi John,
> 
> Thanks for having documentation as a part of the patch. Some kernel-doc
> nits below.
> 
> On Mon, Dec 03, 2018 at 04:17:19PM -0800, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also introduces put_user_pages(), and a few dirty/locked variations,
>> as a replacement for release_pages(), and also as a replacement
>> for open-coded loops that release multiple pages.
>> These may be used for subsequent performance improvements,
>> via batching of pages to be released.
>>
>> This is the first step of fixing the problem described in [1]. The steps
>> are:
>>
>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>    for releasing pages that were pinned via get_user_pages*().
>>
>> 2) Convert all of the call sites for get_user_pages*(), to
>>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>>    call sites, and will take some time.
>>
>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>    implement tracking of these pages. This tracking will be separate from
>>    the existing struct page refcounting.
>>
>> 4) Use the tracking and identification of these pages, to implement
>>    special handling (especially in writeback paths) when the pages are
>>    backed by a filesystem. Again, [1] provides details as to why that is
>>    desirable.
>>
>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Christopher Lameter <cl@linux.com>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Jerome Glisse <jglisse@redhat.com>
>> Cc: Christoph Hellwig <hch@infradead.org>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>>  include/linux/mm.h | 20 ++++++++++++
>>  mm/swap.c          | 80 ++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 100 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5411de93a363..09fbb2c81aba 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
>>  		__put_page(page);
>>  }
>>
>> +/*
>> + * put_user_page() - release a page that had previously been acquired via
>> + * a call to one of the get_user_pages*() functions.
> 
> Please add @page parameter description, otherwise kernel-doc is unhappy

Hi Mike,

Sorry I missed these kerneldoc points from your earlier review! I'll fix it
up now and it will show up in the next posting.

> 
>> + *
>> + * Pages that were pinned via get_user_pages*() must be released via
>> + * either put_user_page(), or one of the put_user_pages*() routines
>> + * below. This is so that eventually, pages that are pinned via
>> + * get_user_pages*() can be separately tracked and uniquely handled. In
>> + * particular, interactions with RDMA and filesystems need special
>> + * handling.
>> + */
>> +static inline void put_user_page(struct page *page)
>> +{
>> +	put_page(page);
>> +}
>> +
>> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
>> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
>> +void put_user_pages(struct page **pages, unsigned long npages);
>> +
>>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>  #define SECTION_IN_PAGE_FLAGS
>>  #endif
>> diff --git a/mm/swap.c b/mm/swap.c
>> index aa483719922e..bb8c32595e5f 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
>>  }
>>  EXPORT_SYMBOL(put_pages_list);
>>
>> +typedef int (*set_dirty_func)(struct page *page);
>> +
>> +static void __put_user_pages_dirty(struct page **pages,
>> +				   unsigned long npages,
>> +				   set_dirty_func sdf)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
>> +		struct page *page = compound_head(pages[index]);
>> +
>> +		if (!PageDirty(page))
>> +			sdf(page);
>> +
>> +		put_user_page(page);
>> +	}
>> +}
>> +
>> +/*
>> + * put_user_pages_dirty() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * set_page_dirty(), which does not lock the page, is used here.
>> + * Therefore, it is the caller's responsibility to ensure that this is
>> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
> 
> Please put the parameters description next to the brief function
> description, as described in [1]
> 
> [1] https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html#function-documentation
> 

OK. 

> 
>> + *
>> + */
>> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
>> +{
>> +	__put_user_pages_dirty(pages, npages, set_page_dirty);
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty);
>> +
>> +/*
>> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * This is just like put_user_pages_dirty(), except that it invokes
>> + * set_page_dirty_lock(), instead of set_page_dirty().
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
> 
> Ditto

OK.

> 
>> + *
>> + */
>> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
>> +{
>> +	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
>> +
>> +/*
>> + * put_user_pages() - for each page in the @pages array, release the page
>> + * using put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
> 
> And here as well :)

OK.


thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:15           ` Matthew Wilcox
@ 2018-12-05  1:44             ` Jerome Glisse
  2018-12-05  1:57               ` John Hubbard
  2018-12-05  5:52             ` Dan Williams
  1 sibling, 1 reply; 36+ messages in thread
From: Jerome Glisse @ 2018-12-05  1:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > On 12/4/18 3:03 PM, Dan Williams wrote:
> > > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > > does this proposal interact with those?
> > 
> > Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> > 
> > If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> > LRU field approach is unusable.
> 
> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> damage:
> 
> +++ b/include/linux/mm_types.h
> @@ -151,10 +151,12 @@ struct page {
>  #endif
>                 };
>                 struct {        /* ZONE_DEVICE pages */
> +                       unsigned long _zd_pad_2;        /* LRU */
> +                       unsigned long _zd_pad_3;        /* LRU */
> +                       unsigned long _zd_pad_1;        /* uses mapping */
>                         /** @pgmap: Points to the hosting device page map. */
>                         struct dev_pagemap *pgmap;
>                         unsigned long hmm_data;
> -                       unsigned long _zd_pad_1;        /* uses mapping */
>                 };
>  
>                 /** @rcu_head: You can use this to free a page by RCU. */
> 
> You don't use page->private or page->index, do you Dan?

page->private and page->index are use by HMM DEVICE page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:44             ` Jerome Glisse
@ 2018-12-05  1:57               ` John Hubbard
  2018-12-07  2:45                 ` John Hubbard
  0 siblings, 1 reply; 36+ messages in thread
From: John Hubbard @ 2018-12-05  1:57 UTC (permalink / raw)
  To: Jerome Glisse, Matthew Wilcox
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On 12/4/18 5:44 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>> does this proposal interact with those?
>>>
>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>>>
>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
>>> LRU field approach is unusable.
>>
>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>> damage:
>>
>> +++ b/include/linux/mm_types.h
>> @@ -151,10 +151,12 @@ struct page {
>>  #endif
>>                 };
>>                 struct {        /* ZONE_DEVICE pages */
>> +                       unsigned long _zd_pad_2;        /* LRU */
>> +                       unsigned long _zd_pad_3;        /* LRU */
>> +                       unsigned long _zd_pad_1;        /* uses mapping */
>>                         /** @pgmap: Points to the hosting device page map. */
>>                         struct dev_pagemap *pgmap;
>>                         unsigned long hmm_data;
>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>>                 };
>>  
>>                 /** @rcu_head: You can use this to free a page by RCU. */
>>
>> You don't use page->private or page->index, do you Dan?
> 
> page->private and page->index are use by HMM DEVICE page.
> 

OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
dma-pinned information. Which might work. To recap, we need:

-- 1 bit for PageDmaPinned
-- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
-- N bits for a reference count

Those *could* be packed into a single 64-bit field, if really necessary.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:15           ` Matthew Wilcox
  2018-12-05  1:44             ` Jerome Glisse
@ 2018-12-05  5:52             ` Dan Williams
  1 sibling, 0 replies; 36+ messages in thread
From: Dan Williams @ 2018-12-05  5:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Jérôme Glisse, Michal Hocko, mike.marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Tue, Dec 4, 2018 at 5:15 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > On 12/4/18 3:03 PM, Dan Williams wrote:
> > > Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > > does this proposal interact with those?
> >
> > Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> >
> > If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> > LRU field approach is unusable.
>
> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> damage:
>
> +++ b/include/linux/mm_types.h
> @@ -151,10 +151,12 @@ struct page {
>  #endif
>                 };
>                 struct {        /* ZONE_DEVICE pages */
> +                       unsigned long _zd_pad_2;        /* LRU */
> +                       unsigned long _zd_pad_3;        /* LRU */
> +                       unsigned long _zd_pad_1;        /* uses mapping */
>                         /** @pgmap: Points to the hosting device page map. */
>                         struct dev_pagemap *pgmap;
>                         unsigned long hmm_data;
> -                       unsigned long _zd_pad_1;        /* uses mapping */
>                 };
>
>                 /** @rcu_head: You can use this to free a page by RCU. */
>
> You don't use page->private or page->index, do you Dan?

I don't use page->private, but page->index is used by the
memory-failure path to do an rmap.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-04 21:56     ` John Hubbard
  2018-12-04 23:03       ` Dan Williams
@ 2018-12-05 11:16       ` Jan Kara
  1 sibling, 0 replies; 36+ messages in thread
From: Jan Kara @ 2018-12-05 11:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe,
	Jérôme Glisse, Matthew Wilcox, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On Tue 04-12-18 13:56:36, John Hubbard wrote:
> On 12/4/18 12:28 PM, Dan Williams wrote:
> > On Mon, Dec 3, 2018 at 4:17 PM <john.hubbard@gmail.com> wrote:
> >>
> >> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Introduces put_user_page(), which simply calls put_page().
> >> This provides a way to update all get_user_pages*() callers,
> >> so that they call put_user_page(), instead of put_page().
> >>
> >> Also introduces put_user_pages(), and a few dirty/locked variations,
> >> as a replacement for release_pages(), and also as a replacement
> >> for open-coded loops that release multiple pages.
> >> These may be used for subsequent performance improvements,
> >> via batching of pages to be released.
> >>
> >> This is the first step of fixing the problem described in [1]. The steps
> >> are:
> >>
> >> 1) (This patch): provide put_user_page*() routines, intended to be used
> >>    for releasing pages that were pinned via get_user_pages*().
> >>
> >> 2) Convert all of the call sites for get_user_pages*(), to
> >>    invoke put_user_page*(), instead of put_page(). This involves dozens of
> >>    call sites, and will take some time.
> >>
> >> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> >>    implement tracking of these pages. This tracking will be separate from
> >>    the existing struct page refcounting.
> >>
> >> 4) Use the tracking and identification of these pages, to implement
> >>    special handling (especially in writeback paths) when the pages are
> >>    backed by a filesystem. Again, [1] provides details as to why that is
> >>    desirable.
> > 
> > I thought at Plumbers we talked about using a page bit to tag pages
> > that have had their reference count elevated by get_user_pages()? That
> > way there is no need to distinguish put_page() from put_user_page() it
> > just happens internally to put_page(). At the conference Matthew was
> > offering to free up a page bit for this purpose.
> > 
> 
> ...but then, upon further discussion in that same session, we realized that
> that doesn't help. You need a reference count. Otherwise a random put_page
> could affect your dma-pinned pages, etc, etc.

Exactly.

> I was not able to actually find any place where a single additional page
> bit would help our situation, which is why this still uses LRU fields for
> both the two bits required (the RFC [1] still applies), and the dma_pinned_count.

So single page bit could help you with performance. In 99% of cases there's
just one reference from GUP. So if you could store that info in page flags,
you could safe yourself a relatively expensive removal from LRU and putting
it back to make space in struct page for proper refcount. But since you
report that the performance isn't that horrible, I'd leave this idea on a
backburner. We can always implement it later in case we find in future we
need to improve the performance.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 0/2] put_user_page*(): start converting the call sites
  2018-12-05  1:05   ` John Hubbard
@ 2018-12-05 14:08     ` David Laight
  0 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2018-12-05 14:08 UTC (permalink / raw)
  To: John Hubbard, john.hubbard, Andrew Morton, linux-mm
  Cc: Jan Kara, Tom Talpey, Al Viro, Christian Benvenuti,
	Christoph Hellwig, Christopher Lameter, Dan Williams,
	Dennis Dalessandro, Doug Ledford, Jason Gunthorpe, Jerome Glisse,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Ralph Campbell,
	LKML, linux-fsdevel

From: John Hubbard
> Sent: 05 December 2018 01:06
> On 12/4/18 9:10 AM, David Laight wrote:
> > From: john.hubbard@gmail.com
> >> Sent: 04 December 2018 00:17
> >>
> >> Summary: I'd like these two patches to go into the next convenient cycle.
> >> I *think* that means 4.21.
> >>
> >> Details
> >>
> >> At the Linux Plumbers Conference, we talked about this approach [1], and
> >> the primary lingering concern was over performance. Tom Talpey helped me
> >> through a much more accurate run of the fio performance test, and now
> >> it's looking like an under 1% performance cost, to add and remove pages
> >> from the LRU (this is only paid when dealing with get_user_pages) [2]. So
> >> we should be fine to start converting call sites.
> >>
> >> This patchset gets the conversion started. Both patches already had a fair
> >> amount of review.
> >
> > Shouldn't the commit message contain actual details of the change?
> >
> 
> Hi David,
> 
> This "patch 0000" is not a commit message, as it never shows up in git log.
> Each of the follow-up patches does have details about the changes it makes.

I think you should still describe the change - at least in summary.

The patch I looked at didn't really...
IIRC it still referred to external links.

> But maybe you are really asking for more background information, which I
> should have added in this cover letter. Here's a start:
> 
> https://lore.kernel.org/r/20181110085041.10071-1-jhubbard@nvidia.com

Yes, but links go stale....

> ...and it looks like this small patch series is not going to work out--I'm
> going to have to fall back to another RFC spin. So I'll be sure to include
> you and everyone on that. Hope that helps.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-05  1:57               ` John Hubbard
@ 2018-12-07  2:45                 ` John Hubbard
  2018-12-07 19:16                   ` Jerome Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: John Hubbard @ 2018-12-07  2:45 UTC (permalink / raw)
  To: Jerome Glisse, Matthew Wilcox
  Cc: Dan Williams, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christoph Hellwig, Christopher Lameter,
	Dalessandro, Dennis, Doug Ledford, Jason Gunthorpe, Michal Hocko,
	mike.marciniszyn, rcampbell, Linux Kernel Mailing List,
	linux-fsdevel

On 12/4/18 5:57 PM, John Hubbard wrote:
> On 12/4/18 5:44 PM, Jerome Glisse wrote:
>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>>> does this proposal interact with those?
>>>>
>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>>>>
>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
>>>> LRU field approach is unusable.
>>>
>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>>> damage:
>>>
>>> +++ b/include/linux/mm_types.h
>>> @@ -151,10 +151,12 @@ struct page {
>>>  #endif
>>>                 };
>>>                 struct {        /* ZONE_DEVICE pages */
>>> +                       unsigned long _zd_pad_2;        /* LRU */
>>> +                       unsigned long _zd_pad_3;        /* LRU */
>>> +                       unsigned long _zd_pad_1;        /* uses mapping */
>>>                         /** @pgmap: Points to the hosting device page map. */
>>>                         struct dev_pagemap *pgmap;
>>>                         unsigned long hmm_data;
>>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>>>                 };
>>>  
>>>                 /** @rcu_head: You can use this to free a page by RCU. */
>>>
>>> You don't use page->private or page->index, do you Dan?
>>
>> page->private and page->index are use by HMM DEVICE page.
>>
> 
> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
> dma-pinned information. Which might work. To recap, we need:
> 
> -- 1 bit for PageDmaPinned
> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> -- N bits for a reference count
> 
> Those *could* be packed into a single 64-bit field, if really necessary.
> 

...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
However, it is still possible for this to work.

Matthew, can I have that bit now please? I'm about out of options, and now it will actually
solve the problem here.

Given:

1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
for that case. 

2) There is an independent bit available (according to Matthew). 

3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
   in that case.

4) get_user_pages() must work on ZONE_DEVICE and HMM pages.

5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
unsigned long field.

So that leads to the following approach:

-- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).

For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
optionally renamed:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 017ab82e36ca..b5dcd9398cae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -90,8 +90,8 @@ struct page {
                                 * are in use.
                                 */
                                struct {
-                                       unsigned long dma_pinned_flags;
-                                       atomic_t      dma_pinned_count;
+                                       unsigned long dma_pinned_flags; /* LRU.next */
+                                       atomic_t      dma_pinned_count; /* LRU.prev */
                                };
                        };
                        /* See page-flags.h for PAGE_MAPPING_FLAGS */
@@ -161,9 +161,9 @@ struct page {
                };
                struct {        /* ZONE_DEVICE pages */
                        /** @pgmap: Points to the hosting device page map. */
-                       struct dev_pagemap *pgmap;
-                       unsigned long hmm_data;
-                       unsigned long _zd_pad_1;        /* uses mapping */
+                       struct dev_pagemap *pgmap;      /* LRU.next */
+                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
+                       unsigned long hmm_data;         /* uses mapping */
                };
 
                /** @rcu_head: You can use this to free a page by RCU. */



-- Use an additional, fully independent page bit (from Matthew) for PageDmaPinned.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07  2:45                 ` John Hubbard
@ 2018-12-07 19:16                   ` Jerome Glisse
  2018-12-07 19:26                     ` Dan Williams
  2018-12-08  0:52                     ` John Hubbard
  0 siblings, 2 replies; 36+ messages in thread
From: Jerome Glisse @ 2018-12-07 19:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> On 12/4/18 5:57 PM, John Hubbard wrote:
> > On 12/4/18 5:44 PM, Jerome Glisse wrote:
> >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> >>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> >>>>> does this proposal interact with those?
> >>>>
> >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> >>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> >>>>
> >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> >>>> LRU field approach is unusable.
> >>>
> >>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> >>> damage:
> >>>
> >>> +++ b/include/linux/mm_types.h
> >>> @@ -151,10 +151,12 @@ struct page {
> >>>  #endif
> >>>                 };
> >>>                 struct {        /* ZONE_DEVICE pages */
> >>> +                       unsigned long _zd_pad_2;        /* LRU */
> >>> +                       unsigned long _zd_pad_3;        /* LRU */
> >>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>                         /** @pgmap: Points to the hosting device page map. */
> >>>                         struct dev_pagemap *pgmap;
> >>>                         unsigned long hmm_data;
> >>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>                 };
> >>>  
> >>>                 /** @rcu_head: You can use this to free a page by RCU. */
> >>>
> >>> You don't use page->private or page->index, do you Dan?
> >>
> >> page->private and page->index are use by HMM DEVICE page.
> >>
> > 
> > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
> > dma-pinned information. Which might work. To recap, we need:
> > 
> > -- 1 bit for PageDmaPinned
> > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> > -- N bits for a reference count
> > 
> > Those *could* be packed into a single 64-bit field, if really necessary.
> > 
> 
> ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> However, it is still possible for this to work.
> 
> Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> solve the problem here.
> 
> Given:
> 
> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
> for that case. 
> 
> 2) There is an independent bit available (according to Matthew). 
> 
> 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
>    in that case.

To expend on this, HMM private page are use for anonymous page
so the index and mapping fields have the value you expect for
such pages. Down the road i want also to support file backed
page with HMM private (mapping, private, index).

For HMM public both anonymous and file back page are supported
today (HMM public is only useful on platform with something like
OpenCAPI, CCIX or NVlink ... so PowerPC for now).

> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.

get_user_pages() only need to work with HMM public page not the
private one as we can not allow _anyone_ to pin HMM private page.
So on get_user_pages() on HMM private we get a page fault and
it is migrated back to regular memory.


> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
> unsigned long field.
> 
> So that leads to the following approach:
> 
> -- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
> For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).
> 
> For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
> available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
> optionally renamed:
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 017ab82e36ca..b5dcd9398cae 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -90,8 +90,8 @@ struct page {
>                                  * are in use.
>                                  */
>                                 struct {
> -                                       unsigned long dma_pinned_flags;
> -                                       atomic_t      dma_pinned_count;
> +                                       unsigned long dma_pinned_flags; /* LRU.next */
> +                                       atomic_t      dma_pinned_count; /* LRU.prev */
>                                 };
>                         };
>                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> @@ -161,9 +161,9 @@ struct page {
>                 };
>                 struct {        /* ZONE_DEVICE pages */
>                         /** @pgmap: Points to the hosting device page map. */
> -                       struct dev_pagemap *pgmap;
> -                       unsigned long hmm_data;
> -                       unsigned long _zd_pad_1;        /* uses mapping */
> +                       struct dev_pagemap *pgmap;      /* LRU.next */
> +                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
> +                       unsigned long hmm_data;         /* uses mapping */

This breaks HMM today as hmm_data would alias with mapping field.
hmm_data can only be in LRU.prev

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07 19:16                   ` Jerome Glisse
@ 2018-12-07 19:26                     ` Dan Williams
  2018-12-07 19:40                       ` Jerome Glisse
  2018-12-08  0:52                     ` John Hubbard
  1 sibling, 1 reply; 36+ messages in thread
From: Dan Williams @ 2018-12-07 19:26 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 7, 2018 at 11:16 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> > On 12/4/18 5:57 PM, John Hubbard wrote:
> > > On 12/4/18 5:44 PM, Jerome Glisse wrote:
> > >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> > >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > >>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> > >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > >>>>> does this proposal interact with those?
> > >>>>
> > >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > >>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> > >>>>
> > >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> > >>>> LRU field approach is unusable.
> > >>>
> > >>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> > >>> damage:
> > >>>
> > >>> +++ b/include/linux/mm_types.h
> > >>> @@ -151,10 +151,12 @@ struct page {
> > >>>  #endif
> > >>>                 };
> > >>>                 struct {        /* ZONE_DEVICE pages */
> > >>> +                       unsigned long _zd_pad_2;        /* LRU */
> > >>> +                       unsigned long _zd_pad_3;        /* LRU */
> > >>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> > >>>                         /** @pgmap: Points to the hosting device page map. */
> > >>>                         struct dev_pagemap *pgmap;
> > >>>                         unsigned long hmm_data;
> > >>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> > >>>                 };
> > >>>
> > >>>                 /** @rcu_head: You can use this to free a page by RCU. */
> > >>>
> > >>> You don't use page->private or page->index, do you Dan?
> > >>
> > >> page->private and page->index are use by HMM DEVICE page.
> > >>
> > >
> > > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for
> > > dma-pinned information. Which might work. To recap, we need:
> > >
> > > -- 1 bit for PageDmaPinned
> > > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> > > -- N bits for a reference count
> > >
> > > Those *could* be packed into a single 64-bit field, if really necessary.
> > >
> >
> > ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> > However, it is still possible for this to work.
> >
> > Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> > solve the problem here.
> >
> > Given:
> >
> > 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> > That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required,
> > for that case.
> >
> > 2) There is an independent bit available (according to Matthew).
> >
> > 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter
> >    in that case.
>
> To expend on this, HMM private page are use for anonymous page
> so the index and mapping fields have the value you expect for
> such pages. Down the road i want also to support file backed
> page with HMM private (mapping, private, index).
>
> For HMM public both anonymous and file back page are supported
> today (HMM public is only useful on platform with something like
> OpenCAPI, CCIX or NVlink ... so PowerPC for now).
>
> > 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
>
> get_user_pages() only need to work with HMM public page not the
> private one as we can not allow _anyone_ to pin HMM private page.

How does HMM enforce that? Because the kernel should not allow *any*
memory management facility to arbitrarily fail direct-I/O operations.
That's why CONFIG_FS_DAX_LIMITED is a temporary / experimental hack
for S390 and ZONE_DEVICE was invented to bypass that hack for X86 and
any arch that plans to properly support DAX. I would classify any
memory management that can't support direct-I/O in the same
"experimental" category.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07 19:26                     ` Dan Williams
@ 2018-12-07 19:40                       ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2018-12-07 19:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 11:26:34AM -0800, Dan Williams wrote:
> On Fri, Dec 7, 2018 at 11:16 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> > > On 12/4/18 5:57 PM, John Hubbard wrote:
> > > > On 12/4/18 5:44 PM, Jerome Glisse wrote:
> > > >> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> > > >>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> > > >>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> > > >>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> > > >>>>> does this proposal interact with those?
> > > >>>>
> > > >>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> > > >>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> > > >>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> > > >>>>
> > > >>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole
> > > >>>> LRU field approach is unusable.
> > > >>>
> > > >>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> > > >>> damage:
> > > >>>
> > > >>> +++ b/include/linux/mm_types.h
> > > >>> @@ -151,10 +151,12 @@ struct page {
> > > >>>  #endif
> > > >>>                 };
> > > >>>                 struct {        /* ZONE_DEVICE pages */
> > > >>> +                       unsigned long _zd_pad_2;        /* LRU */
> > > >>> +                       unsigned long _zd_pad_3;        /* LRU */
> > > >>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> > > >>>                         /** @pgmap: Points to the hosting device page map. */
> > > >>>                         struct dev_pagemap *pgmap;
> > > >>>                         unsigned long hmm_data;
> > > >>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> > > >>>                 };
> > > >>>
> > > >>>                 /** @rcu_head: You can use this to free a page by RCU. */
> > > >>>
> > > >>> You don't use page->private or page->index, do you Dan?
> > > >>
> > > >> page->private and page->index are use by HMM DEVICE page.
> > > >>
> > > >
> > > > OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for
> > > > dma-pinned information. Which might work. To recap, we need:
> > > >
> > > > -- 1 bit for PageDmaPinned
> > > > -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> > > > -- N bits for a reference count
> > > >
> > > > Those *could* be packed into a single 64-bit field, if really necessary.
> > > >
> > >
> > > ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> > > However, it is still possible for this to work.
> > >
> > > Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> > > solve the problem here.
> > >
> > > Given:
> > >
> > > 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> > > That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required,
> > > for that case.
> > >
> > > 2) There is an independent bit available (according to Matthew).
> > >
> > > 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter
> > >    in that case.
> >
> > To expend on this, HMM private page are use for anonymous page
> > so the index and mapping fields have the value you expect for
> > such pages. Down the road i want also to support file backed
> > page with HMM private (mapping, private, index).
> >
> > For HMM public both anonymous and file back page are supported
> > today (HMM public is only useful on platform with something like
> > OpenCAPI, CCIX or NVlink ... so PowerPC for now).
> >
> > > 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
> >
> > get_user_pages() only need to work with HMM public page not the
> > private one as we can not allow _anyone_ to pin HMM private page.
> 
> How does HMM enforce that? Because the kernel should not allow *any*
> memory management facility to arbitrarily fail direct-I/O operations.
> That's why CONFIG_FS_DAX_LIMITED is a temporary / experimental hack
> for S390 and ZONE_DEVICE was invented to bypass that hack for X86 and
> any arch that plans to properly support DAX. I would classify any
> memory management that can't support direct-I/O in the same
> "experimental" category.

It does not fail direct-I/O GUP sees a swap entry for the private
memory and it behave just like if the page was swap to disk so i
am not introducing any new behavior.

With HMM page everything just work as you expect they would from
CPU point of view. It is just like swap.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-07 19:16                   ` Jerome Glisse
  2018-12-07 19:26                     ` Dan Williams
@ 2018-12-08  0:52                     ` John Hubbard
  2018-12-08  2:24                       ` Jerome Glisse
                                         ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: John Hubbard @ 2018-12-08  0:52 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On 12/7/18 11:16 AM, Jerome Glisse wrote:
> On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
>> On 12/4/18 5:57 PM, John Hubbard wrote:
>>> On 12/4/18 5:44 PM, Jerome Glisse wrote:
>>>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>>>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>>>>> does this proposal interact with those?
>>>>>>
>>>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
>>>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
>>>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
>>>>>>
>>>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
>>>>>> LRU field approach is unusable.
>>>>>
>>>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>>>>> damage:
>>>>>
>>>>> +++ b/include/linux/mm_types.h
>>>>> @@ -151,10 +151,12 @@ struct page {
>>>>>  #endif
>>>>>                 };
>>>>>                 struct {        /* ZONE_DEVICE pages */
>>>>> +                       unsigned long _zd_pad_2;        /* LRU */
>>>>> +                       unsigned long _zd_pad_3;        /* LRU */
>>>>> +                       unsigned long _zd_pad_1;        /* uses mapping */
>>>>>                         /** @pgmap: Points to the hosting device page map. */
>>>>>                         struct dev_pagemap *pgmap;
>>>>>                         unsigned long hmm_data;
>>>>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>>>>>                 };
>>>>>  
>>>>>                 /** @rcu_head: You can use this to free a page by RCU. */
>>>>>
>>>>> You don't use page->private or page->index, do you Dan?
>>>>
>>>> page->private and page->index are use by HMM DEVICE page.
>>>>
>>>
>>> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
>>> dma-pinned information. Which might work. To recap, we need:
>>>
>>> -- 1 bit for PageDmaPinned
>>> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
>>> -- N bits for a reference count
>>>
>>> Those *could* be packed into a single 64-bit field, if really necessary.
>>>
>>
>> ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
>> However, it is still possible for this to work.
>>
>> Matthew, can I have that bit now please? I'm about out of options, and now it will actually
>> solve the problem here.
>>
>> Given:
>>
>> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
>> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
>> for that case. 
>>
>> 2) There is an independent bit available (according to Matthew). 
>>
>> 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
>>    in that case.
> 
> To expend on this, HMM private page are use for anonymous page
> so the index and mapping fields have the value you expect for
> such pages. Down the road i want also to support file backed
> page with HMM private (mapping, private, index).
> 
> For HMM public both anonymous and file back page are supported
> today (HMM public is only useful on platform with something like
> OpenCAPI, CCIX or NVlink ... so PowerPC for now).
> 
>> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
> 
> get_user_pages() only need to work with HMM public page not the
> private one as we can not allow _anyone_ to pin HMM private page.
> So on get_user_pages() on HMM private we get a page fault and
> it is migrated back to regular memory.
> 
> 
>> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
>> unsigned long field.
>>
>> So that leads to the following approach:
>>
>> -- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
>> For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).
>>
>> For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
>> available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
>> optionally renamed:
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 017ab82e36ca..b5dcd9398cae 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -90,8 +90,8 @@ struct page {
>>                                  * are in use.
>>                                  */
>>                                 struct {
>> -                                       unsigned long dma_pinned_flags;
>> -                                       atomic_t      dma_pinned_count;
>> +                                       unsigned long dma_pinned_flags; /* LRU.next */
>> +                                       atomic_t      dma_pinned_count; /* LRU.prev */
>>                                 };
>>                         };
>>                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
>> @@ -161,9 +161,9 @@ struct page {
>>                 };
>>                 struct {        /* ZONE_DEVICE pages */
>>                         /** @pgmap: Points to the hosting device page map. */
>> -                       struct dev_pagemap *pgmap;
>> -                       unsigned long hmm_data;
>> -                       unsigned long _zd_pad_1;        /* uses mapping */
>> +                       struct dev_pagemap *pgmap;      /* LRU.next */
>> +                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
>> +                       unsigned long hmm_data;         /* uses mapping */
> 
> This breaks HMM today as hmm_data would alias with mapping field.
> hmm_data can only be in LRU.prev
> 

I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
completely out of space. At this point, after thinking about it carefully, it seems clear
that it's time for a single, new field:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..1c789e324da8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -182,6 +182,9 @@ struct page {
        /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
        atomic_t _refcount;
 
+       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
+       atomic_t _dma_pinned_count;
+
 #ifdef CONFIG_MEMCG
        struct mem_cgroup *mem_cgroup;
 #endif


...because after all, the reason this is so difficult is that this fix has to work
in pretty much every configuration. get_user_pages() use is widespread, it's a very
general facility, and...it needs fixing.  And we're out of space. 

I'm going to send out an updated RFC that shows the latest, and I think it's going
to include the above.

-- 
thanks,
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  0:52                     ` John Hubbard
@ 2018-12-08  2:24                       ` Jerome Glisse
  2018-12-08  5:18                       ` Matthew Wilcox
  2018-12-08  7:16                       ` Dan Williams
  2 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2018-12-08  2:24 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 04:52:42PM -0800, John Hubbard wrote:
> On 12/7/18 11:16 AM, Jerome Glisse wrote:
> > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> >> On 12/4/18 5:57 PM, John Hubbard wrote:
> >>> On 12/4/18 5:44 PM, Jerome Glisse wrote:
> >>>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
> >>>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
> >>>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
> >>>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> >>>>>>> does this proposal interact with those?
> >>>>>>
> >>>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire
> >>>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another
> >>>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?
> >>>>>>
> >>>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
> >>>>>> LRU field approach is unusable.
> >>>>>
> >>>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
> >>>>> damage:
> >>>>>
> >>>>> +++ b/include/linux/mm_types.h
> >>>>> @@ -151,10 +151,12 @@ struct page {
> >>>>>  #endif
> >>>>>                 };
> >>>>>                 struct {        /* ZONE_DEVICE pages */
> >>>>> +                       unsigned long _zd_pad_2;        /* LRU */
> >>>>> +                       unsigned long _zd_pad_3;        /* LRU */
> >>>>> +                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>>>                         /** @pgmap: Points to the hosting device page map. */
> >>>>>                         struct dev_pagemap *pgmap;
> >>>>>                         unsigned long hmm_data;
> >>>>> -                       unsigned long _zd_pad_1;        /* uses mapping */
> >>>>>                 };
> >>>>>  
> >>>>>                 /** @rcu_head: You can use this to free a page by RCU. */
> >>>>>
> >>>>> You don't use page->private or page->index, do you Dan?
> >>>>
> >>>> page->private and page->index are use by HMM DEVICE page.
> >>>>
> >>>
> >>> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
> >>> dma-pinned information. Which might work. To recap, we need:
> >>>
> >>> -- 1 bit for PageDmaPinned
> >>> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> >>> -- N bits for a reference count
> >>>
> >>> Those *could* be packed into a single 64-bit field, if really necessary.
> >>>
> >>
> >> ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot.
> >> However, it is still possible for this to work.
> >>
> >> Matthew, can I have that bit now please? I'm about out of options, and now it will actually
> >> solve the problem here.
> >>
> >> Given:
> >>
> >> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU.
> >> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, 
> >> for that case. 
> >>
> >> 2) There is an independent bit available (according to Matthew). 
> >>
> >> 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter 
> >>    in that case.
> > 
> > To expend on this, HMM private page are use for anonymous page
> > so the index and mapping fields have the value you expect for
> > such pages. Down the road i want also to support file backed
> > page with HMM private (mapping, private, index).
> > 
> > For HMM public both anonymous and file back page are supported
> > today (HMM public is only useful on platform with something like
> > OpenCAPI, CCIX or NVlink ... so PowerPC for now).
> > 
> >> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
> > 
> > get_user_pages() only need to work with HMM public page not the
> > private one as we can not allow _anyone_ to pin HMM private page.
> > So on get_user_pages() on HMM private we get a page fault and
> > it is migrated back to regular memory.
> > 
> > 
> >> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete
> >> unsigned long field.
> >>
> >> So that leads to the following approach:
> >>
> >> -- Use a single unsigned long field for an atomic reference count for the DMA pinned count.
> >> For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit).
> >>
> >> For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is 
> >> available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and
> >> optionally renamed:
> >>
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 017ab82e36ca..b5dcd9398cae 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -90,8 +90,8 @@ struct page {
> >>                                  * are in use.
> >>                                  */
> >>                                 struct {
> >> -                                       unsigned long dma_pinned_flags;
> >> -                                       atomic_t      dma_pinned_count;
> >> +                                       unsigned long dma_pinned_flags; /* LRU.next */
> >> +                                       atomic_t      dma_pinned_count; /* LRU.prev */
> >>                                 };
> >>                         };
> >>                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> >> @@ -161,9 +161,9 @@ struct page {
> >>                 };
> >>                 struct {        /* ZONE_DEVICE pages */
> >>                         /** @pgmap: Points to the hosting device page map. */
> >> -                       struct dev_pagemap *pgmap;
> >> -                       unsigned long hmm_data;
> >> -                       unsigned long _zd_pad_1;        /* uses mapping */
> >> +                       struct dev_pagemap *pgmap;      /* LRU.next */
> >> +                       unsigned long _zd_pad_1;        /* LRU.prev or dma_pinned_count */
> >> +                       unsigned long hmm_data;         /* uses mapping */
> > 
> > This breaks HMM today as hmm_data would alias with mapping field.
> > hmm_data can only be in LRU.prev
> > 
> 
> I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
> completely out of space. At this point, after thinking about it carefully, it seems clear
> that it's time for a single, new field:
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..1c789e324da8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -182,6 +182,9 @@ struct page {
>         /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>         atomic_t _refcount;
>  
> +       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> +       atomic_t _dma_pinned_count;
> +
>  #ifdef CONFIG_MEMCG
>         struct mem_cgroup *mem_cgroup;
>  #endif
> 
> 
> ...because after all, the reason this is so difficult is that this fix has to work
> in pretty much every configuration. get_user_pages() use is widespread, it's a very
> general facility, and...it needs fixing.  And we're out of space. 
> 
> I'm going to send out an updated RFC that shows the latest, and I think it's going
> to include the above.

Another crazy idea, why not treating GUP as another mapping of the page
and caller of GUP would have to provide either a fake anon_vma struct or
a fake vma struct (or both for PRIVATE mapping of a file where you can
have a mix of both private and file page thus only if it is a read only
GUP) that would get added to the list of existing mapping.

So the flow would be:
    somefunction_thatuse_gup()
    {
        ...
        GUP(_fast)(vma, ..., fake_anon, fake_vma);
        ...
    }

    GUP(vma, ..., fake_anon, fake_vma)
    {
        if (vma->flags == ANON) {
            // Add the fake anon vma to the anon vma chain as a child
            // of current vma
        } else {
            // Add the fake vma to the mapping tree
        }

        // The existing GUP except that now it inc mapcount and not
        // refcount
        GUP_old(..., &nanonymous, &nfiles);

        atomic_add(&fake_anon->refcount, nanonymous);
        atomic_add(&fake_vma->refcount, nfiles);

        return nanonymous + nfiles;
    }

I believe all call place of GUP could be updated they fall into 2
categories:
    - fake_anon/fake_vma on stack (direct I/O and few other who
      just do GUP inside their work function and drop reference
      their too)
    - fake_anon/fake_vma as part of the object they have ie GUP
      user that have some kind of struct where they keep the result
      of the GUP around (most user in driver directory fall under
      that)


Few nice bonus:
    [B1] GUP_pin <= (mapcount - refcount) ie it gives a boundary for
         number of GUP on the page (some other part of the kernel might
         still temporarily inc the refcount without a mapcount increase)
    [B2] can add a revoke call back as part of the fake anon_vma/
         vma structure (if the existing GUP user can do that or maybe
         something like an emergency revoke when the memory is poisonous)
    [B3] extra cost is once per GUP call not per page so the impact
         on performance should definitly be better
    [B4] no need to modify LRU or complexify the inner of GUP code
         only the pre-ambule.

Few issues with that proposal:
    [I1] need to check mapcount in page free code path to avoid
         freeing the page if refcount reach 0 before all the GUP
         user unmap the page, page is no in some zombie state ie
         refcount = 0 and mapcount > 0
    [I2] KVM seems to use GUP for weird reasons, it might be better
         to convert KVM to use something else than GUP that have the
         same end result from KVM point of view (i believe it uses it
         to force page fault on the host page). Maybe we can work
         with KVM folks and see if we can provide them with the API
         that actualy do what they want instead of them using GUP
         for its side effect
    [I3] ANON page will need special handling as this will confuse
         mm code path that deal with COW pages ... the page is not
         COW but still has mapcount > 1
    [I4] GUP must be per vma (not an issue everywhere) we can provide
         helpers to iterate over virtual address by vma
    [I5] to ascertain that a page is under GUP might be costly code
         would look like:
            bool page_is_guped(struct page *page)
            {
                if (page_mapcount(page) > page_refcount(page)) {
                    return true;
                }
                // Unknown have to walk the reverse mapping to see
                // if they are any fake anon or fake vma and even
                // if there is we could not say for sure if they
                // apply to the page under consideration we would
                // have to assume so unless:
                //
                // GUP user keep around the array they used to store
                // the GUP results then we can check if the page is
                // in there.
            }

Probably other issues i can not think of right now ...


Maybe even better would be to add a pointer to struct address_space
and re-arrange struct anon_vma to move unsigned degree at the top and
define some flag in it (i don't think we want to grow anon_vam struct)
so that we can differentiate fake anon_vma from others by looking at
first word.

Patchset would probably looks like:
    [1-N] Convert all put_page to put_user_page() with an extra void
          pointer as first step (to allow converting using one at a
          time)
    [N-M] convet every GUP user to provide the fake struct (from
          stack or as part of their object that track GUP result)
    [M-O] patches to add all the helpers and changes inside mm to
          handle fake vma and fake anon vma and the implication of
          having mapcount > refcount (not all time)
    [P] convert GUP to add the fake to anon_vma/mapping and convert
        GUP to inc mapcount and not refcount

Note that the GUP user do not necessarily need to keep the fake
anon or vma struct as part of their own struct. It can use a key
system ie:
    put_user_page_with_key(page, key);
    versus
    put_user_page(page, fake_anon/fake_vma);

Then put_user_page would walk the list of mapping of the page
until it finds the fake anon or fake vma that have the matching
key and dec the refcount of that fake struct and free it once
it reaches zero ...

Anyway they are few thing we can do to alievate the pain for the
GUP users.


Maybe this is crazy but this is what i have without needing to add
a field to struct page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  0:52                     ` John Hubbard
  2018-12-08  2:24                       ` Jerome Glisse
@ 2018-12-08  5:18                       ` Matthew Wilcox
  2018-12-08  7:16                       ` Dan Williams
  2 siblings, 0 replies; 36+ messages in thread
From: Matthew Wilcox @ 2018-12-08  5:18 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Dan Williams, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, mike.marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 04:52:42PM -0800, John Hubbard wrote:
> I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
> completely out of space. At this point, after thinking about it carefully, it seems clear
> that it's time for a single, new field:

Sorry for not replying earlier; I'm travelling and have had trouble
keeping on top of my mail.

Adding this field will grow struct page by 4-8 bytes, so it will no
longer be 64 bytes.  This isn't an acceptable answer.

We have a few options for bits.  One is that we have (iirc) two
bits available in page->flags on 32-bit.  That'll force a few more
configurations into using _last_cpupid and/or page_ext.  I'm not a huge
fan of this approach.

The second is to use page->lru.next bit 1.  This requires some care
because m68k allows misaligned pointers.  If the list_head that it's
joined to is misaligned, we'll be in trouble.  This can get tricky because
some pages are attached to list_heads which are on the stack ... and I
don't think gcc guarantees __aligned attributes work for stack variables.

The third is to use page->lru.prev bit 0.  We'd want to switch pgmap
and hmm_data around to make this work, and we'd want to record this
in mm_types.h so nobody tries to use a field which aliases with
page->lru.prev and has bit 0 set on a page which can be mapped to
userspace (which I currently believe to be true).

The fourth is to use a bit in page->flags for 64-bit and a bit in
page_ext->flags for 32-bit.  Or we could get rid of page_ext and grow
struct page with a ->flags2 on 32-bit.

Fifth, it isn't clear to me how many bits might be left in ->_last_cpupid
at this point, and perhaps there's scope for using a bit in there.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..1c789e324da8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -182,6 +182,9 @@ struct page {
>         /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>         atomic_t _refcount;
>  
> +       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> +       atomic_t _dma_pinned_count;
> +
>  #ifdef CONFIG_MEMCG
>         struct mem_cgroup *mem_cgroup;
>  #endif
> 
> 
> ...because after all, the reason this is so difficult is that this fix has to work
> in pretty much every configuration. get_user_pages() use is widespread, it's a very
> general facility, and...it needs fixing.  And we're out of space. 
> 
> I'm going to send out an updated RFC that shows the latest, and I think it's going
> to include the above.
> 
> -- 
> thanks,
> John Hubbard
> NVIDIA
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  0:52                     ` John Hubbard
  2018-12-08  2:24                       ` Jerome Glisse
  2018-12-08  5:18                       ` Matthew Wilcox
@ 2018-12-08  7:16                       ` Dan Williams
  2018-12-08 16:33                         ` Jerome Glisse
  2 siblings, 1 reply; 36+ messages in thread
From: Dan Williams @ 2018-12-08  7:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jérôme Glisse, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 7, 2018 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/7/18 11:16 AM, Jerome Glisse wrote:
> > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
[..]
> I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
> completely out of space. At this point, after thinking about it carefully, it seems clear
> that it's time for a single, new field:
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..1c789e324da8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -182,6 +182,9 @@ struct page {
>         /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>         atomic_t _refcount;
>
> +       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> +       atomic_t _dma_pinned_count;
> +
>  #ifdef CONFIG_MEMCG
>         struct mem_cgroup *mem_cgroup;
>  #endif
>
>
> ...because after all, the reason this is so difficult is that this fix has to work
> in pretty much every configuration. get_user_pages() use is widespread, it's a very
> general facility, and...it needs fixing.  And we're out of space.

HMM seems entirely too greedy in this regard. Especially with zero
upstream users. When can we start to delete the pieces of HMM that
have no upstream consumers? I would think that would be 4.21 / 5.0 as
there needs to be some forcing function. We can always re-add pieces
of HMM with it's users when / if they arrive.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08  7:16                       ` Dan Williams
@ 2018-12-08 16:33                         ` Jerome Glisse
  2018-12-08 16:48                           ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Jerome Glisse @ 2018-12-08 16:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Matthew Wilcox, John Hubbard, Andrew Morton,
	Linux MM, Jan Kara, tom, Al Viro, benve, Christoph Hellwig,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Fri, Dec 07, 2018 at 11:16:32PM -0800, Dan Williams wrote:
> On Fri, Dec 7, 2018 at 4:53 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 12/7/18 11:16 AM, Jerome Glisse wrote:
> > > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
> [..]
> > I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are
> > completely out of space. At this point, after thinking about it carefully, it seems clear
> > that it's time for a single, new field:
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5ed8f6292a53..1c789e324da8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -182,6 +182,9 @@ struct page {
> >         /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> >         atomic_t _refcount;
> >
> > +       /* DMA usage count. See get_user_pages*(), put_user_page*(). */
> > +       atomic_t _dma_pinned_count;
> > +
> >  #ifdef CONFIG_MEMCG
> >         struct mem_cgroup *mem_cgroup;
> >  #endif
> >
> >
> > ...because after all, the reason this is so difficult is that this fix has to work
> > in pretty much every configuration. get_user_pages() use is widespread, it's a very
> > general facility, and...it needs fixing.  And we're out of space.
> 
> HMM seems entirely too greedy in this regard. Especially with zero
> upstream users. When can we start to delete the pieces of HMM that
> have no upstream consumers? I would think that would be 4.21 / 5.0 as
> there needs to be some forcing function. We can always re-add pieces
> of HMM with it's users when / if they arrive.

Patchset to use HMM inside nouveau have already been posted, some
of the bits have already made upstream and more are line up for
next merge window.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 16:33                         ` Jerome Glisse
@ 2018-12-08 16:48                           ` Christoph Hellwig
  2018-12-08 17:47                             ` Jerome Glisse
  2018-12-08 18:09                             ` Dan Williams
  0 siblings, 2 replies; 36+ messages in thread
From: Christoph Hellwig @ 2018-12-08 16:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christoph Hellwig, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> Patchset to use HMM inside nouveau have already been posted, some
> of the bits have already made upstream and more are line up for
> next merge window.

Even with that it is a relative fringe feature compared to making
something like get_user_pages() that is literally used every to actually
work properly.

So I think we need to kick out HMM here and just find another place for
it to store data.

And just to make clear that I'm not picking just on this - the same is
true to a just a little smaller extent for the pgmap..

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 16:48                           ` Christoph Hellwig
@ 2018-12-08 17:47                             ` Jerome Glisse
  2018-12-08 18:26                               ` Christoph Hellwig
  2018-12-08 18:09                             ` Dan Williams
  1 sibling, 1 reply; 36+ messages in thread
From: Jerome Glisse @ 2018-12-08 17:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 08:48:25AM -0800, Christoph Hellwig wrote:
> On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> > Patchset to use HMM inside nouveau have already been posted, some
> > of the bits have already made upstream and more are line up for
> > next merge window.
> 
> Even with that it is a relative fringe feature compared to making
> something like get_user_pages() that is literally used every to actually
> work properly.
> 
> So I think we need to kick out HMM here and just find another place for
> it to store data.
> 
> And just to make clear that I'm not picking just on this - the same is
> true to a just a little smaller extent for the pgmap..

Most of the user of GUP are well behave (everything under driver/gpu and
so is mellanox driver and many other) ie they abide by mmu notifier
invalidation call backs. They are a handfull of device driver that thought
they could just do GUP and ignore the mmu notifier part and those are the
one being problematic. So to me it feels like bystander are be shot for no
good reasons.

I proposed an alternative solution to this GUP thing and i don't think it
is a crazy one and thinking about it we only need to do that for file back
page so we can leave untouch the anonymous page case. This would put a
small burden on the user of GUP (by the way i am working on removing GUP
from drivers/gpu and other well behave driver, patch posted on dri-devel
for some of the GPU already).

So why not explore my idea and see if they are any roadblock on it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 16:48                           ` Christoph Hellwig
  2018-12-08 17:47                             ` Jerome Glisse
@ 2018-12-08 18:09                             ` Dan Williams
  2018-12-08 18:12                               ` Christoph Hellwig
  1 sibling, 1 reply; 36+ messages in thread
From: Dan Williams @ 2018-12-08 18:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jérôme Glisse, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 8, 2018 at 8:48 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> > Patchset to use HMM inside nouveau have already been posted, some
> > of the bits have already made upstream and more are line up for
> > next merge window.
>
> Even with that it is a relative fringe feature compared to making
> something like get_user_pages() that is literally used every to actually
> work properly.
>
> So I think we need to kick out HMM here and just find another place for
> it to store data.
>
> And just to make clear that I'm not picking just on this - the same is
> true to a just a little smaller extent for the pgmap..

Fair enough, I cringed as I took a full pointer for that use case, I'm
happy to look at ways of consolidating or dropping that usage.

Another fix that may put pressure 'struct page' is resolving the
untenable situation of dax being incompatible with reflink, i.e.
reflink currently requires page-cache pages. Dave has talked about
silently establishing page-cache entries when a dax-page is cow'd for
reflink, but I wonder if we could go the other way and introduce the
mechanism of a page belonging to multiple mappings simultaneously and
managed by the filesystem.

Both HMM and ZONE_DEVICE in general are guilty of side-stepping the mm
and I'm in favor of undoing that as much as possible,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 18:09                             ` Dan Williams
@ 2018-12-08 18:12                               ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2018-12-08 18:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jérôme Glisse, John Hubbard,
	Matthew Wilcox, John Hubbard, Andrew Morton, Linux MM, Jan Kara,
	tom, Al Viro, benve, Christopher Lameter, Dalessandro, Dennis,
	Doug Ledford, Jason Gunthorpe, Michal Hocko, Mike Marciniszyn,
	rcampbell, Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 10:09:26AM -0800, Dan Williams wrote:
> Another fix that may put pressure 'struct page' is resolving the
> untenable situation of dax being incompatible with reflink, i.e.
> reflink currently requires page-cache pages. Dave has talked about
> silently establishing page-cache entries when a dax-page is cow'd for
> reflink, but I wonder if we could go the other way and introduce the
> mechanism of a page belonging to multiple mappings simultaneously and
> managed by the filesystem.

FYI, I had a a prototype for DAX + reflink that didn't require
the page cache, although it badly reimplemented parts of it.  But
that was a long time ago, before we started requiring struct page
for the DAX memory.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 17:47                             ` Jerome Glisse
@ 2018-12-08 18:26                               ` Christoph Hellwig
  2018-12-08 18:45                                 ` Jerome Glisse
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2018-12-08 18:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christoph Hellwig, Dan Williams, John Hubbard, Matthew Wilcox,
	John Hubbard, Andrew Morton, Linux MM, Jan Kara, tom, Al Viro,
	benve, Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 12:47:30PM -0500, Jerome Glisse wrote:
> Most of the user of GUP are well behave (everything under driver/gpu and
> so is mellanox driver and many other) ie they abide by mmu notifier
> invalidation call backs. They are a handfull of device driver that thought
> they could just do GUP and ignore the mmu notifier part and those are the
> one being problematic. So to me it feels like bystander are be shot for no
> good reasons.

get_user_pages is used by every single direct I/O, and while the race
windows in that case are small they very much exists.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
  2018-12-08 18:26                               ` Christoph Hellwig
@ 2018-12-08 18:45                                 ` Jerome Glisse
  0 siblings, 0 replies; 36+ messages in thread
From: Jerome Glisse @ 2018-12-08 18:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, John Hubbard, Matthew Wilcox, John Hubbard,
	Andrew Morton, Linux MM, Jan Kara, tom, Al Viro, benve,
	Christopher Lameter, Dalessandro, Dennis, Doug Ledford,
	Jason Gunthorpe, Michal Hocko, Mike Marciniszyn, rcampbell,
	Linux Kernel Mailing List, linux-fsdevel

On Sat, Dec 08, 2018 at 10:26:04AM -0800, Christoph Hellwig wrote:
> On Sat, Dec 08, 2018 at 12:47:30PM -0500, Jerome Glisse wrote:
> > Most of the user of GUP are well behave (everything under driver/gpu and
> > so is mellanox driver and many other) ie they abide by mmu notifier
> > invalidation call backs. They are a handfull of device driver that thought
> > they could just do GUP and ignore the mmu notifier part and those are the
> > one being problematic. So to me it feels like bystander are be shot for no
> > good reasons.
> 
> get_user_pages is used by every single direct I/O, and while the race
> windows in that case are small they very much exists.

Yes and my proposal allow to fix that in even a better way than
the pin count would ie allowing to provide a callback for write
back to wait on direct I/O as you said for direct I/O it is a
small window so it would be fine to have write back wait on it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, back to index

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
2018-12-04  7:53   ` Mike Rapoport
2018-12-05  1:40     ` John Hubbard
2018-12-04 20:28   ` Dan Williams
2018-12-04 21:56     ` John Hubbard
2018-12-04 23:03       ` Dan Williams
2018-12-05  0:36         ` Jerome Glisse
2018-12-05  0:40           ` Dan Williams
2018-12-05  0:59             ` John Hubbard
2018-12-05  0:58         ` John Hubbard
2018-12-05  1:00           ` Dan Williams
2018-12-05  1:15           ` Matthew Wilcox
2018-12-05  1:44             ` Jerome Glisse
2018-12-05  1:57               ` John Hubbard
2018-12-07  2:45                 ` John Hubbard
2018-12-07 19:16                   ` Jerome Glisse
2018-12-07 19:26                     ` Dan Williams
2018-12-07 19:40                       ` Jerome Glisse
2018-12-08  0:52                     ` John Hubbard
2018-12-08  2:24                       ` Jerome Glisse
2018-12-08  5:18                       ` Matthew Wilcox
2018-12-08  7:16                       ` Dan Williams
2018-12-08 16:33                         ` Jerome Glisse
2018-12-08 16:48                           ` Christoph Hellwig
2018-12-08 17:47                             ` Jerome Glisse
2018-12-08 18:26                               ` Christoph Hellwig
2018-12-08 18:45                                 ` Jerome Glisse
2018-12-08 18:09                             ` Dan Williams
2018-12-08 18:12                               ` Christoph Hellwig
2018-12-05  5:52             ` Dan Williams
2018-12-05 11:16       ` Jan Kara
2018-12-04  0:17 ` [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
2018-12-05  1:05   ` John Hubbard
2018-12-05 14:08     ` David Laight

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox