linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] get_user_pages*() and RDMA: first steps
@ 2018-10-08 21:16 john.hubbard
  2018-10-08 21:16 ` [PATCH v4 1/3] mm: get_user_pages: consolidate error handling john.hubbard
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: john.hubbard @ 2018-10-08 21:16 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell, Andrew Morton

From: John Hubbard <jhubbard@nvidia.com>

Andrew, do you have a preference for which tree (MM or RDMA) this should
go in? If not, then could you please ACK this so that Jason can pick it
up for the RDMA tree?

Changes since v3:

-- Picks up Reviewed-by tags from Jan Kara and Dennis Dalessandro.

-- Picks up Acked-by tag from Jason Gunthorpe, in case this ends up *not*
   going in via the RDMA tree.

-- Fixes formatting of a comment.

Changes since v2:

-- Absorbed more dirty page handling logic into the put_user_page*(), and
   handled some page releasing loops in infiniband more thoroughly, as per
   Jason Gunthorpe's feedback.

-- Fixed a bug in the put_user_pages*() routines' loops (thanks to
   Ralph Campbell for spotting it).

Changes since v1:

-- Renamed release_user_pages*() to put_user_pages*(), from Jan's feedback.

-- Removed the goldfish.c changes, and instead, only included a single
   user (infiniband) of the new functions. That is because goldfish.c no
   longer has a name collision (it has a release_user_pages() routine), and
   also because infiniband exercises both the put_user_page() and
   put_user_pages*() paths.

-- Updated links to discussions and plans, so as to be sure to include
   bounce buffers, thanks to Jerome's feedback.

Also:

-- Dennis, thanks for your earlier review, and I have not yet added your
   Reviewed-by tag, because this revision changes the things that you had
   previously reviewed, thus potentially requiring another look.

This short series prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

Patch 1, although not technically critical to do now, is still nice to
have, because it's already been reviewed by Jan, and it's just one more
thing on the long TODO list here, that is ready to be checked off.

Patch 2 is required in order to allow me (and others, if I'm lucky) to
start submitting changes to convert all of the callsites of
get_user_pages*() and put_page().  I think this will work a lot better
than trying to maintain a massive patchset and submitting all at once.

Patch 3 converts infiniband drivers: put_page() --> put_user_page(), and
also exercises put_user_pages_dirty_locked().

Once these are all in, then the floodgates can open up to convert the large
number of get_user_pages*() callsites.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
    Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.GG24030@quack2.suse.cz
    Follow-up discussions.

CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Christopher Lameter <cl@linux.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Jerome Glisse <jglisse@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Ralph Campbell <rcampbell@nvidia.com>
CC: Andrew Morton <akpm@linux-foundation.org>

John Hubbard (3):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()

 drivers/infiniband/core/umem.c              |  7 +--
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 11 ++---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 +--
 include/linux/mm.h                          | 49 ++++++++++++++++++++-
 mm/gup.c                                    | 37 +++++++++-------
 9 files changed, 93 insertions(+), 45 deletions(-)

-- 
2.19.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/3] mm: get_user_pages: consolidate error handling
  2018-10-08 21:16 [PATCH v4 0/3] get_user_pages*() and RDMA: first steps john.hubbard
@ 2018-10-08 21:16 ` john.hubbard
  2018-10-09  0:05   ` Andrew Morton
  2018-10-08 21:16 ` [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions john.hubbard
  2018-10-08 21:16 ` [PATCH v4 3/3] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
  2 siblings, 1 reply; 25+ messages in thread
From: john.hubbard @ 2018-10-08 21:16 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
 	long i = 0;
+	int err = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (!vma || start >= vma->vm_end) {
 			vma = find_extend_vma(mm, start);
 			if (!vma && in_gate_area(mm, start)) {
-				int ret;
-				ret = get_gate_page(mm, start & PAGE_MASK,
+				err = get_gate_page(mm, start & PAGE_MASK,
 						gup_flags, &vma,
 						pages ? &pages[i] : NULL);
-				if (ret)
-					return i ? : ret;
+				if (err)
+					goto out;
 				page_mask = 0;
 				goto next_page;
 			}
 
-			if (!vma || check_vma_flags(vma, gup_flags))
-				return i ? : -EFAULT;
+			if (!vma || check_vma_flags(vma, gup_flags)) {
+				err = -EFAULT;
+				goto out;
+			}
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		 * If we have a pending SIGKILL, don't keep faulting pages and
 		 * potentially allocating memory.
 		 */
-		if (unlikely(fatal_signal_pending(current)))
-			return i ? i : -ERESTARTSYS;
+		if (unlikely(fatal_signal_pending(current))) {
+			err = -ERESTARTSYS;
+			goto out;
+		}
 		cond_resched();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
 		if (!page) {
-			int ret;
-			ret = faultin_page(tsk, vma, start, &foll_flags,
+			err = faultin_page(tsk, vma, start, &foll_flags,
 					nonblocking);
-			switch (ret) {
+			switch (err) {
 			case 0:
 				goto retry;
 			case -EFAULT:
 			case -ENOMEM:
 			case -EHWPOISON:
-				return i ? i : ret;
+				goto out;
 			case -EBUSY:
-				return i;
+				err = 0;
+				goto out;
 			case -ENOENT:
 				goto next_page;
 			}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			 */
 			goto next_page;
 		} else if (IS_ERR(page)) {
-			return i ? i : PTR_ERR(page);
+			err = PTR_ERR(page);
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
 	} while (nr_pages);
-	return i;
+
+out:
+	return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-08 21:16 [PATCH v4 0/3] get_user_pages*() and RDMA: first steps john.hubbard
  2018-10-08 21:16 ` [PATCH v4 1/3] mm: get_user_pages: consolidate error handling john.hubbard
@ 2018-10-08 21:16 ` john.hubbard
  2018-10-09  0:14   ` Andrew Morton
  2018-10-08 21:16 ` [PATCH v4 3/3] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
  2 siblings, 1 reply; 25+ messages in thread
From: john.hubbard @ 2018-10-08 21:16 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
    Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.GG24030@quack2.suse.cz
    Follow-up discussions.

CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Christopher Lameter <cl@linux.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Jerome Glisse <jglisse@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Ralph Campbell <rcampbell@nvidia.com>

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 49 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..0490f4a71b9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -137,6 +137,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
+int set_page_dirty(struct page *page);
+int set_page_dirty_lock(struct page *page);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
@@ -943,6 +945,51 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/*
+ * Pages that were pinned via get_user_pages*() should be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+static inline void put_user_pages_dirty(struct page **pages,
+					unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		if (!PageDirty(pages[index]))
+			set_page_dirty(pages[index]);
+
+		put_user_page(pages[index]);
+	}
+}
+
+static inline void put_user_pages_dirty_lock(struct page **pages,
+					     unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		if (!PageDirty(pages[index]))
+			set_page_dirty_lock(pages[index]);
+
+		put_user_page(pages[index]);
+	}
+}
+
+static inline void put_user_pages(struct page **pages,
+				  unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
@@ -1534,8 +1581,6 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
 			  struct bdi_writeback *wb);
-int set_page_dirty(struct page *page);
-int set_page_dirty_lock(struct page *page);
 void __cancel_dirty_page(struct page *page);
 static inline void cancel_dirty_page(struct page *page)
 {
-- 
2.19.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/3] infiniband/mm: convert put_page() to put_user_page*()
  2018-10-08 21:16 [PATCH v4 0/3] get_user_pages*() and RDMA: first steps john.hubbard
  2018-10-08 21:16 ` [PATCH v4 1/3] mm: get_user_pages: consolidate error handling john.hubbard
  2018-10-08 21:16 ` [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-10-08 21:16 ` john.hubbard
  2018-10-09  9:52   ` kbuild test robot
  2 siblings, 1 reply; 25+ messages in thread
From: john.hubbard @ 2018-10-08 21:16 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard,
	Doug Ledford, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti

From: John Hubbard <jhubbard@nvidia.com>

For code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
    Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.GG24030@quack2.suse.cz
    Follow-up discussions.

CC: Doug Ledford <dledford@redhat.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Mike Marciniszyn <mike.marciniszyn@intel.com>
CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
CC: Christian Benvenuti <benve@cisco.com>

CC: linux-rdma@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-mm@kvack.org

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c              |  7 ++++---
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 11 ++++-------
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++++-------
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 ++++----
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 ++++---
 7 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..7ab7a3a35eb4 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
 		page = sg_page(sg);
-		if (!PageDirty(page) && umem->writable && dirty)
-			set_page_dirty_lock(page);
-		put_page(page);
+		if (umem->writable && dirty)
+			put_user_pages_dirty_lock(&page, 1);
+		else
+			put_user_page(page);
 	}
 
 	sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 user_virt, u64 bcnt,
 					ret = -EFAULT;
 					break;
 				}
-				put_page(local_page_list[j]);
+				put_user_page(local_page_list[j]);
 				continue;
 			}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 			     size_t npages, bool dirty)
 {
-	size_t i;
-
-	for (i = 0; i < npages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, npages);
+	else
+		put_user_pages(p, npages);
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		down_write(&mm->mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
 	if (ret < 0) {
-		put_page(pages[0]);
+		put_user_page(pages[0]);
 		goto out;
 	}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 				 mthca_uarc_virt(dev, uar, i));
 	if (ret) {
 		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-		put_page(sg_page(&db_tab->page[i].mem));
+		put_user_page(sg_page(&db_tab->page[i].mem));
 		goto out;
 	}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
 		if (db_tab->page[i].uvirt) {
 			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
 			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-			put_page(sg_page(&db_tab->page[i].mem));
+			put_user_page(sg_page(&db_tab->page[i].mem));
 		}
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..1a5c64c8695f 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,13 +40,10 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
 				     int dirty)
 {
-	size_t i;
-
-	for (i = 0; i < num_pages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, num_pages);
+	else
+		put_user_pages(p, num_pages);
 }
 
 /*
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 926f3c8eba69..14f94d823907 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -266,7 +266,7 @@ static void qib_user_sdma_init_frag(struct qib_user_sdma_pkt *pkt,
 	pkt->addr[i].length = len;
 	pkt->addr[i].first_desc = first_desc;
 	pkt->addr[i].last_desc = last_desc;
-	pkt->addr[i].put_page = put_page;
+	pkt->addr[i].put_page = put_user_page;
 	pkt->addr[i].dma_mapped = dma_mapped;
 	pkt->addr[i].page = page;
 	pkt->addr[i].kvaddr = kvaddr;
@@ -321,7 +321,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd,
 		 * the caller can ignore this page.
 		 */
 		if (put) {
-			put_page(page);
+			put_user_page(page);
 		} else {
 			/* coalesce case */
 			kunmap(page);
@@ -635,7 +635,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev,
 			kunmap(pkt->addr[i].page);
 
 		if (pkt->addr[i].put_page)
-			put_page(pkt->addr[i].page);
+			put_user_page(pkt->addr[i].page);
 		else
 			__free_page(pkt->addr[i].page);
 	} else if (pkt->addr[i].kvaddr) {
@@ -710,7 +710,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 	/* if error, return all pages not managed by pkt */
 free_pages:
 	while (i < j)
-		put_page(pages[i++]);
+		put_user_page(pages[i++]);
 
 done:
 	return ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 9dd39daa602b..9e3615fd05f7 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -89,9 +89,10 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
 		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
 			page = sg_page(sg);
 			pa = sg_phys(sg);
-			if (!PageDirty(page) && dirty)
-				set_page_dirty_lock(page);
-			put_page(page);
+			if (dirty)
+				put_user_pages_dirty_lock(&page, 1);
+			else
+				put_user_page(page);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
-- 
2.19.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/3] mm: get_user_pages: consolidate error handling
  2018-10-08 21:16 ` [PATCH v4 1/3] mm: get_user_pages: consolidate error handling john.hubbard
@ 2018-10-09  0:05   ` Andrew Morton
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2018-10-09  0:05 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

On Mon,  8 Oct 2018 14:16:21 -0700 john.hubbard@gmail.com wrote:

> An upcoming patch requires a way to operate on each page that
> any of the get_user_pages_*() variants returns.
> 
> In preparation for that, consolidate the error handling for
> __get_user_pages(). This provides a single location (the "out:" label)
> for operating on the collected set of pages that are about to be returned.
> 
> As long every use of the "ret" variable is being edited, rename
> "ret" --> "err", so that its name matches its true role.
> This also gets rid of two shadowed variable declarations, as a
> tiny beneficial a side effect.

Reviewed-by: Andrew Morton <akpm@linux-foundation.org>

`i' is a pretty crappy identifier as well, but we'll live.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-08 21:16 ` [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-10-09  0:14   ` Andrew Morton
  2018-10-09  8:30     ` Jan Kara
  2018-10-10  0:42     ` John Hubbard
  0 siblings, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2018-10-09  0:14 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On Mon,  8 Oct 2018 14:16:22 -0700 john.hubbard@gmail.com wrote:

> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This prepares for eventually fixing the problem described
> in [1], and is following a plan listed in [2], [3], [4].
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
>     Proposed steps for fixing get_user_pages() + DMA problems.
> 
> [3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
>     Bounce buffers (otherwise [2] is not really viable).
> 
> [4] https://lkml.kernel.org/r/20181003162115.GG24030@quack2.suse.cz
>     Follow-up discussions.
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -137,6 +137,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
>  				    size_t *, loff_t *);
>  extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
>  				    size_t *, loff_t *);
> +int set_page_dirty(struct page *page);
> +int set_page_dirty_lock(struct page *page);
>  
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>  
> @@ -943,6 +945,51 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/*
> + * Pages that were pinned via get_user_pages*() should be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +static inline void put_user_pages_dirty(struct page **pages,
> +					unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		if (!PageDirty(pages[index]))

Both put_page() and set_page_dirty() handle compound pages.  But
because of the above statement, put_user_pages_dirty() might misbehave? 
Or maybe it won't - perhaps the intent here is to skip dirtying the
head page if the sub page is clean?  Please clarify, explain and add
comment if so.

> +			set_page_dirty(pages[index]);
> +
> +		put_user_page(pages[index]);
> +	}
> +}
> +
> +static inline void put_user_pages_dirty_lock(struct page **pages,
> +					     unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		if (!PageDirty(pages[index]))
> +			set_page_dirty_lock(pages[index]);

Ditto.

> +		put_user_page(pages[index]);
> +	}
> +}
> +
> +static inline void put_user_pages(struct page **pages,
> +				  unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}
> +

Otherwise looks OK.  Ish.  But it would be nice if that comment were to
explain *why* get_user_pages() pages must be released with
put_user_page().

Also, maintainability.  What happens if someone now uses put_page() by
mistake?  Kernel fails in some mysterious fashion?  How can we prevent
this from occurring as code evolves?  Is there a cheap way of detecting
this bug at runtime?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-09  0:14   ` Andrew Morton
@ 2018-10-09  8:30     ` Jan Kara
  2018-10-09 23:20       ` Andrew Morton
  2018-10-10  0:42     ` John Hubbard
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kara @ 2018-10-09  8:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On Mon 08-10-18 17:14:42, Andrew Morton wrote:
> On Mon,  8 Oct 2018 14:16:22 -0700 john.hubbard@gmail.com wrote:
> > +		put_user_page(pages[index]);
> > +	}
> > +}
> > +
> > +static inline void put_user_pages(struct page **pages,
> > +				  unsigned long npages)
> > +{
> > +	unsigned long index;
> > +
> > +	for (index = 0; index < npages; index++)
> > +		put_user_page(pages[index]);
> > +}
> > +
> 
> Otherwise looks OK.  Ish.  But it would be nice if that comment were to
> explain *why* get_user_pages() pages must be released with
> put_user_page().

The reason is that eventually we want to track reference from GUP
separately but you're right that it would be good to have a comment about
that somewhere.

> Also, maintainability.  What happens if someone now uses put_page() by
> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> this from occurring as code evolves?  Is there a cheap way of detecting
> this bug at runtime?

The same will happen as with any other reference counting bug - the special
user reference will leak. It will be pretty hard to debug I agree. I was
thinking about whether we could provide some type safety against such bugs
such as get_user_pages() not returning struct page pointers but rather some
other special type but it would result in a big amount of additional churn
as we'd have to propagate this different type e.g. through the IO path so
that IO completion routines could properly call put_user_pages(). So I'm
not sure it's really worth it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 3/3] infiniband/mm: convert put_page() to put_user_page*()
  2018-10-08 21:16 ` [PATCH v4 3/3] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
@ 2018-10-09  9:52   ` kbuild test robot
  0 siblings, 0 replies; 25+ messages in thread
From: kbuild test robot @ 2018-10-09  9:52 UTC (permalink / raw)
  To: john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard, Doug Ledford,
	Mike Marciniszyn, Dennis Dalessandro, Christian Benvenuti

[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]

Hi John,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.19-rc7 next-20181008]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/get_user_pages-and-RDMA-first-steps/20181009-152159
config: x86_64-randconfig-s4-10091707 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/infiniband/hw/qib/qib_user_sdma.c: In function 'qib_user_sdma_init_frag':
>> drivers/infiniband/hw/qib/qib_user_sdma.c:269:24: warning: assignment makes integer from pointer without a cast [-Wint-conversion]
     pkt->addr[i].put_page = put_user_page;
                           ^

vim +269 drivers/infiniband/hw/qib/qib_user_sdma.c

   257	
   258	static void qib_user_sdma_init_frag(struct qib_user_sdma_pkt *pkt,
   259					    int i, u16 offset, u16 len,
   260					    u16 first_desc, u16 last_desc,
   261					    u16 put_page, u16 dma_mapped,
   262					    struct page *page, void *kvaddr,
   263					    dma_addr_t dma_addr, u16 dma_length)
   264	{
   265		pkt->addr[i].offset = offset;
   266		pkt->addr[i].length = len;
   267		pkt->addr[i].first_desc = first_desc;
   268		pkt->addr[i].last_desc = last_desc;
 > 269		pkt->addr[i].put_page = put_user_page;
   270		pkt->addr[i].dma_mapped = dma_mapped;
   271		pkt->addr[i].page = page;
   272		pkt->addr[i].kvaddr = kvaddr;
   273		pkt->addr[i].addr = dma_addr;
   274		pkt->addr[i].dma_length = dma_length;
   275	}
   276	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27331 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-09  8:30     ` Jan Kara
@ 2018-10-09 23:20       ` Andrew Morton
  2018-10-10  0:32         ` John Hubbard
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2018-10-09 23:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel, John Hubbard, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On Tue, 9 Oct 2018 10:30:25 +0200 Jan Kara <jack@suse.cz> wrote:

> > Also, maintainability.  What happens if someone now uses put_page() by
> > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > this from occurring as code evolves?  Is there a cheap way of detecting
> > this bug at runtime?
> 
> The same will happen as with any other reference counting bug - the special
> user reference will leak. It will be pretty hard to debug I agree. I was
> thinking about whether we could provide some type safety against such bugs
> such as get_user_pages() not returning struct page pointers but rather some
> other special type but it would result in a big amount of additional churn
> as we'd have to propagate this different type e.g. through the IO path so
> that IO completion routines could properly call put_user_pages(). So I'm
> not sure it's really worth it.

I'm not really understanding.  Patch 3/3 changes just one infiniband
driver to use put_user_page().  But the changelogs here imply (to me)
that every user of get_user_pages() needs to be converted to
s/put_page/put_user_page/.

Methinks a bit more explanation is needed in these changelogs?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-09 23:20       ` Andrew Morton
@ 2018-10-10  0:32         ` John Hubbard
  2018-10-10 23:43           ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: John Hubbard @ 2018-10-10  0:32 UTC (permalink / raw)
  To: Andrew Morton, Jan Kara
  Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel, Al Viro, Jerome Glisse, Christoph Hellwig,
	Ralph Campbell

On 10/9/18 4:20 PM, Andrew Morton wrote:
> On Tue, 9 Oct 2018 10:30:25 +0200 Jan Kara <jack@suse.cz> wrote:
> 
>>> Also, maintainability.  What happens if someone now uses put_page() by
>>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
>>> this from occurring as code evolves?  Is there a cheap way of detecting
>>> this bug at runtime?
>>
>> The same will happen as with any other reference counting bug - the special
>> user reference will leak. It will be pretty hard to debug I agree. I was
>> thinking about whether we could provide some type safety against such bugs
>> such as get_user_pages() not returning struct page pointers but rather some
>> other special type but it would result in a big amount of additional churn
>> as we'd have to propagate this different type e.g. through the IO path so
>> that IO completion routines could properly call put_user_pages(). So I'm
>> not sure it's really worth it.
> 
> I'm not really understanding.  Patch 3/3 changes just one infiniband
> driver to use put_user_page().  But the changelogs here imply (to me)
> that every user of get_user_pages() needs to be converted to
> s/put_page/put_user_page/.
> 
> Methinks a bit more explanation is needed in these changelogs?
> 

OK, yes, it does sound like the explanation is falling short. I'll work on something 
clearer. Did the proposed steps in the changelogs, such as:
  
[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

help at all, or is it just too many references, and I should write the words
directly in the changelog?

Anyway, patch 3/3 is a just a working example (which we do want to submit, though), and
many more conversions will follow. But they don't have to be done all upfront--they
can be done in follow up patchsets. 

The put_user_page*() routines are, at this point, not going to significantly change
behavior. 

I'm working on an RFC that will show what the long-term fix to get_user_pages and
put_user_pages will look like. But meanwhile it's good to get started on converting
all of the call sites.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-09  0:14   ` Andrew Morton
  2018-10-09  8:30     ` Jan Kara
@ 2018-10-10  0:42     ` John Hubbard
  2018-10-10  8:59       ` Jan Kara
  2018-10-10 23:45       ` Andrew Morton
  1 sibling, 2 replies; 25+ messages in thread
From: John Hubbard @ 2018-10-10  0:42 UTC (permalink / raw)
  To: Andrew Morton, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On 10/8/18 5:14 PM, Andrew Morton wrote:
> On Mon,  8 Oct 2018 14:16:22 -0700 john.hubbard@gmail.com wrote:
> 
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> +/*
>> + * Pages that were pinned via get_user_pages*() should be released via
>> + * either put_user_page(), or one of the put_user_pages*() routines
>> + * below.
>> + */
>> +static inline void put_user_page(struct page *page)
>> +{
>> +	put_page(page);
>> +}
>> +
>> +static inline void put_user_pages_dirty(struct page **pages,
>> +					unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
>> +		if (!PageDirty(pages[index]))
> 
> Both put_page() and set_page_dirty() handle compound pages.  But
> because of the above statement, put_user_pages_dirty() might misbehave? 
> Or maybe it won't - perhaps the intent here is to skip dirtying the
> head page if the sub page is clean?  Please clarify, explain and add
> comment if so.
> 

Yes, technically, the accounting is wrong: we normally use the head page to 
track dirtiness, and here, that is not done. (Nor was it done before this
patch). However, it's not causing problems in code today because sub pages
are released at about the same time as head pages, so the head page does get 
properly checked at some point. And that means that set_page_dirty*() gets
called if it needs to be called. 

Obviously this is a little fragile, in that it depends on the caller behaving 
a certain way. And in any case, the long-term fix (coming later) *also* only
operates on the head page. So actually, instead of a comment, I think it's good 
to just insert

	page = compound_head(page);

...into these new routines, right now. I'll do that.

[...]
> 
> Otherwise looks OK.  Ish.  But it would be nice if that comment were to
> explain *why* get_user_pages() pages must be released with
> put_user_page().
> 

Yes, will do.

> Also, maintainability.  What happens if someone now uses put_page() by
> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> this from occurring as code evolves?  Is there a cheap way of detecting
> this bug at runtime?
> 

It might be possible to do a few run-time checks, such as "does page that came 
back to put_user_page() have the correct flags?", but it's harder (without 
having a dedicated page flag) to detect the other direction: "did someone page 
in a get_user_pages page, to put_page?"

As Jan said in his reply, converting get_user_pages (and put_user_page) to 
work with a new data type that wraps struct pages, would solve it, but that's
an awfully large change. Still...given how much of a mess this can turn into 
if it's wrong, I wonder if it's worth it--maybe? 

thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-10  0:42     ` John Hubbard
@ 2018-10-10  8:59       ` Jan Kara
  2018-10-10 23:23         ` John Hubbard
  2018-10-10 23:45       ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kara @ 2018-10-10  8:59 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, Jan Kara,
	linux-mm, LKML, linux-rdma, linux-fsdevel, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell

On Tue 09-10-18 17:42:09, John Hubbard wrote:
> On 10/8/18 5:14 PM, Andrew Morton wrote:
> > Also, maintainability.  What happens if someone now uses put_page() by
> > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > this from occurring as code evolves?  Is there a cheap way of detecting
> > this bug at runtime?
> > 
> 
> It might be possible to do a few run-time checks, such as "does page that came 
> back to put_user_page() have the correct flags?", but it's harder (without 
> having a dedicated page flag) to detect the other direction: "did someone page 
> in a get_user_pages page, to put_page?"
> 
> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> work with a new data type that wraps struct pages, would solve it, but that's
> an awfully large change. Still...given how much of a mess this can turn into 
> if it's wrong, I wonder if it's worth it--maybe? 

I'm certainly not opposed to looking into it. But after looking into this
for a while it is not clear to me how to convert e.g. fs/direct-io.c or
fs/iomap.c. They pass the reference from gup() via
bio->bi_io_vec[]->bv_page and then release it after IO completion.
Propagating the new type to ->bv_page is not good as lower layer do not
really care how the page is pinned in memory. But we do need to somehow
pass the information to the IO completion functions in a robust manner.

Hmm, what about the following:

1) Make gup() return new type - struct user_page *? In practice that would
be just a struct page pointer with 0 bit set so that people are forced to
use proper helpers and not just force types (and the setting of bit 0 and
masking back would be hidden behind CONFIG_DEBUG_USER_PAGE_REFERENCES for
performance reasons). Also the transition would have to be gradual so we'd
have to name the function differently and use it from converted code.

2) Provide helper bio_add_user_page() that will take user_page, convert it
to struct page, add it to the bio, and flag the bio as having pages with
user references. That code would also make sure the bio is consistent in
having only user-referenced pages in that case. IO completion (like
bio_check_pages_dirty(), bio_release_pages() etc.) will check the flag and
use approprite release function.

3) I have noticed fs/direct-io.c may submit zero page for IO when it needs
to clear stuff so we'll probably need a helper function to acquire 'user pin'
reference given a page pointer so that that code can be kept reasonably
simple and pass user_page references all around.

So this way we could maintain reasonable confidence that refcounts didn't
get mixed up. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-10  8:59       ` Jan Kara
@ 2018-10-10 23:23         ` John Hubbard
  2018-10-11  8:42           ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: John Hubbard @ 2018-10-10 23:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	LKML, linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On 10/10/18 1:59 AM, Jan Kara wrote:
> On Tue 09-10-18 17:42:09, John Hubbard wrote:
>> On 10/8/18 5:14 PM, Andrew Morton wrote:
>>> Also, maintainability.  What happens if someone now uses put_page() by
>>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
>>> this from occurring as code evolves?  Is there a cheap way of detecting
>>> this bug at runtime?
>>>
>>
>> It might be possible to do a few run-time checks, such as "does page that came 
>> back to put_user_page() have the correct flags?", but it's harder (without 
>> having a dedicated page flag) to detect the other direction: "did someone page 
>> in a get_user_pages page, to put_page?"
>>
>> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
>> work with a new data type that wraps struct pages, would solve it, but that's
>> an awfully large change. Still...given how much of a mess this can turn into 
>> if it's wrong, I wonder if it's worth it--maybe? 
> 
> I'm certainly not opposed to looking into it. But after looking into this
> for a while it is not clear to me how to convert e.g. fs/direct-io.c or
> fs/iomap.c. They pass the reference from gup() via
> bio->bi_io_vec[]->bv_page and then release it after IO completion.
> Propagating the new type to ->bv_page is not good as lower layer do not
> really care how the page is pinned in memory. But we do need to somehow
> pass the information to the IO completion functions in a robust manner.
> 

You know, that problem has to be solved in either case: even if we do not
use a new data type for get_user_pages, we still need to clearly, accurately
match up the get/put pairs. And for the complicated systems (block IO, and
GPU DRM layer, especially) one of the things that has caused me concern is 
the way the pages all end up in a large, complicated pool, and put_page is
used to free all of them, indiscriminately.

So I'm glad you're looking at ways to disambiguate this for the bio system.

> Hmm, what about the following:
> 
> 1) Make gup() return new type - struct user_page *? In practice that would
> be just a struct page pointer with 0 bit set so that people are forced to
> use proper helpers and not just force types (and the setting of bit 0 and
> masking back would be hidden behind CONFIG_DEBUG_USER_PAGE_REFERENCES for
> performance reasons). Also the transition would have to be gradual so we'd
> have to name the function differently and use it from converted code.

Yes. That seems perfect: it just fades away if you're not debugging, but we
can catch lots of problems when CONFIG_DEBUG_USER_PAGE_REFERENCES is set. 

> 
> 2) Provide helper bio_add_user_page() that will take user_page, convert it
> to struct page, add it to the bio, and flag the bio as having pages with
> user references. That code would also make sure the bio is consistent in
> having only user-referenced pages in that case. IO completion (like
> bio_check_pages_dirty(), bio_release_pages() etc.) will check the flag and
> use approprite release function.

I'm very new to bio, so I have to ask: can we be sure that the same types of 
pages are always used, within each bio? Because otherwise, we'll have to plumb 
it all the way down to bio_vec's--or so it appears, based on my reading of 
bio_release_pages() and surrounding code.

> 
> 3) I have noticed fs/direct-io.c may submit zero page for IO when it needs
> to clear stuff so we'll probably need a helper function to acquire 'user pin'
> reference given a page pointer so that that code can be kept reasonably
> simple and pass user_page references all around.
>

This only works if we don't set page flags, because if we do set page flags 
on the single, global zero page, that will break the world. So I'm not sure
that the zero page usage in fs/directio.c is going to survive the conversion
to this new approach. :)
 
> So this way we could maintain reasonable confidence that refcounts didn't
> get mixed up. Thoughts?
> 

After thinking some more about the complicated situations in bio and DRM,
and looking into the future (bug reports...), I am leaning toward your 
struct user_page approach. 

I'm looking forward to hearing other opinions on whether it's worth it to go
and do this fairly intrusive change, in return for, probably, fewer bugs along
the way.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-10  0:32         ` John Hubbard
@ 2018-10-10 23:43           ` Andrew Morton
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2018-10-10 23:43 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	LKML, linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On Tue, 9 Oct 2018 17:32:16 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> > I'm not really understanding.  Patch 3/3 changes just one infiniband
> > driver to use put_user_page().  But the changelogs here imply (to me)
> > that every user of get_user_pages() needs to be converted to
> > s/put_page/put_user_page/.
> > 
> > Methinks a bit more explanation is needed in these changelogs?
> > 
> 
> OK, yes, it does sound like the explanation is falling short. I'll work on something 
> clearer. Did the proposed steps in the changelogs, such as:
>   
> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
>     Proposed steps for fixing get_user_pages() + DMA problems.
> 
> help at all, or is it just too many references, and I should write the words
> directly in the changelog?
> 
> Anyway, patch 3/3 is a just a working example (which we do want to submit, though), and
> many more conversions will follow. But they don't have to be done all upfront--they
> can be done in follow up patchsets. 
> 
> The put_user_page*() routines are, at this point, not going to significantly change
> behavior. 
> 
> I'm working on an RFC that will show what the long-term fix to get_user_pages and
> put_user_pages will look like. But meanwhile it's good to get started on converting
> all of the call sites.

I see.  Yes, please do put all of it into the changelog[s].

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-10  0:42     ` John Hubbard
  2018-10-10  8:59       ` Jan Kara
@ 2018-10-10 23:45       ` Andrew Morton
  2018-10-11  8:49         ` Jan Kara
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2018-10-10 23:45 UTC (permalink / raw)
  To: John Hubbard
  Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On Tue, 9 Oct 2018 17:42:09 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> > Also, maintainability.  What happens if someone now uses put_page() by
> > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > this from occurring as code evolves?  Is there a cheap way of detecting
> > this bug at runtime?
> > 
> 
> It might be possible to do a few run-time checks, such as "does page that came 
> back to put_user_page() have the correct flags?", but it's harder (without 
> having a dedicated page flag) to detect the other direction: "did someone page 
> in a get_user_pages page, to put_page?"
> 
> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> work with a new data type that wraps struct pages, would solve it, but that's
> an awfully large change. Still...given how much of a mess this can turn into 
> if it's wrong, I wonder if it's worth it--maybe? 

This is a real worry.  If someone uses a mistaken put_page() then how
will that bug manifest at runtime?  Under what set of circumstances
will the kernel trigger the bug?

(btw, please cc me on all patches, not just [0/n]!)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-10 23:23         ` John Hubbard
@ 2018-10-11  8:42           ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2018-10-11  8:42 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Andrew Morton, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Jason Gunthorpe, Dan Williams,
	linux-mm, LKML, linux-rdma, linux-fsdevel, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell

On Wed 10-10-18 16:23:35, John Hubbard wrote:
> On 10/10/18 1:59 AM, Jan Kara wrote:
> > On Tue 09-10-18 17:42:09, John Hubbard wrote:
> >> On 10/8/18 5:14 PM, Andrew Morton wrote:
> >>> Also, maintainability.  What happens if someone now uses put_page() by
> >>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> >>> this from occurring as code evolves?  Is there a cheap way of detecting
> >>> this bug at runtime?
> >>>
> >>
> >> It might be possible to do a few run-time checks, such as "does page that came 
> >> back to put_user_page() have the correct flags?", but it's harder (without 
> >> having a dedicated page flag) to detect the other direction: "did someone page 
> >> in a get_user_pages page, to put_page?"
> >>
> >> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> >> work with a new data type that wraps struct pages, would solve it, but that's
> >> an awfully large change. Still...given how much of a mess this can turn into 
> >> if it's wrong, I wonder if it's worth it--maybe? 
> > 
> > I'm certainly not opposed to looking into it. But after looking into this
> > for a while it is not clear to me how to convert e.g. fs/direct-io.c or
> > fs/iomap.c. They pass the reference from gup() via
> > bio->bi_io_vec[]->bv_page and then release it after IO completion.
> > Propagating the new type to ->bv_page is not good as lower layer do not
> > really care how the page is pinned in memory. But we do need to somehow
> > pass the information to the IO completion functions in a robust manner.
> > 
> 
> You know, that problem has to be solved in either case: even if we do not
> use a new data type for get_user_pages, we still need to clearly, accurately
> match up the get/put pairs. And for the complicated systems (block IO, and
> GPU DRM layer, especially) one of the things that has caused me concern is 
> the way the pages all end up in a large, complicated pool, and put_page is
> used to free all of them, indiscriminately.

Agreed.

> > Hmm, what about the following:
> > 
> > 1) Make gup() return new type - struct user_page *? In practice that would
> > be just a struct page pointer with 0 bit set so that people are forced to
> > use proper helpers and not just force types (and the setting of bit 0 and
> > masking back would be hidden behind CONFIG_DEBUG_USER_PAGE_REFERENCES for
> > performance reasons). Also the transition would have to be gradual so we'd
> > have to name the function differently and use it from converted code.
> 
> Yes. That seems perfect: it just fades away if you're not debugging, but we
> can catch lots of problems when CONFIG_DEBUG_USER_PAGE_REFERENCES is set. 

Yeah, and when you suspect issues with page pinning, you can try to run
with CONFIG_DEBUG_USER_PAGE_REFERENCES enabled. It's not like the overhead
is going to be huge. But it could be measurable for some workloads...

> > 2) Provide helper bio_add_user_page() that will take user_page, convert it
> > to struct page, add it to the bio, and flag the bio as having pages with
> > user references. That code would also make sure the bio is consistent in
> > having only user-referenced pages in that case. IO completion (like
> > bio_check_pages_dirty(), bio_release_pages() etc.) will check the flag and
> > use approprite release function.
> 
> I'm very new to bio, so I have to ask: can we be sure that the same types of 
> pages are always used, within each bio? Because otherwise, we'll have to plumb 
> it all the way down to bio_vec's--or so it appears, based on my reading of 
> bio_release_pages() and surrounding code.

No, we cannot be sure (the zero page usage within DIO code is one example
when it currently is not true) although usually it is the case that same
type of pages is used for one bio. But bio_add_page() (and thus similarly
bio_add_user_page()) is fine to refuse adding a page to the bio and the
caller then has to submit the current bio and start a new one. So we can
just reuse this mechanism when we detect that currently passed page is of a
different type than other pages in the bio.

> > 3) I have noticed fs/direct-io.c may submit zero page for IO when it needs
> > to clear stuff so we'll probably need a helper function to acquire 'user pin'
> > reference given a page pointer so that that code can be kept reasonably
> > simple and pass user_page references all around.
> >
> 
> This only works if we don't set page flags, because if we do set page flags 
> on the single, global zero page, that will break the world. So I'm not sure
> that the zero page usage in fs/directio.c is going to survive the conversion
> to this new approach. :)

Hum, we can always allocate single page filled with zeros for the use by
DIO code itself. But at this point I'm actually not sure why "user pinning"
of zero page would be an issue. After all if you have private anonymous
read-only mapping, it is going to be backed by zero pages and
get_user_pages() and put_user_pages() have to propely detect pin attempts
for these pages and handle them consistently... So DIO code does not seem
to be doing anything that special.

> > So this way we could maintain reasonable confidence that refcounts didn't
> > get mixed up. Thoughts?
> > 
> 
> After thinking some more about the complicated situations in bio and DRM,
> and looking into the future (bug reports...), I am leaning toward your 
> struct user_page approach. 
> 
> I'm looking forward to hearing other opinions on whether it's worth it to go
> and do this fairly intrusive change, in return for, probably, fewer bugs along
> the way.

Yeah, at this point I think it is worth it as it's probably going to save
us quite some hair-tearing when debugging stuff.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-10 23:45       ` Andrew Morton
@ 2018-10-11  8:49         ` Jan Kara
  2018-10-11 13:20           ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2018-10-11  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: John Hubbard, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, Jan Kara,
	linux-mm, LKML, linux-rdma, linux-fsdevel, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell

On Wed 10-10-18 16:45:41, Andrew Morton wrote:
> On Tue, 9 Oct 2018 17:42:09 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
> 
> > > Also, maintainability.  What happens if someone now uses put_page() by
> > > mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> > > this from occurring as code evolves?  Is there a cheap way of detecting
> > > this bug at runtime?
> > > 
> > 
> > It might be possible to do a few run-time checks, such as "does page that came 
> > back to put_user_page() have the correct flags?", but it's harder (without 
> > having a dedicated page flag) to detect the other direction: "did someone page 
> > in a get_user_pages page, to put_page?"
> > 
> > As Jan said in his reply, converting get_user_pages (and put_user_page) to 
> > work with a new data type that wraps struct pages, would solve it, but that's
> > an awfully large change. Still...given how much of a mess this can turn into 
> > if it's wrong, I wonder if it's worth it--maybe? 
> 
> This is a real worry.  If someone uses a mistaken put_page() then how
> will that bug manifest at runtime?  Under what set of circumstances
> will the kernel trigger the bug?

At runtime such bug will manifest as a page that can never be evicted from
memory. We could warn in put_page() if page reference count drops below
bare minimum for given user pin count which would be able to catch some
issues but it won't be 100% reliable. So at this point I'm more leaning
towards making get_user_pages() return a different type than just
struct page * to make it much harder for refcount to go wrong...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-11  8:49         ` Jan Kara
@ 2018-10-11 13:20           ` Jason Gunthorpe
  2018-10-12  1:23             ` John Hubbard
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2018-10-11 13:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, John Hubbard, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Dan Williams, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:

> > This is a real worry.  If someone uses a mistaken put_page() then how
> > will that bug manifest at runtime?  Under what set of circumstances
> > will the kernel trigger the bug?
> 
> At runtime such bug will manifest as a page that can never be evicted from
> memory. We could warn in put_page() if page reference count drops below
> bare minimum for given user pin count which would be able to catch some
> issues but it won't be 100% reliable. So at this point I'm more leaning
> towards making get_user_pages() return a different type than just
> struct page * to make it much harder for refcount to go wrong...

At least for the infiniband code being used as an example here we take
the struct page from get_user_pages, then stick it in a sgl, and at
put_page time we get the page back out of the sgl via sg_page()

So type safety will not help this case... I wonder how many other
users are similar? I think this is a pretty reasonable flow for DMA
with user pages.

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-11 13:20           ` Jason Gunthorpe
@ 2018-10-12  1:23             ` John Hubbard
  2018-10-12  3:53               ` John Hubbard
  2018-10-22 19:43               ` Jason Gunthorpe
  0 siblings, 2 replies; 25+ messages in thread
From: John Hubbard @ 2018-10-12  1:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Jan Kara
  Cc: Andrew Morton, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel, Al Viro, Jerome Glisse, Christoph Hellwig,
	Ralph Campbell

On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> 
>>> This is a real worry.  If someone uses a mistaken put_page() then how
>>> will that bug manifest at runtime?  Under what set of circumstances
>>> will the kernel trigger the bug?
>>
>> At runtime such bug will manifest as a page that can never be evicted from
>> memory. We could warn in put_page() if page reference count drops below
>> bare minimum for given user pin count which would be able to catch some
>> issues but it won't be 100% reliable. So at this point I'm more leaning
>> towards making get_user_pages() return a different type than just
>> struct page * to make it much harder for refcount to go wrong...
> 
> At least for the infiniband code being used as an example here we take
> the struct page from get_user_pages, then stick it in a sgl, and at
> put_page time we get the page back out of the sgl via sg_page()
> 
> So type safety will not help this case... I wonder how many other
> users are similar? I think this is a pretty reasonable flow for DMA
> with user pages.
> 

That is true. The infiniband code, fortunately, never mixes the two page
types into the same pool (or sg list), so it's actually an easier example
than some other subsystems. But, yes, type safety doesn't help there. I can 
take a moment to look around at the other areas, to quantify how much a type
safety change might help.

Back to page flags again, out of desperation:

How much do we know about the page types that all of these subsystems
use? In other words, can we, for example, use bit 1 of page->lru.next (see [1]
for context) as the "dma-pinned" page flag, while tracking pages within parts 
of the kernel that call a mix of alloc_pages, get_user_pages, and other allocators? 
In order for that to work, page->index, page->private, and bit 1 of page->mapping
must not be used. I doubt that this is always going to hold, but...does it?

Other ideas: provide a fast lookup tree that tracks pages that were 
obtained via get_user_pages. Before calling put_page or put_user_page,
use that tree to decide which to call. Or anything along the lines of
"yet another way to track pages without using page flags".

(Also, Andrew: "ack" on your point about CC-ing you on all patches, I've fixed
my scripts accordingly, sorry about that.)


[1] Matthew Wilcox's idea for stealing some tracking bits, by removing
dma-pinned pages from the LRU:

   https://lore.kernel.org/r/20180619090255.GA25522@bombadil.infradead.org


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-12  1:23             ` John Hubbard
@ 2018-10-12  3:53               ` John Hubbard
  2018-10-18 10:19                 ` Jan Kara
  2018-10-22 19:43               ` Jason Gunthorpe
  1 sibling, 1 reply; 25+ messages in thread
From: John Hubbard @ 2018-10-12  3:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Jan Kara
  Cc: Andrew Morton, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel, Al Viro, Jerome Glisse, Christoph Hellwig,
	Ralph Campbell

On 10/11/18 6:23 PM, John Hubbard wrote:
> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
>>
>>>> This is a real worry.  If someone uses a mistaken put_page() then how
>>>> will that bug manifest at runtime?  Under what set of circumstances
>>>> will the kernel trigger the bug?
>>>
>>> At runtime such bug will manifest as a page that can never be evicted from
>>> memory. We could warn in put_page() if page reference count drops below
>>> bare minimum for given user pin count which would be able to catch some
>>> issues but it won't be 100% reliable. So at this point I'm more leaning
>>> towards making get_user_pages() return a different type than just
>>> struct page * to make it much harder for refcount to go wrong...
>>
>> At least for the infiniband code being used as an example here we take
>> the struct page from get_user_pages, then stick it in a sgl, and at
>> put_page time we get the page back out of the sgl via sg_page()
>>
>> So type safety will not help this case... I wonder how many other
>> users are similar? I think this is a pretty reasonable flow for DMA
>> with user pages.
>>
> 
> That is true. The infiniband code, fortunately, never mixes the two page
> types into the same pool (or sg list), so it's actually an easier example
> than some other subsystems. But, yes, type safety doesn't help there. I can 
> take a moment to look around at the other areas, to quantify how much a type
> safety change might help.
> 
> Back to page flags again, out of desperation:
> 
> How much do we know about the page types that all of these subsystems
> use? In other words, can we, for example, use bit 1 of page->lru.next (see [1]
> for context) as the "dma-pinned" page flag, while tracking pages within parts 
> of the kernel that call a mix of alloc_pages, get_user_pages, and other allocators? 
> In order for that to work, page->index, page->private, and bit 1 of page->mapping
> must not be used. I doubt that this is always going to hold, but...does it?
> 

Oops, pardon me, please ignore that nonsense about page->index and page->private
and page->mapping, that's actually fine (I was seeing "union", where "struct" was
written--too much staring at this code). 

So actually, I think maybe we can just use bit 1 in page->lru.next to sort out
which pages are dma-pinned, in the calling code, just like we're going to do
in writeback situations. This should also allow run-time checking that Andrew was 
hoping for:

    put_user_page(): assert that the page is dma-pinned
    put_page(): assert that the page is *not* dma-pinned

...both of which depend on that bit being, essentially, available as sort of a
general page flag. And in fact, if it's not, then the whole approach is dead anyway.

Am I missing anything? This avoids the need to change the get_user_pages interface.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-12  3:53               ` John Hubbard
@ 2018-10-18 10:19                 ` Jan Kara
  2018-11-05  7:25                   ` John Hubbard
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2018-10-18 10:19 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, john.hubbard,
	Matthew Wilcox, Michal Hocko, Christopher Lameter, Dan Williams,
	linux-mm, LKML, linux-rdma, linux-fsdevel, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell

On Thu 11-10-18 20:53:34, John Hubbard wrote:
> On 10/11/18 6:23 PM, John Hubbard wrote:
> > On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> >> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> >>
> >>>> This is a real worry.  If someone uses a mistaken put_page() then how
> >>>> will that bug manifest at runtime?  Under what set of circumstances
> >>>> will the kernel trigger the bug?
> >>>
> >>> At runtime such bug will manifest as a page that can never be evicted from
> >>> memory. We could warn in put_page() if page reference count drops below
> >>> bare minimum for given user pin count which would be able to catch some
> >>> issues but it won't be 100% reliable. So at this point I'm more leaning
> >>> towards making get_user_pages() return a different type than just
> >>> struct page * to make it much harder for refcount to go wrong...
> >>
> >> At least for the infiniband code being used as an example here we take
> >> the struct page from get_user_pages, then stick it in a sgl, and at
> >> put_page time we get the page back out of the sgl via sg_page()
> >>
> >> So type safety will not help this case... I wonder how many other
> >> users are similar? I think this is a pretty reasonable flow for DMA
> >> with user pages.
> >>
> > 
> > That is true. The infiniband code, fortunately, never mixes the two page
> > types into the same pool (or sg list), so it's actually an easier example
> > than some other subsystems. But, yes, type safety doesn't help there. I can 
> > take a moment to look around at the other areas, to quantify how much a type
> > safety change might help.
> > 
> > Back to page flags again, out of desperation:
> > 
> > How much do we know about the page types that all of these subsystems
> > use? In other words, can we, for example, use bit 1 of page->lru.next (see [1]
> > for context) as the "dma-pinned" page flag, while tracking pages within parts 
> > of the kernel that call a mix of alloc_pages, get_user_pages, and other allocators? 
> > In order for that to work, page->index, page->private, and bit 1 of page->mapping
> > must not be used. I doubt that this is always going to hold, but...does it?
> > 
> 
> Oops, pardon me, please ignore that nonsense about page->index and page->private
> and page->mapping, that's actually fine (I was seeing "union", where "struct" was
> written--too much staring at this code). 
> 
> So actually, I think maybe we can just use bit 1 in page->lru.next to sort out
> which pages are dma-pinned, in the calling code, just like we're going to do
> in writeback situations. This should also allow run-time checking that Andrew was 
> hoping for:
> 
>     put_user_page(): assert that the page is dma-pinned
>     put_page(): assert that the page is *not* dma-pinned
> 
> ...both of which depend on that bit being, essentially, available as sort
> of a general page flag. And in fact, if it's not, then the whole approach
> is dead anyway.

Well, put_page() cannot assert page is not dma-pinned as someone can still
to get_page(), put_page() on dma-pinned page and that must not barf. But
put_page() could assert that if the page is pinned, refcount is >=
pincount. That will detect leaked pin references relatively quickly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-12  1:23             ` John Hubbard
  2018-10-12  3:53               ` John Hubbard
@ 2018-10-22 19:43               ` Jason Gunthorpe
  2018-11-05  7:17                 ` John Hubbard
  1 sibling, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2018-10-22 19:43 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Andrew Morton, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Dan Williams, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On Thu, Oct 11, 2018 at 06:23:24PM -0700, John Hubbard wrote:
> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> > On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> > 
> >>> This is a real worry.  If someone uses a mistaken put_page() then how
> >>> will that bug manifest at runtime?  Under what set of circumstances
> >>> will the kernel trigger the bug?
> >>
> >> At runtime such bug will manifest as a page that can never be evicted from
> >> memory. We could warn in put_page() if page reference count drops below
> >> bare minimum for given user pin count which would be able to catch some
> >> issues but it won't be 100% reliable. So at this point I'm more leaning
> >> towards making get_user_pages() return a different type than just
> >> struct page * to make it much harder for refcount to go wrong...
> > 
> > At least for the infiniband code being used as an example here we take
> > the struct page from get_user_pages, then stick it in a sgl, and at
> > put_page time we get the page back out of the sgl via sg_page()
> > 
> > So type safety will not help this case... I wonder how many other
> > users are similar? I think this is a pretty reasonable flow for DMA
> > with user pages.
> > 
> 
> That is true. The infiniband code, fortunately, never mixes the two page
> types into the same pool (or sg list), so it's actually an easier example
> than some other subsystems. But, yes, type safety doesn't help there. I can 
> take a moment to look around at the other areas, to quantify how much a type
> safety change might help.

Are most (all?) of the places working with SGLs?

Maybe we could just have a 'get_user_pages_to_sgl' and 'put_pages_sgl'
sort of interface that handled all this instead of trying to make
something that is struct page based?

It seems easier to get an extra bit for user/!user in the SGL
datastructure?

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-22 19:43               ` Jason Gunthorpe
@ 2018-11-05  7:17                 ` John Hubbard
  2018-11-05  8:37                   ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: John Hubbard @ 2018-11-05  7:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Andrew Morton, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Dan Williams, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On 10/22/18 12:43 PM, Jason Gunthorpe wrote:
> On Thu, Oct 11, 2018 at 06:23:24PM -0700, John Hubbard wrote:
>> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
>>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
>>>
>>>>> This is a real worry.  If someone uses a mistaken put_page() then how
>>>>> will that bug manifest at runtime?  Under what set of circumstances
>>>>> will the kernel trigger the bug?
>>>>
>>>> At runtime such bug will manifest as a page that can never be evicted from
>>>> memory. We could warn in put_page() if page reference count drops below
>>>> bare minimum for given user pin count which would be able to catch some
>>>> issues but it won't be 100% reliable. So at this point I'm more leaning
>>>> towards making get_user_pages() return a different type than just
>>>> struct page * to make it much harder for refcount to go wrong...
>>>
>>> At least for the infiniband code being used as an example here we take
>>> the struct page from get_user_pages, then stick it in a sgl, and at
>>> put_page time we get the page back out of the sgl via sg_page()
>>>
>>> So type safety will not help this case... I wonder how many other
>>> users are similar? I think this is a pretty reasonable flow for DMA
>>> with user pages.
>>>
>>
>> That is true. The infiniband code, fortunately, never mixes the two page
>> types into the same pool (or sg list), so it's actually an easier example
>> than some other subsystems. But, yes, type safety doesn't help there. I can 
>> take a moment to look around at the other areas, to quantify how much a type
>> safety change might help.
> 
> Are most (all?) of the places working with SGLs?

I finally put together a spreadsheet, in order to answer this sort of thing.
Some notes:

a) There are around 100 call sites of either get_user_pages*(), or indirect
calls via iov_iter_get_pages*().

b) There are only a few SGL users. Most are ad-hoc, instead: some loop that
either can be collapsed nicely into the new put_user_pages*() APIs, or...
cannot.

c) The real problem is: around 20+ iov_iter_get_pages*() call sites. I need
to change both the  iov_iter system a little bit, and also change the callers
so that they don't pile all the gup-pinned pages into the same page** array
that also contains other allocation types. This can be done, it just takes
time, that's the good news.

> 
> Maybe we could just have a 'get_user_pages_to_sgl' and 'put_pages_sgl'
> sort of interface that handled all this instead of trying to make
> something that is struct page based?
> 
> It seems easier to get an extra bit for user/!user in the SGL
> datastructure?
> 

So at the moment I don't think we need this *_sgl interface. We need iov_iter*
changes instead.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-10-18 10:19                 ` Jan Kara
@ 2018-11-05  7:25                   ` John Hubbard
  0 siblings, 0 replies; 25+ messages in thread
From: John Hubbard @ 2018-11-05  7:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Andrew Morton, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Dan Williams, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On 10/18/18 3:19 AM, Jan Kara wrote:
> On Thu 11-10-18 20:53:34, John Hubbard wrote:
>> On 10/11/18 6:23 PM, John Hubbard wrote:
>>> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
>>>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
[...]
> Well, put_page() cannot assert page is not dma-pinned as someone can still
> to get_page(), put_page() on dma-pinned page and that must not barf. But
> put_page() could assert that if the page is pinned, refcount is >=
> pincount. That will detect leaked pin references relatively quickly.
> 

That assertion is definitely a life saver. I've been attempting a combination
of finishing up more call site conversions, and runtime testing, and this
lights up the missing conversions pretty nicely.

As I mentioned in another thread just now, I'll send out an updated RFC this week,
so that people can look through it well before the LPC (next week).

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions
  2018-11-05  7:17                 ` John Hubbard
@ 2018-11-05  8:37                   ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2018-11-05  8:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, john.hubbard,
	Matthew Wilcox, Michal Hocko, Christopher Lameter, Dan Williams,
	linux-mm, LKML, linux-rdma, linux-fsdevel, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell

On Sun 04-11-18 23:17:58, John Hubbard wrote:
> On 10/22/18 12:43 PM, Jason Gunthorpe wrote:
> > On Thu, Oct 11, 2018 at 06:23:24PM -0700, John Hubbard wrote:
> >> On 10/11/18 6:20 AM, Jason Gunthorpe wrote:
> >>> On Thu, Oct 11, 2018 at 10:49:29AM +0200, Jan Kara wrote:
> >>>
> >>>>> This is a real worry.  If someone uses a mistaken put_page() then how
> >>>>> will that bug manifest at runtime?  Under what set of circumstances
> >>>>> will the kernel trigger the bug?
> >>>>
> >>>> At runtime such bug will manifest as a page that can never be evicted from
> >>>> memory. We could warn in put_page() if page reference count drops below
> >>>> bare minimum for given user pin count which would be able to catch some
> >>>> issues but it won't be 100% reliable. So at this point I'm more leaning
> >>>> towards making get_user_pages() return a different type than just
> >>>> struct page * to make it much harder for refcount to go wrong...
> >>>
> >>> At least for the infiniband code being used as an example here we take
> >>> the struct page from get_user_pages, then stick it in a sgl, and at
> >>> put_page time we get the page back out of the sgl via sg_page()
> >>>
> >>> So type safety will not help this case... I wonder how many other
> >>> users are similar? I think this is a pretty reasonable flow for DMA
> >>> with user pages.
> >>>
> >>
> >> That is true. The infiniband code, fortunately, never mixes the two page
> >> types into the same pool (or sg list), so it's actually an easier example
> >> than some other subsystems. But, yes, type safety doesn't help there. I can 
> >> take a moment to look around at the other areas, to quantify how much a type
> >> safety change might help.
> > 
> > Are most (all?) of the places working with SGLs?
> 
> I finally put together a spreadsheet, in order to answer this sort of thing.
> Some notes:
> 
> a) There are around 100 call sites of either get_user_pages*(), or indirect
> calls via iov_iter_get_pages*().

Quite a bit...

> b) There are only a few SGL users. Most are ad-hoc, instead: some loop that
> either can be collapsed nicely into the new put_user_pages*() APIs, or...
> cannot.
> 
> c) The real problem is: around 20+ iov_iter_get_pages*() call sites. I need
> to change both the  iov_iter system a little bit, and also change the callers
> so that they don't pile all the gup-pinned pages into the same page** array
> that also contains other allocation types. This can be done, it just takes
> time, that's the good news.

Yes, but looking into iov_iter_get_pages() users, lot of them then end up
feeding the result either in SGL, SKB (which is basically the same thing,
just for networking), or BVEC (which is again a very similar thing, just for
generic block layer). I'm not saying that we must have _sgl() interface as
untangling all those users might be just too complex but there is certainly
some space for unification and common interfaces ;)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-11-05  8:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-08 21:16 [PATCH v4 0/3] get_user_pages*() and RDMA: first steps john.hubbard
2018-10-08 21:16 ` [PATCH v4 1/3] mm: get_user_pages: consolidate error handling john.hubbard
2018-10-09  0:05   ` Andrew Morton
2018-10-08 21:16 ` [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions john.hubbard
2018-10-09  0:14   ` Andrew Morton
2018-10-09  8:30     ` Jan Kara
2018-10-09 23:20       ` Andrew Morton
2018-10-10  0:32         ` John Hubbard
2018-10-10 23:43           ` Andrew Morton
2018-10-10  0:42     ` John Hubbard
2018-10-10  8:59       ` Jan Kara
2018-10-10 23:23         ` John Hubbard
2018-10-11  8:42           ` Jan Kara
2018-10-10 23:45       ` Andrew Morton
2018-10-11  8:49         ` Jan Kara
2018-10-11 13:20           ` Jason Gunthorpe
2018-10-12  1:23             ` John Hubbard
2018-10-12  3:53               ` John Hubbard
2018-10-18 10:19                 ` Jan Kara
2018-11-05  7:25                   ` John Hubbard
2018-10-22 19:43               ` Jason Gunthorpe
2018-11-05  7:17                 ` John Hubbard
2018-11-05  8:37                   ` Jan Kara
2018-10-08 21:16 ` [PATCH v4 3/3] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
2018-10-09  9:52   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).