LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/4] get_user_pages*() and RDMA: first steps
@ 2018-09-28  5:39 john.hubbard
  2018-09-28  5:39 ` [PATCH 1/4] mm: get_user_pages: consolidate error handling john.hubbard
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: john.hubbard @ 2018-09-28  5:39 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard,
	Christian Benvenuti, Dennis Dalessandro, Doug Ledford,
	Mike Marciniszyn

From: John Hubbard <jhubbard@nvidia.com>

Hi,

This short series prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

I'd like to get the first two patches into the -mm tree.

Patch 1, although not technically critical to do now, is still nice to have,
because it's already been reviewed by Jan, and it's just one more thing on the
long TODO list here, that is ready to be checked off.

Patch 2 is required in order to allow me (and others, if I'm lucky) to start
submitting changes to convert all of the callsites of get_user_pages*() and
put_page().  I think this will work a lot better than trying to maintain a
massive patchset and submitting all at once.

Patch 3 converts infiniband drivers: put_page() --> put_user_page(). I picked
a fairly small and easy example.

Patch 4 converts a small driver from put_page() --> release_user_pages(). This
could just as easily have been done as a change from put_page() to
put_user_page(). The reason I did it this way is that this provides a small and
simple caller of the new release_user_pages() routine. I wanted both of the
new routines, even though just placeholders, to have callers.

Once these are all in, then the floodgates can open up to convert the large
number of get_user_pages*() callsites.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Christian Benvenuti <benve@cisco.com>
CC: Christopher Lameter <cl@linux.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
CC: Doug Ledford <dledford@redhat.com>
CC: Jan Kara <jack@suse.cz>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Mike Marciniszyn <mike.marciniszyn@intel.com>
CC: linux-kernel@vger.kernel.org
CC: linux-mm@kvack.org
CC: linux-rdma@vger.kernel.org

John Hubbard (4):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page(), placeholder version
  infiniband/mm: convert to the new put_user_page() call
  goldfish_pipe/mm: convert to the new release_user_pages() call

 drivers/infiniband/core/umem.c              |  2 +-
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 ++---
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  2 +-
 drivers/platform/goldfish/goldfish_pipe.c   |  7 ++--
 include/linux/mm.h                          | 14 ++++++++
 mm/gup.c                                    | 37 ++++++++++++---------
 10 files changed, 52 insertions(+), 30 deletions(-)

-- 
2.19.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/4] mm: get_user_pages: consolidate error handling
  2018-09-28  5:39 [PATCH 0/4] get_user_pages*() and RDMA: first steps john.hubbard
@ 2018-09-28  5:39 ` john.hubbard
  2018-09-28  5:39 ` [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call john.hubbard
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: john.hubbard @ 2018-09-28  5:39 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
 	long i = 0;
+	int err = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (!vma || start >= vma->vm_end) {
 			vma = find_extend_vma(mm, start);
 			if (!vma && in_gate_area(mm, start)) {
-				int ret;
-				ret = get_gate_page(mm, start & PAGE_MASK,
+				err = get_gate_page(mm, start & PAGE_MASK,
 						gup_flags, &vma,
 						pages ? &pages[i] : NULL);
-				if (ret)
-					return i ? : ret;
+				if (err)
+					goto out;
 				page_mask = 0;
 				goto next_page;
 			}
 
-			if (!vma || check_vma_flags(vma, gup_flags))
-				return i ? : -EFAULT;
+			if (!vma || check_vma_flags(vma, gup_flags)) {
+				err = -EFAULT;
+				goto out;
+			}
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		 * If we have a pending SIGKILL, don't keep faulting pages and
 		 * potentially allocating memory.
 		 */
-		if (unlikely(fatal_signal_pending(current)))
-			return i ? i : -ERESTARTSYS;
+		if (unlikely(fatal_signal_pending(current))) {
+			err = -ERESTARTSYS;
+			goto out;
+		}
 		cond_resched();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
 		if (!page) {
-			int ret;
-			ret = faultin_page(tsk, vma, start, &foll_flags,
+			err = faultin_page(tsk, vma, start, &foll_flags,
 					nonblocking);
-			switch (ret) {
+			switch (err) {
 			case 0:
 				goto retry;
 			case -EFAULT:
 			case -ENOMEM:
 			case -EHWPOISON:
-				return i ? i : ret;
+				goto out;
 			case -EBUSY:
-				return i;
+				err = 0;
+				goto out;
 			case -ENOENT:
 				goto next_page;
 			}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			 */
 			goto next_page;
 		} else if (IS_ERR(page)) {
-			return i ? i : PTR_ERR(page);
+			err = PTR_ERR(page);
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
 	} while (nr_pages);
-	return i;
+
+out:
+	return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-28  5:39 [PATCH 0/4] get_user_pages*() and RDMA: first steps john.hubbard
  2018-09-28  5:39 ` [PATCH 1/4] mm: get_user_pages: consolidate error handling john.hubbard
@ 2018-09-28  5:39 ` john.hubbard
  2018-09-28 15:39   ` Jason Gunthorpe
  2018-09-28  5:39 ` [PATCH 2/4] mm: introduce put_user_page(), placeholder version john.hubbard
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: john.hubbard @ 2018-09-28  5:39 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard,
	Doug Ledford, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti

From: John Hubbard <jhubbard@nvidia.com>

For code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(),
instead of put_page().

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

CC: Doug Ledford <dledford@redhat.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Mike Marciniszyn <mike.marciniszyn@intel.com>
CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
CC: Christian Benvenuti <benve@cisco.com>

CC: linux-rdma@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-mm@kvack.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c              | 2 +-
 drivers/infiniband/core/umem_odp.c          | 2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   | 8 ++++----
 drivers/infiniband/hw/usnic/usnic_uiom.c    | 2 +-
 7 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..9430d697cb9f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 		page = sg_page(sg);
 		if (!PageDirty(page) && umem->writable && dirty)
 			set_page_dirty_lock(page);
-		put_page(page);
+		put_user_page(page);
 	}
 
 	sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 user_virt, u64 bcnt,
 					ret = -EFAULT;
 					break;
 				}
-				put_page(local_page_list[j]);
+				put_user_page(local_page_list[j]);
 				continue;
 			}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..c7516029af33 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -126,7 +126,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 	for (i = 0; i < npages; i++) {
 		if (dirty)
 			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
+		put_user_page(p[i]);
 	}
 
 	if (mm) { /* during close after signal, mm can be NULL */
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
 	if (ret < 0) {
-		put_page(pages[0]);
+		put_user_page(pages[0]);
 		goto out;
 	}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 				 mthca_uarc_virt(dev, uar, i));
 	if (ret) {
 		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-		put_page(sg_page(&db_tab->page[i].mem));
+		put_user_page(sg_page(&db_tab->page[i].mem));
 		goto out;
 	}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
 		if (db_tab->page[i].uvirt) {
 			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
 			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-			put_page(sg_page(&db_tab->page[i].mem));
+			put_user_page(sg_page(&db_tab->page[i].mem));
 		}
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..3f8fd42dd7fc 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -45,7 +45,7 @@ static void __qib_release_user_pages(struct page **p, size_t num_pages,
 	for (i = 0; i < num_pages; i++) {
 		if (dirty)
 			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
+		put_user_page(p[i]);
 	}
 }
 
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 926f3c8eba69..14f94d823907 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -266,7 +266,7 @@ static void qib_user_sdma_init_frag(struct qib_user_sdma_pkt *pkt,
 	pkt->addr[i].length = len;
 	pkt->addr[i].first_desc = first_desc;
 	pkt->addr[i].last_desc = last_desc;
-	pkt->addr[i].put_page = put_page;
+	pkt->addr[i].put_page = put_user_page;
 	pkt->addr[i].dma_mapped = dma_mapped;
 	pkt->addr[i].page = page;
 	pkt->addr[i].kvaddr = kvaddr;
@@ -321,7 +321,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd,
 		 * the caller can ignore this page.
 		 */
 		if (put) {
-			put_page(page);
+			put_user_page(page);
 		} else {
 			/* coalesce case */
 			kunmap(page);
@@ -635,7 +635,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev,
 			kunmap(pkt->addr[i].page);
 
 		if (pkt->addr[i].put_page)
-			put_page(pkt->addr[i].page);
+			put_user_page(pkt->addr[i].page);
 		else
 			__free_page(pkt->addr[i].page);
 	} else if (pkt->addr[i].kvaddr) {
@@ -710,7 +710,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 	/* if error, return all pages not managed by pkt */
 free_pages:
 	while (i < j)
-		put_page(pages[i++]);
+		put_user_page(pages[i++]);
 
 done:
 	return ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 9dd39daa602b..0f607f31c262 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -91,7 +91,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
 			pa = sg_phys(sg);
 			if (!PageDirty(page) && dirty)
 				set_page_dirty_lock(page);
-			put_page(page);
+			put_user_page(page);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
-- 
2.19.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/4] mm: introduce put_user_page(), placeholder version
  2018-09-28  5:39 [PATCH 0/4] get_user_pages*() and RDMA: first steps john.hubbard
  2018-09-28  5:39 ` [PATCH 1/4] mm: get_user_pages: consolidate error handling john.hubbard
  2018-09-28  5:39 ` [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call john.hubbard
@ 2018-09-28  5:39 ` john.hubbard
  2018-10-03 16:22   ` Jan Kara
  2018-09-28  5:39 ` [PATCH 4/4] goldfish_pipe/mm: convert to the new release_user_pages() call john.hubbard
  2018-09-28 15:29 ` [PATCH 0/4] get_user_pages*() and RDMA: first steps Jerome Glisse
  4 siblings, 1 reply; 29+ messages in thread
From: john.hubbard @ 2018-09-28  5:39 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also adds release_user_pages(), a drop-in replacement for
release_pages(). This is intended to be easily grep-able,
for later performance improvements, since release_user_pages
is not batched like release_pages() is, and is significantly
slower.

Also: rename goldfish_pipe.c's release_user_pages(), in order
to avoid a naming conflict with the new external function of
the same name.

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Christopher Lameter <cl@linux.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/platform/goldfish/goldfish_pipe.c |  4 ++--
 include/linux/mm.h                        | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index 2da567540c2d..fad0345376e0 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -332,7 +332,7 @@ static int pin_user_pages(unsigned long first_page, unsigned long last_page,
 
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
+static void __release_user_pages(struct page **pages, int pages_count,
 	int is_write, s32 consumed_size)
 {
 	int i;
@@ -410,7 +410,7 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
 	*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-	release_user_pages(pages, pages_count, is_write, *consumed_size);
+	__release_user_pages(pages, pages_count, is_write, *consumed_size);
 
 	mutex_unlock(&pipe->lock);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..72caf803115f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -943,6 +943,20 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/* Placeholder version, until all get_user_pages*() callers are updated. */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+/* A drop-in replacement for release_pages(): */
+static inline void release_user_pages(struct page **pages,
+				      unsigned long npages)
+{
+	while (npages)
+		put_user_page(pages[--npages]);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
-- 
2.19.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 4/4] goldfish_pipe/mm: convert to the new release_user_pages() call
  2018-09-28  5:39 [PATCH 0/4] get_user_pages*() and RDMA: first steps john.hubbard
                   ` (2 preceding siblings ...)
  2018-09-28  5:39 ` [PATCH 2/4] mm: introduce put_user_page(), placeholder version john.hubbard
@ 2018-09-28  5:39 ` john.hubbard
  2018-09-28 15:29 ` [PATCH 0/4] get_user_pages*() and RDMA: first steps Jerome Glisse
  4 siblings, 0 replies; 29+ messages in thread
From: john.hubbard @ 2018-09-28  5:39 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

For code that retains pages via get_user_pages*(),
release those pages via the new release_user_pages(),
instead of calling put_page().

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
    Proposed steps for fixing get_user_pages() + DMA problems.

CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/platform/goldfish/goldfish_pipe.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index fad0345376e0..1e9455a86698 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -340,8 +340,9 @@ static void __release_user_pages(struct page **pages, int pages_count,
 	for (i = 0; i < pages_count; i++) {
 		if (!is_write && consumed_size > 0)
 			set_page_dirty(pages[i]);
-		put_page(pages[i]);
 	}
+
+	release_user_pages(pages, pages_count);
 }
 
 /* Populate the call parameters, merging adjacent pages together */
-- 
2.19.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-09-28  5:39 [PATCH 0/4] get_user_pages*() and RDMA: first steps john.hubbard
                   ` (3 preceding siblings ...)
  2018-09-28  5:39 ` [PATCH 4/4] goldfish_pipe/mm: convert to the new release_user_pages() call john.hubbard
@ 2018-09-28 15:29 ` Jerome Glisse
  2018-09-28 19:06   ` John Hubbard
  4 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2018-09-28 15:29 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard, Christian Benvenuti,
	Dennis Dalessandro, Doug Ledford, Mike Marciniszyn

On Thu, Sep 27, 2018 at 10:39:45PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Hi,
> 
> This short series prepares for eventually fixing the problem described
> in [1], and is following a plan listed in [2].
> 
> I'd like to get the first two patches into the -mm tree.
> 
> Patch 1, although not technically critical to do now, is still nice to have,
> because it's already been reviewed by Jan, and it's just one more thing on the
> long TODO list here, that is ready to be checked off.
> 
> Patch 2 is required in order to allow me (and others, if I'm lucky) to start
> submitting changes to convert all of the callsites of get_user_pages*() and
> put_page().  I think this will work a lot better than trying to maintain a
> massive patchset and submitting all at once.
> 
> Patch 3 converts infiniband drivers: put_page() --> put_user_page(). I picked
> a fairly small and easy example.
> 
> Patch 4 converts a small driver from put_page() --> release_user_pages(). This
> could just as easily have been done as a change from put_page() to
> put_user_page(). The reason I did it this way is that this provides a small and
> simple caller of the new release_user_pages() routine. I wanted both of the
> new routines, even though just placeholders, to have callers.
> 
> Once these are all in, then the floodgates can open up to convert the large
> number of get_user_pages*() callsites.
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
>     Proposed steps for fixing get_user_pages() + DMA problems.
> 

So the solution is to wait (possibly for days, months, years) that the
RDMA or GPU which did GUP and do not have mmu notifier, release the page
(or put_user_page()) ?

This sounds bads. Like i said during LSF/MM there is no way to properly
fix hardware that can not be preempted/invalidated ... most GPU are fine.
Few RDMA are fine, most can not ...

If it is just about fixing the set_page_dirty() bug then just looking at
refcount versus mapcount should already tell you if you can remove the
buffer head from the page or not. Which would fix the bug without complex
changes (i still like the put_user_page just for symetry with GUP).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-28  5:39 ` [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call john.hubbard
@ 2018-09-28 15:39   ` Jason Gunthorpe
  2018-09-29  3:12     ` John Hubbard
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2018-09-28 15:39 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter, Dan Williams,
	Jan Kara, Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	John Hubbard, Doug Ledford, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti

On Thu, Sep 27, 2018 at 10:39:47PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> For code that retains pages via get_user_pages*(),
> release those pages via the new put_user_page(),
> instead of put_page().
> 
> This prepares for eventually fixing the problem described
> in [1], and is following a plan listed in [2].
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
>     Proposed steps for fixing get_user_pages() + DMA problems.
> 
> CC: Doug Ledford <dledford@redhat.com>
> CC: Jason Gunthorpe <jgg@ziepe.ca>
> CC: Mike Marciniszyn <mike.marciniszyn@intel.com>
> CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
> CC: Christian Benvenuti <benve@cisco.com>
> 
> CC: linux-rdma@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> CC: linux-mm@kvack.org
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>  drivers/infiniband/core/umem.c              | 2 +-
>  drivers/infiniband/core/umem_odp.c          | 2 +-
>  drivers/infiniband/hw/hfi1/user_pages.c     | 2 +-
>  drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +++---
>  drivers/infiniband/hw/qib/qib_user_pages.c  | 2 +-
>  drivers/infiniband/hw/qib/qib_user_sdma.c   | 8 ++++----
>  drivers/infiniband/hw/usnic/usnic_uiom.c    | 2 +-
>  7 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index a41792dbae1f..9430d697cb9f 100644
> +++ b/drivers/infiniband/core/umem.c
> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>  		page = sg_page(sg);
>  		if (!PageDirty(page) && umem->writable && dirty)
>  			set_page_dirty_lock(page);
> -		put_page(page);
> +		put_user_page(page);

Would it make sense to have a release/put_user_pages_dirtied to absorb
the set_page_dity pattern too? I notice in this patch there is some
variety here, I wonder what is the right way?

Also, I'm told this code here is a big performance bottleneck when the
number of pages becomes very long (think >> GB of memory), so having a
future path to use some kind of batching/threading sound great.

Otherwise this RDMA part seems fine to me, there might be some minor
conflicts however. I assume you want to run this through the -mm tree?

Acked-by: Jason Gunthorpe <jgg@mellanox.com>

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-09-28 15:29 ` [PATCH 0/4] get_user_pages*() and RDMA: first steps Jerome Glisse
@ 2018-09-28 19:06   ` John Hubbard
  2018-09-28 21:49     ` Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: John Hubbard @ 2018-09-28 19:06 UTC (permalink / raw)
  To: Jerome Glisse, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Christian Benvenuti,
	Dennis Dalessandro, Doug Ledford, Mike Marciniszyn

On 9/28/18 8:29 AM, Jerome Glisse wrote:
> On Thu, Sep 27, 2018 at 10:39:45PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Hi,
>>
>> This short series prepares for eventually fixing the problem described
>> in [1], and is following a plan listed in [2].
>>
>> I'd like to get the first two patches into the -mm tree.
>>
>> Patch 1, although not technically critical to do now, is still nice to have,
>> because it's already been reviewed by Jan, and it's just one more thing on the
>> long TODO list here, that is ready to be checked off.
>>
>> Patch 2 is required in order to allow me (and others, if I'm lucky) to start
>> submitting changes to convert all of the callsites of get_user_pages*() and
>> put_page().  I think this will work a lot better than trying to maintain a
>> massive patchset and submitting all at once.
>>
>> Patch 3 converts infiniband drivers: put_page() --> put_user_page(). I picked
>> a fairly small and easy example.
>>
>> Patch 4 converts a small driver from put_page() --> release_user_pages(). This
>> could just as easily have been done as a change from put_page() to
>> put_user_page(). The reason I did it this way is that this provides a small and
>> simple caller of the new release_user_pages() routine. I wanted both of the
>> new routines, even though just placeholders, to have callers.
>>
>> Once these are all in, then the floodgates can open up to convert the large
>> number of get_user_pages*() callsites.
>>
>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
>>     Proposed steps for fixing get_user_pages() + DMA problems.
>>
> 
> So the solution is to wait (possibly for days, months, years) that the
> RDMA or GPU which did GUP and do not have mmu notifier, release the page
> (or put_user_page()) ?
> 
> This sounds bads. Like i said during LSF/MM there is no way to properly
> fix hardware that can not be preempted/invalidated ... most GPU are fine.
> Few RDMA are fine, most can not ...
> 

Hi Jerome,

Personally, I'm think that this particular design is the best one I've seen
so far, but if other, better designs show up, than let's do those instead, sure.

I guess your main concern is that this might take longer than other approaches.

As for time frame, perhaps I made it sound worse than it really is. I have patches
staged already for all of the simpler call sites, and for about half of the more
complicated ones. The core solution in mm is not large, and we've gone through a 
few discussion threads about it back in July or so, so it shouldn't take too long
to perfect it.

So it may be a few months to get it all reviewed and submitted, but I don't
see "years" by any stretch.


> If it is just about fixing the set_page_dirty() bug then just looking at
> refcount versus mapcount should already tell you if you can remove the
> buffer head from the page or not. Which would fix the bug without complex
> changes (i still like the put_user_page just for symetry with GUP).
> 

It's about more than that. The goal is to make it safe and correct to
use a non-CPU device to read and write to "pinned" memory, especially when
that memory is backed by a file system.

I recall there were objections to just narrowly fixing the set_page_dirty()
bug, because the underlying problem is large and serious. So here we are.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-09-28 19:06   ` John Hubbard
@ 2018-09-28 21:49     ` Jerome Glisse
  2018-09-29  2:28       ` John Hubbard
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2018-09-28 21:49 UTC (permalink / raw)
  To: John Hubbard
  Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Christian Benvenuti,
	Dennis Dalessandro, Doug Ledford, Mike Marciniszyn

On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote:
> On 9/28/18 8:29 AM, Jerome Glisse wrote:
> > On Thu, Sep 27, 2018 at 10:39:45PM -0700, john.hubbard@gmail.com wrote:
> >> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Hi,
> >>
> >> This short series prepares for eventually fixing the problem described
> >> in [1], and is following a plan listed in [2].
> >>
> >> I'd like to get the first two patches into the -mm tree.
> >>
> >> Patch 1, although not technically critical to do now, is still nice to have,
> >> because it's already been reviewed by Jan, and it's just one more thing on the
> >> long TODO list here, that is ready to be checked off.
> >>
> >> Patch 2 is required in order to allow me (and others, if I'm lucky) to start
> >> submitting changes to convert all of the callsites of get_user_pages*() and
> >> put_page().  I think this will work a lot better than trying to maintain a
> >> massive patchset and submitting all at once.
> >>
> >> Patch 3 converts infiniband drivers: put_page() --> put_user_page(). I picked
> >> a fairly small and easy example.
> >>
> >> Patch 4 converts a small driver from put_page() --> release_user_pages(). This
> >> could just as easily have been done as a change from put_page() to
> >> put_user_page(). The reason I did it this way is that this provides a small and
> >> simple caller of the new release_user_pages() routine. I wanted both of the
> >> new routines, even though just placeholders, to have callers.
> >>
> >> Once these are all in, then the floodgates can open up to convert the large
> >> number of get_user_pages*() callsites.
> >>
> >> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> >>
> >> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubbard@nvidia.com
> >>     Proposed steps for fixing get_user_pages() + DMA problems.
> >>
> > 
> > So the solution is to wait (possibly for days, months, years) that the
> > RDMA or GPU which did GUP and do not have mmu notifier, release the page
> > (or put_user_page()) ?
> > 
> > This sounds bads. Like i said during LSF/MM there is no way to properly
> > fix hardware that can not be preempted/invalidated ... most GPU are fine.
> > Few RDMA are fine, most can not ...
> > 
> 
> Hi Jerome,
> 
> Personally, I'm think that this particular design is the best one I've seen
> so far, but if other, better designs show up, than let's do those instead, sure.
> 
> I guess your main concern is that this might take longer than other approaches.
> 
> As for time frame, perhaps I made it sound worse than it really is. I have patches
> staged already for all of the simpler call sites, and for about half of the more
> complicated ones. The core solution in mm is not large, and we've gone through a 
> few discussion threads about it back in July or so, so it shouldn't take too long
> to perfect it.
> 
> So it may be a few months to get it all reviewed and submitted, but I don't
> see "years" by any stretch.

Bit of miss-comprehention there :) By month, years, i am talking about
the time it will take for some user to release the pin they have on the
page. Not the time to push something upstream.

AFAICT RDMA driver do not have any upper bound on how long they can hold
a page reference and thus your solution can leave one CPU core stuck for
as long as the pin is active. Worst case might lead to all CPU core waiting
for something that might never happen.

> 
> 
> > If it is just about fixing the set_page_dirty() bug then just looking at
> > refcount versus mapcount should already tell you if you can remove the
> > buffer head from the page or not. Which would fix the bug without complex
> > changes (i still like the put_user_page just for symetry with GUP).
> > 
> 
> It's about more than that. The goal is to make it safe and correct to
> use a non-CPU device to read and write to "pinned" memory, especially when
> that memory is backed by a file system.
> 
> I recall there were objections to just narrowly fixing the set_page_dirty()
> bug, because the underlying problem is large and serious. So here we are.

Except that you can not solve that issue without proper hardware. GPU are
fine. RDMA are broken except the mellanox5 hardware which can invalidate
at anytime its page table thus allowing to write protect the page at any
time.

With the solution put forward here you can potentialy wait _forever_ for
the driver that holds a pin to drop it. This was the point i was trying to
get accross during LSF/MM. You can not fix broken hardware that decided to
use GUP to do a feature they can't reliably do because their hardware is
not capable to behave.

Because code is easier here is what i was meaning:

https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=a5dbc0fe7e71d347067579f13579df372ec48389
https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=01677bc039c791a16d5f82b3ef84917d62fac826

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-09-28 21:49     ` Jerome Glisse
@ 2018-09-29  2:28       ` John Hubbard
  2018-09-29  8:46         ` Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: John Hubbard @ 2018-09-29  2:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Christian Benvenuti,
	Dennis Dalessandro, Doug Ledford, Mike Marciniszyn

On 9/28/18 2:49 PM, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote:
>> On 9/28/18 8:29 AM, Jerome Glisse wrote:
>>> On Thu, Sep 27, 2018 at 10:39:45PM -0700, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>>> So the solution is to wait (possibly for days, months, years) that the
>>> RDMA or GPU which did GUP and do not have mmu notifier, release the page
>>> (or put_user_page()) ?
>>>
>>> This sounds bads. Like i said during LSF/MM there is no way to properly
>>> fix hardware that can not be preempted/invalidated ... most GPU are fine.
>>> Few RDMA are fine, most can not ...
>>>
>>
>> Hi Jerome,
>>
>> Personally, I'm think that this particular design is the best one I've seen
>> so far, but if other, better designs show up, than let's do those instead, sure.
>>
>> I guess your main concern is that this might take longer than other approaches.
>>
>> As for time frame, perhaps I made it sound worse than it really is. I have patches
>> staged already for all of the simpler call sites, and for about half of the more
>> complicated ones. The core solution in mm is not large, and we've gone through a 
>> few discussion threads about it back in July or so, so it shouldn't take too long
>> to perfect it.
>>
>> So it may be a few months to get it all reviewed and submitted, but I don't
>> see "years" by any stretch.
> 
> Bit of miss-comprehention there :) By month, years, i am talking about
> the time it will take for some user to release the pin they have on the
> page. Not the time to push something upstream.
> 
> AFAICT RDMA driver do not have any upper bound on how long they can hold
> a page reference and thus your solution can leave one CPU core stuck for
> as long as the pin is active. Worst case might lead to all CPU core waiting
> for something that might never happen.
> 

Actually, the latest direction on that discussion was toward periodically
writing back, even while under RDMA, via bounce buffers:

  https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz

I still think that's viable. Of course, there are other things besides 
writeback (see below) that might also lead to waiting.

>>> If it is just about fixing the set_page_dirty() bug then just looking at
>>> refcount versus mapcount should already tell you if you can remove the
>>> buffer head from the page or not. Which would fix the bug without complex
>>> changes (i still like the put_user_page just for symetry with GUP).
>>>
>>
>> It's about more than that. The goal is to make it safe and correct to
>> use a non-CPU device to read and write to "pinned" memory, especially when
>> that memory is backed by a file system.
>>
>> I recall there were objections to just narrowly fixing the set_page_dirty()
>> bug, because the underlying problem is large and serious. So here we are.
> 
> Except that you can not solve that issue without proper hardware. GPU are
> fine. RDMA are broken except the mellanox5 hardware which can invalidate
> at anytime its page table thus allowing to write protect the page at any
> time.

Today, people are out there using RDMA without page-fault-capable hardware.
And they are hitting problems, as we've seen. From the discussions so far,
I don't think it's impossible to solve the problems, even for "lesser", 
non-fault-capable hardware. Especially once we decide on what is reasonable
and supported.  Certainly the existing situation needs *something* to 
change, even if it's (I don't recommend this) "go forth and tell the world
to stop using RDMA with their current hardware".

> 
> With the solution put forward here you can potentialy wait _forever_ for
> the driver that holds a pin to drop it. This was the point i was trying to
> get accross during LSF/MM. 

I agree that just blocking indefinitely is generally unacceptable for kernel
code, but we can probably avoid it for many cases (bounce buffers), and
if we think it is really appropriate (file system unmounting, maybe?) then
maybe tolerate it in some rare cases.  

>You can not fix broken hardware that decided to
> use GUP to do a feature they can't reliably do because their hardware is
> not capable to behave.
> 
> Because code is easier here is what i was meaning:
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=a5dbc0fe7e71d347067579f13579df372ec48389
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=01677bc039c791a16d5f82b3ef84917d62fac826
> 

While that may work sometimes, I don't think it is reliable enough to trust for
identifying pages that have been gup-pinned. There's just too much overloading of
other mechanisms going on there, and if we pile on top with this constraint of "if you
have +3 refcounts, and this particular combination of page counts and mapcounts, then
you're definitely a long-term pinned page", I think users will find a lot of corner
cases for us that break that assumption. 

So I think we agree that the put_user_page() approach, to complement the
get_user_pages*() call sites, is worth doing regardless of the details of the core
solution. btw, now that I'm refreshing my memory of our earlier discussions: Jan had an
interesting point that "long-term pinned" is a property of the call site, rather than
of the page:

  https://lkml.kernel.org/r/20180704104318.f5pnqtnn3unkwauw@quack2.suse.cz

...which really sounded like a useful way to think about this.

Here's what I think would help:

1) I'll send out a freshened-up RFC for the core implementation (it's hard to talk about
here without that, although your code above helps), and we can hammer out some answers
there.

2) I'll work through remaining comments (Jason had some) on this and respin this patchset.

Basically, I'm hearing "Jerome is totally going to ACK this, but maybe disagree about
some or all of the upcoming RFC". But then again, I hear what I want to hear! :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-28 15:39   ` Jason Gunthorpe
@ 2018-09-29  3:12     ` John Hubbard
  2018-09-29 16:21       ` Matthew Wilcox
                         ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: John Hubbard @ 2018-09-29  3:12 UTC (permalink / raw)
  To: Jason Gunthorpe, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter, Dan Williams,
	Jan Kara, Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Doug Ledford, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti

On 9/28/18 8:39 AM, Jason Gunthorpe wrote:
> On Thu, Sep 27, 2018 at 10:39:47PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>>
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index a41792dbae1f..9430d697cb9f 100644
>> +++ b/drivers/infiniband/core/umem.c
>> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>>  		page = sg_page(sg);
>>  		if (!PageDirty(page) && umem->writable && dirty)
>>  			set_page_dirty_lock(page);
>> -		put_page(page);
>> +		put_user_page(page);
> 
> Would it make sense to have a release/put_user_pages_dirtied to absorb
> the set_page_dity pattern too? I notice in this patch there is some
> variety here, I wonder what is the right way?
> 
> Also, I'm told this code here is a big performance bottleneck when the
> number of pages becomes very long (think >> GB of memory), so having a
> future path to use some kind of batching/threading sound great.
> 

Yes. And you asked for this the first time, too. Consistent! :) Sorry for
being slow to pick it up. It looks like there are several patterns, and
we have to support both set_page_dirty() and set_page_dirty_lock(). So
the best combination looks to be adding a few variations of
release_user_pages*(), but leaving put_user_page() alone, because it's
the "do it yourself" basic one. Scatter-gather will be stuck with that.

Here's a differential patch with that, that shows a nice little cleanup in 
a couple of IB places, and as you point out, it also provides the hooks for 
performance upgrades (via batching) in the future.

Does this API look about right?

diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index c7516029af33..48afec362c31 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -123,11 +123,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 {
        size_t i;
 
-       for (i = 0; i < npages; i++) {
-               if (dirty)
-                       set_page_dirty_lock(p[i]);
-               put_user_page(p[i]);
-       }
+       release_user_pages_lock(p, npages, dirty);
 
        if (mm) { /* during close after signal, mm can be NULL */
                down_write(&mm->mmap_sem);
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 3f8fd42dd7fc..c57a3a6730b6 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,13 +40,7 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
                                     int dirty)
 {
-       size_t i;
-
-       for (i = 0; i < num_pages; i++) {
-               if (dirty)
-                       set_page_dirty_lock(p[i]);
-               put_user_page(p[i]);
-       }
+       release_user_pages_lock(p, num_pages, dirty);
 }
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 72caf803115f..b280d0181e06 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -138,6 +138,9 @@ extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
                                    size_t *, loff_t *);
 
+int set_page_dirty(struct page *page);
+int set_page_dirty_lock(struct page *page);
+
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
 /* to align the pointer to the (next) page boundary */
@@ -949,12 +952,56 @@ static inline void put_user_page(struct page *page)
        put_page(page);
 }
 
-/* A drop-in replacement for release_pages(): */
+/* For get_user_pages*()-pinned pages, use these variants instead of
+ * release_pages():
+ */
+static inline void release_user_pages_dirty(struct page **pages,
+                                           unsigned long npages)
+{
+       while (npages) {
+               set_page_dirty(pages[npages]);
+               put_user_page(pages[npages]);
+               --npages;
+       }
+}
+
+static inline void release_user_pages_dirty_lock(struct page **pages,
+                                                unsigned long npages)
+{
+       while (npages) {
+               set_page_dirty_lock(pages[npages]);
+               put_user_page(pages[npages]);
+               --npages;
+       }
+}
+
+static inline void release_user_pages_basic(struct page **pages,
+                                           unsigned long npages)
+{
+       while (npages) {
+               put_user_page(pages[npages]);
+               --npages;
+       }
+}
+
 static inline void release_user_pages(struct page **pages,
-                                     unsigned long npages)
+                                     unsigned long npages,
+                                     bool set_dirty)
 {
-       while (npages)
-               put_user_page(pages[--npages]);
+       if (set_dirty)
+               release_user_pages_dirty(pages, npages);
+       else
+               release_user_pages_basic(pages, npages);
+}
+
+static inline void release_user_pages_lock(struct page **pages,
+                                          unsigned long npages,
+                                          bool set_dirty)
+{
+       if (set_dirty)
+               release_user_pages_dirty_lock(pages, npages);
+       else
+               release_user_pages_basic(pages, npages);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -1548,8 +1595,6 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
                          struct bdi_writeback *wb);
-int set_page_dirty(struct page *page);
-int set_page_dirty_lock(struct page *page);
 void __cancel_dirty_page(struct page *page);
 static inline void cancel_dirty_page(struct page *page)
 {


> Otherwise this RDMA part seems fine to me, there might be some minor
> conflicts however. I assume you want to run this through the -mm tree?
> 
> Acked-by: Jason Gunthorpe <jgg@mellanox.com>
> 

Great, thanks for the ACK.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-09-29  2:28       ` John Hubbard
@ 2018-09-29  8:46         ` Jerome Glisse
  2018-10-01  6:11           ` Dave Chinner
  2018-10-03 16:08           ` Jan Kara
  0 siblings, 2 replies; 29+ messages in thread
From: Jerome Glisse @ 2018-09-29  8:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Christian Benvenuti,
	Dennis Dalessandro, Doug Ledford, Mike Marciniszyn

On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote:
> On 9/28/18 2:49 PM, Jerome Glisse wrote:
> > On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote:
> >> On 9/28/18 8:29 AM, Jerome Glisse wrote:
> >>> On Thu, Sep 27, 2018 at 10:39:45PM -0700, john.hubbard@gmail.com wrote:
> >>>> From: John Hubbard <jhubbard@nvidia.com>
> [...]
> >>> So the solution is to wait (possibly for days, months, years) that the
> >>> RDMA or GPU which did GUP and do not have mmu notifier, release the page
> >>> (or put_user_page()) ?
> >>>
> >>> This sounds bads. Like i said during LSF/MM there is no way to properly
> >>> fix hardware that can not be preempted/invalidated ... most GPU are fine.
> >>> Few RDMA are fine, most can not ...
> >>>
> >>
> >> Hi Jerome,
> >>
> >> Personally, I'm think that this particular design is the best one I've seen
> >> so far, but if other, better designs show up, than let's do those instead, sure.
> >>
> >> I guess your main concern is that this might take longer than other approaches.
> >>
> >> As for time frame, perhaps I made it sound worse than it really is. I have patches
> >> staged already for all of the simpler call sites, and for about half of the more
> >> complicated ones. The core solution in mm is not large, and we've gone through a 
> >> few discussion threads about it back in July or so, so it shouldn't take too long
> >> to perfect it.
> >>
> >> So it may be a few months to get it all reviewed and submitted, but I don't
> >> see "years" by any stretch.
> > 
> > Bit of miss-comprehention there :) By month, years, i am talking about
> > the time it will take for some user to release the pin they have on the
> > page. Not the time to push something upstream.
> > 
> > AFAICT RDMA driver do not have any upper bound on how long they can hold
> > a page reference and thus your solution can leave one CPU core stuck for
> > as long as the pin is active. Worst case might lead to all CPU core waiting
> > for something that might never happen.
> > 
> 
> Actually, the latest direction on that discussion was toward periodically
> writing back, even while under RDMA, via bounce buffers:
> 
>   https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
> 
> I still think that's viable. Of course, there are other things besides 
> writeback (see below) that might also lead to waiting.

Write back under bounce buffer is fine, when looking back at links you
provided the solution that was discuss was blocking in page_mkclean()
which is horrible in my point of view.

> 
> >>> If it is just about fixing the set_page_dirty() bug then just looking at
> >>> refcount versus mapcount should already tell you if you can remove the
> >>> buffer head from the page or not. Which would fix the bug without complex
> >>> changes (i still like the put_user_page just for symetry with GUP).
> >>>
> >>
> >> It's about more than that. The goal is to make it safe and correct to
> >> use a non-CPU device to read and write to "pinned" memory, especially when
> >> that memory is backed by a file system.
> >>
> >> I recall there were objections to just narrowly fixing the set_page_dirty()
> >> bug, because the underlying problem is large and serious. So here we are.
> > 
> > Except that you can not solve that issue without proper hardware. GPU are
> > fine. RDMA are broken except the mellanox5 hardware which can invalidate
> > at anytime its page table thus allowing to write protect the page at any
> > time.
> 
> Today, people are out there using RDMA without page-fault-capable hardware.
> And they are hitting problems, as we've seen. From the discussions so far,
> I don't think it's impossible to solve the problems, even for "lesser", 
> non-fault-capable hardware. Especially once we decide on what is reasonable
> and supported.  Certainly the existing situation needs *something* to 
> change, even if it's (I don't recommend this) "go forth and tell the world
> to stop using RDMA with their current hardware".

I am not saying stop using their hardware, i am saying their hardware do
things it should not do because the driver developer did not do their due
diligence in understanding what GUP really is. As a result i believe that
what ever solution we decide to have must penalize only and only users of
such device drivers. It should not hurt the general case with sane hw.


> > 
> > With the solution put forward here you can potentialy wait _forever_ for
> > the driver that holds a pin to drop it. This was the point i was trying to
> > get accross during LSF/MM. 
> 
> I agree that just blocking indefinitely is generally unacceptable for kernel
> code, but we can probably avoid it for many cases (bounce buffers), and
> if we think it is really appropriate (file system unmounting, maybe?) then
> maybe tolerate it in some rare cases.  
> 
> >You can not fix broken hardware that decided to
> > use GUP to do a feature they can't reliably do because their hardware is
> > not capable to behave.
> > 
> > Because code is easier here is what i was meaning:
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=a5dbc0fe7e71d347067579f13579df372ec48389
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=01677bc039c791a16d5f82b3ef84917d62fac826
> > 
> 
> While that may work sometimes, I don't think it is reliable enough to trust for
> identifying pages that have been gup-pinned. There's just too much overloading of
> other mechanisms going on there, and if we pile on top with this constraint of "if you
> have +3 refcounts, and this particular combination of page counts and mapcounts, then
> you're definitely a long-term pinned page", I think users will find a lot of corner
> cases for us that break that assumption. 

So the mapcount == refcount (modulo extra reference for mapping and
private) should holds, here are the case when it does not:
    - page being migrated
    - page being isolated from LRU
    - mempolicy changes against the page
    - page cache lookup
    - some file system activities
    - i likely miss couples here i am doing that from memory

What matter is that all of the above are transitory, the extra reference
only last for as long as it takes for the action to finish (migration,
mempolicy change, ...).

So skipping those false positive page while reclaiming likely make sense,
the blocking free buffer maybe not.


> 
> So I think we agree that the put_user_page() approach, to complement the
> get_user_pages*() call sites, is worth doing regardless of the details of the core
> solution. btw, now that I'm refreshing my memory of our earlier discussions: Jan had an
> interesting point that "long-term pinned" is a property of the call site, rather than
> of the page:
> 
>   https://lkml.kernel.org/r/20180704104318.f5pnqtnn3unkwauw@quack2.suse.cz
> 
> ...which really sounded like a useful way to think about this.
> 
> Here's what I think would help:
> 
> 1) I'll send out a freshened-up RFC for the core implementation (it's hard to talk about
> here without that, although your code above helps), and we can hammer out some answers
> there.
> 
> 2) I'll work through remaining comments (Jason had some) on this and respin this patchset.
> 
> Basically, I'm hearing "Jerome is totally going to ACK this, but maybe disagree about
> some or all of the upcoming RFC". But then again, I hear what I want to hear! :)

I am fine with having symetry with GUP and put_user_page() so this
patchset is fine from my POV. But i was looking at the final solution
and what the first links suggested when digging the internet was the
lets block forever in page_mkclean() and that is not fine. The bounce
buffer solution is fine it will put the extra cost burden onto user of
such broken hardware.

Thanks for looking into all that,
Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-29  3:12     ` John Hubbard
@ 2018-09-29 16:21       ` Matthew Wilcox
  2018-09-29 19:19         ` Jason Gunthorpe
  2018-10-01 12:50         ` Christoph Hellwig
  2018-10-01 14:35       ` Dennis Dalessandro
  2018-10-03 16:27       ` Jan Kara
  2 siblings, 2 replies; 29+ messages in thread
From: Matthew Wilcox @ 2018-09-29 16:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, john.hubbard, Michal Hocko, Christopher Lameter,
	Dan Williams, Jan Kara, Al Viro, linux-mm, LKML, linux-rdma,
	linux-fsdevel, Doug Ledford, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti

On Fri, Sep 28, 2018 at 08:12:33PM -0700, John Hubbard wrote:
> >> +++ b/drivers/infiniband/core/umem.c
> >> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
> >>  		page = sg_page(sg);
> >>  		if (!PageDirty(page) && umem->writable && dirty)
> >>  			set_page_dirty_lock(page);
> >> -		put_page(page);
> >> +		put_user_page(page);
> > 
> > Would it make sense to have a release/put_user_pages_dirtied to absorb
> > the set_page_dity pattern too? I notice in this patch there is some
> > variety here, I wonder what is the right way?
> > 
> > Also, I'm told this code here is a big performance bottleneck when the
> > number of pages becomes very long (think >> GB of memory), so having a
> > future path to use some kind of batching/threading sound great.
> 
> Yes. And you asked for this the first time, too. Consistent! :) Sorry for
> being slow to pick it up. It looks like there are several patterns, and
> we have to support both set_page_dirty() and set_page_dirty_lock(). So
> the best combination looks to be adding a few variations of
> release_user_pages*(), but leaving put_user_page() alone, because it's
> the "do it yourself" basic one. Scatter-gather will be stuck with that.

I think our current interfaces are wrong.  We should really have a
get_user_sg() / put_user_sg() function that will set up / destroy an
SG list appropriate for that range of user memory.  This is almost
orthogonal to the original intent here, so please don't see this as a
"must do first" kind of argument that might derail the whole thing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-29 16:21       ` Matthew Wilcox
@ 2018-09-29 19:19         ` Jason Gunthorpe
  2018-10-01 12:50         ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2018-09-29 19:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, john.hubbard, Michal Hocko, Christopher Lameter,
	Dan Williams, Jan Kara, Al Viro, linux-mm, LKML, linux-rdma,
	linux-fsdevel, Doug Ledford, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti

On Sat, Sep 29, 2018 at 09:21:17AM -0700, Matthew Wilcox wrote:
> On Fri, Sep 28, 2018 at 08:12:33PM -0700, John Hubbard wrote:
> > >> +++ b/drivers/infiniband/core/umem.c
> > >> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
> > >>  		page = sg_page(sg);
> > >>  		if (!PageDirty(page) && umem->writable && dirty)
> > >>  			set_page_dirty_lock(page);
> > >> -		put_page(page);
> > >> +		put_user_page(page);
> > > 
> > > Would it make sense to have a release/put_user_pages_dirtied to absorb
> > > the set_page_dity pattern too? I notice in this patch there is some
> > > variety here, I wonder what is the right way?
> > > 
> > > Also, I'm told this code here is a big performance bottleneck when the
> > > number of pages becomes very long (think >> GB of memory), so having a
> > > future path to use some kind of batching/threading sound great.
> > 
> > Yes. And you asked for this the first time, too. Consistent! :) Sorry for
> > being slow to pick it up. It looks like there are several patterns, and
> > we have to support both set_page_dirty() and set_page_dirty_lock(). So
> > the best combination looks to be adding a few variations of
> > release_user_pages*(), but leaving put_user_page() alone, because it's
> > the "do it yourself" basic one. Scatter-gather will be stuck with that.
> 
> I think our current interfaces are wrong.  We should really have a
> get_user_sg() / put_user_sg() function that will set up / destroy an
> SG list appropriate for that range of user memory.  This is almost
> orthogonal to the original intent here, so please don't see this as a
> "must do first" kind of argument that might derail the whole thing.

This would be an excellent API, and it is exactly what this 'umem'
code in RDMA is trying to do for RDMA drivers..

I suspect it could do a much better job if it wasn't hindered by the
get_user_pages API - ie it could directly stuff huge pages into the SG
list instead of breaking them up, maybe also avoid the temporary memory
allocations for page * pointers, etc?

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-09-29  8:46         ` Jerome Glisse
@ 2018-10-01  6:11           ` Dave Chinner
  2018-10-01 12:47             ` Christoph Hellwig
  2018-10-01 15:31             ` Jason Gunthorpe
  2018-10-03 16:08           ` Jan Kara
  1 sibling, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2018-10-01  6:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, Jan Kara,
	Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Christian Benvenuti, Dennis Dalessandro, Doug Ledford,
	Mike Marciniszyn

On Sat, Sep 29, 2018 at 04:46:09AM -0400, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote:
> > On 9/28/18 2:49 PM, Jerome Glisse wrote:
> > > On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote:
> > >> use a non-CPU device to read and write to "pinned" memory, especially when
> > >> that memory is backed by a file system.

"backed by a filesystem" is the biggest problem here.

> > >> I recall there were objections to just narrowly fixing the set_page_dirty()
> > >> bug, because the underlying problem is large and serious. So here we are.
> > > 
> > > Except that you can not solve that issue without proper hardware. GPU are
> > > fine. RDMA are broken except the mellanox5 hardware which can invalidate
> > > at anytime its page table thus allowing to write protect the page at any
> > > time.
> > 
> > Today, people are out there using RDMA without page-fault-capable hardware.
> > And they are hitting problems, as we've seen. From the discussions so far,
> > I don't think it's impossible to solve the problems, even for "lesser", 
> > non-fault-capable hardware.

This reminds me so much of Linux mmap() in the mid-2000s - mmap()
worked for ext3 without being aware of page faults, so most mm/
developers at the time were of the opinion that all the other
filesystems should work just fine without being aware of page
faults.

But some loud-mouthed idiot at SGI kept complaining that mmap()
could never be fixed for XFS without write fault notification
because of delayed allocation, unwritten extents and ENOSPC had to
be handled before mapped writes could be posted.  Eventually
Christoph Lameter got ->page_mkwrite into the page fault path and
the loud mouthed idiot finally got mmap() to work correctly on XFS:

commit 4f57dbc6b5bae5a3978d429f45ac597ca7a3b8c6
Author: David Chinner <dgc@sgi.com>
Date:   Thu Jul 19 16:28:17 2007 +1000

    [XFS] Implement ->page_mkwrite in XFS.
    
    Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages
    being written to. This allows use to do correct delayed allocation and
    ENOSPC checking as well as remap unwritten extents so that they get
    converted correctly during writeback. This is done via the generic
    block_page_mkwrite code.
    
    SGI-PV: 940392
    SGI-Modid: xfs-linux-melb:xfs-kern:29149a
    
    Signed-off-by: David Chinner <dgc@sgi.com>
    Signed-off-by: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Tim Shimmin <tes@sgi.com>

Nowdays, ->page_mkwrite is fundamental filesystem functionality -
copy-on-write filesystems like btrfs simply don't work if they can't
trigger COW on mapped write accesses. These days all the main linux
filesystems depend on write fault notifications in some way or
another for correct operation.

The way RDMA uses GUP to take references to file backed pages to
'stop them going away' is reminiscent of mmap() back before
->page_mkwrite(). i.e. it assumes that an initial read of the page
will populate the page state correctly for all future operations,
including set_page_dirty() after write accesses.

This is not a valid assumption - filesystems can have different
private clean vs dirty page state, and may need to perform
operations to take a page from clean to dirty.  Hence calling
set_page_dirty() on a file backed mapped page without first having
called ->page_mkwrite() is a bug.

RDMA does not call ->page_mkwrite on clean file backed pages before it
writes to them and calls set_page_dirty(), and hence RDMA to file
backed pages is completely unreliable. I'm not sure this can be
solved without having page fault capable RDMA hardware....

> > > With the solution put forward here you can potentialy wait _forever_ for
> > > the driver that holds a pin to drop it. This was the point i was trying to
> > > get accross during LSF/MM. 

Right, but pinning via GUP is not an option for file backed pages
because the filesystem is completely unaware of these references.
i.e. waiting forever isn't an issue here because the filesystem
never waits on them. Instead, they are a filesystem corruption
vector because the filesystem can invalidate those mappings and free
the backing store while they are still in use by RDMA.

Hence for DAX filesystems, this leaves the RDMA app with direct
access to the physical storage even though the filesystem has freed
the space it is accessing. This is a use after free of the physical
storage that the filesystem cannot control, and why DAX+RDMA is
disabled right now.

We could address these use-after-free situations via forcing RDMA to
use file layout leases and revoke the lease when we need to modify
the backing store on leased files. However, this doesn't solve the
need for filesystems to receive write fault notifications via
->page_mkwrite.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-10-01  6:11           ` Dave Chinner
@ 2018-10-01 12:47             ` Christoph Hellwig
  2018-10-02  1:14               ` Dave Chinner
  2018-10-01 15:31             ` Jason Gunthorpe
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2018-10-01 12:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jerome Glisse, John Hubbard, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Jason Gunthorpe, Dan Williams,
	Jan Kara, Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Christian Benvenuti, Dennis Dalessandro, Doug Ledford,
	Mike Marciniszyn

On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote:
> This reminds me so much of Linux mmap() in the mid-2000s - mmap()
> worked for ext3 without being aware of page faults,

And "worked" still is a bit of a stretch, as soon as you'd get
ENOSPC it would still blow up badly.  Which probably makes it an
even better analogy to the current case.

> RDMA does not call ->page_mkwrite on clean file backed pages before it
> writes to them and calls set_page_dirty(), and hence RDMA to file
> backed pages is completely unreliable. I'm not sure this can be
> solved without having page fault capable RDMA hardware....

We can always software prefault at gup time.  And also remember that
while RDMA might be the case at least some people care about here it
really isn't different from any of the other gup + I/O cases, including
doing direct I/O to a mmap area.  The only difference in the various
cases is how long the area should be pinned down - some users like RDMA
want a long term mapping, while others like direct I/O just need a short
transient one.

> We could address these use-after-free situations via forcing RDMA to
> use file layout leases and revoke the lease when we need to modify
> the backing store on leased files. However, this doesn't solve the
> need for filesystems to receive write fault notifications via
> ->page_mkwrite.

Exactly.   We need three things here:

 - notification to the filesystem that a page is (possibly) beeing
   written to
 - a way to to block fs operations while the pages are pinned
 - a way to distinguish between short and long term mappings,
   and only allow long terms mappings if they can be broken
   using something like leases

I'm also pretty sure we already explained this a long time ago when the
issue came up last year, so I'm not sure why this is even still
contentious.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-29 16:21       ` Matthew Wilcox
  2018-09-29 19:19         ` Jason Gunthorpe
@ 2018-10-01 12:50         ` Christoph Hellwig
  2018-10-01 15:29           ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2018-10-01 12:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, Jason Gunthorpe, john.hubbard, Michal Hocko,
	Christopher Lameter, Dan Williams, Jan Kara, Al Viro, linux-mm,
	LKML, linux-rdma, linux-fsdevel, Doug Ledford, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti

On Sat, Sep 29, 2018 at 09:21:17AM -0700, Matthew Wilcox wrote:
> > being slow to pick it up. It looks like there are several patterns, and
> > we have to support both set_page_dirty() and set_page_dirty_lock(). So
> > the best combination looks to be adding a few variations of
> > release_user_pages*(), but leaving put_user_page() alone, because it's
> > the "do it yourself" basic one. Scatter-gather will be stuck with that.
> 
> I think our current interfaces are wrong.  We should really have a
> get_user_sg() / put_user_sg() function that will set up / destroy an
> SG list appropriate for that range of user memory.  This is almost
> orthogonal to the original intent here, so please don't see this as a
> "must do first" kind of argument that might derail the whole thing.

The SG list really is the wrong interface, as it mixes up information
about the pages/phys addr range and a potential dma mapping.  I think
the right interface is an array of bio_vecs.  In fact I've recently
been looking into a get_user_pages variant that does fill bio_vecs,
as it fundamentally is the right thing for doing I/O on large pages,
and will really help with direct I/O performance in that case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-29  3:12     ` John Hubbard
  2018-09-29 16:21       ` Matthew Wilcox
@ 2018-10-01 14:35       ` Dennis Dalessandro
  2018-10-03  5:40         ` John Hubbard
  2018-10-03 16:27       ` Jan Kara
  2 siblings, 1 reply; 29+ messages in thread
From: Dennis Dalessandro @ 2018-10-01 14:35 UTC (permalink / raw)
  To: John Hubbard, Jason Gunthorpe, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter, Dan Williams,
	Jan Kara, Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Doug Ledford, Mike Marciniszyn, Christian Benvenuti

On 9/28/2018 11:12 PM, John Hubbard wrote:
> On 9/28/18 8:39 AM, Jason Gunthorpe wrote:
>> On Thu, Sep 27, 2018 at 10:39:47PM -0700, john.hubbard@gmail.com wrote:
>>> From: John Hubbard <jhubbard@nvidia.com>
> [...]
>>>
>>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>>> index a41792dbae1f..9430d697cb9f 100644
>>> +++ b/drivers/infiniband/core/umem.c
>>> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>>>   		page = sg_page(sg);
>>>   		if (!PageDirty(page) && umem->writable && dirty)
>>>   			set_page_dirty_lock(page);
>>> -		put_page(page);
>>> +		put_user_page(page);
>>
>> Would it make sense to have a release/put_user_pages_dirtied to absorb
>> the set_page_dity pattern too? I notice in this patch there is some
>> variety here, I wonder what is the right way?
>>
>> Also, I'm told this code here is a big performance bottleneck when the
>> number of pages becomes very long (think >> GB of memory), so having a
>> future path to use some kind of batching/threading sound great.
>>
> 
> Yes. And you asked for this the first time, too. Consistent! :) Sorry for
> being slow to pick it up. It looks like there are several patterns, and
> we have to support both set_page_dirty() and set_page_dirty_lock(). So
> the best combination looks to be adding a few variations of
> release_user_pages*(), but leaving put_user_page() alone, because it's
> the "do it yourself" basic one. Scatter-gather will be stuck with that.
> 
> Here's a differential patch with that, that shows a nice little cleanup in
> a couple of IB places, and as you point out, it also provides the hooks for
> performance upgrades (via batching) in the future.
> 
> Does this API look about right?

I'm on board with that and the changes to hfi1 and qib.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-10-01 12:50         ` Christoph Hellwig
@ 2018-10-01 15:29           ` Matthew Wilcox
  2018-10-01 15:51             ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2018-10-01 15:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Hubbard, Jason Gunthorpe, john.hubbard, Michal Hocko,
	Christopher Lameter, Dan Williams, Jan Kara, Al Viro, linux-mm,
	LKML, linux-rdma, linux-fsdevel, Doug Ledford, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti

On Mon, Oct 01, 2018 at 05:50:13AM -0700, Christoph Hellwig wrote:
> On Sat, Sep 29, 2018 at 09:21:17AM -0700, Matthew Wilcox wrote:
> > > being slow to pick it up. It looks like there are several patterns, and
> > > we have to support both set_page_dirty() and set_page_dirty_lock(). So
> > > the best combination looks to be adding a few variations of
> > > release_user_pages*(), but leaving put_user_page() alone, because it's
> > > the "do it yourself" basic one. Scatter-gather will be stuck with that.
> > 
> > I think our current interfaces are wrong.  We should really have a
> > get_user_sg() / put_user_sg() function that will set up / destroy an
> > SG list appropriate for that range of user memory.  This is almost
> > orthogonal to the original intent here, so please don't see this as a
> > "must do first" kind of argument that might derail the whole thing.
> 
> The SG list really is the wrong interface, as it mixes up information
> about the pages/phys addr range and a potential dma mapping.  I think
> the right interface is an array of bio_vecs.  In fact I've recently
> been looking into a get_user_pages variant that does fill bio_vecs,
> as it fundamentally is the right thing for doing I/O on large pages,
> and will really help with direct I/O performance in that case.

I don't think the bio_vec is really a big improvement; it's just a (page,
offset, length) tuple.  Not to mention that due to the annoying divergence
between block and networking [1] this is actually a less useful interface.

I don't understand the dislike of the sg list.  Other than for special
cases which we should't be optimising for (ramfs, brd, loopback
filesystems), when we get a page to do I/O, we're going to want a dma
mapping for them.  It makes sense to already allocate space to store
the mapping at the outset.

[1] Can we ever admit that the bio_vec and the skb_frag_t are actually
the same thing?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-10-01  6:11           ` Dave Chinner
  2018-10-01 12:47             ` Christoph Hellwig
@ 2018-10-01 15:31             ` Jason Gunthorpe
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2018-10-01 15:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jerome Glisse, John Hubbard, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Dan Williams, Jan Kara,
	Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Christian Benvenuti, Dennis Dalessandro, Doug Ledford,
	Mike Marciniszyn

On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote:

> This reminds me so much of Linux mmap() in the mid-2000s - mmap()
> worked for ext3 without being aware of page faults, so most mm/
> developers at the time were of the opinion that all the other
> filesystems should work just fine without being aware of page
> faults.

This is probably because RDMA was introduced around that time, before
page_mkwrite was added.. It kind of sounds like page_mkwrite() broke
all the writing get_user_pages users and it wasn't realized at the
time.

BTW, we are hosting a session on this at Plumbers during the RDMA
track, hope everyone interested will be able to attend.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-10-01 15:29           ` Matthew Wilcox
@ 2018-10-01 15:51             ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2018-10-01 15:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, John Hubbard, Jason Gunthorpe, john.hubbard,
	Michal Hocko, Christopher Lameter, Dan Williams, Jan Kara,
	Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel, Doug Ledford,
	Mike Marciniszyn, Dennis Dalessandro, Christian Benvenuti

On Mon, Oct 01, 2018 at 08:29:29AM -0700, Matthew Wilcox wrote:
> I don't understand the dislike of the sg list.  Other than for special
> cases which we should't be optimising for (ramfs, brd, loopback
> filesystems), when we get a page to do I/O, we're going to want a dma
> mapping for them.  It makes sense to already allocate space to store
> the mapping at the outset.

We don't actually need the space - the scatterlist forces it on us,
otherwise we could translate directly in the on-disk format and
save that duplicate space.  I have prototypes for NVMe and RDMA that do
away with the scatterlist entirely.

And even if we are still using the scatterlist as we do right now we'd
need a second scatterlist at least for block / file system based I/O
as we can't plug the scatterlist into the I/O stack (nevermind that
due to splitting merging the lower one might not map 1:1 to the upper
one).

> [1] Can we ever admit that the bio_vec and the skb_frag_t are actually
> the same thing?

When I brought this up years ago the networking folks insisted that
their use of u16 offset/size fields was important for performance,
while for bio_vecs we needed the larger ones for some cases.  Since
then networking switched to 32-bit fields for what is now the fast
path, so it might be worth to give it another spin.

Than should also help with using my new bio_vec based dma-mapping
helpers to batch iommu mappings in networking, which Jesper had on
his todo list as all the indirect calls are causing performance
issues.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-10-01 12:47             ` Christoph Hellwig
@ 2018-10-02  1:14               ` Dave Chinner
  2018-10-03 16:21                 ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2018-10-02  1:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jerome Glisse, John Hubbard, john.hubbard, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Jason Gunthorpe, Dan Williams,
	Jan Kara, Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Christian Benvenuti, Dennis Dalessandro, Doug Ledford,
	Mike Marciniszyn

On Mon, Oct 01, 2018 at 05:47:57AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote:
> > This reminds me so much of Linux mmap() in the mid-2000s - mmap()
> > worked for ext3 without being aware of page faults,
> 
> And "worked" still is a bit of a stretch, as soon as you'd get
> ENOSPC it would still blow up badly.  Which probably makes it an
> even better analogy to the current case.
> 
> > RDMA does not call ->page_mkwrite on clean file backed pages before it
> > writes to them and calls set_page_dirty(), and hence RDMA to file
> > backed pages is completely unreliable. I'm not sure this can be
> > solved without having page fault capable RDMA hardware....
> 
> We can always software prefault at gup time.

I'm not sure that's sufficient - we've got a series of panics from
machines running ext4+RDMA where there are no bufferheads on dirty
pages at writeback time. This was also reproducable on versions of
XFS that used bufferheads.

We suspect that memory reclaim has tripped the bufferhead stripping
threshold (yeah, that old ext3 hack to avoid writeback deadlocks
under memory pressure), hence removed the bufferheads from clean
mapped pages while RDMA has them pinned. And then some time later
after set_page_dirty() was called on them the filesystem's page
writeback code crashes and burns....

i.e. just because the page was in a known state at when it was
pinned, it doesn't mean it will remain in that state until it is
unpinned....

> And also remember that
> while RDMA might be the case at least some people care about here it
> really isn't different from any of the other gup + I/O cases, including
> doing direct I/O to a mmap area.  The only difference in the various
> cases is how long the area should be pinned down - some users like RDMA
> want a long term mapping, while others like direct I/O just need a short
> transient one.

Yup, now that I'm aware of all those little intricacies with gup I
always try to consider what impact they have...

> > We could address these use-after-free situations via forcing RDMA to
> > use file layout leases and revoke the lease when we need to modify
> > the backing store on leased files. However, this doesn't solve the
> > need for filesystems to receive write fault notifications via
> > ->page_mkwrite.
> 
> Exactly.   We need three things here:
> 
>  - notification to the filesystem that a page is (possibly) beeing
>    written to
>  - a way to to block fs operations while the pages are pinned
>  - a way to distinguish between short and long term mappings,
>    and only allow long terms mappings if they can be broken
>    using something like leases
> 
> I'm also pretty sure we already explained this a long time ago when the
> issue came up last year, so I'm not sure why this is even still
> contentious.

I suspect that it's simply because these discussions have been
spread across different groups and not everyone is aware of what the
other groups are discussing...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-10-01 14:35       ` Dennis Dalessandro
@ 2018-10-03  5:40         ` John Hubbard
  0 siblings, 0 replies; 29+ messages in thread
From: John Hubbard @ 2018-10-03  5:40 UTC (permalink / raw)
  To: Dennis Dalessandro, Jason Gunthorpe, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter, Dan Williams,
	Jan Kara, Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Doug Ledford, Mike Marciniszyn, Christian Benvenuti

On 10/1/18 7:35 AM, Dennis Dalessandro wrote:
> On 9/28/2018 11:12 PM, John Hubbard wrote:
>> On 9/28/18 8:39 AM, Jason Gunthorpe wrote:
>>> On Thu, Sep 27, 2018 at 10:39:47PM -0700, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>> [...]
>>>>
>>>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>>>> index a41792dbae1f..9430d697cb9f 100644
>>>> +++ b/drivers/infiniband/core/umem.c
>>>> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>>>>           page = sg_page(sg);
>>>>           if (!PageDirty(page) && umem->writable && dirty)
>>>>               set_page_dirty_lock(page);
>>>> -        put_page(page);
>>>> +        put_user_page(page);
>>>
>>> Would it make sense to have a release/put_user_pages_dirtied to absorb
>>> the set_page_dity pattern too? I notice in this patch there is some
>>> variety here, I wonder what is the right way?
>>>
>>> Also, I'm told this code here is a big performance bottleneck when the
>>> number of pages becomes very long (think >> GB of memory), so having a
>>> future path to use some kind of batching/threading sound great.
>>>
>>
>> Yes. And you asked for this the first time, too. Consistent! :) Sorry for
>> being slow to pick it up. It looks like there are several patterns, and
>> we have to support both set_page_dirty() and set_page_dirty_lock(). So
>> the best combination looks to be adding a few variations of
>> release_user_pages*(), but leaving put_user_page() alone, because it's
>> the "do it yourself" basic one. Scatter-gather will be stuck with that.
>>
>> Here's a differential patch with that, that shows a nice little cleanup in
>> a couple of IB places, and as you point out, it also provides the hooks for
>> performance upgrades (via batching) in the future.
>>
>> Does this API look about right?
> 
> I'm on board with that and the changes to hfi1 and qib.
> 
> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>

Hi Dennis, thanks for the review!

I'll add those new routines in and send out a v2 soon, now that it appears, from 
the recent discussion, that this aspect of the approach is still viable.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-09-29  8:46         ` Jerome Glisse
  2018-10-01  6:11           ` Dave Chinner
@ 2018-10-03 16:08           ` Jan Kara
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Kara @ 2018-10-03 16:08 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, Jan Kara,
	Al Viro, linux-mm, LKML, linux-rdma, linux-fsdevel,
	Christian Benvenuti, Dennis Dalessandro, Doug Ledford,
	Mike Marciniszyn

On Sat 29-09-18 04:46:09, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote:
> > Actually, the latest direction on that discussion was toward periodically
> > writing back, even while under RDMA, via bounce buffers:
> > 
> >   https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
> > 
> > I still think that's viable. Of course, there are other things besides 
> > writeback (see below) that might also lead to waiting.
> 
> Write back under bounce buffer is fine, when looking back at links you
> provided the solution that was discuss was blocking in page_mkclean()
> which is horrible in my point of view.

Yeah, after looking into it for some time, we figured that waiting for page
pins in page_mkclean() isn't really going to fly due to deadlocks. So we
came up with the bounce buffers idea which should solve that nicely.

> > > With the solution put forward here you can potentialy wait _forever_ for
> > > the driver that holds a pin to drop it. This was the point i was trying to
> > > get accross during LSF/MM. 
> > 
> > I agree that just blocking indefinitely is generally unacceptable for kernel
> > code, but we can probably avoid it for many cases (bounce buffers), and
> > if we think it is really appropriate (file system unmounting, maybe?) then
> > maybe tolerate it in some rare cases.  
> > 
> > >You can not fix broken hardware that decided to
> > > use GUP to do a feature they can't reliably do because their hardware is
> > > not capable to behave.
> > > 
> > > Because code is easier here is what i was meaning:
> > > 
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=a5dbc0fe7e71d347067579f13579df372ec48389
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=01677bc039c791a16d5f82b3ef84917d62fac826
> > > 
> > 
> > While that may work sometimes, I don't think it is reliable enough to trust for
> > identifying pages that have been gup-pinned. There's just too much overloading of
> > other mechanisms going on there, and if we pile on top with this constraint of "if you
> > have +3 refcounts, and this particular combination of page counts and mapcounts, then
> > you're definitely a long-term pinned page", I think users will find a lot of corner
> > cases for us that break that assumption. 
> 
> So the mapcount == refcount (modulo extra reference for mapping and
> private) should holds, here are the case when it does not:
>     - page being migrated
>     - page being isolated from LRU
>     - mempolicy changes against the page
>     - page cache lookup
>     - some file system activities
>     - i likely miss couples here i am doing that from memory
> 
> What matter is that all of the above are transitory, the extra reference
> only last for as long as it takes for the action to finish (migration,
> mempolicy change, ...).
> 
> So skipping those false positive page while reclaiming likely make sense,
> the blocking free buffer maybe not.

Well, as John wrote, these page refcount are fragile (and actually
filesystem dependent as some filesystems hold page reference from their
page->private data and some don't). So I think we really need a new
reliable mechanism for tracking page references from GUP. And John works
towards that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
  2018-10-02  1:14               ` Dave Chinner
@ 2018-10-03 16:21                 ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2018-10-03 16:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Jerome Glisse, John Hubbard, john.hubbard,
	Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Christian Benvenuti,
	Dennis Dalessandro, Doug Ledford, Mike Marciniszyn

On Tue 02-10-18 11:14:47, Dave Chinner wrote:
> On Mon, Oct 01, 2018 at 05:47:57AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote:
> > > This reminds me so much of Linux mmap() in the mid-2000s - mmap()
> > > worked for ext3 without being aware of page faults,
> > 
> > And "worked" still is a bit of a stretch, as soon as you'd get
> > ENOSPC it would still blow up badly.  Which probably makes it an
> > even better analogy to the current case.
> > 
> > > RDMA does not call ->page_mkwrite on clean file backed pages before it
> > > writes to them and calls set_page_dirty(), and hence RDMA to file
> > > backed pages is completely unreliable. I'm not sure this can be
> > > solved without having page fault capable RDMA hardware....
> > 
> > We can always software prefault at gup time.
> 
> I'm not sure that's sufficient - we've got a series of panics from
> machines running ext4+RDMA where there are no bufferheads on dirty
> pages at writeback time. This was also reproducable on versions of
> XFS that used bufferheads.
> 
> We suspect that memory reclaim has tripped the bufferhead stripping
> threshold (yeah, that old ext3 hack to avoid writeback deadlocks
> under memory pressure), hence removed the bufferheads from clean
> mapped pages while RDMA has them pinned. And then some time later
> after set_page_dirty() was called on them the filesystem's page
> writeback code crashes and burns....
> 
> i.e. just because the page was in a known state at when it was
> pinned, it doesn't mean it will remain in that state until it is
> unpinned....

Exactly. The problem really is GUP user can call set_page_dirty() after the
page has been unmapped from all page tables and thus filesystem is totaly
unprepared to handle that.

> > And also remember that
> > while RDMA might be the case at least some people care about here it
> > really isn't different from any of the other gup + I/O cases, including
> > doing direct I/O to a mmap area.  The only difference in the various
> > cases is how long the area should be pinned down - some users like RDMA
> > want a long term mapping, while others like direct I/O just need a short
> > transient one.
> 
> Yup, now that I'm aware of all those little intricacies with gup I
> always try to consider what impact they have...
> 
> > > We could address these use-after-free situations via forcing RDMA to
> > > use file layout leases and revoke the lease when we need to modify
> > > the backing store on leased files. However, this doesn't solve the
> > > need for filesystems to receive write fault notifications via
> > > ->page_mkwrite.
> > 
> > Exactly.   We need three things here:
> > 
> >  - notification to the filesystem that a page is (possibly) beeing
> >    written to
> >  - a way to to block fs operations while the pages are pinned
> >  - a way to distinguish between short and long term mappings,
> >    and only allow long terms mappings if they can be broken
> >    using something like leases

Actually when using page cache, we should be fine with just knowing the
page it pinned by GUP. Writeback can then use bounce buffers to handle
data integrity writeout requests, skip periodic writeback, and keep page
dirty all the time it is pinned. That should deal with all the problems we
are currently aware of.

With DAX more is needed as truncate cannot proceed with pending GUP
references. So there we really need all you mention above. But for DAX we
have much more wiggle room how to define behavior and what we declare as
unsupported as DAX is new and it was never really used with RDMA and stuff.

> > I'm also pretty sure we already explained this a long time ago when the
> > issue came up last year, so I'm not sure why this is even still
> > contentious.
> 
> I suspect that it's simply because these discussions have been
> spread across different groups and not everyone is aware of what the
> other groups are discussing...

Yes, I have the same feeling.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] mm: introduce put_user_page(), placeholder version
  2018-09-28  5:39 ` [PATCH 2/4] mm: introduce put_user_page(), placeholder version john.hubbard
@ 2018-10-03 16:22   ` Jan Kara
  2018-10-03 23:23     ` John Hubbard
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2018-10-03 16:22 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

On Thu 27-09-18 22:39:48, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also adds release_user_pages(), a drop-in replacement for
> release_pages(). This is intended to be easily grep-able,
> for later performance improvements, since release_user_pages
> is not batched like release_pages() is, and is significantly
> slower.

A small nit but can we maybe call this put_user_pages() for symmetry with
put_user_page()? I don't really care too much but it would look natural to
me.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-09-29  3:12     ` John Hubbard
  2018-09-29 16:21       ` Matthew Wilcox
  2018-10-01 14:35       ` Dennis Dalessandro
@ 2018-10-03 16:27       ` Jan Kara
  2018-10-03 23:19         ` John Hubbard
  2 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2018-10-03 16:27 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Dan Williams, Jan Kara, Al Viro, linux-mm,
	LKML, linux-rdma, linux-fsdevel, Doug Ledford, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti

On Fri 28-09-18 20:12:33, John Hubbard wrote:
>  static inline void release_user_pages(struct page **pages,
> -                                     unsigned long npages)
> +                                     unsigned long npages,
> +                                     bool set_dirty)
>  {
> -       while (npages)
> -               put_user_page(pages[--npages]);
> +       if (set_dirty)
> +               release_user_pages_dirty(pages, npages);
> +       else
> +               release_user_pages_basic(pages, npages);
> +}

Is there a good reason to have this with set_dirty argument? Generally bool
arguments are not great for readability (or greppability for that matter).
Also in this case callers can just as easily do:
	if (set_dirty)
		release_user_pages_dirty(...);
	else
		release_user_pages(...);

And furthermore it makes the code author think more whether he needs
set_page_dirty() or set_page_dirty_lock(), rather than just passing 'true'
and hoping the function magically does the right thing for him.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call
  2018-10-03 16:27       ` Jan Kara
@ 2018-10-03 23:19         ` John Hubbard
  0 siblings, 0 replies; 29+ messages in thread
From: John Hubbard @ 2018-10-03 23:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, john.hubbard, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Dan Williams, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel, Doug Ledford, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti

On 10/3/18 9:27 AM, Jan Kara wrote:
> On Fri 28-09-18 20:12:33, John Hubbard wrote:
>>  static inline void release_user_pages(struct page **pages,
>> -                                     unsigned long npages)
>> +                                     unsigned long npages,
>> +                                     bool set_dirty)
>>  {
>> -       while (npages)
>> -               put_user_page(pages[--npages]);
>> +       if (set_dirty)
>> +               release_user_pages_dirty(pages, npages);
>> +       else
>> +               release_user_pages_basic(pages, npages);
>> +}
> 
> Is there a good reason to have this with set_dirty argument? Generally bool
> arguments are not great for readability (or greppability for that matter).
> Also in this case callers can just as easily do:
> 	if (set_dirty)
> 		release_user_pages_dirty(...);
> 	else
> 		release_user_pages(...);
> 
> And furthermore it makes the code author think more whether he needs
> set_page_dirty() or set_page_dirty_lock(), rather than just passing 'true'
> and hoping the function magically does the right thing for him.
> 

Ha, I went through *precisely* that argument in my head, too--and then
got seduced with the idea that it pretties up the existing calling code, 
because it's a drop-in one-liner at the call sites. But yes, I'll change it 
back to omit the bool set_dirty argument.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] mm: introduce put_user_page(), placeholder version
  2018-10-03 16:22   ` Jan Kara
@ 2018-10-03 23:23     ` John Hubbard
  0 siblings, 0 replies; 29+ messages in thread
From: John Hubbard @ 2018-10-03 23:23 UTC (permalink / raw)
  To: Jan Kara, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Al Viro, linux-mm, LKML,
	linux-rdma, linux-fsdevel

On 10/3/18 9:22 AM, Jan Kara wrote:
> On Thu 27-09-18 22:39:48, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also adds release_user_pages(), a drop-in replacement for
>> release_pages(). This is intended to be easily grep-able,
>> for later performance improvements, since release_user_pages
>> is not batched like release_pages() is, and is significantly
>> slower.
> 
> A small nit but can we maybe call this put_user_pages() for symmetry with
> put_user_page()? I don't really care too much but it would look natural to
> me.
> 

Sure. It started out as "make it a drop-in replacement for release_pages()",
but now it's not quite a drop-in replacement anymore. And in any case it's an 
opportunity to make the name more intuitive, so that seems like a good
idea.

If anyone hates put_user_pages() and wants to campaign relentlessly for
release_pages*(), then now is the time! :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-28  5:39 [PATCH 0/4] get_user_pages*() and RDMA: first steps john.hubbard
2018-09-28  5:39 ` [PATCH 1/4] mm: get_user_pages: consolidate error handling john.hubbard
2018-09-28  5:39 ` [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call john.hubbard
2018-09-28 15:39   ` Jason Gunthorpe
2018-09-29  3:12     ` John Hubbard
2018-09-29 16:21       ` Matthew Wilcox
2018-09-29 19:19         ` Jason Gunthorpe
2018-10-01 12:50         ` Christoph Hellwig
2018-10-01 15:29           ` Matthew Wilcox
2018-10-01 15:51             ` Christoph Hellwig
2018-10-01 14:35       ` Dennis Dalessandro
2018-10-03  5:40         ` John Hubbard
2018-10-03 16:27       ` Jan Kara
2018-10-03 23:19         ` John Hubbard
2018-09-28  5:39 ` [PATCH 2/4] mm: introduce put_user_page(), placeholder version john.hubbard
2018-10-03 16:22   ` Jan Kara
2018-10-03 23:23     ` John Hubbard
2018-09-28  5:39 ` [PATCH 4/4] goldfish_pipe/mm: convert to the new release_user_pages() call john.hubbard
2018-09-28 15:29 ` [PATCH 0/4] get_user_pages*() and RDMA: first steps Jerome Glisse
2018-09-28 19:06   ` John Hubbard
2018-09-28 21:49     ` Jerome Glisse
2018-09-29  2:28       ` John Hubbard
2018-09-29  8:46         ` Jerome Glisse
2018-10-01  6:11           ` Dave Chinner
2018-10-01 12:47             ` Christoph Hellwig
2018-10-02  1:14               ` Dave Chinner
2018-10-03 16:21                 ` Jan Kara
2018-10-01 15:31             ` Jason Gunthorpe
2018-10-03 16:08           ` Jan Kara

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox