All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages
@ 2018-10-12  6:00 john.hubbard
  2018-10-12  6:00 ` [PATCH 1/6] mm: get_user_pages: consolidate error handling john.hubbard
                   ` (5 more replies)
  0 siblings, 6 replies; 46+ messages in thread
From: john.hubbard @ 2018-10-12  6:00 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Here is an updated proposal for tracking pages that have had
get_user_pages*() called on them. This is in support of fixing the problem
discussed in [1]. This RFC only shows how to set up a reliable
PageDmaPinned flag. What to *do* with that flag is left for a later
discussion.

I'm providing this in order to help the discussion about patches 1-3, which
I'm hoping to check in first. The sequence would be:

    -- apply patches 1-3, convert the rest of the subsystems to call
       put_user_page*(), then

    -- apply patches 4-6, then

    -- Apply more patches, to actually use the new PageDmaPinned flag.

One question up front is, "how do we ensure that either put_user_page()
or put_page() are called, depending on whether the page came from
get_user_pages() or not?". From this series, you can see that:

    -- It's possible to assert within put_user_page(), that we are in the
       right place.

    -- It's less clear that there is a way to assert within put_page(),
       because put_page() is called from put_user_page(), and PageDmaPinned
       may or may not be set--either case is valid.

       Opinions and ideas are welcome there.

This is a lightly tested example (it boots up on x86_64, and just lets the
dma-pinned pages leak, in all non-infiniband cases...which is all cases, on
my particular test computer). This series just does the following:

a) Provides the put_user_page*() routines that have been discussed in
   another thread (patch 2).

b) Provides a single example of converting some code (infiniband) to use
   those routines (patch 3).

c) Connects up get_user_pages*() to use the new refcounting and flags
   fieldsj (patches 4-6)

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

John Hubbard (6):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()
  mm: introduce page->dma_pinned_flags, _count
  mm: introduce zone_gup_lock, for dma-pinned pages
  mm: track gup pages with page->dma_pinned_* fields

 drivers/infiniband/core/umem.c              |   7 +-
 drivers/infiniband/core/umem_odp.c          |   2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |  11 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   6 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  11 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   7 +-
 include/linux/mm.h                          |   9 ++
 include/linux/mm_types.h                    |  22 +++-
 include/linux/mmzone.h                      |   6 +
 include/linux/page-flags.h                  |  47 +++++++
 mm/gup.c                                    |  93 +++++++++++---
 mm/memcontrol.c                             |   7 +
 mm/page_alloc.c                             |   1 +
 mm/swap.c                                   | 134 ++++++++++++++++++++
 15 files changed, 319 insertions(+), 50 deletions(-)

-- 
2.19.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/6] mm: get_user_pages: consolidate error handling
  2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
@ 2018-10-12  6:00 ` john.hubbard
  2018-10-12  6:30   ` Balbir Singh
  2018-10-12  6:00 ` [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions john.hubbard
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 46+ messages in thread
From: john.hubbard @ 2018-10-12  6:00 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
 	long i = 0;
+	int err = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (!vma || start >= vma->vm_end) {
 			vma = find_extend_vma(mm, start);
 			if (!vma && in_gate_area(mm, start)) {
-				int ret;
-				ret = get_gate_page(mm, start & PAGE_MASK,
+				err = get_gate_page(mm, start & PAGE_MASK,
 						gup_flags, &vma,
 						pages ? &pages[i] : NULL);
-				if (ret)
-					return i ? : ret;
+				if (err)
+					goto out;
 				page_mask = 0;
 				goto next_page;
 			}
 
-			if (!vma || check_vma_flags(vma, gup_flags))
-				return i ? : -EFAULT;
+			if (!vma || check_vma_flags(vma, gup_flags)) {
+				err = -EFAULT;
+				goto out;
+			}
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		 * If we have a pending SIGKILL, don't keep faulting pages and
 		 * potentially allocating memory.
 		 */
-		if (unlikely(fatal_signal_pending(current)))
-			return i ? i : -ERESTARTSYS;
+		if (unlikely(fatal_signal_pending(current))) {
+			err = -ERESTARTSYS;
+			goto out;
+		}
 		cond_resched();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
 		if (!page) {
-			int ret;
-			ret = faultin_page(tsk, vma, start, &foll_flags,
+			err = faultin_page(tsk, vma, start, &foll_flags,
 					nonblocking);
-			switch (ret) {
+			switch (err) {
 			case 0:
 				goto retry;
 			case -EFAULT:
 			case -ENOMEM:
 			case -EHWPOISON:
-				return i ? i : ret;
+				goto out;
 			case -EBUSY:
-				return i;
+				err = 0;
+				goto out;
 			case -ENOENT:
 				goto next_page;
 			}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			 */
 			goto next_page;
 		} else if (IS_ERR(page)) {
-			return i ? i : PTR_ERR(page);
+			err = PTR_ERR(page);
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
 	} while (nr_pages);
-	return i;
+
+out:
+	return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions
  2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
  2018-10-12  6:00 ` [PATCH 1/6] mm: get_user_pages: consolidate error handling john.hubbard
@ 2018-10-12  6:00 ` john.hubbard
  2018-10-12  7:35   ` Balbir Singh
  2018-10-12  6:00 ` [PATCH 3/6] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 46+ messages in thread
From: john.hubbard @ 2018-10-12  6:00 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel,
	John Hubbard, Al Viro, Jerome Glisse, Christoph Hellwig,
	Ralph Campbell

From: John Hubbard <jhubbard@nvidia.com>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing the problem described in [1]. The steps
are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, any will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Christopher Lameter <cl@linux.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Jan Kara <jack@suse.cz>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Jerome Glisse <jglisse@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Ralph Campbell <rcampbell@nvidia.com>

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 20 +++++++++++
 mm/swap.c          | 83 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..76d18aada9f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -943,6 +943,26 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+static inline void put_user_page(struct page *page)
+{
+	put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index 26fc9b5f1b6c..efab3a6b6f91 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -134,6 +134,89 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+/*
+ * put_user_pages_dirty() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			set_page_dirty(page);
+
+		put_user_page(page);
+	}
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/*
+ * put_user_pages_dirty_lock() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++) {
+		struct page *page = compound_head(pages[index]);
+
+		if (!PageDirty(page))
+			set_page_dirty_lock(page);
+
+		put_user_page(page);
+	}
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/*
+ * put_user_pages() - for each page in the @pages array, release the page
+ * using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+	unsigned long index;
+
+	for (index = 0; index < npages; index++)
+		put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 /*
  * get_kernel_pages() - pin kernel pages in memory
  * @kiov:	An array of struct kvec structures
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/6] infiniband/mm: convert put_page() to put_user_page*()
  2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
  2018-10-12  6:00 ` [PATCH 1/6] mm: get_user_pages: consolidate error handling john.hubbard
  2018-10-12  6:00 ` [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-10-12  6:00 ` john.hubbard
  2018-10-12  6:00 ` [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: john.hubbard @ 2018-10-12  6:00 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel,
	John Hubbard, Doug Ledford, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti

From: John Hubbard <jhubbard@nvidia.com>

For infiniband code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This is a tiny part of the second step of fixing the problem described
in [1]. The steps are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, any will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

CC: Doug Ledford <dledford@redhat.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Mike Marciniszyn <mike.marciniszyn@intel.com>
CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
CC: Christian Benvenuti <benve@cisco.com>

CC: linux-rdma@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-mm@kvack.org

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c              |  7 ++++---
 drivers/infiniband/core/umem_odp.c          |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     | 11 ++++-------
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++++-------
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +++---
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  7 ++++---
 7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..7ab7a3a35eb4 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
 		page = sg_page(sg);
-		if (!PageDirty(page) && umem->writable && dirty)
-			set_page_dirty_lock(page);
-		put_page(page);
+		if (umem->writable && dirty)
+			put_user_pages_dirty_lock(&page, 1);
+		else
+			put_user_page(page);
 	}
 
 	sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 user_virt, u64 bcnt,
 					ret = -EFAULT;
 					break;
 				}
-				put_page(local_page_list[j]);
+				put_user_page(local_page_list[j]);
 				continue;
 			}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 			     size_t npages, bool dirty)
 {
-	size_t i;
-
-	for (i = 0; i < npages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, npages);
+	else
+		put_user_pages(p, npages);
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		down_write(&mm->mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
 	if (ret < 0) {
-		put_page(pages[0]);
+		put_user_page(pages[0]);
 		goto out;
 	}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 				 mthca_uarc_virt(dev, uar, i));
 	if (ret) {
 		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-		put_page(sg_page(&db_tab->page[i].mem));
+		put_user_page(sg_page(&db_tab->page[i].mem));
 		goto out;
 	}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
 		if (db_tab->page[i].uvirt) {
 			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
 			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-			put_page(sg_page(&db_tab->page[i].mem));
+			put_user_page(sg_page(&db_tab->page[i].mem));
 		}
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..1a5c64c8695f 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,13 +40,10 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
 				     int dirty)
 {
-	size_t i;
-
-	for (i = 0; i < num_pages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
+	if (dirty)
+		put_user_pages_dirty_lock(p, num_pages);
+	else
+		put_user_pages(p, num_pages);
 }
 
 /*
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 926f3c8eba69..4a4b802b011f 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -321,7 +321,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd,
 		 * the caller can ignore this page.
 		 */
 		if (put) {
-			put_page(page);
+			put_user_page(page);
 		} else {
 			/* coalesce case */
 			kunmap(page);
@@ -635,7 +635,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev,
 			kunmap(pkt->addr[i].page);
 
 		if (pkt->addr[i].put_page)
-			put_page(pkt->addr[i].page);
+			put_user_page(pkt->addr[i].page);
 		else
 			__free_page(pkt->addr[i].page);
 	} else if (pkt->addr[i].kvaddr) {
@@ -710,7 +710,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 	/* if error, return all pages not managed by pkt */
 free_pages:
 	while (i < j)
-		put_page(pages[i++]);
+		put_user_page(pages[i++]);
 
 done:
 	return ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 9dd39daa602b..9e3615fd05f7 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -89,9 +89,10 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
 		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
 			page = sg_page(sg);
 			pa = sg_phys(sg);
-			if (!PageDirty(page) && dirty)
-				set_page_dirty_lock(page);
-			put_page(page);
+			if (dirty)
+				put_user_pages_dirty_lock(&page, 1);
+			else
+				put_user_page(page);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
                   ` (2 preceding siblings ...)
  2018-10-12  6:00 ` [PATCH 3/6] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
@ 2018-10-12  6:00 ` john.hubbard
  2018-10-12 10:56   ` Balbir Singh
  2018-10-13  3:55   ` Dave Chinner
  2018-10-12  6:00 ` [PATCH 5/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
  2018-10-12  6:00 ` [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
  5 siblings, 2 replies; 46+ messages in thread
From: john.hubbard @ 2018-10-12  6:00 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Add two struct page fields that, combined, are unioned with
struct page->lru. There is no change in the size of
struct page. These new fields are for type safety and clarity.

Also add page flag accessors to test, set and clear the new
page->dma_pinned_flags field.

The page->dma_pinned_count field will be used in upcoming
patches

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm_types.h   | 22 +++++++++++++-----
 include/linux/page-flags.h | 47 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..017ab82e36ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,12 +78,22 @@ struct page {
 	 */
 	union {
 		struct {	/* Page cache and anonymous pages */
-			/**
-			 * @lru: Pageout list, eg. active_list protected by
-			 * zone_lru_lock.  Sometimes used as a generic list
-			 * by the page owner.
-			 */
-			struct list_head lru;
+			union {
+				/**
+				 * @lru: Pageout list, eg. active_list protected
+				 * by zone_lru_lock.  Sometimes used as a
+				 * generic list by the page owner.
+				 */
+				struct list_head lru;
+				/* Used by get_user_pages*(). Pages may not be
+				 * on an LRU while these dma_pinned_* fields
+				 * are in use.
+				 */
+				struct {
+					unsigned long dma_pinned_flags;
+					atomic_t      dma_pinned_count;
+				};
+			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
 			pgoff_t index;		/* Our offset within mapping. */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74bee8cecf4c..81ed52c3caae 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -425,6 +425,53 @@ static __always_inline int __PageMovable(struct page *page)
 				PAGE_MAPPING_MOVABLE;
 }
 
+/*
+ * Because page->dma_pinned_flags is unioned with page->lru, any page that
+ * uses these flags must NOT be on an LRU. That's partly enforced by
+ * ClearPageDmaPinned, which gives the page back to LRU.
+ *
+ * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
+ * of struct page), and this flag is checked without knowing whether it is a
+ * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
+ * rather than bit 0.
+ */
+#define PAGE_DMA_PINNED		0x2
+#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
+
+/*
+ * Because these flags are read outside of a lock, ensure visibility between
+ * different threads, by using READ|WRITE_ONCE.
+ */
+static __always_inline int PageDmaPinnedFlags(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED_FLAGS) != 0;
+}
+
+static __always_inline int PageDmaPinned(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
+}
+
+static __always_inline void SetPageDmaPinned(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
+}
+
+static __always_inline void ClearPageDmaPinned(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
+
+	/* This does a WRITE_ONCE to the lru.next, which is also the
+	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
+	 * this provides visibility to other threads.
+	 */
+	INIT_LIST_HEAD(&page->lru);
+}
+
 #ifdef CONFIG_KSM
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/6] mm: introduce zone_gup_lock, for dma-pinned pages
  2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
                   ` (3 preceding siblings ...)
  2018-10-12  6:00 ` [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
@ 2018-10-12  6:00 ` john.hubbard
  2018-10-12  6:00 ` [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
  5 siblings, 0 replies; 46+ messages in thread
From: john.hubbard @ 2018-10-12  6:00 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

The page->dma_pinned_flags and _count fields require
lock protection. A lock at approximately the granularity
of the zone_lru_lock is called for, but adding to the
locking contention of zone_lru_lock is undesirable,
because that is a pre-existing hot spot. Fortunately,
these new dma_pinned_* fields can use an independent
lock, so this patch creates an entirely new lock, right
next to the zone_lru_lock.

Why "zone_gup_lock"?

Most of the naming refers to "DMA-pinned pages", but
"zone DMA lock" has other meanings already, so this is
called zone_gup_lock instead. The "dma pinning" is a result
of get_user_pages (gup) being called, so the name still
helps explain its use.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mmzone.h | 6 ++++++
 mm/page_alloc.c        | 1 +
 2 files changed, 7 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4b0c79d2924..971a63f84ad5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -661,6 +661,7 @@ typedef struct pglist_data {
 	enum zone_type kswapd_classzone_idx;
 
 	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
+	spinlock_t pinned_dma_lock;
 
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
@@ -730,6 +731,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
+static inline spinlock_t *zone_gup_lock(struct zone *zone)
+{
+	return &zone->zone_pgdat->pinned_dma_lock;
+}
+
 static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
 	return &pgdat->lruvec;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..850f90223cc7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6225,6 +6225,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
+	spin_lock_init(&pgdat->pinned_dma_lock);
 	lruvec_init(node_lruvec(pgdat));
 }
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields
  2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
                   ` (4 preceding siblings ...)
  2018-10-12  6:00 ` [PATCH 5/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
@ 2018-10-12  6:00 ` john.hubbard
  2018-10-12 11:07   ` Balbir Singh
  5 siblings, 1 reply; 46+ messages in thread
From: john.hubbard @ 2018-10-12  6:00 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

This patch sets and restores the new page->dma_pinned_flags and
page->dma_pinned_count fields, but does not actually use them for
anything yet.

In order to use these fields at all, the page must be removed from
any LRU list that it's on. The patch also adds some precautions that
prevent the page from getting moved back onto an LRU, once it is
in this state.

This is in preparation to fix some problems that came up when using
devices (NICs, GPUs, for example) that set up direct access to a chunk
of system (CPU) memory, so that they can DMA to/from that memory.

CC: Matthew Wilcox <willy@infradead.org>
CC: Jan Kara <jack@suse.cz>
CC: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 15 ++-----------
 mm/gup.c           | 56 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/memcontrol.c    |  7 ++++++
 mm/swap.c          | 51 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 114 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76d18aada9f8..44878d21e27b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -944,21 +944,10 @@ static inline void put_page(struct page *page)
 }
 
 /*
- * put_user_page() - release a page that had previously been acquired via
- * a call to one of the get_user_pages*() functions.
- *
  * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * one of these put_user_pages*() routines:
  */
-static inline void put_user_page(struct page *page)
-{
-	put_page(page);
-}
-
+void put_user_page(struct page *page);
 void put_user_pages_dirty(struct page **pages, unsigned long npages);
 void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
 void put_user_pages(struct page **pages, unsigned long npages);
diff --git a/mm/gup.c b/mm/gup.c
index 05ee7c18e59a..fddbc30cde89 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -20,6 +20,51 @@
 
 #include "internal.h"
 
+static int pin_page_for_dma(struct page *page)
+{
+	int ret = 0;
+	struct zone *zone;
+
+	page = compound_head(page);
+	zone = page_zone(page);
+
+	spin_lock(zone_gup_lock(zone));
+
+	if (PageDmaPinned(page)) {
+		/* Page was not on an LRU list, because it was DMA-pinned. */
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+
+		atomic_inc(&page->dma_pinned_count);
+		goto unlock_out;
+	}
+
+	/*
+	 * Note that page->dma_pinned_flags is unioned with page->lru.
+	 * Therefore, the rules are: checking if any of the
+	 * PAGE_DMA_PINNED_FLAGS bits are set may be done while page->lru
+	 * is in use. However, setting those flags requires that
+	 * the page is both locked, and also, removed from the LRU.
+	 */
+	ret = isolate_lru_page(page);
+
+	if (ret == 0) {
+		/* Avoid problems later, when freeing the page: */
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+
+		/* counteract isolate_lru_page's effects: */
+		put_page(page);
+
+		atomic_set(&page->dma_pinned_count, 1);
+		SetPageDmaPinned(page);
+	}
+
+unlock_out:
+	spin_unlock(zone_gup_lock(zone));
+
+	return ret;
+}
+
 static struct page *no_page_table(struct vm_area_struct *vma,
 		unsigned int flags)
 {
@@ -659,7 +704,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
-	long i = 0;
+	long i = 0, j;
 	int err = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
@@ -764,6 +809,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	} while (nr_pages);
 
 out:
+	if (pages)
+		for (j = 0; j < i; j++)
+			pin_page_for_dma(pages[j]);
+
 	return i ? i : err;
 }
 
@@ -1841,7 +1890,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
 	unsigned long addr, len, end;
-	int nr = 0, ret = 0;
+	int nr = 0, ret = 0, i;
 
 	start &= PAGE_MASK;
 	addr = start;
@@ -1862,6 +1911,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		ret = nr;
 	}
 
+	for (i = 0; i < nr; i++)
+		pin_page_for_dma(pages[i]);
+
 	if (nr < nr_pages) {
 		/* Try to get the remaining pages with get_user_pages */
 		start += nr << PAGE_SHIFT;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59552d9..af9719756081 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2335,6 +2335,11 @@ static void lock_page_lru(struct page *page, int *isolated)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
+		/* LRU and PageDmaPinned are mutually exclusive: they use the
+		 * same fields in struct page, but for different purposes.
+		 */
+		VM_BUG_ON_PAGE(PageDmaPinned(page), page);
+
 		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
@@ -2352,6 +2357,8 @@ static void unlock_page_lru(struct page *page, int isolated)
 
 		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		VM_BUG_ON_PAGE(PageDmaPinned(page), page);
+
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
diff --git a/mm/swap.c b/mm/swap.c
index efab3a6b6f91..6b2b1a958a67 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -134,6 +134,46 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+void put_user_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	page = compound_head(page);
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
+
+	if (atomic_dec_and_test(&page->dma_pinned_count)) {
+		spin_lock(zone_gup_lock(zone));
+
+		/* Re-check while holding the lock, because
+		 * pin_page_for_dma() or get_page() may have snuck in right
+		 * after the atomic_dec_and_test, and raised the count
+		 * above zero again. If so, just leave the flag set. And
+		 * because the atomic_dec_and_test above already got the
+		 * accounting correct, no other action is required.
+		 */
+		if (atomic_read(&page->dma_pinned_count) == 0)
+			ClearPageDmaPinned(page);
+
+		spin_unlock(zone_gup_lock(zone));
+	}
+
+	put_page(page);
+}
+EXPORT_SYMBOL(put_user_page);
+
 /*
  * put_user_pages_dirty() - for each page in the @pages array, make
  * that page (or its head page, if a compound page) dirty, if it was
@@ -907,6 +947,11 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+
+	/* LRU and PageDmaPinned are mutually exclusive: they use the
+	 * same fields in struct page, but for different purposes.
+	 */
+	VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
 		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
 
@@ -946,6 +991,12 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+	/* LRU and PageDmaPinned are mutually exclusive: they use the
+	 * same fields in struct page, but for different purposes.
+	 */
+	if (PageDmaPinned(page))
+		return;
+
 	SetPageLRU(page);
 	/*
 	 * Page becomes evictable in two ways:
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] mm: get_user_pages: consolidate error handling
  2018-10-12  6:00 ` [PATCH 1/6] mm: get_user_pages: consolidate error handling john.hubbard
@ 2018-10-12  6:30   ` Balbir Singh
  2018-10-12 22:45       ` John Hubbard
  0 siblings, 1 reply; 46+ messages in thread
From: Balbir Singh @ 2018-10-12  6:30 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel, John Hubbard

On Thu, Oct 11, 2018 at 11:00:09PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> An upcoming patch requires a way to operate on each page that
> any of the get_user_pages_*() variants returns.
> 
> In preparation for that, consolidate the error handling for
> __get_user_pages(). This provides a single location (the "out:" label)
> for operating on the collected set of pages that are about to be returned.
> 
> As long every use of the "ret" variable is being edited, rename
> "ret" --> "err", so that its name matches its true role.
> This also gets rid of two shadowed variable declarations, as a
> tiny beneficial a side effect.
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---

Looks good, might not be needed but
Reviewed-by: Balbir Singh <bsingharora@gmail.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions
  2018-10-12  6:00 ` [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions john.hubbard
@ 2018-10-12  7:35   ` Balbir Singh
  2018-10-12 22:31       ` John Hubbard
  0 siblings, 1 reply; 46+ messages in thread
From: Balbir Singh @ 2018-10-12  7:35 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel, John Hubbard, Al Viro,
	Jerome Glisse, Christoph Hellwig, Ralph Campbell

On Thu, Oct 11, 2018 at 11:00:10PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing the problem described in [1]. The steps
> are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>    for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>    invoke put_user_page*(), instead of put_page(). This involves dozens of
>    call sites, any will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>    implement tracking of these pages. This tracking will be separate from
>    the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>    special handling (especially in writeback paths) when the pages are
>    backed by a filesystem. Again, [1] provides details as to why that is
>    desirable.
> 
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> CC: Matthew Wilcox <willy@infradead.org>
> CC: Michal Hocko <mhocko@kernel.org>
> CC: Christopher Lameter <cl@linux.com>
> CC: Jason Gunthorpe <jgg@ziepe.ca>
> CC: Dan Williams <dan.j.williams@intel.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Al Viro <viro@zeniv.linux.org.uk>
> CC: Jerome Glisse <jglisse@redhat.com>
> CC: Christoph Hellwig <hch@infradead.org>
> CC: Ralph Campbell <rcampbell@nvidia.com>
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 20 +++++++++++
>  mm/swap.c          | 83 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 103 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0416a7204be3..76d18aada9f8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -943,6 +943,26 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> +/*
> + * put_user_page() - release a page that had previously been acquired via
> + * a call to one of the get_user_pages*() functions.
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> +	put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index 26fc9b5f1b6c..efab3a6b6f91 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -134,6 +134,89 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
>  
> +/*
> + * put_user_pages_dirty() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
Do we need any checks on npages, npages <= (PUD_SHIFT - PAGE_SHIFT)?

> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			set_page_dirty(page);
> +
> +		put_user_page(page);
> +	}
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/*
> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +
> +		if (!PageDirty(page))
> +			set_page_dirty_lock(page);
> +
> +		put_user_page(page);
> +	}
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +

This can be collapsed w.r.t put_user_pages_dirty, a function pointer indirection
for the locked vs unlocked case, not sure how that affects function optimization.


> +/*
> + * put_user_pages() - for each page in the @pages array, release the page
> + * using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().

The comment is incorrect.

> + *
> + * @pages:  array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *
> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	for (index = 0; index < npages; index++)
> +		put_user_page(pages[index]);
> +}

Ditto in terms of code duplication

How about

for_each_page_index(index, npages) {
	<do the dirty bits if needed>
	put_user_pages(pages[index]
}

Then pass what you want the page iterator to do


> +EXPORT_SYMBOL(put_user_pages);
> +
>  /*
>   * get_kernel_pages() - pin kernel pages in memory
>   * @kiov:	An array of struct kvec structures
> -- 
> 2.19.1
> 

Balbir Singh.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-12  6:00 ` [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
@ 2018-10-12 10:56   ` Balbir Singh
  2018-10-13  0:15       ` John Hubbard
  2018-10-13  3:55   ` Dave Chinner
  1 sibling, 1 reply; 46+ messages in thread
From: Balbir Singh @ 2018-10-12 10:56 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel, John Hubbard

On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Add two struct page fields that, combined, are unioned with
> struct page->lru. There is no change in the size of
> struct page. These new fields are for type safety and clarity.
> 
> Also add page flag accessors to test, set and clear the new
> page->dma_pinned_flags field.
> 
> The page->dma_pinned_count field will be used in upcoming
> patches
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm_types.h   | 22 +++++++++++++-----
>  include/linux/page-flags.h | 47 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..017ab82e36ca 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,12 +78,22 @@ struct page {
>  	 */
>  	union {
>  		struct {	/* Page cache and anonymous pages */
> -			/**
> -			 * @lru: Pageout list, eg. active_list protected by
> -			 * zone_lru_lock.  Sometimes used as a generic list
> -			 * by the page owner.
> -			 */
> -			struct list_head lru;
> +			union {
> +				/**
> +				 * @lru: Pageout list, eg. active_list protected
> +				 * by zone_lru_lock.  Sometimes used as a
> +				 * generic list by the page owner.
> +				 */
> +				struct list_head lru;
> +				/* Used by get_user_pages*(). Pages may not be
> +				 * on an LRU while these dma_pinned_* fields
> +				 * are in use.
> +				 */
> +				struct {
> +					unsigned long dma_pinned_flags;
> +					atomic_t      dma_pinned_count;
> +				};
> +			};
>  			/* See page-flags.h for PAGE_MAPPING_FLAGS */
>  			struct address_space *mapping;
>  			pgoff_t index;		/* Our offset within mapping. */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74bee8cecf4c..81ed52c3caae 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -425,6 +425,53 @@ static __always_inline int __PageMovable(struct page *page)
>  				PAGE_MAPPING_MOVABLE;
>  }
>  
> +/*
> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
> + * uses these flags must NOT be on an LRU. That's partly enforced by
> + * ClearPageDmaPinned, which gives the page back to LRU.
> + *
> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
> + * of struct page), and this flag is checked without knowing whether it is a
> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
> + * rather than bit 0.
> + */
> +#define PAGE_DMA_PINNED		0x2
> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
> +

This is really subtle, additional changes to compound_head will need to coordinate
with these flags? Also doesn't this bit need to be unique across all structs in
the union? I guess that is guaranteed by the fact that page == compound_head(page)
as per your assertion, but I've forgotten why that is true. Could you please
add some commentary on that

> +/*
> + * Because these flags are read outside of a lock, ensure visibility between
> + * different threads, by using READ|WRITE_ONCE.
> + */
> +static __always_inline int PageDmaPinnedFlags(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED_FLAGS) != 0;
> +}
> +
> +static __always_inline int PageDmaPinned(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
> +}
> +
> +static __always_inline void SetPageDmaPinned(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));

VM_BUG_ON(!list_empty(&page->lru))

> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
> +}
> +
> +static __always_inline void ClearPageDmaPinned(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
> +
> +	/* This does a WRITE_ONCE to the lru.next, which is also the
> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
> +	 * this provides visibility to other threads.
> +	 */
> +	INIT_LIST_HEAD(&page->lru);

This assumes certain things about list_head, why not use the correct
initialization bits.

> +}
> +
>  #ifdef CONFIG_KSM
>  /*
>   * A KSM page is one of those write-protected "shared pages" or "merged pages"
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields
  2018-10-12  6:00 ` [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
@ 2018-10-12 11:07   ` Balbir Singh
  2018-10-13  0:33       ` John Hubbard
  0 siblings, 1 reply; 46+ messages in thread
From: Balbir Singh @ 2018-10-12 11:07 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel, John Hubbard

On Thu, Oct 11, 2018 at 11:00:14PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> This patch sets and restores the new page->dma_pinned_flags and
> page->dma_pinned_count fields, but does not actually use them for
> anything yet.
> 
> In order to use these fields at all, the page must be removed from
> any LRU list that it's on. The patch also adds some precautions that
> prevent the page from getting moved back onto an LRU, once it is
> in this state.
> 
> This is in preparation to fix some problems that came up when using
> devices (NICs, GPUs, for example) that set up direct access to a chunk
> of system (CPU) memory, so that they can DMA to/from that memory.
> 
> CC: Matthew Wilcox <willy@infradead.org>
> CC: Jan Kara <jack@suse.cz>
> CC: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 15 ++-----------
>  mm/gup.c           | 56 ++++++++++++++++++++++++++++++++++++++++++++--
>  mm/memcontrol.c    |  7 ++++++
>  mm/swap.c          | 51 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 114 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 76d18aada9f8..44878d21e27b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -944,21 +944,10 @@ static inline void put_page(struct page *page)
>  }
>  
>  /*
> - * put_user_page() - release a page that had previously been acquired via
> - * a call to one of the get_user_pages*() functions.
> - *
>   * Pages that were pinned via get_user_pages*() must be released via
> - * either put_user_page(), or one of the put_user_pages*() routines
> - * below. This is so that eventually, pages that are pinned via
> - * get_user_pages*() can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special
> - * handling.
> + * one of these put_user_pages*() routines:
>   */
> -static inline void put_user_page(struct page *page)
> -{
> -	put_page(page);
> -}
> -
> +void put_user_page(struct page *page);
>  void put_user_pages_dirty(struct page **pages, unsigned long npages);
>  void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
>  void put_user_pages(struct page **pages, unsigned long npages);
> diff --git a/mm/gup.c b/mm/gup.c
> index 05ee7c18e59a..fddbc30cde89 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -20,6 +20,51 @@
>  
>  #include "internal.h"
>  
> +static int pin_page_for_dma(struct page *page)
> +{
> +	int ret = 0;
> +	struct zone *zone;
> +
> +	page = compound_head(page);
> +	zone = page_zone(page);
> +
> +	spin_lock(zone_gup_lock(zone));
> +
> +	if (PageDmaPinned(page)) {
> +		/* Page was not on an LRU list, because it was DMA-pinned. */
> +		VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> +		atomic_inc(&page->dma_pinned_count);
> +		goto unlock_out;
> +	}
> +
> +	/*
> +	 * Note that page->dma_pinned_flags is unioned with page->lru.
> +	 * Therefore, the rules are: checking if any of the
> +	 * PAGE_DMA_PINNED_FLAGS bits are set may be done while page->lru
> +	 * is in use. However, setting those flags requires that
> +	 * the page is both locked, and also, removed from the LRU.
> +	 */
> +	ret = isolate_lru_page(page);
> +

isolate_lru_page() can be expensive and in terms of the overall locking order
sounds like zone_gup_lock is higher in the hierarchy than the locks taken
inside isolate_lru_page()

> +	if (ret == 0) {
> +		/* Avoid problems later, when freeing the page: */
> +		ClearPageActive(page);
> +		ClearPageUnevictable(page);
> +
> +		/* counteract isolate_lru_page's effects: */
> +		put_page(page);

Can the page get reaped here? What's the expected page count?

> +
> +		atomic_set(&page->dma_pinned_count, 1);
> +		SetPageDmaPinned(page);
> +	}
> +
> +unlock_out:
> +	spin_unlock(zone_gup_lock(zone));
> +
> +	return ret;
> +}
> +
>  static struct page *no_page_table(struct vm_area_struct *vma,
>  		unsigned int flags)
>  {
> @@ -659,7 +704,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas, int *nonblocking)
>  {
> -	long i = 0;
> +	long i = 0, j;
>  	int err = 0;
>  	unsigned int page_mask;
>  	struct vm_area_struct *vma = NULL;
> @@ -764,6 +809,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  	} while (nr_pages);
>  
>  out:
> +	if (pages)
> +		for (j = 0; j < i; j++)
> +			pin_page_for_dma(pages[j]);
> +

Why does get_user_pages() unconditionally pin_page_for_dma?

>  	return i ? i : err;
>  }
>  
> @@ -1841,7 +1890,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages)
>  {
>  	unsigned long addr, len, end;
> -	int nr = 0, ret = 0;
> +	int nr = 0, ret = 0, i;
>  
>  	start &= PAGE_MASK;
>  	addr = start;
> @@ -1862,6 +1911,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  		ret = nr;
>  	}
>  
> +	for (i = 0; i < nr; i++)
> +		pin_page_for_dma(pages[i]);

Why does get_user_pages_fast() unconditionally pin_page_for_dma?

> +
>  	if (nr < nr_pages) {
>  		/* Try to get the remaining pages with get_user_pages */
>  		start += nr << PAGE_SHIFT;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e79cb59552d9..af9719756081 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2335,6 +2335,11 @@ static void lock_page_lru(struct page *page, int *isolated)
>  	if (PageLRU(page)) {
>  		struct lruvec *lruvec;
>  
> +		/* LRU and PageDmaPinned are mutually exclusive: they use the
> +		 * same fields in struct page, but for different purposes.
> +		 */

Comment style needs fixing

> +		VM_BUG_ON_PAGE(PageDmaPinned(page), page);
> +
>  		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>  		ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
> @@ -2352,6 +2357,8 @@ static void unlock_page_lru(struct page *page, int isolated)
>  
>  		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
> +		VM_BUG_ON_PAGE(PageDmaPinned(page), page);
> +
>  		SetPageLRU(page);
>  		add_page_to_lru_list(page, lruvec, page_lru(page));
>  	}
> diff --git a/mm/swap.c b/mm/swap.c
> index efab3a6b6f91..6b2b1a958a67 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -134,6 +134,46 @@ void put_pages_list(struct list_head *pages)
>  }
>  EXPORT_SYMBOL(put_pages_list);
>  
> +/*
> + * put_user_page() - release a page that had previously been acquired via
> + * a call to one of the get_user_pages*() functions.
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + */
> +void put_user_page(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	page = compound_head(page);
> +
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
> +	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
> +
> +	if (atomic_dec_and_test(&page->dma_pinned_count)) {
> +		spin_lock(zone_gup_lock(zone));
> +
> +		/* Re-check while holding the lock, because
> +		 * pin_page_for_dma() or get_page() may have snuck in right
> +		 * after the atomic_dec_and_test, and raised the count
> +		 * above zero again. If so, just leave the flag set. And
> +		 * because the atomic_dec_and_test above already got the
> +		 * accounting correct, no other action is required.
> +		 */
> +		if (atomic_read(&page->dma_pinned_count) == 0)
> +			ClearPageDmaPinned(page);
> +
> +		spin_unlock(zone_gup_lock(zone));
> +	}
> +
> +	put_page(page);
> +}
> +EXPORT_SYMBOL(put_user_page);
> +
>  /*
>   * put_user_pages_dirty() - for each page in the @pages array, make
>   * that page (or its head page, if a compound page) dirty, if it was
> @@ -907,6 +947,11 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>  	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>  	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> +
> +	/* LRU and PageDmaPinned are mutually exclusive: they use the
> +	 * same fields in struct page, but for different purposes.
> +	 */
> +	VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
>  	VM_BUG_ON(NR_CPUS != 1 &&
>  		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>  
> @@ -946,6 +991,12 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> +	/* LRU and PageDmaPinned are mutually exclusive: they use the
> +	 * same fields in struct page, but for different purposes.
> +	 */
> +	if (PageDmaPinned(page))
> +		return;
> +
>  	SetPageLRU(page);
>  	/*
>  	 * Page becomes evictable in two ways:
> -- 
> 2.19.1


Balbir

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions
  2018-10-12  7:35   ` Balbir Singh
@ 2018-10-12 22:31       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-12 22:31 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On 10/12/18 12:35 AM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:10PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]>> +/*
>> + * put_user_pages_dirty() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * set_page_dirty(), which does not lock the page, is used here.
>> + * Therefore, it is the caller's responsibility to ensure that this is
>> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
>> + */
>> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
> Do we need any checks on npages, npages <= (PUD_SHIFT - PAGE_SHIFT)?
> 

Hi Balbir,

Thanks for combing through this series.

I'd go with "probably not", because the only check that can be done is
what you showed above: "did someone crazy pass in more pages than are
possible for this system?". I don't think checking for that helps here,
as that will show up loud and clear, in other ways.

The release_pages() implementation made the same judgment call to not 
check npages, which also influenced me.

A VM_BUG_ON could be added but I'd prefer not to, as it seems to have 
not enough benefit to be worth it.


>> +		struct page *page = compound_head(pages[index]);
>> +
>> +		if (!PageDirty(page))
>> +			set_page_dirty(page);
>> +
>> +		put_user_page(page);
>> +	}
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty);
>> +
>> +/*
>> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * This is just like put_user_pages_dirty(), except that it invokes
>> + * set_page_dirty_lock(), instead of set_page_dirty().
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
>> + */
>> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
>> +		struct page *page = compound_head(pages[index]);
>> +
>> +		if (!PageDirty(page))
>> +			set_page_dirty_lock(page);
>> +
>> +		put_user_page(page);
>> +	}
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
>> +
> 
> This can be collapsed w.r.t put_user_pages_dirty, a function pointer indirection
> for the locked vs unlocked case, not sure how that affects function optimization.
> 

OK, good point. I initially wanted to avoid the overhead of a function 
pointer, but considering that there are lots of other operations 
happening, I think you're right: best to just get rid of the code 
duplication. If later on we find that changing it back actually helps 
any benchmarks, that can always be done.

See below for how I'm planning on fixing it, and it is a nice little 
cleanup, so thanks for pointing that out.

>> +/*
>> + * put_user_pages() - for each page in the @pages array, release the page
>> + * using put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * This is just like put_user_pages_dirty(), except that it invokes
>> + * set_page_dirty_lock(), instead of set_page_dirty().
> 
> The comment is incorrect.

Yes, it sure is! Jan spotted it before, and I fixed it once, then 
rebased off of the version right *before* the fix, so now I have to 
delete that sentence again.  It's hard to kill! :)

> 
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
>> + */
>> +void put_user_pages(struct page **pages, unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++)
>> +		put_user_page(pages[index]);
>> +}
> 
> Ditto in terms of code duplication
> 

Here, I think you'll find that the end result, is sufficiently 
de-duplicated, after applying the function pointer above. Here's what it 
looks like without the comment blocks, below. I don't want to go any 
further than this, because the only thing left is the "for" loops, and 
macro-izing such a trivial thing is not really a win:


typedef int (*set_dirty_func)(struct page *page);

static void __put_user_pages_dirty(struct page **pages,
				   unsigned long npages,
				   set_dirty_func sdf)
{
	unsigned long index;

	for (index = 0; index < npages; index++) {
		struct page *page = compound_head(pages[index]);

		if (!PageDirty(page))
			sdf(page);

		put_user_page(page);
	}
}

void put_user_pages_dirty(struct page **pages, unsigned long npages)
{
	__put_user_pages_dirty(pages, npages, set_page_dirty);
}
EXPORT_SYMBOL(put_user_pages_dirty);

void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
{
	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
}
EXPORT_SYMBOL(put_user_pages_dirty_lock);

void put_user_pages(struct page **pages, unsigned long npages)
{
	unsigned long index;

	for (index = 0; index < npages; index++)
		put_user_page(pages[index]);
}
EXPORT_SYMBOL(put_user_pages);

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions
@ 2018-10-12 22:31       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-12 22:31 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel, Al Viro, Jerome Glisse,
	Christoph Hellwig, Ralph Campbell

On 10/12/18 12:35 AM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:10PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]>> +/*
>> + * put_user_pages_dirty() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * set_page_dirty(), which does not lock the page, is used here.
>> + * Therefore, it is the caller's responsibility to ensure that this is
>> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
>> + */
>> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
> Do we need any checks on npages, npages <= (PUD_SHIFT - PAGE_SHIFT)?
> 

Hi Balbir,

Thanks for combing through this series.

I'd go with "probably not", because the only check that can be done is
what you showed above: "did someone crazy pass in more pages than are
possible for this system?". I don't think checking for that helps here,
as that will show up loud and clear, in other ways.

The release_pages() implementation made the same judgment call to not 
check npages, which also influenced me.

A VM_BUG_ON could be added but I'd prefer not to, as it seems to have 
not enough benefit to be worth it.


>> +		struct page *page = compound_head(pages[index]);
>> +
>> +		if (!PageDirty(page))
>> +			set_page_dirty(page);
>> +
>> +		put_user_page(page);
>> +	}
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty);
>> +
>> +/*
>> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
>> + * that page (or its head page, if a compound page) dirty, if it was
>> + * previously listed as clean. Then, release the page using
>> + * put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * This is just like put_user_pages_dirty(), except that it invokes
>> + * set_page_dirty_lock(), instead of set_page_dirty().
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
>> + */
>> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++) {
>> +		struct page *page = compound_head(pages[index]);
>> +
>> +		if (!PageDirty(page))
>> +			set_page_dirty_lock(page);
>> +
>> +		put_user_page(page);
>> +	}
>> +}
>> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
>> +
> 
> This can be collapsed w.r.t put_user_pages_dirty, a function pointer indirection
> for the locked vs unlocked case, not sure how that affects function optimization.
> 

OK, good point. I initially wanted to avoid the overhead of a function 
pointer, but considering that there are lots of other operations 
happening, I think you're right: best to just get rid of the code 
duplication. If later on we find that changing it back actually helps 
any benchmarks, that can always be done.

See below for how I'm planning on fixing it, and it is a nice little 
cleanup, so thanks for pointing that out.

>> +/*
>> + * put_user_pages() - for each page in the @pages array, release the page
>> + * using put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * This is just like put_user_pages_dirty(), except that it invokes
>> + * set_page_dirty_lock(), instead of set_page_dirty().
> 
> The comment is incorrect.

Yes, it sure is! Jan spotted it before, and I fixed it once, then 
rebased off of the version right *before* the fix, so now I have to 
delete that sentence again.  It's hard to kill! :)

> 
>> + *
>> + * @pages:  array of pages to be marked dirty and released.
>> + * @npages: number of pages in the @pages array.
>> + *
>> + */
>> +void put_user_pages(struct page **pages, unsigned long npages)
>> +{
>> +	unsigned long index;
>> +
>> +	for (index = 0; index < npages; index++)
>> +		put_user_page(pages[index]);
>> +}
> 
> Ditto in terms of code duplication
> 

Here, I think you'll find that the end result, is sufficiently 
de-duplicated, after applying the function pointer above. Here's what it 
looks like without the comment blocks, below. I don't want to go any 
further than this, because the only thing left is the "for" loops, and 
macro-izing such a trivial thing is not really a win:


typedef int (*set_dirty_func)(struct page *page);

static void __put_user_pages_dirty(struct page **pages,
				   unsigned long npages,
				   set_dirty_func sdf)
{
	unsigned long index;

	for (index = 0; index < npages; index++) {
		struct page *page = compound_head(pages[index]);

		if (!PageDirty(page))
			sdf(page);

		put_user_page(page);
	}
}

void put_user_pages_dirty(struct page **pages, unsigned long npages)
{
	__put_user_pages_dirty(pages, npages, set_page_dirty);
}
EXPORT_SYMBOL(put_user_pages_dirty);

void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
{
	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
}
EXPORT_SYMBOL(put_user_pages_dirty_lock);

void put_user_pages(struct page **pages, unsigned long npages)
{
	unsigned long index;

	for (index = 0; index < npages; index++)
		put_user_page(pages[index]);
}
EXPORT_SYMBOL(put_user_pages);

-- 
thanks,
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] mm: get_user_pages: consolidate error handling
  2018-10-12  6:30   ` Balbir Singh
@ 2018-10-12 22:45       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-12 22:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/11/18 11:30 PM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:09PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> An upcoming patch requires a way to operate on each page that
>> any of the get_user_pages_*() variants returns.
>>
>> In preparation for that, consolidate the error handling for
>> __get_user_pages(). This provides a single location (the "out:" label)
>> for operating on the collected set of pages that are about to be returned.
>>
>> As long every use of the "ret" variable is being edited, rename
>> "ret" --> "err", so that its name matches its true role.
>> This also gets rid of two shadowed variable declarations, as a
>> tiny beneficial a side effect.
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
> 
> Looks good, might not be needed but
> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
> 

Thanks for the review, very good to have another set of eyes on
this one.

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] mm: get_user_pages: consolidate error handling
@ 2018-10-12 22:45       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-12 22:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/11/18 11:30 PM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:09PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> An upcoming patch requires a way to operate on each page that
>> any of the get_user_pages_*() variants returns.
>>
>> In preparation for that, consolidate the error handling for
>> __get_user_pages(). This provides a single location (the "out:" label)
>> for operating on the collected set of pages that are about to be returned.
>>
>> As long every use of the "ret" variable is being edited, rename
>> "ret" --> "err", so that its name matches its true role.
>> This also gets rid of two shadowed variable declarations, as a
>> tiny beneficial a side effect.
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
> 
> Looks good, might not be needed but
> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
> 

Thanks for the review, very good to have another set of eyes on
this one.

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-12 10:56   ` Balbir Singh
@ 2018-10-13  0:15       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13  0:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/12/18 3:56 AM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
>> + * uses these flags must NOT be on an LRU. That's partly enforced by
>> + * ClearPageDmaPinned, which gives the page back to LRU.
>> + *
>> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
>> + * of struct page), and this flag is checked without knowing whether it is a
>> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
>> + * rather than bit 0.
>> + */
>> +#define PAGE_DMA_PINNED		0x2
>> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
>> +
> 
> This is really subtle, additional changes to compound_head will need to coordinate
> with these flags? Also doesn't this bit need to be unique across all structs in
> the union? I guess that is guaranteed by the fact that page == compound_head(page)
> as per your assertion, but I've forgotten why that is true. Could you please
> add some commentary on that
> 

Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
to even have it). So now it looks like this:

/*
 * Because page->dma_pinned_flags is unioned with page->lru, any page that
 * uses these flags must NOT be on an LRU. That's partly enforced by
 * ClearPageDmaPinned, which gives the page back to LRU.
 *
 * PageDmaPinned is checked without knowing whether it is a tail page or a
 * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
 * bit in the first union of struct page), and instead uses bit 1 (0x2),
 * rather than bit 0.
 *
 * PageDmaPinned can only be used if no other systems are using the same bit
 * across the first struct page union. In this regard, it is similar to
 * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
 * alone, bit 1 is also left alone so far: other union elements (ignoring tail
 * pages) put pointers there, and pointer alignment leaves the lower two bits
 * available.
 *
 * So, constraints include:
 *
 *     -- Only use PageDmaPinned on non-tail pages.
 *     -- Remove the page from any LRU list first.
 */
#define PAGE_DMA_PINNED		0x2

/*
 * Because these flags are read outside of a lock, ensure visibility between
 * different threads, by using READ|WRITE_ONCE.
 */
static __always_inline int PageDmaPinned(struct page *page)
{
	VM_BUG_ON(page != compound_head(page));
	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
}

[...]
>> +static __always_inline void SetPageDmaPinned(struct page *page)
>> +{
>> +	VM_BUG_ON(page != compound_head(page));
> 
> VM_BUG_ON(!list_empty(&page->lru))


There is only one place where we set this flag, and that is when (in patch 6/6)
transitioning from a page that might (or might not) have been
on an LRU. In that case, the calling code has already corrupted page->lru, by
writing to page->dma_pinned_count, which is unions with page->lru:

		atomic_set(&page->dma_pinned_count, 1);
		SetPageDmaPinned(page);

...so it would be inappropriate to call a list function, such as 
list_empty(), on that field.  Let's just leave it as-is.


> 
>> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
>> +}
>> +
>> +static __always_inline void ClearPageDmaPinned(struct page *page)
>> +{
>> +	VM_BUG_ON(page != compound_head(page));
>> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
>> +
>> +	/* This does a WRITE_ONCE to the lru.next, which is also the
>> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
>> +	 * this provides visibility to other threads.
>> +	 */
>> +	INIT_LIST_HEAD(&page->lru);
> 
> This assumes certain things about list_head, why not use the correct
> initialization bits.
> 

Yes, OK, changed to:

static __always_inline void ClearPageDmaPinned(struct page *page)
{
	VM_BUG_ON(page != compound_head(page));
	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);

	/* Provide visibility to other threads: */
	WRITE_ONCE(page->dma_pinned_flags, 0);

	/*
	 * Safety precaution: restore the list head, before possibly returning
	 * the page to other subsystems.
	 */
	INIT_LIST_HEAD(&page->lru);
}



-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-10-13  0:15       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13  0:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/12/18 3:56 AM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
>> + * uses these flags must NOT be on an LRU. That's partly enforced by
>> + * ClearPageDmaPinned, which gives the page back to LRU.
>> + *
>> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
>> + * of struct page), and this flag is checked without knowing whether it is a
>> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
>> + * rather than bit 0.
>> + */
>> +#define PAGE_DMA_PINNED		0x2
>> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
>> +
> 
> This is really subtle, additional changes to compound_head will need to coordinate
> with these flags? Also doesn't this bit need to be unique across all structs in
> the union? I guess that is guaranteed by the fact that page == compound_head(page)
> as per your assertion, but I've forgotten why that is true. Could you please
> add some commentary on that
> 

Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
to even have it). So now it looks like this:

/*
 * Because page->dma_pinned_flags is unioned with page->lru, any page that
 * uses these flags must NOT be on an LRU. That's partly enforced by
 * ClearPageDmaPinned, which gives the page back to LRU.
 *
 * PageDmaPinned is checked without knowing whether it is a tail page or a
 * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
 * bit in the first union of struct page), and instead uses bit 1 (0x2),
 * rather than bit 0.
 *
 * PageDmaPinned can only be used if no other systems are using the same bit
 * across the first struct page union. In this regard, it is similar to
 * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
 * alone, bit 1 is also left alone so far: other union elements (ignoring tail
 * pages) put pointers there, and pointer alignment leaves the lower two bits
 * available.
 *
 * So, constraints include:
 *
 *     -- Only use PageDmaPinned on non-tail pages.
 *     -- Remove the page from any LRU list first.
 */
#define PAGE_DMA_PINNED		0x2

/*
 * Because these flags are read outside of a lock, ensure visibility between
 * different threads, by using READ|WRITE_ONCE.
 */
static __always_inline int PageDmaPinned(struct page *page)
{
	VM_BUG_ON(page != compound_head(page));
	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
}

[...]
>> +static __always_inline void SetPageDmaPinned(struct page *page)
>> +{
>> +	VM_BUG_ON(page != compound_head(page));
> 
> VM_BUG_ON(!list_empty(&page->lru))


There is only one place where we set this flag, and that is when (in patch 6/6)
transitioning from a page that might (or might not) have been
on an LRU. In that case, the calling code has already corrupted page->lru, by
writing to page->dma_pinned_count, which is unions with page->lru:

		atomic_set(&page->dma_pinned_count, 1);
		SetPageDmaPinned(page);

...so it would be inappropriate to call a list function, such as 
list_empty(), on that field.  Let's just leave it as-is.


> 
>> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
>> +}
>> +
>> +static __always_inline void ClearPageDmaPinned(struct page *page)
>> +{
>> +	VM_BUG_ON(page != compound_head(page));
>> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
>> +
>> +	/* This does a WRITE_ONCE to the lru.next, which is also the
>> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
>> +	 * this provides visibility to other threads.
>> +	 */
>> +	INIT_LIST_HEAD(&page->lru);
> 
> This assumes certain things about list_head, why not use the correct
> initialization bits.
> 

Yes, OK, changed to:

static __always_inline void ClearPageDmaPinned(struct page *page)
{
	VM_BUG_ON(page != compound_head(page));
	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);

	/* Provide visibility to other threads: */
	WRITE_ONCE(page->dma_pinned_flags, 0);

	/*
	 * Safety precaution: restore the list head, before possibly returning
	 * the page to other subsystems.
	 */
	INIT_LIST_HEAD(&page->lru);
}



-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields
  2018-10-12 11:07   ` Balbir Singh
@ 2018-10-13  0:33       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13  0:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/12/18 4:07 AM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:14PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> +static int pin_page_for_dma(struct page *page)
>> +{
>> +	int ret = 0;
>> +	struct zone *zone;
>> +
>> +	page = compound_head(page);
>> +	zone = page_zone(page);
>> +
>> +	spin_lock(zone_gup_lock(zone));
>> +
>> +	if (PageDmaPinned(page)) {
>> +		/* Page was not on an LRU list, because it was DMA-pinned. */
>> +		VM_BUG_ON_PAGE(PageLRU(page), page);
>> +
>> +		atomic_inc(&page->dma_pinned_count);
>> +		goto unlock_out;
>> +	}
>> +
>> +	/*
>> +	 * Note that page->dma_pinned_flags is unioned with page->lru.
>> +	 * Therefore, the rules are: checking if any of the
>> +	 * PAGE_DMA_PINNED_FLAGS bits are set may be done while page->lru
>> +	 * is in use. However, setting those flags requires that
>> +	 * the page is both locked, and also, removed from the LRU.
>> +	 */
>> +	ret = isolate_lru_page(page);
>> +
> 
> isolate_lru_page() can be expensive and in terms of the overall locking order
> sounds like zone_gup_lock is higher in the hierarchy than the locks taken
> inside isolate_lru_page()
 
As for the expensive part, that is a concern. But I do think we need some lock
here. The hierarchy shouldn't be a problem, given that this is a new lock. But
I'm not sure how to make this better. In any case, I think it should work--note that
the zone_lru_lock, within isolate_lru_page(), is of similar use, and is held
for a similar duration, so...maybe not really a problem?


>> +	if (ret == 0) {
>> +		/* Avoid problems later, when freeing the page: */
>> +		ClearPageActive(page);
>> +		ClearPageUnevictable(page);
>> +
>> +		/* counteract isolate_lru_page's effects: */
>> +		put_page(page);
> 
> Can the page get reaped here? What's the expected page count?

Nope. The page_count is at least one, because get_user_pages() incremented it.

 
>> +
>> +		atomic_set(&page->dma_pinned_count, 1);
>> +		SetPageDmaPinned(page);
>> +	}
>> +
>> +unlock_out:
>> +	spin_unlock(zone_gup_lock(zone));
>> +
>> +	return ret;
>> +}
>> +
>>  static struct page *no_page_table(struct vm_area_struct *vma,
>>  		unsigned int flags)
>>  {
>> @@ -659,7 +704,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  		unsigned int gup_flags, struct page **pages,
>>  		struct vm_area_struct **vmas, int *nonblocking)
>>  {
>> -	long i = 0;
>> +	long i = 0, j;
>>  	int err = 0;
>>  	unsigned int page_mask;
>>  	struct vm_area_struct *vma = NULL;
>> @@ -764,6 +809,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  	} while (nr_pages);
>>  
>>  out:
>> +	if (pages)
>> +		for (j = 0; j < i; j++)
>> +			pin_page_for_dma(pages[j]);
>> +
> 
> Why does get_user_pages() unconditionally pin_page_for_dma?

That's the grand plan here: get_user_pages() now means "unconditionally pin the page for dma".
If you didn't want that, then either release it quickly (many callers do), or use a different
way of pinning or acquiring the page.

> 
>>  	return i ? i : err;
>>  }
>>  
>> @@ -1841,7 +1890,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  			struct page **pages)
>>  {
>>  	unsigned long addr, len, end;
>> -	int nr = 0, ret = 0;
>> +	int nr = 0, ret = 0, i;
>>  
>>  	start &= PAGE_MASK;
>>  	addr = start;
>> @@ -1862,6 +1911,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  		ret = nr;
>>  	}
>>  
>> +	for (i = 0; i < nr; i++)
>> +		pin_page_for_dma(pages[i]);
> 
> Why does get_user_pages_fast() unconditionally pin_page_for_dma?

All of the get_user_pages*() variants need to follow the same rules, so the same 
explanation as above, applies here also.

>> +
>>  	if (nr < nr_pages) {
>>  		/* Try to get the remaining pages with get_user_pages */
>>  		start += nr << PAGE_SHIFT;
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e79cb59552d9..af9719756081 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2335,6 +2335,11 @@ static void lock_page_lru(struct page *page, int *isolated)
>>  	if (PageLRU(page)) {
>>  		struct lruvec *lruvec;
>>  
>> +		/* LRU and PageDmaPinned are mutually exclusive: they use the
>> +		 * same fields in struct page, but for different purposes.
>> +		 */
> 
> Comment style needs fixing
 
oops, thanks for spotting those, will fix.


-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields
@ 2018-10-13  0:33       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13  0:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/12/18 4:07 AM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:14PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> +static int pin_page_for_dma(struct page *page)
>> +{
>> +	int ret = 0;
>> +	struct zone *zone;
>> +
>> +	page = compound_head(page);
>> +	zone = page_zone(page);
>> +
>> +	spin_lock(zone_gup_lock(zone));
>> +
>> +	if (PageDmaPinned(page)) {
>> +		/* Page was not on an LRU list, because it was DMA-pinned. */
>> +		VM_BUG_ON_PAGE(PageLRU(page), page);
>> +
>> +		atomic_inc(&page->dma_pinned_count);
>> +		goto unlock_out;
>> +	}
>> +
>> +	/*
>> +	 * Note that page->dma_pinned_flags is unioned with page->lru.
>> +	 * Therefore, the rules are: checking if any of the
>> +	 * PAGE_DMA_PINNED_FLAGS bits are set may be done while page->lru
>> +	 * is in use. However, setting those flags requires that
>> +	 * the page is both locked, and also, removed from the LRU.
>> +	 */
>> +	ret = isolate_lru_page(page);
>> +
> 
> isolate_lru_page() can be expensive and in terms of the overall locking order
> sounds like zone_gup_lock is higher in the hierarchy than the locks taken
> inside isolate_lru_page()
 
As for the expensive part, that is a concern. But I do think we need some lock
here. The hierarchy shouldn't be a problem, given that this is a new lock. But
I'm not sure how to make this better. In any case, I think it should work--note that
the zone_lru_lock, within isolate_lru_page(), is of similar use, and is held
for a similar duration, so...maybe not really a problem?


>> +	if (ret == 0) {
>> +		/* Avoid problems later, when freeing the page: */
>> +		ClearPageActive(page);
>> +		ClearPageUnevictable(page);
>> +
>> +		/* counteract isolate_lru_page's effects: */
>> +		put_page(page);
> 
> Can the page get reaped here? What's the expected page count?

Nope. The page_count is at least one, because get_user_pages() incremented it.

 
>> +
>> +		atomic_set(&page->dma_pinned_count, 1);
>> +		SetPageDmaPinned(page);
>> +	}
>> +
>> +unlock_out:
>> +	spin_unlock(zone_gup_lock(zone));
>> +
>> +	return ret;
>> +}
>> +
>>  static struct page *no_page_table(struct vm_area_struct *vma,
>>  		unsigned int flags)
>>  {
>> @@ -659,7 +704,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  		unsigned int gup_flags, struct page **pages,
>>  		struct vm_area_struct **vmas, int *nonblocking)
>>  {
>> -	long i = 0;
>> +	long i = 0, j;
>>  	int err = 0;
>>  	unsigned int page_mask;
>>  	struct vm_area_struct *vma = NULL;
>> @@ -764,6 +809,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>  	} while (nr_pages);
>>  
>>  out:
>> +	if (pages)
>> +		for (j = 0; j < i; j++)
>> +			pin_page_for_dma(pages[j]);
>> +
> 
> Why does get_user_pages() unconditionally pin_page_for_dma?

That's the grand plan here: get_user_pages() now means "unconditionally pin the page for dma".
If you didn't want that, then either release it quickly (many callers do), or use a different
way of pinning or acquiring the page.

> 
>>  	return i ? i : err;
>>  }
>>  
>> @@ -1841,7 +1890,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  			struct page **pages)
>>  {
>>  	unsigned long addr, len, end;
>> -	int nr = 0, ret = 0;
>> +	int nr = 0, ret = 0, i;
>>  
>>  	start &= PAGE_MASK;
>>  	addr = start;
>> @@ -1862,6 +1911,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  		ret = nr;
>>  	}
>>  
>> +	for (i = 0; i < nr; i++)
>> +		pin_page_for_dma(pages[i]);
> 
> Why does get_user_pages_fast() unconditionally pin_page_for_dma?

All of the get_user_pages*() variants need to follow the same rules, so the same 
explanation as above, applies here also.

>> +
>>  	if (nr < nr_pages) {
>>  		/* Try to get the remaining pages with get_user_pages */
>>  		start += nr << PAGE_SHIFT;
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e79cb59552d9..af9719756081 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2335,6 +2335,11 @@ static void lock_page_lru(struct page *page, int *isolated)
>>  	if (PageLRU(page)) {
>>  		struct lruvec *lruvec;
>>  
>> +		/* LRU and PageDmaPinned are mutually exclusive: they use the
>> +		 * same fields in struct page, but for different purposes.
>> +		 */
> 
> Comment style needs fixing
 
oops, thanks for spotting those, will fix.


-- 
thanks,
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-12  6:00 ` [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
  2018-10-12 10:56   ` Balbir Singh
@ 2018-10-13  3:55   ` Dave Chinner
  2018-10-13  7:34       ` John Hubbard
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2018-10-13  3:55 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel, John Hubbard

On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Add two struct page fields that, combined, are unioned with
> struct page->lru. There is no change in the size of
> struct page. These new fields are for type safety and clarity.
> 
> Also add page flag accessors to test, set and clear the new
> page->dma_pinned_flags field.
> 
> The page->dma_pinned_count field will be used in upcoming
> patches
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm_types.h   | 22 +++++++++++++-----
>  include/linux/page-flags.h | 47 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..017ab82e36ca 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,12 +78,22 @@ struct page {
>  	 */
>  	union {
>  		struct {	/* Page cache and anonymous pages */
> -			/**
> -			 * @lru: Pageout list, eg. active_list protected by
> -			 * zone_lru_lock.  Sometimes used as a generic list
> -			 * by the page owner.
> -			 */
> -			struct list_head lru;
> +			union {
> +				/**
> +				 * @lru: Pageout list, eg. active_list protected
> +				 * by zone_lru_lock.  Sometimes used as a
> +				 * generic list by the page owner.
> +				 */
> +				struct list_head lru;
> +				/* Used by get_user_pages*(). Pages may not be
> +				 * on an LRU while these dma_pinned_* fields
> +				 * are in use.
> +				 */
> +				struct {
> +					unsigned long dma_pinned_flags;
> +					atomic_t      dma_pinned_count;
> +				};
> +			};

Isn't this broken for mapped file-backed pages? i.e. they may be
passed as the user buffer to read/write direct IO and so the pages
passed to gup will be on the active/inactive LRUs. hence I can't see
how you can have dual use of the LRU list head like this....

What am I missing here?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-13  3:55   ` Dave Chinner
@ 2018-10-13  7:34       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13  7:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/12/18 8:55 PM, Dave Chinner wrote:
> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 5ed8f6292a53..017ab82e36ca 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -78,12 +78,22 @@ struct page {
>>  	 */
>>  	union {
>>  		struct {	/* Page cache and anonymous pages */
>> -			/**
>> -			 * @lru: Pageout list, eg. active_list protected by
>> -			 * zone_lru_lock.  Sometimes used as a generic list
>> -			 * by the page owner.
>> -			 */
>> -			struct list_head lru;
>> +			union {
>> +				/**
>> +				 * @lru: Pageout list, eg. active_list protected
>> +				 * by zone_lru_lock.  Sometimes used as a
>> +				 * generic list by the page owner.
>> +				 */
>> +				struct list_head lru;
>> +				/* Used by get_user_pages*(). Pages may not be
>> +				 * on an LRU while these dma_pinned_* fields
>> +				 * are in use.
>> +				 */
>> +				struct {
>> +					unsigned long dma_pinned_flags;
>> +					atomic_t      dma_pinned_count;
>> +				};
>> +			};
> 
> Isn't this broken for mapped file-backed pages? i.e. they may be
> passed as the user buffer to read/write direct IO and so the pages
> passed to gup will be on the active/inactive LRUs. hence I can't see
> how you can have dual use of the LRU list head like this....
> 
> What am I missing here?

Hi Dave,

In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
unceremoniously rips the pages out of the LRU, as a prerequisite to using
either of the page->dma_pinned_* fields. 

The idea is that LRU is not especially useful for this situation anyway,
so we'll just make it one or the other: either a page is dma-pinned, and
just hanging out doing RDMA most likely (and LRU is less meaningful during that
time), or it's possibly on an LRU list.


-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-10-13  7:34       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13  7:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/12/18 8:55 PM, Dave Chinner wrote:
> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 5ed8f6292a53..017ab82e36ca 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -78,12 +78,22 @@ struct page {
>>  	 */
>>  	union {
>>  		struct {	/* Page cache and anonymous pages */
>> -			/**
>> -			 * @lru: Pageout list, eg. active_list protected by
>> -			 * zone_lru_lock.  Sometimes used as a generic list
>> -			 * by the page owner.
>> -			 */
>> -			struct list_head lru;
>> +			union {
>> +				/**
>> +				 * @lru: Pageout list, eg. active_list protected
>> +				 * by zone_lru_lock.  Sometimes used as a
>> +				 * generic list by the page owner.
>> +				 */
>> +				struct list_head lru;
>> +				/* Used by get_user_pages*(). Pages may not be
>> +				 * on an LRU while these dma_pinned_* fields
>> +				 * are in use.
>> +				 */
>> +				struct {
>> +					unsigned long dma_pinned_flags;
>> +					atomic_t      dma_pinned_count;
>> +				};
>> +			};
> 
> Isn't this broken for mapped file-backed pages? i.e. they may be
> passed as the user buffer to read/write direct IO and so the pages
> passed to gup will be on the active/inactive LRUs. hence I can't see
> how you can have dual use of the LRU list head like this....
> 
> What am I missing here?

Hi Dave,

In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
unceremoniously rips the pages out of the LRU, as a prerequisite to using
either of the page->dma_pinned_* fields. 

The idea is that LRU is not especially useful for this situation anyway,
so we'll just make it one or the other: either a page is dma-pinned, and
just hanging out doing RDMA most likely (and LRU is less meaningful during that
time), or it's possibly on an LRU list.


-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-13  7:34       ` John Hubbard
  (?)
@ 2018-10-13 16:47       ` Christoph Hellwig
  2018-10-13 21:19           ` John Hubbard
  2018-11-05  7:10           ` John Hubbard
  -1 siblings, 2 replies; 46+ messages in thread
From: Christoph Hellwig @ 2018-10-13 16:47 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dave Chinner, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> either of the page->dma_pinned_* fields. 
> 
> The idea is that LRU is not especially useful for this situation anyway,
> so we'll just make it one or the other: either a page is dma-pinned, and
> just hanging out doing RDMA most likely (and LRU is less meaningful during that
> time), or it's possibly on an LRU list.

Have you done any benchmarking what this does to direct I/O performance,
especially for small I/O directly to a (fast) block device?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-13 16:47       ` Christoph Hellwig
@ 2018-10-13 21:19           ` John Hubbard
  2018-11-05  7:10           ` John Hubbard
  1 sibling, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13 21:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?

Not yet. I can go do that now. If you have any particular test suites, benchmarks,
or just programs to recommend, please let me know. So far, I see

   tools/testing/selftests/vm/gup_benchmark.c


-- 
thanks,
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-10-13 21:19           ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-13 21:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?

Not yet. I can go do that now. If you have any particular test suites, benchmarks,
or just programs to recommend, please let me know. So far, I see

   tools/testing/selftests/vm/gup_benchmark.c


-- 
thanks,
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-13  7:34       ` John Hubbard
  (?)
  (?)
@ 2018-10-13 23:01       ` Dave Chinner
  2018-10-16  8:51         ` Jan Kara
  -1 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2018-10-13 23:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> On 10/12/18 8:55 PM, Dave Chinner wrote:
> > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> >> From: John Hubbard <jhubbard@nvidia.com>
> [...]
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 5ed8f6292a53..017ab82e36ca 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -78,12 +78,22 @@ struct page {
> >>  	 */
> >>  	union {
> >>  		struct {	/* Page cache and anonymous pages */
> >> -			/**
> >> -			 * @lru: Pageout list, eg. active_list protected by
> >> -			 * zone_lru_lock.  Sometimes used as a generic list
> >> -			 * by the page owner.
> >> -			 */
> >> -			struct list_head lru;
> >> +			union {
> >> +				/**
> >> +				 * @lru: Pageout list, eg. active_list protected
> >> +				 * by zone_lru_lock.  Sometimes used as a
> >> +				 * generic list by the page owner.
> >> +				 */
> >> +				struct list_head lru;
> >> +				/* Used by get_user_pages*(). Pages may not be
> >> +				 * on an LRU while these dma_pinned_* fields
> >> +				 * are in use.
> >> +				 */
> >> +				struct {
> >> +					unsigned long dma_pinned_flags;
> >> +					atomic_t      dma_pinned_count;
> >> +				};
> >> +			};
> > 
> > Isn't this broken for mapped file-backed pages? i.e. they may be
> > passed as the user buffer to read/write direct IO and so the pages
> > passed to gup will be on the active/inactive LRUs. hence I can't see
> > how you can have dual use of the LRU list head like this....
> > 
> > What am I missing here?
> 
> Hi Dave,
> 
> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> either of the page->dma_pinned_* fields. 

How is that safe? If you've ripped the page out of the LRU, it's no
longer being tracked by the page cache aging and reclaim algorithms.
Patch 6 doesn't appear to put these pages back in the LRU, either,
so it looks to me like this just dumps them on the ground after the
gup reference is dropped.  How do we reclaim these page cache pages
when there is memory pressure if they aren't in the LRU?

> The idea is that LRU is not especially useful for this situation anyway,
> so we'll just make it one or the other: either a page is dma-pinned, and
> just hanging out doing RDMA most likely (and LRU is less meaningful during that
> time), or it's possibly on an LRU list.

gup isn't just used for RDMA. It's used by direct IO in far, far
more situations and machines than RDMA is. Please explain why
ripping pages out of the LRU and not putting them back is safe, has
no side effects, doesn't adversely impact page cache reclaim, etc.
Indeed, I'd love to see a description of all the page references and
where they come and go so we know the changes aren't just leaking
these pages until the filesystem invalidates them at unmount.

Maybe I'm not seeing why this is safe yet, but seeing as you haven't
explained why it is safe then, at minimum, the patch descriptions
are incomplete.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-13 23:01       ` Dave Chinner
@ 2018-10-16  8:51         ` Jan Kara
  2018-10-17  1:48             ` John Hubbard
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Kara @ 2018-10-16  8:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On Sun 14-10-18 10:01:24, Dave Chinner wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> > On 10/12/18 8:55 PM, Dave Chinner wrote:
> > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> > >> From: John Hubbard <jhubbard@nvidia.com>
> > [...]
> > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > >> index 5ed8f6292a53..017ab82e36ca 100644
> > >> --- a/include/linux/mm_types.h
> > >> +++ b/include/linux/mm_types.h
> > >> @@ -78,12 +78,22 @@ struct page {
> > >>  	 */
> > >>  	union {
> > >>  		struct {	/* Page cache and anonymous pages */
> > >> -			/**
> > >> -			 * @lru: Pageout list, eg. active_list protected by
> > >> -			 * zone_lru_lock.  Sometimes used as a generic list
> > >> -			 * by the page owner.
> > >> -			 */
> > >> -			struct list_head lru;
> > >> +			union {
> > >> +				/**
> > >> +				 * @lru: Pageout list, eg. active_list protected
> > >> +				 * by zone_lru_lock.  Sometimes used as a
> > >> +				 * generic list by the page owner.
> > >> +				 */
> > >> +				struct list_head lru;
> > >> +				/* Used by get_user_pages*(). Pages may not be
> > >> +				 * on an LRU while these dma_pinned_* fields
> > >> +				 * are in use.
> > >> +				 */
> > >> +				struct {
> > >> +					unsigned long dma_pinned_flags;
> > >> +					atomic_t      dma_pinned_count;
> > >> +				};
> > >> +			};
> > > 
> > > Isn't this broken for mapped file-backed pages? i.e. they may be
> > > passed as the user buffer to read/write direct IO and so the pages
> > > passed to gup will be on the active/inactive LRUs. hence I can't see
> > > how you can have dual use of the LRU list head like this....
> > > 
> > > What am I missing here?
> > 
> > Hi Dave,
> > 
> > In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> > unceremoniously rips the pages out of the LRU, as a prerequisite to using
> > either of the page->dma_pinned_* fields. 
> 
> How is that safe? If you've ripped the page out of the LRU, it's no
> longer being tracked by the page cache aging and reclaim algorithms.
> Patch 6 doesn't appear to put these pages back in the LRU, either,
> so it looks to me like this just dumps them on the ground after the
> gup reference is dropped.  How do we reclaim these page cache pages
> when there is memory pressure if they aren't in the LRU?

Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should
return the page to the LRU from put_user_page().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-16  8:51         ` Jan Kara
@ 2018-10-17  1:48             ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-17  1:48 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, Andrew Morton, LKML,
	linux-rdma, linux-fsdevel

On 10/16/18 1:51 AM, Jan Kara wrote:
> On Sun 14-10-18 10:01:24, Dave Chinner wrote:
>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>>> On 10/12/18 8:55 PM, Dave Chinner wrote:
>>>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>> [...]
>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>> index 5ed8f6292a53..017ab82e36ca 100644
>>>>> --- a/include/linux/mm_types.h
>>>>> +++ b/include/linux/mm_types.h
>>>>> @@ -78,12 +78,22 @@ struct page {
>>>>>  	 */
>>>>>  	union {
>>>>>  		struct {	/* Page cache and anonymous pages */
>>>>> -			/**
>>>>> -			 * @lru: Pageout list, eg. active_list protected by
>>>>> -			 * zone_lru_lock.  Sometimes used as a generic list
>>>>> -			 * by the page owner.
>>>>> -			 */
>>>>> -			struct list_head lru;
>>>>> +			union {
>>>>> +				/**
>>>>> +				 * @lru: Pageout list, eg. active_list protected
>>>>> +				 * by zone_lru_lock.  Sometimes used as a
>>>>> +				 * generic list by the page owner.
>>>>> +				 */
>>>>> +				struct list_head lru;
>>>>> +				/* Used by get_user_pages*(). Pages may not be
>>>>> +				 * on an LRU while these dma_pinned_* fields
>>>>> +				 * are in use.
>>>>> +				 */
>>>>> +				struct {
>>>>> +					unsigned long dma_pinned_flags;
>>>>> +					atomic_t      dma_pinned_count;
>>>>> +				};
>>>>> +			};
>>>>
>>>> Isn't this broken for mapped file-backed pages? i.e. they may be
>>>> passed as the user buffer to read/write direct IO and so the pages
>>>> passed to gup will be on the active/inactive LRUs. hence I can't see
>>>> how you can have dual use of the LRU list head like this....
>>>>
>>>> What am I missing here?
>>>
>>> Hi Dave,
>>>
>>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>>> either of the page->dma_pinned_* fields. 
>>
>> How is that safe? If you've ripped the page out of the LRU, it's no
>> longer being tracked by the page cache aging and reclaim algorithms.
>> Patch 6 doesn't appear to put these pages back in the LRU, either,
>> so it looks to me like this just dumps them on the ground after the
>> gup reference is dropped.  How do we reclaim these page cache pages
>> when there is memory pressure if they aren't in the LRU?
> 
> Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should
> return the page to the LRU from put_user_page().
> 

Yes. Ugh, the LRU handling in this series is definitely not all there yet:
probably need to track (in the page->dma_pinned_flags) which LRU (if any) each 
page was taken from. 

It's hard to say exactly what the active/inactive/unevictable list should
be when DMA is done and put_user_page*() is called, because we don't know
if some device read, wrote, or ignored any of those pages. Although if 
put_user_pages_dirty() is called, that's an argument for "active", at least.

And maybe this will all be pointless if the DIRECT_IO performance test, that
Christoph requested, shows that LRU operations are too expensive here, anyway.
I wonder if we should just limit this to 64-bit arches and find a real
page flag...well, let's see what the testing shows first I suppose.

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-10-17  1:48             ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-17  1:48 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, Andrew Morton, LKML,
	linux-rdma, linux-fsdevel

On 10/16/18 1:51 AM, Jan Kara wrote:
> On Sun 14-10-18 10:01:24, Dave Chinner wrote:
>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>>> On 10/12/18 8:55 PM, Dave Chinner wrote:
>>>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>> [...]
>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>> index 5ed8f6292a53..017ab82e36ca 100644
>>>>> --- a/include/linux/mm_types.h
>>>>> +++ b/include/linux/mm_types.h
>>>>> @@ -78,12 +78,22 @@ struct page {
>>>>>  	 */
>>>>>  	union {
>>>>>  		struct {	/* Page cache and anonymous pages */
>>>>> -			/**
>>>>> -			 * @lru: Pageout list, eg. active_list protected by
>>>>> -			 * zone_lru_lock.  Sometimes used as a generic list
>>>>> -			 * by the page owner.
>>>>> -			 */
>>>>> -			struct list_head lru;
>>>>> +			union {
>>>>> +				/**
>>>>> +				 * @lru: Pageout list, eg. active_list protected
>>>>> +				 * by zone_lru_lock.  Sometimes used as a
>>>>> +				 * generic list by the page owner.
>>>>> +				 */
>>>>> +				struct list_head lru;
>>>>> +				/* Used by get_user_pages*(). Pages may not be
>>>>> +				 * on an LRU while these dma_pinned_* fields
>>>>> +				 * are in use.
>>>>> +				 */
>>>>> +				struct {
>>>>> +					unsigned long dma_pinned_flags;
>>>>> +					atomic_t      dma_pinned_count;
>>>>> +				};
>>>>> +			};
>>>>
>>>> Isn't this broken for mapped file-backed pages? i.e. they may be
>>>> passed as the user buffer to read/write direct IO and so the pages
>>>> passed to gup will be on the active/inactive LRUs. hence I can't see
>>>> how you can have dual use of the LRU list head like this....
>>>>
>>>> What am I missing here?
>>>
>>> Hi Dave,
>>>
>>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>>> either of the page->dma_pinned_* fields. 
>>
>> How is that safe? If you've ripped the page out of the LRU, it's no
>> longer being tracked by the page cache aging and reclaim algorithms.
>> Patch 6 doesn't appear to put these pages back in the LRU, either,
>> so it looks to me like this just dumps them on the ground after the
>> gup reference is dropped.  How do we reclaim these page cache pages
>> when there is memory pressure if they aren't in the LRU?
> 
> Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should
> return the page to the LRU from put_user_page().
> 

Yes. Ugh, the LRU handling in this series is definitely not all there yet:
probably need to track (in the page->dma_pinned_flags) which LRU (if any) each 
page was taken from. 

It's hard to say exactly what the active/inactive/unevictable list should
be when DMA is done and put_user_page*() is called, because we don't know
if some device read, wrote, or ignored any of those pages. Although if 
put_user_pages_dirty() is called, that's an argument for "active", at least.

And maybe this will all be pointless if the DIRECT_IO performance test, that
Christoph requested, shows that LRU operations are too expensive here, anyway.
I wonder if we should just limit this to 64-bit arches and find a real
page flag...well, let's see what the testing shows first I suppose.

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-17  1:48             ` John Hubbard
  (?)
@ 2018-10-17 11:09             ` Michal Hocko
  2018-10-18  0:03                 ` John Hubbard
  -1 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2018-10-17 11:09 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Dave Chinner, Matthew Wilcox, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, Andrew Morton, LKML,
	linux-rdma, linux-fsdevel

On Tue 16-10-18 18:48:23, John Hubbard wrote:
[...]
> It's hard to say exactly what the active/inactive/unevictable list should
> be when DMA is done and put_user_page*() is called, because we don't know
> if some device read, wrote, or ignored any of those pages. Although if 
> put_user_pages_dirty() is called, that's an argument for "active", at least.

Any reason to not use putback_lru_page?

Please note I haven't really got through your patches to have a wider
picture of the change so this is just hint for the LRU part of the
issue.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-17 11:09             ` Michal Hocko
@ 2018-10-18  0:03                 ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-18  0:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, Dave Chinner, Matthew Wilcox, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, Andrew Morton, LKML,
	linux-rdma, linux-fsdevel

On 10/17/18 4:09 AM, Michal Hocko wrote:
> On Tue 16-10-18 18:48:23, John Hubbard wrote:
> [...]
>> It's hard to say exactly what the active/inactive/unevictable list should
>> be when DMA is done and put_user_page*() is called, because we don't know
>> if some device read, wrote, or ignored any of those pages. Although if 
>> put_user_pages_dirty() is called, that's an argument for "active", at least.
> 
> Any reason to not use putback_lru_page?

That does help with which LRU to use. I guess I'd still need to track whether
a page was on an LRU when get_user_pages() was called, because it seems
that that is not necessarily always the case. And putback_lru_page() definitely
wants to deal with a page that *was* previously on an LRU.

> 
> Please note I haven't really got through your patches to have a wider
> picture of the change so this is just hint for the LRU part of the
> issue.
> 

Understood, and the hints are much appreciated.

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-10-18  0:03                 ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-10-18  0:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, Dave Chinner, Matthew Wilcox, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, Andrew Morton, LKML,
	linux-rdma, linux-fsdevel

On 10/17/18 4:09 AM, Michal Hocko wrote:
> On Tue 16-10-18 18:48:23, John Hubbard wrote:
> [...]
>> It's hard to say exactly what the active/inactive/unevictable list should
>> be when DMA is done and put_user_page*() is called, because we don't know
>> if some device read, wrote, or ignored any of those pages. Although if 
>> put_user_pages_dirty() is called, that's an argument for "active", at least.
> 
> Any reason to not use putback_lru_page?

That does help with which LRU to use. I guess I'd still need to track whether
a page was on an LRU when get_user_pages() was called, because it seems
that that is not necessarily always the case. And putback_lru_page() definitely
wants to deal with a page that *was* previously on an LRU.

> 
> Please note I haven't really got through your patches to have a wider
> picture of the change so this is just hint for the LRU part of the
> issue.
> 

Understood, and the hints are much appreciated.

-- 
thanks,
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-18  0:03                 ` John Hubbard
  (?)
@ 2018-10-19  8:11                 ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2018-10-19  8:11 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Dave Chinner, Matthew Wilcox, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, Andrew Morton, LKML,
	linux-rdma, linux-fsdevel

On Wed 17-10-18 17:03:03, John Hubbard wrote:
> On 10/17/18 4:09 AM, Michal Hocko wrote:
> > On Tue 16-10-18 18:48:23, John Hubbard wrote:
> > [...]
> >> It's hard to say exactly what the active/inactive/unevictable list should
> >> be when DMA is done and put_user_page*() is called, because we don't know
> >> if some device read, wrote, or ignored any of those pages. Although if 
> >> put_user_pages_dirty() is called, that's an argument for "active", at least.
> > 
> > Any reason to not use putback_lru_page?
> 
> That does help with which LRU to use. I guess I'd still need to track whether
> a page was on an LRU when get_user_pages() was called, because it seems
> that that is not necessarily always the case. And putback_lru_page() definitely
> wants to deal with a page that *was* previously on an LRU.

Well, if you ever g-u-p pages which are never going to go to LRU then
sure (e.g. hugetlb pages).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-13  0:15       ` John Hubbard
  (?)
@ 2018-10-24 11:00       ` Balbir Singh
  2018-11-02 23:27           ` John Hubbard
  -1 siblings, 1 reply; 46+ messages in thread
From: Balbir Singh @ 2018-10-24 11:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote:
> On 10/12/18 3:56 AM, Balbir Singh wrote:
> > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> >> From: John Hubbard <jhubbard@nvidia.com>
> [...]
> >> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
> >> + * uses these flags must NOT be on an LRU. That's partly enforced by
> >> + * ClearPageDmaPinned, which gives the page back to LRU.
> >> + *
> >> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
> >> + * of struct page), and this flag is checked without knowing whether it is a
> >> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
> >> + * rather than bit 0.
> >> + */
> >> +#define PAGE_DMA_PINNED		0x2
> >> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
> >> +
> > 
> > This is really subtle, additional changes to compound_head will need to coordinate
> > with these flags? Also doesn't this bit need to be unique across all structs in
> > the union? I guess that is guaranteed by the fact that page == compound_head(page)
> > as per your assertion, but I've forgotten why that is true. Could you please
> > add some commentary on that
> > 
> 
> Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
> PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
> to even have it). So now it looks like this:
> 
> /*
>  * Because page->dma_pinned_flags is unioned with page->lru, any page that
>  * uses these flags must NOT be on an LRU. That's partly enforced by
>  * ClearPageDmaPinned, which gives the page back to LRU.
>  *
>  * PageDmaPinned is checked without knowing whether it is a tail page or a
>  * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
>  * bit in the first union of struct page), and instead uses bit 1 (0x2),
>  * rather than bit 0.
>  *
>  * PageDmaPinned can only be used if no other systems are using the same bit
>  * across the first struct page union. In this regard, it is similar to
>  * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
>  * alone, bit 1 is also left alone so far: other union elements (ignoring tail
>  * pages) put pointers there, and pointer alignment leaves the lower two bits
>  * available.
>  *
>  * So, constraints include:
>  *
>  *     -- Only use PageDmaPinned on non-tail pages.
>  *     -- Remove the page from any LRU list first.
>  */
> #define PAGE_DMA_PINNED		0x2
> 
> /*
>  * Because these flags are read outside of a lock, ensure visibility between
>  * different threads, by using READ|WRITE_ONCE.
>  */
> static __always_inline int PageDmaPinned(struct page *page)
> {
> 	VM_BUG_ON(page != compound_head(page));
> 	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
> }
> 
> [...]
> >> +static __always_inline void SetPageDmaPinned(struct page *page)
> >> +{
> >> +	VM_BUG_ON(page != compound_head(page));
> > 
> > VM_BUG_ON(!list_empty(&page->lru))
> 
> 
> There is only one place where we set this flag, and that is when (in patch 6/6)
> transitioning from a page that might (or might not) have been
> on an LRU. In that case, the calling code has already corrupted page->lru, by
> writing to page->dma_pinned_count, which is unions with page->lru:
> 
> 		atomic_set(&page->dma_pinned_count, 1);
> 		SetPageDmaPinned(page);
> 
> ...so it would be inappropriate to call a list function, such as 
> list_empty(), on that field.  Let's just leave it as-is.
> 
> 
> > 
> >> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
> >> +}
> >> +
> >> +static __always_inline void ClearPageDmaPinned(struct page *page)
> >> +{
> >> +	VM_BUG_ON(page != compound_head(page));
> >> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
> >> +
> >> +	/* This does a WRITE_ONCE to the lru.next, which is also the
> >> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
> >> +	 * this provides visibility to other threads.
> >> +	 */
> >> +	INIT_LIST_HEAD(&page->lru);
> > 
> > This assumes certain things about list_head, why not use the correct
> > initialization bits.
> > 
> 
> Yes, OK, changed to:
> 
> static __always_inline void ClearPageDmaPinned(struct page *page)
> {
> 	VM_BUG_ON(page != compound_head(page));
> 	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
> 
> 	/* Provide visibility to other threads: */
> 	WRITE_ONCE(page->dma_pinned_flags, 0);
> 
> 	/*
> 	 * Safety precaution: restore the list head, before possibly returning
> 	 * the page to other subsystems.
> 	 */
> 	INIT_LIST_HEAD(&page->lru);
> }
> 
>

Sorry, I've been distracted with other things

This looks better, do we still need the INIT_LIST_HEAD?

Balbir Singh.
 
> 
> -- 
> thanks,
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-24 11:00       ` Balbir Singh
@ 2018-11-02 23:27           ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-02 23:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/24/18 4:00 AM, Balbir Singh wrote:
> On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote:
>> On 10/12/18 3:56 AM, Balbir Singh wrote:
>>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>> [...]
>>>> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
>>>> + * uses these flags must NOT be on an LRU. That's partly enforced by
>>>> + * ClearPageDmaPinned, which gives the page back to LRU.
>>>> + *
>>>> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
>>>> + * of struct page), and this flag is checked without knowing whether it is a
>>>> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
>>>> + * rather than bit 0.
>>>> + */
>>>> +#define PAGE_DMA_PINNED		0x2
>>>> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
>>>> +
>>>
>>> This is really subtle, additional changes to compound_head will need to coordinate
>>> with these flags? Also doesn't this bit need to be unique across all structs in
>>> the union? I guess that is guaranteed by the fact that page == compound_head(page)
>>> as per your assertion, but I've forgotten why that is true. Could you please
>>> add some commentary on that
>>>
>>
>> Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
>> PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
>> to even have it). So now it looks like this:
>>
>> /*
>>  * Because page->dma_pinned_flags is unioned with page->lru, any page that
>>  * uses these flags must NOT be on an LRU. That's partly enforced by
>>  * ClearPageDmaPinned, which gives the page back to LRU.
>>  *
>>  * PageDmaPinned is checked without knowing whether it is a tail page or a
>>  * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
>>  * bit in the first union of struct page), and instead uses bit 1 (0x2),
>>  * rather than bit 0.
>>  *
>>  * PageDmaPinned can only be used if no other systems are using the same bit
>>  * across the first struct page union. In this regard, it is similar to
>>  * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
>>  * alone, bit 1 is also left alone so far: other union elements (ignoring tail
>>  * pages) put pointers there, and pointer alignment leaves the lower two bits
>>  * available.
>>  *
>>  * So, constraints include:
>>  *
>>  *     -- Only use PageDmaPinned on non-tail pages.
>>  *     -- Remove the page from any LRU list first.
>>  */
>> #define PAGE_DMA_PINNED		0x2
>>
>> /*
>>  * Because these flags are read outside of a lock, ensure visibility between
>>  * different threads, by using READ|WRITE_ONCE.
>>  */
>> static __always_inline int PageDmaPinned(struct page *page)
>> {
>> 	VM_BUG_ON(page != compound_head(page));
>> 	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
>> }
>>
>> [...]
>>>> +static __always_inline void SetPageDmaPinned(struct page *page)
>>>> +{
>>>> +	VM_BUG_ON(page != compound_head(page));
>>>
>>> VM_BUG_ON(!list_empty(&page->lru))
>>
>>
>> There is only one place where we set this flag, and that is when (in patch 6/6)
>> transitioning from a page that might (or might not) have been
>> on an LRU. In that case, the calling code has already corrupted page->lru, by
>> writing to page->dma_pinned_count, which is unions with page->lru:
>>
>> 		atomic_set(&page->dma_pinned_count, 1);
>> 		SetPageDmaPinned(page);
>>
>> ...so it would be inappropriate to call a list function, such as 
>> list_empty(), on that field.  Let's just leave it as-is.
>>
>>
>>>
>>>> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
>>>> +}
>>>> +
>>>> +static __always_inline void ClearPageDmaPinned(struct page *page)
>>>> +{
>>>> +	VM_BUG_ON(page != compound_head(page));
>>>> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
>>>> +
>>>> +	/* This does a WRITE_ONCE to the lru.next, which is also the
>>>> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
>>>> +	 * this provides visibility to other threads.
>>>> +	 */
>>>> +	INIT_LIST_HEAD(&page->lru);
>>>
>>> This assumes certain things about list_head, why not use the correct
>>> initialization bits.
>>>
>>
>> Yes, OK, changed to:
>>
>> static __always_inline void ClearPageDmaPinned(struct page *page)
>> {
>> 	VM_BUG_ON(page != compound_head(page));
>> 	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
>>
>> 	/* Provide visibility to other threads: */
>> 	WRITE_ONCE(page->dma_pinned_flags, 0);
>>
>> 	/*
>> 	 * Safety precaution: restore the list head, before possibly returning
>> 	 * the page to other subsystems.
>> 	 */
>> 	INIT_LIST_HEAD(&page->lru);
>> }
>>
>>
> 
> Sorry, I've been distracted with other things
> 
> This looks better, do we still need the INIT_LIST_HEAD?
> 

Good point. I guess not. I was getting a little too fancy, and it's better 
for ClearPageDmaPinned to be true to its name, and just only do that.

(Sorry for the delayed response.)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-11-02 23:27           ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-02 23:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/24/18 4:00 AM, Balbir Singh wrote:
> On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote:
>> On 10/12/18 3:56 AM, Balbir Singh wrote:
>>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>> [...]
>>>> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
>>>> + * uses these flags must NOT be on an LRU. That's partly enforced by
>>>> + * ClearPageDmaPinned, which gives the page back to LRU.
>>>> + *
>>>> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
>>>> + * of struct page), and this flag is checked without knowing whether it is a
>>>> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
>>>> + * rather than bit 0.
>>>> + */
>>>> +#define PAGE_DMA_PINNED		0x2
>>>> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
>>>> +
>>>
>>> This is really subtle, additional changes to compound_head will need to coordinate
>>> with these flags? Also doesn't this bit need to be unique across all structs in
>>> the union? I guess that is guaranteed by the fact that page == compound_head(page)
>>> as per your assertion, but I've forgotten why that is true. Could you please
>>> add some commentary on that
>>>
>>
>> Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
>> PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
>> to even have it). So now it looks like this:
>>
>> /*
>>  * Because page->dma_pinned_flags is unioned with page->lru, any page that
>>  * uses these flags must NOT be on an LRU. That's partly enforced by
>>  * ClearPageDmaPinned, which gives the page back to LRU.
>>  *
>>  * PageDmaPinned is checked without knowing whether it is a tail page or a
>>  * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
>>  * bit in the first union of struct page), and instead uses bit 1 (0x2),
>>  * rather than bit 0.
>>  *
>>  * PageDmaPinned can only be used if no other systems are using the same bit
>>  * across the first struct page union. In this regard, it is similar to
>>  * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
>>  * alone, bit 1 is also left alone so far: other union elements (ignoring tail
>>  * pages) put pointers there, and pointer alignment leaves the lower two bits
>>  * available.
>>  *
>>  * So, constraints include:
>>  *
>>  *     -- Only use PageDmaPinned on non-tail pages.
>>  *     -- Remove the page from any LRU list first.
>>  */
>> #define PAGE_DMA_PINNED		0x2
>>
>> /*
>>  * Because these flags are read outside of a lock, ensure visibility between
>>  * different threads, by using READ|WRITE_ONCE.
>>  */
>> static __always_inline int PageDmaPinned(struct page *page)
>> {
>> 	VM_BUG_ON(page != compound_head(page));
>> 	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
>> }
>>
>> [...]
>>>> +static __always_inline void SetPageDmaPinned(struct page *page)
>>>> +{
>>>> +	VM_BUG_ON(page != compound_head(page));
>>>
>>> VM_BUG_ON(!list_empty(&page->lru))
>>
>>
>> There is only one place where we set this flag, and that is when (in patch 6/6)
>> transitioning from a page that might (or might not) have been
>> on an LRU. In that case, the calling code has already corrupted page->lru, by
>> writing to page->dma_pinned_count, which is unions with page->lru:
>>
>> 		atomic_set(&page->dma_pinned_count, 1);
>> 		SetPageDmaPinned(page);
>>
>> ...so it would be inappropriate to call a list function, such as 
>> list_empty(), on that field.  Let's just leave it as-is.
>>
>>
>>>
>>>> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
>>>> +}
>>>> +
>>>> +static __always_inline void ClearPageDmaPinned(struct page *page)
>>>> +{
>>>> +	VM_BUG_ON(page != compound_head(page));
>>>> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
>>>> +
>>>> +	/* This does a WRITE_ONCE to the lru.next, which is also the
>>>> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
>>>> +	 * this provides visibility to other threads.
>>>> +	 */
>>>> +	INIT_LIST_HEAD(&page->lru);
>>>
>>> This assumes certain things about list_head, why not use the correct
>>> initialization bits.
>>>
>>
>> Yes, OK, changed to:
>>
>> static __always_inline void ClearPageDmaPinned(struct page *page)
>> {
>> 	VM_BUG_ON(page != compound_head(page));
>> 	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
>>
>> 	/* Provide visibility to other threads: */
>> 	WRITE_ONCE(page->dma_pinned_flags, 0);
>>
>> 	/*
>> 	 * Safety precaution: restore the list head, before possibly returning
>> 	 * the page to other subsystems.
>> 	 */
>> 	INIT_LIST_HEAD(&page->lru);
>> }
>>
>>
> 
> Sorry, I've been distracted with other things
> 
> This looks better, do we still need the INIT_LIST_HEAD?
> 

Good point. I guess not. I was getting a little too fancy, and it's better 
for ClearPageDmaPinned to be true to its name, and just only do that.

(Sorry for the delayed response.)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-10-13 16:47       ` Christoph Hellwig
@ 2018-11-05  7:10           ` John Hubbard
  2018-11-05  7:10           ` John Hubbard
  1 sibling, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-05  7:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?
> 

Hi Christoph,

I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
and buggy yet, but I've got enough of them done to briefly run the test.

One thing that occurs to me is that jumping on and off the LRU takes time, and
if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
know that leaves 32-bit out in the cold, but...maybe use this slower approach
for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
by 20%. 

Test program is below. I hope I didn't overlook something obvious, but it's 
definitely possible, given my lack of experience with direct IO. 

I'm preparing to send an updated RFC this week, that contains the feedback to date,
and also many converted call sites as well, so that everyone can see what the whole
(proposed) story would look like in its latest incarnation.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

const static unsigned BUF_SIZE       = 4096;
static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE;

void read_from_file(int fd, size_t how_much, char * buf)
{
	size_t bytes_read;

	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		bytes_read = read(fd, buf, BUF_SIZE);
		if (bytes_read != BUF_SIZE) {
			printf("reading file failed: %m\n");
			exit(3);
		}
	}
}

void seek_to_start(int fd, char *caller)
{
	off_t result = lseek(fd, 0, SEEK_SET);
	if (result == -1) {
		printf("%s: lseek failed: %m\n", caller);
		exit(4);
	}
}

void write_to_file(int fd, size_t how_much, char * buf)
{
	int result;
	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		result = write(fd, buf, BUF_SIZE);
		if (result < 0) {
			printf("writing file failed: %m\n");
			exit(3);
		}
	}
}

void read_and_write(int fd, size_t how_much, char * buf)
{
	seek_to_start(fd, "About to read");
	read_from_file(fd, how_much, buf);

	memset(buf, 'a', BUF_SIZE);

	seek_to_start(fd, "About to write");
	write_to_file(fd, how_much, buf);
}

int main(int argc, char *argv[])
{
	void *buf;
	/*
	 * O_DIRECT requires at least 512 B alighnment, but runs faster
	 * (2.8 sec, vs. 3.5 sec) with 4096 B alignment.
	 */
	unsigned align = 4096;
	posix_memalign(&buf, align, BUF_SIZE );

	if (argc < 3) {
		printf("Usage: %s <filename> <iterations>\n", argv[0]);
		return 1;
	}
	char *filename = argv[1];
	unsigned iterations = strtoul(argv[2], 0, 0);

	/* Not using O_SYNC for now, anyway. */
	int fd = open(filename, O_DIRECT | O_RDWR);
	if (fd < 0) {
		printf("Failed to open %s: %m\n", filename);
		return 2;
	}

	printf("File: %s, data size: %u, interations: %u\n",
		       filename, FULL_DATA_SIZE, iterations);

	for (int count = 0; count < iterations; count++) {
		read_and_write(fd, FULL_DATA_SIZE, buf);
	}

	close(fd);
	return 0;
}


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-11-05  7:10           ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-05  7:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, Andrew Morton,
	LKML, linux-rdma, linux-fsdevel

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?
> 

Hi Christoph,

I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
and buggy yet, but I've got enough of them done to briefly run the test.

One thing that occurs to me is that jumping on and off the LRU takes time, and
if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
know that leaves 32-bit out in the cold, but...maybe use this slower approach
for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
by 20%. 

Test program is below. I hope I didn't overlook something obvious, but it's 
definitely possible, given my lack of experience with direct IO. 

I'm preparing to send an updated RFC this week, that contains the feedback to date,
and also many converted call sites as well, so that everyone can see what the whole
(proposed) story would look like in its latest incarnation.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

const static unsigned BUF_SIZE       = 4096;
static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE;

void read_from_file(int fd, size_t how_much, char * buf)
{
	size_t bytes_read;

	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		bytes_read = read(fd, buf, BUF_SIZE);
		if (bytes_read != BUF_SIZE) {
			printf("reading file failed: %m\n");
			exit(3);
		}
	}
}

void seek_to_start(int fd, char *caller)
{
	off_t result = lseek(fd, 0, SEEK_SET);
	if (result == -1) {
		printf("%s: lseek failed: %m\n", caller);
		exit(4);
	}
}

void write_to_file(int fd, size_t how_much, char * buf)
{
	int result;
	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		result = write(fd, buf, BUF_SIZE);
		if (result < 0) {
			printf("writing file failed: %m\n");
			exit(3);
		}
	}
}

void read_and_write(int fd, size_t how_much, char * buf)
{
	seek_to_start(fd, "About to read");
	read_from_file(fd, how_much, buf);

	memset(buf, 'a', BUF_SIZE);

	seek_to_start(fd, "About to write");
	write_to_file(fd, how_much, buf);
}

int main(int argc, char *argv[])
{
	void *buf;
	/*
	 * O_DIRECT requires at least 512 B alighnment, but runs faster
	 * (2.8 sec, vs. 3.5 sec) with 4096 B alignment.
	 */
	unsigned align = 4096;
	posix_memalign(&buf, align, BUF_SIZE );

	if (argc < 3) {
		printf("Usage: %s <filename> <iterations>\n", argv[0]);
		return 1;
	}
	char *filename = argv[1];
	unsigned iterations = strtoul(argv[2], 0, 0);

	/* Not using O_SYNC for now, anyway. */
	int fd = open(filename, O_DIRECT | O_RDWR);
	if (fd < 0) {
		printf("Failed to open %s: %m\n", filename);
		return 2;
	}

	printf("File: %s, data size: %u, interations: %u\n",
		       filename, FULL_DATA_SIZE, iterations);

	for (int count = 0; count < iterations; count++) {
		read_and_write(fd, FULL_DATA_SIZE, buf);
	}

	close(fd);
	return 0;
}


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-11-05  7:10           ` John Hubbard
  (?)
@ 2018-11-05  9:54           ` Jan Kara
  2018-11-06  0:26               ` John Hubbard
  -1 siblings, 1 reply; 46+ messages in thread
From: Jan Kara @ 2018-11-05  9:54 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Dave Chinner, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, Jan Kara,
	linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel

On Sun 04-11-18 23:10:12, John Hubbard wrote:
> On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> >> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> >> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> >> either of the page->dma_pinned_* fields. 
> >>
> >> The idea is that LRU is not especially useful for this situation anyway,
> >> so we'll just make it one or the other: either a page is dma-pinned, and
> >> just hanging out doing RDMA most likely (and LRU is less meaningful during that
> >> time), or it's possibly on an LRU list.
> > 
> > Have you done any benchmarking what this does to direct I/O performance,
> > especially for small I/O directly to a (fast) block device?
> > 
> 
> Hi Christoph,
> 
> I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
> on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
> and buggy yet, but I've got enough of them done to briefly run the test.
> 
> One thing that occurs to me is that jumping on and off the LRU takes time, and
> if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
> know that leaves 32-bit out in the cold, but...maybe use this slower approach
> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
> by 20%. 
> 
> Test program is below. I hope I didn't overlook something obvious, but it's 
> definitely possible, given my lack of experience with direct IO. 
> 
> I'm preparing to send an updated RFC this week, that contains the feedback to date,
> and also many converted call sites as well, so that everyone can see what the whole
> (proposed) story would look like in its latest incarnation.

Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
going to max-out NVME iops by far. Can I suggest you install fio [1] (it
has the advantage that it is pretty much standard for a test like this so
everyone knows what the test does from a glimpse) and run with it something
like the following workfile:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

And see how the numbers with and without your patches compare?

								Honza

[1] https://github.com/axboe/fio


> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdbool.h>
> #include <string.h>
> 
> const static unsigned BUF_SIZE       = 4096;
> static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE;
> 
> void read_from_file(int fd, size_t how_much, char * buf)
> {
> 	size_t bytes_read;
> 
> 	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
> 		bytes_read = read(fd, buf, BUF_SIZE);
> 		if (bytes_read != BUF_SIZE) {
> 			printf("reading file failed: %m\n");
> 			exit(3);
> 		}
> 	}
> }
> 
> void seek_to_start(int fd, char *caller)
> {
> 	off_t result = lseek(fd, 0, SEEK_SET);
> 	if (result == -1) {
> 		printf("%s: lseek failed: %m\n", caller);
> 		exit(4);
> 	}
> }
> 
> void write_to_file(int fd, size_t how_much, char * buf)
> {
> 	int result;
> 	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
> 		result = write(fd, buf, BUF_SIZE);
> 		if (result < 0) {
> 			printf("writing file failed: %m\n");
> 			exit(3);
> 		}
> 	}
> }
> 
> void read_and_write(int fd, size_t how_much, char * buf)
> {
> 	seek_to_start(fd, "About to read");
> 	read_from_file(fd, how_much, buf);
> 
> 	memset(buf, 'a', BUF_SIZE);
> 
> 	seek_to_start(fd, "About to write");
> 	write_to_file(fd, how_much, buf);
> }
> 
> int main(int argc, char *argv[])
> {
> 	void *buf;
> 	/*
> 	 * O_DIRECT requires at least 512 B alighnment, but runs faster
> 	 * (2.8 sec, vs. 3.5 sec) with 4096 B alignment.
> 	 */
> 	unsigned align = 4096;
> 	posix_memalign(&buf, align, BUF_SIZE );
> 
> 	if (argc < 3) {
> 		printf("Usage: %s <filename> <iterations>\n", argv[0]);
> 		return 1;
> 	}
> 	char *filename = argv[1];
> 	unsigned iterations = strtoul(argv[2], 0, 0);
> 
> 	/* Not using O_SYNC for now, anyway. */
> 	int fd = open(filename, O_DIRECT | O_RDWR);
> 	if (fd < 0) {
> 		printf("Failed to open %s: %m\n", filename);
> 		return 2;
> 	}
> 
> 	printf("File: %s, data size: %u, interations: %u\n",
> 		       filename, FULL_DATA_SIZE, iterations);
> 
> 	for (int count = 0; count < iterations; count++) {
> 		read_and_write(fd, FULL_DATA_SIZE, buf);
> 	}
> 
> 	close(fd);
> 	return 0;
> }
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-11-05  9:54           ` Jan Kara
@ 2018-11-06  0:26               ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-06  0:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	Andrew Morton, LKML, linux-rdma, linux-fsdevel

On 11/5/18 1:54 AM, Jan Kara wrote:
> On Sun 04-11-18 23:10:12, John Hubbard wrote:
>> On 10/13/18 9:47 AM, Christoph Hellwig wrote:
>>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>>>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>>>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>>>> either of the page->dma_pinned_* fields. 
>>>>
>>>> The idea is that LRU is not especially useful for this situation anyway,
>>>> so we'll just make it one or the other: either a page is dma-pinned, and
>>>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>>>> time), or it's possibly on an LRU list.
>>>
>>> Have you done any benchmarking what this does to direct I/O performance,
>>> especially for small I/O directly to a (fast) block device?
>>>
>>
>> Hi Christoph,
>>
>> I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
>> on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
>> and buggy yet, but I've got enough of them done to briefly run the test.
>>
>> One thing that occurs to me is that jumping on and off the LRU takes time, and
>> if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
>> know that leaves 32-bit out in the cold, but...maybe use this slower approach
>> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
>> by 20%. 
>>
>> Test program is below. I hope I didn't overlook something obvious, but it's 
>> definitely possible, given my lack of experience with direct IO. 
>>
>> I'm preparing to send an updated RFC this week, that contains the feedback to date,
>> and also many converted call sites as well, so that everyone can see what the whole
>> (proposed) story would look like in its latest incarnation.
> 
> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> has the advantage that it is pretty much standard for a test like this so
> everyone knows what the test does from a glimpse) and run with it something
> like the following workfile:
> 
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
> 
> And see how the numbers with and without your patches compare?
> 
> 								Honza
> 
> [1] https://github.com/axboe/fio

That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
is approximately 74 MiB/s with my patch (it varies a bit, run to run),
as compared to around 85 without the patch, so still showing about a 20%
performance degradation, assuming I'm reading this correctly.

Raw data follows, using the fio options you listed above:

Baseline (without my patch):
---------------------------- 
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=87.2MiB/s,w=0KiB/s][r=22.3k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1775: Mon Nov  5 12:08:45 2018
   read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(1024MiB/11945msec)
    slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
    clat (usec): min=71, max=13093, avg=2869.40, stdev=1225.23
     lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
    clat percentiles (usec):
     |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
     | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
     | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
     | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
     | 99.99th=[12256]
   bw (  KiB/s): min=80648, max=93288, per=99.80%, avg=87608.57, stdev=3201.35, samples=23
   iops        : min=20162, max=23322, avg=21902.09, stdev=800.37, samples=23
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=88.47%, 10=11.27%, 20=0.25%
  cpu          : usr=2.68%, sys=94.68%, ctx=408, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=85.7MiB/s (89.9MB/s), 85.7MiB/s-85.7MiB/s (89.9MB/s-89.9MB/s), io=1024MiB (1074MB), run=11945-11945msec

Disk stats (read/write):
  nvme0n1: ios=260906/3, merge=0/1, ticks=14618/4, in_queue=17670, util=100.00%

Modified (with my patch):
---------------------------- 

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=74.1MiB/s,w=0KiB/s][r=18.0k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1808: Mon Nov  5 16:11:09 2018
   read: IOPS=18.3k, BW=71.4MiB/s (74.9MB/s)(1024MiB/14334msec)
    slat (usec): min=18, max=4378, avg=52.59, stdev=63.66
    clat (usec): min=31, max=15622, avg=3443.86, stdev=1431.27
     lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
    clat percentiles (usec):
     |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
     | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
     | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
     | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
     | 99.99th=[14484]
   bw (  KiB/s): min=63142, max=77464, per=99.97%, avg=73128.46, stdev=3383.81, samples=28
   iops        : min=15785, max=19366, avg=18281.96, stdev=845.95, samples=28
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=84.77%, 10=14.12%, 20=1.09%
  cpu          : usr=2.20%, sys=95.72%, ctx=360, majf=0, minf=72
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=71.4MiB/s (74.9MB/s), 71.4MiB/s-71.4MiB/s (74.9MB/s-74.9MB/s), io=1024MiB (1074MB), run=14334-14334msec

Disk stats (read/write):
  nvme0n1: ios=258235/3, merge=0/1, ticks=12583/10, in_queue=14779, util=100.00%



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-11-06  0:26               ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-06  0:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	Andrew Morton, LKML, linux-rdma, linux-fsdevel

On 11/5/18 1:54 AM, Jan Kara wrote:
> On Sun 04-11-18 23:10:12, John Hubbard wrote:
>> On 10/13/18 9:47 AM, Christoph Hellwig wrote:
>>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>>>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>>>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>>>> either of the page->dma_pinned_* fields. 
>>>>
>>>> The idea is that LRU is not especially useful for this situation anyway,
>>>> so we'll just make it one or the other: either a page is dma-pinned, and
>>>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>>>> time), or it's possibly on an LRU list.
>>>
>>> Have you done any benchmarking what this does to direct I/O performance,
>>> especially for small I/O directly to a (fast) block device?
>>>
>>
>> Hi Christoph,
>>
>> I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
>> on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
>> and buggy yet, but I've got enough of them done to briefly run the test.
>>
>> One thing that occurs to me is that jumping on and off the LRU takes time, and
>> if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
>> know that leaves 32-bit out in the cold, but...maybe use this slower approach
>> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
>> by 20%. 
>>
>> Test program is below. I hope I didn't overlook something obvious, but it's 
>> definitely possible, given my lack of experience with direct IO. 
>>
>> I'm preparing to send an updated RFC this week, that contains the feedback to date,
>> and also many converted call sites as well, so that everyone can see what the whole
>> (proposed) story would look like in its latest incarnation.
> 
> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> has the advantage that it is pretty much standard for a test like this so
> everyone knows what the test does from a glimpse) and run with it something
> like the following workfile:
> 
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
> 
> And see how the numbers with and without your patches compare?
> 
> 								Honza
> 
> [1] https://github.com/axboe/fio

That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
is approximately 74 MiB/s with my patch (it varies a bit, run to run),
as compared to around 85 without the patch, so still showing about a 20%
performance degradation, assuming I'm reading this correctly.

Raw data follows, using the fio options you listed above:

Baseline (without my patch):
---------------------------- 
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=87.2MiB/s,w=0KiB/s][r=22.3k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1775: Mon Nov  5 12:08:45 2018
   read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(1024MiB/11945msec)
    slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
    clat (usec): min=71, max=13093, avg=2869.40, stdev=1225.23
     lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
    clat percentiles (usec):
     |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
     | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
     | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
     | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
     | 99.99th=[12256]
   bw (  KiB/s): min=80648, max=93288, per=99.80%, avg=87608.57, stdev=3201.35, samples=23
   iops        : min=20162, max=23322, avg=21902.09, stdev=800.37, samples=23
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=88.47%, 10=11.27%, 20=0.25%
  cpu          : usr=2.68%, sys=94.68%, ctx=408, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=85.7MiB/s (89.9MB/s), 85.7MiB/s-85.7MiB/s (89.9MB/s-89.9MB/s), io=1024MiB (1074MB), run=11945-11945msec

Disk stats (read/write):
  nvme0n1: ios=260906/3, merge=0/1, ticks=14618/4, in_queue=17670, util=100.00%

Modified (with my patch):
---------------------------- 

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=74.1MiB/s,w=0KiB/s][r=18.0k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1808: Mon Nov  5 16:11:09 2018
   read: IOPS=18.3k, BW=71.4MiB/s (74.9MB/s)(1024MiB/14334msec)
    slat (usec): min=18, max=4378, avg=52.59, stdev=63.66
    clat (usec): min=31, max=15622, avg=3443.86, stdev=1431.27
     lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
    clat percentiles (usec):
     |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
     | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
     | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
     | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
     | 99.99th=[14484]
   bw (  KiB/s): min=63142, max=77464, per=99.97%, avg=73128.46, stdev=3383.81, samples=28
   iops        : min=15785, max=19366, avg=18281.96, stdev=845.95, samples=28
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=84.77%, 10=14.12%, 20=1.09%
  cpu          : usr=2.20%, sys=95.72%, ctx=360, majf=0, minf=72
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=71.4MiB/s (74.9MB/s), 71.4MiB/s-71.4MiB/s (74.9MB/s-74.9MB/s), io=1024MiB (1074MB), run=14334-14334msec

Disk stats (read/write):
  nvme0n1: ios=258235/3, merge=0/1, ticks=12583/10, in_queue=14779, util=100.00%



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-11-06  0:26               ` John Hubbard
  (?)
@ 2018-11-06  2:47               ` Dave Chinner
  2018-11-06 11:00                 ` Jan Kara
  -1 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2018-11-06  2:47 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	Andrew Morton, LKML, linux-rdma, linux-fsdevel

On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> On 11/5/18 1:54 AM, Jan Kara wrote:
> > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > has the advantage that it is pretty much standard for a test like this so
> > everyone knows what the test does from a glimpse) and run with it something
> > like the following workfile:
> > 
> > [reader]
> > direct=1
> > ioengine=libaio
> > blocksize=4096
> > size=1g
> > numjobs=1
> > rw=read
> > iodepth=64
> > 
> > And see how the numbers with and without your patches compare?
> > 
> > 								Honza
> > 
> > [1] https://github.com/axboe/fio
> 
> That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> as compared to around 85 without the patch, so still showing about a 20%
> performance degradation, assuming I'm reading this correctly.
> 
> Raw data follows, using the fio options you listed above:
> 
> Baseline (without my patch):
> ---------------------------- 
....
>      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
>     clat percentiles (usec):
>      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>      | 99.99th=[12256]
.....
> Modified (with my patch):
> ---------------------------- 
.....
>      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
>     clat percentiles (usec):
>      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>      | 99.99th=[14484]

So it's adding at least 500us of completion latency to every IO?
I'd argue that the IO latency impact is far worse than the a 20%
throughput drop.

i.e. You can make up for throughput drops by running a deeper
queue/more dispatch threads, but you can't reduce IO latency at
all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-11-06  2:47               ` Dave Chinner
@ 2018-11-06 11:00                 ` Jan Kara
  2018-11-06 20:41                   ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Kara @ 2018-11-06 11:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Hubbard, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Michal Hocko, Christopher Lameter, Jason Gunthorpe, Dan Williams,
	linux-mm, Andrew Morton, LKML, linux-rdma, linux-fsdevel

On Tue 06-11-18 13:47:15, Dave Chinner wrote:
> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> > On 11/5/18 1:54 AM, Jan Kara wrote:
> > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > > has the advantage that it is pretty much standard for a test like this so
> > > everyone knows what the test does from a glimpse) and run with it something
> > > like the following workfile:
> > > 
> > > [reader]
> > > direct=1
> > > ioengine=libaio
> > > blocksize=4096
> > > size=1g
> > > numjobs=1
> > > rw=read
> > > iodepth=64
> > > 
> > > And see how the numbers with and without your patches compare?
> > > 
> > > 								Honza
> > > 
> > > [1] https://github.com/axboe/fio
> > 
> > That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
> > is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> > as compared to around 85 without the patch, so still showing about a 20%
> > performance degradation, assuming I'm reading this correctly.
> > 
> > Raw data follows, using the fio options you listed above:
> > 
> > Baseline (without my patch):
> > ---------------------------- 
> ....
> >      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> >     clat percentiles (usec):
> >      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
> >      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
> >      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
> >      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
> >      | 99.99th=[12256]
> .....
> > Modified (with my patch):
> > ---------------------------- 
> .....
> >      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> >     clat percentiles (usec):
> >      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
> >      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
> >      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
> >      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
> >      | 99.99th=[14484]
> 
> So it's adding at least 500us of completion latency to every IO?
> I'd argue that the IO latency impact is far worse than the a 20%
> throughput drop.

Hum, right. So for each IO we have to remove the page from LRU on submit
and then put it back on IO completion (which is going to race with new
submits so LRU lock contention might be an issue). Spending 500 us on that
is not unthinkable when the lock is contended but it is more expensive than
I'd have thought. John, could you perhaps profile where the time is spent?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-11-06 11:00                 ` Jan Kara
@ 2018-11-06 20:41                   ` Dave Chinner
  2018-11-07  6:36                       ` John Hubbard
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2018-11-06 20:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Christoph Hellwig, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	Andrew Morton, LKML, linux-rdma, linux-fsdevel

On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
> > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> > > On 11/5/18 1:54 AM, Jan Kara wrote:
> > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > > > has the advantage that it is pretty much standard for a test like this so
> > > > everyone knows what the test does from a glimpse) and run with it something
> > > > like the following workfile:
> > > > 
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > > 
> > > > And see how the numbers with and without your patches compare?
> > > > 
> > > > 								Honza
> > > > 
> > > > [1] https://github.com/axboe/fio
> > > 
> > > That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
> > > is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> > > as compared to around 85 without the patch, so still showing about a 20%
> > > performance degradation, assuming I'm reading this correctly.
> > > 
> > > Raw data follows, using the fio options you listed above:
> > > 
> > > Baseline (without my patch):
> > > ---------------------------- 
> > ....
> > >      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> > >     clat percentiles (usec):
> > >      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
> > >      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
> > >      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
> > >      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
> > >      | 99.99th=[12256]
> > .....
> > > Modified (with my patch):
> > > ---------------------------- 
> > .....
> > >      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> > >     clat percentiles (usec):
> > >      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
> > >      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
> > >      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
> > >      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
> > >      | 99.99th=[14484]
> > 
> > So it's adding at least 500us of completion latency to every IO?
> > I'd argue that the IO latency impact is far worse than the a 20%
> > throughput drop.
> 
> Hum, right. So for each IO we have to remove the page from LRU on submit

Which cost us less then 10us on average:

	slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
vs
	slat (usec): min=18, max=4378, avg=52.59, stdev=63.66

> and then put it back on IO completion (which is going to race with new
> submits so LRU lock contention might be an issue).

Removal has to take the same LRU lock, so I don't think contention
is the problem here. More likely the overhead is in selecting the
LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
a memcg lookup.

> Spending 500 us on that
> is not unthinkable when the lock is contended but it is more expensive than
> I'd have thought. John, could you perhaps profile where the time is spent?

That'll tell us for sure :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
  2018-11-06 20:41                   ` Dave Chinner
@ 2018-11-07  6:36                       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-07  6:36 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Christoph Hellwig, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	Andrew Morton, LKML, linux-rdma, linux-fsdevel

On 11/6/18 12:41 PM, Dave Chinner wrote:
> On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
>> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
>>> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
>>>> On 11/5/18 1:54 AM, Jan Kara wrote:
>>>>> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
>>>>> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
>>>>> has the advantage that it is pretty much standard for a test like this so
>>>>> everyone knows what the test does from a glimpse) and run with it something
>>>>> like the following workfile:
>>>>>
>>>>> [reader]
>>>>> direct=1
>>>>> ioengine=libaio
>>>>> blocksize=4096
>>>>> size=1g
>>>>> numjobs=1
>>>>> rw=read
>>>>> iodepth=64
>>>>>
>>>>> And see how the numbers with and without your patches compare?
>>>>>
>>>>> 								Honza
>>>>>
>>>>> [1] https://github.com/axboe/fio
>>>>
>>>> That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
>>>> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
>>>> as compared to around 85 without the patch, so still showing about a 20%
>>>> performance degradation, assuming I'm reading this correctly.
>>>>
>>>> Raw data follows, using the fio options you listed above:
>>>>
>>>> Baseline (without my patch):
>>>> ---------------------------- 
>>> ....
>>>>      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
>>>>     clat percentiles (usec):
>>>>      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>>>>      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>>>>      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>>>>      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>>>>      | 99.99th=[12256]
>>> .....
>>>> Modified (with my patch):
>>>> ---------------------------- 
>>> .....
>>>>      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
>>>>     clat percentiles (usec):
>>>>      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>>>>      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>>>>      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>>>>      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>>>>      | 99.99th=[14484]
>>>
>>> So it's adding at least 500us of completion latency to every IO?
>>> I'd argue that the IO latency impact is far worse than the a 20%
>>> throughput drop.
>>
>> Hum, right. So for each IO we have to remove the page from LRU on submit
> 
> Which cost us less then 10us on average:
> 
> 	slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
> vs
> 	slat (usec): min=18, max=4378, avg=52.59, stdev=63.66
> 
>> and then put it back on IO completion (which is going to race with new
>> submits so LRU lock contention might be an issue).
> 
> Removal has to take the same LRU lock, so I don't think contention
> is the problem here. More likely the overhead is in selecting the
> LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
> a memcg lookup.
> 
>> Spending 500 us on that
>> is not unthinkable when the lock is contended but it is more expensive than
>> I'd have thought. John, could you perhaps profile where the time is spent?
> 

OK, some updates on that:

1. First of all, I fixed a direct-io-related call site (it was still calling put_page
instead of put_user_page), and that not only got rid of a problem, it also changed
performance: it makes the impact of the patch a bit less. (Sorry about that!
I was hoping that I could get away with temporarily ignoring that failure, but no.)
The bandwidth numbers in particular look much closer to each other.

2. Second, note that these fio results are noisy. The std deviation is large enough 
that some of this could be noise. In order to highlight that, I did 5 runs each of
with, and without the patch, and while there is definitely a performance drop on 
average, it's also true that there is overlap in the results. In other words, the
best "with patch" run is about the same as the worst "without patch" run.

3. Finally, initial profiling shows that we're adding about 1% total to the this
particular test run...I think. I may have to narrow this down some more, but I don't 
seem to see any real lock contention. Hints or ideas on measurement methods are
welcome, btw.

    -- 0.59% in put_user_page
    -- 0.2% (or 0.54%, depending on how you read the perf out below) in 
       get_user_pages_fast:


          1.36%--iov_iter_get_pages
                    |
                     --1.27%--get_user_pages_fast
                               |
                                --0.54%--pin_page_for_dma

          0.59%--put_user_page

          1.34%     0.03%  fio   [kernel.vmlinux]     [k] _raw_spin_lock
          0.95%     0.55%  fio   [kernel.vmlinux]     [k] do_raw_spin_lock
          0.17%     0.03%  fio   [kernel.vmlinux]     [k] isolate_lru_page
          0.06%     0.00%  fio   [kernel.vmlinux]     [k] putback_lru_page

4. Here's some raw fio data: one run without the patch, and two with the patch:

------------------------------------------------------
WITHOUT the patch:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
   read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
    slat (usec): min=25, max=4870, avg=68.19, stdev=85.21
    clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03
     lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65
    clat percentiles (usec):
     |  1.00th=[ 3687],  5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752],
     | 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785],
     | 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225],
     | 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171],
     | 99.99th=[18220]
   bw (  KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37
   iops        : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08%
  cpu          : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec

Disk stats (read/write):
  nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00%

------------------------------------------------------
With the patch applied:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov  6 20:03:50 2018
   read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec)
    slat (usec): min=33, max=4365, avg=74.05, stdev=85.79
    clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79
     lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23
    clat percentiles (usec):
     |  1.00th=[ 4047],  5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080],
     | 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146],
     | 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717],
     | 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220],
     | 99.99th=[19006]
   bw (  KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40
   iops        : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53%
  cpu          : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec

Disk stats (read/write):
  nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00%

------------------------------------------------------
OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
as the "without" case:
------------------------------------------------------

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
   read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
    slat (usec): min=30, max=12458, avg=69.71, stdev=88.01
    clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29
     lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06
    clat percentiles (usec):
     |  1.00th=[ 3884],  5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916],
     | 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949],
     | 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848],
     | 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695],
     | 99.99th=[24249]
   bw (  KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38
   iops        : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02%
  cpu          : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec

Disk stats (read/write):
  nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00%


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
@ 2018-11-07  6:36                       ` John Hubbard
  0 siblings, 0 replies; 46+ messages in thread
From: John Hubbard @ 2018-11-07  6:36 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Christoph Hellwig, Matthew Wilcox, Michal Hocko,
	Christopher Lameter, Jason Gunthorpe, Dan Williams, linux-mm,
	Andrew Morton, LKML, linux-rdma, linux-fsdevel

On 11/6/18 12:41 PM, Dave Chinner wrote:
> On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
>> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
>>> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
>>>> On 11/5/18 1:54 AM, Jan Kara wrote:
>>>>> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
>>>>> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
>>>>> has the advantage that it is pretty much standard for a test like this so
>>>>> everyone knows what the test does from a glimpse) and run with it something
>>>>> like the following workfile:
>>>>>
>>>>> [reader]
>>>>> direct=1
>>>>> ioengine=libaio
>>>>> blocksize=4096
>>>>> size=1g
>>>>> numjobs=1
>>>>> rw=read
>>>>> iodepth=64
>>>>>
>>>>> And see how the numbers with and without your patches compare?
>>>>>
>>>>> 								Honza
>>>>>
>>>>> [1] https://github.com/axboe/fio
>>>>
>>>> That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
>>>> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
>>>> as compared to around 85 without the patch, so still showing about a 20%
>>>> performance degradation, assuming I'm reading this correctly.
>>>>
>>>> Raw data follows, using the fio options you listed above:
>>>>
>>>> Baseline (without my patch):
>>>> ---------------------------- 
>>> ....
>>>>      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
>>>>     clat percentiles (usec):
>>>>      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>>>>      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>>>>      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>>>>      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>>>>      | 99.99th=[12256]
>>> .....
>>>> Modified (with my patch):
>>>> ---------------------------- 
>>> .....
>>>>      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
>>>>     clat percentiles (usec):
>>>>      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>>>>      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>>>>      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>>>>      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>>>>      | 99.99th=[14484]
>>>
>>> So it's adding at least 500us of completion latency to every IO?
>>> I'd argue that the IO latency impact is far worse than the a 20%
>>> throughput drop.
>>
>> Hum, right. So for each IO we have to remove the page from LRU on submit
> 
> Which cost us less then 10us on average:
> 
> 	slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
> vs
> 	slat (usec): min=18, max=4378, avg=52.59, stdev=63.66
> 
>> and then put it back on IO completion (which is going to race with new
>> submits so LRU lock contention might be an issue).
> 
> Removal has to take the same LRU lock, so I don't think contention
> is the problem here. More likely the overhead is in selecting the
> LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
> a memcg lookup.
> 
>> Spending 500 us on that
>> is not unthinkable when the lock is contended but it is more expensive than
>> I'd have thought. John, could you perhaps profile where the time is spent?
> 

OK, some updates on that:

1. First of all, I fixed a direct-io-related call site (it was still calling put_page
instead of put_user_page), and that not only got rid of a problem, it also changed
performance: it makes the impact of the patch a bit less. (Sorry about that!
I was hoping that I could get away with temporarily ignoring that failure, but no.)
The bandwidth numbers in particular look much closer to each other.

2. Second, note that these fio results are noisy. The std deviation is large enough 
that some of this could be noise. In order to highlight that, I did 5 runs each of
with, and without the patch, and while there is definitely a performance drop on 
average, it's also true that there is overlap in the results. In other words, the
best "with patch" run is about the same as the worst "without patch" run.

3. Finally, initial profiling shows that we're adding about 1% total to the this
particular test run...I think. I may have to narrow this down some more, but I don't 
seem to see any real lock contention. Hints or ideas on measurement methods are
welcome, btw.

    -- 0.59% in put_user_page
    -- 0.2% (or 0.54%, depending on how you read the perf out below) in 
       get_user_pages_fast:


          1.36%--iov_iter_get_pages
                    |
                     --1.27%--get_user_pages_fast
                               |
                                --0.54%--pin_page_for_dma

          0.59%--put_user_page

          1.34%     0.03%  fio   [kernel.vmlinux]     [k] _raw_spin_lock
          0.95%     0.55%  fio   [kernel.vmlinux]     [k] do_raw_spin_lock
          0.17%     0.03%  fio   [kernel.vmlinux]     [k] isolate_lru_page
          0.06%     0.00%  fio   [kernel.vmlinux]     [k] putback_lru_page

4. Here's some raw fio data: one run without the patch, and two with the patch:

------------------------------------------------------
WITHOUT the patch:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
   read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
    slat (usec): min=25, max=4870, avg=68.19, stdev=85.21
    clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03
     lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65
    clat percentiles (usec):
     |  1.00th=[ 3687],  5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752],
     | 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785],
     | 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225],
     | 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171],
     | 99.99th=[18220]
   bw (  KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37
   iops        : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08%
  cpu          : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec

Disk stats (read/write):
  nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00%

------------------------------------------------------
With the patch applied:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov  6 20:03:50 2018
   read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec)
    slat (usec): min=33, max=4365, avg=74.05, stdev=85.79
    clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79
     lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23
    clat percentiles (usec):
     |  1.00th=[ 4047],  5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080],
     | 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146],
     | 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717],
     | 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220],
     | 99.99th=[19006]
   bw (  KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40
   iops        : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53%
  cpu          : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec

Disk stats (read/write):
  nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00%

------------------------------------------------------
OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
as the "without" case:
------------------------------------------------------

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
   read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
    slat (usec): min=30, max=12458, avg=69.71, stdev=88.01
    clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29
     lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06
    clat percentiles (usec):
     |  1.00th=[ 3884],  5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916],
     | 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949],
     | 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848],
     | 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695],
     | 99.99th=[24249]
   bw (  KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38
   iops        : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02%
  cpu          : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec

Disk stats (read/write):
  nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00%


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2018-11-07  6:36 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-12  6:00 [PATCH 0/6] RFC: gup+dma: tracking dma-pinned pages john.hubbard
2018-10-12  6:00 ` [PATCH 1/6] mm: get_user_pages: consolidate error handling john.hubbard
2018-10-12  6:30   ` Balbir Singh
2018-10-12 22:45     ` John Hubbard
2018-10-12 22:45       ` John Hubbard
2018-10-12  6:00 ` [PATCH 2/6] mm: introduce put_user_page*(), placeholder versions john.hubbard
2018-10-12  7:35   ` Balbir Singh
2018-10-12 22:31     ` John Hubbard
2018-10-12 22:31       ` John Hubbard
2018-10-12  6:00 ` [PATCH 3/6] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
2018-10-12  6:00 ` [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
2018-10-12 10:56   ` Balbir Singh
2018-10-13  0:15     ` John Hubbard
2018-10-13  0:15       ` John Hubbard
2018-10-24 11:00       ` Balbir Singh
2018-11-02 23:27         ` John Hubbard
2018-11-02 23:27           ` John Hubbard
2018-10-13  3:55   ` Dave Chinner
2018-10-13  7:34     ` John Hubbard
2018-10-13  7:34       ` John Hubbard
2018-10-13 16:47       ` Christoph Hellwig
2018-10-13 21:19         ` John Hubbard
2018-10-13 21:19           ` John Hubbard
2018-11-05  7:10         ` John Hubbard
2018-11-05  7:10           ` John Hubbard
2018-11-05  9:54           ` Jan Kara
2018-11-06  0:26             ` John Hubbard
2018-11-06  0:26               ` John Hubbard
2018-11-06  2:47               ` Dave Chinner
2018-11-06 11:00                 ` Jan Kara
2018-11-06 20:41                   ` Dave Chinner
2018-11-07  6:36                     ` John Hubbard
2018-11-07  6:36                       ` John Hubbard
2018-10-13 23:01       ` Dave Chinner
2018-10-16  8:51         ` Jan Kara
2018-10-17  1:48           ` John Hubbard
2018-10-17  1:48             ` John Hubbard
2018-10-17 11:09             ` Michal Hocko
2018-10-18  0:03               ` John Hubbard
2018-10-18  0:03                 ` John Hubbard
2018-10-19  8:11                 ` Michal Hocko
2018-10-12  6:00 ` [PATCH 5/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
2018-10-12  6:00 ` [PATCH 6/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
2018-10-12 11:07   ` Balbir Singh
2018-10-13  0:33     ` John Hubbard
2018-10-13  0:33       ` John Hubbard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.