LinuxPPC-Dev Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
@ 2019-02-11 22:44 Daniel Jordan
  2019-02-11 22:44 ` [PATCH 1/5] vfio/type1: " Daniel Jordan
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 22:44 UTC (permalink / raw)
  To: jgg
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	daniel.m.jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

Hi,

This series converts users that account pinned pages with locked_vm to
account with pinned_vm instead, pinned_vm being the correct counter to
use.  It's based on a similar patch I posted recently[0].

The patches are based on rdma/for-next to build on Davidlohr Bueso's
recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
sense for these to be routed the same way, despite lack of rdma content?

All five of these places, and probably some of Davidlohr's conversions,
probably want to be collapsed into a common helper in the core mm for
accounting pinned pages.  I tried, and there are several details that
likely need discussion, so this can be done as a follow-on.

I'd appreciate a look at patch 5 especially since the accounting is
unusual no matter whether locked_vm or pinned_vm are used.

On powerpc, this was cross-compile tested only.

[0] http://lkml.kernel.org/r/20181105165558.11698-8-daniel.m.jordan@oracle.com
[1] http://lkml.kernel.org/r/20190206175920.31082-1-dave@stgolabs.net

Daniel Jordan (5):
  vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned
    pages
  fpga/dlf/afu: use pinned_vm instead of locked_vm to account pinned
    pages
  powerpc/mmu: use pinned_vm instead of locked_vm to account pinned
    pages
  kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages

 Documentation/vfio.txt              |  6 +--
 arch/powerpc/kvm/book3s_64_vio.c    | 35 +++++++---------
 arch/powerpc/mm/mmu_context_iommu.c | 43 ++++++++++---------
 drivers/fpga/dfl-afu-dma-region.c   | 50 +++++++++++-----------
 drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++++++++++++++---------------
 drivers/vfio/vfio_iommu_type1.c     | 31 ++++++--------
 6 files changed, 104 insertions(+), 125 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages Daniel Jordan
@ 2019-02-11 22:44 ` " Daniel Jordan
  2019-02-11 22:56   ` Jason Gunthorpe
  2019-02-11 22:44 ` [PATCH 2/5] vfio/spapr_tce: " Daniel Jordan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 22:44 UTC (permalink / raw)
  To: jgg
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	daniel.m.jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  Type1
accounts pinned pages to locked_vm; use pinned_vm instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 73652e21efec..a56cc341813f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -257,7 +257,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
 	struct mm_struct *mm;
-	int ret;
+	s64 pinned_vm;
+	int ret = 0;
 
 	if (!npage)
 		return 0;
@@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 	if (!mm)
 		return -ESRCH; /* process exited */
 
-	ret = down_write_killable(&mm->mmap_sem);
-	if (!ret) {
-		if (npage > 0) {
-			if (!dma->lock_cap) {
-				unsigned long limit;
-
-				limit = task_rlimit(dma->task,
-						RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	pinned_vm = atomic64_add_return(npage, &mm->pinned_vm);
 
-				if (mm->locked_vm + npage > limit)
-					ret = -ENOMEM;
-			}
+	if (npage > 0 && !dma->lock_cap) {
+		unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
+								   PAGE_SHIFT;
+		if (pinned_vm > limit) {
+			atomic64_sub(npage, &mm->pinned_vm);
+			ret = -ENOMEM;
 		}
-
-		if (!ret)
-			mm->locked_vm += npage;
-
-		up_write(&mm->mmap_sem);
 	}
 
 	if (async)
@@ -401,6 +393,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	long ret, pinned = 0, lock_acct = 0;
 	bool rsvd;
 	dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
+	atomic64_t *pinned_vm = &current->mm->pinned_vm;
 
 	/* This code path is only user initiated */
 	if (!current->mm)
@@ -418,7 +411,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	 * pages are already counted against the user.
 	 */
 	if (!rsvd && !vfio_find_vpfn(dma, iova)) {
-		if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
+		if (!dma->lock_cap && atomic64_read(pinned_vm) + 1 > limit) {
 			put_pfn(*pfn_base, dma->prot);
 			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
 					limit << PAGE_SHIFT);
@@ -445,7 +438,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 
 		if (!rsvd && !vfio_find_vpfn(dma, iova)) {
 			if (!dma->lock_cap &&
-			    current->mm->locked_vm + lock_acct + 1 > limit) {
+			    atomic64_read(pinned_vm) + lock_acct + 1 > limit) {
 				put_pfn(pfn, dma->prot);
 				pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 					__func__, limit << PAGE_SHIFT);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages Daniel Jordan
  2019-02-11 22:44 ` [PATCH 1/5] vfio/type1: " Daniel Jordan
@ 2019-02-11 22:44 ` " Daniel Jordan
  2019-02-12  6:56   ` Alexey Kardashevskiy
  2019-02-11 22:44 ` [PATCH 3/5] fpga/dlf/afu: " Daniel Jordan
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 22:44 UTC (permalink / raw)
  To: jgg
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	daniel.m.jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  The SPAPR
TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Documentation/vfio.txt              |  6 +--
 drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++++++++++++++---------------
 2 files changed, 33 insertions(+), 37 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index f1a4d3c3ba0b..fa37d65363f9 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -308,7 +308,7 @@ This implementation has some specifics:
    currently there is no way to reduce the number of calls. In order to make
    things faster, the map/unmap handling has been implemented in real mode
    which provides an excellent performance which has limitations such as
-   inability to do locked pages accounting in real time.
+   inability to do pinned pages accounting in real time.
 
 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
    subtree that can be treated as a unit for the purposes of partitioning and
@@ -324,7 +324,7 @@ This implementation has some specifics:
 		returns the size and the start of the DMA window on the PCI bus.
 
 	VFIO_IOMMU_ENABLE
-		enables the container. The locked pages accounting
+		enables the container. The pinned pages accounting
 		is done at this point. This lets user first to know what
 		the DMA window is and adjust rlimit before doing any real job.
 
@@ -454,7 +454,7 @@ This implementation has some specifics:
 
    PPC64 paravirtualized guests generate a lot of map/unmap requests,
    and the handling of those includes pinning/unpinning pages and updating
-   mm::locked_vm counter to make sure we do not exceed the rlimit.
+   mm::pinned_vm counter to make sure we do not exceed the rlimit.
    The v2 IOMMU splits accounting and pinning into separate operations:
 
    - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c424913324e3..f47e020dc5e4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -34,9 +34,11 @@
 static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group);
 
-static long try_increment_locked_vm(struct mm_struct *mm, long npages)
+static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
 {
-	long ret = 0, locked, lock_limit;
+	long ret = 0;
+	s64 pinned;
+	unsigned long lock_limit;
 
 	if (WARN_ON_ONCE(!mm))
 		return -EPERM;
@@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages)
 	if (!npages)
 		return 0;
 
-	down_write(&mm->mmap_sem);
-	locked = mm->locked_vm + npages;
+	pinned = atomic64_add_return(npages, &mm->pinned_vm);
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+	if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
 		ret = -ENOMEM;
-	else
-		mm->locked_vm += npages;
+		atomic64_sub(npages, &mm->pinned_vm);
+	}
 
-	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
+	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
 			npages << PAGE_SHIFT,
-			mm->locked_vm << PAGE_SHIFT,
-			rlimit(RLIMIT_MEMLOCK),
-			ret ? " - exceeded" : "");
-
-	up_write(&mm->mmap_sem);
+			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
+			rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
 
 	return ret;
 }
 
-static void decrement_locked_vm(struct mm_struct *mm, long npages)
+static void decrement_pinned_vm(struct mm_struct *mm, long npages)
 {
 	if (!mm || !npages)
 		return;
 
-	down_write(&mm->mmap_sem);
-	if (WARN_ON_ONCE(npages > mm->locked_vm))
-		npages = mm->locked_vm;
-	mm->locked_vm -= npages;
-	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
+	if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm)))
+		npages = atomic64_read(&mm->pinned_vm);
+	atomic64_sub(npages, &mm->pinned_vm);
+	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
 			npages << PAGE_SHIFT,
-			mm->locked_vm << PAGE_SHIFT,
+			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
 			rlimit(RLIMIT_MEMLOCK));
-	up_write(&mm->mmap_sem);
 }
 
 /*
@@ -110,7 +106,7 @@ struct tce_container {
 	bool enabled;
 	bool v2;
 	bool def_window_pending;
-	unsigned long locked_pages;
+	unsigned long pinned_pages;
 	struct mm_struct *mm;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 	struct list_head group_list;
@@ -283,7 +279,7 @@ static int tce_iommu_find_free_table(struct tce_container *container)
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	unsigned long locked;
+	unsigned long pinned;
 	struct iommu_table_group *table_group;
 	struct tce_iommu_group *tcegrp;
 
@@ -292,15 +288,15 @@ static int tce_iommu_enable(struct tce_container *container)
 
 	/*
 	 * When userspace pages are mapped into the IOMMU, they are effectively
-	 * locked memory, so, theoretically, we need to update the accounting
-	 * of locked pages on each map and unmap.  For powerpc, the map unmap
+	 * pinned memory, so, theoretically, we need to update the accounting
+	 * of pinned pages on each map and unmap.  For powerpc, the map unmap
 	 * paths can be very hot, though, and the accounting would kill
 	 * performance, especially since it would be difficult to impossible
 	 * to handle the accounting in real mode only.
 	 *
 	 * To address that, rather than precisely accounting every page, we
-	 * instead account for a worst case on locked memory when the iommu is
-	 * enabled and disabled.  The worst case upper bound on locked memory
+	 * instead account for a worst case on pinned memory when the iommu is
+	 * enabled and disabled.  The worst case upper bound on pinned memory
 	 * is the size of the whole iommu window, which is usually relatively
 	 * small (compared to total memory sizes) on POWER hardware.
 	 *
@@ -317,7 +313,7 @@ static int tce_iommu_enable(struct tce_container *container)
 	 *
 	 * So we do not allow enabling a container without a group attached
 	 * as there is no way to know how much we should increment
-	 * the locked_vm counter.
+	 * the pinned_vm counter.
 	 */
 	if (!tce_groups_attached(container))
 		return -ENODEV;
@@ -335,12 +331,12 @@ static int tce_iommu_enable(struct tce_container *container)
 	if (ret)
 		return ret;
 
-	locked = table_group->tce32_size >> PAGE_SHIFT;
-	ret = try_increment_locked_vm(container->mm, locked);
+	pinned = table_group->tce32_size >> PAGE_SHIFT;
+	ret = try_increment_pinned_vm(container->mm, pinned);
 	if (ret)
 		return ret;
 
-	container->locked_pages = locked;
+	container->pinned_pages = pinned;
 
 	container->enabled = true;
 
@@ -355,7 +351,7 @@ static void tce_iommu_disable(struct tce_container *container)
 	container->enabled = false;
 
 	BUG_ON(!container->mm);
-	decrement_locked_vm(container->mm, container->locked_pages);
+	decrement_pinned_vm(container->mm, container->pinned_pages);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -658,7 +654,7 @@ static long tce_iommu_create_table(struct tce_container *container,
 	if (!table_size)
 		return -EINVAL;
 
-	ret = try_increment_locked_vm(container->mm, table_size >> PAGE_SHIFT);
+	ret = try_increment_pinned_vm(container->mm, table_size >> PAGE_SHIFT);
 	if (ret)
 		return ret;
 
@@ -677,7 +673,7 @@ static void tce_iommu_free_table(struct tce_container *container,
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
 	iommu_tce_table_put(tbl);
-	decrement_locked_vm(container->mm, pages);
+	decrement_pinned_vm(container->mm, pages);
 }
 
 static long tce_iommu_create_window(struct tce_container *container,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/5] fpga/dlf/afu: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages Daniel Jordan
  2019-02-11 22:44 ` [PATCH 1/5] vfio/type1: " Daniel Jordan
  2019-02-11 22:44 ` [PATCH 2/5] vfio/spapr_tce: " Daniel Jordan
@ 2019-02-11 22:44 ` " Daniel Jordan
  2019-02-11 22:44 ` [PATCH 4/5] powerpc/mmu: " Daniel Jordan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 22:44 UTC (permalink / raw)
  To: jgg
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	daniel.m.jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  The FPGA AFU
driver accounts pinned pages to locked_vm; use pinned_vm instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 drivers/fpga/dfl-afu-dma-region.c | 50 ++++++++++++++-----------------
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c
index e18a786fc943..a9a6b317fe2e 100644
--- a/drivers/fpga/dfl-afu-dma-region.c
+++ b/drivers/fpga/dfl-afu-dma-region.c
@@ -32,47 +32,43 @@ void afu_dma_region_init(struct dfl_feature_platform_data *pdata)
 }
 
 /**
- * afu_dma_adjust_locked_vm - adjust locked memory
+ * afu_dma_adjust_pinned_vm - adjust pinned memory
  * @dev: port device
  * @npages: number of pages
- * @incr: increase or decrease locked memory
  *
- * Increase or decrease the locked memory size with npages input.
+ * Increase or decrease the pinned memory size with npages input.
  *
  * Return 0 on success.
- * Return -ENOMEM if locked memory size is over the limit and no CAP_IPC_LOCK.
+ * Return -ENOMEM if pinned memory size is over the limit and no CAP_IPC_LOCK.
  */
-static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr)
+static int afu_dma_adjust_pinned_vm(struct device *dev, long pages)
 {
-	unsigned long locked, lock_limit;
+	unsigned long lock_limit;
+	s64 pinned_vm;
 	int ret = 0;
 
 	/* the task is exiting. */
-	if (!current->mm)
+	if (!current->mm || !pages)
 		return 0;
 
-	down_write(&current->mm->mmap_sem);
-
-	if (incr) {
-		locked = current->mm->locked_vm + npages;
+	if (pages > 0) {
 		lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+		pinned_vm = atomic64_add_return(pages, &current->mm->pinned_vm);
+		if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
 			ret = -ENOMEM;
-		else
-			current->mm->locked_vm += npages;
+			atomic64_sub(pages, &current->mm->pinned_vm);
+		}
 	} else {
-		if (WARN_ON_ONCE(npages > current->mm->locked_vm))
-			npages = current->mm->locked_vm;
-		current->mm->locked_vm -= npages;
+		pinned_vm = atomic64_read(&current->mm->pinned_vm);
+		if (WARN_ON_ONCE(pages > pinned_vm))
+			pages = pinned_vm;
+		atomic64_sub(pages, &current->mm->pinned_vm);
 	}
 
-	dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid,
-		incr ? '+' : '-', npages << PAGE_SHIFT,
-		current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK),
-		ret ? "- exceeded" : "");
-
-	up_write(&current->mm->mmap_sem);
+	dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %lld/%lu%s\n", current->pid,
+		(pages > 0) ? '+' : '-', pages << PAGE_SHIFT,
+		(s64)atomic64_read(&current->mm->pinned_vm) << PAGE_SHIFT,
+		rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : "");
 
 	return ret;
 }
@@ -92,7 +88,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
 	struct device *dev = &pdata->dev->dev;
 	int ret, pinned;
 
-	ret = afu_dma_adjust_locked_vm(dev, npages, true);
+	ret = afu_dma_adjust_pinned_vm(dev, npages);
 	if (ret)
 		return ret;
 
@@ -121,7 +117,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
 free_pages:
 	kfree(region->pages);
 unlock_vm:
-	afu_dma_adjust_locked_vm(dev, npages, false);
+	afu_dma_adjust_pinned_vm(dev, -npages);
 	return ret;
 }
 
@@ -141,7 +137,7 @@ static void afu_dma_unpin_pages(struct dfl_feature_platform_data *pdata,
 
 	put_all_pages(region->pages, npages);
 	kfree(region->pages);
-	afu_dma_adjust_locked_vm(dev, npages, false);
+	afu_dma_adjust_pinned_vm(dev, -npages);
 
 	dev_dbg(dev, "%ld pages unpinned\n", npages);
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/5] powerpc/mmu: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages Daniel Jordan
                   ` (2 preceding siblings ...)
  2019-02-11 22:44 ` [PATCH 3/5] fpga/dlf/afu: " Daniel Jordan
@ 2019-02-11 22:44 ` " Daniel Jordan
  2019-02-13  1:14   ` kbuild test robot
  2019-02-11 22:44 ` [PATCH 5/5] kvm/book3s: " Daniel Jordan
  2019-02-11 22:54 ` [PATCH 0/5] " Jason Gunthorpe
  5 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 22:44 UTC (permalink / raw)
  To: jgg
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	daniel.m.jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  The IOMMU
MMU helpers on powerpc account pinned pages to locked_vm; use pinned_vm
instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 arch/powerpc/mm/mmu_context_iommu.c | 43 ++++++++++++++---------------
 1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index a712a650a8b6..fdf670542847 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -40,36 +40,35 @@ struct mm_iommu_table_group_mem_t {
 	u64 dev_hpa;		/* Device memory base address */
 };
 
-static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
+static long mm_iommu_adjust_pinned_vm(struct mm_struct *mm,
 		unsigned long npages, bool incr)
 {
-	long ret = 0, locked, lock_limit;
+	long ret = 0;
+	unsigned long lock_limit;
+	s64 pinned_vm;
 
 	if (!npages)
 		return 0;
 
-	down_write(&mm->mmap_sem);
-
 	if (incr) {
-		locked = mm->locked_vm + npages;
 		lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+		pinned_vm = atomic64_add_return(npages, &mm->pinned_vm);
+		if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
 			ret = -ENOMEM;
-		else
-			mm->locked_vm += npages;
+			atomic64_sub(npages, &mm->pinned_vm);
+		}
 	} else {
-		if (WARN_ON_ONCE(npages > mm->locked_vm))
-			npages = mm->locked_vm;
-		mm->locked_vm -= npages;
+		pinned_vm = atomic64_read(&mm->pinned_vm);
+		if (WARN_ON_ONCE(npages > pinned_vm))
+			npages = pinned_vm;
+		atomic64_sub(npages, &mm->pinned_vm);
 	}
 
-	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-			current ? current->pid : 0,
-			incr ? '+' : '-',
+	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%lu %ld/%lu\n",
+			current ? current->pid : 0, incr ? '+' : '-',
 			npages << PAGE_SHIFT,
-			mm->locked_vm << PAGE_SHIFT,
+			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
 			rlimit(RLIMIT_MEMLOCK));
-	up_write(&mm->mmap_sem);
 
 	return ret;
 }
@@ -133,7 +132,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
-	long i, j, ret = 0, locked_entries = 0;
+	long i, j, ret = 0, pinned_entries = 0;
 	unsigned int pageshift;
 	unsigned long flags;
 	unsigned long cur_ua;
@@ -154,11 +153,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	}
 
 	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
-		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+		ret = mm_iommu_adjust_pinned_vm(mm, entries, true);
 		if (ret)
 			goto unlock_exit;
 
-		locked_entries = entries;
+		pinned_entries = entries;
 	}
 
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
@@ -252,8 +251,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
 
 unlock_exit:
-	if (locked_entries && ret)
-		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
+	if (pinned_entries && ret)
+		mm_iommu_adjust_pinned_vm(mm, pinned_entries, false);
 
 	mutex_unlock(&mem_list_mutex);
 
@@ -352,7 +351,7 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 	mm_iommu_release(mem);
 
 	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
-		mm_iommu_adjust_locked_vm(mm, entries, false);
+		mm_iommu_adjust_pinned_vm(mm, entries, false);
 
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/5] kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages Daniel Jordan
                   ` (3 preceding siblings ...)
  2019-02-11 22:44 ` [PATCH 4/5] powerpc/mmu: " Daniel Jordan
@ 2019-02-11 22:44 ` " Daniel Jordan
  2019-02-13  1:43   ` kbuild test robot
  2019-02-11 22:54 ` [PATCH 0/5] " Jason Gunthorpe
  5 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 22:44 UTC (permalink / raw)
  To: jgg
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	daniel.m.jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

Memory used for TCE tables in kvm_vm_ioctl_create_spapr_tce is currently
accounted to locked_vm because it stays resident and its allocation is
directly triggered from userspace as explained in f8626985c7c2 ("KVM:
PPC: Account TCE-containing pages in locked_vm").

However, since the memory comes straight from the page allocator (and to
a lesser extent unreclaimable slab) and is effectively pinned, it should
be accounted with pinned_vm (see bc3e53f682d9 ("mm: distinguish between
mlocked and pinned pages")).

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 arch/powerpc/kvm/book3s_64_vio.c | 35 ++++++++++++++------------------
 1 file changed, 15 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 532ab79734c7..2f8d7c051e4e 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -56,39 +56,34 @@ static unsigned long kvmppc_stt_pages(unsigned long tce_pages)
 	return tce_pages + ALIGN(stt_bytes, PAGE_SIZE) / PAGE_SIZE;
 }
 
-static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
+static long kvmppc_account_memlimit(unsigned long pages, bool inc)
 {
 	long ret = 0;
+	s64 pinned_vm;
 
 	if (!current || !current->mm)
 		return ret; /* process exited */
 
-	down_write(&current->mm->mmap_sem);
-
 	if (inc) {
-		unsigned long locked, lock_limit;
+		unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
-		locked = current->mm->locked_vm + stt_pages;
-		lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+		pinned_vm = atomic64_add_return(pages, &current->mm->pinned_vm);
+		if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
 			ret = -ENOMEM;
-		else
-			current->mm->locked_vm += stt_pages;
+			atomic64_sub(pages, &current->mm->pinned_vm);
+		}
 	} else {
-		if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm))
-			stt_pages = current->mm->locked_vm;
+		pinned_vm = atomic64_read(&current->mm->pinned_vm);
+		if (WARN_ON_ONCE(pages > pinned_vm))
+			pages = pinned_vm;
 
-		current->mm->locked_vm -= stt_pages;
+		atomic64_sub(pages, &current->mm->pinned_vm);
 	}
 
-	pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
-			inc ? '+' : '-',
-			stt_pages << PAGE_SHIFT,
-			current->mm->locked_vm << PAGE_SHIFT,
-			rlimit(RLIMIT_MEMLOCK),
-			ret ? " - exceeded" : "");
-
-	up_write(&current->mm->mmap_sem);
+	pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%lu %ld/%lu%s\n", current->pid,
+			inc ? '+' : '-', pages << PAGE_SHIFT,
+			atomic64_read(&current->mm->pinned_vm) << PAGE_SHIFT,
+			rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
 
 	return ret;
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages Daniel Jordan
                   ` (4 preceding siblings ...)
  2019-02-11 22:44 ` [PATCH 5/5] kvm/book3s: " Daniel Jordan
@ 2019-02-11 22:54 ` " Jason Gunthorpe
  2019-02-11 23:15   ` Daniel Jordan
  2019-02-14  1:53   ` Ira Weiny
  5 siblings, 2 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 22:54 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	linux-mm, alex.williamson, mdf, akpm, linuxppc-dev, cl, hao.wu

On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> Hi,
> 
> This series converts users that account pinned pages with locked_vm to
> account with pinned_vm instead, pinned_vm being the correct counter to
> use.  It's based on a similar patch I posted recently[0].
> 
> The patches are based on rdma/for-next to build on Davidlohr Bueso's
> recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
> sense for these to be routed the same way, despite lack of rdma content?

Oy.. I'd be willing to accumulate a branch with acks to send to Linus
*separately* from RDMA to Linus, but this is very abnormal.

Better to wait a few weeks for -rc1 and send patches through the
subsystem trees.

> All five of these places, and probably some of Davidlohr's conversions,
> probably want to be collapsed into a common helper in the core mm for
> accounting pinned pages.  I tried, and there are several details that
> likely need discussion, so this can be done as a follow-on.

I've wondered the same..

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 ` [PATCH 1/5] vfio/type1: " Daniel Jordan
@ 2019-02-11 22:56   ` Jason Gunthorpe
  2019-02-11 23:11     ` Daniel Jordan
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 22:56 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	linux-mm, alex.williamson, mdf, akpm, linuxppc-dev, cl, hao.wu

On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:
> Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> pages"), locked and pinned pages are accounted separately.  Type1
> accounts pinned pages to locked_vm; use pinned_vm instead.
> 
> pinned_vm recently became atomic and so no longer relies on mmap_sem
> held as writer: delete.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>  drivers/vfio/vfio_iommu_type1.c | 31 ++++++++++++-------------------
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 73652e21efec..a56cc341813f 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -257,7 +257,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  {
>  	struct mm_struct *mm;
> -	int ret;
> +	s64 pinned_vm;
> +	int ret = 0;
>  
>  	if (!npage)
>  		return 0;
> @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  	if (!mm)
>  		return -ESRCH; /* process exited */
>  
> -	ret = down_write_killable(&mm->mmap_sem);
> -	if (!ret) {
> -		if (npage > 0) {
> -			if (!dma->lock_cap) {
> -				unsigned long limit;
> -
> -				limit = task_rlimit(dma->task,
> -						RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	pinned_vm = atomic64_add_return(npage, &mm->pinned_vm);
>  
> -				if (mm->locked_vm + npage > limit)
> -					ret = -ENOMEM;
> -			}
> +	if (npage > 0 && !dma->lock_cap) {
> +		unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> +
> -					PAGE_SHIFT;

I haven't looked at this super closely, but how does this stuff work?

do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...

Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?

Otherwise MEMLOCK is really doubled..

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:56   ` Jason Gunthorpe
@ 2019-02-11 23:11     ` Daniel Jordan
  2019-02-12 18:41       ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 23:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:
> > @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> >  	if (!mm)
> >  		return -ESRCH; /* process exited */
> >  
> > -	ret = down_write_killable(&mm->mmap_sem);
> > -	if (!ret) {
> > -		if (npage > 0) {
> > -			if (!dma->lock_cap) {
> > -				unsigned long limit;
> > -
> > -				limit = task_rlimit(dma->task,
> > -						RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	pinned_vm = atomic64_add_return(npage, &mm->pinned_vm);
> >  
> > -				if (mm->locked_vm + npage > limit)
> > -					ret = -ENOMEM;
> > -			}
> > +	if (npage > 0 && !dma->lock_cap) {
> > +		unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> > +
> > -					PAGE_SHIFT;
> 
> I haven't looked at this super closely, but how does this stuff work?
> 
> do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> 
> Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
>
> Otherwise MEMLOCK is really doubled..

So this has been a problem for some time, but it's not as easy as adding them
together, see [1][2] for a start.

The locked_vm/pinned_vm issue definitely needs fixing, but all this series is
trying to do is account to the right counter.

Daniel

[1] http://lkml.kernel.org/r/20130523104154.GA23650@twins.programming.kicks-ass.net
[2] http://lkml.kernel.org/r/20130524140114.GK23650@twins.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:54 ` [PATCH 0/5] " Jason Gunthorpe
@ 2019-02-11 23:15   ` Daniel Jordan
  2019-02-14  1:53   ` Ira Weiny
  1 sibling, 0 replies; 30+ messages in thread
From: Daniel Jordan @ 2019-02-11 23:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> > Hi,
> > 
> > This series converts users that account pinned pages with locked_vm to
> > account with pinned_vm instead, pinned_vm being the correct counter to
> > use.  It's based on a similar patch I posted recently[0].
> > 
> > The patches are based on rdma/for-next to build on Davidlohr Bueso's
> > recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
> > sense for these to be routed the same way, despite lack of rdma content?
> 
> Oy.. I'd be willing to accumulate a branch with acks to send to Linus
> *separately* from RDMA to Linus, but this is very abnormal.
> 
> Better to wait a few weeks for -rc1 and send patches through the
> subsystem trees.

Ok, I can do that.  It did seem strange, so I made it a question...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 ` [PATCH 2/5] vfio/spapr_tce: " Daniel Jordan
@ 2019-02-12  6:56   ` Alexey Kardashevskiy
  2019-02-12 16:50     ` Christopher Lameter
  2019-02-12 18:56     ` Alex Williamson
  0 siblings, 2 replies; 30+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-12  6:56 UTC (permalink / raw)
  To: Daniel Jordan, jgg
  Cc: dave, jack, kvm, atull, linux-fpga, linux-kernel, kvm-ppc,
	linux-mm, alex.williamson, mdf, akpm, linuxppc-dev, cl, hao.wu



On 12/02/2019 09:44, Daniel Jordan wrote:
> Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> pages"), locked and pinned pages are accounted separately.  The SPAPR
> TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
> instead.
> 
> pinned_vm recently became atomic and so no longer relies on mmap_sem
> held as writer: delete.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>  Documentation/vfio.txt              |  6 +--
>  drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++++++++++++++---------------
>  2 files changed, 33 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index f1a4d3c3ba0b..fa37d65363f9 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -308,7 +308,7 @@ This implementation has some specifics:
>     currently there is no way to reduce the number of calls. In order to make
>     things faster, the map/unmap handling has been implemented in real mode
>     which provides an excellent performance which has limitations such as
> -   inability to do locked pages accounting in real time.
> +   inability to do pinned pages accounting in real time.
>  
>  4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
>     subtree that can be treated as a unit for the purposes of partitioning and
> @@ -324,7 +324,7 @@ This implementation has some specifics:
>  		returns the size and the start of the DMA window on the PCI bus.
>  
>  	VFIO_IOMMU_ENABLE
> -		enables the container. The locked pages accounting
> +		enables the container. The pinned pages accounting
>  		is done at this point. This lets user first to know what
>  		the DMA window is and adjust rlimit before doing any real job.
>  
> @@ -454,7 +454,7 @@ This implementation has some specifics:
>  
>     PPC64 paravirtualized guests generate a lot of map/unmap requests,
>     and the handling of those includes pinning/unpinning pages and updating
> -   mm::locked_vm counter to make sure we do not exceed the rlimit.
> +   mm::pinned_vm counter to make sure we do not exceed the rlimit.
>     The v2 IOMMU splits accounting and pinning into separate operations:
>  
>     - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c424913324e3..f47e020dc5e4 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -34,9 +34,11 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>  		struct iommu_group *iommu_group);
>  
> -static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> +static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
>  {
> -	long ret = 0, locked, lock_limit;
> +	long ret = 0;
> +	s64 pinned;
> +	unsigned long lock_limit;
>  
>  	if (WARN_ON_ONCE(!mm))
>  		return -EPERM;
> @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>  	if (!npages)
>  		return 0;
>  
> -	down_write(&mm->mmap_sem);
> -	locked = mm->locked_vm + npages;
> +	pinned = atomic64_add_return(npages, &mm->pinned_vm);
>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> +	if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
>  		ret = -ENOMEM;
> -	else
> -		mm->locked_vm += npages;
> +		atomic64_sub(npages, &mm->pinned_vm);
> +	}
>  
> -	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> +	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
>  			npages << PAGE_SHIFT,
> -			mm->locked_vm << PAGE_SHIFT,
> -			rlimit(RLIMIT_MEMLOCK),
> -			ret ? " - exceeded" : "");
> -
> -	up_write(&mm->mmap_sem);
> +			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
> +			rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
>  
>  	return ret;
>  }
>  
> -static void decrement_locked_vm(struct mm_struct *mm, long npages)
> +static void decrement_pinned_vm(struct mm_struct *mm, long npages)
>  {
>  	if (!mm || !npages)
>  		return;
>  
> -	down_write(&mm->mmap_sem);
> -	if (WARN_ON_ONCE(npages > mm->locked_vm))
> -		npages = mm->locked_vm;
> -	mm->locked_vm -= npages;
> -	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
> +	if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm)))
> +		npages = atomic64_read(&mm->pinned_vm);
> +	atomic64_sub(npages, &mm->pinned_vm);
> +	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
>  			npages << PAGE_SHIFT,
> -			mm->locked_vm << PAGE_SHIFT,
> +			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
>  			rlimit(RLIMIT_MEMLOCK));
> -	up_write(&mm->mmap_sem);


So it used to be down_write+up_write and stuff in between.

Now it is 3 independent accesses (actually 4 but the last one is
diagnostic) with no locking around them. Why do not we need a lock
anymore precisely? Thanks,




>  }
>  
>  /*
> @@ -110,7 +106,7 @@ struct tce_container {
>  	bool enabled;
>  	bool v2;
>  	bool def_window_pending;
> -	unsigned long locked_pages;
> +	unsigned long pinned_pages;
>  	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
> @@ -283,7 +279,7 @@ static int tce_iommu_find_free_table(struct tce_container *container)
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> -	unsigned long locked;
> +	unsigned long pinned;
>  	struct iommu_table_group *table_group;
>  	struct tce_iommu_group *tcegrp;
>  
> @@ -292,15 +288,15 @@ static int tce_iommu_enable(struct tce_container *container)
>  
>  	/*
>  	 * When userspace pages are mapped into the IOMMU, they are effectively
> -	 * locked memory, so, theoretically, we need to update the accounting
> -	 * of locked pages on each map and unmap.  For powerpc, the map unmap
> +	 * pinned memory, so, theoretically, we need to update the accounting
> +	 * of pinned pages on each map and unmap.  For powerpc, the map unmap
>  	 * paths can be very hot, though, and the accounting would kill
>  	 * performance, especially since it would be difficult to impossible
>  	 * to handle the accounting in real mode only.
>  	 *
>  	 * To address that, rather than precisely accounting every page, we
> -	 * instead account for a worst case on locked memory when the iommu is
> -	 * enabled and disabled.  The worst case upper bound on locked memory
> +	 * instead account for a worst case on pinned memory when the iommu is
> +	 * enabled and disabled.  The worst case upper bound on pinned memory
>  	 * is the size of the whole iommu window, which is usually relatively
>  	 * small (compared to total memory sizes) on POWER hardware.
>  	 *
> @@ -317,7 +313,7 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 *
>  	 * So we do not allow enabling a container without a group attached
>  	 * as there is no way to know how much we should increment
> -	 * the locked_vm counter.
> +	 * the pinned_vm counter.
>  	 */
>  	if (!tce_groups_attached(container))
>  		return -ENODEV;
> @@ -335,12 +331,12 @@ static int tce_iommu_enable(struct tce_container *container)
>  	if (ret)
>  		return ret;
>  
> -	locked = table_group->tce32_size >> PAGE_SHIFT;
> -	ret = try_increment_locked_vm(container->mm, locked);
> +	pinned = table_group->tce32_size >> PAGE_SHIFT;
> +	ret = try_increment_pinned_vm(container->mm, pinned);
>  	if (ret)
>  		return ret;
>  
> -	container->locked_pages = locked;
> +	container->pinned_pages = pinned;
>  
>  	container->enabled = true;
>  
> @@ -355,7 +351,7 @@ static void tce_iommu_disable(struct tce_container *container)
>  	container->enabled = false;
>  
>  	BUG_ON(!container->mm);
> -	decrement_locked_vm(container->mm, container->locked_pages);
> +	decrement_pinned_vm(container->mm, container->pinned_pages);
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)
> @@ -658,7 +654,7 @@ static long tce_iommu_create_table(struct tce_container *container,
>  	if (!table_size)
>  		return -EINVAL;
>  
> -	ret = try_increment_locked_vm(container->mm, table_size >> PAGE_SHIFT);
> +	ret = try_increment_pinned_vm(container->mm, table_size >> PAGE_SHIFT);
>  	if (ret)
>  		return ret;
>  
> @@ -677,7 +673,7 @@ static void tce_iommu_free_table(struct tce_container *container,
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>  	iommu_tce_table_put(tbl);
> -	decrement_locked_vm(container->mm, pages);
> +	decrement_pinned_vm(container->mm, pages);
>  }
>  
>  static long tce_iommu_create_window(struct tce_container *container,
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-12  6:56   ` Alexey Kardashevskiy
@ 2019-02-12 16:50     ` Christopher Lameter
  2019-02-12 17:18       ` Daniel Jordan
  2019-02-12 18:56     ` Alex Williamson
  1 sibling, 1 reply; 30+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: dave, jack, kvm, jgg, linux-fpga, atull, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, hao.wu

On Tue, 12 Feb 2019, Alexey Kardashevskiy wrote:

> Now it is 3 independent accesses (actually 4 but the last one is
> diagnostic) with no locking around them. Why do not we need a lock
> anymore precisely? Thanks,

Updating a regular counter is racy and requires a lock. It was converted
to be an atomic which can be incremented without a race.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-12 16:50     ` Christopher Lameter
@ 2019-02-12 17:18       ` Daniel Jordan
  2019-02-13  0:37         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-02-12 17:18 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: dave, jack, kvm, linux-mm, Alexey Kardashevskiy, linux-fpga,
	atull, linux-kernel, kvm-ppc, Daniel Jordan, jgg,
	alex.williamson, mdf, akpm, linuxppc-dev, hao.wu

On Tue, Feb 12, 2019 at 04:50:11PM +0000, Christopher Lameter wrote:
> On Tue, 12 Feb 2019, Alexey Kardashevskiy wrote:
> 
> > Now it is 3 independent accesses (actually 4 but the last one is
> > diagnostic) with no locking around them. Why do not we need a lock
> > anymore precisely? Thanks,
> 
> Updating a regular counter is racy and requires a lock. It was converted
> to be an atomic which can be incremented without a race.

Yes, though Alexey may have meant that the multiple reads of the atomic in
decrement_pinned_vm are racy.  It only matters when there's a bug that would
make the counter go negative, but it's there.

And FWIW the debug print in try_increment_pinned_vm is also racy.

This fixes all that.  It doesn't try to correct the negative pinned_vm as the
old code did because it's already a bug and adjusting the value by the negative
amount seems to do nothing but make debugging harder.

If it's ok, I'll respin the whole series this way (another point for common
helper)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index f47e020dc5e4..b79257304de6 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -53,25 +53,24 @@ static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
 		atomic64_sub(npages, &mm->pinned_vm);
 	}
 
-	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
-			npages << PAGE_SHIFT,
-			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
-			rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
+	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %lld/%lu%s\n", current->pid,
+			npages << PAGE_SHIFT, pinned << PAGE_SHIFT,
+			lock_limit, ret ? " - exceeded" : "");
 
 	return ret;
 }
 
 static void decrement_pinned_vm(struct mm_struct *mm, long npages)
 {
+	s64 pinned;
+
 	if (!mm || !npages)
 		return;
 
-	if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm)))
-		npages = atomic64_read(&mm->pinned_vm);
-	atomic64_sub(npages, &mm->pinned_vm);
-	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
-			npages << PAGE_SHIFT,
-			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
+	pinned = atomic64_sub_return(npages, &mm->pinned_vm);
+	WARN_ON_ONCE(pinned < 0);
+	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %lld/%lu\n", current->pid,
+			npages << PAGE_SHIFT, pinned << PAGE_SHIFT,
 			rlimit(RLIMIT_MEMLOCK));
 }
 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 23:11     ` Daniel Jordan
@ 2019-02-12 18:41       ` Alex Williamson
  2019-02-13  0:26         ` Daniel Jordan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2019-02-12 18:41 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: dave, jack, kvm, linux-mm, aik, linux-fpga, atull, linux-kernel,
	kvm-ppc, Jason Gunthorpe, mdf, akpm, linuxppc-dev, cl, hao.wu

On Mon, 11 Feb 2019 18:11:53 -0500
Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:  
> > > @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
> > >  	if (!mm)
> > >  		return -ESRCH; /* process exited */
> > >  
> > > -	ret = down_write_killable(&mm->mmap_sem);
> > > -	if (!ret) {
> > > -		if (npage > 0) {
> > > -			if (!dma->lock_cap) {
> > > -				unsigned long limit;
> > > -
> > > -				limit = task_rlimit(dma->task,
> > > -						RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > +	pinned_vm = atomic64_add_return(npage, &mm->pinned_vm);
> > >  
> > > -				if (mm->locked_vm + npage > limit)
> > > -					ret = -ENOMEM;
> > > -			}
> > > +	if (npage > 0 && !dma->lock_cap) {
> > > +		unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> > > +
> > > -					PAGE_SHIFT;  
> > 
> > I haven't looked at this super closely, but how does this stuff work?
> > 
> > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> > 
> > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
> >
> > Otherwise MEMLOCK is really doubled..  
> 
> So this has been a problem for some time, but it's not as easy as adding them
> together, see [1][2] for a start.
> 
> The locked_vm/pinned_vm issue definitely needs fixing, but all this series is
> trying to do is account to the right counter.

This still makes me nervous because we have userspace dependencies on
setting process locked memory.  There's a user visible difference if we
account for them in the same bucket vs separate.  Perhaps we're
counting in the wrong bucket now, but if we "fix" that and userspace
adapts, how do we ever go back to accounting both mlocked and pinned
memory combined against rlimit?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-12  6:56   ` Alexey Kardashevskiy
  2019-02-12 16:50     ` Christopher Lameter
@ 2019-02-12 18:56     ` Alex Williamson
  2019-02-13  0:34       ` Alexey Kardashevskiy
  1 sibling, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2019-02-12 18:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: dave, jack, kvm, linux-mm, linux-fpga, atull, linux-kernel,
	kvm-ppc, Daniel Jordan, jgg, mdf, akpm, linuxppc-dev, cl, hao.wu

On Tue, 12 Feb 2019 17:56:18 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 12/02/2019 09:44, Daniel Jordan wrote:
> > Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> > pages"), locked and pinned pages are accounted separately.  The SPAPR
> > TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
> > instead.
> > 
> > pinned_vm recently became atomic and so no longer relies on mmap_sem
> > held as writer: delete.
> > 
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > ---
> >  Documentation/vfio.txt              |  6 +--
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++++++++++++++---------------
> >  2 files changed, 33 insertions(+), 37 deletions(-)
> > 
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > index f1a4d3c3ba0b..fa37d65363f9 100644
> > --- a/Documentation/vfio.txt
> > +++ b/Documentation/vfio.txt
> > @@ -308,7 +308,7 @@ This implementation has some specifics:
> >     currently there is no way to reduce the number of calls. In order to make
> >     things faster, the map/unmap handling has been implemented in real mode
> >     which provides an excellent performance which has limitations such as
> > -   inability to do locked pages accounting in real time.
> > +   inability to do pinned pages accounting in real time.
> >  
> >  4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
> >     subtree that can be treated as a unit for the purposes of partitioning and
> > @@ -324,7 +324,7 @@ This implementation has some specifics:
> >  		returns the size and the start of the DMA window on the PCI bus.
> >  
> >  	VFIO_IOMMU_ENABLE
> > -		enables the container. The locked pages accounting
> > +		enables the container. The pinned pages accounting
> >  		is done at this point. This lets user first to know what
> >  		the DMA window is and adjust rlimit before doing any real job.

I don't know of a ulimit only covering pinned pages, so for
documentation it seems more correct to continue referring to this as
locked page accounting.

> > @@ -454,7 +454,7 @@ This implementation has some specifics:
> >  
> >     PPC64 paravirtualized guests generate a lot of map/unmap requests,
> >     and the handling of those includes pinning/unpinning pages and updating
> > -   mm::locked_vm counter to make sure we do not exceed the rlimit.
> > +   mm::pinned_vm counter to make sure we do not exceed the rlimit.
> >     The v2 IOMMU splits accounting and pinning into separate operations:
> >  
> >     - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > index c424913324e3..f47e020dc5e4 100644
> > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > @@ -34,9 +34,11 @@
> >  static void tce_iommu_detach_group(void *iommu_data,
> >  		struct iommu_group *iommu_group);
> >  
> > -static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> > +static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
> >  {
> > -	long ret = 0, locked, lock_limit;
> > +	long ret = 0;
> > +	s64 pinned;
> > +	unsigned long lock_limit;
> >  
> >  	if (WARN_ON_ONCE(!mm))
> >  		return -EPERM;
> > @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> >  	if (!npages)
> >  		return 0;
> >  
> > -	down_write(&mm->mmap_sem);
> > -	locked = mm->locked_vm + npages;
> > +	pinned = atomic64_add_return(npages, &mm->pinned_vm);
> >  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> > +	if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
> >  		ret = -ENOMEM;
> > -	else
> > -		mm->locked_vm += npages;
> > +		atomic64_sub(npages, &mm->pinned_vm);
> > +	}
> >  
> > -	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> > +	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
> >  			npages << PAGE_SHIFT,
> > -			mm->locked_vm << PAGE_SHIFT,
> > -			rlimit(RLIMIT_MEMLOCK),
> > -			ret ? " - exceeded" : "");
> > -
> > -	up_write(&mm->mmap_sem);
> > +			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
> > +			rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
> >  
> >  	return ret;
> >  }
> >  
> > -static void decrement_locked_vm(struct mm_struct *mm, long npages)
> > +static void decrement_pinned_vm(struct mm_struct *mm, long npages)
> >  {
> >  	if (!mm || !npages)
> >  		return;
> >  
> > -	down_write(&mm->mmap_sem);
> > -	if (WARN_ON_ONCE(npages > mm->locked_vm))
> > -		npages = mm->locked_vm;
> > -	mm->locked_vm -= npages;
> > -	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
> > +	if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm)))
> > +		npages = atomic64_read(&mm->pinned_vm);
> > +	atomic64_sub(npages, &mm->pinned_vm);
> > +	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
> >  			npages << PAGE_SHIFT,
> > -			mm->locked_vm << PAGE_SHIFT,
> > +			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
> >  			rlimit(RLIMIT_MEMLOCK));
> > -	up_write(&mm->mmap_sem);  
> 
> 
> So it used to be down_write+up_write and stuff in between.
> 
> Now it is 3 independent accesses (actually 4 but the last one is
> diagnostic) with no locking around them. Why do not we need a lock
> anymore precisely? Thanks,

The first 2 look pretty sketchy to me, is there a case where you don't
know how many pages you've pinned to unpin them?  And can it ever
really be correct to just unpin whatever remains?  The last access is
diagnostic, which leaves 1.  Daniel's rework to warn on a negative
result looks more sane. Thanks,

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-12 18:41       ` Alex Williamson
@ 2019-02-13  0:26         ` Daniel Jordan
  2019-02-13 20:03           ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-02-13  0:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: dave, jack, kvm, linux-mm, aik, linux-fpga, atull, linux-kernel,
	kvm-ppc, Daniel Jordan, Jason Gunthorpe, peterz, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Tue, Feb 12, 2019 at 11:41:10AM -0700, Alex Williamson wrote:
> Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> > On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:
> > > I haven't looked at this super closely, but how does this stuff work?
> > > 
> > > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> > > 
> > > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
> > >
> > > Otherwise MEMLOCK is really doubled..  
> > 
> > So this has been a problem for some time, but it's not as easy as adding them
> > together, see [1][2] for a start.
> > 
> > The locked_vm/pinned_vm issue definitely needs fixing, but all this series is
> > trying to do is account to the right counter.

Thanks for taking a look, Alex.

> This still makes me nervous because we have userspace dependencies on
> setting process locked memory.

Could you please expand on this?  Trying to get more context.

> There's a user visible difference if we
> account for them in the same bucket vs separate.  Perhaps we're
> counting in the wrong bucket now, but if we "fix" that and userspace
> adapts, how do we ever go back to accounting both mlocked and pinned
> memory combined against rlimit?  Thanks,

PeterZ posted an RFC that addresses this point[1].  It kept pinned_vm and
locked_vm accounting separate, but allowed the two to be added safely to be
compared against RLIMIT_MEMLOCK.

Anyway, until some solution is agreed on, are there objections to converting
locked_vm to an atomic, to avoid user-visible changes, instead of switching
locked_vm users to pinned_vm?

Daniel

[1] http://lkml.kernel.org/r/20130524140114.GK23650@twins.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-12 18:56     ` Alex Williamson
@ 2019-02-13  0:34       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-13  0:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: dave, jack, kvm, linux-mm, linux-fpga, atull, linux-kernel,
	kvm-ppc, Daniel Jordan, jgg, mdf, akpm, linuxppc-dev, cl, hao.wu



On 13/02/2019 05:56, Alex Williamson wrote:
> On Tue, 12 Feb 2019 17:56:18 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 12/02/2019 09:44, Daniel Jordan wrote:
>>> Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
>>> pages"), locked and pinned pages are accounted separately.  The SPAPR
>>> TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
>>> instead.
>>>
>>> pinned_vm recently became atomic and so no longer relies on mmap_sem
>>> held as writer: delete.
>>>
>>> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>>> ---
>>>  Documentation/vfio.txt              |  6 +--
>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++++++++++++++---------------
>>>  2 files changed, 33 insertions(+), 37 deletions(-)
>>>
>>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>>> index f1a4d3c3ba0b..fa37d65363f9 100644
>>> --- a/Documentation/vfio.txt
>>> +++ b/Documentation/vfio.txt
>>> @@ -308,7 +308,7 @@ This implementation has some specifics:
>>>     currently there is no way to reduce the number of calls. In order to make
>>>     things faster, the map/unmap handling has been implemented in real mode
>>>     which provides an excellent performance which has limitations such as
>>> -   inability to do locked pages accounting in real time.
>>> +   inability to do pinned pages accounting in real time.
>>>  
>>>  4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
>>>     subtree that can be treated as a unit for the purposes of partitioning and
>>> @@ -324,7 +324,7 @@ This implementation has some specifics:
>>>  		returns the size and the start of the DMA window on the PCI bus.
>>>  
>>>  	VFIO_IOMMU_ENABLE
>>> -		enables the container. The locked pages accounting
>>> +		enables the container. The pinned pages accounting
>>>  		is done at this point. This lets user first to know what
>>>  		the DMA window is and adjust rlimit before doing any real job.
> 
> I don't know of a ulimit only covering pinned pages, so for
> documentation it seems more correct to continue referring to this as
> locked page accounting.
> 
>>> @@ -454,7 +454,7 @@ This implementation has some specifics:
>>>  
>>>     PPC64 paravirtualized guests generate a lot of map/unmap requests,
>>>     and the handling of those includes pinning/unpinning pages and updating
>>> -   mm::locked_vm counter to make sure we do not exceed the rlimit.
>>> +   mm::pinned_vm counter to make sure we do not exceed the rlimit.
>>>     The v2 IOMMU splits accounting and pinning into separate operations:
>>>  
>>>     - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> index c424913324e3..f47e020dc5e4 100644
>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> @@ -34,9 +34,11 @@
>>>  static void tce_iommu_detach_group(void *iommu_data,
>>>  		struct iommu_group *iommu_group);
>>>  
>>> -static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>>> +static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
>>>  {
>>> -	long ret = 0, locked, lock_limit;
>>> +	long ret = 0;
>>> +	s64 pinned;
>>> +	unsigned long lock_limit;
>>>  
>>>  	if (WARN_ON_ONCE(!mm))
>>>  		return -EPERM;
>>> @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>>>  	if (!npages)
>>>  		return 0;
>>>  
>>> -	down_write(&mm->mmap_sem);
>>> -	locked = mm->locked_vm + npages;
>>> +	pinned = atomic64_add_return(npages, &mm->pinned_vm);
>>>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
>>> +	if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
>>>  		ret = -ENOMEM;
>>> -	else
>>> -		mm->locked_vm += npages;
>>> +		atomic64_sub(npages, &mm->pinned_vm);
>>> +	}
>>>  
>>> -	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
>>> +	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
>>>  			npages << PAGE_SHIFT,
>>> -			mm->locked_vm << PAGE_SHIFT,
>>> -			rlimit(RLIMIT_MEMLOCK),
>>> -			ret ? " - exceeded" : "");
>>> -
>>> -	up_write(&mm->mmap_sem);
>>> +			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
>>> +			rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
>>>  
>>>  	return ret;
>>>  }
>>>  
>>> -static void decrement_locked_vm(struct mm_struct *mm, long npages)
>>> +static void decrement_pinned_vm(struct mm_struct *mm, long npages)
>>>  {
>>>  	if (!mm || !npages)
>>>  		return;
>>>  
>>> -	down_write(&mm->mmap_sem);
>>> -	if (WARN_ON_ONCE(npages > mm->locked_vm))
>>> -		npages = mm->locked_vm;
>>> -	mm->locked_vm -= npages;
>>> -	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
>>> +	if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm)))
>>> +		npages = atomic64_read(&mm->pinned_vm);
>>> +	atomic64_sub(npages, &mm->pinned_vm);
>>> +	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
>>>  			npages << PAGE_SHIFT,
>>> -			mm->locked_vm << PAGE_SHIFT,
>>> +			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
>>>  			rlimit(RLIMIT_MEMLOCK));
>>> -	up_write(&mm->mmap_sem);  
>>
>>
>> So it used to be down_write+up_write and stuff in between.
>>
>> Now it is 3 independent accesses (actually 4 but the last one is
>> diagnostic) with no locking around them. Why do not we need a lock
>> anymore precisely? Thanks,
> 
> The first 2 look pretty sketchy to me, is there a case where you don't
> know how many pages you've pinned to unpin them?

No case like this, this is why WARN_ON_ONCE(). At the time I could have
been under impression that pinned_vm is system-global, hence that
adjustment but we do not really need it there.

>  And can it ever
> really be correct to just unpin whatever remains?  The last access is
> diagnostic, which leaves 1.  Daniel's rework to warn on a negative
> result looks more sane. Thanks,

Yes it does look sane.


-- 
Alexey

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-12 17:18       ` Daniel Jordan
@ 2019-02-13  0:37         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 30+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-13  0:37 UTC (permalink / raw)
  To: Daniel Jordan, Christopher Lameter
  Cc: dave, jack, kvm, linux-mm, linux-fpga, atull, linux-kernel,
	kvm-ppc, jgg, alex.williamson, mdf, akpm, linuxppc-dev, hao.wu



On 13/02/2019 04:18, Daniel Jordan wrote:
> On Tue, Feb 12, 2019 at 04:50:11PM +0000, Christopher Lameter wrote:
>> On Tue, 12 Feb 2019, Alexey Kardashevskiy wrote:
>>
>>> Now it is 3 independent accesses (actually 4 but the last one is
>>> diagnostic) with no locking around them. Why do not we need a lock
>>> anymore precisely? Thanks,
>>
>> Updating a regular counter is racy and requires a lock. It was converted
>> to be an atomic which can be incremented without a race.
> 
> Yes, though Alexey may have meant that the multiple reads of the atomic in
> decrement_pinned_vm are racy.

Yes, I meant this race, thanks for clarifying this.

>  It only matters when there's a bug that would
> make the counter go negative, but it's there.
> 
> And FWIW the debug print in try_increment_pinned_vm is also racy.
> 
> This fixes all that.  It doesn't try to correct the negative pinned_vm as the
> old code did because it's already a bug and adjusting the value by the negative
> amount seems to do nothing but make debugging harder.
> 
> If it's ok, I'll respin the whole series this way (another point for common
> helper)

This looks good, thanks for fixing this.


> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index f47e020dc5e4..b79257304de6 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -53,25 +53,24 @@ static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
>  		atomic64_sub(npages, &mm->pinned_vm);
>  	}
>  
> -	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
> -			npages << PAGE_SHIFT,
> -			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
> -			rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
> +	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %lld/%lu%s\n", current->pid,
> +			npages << PAGE_SHIFT, pinned << PAGE_SHIFT,
> +			lock_limit, ret ? " - exceeded" : "");
>  
>  	return ret;
>  }
>  
>  static void decrement_pinned_vm(struct mm_struct *mm, long npages)
>  {
> +	s64 pinned;
> +
>  	if (!mm || !npages)
>  		return;
>  
> -	if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm)))
> -		npages = atomic64_read(&mm->pinned_vm);
> -	atomic64_sub(npages, &mm->pinned_vm);
> -	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
> -			npages << PAGE_SHIFT,
> -			atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
> +	pinned = atomic64_sub_return(npages, &mm->pinned_vm);
> +	WARN_ON_ONCE(pinned < 0);
> +	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %lld/%lu\n", current->pid,
> +			npages << PAGE_SHIFT, pinned << PAGE_SHIFT,
>  			rlimit(RLIMIT_MEMLOCK));
>  }
>  
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] powerpc/mmu: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 ` [PATCH 4/5] powerpc/mmu: " Daniel Jordan
@ 2019-02-13  1:14   ` kbuild test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2019-02-13  1:14 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: dave, jack, kvm, linux-mm, aik, linux-fpga, atull, linux-kernel,
	kvm-ppc, daniel.m.jordan, jgg, alex.williamson, mdf, kbuild-all,
	cl, linuxppc-dev, akpm, hao.wu

[-- Attachment #1: Type: text/plain, Size: 11118 bytes --]

Hi Daniel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on v5.0-rc4]
[cannot apply to next-20190212]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Daniel-Jordan/use-pinned_vm-instead-of-locked_vm-to-account-pinned-pages/20190213-070458
base:   https://github.com/awilliam/linux-vfio.git next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 8.2.0-11) 8.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=8.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   In file included from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/mm/mmu_context_iommu.c: In function 'mm_iommu_adjust_pinned_vm':
>> arch/powerpc/mm/mmu_context_iommu.c:55:43: error: passing argument 2 of 'atomic64_add_return_relaxed' from incompatible pointer type [-Werror=incompatible-pointer-types]
      pinned_vm = atomic64_add_return(npages, &mm->pinned_vm);
                                              ^~~~~~~~~~~~~~
   include/linux/atomic.h:75:22: note: in definition of macro '__atomic_op_fence'
     typeof(op##_relaxed(args)) __ret;    \
                         ^~~~
   arch/powerpc/mm/mmu_context_iommu.c:55:15: note: in expansion of macro 'atomic64_add_return'
      pinned_vm = atomic64_add_return(npages, &mm->pinned_vm);
                  ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/include/asm/atomic.h:331:52: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    atomic64_##op##_return_relaxed(long a, atomic64_t *v)   \
                                           ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:367:2: note: in expansion of macro 'ATOMIC64_OP_RETURN_RELAXED'
     ATOMIC64_OP_RETURN_RELAXED(op, asm_op)    \
     ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:370:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(add, add)
    ^~~~~~~~~~~~
   In file included from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
>> arch/powerpc/mm/mmu_context_iommu.c:55:43: error: passing argument 2 of 'atomic64_add_return_relaxed' from incompatible pointer type [-Werror=incompatible-pointer-types]
      pinned_vm = atomic64_add_return(npages, &mm->pinned_vm);
                                              ^~~~~~~~~~~~~~
   include/linux/atomic.h:77:23: note: in definition of macro '__atomic_op_fence'
     __ret = op##_relaxed(args);     \
                          ^~~~
   arch/powerpc/mm/mmu_context_iommu.c:55:15: note: in expansion of macro 'atomic64_add_return'
      pinned_vm = atomic64_add_return(npages, &mm->pinned_vm);
                  ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/include/asm/atomic.h:331:52: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    atomic64_##op##_return_relaxed(long a, atomic64_t *v)   \
                                           ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:367:2: note: in expansion of macro 'ATOMIC64_OP_RETURN_RELAXED'
     ATOMIC64_OP_RETURN_RELAXED(op, asm_op)    \
     ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:370:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(add, add)
    ^~~~~~~~~~~~
>> arch/powerpc/mm/mmu_context_iommu.c:58:25: error: passing argument 2 of 'atomic64_sub' from incompatible pointer type [-Werror=incompatible-pointer-types]
       atomic64_sub(npages, &mm->pinned_vm);
                            ^~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/include/asm/atomic.h:315:58: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ void atomic64_##op(long a, atomic64_t *v)  \
                                                 ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:366:2: note: in expansion of macro 'ATOMIC64_OP'
     ATOMIC64_OP(op, asm_op)      \
     ^~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:371:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(sub, subf)
    ^~~~~~~~~~~~
>> arch/powerpc/mm/mmu_context_iommu.c:61:29: error: passing argument 1 of 'atomic64_read' from incompatible pointer type [-Werror=incompatible-pointer-types]
      pinned_vm = atomic64_read(&mm->pinned_vm);
                                ^~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/include/asm/atomic.h:300:56: note: expected 'const atomic64_t *' {aka 'const struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ long atomic64_read(const atomic64_t *v)
                                         ~~~~~~~~~~~~~~~~~~^
   arch/powerpc/mm/mmu_context_iommu.c:64:24: error: passing argument 2 of 'atomic64_sub' from incompatible pointer type [-Werror=incompatible-pointer-types]
      atomic64_sub(npages, &mm->pinned_vm);
                           ^~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/include/asm/atomic.h:315:58: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ void atomic64_##op(long a, atomic64_t *v)  \
                                                 ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:366:2: note: in expansion of macro 'ATOMIC64_OP'
     ATOMIC64_OP(op, asm_op)      \
     ^~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:371:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(sub, subf)
    ^~~~~~~~~~~~
   In file included from include/linux/kernel.h:14,
                    from include/linux/list.h:9,
                    from include/linux/rculist.h:10,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/mm/mmu_context_iommu.c:70:18: error: passing argument 1 of 'atomic64_read' from incompatible pointer type [-Werror=incompatible-pointer-types]
       atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
                     ^~~~~~~~~~~~~~
   include/linux/printk.h:136:17: note: in definition of macro 'no_printk'
      printk(fmt, ##__VA_ARGS__);  \
                    ^~~~~~~~~~~
   arch/powerpc/mm/mmu_context_iommu.c:67:2: note: in expansion of macro 'pr_debug'
     pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%lu %ld/%lu\n",
     ^~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/rcupdate.h:38,
                    from include/linux/rculist.h:11,
                    from include/linux/sched/signal.h:5,
                    from arch/powerpc/mm/mmu_context_iommu.c:13:
   arch/powerpc/include/asm/atomic.h:300:56: note: expected 'const atomic64_t *' {aka 'const struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ long atomic64_read(const atomic64_t *v)
                                         ~~~~~~~~~~~~~~~~~~^
   cc1: all warnings being treated as errors

vim +/atomic64_add_return_relaxed +55 arch/powerpc/mm/mmu_context_iommu.c

  > 13	#include <linux/sched/signal.h>
    14	#include <linux/slab.h>
    15	#include <linux/rculist.h>
    16	#include <linux/vmalloc.h>
    17	#include <linux/mutex.h>
    18	#include <linux/migrate.h>
    19	#include <linux/hugetlb.h>
    20	#include <linux/swap.h>
    21	#include <linux/sizes.h>
    22	#include <asm/mmu_context.h>
    23	#include <asm/pte-walk.h>
    24	
    25	static DEFINE_MUTEX(mem_list_mutex);
    26	
    27	#define MM_IOMMU_TABLE_GROUP_PAGE_DIRTY	0x1
    28	#define MM_IOMMU_TABLE_GROUP_PAGE_MASK	~(SZ_4K - 1)
    29	
    30	struct mm_iommu_table_group_mem_t {
    31		struct list_head next;
    32		struct rcu_head rcu;
    33		unsigned long used;
    34		atomic64_t mapped;
    35		unsigned int pageshift;
    36		u64 ua;			/* userspace address */
    37		u64 entries;		/* number of entries in hpas[] */
    38		u64 *hpas;		/* vmalloc'ed */
    39	#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
    40		u64 dev_hpa;		/* Device memory base address */
    41	};
    42	
    43	static long mm_iommu_adjust_pinned_vm(struct mm_struct *mm,
    44			unsigned long npages, bool incr)
    45	{
    46		long ret = 0;
    47		unsigned long lock_limit;
    48		s64 pinned_vm;
    49	
    50		if (!npages)
    51			return 0;
    52	
    53		if (incr) {
    54			lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
  > 55			pinned_vm = atomic64_add_return(npages, &mm->pinned_vm);
    56			if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
    57				ret = -ENOMEM;
  > 58				atomic64_sub(npages, &mm->pinned_vm);
    59			}
    60		} else {
  > 61			pinned_vm = atomic64_read(&mm->pinned_vm);
    62			if (WARN_ON_ONCE(npages > pinned_vm))
    63				npages = pinned_vm;
    64			atomic64_sub(npages, &mm->pinned_vm);
    65		}
    66	
    67		pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%lu %ld/%lu\n",
    68				current ? current->pid : 0, incr ? '+' : '-',
    69				npages << PAGE_SHIFT,
    70				atomic64_read(&mm->pinned_vm) << PAGE_SHIFT,
    71				rlimit(RLIMIT_MEMLOCK));
    72	
    73		return ret;
    74	}
    75	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24035 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:44 ` [PATCH 5/5] kvm/book3s: " Daniel Jordan
@ 2019-02-13  1:43   ` kbuild test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2019-02-13  1:43 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: dave, jack, kvm, linux-mm, aik, linux-fpga, atull, linux-kernel,
	kvm-ppc, daniel.m.jordan, jgg, alex.williamson, mdf, kbuild-all,
	cl, linuxppc-dev, akpm, hao.wu

[-- Attachment #1: Type: text/plain, Size: 13126 bytes --]

Hi Daniel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on v5.0-rc4]
[cannot apply to next-20190212]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Daniel-Jordan/use-pinned_vm-instead-of-locked_vm-to-account-pinned-pages/20190213-070458
base:   https://github.com/awilliam/linux-vfio.git next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 8.2.0-11) 8.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=8.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   In file included from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/kvm/book3s_64_vio.c: In function 'kvmppc_account_memlimit':
>> arch/powerpc/kvm/book3s_64_vio.c:70:42: error: passing argument 2 of 'atomic64_add_return_relaxed' from incompatible pointer type [-Werror=incompatible-pointer-types]
      pinned_vm = atomic64_add_return(pages, &current->mm->pinned_vm);
                                             ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/atomic.h:75:22: note: in definition of macro '__atomic_op_fence'
     typeof(op##_relaxed(args)) __ret;    \
                         ^~~~
   arch/powerpc/kvm/book3s_64_vio.c:70:15: note: in expansion of macro 'atomic64_add_return'
      pinned_vm = atomic64_add_return(pages, &current->mm->pinned_vm);
                  ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/include/asm/atomic.h:331:52: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    atomic64_##op##_return_relaxed(long a, atomic64_t *v)   \
                                           ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:367:2: note: in expansion of macro 'ATOMIC64_OP_RETURN_RELAXED'
     ATOMIC64_OP_RETURN_RELAXED(op, asm_op)    \
     ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:370:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(add, add)
    ^~~~~~~~~~~~
   In file included from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
>> arch/powerpc/kvm/book3s_64_vio.c:70:42: error: passing argument 2 of 'atomic64_add_return_relaxed' from incompatible pointer type [-Werror=incompatible-pointer-types]
      pinned_vm = atomic64_add_return(pages, &current->mm->pinned_vm);
                                             ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/atomic.h:77:23: note: in definition of macro '__atomic_op_fence'
     __ret = op##_relaxed(args);     \
                          ^~~~
   arch/powerpc/kvm/book3s_64_vio.c:70:15: note: in expansion of macro 'atomic64_add_return'
      pinned_vm = atomic64_add_return(pages, &current->mm->pinned_vm);
                  ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/include/asm/atomic.h:331:52: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    atomic64_##op##_return_relaxed(long a, atomic64_t *v)   \
                                           ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:367:2: note: in expansion of macro 'ATOMIC64_OP_RETURN_RELAXED'
     ATOMIC64_OP_RETURN_RELAXED(op, asm_op)    \
     ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:370:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(add, add)
    ^~~~~~~~~~~~
>> arch/powerpc/kvm/book3s_64_vio.c:73:24: error: passing argument 2 of 'atomic64_sub' from incompatible pointer type [-Werror=incompatible-pointer-types]
       atomic64_sub(pages, &current->mm->pinned_vm);
                           ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/include/asm/atomic.h:315:58: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ void atomic64_##op(long a, atomic64_t *v)  \
                                                 ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:366:2: note: in expansion of macro 'ATOMIC64_OP'
     ATOMIC64_OP(op, asm_op)      \
     ^~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:371:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(sub, subf)
    ^~~~~~~~~~~~
>> arch/powerpc/kvm/book3s_64_vio.c:76:29: error: passing argument 1 of 'atomic64_read' from incompatible pointer type [-Werror=incompatible-pointer-types]
      pinned_vm = atomic64_read(&current->mm->pinned_vm);
                                ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/include/asm/atomic.h:300:56: note: expected 'const atomic64_t *' {aka 'const struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ long atomic64_read(const atomic64_t *v)
                                         ~~~~~~~~~~~~~~~~~~^
   arch/powerpc/kvm/book3s_64_vio.c:80:23: error: passing argument 2 of 'atomic64_sub' from incompatible pointer type [-Werror=incompatible-pointer-types]
      atomic64_sub(pages, &current->mm->pinned_vm);
                          ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/include/asm/atomic.h:315:58: note: expected 'atomic64_t *' {aka 'struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ void atomic64_##op(long a, atomic64_t *v)  \
                                                 ~~~~~~~~~~~~^
   arch/powerpc/include/asm/atomic.h:366:2: note: in expansion of macro 'ATOMIC64_OP'
     ATOMIC64_OP(op, asm_op)      \
     ^~~~~~~~~~~
   arch/powerpc/include/asm/atomic.h:371:1: note: in expansion of macro 'ATOMIC64_OPS'
    ATOMIC64_OPS(sub, subf)
    ^~~~~~~~~~~~
   In file included from include/linux/kernel.h:14,
                    from include/linux/list.h:9,
                    from include/linux/preempt.h:11,
                    from include/linux/hardirq.h:5,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/kvm/book3s_64_vio.c:85:18: error: passing argument 1 of 'atomic64_read' from incompatible pointer type [-Werror=incompatible-pointer-types]
       atomic64_read(&current->mm->pinned_vm) << PAGE_SHIFT,
                     ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/printk.h:136:17: note: in definition of macro 'no_printk'
      printk(fmt, ##__VA_ARGS__);  \
                    ^~~~~~~~~~~
   arch/powerpc/kvm/book3s_64_vio.c:83:2: note: in expansion of macro 'pr_debug'
     pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%lu %ld/%lu%s\n", current->pid,
     ^~~~~~~~
   In file included from include/linux/atomic.h:7,
                    from include/linux/llist.h:63,
                    from include/linux/smp.h:15,
                    from include/linux/percpu.h:7,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/vtime.h:5,
                    from include/linux/hardirq.h:8,
                    from include/linux/kvm_host.h:10,
                    from arch/powerpc/kvm/book3s_64_vio.c:23:
   arch/powerpc/include/asm/atomic.h:300:56: note: expected 'const atomic64_t *' {aka 'const struct <anonymous> *'} but argument is of type 'long unsigned int *'
    static __inline__ long atomic64_read(const atomic64_t *v)
                                         ~~~~~~~~~~~~~~~~~~^
   cc1: all warnings being treated as errors

vim +/atomic64_add_return_relaxed +70 arch/powerpc/kvm/book3s_64_vio.c

  > 23	#include <linux/kvm_host.h>
    24	#include <linux/highmem.h>
    25	#include <linux/gfp.h>
    26	#include <linux/slab.h>
    27	#include <linux/sched/signal.h>
    28	#include <linux/hugetlb.h>
    29	#include <linux/list.h>
    30	#include <linux/anon_inodes.h>
    31	#include <linux/iommu.h>
    32	#include <linux/file.h>
    33	
    34	#include <asm/kvm_ppc.h>
    35	#include <asm/kvm_book3s.h>
    36	#include <asm/book3s/64/mmu-hash.h>
    37	#include <asm/hvcall.h>
    38	#include <asm/synch.h>
    39	#include <asm/ppc-opcode.h>
    40	#include <asm/kvm_host.h>
    41	#include <asm/udbg.h>
    42	#include <asm/iommu.h>
    43	#include <asm/tce.h>
    44	#include <asm/mmu_context.h>
    45	
    46	static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
    47	{
    48		return ALIGN(iommu_pages * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
    49	}
    50	
    51	static unsigned long kvmppc_stt_pages(unsigned long tce_pages)
    52	{
    53		unsigned long stt_bytes = sizeof(struct kvmppc_spapr_tce_table) +
    54				(tce_pages * sizeof(struct page *));
    55	
    56		return tce_pages + ALIGN(stt_bytes, PAGE_SIZE) / PAGE_SIZE;
    57	}
    58	
    59	static long kvmppc_account_memlimit(unsigned long pages, bool inc)
    60	{
    61		long ret = 0;
    62		s64 pinned_vm;
    63	
    64		if (!current || !current->mm)
    65			return ret; /* process exited */
    66	
    67		if (inc) {
    68			unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
    69	
  > 70			pinned_vm = atomic64_add_return(pages, &current->mm->pinned_vm);
    71			if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
    72				ret = -ENOMEM;
  > 73				atomic64_sub(pages, &current->mm->pinned_vm);
    74			}
    75		} else {
  > 76			pinned_vm = atomic64_read(&current->mm->pinned_vm);
    77			if (WARN_ON_ONCE(pages > pinned_vm))
    78				pages = pinned_vm;
    79	
    80			atomic64_sub(pages, &current->mm->pinned_vm);
    81		}
    82	
    83		pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%lu %ld/%lu%s\n", current->pid,
    84				inc ? '+' : '-', pages << PAGE_SHIFT,
    85				atomic64_read(&current->mm->pinned_vm) << PAGE_SHIFT,
    86				rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
    87	
    88		return ret;
    89	}
    90	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24035 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-13  0:26         ` Daniel Jordan
@ 2019-02-13 20:03           ` Alex Williamson
  2019-02-13 23:07             ` Jason Gunthorpe
  2019-02-14  1:46             ` Daniel Jordan
  0 siblings, 2 replies; 30+ messages in thread
From: Alex Williamson @ 2019-02-13 20:03 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: dave, jack, kvm, linux-mm, aik, linux-fpga, atull, linux-kernel,
	kvm-ppc, Jason Gunthorpe, peterz, mdf, akpm, linuxppc-dev, cl,
	hao.wu

On Tue, 12 Feb 2019 19:26:50 -0500
Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> On Tue, Feb 12, 2019 at 11:41:10AM -0700, Alex Williamson wrote:
> > Daniel Jordan <daniel.m.jordan@oracle.com> wrote:  
> > > On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:  
> > > > I haven't looked at this super closely, but how does this stuff work?
> > > > 
> > > > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> > > > 
> > > > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
> > > >
> > > > Otherwise MEMLOCK is really doubled..    
> > > 
> > > So this has been a problem for some time, but it's not as easy as adding them
> > > together, see [1][2] for a start.
> > > 
> > > The locked_vm/pinned_vm issue definitely needs fixing, but all this series is
> > > trying to do is account to the right counter.  
> 
> Thanks for taking a look, Alex.
> 
> > This still makes me nervous because we have userspace dependencies on
> > setting process locked memory.  
> 
> Could you please expand on this?  Trying to get more context.

VFIO is a userspace driver interface and the pinned/locked page
accounting we're doing here is trying to prevent a user from exceeding
their locked memory limits.  Thus a VM management tool or unprivileged
userspace driver needs to have appropriate locked memory limits
configured for their use case.  Currently we do not have a unified
accounting scheme, so if a page is mlock'd by the user and also mapped
through VFIO for DMA, it's accounted twice, these both increment
locked_vm and userspace needs to manage that.  If pinned memory
and locked memory are now two separate buckets and we're only comparing
one of them against the locked memory limit, then it seems we have
effectively doubled the user's locked memory for this use case, as
Jason questioned.  The user could mlock one page and DMA map another,
they're both "locked", but now they only take one slot in each bucket.

If we continue forward with using a separate bucket here, userspace
could infer that accounting is unified and lower the user's locked
memory limit, or exploit the gap that their effective limit might
actually exceed system memory.  In the former case, if we do eventually
correct to compare the total of the combined buckets against the user's
locked memory limits, we'll break users that have adapted their locked
memory limits to meet the apparent needs.  In the latter case, the
inconsistent accounting is potentially an attack vector.

> > There's a user visible difference if we
> > account for them in the same bucket vs separate.  Perhaps we're
> > counting in the wrong bucket now, but if we "fix" that and userspace
> > adapts, how do we ever go back to accounting both mlocked and pinned
> > memory combined against rlimit?  Thanks,  
> 
> PeterZ posted an RFC that addresses this point[1].  It kept pinned_vm and
> locked_vm accounting separate, but allowed the two to be added safely to be
> compared against RLIMIT_MEMLOCK.

Unless I'm incorrect in the concerns above, I don't see how we can
convert vfio before this occurs.
 
> Anyway, until some solution is agreed on, are there objections to converting
> locked_vm to an atomic, to avoid user-visible changes, instead of switching
> locked_vm users to pinned_vm?

Seems that as long as we have separate buckets that are compared
individually to rlimit that we've got problems, it's just a matter of
where they're exposed based on which bucket is used for which
interface.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-13 20:03           ` Alex Williamson
@ 2019-02-13 23:07             ` Jason Gunthorpe
  2019-02-14  1:46             ` Daniel Jordan
  1 sibling, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2019-02-13 23:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, peterz, mdf, akpm, linuxppc-dev, cl,
	hao.wu

On Wed, Feb 13, 2019 at 01:03:30PM -0700, Alex Williamson wrote:
> > PeterZ posted an RFC that addresses this point[1].  It kept pinned_vm and
> > locked_vm accounting separate, but allowed the two to be added safely to be
> > compared against RLIMIT_MEMLOCK.
> 
> Unless I'm incorrect in the concerns above, I don't see how we can
> convert vfio before this occurs.

RDMA was converted to this pinned_vm scheme a long time ago, arguably
it is a mistake that VFIO did something different... This was to fix
some other bug where reporting of pages was wrong.

You are not wrong that this approach doesn't entirely make sense
though. :)

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  2019-02-13 20:03           ` Alex Williamson
  2019-02-13 23:07             ` Jason Gunthorpe
@ 2019-02-14  1:46             ` Daniel Jordan
  1 sibling, 0 replies; 30+ messages in thread
From: Daniel Jordan @ 2019-02-14  1:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: dave, jack, kvm, linux-mm, aik, linux-fpga, atull, linux-kernel,
	kvm-ppc, Daniel Jordan, Jason Gunthorpe, peterz, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Wed, Feb 13, 2019 at 01:03:30PM -0700, Alex Williamson wrote:
> Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> > On Tue, Feb 12, 2019 at 11:41:10AM -0700, Alex Williamson wrote:
> > > This still makes me nervous because we have userspace dependencies on
> > > setting process locked memory.  
> > 
> > Could you please expand on this?  Trying to get more context.
> 
> VFIO is a userspace driver interface and the pinned/locked page
> accounting we're doing here is trying to prevent a user from exceeding
> their locked memory limits.  Thus a VM management tool or unprivileged
> userspace driver needs to have appropriate locked memory limits
> configured for their use case.  Currently we do not have a unified
> accounting scheme, so if a page is mlock'd by the user and also mapped
> through VFIO for DMA, it's accounted twice, these both increment
> locked_vm and userspace needs to manage that.  If pinned memory
> and locked memory are now two separate buckets and we're only comparing
> one of them against the locked memory limit, then it seems we have
> effectively doubled the user's locked memory for this use case, as
> Jason questioned.  The user could mlock one page and DMA map another,
> they're both "locked", but now they only take one slot in each bucket.

Right, yes.  Should have been more specific.  I was after a concrete use case
where this would happen (sounded like you may have had a specific tool in
mind).

But it doesn't matter.  I understand your concern and agree that, given the
possibility that accounting in _some_ tool can be affected, we should fix
accounting before changing user visible behavior.  I can start a separate
discussion, having opened the can of worms again :)

> If we continue forward with using a separate bucket here, userspace
> could infer that accounting is unified and lower the user's locked
> memory limit, or exploit the gap that their effective limit might
> actually exceed system memory.  In the former case, if we do eventually
> correct to compare the total of the combined buckets against the user's
> locked memory limits, we'll break users that have adapted their locked
> memory limits to meet the apparent needs.  In the latter case, the
> inconsistent accounting is potentially an attack vector.

Makes sense.

> > > There's a user visible difference if we
> > > account for them in the same bucket vs separate.  Perhaps we're
> > > counting in the wrong bucket now, but if we "fix" that and userspace
> > > adapts, how do we ever go back to accounting both mlocked and pinned
> > > memory combined against rlimit?  Thanks,  
> > 
> > PeterZ posted an RFC that addresses this point[1].  It kept pinned_vm and
> > locked_vm accounting separate, but allowed the two to be added safely to be
> > compared against RLIMIT_MEMLOCK.
> 
> Unless I'm incorrect in the concerns above, I don't see how we can
> convert vfio before this occurs.
>  
> > Anyway, until some solution is agreed on, are there objections to converting
> > locked_vm to an atomic, to avoid user-visible changes, instead of switching
> > locked_vm users to pinned_vm?
> 
> Seems that as long as we have separate buckets that are compared
> individually to rlimit that we've got problems, it's just a matter of
> where they're exposed based on which bucket is used for which
> interface.  Thanks,

Indeed.  But for now, any concern with simply changing the type of the
currently used counter to an atomic, to reduce mmap_sem usage?  This is just an
implementation detail, invisible to userspace.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-11 22:54 ` [PATCH 0/5] " Jason Gunthorpe
  2019-02-11 23:15   ` Daniel Jordan
@ 2019-02-14  1:53   ` Ira Weiny
  2019-02-14  6:00     ` Jason Gunthorpe
  1 sibling, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-02-14  1:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> 
> > All five of these places, and probably some of Davidlohr's conversions,
> > probably want to be collapsed into a common helper in the core mm for
> > accounting pinned pages.  I tried, and there are several details that
> > likely need discussion, so this can be done as a follow-on.
> 
> I've wondered the same..

I'm really thinking this would be a nice way to ensure it gets cleaned up and
does not happen again.

Also, by moving it to the core we could better manage any user visible changes.

From a high level, pinned is a subset of locked so it seems like we need a 2
sets of helpers.

try_increment_locked_vm(...)
decrement_locked_vm(...)

try_increment_pinned_vm(...)
decrement_pinned_vm(...)

Where try_increment_pinned_vm() also increments locked_vm...  Of course this
may end up reverting the improvement of Davidlohr  Bueso's atomic work...  :-(

Furthermore it would seem better (although I don't know if at all possible) if
this were accounted for in core calls which tracked them based on how the pages
are being used so that drivers can't call try_increment_locked_vm() and then
pin the pages...  Thus getting the account wrong vs what actually happened.

And then in the end we can go back to locked_vm being the value checked against
RLIMIT_MEMLOCK.

Ira


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-14  1:53   ` Ira Weiny
@ 2019-02-14  6:00     ` Jason Gunthorpe
  2019-02-14 19:33       ` Ira Weiny
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2019-02-14  6:00 UTC (permalink / raw)
  To: Ira Weiny
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Wed, Feb 13, 2019 at 05:53:14PM -0800, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> > 
> > > All five of these places, and probably some of Davidlohr's conversions,
> > > probably want to be collapsed into a common helper in the core mm for
> > > accounting pinned pages.  I tried, and there are several details that
> > > likely need discussion, so this can be done as a follow-on.
> > 
> > I've wondered the same..
> 
> I'm really thinking this would be a nice way to ensure it gets cleaned up and
> does not happen again.
> 
> Also, by moving it to the core we could better manage any user visible changes.
> 
> From a high level, pinned is a subset of locked so it seems like we need a 2
> sets of helpers.
> 
> try_increment_locked_vm(...)
> decrement_locked_vm(...)
> 
> try_increment_pinned_vm(...)
> decrement_pinned_vm(...)
> 
> Where try_increment_pinned_vm() also increments locked_vm...  Of course this
> may end up reverting the improvement of Davidlohr  Bueso's atomic work...  :-(
> 
> Furthermore it would seem better (although I don't know if at all possible) if
> this were accounted for in core calls which tracked them based on how the pages
> are being used so that drivers can't call try_increment_locked_vm() and then
> pin the pages...  Thus getting the account wrong vs what actually happened.
> 
> And then in the end we can go back to locked_vm being the value checked against
> RLIMIT_MEMLOCK.

Someone would need to understand the bug that was fixed by splitting
them. 

I think it had to do with double accounting pinned and mlocked pages
and thus delivering a lower than expected limit to userspace.

vfio has this bug, RDMA does not. RDMA has a bug where it can
overallocate locked memory, vfio doesn't.

Really unclear how to fix this. The pinned/locked split with two
buckets may be the right way.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-14  6:00     ` Jason Gunthorpe
@ 2019-02-14 19:33       ` Ira Weiny
  2019-02-14 20:12         ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-02-14 19:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Wed, Feb 13, 2019 at 11:00:06PM -0700, Jason Gunthorpe wrote:
> On Wed, Feb 13, 2019 at 05:53:14PM -0800, Ira Weiny wrote:
> > On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> > > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> > > 
> > > > All five of these places, and probably some of Davidlohr's conversions,
> > > > probably want to be collapsed into a common helper in the core mm for
> > > > accounting pinned pages.  I tried, and there are several details that
> > > > likely need discussion, so this can be done as a follow-on.
> > > 
> > > I've wondered the same..
> > 
> > I'm really thinking this would be a nice way to ensure it gets cleaned up and
> > does not happen again.
> > 
> > Also, by moving it to the core we could better manage any user visible changes.
> > 
> > From a high level, pinned is a subset of locked so it seems like we need a 2
> > sets of helpers.
> > 
> > try_increment_locked_vm(...)
> > decrement_locked_vm(...)
> > 
> > try_increment_pinned_vm(...)
> > decrement_pinned_vm(...)
> > 
> > Where try_increment_pinned_vm() also increments locked_vm...  Of course this
> > may end up reverting the improvement of Davidlohr  Bueso's atomic work...  :-(
> > 
> > Furthermore it would seem better (although I don't know if at all possible) if
> > this were accounted for in core calls which tracked them based on how the pages
> > are being used so that drivers can't call try_increment_locked_vm() and then
> > pin the pages...  Thus getting the account wrong vs what actually happened.
> > 
> > And then in the end we can go back to locked_vm being the value checked against
> > RLIMIT_MEMLOCK.
> 
> Someone would need to understand the bug that was fixed by splitting
> them. 
>

My suggestion above assumes that splitting them is required/correct.  To be
fair I've not dug into if this is true or not, but I trust Christopher.

What I have found is this commit:

bc3e53f682d9 mm: distinguish between mlocked and pinned pages

I think that commit introduced the bug (for IB) which at the time may have been
"ok" because many users of IB at the time were HPC/MPI users and I don't think
MPI does a lot of _separate_ mlock operations so the count of locked_vm was
probably negligible.  Alternatively, the clusters I've worked on in the past
had compute nodes set with RLIMIT_MEMLOCK to 'unlimited' whilst running MPI
applications on compute nodes of a cluster...  :-/

I think what Christopher did was probably ok for the internal tracking but we
_should_ have had something which summed the 2 for RLIMIT_MEMLOCK checking at
that time to be 100% correct?  Christopher do you remember why you did not do
that?

[1] http://lkml.kernel.org/r/20130524140114.GK23650@twins.programming.kicks-ass.net

> 
> I think it had to do with double accounting pinned and mlocked pages
> and thus delivering a lower than expected limit to userspace.
> 
> vfio has this bug, RDMA does not. RDMA has a bug where it can
> overallocate locked memory, vfio doesn't.

Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages?

I think the problem is that if the user calls mlock on a large range then both
vfio and RDMA could potentially overallocate even with this fix.  This was your
initial email to Daniel, I think...  And Alex's concern.

> 
> Really unclear how to fix this. The pinned/locked split with two
> buckets may be the right way.

Are you suggesting that we have 2 user limits?

Ira

> 
> Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-14 19:33       ` Ira Weiny
@ 2019-02-14 20:12         ` Jason Gunthorpe
  2019-02-14 21:46           ` Ira Weiny
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2019-02-14 20:12 UTC (permalink / raw)
  To: Ira Weiny
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Thu, Feb 14, 2019 at 11:33:53AM -0800, Ira Weiny wrote:

> > I think it had to do with double accounting pinned and mlocked pages
> > and thus delivering a lower than expected limit to userspace.
> > 
> > vfio has this bug, RDMA does not. RDMA has a bug where it can
> > overallocate locked memory, vfio doesn't.
> 
> Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages?

Yes
 
> I think the problem is that if the user calls mlock on a large range then both
> vfio and RDMA could potentially overallocate even with this fix.  This was your
> initial email to Daniel, I think...  And Alex's concern.

Here are the possibilities
- mlock and pin on the same pages - RDMA respects the limit, VFIO halfs it.
- mlock and pin on different pages - RDMA doubles the limit, VFIO
  respects it
- VFIO and RDMA in the same process, the limit is halfed or doubled, depending.

IHMO we should make VFIO & RDMA the same, and then decide what to do
about case #2.

> > Really unclear how to fix this. The pinned/locked split with two
> > buckets may be the right way.
> 
> Are you suggesting that we have 2 user limits?

This is what RDMA has done since CL's patch.

It is very hard to fix as you need to track how many pages are mlocked
*AND* pinned.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-14 20:12         ` Jason Gunthorpe
@ 2019-02-14 21:46           ` Ira Weiny
  2019-02-14 22:16             ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Ira Weiny @ 2019-02-14 21:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Thu, Feb 14, 2019 at 01:12:31PM -0700, Jason Gunthorpe wrote:
> On Thu, Feb 14, 2019 at 11:33:53AM -0800, Ira Weiny wrote:
> 
> > > I think it had to do with double accounting pinned and mlocked pages
> > > and thus delivering a lower than expected limit to userspace.
> > > 
> > > vfio has this bug, RDMA does not. RDMA has a bug where it can
> > > overallocate locked memory, vfio doesn't.
> > 
> > Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages?
> 
> Yes
>  
> > I think the problem is that if the user calls mlock on a large range then both
> > vfio and RDMA could potentially overallocate even with this fix.  This was your
> > initial email to Daniel, I think...  And Alex's concern.
> 
> Here are the possibilities
> - mlock and pin on the same pages - RDMA respects the limit, VFIO halfs it.
> - mlock and pin on different pages - RDMA doubles the limit, VFIO
>   respects it
> - VFIO and RDMA in the same process, the limit is halfed or doubled, depending.
> 
> IHMO we should make VFIO & RDMA the same, and then decide what to do
> about case #2.

I'm not against that.  Sorry if I came across that way.  For this series I
agree we should make it consistent.

> 
> > > Really unclear how to fix this. The pinned/locked split with two
> > > buckets may be the right way.
> > 
> > Are you suggesting that we have 2 user limits?
> 
> This is what RDMA has done since CL's patch.

I don't understand?  What is the other _user_ limit (other than
RLIMIT_MEMLOCK)?

> 
> It is very hard to fix as you need to track how many pages are mlocked
> *AND* pinned.

Understood. :-/

Ira

> 
> Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-14 21:46           ` Ira Weiny
@ 2019-02-14 22:16             ` Jason Gunthorpe
  2019-02-15 15:26               ` Christopher Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2019-02-14 22:16 UTC (permalink / raw)
  To: Ira Weiny
  Cc: dave, jack, kvm, atull, aik, linux-fpga, linux-kernel, kvm-ppc,
	Daniel Jordan, linux-mm, alex.williamson, mdf, akpm,
	linuxppc-dev, cl, hao.wu

On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote:

> > > > Really unclear how to fix this. The pinned/locked split with two
> > > > buckets may be the right way.
> > > 
> > > Are you suggesting that we have 2 user limits?
> > 
> > This is what RDMA has done since CL's patch.
> 
> I don't understand?  What is the other _user_ limit (other than
> RLIMIT_MEMLOCK)?

With todays implementation RLIMIT_MEMLOCK covers two user limits,
total number of pinned pages and total number of mlocked pages. The
two are different buckets and not summed.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
  2019-02-14 22:16             ` Jason Gunthorpe
@ 2019-02-15 15:26               ` Christopher Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christopher Lameter @ 2019-02-15 15:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dave, jack, kvm, atull, aik, linuxppc-dev, linux-fpga,
	linux-kernel, kvm-ppc, Daniel Jordan, linux-mm, alex.williamson,
	mdf, akpm, Ira Weiny, hao.wu

On Thu, 14 Feb 2019, Jason Gunthorpe wrote:

> On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote:
>
> > > > > Really unclear how to fix this. The pinned/locked split with two
> > > > > buckets may be the right way.
> > > >
> > > > Are you suggesting that we have 2 user limits?
> > >
> > > This is what RDMA has done since CL's patch.
> >
> > I don't understand?  What is the other _user_ limit (other than
> > RLIMIT_MEMLOCK)?
>
> With todays implementation RLIMIT_MEMLOCK covers two user limits,
> total number of pinned pages and total number of mlocked pages. The
> two are different buckets and not summed.

Applications were failing at some point because they were effectively
summed up. If you mlocked/pinned a dataset of more than half the memory of
a system then things would get really weird.

Also there is the possibility of even more duplication because pages can
be pinned by multiple kernel subsystems. So you could get more than
doubling of the number.

The sane thing was to account them separately so that mlocking and
pinning worked without apps failing and then wait for another genius
to find out how to improve the situation by getting the pinned page mess
under control.

It is not even advisable to check pinned pages against any limit because
pages can be pinned by multiple subsystems.

The main problem here is that we only have a refcount to indicate pinning
and no way to clearly distinguish long term from short pins. In order to
really fix this issue we would need to have a list of subsystems that have
taken long term pins on a page. But doing so would waste a lot of memory
and cause a significant performance regression.

And the discussions here seem to be meandering around these issues.
Nothing really that convinces me that we have a clean solution at hand.


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, back to index

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-11 22:44 [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages Daniel Jordan
2019-02-11 22:44 ` [PATCH 1/5] vfio/type1: " Daniel Jordan
2019-02-11 22:56   ` Jason Gunthorpe
2019-02-11 23:11     ` Daniel Jordan
2019-02-12 18:41       ` Alex Williamson
2019-02-13  0:26         ` Daniel Jordan
2019-02-13 20:03           ` Alex Williamson
2019-02-13 23:07             ` Jason Gunthorpe
2019-02-14  1:46             ` Daniel Jordan
2019-02-11 22:44 ` [PATCH 2/5] vfio/spapr_tce: " Daniel Jordan
2019-02-12  6:56   ` Alexey Kardashevskiy
2019-02-12 16:50     ` Christopher Lameter
2019-02-12 17:18       ` Daniel Jordan
2019-02-13  0:37         ` Alexey Kardashevskiy
2019-02-12 18:56     ` Alex Williamson
2019-02-13  0:34       ` Alexey Kardashevskiy
2019-02-11 22:44 ` [PATCH 3/5] fpga/dlf/afu: " Daniel Jordan
2019-02-11 22:44 ` [PATCH 4/5] powerpc/mmu: " Daniel Jordan
2019-02-13  1:14   ` kbuild test robot
2019-02-11 22:44 ` [PATCH 5/5] kvm/book3s: " Daniel Jordan
2019-02-13  1:43   ` kbuild test robot
2019-02-11 22:54 ` [PATCH 0/5] " Jason Gunthorpe
2019-02-11 23:15   ` Daniel Jordan
2019-02-14  1:53   ` Ira Weiny
2019-02-14  6:00     ` Jason Gunthorpe
2019-02-14 19:33       ` Ira Weiny
2019-02-14 20:12         ` Jason Gunthorpe
2019-02-14 21:46           ` Ira Weiny
2019-02-14 22:16             ` Jason Gunthorpe
2019-02-15 15:26               ` Christopher Lameter

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org linuxppc-dev@archiver.kernel.org
	public-inbox-index linuxppc-dev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox