linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
@ 2020-11-17 10:25 John Garry
  2020-11-17 10:25 ` [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas() John Garry
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: John Garry @ 2020-11-17 10:25 UTC (permalink / raw)
  To: robin.murphy, joro, will
  Cc: xiyou.wangcong, linuxarm, iommu, linux-kernel, chenxiang66, John Garry

This series contains a patch to solve the longterm IOVA issue which
leizhen originally tried to address at [0].

A sieved kernel log is at the following, showing periodic dumps of IOVA
sizes, per CPU and per depot bin, per IOVA size granule:
https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test

Notice, for example, the following logs:
[13175.355584] print_iova1 cpu_total=40135 depot_total=3866 total=44001
[83483.457858] print_iova1 cpu_total=62532 depot_total=24476 total=87008

Where total IOVA rcache size has grown from 44K->87K over a long time.

Along with this patch, I included the following:
- A smaller helper to clear all IOVAs for a domain
- Change polarity of the IOVA magazine helpers
- Small optimisation from Cong Wang included, which was never applied [1].
  There was some debate of the other patches in that series, but this one
  is quite straightforward.

Differnces to v2:
- Update commit message for patch 3/4

Differences to v1:
- Add IOVA clearing helper
- Add patch to change polarity of mag helpers
- Avoid logically-redundant extra variable in __iova_rcache_insert()

[0] https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/
[1] https://lore.kernel.org/linux-iommu/4b74d40a-22d1-af53-fcb6-5d70183705a8@huawei.com/

Cong Wang (1):
  iommu: avoid taking iova_rbtree_lock twice

John Garry (3):
  iommu/iova: Add free_all_cpu_cached_iovas()
  iommu/iova: Avoid double-negatives in magazine helpers
  iommu/iova: Flush CPU rcache for when a depot fills

 drivers/iommu/iova.c | 66 +++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 28 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas()
  2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
@ 2020-11-17 10:25 ` John Garry
  2020-12-09  8:58   ` Leizhen (ThunderTown)
  2020-11-17 10:25 ` [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers John Garry
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: John Garry @ 2020-11-17 10:25 UTC (permalink / raw)
  To: robin.murphy, joro, will
  Cc: xiyou.wangcong, linuxarm, iommu, linux-kernel, chenxiang66, John Garry

Add a helper function to free the CPU rcache for all online CPUs.

There also exists a function of the same name in
drivers/iommu/intel/iommu.c, but the parameters are different, and there
should be no conflict.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/iova.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 30d969a4c5fd..81b7399dd5e8 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -227,6 +227,14 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	return -ENOMEM;
 }
 
+static void free_all_cpu_cached_iovas(struct iova_domain *iovad)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		free_cpu_cached_iovas(cpu, iovad);
+}
+
 static struct kmem_cache *iova_cache;
 static unsigned int iova_cache_users;
 static DEFINE_MUTEX(iova_cache_mutex);
@@ -422,15 +430,12 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
 retry:
 	new_iova = alloc_iova(iovad, size, limit_pfn, true);
 	if (!new_iova) {
-		unsigned int cpu;
-
 		if (!flush_rcache)
 			return 0;
 
 		/* Try replenishing IOVAs by flushing rcache. */
 		flush_rcache = false;
-		for_each_online_cpu(cpu)
-			free_cpu_cached_iovas(cpu, iovad);
+		free_all_cpu_cached_iovas(iovad);
 		goto retry;
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers
  2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
  2020-11-17 10:25 ` [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas() John Garry
@ 2020-11-17 10:25 ` John Garry
  2020-12-09  9:03   ` Leizhen (ThunderTown)
  2020-11-17 10:25 ` [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills John Garry
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: John Garry @ 2020-11-17 10:25 UTC (permalink / raw)
  To: robin.murphy, joro, will
  Cc: xiyou.wangcong, linuxarm, iommu, linux-kernel, chenxiang66, John Garry

A similar crash to the following could be observed if initial CPU rcache
magazine allocations fail in init_iova_rcaches():

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Mem abort info:
   ESR = 0x96000004
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
[0000000000000000] user address but active_mm is swapper
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 11 PID: 696 Comm: irq/40-hisi_sas Not tainted 5.9.0-rc7-dirty #109
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019
Call trace:
  free_iova_fast+0xfc/0x280
  iommu_dma_free_iova+0x64/0x70
  __iommu_dma_unmap+0x9c/0xf8
  iommu_dma_unmap_sg+0xa8/0xc8
  dma_unmap_sg_attrs+0x28/0x50
  cq_thread_v3_hw+0x2dc/0x528
  irq_thread_fn+0x2c/0xa0
  irq_thread+0x130/0x1e0
  kthread+0x154/0x158
  ret_from_fork+0x10/0x34

Code: f9400060 f102001f 54000981 d4210000 (f9400043)

 ---[ end trace 4afcbdfc61b60467 ]---

The issue is that expression !iova_magazine_full(NULL) evaluates true; this
falls over in in __iova_rcache_insert() when we attempt to cache a mag
and cpu_rcache->loaded == NULL:

if (!iova_magazine_full(cpu_rcache->loaded)) {
	can_insert = true;
...

if (can_insert)
	iova_magazine_push(cpu_rcache->loaded, iova_pfn);

As above, can_insert is evaluated true, which it shouldn't be, and we try
to insert pfns in a NULL mag, which is not safe.

To avoid this, stop using double-negatives, like !iova_magazine_full() and
!iova_magazine_empty(), and use positive tests, like
iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
can safely deal with cpu_rcache->{loaded, prev} = NULL.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/iova.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 81b7399dd5e8..1f3f0f8b12e0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -827,14 +827,18 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
 	mag->size = 0;
 }
 
-static bool iova_magazine_full(struct iova_magazine *mag)
+static bool iova_magazine_has_space(struct iova_magazine *mag)
 {
-	return (mag && mag->size == IOVA_MAG_SIZE);
+	if (!mag)
+		return false;
+	return mag->size < IOVA_MAG_SIZE;
 }
 
-static bool iova_magazine_empty(struct iova_magazine *mag)
+static bool iova_magazine_has_pfns(struct iova_magazine *mag)
 {
-	return (!mag || mag->size == 0);
+	if (!mag)
+		return false;
+	return mag->size;
 }
 
 static unsigned long iova_magazine_pop(struct iova_magazine *mag,
@@ -843,7 +847,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
 	int i;
 	unsigned long pfn;
 
-	BUG_ON(iova_magazine_empty(mag));
+	BUG_ON(!iova_magazine_has_pfns(mag));
 
 	/* Only fall back to the rbtree if we have no suitable pfns at all */
 	for (i = mag->size - 1; mag->pfns[i] > limit_pfn; i--)
@@ -859,7 +863,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
 
 static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)
 {
-	BUG_ON(iova_magazine_full(mag));
+	BUG_ON(!iova_magazine_has_space(mag));
 
 	mag->pfns[mag->size++] = pfn;
 }
@@ -905,9 +909,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 	cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
 	spin_lock_irqsave(&cpu_rcache->lock, flags);
 
-	if (!iova_magazine_full(cpu_rcache->loaded)) {
+	if (iova_magazine_has_space(cpu_rcache->loaded)) {
 		can_insert = true;
-	} else if (!iova_magazine_full(cpu_rcache->prev)) {
+	} else if (iova_magazine_has_space(cpu_rcache->prev)) {
 		swap(cpu_rcache->prev, cpu_rcache->loaded);
 		can_insert = true;
 	} else {
@@ -916,8 +920,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 		if (new_mag) {
 			spin_lock(&rcache->lock);
 			if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-				rcache->depot[rcache->depot_size++] =
-						cpu_rcache->loaded;
+				if (cpu_rcache->loaded)
+					rcache->depot[rcache->depot_size++] =
+							cpu_rcache->loaded;
 			} else {
 				mag_to_free = cpu_rcache->loaded;
 			}
@@ -968,9 +973,9 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
 	cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
 	spin_lock_irqsave(&cpu_rcache->lock, flags);
 
-	if (!iova_magazine_empty(cpu_rcache->loaded)) {
+	if (iova_magazine_has_pfns(cpu_rcache->loaded)) {
 		has_pfn = true;
-	} else if (!iova_magazine_empty(cpu_rcache->prev)) {
+	} else if (iova_magazine_has_pfns(cpu_rcache->prev)) {
 		swap(cpu_rcache->prev, cpu_rcache->loaded);
 		has_pfn = true;
 	} else {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills
  2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
  2020-11-17 10:25 ` [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas() John Garry
  2020-11-17 10:25 ` [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers John Garry
@ 2020-11-17 10:25 ` John Garry
  2020-12-09  9:13   ` Leizhen (ThunderTown)
  2020-11-17 10:25 ` [RESEND PATCH v3 4/4] iommu: avoid taking iova_rbtree_lock twice John Garry
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: John Garry @ 2020-11-17 10:25 UTC (permalink / raw)
  To: robin.murphy, joro, will
  Cc: xiyou.wangcong, linuxarm, iommu, linux-kernel, chenxiang66, John Garry

Leizhen reported some time ago that IOVA performance may degrade over time
[0], but unfortunately his solution to fix this problem was not given
attention.

To summarize, the issue is that as time goes by, the CPU rcache and depot
rcache continue to grow. As such, IOVA RB tree access time also continues
to grow.

At a certain point, a depot may become full, and also some CPU rcaches may
also be full when inserting another IOVA is attempted. For this scenario,
currently the "loaded" CPU rcache is freed and a new one is created. This
freeing means that many IOVAs in the RB tree need to be freed, which
makes IO throughput performance fall off a cliff in some storage scenarios:

Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6314MB/0KB/0KB /s] [1616K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [5669MB/0KB/0KB /s] [1451K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6673MB/0KB/0KB /s] [1708K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6761MB/0KB/0KB /s] [1731K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6685MB/0KB/0KB /s] [1711K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6178MB/0KB/0KB /s] [1582K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6731MB/0KB/0KB /s] [1723K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2387MB/0KB/0KB /s] [611K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2689MB/0KB/0KB /s] [688K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2278MB/0KB/0KB /s] [583K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1288MB/0KB/0KB /s] [330K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1632MB/0KB/0KB /s] [418K/0/0 iops]
Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1765MB/0KB/0KB /s] [452K/0/0 iops]

And continue in this fashion, without recovering. Note that in this
example it was required to wait 16 hours for this to occur. Also note that
IO throughput also becomes gradually becomes more unstable leading up to
this point.

This problem is only seen for non-strict mode. For strict mode, the rcaches
stay quite compact.

As a solution to this issue, judge that the IOVA caches have grown too big
when cached magazines need to be free, and just flush all the CPUs rcaches
instead.

The depot rcaches, however, are not flushed, as they can be used to
immediately replenish active CPUs.

In future, some IOVA compaction could be implemented to solve the
instabilty issue, which I figure could be quite complex to implement.

[0] https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/

Analyzed-by: Zhen Lei <thunder.leizhen@huawei.com>
Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/iova.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 1f3f0f8b12e0..386005055aca 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -901,7 +901,6 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 				 struct iova_rcache *rcache,
 				 unsigned long iova_pfn)
 {
-	struct iova_magazine *mag_to_free = NULL;
 	struct iova_cpu_rcache *cpu_rcache;
 	bool can_insert = false;
 	unsigned long flags;
@@ -923,13 +922,12 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 				if (cpu_rcache->loaded)
 					rcache->depot[rcache->depot_size++] =
 							cpu_rcache->loaded;
-			} else {
-				mag_to_free = cpu_rcache->loaded;
+				can_insert = true;
+				cpu_rcache->loaded = new_mag;
 			}
 			spin_unlock(&rcache->lock);
-
-			cpu_rcache->loaded = new_mag;
-			can_insert = true;
+			if (!can_insert)
+				iova_magazine_free(new_mag);
 		}
 	}
 
@@ -938,10 +936,8 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 
 	spin_unlock_irqrestore(&cpu_rcache->lock, flags);
 
-	if (mag_to_free) {
-		iova_magazine_free_pfns(mag_to_free, iovad);
-		iova_magazine_free(mag_to_free);
-	}
+	if (!can_insert)
+		free_all_cpu_cached_iovas(iovad);
 
 	return can_insert;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RESEND PATCH v3 4/4] iommu: avoid taking iova_rbtree_lock twice
  2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
                   ` (2 preceding siblings ...)
  2020-11-17 10:25 ` [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills John Garry
@ 2020-11-17 10:25 ` John Garry
  2020-12-01 15:35 ` [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: John Garry @ 2020-11-17 10:25 UTC (permalink / raw)
  To: robin.murphy, joro, will
  Cc: xiyou.wangcong, linuxarm, iommu, linux-kernel, chenxiang66, John Garry

From: Cong Wang <xiyou.wangcong@gmail.com>

Both find_iova() and __free_iova() take iova_rbtree_lock,
there is no reason to take and release it twice inside
free_iova().

Fold them into one critical section by calling the unlock
versions instead.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/iova.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 386005055aca..3b32e6746c70 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -398,10 +398,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
 void
 free_iova(struct iova_domain *iovad, unsigned long pfn)
 {
-	struct iova *iova = find_iova(iovad, pfn);
+	unsigned long flags;
+	struct iova *iova;
 
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	iova = private_find_iova(iovad, pfn);
 	if (iova)
-		__free_iova(iovad, iova);
+		private_free_iova(iovad, iova);
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 
 }
 EXPORT_SYMBOL_GPL(free_iova);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
  2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
                   ` (3 preceding siblings ...)
  2020-11-17 10:25 ` [RESEND PATCH v3 4/4] iommu: avoid taking iova_rbtree_lock twice John Garry
@ 2020-12-01 15:35 ` John Garry
  2020-12-01 21:02   ` Will Deacon
  2020-12-01 21:45 ` Will Deacon
  2021-01-15 11:32 ` John Garry
  6 siblings, 1 reply; 20+ messages in thread
From: John Garry @ 2020-12-01 15:35 UTC (permalink / raw)
  To: robin.murphy, joro, will
  Cc: xiyou.wangcong, linuxarm, iommu, linux-kernel, chenxiang66

On 17/11/2020 10:25, John Garry wrote:

Hi Will,

Is there any chance that we can get these picked up for 5.11? We've seen 
this issue solved here for a long time.

Or, @Robin, let me know if not happy with this since v1.

BTW, patch #4 has been on the go for ~1 year now, and is a nice small 
optimisation from Cong, which I picked up and already had a RB tag.

Thanks,
John

> This series contains a patch to solve the longterm IOVA issue which
> leizhen originally tried to address at [0].
> 
> A sieved kernel log is at the following, showing periodic dumps of IOVA
> sizes, per CPU and per depot bin, per IOVA size granule:
> https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test
> 
> Notice, for example, the following logs:
> [13175.355584] print_iova1 cpu_total=40135 depot_total=3866 total=44001
> [83483.457858] print_iova1 cpu_total=62532 depot_total=24476 total=87008
> 
> Where total IOVA rcache size has grown from 44K->87K over a long time.
> 
> Along with this patch, I included the following:
> - A smaller helper to clear all IOVAs for a domain
> - Change polarity of the IOVA magazine helpers
> - Small optimisation from Cong Wang included, which was never applied [1].
>    There was some debate of the other patches in that series, but this one
>    is quite straightforward.
> 
> Differnces to v2:
> - Update commit message for patch 3/4
> 
> Differences to v1:
> - Add IOVA clearing helper
> - Add patch to change polarity of mag helpers
> - Avoid logically-redundant extra variable in __iova_rcache_insert()
> 
> [0] https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/
> [1] https://lore.kernel.org/linux-iommu/4b74d40a-22d1-af53-fcb6-5d70183705a8@huawei.com/
> 
> Cong Wang (1):
>    iommu: avoid taking iova_rbtree_lock twice
> 
> John Garry (3):
>    iommu/iova: Add free_all_cpu_cached_iovas()
>    iommu/iova: Avoid double-negatives in magazine helpers
>    iommu/iova: Flush CPU rcache for when a depot fills
> 
>   drivers/iommu/iova.c | 66 +++++++++++++++++++++++++-------------------
>   1 file changed, 38 insertions(+), 28 deletions(-)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
  2020-12-01 15:35 ` [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
@ 2020-12-01 21:02   ` Will Deacon
  2020-12-02 15:20     ` John Garry
  0 siblings, 1 reply; 20+ messages in thread
From: Will Deacon @ 2020-12-01 21:02 UTC (permalink / raw)
  To: John Garry
  Cc: robin.murphy, joro, xiyou.wangcong, linuxarm, iommu,
	linux-kernel, chenxiang66

On Tue, Dec 01, 2020 at 03:35:02PM +0000, John Garry wrote:
> On 17/11/2020 10:25, John Garry wrote:
> Is there any chance that we can get these picked up for 5.11? We've seen
> this issue solved here for a long time.
> 
> Or, @Robin, let me know if not happy with this since v1.
> 
> BTW, patch #4 has been on the go for ~1 year now, and is a nice small
> optimisation from Cong, which I picked up and already had a RB tag.

I can pick the last patch up, but I'd really like some reviewed/tested-bys
on the others.

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
  2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
                   ` (4 preceding siblings ...)
  2020-12-01 15:35 ` [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
@ 2020-12-01 21:45 ` Will Deacon
  2020-12-03  6:04   ` Dmitry Safonov
  2021-01-15 11:32 ` John Garry
  6 siblings, 1 reply; 20+ messages in thread
From: Will Deacon @ 2020-12-01 21:45 UTC (permalink / raw)
  To: John Garry, joro, robin.murphy
  Cc: catalin.marinas, kernel-team, Will Deacon, xiyou.wangcong,
	linuxarm, iommu, linux-kernel

On Tue, 17 Nov 2020 18:25:30 +0800, John Garry wrote:
> This series contains a patch to solve the longterm IOVA issue which
> leizhen originally tried to address at [0].
> 
> A sieved kernel log is at the following, showing periodic dumps of IOVA
> sizes, per CPU and per depot bin, per IOVA size granule:
> https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test
> 
> [...]

Applied the final patch to arm64 (for-next/iommu/iova), thanks!

[4/4] iommu: avoid taking iova_rbtree_lock twice
      https://git.kernel.org/arm64/c/3a651b3a27a1

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
  2020-12-01 21:02   ` Will Deacon
@ 2020-12-02 15:20     ` John Garry
  0 siblings, 0 replies; 20+ messages in thread
From: John Garry @ 2020-12-02 15:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: robin.murphy, joro, xiyou.wangcong, linuxarm, iommu,
	linux-kernel, chenxiang66, Vijayanand Jitta, yuqi jin, dima

On 01/12/2020 21:02, Will Deacon wrote:

cc'ing some more people who have touched iova code recently

> On Tue, Dec 01, 2020 at 03:35:02PM +0000, John Garry wrote:
>> On 17/11/2020 10:25, John Garry wrote:
>> Is there any chance that we can get these picked up for 5.11? We've seen
>> this issue solved here for a long time.
>>
>> Or, @Robin, let me know if not happy with this since v1.
>>
>> BTW, patch #4 has been on the go for ~1 year now, and is a nice small
>> optimisation from Cong, which I picked up and already had a RB tag.
> I can pick the last patch up, but I'd really like some reviewed/tested-bys
> on the others.
> 

ok, fair enough.

Considering the extremes required to unearth the main problem, it'll be 
hard to get testers, but, fwiw, I can provide a tested-by from the reporter:

Tested-by: Xiang Chen <chenxiang66@hisilicon.com>

@Robin, You originally had some interest in this topic - are you now 
satisfied with the changes I am proposing?

Please let me know.

Thanks,
John

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
  2020-12-01 21:45 ` Will Deacon
@ 2020-12-03  6:04   ` Dmitry Safonov
  2020-12-03 14:54     ` John Garry
  0 siblings, 1 reply; 20+ messages in thread
From: Dmitry Safonov @ 2020-12-03  6:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: John Garry, Joerg Roedel, robin.murphy, Catalin Marinas,
	kernel-team, xiyou.wangcong, linuxarm, iommu, open list

On Tue, 1 Dec 2020 at 21:50, Will Deacon <will@kernel.org> wrote:
>
> On Tue, 17 Nov 2020 18:25:30 +0800, John Garry wrote:
> > This series contains a patch to solve the longterm IOVA issue which
> > leizhen originally tried to address at [0].
> >
> > A sieved kernel log is at the following, showing periodic dumps of IOVA
> > sizes, per CPU and per depot bin, per IOVA size granule:
> > https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test
> >
> > [...]
>
> Applied the final patch to arm64 (for-next/iommu/iova), thanks!
>
> [4/4] iommu: avoid taking iova_rbtree_lock twice
>       https://git.kernel.org/arm64/c/3a651b3a27a1

Glad it made in next, 2 years ago I couldn't convince iommu maintainer
it's worth it (but with a different justification):
https://lore.kernel.org/linux-iommu/20180621180823.805-3-dima@arista.com/

Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
  2020-12-03  6:04   ` Dmitry Safonov
@ 2020-12-03 14:54     ` John Garry
  0 siblings, 0 replies; 20+ messages in thread
From: John Garry @ 2020-12-03 14:54 UTC (permalink / raw)
  To: Dmitry Safonov, Will Deacon
  Cc: Joerg Roedel, robin.murphy, Catalin Marinas, kernel-team,
	xiyou.wangcong, linuxarm, iommu, open list

On 03/12/2020 06:04, Dmitry Safonov wrote:
> On Tue, 1 Dec 2020 at 21:50, Will Deacon<will@kernel.org>  wrote:
>> On Tue, 17 Nov 2020 18:25:30 +0800, John Garry wrote:
>>> This series contains a patch to solve the longterm IOVA issue which
>>> leizhen originally tried to address at [0].
>>>
>>> A sieved kernel log is at the following, showing periodic dumps of IOVA
>>> sizes, per CPU and per depot bin, per IOVA size granule:
>>> https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test
>>>
>>> [...]
>> Applied the final patch to arm64 (for-next/iommu/iova), thanks!
>>
>> [4/4] iommu: avoid taking iova_rbtree_lock twice
>>        https://git.kernel.org/arm64/c/3a651b3a27a1
> Glad it made in next, 2 years ago I couldn't convince iommu maintainer
> it's worth it (but with a different justification):
> https://lore.kernel.org/linux-iommu/20180621180823.805-3-dima@arista.com/

Hi Dmitry,

I was unaware of your series, and it’s unfortunate that your 
optimization never made it. However I was having a quick look there, 
and, in case you did not notice, that the code which you were proposing 
changing in patch #1 for intel-iommu.c was removed in e70b081c6f37 
("iommu/vt-d: Remove IOVA handling code from the non-dma_ops path").

BTW, split_and_remove_iova() has no in-tree users anymore, so I can send 
a patch to delete if nobody else wants to.

BTW2, there's some more patches in my series which could use a review if 
you're feeling very helpful :)

Cheers,
John

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas()
  2020-11-17 10:25 ` [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas() John Garry
@ 2020-12-09  8:58   ` Leizhen (ThunderTown)
  2020-12-09 12:41     ` John Garry
  0 siblings, 1 reply; 20+ messages in thread
From: Leizhen (ThunderTown) @ 2020-12-09  8:58 UTC (permalink / raw)
  To: John Garry, robin.murphy, joro, will
  Cc: linuxarm, linux-kernel, iommu, xiyou.wangcong

On 2020/11/17 18:25, John Garry wrote:
> Add a helper function to free the CPU rcache for all online CPUs.
> 
> There also exists a function of the same name in
> drivers/iommu/intel/iommu.c, but the parameters are different, and there
> should be no conflict.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/iommu/iova.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 30d969a4c5fd..81b7399dd5e8 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -227,6 +227,14 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>  	return -ENOMEM;
>  }
>  
> +static void free_all_cpu_cached_iovas(struct iova_domain *iovad)
> +{
> +	unsigned int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		free_cpu_cached_iovas(cpu, iovad);
> +}
> +
>  static struct kmem_cache *iova_cache;
>  static unsigned int iova_cache_users;
>  static DEFINE_MUTEX(iova_cache_mutex);
> @@ -422,15 +430,12 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
>  retry:
>  	new_iova = alloc_iova(iovad, size, limit_pfn, true);
>  	if (!new_iova) {
> -		unsigned int cpu;
> -
>  		if (!flush_rcache)
>  			return 0;
>  
>  		/* Try replenishing IOVAs by flushing rcache. */
>  		flush_rcache = false;
> -		for_each_online_cpu(cpu)
> -			free_cpu_cached_iovas(cpu, iovad);
> +		free_all_cpu_cached_iovas(iovad);
>  		goto retry;
>  	}
>  

Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>

> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers
  2020-11-17 10:25 ` [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers John Garry
@ 2020-12-09  9:03   ` Leizhen (ThunderTown)
  2020-12-09 11:39     ` John Garry
  0 siblings, 1 reply; 20+ messages in thread
From: Leizhen (ThunderTown) @ 2020-12-09  9:03 UTC (permalink / raw)
  To: John Garry, robin.murphy, joro, will
  Cc: linuxarm, linux-kernel, iommu, xiyou.wangcong



On 2020/11/17 18:25, John Garry wrote:
> A similar crash to the following could be observed if initial CPU rcache
> magazine allocations fail in init_iova_rcaches():
> 
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> Mem abort info:
>    ESR = 0x96000004
>    EC = 0x25: DABT (current EL), IL = 32 bits
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
> Data abort info:
>    ISV = 0, ISS = 0x00000004
>    CM = 0, WnR = 0
> [0000000000000000] user address but active_mm is swapper
> Internal error: Oops: 96000004 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 11 PID: 696 Comm: irq/40-hisi_sas Not tainted 5.9.0-rc7-dirty #109
> Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019
> Call trace:
>   free_iova_fast+0xfc/0x280
>   iommu_dma_free_iova+0x64/0x70
>   __iommu_dma_unmap+0x9c/0xf8
>   iommu_dma_unmap_sg+0xa8/0xc8
>   dma_unmap_sg_attrs+0x28/0x50
>   cq_thread_v3_hw+0x2dc/0x528
>   irq_thread_fn+0x2c/0xa0
>   irq_thread+0x130/0x1e0
>   kthread+0x154/0x158
>   ret_from_fork+0x10/0x34
> 
> Code: f9400060 f102001f 54000981 d4210000 (f9400043)
> 
>  ---[ end trace 4afcbdfc61b60467 ]---
> 
> The issue is that expression !iova_magazine_full(NULL) evaluates true; this
> falls over in in __iova_rcache_insert() when we attempt to cache a mag
> and cpu_rcache->loaded == NULL:
> 
> if (!iova_magazine_full(cpu_rcache->loaded)) {
> 	can_insert = true;
> ...
> 
> if (can_insert)
> 	iova_magazine_push(cpu_rcache->loaded, iova_pfn);
> 
> As above, can_insert is evaluated true, which it shouldn't be, and we try
> to insert pfns in a NULL mag, which is not safe.
> 
> To avoid this, stop using double-negatives, like !iova_magazine_full() and
> !iova_magazine_empty(), and use positive tests, like
> iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
> can safely deal with cpu_rcache->{loaded, prev} = NULL.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/iommu/iova.c | 29 +++++++++++++++++------------
>  1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 81b7399dd5e8..1f3f0f8b12e0 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -827,14 +827,18 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
>  	mag->size = 0;
>  }
>  
> -static bool iova_magazine_full(struct iova_magazine *mag)
> +static bool iova_magazine_has_space(struct iova_magazine *mag)
>  {
> -	return (mag && mag->size == IOVA_MAG_SIZE);
> +	if (!mag)
> +		return false;
> +	return mag->size < IOVA_MAG_SIZE;
>  }
>  
> -static bool iova_magazine_empty(struct iova_magazine *mag)
> +static bool iova_magazine_has_pfns(struct iova_magazine *mag)
>  {
> -	return (!mag || mag->size == 0);
> +	if (!mag)
> +		return false;
> +	return mag->size;
>  }
>  
>  static unsigned long iova_magazine_pop(struct iova_magazine *mag,
> @@ -843,7 +847,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>  	int i;
>  	unsigned long pfn;
>  
> -	BUG_ON(iova_magazine_empty(mag));
> +	BUG_ON(!iova_magazine_has_pfns(mag));
>  
>  	/* Only fall back to the rbtree if we have no suitable pfns at all */
>  	for (i = mag->size - 1; mag->pfns[i] > limit_pfn; i--)
> @@ -859,7 +863,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>  
>  static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)
>  {
> -	BUG_ON(iova_magazine_full(mag));
> +	BUG_ON(!iova_magazine_has_space(mag));
>  
>  	mag->pfns[mag->size++] = pfn;
>  }
> @@ -905,9 +909,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>  	cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>  	spin_lock_irqsave(&cpu_rcache->lock, flags);
>  
> -	if (!iova_magazine_full(cpu_rcache->loaded)) {
> +	if (iova_magazine_has_space(cpu_rcache->loaded)) {
>  		can_insert = true;
> -	} else if (!iova_magazine_full(cpu_rcache->prev)) {
> +	} else if (iova_magazine_has_space(cpu_rcache->prev)) {
>  		swap(cpu_rcache->prev, cpu_rcache->loaded);
>  		can_insert = true;
>  	} else {
> @@ -916,8 +920,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>  		if (new_mag) {
>  			spin_lock(&rcache->lock);
>  			if (rcache->depot_size < MAX_GLOBAL_MAGS) {
> -				rcache->depot[rcache->depot_size++] =
> -						cpu_rcache->loaded;
> +				if (cpu_rcache->loaded)

Looks like it just needs to change this place. Compiler ensures that mag->size
will not be accessed when mag is NULL.

static bool iova_magazine_full(struct iova_magazine *mag)
{
        return (mag && mag->size == IOVA_MAG_SIZE);
}

static bool iova_magazine_empty(struct iova_magazine *mag)
{
        return (!mag || mag->size == 0);
}


> +					rcache->depot[rcache->depot_size++] =
> +							cpu_rcache->loaded;
>  			} else {
>  				mag_to_free = cpu_rcache->loaded;
>  			}
> @@ -968,9 +973,9 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
>  	cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>  	spin_lock_irqsave(&cpu_rcache->lock, flags);
>  
> -	if (!iova_magazine_empty(cpu_rcache->loaded)) {
> +	if (iova_magazine_has_pfns(cpu_rcache->loaded)) {
>  		has_pfn = true;
> -	} else if (!iova_magazine_empty(cpu_rcache->prev)) {
> +	} else if (iova_magazine_has_pfns(cpu_rcache->prev)) {
>  		swap(cpu_rcache->prev, cpu_rcache->loaded);
>  		has_pfn = true;
>  	} else {
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills
  2020-11-17 10:25 ` [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills John Garry
@ 2020-12-09  9:13   ` Leizhen (ThunderTown)
  2020-12-09 11:22     ` John Garry
  0 siblings, 1 reply; 20+ messages in thread
From: Leizhen (ThunderTown) @ 2020-12-09  9:13 UTC (permalink / raw)
  To: John Garry, robin.murphy, joro, will
  Cc: linuxarm, linux-kernel, iommu, xiyou.wangcong



On 2020/11/17 18:25, John Garry wrote:
> Leizhen reported some time ago that IOVA performance may degrade over time
> [0], but unfortunately his solution to fix this problem was not given
> attention.
> 
> To summarize, the issue is that as time goes by, the CPU rcache and depot
> rcache continue to grow. As such, IOVA RB tree access time also continues
> to grow.
> 
> At a certain point, a depot may become full, and also some CPU rcaches may
> also be full when inserting another IOVA is attempted. For this scenario,
> currently the "loaded" CPU rcache is freed and a new one is created. This
> freeing means that many IOVAs in the RB tree need to be freed, which
> makes IO throughput performance fall off a cliff in some storage scenarios:
> 
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6314MB/0KB/0KB /s] [1616K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [5669MB/0KB/0KB /s] [1451K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6673MB/0KB/0KB /s] [1708K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6761MB/0KB/0KB /s] [1731K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6685MB/0KB/0KB /s] [1711K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6178MB/0KB/0KB /s] [1582K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6731MB/0KB/0KB /s] [1723K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2387MB/0KB/0KB /s] [611K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2689MB/0KB/0KB /s] [688K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2278MB/0KB/0KB /s] [583K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1288MB/0KB/0KB /s] [330K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1632MB/0KB/0KB /s] [418K/0/0 iops]
> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1765MB/0KB/0KB /s] [452K/0/0 iops]
> 
> And continue in this fashion, without recovering. Note that in this
> example it was required to wait 16 hours for this to occur. Also note that
> IO throughput also becomes gradually becomes more unstable leading up to
> this point.
> 
> This problem is only seen for non-strict mode. For strict mode, the rcaches
> stay quite compact.
> 
> As a solution to this issue, judge that the IOVA caches have grown too big
> when cached magazines need to be free, and just flush all the CPUs rcaches
> instead.
> 
> The depot rcaches, however, are not flushed, as they can be used to
> immediately replenish active CPUs.
> 
> In future, some IOVA compaction could be implemented to solve the
> instabilty issue, which I figure could be quite complex to implement.
> 
> [0] https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/
> 
> Analyzed-by: Zhen Lei <thunder.leizhen@huawei.com>
> Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/iommu/iova.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 1f3f0f8b12e0..386005055aca 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -901,7 +901,6 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>  				 struct iova_rcache *rcache,
>  				 unsigned long iova_pfn)
>  {
> -	struct iova_magazine *mag_to_free = NULL;
>  	struct iova_cpu_rcache *cpu_rcache;
>  	bool can_insert = false;
>  	unsigned long flags;
> @@ -923,13 +922,12 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>  				if (cpu_rcache->loaded)
>  					rcache->depot[rcache->depot_size++] =
>  							cpu_rcache->loaded;
> -			} else {
> -				mag_to_free = cpu_rcache->loaded;
> +				can_insert = true;
> +				cpu_rcache->loaded = new_mag;
>  			}
>  			spin_unlock(&rcache->lock);
> -
> -			cpu_rcache->loaded = new_mag;
> -			can_insert = true;
> +			if (!can_insert)
> +				iova_magazine_free(new_mag);
>  		}
>  	}
>  
> @@ -938,10 +936,8 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>  
>  	spin_unlock_irqrestore(&cpu_rcache->lock, flags);
>  
> -	if (mag_to_free) {
> -		iova_magazine_free_pfns(mag_to_free, iovad);
> -		iova_magazine_free(mag_to_free);
mag_to_free has been stripped out, that's why lock protection is not required here.

> -	}
> +	if (!can_insert)
> +		free_all_cpu_cached_iovas(iovad);
Lock protection required.

>  
>  	return can_insert;
>  }
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills
  2020-12-09  9:13   ` Leizhen (ThunderTown)
@ 2020-12-09 11:22     ` John Garry
  2020-12-09 12:11       ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 20+ messages in thread
From: John Garry @ 2020-12-09 11:22 UTC (permalink / raw)
  To: Leizhen (ThunderTown), robin.murphy, joro, will
  Cc: linuxarm, linux-kernel, iommu, xiyou.wangcong

On 09/12/2020 09:13, Leizhen (ThunderTown) wrote:
> 
> 
> On 2020/11/17 18:25, John Garry wrote:
>> Leizhen reported some time ago that IOVA performance may degrade over time
>> [0], but unfortunately his solution to fix this problem was not given
>> attention.
>>
>> To summarize, the issue is that as time goes by, the CPU rcache and depot
>> rcache continue to grow. As such, IOVA RB tree access time also continues
>> to grow.
>>
>> At a certain point, a depot may become full, and also some CPU rcaches may
>> also be full when inserting another IOVA is attempted. For this scenario,
>> currently the "loaded" CPU rcache is freed and a new one is created. This
>> freeing means that many IOVAs in the RB tree need to be freed, which
>> makes IO throughput performance fall off a cliff in some storage scenarios:
>>
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6314MB/0KB/0KB /s] [1616K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [5669MB/0KB/0KB /s] [1451K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6673MB/0KB/0KB /s] [1708K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6761MB/0KB/0KB /s] [1731K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6685MB/0KB/0KB /s] [1711K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6178MB/0KB/0KB /s] [1582K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6731MB/0KB/0KB /s] [1723K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2387MB/0KB/0KB /s] [611K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2689MB/0KB/0KB /s] [688K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2278MB/0KB/0KB /s] [583K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1288MB/0KB/0KB /s] [330K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1632MB/0KB/0KB /s] [418K/0/0 iops]
>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1765MB/0KB/0KB /s] [452K/0/0 iops]
>>
>> And continue in this fashion, without recovering. Note that in this
>> example it was required to wait 16 hours for this to occur. Also note that
>> IO throughput also becomes gradually becomes more unstable leading up to
>> this point.
>>
>> This problem is only seen for non-strict mode. For strict mode, the rcaches
>> stay quite compact.
>>
>> As a solution to this issue, judge that the IOVA caches have grown too big
>> when cached magazines need to be free, and just flush all the CPUs rcaches
>> instead.
>>
>> The depot rcaches, however, are not flushed, as they can be used to
>> immediately replenish active CPUs.
>>
>> In future, some IOVA compaction could be implemented to solve the
>> instabilty issue, which I figure could be quite complex to implement.
>>
>> [0] https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/
>>
>> Analyzed-by: Zhen Lei <thunder.leizhen@huawei.com>
>> Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
>> Signed-off-by: John Garry <john.garry@huawei.com>

Thanks for having a look

>> ---
>>   drivers/iommu/iova.c | 16 ++++++----------
>>   1 file changed, 6 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 1f3f0f8b12e0..386005055aca 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -901,7 +901,6 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>   				 struct iova_rcache *rcache,
>>   				 unsigned long iova_pfn)
>>   {
>> -	struct iova_magazine *mag_to_free = NULL;
>>   	struct iova_cpu_rcache *cpu_rcache;
>>   	bool can_insert = false;
>>   	unsigned long flags;
>> @@ -923,13 +922,12 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>   				if (cpu_rcache->loaded)
>>   					rcache->depot[rcache->depot_size++] =
>>   							cpu_rcache->loaded;
>> -			} else {
>> -				mag_to_free = cpu_rcache->loaded;
>> +				can_insert = true;
>> +				cpu_rcache->loaded = new_mag;
>>   			}
>>   			spin_unlock(&rcache->lock);
>> -
>> -			cpu_rcache->loaded = new_mag;
>> -			can_insert = true;
>> +			if (!can_insert)
>> +				iova_magazine_free(new_mag);
>>   		}
>>   	}
>>   
>> @@ -938,10 +936,8 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>   
>>   	spin_unlock_irqrestore(&cpu_rcache->lock, flags);
>>   
>> -	if (mag_to_free) {
>> -		iova_magazine_free_pfns(mag_to_free, iovad);
>> -		iova_magazine_free(mag_to_free);
> mag_to_free has been stripped out, that's why lock protection is not required here.
> 
>> -	}
>> +	if (!can_insert)
>> +		free_all_cpu_cached_iovas(iovad);
> Lock protection required.

But we have the per-CPU rcache locking again in free_cpu_cached_iovas() 
(which is called per-CPU from free_all_cpu_cached_iovas()).

ok? Or some other lock you mean?

Cheers,
John

> 
>>   
>>   	return can_insert;
>>   }
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers
  2020-12-09  9:03   ` Leizhen (ThunderTown)
@ 2020-12-09 11:39     ` John Garry
  2020-12-09 12:31       ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 20+ messages in thread
From: John Garry @ 2020-12-09 11:39 UTC (permalink / raw)
  To: Leizhen (ThunderTown), robin.murphy, joro, will
  Cc: Linuxarm, linux-kernel, iommu, xiyou.wangcong

On 09/12/2020 09:03, Leizhen (ThunderTown) wrote:
> 
> 
> On 2020/11/17 18:25, John Garry wrote:
>> A similar crash to the following could be observed if initial CPU rcache
>> magazine allocations fail in init_iova_rcaches():
>>
>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>> Mem abort info:
>>     ESR = 0x96000004
>>     EC = 0x25: DABT (current EL), IL = 32 bits
>>     SET = 0, FnV = 0
>>     EA = 0, S1PTW = 0
>> Data abort info:
>>     ISV = 0, ISS = 0x00000004
>>     CM = 0, WnR = 0
>> [0000000000000000] user address but active_mm is swapper
>> Internal error: Oops: 96000004 [#1] PREEMPT SMP
>> Modules linked in:
>> CPU: 11 PID: 696 Comm: irq/40-hisi_sas Not tainted 5.9.0-rc7-dirty #109
>> Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019
>> Call trace:
>>    free_iova_fast+0xfc/0x280
>>    iommu_dma_free_iova+0x64/0x70
>>    __iommu_dma_unmap+0x9c/0xf8
>>    iommu_dma_unmap_sg+0xa8/0xc8
>>    dma_unmap_sg_attrs+0x28/0x50
>>    cq_thread_v3_hw+0x2dc/0x528
>>    irq_thread_fn+0x2c/0xa0
>>    irq_thread+0x130/0x1e0
>>    kthread+0x154/0x158
>>    ret_from_fork+0x10/0x34
>>
>> Code: f9400060 f102001f 54000981 d4210000 (f9400043)
>>
>>   ---[ end trace 4afcbdfc61b60467 ]---
>>
>> The issue is that expression !iova_magazine_full(NULL) evaluates true; this
>> falls over in in __iova_rcache_insert() when we attempt to cache a mag
>> and cpu_rcache->loaded == NULL:
>>
>> if (!iova_magazine_full(cpu_rcache->loaded)) {
>> 	can_insert = true;
>> ...
>>
>> if (can_insert)
>> 	iova_magazine_push(cpu_rcache->loaded, iova_pfn);
>>
>> As above, can_insert is evaluated true, which it shouldn't be, and we try
>> to insert pfns in a NULL mag, which is not safe.
>>
>> To avoid this, stop using double-negatives, like !iova_magazine_full() and
>> !iova_magazine_empty(), and use positive tests, like
>> iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
>> can safely deal with cpu_rcache->{loaded, prev} = NULL.
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>

Thanks for checking here...

>> ---
>>   drivers/iommu/iova.c | 29 +++++++++++++++++------------
>>   1 file changed, 17 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 81b7399dd5e8..1f3f0f8b12e0 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -827,14 +827,18 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
>>   	mag->size = 0;
>>   }
>>   
>> -static bool iova_magazine_full(struct iova_magazine *mag)
>> +static bool iova_magazine_has_space(struct iova_magazine *mag)
>>   {
>> -	return (mag && mag->size == IOVA_MAG_SIZE);
>> +	if (!mag)
>> +		return false;
>> +	return mag->size < IOVA_MAG_SIZE;
>>   }
>>   
>> -static bool iova_magazine_empty(struct iova_magazine *mag)
>> +static bool iova_magazine_has_pfns(struct iova_magazine *mag)
>>   {
>> -	return (!mag || mag->size == 0);
>> +	if (!mag)
>> +		return false;
>> +	return mag->size;
>>   }
>>   
>>   static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>> @@ -843,7 +847,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>>   	int i;
>>   	unsigned long pfn;
>>   
>> -	BUG_ON(iova_magazine_empty(mag));
>> +	BUG_ON(!iova_magazine_has_pfns(mag));
>>   
>>   	/* Only fall back to the rbtree if we have no suitable pfns at all */
>>   	for (i = mag->size - 1; mag->pfns[i] > limit_pfn; i--)
>> @@ -859,7 +863,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>>   
>>   static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)
>>   {
>> -	BUG_ON(iova_magazine_full(mag));
>> +	BUG_ON(!iova_magazine_has_space(mag));
>>   
>>   	mag->pfns[mag->size++] = pfn;
>>   }
>> @@ -905,9 +909,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>   	cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>>   	spin_lock_irqsave(&cpu_rcache->lock, flags);
>>   
>> -	if (!iova_magazine_full(cpu_rcache->loaded)) {

*

>> +	if (iova_magazine_has_space(cpu_rcache->loaded)) {
>>   		can_insert = true;
>> -	} else if (!iova_magazine_full(cpu_rcache->prev)) {
>> +	} else if (iova_magazine_has_space(cpu_rcache->prev)) {
>>   		swap(cpu_rcache->prev, cpu_rcache->loaded);
>>   		can_insert = true;
>>   	} else {
>> @@ -916,8 +920,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>   		if (new_mag) {
>>   			spin_lock(&rcache->lock);
>>   			if (rcache->depot_size < MAX_GLOBAL_MAGS) {
>> -				rcache->depot[rcache->depot_size++] =
>> -						cpu_rcache->loaded;
>> +				if (cpu_rcache->loaded)
> 
> Looks like it just needs to change this place. Compiler ensures that mag->size
> will not be accessed when mag is NULL.

Not sure about that. We would get a crash prior to touching this codepath:

So if cpu_rcache->loaded == NULL at entry, then 
!iova_magazine_full(cpu_rcache->loaded (==NULL) ) evaluates true, * 
above, so can_insert is set true, and we then attempt 
iova_magazine_push(cpu_rcache->loaded (==NULL), iova_pfn), which is not 
safe.

ok?

Cheers,
John

> 
> static bool iova_magazine_full(struct iova_magazine *mag)
> {
>          return (mag && mag->size == IOVA_MAG_SIZE);
> }
> 
> static bool iova_magazine_empty(struct iova_magazine *mag)
> {
>          return (!mag || mag->size == 0);
> }
> 




> 
>> +					rcache->depot[rcache->depot_size++] =
>> +							cpu_rcache->loaded;
>>   			} else {
>>   				mag_to_free = cpu_rcache->loaded;
>>   			}
>> @@ -968,9 +973,9 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
>>   	cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>>   	spin_lock_irqsave(&cpu_rcache->lock, flags);
>>   
>> -	if (!iova_magazine_empty(cpu_rcache->loaded)) {
>> +	if (iova_magazine_has_pfns(cpu_rcache->loaded)) {
>>   		has_pfn = true;
>> -	} else if (!iova_magazine_empty(cpu_rcache->prev)) {
>> +	} else if (iova_magazine_has_pfns(cpu_rcache->prev)) {
>>   		swap(cpu_rcache->prev, cpu_rcache->loaded);
>>   		has_pfn = true;
>>   	} else {
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills
  2020-12-09 11:22     ` John Garry
@ 2020-12-09 12:11       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 20+ messages in thread
From: Leizhen (ThunderTown) @ 2020-12-09 12:11 UTC (permalink / raw)
  To: John Garry, robin.murphy, joro, will
  Cc: linuxarm, linux-kernel, iommu, xiyou.wangcong



On 2020/12/9 19:22, John Garry wrote:
> On 09/12/2020 09:13, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2020/11/17 18:25, John Garry wrote:
>>> Leizhen reported some time ago that IOVA performance may degrade over time
>>> [0], but unfortunately his solution to fix this problem was not given
>>> attention.
>>>
>>> To summarize, the issue is that as time goes by, the CPU rcache and depot
>>> rcache continue to grow. As such, IOVA RB tree access time also continues
>>> to grow.
>>>
>>> At a certain point, a depot may become full, and also some CPU rcaches may
>>> also be full when inserting another IOVA is attempted. For this scenario,
>>> currently the "loaded" CPU rcache is freed and a new one is created. This
>>> freeing means that many IOVAs in the RB tree need to be freed, which
>>> makes IO throughput performance fall off a cliff in some storage scenarios:
>>>
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6314MB/0KB/0KB /s] [1616K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [5669MB/0KB/0KB /s] [1451K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6673MB/0KB/0KB /s] [1708K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6031MB/0KB/0KB /s] [1544K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6761MB/0KB/0KB /s] [1731K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6705MB/0KB/0KB /s] [1717K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6685MB/0KB/0KB /s] [1711K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6178MB/0KB/0KB /s] [1582K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [6731MB/0KB/0KB /s] [1723K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2387MB/0KB/0KB /s] [611K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2689MB/0KB/0KB /s] [688K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [2278MB/0KB/0KB /s] [583K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1288MB/0KB/0KB /s] [330K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1632MB/0KB/0KB /s] [418K/0/0 iops]
>>> Jobs: 12 (f=12): [RRRRRRRRRRRR] [0.0% done] [1765MB/0KB/0KB /s] [452K/0/0 iops]
>>>
>>> And continue in this fashion, without recovering. Note that in this
>>> example it was required to wait 16 hours for this to occur. Also note that
>>> IO throughput also becomes gradually becomes more unstable leading up to
>>> this point.
>>>
>>> This problem is only seen for non-strict mode. For strict mode, the rcaches
>>> stay quite compact.
>>>
>>> As a solution to this issue, judge that the IOVA caches have grown too big
>>> when cached magazines need to be free, and just flush all the CPUs rcaches
>>> instead.
>>>
>>> The depot rcaches, however, are not flushed, as they can be used to
>>> immediately replenish active CPUs.
>>>
>>> In future, some IOVA compaction could be implemented to solve the
>>> instabilty issue, which I figure could be quite complex to implement.
>>>
>>> [0] https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/
>>>
>>> Analyzed-by: Zhen Lei <thunder.leizhen@huawei.com>
>>> Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
>>> Signed-off-by: John Garry <john.garry@huawei.com>
> 
> Thanks for having a look
> 
>>> ---
>>>   drivers/iommu/iova.c | 16 ++++++----------
>>>   1 file changed, 6 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 1f3f0f8b12e0..386005055aca 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -901,7 +901,6 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>>                    struct iova_rcache *rcache,
>>>                    unsigned long iova_pfn)
>>>   {
>>> -    struct iova_magazine *mag_to_free = NULL;
>>>       struct iova_cpu_rcache *cpu_rcache;
>>>       bool can_insert = false;
>>>       unsigned long flags;
>>> @@ -923,13 +922,12 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>>                   if (cpu_rcache->loaded)
>>>                       rcache->depot[rcache->depot_size++] =
>>>                               cpu_rcache->loaded;
>>> -            } else {
>>> -                mag_to_free = cpu_rcache->loaded;
>>> +                can_insert = true;
>>> +                cpu_rcache->loaded = new_mag;
>>>               }
>>>               spin_unlock(&rcache->lock);
>>> -
>>> -            cpu_rcache->loaded = new_mag;
>>> -            can_insert = true;
>>> +            if (!can_insert)
>>> +                iova_magazine_free(new_mag);
>>>           }
>>>       }
>>>   @@ -938,10 +936,8 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>>         spin_unlock_irqrestore(&cpu_rcache->lock, flags);
>>>   -    if (mag_to_free) {
>>> -        iova_magazine_free_pfns(mag_to_free, iovad);
>>> -        iova_magazine_free(mag_to_free);
>> mag_to_free has been stripped out, that's why lock protection is not required here.
>>
>>> -    }
>>> +    if (!can_insert)
>>> +        free_all_cpu_cached_iovas(iovad);
>> Lock protection required.
> 
> But we have the per-CPU rcache locking again in free_cpu_cached_iovas() (which is called per-CPU from free_all_cpu_cached_iovas()).
> 
> ok? Or some other lock you mean?

Oh, Sorry, think of function free_cpu_cached_iovas() as function free_iova_rcaches().

Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>

> 
> Cheers,
> John
> 
>>
>>>         return can_insert;
>>>   }
>>>
>>
>> .
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers
  2020-12-09 11:39     ` John Garry
@ 2020-12-09 12:31       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 20+ messages in thread
From: Leizhen (ThunderTown) @ 2020-12-09 12:31 UTC (permalink / raw)
  To: John Garry, robin.murphy, joro, will
  Cc: Linuxarm, linux-kernel, iommu, xiyou.wangcong



On 2020/12/9 19:39, John Garry wrote:
> On 09/12/2020 09:03, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2020/11/17 18:25, John Garry wrote:
>>> A similar crash to the following could be observed if initial CPU rcache
>>> magazine allocations fail in init_iova_rcaches():
>>>
>>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>>> Mem abort info:
>>>     ESR = 0x96000004
>>>     EC = 0x25: DABT (current EL), IL = 32 bits
>>>     SET = 0, FnV = 0
>>>     EA = 0, S1PTW = 0
>>> Data abort info:
>>>     ISV = 0, ISS = 0x00000004
>>>     CM = 0, WnR = 0
>>> [0000000000000000] user address but active_mm is swapper
>>> Internal error: Oops: 96000004 [#1] PREEMPT SMP
>>> Modules linked in:
>>> CPU: 11 PID: 696 Comm: irq/40-hisi_sas Not tainted 5.9.0-rc7-dirty #109
>>> Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019
>>> Call trace:
>>>    free_iova_fast+0xfc/0x280
>>>    iommu_dma_free_iova+0x64/0x70
>>>    __iommu_dma_unmap+0x9c/0xf8
>>>    iommu_dma_unmap_sg+0xa8/0xc8
>>>    dma_unmap_sg_attrs+0x28/0x50
>>>    cq_thread_v3_hw+0x2dc/0x528
>>>    irq_thread_fn+0x2c/0xa0
>>>    irq_thread+0x130/0x1e0
>>>    kthread+0x154/0x158
>>>    ret_from_fork+0x10/0x34
>>>
>>> Code: f9400060 f102001f 54000981 d4210000 (f9400043)
>>>
>>>   ---[ end trace 4afcbdfc61b60467 ]---
>>>
>>> The issue is that expression !iova_magazine_full(NULL) evaluates true; this
>>> falls over in in __iova_rcache_insert() when we attempt to cache a mag
>>> and cpu_rcache->loaded == NULL:
>>>
>>> if (!iova_magazine_full(cpu_rcache->loaded)) {
>>>     can_insert = true;
>>> ...
>>>
>>> if (can_insert)
>>>     iova_magazine_push(cpu_rcache->loaded, iova_pfn);
>>>
>>> As above, can_insert is evaluated true, which it shouldn't be, and we try
>>> to insert pfns in a NULL mag, which is not safe.
>>>
>>> To avoid this, stop using double-negatives, like !iova_magazine_full() and
>>> !iova_magazine_empty(), and use positive tests, like
>>> iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
>>> can safely deal with cpu_rcache->{loaded, prev} = NULL.
>>>
>>> Signed-off-by: John Garry <john.garry@huawei.com>
> 
> Thanks for checking here...
> 
>>> ---
>>>   drivers/iommu/iova.c | 29 +++++++++++++++++------------
>>>   1 file changed, 17 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 81b7399dd5e8..1f3f0f8b12e0 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -827,14 +827,18 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
>>>       mag->size = 0;
>>>   }
>>>   -static bool iova_magazine_full(struct iova_magazine *mag)
>>> +static bool iova_magazine_has_space(struct iova_magazine *mag)
>>>   {
>>> -    return (mag && mag->size == IOVA_MAG_SIZE);
>>> +    if (!mag)
>>> +        return false;
>>> +    return mag->size < IOVA_MAG_SIZE;
>>>   }
>>>   -static bool iova_magazine_empty(struct iova_magazine *mag)
>>> +static bool iova_magazine_has_pfns(struct iova_magazine *mag)
>>>   {
>>> -    return (!mag || mag->size == 0);
>>> +    if (!mag)
>>> +        return false;
>>> +    return mag->size;
>>>   }
>>>     static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>>> @@ -843,7 +847,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>>>       int i;
>>>       unsigned long pfn;
>>>   -    BUG_ON(iova_magazine_empty(mag));
>>> +    BUG_ON(!iova_magazine_has_pfns(mag));
>>>         /* Only fall back to the rbtree if we have no suitable pfns at all */
>>>       for (i = mag->size - 1; mag->pfns[i] > limit_pfn; i--)
>>> @@ -859,7 +863,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>>>     static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)
>>>   {
>>> -    BUG_ON(iova_magazine_full(mag));
>>> +    BUG_ON(!iova_magazine_has_space(mag));
>>>         mag->pfns[mag->size++] = pfn;
>>>   }
>>> @@ -905,9 +909,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>>       cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>>>       spin_lock_irqsave(&cpu_rcache->lock, flags);
>>>   -    if (!iova_magazine_full(cpu_rcache->loaded)) {
> 
> *
> 
>>> +    if (iova_magazine_has_space(cpu_rcache->loaded)) {
>>>           can_insert = true;
>>> -    } else if (!iova_magazine_full(cpu_rcache->prev)) {
>>> +    } else if (iova_magazine_has_space(cpu_rcache->prev)) {
>>>           swap(cpu_rcache->prev, cpu_rcache->loaded);
>>>           can_insert = true;
>>>       } else {
>>> @@ -916,8 +920,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>>>           if (new_mag) {
>>>               spin_lock(&rcache->lock);
>>>               if (rcache->depot_size < MAX_GLOBAL_MAGS) {
>>> -                rcache->depot[rcache->depot_size++] =
>>> -                        cpu_rcache->loaded;
>>> +                if (cpu_rcache->loaded)
>>
>> Looks like it just needs to change this place. Compiler ensures that mag->size
>> will not be accessed when mag is NULL.
> 
> Not sure about that. We would get a crash prior to touching this codepath:
> 
> So if cpu_rcache->loaded == NULL at entry, then !iova_magazine_full(cpu_rcache->loaded (==NULL) ) evaluates true, * above, so can_insert is set true, and we then attempt iova_magazine_push(cpu_rcache->loaded (==NULL), iova_pfn), which is not safe.
> 
> ok?

OK, I got it.

Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>

> 
> Cheers,
> John
> 
>>
>> static bool iova_magazine_full(struct iova_magazine *mag)
>> {
>>          return (mag && mag->size == IOVA_MAG_SIZE);
>> }
>>
>> static bool iova_magazine_empty(struct iova_magazine *mag)
>> {
>>          return (!mag || mag->size == 0);
>> }
>>
> 
> 
> 
> 
>>
>>> +                    rcache->depot[rcache->depot_size++] =
>>> +                            cpu_rcache->loaded;
>>>               } else {
>>>                   mag_to_free = cpu_rcache->loaded;
>>>               }
>>> @@ -968,9 +973,9 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
>>>       cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
>>>       spin_lock_irqsave(&cpu_rcache->lock, flags);
>>>   -    if (!iova_magazine_empty(cpu_rcache->loaded)) {
>>> +    if (iova_magazine_has_pfns(cpu_rcache->loaded)) {
>>>           has_pfn = true;
>>> -    } else if (!iova_magazine_empty(cpu_rcache->prev)) {
>>> +    } else if (iova_magazine_has_pfns(cpu_rcache->prev)) {
>>>           swap(cpu_rcache->prev, cpu_rcache->loaded);
>>>           has_pfn = true;
>>>       } else {
>>>
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas()
  2020-12-09  8:58   ` Leizhen (ThunderTown)
@ 2020-12-09 12:41     ` John Garry
  0 siblings, 0 replies; 20+ messages in thread
From: John Garry @ 2020-12-09 12:41 UTC (permalink / raw)
  To: Leizhen (ThunderTown), robin.murphy, joro, will
  Cc: linuxarm, linux-kernel, iommu, xiyou.wangcong

On 09/12/2020 08:58, Leizhen (ThunderTown) wrote:
>>   		goto retry;
>>   	}
>>   
> Reviewed-by: Zhen Lei<thunder.leizhen@huawei.com>
> 


Thanks, incidentally this needs to be rebased, which I'll do now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue
  2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
                   ` (5 preceding siblings ...)
  2020-12-01 21:45 ` Will Deacon
@ 2021-01-15 11:32 ` John Garry
  6 siblings, 0 replies; 20+ messages in thread
From: John Garry @ 2021-01-15 11:32 UTC (permalink / raw)
  To: robin.murphy, joro, will
  Cc: xiyou.wangcong, iommu, linux-kernel, chenxiang66, linuxarm,
	linux-scsi, Kashyap Desai

+ linux-scsi (see 
https://lore.kernel.org/linux-iommu/1607538189-237944-4-git-send-email-john.garry@huawei.com/)

On 17/11/2020 10:25, John Garry wrote:
> This series contains a patch to solve the longterm IOVA issue which
> leizhen originally tried to address at [0].
> 
> A sieved kernel log is at the following, showing periodic dumps of IOVA
> sizes, per CPU and per depot bin, per IOVA size granule:
> https://raw.githubusercontent.com/hisilicon/kernel-dev/topic-iommu-5.10-iova-debug-v3/aging_test
> 
> Notice, for example, the following logs:
> [13175.355584] print_iova1 cpu_total=40135 depot_total=3866 total=44001
> [83483.457858] print_iova1 cpu_total=62532 depot_total=24476 total=87008
> 
> Where total IOVA rcache size has grown from 44K->87K over a long time.
> 

JFYI, I am able to reproduce this aging issue on another storage card, 
an LSI SAS 3008, so now it's harder to say it's an issue specific to a 
(buggy) single driver.

A log of the IOVA size dumps is here:
https://raw.githubusercontent.com/hisilicon/kernel-dev/064c4dc8869b3f2ad07edffceafde0b129f276b0/lsi3008_dmesg

Notice again how the total IOVA size goes up over time, like:
[ 68.176914] print_iova1 cpu_total=23663 depot_total=256 total=23919
[ 2337.008194] print_iova1 cpu_total=67361 depot_total=9088 total=76449
[17141.860078] print_iova1 cpu_total=73397 depot_total=10368 total=83765
[27087.850830] print_iova1 cpu_total=73386 depot_total=10624 total=84010
[10434.042877] print_iova1 cpu_total=90652 depot_total=12928 total=103580

I had to change some settings for that storage card to reproduce, though 
[0]. Could explain why no other reports.

So please consider this issue again...

Thanks,
john

[0] 
https://lore.kernel.org/linux-scsi/dd8e6fdc-397d-b6ad-3371-0b65d1932ad1@huawei.com/T/#m953d21446a5756981412c92d0924ca65c8d2f3a5

> Along with this patch, I included the following:
> - A smaller helper to clear all IOVAs for a domain
> - Change polarity of the IOVA magazine helpers
> - Small optimisation from Cong Wang included, which was never applied [1].
>    There was some debate of the other patches in that series, but this one
>    is quite straightforward.
> 
> Differnces to v2:
> - Update commit message for patch 3/4

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-01-15 11:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17 10:25 [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
2020-11-17 10:25 ` [RESEND PATCH v3 1/4] iommu/iova: Add free_all_cpu_cached_iovas() John Garry
2020-12-09  8:58   ` Leizhen (ThunderTown)
2020-12-09 12:41     ` John Garry
2020-11-17 10:25 ` [RESEND PATCH v3 2/4] iommu/iova: Avoid double-negatives in magazine helpers John Garry
2020-12-09  9:03   ` Leizhen (ThunderTown)
2020-12-09 11:39     ` John Garry
2020-12-09 12:31       ` Leizhen (ThunderTown)
2020-11-17 10:25 ` [RESEND PATCH v3 3/4] iommu/iova: Flush CPU rcache for when a depot fills John Garry
2020-12-09  9:13   ` Leizhen (ThunderTown)
2020-12-09 11:22     ` John Garry
2020-12-09 12:11       ` Leizhen (ThunderTown)
2020-11-17 10:25 ` [RESEND PATCH v3 4/4] iommu: avoid taking iova_rbtree_lock twice John Garry
2020-12-01 15:35 ` [RESEND PATCH v3 0/4] iommu/iova: Solve longterm IOVA issue John Garry
2020-12-01 21:02   ` Will Deacon
2020-12-02 15:20     ` John Garry
2020-12-01 21:45 ` Will Deacon
2020-12-03  6:04   ` Dmitry Safonov
2020-12-03 14:54     ` John Garry
2021-01-15 11:32 ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).