All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] mpt3sas and dmapool scalability
@ 2022-05-31 18:11 ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

This patch series improves dmapool scalability by replacing linear scans
with red-black trees.

History:

In 2018 this patch series made it through 4 versions.  v1 used red-black
trees; v2 - v4 put the dma pool info directly into struct page and used
virt_to_page() to get at it.  v4 made a brief appearance in linux-next,
but it caused problems on non-x86 archs where virt_to_page() doesn't
work with dma_alloc_coherent, so it was reverted.  I was too busy at the
time to repost the red-black tree version, and I forgot about it until
now.  This version is based on the red-black trees of v1, but addressing
all the review comments I got at the time and with additional cleanup
patches.

Note that Keith Busch is also working on improving dmapool scalability,
so for now I would recommend not merging my scalability patches until
Keith's approach can be evaluated.  In the meantime, my patches can
serve as a benchmark comparison.  I also have a number of cleanup
patches in my series that could be useful on their own.

References:

v1
https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/

v2
https://lore.kernel.org/linux-mm/ec701153-fdc9-37f3-c267-f056159b4606@cybernetics.com/

v3
https://lore.kernel.org/linux-mm/d48854ff-995d-228e-8356-54c141c32117@cybernetics.com/

v4
https://lore.kernel.org/linux-mm/88395080-efc1-4e7b-f813-bb90c86d0745@cybernetics.com/

problem caused by virt_to_page()
https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/

Keith Busch's dmapool performance enhancements
https://lore.kernel.org/linux-mm/20220428202714.17630-1-kbusch@kernel.org/

Below is my original description of the motivation for these patches.

drivers/scsi/mpt3sas is running into a scalability problem with the
kernel's DMA pool implementation.  With a LSI/Broadcom SAS 9300-8i
12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the
equivalent of:

chain_dma_pool = dma_pool_create(size = 128);
for (i = 0; i < 373959; i++)
    {
    dma_addr[i] = dma_pool_alloc(chain_dma_pool);
    }

And at rmmod, system shutdown, or system reboot, mpt3sas does the
equivalent of:

for (i = 0; i < 373959; i++)
    {
    dma_pool_free(chain_dma_pool, dma_addr[i]);
    }
dma_pool_destroy(chain_dma_pool);

With this usage, both dma_pool_alloc() and dma_pool_free() exhibit
O(n^2) complexity, although dma_pool_free() is much worse due to
implementation details.  On my system, the dma_pool_free() loop above
takes about 9 seconds to run.  Note that the problem was even worse
before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory
consumption in driver."), where the dma_pool_free() loop could take ~30
seconds.

mpt3sas also has some other DMA pools, but chain_dma_pool is the only
one with so many allocations:

cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools
(manually cleaned up column alignment)
poolinfo - 0.1
reply_post_free_array pool  1      21     192     1
reply_free pool             1      1      41728   1
reply pool                  1      1      1335296 1
sense pool                  1      1      970272  1
chain pool                  373959 386048 128     12064
reply_post_free pool        12     12     166528  12

The patches in this series improve the scalability of the DMA pool
implementation, which significantly reduces the running time of the
DMA alloc/free loops.  With the patches applied, "modprobe mpt3sas",
"rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are
significantly faster.  Here are some benchmarks (of DMA alloc/free
only, not the entire modprobe/rmmod):

dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959
  original:        350 ms ( 1x)
  dmapool patches:  18 ms (19x)

dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959
  original:        8901 ms (   1x)
  dmapool patches:   19 ms ( 477x)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 00/10] mpt3sas and dmapool scalability
@ 2022-05-31 18:11 ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

This patch series improves dmapool scalability by replacing linear scans
with red-black trees.

History:

In 2018 this patch series made it through 4 versions.  v1 used red-black
trees; v2 - v4 put the dma pool info directly into struct page and used
virt_to_page() to get at it.  v4 made a brief appearance in linux-next,
but it caused problems on non-x86 archs where virt_to_page() doesn't
work with dma_alloc_coherent, so it was reverted.  I was too busy at the
time to repost the red-black tree version, and I forgot about it until
now.  This version is based on the red-black trees of v1, but addressing
all the review comments I got at the time and with additional cleanup
patches.

Note that Keith Busch is also working on improving dmapool scalability,
so for now I would recommend not merging my scalability patches until
Keith's approach can be evaluated.  In the meantime, my patches can
serve as a benchmark comparison.  I also have a number of cleanup
patches in my series that could be useful on their own.

References:

v1
https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/

v2
https://lore.kernel.org/linux-mm/ec701153-fdc9-37f3-c267-f056159b4606@cybernetics.com/

v3
https://lore.kernel.org/linux-mm/d48854ff-995d-228e-8356-54c141c32117@cybernetics.com/

v4
https://lore.kernel.org/linux-mm/88395080-efc1-4e7b-f813-bb90c86d0745@cybernetics.com/

problem caused by virt_to_page()
https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/

Keith Busch's dmapool performance enhancements
https://lore.kernel.org/linux-mm/20220428202714.17630-1-kbusch@kernel.org/

Below is my original description of the motivation for these patches.

drivers/scsi/mpt3sas is running into a scalability problem with the
kernel's DMA pool implementation.  With a LSI/Broadcom SAS 9300-8i
12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the
equivalent of:

chain_dma_pool = dma_pool_create(size = 128);
for (i = 0; i < 373959; i++)
    {
    dma_addr[i] = dma_pool_alloc(chain_dma_pool);
    }

And at rmmod, system shutdown, or system reboot, mpt3sas does the
equivalent of:

for (i = 0; i < 373959; i++)
    {
    dma_pool_free(chain_dma_pool, dma_addr[i]);
    }
dma_pool_destroy(chain_dma_pool);

With this usage, both dma_pool_alloc() and dma_pool_free() exhibit
O(n^2) complexity, although dma_pool_free() is much worse due to
implementation details.  On my system, the dma_pool_free() loop above
takes about 9 seconds to run.  Note that the problem was even worse
before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory
consumption in driver."), where the dma_pool_free() loop could take ~30
seconds.

mpt3sas also has some other DMA pools, but chain_dma_pool is the only
one with so many allocations:

cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools
(manually cleaned up column alignment)
poolinfo - 0.1
reply_post_free_array pool  1      21     192     1
reply_free pool             1      1      41728   1
reply pool                  1      1      1335296 1
sense pool                  1      1      970272  1
chain pool                  373959 386048 128     12064
reply_post_free pool        12     12     166528  12

The patches in this series improve the scalability of the DMA pool
implementation, which significantly reduces the running time of the
DMA alloc/free loops.  With the patches applied, "modprobe mpt3sas",
"rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are
significantly faster.  Here are some benchmarks (of DMA alloc/free
only, not the entire modprobe/rmmod):

dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959
  original:        350 ms ( 1x)
  dmapool patches:  18 ms (19x)

dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959
  original:        8901 ms (   1x)
  dmapool patches:   19 ms ( 477x)


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/10] dmapool: remove checks for dev == NULL
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:12   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:12 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device.  But nobody
ended up using dma pools without a device, so the current checks in
dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
Remove them.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 42 +++++++++++-------------------------------
 1 file changed, 11 insertions(+), 31 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index a7eb5d0eb2da..0f89de408cbe 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool)
 	mutex_lock(&pools_reg_lock);
 	mutex_lock(&pools_lock);
 	list_del(&pool->pools);
-	if (pool->dev && list_empty(&pool->dev->dma_pools))
+	if (list_empty(&pool->dev->dma_pools))
 		empty = true;
 	mutex_unlock(&pools_lock);
 	if (empty)
@@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool)
 
 	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
 		if (is_page_busy(page)) {
-			if (pool->dev)
-				dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-					pool->name, page->vaddr);
-			else
-				pr_err("%s %s, %p busy\n", __func__,
-				       pool->name, page->vaddr);
+			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
+				pool->name, page->vaddr);
 			/* leak the still-in-use consistent memory */
 			list_del(&page->page_list);
 			kfree(page);
@@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 		for (i = sizeof(page->offset); i < pool->size; i++) {
 			if (data[i] == POOL_POISON_FREED)
 				continue;
-			if (pool->dev)
-				dev_err(pool->dev, "%s %s, %p (corrupted)\n",
-					__func__, pool->name, retval);
-			else
-				pr_err("%s %s, %p (corrupted)\n",
-					__func__, pool->name, retval);
+			dev_err(pool->dev, "%s %s, %p (corrupted)\n",
+				__func__, pool->name, retval);
 
 			/*
 			 * Dump the first 4 bytes even if they are not
@@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	page = pool_find_page(pool, dma);
 	if (!page) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		if (pool->dev)
-			dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
-				__func__, pool->name, vaddr, &dma);
-		else
-			pr_err("%s %s, %p/%pad (bad dma)\n",
-			       __func__, pool->name, vaddr, &dma);
+		dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
+			__func__, pool->name, vaddr, &dma);
 		return;
 	}
 
@@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 #ifdef	DMAPOOL_DEBUG
 	if ((dma - page->dma) != offset) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		if (pool->dev)
-			dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
-				__func__, pool->name, vaddr, &dma);
-		else
-			pr_err("%s %s, %p (bad vaddr)/%pad\n",
-			       __func__, pool->name, vaddr, &dma);
+		dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
+			__func__, pool->name, vaddr, &dma);
 		return;
 	}
 	{
@@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 				continue;
 			}
 			spin_unlock_irqrestore(&pool->lock, flags);
-			if (pool->dev)
-				dev_err(pool->dev, "%s %s, dma %pad already free\n",
-					__func__, pool->name, &dma);
-			else
-				pr_err("%s %s, dma %pad already free\n",
-				       __func__, pool->name, &dma);
+			dev_err(pool->dev, "%s %s, dma %pad already free\n",
+				__func__, pool->name, &dma);
 			return;
 		}
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 01/10] dmapool: remove checks for dev == NULL
@ 2022-05-31 18:12   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:12 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device.  But nobody
ended up using dma pools without a device, so the current checks in
dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
Remove them.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 42 +++++++++++-------------------------------
 1 file changed, 11 insertions(+), 31 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index a7eb5d0eb2da..0f89de408cbe 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool)
 	mutex_lock(&pools_reg_lock);
 	mutex_lock(&pools_lock);
 	list_del(&pool->pools);
-	if (pool->dev && list_empty(&pool->dev->dma_pools))
+	if (list_empty(&pool->dev->dma_pools))
 		empty = true;
 	mutex_unlock(&pools_lock);
 	if (empty)
@@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool)
 
 	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
 		if (is_page_busy(page)) {
-			if (pool->dev)
-				dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-					pool->name, page->vaddr);
-			else
-				pr_err("%s %s, %p busy\n", __func__,
-				       pool->name, page->vaddr);
+			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
+				pool->name, page->vaddr);
 			/* leak the still-in-use consistent memory */
 			list_del(&page->page_list);
 			kfree(page);
@@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 		for (i = sizeof(page->offset); i < pool->size; i++) {
 			if (data[i] == POOL_POISON_FREED)
 				continue;
-			if (pool->dev)
-				dev_err(pool->dev, "%s %s, %p (corrupted)\n",
-					__func__, pool->name, retval);
-			else
-				pr_err("%s %s, %p (corrupted)\n",
-					__func__, pool->name, retval);
+			dev_err(pool->dev, "%s %s, %p (corrupted)\n",
+				__func__, pool->name, retval);
 
 			/*
 			 * Dump the first 4 bytes even if they are not
@@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	page = pool_find_page(pool, dma);
 	if (!page) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		if (pool->dev)
-			dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
-				__func__, pool->name, vaddr, &dma);
-		else
-			pr_err("%s %s, %p/%pad (bad dma)\n",
-			       __func__, pool->name, vaddr, &dma);
+		dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
+			__func__, pool->name, vaddr, &dma);
 		return;
 	}
 
@@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 #ifdef	DMAPOOL_DEBUG
 	if ((dma - page->dma) != offset) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		if (pool->dev)
-			dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
-				__func__, pool->name, vaddr, &dma);
-		else
-			pr_err("%s %s, %p (bad vaddr)/%pad\n",
-			       __func__, pool->name, vaddr, &dma);
+		dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
+			__func__, pool->name, vaddr, &dma);
 		return;
 	}
 	{
@@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 				continue;
 			}
 			spin_unlock_irqrestore(&pool->lock, flags);
-			if (pool->dev)
-				dev_err(pool->dev, "%s %s, dma %pad already free\n",
-					__func__, pool->name, &dma);
-			else
-				pr_err("%s %s, dma %pad already free\n",
-				       __func__, pool->name, &dma);
+			dev_err(pool->dev, "%s %s, dma %pad already free\n",
+				__func__, pool->name, &dma);
 			return;
 		}
 	}
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/10] dmapool: cleanup integer types
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:13   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places.  Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---

This puts an upper bound on 'size' of INT_MAX to avoid overflowing the
following comparison in pool_initialise_page():

unsigned int offset = 0;
unsigned int next = offset + pool->size;
if (unlikely((next + pool->size) > ...

'boundary' is passed in as a size_t but gets stored as an unsigned int.
'boundary' values >= 'allocation' do not have any effect, so clipping
'boundary' to 'allocation' keeps it within the range of unsigned int
without affecting anything else.  A few lines above (not in the diff)
you can see that if 'boundary' is passed in as 0 then it is set to
'allocation', so it is nothing new.  For reference, here is the
relevant code after being patched:

	if (!boundary)
		boundary = allocation;
	else if ((boundary < size) || (boundary & (boundary - 1)))
		return NULL;

	boundary = min(boundary, allocation);

 mm/dmapool.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 0f89de408cbe..d7b372248111 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -43,10 +43,10 @@
 struct dma_pool {		/* the pool */
 	struct list_head page_list;
 	spinlock_t lock;
-	size_t size;
+	unsigned int size;
 	struct device *dev;
-	size_t allocation;
-	size_t boundary;
+	unsigned int allocation;
+	unsigned int boundary;
 	char name[32];
 	struct list_head pools;
 };
@@ -80,7 +80,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 	mutex_lock(&pools_lock);
 	list_for_each_entry(pool, &dev->dma_pools, pools) {
 		unsigned pages = 0;
-		unsigned blocks = 0;
+		size_t blocks = 0;
 
 		spin_lock_irq(&pool->lock);
 		list_for_each_entry(page, &pool->page_list, page_list) {
@@ -90,9 +90,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 		spin_unlock_irq(&pool->lock);
 
 		/* per-pool info, no real statistics yet */
-		temp = scnprintf(next, size, "%-16s %4u %4zu %4zu %2u\n",
+		temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 				 pool->name, blocks,
-				 pages * (pool->allocation / pool->size),
+				 (size_t) pages *
+				 (pool->allocation / pool->size),
 				 pool->size, pages);
 		size -= temp;
 		next += temp;
@@ -139,7 +140,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	else if (align & (align - 1))
 		return NULL;
 
-	if (size == 0)
+	if (size == 0 || size > INT_MAX)
 		return NULL;
 	else if (size < 4)
 		size = 4;
@@ -152,6 +153,8 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	else if ((boundary < size) || (boundary & (boundary - 1)))
 		return NULL;
 
+	boundary = min(boundary, allocation);
+
 	retval = kmalloc(sizeof(*retval), GFP_KERNEL);
 	if (!retval)
 		return retval;
@@ -312,7 +315,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 {
 	unsigned long flags;
 	struct dma_page *page;
-	size_t offset;
+	unsigned int offset;
 	void *retval;
 
 	might_alloc(mem_flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/10] dmapool: cleanup integer types
@ 2022-05-31 18:13   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places.  Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---

This puts an upper bound on 'size' of INT_MAX to avoid overflowing the
following comparison in pool_initialise_page():

unsigned int offset = 0;
unsigned int next = offset + pool->size;
if (unlikely((next + pool->size) > ...

'boundary' is passed in as a size_t but gets stored as an unsigned int.
'boundary' values >= 'allocation' do not have any effect, so clipping
'boundary' to 'allocation' keeps it within the range of unsigned int
without affecting anything else.  A few lines above (not in the diff)
you can see that if 'boundary' is passed in as 0 then it is set to
'allocation', so it is nothing new.  For reference, here is the
relevant code after being patched:

	if (!boundary)
		boundary = allocation;
	else if ((boundary < size) || (boundary & (boundary - 1)))
		return NULL;

	boundary = min(boundary, allocation);

 mm/dmapool.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 0f89de408cbe..d7b372248111 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -43,10 +43,10 @@
 struct dma_pool {		/* the pool */
 	struct list_head page_list;
 	spinlock_t lock;
-	size_t size;
+	unsigned int size;
 	struct device *dev;
-	size_t allocation;
-	size_t boundary;
+	unsigned int allocation;
+	unsigned int boundary;
 	char name[32];
 	struct list_head pools;
 };
@@ -80,7 +80,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 	mutex_lock(&pools_lock);
 	list_for_each_entry(pool, &dev->dma_pools, pools) {
 		unsigned pages = 0;
-		unsigned blocks = 0;
+		size_t blocks = 0;
 
 		spin_lock_irq(&pool->lock);
 		list_for_each_entry(page, &pool->page_list, page_list) {
@@ -90,9 +90,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 		spin_unlock_irq(&pool->lock);
 
 		/* per-pool info, no real statistics yet */
-		temp = scnprintf(next, size, "%-16s %4u %4zu %4zu %2u\n",
+		temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 				 pool->name, blocks,
-				 pages * (pool->allocation / pool->size),
+				 (size_t) pages *
+				 (pool->allocation / pool->size),
 				 pool->size, pages);
 		size -= temp;
 		next += temp;
@@ -139,7 +140,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	else if (align & (align - 1))
 		return NULL;
 
-	if (size == 0)
+	if (size == 0 || size > INT_MAX)
 		return NULL;
 	else if (size < 4)
 		size = 4;
@@ -152,6 +153,8 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	else if ((boundary < size) || (boundary & (boundary - 1)))
 		return NULL;
 
+	boundary = min(boundary, allocation);
+
 	retval = kmalloc(sizeof(*retval), GFP_KERNEL);
 	if (!retval)
 		return retval;
@@ -312,7 +315,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 {
 	unsigned long flags;
 	struct dma_page *page;
-	size_t offset;
+	unsigned int offset;
 	void *retval;
 
 	might_alloc(mem_flags);
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/10] dmapool: fix boundary comparison
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:14   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

Fix the boundary comparison when constructing the list of free blocks
for the case that 'size' is a power of two.  Since 'boundary' is also a
power of two, that would make 'boundary' a multiple of 'size', in which
case a single block would never cross the boundary.  This bug would
cause some of the allocated memory to be wasted (but not leaked).

Example:

size       = 512
boundary   = 2048
allocation = 4096

Address range
   0 -  511
 512 - 1023
1024 - 1535
1536 - 2047 *
2048 - 2559
2560 - 3071
3072 - 3583
3584 - 4095 *

Prior to this fix, the address ranges marked with "*" would not have
been used even though they didn't cross the given boundary.

Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't cross boundaries")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index d7b372248111..782143144a32 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -210,7 +210,7 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 
 	do {
 		unsigned int next = offset + pool->size;
-		if (unlikely((next + pool->size) >= next_boundary)) {
+		if (unlikely((next + pool->size) > next_boundary)) {
 			next = next_boundary;
 			next_boundary += pool->boundary;
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/10] dmapool: fix boundary comparison
@ 2022-05-31 18:14   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:14 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

Fix the boundary comparison when constructing the list of free blocks
for the case that 'size' is a power of two.  Since 'boundary' is also a
power of two, that would make 'boundary' a multiple of 'size', in which
case a single block would never cross the boundary.  This bug would
cause some of the allocated memory to be wasted (but not leaked).

Example:

size       = 512
boundary   = 2048
allocation = 4096

Address range
   0 -  511
 512 - 1023
1024 - 1535
1536 - 2047 *
2048 - 2559
2560 - 3071
3072 - 3583
3584 - 4095 *

Prior to this fix, the address ranges marked with "*" would not have
been used even though they didn't cross the given boundary.

Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't cross boundaries")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index d7b372248111..782143144a32 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -210,7 +210,7 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 
 	do {
 		unsigned int next = offset + pool->size;
-		if (unlikely((next + pool->size) >= next_boundary)) {
+		if (unlikely((next + pool->size) > next_boundary)) {
 			next = next_boundary;
 			next_boundary += pool->boundary;
 		}
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/10] dmapool: improve accuracy of debug statistics
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:17   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

The "total number of blocks in pool" debug statistic currently does not
take the boundary value into account, so it diverges from the "total
number of blocks in use" statistic when a boundary is in effect.  Add a
calculation for the number of blocks per allocation that takes the
boundary into account, and use it to replace the inaccurate calculation.

This depends on the patch "dmapool: fix boundary comparison" for the
calculated blks_per_alloc value to be correct.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 782143144a32..9e30f4425dea 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -47,6 +47,7 @@ struct dma_pool {		/* the pool */
 	struct device *dev;
 	unsigned int allocation;
 	unsigned int boundary;
+	unsigned int blks_per_alloc;
 	char name[32];
 	struct list_head pools;
 };
@@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 		/* per-pool info, no real statistics yet */
 		temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 				 pool->name, blocks,
-				 (size_t) pages *
-				 (pool->allocation / pool->size),
+				 (size_t) pages * pool->blks_per_alloc,
 				 pool->size, pages);
 		size -= temp;
 		next += temp;
@@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	retval->size = size;
 	retval->boundary = boundary;
 	retval->allocation = allocation;
+	retval->blks_per_alloc =
+		(allocation / boundary) * (boundary / size) +
+		(allocation % boundary) / size;
 
 	INIT_LIST_HEAD(&retval->pools);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/10] dmapool: improve accuracy of debug statistics
@ 2022-05-31 18:17   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

The "total number of blocks in pool" debug statistic currently does not
take the boundary value into account, so it diverges from the "total
number of blocks in use" statistic when a boundary is in effect.  Add a
calculation for the number of blocks per allocation that takes the
boundary into account, and use it to replace the inaccurate calculation.

This depends on the patch "dmapool: fix boundary comparison" for the
calculated blks_per_alloc value to be correct.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 782143144a32..9e30f4425dea 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -47,6 +47,7 @@ struct dma_pool {		/* the pool */
 	struct device *dev;
 	unsigned int allocation;
 	unsigned int boundary;
+	unsigned int blks_per_alloc;
 	char name[32];
 	struct list_head pools;
 };
@@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 		/* per-pool info, no real statistics yet */
 		temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 				 pool->name, blocks,
-				 (size_t) pages *
-				 (pool->allocation / pool->size),
+				 (size_t) pages * pool->blks_per_alloc,
 				 pool->size, pages);
 		size -= temp;
 		next += temp;
@@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	retval->size = size;
 	retval->boundary = boundary;
 	retval->allocation = allocation;
+	retval->blks_per_alloc =
+		(allocation / boundary) * (boundary / size) +
+		(allocation % boundary) / size;
 
 	INIT_LIST_HEAD(&retval->pools);
 
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/10] dmapool: debug: prevent endless loop in case of corruption
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:18   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy
driver corrupts DMA pool memory.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 9e30f4425dea..7a9161d4f7a6 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -426,16 +426,39 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	}
 	{
 		unsigned int chain = page->offset;
+		unsigned int free_blks = 0;
+
 		while (chain < pool->allocation) {
-			if (chain != offset) {
-				chain = *(int *)(page->vaddr + chain);
-				continue;
+			if (unlikely(chain == offset)) {
+				spin_unlock_irqrestore(&pool->lock, flags);
+				dev_err(pool->dev,
+					"%s %s, dma %pad already free\n",
+					__func__, pool->name, &dma);
+				return;
 			}
-			spin_unlock_irqrestore(&pool->lock, flags);
-			dev_err(pool->dev, "%s %s, dma %pad already free\n",
-				__func__, pool->name, &dma);
-			return;
+
+			/*
+			 * A buggy driver could corrupt the freelist by
+			 * use-after-free, buffer overflow, etc.  Besides
+			 * checking for corruption, this also prevents an
+			 * endless loop in case corruption causes a circular
+			 * loop in the freelist.
+			 */
+			if (unlikely(++free_blks + page->in_use >
+				     pool->blks_per_alloc)) {
+ freelist_corrupt:
+				spin_unlock_irqrestore(&pool->lock, flags);
+				dev_err(pool->dev,
+					"%s %s, freelist corrupted\n",
+					__func__, pool->name);
+				return;
+			}
+
+			chain = *(int *)(page->vaddr + chain);
 		}
+		if (unlikely(free_blks + page->in_use !=
+			     pool->blks_per_alloc))
+			goto freelist_corrupt;
 	}
 	memset(vaddr, POOL_POISON_FREED, pool->size);
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/10] dmapool: debug: prevent endless loop in case of corruption
@ 2022-05-31 18:18   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy
driver corrupts DMA pool memory.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 9e30f4425dea..7a9161d4f7a6 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -426,16 +426,39 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	}
 	{
 		unsigned int chain = page->offset;
+		unsigned int free_blks = 0;
+
 		while (chain < pool->allocation) {
-			if (chain != offset) {
-				chain = *(int *)(page->vaddr + chain);
-				continue;
+			if (unlikely(chain == offset)) {
+				spin_unlock_irqrestore(&pool->lock, flags);
+				dev_err(pool->dev,
+					"%s %s, dma %pad already free\n",
+					__func__, pool->name, &dma);
+				return;
 			}
-			spin_unlock_irqrestore(&pool->lock, flags);
-			dev_err(pool->dev, "%s %s, dma %pad already free\n",
-				__func__, pool->name, &dma);
-			return;
+
+			/*
+			 * A buggy driver could corrupt the freelist by
+			 * use-after-free, buffer overflow, etc.  Besides
+			 * checking for corruption, this also prevents an
+			 * endless loop in case corruption causes a circular
+			 * loop in the freelist.
+			 */
+			if (unlikely(++free_blks + page->in_use >
+				     pool->blks_per_alloc)) {
+ freelist_corrupt:
+				spin_unlock_irqrestore(&pool->lock, flags);
+				dev_err(pool->dev,
+					"%s %s, freelist corrupted\n",
+					__func__, pool->name);
+				return;
+			}
+
+			chain = *(int *)(page->vaddr + chain);
 		}
+		if (unlikely(free_blks + page->in_use !=
+			     pool->blks_per_alloc))
+			goto freelist_corrupt;
 	}
 	memset(vaddr, POOL_POISON_FREED, pool->size);
 #endif
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/10] dmapool: ignore init_on_free when DMAPOOL_DEBUG enabled
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:20   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

There are two cases:

1) In the normal case that the memory is being freed correctly, then
DMAPOOL_DEBUG will memset the memory anyway, so speed things up by
avoiding a double-memset of the same memory.

2) In the abnormal case that DMAPOOL_DEBUG detects that a driver passes
incorrect parameters to dma_pool_free() (e.g. double-free, invalid
free, mismatched vaddr/dma, etc.), then that is a kernel bug, and we
don't want to clear the passed-in possibly-invalid memory pointer
because we can't be sure that the memory is really free.  So don't clear
it just because init_on_free=1.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 7a9161d4f7a6..49019ef6dd83 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -415,8 +415,6 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	}
 
 	offset = vaddr - page->vaddr;
-	if (want_init_on_free())
-		memset(vaddr, 0, pool->size);
 #ifdef	DMAPOOL_DEBUG
 	if ((dma - page->dma) != offset) {
 		spin_unlock_irqrestore(&pool->lock, flags);
@@ -461,6 +459,9 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 			goto freelist_corrupt;
 	}
 	memset(vaddr, POOL_POISON_FREED, pool->size);
+#else
+	if (want_init_on_free())
+		memset(vaddr, 0, pool->size);
 #endif
 
 	page->in_use--;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/10] dmapool: ignore init_on_free when DMAPOOL_DEBUG enabled
@ 2022-05-31 18:20   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

There are two cases:

1) In the normal case that the memory is being freed correctly, then
DMAPOOL_DEBUG will memset the memory anyway, so speed things up by
avoiding a double-memset of the same memory.

2) In the abnormal case that DMAPOOL_DEBUG detects that a driver passes
incorrect parameters to dma_pool_free() (e.g. double-free, invalid
free, mismatched vaddr/dma, etc.), then that is a kernel bug, and we
don't want to clear the passed-in possibly-invalid memory pointer
because we can't be sure that the memory is really free.  So don't clear
it just because init_on_free=1.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 7a9161d4f7a6..49019ef6dd83 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -415,8 +415,6 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	}
 
 	offset = vaddr - page->vaddr;
-	if (want_init_on_free())
-		memset(vaddr, 0, pool->size);
 #ifdef	DMAPOOL_DEBUG
 	if ((dma - page->dma) != offset) {
 		spin_unlock_irqrestore(&pool->lock, flags);
@@ -461,6 +459,9 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 			goto freelist_corrupt;
 	}
 	memset(vaddr, POOL_POISON_FREED, pool->size);
+#else
+	if (want_init_on_free())
+		memset(vaddr, 0, pool->size);
 #endif
 
 	page->in_use--;
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/10] dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:21   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

Avoid double-memset of the same allocated memory in dma_pool_alloc()
when both DMAPOOL_DEBUG is enabled and init_on_alloc=1.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 49019ef6dd83..8749a9d7927e 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -365,7 +365,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 			break;
 		}
 	}
-	if (!(mem_flags & __GFP_ZERO))
+	if (!want_init_on_alloc(mem_flags))
 		memset(retval, POOL_POISON_ALLOCATED, pool->size);
 #endif
 	spin_unlock_irqrestore(&pool->lock, flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/10] dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
@ 2022-05-31 18:21   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

Avoid double-memset of the same allocated memory in dma_pool_alloc()
when both DMAPOOL_DEBUG is enabled and init_on_alloc=1.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 49019ef6dd83..8749a9d7927e 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -365,7 +365,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 			break;
 		}
 	}
-	if (!(mem_flags & __GFP_ZERO))
+	if (!want_init_on_alloc(mem_flags))
 		memset(retval, POOL_POISON_ALLOCATED, pool->size);
 #endif
 	spin_unlock_irqrestore(&pool->lock, flags);
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/10] dmapool: cleanup dma_pool_destroy
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:22   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

Remove a small amount of code duplication between dma_pool_destroy() and
pool_free_page() in preparation for adding more code without having to
duplicate it.  No functional changes.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 8749a9d7927e..58c11dcaa4e4 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page)
 	return page->in_use != 0;
 }
 
-static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
+static void pool_free_page(struct dma_pool *pool,
+			   struct dma_page *page,
+			   bool destroying_pool)
 {
+	void *vaddr = page->vaddr;
 	dma_addr_t dma = page->dma;
 
+	if (destroying_pool && is_page_busy(page)) {
+		dev_err(pool->dev,
+			"dma_pool_destroy %s, %p busy\n",
+			pool->name, vaddr);
+		/* leak the still-in-use consistent memory */
+	} else {
 #ifdef	DMAPOOL_DEBUG
-	memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
+		memset(vaddr, POOL_POISON_FREED, pool->allocation);
 #endif
-	dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
+		dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
+	}
+
 	list_del(&page->page_list);
 	kfree(page);
 }
@@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
-	struct dma_page *page, *tmp;
+	struct dma_page *page;
 	bool empty = false;
 
 	if (unlikely(!pool))
@@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool)
 		device_remove_file(pool->dev, &dev_attr_pools);
 	mutex_unlock(&pools_reg_lock);
 
-	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
-		if (is_page_busy(page)) {
-			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-				pool->name, page->vaddr);
-			/* leak the still-in-use consistent memory */
-			list_del(&page->page_list);
-			kfree(page);
-		} else
-			pool_free_page(pool, page);
+	while ((page = list_first_entry_or_null(&pool->page_list,
+						struct dma_page,
+						page_list))) {
+		pool_free_page(pool, page, true);
 	}
 
 	kfree(pool);
@@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	page->offset = offset;
 	/*
 	 * Resist a temptation to do
-	 *    if (!is_page_busy(page)) pool_free_page(pool, page);
+	 *    if (!is_page_busy(page)) pool_free_page(pool, page, false);
 	 * Better have a few empty pages hang around.
 	 */
 	spin_unlock_irqrestore(&pool->lock, flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/10] dmapool: cleanup dma_pool_destroy
@ 2022-05-31 18:22   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:22 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

Remove a small amount of code duplication between dma_pool_destroy() and
pool_free_page() in preparation for adding more code without having to
duplicate it.  No functional changes.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 8749a9d7927e..58c11dcaa4e4 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page)
 	return page->in_use != 0;
 }
 
-static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
+static void pool_free_page(struct dma_pool *pool,
+			   struct dma_page *page,
+			   bool destroying_pool)
 {
+	void *vaddr = page->vaddr;
 	dma_addr_t dma = page->dma;
 
+	if (destroying_pool && is_page_busy(page)) {
+		dev_err(pool->dev,
+			"dma_pool_destroy %s, %p busy\n",
+			pool->name, vaddr);
+		/* leak the still-in-use consistent memory */
+	} else {
 #ifdef	DMAPOOL_DEBUG
-	memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
+		memset(vaddr, POOL_POISON_FREED, pool->allocation);
 #endif
-	dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
+		dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
+	}
+
 	list_del(&page->page_list);
 	kfree(page);
 }
@@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
-	struct dma_page *page, *tmp;
+	struct dma_page *page;
 	bool empty = false;
 
 	if (unlikely(!pool))
@@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool)
 		device_remove_file(pool->dev, &dev_attr_pools);
 	mutex_unlock(&pools_reg_lock);
 
-	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
-		if (is_page_busy(page)) {
-			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
-				pool->name, page->vaddr);
-			/* leak the still-in-use consistent memory */
-			list_del(&page->page_list);
-			kfree(page);
-		} else
-			pool_free_page(pool, page);
+	while ((page = list_first_entry_or_null(&pool->page_list,
+						struct dma_page,
+						page_list))) {
+		pool_free_page(pool, page, true);
 	}
 
 	kfree(pool);
@@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 	page->offset = offset;
 	/*
 	 * Resist a temptation to do
-	 *    if (!is_page_busy(page)) pool_free_page(pool, page);
+	 *    if (!is_page_busy(page)) pool_free_page(pool, page, false);
 	 * Better have a few empty pages hang around.
 	 */
 	spin_unlock_irqrestore(&pool->lock, flags);
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/10] dmapool: improve scalability of dma_pool_alloc
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:23   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

dma_pool_alloc() scales poorly when allocating a large number of pages
because it does a linear scan of all previously-allocated pages before
allocating a new one.  Improve its scalability by maintaining a separate
list of pages that have free blocks ready to (re)allocate.  In big O
notation, this improves the algorithm from O(n^2) to O(n).

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 58c11dcaa4e4..b3dd2ace0d2a 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -17,6 +17,10 @@
  * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
  * list of free blocks within the page.  Used blocks aren't tracked, but we
  * keep a count of how many are currently allocated from each page.
+ *
+ * The avail_page_list keeps track of pages that have one or more free blocks
+ * available to (re)allocate.  Pages are moved in and out of avail_page_list
+ * as their blocks are allocated and freed.
  */
 
 #include <linux/device.h>
@@ -42,6 +46,7 @@
 
 struct dma_pool {		/* the pool */
 	struct list_head page_list;
+	struct list_head avail_page_list;
 	spinlock_t lock;
 	unsigned int size;
 	struct device *dev;
@@ -54,6 +59,7 @@ struct dma_pool {		/* the pool */
 
 struct dma_page {		/* cacheable header for 'allocation' bytes */
 	struct list_head page_list;
+	struct list_head avail_page_link;
 	void *vaddr;
 	dma_addr_t dma;
 	unsigned int in_use;
@@ -164,6 +170,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	retval->dev = dev;
 
 	INIT_LIST_HEAD(&retval->page_list);
+	INIT_LIST_HEAD(&retval->avail_page_list);
 	spin_lock_init(&retval->lock);
 	retval->size = size;
 	retval->boundary = boundary;
@@ -270,6 +277,7 @@ static void pool_free_page(struct dma_pool *pool,
 	}
 
 	list_del(&page->page_list);
+	list_del(&page->avail_page_link);
 	kfree(page);
 }
 
@@ -330,10 +338,11 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	might_alloc(mem_flags);
 
 	spin_lock_irqsave(&pool->lock, flags);
-	list_for_each_entry(page, &pool->page_list, page_list) {
-		if (page->offset < pool->allocation)
-			goto ready;
-	}
+	page = list_first_entry_or_null(&pool->avail_page_list,
+					struct dma_page,
+					avail_page_link);
+	if (page)
+		goto ready;
 
 	/* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
 	spin_unlock_irqrestore(&pool->lock, flags);
@@ -345,10 +354,13 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	spin_lock_irqsave(&pool->lock, flags);
 
 	list_add(&page->page_list, &pool->page_list);
+	list_add(&page->avail_page_link, &pool->avail_page_list);
  ready:
 	page->in_use++;
 	offset = page->offset;
 	page->offset = *(int *)(page->vaddr + offset);
+	if (page->offset >= pool->allocation)
+		list_del_init(&page->avail_page_link);
 	retval = offset + page->vaddr;
 	*handle = offset + page->dma;
 #ifdef	DMAPOOL_DEBUG
@@ -470,6 +482,13 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 		memset(vaddr, 0, pool->size);
 #endif
 
+	/*
+	 * list_empty() on the page tests if the page is already linked into
+	 * avail_page_list to avoid adding it more than once.
+	 */
+	if (list_empty(&page->avail_page_link))
+		list_add(&page->avail_page_link, &pool->avail_page_list);
+
 	page->in_use--;
 	*(int *)vaddr = page->offset;
 	page->offset = offset;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/10] dmapool: improve scalability of dma_pool_alloc
@ 2022-05-31 18:23   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

dma_pool_alloc() scales poorly when allocating a large number of pages
because it does a linear scan of all previously-allocated pages before
allocating a new one.  Improve its scalability by maintaining a separate
list of pages that have free blocks ready to (re)allocate.  In big O
notation, this improves the algorithm from O(n^2) to O(n).

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index 58c11dcaa4e4..b3dd2ace0d2a 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -17,6 +17,10 @@
  * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
  * list of free blocks within the page.  Used blocks aren't tracked, but we
  * keep a count of how many are currently allocated from each page.
+ *
+ * The avail_page_list keeps track of pages that have one or more free blocks
+ * available to (re)allocate.  Pages are moved in and out of avail_page_list
+ * as their blocks are allocated and freed.
  */
 
 #include <linux/device.h>
@@ -42,6 +46,7 @@
 
 struct dma_pool {		/* the pool */
 	struct list_head page_list;
+	struct list_head avail_page_list;
 	spinlock_t lock;
 	unsigned int size;
 	struct device *dev;
@@ -54,6 +59,7 @@ struct dma_pool {		/* the pool */
 
 struct dma_page {		/* cacheable header for 'allocation' bytes */
 	struct list_head page_list;
+	struct list_head avail_page_link;
 	void *vaddr;
 	dma_addr_t dma;
 	unsigned int in_use;
@@ -164,6 +170,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 	retval->dev = dev;
 
 	INIT_LIST_HEAD(&retval->page_list);
+	INIT_LIST_HEAD(&retval->avail_page_list);
 	spin_lock_init(&retval->lock);
 	retval->size = size;
 	retval->boundary = boundary;
@@ -270,6 +277,7 @@ static void pool_free_page(struct dma_pool *pool,
 	}
 
 	list_del(&page->page_list);
+	list_del(&page->avail_page_link);
 	kfree(page);
 }
 
@@ -330,10 +338,11 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	might_alloc(mem_flags);
 
 	spin_lock_irqsave(&pool->lock, flags);
-	list_for_each_entry(page, &pool->page_list, page_list) {
-		if (page->offset < pool->allocation)
-			goto ready;
-	}
+	page = list_first_entry_or_null(&pool->avail_page_list,
+					struct dma_page,
+					avail_page_link);
+	if (page)
+		goto ready;
 
 	/* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
 	spin_unlock_irqrestore(&pool->lock, flags);
@@ -345,10 +354,13 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	spin_lock_irqsave(&pool->lock, flags);
 
 	list_add(&page->page_list, &pool->page_list);
+	list_add(&page->avail_page_link, &pool->avail_page_list);
  ready:
 	page->in_use++;
 	offset = page->offset;
 	page->offset = *(int *)(page->vaddr + offset);
+	if (page->offset >= pool->allocation)
+		list_del_init(&page->avail_page_link);
 	retval = offset + page->vaddr;
 	*handle = offset + page->dma;
 #ifdef	DMAPOOL_DEBUG
@@ -470,6 +482,13 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
 		memset(vaddr, 0, pool->size);
 #endif
 
+	/*
+	 * list_empty() on the page tests if the page is already linked into
+	 * avail_page_list to avoid adding it more than once.
+	 */
+	if (list_empty(&page->avail_page_link))
+		list_add(&page->avail_page_link, &pool->avail_page_list);
+
 	page->in_use--;
 	*(int *)vaddr = page->offset;
 	page->offset = offset;
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/10] dmapool: improve scalability of dma_pool_free
  2022-05-31 18:11 ` Tony Battersby
@ 2022-05-31 18:23   ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Robin Murphy, Tony Lindgren

dma_pool_free() scales poorly when the pool contains many pages because
pool_find_page() does a linear scan of all allocated pages.  Improve its
scalability by replacing the linear scan with a red-black tree lookup.
In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 128 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 100 insertions(+), 28 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index b3dd2ace0d2a..24535483f781 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -12,11 +12,12 @@
  * Many older drivers still have their own code to do this.
  *
  * The current design of this allocator is fairly simple.  The pool is
- * represented by the 'struct dma_pool' which keeps a doubly-linked list of
- * allocated pages.  Each page in the page_list is split into blocks of at
- * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
- * list of free blocks within the page.  Used blocks aren't tracked, but we
- * keep a count of how many are currently allocated from each page.
+ * represented by the 'struct dma_pool' which keeps a red-black tree of all
+ * allocated pages, keyed by DMA address for fast lookup when freeing.
+ * Each page in the page_tree is split into blocks of at least 'size' bytes.
+ * Free blocks are tracked in an unsorted singly-linked list of free blocks
+ * within the page.  Used blocks aren't tracked, but we keep a count of how
+ * many are currently allocated from each page.
  *
  * The avail_page_list keeps track of pages that have one or more free blocks
  * available to (re)allocate.  Pages are moved in and out of avail_page_list
@@ -36,6 +37,7 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/spinlock.h>
+#include <linux/rbtree.h>
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/wait.h>
@@ -45,7 +47,7 @@
 #endif
 
 struct dma_pool {		/* the pool */
-	struct list_head page_list;
+	struct rb_root page_tree;
 	struct list_head avail_page_list;
 	spinlock_t lock;
 	unsigned int size;
@@ -58,7 +60,7 @@ struct dma_pool {		/* the pool */
 };
 
 struct dma_page {		/* cacheable header for 'allocation' bytes */
-	struct list_head page_list;
+	struct rb_node page_node;
 	struct list_head avail_page_link;
 	void *vaddr;
 	dma_addr_t dma;
@@ -69,6 +71,11 @@ struct dma_page {		/* cacheable header for 'allocation' bytes */
 static DEFINE_MUTEX(pools_lock);
 static DEFINE_MUTEX(pools_reg_lock);
 
+static inline struct dma_page *rb_to_dma_page(struct rb_node *node)
+{
+	return rb_entry(node, struct dma_page, page_node);
+}
+
 static ssize_t pools_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	unsigned temp;
@@ -76,6 +83,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 	char *next;
 	struct dma_page *page;
 	struct dma_pool *pool;
+	struct rb_node *node;
 
 	next = buf;
 	size = PAGE_SIZE;
@@ -90,7 +98,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 		size_t blocks = 0;
 
 		spin_lock_irq(&pool->lock);
-		list_for_each_entry(page, &pool->page_list, page_list) {
+		for (node = rb_first(&pool->page_tree);
+		     node;
+		     node = rb_next(node)) {
+			page = rb_to_dma_page(node);
 			pages++;
 			blocks += page->in_use;
 		}
@@ -169,7 +180,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 
 	retval->dev = dev;
 
-	INIT_LIST_HEAD(&retval->page_list);
+	retval->page_tree = RB_ROOT;
 	INIT_LIST_HEAD(&retval->avail_page_list);
 	spin_lock_init(&retval->lock);
 	retval->size = size;
@@ -213,6 +224,63 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 }
 EXPORT_SYMBOL(dma_pool_create);
 
+/*
+ * Find the dma_page that manages the given DMA address.
+ */
+static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
+{
+	struct rb_node *node = pool->page_tree.rb_node;
+
+	while (node) {
+		struct dma_page *page = rb_to_dma_page(node);
+
+		if (dma < page->dma)
+			node = node->rb_left;
+		else if ((dma - page->dma) >= pool->allocation)
+			node = node->rb_right;
+		else
+			return page;
+	}
+	return NULL;
+}
+
+/*
+ * Insert a dma_page into the page_tree.
+ */
+static int pool_insert_page(struct dma_pool *pool, struct dma_page *new_page)
+{
+	dma_addr_t dma = new_page->dma;
+	struct rb_node **node = &(pool->page_tree.rb_node), *parent = NULL;
+
+	while (*node) {
+		struct dma_page *this_page = rb_to_dma_page(*node);
+
+		parent = *node;
+		if (dma < this_page->dma)
+			node = &((*node)->rb_left);
+		else if (likely((dma - this_page->dma) >= pool->allocation))
+			node = &((*node)->rb_right);
+		else {
+			/*
+			 * A page that overlaps the new DMA range is already
+			 * present in the tree.  This should not happen.
+			 */
+			WARN(1,
+			     "%s: %s: DMA address overlap: old %pad new %pad len %u\n",
+			     dev_name(pool->dev),
+			     pool->name, &this_page->dma, &dma,
+			     pool->allocation);
+			return -1;
+		}
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&new_page->page_node, parent, node);
+	rb_insert_color(&new_page->page_node, &pool->page_tree);
+
+	return 0;
+}
+
 static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 {
 	unsigned int offset = 0;
@@ -276,8 +344,16 @@ static void pool_free_page(struct dma_pool *pool,
 		dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
 	}
 
-	list_del(&page->page_list);
 	list_del(&page->avail_page_link);
+
+	/*
+	 * If the pool is being destroyed, it is not safe to modify the
+	 * page_tree while iterating over it, and it is also unnecessary since
+	 * the whole tree will be discarded anyway.
+	 */
+	if (!destroying_pool)
+		rb_erase(&page->page_node, &pool->page_tree);
+
 	kfree(page);
 }
 
@@ -291,7 +367,7 @@ static void pool_free_page(struct dma_pool *pool,
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
-	struct dma_page *page;
+	struct dma_page *page, *tmp;
 	bool empty = false;
 
 	if (unlikely(!pool))
@@ -307,9 +383,10 @@ void dma_pool_destroy(struct dma_pool *pool)
 		device_remove_file(pool->dev, &dev_attr_pools);
 	mutex_unlock(&pools_reg_lock);
 
-	while ((page = list_first_entry_or_null(&pool->page_list,
-						struct dma_page,
-						page_list))) {
+	rbtree_postorder_for_each_entry_safe(page,
+					     tmp,
+					     &pool->page_tree,
+					     page_node) {
 		pool_free_page(pool, page, true);
 	}
 
@@ -353,7 +430,15 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 
 	spin_lock_irqsave(&pool->lock, flags);
 
-	list_add(&page->page_list, &pool->page_list);
+	if (unlikely(pool_insert_page(pool, page))) {
+		/*
+		 * This should not happen, so something must have gone horribly
+		 * wrong.  Instead of crashing, intentionally leak the memory
+		 * and make for the exit.
+		 */
+		spin_unlock_irqrestore(&pool->lock, flags);
+		return NULL;
+	}
 	list_add(&page->avail_page_link, &pool->avail_page_list);
  ready:
 	page->in_use++;
@@ -395,19 +480,6 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 }
 EXPORT_SYMBOL(dma_pool_alloc);
 
-static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
-{
-	struct dma_page *page;
-
-	list_for_each_entry(page, &pool->page_list, page_list) {
-		if (dma < page->dma)
-			continue;
-		if ((dma - page->dma) < pool->allocation)
-			return page;
-	}
-	return NULL;
-}
-
 /**
  * dma_pool_free - put block back into dma pool
  * @pool: the dma pool holding the block
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/10] dmapool: improve scalability of dma_pool_free
@ 2022-05-31 18:23   ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 18:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team, Robin Murphy

dma_pool_free() scales poorly when the pool contains many pages because
pool_find_page() does a linear scan of all allocated pages.  Improve its
scalability by replacing the linear scan with a red-black tree lookup.
In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
---
 mm/dmapool.c | 128 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 100 insertions(+), 28 deletions(-)

diff --git a/mm/dmapool.c b/mm/dmapool.c
index b3dd2ace0d2a..24535483f781 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -12,11 +12,12 @@
  * Many older drivers still have their own code to do this.
  *
  * The current design of this allocator is fairly simple.  The pool is
- * represented by the 'struct dma_pool' which keeps a doubly-linked list of
- * allocated pages.  Each page in the page_list is split into blocks of at
- * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
- * list of free blocks within the page.  Used blocks aren't tracked, but we
- * keep a count of how many are currently allocated from each page.
+ * represented by the 'struct dma_pool' which keeps a red-black tree of all
+ * allocated pages, keyed by DMA address for fast lookup when freeing.
+ * Each page in the page_tree is split into blocks of at least 'size' bytes.
+ * Free blocks are tracked in an unsorted singly-linked list of free blocks
+ * within the page.  Used blocks aren't tracked, but we keep a count of how
+ * many are currently allocated from each page.
  *
  * The avail_page_list keeps track of pages that have one or more free blocks
  * available to (re)allocate.  Pages are moved in and out of avail_page_list
@@ -36,6 +37,7 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/spinlock.h>
+#include <linux/rbtree.h>
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/wait.h>
@@ -45,7 +47,7 @@
 #endif
 
 struct dma_pool {		/* the pool */
-	struct list_head page_list;
+	struct rb_root page_tree;
 	struct list_head avail_page_list;
 	spinlock_t lock;
 	unsigned int size;
@@ -58,7 +60,7 @@ struct dma_pool {		/* the pool */
 };
 
 struct dma_page {		/* cacheable header for 'allocation' bytes */
-	struct list_head page_list;
+	struct rb_node page_node;
 	struct list_head avail_page_link;
 	void *vaddr;
 	dma_addr_t dma;
@@ -69,6 +71,11 @@ struct dma_page {		/* cacheable header for 'allocation' bytes */
 static DEFINE_MUTEX(pools_lock);
 static DEFINE_MUTEX(pools_reg_lock);
 
+static inline struct dma_page *rb_to_dma_page(struct rb_node *node)
+{
+	return rb_entry(node, struct dma_page, page_node);
+}
+
 static ssize_t pools_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	unsigned temp;
@@ -76,6 +83,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 	char *next;
 	struct dma_page *page;
 	struct dma_pool *pool;
+	struct rb_node *node;
 
 	next = buf;
 	size = PAGE_SIZE;
@@ -90,7 +98,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
 		size_t blocks = 0;
 
 		spin_lock_irq(&pool->lock);
-		list_for_each_entry(page, &pool->page_list, page_list) {
+		for (node = rb_first(&pool->page_tree);
+		     node;
+		     node = rb_next(node)) {
+			page = rb_to_dma_page(node);
 			pages++;
 			blocks += page->in_use;
 		}
@@ -169,7 +180,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 
 	retval->dev = dev;
 
-	INIT_LIST_HEAD(&retval->page_list);
+	retval->page_tree = RB_ROOT;
 	INIT_LIST_HEAD(&retval->avail_page_list);
 	spin_lock_init(&retval->lock);
 	retval->size = size;
@@ -213,6 +224,63 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 }
 EXPORT_SYMBOL(dma_pool_create);
 
+/*
+ * Find the dma_page that manages the given DMA address.
+ */
+static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
+{
+	struct rb_node *node = pool->page_tree.rb_node;
+
+	while (node) {
+		struct dma_page *page = rb_to_dma_page(node);
+
+		if (dma < page->dma)
+			node = node->rb_left;
+		else if ((dma - page->dma) >= pool->allocation)
+			node = node->rb_right;
+		else
+			return page;
+	}
+	return NULL;
+}
+
+/*
+ * Insert a dma_page into the page_tree.
+ */
+static int pool_insert_page(struct dma_pool *pool, struct dma_page *new_page)
+{
+	dma_addr_t dma = new_page->dma;
+	struct rb_node **node = &(pool->page_tree.rb_node), *parent = NULL;
+
+	while (*node) {
+		struct dma_page *this_page = rb_to_dma_page(*node);
+
+		parent = *node;
+		if (dma < this_page->dma)
+			node = &((*node)->rb_left);
+		else if (likely((dma - this_page->dma) >= pool->allocation))
+			node = &((*node)->rb_right);
+		else {
+			/*
+			 * A page that overlaps the new DMA range is already
+			 * present in the tree.  This should not happen.
+			 */
+			WARN(1,
+			     "%s: %s: DMA address overlap: old %pad new %pad len %u\n",
+			     dev_name(pool->dev),
+			     pool->name, &this_page->dma, &dma,
+			     pool->allocation);
+			return -1;
+		}
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&new_page->page_node, parent, node);
+	rb_insert_color(&new_page->page_node, &pool->page_tree);
+
+	return 0;
+}
+
 static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
 {
 	unsigned int offset = 0;
@@ -276,8 +344,16 @@ static void pool_free_page(struct dma_pool *pool,
 		dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
 	}
 
-	list_del(&page->page_list);
 	list_del(&page->avail_page_link);
+
+	/*
+	 * If the pool is being destroyed, it is not safe to modify the
+	 * page_tree while iterating over it, and it is also unnecessary since
+	 * the whole tree will be discarded anyway.
+	 */
+	if (!destroying_pool)
+		rb_erase(&page->page_node, &pool->page_tree);
+
 	kfree(page);
 }
 
@@ -291,7 +367,7 @@ static void pool_free_page(struct dma_pool *pool,
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
-	struct dma_page *page;
+	struct dma_page *page, *tmp;
 	bool empty = false;
 
 	if (unlikely(!pool))
@@ -307,9 +383,10 @@ void dma_pool_destroy(struct dma_pool *pool)
 		device_remove_file(pool->dev, &dev_attr_pools);
 	mutex_unlock(&pools_reg_lock);
 
-	while ((page = list_first_entry_or_null(&pool->page_list,
-						struct dma_page,
-						page_list))) {
+	rbtree_postorder_for_each_entry_safe(page,
+					     tmp,
+					     &pool->page_tree,
+					     page_node) {
 		pool_free_page(pool, page, true);
 	}
 
@@ -353,7 +430,15 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 
 	spin_lock_irqsave(&pool->lock, flags);
 
-	list_add(&page->page_list, &pool->page_list);
+	if (unlikely(pool_insert_page(pool, page))) {
+		/*
+		 * This should not happen, so something must have gone horribly
+		 * wrong.  Instead of crashing, intentionally leak the memory
+		 * and make for the exit.
+		 */
+		spin_unlock_irqrestore(&pool->lock, flags);
+		return NULL;
+	}
 	list_add(&page->avail_page_link, &pool->avail_page_list);
  ready:
 	page->in_use++;
@@ -395,19 +480,6 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 }
 EXPORT_SYMBOL(dma_pool_alloc);
 
-static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma)
-{
-	struct dma_page *page;
-
-	list_for_each_entry(page, &pool->page_list, page_list) {
-		if (dma < page->dma)
-			continue;
-		if ((dma - page->dma) < pool->allocation)
-			return page;
-	}
-	return NULL;
-}
-
 /**
  * dma_pool_free - put block back into dma pool
  * @pool: the dma pool holding the block
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/10] dmapool: remove checks for dev == NULL
  2022-05-31 18:12   ` Tony Battersby
@ 2022-05-31 18:23     ` Robin Murphy
  -1 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 18:23 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Tony Lindgren

On 2022-05-31 19:12, Tony Battersby wrote:
> dmapool originally tried to support pools without a device because
> dma_alloc_coherent() supports allocations without a device.  But nobody
> ended up using dma pools without a device, so the current checks in
> dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
> Remove them.

Furthermore, dma_pool_create() already dereferences the dev argument 
unconditionally, so there's no way we could possibly get as far as these 
checks even if a caller did ever pass NULL.

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
> ---
>   mm/dmapool.c | 42 +++++++++++-------------------------------
>   1 file changed, 11 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index a7eb5d0eb2da..0f89de408cbe 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool)
>   	mutex_lock(&pools_reg_lock);
>   	mutex_lock(&pools_lock);
>   	list_del(&pool->pools);
> -	if (pool->dev && list_empty(&pool->dev->dma_pools))
> +	if (list_empty(&pool->dev->dma_pools))
>   		empty = true;
>   	mutex_unlock(&pools_lock);
>   	if (empty)
> @@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool)
>   
>   	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
>   		if (is_page_busy(page)) {
> -			if (pool->dev)
> -				dev_err(pool->dev, "%s %s, %p busy\n", __func__,
> -					pool->name, page->vaddr);
> -			else
> -				pr_err("%s %s, %p busy\n", __func__,
> -				       pool->name, page->vaddr);
> +			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
> +				pool->name, page->vaddr);
>   			/* leak the still-in-use consistent memory */
>   			list_del(&page->page_list);
>   			kfree(page);
> @@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
>   		for (i = sizeof(page->offset); i < pool->size; i++) {
>   			if (data[i] == POOL_POISON_FREED)
>   				continue;
> -			if (pool->dev)
> -				dev_err(pool->dev, "%s %s, %p (corrupted)\n",
> -					__func__, pool->name, retval);
> -			else
> -				pr_err("%s %s, %p (corrupted)\n",
> -					__func__, pool->name, retval);
> +			dev_err(pool->dev, "%s %s, %p (corrupted)\n",
> +				__func__, pool->name, retval);
>   
>   			/*
>   			 * Dump the first 4 bytes even if they are not
> @@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   	page = pool_find_page(pool, dma);
>   	if (!page) {
>   		spin_unlock_irqrestore(&pool->lock, flags);
> -		if (pool->dev)
> -			dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
> -				__func__, pool->name, vaddr, &dma);
> -		else
> -			pr_err("%s %s, %p/%pad (bad dma)\n",
> -			       __func__, pool->name, vaddr, &dma);
> +		dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
> +			__func__, pool->name, vaddr, &dma);
>   		return;
>   	}
>   
> @@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   #ifdef	DMAPOOL_DEBUG
>   	if ((dma - page->dma) != offset) {
>   		spin_unlock_irqrestore(&pool->lock, flags);
> -		if (pool->dev)
> -			dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
> -				__func__, pool->name, vaddr, &dma);
> -		else
> -			pr_err("%s %s, %p (bad vaddr)/%pad\n",
> -			       __func__, pool->name, vaddr, &dma);
> +		dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
> +			__func__, pool->name, vaddr, &dma);
>   		return;
>   	}
>   	{
> @@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   				continue;
>   			}
>   			spin_unlock_irqrestore(&pool->lock, flags);
> -			if (pool->dev)
> -				dev_err(pool->dev, "%s %s, dma %pad already free\n",
> -					__func__, pool->name, &dma);
> -			else
> -				pr_err("%s %s, dma %pad already free\n",
> -				       __func__, pool->name, &dma);
> +			dev_err(pool->dev, "%s %s, dma %pad already free\n",
> +				__func__, pool->name, &dma);
>   			return;
>   		}
>   	}

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/10] dmapool: remove checks for dev == NULL
@ 2022-05-31 18:23     ` Robin Murphy
  0 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 18:23 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team

On 2022-05-31 19:12, Tony Battersby wrote:
> dmapool originally tried to support pools without a device because
> dma_alloc_coherent() supports allocations without a device.  But nobody
> ended up using dma pools without a device, so the current checks in
> dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
> Remove them.

Furthermore, dma_pool_create() already dereferences the dev argument 
unconditionally, so there's no way we could possibly get as far as these 
checks even if a caller did ever pass NULL.

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
> ---
>   mm/dmapool.c | 42 +++++++++++-------------------------------
>   1 file changed, 11 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index a7eb5d0eb2da..0f89de408cbe 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool)
>   	mutex_lock(&pools_reg_lock);
>   	mutex_lock(&pools_lock);
>   	list_del(&pool->pools);
> -	if (pool->dev && list_empty(&pool->dev->dma_pools))
> +	if (list_empty(&pool->dev->dma_pools))
>   		empty = true;
>   	mutex_unlock(&pools_lock);
>   	if (empty)
> @@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool)
>   
>   	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
>   		if (is_page_busy(page)) {
> -			if (pool->dev)
> -				dev_err(pool->dev, "%s %s, %p busy\n", __func__,
> -					pool->name, page->vaddr);
> -			else
> -				pr_err("%s %s, %p busy\n", __func__,
> -				       pool->name, page->vaddr);
> +			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
> +				pool->name, page->vaddr);
>   			/* leak the still-in-use consistent memory */
>   			list_del(&page->page_list);
>   			kfree(page);
> @@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
>   		for (i = sizeof(page->offset); i < pool->size; i++) {
>   			if (data[i] == POOL_POISON_FREED)
>   				continue;
> -			if (pool->dev)
> -				dev_err(pool->dev, "%s %s, %p (corrupted)\n",
> -					__func__, pool->name, retval);
> -			else
> -				pr_err("%s %s, %p (corrupted)\n",
> -					__func__, pool->name, retval);
> +			dev_err(pool->dev, "%s %s, %p (corrupted)\n",
> +				__func__, pool->name, retval);
>   
>   			/*
>   			 * Dump the first 4 bytes even if they are not
> @@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   	page = pool_find_page(pool, dma);
>   	if (!page) {
>   		spin_unlock_irqrestore(&pool->lock, flags);
> -		if (pool->dev)
> -			dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
> -				__func__, pool->name, vaddr, &dma);
> -		else
> -			pr_err("%s %s, %p/%pad (bad dma)\n",
> -			       __func__, pool->name, vaddr, &dma);
> +		dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
> +			__func__, pool->name, vaddr, &dma);
>   		return;
>   	}
>   
> @@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   #ifdef	DMAPOOL_DEBUG
>   	if ((dma - page->dma) != offset) {
>   		spin_unlock_irqrestore(&pool->lock, flags);
> -		if (pool->dev)
> -			dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
> -				__func__, pool->name, vaddr, &dma);
> -		else
> -			pr_err("%s %s, %p (bad vaddr)/%pad\n",
> -			       __func__, pool->name, vaddr, &dma);
> +		dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
> +			__func__, pool->name, vaddr, &dma);
>   		return;
>   	}
>   	{
> @@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   				continue;
>   			}
>   			spin_unlock_irqrestore(&pool->lock, flags);
> -			if (pool->dev)
> -				dev_err(pool->dev, "%s %s, dma %pad already free\n",
> -					__func__, pool->name, &dma);
> -			else
> -				pr_err("%s %s, dma %pad already free\n",
> -				       __func__, pool->name, &dma);
> +			dev_err(pool->dev, "%s %s, dma %pad already free\n",
> +				__func__, pool->name, &dma);
>   			return;
>   		}
>   	}
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy
  2022-05-31 18:22   ` Tony Battersby
@ 2022-05-31 19:33     ` Robin Murphy
  -1 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 19:33 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Tony Lindgren

On 2022-05-31 19:22, Tony Battersby wrote:
> Remove a small amount of code duplication between dma_pool_destroy() and
> pool_free_page() in preparation for adding more code without having to
> duplicate it.  No functional changes.
> 
> Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
> ---
>   mm/dmapool.c | 34 ++++++++++++++++++++--------------
>   1 file changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index 8749a9d7927e..58c11dcaa4e4 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page)
>   	return page->in_use != 0;
>   }
>   
> -static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
> +static void pool_free_page(struct dma_pool *pool,
> +			   struct dma_page *page,
> +			   bool destroying_pool)
>   {
> +	void *vaddr = page->vaddr;
>   	dma_addr_t dma = page->dma;
>   
> +	if (destroying_pool && is_page_busy(page)) {
> +		dev_err(pool->dev,
> +			"dma_pool_destroy %s, %p busy\n",
> +			pool->name, vaddr);
> +		/* leak the still-in-use consistent memory */
> +	} else {
>   #ifdef	DMAPOOL_DEBUG
> -	memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
> +		memset(vaddr, POOL_POISON_FREED, pool->allocation);
>   #endif
> -	dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
> +		dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
> +	}
> +
>   	list_del(&page->page_list);

If we're tearing down the whole pool, surely we can skip this as well? 
(Same for the second list in patch #9)

In fact I think it might make more sense to refactor in the opposite 
direction and just streamline this directly into dma_pool_destroy(), 
more like:

	list_for_each_entry_safe() {
		if (is_page_busy()) {
			dev_err();
		} else {
			dma_free_coherent();
		}
		kfree(page);
	}

>   	kfree(page);
>   }
> @@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
>    */
>   void dma_pool_destroy(struct dma_pool *pool)
>   {
> -	struct dma_page *page, *tmp;
> +	struct dma_page *page;

Nit: you bring this back again in patch #10, so we may as well leave the 
list_for_each_entry_safe() iterator in place until then as well, and 
save a bit of churn in this patch.

>   	bool empty = false;
>   
>   	if (unlikely(!pool))
> @@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool)
>   		device_remove_file(pool->dev, &dev_attr_pools);
>   	mutex_unlock(&pools_reg_lock);
>   
> -	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
> -		if (is_page_busy(page)) {
> -			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
> -				pool->name, page->vaddr);
> -			/* leak the still-in-use consistent memory */
> -			list_del(&page->page_list);
> -			kfree(page);
> -		} else
> -			pool_free_page(pool, page);
> +	while ((page = list_first_entry_or_null(&pool->page_list,
> +						struct dma_page,
> +						page_list))) {
> +		pool_free_page(pool, page, true);
>   	}
>   
>   	kfree(pool);
> @@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   	page->offset = offset;
>   	/*
>   	 * Resist a temptation to do
> -	 *    if (!is_page_busy(page)) pool_free_page(pool, page);
> +	 *    if (!is_page_busy(page)) pool_free_page(pool, page, false);

Further to the above, even if we did retain a separate function, if an 
argument is hard-coded at the one single callsite, and the only 
reference to passing any other value is in a comment effectively saying 
"don't do this", do we really need to pretend it's an argument at all? ;)

FWIW I'd just reword the comment in more general terms, e.g. "Resist the 
temptation to free unused pages immediately..."

Thanks,
Robin.

>   	 * Better have a few empty pages hang around.
>   	 */
>   	spin_unlock_irqrestore(&pool->lock, flags);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy
@ 2022-05-31 19:33     ` Robin Murphy
  0 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 19:33 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team

On 2022-05-31 19:22, Tony Battersby wrote:
> Remove a small amount of code duplication between dma_pool_destroy() and
> pool_free_page() in preparation for adding more code without having to
> duplicate it.  No functional changes.
> 
> Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
> ---
>   mm/dmapool.c | 34 ++++++++++++++++++++--------------
>   1 file changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index 8749a9d7927e..58c11dcaa4e4 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page)
>   	return page->in_use != 0;
>   }
>   
> -static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
> +static void pool_free_page(struct dma_pool *pool,
> +			   struct dma_page *page,
> +			   bool destroying_pool)
>   {
> +	void *vaddr = page->vaddr;
>   	dma_addr_t dma = page->dma;
>   
> +	if (destroying_pool && is_page_busy(page)) {
> +		dev_err(pool->dev,
> +			"dma_pool_destroy %s, %p busy\n",
> +			pool->name, vaddr);
> +		/* leak the still-in-use consistent memory */
> +	} else {
>   #ifdef	DMAPOOL_DEBUG
> -	memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
> +		memset(vaddr, POOL_POISON_FREED, pool->allocation);
>   #endif
> -	dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
> +		dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
> +	}
> +
>   	list_del(&page->page_list);

If we're tearing down the whole pool, surely we can skip this as well? 
(Same for the second list in patch #9)

In fact I think it might make more sense to refactor in the opposite 
direction and just streamline this directly into dma_pool_destroy(), 
more like:

	list_for_each_entry_safe() {
		if (is_page_busy()) {
			dev_err();
		} else {
			dma_free_coherent();
		}
		kfree(page);
	}

>   	kfree(page);
>   }
> @@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
>    */
>   void dma_pool_destroy(struct dma_pool *pool)
>   {
> -	struct dma_page *page, *tmp;
> +	struct dma_page *page;

Nit: you bring this back again in patch #10, so we may as well leave the 
list_for_each_entry_safe() iterator in place until then as well, and 
save a bit of churn in this patch.

>   	bool empty = false;
>   
>   	if (unlikely(!pool))
> @@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool)
>   		device_remove_file(pool->dev, &dev_attr_pools);
>   	mutex_unlock(&pools_reg_lock);
>   
> -	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
> -		if (is_page_busy(page)) {
> -			dev_err(pool->dev, "%s %s, %p busy\n", __func__,
> -				pool->name, page->vaddr);
> -			/* leak the still-in-use consistent memory */
> -			list_del(&page->page_list);
> -			kfree(page);
> -		} else
> -			pool_free_page(pool, page);
> +	while ((page = list_first_entry_or_null(&pool->page_list,
> +						struct dma_page,
> +						page_list))) {
> +		pool_free_page(pool, page, true);
>   	}
>   
>   	kfree(pool);
> @@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma)
>   	page->offset = offset;
>   	/*
>   	 * Resist a temptation to do
> -	 *    if (!is_page_busy(page)) pool_free_page(pool, page);
> +	 *    if (!is_page_busy(page)) pool_free_page(pool, page, false);

Further to the above, even if we did retain a separate function, if an 
argument is hard-coded at the one single callsite, and the only 
reference to passing any other value is in a comment effectively saying 
"don't do this", do we really need to pretend it's an argument at all? ;)

FWIW I'd just reword the comment in more general terms, e.g. "Resist the 
temptation to free unused pages immediately..."

Thanks,
Robin.

>   	 * Better have a few empty pages hang around.
>   	 */
>   	spin_unlock_irqrestore(&pool->lock, flags);
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
  2022-05-31 18:17   ` Tony Battersby
@ 2022-05-31 19:48     ` Robin Murphy
  -1 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 19:48 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Tony Lindgren

On 2022-05-31 19:17, Tony Battersby wrote:
> The "total number of blocks in pool" debug statistic currently does not
> take the boundary value into account, so it diverges from the "total
> number of blocks in use" statistic when a boundary is in effect.  Add a
> calculation for the number of blocks per allocation that takes the
> boundary into account, and use it to replace the inaccurate calculation.
> 
> This depends on the patch "dmapool: fix boundary comparison" for the
> calculated blks_per_alloc value to be correct.
> 
> Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
> ---
>   mm/dmapool.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index 782143144a32..9e30f4425dea 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -47,6 +47,7 @@ struct dma_pool {		/* the pool */
>   	struct device *dev;
>   	unsigned int allocation;
>   	unsigned int boundary;
> +	unsigned int blks_per_alloc;
>   	char name[32];
>   	struct list_head pools;
>   };
> @@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
>   		/* per-pool info, no real statistics yet */
>   		temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",

Nit: if we're tinkering with this, it's probably worth updating the 
whole function to use sysfs_emit{_at}().

>   				 pool->name, blocks,
> -				 (size_t) pages *
> -				 (pool->allocation / pool->size),
> +				 (size_t) pages * pool->blks_per_alloc,
>   				 pool->size, pages);
>   		size -= temp;
>   		next += temp;
> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
>   	retval->size = size;
>   	retval->boundary = boundary;
>   	retval->allocation = allocation;
> +	retval->blks_per_alloc =
> +		(allocation / boundary) * (boundary / size) +
> +		(allocation % boundary) / size;

Do we really need to store this? Sure, 4 divisions (which could possibly 
be fewer given the constraints on boundary) isn't the absolute cheapest 
calculation, but I still can't imagine anyone would be polling sysfs 
stats hard enough to even notice.

Thanks,
Robin.

>   
>   	INIT_LIST_HEAD(&retval->pools);
>   

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
@ 2022-05-31 19:48     ` Robin Murphy
  0 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 19:48 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team

On 2022-05-31 19:17, Tony Battersby wrote:
> The "total number of blocks in pool" debug statistic currently does not
> take the boundary value into account, so it diverges from the "total
> number of blocks in use" statistic when a boundary is in effect.  Add a
> calculation for the number of blocks per allocation that takes the
> boundary into account, and use it to replace the inaccurate calculation.
> 
> This depends on the patch "dmapool: fix boundary comparison" for the
> calculated blks_per_alloc value to be correct.
> 
> Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
> ---
>   mm/dmapool.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index 782143144a32..9e30f4425dea 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -47,6 +47,7 @@ struct dma_pool {		/* the pool */
>   	struct device *dev;
>   	unsigned int allocation;
>   	unsigned int boundary;
> +	unsigned int blks_per_alloc;
>   	char name[32];
>   	struct list_head pools;
>   };
> @@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha
>   		/* per-pool info, no real statistics yet */
>   		temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",

Nit: if we're tinkering with this, it's probably worth updating the 
whole function to use sysfs_emit{_at}().

>   				 pool->name, blocks,
> -				 (size_t) pages *
> -				 (pool->allocation / pool->size),
> +				 (size_t) pages * pool->blks_per_alloc,
>   				 pool->size, pages);
>   		size -= temp;
>   		next += temp;
> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
>   	retval->size = size;
>   	retval->boundary = boundary;
>   	retval->allocation = allocation;
> +	retval->blks_per_alloc =
> +		(allocation / boundary) * (boundary / size) +
> +		(allocation % boundary) / size;

Do we really need to store this? Sure, 4 divisions (which could possibly 
be fewer given the constraints on boundary) isn't the absolute cheapest 
calculation, but I still can't imagine anyone would be polling sysfs 
stats hard enough to even notice.

Thanks,
Robin.

>   
>   	INIT_LIST_HEAD(&retval->pools);
>   
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
  2022-05-31 19:48     ` Robin Murphy
@ 2022-05-31 19:52       ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 19:52 UTC (permalink / raw)
  To: Robin Murphy, linux-mm, linux-kernel
  Cc: iommu, kernel-team, Matthew Wilcox, Keith Busch, Andy Shevchenko,
	Tony Lindgren

On 5/31/22 15:48, Robin Murphy wrote:
> On 2022-05-31 19:17, Tony Battersby wrote:
>
>>   				 pool->name, blocks,
>> -				 (size_t) pages *
>> -				 (pool->allocation / pool->size),
>> +				 (size_t) pages * pool->blks_per_alloc,
>>   				 pool->size, pages);
>>   		size -= temp;
>>   		next += temp;
>> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
>>   	retval->size = size;
>>   	retval->boundary = boundary;
>>   	retval->allocation = allocation;
>> +	retval->blks_per_alloc =
>> +		(allocation / boundary) * (boundary / size) +
>> +		(allocation % boundary) / size;
> Do we really need to store this? Sure, 4 divisions (which could possibly 
> be fewer given the constraints on boundary) isn't the absolute cheapest 
> calculation, but I still can't imagine anyone would be polling sysfs 
> stats hard enough to even notice.
>
The stored value is also used in patch #5, in more performance-critical
code, although only when debug is enabled.

Tony


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
@ 2022-05-31 19:52       ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 19:52 UTC (permalink / raw)
  To: Robin Murphy, linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team

On 5/31/22 15:48, Robin Murphy wrote:
> On 2022-05-31 19:17, Tony Battersby wrote:
>
>>   				 pool->name, blocks,
>> -				 (size_t) pages *
>> -				 (pool->allocation / pool->size),
>> +				 (size_t) pages * pool->blks_per_alloc,
>>   				 pool->size, pages);
>>   		size -= temp;
>>   		next += temp;
>> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
>>   	retval->size = size;
>>   	retval->boundary = boundary;
>>   	retval->allocation = allocation;
>> +	retval->blks_per_alloc =
>> +		(allocation / boundary) * (boundary / size) +
>> +		(allocation % boundary) / size;
> Do we really need to store this? Sure, 4 divisions (which could possibly 
> be fewer given the constraints on boundary) isn't the absolute cheapest 
> calculation, but I still can't imagine anyone would be polling sysfs 
> stats hard enough to even notice.
>
The stored value is also used in patch #5, in more performance-critical
code, although only when debug is enabled.

Tony

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy
  2022-05-31 18:22   ` Tony Battersby
@ 2022-05-31 21:40     ` Keith Busch
  -1 siblings, 0 replies; 40+ messages in thread
From: Keith Busch @ 2022-05-31 21:40 UTC (permalink / raw)
  To: Tony Battersby
  Cc: linux-mm, linux-kernel, iommu, kernel-team, Matthew Wilcox,
	Andy Shevchenko, Robin Murphy, Tony Lindgren

On Tue, May 31, 2022 at 02:22:21PM -0400, Tony Battersby wrote:
> +static void pool_free_page(struct dma_pool *pool,
> +			   struct dma_page *page,
> +			   bool destroying_pool)

'destroying_pool' is always true, so I don't think you need it.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy
@ 2022-05-31 21:40     ` Keith Busch
  0 siblings, 0 replies; 40+ messages in thread
From: Keith Busch @ 2022-05-31 21:40 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Tony Lindgren, iommu, Matthew Wilcox, linux-kernel, linux-mm,
	Andy Shevchenko, kernel-team, Robin Murphy

On Tue, May 31, 2022 at 02:22:21PM -0400, Tony Battersby wrote:
> +static void pool_free_page(struct dma_pool *pool,
> +			   struct dma_page *page,
> +			   bool destroying_pool)

'destroying_pool' is always true, so I don't think you need it.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
  2022-05-31 18:23   ` Tony Battersby
@ 2022-05-31 21:54     ` Keith Busch
  -1 siblings, 0 replies; 40+ messages in thread
From: Keith Busch @ 2022-05-31 21:54 UTC (permalink / raw)
  To: Tony Battersby
  Cc: linux-mm, linux-kernel, iommu, kernel-team, Matthew Wilcox,
	Andy Shevchenko, Robin Murphy, Tony Lindgren

On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
> dma_pool_free() scales poorly when the pool contains many pages because
> pool_find_page() does a linear scan of all allocated pages.  Improve its
> scalability by replacing the linear scan with a red-black tree lookup.
> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

The improvement should say O(n) to O(log n), right?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
@ 2022-05-31 21:54     ` Keith Busch
  0 siblings, 0 replies; 40+ messages in thread
From: Keith Busch @ 2022-05-31 21:54 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Tony Lindgren, iommu, Matthew Wilcox, linux-kernel, linux-mm,
	Andy Shevchenko, kernel-team, Robin Murphy

On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
> dma_pool_free() scales poorly when the pool contains many pages because
> pool_find_page() does a linear scan of all allocated pages.  Improve its
> scalability by replacing the linear scan with a red-black tree lookup.
> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).

The improvement should say O(n) to O(log n), right?
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
  2022-05-31 19:52       ` Tony Battersby
@ 2022-05-31 21:55         ` Robin Murphy
  -1 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 21:55 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: Tony Lindgren, Andy Shevchenko, Matthew Wilcox, iommu,
	Keith Busch, kernel-team

On 2022-05-31 20:52, Tony Battersby wrote:
> On 5/31/22 15:48, Robin Murphy wrote:
>> On 2022-05-31 19:17, Tony Battersby wrote:
>>
>>>    				 pool->name, blocks,
>>> -				 (size_t) pages *
>>> -				 (pool->allocation / pool->size),
>>> +				 (size_t) pages * pool->blks_per_alloc,
>>>    				 pool->size, pages);
>>>    		size -= temp;
>>>    		next += temp;
>>> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
>>>    	retval->size = size;
>>>    	retval->boundary = boundary;
>>>    	retval->allocation = allocation;
>>> +	retval->blks_per_alloc =
>>> +		(allocation / boundary) * (boundary / size) +
>>> +		(allocation % boundary) / size;
>> Do we really need to store this? Sure, 4 divisions (which could possibly
>> be fewer given the constraints on boundary) isn't the absolute cheapest
>> calculation, but I still can't imagine anyone would be polling sysfs
>> stats hard enough to even notice.
>>
> The stored value is also used in patch #5, in more performance-critical
> code, although only when debug is enabled.

Ah, fair enough. On second look I think 64-bit systems could effectively 
store this for free anyway, if patch #2 moved "size" down past "dev" in 
struct dma_pool, such that blks_per_alloc then ends up padding out the 
hole again.

FWIW the mathematician in me also now can't help seeing the algebraic 
reduction to at least "(allocation + (allocation % boundary)) / size", 
but is now too tired to reason about the power-of-two constraints and 
whether the intermediate integer truncations matter...

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
@ 2022-05-31 21:55         ` Robin Murphy
  0 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-05-31 21:55 UTC (permalink / raw)
  To: Tony Battersby, linux-mm, linux-kernel
  Cc: Tony Lindgren, iommu, Matthew Wilcox, Andy Shevchenko,
	Keith Busch, kernel-team

On 2022-05-31 20:52, Tony Battersby wrote:
> On 5/31/22 15:48, Robin Murphy wrote:
>> On 2022-05-31 19:17, Tony Battersby wrote:
>>
>>>    				 pool->name, blocks,
>>> -				 (size_t) pages *
>>> -				 (pool->allocation / pool->size),
>>> +				 (size_t) pages * pool->blks_per_alloc,
>>>    				 pool->size, pages);
>>>    		size -= temp;
>>>    		next += temp;
>>> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
>>>    	retval->size = size;
>>>    	retval->boundary = boundary;
>>>    	retval->allocation = allocation;
>>> +	retval->blks_per_alloc =
>>> +		(allocation / boundary) * (boundary / size) +
>>> +		(allocation % boundary) / size;
>> Do we really need to store this? Sure, 4 divisions (which could possibly
>> be fewer given the constraints on boundary) isn't the absolute cheapest
>> calculation, but I still can't imagine anyone would be polling sysfs
>> stats hard enough to even notice.
>>
> The stored value is also used in patch #5, in more performance-critical
> code, although only when debug is enabled.

Ah, fair enough. On second look I think 64-bit systems could effectively 
store this for free anyway, if patch #2 moved "size" down past "dev" in 
struct dma_pool, such that blks_per_alloc then ends up padding out the 
hole again.

FWIW the mathematician in me also now can't help seeing the algebraic 
reduction to at least "(allocation + (allocation % boundary)) / size", 
but is now too tired to reason about the power-of-two constraints and 
whether the intermediate integer truncations matter...

Cheers,
Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
  2022-05-31 21:54     ` Keith Busch
@ 2022-05-31 22:10       ` Tony Battersby
  -1 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 22:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-mm, linux-kernel, iommu, kernel-team, Matthew Wilcox,
	Andy Shevchenko, Robin Murphy, Tony Lindgren

On 5/31/22 17:54, Keith Busch wrote:
> On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
>> dma_pool_free() scales poorly when the pool contains many pages because
>> pool_find_page() does a linear scan of all allocated pages.  Improve its
>> scalability by replacing the linear scan with a red-black tree lookup.
>> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
> The improvement should say O(n) to O(log n), right?

That would be the improvement for a single call to dma_pool_alloc or
dma_pool_free, but I was going with the improvement for "n" calls
instead, which is consistent with the improvement for the example in the
cover letter for mpt3sas.  I would have to look up the convention to be
sure of the proper notation in a situation like this, but I usually
think "inserting N items takes N^2 time"; to me it makes less sense to
say "inserting 1 item takes N time", because the N seems to come out of
nowhere.

Tony


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
@ 2022-05-31 22:10       ` Tony Battersby
  0 siblings, 0 replies; 40+ messages in thread
From: Tony Battersby @ 2022-05-31 22:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Tony Lindgren, iommu, Matthew Wilcox, linux-kernel, linux-mm,
	Andy Shevchenko, kernel-team, Robin Murphy

On 5/31/22 17:54, Keith Busch wrote:
> On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
>> dma_pool_free() scales poorly when the pool contains many pages because
>> pool_find_page() does a linear scan of all allocated pages.  Improve its
>> scalability by replacing the linear scan with a red-black tree lookup.
>> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
> The improvement should say O(n) to O(log n), right?

That would be the improvement for a single call to dma_pool_alloc or
dma_pool_free, but I was going with the improvement for "n" calls
instead, which is consistent with the improvement for the example in the
cover letter for mpt3sas.  I would have to look up the convention to be
sure of the proper notation in a situation like this, but I usually
think "inserting N items takes N^2 time"; to me it makes less sense to
say "inserting 1 item takes N time", because the N seems to come out of
nowhere.

Tony

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
  2022-05-31 22:10       ` Tony Battersby
@ 2022-06-01  9:44         ` Robin Murphy
  -1 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-06-01  9:44 UTC (permalink / raw)
  To: Tony Battersby, Keith Busch
  Cc: linux-mm, linux-kernel, iommu, kernel-team, Matthew Wilcox,
	Andy Shevchenko, Tony Lindgren

On 2022-05-31 23:10, Tony Battersby wrote:
> On 5/31/22 17:54, Keith Busch wrote:
>> On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
>>> dma_pool_free() scales poorly when the pool contains many pages because
>>> pool_find_page() does a linear scan of all allocated pages.  Improve its
>>> scalability by replacing the linear scan with a red-black tree lookup.
>>> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
>> The improvement should say O(n) to O(log n), right?
> 
> That would be the improvement for a single call to dma_pool_alloc or
> dma_pool_free, but I was going with the improvement for "n" calls
> instead, which is consistent with the improvement for the example in the
> cover letter for mpt3sas.  I would have to look up the convention to be
> sure of the proper notation in a situation like this, but I usually
> think "inserting N items takes N^2 time"; to me it makes less sense to
> say "inserting 1 item takes N time", because the N seems to come out of
> nowhere.

No, n represents the size of the set that the algorithm is inserting 
into (or removing from, searching within, etc.). Hence why constant time 
is represented by O(1), i.e. not involving the variable at all.

Robin.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
@ 2022-06-01  9:44         ` Robin Murphy
  0 siblings, 0 replies; 40+ messages in thread
From: Robin Murphy @ 2022-06-01  9:44 UTC (permalink / raw)
  To: Tony Battersby, Keith Busch
  Cc: Tony Lindgren, iommu, Matthew Wilcox, linux-kernel, linux-mm,
	Andy Shevchenko, kernel-team

On 2022-05-31 23:10, Tony Battersby wrote:
> On 5/31/22 17:54, Keith Busch wrote:
>> On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote:
>>> dma_pool_free() scales poorly when the pool contains many pages because
>>> pool_find_page() does a linear scan of all allocated pages.  Improve its
>>> scalability by replacing the linear scan with a red-black tree lookup.
>>> In big O notation, this improves the algorithm from O(n^2) to O(n * log n).
>> The improvement should say O(n) to O(log n), right?
> 
> That would be the improvement for a single call to dma_pool_alloc or
> dma_pool_free, but I was going with the improvement for "n" calls
> instead, which is consistent with the improvement for the example in the
> cover letter for mpt3sas.  I would have to look up the convention to be
> sure of the proper notation in a situation like this, but I usually
> think "inserting N items takes N^2 time"; to me it makes less sense to
> say "inserting 1 item takes N time", because the N seems to come out of
> nowhere.

No, n represents the size of the set that the algorithm is inserting 
into (or removing from, searching within, etc.). Hence why constant time 
is represented by O(1), i.e. not involving the variable at all.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-06-01  9:45 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-31 18:11 [PATCH 00/10] mpt3sas and dmapool scalability Tony Battersby
2022-05-31 18:11 ` Tony Battersby
2022-05-31 18:12 ` [PATCH 01/10] dmapool: remove checks for dev == NULL Tony Battersby
2022-05-31 18:12   ` Tony Battersby
2022-05-31 18:23   ` Robin Murphy
2022-05-31 18:23     ` Robin Murphy
2022-05-31 18:13 ` [PATCH 02/10] dmapool: cleanup integer types Tony Battersby
2022-05-31 18:13   ` Tony Battersby
2022-05-31 18:14 ` [PATCH 03/10] dmapool: fix boundary comparison Tony Battersby
2022-05-31 18:14   ` Tony Battersby
2022-05-31 18:17 ` [PATCH 04/10] dmapool: improve accuracy of debug statistics Tony Battersby
2022-05-31 18:17   ` Tony Battersby
2022-05-31 19:48   ` Robin Murphy
2022-05-31 19:48     ` Robin Murphy
2022-05-31 19:52     ` Tony Battersby
2022-05-31 19:52       ` Tony Battersby
2022-05-31 21:55       ` Robin Murphy
2022-05-31 21:55         ` Robin Murphy
2022-05-31 18:18 ` [PATCH 05/10] dmapool: debug: prevent endless loop in case of corruption Tony Battersby
2022-05-31 18:18   ` Tony Battersby
2022-05-31 18:20 ` [PATCH 06/10] dmapool: ignore init_on_free when DMAPOOL_DEBUG enabled Tony Battersby
2022-05-31 18:20   ` Tony Battersby
2022-05-31 18:21 ` [PATCH 07/10] dmapool: speedup DMAPOOL_DEBUG with init_on_alloc Tony Battersby
2022-05-31 18:21   ` Tony Battersby
2022-05-31 18:22 ` [PATCH 08/10] dmapool: cleanup dma_pool_destroy Tony Battersby
2022-05-31 18:22   ` Tony Battersby
2022-05-31 19:33   ` Robin Murphy
2022-05-31 19:33     ` Robin Murphy
2022-05-31 21:40   ` Keith Busch
2022-05-31 21:40     ` Keith Busch
2022-05-31 18:23 ` [PATCH 09/10] dmapool: improve scalability of dma_pool_alloc Tony Battersby
2022-05-31 18:23   ` Tony Battersby
2022-05-31 18:23 ` [PATCH 10/10] dmapool: improve scalability of dma_pool_free Tony Battersby
2022-05-31 18:23   ` Tony Battersby
2022-05-31 21:54   ` Keith Busch
2022-05-31 21:54     ` Keith Busch
2022-05-31 22:10     ` Tony Battersby
2022-05-31 22:10       ` Tony Battersby
2022-06-01  9:44       ` Robin Murphy
2022-06-01  9:44         ` Robin Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.