All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] swiotlb performance optimizations
@ 2022-06-28  7:01 ` Chao Gao
  0 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, kirill.shutemov,
	sathyanarayanan.kuppuswamy, ilpo.jarvinen, Chao Gao,
	Andrew Morton, Borislav Petkov, Damien Le Moal, iommu, Kees Cook,
	linux-doc, linux-pm, Muchun Song, Paul E. McKenney, Randy Dunlap

Intent of this post:
 Seek reviews from Intel reviewers and anyone else in the list
 interested in IO performance in confidential VMs. Need some acked-by
 reviewed-by tags before I can add swiotlb maintainers to "to/cc" lists
 and ask for a review from them.

swiotlb is now widely used by confidential VMs. This series optimizes
swiotlb to reduce cache misses and lock contention during bounce buffer
allocation/free and memory bouncing to improve IO workload performance in
confidential VMs.

Here are some FIO tests we did to demonstrate the improvement.

Test setup
----------

A normal VM with 8vCPU and 32G memory, swiotlb is enabled by swiotlb=force.
100 in Host/Guest CPU utilization means 1 logical processor. FIO block size
is 4K and iodepth is 256. Note that a normal VM is used so that others lack
of necessary hardware to host confidential VMs can reproduce results below.

Results
-------

1 FIO job	read/write	Throughput	IOPS	Host CPU	Guest CPU
				(MB/s)		(k)	utilization	utilization
vanilla		read		1037		253	228.48		101.92
		write		1148		280	233.28		100.96
optimized	read		1160		283	232.32		101.12
		write		1195		292	233.28		100.64

1-job FIO sequential read/write perf increase by 12% and 4% respectively.

4 FIO jobs	read/write	Throughput	IOPS	Host CPU	Guest CPU
				(MB/s)		(k)	utilization	utilization
vanilla		read		885		214.9	527.04		401.12
		write		868		212.1	531.84		400.64
optimized	read		2320		567	344.64		202.8
		write		1998		488	312		173.92

4-job FIO sequential read/write perf increase by 164% and 130% respectively.

This series is based on 5.19-rc2.

Andi Kleen (1):
  swiotlb: Split up single swiotlb lock

Chao Gao (2):
  swiotlb: Use bitmap to track free slots
  swiotlb: Allocate memory in a cache-friendly way

 .../admin-guide/kernel-parameters.txt         |   4 +-
 arch/x86/kernel/acpi/boot.c                   |   4 +
 include/linux/swiotlb.h                       |  47 +++-
 kernel/dma/swiotlb.c                          | 263 +++++++++++++-----
 4 files changed, 229 insertions(+), 89 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 0/3] swiotlb performance optimizations
@ 2022-06-28  7:01 ` Chao Gao
  0 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: len.brown, tony.luck, Kees Cook, Paul E. McKenney, linux-doc,
	Damien Le Moal, rafael.j.wysocki, Muchun Song, dave.hansen,
	iommu, Randy Dunlap, Borislav Petkov, linux-pm, ilpo.jarvinen,
	dan.j.williams, reinette.chatre, Andrew Morton, kirill.shutemov

Intent of this post:
 Seek reviews from Intel reviewers and anyone else in the list
 interested in IO performance in confidential VMs. Need some acked-by
 reviewed-by tags before I can add swiotlb maintainers to "to/cc" lists
 and ask for a review from them.

swiotlb is now widely used by confidential VMs. This series optimizes
swiotlb to reduce cache misses and lock contention during bounce buffer
allocation/free and memory bouncing to improve IO workload performance in
confidential VMs.

Here are some FIO tests we did to demonstrate the improvement.

Test setup
----------

A normal VM with 8vCPU and 32G memory, swiotlb is enabled by swiotlb=force.
100 in Host/Guest CPU utilization means 1 logical processor. FIO block size
is 4K and iodepth is 256. Note that a normal VM is used so that others lack
of necessary hardware to host confidential VMs can reproduce results below.

Results
-------

1 FIO job	read/write	Throughput	IOPS	Host CPU	Guest CPU
				(MB/s)		(k)	utilization	utilization
vanilla		read		1037		253	228.48		101.92
		write		1148		280	233.28		100.96
optimized	read		1160		283	232.32		101.12
		write		1195		292	233.28		100.64

1-job FIO sequential read/write perf increase by 12% and 4% respectively.

4 FIO jobs	read/write	Throughput	IOPS	Host CPU	Guest CPU
				(MB/s)		(k)	utilization	utilization
vanilla		read		885		214.9	527.04		401.12
		write		868		212.1	531.84		400.64
optimized	read		2320		567	344.64		202.8
		write		1998		488	312		173.92

4-job FIO sequential read/write perf increase by 164% and 130% respectively.

This series is based on 5.19-rc2.

Andi Kleen (1):
  swiotlb: Split up single swiotlb lock

Chao Gao (2):
  swiotlb: Use bitmap to track free slots
  swiotlb: Allocate memory in a cache-friendly way

 .../admin-guide/kernel-parameters.txt         |   4 +-
 arch/x86/kernel/acpi/boot.c                   |   4 +
 include/linux/swiotlb.h                       |  47 +++-
 kernel/dma/swiotlb.c                          | 263 +++++++++++++-----
 4 files changed, 229 insertions(+), 89 deletions(-)

-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 1/3] swiotlb: Use bitmap to track free slots
  2022-06-28  7:01 ` Chao Gao
@ 2022-06-28  7:01   ` Chao Gao
  -1 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, kirill.shutemov,
	sathyanarayanan.kuppuswamy, ilpo.jarvinen, Chao Gao, iommu

Currently, each slot tracks the number of contiguous free slots starting
from itself. It helps to quickly check if there are enough contiguous
entries when dealing with an allocation request. But maintaining this
information can leads to some overhead. Specifically, if a slot is
allocated/freed, preceding slots may need to be updated as the number
of contiguous free slots can change. This process may access memory
scattering over multiple cachelines.

To reduce the overhead of maintaining the number of contiguous free
entries, use a global bitmap to track free slots; each bit represents
if a slot is available. The number of contiguous free slots can be
calculated by counting the number of consecutive 1s in the bitmap.

Tests show that the average cost of freeing slots drops by 120 cycles
while the average cost of allocation increases by 20 cycles. Overall,
100 cycles are saved from a pair of allocation and freeing.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 include/linux/swiotlb.h |  6 ++--
 kernel/dma/swiotlb.c    | 64 ++++++++++++++++++++---------------------
 2 files changed, 34 insertions(+), 36 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..c3eab237991a 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -78,8 +78,6 @@ extern enum swiotlb_force swiotlb_force;
  *		@end. For default swiotlb, this is command line adjustable via
  *		setup_io_tlb_npages.
  * @used:	The number of used IO TLB block.
- * @list:	The free list describing the number of free entries available
- *		from each index.
  * @index:	The index to start searching in the next round.
  * @orig_addr:	The original address corresponding to a mapped entry.
  * @alloc_size:	Size of the allocated buffer.
@@ -89,6 +87,8 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @bitmap:	The bitmap used to track free entries. 1 in bit X means the slot
+ *		indexed by X is free.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -105,8 +105,8 @@ struct io_tlb_mem {
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
-		unsigned int list;
 	} *slots;
+	unsigned long *bitmap;
 };
 extern struct io_tlb_mem io_tlb_default_mem;
 
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index cb50f8d38360..d7f68c0af7f5 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -207,7 +207,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 
 	spin_lock_init(&mem->lock);
 	for (i = 0; i < mem->nslabs; i++) {
-		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
+		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
 	}
@@ -274,6 +274,11 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
+	mem->bitmap = memblock_alloc(BITS_TO_BYTES(nslabs), SMP_CACHE_BYTES);
+	if (!mem->bitmap)
+		panic("%s: Failed to allocate %lu bytes align=0x%x\n",
+		      __func__, DIV_ROUND_UP(nslabs, BITS_PER_BYTE), SMP_CACHE_BYTES);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, flags, false);
 
 	if (flags & SWIOTLB_VERBOSE)
@@ -337,10 +342,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 			(PAGE_SIZE << order) >> 20);
 	}
 
+	mem->bitmap = bitmap_zalloc(nslabs, GFP_KERNEL);
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
-	if (!mem->slots) {
+	if (!mem->slots || !mem->bitmap) {
 		free_pages((unsigned long)vstart, order);
+		bitmap_free(mem->bitmap);
+		kfree(mem->slots);
 		return -ENOMEM;
 	}
 
@@ -498,7 +506,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	unsigned int iotlb_align_mask =
 		dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
 	unsigned int nslots = nr_slots(alloc_size), stride;
-	unsigned int index, wrap, count = 0, i;
+	unsigned int index, wrap, i;
 	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
 	unsigned long flags;
 
@@ -514,6 +522,9 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
 	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
 
+	/* slots shouldn't cross one segment */
+	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
+
 	spin_lock_irqsave(&mem->lock, flags);
 	if (unlikely(nslots > mem->nslabs - mem->used))
 		goto not_found;
@@ -535,8 +546,15 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		if (!iommu_is_span_boundary(index, nslots,
 					    nr_slots(tbl_dma_addr),
 					    max_slots)) {
-			if (mem->slots[index].list >= nslots)
+			if (find_next_zero_bit(mem->bitmap, index + nslots, index) ==
+					index + nslots)
 				goto found;
+		} else {
+			/*
+			 * Remaining slots between current one and the next
+			 * bounary cannot meet our requirement.
+			 */
+			index = wrap_index(mem, round_up(index, max_slots));
 		}
 		index = wrap_index(mem, index + stride);
 	} while (index != wrap);
@@ -547,14 +565,10 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 
 found:
 	for (i = index; i < index + nslots; i++) {
-		mem->slots[i].list = 0;
+		__clear_bit(i, mem->bitmap);
 		mem->slots[i].alloc_size =
 			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
 	}
-	for (i = index - 1;
-	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
-	     mem->slots[i].list; i--)
-		mem->slots[i].list = ++count;
 
 	/*
 	 * Update the indices to avoid searching in the next round.
@@ -628,38 +642,19 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
-	int count, i;
+	int i;
 
-	/*
-	 * Return the buffer to the free list by setting the corresponding
-	 * entries to indicate the number of contiguous entries available.
-	 * While returning the entries to the free list, we merge the entries
-	 * with slots below and above the pool being returned.
-	 */
 	spin_lock_irqsave(&mem->lock, flags);
-	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
-		count = mem->slots[index + nslots].list;
-	else
-		count = 0;
-
 	/*
-	 * Step 1: return the slots to the free list, merging the slots with
-	 * superceeding slots
+	 * Return the slots to swiotlb, updating bitmap to indicate
+	 * corresponding entries are free.
 	 */
 	for (i = index + nslots - 1; i >= index; i--) {
-		mem->slots[i].list = ++count;
+		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
 	}
 
-	/*
-	 * Step 2: merge the returned slots with the preceding slots, if
-	 * available (non zero)
-	 */
-	for (i = index - 1;
-	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
-	     i--)
-		mem->slots[i].list = ++count;
 	mem->used -= nslots;
 	spin_unlock_irqrestore(&mem->lock, flags);
 }
@@ -826,7 +821,10 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
 			return -ENOMEM;
 
 		mem->slots = kcalloc(nslabs, sizeof(*mem->slots), GFP_KERNEL);
-		if (!mem->slots) {
+		mem->bitmap = bitmap_zalloc(nslabs, GFP_KERNEL);
+		if (!mem->slots || !mem->bitmap) {
+			kfree(mem->slots);
+			bitmap_free(mem->bitmap);
 			kfree(mem);
 			return -ENOMEM;
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 1/3] swiotlb: Use bitmap to track free slots
@ 2022-06-28  7:01   ` Chao Gao
  0 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: len.brown, tony.luck, rafael.j.wysocki, dave.hansen, iommu,
	ilpo.jarvinen, dan.j.williams, reinette.chatre, kirill.shutemov

Currently, each slot tracks the number of contiguous free slots starting
from itself. It helps to quickly check if there are enough contiguous
entries when dealing with an allocation request. But maintaining this
information can leads to some overhead. Specifically, if a slot is
allocated/freed, preceding slots may need to be updated as the number
of contiguous free slots can change. This process may access memory
scattering over multiple cachelines.

To reduce the overhead of maintaining the number of contiguous free
entries, use a global bitmap to track free slots; each bit represents
if a slot is available. The number of contiguous free slots can be
calculated by counting the number of consecutive 1s in the bitmap.

Tests show that the average cost of freeing slots drops by 120 cycles
while the average cost of allocation increases by 20 cycles. Overall,
100 cycles are saved from a pair of allocation and freeing.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 include/linux/swiotlb.h |  6 ++--
 kernel/dma/swiotlb.c    | 64 ++++++++++++++++++++---------------------
 2 files changed, 34 insertions(+), 36 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..c3eab237991a 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -78,8 +78,6 @@ extern enum swiotlb_force swiotlb_force;
  *		@end. For default swiotlb, this is command line adjustable via
  *		setup_io_tlb_npages.
  * @used:	The number of used IO TLB block.
- * @list:	The free list describing the number of free entries available
- *		from each index.
  * @index:	The index to start searching in the next round.
  * @orig_addr:	The original address corresponding to a mapped entry.
  * @alloc_size:	Size of the allocated buffer.
@@ -89,6 +87,8 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @bitmap:	The bitmap used to track free entries. 1 in bit X means the slot
+ *		indexed by X is free.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -105,8 +105,8 @@ struct io_tlb_mem {
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
-		unsigned int list;
 	} *slots;
+	unsigned long *bitmap;
 };
 extern struct io_tlb_mem io_tlb_default_mem;
 
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index cb50f8d38360..d7f68c0af7f5 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -207,7 +207,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 
 	spin_lock_init(&mem->lock);
 	for (i = 0; i < mem->nslabs; i++) {
-		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
+		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
 	}
@@ -274,6 +274,11 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
+	mem->bitmap = memblock_alloc(BITS_TO_BYTES(nslabs), SMP_CACHE_BYTES);
+	if (!mem->bitmap)
+		panic("%s: Failed to allocate %lu bytes align=0x%x\n",
+		      __func__, DIV_ROUND_UP(nslabs, BITS_PER_BYTE), SMP_CACHE_BYTES);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, flags, false);
 
 	if (flags & SWIOTLB_VERBOSE)
@@ -337,10 +342,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 			(PAGE_SIZE << order) >> 20);
 	}
 
+	mem->bitmap = bitmap_zalloc(nslabs, GFP_KERNEL);
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
-	if (!mem->slots) {
+	if (!mem->slots || !mem->bitmap) {
 		free_pages((unsigned long)vstart, order);
+		bitmap_free(mem->bitmap);
+		kfree(mem->slots);
 		return -ENOMEM;
 	}
 
@@ -498,7 +506,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	unsigned int iotlb_align_mask =
 		dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
 	unsigned int nslots = nr_slots(alloc_size), stride;
-	unsigned int index, wrap, count = 0, i;
+	unsigned int index, wrap, i;
 	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
 	unsigned long flags;
 
@@ -514,6 +522,9 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
 	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
 
+	/* slots shouldn't cross one segment */
+	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
+
 	spin_lock_irqsave(&mem->lock, flags);
 	if (unlikely(nslots > mem->nslabs - mem->used))
 		goto not_found;
@@ -535,8 +546,15 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		if (!iommu_is_span_boundary(index, nslots,
 					    nr_slots(tbl_dma_addr),
 					    max_slots)) {
-			if (mem->slots[index].list >= nslots)
+			if (find_next_zero_bit(mem->bitmap, index + nslots, index) ==
+					index + nslots)
 				goto found;
+		} else {
+			/*
+			 * Remaining slots between current one and the next
+			 * bounary cannot meet our requirement.
+			 */
+			index = wrap_index(mem, round_up(index, max_slots));
 		}
 		index = wrap_index(mem, index + stride);
 	} while (index != wrap);
@@ -547,14 +565,10 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 
 found:
 	for (i = index; i < index + nslots; i++) {
-		mem->slots[i].list = 0;
+		__clear_bit(i, mem->bitmap);
 		mem->slots[i].alloc_size =
 			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
 	}
-	for (i = index - 1;
-	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
-	     mem->slots[i].list; i--)
-		mem->slots[i].list = ++count;
 
 	/*
 	 * Update the indices to avoid searching in the next round.
@@ -628,38 +642,19 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
-	int count, i;
+	int i;
 
-	/*
-	 * Return the buffer to the free list by setting the corresponding
-	 * entries to indicate the number of contiguous entries available.
-	 * While returning the entries to the free list, we merge the entries
-	 * with slots below and above the pool being returned.
-	 */
 	spin_lock_irqsave(&mem->lock, flags);
-	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
-		count = mem->slots[index + nslots].list;
-	else
-		count = 0;
-
 	/*
-	 * Step 1: return the slots to the free list, merging the slots with
-	 * superceeding slots
+	 * Return the slots to swiotlb, updating bitmap to indicate
+	 * corresponding entries are free.
 	 */
 	for (i = index + nslots - 1; i >= index; i--) {
-		mem->slots[i].list = ++count;
+		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
 	}
 
-	/*
-	 * Step 2: merge the returned slots with the preceding slots, if
-	 * available (non zero)
-	 */
-	for (i = index - 1;
-	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
-	     i--)
-		mem->slots[i].list = ++count;
 	mem->used -= nslots;
 	spin_unlock_irqrestore(&mem->lock, flags);
 }
@@ -826,7 +821,10 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
 			return -ENOMEM;
 
 		mem->slots = kcalloc(nslabs, sizeof(*mem->slots), GFP_KERNEL);
-		if (!mem->slots) {
+		mem->bitmap = bitmap_zalloc(nslabs, GFP_KERNEL);
+		if (!mem->slots || !mem->bitmap) {
+			kfree(mem->slots);
+			bitmap_free(mem->bitmap);
 			kfree(mem);
 			return -ENOMEM;
 		}
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 2/3] swiotlb: Allocate memory in a cache-friendly way
  2022-06-28  7:01 ` Chao Gao
@ 2022-06-28  7:01   ` Chao Gao
  -1 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, kirill.shutemov,
	sathyanarayanan.kuppuswamy, ilpo.jarvinen, Chao Gao, Andi Kleen,
	iommu

Currently, swiotlb uses a global index to indicate the starting point
of next search. The index increases from 0 to the number of slots - 1
and then wraps around. It is straightforward but not cache-friendly
because the "oldest" slot in swiotlb tends to be used first.

Freed slots are probably accessed right before being freed, especially
in VM's case (device backends access them in DMA_TO_DEVICE mode; guest
accesses them in other DMA modes). Thus those just freed slots may
reside in cache. Then reusing those just freed slots can reduce cache
misses.

To that end, maintain a free list for free slots and insert freed slots
from the head and searching for free slots always starts from the head.

With this optimization, network throughput of sending data from host to
guest, measured by iperf3, increases by 7%.

A bad side effect of this patch is we cannot use a large stride to skip
unaligned slots when there is an alignment requirement. Currently, a
large stride is used when a) device has an alignment requirement, stride
is calculated according to the requirement; b) the requested size is
larger than PAGE_SIZE. For x86 with 4KB page size, stride is set to 2.

For case a), few devices have an alignment requirement; the impact is
limited. For case b) this patch probably leads to one (or more if page size
is larger than 4K) additional lookup; but as the "io_tlb_slot" struct of
free slots are also accessed when freeing slots, they probably resides in
CPU cache as well and then the overhead is almost negligible.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 include/linux/swiotlb.h | 13 +++++----
 kernel/dma/swiotlb.c    | 63 ++++++++++++++++-------------------------
 2 files changed, 32 insertions(+), 44 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c3eab237991a..012a6fde873b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,12 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
 
+struct io_tlb_slot {
+	phys_addr_t orig_addr;
+	size_t alloc_size;
+	struct list_head node;
+};
+
 /**
  * struct io_tlb_mem - IO TLB Memory Pool Descriptor
  *
@@ -96,16 +102,13 @@ struct io_tlb_mem {
 	void *vaddr;
 	unsigned long nslabs;
 	unsigned long used;
-	unsigned int index;
+	struct list_head free_slots;
 	spinlock_t lock;
 	struct dentry *debugfs;
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
-	struct io_tlb_slot {
-		phys_addr_t orig_addr;
-		size_t alloc_size;
-	} *slots;
+	struct io_tlb_slot *slots;
 	unsigned long *bitmap;
 };
 extern struct io_tlb_mem io_tlb_default_mem;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d7f68c0af7f5..2283f3a36f0d 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -200,7 +200,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	mem->nslabs = nslabs;
 	mem->start = start;
 	mem->end = mem->start + bytes;
-	mem->index = 0;
+	INIT_LIST_HEAD(&mem->free_slots);
 	mem->late_alloc = late_alloc;
 
 	mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
@@ -210,6 +210,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
+		list_add_tail(&mem->slots[i].node, &mem->free_slots);
 	}
 
 	/*
@@ -484,13 +485,6 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
 
-static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
-{
-	if (index >= mem->nslabs)
-		return 0;
-	return index;
-}
-
 /*
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
@@ -499,29 +493,21 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 			      size_t alloc_size, unsigned int alloc_align_mask)
 {
 	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_slot *slot, *tmp;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
 		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
+	dma_addr_t slot_dma_addr;
 	unsigned long max_slots = get_max_slots(boundary_mask);
 	unsigned int iotlb_align_mask =
 		dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
-	unsigned int nslots = nr_slots(alloc_size), stride;
-	unsigned int index, wrap, i;
+	unsigned int nslots = nr_slots(alloc_size);
+	unsigned int index, i;
 	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
 	unsigned long flags;
 
 	BUG_ON(!nslots);
 
-	/*
-	 * For mappings with an alignment requirement don't bother looping to
-	 * unaligned slots once we found an aligned one.  For allocations of
-	 * PAGE_SIZE or larger only look for page aligned allocations.
-	 */
-	stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
-	if (alloc_size >= PAGE_SIZE)
-		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
-	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
-
 	/* slots shouldn't cross one segment */
 	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
 
@@ -529,15 +515,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	if (unlikely(nslots > mem->nslabs - mem->used))
 		goto not_found;
 
-	index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
-	do {
+	list_for_each_entry_safe(slot, tmp, &mem->free_slots, node) {
+		index = slot - mem->slots;
+		slot_dma_addr = slot_addr(tbl_dma_addr, index);
 		if (orig_addr &&
-		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
+		    (slot_dma_addr & iotlb_align_mask) !=
 			    (orig_addr & iotlb_align_mask)) {
-			index = wrap_index(mem, index + 1);
 			continue;
 		}
 
+		/* Ensure requested alignment is met */
+		if (alloc_align_mask && (slot_dma_addr & (alloc_align_mask - 1)))
+			continue;
+
+		/*
+		 * If requested size is larger than a page, ensure allocated
+		 * memory to be page aligned.
+		 */
+		if (alloc_size >= PAGE_SIZE && (slot_dma_addr & ~PAGE_MASK))
+			continue;
+
 		/*
 		 * If we find a slot that indicates we have 'nslots' number of
 		 * contiguous buffers, we allocate the buffers from that slot
@@ -549,15 +546,8 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 			if (find_next_zero_bit(mem->bitmap, index + nslots, index) ==
 					index + nslots)
 				goto found;
-		} else {
-			/*
-			 * Remaining slots between current one and the next
-			 * bounary cannot meet our requirement.
-			 */
-			index = wrap_index(mem, round_up(index, max_slots));
 		}
-		index = wrap_index(mem, index + stride);
-	} while (index != wrap);
+	}
 
 not_found:
 	spin_unlock_irqrestore(&mem->lock, flags);
@@ -568,15 +558,9 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		__clear_bit(i, mem->bitmap);
 		mem->slots[i].alloc_size =
 			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
+		list_del(&mem->slots[i].node);
 	}
 
-	/*
-	 * Update the indices to avoid searching in the next round.
-	 */
-	if (index + nslots < mem->nslabs)
-		mem->index = index + nslots;
-	else
-		mem->index = 0;
 	mem->used += nslots;
 
 	spin_unlock_irqrestore(&mem->lock, flags);
@@ -653,6 +637,7 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
+		list_add(&mem->slots[i].node, &mem->free_slots);
 	}
 
 	mem->used -= nslots;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 2/3] swiotlb: Allocate memory in a cache-friendly way
@ 2022-06-28  7:01   ` Chao Gao
  0 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: len.brown, tony.luck, rafael.j.wysocki, dave.hansen, iommu,
	Andi Kleen, ilpo.jarvinen, dan.j.williams, reinette.chatre,
	kirill.shutemov

Currently, swiotlb uses a global index to indicate the starting point
of next search. The index increases from 0 to the number of slots - 1
and then wraps around. It is straightforward but not cache-friendly
because the "oldest" slot in swiotlb tends to be used first.

Freed slots are probably accessed right before being freed, especially
in VM's case (device backends access them in DMA_TO_DEVICE mode; guest
accesses them in other DMA modes). Thus those just freed slots may
reside in cache. Then reusing those just freed slots can reduce cache
misses.

To that end, maintain a free list for free slots and insert freed slots
from the head and searching for free slots always starts from the head.

With this optimization, network throughput of sending data from host to
guest, measured by iperf3, increases by 7%.

A bad side effect of this patch is we cannot use a large stride to skip
unaligned slots when there is an alignment requirement. Currently, a
large stride is used when a) device has an alignment requirement, stride
is calculated according to the requirement; b) the requested size is
larger than PAGE_SIZE. For x86 with 4KB page size, stride is set to 2.

For case a), few devices have an alignment requirement; the impact is
limited. For case b) this patch probably leads to one (or more if page size
is larger than 4K) additional lookup; but as the "io_tlb_slot" struct of
free slots are also accessed when freeing slots, they probably resides in
CPU cache as well and then the overhead is almost negligible.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 include/linux/swiotlb.h | 13 +++++----
 kernel/dma/swiotlb.c    | 63 ++++++++++++++++-------------------------
 2 files changed, 32 insertions(+), 44 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c3eab237991a..012a6fde873b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,12 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
 
+struct io_tlb_slot {
+	phys_addr_t orig_addr;
+	size_t alloc_size;
+	struct list_head node;
+};
+
 /**
  * struct io_tlb_mem - IO TLB Memory Pool Descriptor
  *
@@ -96,16 +102,13 @@ struct io_tlb_mem {
 	void *vaddr;
 	unsigned long nslabs;
 	unsigned long used;
-	unsigned int index;
+	struct list_head free_slots;
 	spinlock_t lock;
 	struct dentry *debugfs;
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
-	struct io_tlb_slot {
-		phys_addr_t orig_addr;
-		size_t alloc_size;
-	} *slots;
+	struct io_tlb_slot *slots;
 	unsigned long *bitmap;
 };
 extern struct io_tlb_mem io_tlb_default_mem;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d7f68c0af7f5..2283f3a36f0d 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -200,7 +200,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	mem->nslabs = nslabs;
 	mem->start = start;
 	mem->end = mem->start + bytes;
-	mem->index = 0;
+	INIT_LIST_HEAD(&mem->free_slots);
 	mem->late_alloc = late_alloc;
 
 	mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
@@ -210,6 +210,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
+		list_add_tail(&mem->slots[i].node, &mem->free_slots);
 	}
 
 	/*
@@ -484,13 +485,6 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
 
-static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
-{
-	if (index >= mem->nslabs)
-		return 0;
-	return index;
-}
-
 /*
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
@@ -499,29 +493,21 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 			      size_t alloc_size, unsigned int alloc_align_mask)
 {
 	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_slot *slot, *tmp;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
 		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
+	dma_addr_t slot_dma_addr;
 	unsigned long max_slots = get_max_slots(boundary_mask);
 	unsigned int iotlb_align_mask =
 		dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
-	unsigned int nslots = nr_slots(alloc_size), stride;
-	unsigned int index, wrap, i;
+	unsigned int nslots = nr_slots(alloc_size);
+	unsigned int index, i;
 	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
 	unsigned long flags;
 
 	BUG_ON(!nslots);
 
-	/*
-	 * For mappings with an alignment requirement don't bother looping to
-	 * unaligned slots once we found an aligned one.  For allocations of
-	 * PAGE_SIZE or larger only look for page aligned allocations.
-	 */
-	stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
-	if (alloc_size >= PAGE_SIZE)
-		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
-	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
-
 	/* slots shouldn't cross one segment */
 	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
 
@@ -529,15 +515,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	if (unlikely(nslots > mem->nslabs - mem->used))
 		goto not_found;
 
-	index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
-	do {
+	list_for_each_entry_safe(slot, tmp, &mem->free_slots, node) {
+		index = slot - mem->slots;
+		slot_dma_addr = slot_addr(tbl_dma_addr, index);
 		if (orig_addr &&
-		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
+		    (slot_dma_addr & iotlb_align_mask) !=
 			    (orig_addr & iotlb_align_mask)) {
-			index = wrap_index(mem, index + 1);
 			continue;
 		}
 
+		/* Ensure requested alignment is met */
+		if (alloc_align_mask && (slot_dma_addr & (alloc_align_mask - 1)))
+			continue;
+
+		/*
+		 * If requested size is larger than a page, ensure allocated
+		 * memory to be page aligned.
+		 */
+		if (alloc_size >= PAGE_SIZE && (slot_dma_addr & ~PAGE_MASK))
+			continue;
+
 		/*
 		 * If we find a slot that indicates we have 'nslots' number of
 		 * contiguous buffers, we allocate the buffers from that slot
@@ -549,15 +546,8 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 			if (find_next_zero_bit(mem->bitmap, index + nslots, index) ==
 					index + nslots)
 				goto found;
-		} else {
-			/*
-			 * Remaining slots between current one and the next
-			 * bounary cannot meet our requirement.
-			 */
-			index = wrap_index(mem, round_up(index, max_slots));
 		}
-		index = wrap_index(mem, index + stride);
-	} while (index != wrap);
+	}
 
 not_found:
 	spin_unlock_irqrestore(&mem->lock, flags);
@@ -568,15 +558,9 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		__clear_bit(i, mem->bitmap);
 		mem->slots[i].alloc_size =
 			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
+		list_del(&mem->slots[i].node);
 	}
 
-	/*
-	 * Update the indices to avoid searching in the next round.
-	 */
-	if (index + nslots < mem->nslabs)
-		mem->index = index + nslots;
-	else
-		mem->index = 0;
 	mem->used += nslots;
 
 	spin_unlock_irqrestore(&mem->lock, flags);
@@ -653,6 +637,7 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
+		list_add(&mem->slots[i].node, &mem->free_slots);
 	}
 
 	mem->used -= nslots;
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 3/3] swiotlb: Split up single swiotlb lock
  2022-06-28  7:01 ` Chao Gao
@ 2022-06-28  7:01   ` Chao Gao
  -1 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: len.brown, tony.luck, Kees Cook, Paul E. McKenney, linux-doc,
	Damien Le Moal, rafael.j.wysocki, dave.hansen, iommu, Andi Kleen,
	Randy Dunlap, Borislav Petkov, Muchun Song, linux-pm,
	ilpo.jarvinen, dan.j.williams, reinette.chatre, Andrew Morton,
	kirill.shutemov

From: Andi Kleen <ak@linux.intel.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
signifiant lock contention on the swiotlb lock. We've seen 20+% CPU
time in locks in some extreme cases.

This patch splits the swiotlb into individual areas which have their
own lock. Each CPU tries to allocate in its own area first. Only if
that fails does it search other areas. On freeing the allocation is
freed into the area where the memory was originally allocated from.

To avoid doing a full modulo in the main path the number of swiotlb
areas is always rounded to the next power of two. I believe that's
not really needed anymore on modern CPUs (which have fast enough
dividers), but still a good idea on older parts.

The number of areas can be set using the swiotlb option. But to avoid
every user having to set this option set the default to the number of
available CPUs. Unfortunately on x86 swiotlb is initialized before
num_possible_cpus() is available, that is why it uses a custom hook
called from the early ACPI code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
[ rebase and fix warnings of checkpatch.pl ]
Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 .../admin-guide/kernel-parameters.txt         |   4 +-
 arch/x86/kernel/acpi/boot.c                   |   4 +
 include/linux/swiotlb.h                       |  30 +++-
 kernel/dma/swiotlb.c                          | 168 ++++++++++++++++--
 4 files changed, 180 insertions(+), 26 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8090130b544b..5d46271982d5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5869,8 +5869,10 @@
 			it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
 
 	swiotlb=	[ARM,IA-64,PPC,MIPS,X86]
-			Format: { <int> | force | noforce }
+			Format: { <int> [,<int>] | force | noforce }
 			<int> -- Number of I/O TLB slabs
+			<int> -- Second integer after comma. Number of swiotlb
+				 areas with their own lock. Must be power of 2.
 			force -- force using of bounce buffers even if they
 			         wouldn't be automatically used by the kernel
 			noforce -- Never use bounce buffers (for debugging)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 907cc98b1938..f05e55b2d50c 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -22,6 +22,7 @@
 #include <linux/efi-bgrt.h>
 #include <linux/serial_core.h>
 #include <linux/pgtable.h>
+#include <linux/swiotlb.h>
 
 #include <asm/e820/api.h>
 #include <asm/irqdomain.h>
@@ -1131,6 +1132,9 @@ static int __init acpi_parse_madt_lapic_entries(void)
 		return count;
 	}
 
+	/* This does not take overrides into consideration */
+	swiotlb_hint_cpus(max(count, x2count));
+
 	x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC_NMI,
 					acpi_parse_x2apic_nmi, 0);
 	count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_NMI,
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 012a6fde873b..cf36c5cc7584 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -69,7 +69,25 @@ struct io_tlb_slot {
 };
 
 /**
- * struct io_tlb_mem - IO TLB Memory Pool Descriptor
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:	The number of used IO TLB block.
+ * @list:	The free list describing the number of free entries available
+ *		from each index.
+ * @lock:	The lock to protect the above data structures in the map and
+ *		unmap calls.
+ */
+
+struct io_tlb_area {
+	unsigned long used;
+	struct list_head free_slots;
+	spinlock_t lock;
+};
+
+/**
+ * struct io_tlb_mem - io tlb memory pool descriptor
  *
  * @start:	The start address of the swiotlb memory pool. Used to do a quick
  *		range check to see if the memory was in fact allocated by this
@@ -87,8 +105,6 @@ struct io_tlb_slot {
  * @index:	The index to start searching in the next round.
  * @orig_addr:	The original address corresponding to a mapped entry.
  * @alloc_size:	Size of the allocated buffer.
- * @lock:	The lock to protect the above data structures in the map and
- *		unmap calls.
  * @debugfs:	The dentry to debugfs.
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
@@ -101,13 +117,11 @@ struct io_tlb_mem {
 	phys_addr_t end;
 	void *vaddr;
 	unsigned long nslabs;
-	unsigned long used;
-	struct list_head free_slots;
-	spinlock_t lock;
 	struct dentry *debugfs;
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	struct io_tlb_area *areas;
 	struct io_tlb_slot *slots;
 	unsigned long *bitmap;
 };
@@ -133,6 +147,7 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+void swiotlb_hint_cpus(int cpus);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -165,6 +180,9 @@ static inline bool is_swiotlb_active(struct device *dev)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+static inline void swiotlb_hint_cpus(int cpus)
+{
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 2283f3a36f0d..a25e53e84b80 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,8 @@
 
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
 
+#define NUM_AREAS_DEFAULT 1
+
 static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
@@ -71,14 +73,61 @@ phys_addr_t swiotlb_unencrypted_base;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
 
+static __read_mostly unsigned int area_index_shift;
+static __read_mostly unsigned int area_bits;
+static __read_mostly int num_areas = NUM_AREAS_DEFAULT;
+
+static __init int setup_areas(int n)
+{
+	unsigned long nslabs;
+
+	if (n < 1 || !is_power_of_2(n) || n > default_nslabs) {
+		pr_err("swiotlb: Invalid areas parameter %d\n", n);
+		return -EINVAL;
+	}
+
+	/*
+	 * Round up number of slabs to the next power of 2.
+	 * The last area is going be smaller than the rest if default_nslabs is
+	 * not power of two.
+	 */
+	nslabs = roundup_pow_of_two(default_nslabs);
+
+	pr_info("swiotlb: Using %d areas\n", n);
+	num_areas = n;
+	area_index_shift = __fls(nslabs) - __fls(num_areas);
+	area_bits = __fls(n);
+	return 0;
+}
+
+/*
+ * Can be called from architecture specific code when swiotlb is set up before
+ * possible cpus are setup.
+ */
+
+void __init swiotlb_hint_cpus(int cpus)
+{
+	if (num_areas == NUM_AREAS_DEFAULT && cpus > 1) {
+		if (!is_power_of_2(cpus))
+			cpus = 1U << (__fls(cpus) + 1);
+		setup_areas(cpus);
+	}
+}
+
 static int __init
 setup_io_tlb_npages(char *str)
 {
+	int ret = 0;
+
 	if (isdigit(*str)) {
 		/* avoid tail segment of size < IO_TLB_SEGSIZE */
 		default_nslabs =
 			ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
 	}
+	if (*str == ',')
+		++str;
+	if (isdigit(*str))
+		ret = setup_areas(simple_strtoul(str, &str, 0));
 	if (*str == ',')
 		++str;
 	if (!strcmp(str, "force"))
@@ -86,7 +135,7 @@ setup_io_tlb_npages(char *str)
 	else if (!strcmp(str, "noforce"))
 		swiotlb_force_disable = true;
 
-	return 0;
+	return ret;
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
@@ -114,6 +163,7 @@ void __init swiotlb_adjust_size(unsigned long size)
 		return;
 	size = ALIGN(size, IO_TLB_SIZE);
 	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+	setup_areas(num_areas);
 	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
@@ -200,17 +250,21 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	mem->nslabs = nslabs;
 	mem->start = start;
 	mem->end = mem->start + bytes;
-	INIT_LIST_HEAD(&mem->free_slots);
 	mem->late_alloc = late_alloc;
 
 	mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
 
-	spin_lock_init(&mem->lock);
+	for (i = 0; i < num_areas; i++) {
+		INIT_LIST_HEAD(&mem->areas[i].free_slots);
+		spin_lock_init(&mem->areas[i].lock);
+	}
 	for (i = 0; i < mem->nslabs; i++) {
+		int aindex = area_index_shift ? i >> area_index_shift : 0;
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
-		list_add_tail(&mem->slots[i].node, &mem->free_slots);
+		list_add_tail(&mem->slots[i].node,
+			      &mem->areas[aindex].free_slots);
 	}
 
 	/*
@@ -274,6 +328,10 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 	if (!mem->slots)
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
+	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) * num_areas,
+				    SMP_CACHE_BYTES);
+	if (!mem->areas)
+		panic("Cannot allocate io_tlb_areas");
 
 	mem->bitmap = memblock_alloc(BITS_TO_BYTES(nslabs), SMP_CACHE_BYTES);
 	if (!mem->bitmap)
@@ -353,6 +411,12 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 		return -ENOMEM;
 	}
 
+	mem->areas = kcalloc(num_areas, sizeof(struct io_tlb_area), GFP_KERNEL);
+	if (!mem->areas) {
+		free_pages((unsigned long)mem->slots, order);
+		return -ENOMEM;
+	}
+
 	set_memory_decrypted((unsigned long)vstart,
 			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, 0, true);
@@ -380,9 +444,12 @@ void __init swiotlb_exit(void)
 
 	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
 	if (mem->late_alloc) {
+		kfree(mem->areas);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
+		memblock_free_late(__pa(mem->areas),
+				   num_areas * sizeof(struct io_tlb_area));
 		memblock_free_late(mem->start, tbl_size);
 		memblock_free_late(__pa(mem->slots), slots_size);
 	}
@@ -485,14 +552,32 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
 
+static inline unsigned long area_nslabs(struct io_tlb_mem *mem)
+{
+	return area_index_shift ? (1UL << area_index_shift) + 1 :
+			mem->nslabs;
+}
+
+static inline unsigned int area_start(int aindex)
+{
+	return aindex << area_index_shift;
+}
+
+static inline unsigned int area_end(struct io_tlb_mem *mem, int aindex)
+{
+	return area_start(aindex) + area_nslabs(mem);
+}
+
 /*
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct io_tlb_area *area,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size,
+				 unsigned int alloc_align_mask)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	struct io_tlb_slot *slot, *tmp;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
@@ -511,11 +596,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	/* slots shouldn't cross one segment */
 	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
 
-	spin_lock_irqsave(&mem->lock, flags);
-	if (unlikely(nslots > mem->nslabs - mem->used))
+	spin_lock_irqsave(&area->lock, flags);
+	if (unlikely(nslots > area_nslabs(mem) - area->used))
 		goto not_found;
 
-	list_for_each_entry_safe(slot, tmp, &mem->free_slots, node) {
+	list_for_each_entry_safe(slot, tmp, &area->free_slots, node) {
 		index = slot - mem->slots;
 		slot_dma_addr = slot_addr(tbl_dma_addr, index);
 		if (orig_addr &&
@@ -550,7 +635,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	}
 
 not_found:
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return -1;
 
 found:
@@ -561,12 +646,43 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		list_del(&mem->slots[i].node);
 	}
 
-	mem->used += nslots;
+	area->used += nslots;
 
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return index;
 }
 
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size, unsigned int alloc_align_mask)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	int start = raw_smp_processor_id() & ((1U << area_bits) - 1);
+	int i, index;
+
+	i = start;
+	do {
+		index = swiotlb_do_find_slots(mem, mem->areas + i,
+					      dev, orig_addr, alloc_size,
+					      alloc_align_mask);
+		if (index >= 0)
+			return index;
+		if (++i >= num_areas)
+			i = 0;
+	} while (i != start);
+	return -1;
+}
+
+/* Somewhat racy estimate */
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used = 0;
+
+	for (i = 0; i < num_areas; i++)
+		used += mem->areas[i].used;
+	return used;
+}
+
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		unsigned int alloc_align_mask, enum dma_data_direction dir,
@@ -596,7 +712,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				 alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
@@ -626,9 +742,14 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
+	int aindex = area_index_shift ? index >> area_index_shift : 0;
+	struct io_tlb_area *area = mem->areas + aindex;
 	int i;
 
-	spin_lock_irqsave(&mem->lock, flags);
+	WARN_ON(aindex >= num_areas);
+
+	spin_lock_irqsave(&area->lock, flags);
+
 	/*
 	 * Return the slots to swiotlb, updating bitmap to indicate
 	 * corresponding entries are free.
@@ -637,11 +758,11 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
-		list_add(&mem->slots[i].node, &mem->free_slots);
+		list_add(&mem->slots[i].node, &area->free_slots);
 	}
 
-	mem->used -= nslots;
-	spin_unlock_irqrestore(&mem->lock, flags);
+	area->used -= nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
 }
 
 /*
@@ -736,6 +857,15 @@ bool is_swiotlb_active(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
+static int used_get(void *data, u64 *val)
+{
+	struct io_tlb_mem *mem = (struct io_tlb_mem *)data;
+	*val = mem_used(mem);
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(used_fops, used_get, NULL, "%llu\n");
+
 static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
 					 const char *dirname)
 {
@@ -744,7 +874,7 @@ static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
 		return;
 
 	debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
-	debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, &mem->used);
+	debugfs_create_file("io_tlb_used", 0400, mem->debugfs, mem, &used_fops);
 }
 
 static int __init __maybe_unused swiotlb_create_default_debugfs(void)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 3/3] swiotlb: Split up single swiotlb lock
@ 2022-06-28  7:01   ` Chao Gao
  0 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-28  7:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, kirill.shutemov,
	sathyanarayanan.kuppuswamy, ilpo.jarvinen, Andi Kleen, Chao Gao,
	Paul E. McKenney, Andrew Morton, Borislav Petkov, Muchun Song,
	Kees Cook, Randy Dunlap, Damien Le Moal, linux-doc, linux-pm,
	iommu

From: Andi Kleen <ak@linux.intel.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
signifiant lock contention on the swiotlb lock. We've seen 20+% CPU
time in locks in some extreme cases.

This patch splits the swiotlb into individual areas which have their
own lock. Each CPU tries to allocate in its own area first. Only if
that fails does it search other areas. On freeing the allocation is
freed into the area where the memory was originally allocated from.

To avoid doing a full modulo in the main path the number of swiotlb
areas is always rounded to the next power of two. I believe that's
not really needed anymore on modern CPUs (which have fast enough
dividers), but still a good idea on older parts.

The number of areas can be set using the swiotlb option. But to avoid
every user having to set this option set the default to the number of
available CPUs. Unfortunately on x86 swiotlb is initialized before
num_possible_cpus() is available, that is why it uses a custom hook
called from the early ACPI code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
[ rebase and fix warnings of checkpatch.pl ]
Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 .../admin-guide/kernel-parameters.txt         |   4 +-
 arch/x86/kernel/acpi/boot.c                   |   4 +
 include/linux/swiotlb.h                       |  30 +++-
 kernel/dma/swiotlb.c                          | 168 ++++++++++++++++--
 4 files changed, 180 insertions(+), 26 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8090130b544b..5d46271982d5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5869,8 +5869,10 @@
 			it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
 
 	swiotlb=	[ARM,IA-64,PPC,MIPS,X86]
-			Format: { <int> | force | noforce }
+			Format: { <int> [,<int>] | force | noforce }
 			<int> -- Number of I/O TLB slabs
+			<int> -- Second integer after comma. Number of swiotlb
+				 areas with their own lock. Must be power of 2.
 			force -- force using of bounce buffers even if they
 			         wouldn't be automatically used by the kernel
 			noforce -- Never use bounce buffers (for debugging)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 907cc98b1938..f05e55b2d50c 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -22,6 +22,7 @@
 #include <linux/efi-bgrt.h>
 #include <linux/serial_core.h>
 #include <linux/pgtable.h>
+#include <linux/swiotlb.h>
 
 #include <asm/e820/api.h>
 #include <asm/irqdomain.h>
@@ -1131,6 +1132,9 @@ static int __init acpi_parse_madt_lapic_entries(void)
 		return count;
 	}
 
+	/* This does not take overrides into consideration */
+	swiotlb_hint_cpus(max(count, x2count));
+
 	x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC_NMI,
 					acpi_parse_x2apic_nmi, 0);
 	count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_NMI,
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 012a6fde873b..cf36c5cc7584 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -69,7 +69,25 @@ struct io_tlb_slot {
 };
 
 /**
- * struct io_tlb_mem - IO TLB Memory Pool Descriptor
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:	The number of used IO TLB block.
+ * @list:	The free list describing the number of free entries available
+ *		from each index.
+ * @lock:	The lock to protect the above data structures in the map and
+ *		unmap calls.
+ */
+
+struct io_tlb_area {
+	unsigned long used;
+	struct list_head free_slots;
+	spinlock_t lock;
+};
+
+/**
+ * struct io_tlb_mem - io tlb memory pool descriptor
  *
  * @start:	The start address of the swiotlb memory pool. Used to do a quick
  *		range check to see if the memory was in fact allocated by this
@@ -87,8 +105,6 @@ struct io_tlb_slot {
  * @index:	The index to start searching in the next round.
  * @orig_addr:	The original address corresponding to a mapped entry.
  * @alloc_size:	Size of the allocated buffer.
- * @lock:	The lock to protect the above data structures in the map and
- *		unmap calls.
  * @debugfs:	The dentry to debugfs.
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
@@ -101,13 +117,11 @@ struct io_tlb_mem {
 	phys_addr_t end;
 	void *vaddr;
 	unsigned long nslabs;
-	unsigned long used;
-	struct list_head free_slots;
-	spinlock_t lock;
 	struct dentry *debugfs;
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	struct io_tlb_area *areas;
 	struct io_tlb_slot *slots;
 	unsigned long *bitmap;
 };
@@ -133,6 +147,7 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+void swiotlb_hint_cpus(int cpus);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -165,6 +180,9 @@ static inline bool is_swiotlb_active(struct device *dev)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+static inline void swiotlb_hint_cpus(int cpus)
+{
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 2283f3a36f0d..a25e53e84b80 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,8 @@
 
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
 
+#define NUM_AREAS_DEFAULT 1
+
 static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
@@ -71,14 +73,61 @@ phys_addr_t swiotlb_unencrypted_base;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
 
+static __read_mostly unsigned int area_index_shift;
+static __read_mostly unsigned int area_bits;
+static __read_mostly int num_areas = NUM_AREAS_DEFAULT;
+
+static __init int setup_areas(int n)
+{
+	unsigned long nslabs;
+
+	if (n < 1 || !is_power_of_2(n) || n > default_nslabs) {
+		pr_err("swiotlb: Invalid areas parameter %d\n", n);
+		return -EINVAL;
+	}
+
+	/*
+	 * Round up number of slabs to the next power of 2.
+	 * The last area is going be smaller than the rest if default_nslabs is
+	 * not power of two.
+	 */
+	nslabs = roundup_pow_of_two(default_nslabs);
+
+	pr_info("swiotlb: Using %d areas\n", n);
+	num_areas = n;
+	area_index_shift = __fls(nslabs) - __fls(num_areas);
+	area_bits = __fls(n);
+	return 0;
+}
+
+/*
+ * Can be called from architecture specific code when swiotlb is set up before
+ * possible cpus are setup.
+ */
+
+void __init swiotlb_hint_cpus(int cpus)
+{
+	if (num_areas == NUM_AREAS_DEFAULT && cpus > 1) {
+		if (!is_power_of_2(cpus))
+			cpus = 1U << (__fls(cpus) + 1);
+		setup_areas(cpus);
+	}
+}
+
 static int __init
 setup_io_tlb_npages(char *str)
 {
+	int ret = 0;
+
 	if (isdigit(*str)) {
 		/* avoid tail segment of size < IO_TLB_SEGSIZE */
 		default_nslabs =
 			ALIGN(simple_strtoul(str, &str, 0), IO_TLB_SEGSIZE);
 	}
+	if (*str == ',')
+		++str;
+	if (isdigit(*str))
+		ret = setup_areas(simple_strtoul(str, &str, 0));
 	if (*str == ',')
 		++str;
 	if (!strcmp(str, "force"))
@@ -86,7 +135,7 @@ setup_io_tlb_npages(char *str)
 	else if (!strcmp(str, "noforce"))
 		swiotlb_force_disable = true;
 
-	return 0;
+	return ret;
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
@@ -114,6 +163,7 @@ void __init swiotlb_adjust_size(unsigned long size)
 		return;
 	size = ALIGN(size, IO_TLB_SIZE);
 	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+	setup_areas(num_areas);
 	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
@@ -200,17 +250,21 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	mem->nslabs = nslabs;
 	mem->start = start;
 	mem->end = mem->start + bytes;
-	INIT_LIST_HEAD(&mem->free_slots);
 	mem->late_alloc = late_alloc;
 
 	mem->force_bounce = swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
 
-	spin_lock_init(&mem->lock);
+	for (i = 0; i < num_areas; i++) {
+		INIT_LIST_HEAD(&mem->areas[i].free_slots);
+		spin_lock_init(&mem->areas[i].lock);
+	}
 	for (i = 0; i < mem->nslabs; i++) {
+		int aindex = area_index_shift ? i >> area_index_shift : 0;
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
-		list_add_tail(&mem->slots[i].node, &mem->free_slots);
+		list_add_tail(&mem->slots[i].node,
+			      &mem->areas[aindex].free_slots);
 	}
 
 	/*
@@ -274,6 +328,10 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 	if (!mem->slots)
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
+	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) * num_areas,
+				    SMP_CACHE_BYTES);
+	if (!mem->areas)
+		panic("Cannot allocate io_tlb_areas");
 
 	mem->bitmap = memblock_alloc(BITS_TO_BYTES(nslabs), SMP_CACHE_BYTES);
 	if (!mem->bitmap)
@@ -353,6 +411,12 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 		return -ENOMEM;
 	}
 
+	mem->areas = kcalloc(num_areas, sizeof(struct io_tlb_area), GFP_KERNEL);
+	if (!mem->areas) {
+		free_pages((unsigned long)mem->slots, order);
+		return -ENOMEM;
+	}
+
 	set_memory_decrypted((unsigned long)vstart,
 			     (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, 0, true);
@@ -380,9 +444,12 @@ void __init swiotlb_exit(void)
 
 	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
 	if (mem->late_alloc) {
+		kfree(mem->areas);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
+		memblock_free_late(__pa(mem->areas),
+				   num_areas * sizeof(struct io_tlb_area));
 		memblock_free_late(mem->start, tbl_size);
 		memblock_free_late(__pa(mem->slots), slots_size);
 	}
@@ -485,14 +552,32 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
 
+static inline unsigned long area_nslabs(struct io_tlb_mem *mem)
+{
+	return area_index_shift ? (1UL << area_index_shift) + 1 :
+			mem->nslabs;
+}
+
+static inline unsigned int area_start(int aindex)
+{
+	return aindex << area_index_shift;
+}
+
+static inline unsigned int area_end(struct io_tlb_mem *mem, int aindex)
+{
+	return area_start(aindex) + area_nslabs(mem);
+}
+
 /*
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct io_tlb_area *area,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size,
+				 unsigned int alloc_align_mask)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	struct io_tlb_slot *slot, *tmp;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
@@ -511,11 +596,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	/* slots shouldn't cross one segment */
 	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
 
-	spin_lock_irqsave(&mem->lock, flags);
-	if (unlikely(nslots > mem->nslabs - mem->used))
+	spin_lock_irqsave(&area->lock, flags);
+	if (unlikely(nslots > area_nslabs(mem) - area->used))
 		goto not_found;
 
-	list_for_each_entry_safe(slot, tmp, &mem->free_slots, node) {
+	list_for_each_entry_safe(slot, tmp, &area->free_slots, node) {
 		index = slot - mem->slots;
 		slot_dma_addr = slot_addr(tbl_dma_addr, index);
 		if (orig_addr &&
@@ -550,7 +635,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	}
 
 not_found:
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return -1;
 
 found:
@@ -561,12 +646,43 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		list_del(&mem->slots[i].node);
 	}
 
-	mem->used += nslots;
+	area->used += nslots;
 
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return index;
 }
 
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size, unsigned int alloc_align_mask)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	int start = raw_smp_processor_id() & ((1U << area_bits) - 1);
+	int i, index;
+
+	i = start;
+	do {
+		index = swiotlb_do_find_slots(mem, mem->areas + i,
+					      dev, orig_addr, alloc_size,
+					      alloc_align_mask);
+		if (index >= 0)
+			return index;
+		if (++i >= num_areas)
+			i = 0;
+	} while (i != start);
+	return -1;
+}
+
+/* Somewhat racy estimate */
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used = 0;
+
+	for (i = 0; i < num_areas; i++)
+		used += mem->areas[i].used;
+	return used;
+}
+
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		unsigned int alloc_align_mask, enum dma_data_direction dir,
@@ -596,7 +712,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				 alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
@@ -626,9 +742,14 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
+	int aindex = area_index_shift ? index >> area_index_shift : 0;
+	struct io_tlb_area *area = mem->areas + aindex;
 	int i;
 
-	spin_lock_irqsave(&mem->lock, flags);
+	WARN_ON(aindex >= num_areas);
+
+	spin_lock_irqsave(&area->lock, flags);
+
 	/*
 	 * Return the slots to swiotlb, updating bitmap to indicate
 	 * corresponding entries are free.
@@ -637,11 +758,11 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 		__set_bit(i, mem->bitmap);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
-		list_add(&mem->slots[i].node, &mem->free_slots);
+		list_add(&mem->slots[i].node, &area->free_slots);
 	}
 
-	mem->used -= nslots;
-	spin_unlock_irqrestore(&mem->lock, flags);
+	area->used -= nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
 }
 
 /*
@@ -736,6 +857,15 @@ bool is_swiotlb_active(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
+static int used_get(void *data, u64 *val)
+{
+	struct io_tlb_mem *mem = (struct io_tlb_mem *)data;
+	*val = mem_used(mem);
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(used_fops, used_get, NULL, "%llu\n");
+
 static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
 					 const char *dirname)
 {
@@ -744,7 +874,7 @@ static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
 		return;
 
 	debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
-	debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, &mem->used);
+	debugfs_create_file("io_tlb_used", 0400, mem->debugfs, mem, &used_fops);
 }
 
 static int __init __maybe_unused swiotlb_create_default_debugfs(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/3] swiotlb: Use bitmap to track free slots
  2022-06-28  7:01   ` Chao Gao
@ 2022-06-28 13:11     ` Ilpo Järvinen
  -1 siblings, 0 replies; 12+ messages in thread
From: Ilpo Järvinen @ 2022-06-28 13:11 UTC (permalink / raw)
  To: Chao Gao
  Cc: LKML, dave.hansen, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, kirill.shutemov,
	sathyanarayanan.kuppuswamy, iommu

[-- Attachment #1: Type: text/plain, Size: 5560 bytes --]

On Tue, 28 Jun 2022, Chao Gao wrote:

> Currently, each slot tracks the number of contiguous free slots starting
> from itself. It helps to quickly check if there are enough contiguous
> entries when dealing with an allocation request. But maintaining this
> information can leads to some overhead. Specifically, if a slot is
> allocated/freed, preceding slots may need to be updated as the number
> of contiguous free slots can change. This process may access memory
> scattering over multiple cachelines.
> 
> To reduce the overhead of maintaining the number of contiguous free
> entries, use a global bitmap to track free slots; each bit represents
> if a slot is available. The number of contiguous free slots can be
> calculated by counting the number of consecutive 1s in the bitmap.
> 
> Tests show that the average cost of freeing slots drops by 120 cycles
> while the average cost of allocation increases by 20 cycles. Overall,
> 100 cycles are saved from a pair of allocation and freeing.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

One nit below.

> ---
>  include/linux/swiotlb.h |  6 ++--
>  kernel/dma/swiotlb.c    | 64 ++++++++++++++++++++---------------------
>  2 files changed, 34 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7ed35dd3de6e..c3eab237991a 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -78,8 +78,6 @@ extern enum swiotlb_force swiotlb_force;
>   *		@end. For default swiotlb, this is command line adjustable via
>   *		setup_io_tlb_npages.
>   * @used:	The number of used IO TLB block.
> - * @list:	The free list describing the number of free entries available
> - *		from each index.
>   * @index:	The index to start searching in the next round.
>   * @orig_addr:	The original address corresponding to a mapped entry.
>   * @alloc_size:	Size of the allocated buffer.
> @@ -89,6 +87,8 @@ extern enum swiotlb_force swiotlb_force;
>   * @late_alloc:	%true if allocated using the page allocator
>   * @force_bounce: %true if swiotlb bouncing is forced
>   * @for_alloc:  %true if the pool is used for memory allocation
> + * @bitmap:	The bitmap used to track free entries. 1 in bit X means the slot
> + *		indexed by X is free.
>   */
>  struct io_tlb_mem {
>  	phys_addr_t start;
> @@ -105,8 +105,8 @@ struct io_tlb_mem {
>  	struct io_tlb_slot {
>  		phys_addr_t orig_addr;
>  		size_t alloc_size;
> -		unsigned int list;
>  	} *slots;
> +	unsigned long *bitmap;
>  };
>  extern struct io_tlb_mem io_tlb_default_mem;
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index cb50f8d38360..d7f68c0af7f5 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -207,7 +207,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>  
>  	spin_lock_init(&mem->lock);
>  	for (i = 0; i < mem->nslabs; i++) {
> -		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
> +		__set_bit(i, mem->bitmap);
>  		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>  		mem->slots[i].alloc_size = 0;
>  	}
> @@ -274,6 +274,11 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>  		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
>  		      __func__, alloc_size, PAGE_SIZE);
>  
> +	mem->bitmap = memblock_alloc(BITS_TO_BYTES(nslabs), SMP_CACHE_BYTES);
> +	if (!mem->bitmap)
> +		panic("%s: Failed to allocate %lu bytes align=0x%x\n",
> +		      __func__, DIV_ROUND_UP(nslabs, BITS_PER_BYTE), SMP_CACHE_BYTES);
> +
>  	swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, flags, false);
>  
>  	if (flags & SWIOTLB_VERBOSE)
> @@ -337,10 +342,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>  			(PAGE_SIZE << order) >> 20);
>  	}
>  
> +	mem->bitmap = bitmap_zalloc(nslabs, GFP_KERNEL);
>  	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>  		get_order(array_size(sizeof(*mem->slots), nslabs)));
> -	if (!mem->slots) {
> +	if (!mem->slots || !mem->bitmap) {
>  		free_pages((unsigned long)vstart, order);
> +		bitmap_free(mem->bitmap);
> +		kfree(mem->slots);
>  		return -ENOMEM;
>  	}
>  
> @@ -498,7 +506,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  	unsigned int iotlb_align_mask =
>  		dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
>  	unsigned int nslots = nr_slots(alloc_size), stride;
> -	unsigned int index, wrap, count = 0, i;
> +	unsigned int index, wrap, i;
>  	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
>  	unsigned long flags;
>  
> @@ -514,6 +522,9 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
>  	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
>  
> +	/* slots shouldn't cross one segment */
> +	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
> +
>  	spin_lock_irqsave(&mem->lock, flags);
>  	if (unlikely(nslots > mem->nslabs - mem->used))
>  		goto not_found;
> @@ -535,8 +546,15 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  		if (!iommu_is_span_boundary(index, nslots,
>  					    nr_slots(tbl_dma_addr),
>  					    max_slots)) {
> -			if (mem->slots[index].list >= nslots)
> +			if (find_next_zero_bit(mem->bitmap, index + nslots, index) ==
> +					index + nslots)
>  				goto found;
> +		} else {
> +			/*
> +			 * Remaining slots between current one and the next
> +			 * bounary cannot meet our requirement.

bounary -> boundary


-- 
 i.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/3] swiotlb: Use bitmap to track free slots
@ 2022-06-28 13:11     ` Ilpo Järvinen
  0 siblings, 0 replies; 12+ messages in thread
From: Ilpo Järvinen @ 2022-06-28 13:11 UTC (permalink / raw)
  To: Chao Gao
  Cc: len.brown, tony.luck, rafael.j.wysocki, LKML, dave.hansen, iommu,
	dan.j.williams, reinette.chatre, kirill.shutemov

[-- Attachment #1: Type: text/plain, Size: 5560 bytes --]

On Tue, 28 Jun 2022, Chao Gao wrote:

> Currently, each slot tracks the number of contiguous free slots starting
> from itself. It helps to quickly check if there are enough contiguous
> entries when dealing with an allocation request. But maintaining this
> information can leads to some overhead. Specifically, if a slot is
> allocated/freed, preceding slots may need to be updated as the number
> of contiguous free slots can change. This process may access memory
> scattering over multiple cachelines.
> 
> To reduce the overhead of maintaining the number of contiguous free
> entries, use a global bitmap to track free slots; each bit represents
> if a slot is available. The number of contiguous free slots can be
> calculated by counting the number of consecutive 1s in the bitmap.
> 
> Tests show that the average cost of freeing slots drops by 120 cycles
> while the average cost of allocation increases by 20 cycles. Overall,
> 100 cycles are saved from a pair of allocation and freeing.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

One nit below.

> ---
>  include/linux/swiotlb.h |  6 ++--
>  kernel/dma/swiotlb.c    | 64 ++++++++++++++++++++---------------------
>  2 files changed, 34 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7ed35dd3de6e..c3eab237991a 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -78,8 +78,6 @@ extern enum swiotlb_force swiotlb_force;
>   *		@end. For default swiotlb, this is command line adjustable via
>   *		setup_io_tlb_npages.
>   * @used:	The number of used IO TLB block.
> - * @list:	The free list describing the number of free entries available
> - *		from each index.
>   * @index:	The index to start searching in the next round.
>   * @orig_addr:	The original address corresponding to a mapped entry.
>   * @alloc_size:	Size of the allocated buffer.
> @@ -89,6 +87,8 @@ extern enum swiotlb_force swiotlb_force;
>   * @late_alloc:	%true if allocated using the page allocator
>   * @force_bounce: %true if swiotlb bouncing is forced
>   * @for_alloc:  %true if the pool is used for memory allocation
> + * @bitmap:	The bitmap used to track free entries. 1 in bit X means the slot
> + *		indexed by X is free.
>   */
>  struct io_tlb_mem {
>  	phys_addr_t start;
> @@ -105,8 +105,8 @@ struct io_tlb_mem {
>  	struct io_tlb_slot {
>  		phys_addr_t orig_addr;
>  		size_t alloc_size;
> -		unsigned int list;
>  	} *slots;
> +	unsigned long *bitmap;
>  };
>  extern struct io_tlb_mem io_tlb_default_mem;
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index cb50f8d38360..d7f68c0af7f5 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -207,7 +207,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>  
>  	spin_lock_init(&mem->lock);
>  	for (i = 0; i < mem->nslabs; i++) {
> -		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
> +		__set_bit(i, mem->bitmap);
>  		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>  		mem->slots[i].alloc_size = 0;
>  	}
> @@ -274,6 +274,11 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>  		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
>  		      __func__, alloc_size, PAGE_SIZE);
>  
> +	mem->bitmap = memblock_alloc(BITS_TO_BYTES(nslabs), SMP_CACHE_BYTES);
> +	if (!mem->bitmap)
> +		panic("%s: Failed to allocate %lu bytes align=0x%x\n",
> +		      __func__, DIV_ROUND_UP(nslabs, BITS_PER_BYTE), SMP_CACHE_BYTES);
> +
>  	swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, flags, false);
>  
>  	if (flags & SWIOTLB_VERBOSE)
> @@ -337,10 +342,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>  			(PAGE_SIZE << order) >> 20);
>  	}
>  
> +	mem->bitmap = bitmap_zalloc(nslabs, GFP_KERNEL);
>  	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>  		get_order(array_size(sizeof(*mem->slots), nslabs)));
> -	if (!mem->slots) {
> +	if (!mem->slots || !mem->bitmap) {
>  		free_pages((unsigned long)vstart, order);
> +		bitmap_free(mem->bitmap);
> +		kfree(mem->slots);
>  		return -ENOMEM;
>  	}
>  
> @@ -498,7 +506,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  	unsigned int iotlb_align_mask =
>  		dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
>  	unsigned int nslots = nr_slots(alloc_size), stride;
> -	unsigned int index, wrap, count = 0, i;
> +	unsigned int index, wrap, i;
>  	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
>  	unsigned long flags;
>  
> @@ -514,6 +522,9 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
>  	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
>  
> +	/* slots shouldn't cross one segment */
> +	max_slots = min_t(unsigned long, max_slots, IO_TLB_SEGSIZE);
> +
>  	spin_lock_irqsave(&mem->lock, flags);
>  	if (unlikely(nslots > mem->nslabs - mem->used))
>  		goto not_found;
> @@ -535,8 +546,15 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  		if (!iommu_is_span_boundary(index, nslots,
>  					    nr_slots(tbl_dma_addr),
>  					    max_slots)) {
> -			if (mem->slots[index].list >= nslots)
> +			if (find_next_zero_bit(mem->bitmap, index + nslots, index) ==
> +					index + nslots)
>  				goto found;
> +		} else {
> +			/*
> +			 * Remaining slots between current one and the next
> +			 * bounary cannot meet our requirement.

bounary -> boundary


-- 
 i.

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 3/3] swiotlb: Split up single swiotlb lock
  2022-06-28  7:01   ` Chao Gao
@ 2022-06-30  2:42     ` Chao Gao
  -1 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-30  2:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, len.brown, tony.luck, rafael.j.wysocki,
	reinette.chatre, dan.j.williams, kirill.shutemov,
	sathyanarayanan.kuppuswamy, ilpo.jarvinen, Andi Kleen,
	Paul E. McKenney, Andrew Morton, Borislav Petkov, Muchun Song,
	Kees Cook, Randy Dunlap, Damien Le Moal, linux-doc, linux-pm,
	iommu

On Tue, Jun 28, 2022 at 03:01:34PM +0800, Chao Gao wrote:
>From: Andi Kleen <ak@linux.intel.com>
>
>Traditionally swiotlb was not performance critical because it was only
>used for slow devices. But in some setups, like TDX confidential
>guests, all IO has to go through swiotlb. Currently swiotlb only has a
>single lock. Under high IO load with multiple CPUs this can lead to
>signifiant lock contention on the swiotlb lock. We've seen 20+% CPU
>time in locks in some extreme cases.
>
>This patch splits the swiotlb into individual areas which have their
>own lock. Each CPU tries to allocate in its own area first. Only if
>that fails does it search other areas. On freeing the allocation is
>freed into the area where the memory was originally allocated from.
>
>To avoid doing a full modulo in the main path the number of swiotlb
>areas is always rounded to the next power of two. I believe that's
>not really needed anymore on modern CPUs (which have fast enough
>dividers), but still a good idea on older parts.
>
>The number of areas can be set using the swiotlb option. But to avoid
>every user having to set this option set the default to the number of
>available CPUs. Unfortunately on x86 swiotlb is initialized before
>num_possible_cpus() is available, that is why it uses a custom hook
>called from the early ACPI code.
>
>Signed-off-by: Andi Kleen <ak@linux.intel.com>
>[ rebase and fix warnings of checkpatch.pl ]
>Signed-off-by: Chao Gao <chao.gao@intel.com>

Just noticed that Tianyu already posted a variant of this patch.
Will drop this one from my series.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 3/3] swiotlb: Split up single swiotlb lock
@ 2022-06-30  2:42     ` Chao Gao
  0 siblings, 0 replies; 12+ messages in thread
From: Chao Gao @ 2022-06-30  2:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: len.brown, tony.luck, Kees Cook, Paul E. McKenney, linux-doc,
	Damien Le Moal, rafael.j.wysocki, dave.hansen, iommu, Andi Kleen,
	Randy Dunlap, Borislav Petkov, Muchun Song, linux-pm,
	ilpo.jarvinen, dan.j.williams, reinette.chatre, Andrew Morton,
	kirill.shutemov

On Tue, Jun 28, 2022 at 03:01:34PM +0800, Chao Gao wrote:
>From: Andi Kleen <ak@linux.intel.com>
>
>Traditionally swiotlb was not performance critical because it was only
>used for slow devices. But in some setups, like TDX confidential
>guests, all IO has to go through swiotlb. Currently swiotlb only has a
>single lock. Under high IO load with multiple CPUs this can lead to
>signifiant lock contention on the swiotlb lock. We've seen 20+% CPU
>time in locks in some extreme cases.
>
>This patch splits the swiotlb into individual areas which have their
>own lock. Each CPU tries to allocate in its own area first. Only if
>that fails does it search other areas. On freeing the allocation is
>freed into the area where the memory was originally allocated from.
>
>To avoid doing a full modulo in the main path the number of swiotlb
>areas is always rounded to the next power of two. I believe that's
>not really needed anymore on modern CPUs (which have fast enough
>dividers), but still a good idea on older parts.
>
>The number of areas can be set using the swiotlb option. But to avoid
>every user having to set this option set the default to the number of
>available CPUs. Unfortunately on x86 swiotlb is initialized before
>num_possible_cpus() is available, that is why it uses a custom hook
>called from the early ACPI code.
>
>Signed-off-by: Andi Kleen <ak@linux.intel.com>
>[ rebase and fix warnings of checkpatch.pl ]
>Signed-off-by: Chao Gao <chao.gao@intel.com>

Just noticed that Tianyu already posted a variant of this patch.
Will drop this one from my series.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-06-30  2:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-28  7:01 [PATCH v1 0/3] swiotlb performance optimizations Chao Gao
2022-06-28  7:01 ` Chao Gao
2022-06-28  7:01 ` [PATCH v1 1/3] swiotlb: Use bitmap to track free slots Chao Gao
2022-06-28  7:01   ` Chao Gao
2022-06-28 13:11   ` Ilpo Järvinen
2022-06-28 13:11     ` Ilpo Järvinen
2022-06-28  7:01 ` [PATCH v1 2/3] swiotlb: Allocate memory in a cache-friendly way Chao Gao
2022-06-28  7:01   ` Chao Gao
2022-06-28  7:01 ` [PATCH v1 3/3] swiotlb: Split up single swiotlb lock Chao Gao
2022-06-28  7:01   ` Chao Gao
2022-06-30  2:42   ` Chao Gao
2022-06-30  2:42     ` Chao Gao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.