[RFC PATCH 0/2] swiotlb: Introduce swiotlb device allocation function

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] swiotlb: Introduce swiotlb device allocation function
@ 2022-04-28 14:14 ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 14:14 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: Tianyu Lan, iommu, linux-kernel, vkuznets, brijesh.singh,
	konrad.wilk, hch, wei.liu, parri.andrea, thomas.lendacky,
	linux-hyperv, andi.kleen, kirill.shutemov

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significant lock contention on the swiotlb lock.

This patchset splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate io tlb
buffer from areas averagely and free the allocation back to the associated
area.

Patch 2 introduces an helper function to allocate bounce buffer
from default IO tlb pool for devices with new IO TLB block unit
and set up IO TLB area for device queues to avoid spinlock overhead.
The area number is set by device driver according queue number.

The network test between traditional VM and Confidential VM.
The throughput improves from ~20Gb/s to ~34Gb/s  with this patchset.

Tianyu Lan (2):
  swiotlb: Split up single swiotlb lock
  Swiotlb: Add device bounce buffer allocation interface

 include/linux/swiotlb.h |  58 +++++++
 kernel/dma/swiotlb.c    | 340 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 362 insertions(+), 36 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH 0/2] swiotlb: Introduce swiotlb device allocation function
@ 2022-04-28 14:14 ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 14:14 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv,
	konrad.wilk, linux-kernel, kirill.shutemov, iommu, andi.kleen,
	brijesh.singh, vkuznets, hch

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significant lock contention on the swiotlb lock.

This patchset splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate io tlb
buffer from areas averagely and free the allocation back to the associated
area.

Patch 2 introduces an helper function to allocate bounce buffer
from default IO tlb pool for devices with new IO TLB block unit
and set up IO TLB area for device queues to avoid spinlock overhead.
The area number is set by device driver according queue number.

The network test between traditional VM and Confidential VM.
The throughput improves from ~20Gb/s to ~34Gb/s  with this patchset.

Tianyu Lan (2):
  swiotlb: Split up single swiotlb lock
  Swiotlb: Add device bounce buffer allocation interface

 include/linux/swiotlb.h |  58 +++++++
 kernel/dma/swiotlb.c    | 340 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 362 insertions(+), 36 deletions(-)

-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:14 ` Tianyu Lan
@ 2022-04-28 14:14   ` Tianyu Lan
  -1 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 14:14 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: Tianyu Lan, iommu, linux-kernel, vkuznets, brijesh.singh,
	konrad.wilk, hch, wei.liu, parri.andrea, thomas.lendacky,
	linux-hyperv, andi.kleen, kirill.shutemov, Andi Kleen

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significat lock contention on the swiotlb lock.

This patch splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate
io tlb buffer from areas averagely and free the allocation back
to the associated area. This is to prepare to resolve the overhead
of single spinlock among device's queues. Per device may have its
own io tlb mem and bounce buffer pool.

This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/4529b578
4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
for individual device's io tlb mem. The device driver may determine
area number according to device queue number.

Based-on-idea-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  25 ++++++
 kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++---------
 2 files changed, 162 insertions(+), 36 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..489c249da434 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,24 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
 
+/**
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:	The number of used IO TLB block.
+ * @area_index: The index of to tlb area.
+ * @index:	The slot index to start searching in this area for next round.
+ * @lock:	The lock to protect the above data structures in the map and
+ *		unmap calls.
+ */
+struct io_tlb_area {
+	unsigned long used;
+	unsigned int area_index;
+	unsigned int index;
+	spinlock_t lock;
+};
+
 /**
  * struct io_tlb_mem - IO TLB Memory Pool Descriptor
  *
@@ -89,6 +107,9 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @num_areas:  The area number in the pool.
+ * @area_start: The area index to start searching in the next round.
+ * @area_nslabs: The slot number in the area.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -102,6 +123,10 @@ struct io_tlb_mem {
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	unsigned int num_areas;
+	unsigned int area_start;
+	unsigned int area_nslabs;
+	struct io_tlb_area *areas;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..00a16f540f20 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,8 @@
 
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
 
+#define NUM_AREAS_DEFAULT 1
+
 static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
@@ -70,6 +72,25 @@ struct io_tlb_mem io_tlb_default_mem;
 phys_addr_t swiotlb_unencrypted_base;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
+static unsigned long default_area_num = NUM_AREAS_DEFAULT;
+
+static int swiotlb_setup_areas(struct io_tlb_mem *mem,
+		unsigned int num_areas, unsigned long nslabs)
+{
+	if (nslabs < 1 || !is_power_of_2(num_areas)) {
+		pr_err("swiotlb: Invalid areas parameter %d.\n", num_areas);
+		return -EINVAL;
+	}
+
+	/* Round up number of slabs to the next power of 2.
+	 * The last area is going be smaller than the rest if default_nslabs is
+	 * not power of two.
+	 */
+	mem->area_start = 0;
+	mem->num_areas = num_areas;
+	mem->area_nslabs = nslabs / num_areas;
+	return 0;
+}
 
 static int __init
 setup_io_tlb_npages(char *str)
@@ -114,6 +135,8 @@ void __init swiotlb_adjust_size(unsigned long size)
 		return;
 	size = ALIGN(size, IO_TLB_SIZE);
 	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+			    default_nslabs);
 	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
@@ -195,7 +218,8 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 				    unsigned long nslabs, bool late_alloc)
 {
 	void *vaddr = phys_to_virt(start);
-	unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
+	unsigned int block_list;
 
 	mem->nslabs = nslabs;
 	mem->start = start;
@@ -206,8 +230,13 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	if (swiotlb_force_bounce)
 		mem->force_bounce = true;
 
-	spin_lock_init(&mem->lock);
-	for (i = 0; i < mem->nslabs; i++) {
+	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
+		if (!(i % mem->area_nslabs)) {
+			mem->areas[j].index = 0;
+			spin_lock_init(&mem->areas[j].lock);
+			j++;
+		}
+
 		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
@@ -272,6 +301,13 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+		    default_nslabs);
+	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) * mem->num_areas,
+			    SMP_CACHE_BYTES);
+	if (!mem->areas)
+		panic("%s: Failed to allocate mem->areas.\n", __func__);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
 	mem->force_bounce = flags & SWIOTLB_FORCE;
 
@@ -296,7 +332,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
 	unsigned long bytes;
 	unsigned char *vstart = NULL;
-	unsigned int order;
+	unsigned int order, area_order;
 	int rc = 0;
 
 	if (swiotlb_force_disable)
@@ -334,18 +370,32 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 		goto retry;
 	}
 
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+			    nslabs);
+
+	area_order = get_order(array_size(sizeof(*mem->areas),
+		default_area_num));
+	mem->areas = (struct io_tlb_area *)
+		__get_free_pages(GFP_KERNEL | __GFP_ZERO, area_order);
+	if (!mem->areas)
+		goto error_area;
+
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
-	if (!mem->slots) {
-		free_pages((unsigned long)vstart, order);
-		return -ENOMEM;
-	}
+	if (!mem->slots)
+		goto error_slots;
 
 	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
 
 	swiotlb_print_info();
 	return 0;
+
+error_slots:
+	free_pages((unsigned long)mem->areas, area_order);
+error_area:
+	free_pages((unsigned long)vstart, order);
+	return -ENOMEM;
 }
 
 void __init swiotlb_exit(void)
@@ -353,6 +403,7 @@ void __init swiotlb_exit(void)
 	struct io_tlb_mem *mem = &io_tlb_default_mem;
 	unsigned long tbl_vaddr;
 	size_t tbl_size, slots_size;
+	unsigned int area_order;
 
 	if (swiotlb_force_bounce)
 		return;
@@ -367,9 +418,14 @@ void __init swiotlb_exit(void)
 
 	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
 	if (mem->late_alloc) {
+		area_order = get_order(array_size(sizeof(*mem->areas),
+			mem->num_areas));
+		free_pages((unsigned long)mem->areas, area_order);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
+		memblock_free_late(__pa(mem->areas),
+				   mem->num_areas * sizeof(struct io_tlb_area));
 		memblock_free_late(mem->start, tbl_size);
 		memblock_free_late(__pa(mem->slots), slots_size);
 	}
@@ -472,9 +528,9 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
 
-static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
+static unsigned int wrap_area_index(struct io_tlb_mem *mem, unsigned int index)
 {
-	if (index >= mem->nslabs)
+	if (index >= mem->area_nslabs)
 		return 0;
 	return index;
 }
@@ -483,10 +539,13 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct io_tlb_area *area,
+				 int area_index,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size,
+				 unsigned int alloc_align_mask)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
 		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -497,8 +556,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	unsigned int index, wrap, count = 0, i;
 	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
 	unsigned long flags;
+	unsigned int slot_base;
+	unsigned int slot_index;
 
 	BUG_ON(!nslots);
+	BUG_ON(area_index >= mem->num_areas);
 
 	/*
 	 * For mappings with an alignment requirement don't bother looping to
@@ -510,16 +572,20 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
 	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
 
-	spin_lock_irqsave(&mem->lock, flags);
-	if (unlikely(nslots > mem->nslabs - mem->used))
+	spin_lock_irqsave(&area->lock, flags);
+	if (unlikely(nslots > mem->area_nslabs - area->used))
 		goto not_found;
 
-	index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
+	slot_base = area_index * mem->area_nslabs;
+	index = wrap = wrap_area_index(mem, ALIGN(area->index, stride));
+
 	do {
+		slot_index = slot_base + index;
+
 		if (orig_addr &&
-		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
-			    (orig_addr & iotlb_align_mask)) {
-			index = wrap_index(mem, index + 1);
+		    (slot_addr(tbl_dma_addr, slot_index) &
+		     iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
+			index = wrap_area_index(mem, index + 1);
 			continue;
 		}
 
@@ -528,26 +594,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		 * contiguous buffers, we allocate the buffers from that slot
 		 * and mark the entries as '0' indicating unavailable.
 		 */
-		if (!iommu_is_span_boundary(index, nslots,
+		if (!iommu_is_span_boundary(slot_index, nslots,
 					    nr_slots(tbl_dma_addr),
 					    max_slots)) {
-			if (mem->slots[index].list >= nslots)
+			if (mem->slots[slot_index].list >= nslots)
 				goto found;
 		}
-		index = wrap_index(mem, index + stride);
+		index = wrap_area_index(mem, index + stride);
 	} while (index != wrap);
 
 not_found:
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return -1;
 
 found:
-	for (i = index; i < index + nslots; i++) {
+	for (i = slot_index; i < slot_index + nslots; i++) {
 		mem->slots[i].list = 0;
 		mem->slots[i].alloc_size =
-			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
+			alloc_size - (offset + ((i - slot_index) << IO_TLB_SHIFT));
 	}
-	for (i = index - 1;
+	for (i = slot_index - 1;
 	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
 	     mem->slots[i].list; i--)
 		mem->slots[i].list = ++count;
@@ -555,14 +621,45 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	/*
 	 * Update the indices to avoid searching in the next round.
 	 */
-	if (index + nslots < mem->nslabs)
-		mem->index = index + nslots;
+	if (index + nslots < mem->area_nslabs)
+		area->index = index + nslots;
 	else
-		mem->index = 0;
-	mem->used += nslots;
+		area->index = 0;
+	area->used += nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
+	return slot_index;
+}
 
-	spin_unlock_irqrestore(&mem->lock, flags);
-	return index;
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size, unsigned int alloc_align_mask)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	int start, i, index;
+
+	i = start = mem->area_start;
+	mem->area_start = (mem->area_start + 1) % mem->num_areas;
+
+	do {
+		index = swiotlb_do_find_slots(mem, mem->areas + i, i,
+					      dev, orig_addr, alloc_size,
+					      alloc_align_mask);
+		if (index >= 0)
+			return index;
+		if (++i >= mem->num_areas)
+			i = 0;
+	} while (i != start);
+
+	return -1;
+}
+
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used = 0;
+
+	for (i = 0; i < mem->num_areas; i++)
+		used += mem->areas[i].used;
+	return used;
 }
 
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
@@ -594,7 +691,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				 alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
@@ -624,6 +721,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
+	int aindex = index / mem->area_nslabs;
+	struct io_tlb_area *area = &mem->areas[aindex];
 	int count, i;
 
 	/*
@@ -632,7 +731,9 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	 * While returning the entries to the free list, we merge the entries
 	 * with slots below and above the pool being returned.
 	 */
-	spin_lock_irqsave(&mem->lock, flags);
+	BUG_ON(aindex >= mem->num_areas);
+
+	spin_lock_irqsave(&area->lock, flags);
 	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
 		count = mem->slots[index + nslots].list;
 	else
@@ -656,8 +757,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
 	     i--)
 		mem->slots[i].list = ++count;
-	mem->used -= nslots;
-	spin_unlock_irqrestore(&mem->lock, flags);
+	area->used -= nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 14:14   ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 14:14 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan,
	linux-hyperv, konrad.wilk, linux-kernel, kirill.shutemov, iommu,
	andi.kleen, brijesh.singh, vkuznets, hch

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significat lock contention on the swiotlb lock.

This patch splits the swiotlb into individual areas which have their
own lock. When there are swiotlb map/allocate request, allocate
io tlb buffer from areas averagely and free the allocation back
to the associated area. This is to prepare to resolve the overhead
of single spinlock among device's queues. Per device may have its
own io tlb mem and bounce buffer pool.

This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/4529b578
4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
for individual device's io tlb mem. The device driver may determine
area number according to device queue number.

Based-on-idea-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  25 ++++++
 kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++---------
 2 files changed, 162 insertions(+), 36 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..489c249da434 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,24 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
 
+/**
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:	The number of used IO TLB block.
+ * @area_index: The index of to tlb area.
+ * @index:	The slot index to start searching in this area for next round.
+ * @lock:	The lock to protect the above data structures in the map and
+ *		unmap calls.
+ */
+struct io_tlb_area {
+	unsigned long used;
+	unsigned int area_index;
+	unsigned int index;
+	spinlock_t lock;
+};
+
 /**
  * struct io_tlb_mem - IO TLB Memory Pool Descriptor
  *
@@ -89,6 +107,9 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @num_areas:  The area number in the pool.
+ * @area_start: The area index to start searching in the next round.
+ * @area_nslabs: The slot number in the area.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -102,6 +123,10 @@ struct io_tlb_mem {
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	unsigned int num_areas;
+	unsigned int area_start;
+	unsigned int area_nslabs;
+	struct io_tlb_area *areas;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..00a16f540f20 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -62,6 +62,8 @@
 
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
 
+#define NUM_AREAS_DEFAULT 1
+
 static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
@@ -70,6 +72,25 @@ struct io_tlb_mem io_tlb_default_mem;
 phys_addr_t swiotlb_unencrypted_base;
 
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
+static unsigned long default_area_num = NUM_AREAS_DEFAULT;
+
+static int swiotlb_setup_areas(struct io_tlb_mem *mem,
+		unsigned int num_areas, unsigned long nslabs)
+{
+	if (nslabs < 1 || !is_power_of_2(num_areas)) {
+		pr_err("swiotlb: Invalid areas parameter %d.\n", num_areas);
+		return -EINVAL;
+	}
+
+	/* Round up number of slabs to the next power of 2.
+	 * The last area is going be smaller than the rest if default_nslabs is
+	 * not power of two.
+	 */
+	mem->area_start = 0;
+	mem->num_areas = num_areas;
+	mem->area_nslabs = nslabs / num_areas;
+	return 0;
+}
 
 static int __init
 setup_io_tlb_npages(char *str)
@@ -114,6 +135,8 @@ void __init swiotlb_adjust_size(unsigned long size)
 		return;
 	size = ALIGN(size, IO_TLB_SIZE);
 	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+			    default_nslabs);
 	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 }
 
@@ -195,7 +218,8 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 				    unsigned long nslabs, bool late_alloc)
 {
 	void *vaddr = phys_to_virt(start);
-	unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
+	unsigned int block_list;
 
 	mem->nslabs = nslabs;
 	mem->start = start;
@@ -206,8 +230,13 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	if (swiotlb_force_bounce)
 		mem->force_bounce = true;
 
-	spin_lock_init(&mem->lock);
-	for (i = 0; i < mem->nslabs; i++) {
+	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
+		if (!(i % mem->area_nslabs)) {
+			mem->areas[j].index = 0;
+			spin_lock_init(&mem->areas[j].lock);
+			j++;
+		}
+
 		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
@@ -272,6 +301,13 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+		    default_nslabs);
+	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) * mem->num_areas,
+			    SMP_CACHE_BYTES);
+	if (!mem->areas)
+		panic("%s: Failed to allocate mem->areas.\n", __func__);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
 	mem->force_bounce = flags & SWIOTLB_FORCE;
 
@@ -296,7 +332,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
 	unsigned long bytes;
 	unsigned char *vstart = NULL;
-	unsigned int order;
+	unsigned int order, area_order;
 	int rc = 0;
 
 	if (swiotlb_force_disable)
@@ -334,18 +370,32 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 		goto retry;
 	}
 
+	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
+			    nslabs);
+
+	area_order = get_order(array_size(sizeof(*mem->areas),
+		default_area_num));
+	mem->areas = (struct io_tlb_area *)
+		__get_free_pages(GFP_KERNEL | __GFP_ZERO, area_order);
+	if (!mem->areas)
+		goto error_area;
+
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
-	if (!mem->slots) {
-		free_pages((unsigned long)vstart, order);
-		return -ENOMEM;
-	}
+	if (!mem->slots)
+		goto error_slots;
 
 	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
 
 	swiotlb_print_info();
 	return 0;
+
+error_slots:
+	free_pages((unsigned long)mem->areas, area_order);
+error_area:
+	free_pages((unsigned long)vstart, order);
+	return -ENOMEM;
 }
 
 void __init swiotlb_exit(void)
@@ -353,6 +403,7 @@ void __init swiotlb_exit(void)
 	struct io_tlb_mem *mem = &io_tlb_default_mem;
 	unsigned long tbl_vaddr;
 	size_t tbl_size, slots_size;
+	unsigned int area_order;
 
 	if (swiotlb_force_bounce)
 		return;
@@ -367,9 +418,14 @@ void __init swiotlb_exit(void)
 
 	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
 	if (mem->late_alloc) {
+		area_order = get_order(array_size(sizeof(*mem->areas),
+			mem->num_areas));
+		free_pages((unsigned long)mem->areas, area_order);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
+		memblock_free_late(__pa(mem->areas),
+				   mem->num_areas * sizeof(struct io_tlb_area));
 		memblock_free_late(mem->start, tbl_size);
 		memblock_free_late(__pa(mem->slots), slots_size);
 	}
@@ -472,9 +528,9 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
 	return nr_slots(boundary_mask + 1);
 }
 
-static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
+static unsigned int wrap_area_index(struct io_tlb_mem *mem, unsigned int index)
 {
-	if (index >= mem->nslabs)
+	if (index >= mem->area_nslabs)
 		return 0;
 	return index;
 }
@@ -483,10 +539,13 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct io_tlb_area *area,
+				 int area_index,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size,
+				 unsigned int alloc_align_mask)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
 		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -497,8 +556,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	unsigned int index, wrap, count = 0, i;
 	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
 	unsigned long flags;
+	unsigned int slot_base;
+	unsigned int slot_index;
 
 	BUG_ON(!nslots);
+	BUG_ON(area_index >= mem->num_areas);
 
 	/*
 	 * For mappings with an alignment requirement don't bother looping to
@@ -510,16 +572,20 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
 	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
 
-	spin_lock_irqsave(&mem->lock, flags);
-	if (unlikely(nslots > mem->nslabs - mem->used))
+	spin_lock_irqsave(&area->lock, flags);
+	if (unlikely(nslots > mem->area_nslabs - area->used))
 		goto not_found;
 
-	index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
+	slot_base = area_index * mem->area_nslabs;
+	index = wrap = wrap_area_index(mem, ALIGN(area->index, stride));
+
 	do {
+		slot_index = slot_base + index;
+
 		if (orig_addr &&
-		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
-			    (orig_addr & iotlb_align_mask)) {
-			index = wrap_index(mem, index + 1);
+		    (slot_addr(tbl_dma_addr, slot_index) &
+		     iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
+			index = wrap_area_index(mem, index + 1);
 			continue;
 		}
 
@@ -528,26 +594,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 		 * contiguous buffers, we allocate the buffers from that slot
 		 * and mark the entries as '0' indicating unavailable.
 		 */
-		if (!iommu_is_span_boundary(index, nslots,
+		if (!iommu_is_span_boundary(slot_index, nslots,
 					    nr_slots(tbl_dma_addr),
 					    max_slots)) {
-			if (mem->slots[index].list >= nslots)
+			if (mem->slots[slot_index].list >= nslots)
 				goto found;
 		}
-		index = wrap_index(mem, index + stride);
+		index = wrap_area_index(mem, index + stride);
 	} while (index != wrap);
 
 not_found:
-	spin_unlock_irqrestore(&mem->lock, flags);
+	spin_unlock_irqrestore(&area->lock, flags);
 	return -1;
 
 found:
-	for (i = index; i < index + nslots; i++) {
+	for (i = slot_index; i < slot_index + nslots; i++) {
 		mem->slots[i].list = 0;
 		mem->slots[i].alloc_size =
-			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
+			alloc_size - (offset + ((i - slot_index) << IO_TLB_SHIFT));
 	}
-	for (i = index - 1;
+	for (i = slot_index - 1;
 	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
 	     mem->slots[i].list; i--)
 		mem->slots[i].list = ++count;
@@ -555,14 +621,45 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	/*
 	 * Update the indices to avoid searching in the next round.
 	 */
-	if (index + nslots < mem->nslabs)
-		mem->index = index + nslots;
+	if (index + nslots < mem->area_nslabs)
+		area->index = index + nslots;
 	else
-		mem->index = 0;
-	mem->used += nslots;
+		area->index = 0;
+	area->used += nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
+	return slot_index;
+}
 
-	spin_unlock_irqrestore(&mem->lock, flags);
-	return index;
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size, unsigned int alloc_align_mask)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	int start, i, index;
+
+	i = start = mem->area_start;
+	mem->area_start = (mem->area_start + 1) % mem->num_areas;
+
+	do {
+		index = swiotlb_do_find_slots(mem, mem->areas + i, i,
+					      dev, orig_addr, alloc_size,
+					      alloc_align_mask);
+		if (index >= 0)
+			return index;
+		if (++i >= mem->num_areas)
+			i = 0;
+	} while (i != start);
+
+	return -1;
+}
+
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used = 0;
+
+	for (i = 0; i < mem->num_areas; i++)
+		used += mem->areas[i].used;
+	return used;
 }
 
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
@@ -594,7 +691,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				 alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
@@ -624,6 +721,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
 	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
+	int aindex = index / mem->area_nslabs;
+	struct io_tlb_area *area = &mem->areas[aindex];
 	int count, i;
 
 	/*
@@ -632,7 +731,9 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	 * While returning the entries to the free list, we merge the entries
 	 * with slots below and above the pool being returned.
 	 */
-	spin_lock_irqsave(&mem->lock, flags);
+	BUG_ON(aindex >= mem->num_areas);
+
+	spin_lock_irqsave(&area->lock, flags);
 	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
 		count = mem->slots[index + nslots].list;
 	else
@@ -656,8 +757,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
 	     i--)
 		mem->slots[i].list = ++count;
-	mem->used -= nslots;
-	spin_unlock_irqrestore(&mem->lock, flags);
+	area->used -= nslots;
+	spin_unlock_irqrestore(&area->lock, flags);
 }
 
 /*
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface
  2022-04-28 14:14 ` Tianyu Lan
@ 2022-04-28 14:14   ` Tianyu Lan
  -1 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 14:14 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: Tianyu Lan, iommu, linux-kernel, vkuznets, brijesh.singh,
	konrad.wilk, hch, wei.liu, parri.andrea, thomas.lendacky,
	linux-hyperv, andi.kleen, kirill.shutemov

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

In SEV/TDX Confidential VM, device DMA transaction needs use swiotlb
bounce buffer to share data with host/hypervisor. The swiotlb spinlock
introduces overhead among devices if they share io tlb mem. Avoid such
issue, introduce swiotlb_device_allocate() to allocate device bounce
buffer from default io tlb pool and set up areas according input queue
number. Device may have multi io queues and setting up the same number
of io tlb area may help to resolve spinlock overhead among queues.

Introduce IO TLB Block unit(2MB) concepts to allocate big bounce buffer
from default pool for devices. IO TLB segment(256k) is too small.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  33 ++++++++
 kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 203 insertions(+), 3 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 489c249da434..380bd1ce3d0f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -31,6 +31,14 @@ struct scatterlist;
 #define IO_TLB_SHIFT 11
 #define IO_TLB_SIZE (1 << IO_TLB_SHIFT)
 
+/*
+ * IO TLB BLOCK UNIT as device bounce buffer allocation unit.
+ * This allows device allocates bounce buffer from default io
+ * tlb pool.
+ */
+#define IO_TLB_BLOCKSIZE   (8 * IO_TLB_SEGSIZE)
+#define IO_TLB_BLOCK_UNIT  (IO_TLB_BLOCKSIZE << IO_TLB_SHIFT)
+
 /* default to 64MB */
 #define IO_TLB_DEFAULT_SIZE (64UL<<20)
 
@@ -72,11 +80,13 @@ extern enum swiotlb_force swiotlb_force;
  * @index:	The slot index to start searching in this area for next round.
  * @lock:	The lock to protect the above data structures in the map and
  *		unmap calls.
+ * @block_index: The block index to start earching in this area for next round.
  */
 struct io_tlb_area {
 	unsigned long used;
 	unsigned int area_index;
 	unsigned int index;
+	unsigned int block_index;
 	spinlock_t lock;
 };
 
@@ -110,6 +120,7 @@ struct io_tlb_area {
  * @num_areas:  The area number in the pool.
  * @area_start: The area index to start searching in the next round.
  * @area_nslabs: The slot number in the area.
+ * @areas_block_number: The block number in the area.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -126,7 +137,14 @@ struct io_tlb_mem {
 	unsigned int num_areas;
 	unsigned int area_start;
 	unsigned int area_nslabs;
+	unsigned int area_block_number;
+	struct io_tlb_mem *parent;
 	struct io_tlb_area *areas;
+	struct io_tlb_block {
+		size_t alloc_size;
+		unsigned long start_slot;
+		unsigned int list;
+	} *block;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
@@ -155,6 +173,10 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size);
+void swiotlb_device_free(struct device *dev);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -187,6 +209,17 @@ static inline bool is_swiotlb_active(struct device *dev)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+
+void swiotlb_device_free(struct device *dev)
+{
+}
+
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size)
+{
+	return -ENOMEM;
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 00a16f540f20..7b95a140694a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -218,7 +218,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 				    unsigned long nslabs, bool late_alloc)
 {
 	void *vaddr = phys_to_virt(start);
-	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
+	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j, k;
 	unsigned int block_list;
 
 	mem->nslabs = nslabs;
@@ -226,6 +226,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	mem->end = mem->start + bytes;
 	mem->index = 0;
 	mem->late_alloc = late_alloc;
+	mem->area_block_number = nslabs / (IO_TLB_BLOCKSIZE * mem->num_areas);
 
 	if (swiotlb_force_bounce)
 		mem->force_bounce = true;
@@ -233,10 +234,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
 		if (!(i % mem->area_nslabs)) {
 			mem->areas[j].index = 0;
+			mem->areas[j].block_index = 0;
 			spin_lock_init(&mem->areas[j].lock);
+			block_list = mem->area_block_number;
 			j++;
 		}
 
+		if (!(i % IO_TLB_BLOCKSIZE)) {
+			mem->block[k].alloc_size = 0;
+			mem->block[k].list = block_list--;
+			k++;
+		}
+
 		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
@@ -308,6 +317,12 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 	if (!mem->areas)
 		panic("%s: Failed to allocate mem->areas.\n", __func__);
 
+	mem->block = memblock_alloc(sizeof(struct io_tlb_block) *
+				    (default_nslabs / IO_TLB_BLOCKSIZE),
+				     SMP_CACHE_BYTES);
+	if (!mem->block)
+		panic("%s: Failed to allocate mem->block.\n", __func__);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
 	mem->force_bounce = flags & SWIOTLB_FORCE;
 
@@ -332,7 +347,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
 	unsigned long bytes;
 	unsigned char *vstart = NULL;
-	unsigned int order, area_order;
+	unsigned int order, area_order, block_order;
 	int rc = 0;
 
 	if (swiotlb_force_disable)
@@ -380,6 +395,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	if (!mem->areas)
 		goto error_area;
 
+	block_order = get_order(array_size(sizeof(*mem->block),
+		nslabs / IO_TLB_BLOCKSIZE));
+	mem->block = (struct io_tlb_block *)
+		__get_free_pages(GFP_KERNEL | __GFP_ZERO, block_order);
+	if (!mem->block)
+		goto error_block;
+
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
 	if (!mem->slots)
@@ -392,6 +414,8 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	return 0;
 
 error_slots:
+	free_pages((unsigned long)mem->block, block_order);
+error_block:
 	free_pages((unsigned long)mem->areas, area_order);
 error_area:
 	free_pages((unsigned long)vstart, order);
@@ -403,7 +427,7 @@ void __init swiotlb_exit(void)
 	struct io_tlb_mem *mem = &io_tlb_default_mem;
 	unsigned long tbl_vaddr;
 	size_t tbl_size, slots_size;
-	unsigned int area_order;
+	unsigned int area_order, block_order;
 
 	if (swiotlb_force_bounce)
 		return;
@@ -421,6 +445,9 @@ void __init swiotlb_exit(void)
 		area_order = get_order(array_size(sizeof(*mem->areas),
 			mem->num_areas));
 		free_pages((unsigned long)mem->areas, area_order);
+		block_order = get_order(array_size(sizeof(*mem->block),
+			mem->nslabs / IO_TLB_BLOCKSIZE));
+		free_pages((unsigned long)mem->block, block_order);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
@@ -863,6 +890,146 @@ static int __init __maybe_unused swiotlb_create_default_debugfs(void)
 late_initcall(swiotlb_create_default_debugfs);
 #endif
 
+static void swiotlb_free_block(struct io_tlb_mem *mem,
+			       phys_addr_t start, unsigned int block_num)
+{
+	unsigned int start_slot = (start - mem->start) >> IO_TLB_SHIFT;
+	unsigned int area_index = start_slot / mem->num_areas;
+	unsigned int block_index = start_slot / IO_TLB_BLOCKSIZE;
+	unsigned int area_block_index = start_slot % mem->area_block_number;
+	struct io_tlb_area *area = &mem->areas[area_index];
+	unsigned long flags;
+	int count, i, num;
+
+	spin_lock_irqsave(&area->lock, flags);
+	if (area_block_index + block_num < mem->area_block_number)
+		count = mem->block[block_index + block_num].list;
+	else
+		count = 0;
+
+
+	for (i = block_index + block_num; i >= block_index; i--) {
+		mem->block[i].list = ++count;
+		/* Todo: recover slot->list and alloc_size here. */
+	}
+
+	for (i = block_index - 1, num = block_index % mem->area_block_number;
+	    i < num && mem->block[i].list; i--)
+		mem->block[i].list = ++count;
+
+	spin_unlock_irqrestore(&area->lock, flags);
+}
+
+void swiotlb_device_free(struct device *dev)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_mem *parent_mem = dev->dma_io_tlb_mem->parent;
+
+	swiotlb_free_block(parent_mem, mem->start, mem->nslabs / IO_TLB_BLOCKSIZE);
+}
+
+static struct page *swiotlb_alloc_block(struct io_tlb_mem *mem, unsigned int block_num)
+{
+	unsigned int area_index, block_index, nslot;
+	phys_addr_t tlb_addr;
+	struct io_tlb_area *area;
+	unsigned long flags;
+	int i, j;
+
+	if (!mem || !mem->block)
+		return NULL;
+
+	area_index = mem->area_start;
+	mem->area_start = (mem->area_start + 1) % mem->num_areas;
+	area = &mem->areas[area_index];
+
+	spin_lock_irqsave(&area->lock, flags);
+	block_index = area_index * mem->area_block_number + area->block_index;
+
+	/* Todo: Search more blocks. */
+	if (mem->block[block_index].list < block_num) {
+		spin_unlock_irqrestore(&area->lock, flags);
+		return NULL;
+	}
+
+	/* Update block and slot list. */
+	for (i = block_index; i < block_index + block_num; i++) {
+		mem->block[i].list = 0;
+		for (j = 0; j < IO_TLB_BLOCKSIZE; j++) {
+			nslot = i * IO_TLB_BLOCKSIZE + j;
+			mem->slots[nslot].list = 0;
+			mem->slots[nslot].alloc_size = IO_TLB_SIZE;
+		}
+	}
+	spin_unlock_irqrestore(&area->lock, flags);
+
+	area->block_index += block_num;
+	area->used += block_num * IO_TLB_BLOCKSIZE;
+	tlb_addr = slot_addr(mem->start, block_index * IO_TLB_BLOCKSIZE);
+	return pfn_to_page(PFN_DOWN(tlb_addr));
+}
+
+/*
+ * swiotlb_device_allocate - Allocate bounce buffer fo device from
+ * default io tlb pool. The allocation size should be aligned with
+ * IO_TLB_BLOCK_UNIT.
+ */
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size)
+{
+	struct io_tlb_mem *mem, *parent_mem = dev->dma_io_tlb_mem;
+	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_BLOCKSIZE);
+	struct page *page;
+	int ret = -ENOMEM;
+
+	page = swiotlb_alloc_block(parent_mem, nslabs / IO_TLB_BLOCKSIZE);
+	if (!page)
+		return -ENOMEM;
+
+	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	if (!mem)
+		goto error_mem;
+
+	mem->slots = kzalloc(array_size(sizeof(*mem->slots), nslabs),
+			     GFP_KERNEL);
+	if (!mem->slots)
+		goto error_slots;
+
+	swiotlb_setup_areas(mem, area_num, nslabs);
+	mem->areas = (struct io_tlb_area *)kcalloc(area_num,
+				   sizeof(struct io_tlb_area),
+				   GFP_KERNEL);
+	if (!mem->areas)
+		goto error_areas;
+
+	mem->block = (struct io_tlb_block *)kcalloc(nslabs / IO_TLB_BLOCKSIZE,
+				sizeof(struct io_tlb_block),
+				GFP_KERNEL);
+	if (!mem->block)
+		goto error_block;
+
+	swiotlb_init_io_tlb_mem(mem, page_to_phys(page), nslabs, true);
+	mem->force_bounce = true;
+	mem->for_alloc = true;
+
+	mem->vaddr = parent_mem->vaddr + page_to_phys(page) -  parent_mem->start;
+	dev->dma_io_tlb_mem->parent = parent_mem;
+	dev->dma_io_tlb_mem = mem;
+	return 0;
+
+error_block:
+	kfree(mem->areas);
+error_areas:
+	kfree(mem->slots);
+error_slots:
+	kfree(mem);
+error_mem:
+	swiotlb_device_free(dev);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(swiotlb_device_allocate);
+
 #ifdef CONFIG_DMA_RESTRICTED_POOL
 
 struct page *swiotlb_alloc(struct device *dev, size_t size)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface
@ 2022-04-28 14:14   ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 14:14 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv,
	konrad.wilk, linux-kernel, kirill.shutemov, iommu, andi.kleen,
	brijesh.singh, vkuznets, hch

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

In SEV/TDX Confidential VM, device DMA transaction needs use swiotlb
bounce buffer to share data with host/hypervisor. The swiotlb spinlock
introduces overhead among devices if they share io tlb mem. Avoid such
issue, introduce swiotlb_device_allocate() to allocate device bounce
buffer from default io tlb pool and set up areas according input queue
number. Device may have multi io queues and setting up the same number
of io tlb area may help to resolve spinlock overhead among queues.

Introduce IO TLB Block unit(2MB) concepts to allocate big bounce buffer
from default pool for devices. IO TLB segment(256k) is too small.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  33 ++++++++
 kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 203 insertions(+), 3 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 489c249da434..380bd1ce3d0f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -31,6 +31,14 @@ struct scatterlist;
 #define IO_TLB_SHIFT 11
 #define IO_TLB_SIZE (1 << IO_TLB_SHIFT)
 
+/*
+ * IO TLB BLOCK UNIT as device bounce buffer allocation unit.
+ * This allows device allocates bounce buffer from default io
+ * tlb pool.
+ */
+#define IO_TLB_BLOCKSIZE   (8 * IO_TLB_SEGSIZE)
+#define IO_TLB_BLOCK_UNIT  (IO_TLB_BLOCKSIZE << IO_TLB_SHIFT)
+
 /* default to 64MB */
 #define IO_TLB_DEFAULT_SIZE (64UL<<20)
 
@@ -72,11 +80,13 @@ extern enum swiotlb_force swiotlb_force;
  * @index:	The slot index to start searching in this area for next round.
  * @lock:	The lock to protect the above data structures in the map and
  *		unmap calls.
+ * @block_index: The block index to start earching in this area for next round.
  */
 struct io_tlb_area {
 	unsigned long used;
 	unsigned int area_index;
 	unsigned int index;
+	unsigned int block_index;
 	spinlock_t lock;
 };
 
@@ -110,6 +120,7 @@ struct io_tlb_area {
  * @num_areas:  The area number in the pool.
  * @area_start: The area index to start searching in the next round.
  * @area_nslabs: The slot number in the area.
+ * @areas_block_number: The block number in the area.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -126,7 +137,14 @@ struct io_tlb_mem {
 	unsigned int num_areas;
 	unsigned int area_start;
 	unsigned int area_nslabs;
+	unsigned int area_block_number;
+	struct io_tlb_mem *parent;
 	struct io_tlb_area *areas;
+	struct io_tlb_block {
+		size_t alloc_size;
+		unsigned long start_slot;
+		unsigned int list;
+	} *block;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
@@ -155,6 +173,10 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size);
+void swiotlb_device_free(struct device *dev);
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -187,6 +209,17 @@ static inline bool is_swiotlb_active(struct device *dev)
 static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
+
+void swiotlb_device_free(struct device *dev)
+{
+}
+
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size)
+{
+	return -ENOMEM;
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 00a16f540f20..7b95a140694a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -218,7 +218,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 				    unsigned long nslabs, bool late_alloc)
 {
 	void *vaddr = phys_to_virt(start);
-	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
+	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j, k;
 	unsigned int block_list;
 
 	mem->nslabs = nslabs;
@@ -226,6 +226,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	mem->end = mem->start + bytes;
 	mem->index = 0;
 	mem->late_alloc = late_alloc;
+	mem->area_block_number = nslabs / (IO_TLB_BLOCKSIZE * mem->num_areas);
 
 	if (swiotlb_force_bounce)
 		mem->force_bounce = true;
@@ -233,10 +234,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
 		if (!(i % mem->area_nslabs)) {
 			mem->areas[j].index = 0;
+			mem->areas[j].block_index = 0;
 			spin_lock_init(&mem->areas[j].lock);
+			block_list = mem->area_block_number;
 			j++;
 		}
 
+		if (!(i % IO_TLB_BLOCKSIZE)) {
+			mem->block[k].alloc_size = 0;
+			mem->block[k].list = block_list--;
+			k++;
+		}
+
 		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
 		mem->slots[i].alloc_size = 0;
@@ -308,6 +317,12 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
 	if (!mem->areas)
 		panic("%s: Failed to allocate mem->areas.\n", __func__);
 
+	mem->block = memblock_alloc(sizeof(struct io_tlb_block) *
+				    (default_nslabs / IO_TLB_BLOCKSIZE),
+				     SMP_CACHE_BYTES);
+	if (!mem->block)
+		panic("%s: Failed to allocate mem->block.\n", __func__);
+
 	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
 	mem->force_bounce = flags & SWIOTLB_FORCE;
 
@@ -332,7 +347,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
 	unsigned long bytes;
 	unsigned char *vstart = NULL;
-	unsigned int order, area_order;
+	unsigned int order, area_order, block_order;
 	int rc = 0;
 
 	if (swiotlb_force_disable)
@@ -380,6 +395,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	if (!mem->areas)
 		goto error_area;
 
+	block_order = get_order(array_size(sizeof(*mem->block),
+		nslabs / IO_TLB_BLOCKSIZE));
+	mem->block = (struct io_tlb_block *)
+		__get_free_pages(GFP_KERNEL | __GFP_ZERO, block_order);
+	if (!mem->block)
+		goto error_block;
+
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
 	if (!mem->slots)
@@ -392,6 +414,8 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 	return 0;
 
 error_slots:
+	free_pages((unsigned long)mem->block, block_order);
+error_block:
 	free_pages((unsigned long)mem->areas, area_order);
 error_area:
 	free_pages((unsigned long)vstart, order);
@@ -403,7 +427,7 @@ void __init swiotlb_exit(void)
 	struct io_tlb_mem *mem = &io_tlb_default_mem;
 	unsigned long tbl_vaddr;
 	size_t tbl_size, slots_size;
-	unsigned int area_order;
+	unsigned int area_order, block_order;
 
 	if (swiotlb_force_bounce)
 		return;
@@ -421,6 +445,9 @@ void __init swiotlb_exit(void)
 		area_order = get_order(array_size(sizeof(*mem->areas),
 			mem->num_areas));
 		free_pages((unsigned long)mem->areas, area_order);
+		block_order = get_order(array_size(sizeof(*mem->block),
+			mem->nslabs / IO_TLB_BLOCKSIZE));
+		free_pages((unsigned long)mem->block, block_order);
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
@@ -863,6 +890,146 @@ static int __init __maybe_unused swiotlb_create_default_debugfs(void)
 late_initcall(swiotlb_create_default_debugfs);
 #endif
 
+static void swiotlb_free_block(struct io_tlb_mem *mem,
+			       phys_addr_t start, unsigned int block_num)
+{
+	unsigned int start_slot = (start - mem->start) >> IO_TLB_SHIFT;
+	unsigned int area_index = start_slot / mem->num_areas;
+	unsigned int block_index = start_slot / IO_TLB_BLOCKSIZE;
+	unsigned int area_block_index = start_slot % mem->area_block_number;
+	struct io_tlb_area *area = &mem->areas[area_index];
+	unsigned long flags;
+	int count, i, num;
+
+	spin_lock_irqsave(&area->lock, flags);
+	if (area_block_index + block_num < mem->area_block_number)
+		count = mem->block[block_index + block_num].list;
+	else
+		count = 0;
+
+
+	for (i = block_index + block_num; i >= block_index; i--) {
+		mem->block[i].list = ++count;
+		/* Todo: recover slot->list and alloc_size here. */
+	}
+
+	for (i = block_index - 1, num = block_index % mem->area_block_number;
+	    i < num && mem->block[i].list; i--)
+		mem->block[i].list = ++count;
+
+	spin_unlock_irqrestore(&area->lock, flags);
+}
+
+void swiotlb_device_free(struct device *dev)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_mem *parent_mem = dev->dma_io_tlb_mem->parent;
+
+	swiotlb_free_block(parent_mem, mem->start, mem->nslabs / IO_TLB_BLOCKSIZE);
+}
+
+static struct page *swiotlb_alloc_block(struct io_tlb_mem *mem, unsigned int block_num)
+{
+	unsigned int area_index, block_index, nslot;
+	phys_addr_t tlb_addr;
+	struct io_tlb_area *area;
+	unsigned long flags;
+	int i, j;
+
+	if (!mem || !mem->block)
+		return NULL;
+
+	area_index = mem->area_start;
+	mem->area_start = (mem->area_start + 1) % mem->num_areas;
+	area = &mem->areas[area_index];
+
+	spin_lock_irqsave(&area->lock, flags);
+	block_index = area_index * mem->area_block_number + area->block_index;
+
+	/* Todo: Search more blocks. */
+	if (mem->block[block_index].list < block_num) {
+		spin_unlock_irqrestore(&area->lock, flags);
+		return NULL;
+	}
+
+	/* Update block and slot list. */
+	for (i = block_index; i < block_index + block_num; i++) {
+		mem->block[i].list = 0;
+		for (j = 0; j < IO_TLB_BLOCKSIZE; j++) {
+			nslot = i * IO_TLB_BLOCKSIZE + j;
+			mem->slots[nslot].list = 0;
+			mem->slots[nslot].alloc_size = IO_TLB_SIZE;
+		}
+	}
+	spin_unlock_irqrestore(&area->lock, flags);
+
+	area->block_index += block_num;
+	area->used += block_num * IO_TLB_BLOCKSIZE;
+	tlb_addr = slot_addr(mem->start, block_index * IO_TLB_BLOCKSIZE);
+	return pfn_to_page(PFN_DOWN(tlb_addr));
+}
+
+/*
+ * swiotlb_device_allocate - Allocate bounce buffer fo device from
+ * default io tlb pool. The allocation size should be aligned with
+ * IO_TLB_BLOCK_UNIT.
+ */
+int swiotlb_device_allocate(struct device *dev,
+			    unsigned int area_num,
+			    unsigned long size)
+{
+	struct io_tlb_mem *mem, *parent_mem = dev->dma_io_tlb_mem;
+	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_BLOCKSIZE);
+	struct page *page;
+	int ret = -ENOMEM;
+
+	page = swiotlb_alloc_block(parent_mem, nslabs / IO_TLB_BLOCKSIZE);
+	if (!page)
+		return -ENOMEM;
+
+	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	if (!mem)
+		goto error_mem;
+
+	mem->slots = kzalloc(array_size(sizeof(*mem->slots), nslabs),
+			     GFP_KERNEL);
+	if (!mem->slots)
+		goto error_slots;
+
+	swiotlb_setup_areas(mem, area_num, nslabs);
+	mem->areas = (struct io_tlb_area *)kcalloc(area_num,
+				   sizeof(struct io_tlb_area),
+				   GFP_KERNEL);
+	if (!mem->areas)
+		goto error_areas;
+
+	mem->block = (struct io_tlb_block *)kcalloc(nslabs / IO_TLB_BLOCKSIZE,
+				sizeof(struct io_tlb_block),
+				GFP_KERNEL);
+	if (!mem->block)
+		goto error_block;
+
+	swiotlb_init_io_tlb_mem(mem, page_to_phys(page), nslabs, true);
+	mem->force_bounce = true;
+	mem->for_alloc = true;
+
+	mem->vaddr = parent_mem->vaddr + page_to_phys(page) -  parent_mem->start;
+	dev->dma_io_tlb_mem->parent = parent_mem;
+	dev->dma_io_tlb_mem = mem;
+	return 0;
+
+error_block:
+	kfree(mem->areas);
+error_areas:
+	kfree(mem->slots);
+error_slots:
+	kfree(mem);
+error_mem:
+	swiotlb_device_free(dev);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(swiotlb_device_allocate);
+
 #ifdef CONFIG_DMA_RESTRICTED_POOL
 
 struct page *swiotlb_alloc(struct device *dev, size_t size)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:14   ` Tianyu Lan
@ 2022-04-28 14:44     ` Robin Murphy
  -1 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 14:44 UTC (permalink / raw)
  To: Tianyu Lan, hch, m.szyprowski, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan,
	linux-hyperv, konrad.wilk, linux-kernel, kirill.shutemov, iommu,
	andi.kleen, brijesh.singh, vkuznets, hch

On 2022-04-28 15:14, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> Traditionally swiotlb was not performance critical because it was only
> used for slow devices. But in some setups, like TDX/SEV confidential
> guests, all IO has to go through swiotlb. Currently swiotlb only has a
> single lock. Under high IO load with multiple CPUs this can lead to
> significat lock contention on the swiotlb lock.
> 
> This patch splits the swiotlb into individual areas which have their
> own lock. When there are swiotlb map/allocate request, allocate
> io tlb buffer from areas averagely and free the allocation back
> to the associated area. This is to prepare to resolve the overhead
> of single spinlock among device's queues. Per device may have its
> own io tlb mem and bounce buffer pool.
> 
> This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/4529b578
> 4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
> for individual device's io tlb mem. The device driver may determine
> area number according to device queue number.

Rather than introduce this extra level of allocator complexity, how 
about just dividing up the initial SWIOTLB allocation into multiple 
io_tlb_mem instances?

Robin.

> Based-on-idea-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>   include/linux/swiotlb.h |  25 ++++++
>   kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++---------
>   2 files changed, 162 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7ed35dd3de6e..489c249da434 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -62,6 +62,24 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
>   #ifdef CONFIG_SWIOTLB
>   extern enum swiotlb_force swiotlb_force;
>   
> +/**
> + * struct io_tlb_area - IO TLB memory area descriptor
> + *
> + * This is a single area with a single lock.
> + *
> + * @used:	The number of used IO TLB block.
> + * @area_index: The index of to tlb area.
> + * @index:	The slot index to start searching in this area for next round.
> + * @lock:	The lock to protect the above data structures in the map and
> + *		unmap calls.
> + */
> +struct io_tlb_area {
> +	unsigned long used;
> +	unsigned int area_index;
> +	unsigned int index;
> +	spinlock_t lock;
> +};
> +
>   /**
>    * struct io_tlb_mem - IO TLB Memory Pool Descriptor
>    *
> @@ -89,6 +107,9 @@ extern enum swiotlb_force swiotlb_force;
>    * @late_alloc:	%true if allocated using the page allocator
>    * @force_bounce: %true if swiotlb bouncing is forced
>    * @for_alloc:  %true if the pool is used for memory allocation
> + * @num_areas:  The area number in the pool.
> + * @area_start: The area index to start searching in the next round.
> + * @area_nslabs: The slot number in the area.
>    */
>   struct io_tlb_mem {
>   	phys_addr_t start;
> @@ -102,6 +123,10 @@ struct io_tlb_mem {
>   	bool late_alloc;
>   	bool force_bounce;
>   	bool for_alloc;
> +	unsigned int num_areas;
> +	unsigned int area_start;
> +	unsigned int area_nslabs;
> +	struct io_tlb_area *areas;
>   	struct io_tlb_slot {
>   		phys_addr_t orig_addr;
>   		size_t alloc_size;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index e2ef0864eb1e..00a16f540f20 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -62,6 +62,8 @@
>   
>   #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
>   
> +#define NUM_AREAS_DEFAULT 1
> +
>   static bool swiotlb_force_bounce;
>   static bool swiotlb_force_disable;
>   
> @@ -70,6 +72,25 @@ struct io_tlb_mem io_tlb_default_mem;
>   phys_addr_t swiotlb_unencrypted_base;
>   
>   static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> +static unsigned long default_area_num = NUM_AREAS_DEFAULT;
> +
> +static int swiotlb_setup_areas(struct io_tlb_mem *mem,
> +		unsigned int num_areas, unsigned long nslabs)
> +{
> +	if (nslabs < 1 || !is_power_of_2(num_areas)) {
> +		pr_err("swiotlb: Invalid areas parameter %d.\n", num_areas);
> +		return -EINVAL;
> +	}
> +
> +	/* Round up number of slabs to the next power of 2.
> +	 * The last area is going be smaller than the rest if default_nslabs is
> +	 * not power of two.
> +	 */
> +	mem->area_start = 0;
> +	mem->num_areas = num_areas;
> +	mem->area_nslabs = nslabs / num_areas;
> +	return 0;
> +}
>   
>   static int __init
>   setup_io_tlb_npages(char *str)
> @@ -114,6 +135,8 @@ void __init swiotlb_adjust_size(unsigned long size)
>   		return;
>   	size = ALIGN(size, IO_TLB_SIZE);
>   	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
> +	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
> +			    default_nslabs);
>   	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
>   }
>   
> @@ -195,7 +218,8 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   				    unsigned long nslabs, bool late_alloc)
>   {
>   	void *vaddr = phys_to_virt(start);
> -	unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
> +	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
> +	unsigned int block_list;
>   
>   	mem->nslabs = nslabs;
>   	mem->start = start;
> @@ -206,8 +230,13 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   	if (swiotlb_force_bounce)
>   		mem->force_bounce = true;
>   
> -	spin_lock_init(&mem->lock);
> -	for (i = 0; i < mem->nslabs; i++) {
> +	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
> +		if (!(i % mem->area_nslabs)) {
> +			mem->areas[j].index = 0;
> +			spin_lock_init(&mem->areas[j].lock);
> +			j++;
> +		}
> +
>   		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>   		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>   		mem->slots[i].alloc_size = 0;
> @@ -272,6 +301,13 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>   		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
>   		      __func__, alloc_size, PAGE_SIZE);
>   
> +	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
> +		    default_nslabs);
> +	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) * mem->num_areas,
> +			    SMP_CACHE_BYTES);
> +	if (!mem->areas)
> +		panic("%s: Failed to allocate mem->areas.\n", __func__);
> +
>   	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
>   	mem->force_bounce = flags & SWIOTLB_FORCE;
>   
> @@ -296,7 +332,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
>   	unsigned long bytes;
>   	unsigned char *vstart = NULL;
> -	unsigned int order;
> +	unsigned int order, area_order;
>   	int rc = 0;
>   
>   	if (swiotlb_force_disable)
> @@ -334,18 +370,32 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   		goto retry;
>   	}
>   
> +	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
> +			    nslabs);
> +
> +	area_order = get_order(array_size(sizeof(*mem->areas),
> +		default_area_num));
> +	mem->areas = (struct io_tlb_area *)
> +		__get_free_pages(GFP_KERNEL | __GFP_ZERO, area_order);
> +	if (!mem->areas)
> +		goto error_area;
> +
>   	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>   		get_order(array_size(sizeof(*mem->slots), nslabs)));
> -	if (!mem->slots) {
> -		free_pages((unsigned long)vstart, order);
> -		return -ENOMEM;
> -	}
> +	if (!mem->slots)
> +		goto error_slots;
>   
>   	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
>   	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
>   
>   	swiotlb_print_info();
>   	return 0;
> +
> +error_slots:
> +	free_pages((unsigned long)mem->areas, area_order);
> +error_area:
> +	free_pages((unsigned long)vstart, order);
> +	return -ENOMEM;
>   }
>   
>   void __init swiotlb_exit(void)
> @@ -353,6 +403,7 @@ void __init swiotlb_exit(void)
>   	struct io_tlb_mem *mem = &io_tlb_default_mem;
>   	unsigned long tbl_vaddr;
>   	size_t tbl_size, slots_size;
> +	unsigned int area_order;
>   
>   	if (swiotlb_force_bounce)
>   		return;
> @@ -367,9 +418,14 @@ void __init swiotlb_exit(void)
>   
>   	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>   	if (mem->late_alloc) {
> +		area_order = get_order(array_size(sizeof(*mem->areas),
> +			mem->num_areas));
> +		free_pages((unsigned long)mem->areas, area_order);
>   		free_pages(tbl_vaddr, get_order(tbl_size));
>   		free_pages((unsigned long)mem->slots, get_order(slots_size));
>   	} else {
> +		memblock_free_late(__pa(mem->areas),
> +				   mem->num_areas * sizeof(struct io_tlb_area));
>   		memblock_free_late(mem->start, tbl_size);
>   		memblock_free_late(__pa(mem->slots), slots_size);
>   	}
> @@ -472,9 +528,9 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
>   	return nr_slots(boundary_mask + 1);
>   }
>   
> -static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
> +static unsigned int wrap_area_index(struct io_tlb_mem *mem, unsigned int index)
>   {
> -	if (index >= mem->nslabs)
> +	if (index >= mem->area_nslabs)
>   		return 0;
>   	return index;
>   }
> @@ -483,10 +539,13 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
>    * Find a suitable number of IO TLB entries size that will fit this request and
>    * allocate a buffer from that IO TLB pool.
>    */
> -static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> -			      size_t alloc_size, unsigned int alloc_align_mask)
> +static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
> +				 struct io_tlb_area *area,
> +				 int area_index,
> +				 struct device *dev, phys_addr_t orig_addr,
> +				 size_t alloc_size,
> +				 unsigned int alloc_align_mask)
>   {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>   	unsigned long boundary_mask = dma_get_seg_boundary(dev);
>   	dma_addr_t tbl_dma_addr =
>   		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
> @@ -497,8 +556,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   	unsigned int index, wrap, count = 0, i;
>   	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
>   	unsigned long flags;
> +	unsigned int slot_base;
> +	unsigned int slot_index;
>   
>   	BUG_ON(!nslots);
> +	BUG_ON(area_index >= mem->num_areas);
>   
>   	/*
>   	 * For mappings with an alignment requirement don't bother looping to
> @@ -510,16 +572,20 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
>   	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
>   
> -	spin_lock_irqsave(&mem->lock, flags);
> -	if (unlikely(nslots > mem->nslabs - mem->used))
> +	spin_lock_irqsave(&area->lock, flags);
> +	if (unlikely(nslots > mem->area_nslabs - area->used))
>   		goto not_found;
>   
> -	index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
> +	slot_base = area_index * mem->area_nslabs;
> +	index = wrap = wrap_area_index(mem, ALIGN(area->index, stride));
> +
>   	do {
> +		slot_index = slot_base + index;
> +
>   		if (orig_addr &&
> -		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
> -			    (orig_addr & iotlb_align_mask)) {
> -			index = wrap_index(mem, index + 1);
> +		    (slot_addr(tbl_dma_addr, slot_index) &
> +		     iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> +			index = wrap_area_index(mem, index + 1);
>   			continue;
>   		}
>   
> @@ -528,26 +594,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   		 * contiguous buffers, we allocate the buffers from that slot
>   		 * and mark the entries as '0' indicating unavailable.
>   		 */
> -		if (!iommu_is_span_boundary(index, nslots,
> +		if (!iommu_is_span_boundary(slot_index, nslots,
>   					    nr_slots(tbl_dma_addr),
>   					    max_slots)) {
> -			if (mem->slots[index].list >= nslots)
> +			if (mem->slots[slot_index].list >= nslots)
>   				goto found;
>   		}
> -		index = wrap_index(mem, index + stride);
> +		index = wrap_area_index(mem, index + stride);
>   	} while (index != wrap);
>   
>   not_found:
> -	spin_unlock_irqrestore(&mem->lock, flags);
> +	spin_unlock_irqrestore(&area->lock, flags);
>   	return -1;
>   
>   found:
> -	for (i = index; i < index + nslots; i++) {
> +	for (i = slot_index; i < slot_index + nslots; i++) {
>   		mem->slots[i].list = 0;
>   		mem->slots[i].alloc_size =
> -			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
> +			alloc_size - (offset + ((i - slot_index) << IO_TLB_SHIFT));
>   	}
> -	for (i = index - 1;
> +	for (i = slot_index - 1;
>   	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
>   	     mem->slots[i].list; i--)
>   		mem->slots[i].list = ++count;
> @@ -555,14 +621,45 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   	/*
>   	 * Update the indices to avoid searching in the next round.
>   	 */
> -	if (index + nslots < mem->nslabs)
> -		mem->index = index + nslots;
> +	if (index + nslots < mem->area_nslabs)
> +		area->index = index + nslots;
>   	else
> -		mem->index = 0;
> -	mem->used += nslots;
> +		area->index = 0;
> +	area->used += nslots;
> +	spin_unlock_irqrestore(&area->lock, flags);
> +	return slot_index;
> +}
>   
> -	spin_unlock_irqrestore(&mem->lock, flags);
> -	return index;
> +static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> +			      size_t alloc_size, unsigned int alloc_align_mask)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	int start, i, index;
> +
> +	i = start = mem->area_start;
> +	mem->area_start = (mem->area_start + 1) % mem->num_areas;
> +
> +	do {
> +		index = swiotlb_do_find_slots(mem, mem->areas + i, i,
> +					      dev, orig_addr, alloc_size,
> +					      alloc_align_mask);
> +		if (index >= 0)
> +			return index;
> +		if (++i >= mem->num_areas)
> +			i = 0;
> +	} while (i != start);
> +
> +	return -1;
> +}
> +
> +static unsigned long mem_used(struct io_tlb_mem *mem)
> +{
> +	int i;
> +	unsigned long used = 0;
> +
> +	for (i = 0; i < mem->num_areas; i++)
> +		used += mem->areas[i].used;
> +	return used;
>   }
>   
>   phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> @@ -594,7 +691,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		if (!(attrs & DMA_ATTR_NO_WARN))
>   			dev_warn_ratelimited(dev,
>   	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
> -				 alloc_size, mem->nslabs, mem->used);
> +				 alloc_size, mem->nslabs, mem_used(mem));
>   		return (phys_addr_t)DMA_MAPPING_ERROR;
>   	}
>   
> @@ -624,6 +721,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
>   	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
>   	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
> +	int aindex = index / mem->area_nslabs;
> +	struct io_tlb_area *area = &mem->areas[aindex];
>   	int count, i;
>   
>   	/*
> @@ -632,7 +731,9 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	 * While returning the entries to the free list, we merge the entries
>   	 * with slots below and above the pool being returned.
>   	 */
> -	spin_lock_irqsave(&mem->lock, flags);
> +	BUG_ON(aindex >= mem->num_areas);
> +
> +	spin_lock_irqsave(&area->lock, flags);
>   	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
>   		count = mem->slots[index + nslots].list;
>   	else
> @@ -656,8 +757,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
>   	     i--)
>   		mem->slots[i].list = ++count;
> -	mem->used -= nslots;
> -	spin_unlock_irqrestore(&mem->lock, flags);
> +	area->used -= nslots;
> +	spin_unlock_irqrestore(&area->lock, flags);
>   }
>   
>   /*

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 14:44     ` Robin Murphy
  0 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 14:44 UTC (permalink / raw)
  To: Tianyu Lan, hch, m.szyprowski, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan,
	konrad.wilk, linux-hyperv, linux-kernel, kirill.shutemov, iommu,
	andi.kleen, brijesh.singh, vkuznets, hch

On 2022-04-28 15:14, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> Traditionally swiotlb was not performance critical because it was only
> used for slow devices. But in some setups, like TDX/SEV confidential
> guests, all IO has to go through swiotlb. Currently swiotlb only has a
> single lock. Under high IO load with multiple CPUs this can lead to
> significat lock contention on the swiotlb lock.
> 
> This patch splits the swiotlb into individual areas which have their
> own lock. When there are swiotlb map/allocate request, allocate
> io tlb buffer from areas averagely and free the allocation back
> to the associated area. This is to prepare to resolve the overhead
> of single spinlock among device's queues. Per device may have its
> own io tlb mem and bounce buffer pool.
> 
> This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/4529b578
> 4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
> for individual device's io tlb mem. The device driver may determine
> area number according to device queue number.

Rather than introduce this extra level of allocator complexity, how 
about just dividing up the initial SWIOTLB allocation into multiple 
io_tlb_mem instances?

Robin.

> Based-on-idea-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>   include/linux/swiotlb.h |  25 ++++++
>   kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++---------
>   2 files changed, 162 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7ed35dd3de6e..489c249da434 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -62,6 +62,24 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
>   #ifdef CONFIG_SWIOTLB
>   extern enum swiotlb_force swiotlb_force;
>   
> +/**
> + * struct io_tlb_area - IO TLB memory area descriptor
> + *
> + * This is a single area with a single lock.
> + *
> + * @used:	The number of used IO TLB block.
> + * @area_index: The index of to tlb area.
> + * @index:	The slot index to start searching in this area for next round.
> + * @lock:	The lock to protect the above data structures in the map and
> + *		unmap calls.
> + */
> +struct io_tlb_area {
> +	unsigned long used;
> +	unsigned int area_index;
> +	unsigned int index;
> +	spinlock_t lock;
> +};
> +
>   /**
>    * struct io_tlb_mem - IO TLB Memory Pool Descriptor
>    *
> @@ -89,6 +107,9 @@ extern enum swiotlb_force swiotlb_force;
>    * @late_alloc:	%true if allocated using the page allocator
>    * @force_bounce: %true if swiotlb bouncing is forced
>    * @for_alloc:  %true if the pool is used for memory allocation
> + * @num_areas:  The area number in the pool.
> + * @area_start: The area index to start searching in the next round.
> + * @area_nslabs: The slot number in the area.
>    */
>   struct io_tlb_mem {
>   	phys_addr_t start;
> @@ -102,6 +123,10 @@ struct io_tlb_mem {
>   	bool late_alloc;
>   	bool force_bounce;
>   	bool for_alloc;
> +	unsigned int num_areas;
> +	unsigned int area_start;
> +	unsigned int area_nslabs;
> +	struct io_tlb_area *areas;
>   	struct io_tlb_slot {
>   		phys_addr_t orig_addr;
>   		size_t alloc_size;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index e2ef0864eb1e..00a16f540f20 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -62,6 +62,8 @@
>   
>   #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
>   
> +#define NUM_AREAS_DEFAULT 1
> +
>   static bool swiotlb_force_bounce;
>   static bool swiotlb_force_disable;
>   
> @@ -70,6 +72,25 @@ struct io_tlb_mem io_tlb_default_mem;
>   phys_addr_t swiotlb_unencrypted_base;
>   
>   static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> +static unsigned long default_area_num = NUM_AREAS_DEFAULT;
> +
> +static int swiotlb_setup_areas(struct io_tlb_mem *mem,
> +		unsigned int num_areas, unsigned long nslabs)
> +{
> +	if (nslabs < 1 || !is_power_of_2(num_areas)) {
> +		pr_err("swiotlb: Invalid areas parameter %d.\n", num_areas);
> +		return -EINVAL;
> +	}
> +
> +	/* Round up number of slabs to the next power of 2.
> +	 * The last area is going be smaller than the rest if default_nslabs is
> +	 * not power of two.
> +	 */
> +	mem->area_start = 0;
> +	mem->num_areas = num_areas;
> +	mem->area_nslabs = nslabs / num_areas;
> +	return 0;
> +}
>   
>   static int __init
>   setup_io_tlb_npages(char *str)
> @@ -114,6 +135,8 @@ void __init swiotlb_adjust_size(unsigned long size)
>   		return;
>   	size = ALIGN(size, IO_TLB_SIZE);
>   	default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
> +	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
> +			    default_nslabs);
>   	pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
>   }
>   
> @@ -195,7 +218,8 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   				    unsigned long nslabs, bool late_alloc)
>   {
>   	void *vaddr = phys_to_virt(start);
> -	unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
> +	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
> +	unsigned int block_list;
>   
>   	mem->nslabs = nslabs;
>   	mem->start = start;
> @@ -206,8 +230,13 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   	if (swiotlb_force_bounce)
>   		mem->force_bounce = true;
>   
> -	spin_lock_init(&mem->lock);
> -	for (i = 0; i < mem->nslabs; i++) {
> +	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
> +		if (!(i % mem->area_nslabs)) {
> +			mem->areas[j].index = 0;
> +			spin_lock_init(&mem->areas[j].lock);
> +			j++;
> +		}
> +
>   		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>   		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>   		mem->slots[i].alloc_size = 0;
> @@ -272,6 +301,13 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>   		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
>   		      __func__, alloc_size, PAGE_SIZE);
>   
> +	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
> +		    default_nslabs);
> +	mem->areas = memblock_alloc(sizeof(struct io_tlb_area) * mem->num_areas,
> +			    SMP_CACHE_BYTES);
> +	if (!mem->areas)
> +		panic("%s: Failed to allocate mem->areas.\n", __func__);
> +
>   	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
>   	mem->force_bounce = flags & SWIOTLB_FORCE;
>   
> @@ -296,7 +332,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
>   	unsigned long bytes;
>   	unsigned char *vstart = NULL;
> -	unsigned int order;
> +	unsigned int order, area_order;
>   	int rc = 0;
>   
>   	if (swiotlb_force_disable)
> @@ -334,18 +370,32 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   		goto retry;
>   	}
>   
> +	swiotlb_setup_areas(&io_tlb_default_mem, default_area_num,
> +			    nslabs);
> +
> +	area_order = get_order(array_size(sizeof(*mem->areas),
> +		default_area_num));
> +	mem->areas = (struct io_tlb_area *)
> +		__get_free_pages(GFP_KERNEL | __GFP_ZERO, area_order);
> +	if (!mem->areas)
> +		goto error_area;
> +
>   	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>   		get_order(array_size(sizeof(*mem->slots), nslabs)));
> -	if (!mem->slots) {
> -		free_pages((unsigned long)vstart, order);
> -		return -ENOMEM;
> -	}
> +	if (!mem->slots)
> +		goto error_slots;
>   
>   	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
>   	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
>   
>   	swiotlb_print_info();
>   	return 0;
> +
> +error_slots:
> +	free_pages((unsigned long)mem->areas, area_order);
> +error_area:
> +	free_pages((unsigned long)vstart, order);
> +	return -ENOMEM;
>   }
>   
>   void __init swiotlb_exit(void)
> @@ -353,6 +403,7 @@ void __init swiotlb_exit(void)
>   	struct io_tlb_mem *mem = &io_tlb_default_mem;
>   	unsigned long tbl_vaddr;
>   	size_t tbl_size, slots_size;
> +	unsigned int area_order;
>   
>   	if (swiotlb_force_bounce)
>   		return;
> @@ -367,9 +418,14 @@ void __init swiotlb_exit(void)
>   
>   	set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
>   	if (mem->late_alloc) {
> +		area_order = get_order(array_size(sizeof(*mem->areas),
> +			mem->num_areas));
> +		free_pages((unsigned long)mem->areas, area_order);
>   		free_pages(tbl_vaddr, get_order(tbl_size));
>   		free_pages((unsigned long)mem->slots, get_order(slots_size));
>   	} else {
> +		memblock_free_late(__pa(mem->areas),
> +				   mem->num_areas * sizeof(struct io_tlb_area));
>   		memblock_free_late(mem->start, tbl_size);
>   		memblock_free_late(__pa(mem->slots), slots_size);
>   	}
> @@ -472,9 +528,9 @@ static inline unsigned long get_max_slots(unsigned long boundary_mask)
>   	return nr_slots(boundary_mask + 1);
>   }
>   
> -static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
> +static unsigned int wrap_area_index(struct io_tlb_mem *mem, unsigned int index)
>   {
> -	if (index >= mem->nslabs)
> +	if (index >= mem->area_nslabs)
>   		return 0;
>   	return index;
>   }
> @@ -483,10 +539,13 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
>    * Find a suitable number of IO TLB entries size that will fit this request and
>    * allocate a buffer from that IO TLB pool.
>    */
> -static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> -			      size_t alloc_size, unsigned int alloc_align_mask)
> +static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
> +				 struct io_tlb_area *area,
> +				 int area_index,
> +				 struct device *dev, phys_addr_t orig_addr,
> +				 size_t alloc_size,
> +				 unsigned int alloc_align_mask)
>   {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>   	unsigned long boundary_mask = dma_get_seg_boundary(dev);
>   	dma_addr_t tbl_dma_addr =
>   		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
> @@ -497,8 +556,11 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   	unsigned int index, wrap, count = 0, i;
>   	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
>   	unsigned long flags;
> +	unsigned int slot_base;
> +	unsigned int slot_index;
>   
>   	BUG_ON(!nslots);
> +	BUG_ON(area_index >= mem->num_areas);
>   
>   	/*
>   	 * For mappings with an alignment requirement don't bother looping to
> @@ -510,16 +572,20 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   		stride = max(stride, stride << (PAGE_SHIFT - IO_TLB_SHIFT));
>   	stride = max(stride, (alloc_align_mask >> IO_TLB_SHIFT) + 1);
>   
> -	spin_lock_irqsave(&mem->lock, flags);
> -	if (unlikely(nslots > mem->nslabs - mem->used))
> +	spin_lock_irqsave(&area->lock, flags);
> +	if (unlikely(nslots > mem->area_nslabs - area->used))
>   		goto not_found;
>   
> -	index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
> +	slot_base = area_index * mem->area_nslabs;
> +	index = wrap = wrap_area_index(mem, ALIGN(area->index, stride));
> +
>   	do {
> +		slot_index = slot_base + index;
> +
>   		if (orig_addr &&
> -		    (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
> -			    (orig_addr & iotlb_align_mask)) {
> -			index = wrap_index(mem, index + 1);
> +		    (slot_addr(tbl_dma_addr, slot_index) &
> +		     iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> +			index = wrap_area_index(mem, index + 1);
>   			continue;
>   		}
>   
> @@ -528,26 +594,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   		 * contiguous buffers, we allocate the buffers from that slot
>   		 * and mark the entries as '0' indicating unavailable.
>   		 */
> -		if (!iommu_is_span_boundary(index, nslots,
> +		if (!iommu_is_span_boundary(slot_index, nslots,
>   					    nr_slots(tbl_dma_addr),
>   					    max_slots)) {
> -			if (mem->slots[index].list >= nslots)
> +			if (mem->slots[slot_index].list >= nslots)
>   				goto found;
>   		}
> -		index = wrap_index(mem, index + stride);
> +		index = wrap_area_index(mem, index + stride);
>   	} while (index != wrap);
>   
>   not_found:
> -	spin_unlock_irqrestore(&mem->lock, flags);
> +	spin_unlock_irqrestore(&area->lock, flags);
>   	return -1;
>   
>   found:
> -	for (i = index; i < index + nslots; i++) {
> +	for (i = slot_index; i < slot_index + nslots; i++) {
>   		mem->slots[i].list = 0;
>   		mem->slots[i].alloc_size =
> -			alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
> +			alloc_size - (offset + ((i - slot_index) << IO_TLB_SHIFT));
>   	}
> -	for (i = index - 1;
> +	for (i = slot_index - 1;
>   	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
>   	     mem->slots[i].list; i--)
>   		mem->slots[i].list = ++count;
> @@ -555,14 +621,45 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   	/*
>   	 * Update the indices to avoid searching in the next round.
>   	 */
> -	if (index + nslots < mem->nslabs)
> -		mem->index = index + nslots;
> +	if (index + nslots < mem->area_nslabs)
> +		area->index = index + nslots;
>   	else
> -		mem->index = 0;
> -	mem->used += nslots;
> +		area->index = 0;
> +	area->used += nslots;
> +	spin_unlock_irqrestore(&area->lock, flags);
> +	return slot_index;
> +}
>   
> -	spin_unlock_irqrestore(&mem->lock, flags);
> -	return index;
> +static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> +			      size_t alloc_size, unsigned int alloc_align_mask)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	int start, i, index;
> +
> +	i = start = mem->area_start;
> +	mem->area_start = (mem->area_start + 1) % mem->num_areas;
> +
> +	do {
> +		index = swiotlb_do_find_slots(mem, mem->areas + i, i,
> +					      dev, orig_addr, alloc_size,
> +					      alloc_align_mask);
> +		if (index >= 0)
> +			return index;
> +		if (++i >= mem->num_areas)
> +			i = 0;
> +	} while (i != start);
> +
> +	return -1;
> +}
> +
> +static unsigned long mem_used(struct io_tlb_mem *mem)
> +{
> +	int i;
> +	unsigned long used = 0;
> +
> +	for (i = 0; i < mem->num_areas; i++)
> +		used += mem->areas[i].used;
> +	return used;
>   }
>   
>   phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> @@ -594,7 +691,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		if (!(attrs & DMA_ATTR_NO_WARN))
>   			dev_warn_ratelimited(dev,
>   	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
> -				 alloc_size, mem->nslabs, mem->used);
> +				 alloc_size, mem->nslabs, mem_used(mem));
>   		return (phys_addr_t)DMA_MAPPING_ERROR;
>   	}
>   
> @@ -624,6 +721,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
>   	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
>   	int nslots = nr_slots(mem->slots[index].alloc_size + offset);
> +	int aindex = index / mem->area_nslabs;
> +	struct io_tlb_area *area = &mem->areas[aindex];
>   	int count, i;
>   
>   	/*
> @@ -632,7 +731,9 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	 * While returning the entries to the free list, we merge the entries
>   	 * with slots below and above the pool being returned.
>   	 */
> -	spin_lock_irqsave(&mem->lock, flags);
> +	BUG_ON(aindex >= mem->num_areas);
> +
> +	spin_lock_irqsave(&area->lock, flags);
>   	if (index + nslots < ALIGN(index + 1, IO_TLB_SEGSIZE))
>   		count = mem->slots[index + nslots].list;
>   	else
> @@ -656,8 +757,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	     io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list;
>   	     i--)
>   		mem->slots[i].list = ++count;
> -	mem->used -= nslots;
> -	spin_unlock_irqrestore(&mem->lock, flags);
> +	area->used -= nslots;
> +	spin_unlock_irqrestore(&area->lock, flags);
>   }
>   
>   /*
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:44     ` Robin Murphy
@ 2022-04-28 14:45       ` Christoph Hellwig
  -1 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2022-04-28 14:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Tianyu Lan, hch, m.szyprowski, michael.h.kelley, kys,
	parri.andrea, thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan,
	linux-hyperv, konrad.wilk, linux-kernel, kirill.shutemov, iommu,
	andi.kleen, brijesh.singh, vkuznets, hch

On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
> Rather than introduce this extra level of allocator complexity, how about
> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
> instances?

Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 14:45       ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2022-04-28 14:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: parri.andrea, michael.h.kelley, wei.liu, Andi Kleen, Tianyu Lan,
	thomas.lendacky, konrad.wilk, linux-hyperv, Tianyu Lan,
	linux-kernel, hch, iommu, andi.kleen, brijesh.singh, vkuznets,
	kys, kirill.shutemov, hch

On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
> Rather than introduce this extra level of allocator complexity, how about
> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
> instances?

Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
so the very last thing I want is an interface that allows a driver to
allocate a per-device buffer.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:45       ` Christoph Hellwig
@ 2022-04-28 14:55         ` Andi Kleen
  -1 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2022-04-28 14:55 UTC (permalink / raw)
  To: Christoph Hellwig, Robin Murphy
  Cc: Tianyu Lan, m.szyprowski, michael.h.kelley, kys, parri.andrea,
	thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv, konrad.wilk,
	linux-kernel, kirill.shutemov, iommu, andi.kleen, brijesh.singh,
	vkuznets, hch


On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>> Rather than introduce this extra level of allocator complexity, how about
>> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
>> instances?
> Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
> so the very last thing I want is an interface that allows a driver to
> allocate a per-device buffer.

At least for TDX need parallelism with a single device for performance.

So if you split up the io tlb mems for a device then you would need a 
new mechanism to load balance the requests for single device over those. 
I doubt it would be any simpler.


-Andi



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 14:55         ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2022-04-28 14:55 UTC (permalink / raw)
  To: Christoph Hellwig, Robin Murphy
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, konrad.wilk,
	linux-hyperv, Tianyu Lan, linux-kernel, michael.h.kelley, iommu,
	andi.kleen, brijesh.singh, vkuznets, kys, kirill.shutemov, hch


On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>> Rather than introduce this extra level of allocator complexity, how about
>> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
>> instances?
> Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
> so the very last thing I want is an interface that allows a driver to
> allocate a per-device buffer.

At least for TDX need parallelism with a single device for performance.

So if you split up the io tlb mems for a device then you would need a 
new mechanism to load balance the requests for single device over those. 
I doubt it would be any simpler.


-Andi


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:45       ` Christoph Hellwig
@ 2022-04-28 14:56         ` Robin Murphy
  -1 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 14:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tianyu Lan, m.szyprowski, michael.h.kelley, kys, parri.andrea,
	thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan, linux-hyperv,
	konrad.wilk, linux-kernel, kirill.shutemov, iommu, andi.kleen,
	brijesh.singh, vkuznets, hch

On 2022-04-28 15:45, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>> Rather than introduce this extra level of allocator complexity, how about
>> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
>> instances?
> 
> Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
> so the very last thing I want is an interface that allows a driver to
> allocate a per-device buffer.

FWIW I'd already started thinking about having a distinct io_tlb_mem for 
non-coherent devices where vaddr is made non-cacheable to avoid the 
hassle of keeping the arch_dma_sync_* calls lined up, so I'm certainly 
in favour of bringing in a bit more flexibility at this level :)

Robin.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 14:56         ` Robin Murphy
  0 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 14:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: parri.andrea, thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan,
	konrad.wilk, linux-hyperv, Tianyu Lan, linux-kernel,
	michael.h.kelley, iommu, andi.kleen, brijesh.singh, vkuznets,
	kys, kirill.shutemov, hch

On 2022-04-28 15:45, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>> Rather than introduce this extra level of allocator complexity, how about
>> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
>> instances?
> 
> Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
> so the very last thing I want is an interface that allows a driver to
> allocate a per-device buffer.

FWIW I'd already started thinking about having a distinct io_tlb_mem for 
non-coherent devices where vaddr is made non-cacheable to avoid the 
hassle of keeping the arch_dma_sync_* calls lined up, so I'm certainly 
in favour of bringing in a bit more flexibility at this level :)

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:55         ` Andi Kleen
@ 2022-04-28 15:05           ` Christoph Hellwig
  -1 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2022-04-28 15:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Robin Murphy, Tianyu Lan, m.szyprowski,
	michael.h.kelley, kys, parri.andrea, thomas.lendacky, wei.liu,
	Tianyu Lan, linux-hyperv, konrad.wilk, linux-kernel,
	kirill.shutemov, iommu, andi.kleen, brijesh.singh, vkuznets, hch

On Thu, Apr 28, 2022 at 07:55:39AM -0700, Andi Kleen wrote:
> At least for TDX need parallelism with a single device for performance.

So find a way to make it happen without exposing details to random
drivers.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 15:05           ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2022-04-28 15:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: parri.andrea, michael.h.kelley, wei.liu, Tianyu Lan,
	thomas.lendacky, konrad.wilk, linux-hyperv, Tianyu Lan,
	linux-kernel, Christoph Hellwig, iommu, kirill.shutemov,
	andi.kleen, brijesh.singh, vkuznets, kys, Robin Murphy, hch

On Thu, Apr 28, 2022 at 07:55:39AM -0700, Andi Kleen wrote:
> At least for TDX need parallelism with a single device for performance.

So find a way to make it happen without exposing details to random
drivers.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:55         ` Andi Kleen
@ 2022-04-28 15:07           ` Robin Murphy
  -1 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 15:07 UTC (permalink / raw)
  To: Andi Kleen, Christoph Hellwig
  Cc: Tianyu Lan, m.szyprowski, michael.h.kelley, kys, parri.andrea,
	thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv, konrad.wilk,
	linux-kernel, kirill.shutemov, iommu, andi.kleen, brijesh.singh,
	vkuznets, hch

On 2022-04-28 15:55, Andi Kleen wrote:
> 
> On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
>> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>>> Rather than introduce this extra level of allocator complexity, how 
>>> about
>>> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
>>> instances?
>> Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
>> so the very last thing I want is an interface that allows a driver to
>> allocate a per-device buffer.
> 
> At least for TDX need parallelism with a single device for performance.
> 
> So if you split up the io tlb mems for a device then you would need a 
> new mechanism to load balance the requests for single device over those. 
> I doubt it would be any simpler.

Eh, I think it would be, since the round-robin retry loop can then just 
sit around the existing io_tlb_mem-based allocator, vs. the churn of 
inserting it in the middle, plus it's then really easy to statically 
distribute different starting points across different devices via 
dev->dma_io_tlb_mem if we wanted to.

Admittedly the overall patch probably ends up about the same size, since 
it likely pushes a bit more complexity into swiotlb_init to compensate, 
but that's still a trade-off I like.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 15:07           ` Robin Murphy
  0 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 15:07 UTC (permalink / raw)
  To: Andi Kleen, Christoph Hellwig
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, konrad.wilk,
	linux-hyperv, Tianyu Lan, linux-kernel, michael.h.kelley, iommu,
	andi.kleen, brijesh.singh, vkuznets, kys, kirill.shutemov, hch

On 2022-04-28 15:55, Andi Kleen wrote:
> 
> On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
>> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>>> Rather than introduce this extra level of allocator complexity, how 
>>> about
>>> just dividing up the initial SWIOTLB allocation into multiple io_tlb_mem
>>> instances?
>> Yeah.  We're almost done removing all knowledge of swiotlb from drivers,
>> so the very last thing I want is an interface that allows a driver to
>> allocate a per-device buffer.
> 
> At least for TDX need parallelism with a single device for performance.
> 
> So if you split up the io tlb mems for a device then you would need a 
> new mechanism to load balance the requests for single device over those. 
> I doubt it would be any simpler.

Eh, I think it would be, since the round-robin retry loop can then just 
sit around the existing io_tlb_mem-based allocator, vs. the churn of 
inserting it in the middle, plus it's then really easy to statically 
distribute different starting points across different devices via 
dev->dma_io_tlb_mem if we wanted to.

Admittedly the overall patch probably ends up about the same size, since 
it likely pushes a bit more complexity into swiotlb_init to compensate, 
but that's still a trade-off I like.

Thanks,
Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 15:05           ` Christoph Hellwig
@ 2022-04-28 15:16             ` Andi Kleen
  -1 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2022-04-28 15:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Tianyu Lan, m.szyprowski, michael.h.kelley, kys,
	parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv,
	konrad.wilk, linux-kernel, kirill.shutemov, iommu, andi.kleen,
	brijesh.singh, vkuznets, hch


On 4/28/2022 8:05 AM, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 07:55:39AM -0700, Andi Kleen wrote:
>> At least for TDX need parallelism with a single device for performance.
> So find a way to make it happen without exposing details to random
> drivers.


That's what the original patch (that this one is derived from) did.

It was completely transparent to everyone outside swiotlb.c

-Andi


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 15:16             ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2022-04-28 15:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, konrad.wilk,
	linux-hyperv, Tianyu Lan, linux-kernel, michael.h.kelley, iommu,
	kirill.shutemov, andi.kleen, brijesh.singh, vkuznets, kys,
	Robin Murphy, hch


On 4/28/2022 8:05 AM, Christoph Hellwig wrote:
> On Thu, Apr 28, 2022 at 07:55:39AM -0700, Andi Kleen wrote:
>> At least for TDX need parallelism with a single device for performance.
> So find a way to make it happen without exposing details to random
> drivers.


That's what the original patch (that this one is derived from) did.

It was completely transparent to everyone outside swiotlb.c

-Andi

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface
  2022-04-28 14:14   ` Tianyu Lan
@ 2022-04-28 15:50     ` Tianyu Lan
  -1 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 15:50 UTC (permalink / raw)
  To: robin.murphy, hch
  Cc: Tianyu Lan, iommu, linux-kernel, vkuznets, brijesh.singh,
	konrad.wilk, hch, wei.liu, parri.andrea, thomas.lendacky,
	linux-hyperv, andi.kleen, kirill.shutemov, hch, m.szyprowski,
	robin.murphy, michael.h.kelley, kys

On 4/28/2022 10:14 PM, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> In SEV/TDX Confidential VM, device DMA transaction needs use swiotlb
> bounce buffer to share data with host/hypervisor. The swiotlb spinlock
> introduces overhead among devices if they share io tlb mem. Avoid such
> issue, introduce swiotlb_device_allocate() to allocate device bounce
> buffer from default io tlb pool and set up areas according input queue
> number. Device may have multi io queues and setting up the same number
> of io tlb area may help to resolve spinlock overhead among queues.
> 
> Introduce IO TLB Block unit(2MB) concepts to allocate big bounce buffer
> from default pool for devices. IO TLB segment(256k) is too small.

Hi Christoph and Robin Murphy:

 From Christoph:
"Yeah.  We're almost done removing all knowledge of swiotlb from 
drivers, so the very last thing I want is an interface that allows a 
driver to allocate a per-device buffer."
	Please have a look at this patch. This patch is to provide a API
to device driver to allocate per-device buffer. Just providing 
per-device bounce buffer is not enough. Device still may have multi queue.
The single io tlb mem just has one spin lock in current code and this 
will introuduce overhead among queues DMA transaction. So the new API 
requests queues number as the IO TLB area number and this is why we 
still need to creat area in the IO Tlb mem.
        This new API is the one mentioned in the Christoph's comment.






> 
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>   include/linux/swiotlb.h |  33 ++++++++
>   kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 203 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 489c249da434..380bd1ce3d0f 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -31,6 +31,14 @@ struct scatterlist;
>   #define IO_TLB_SHIFT 11
>   #define IO_TLB_SIZE (1 << IO_TLB_SHIFT)
>   
> +/*
> + * IO TLB BLOCK UNIT as device bounce buffer allocation unit.
> + * This allows device allocates bounce buffer from default io
> + * tlb pool.
> + */
> +#define IO_TLB_BLOCKSIZE   (8 * IO_TLB_SEGSIZE)
> +#define IO_TLB_BLOCK_UNIT  (IO_TLB_BLOCKSIZE << IO_TLB_SHIFT)
> +
>   /* default to 64MB */
>   #define IO_TLB_DEFAULT_SIZE (64UL<<20)
>   
> @@ -72,11 +80,13 @@ extern enum swiotlb_force swiotlb_force;
>    * @index:	The slot index to start searching in this area for next round.
>    * @lock:	The lock to protect the above data structures in the map and
>    *		unmap calls.
> + * @block_index: The block index to start earching in this area for next round.
>    */
>   struct io_tlb_area {
>   	unsigned long used;
>   	unsigned int area_index;
>   	unsigned int index;
> +	unsigned int block_index;
>   	spinlock_t lock;
>   };
>   
> @@ -110,6 +120,7 @@ struct io_tlb_area {
>    * @num_areas:  The area number in the pool.
>    * @area_start: The area index to start searching in the next round.
>    * @area_nslabs: The slot number in the area.
> + * @areas_block_number: The block number in the area.
>    */
>   struct io_tlb_mem {
>   	phys_addr_t start;
> @@ -126,7 +137,14 @@ struct io_tlb_mem {
>   	unsigned int num_areas;
>   	unsigned int area_start;
>   	unsigned int area_nslabs;
> +	unsigned int area_block_number;
> +	struct io_tlb_mem *parent;
>   	struct io_tlb_area *areas;
> +	struct io_tlb_block {
> +		size_t alloc_size;
> +		unsigned long start_slot;
> +		unsigned int list;
> +	} *block;
>   	struct io_tlb_slot {
>   		phys_addr_t orig_addr;
>   		size_t alloc_size;
> @@ -155,6 +173,10 @@ unsigned int swiotlb_max_segment(void);
>   size_t swiotlb_max_mapping_size(struct device *dev);
>   bool is_swiotlb_active(struct device *dev);
>   void __init swiotlb_adjust_size(unsigned long size);
> +int swiotlb_device_allocate(struct device *dev,
> +			    unsigned int area_num,
> +			    unsigned long size);
> +void swiotlb_device_free(struct device *dev);
>   #else
>   static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
>   {
> @@ -187,6 +209,17 @@ static inline bool is_swiotlb_active(struct device *dev)
>   static inline void swiotlb_adjust_size(unsigned long size)
>   {
>   }
> +
> +void swiotlb_device_free(struct device *dev)
> +{
> +}
> +
> +int swiotlb_device_allocate(struct device *dev,
> +			    unsigned int area_num,
> +			    unsigned long size)
> +{
> +	return -ENOMEM;
> +}
>   #endif /* CONFIG_SWIOTLB */
>   
>   extern void swiotlb_print_info(void);
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 00a16f540f20..7b95a140694a 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -218,7 +218,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   				    unsigned long nslabs, bool late_alloc)
>   {
>   	void *vaddr = phys_to_virt(start);
> -	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
> +	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j, k;
>   	unsigned int block_list;
>   
>   	mem->nslabs = nslabs;
> @@ -226,6 +226,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   	mem->end = mem->start + bytes;
>   	mem->index = 0;
>   	mem->late_alloc = late_alloc;
> +	mem->area_block_number = nslabs / (IO_TLB_BLOCKSIZE * mem->num_areas);
>   
>   	if (swiotlb_force_bounce)
>   		mem->force_bounce = true;
> @@ -233,10 +234,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
>   		if (!(i % mem->area_nslabs)) {
>   			mem->areas[j].index = 0;
> +			mem->areas[j].block_index = 0;
>   			spin_lock_init(&mem->areas[j].lock);
> +			block_list = mem->area_block_number;
>   			j++;
>   		}
>   
> +		if (!(i % IO_TLB_BLOCKSIZE)) {
> +			mem->block[k].alloc_size = 0;
> +			mem->block[k].list = block_list--;
> +			k++;
> +		}
> +
>   		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>   		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>   		mem->slots[i].alloc_size = 0;
> @@ -308,6 +317,12 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>   	if (!mem->areas)
>   		panic("%s: Failed to allocate mem->areas.\n", __func__);
>   
> +	mem->block = memblock_alloc(sizeof(struct io_tlb_block) *
> +				    (default_nslabs / IO_TLB_BLOCKSIZE),
> +				     SMP_CACHE_BYTES);
> +	if (!mem->block)
> +		panic("%s: Failed to allocate mem->block.\n", __func__);
> +
>   	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
>   	mem->force_bounce = flags & SWIOTLB_FORCE;
>   
> @@ -332,7 +347,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
>   	unsigned long bytes;
>   	unsigned char *vstart = NULL;
> -	unsigned int order, area_order;
> +	unsigned int order, area_order, block_order;
>   	int rc = 0;
>   
>   	if (swiotlb_force_disable)
> @@ -380,6 +395,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	if (!mem->areas)
>   		goto error_area;
>   
> +	block_order = get_order(array_size(sizeof(*mem->block),
> +		nslabs / IO_TLB_BLOCKSIZE));
> +	mem->block = (struct io_tlb_block *)
> +		__get_free_pages(GFP_KERNEL | __GFP_ZERO, block_order);
> +	if (!mem->block)
> +		goto error_block;
> +
>   	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>   		get_order(array_size(sizeof(*mem->slots), nslabs)));
>   	if (!mem->slots)
> @@ -392,6 +414,8 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	return 0;
>   
>   error_slots:
> +	free_pages((unsigned long)mem->block, block_order);
> +error_block:
>   	free_pages((unsigned long)mem->areas, area_order);
>   error_area:
>   	free_pages((unsigned long)vstart, order);
> @@ -403,7 +427,7 @@ void __init swiotlb_exit(void)
>   	struct io_tlb_mem *mem = &io_tlb_default_mem;
>   	unsigned long tbl_vaddr;
>   	size_t tbl_size, slots_size;
> -	unsigned int area_order;
> +	unsigned int area_order, block_order;
>   
>   	if (swiotlb_force_bounce)
>   		return;
> @@ -421,6 +445,9 @@ void __init swiotlb_exit(void)
>   		area_order = get_order(array_size(sizeof(*mem->areas),
>   			mem->num_areas));
>   		free_pages((unsigned long)mem->areas, area_order);
> +		block_order = get_order(array_size(sizeof(*mem->block),
> +			mem->nslabs / IO_TLB_BLOCKSIZE));
> +		free_pages((unsigned long)mem->block, block_order);
>   		free_pages(tbl_vaddr, get_order(tbl_size));
>   		free_pages((unsigned long)mem->slots, get_order(slots_size));
>   	} else {
> @@ -863,6 +890,146 @@ static int __init __maybe_unused swiotlb_create_default_debugfs(void)
>   late_initcall(swiotlb_create_default_debugfs);
>   #endif
>   
> +static void swiotlb_free_block(struct io_tlb_mem *mem,
> +			       phys_addr_t start, unsigned int block_num)
> +{
> +	unsigned int start_slot = (start - mem->start) >> IO_TLB_SHIFT;
> +	unsigned int area_index = start_slot / mem->num_areas;
> +	unsigned int block_index = start_slot / IO_TLB_BLOCKSIZE;
> +	unsigned int area_block_index = start_slot % mem->area_block_number;
> +	struct io_tlb_area *area = &mem->areas[area_index];
> +	unsigned long flags;
> +	int count, i, num;
> +
> +	spin_lock_irqsave(&area->lock, flags);
> +	if (area_block_index + block_num < mem->area_block_number)
> +		count = mem->block[block_index + block_num].list;
> +	else
> +		count = 0;
> +
> +
> +	for (i = block_index + block_num; i >= block_index; i--) {
> +		mem->block[i].list = ++count;
> +		/* Todo: recover slot->list and alloc_size here. */
> +	}
> +
> +	for (i = block_index - 1, num = block_index % mem->area_block_number;
> +	    i < num && mem->block[i].list; i--)
> +		mem->block[i].list = ++count;
> +
> +	spin_unlock_irqrestore(&area->lock, flags);
> +}
> +
> +void swiotlb_device_free(struct device *dev)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	struct io_tlb_mem *parent_mem = dev->dma_io_tlb_mem->parent;
> +
> +	swiotlb_free_block(parent_mem, mem->start, mem->nslabs / IO_TLB_BLOCKSIZE);
> +}
> +
> +static struct page *swiotlb_alloc_block(struct io_tlb_mem *mem, unsigned int block_num)
> +{
> +	unsigned int area_index, block_index, nslot;
> +	phys_addr_t tlb_addr;
> +	struct io_tlb_area *area;
> +	unsigned long flags;
> +	int i, j;
> +
> +	if (!mem || !mem->block)
> +		return NULL;
> +
> +	area_index = mem->area_start;
> +	mem->area_start = (mem->area_start + 1) % mem->num_areas;
> +	area = &mem->areas[area_index];
> +
> +	spin_lock_irqsave(&area->lock, flags);
> +	block_index = area_index * mem->area_block_number + area->block_index;
> +
> +	/* Todo: Search more blocks. */
> +	if (mem->block[block_index].list < block_num) {
> +		spin_unlock_irqrestore(&area->lock, flags);
> +		return NULL;
> +	}
> +
> +	/* Update block and slot list. */
> +	for (i = block_index; i < block_index + block_num; i++) {
> +		mem->block[i].list = 0;
> +		for (j = 0; j < IO_TLB_BLOCKSIZE; j++) {
> +			nslot = i * IO_TLB_BLOCKSIZE + j;
> +			mem->slots[nslot].list = 0;
> +			mem->slots[nslot].alloc_size = IO_TLB_SIZE;
> +		}
> +	}
> +	spin_unlock_irqrestore(&area->lock, flags);
> +
> +	area->block_index += block_num;
> +	area->used += block_num * IO_TLB_BLOCKSIZE;
> +	tlb_addr = slot_addr(mem->start, block_index * IO_TLB_BLOCKSIZE);
> +	return pfn_to_page(PFN_DOWN(tlb_addr));
> +}
> +
> +/*
> + * swiotlb_device_allocate - Allocate bounce buffer fo device from
> + * default io tlb pool. The allocation size should be aligned with
> + * IO_TLB_BLOCK_UNIT.
> + */
> +int swiotlb_device_allocate(struct device *dev,
> +			    unsigned int area_num,
> +			    unsigned long size)
> +{
> +	struct io_tlb_mem *mem, *parent_mem = dev->dma_io_tlb_mem;
> +	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_BLOCKSIZE);
> +	struct page *page;
> +	int ret = -ENOMEM;
> +
> +	page = swiotlb_alloc_block(parent_mem, nslabs / IO_TLB_BLOCKSIZE);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> +	if (!mem)
> +		goto error_mem;
> +
> +	mem->slots = kzalloc(array_size(sizeof(*mem->slots), nslabs),
> +			     GFP_KERNEL);
> +	if (!mem->slots)
> +		goto error_slots;
> +
> +	swiotlb_setup_areas(mem, area_num, nslabs);
> +	mem->areas = (struct io_tlb_area *)kcalloc(area_num,
> +				   sizeof(struct io_tlb_area),
> +				   GFP_KERNEL);
> +	if (!mem->areas)
> +		goto error_areas;
> +
> +	mem->block = (struct io_tlb_block *)kcalloc(nslabs / IO_TLB_BLOCKSIZE,
> +				sizeof(struct io_tlb_block),
> +				GFP_KERNEL);
> +	if (!mem->block)
> +		goto error_block;
> +
> +	swiotlb_init_io_tlb_mem(mem, page_to_phys(page), nslabs, true);
> +	mem->force_bounce = true;
> +	mem->for_alloc = true;
> +
> +	mem->vaddr = parent_mem->vaddr + page_to_phys(page) -  parent_mem->start;
> +	dev->dma_io_tlb_mem->parent = parent_mem;
> +	dev->dma_io_tlb_mem = mem;
> +	return 0;
> +
> +error_block:
> +	kfree(mem->areas);
> +error_areas:
> +	kfree(mem->slots);
> +error_slots:
> +	kfree(mem);
> +error_mem:
> +	swiotlb_device_free(dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_device_allocate);
> +
>   #ifdef CONFIG_DMA_RESTRICTED_POOL
>   
>   struct page *swiotlb_alloc(struct device *dev, size_t size)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface
@ 2022-04-28 15:50     ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 15:50 UTC (permalink / raw)
  To: robin.murphy, hch
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv,
	konrad.wilk, linux-kernel, kirill.shutemov, iommu,
	michael.h.kelley, hch, andi.kleen, brijesh.singh, vkuznets, kys,
	robin.murphy, hch

On 4/28/2022 10:14 PM, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> In SEV/TDX Confidential VM, device DMA transaction needs use swiotlb
> bounce buffer to share data with host/hypervisor. The swiotlb spinlock
> introduces overhead among devices if they share io tlb mem. Avoid such
> issue, introduce swiotlb_device_allocate() to allocate device bounce
> buffer from default io tlb pool and set up areas according input queue
> number. Device may have multi io queues and setting up the same number
> of io tlb area may help to resolve spinlock overhead among queues.
> 
> Introduce IO TLB Block unit(2MB) concepts to allocate big bounce buffer
> from default pool for devices. IO TLB segment(256k) is too small.

Hi Christoph and Robin Murphy:

 From Christoph:
"Yeah.  We're almost done removing all knowledge of swiotlb from 
drivers, so the very last thing I want is an interface that allows a 
driver to allocate a per-device buffer."
	Please have a look at this patch. This patch is to provide a API
to device driver to allocate per-device buffer. Just providing 
per-device bounce buffer is not enough. Device still may have multi queue.
The single io tlb mem just has one spin lock in current code and this 
will introuduce overhead among queues DMA transaction. So the new API 
requests queues number as the IO TLB area number and this is why we 
still need to creat area in the IO Tlb mem.
        This new API is the one mentioned in the Christoph's comment.






> 
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>   include/linux/swiotlb.h |  33 ++++++++
>   kernel/dma/swiotlb.c    | 173 +++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 203 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 489c249da434..380bd1ce3d0f 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -31,6 +31,14 @@ struct scatterlist;
>   #define IO_TLB_SHIFT 11
>   #define IO_TLB_SIZE (1 << IO_TLB_SHIFT)
>   
> +/*
> + * IO TLB BLOCK UNIT as device bounce buffer allocation unit.
> + * This allows device allocates bounce buffer from default io
> + * tlb pool.
> + */
> +#define IO_TLB_BLOCKSIZE   (8 * IO_TLB_SEGSIZE)
> +#define IO_TLB_BLOCK_UNIT  (IO_TLB_BLOCKSIZE << IO_TLB_SHIFT)
> +
>   /* default to 64MB */
>   #define IO_TLB_DEFAULT_SIZE (64UL<<20)
>   
> @@ -72,11 +80,13 @@ extern enum swiotlb_force swiotlb_force;
>    * @index:	The slot index to start searching in this area for next round.
>    * @lock:	The lock to protect the above data structures in the map and
>    *		unmap calls.
> + * @block_index: The block index to start earching in this area for next round.
>    */
>   struct io_tlb_area {
>   	unsigned long used;
>   	unsigned int area_index;
>   	unsigned int index;
> +	unsigned int block_index;
>   	spinlock_t lock;
>   };
>   
> @@ -110,6 +120,7 @@ struct io_tlb_area {
>    * @num_areas:  The area number in the pool.
>    * @area_start: The area index to start searching in the next round.
>    * @area_nslabs: The slot number in the area.
> + * @areas_block_number: The block number in the area.
>    */
>   struct io_tlb_mem {
>   	phys_addr_t start;
> @@ -126,7 +137,14 @@ struct io_tlb_mem {
>   	unsigned int num_areas;
>   	unsigned int area_start;
>   	unsigned int area_nslabs;
> +	unsigned int area_block_number;
> +	struct io_tlb_mem *parent;
>   	struct io_tlb_area *areas;
> +	struct io_tlb_block {
> +		size_t alloc_size;
> +		unsigned long start_slot;
> +		unsigned int list;
> +	} *block;
>   	struct io_tlb_slot {
>   		phys_addr_t orig_addr;
>   		size_t alloc_size;
> @@ -155,6 +173,10 @@ unsigned int swiotlb_max_segment(void);
>   size_t swiotlb_max_mapping_size(struct device *dev);
>   bool is_swiotlb_active(struct device *dev);
>   void __init swiotlb_adjust_size(unsigned long size);
> +int swiotlb_device_allocate(struct device *dev,
> +			    unsigned int area_num,
> +			    unsigned long size);
> +void swiotlb_device_free(struct device *dev);
>   #else
>   static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
>   {
> @@ -187,6 +209,17 @@ static inline bool is_swiotlb_active(struct device *dev)
>   static inline void swiotlb_adjust_size(unsigned long size)
>   {
>   }
> +
> +void swiotlb_device_free(struct device *dev)
> +{
> +}
> +
> +int swiotlb_device_allocate(struct device *dev,
> +			    unsigned int area_num,
> +			    unsigned long size)
> +{
> +	return -ENOMEM;
> +}
>   #endif /* CONFIG_SWIOTLB */
>   
>   extern void swiotlb_print_info(void);
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 00a16f540f20..7b95a140694a 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -218,7 +218,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   				    unsigned long nslabs, bool late_alloc)
>   {
>   	void *vaddr = phys_to_virt(start);
> -	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j;
> +	unsigned long bytes = nslabs << IO_TLB_SHIFT, i, j, k;
>   	unsigned int block_list;
>   
>   	mem->nslabs = nslabs;
> @@ -226,6 +226,7 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   	mem->end = mem->start + bytes;
>   	mem->index = 0;
>   	mem->late_alloc = late_alloc;
> +	mem->area_block_number = nslabs / (IO_TLB_BLOCKSIZE * mem->num_areas);
>   
>   	if (swiotlb_force_bounce)
>   		mem->force_bounce = true;
> @@ -233,10 +234,18 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   	for (i = 0, j = 0, k = 0; i < mem->nslabs; i++) {
>   		if (!(i % mem->area_nslabs)) {
>   			mem->areas[j].index = 0;
> +			mem->areas[j].block_index = 0;
>   			spin_lock_init(&mem->areas[j].lock);
> +			block_list = mem->area_block_number;
>   			j++;
>   		}
>   
> +		if (!(i % IO_TLB_BLOCKSIZE)) {
> +			mem->block[k].alloc_size = 0;
> +			mem->block[k].list = block_list--;
> +			k++;
> +		}
> +
>   		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>   		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>   		mem->slots[i].alloc_size = 0;
> @@ -308,6 +317,12 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>   	if (!mem->areas)
>   		panic("%s: Failed to allocate mem->areas.\n", __func__);
>   
> +	mem->block = memblock_alloc(sizeof(struct io_tlb_block) *
> +				    (default_nslabs / IO_TLB_BLOCKSIZE),
> +				     SMP_CACHE_BYTES);
> +	if (!mem->block)
> +		panic("%s: Failed to allocate mem->block.\n", __func__);
> +
>   	swiotlb_init_io_tlb_mem(mem, __pa(tlb), default_nslabs, false);
>   	mem->force_bounce = flags & SWIOTLB_FORCE;
>   
> @@ -332,7 +347,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
>   	unsigned long bytes;
>   	unsigned char *vstart = NULL;
> -	unsigned int order, area_order;
> +	unsigned int order, area_order, block_order;
>   	int rc = 0;
>   
>   	if (swiotlb_force_disable)
> @@ -380,6 +395,13 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	if (!mem->areas)
>   		goto error_area;
>   
> +	block_order = get_order(array_size(sizeof(*mem->block),
> +		nslabs / IO_TLB_BLOCKSIZE));
> +	mem->block = (struct io_tlb_block *)
> +		__get_free_pages(GFP_KERNEL | __GFP_ZERO, block_order);
> +	if (!mem->block)
> +		goto error_block;
> +
>   	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>   		get_order(array_size(sizeof(*mem->slots), nslabs)));
>   	if (!mem->slots)
> @@ -392,6 +414,8 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   	return 0;
>   
>   error_slots:
> +	free_pages((unsigned long)mem->block, block_order);
> +error_block:
>   	free_pages((unsigned long)mem->areas, area_order);
>   error_area:
>   	free_pages((unsigned long)vstart, order);
> @@ -403,7 +427,7 @@ void __init swiotlb_exit(void)
>   	struct io_tlb_mem *mem = &io_tlb_default_mem;
>   	unsigned long tbl_vaddr;
>   	size_t tbl_size, slots_size;
> -	unsigned int area_order;
> +	unsigned int area_order, block_order;
>   
>   	if (swiotlb_force_bounce)
>   		return;
> @@ -421,6 +445,9 @@ void __init swiotlb_exit(void)
>   		area_order = get_order(array_size(sizeof(*mem->areas),
>   			mem->num_areas));
>   		free_pages((unsigned long)mem->areas, area_order);
> +		block_order = get_order(array_size(sizeof(*mem->block),
> +			mem->nslabs / IO_TLB_BLOCKSIZE));
> +		free_pages((unsigned long)mem->block, block_order);
>   		free_pages(tbl_vaddr, get_order(tbl_size));
>   		free_pages((unsigned long)mem->slots, get_order(slots_size));
>   	} else {
> @@ -863,6 +890,146 @@ static int __init __maybe_unused swiotlb_create_default_debugfs(void)
>   late_initcall(swiotlb_create_default_debugfs);
>   #endif
>   
> +static void swiotlb_free_block(struct io_tlb_mem *mem,
> +			       phys_addr_t start, unsigned int block_num)
> +{
> +	unsigned int start_slot = (start - mem->start) >> IO_TLB_SHIFT;
> +	unsigned int area_index = start_slot / mem->num_areas;
> +	unsigned int block_index = start_slot / IO_TLB_BLOCKSIZE;
> +	unsigned int area_block_index = start_slot % mem->area_block_number;
> +	struct io_tlb_area *area = &mem->areas[area_index];
> +	unsigned long flags;
> +	int count, i, num;
> +
> +	spin_lock_irqsave(&area->lock, flags);
> +	if (area_block_index + block_num < mem->area_block_number)
> +		count = mem->block[block_index + block_num].list;
> +	else
> +		count = 0;
> +
> +
> +	for (i = block_index + block_num; i >= block_index; i--) {
> +		mem->block[i].list = ++count;
> +		/* Todo: recover slot->list and alloc_size here. */
> +	}
> +
> +	for (i = block_index - 1, num = block_index % mem->area_block_number;
> +	    i < num && mem->block[i].list; i--)
> +		mem->block[i].list = ++count;
> +
> +	spin_unlock_irqrestore(&area->lock, flags);
> +}
> +
> +void swiotlb_device_free(struct device *dev)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	struct io_tlb_mem *parent_mem = dev->dma_io_tlb_mem->parent;
> +
> +	swiotlb_free_block(parent_mem, mem->start, mem->nslabs / IO_TLB_BLOCKSIZE);
> +}
> +
> +static struct page *swiotlb_alloc_block(struct io_tlb_mem *mem, unsigned int block_num)
> +{
> +	unsigned int area_index, block_index, nslot;
> +	phys_addr_t tlb_addr;
> +	struct io_tlb_area *area;
> +	unsigned long flags;
> +	int i, j;
> +
> +	if (!mem || !mem->block)
> +		return NULL;
> +
> +	area_index = mem->area_start;
> +	mem->area_start = (mem->area_start + 1) % mem->num_areas;
> +	area = &mem->areas[area_index];
> +
> +	spin_lock_irqsave(&area->lock, flags);
> +	block_index = area_index * mem->area_block_number + area->block_index;
> +
> +	/* Todo: Search more blocks. */
> +	if (mem->block[block_index].list < block_num) {
> +		spin_unlock_irqrestore(&area->lock, flags);
> +		return NULL;
> +	}
> +
> +	/* Update block and slot list. */
> +	for (i = block_index; i < block_index + block_num; i++) {
> +		mem->block[i].list = 0;
> +		for (j = 0; j < IO_TLB_BLOCKSIZE; j++) {
> +			nslot = i * IO_TLB_BLOCKSIZE + j;
> +			mem->slots[nslot].list = 0;
> +			mem->slots[nslot].alloc_size = IO_TLB_SIZE;
> +		}
> +	}
> +	spin_unlock_irqrestore(&area->lock, flags);
> +
> +	area->block_index += block_num;
> +	area->used += block_num * IO_TLB_BLOCKSIZE;
> +	tlb_addr = slot_addr(mem->start, block_index * IO_TLB_BLOCKSIZE);
> +	return pfn_to_page(PFN_DOWN(tlb_addr));
> +}
> +
> +/*
> + * swiotlb_device_allocate - Allocate bounce buffer fo device from
> + * default io tlb pool. The allocation size should be aligned with
> + * IO_TLB_BLOCK_UNIT.
> + */
> +int swiotlb_device_allocate(struct device *dev,
> +			    unsigned int area_num,
> +			    unsigned long size)
> +{
> +	struct io_tlb_mem *mem, *parent_mem = dev->dma_io_tlb_mem;
> +	unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_BLOCKSIZE);
> +	struct page *page;
> +	int ret = -ENOMEM;
> +
> +	page = swiotlb_alloc_block(parent_mem, nslabs / IO_TLB_BLOCKSIZE);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> +	if (!mem)
> +		goto error_mem;
> +
> +	mem->slots = kzalloc(array_size(sizeof(*mem->slots), nslabs),
> +			     GFP_KERNEL);
> +	if (!mem->slots)
> +		goto error_slots;
> +
> +	swiotlb_setup_areas(mem, area_num, nslabs);
> +	mem->areas = (struct io_tlb_area *)kcalloc(area_num,
> +				   sizeof(struct io_tlb_area),
> +				   GFP_KERNEL);
> +	if (!mem->areas)
> +		goto error_areas;
> +
> +	mem->block = (struct io_tlb_block *)kcalloc(nslabs / IO_TLB_BLOCKSIZE,
> +				sizeof(struct io_tlb_block),
> +				GFP_KERNEL);
> +	if (!mem->block)
> +		goto error_block;
> +
> +	swiotlb_init_io_tlb_mem(mem, page_to_phys(page), nslabs, true);
> +	mem->force_bounce = true;
> +	mem->for_alloc = true;
> +
> +	mem->vaddr = parent_mem->vaddr + page_to_phys(page) -  parent_mem->start;
> +	dev->dma_io_tlb_mem->parent = parent_mem;
> +	dev->dma_io_tlb_mem = mem;
> +	return 0;
> +
> +error_block:
> +	kfree(mem->areas);
> +error_areas:
> +	kfree(mem->slots);
> +error_slots:
> +	kfree(mem);
> +error_mem:
> +	swiotlb_device_free(dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_device_allocate);
> +
>   #ifdef CONFIG_DMA_RESTRICTED_POOL
>   
>   struct page *swiotlb_alloc(struct device *dev, size_t size)
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 14:44     ` Robin Murphy
@ 2022-04-28 15:54       ` Tianyu Lan
  -1 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 15:54 UTC (permalink / raw)
  To: Robin Murphy, hch, m.szyprowski, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan,
	linux-hyperv, konrad.wilk, linux-kernel, kirill.shutemov, iommu,
	andi.kleen, brijesh.singh, vkuznets, hch

On 4/28/2022 10:44 PM, Robin Murphy wrote:
> On 2022-04-28 15:14, Tianyu Lan wrote:
>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
>>
>> Traditionally swiotlb was not performance critical because it was only
>> used for slow devices. But in some setups, like TDX/SEV confidential
>> guests, all IO has to go through swiotlb. Currently swiotlb only has a
>> single lock. Under high IO load with multiple CPUs this can lead to
>> significat lock contention on the swiotlb lock.
>>
>> This patch splits the swiotlb into individual areas which have their
>> own lock. When there are swiotlb map/allocate request, allocate
>> io tlb buffer from areas averagely and free the allocation back
>> to the associated area. This is to prepare to resolve the overhead
>> of single spinlock among device's queues. Per device may have its
>> own io tlb mem and bounce buffer pool.
>>
>> This idea from Andi Kleen 
>> patch(https://github.com/intel/tdx/commit/4529b578
>> 4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
>> for individual device's io tlb mem. The device driver may determine
>> area number according to device queue number.
> 
> Rather than introduce this extra level of allocator complexity, how 
> about just dividing up the initial SWIOTLB allocation into multiple 
> io_tlb_mem instances?
> 
> Robin.

Agree. Thanks for suggestion. That will be more generic and will update
in the next version.

Thanks.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 15:54       ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-28 15:54 UTC (permalink / raw)
  To: Robin Murphy, hch, m.szyprowski, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Andi Kleen, Tianyu Lan,
	konrad.wilk, linux-hyperv, linux-kernel, kirill.shutemov, iommu,
	andi.kleen, brijesh.singh, vkuznets, hch

On 4/28/2022 10:44 PM, Robin Murphy wrote:
> On 2022-04-28 15:14, Tianyu Lan wrote:
>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
>>
>> Traditionally swiotlb was not performance critical because it was only
>> used for slow devices. But in some setups, like TDX/SEV confidential
>> guests, all IO has to go through swiotlb. Currently swiotlb only has a
>> single lock. Under high IO load with multiple CPUs this can lead to
>> significat lock contention on the swiotlb lock.
>>
>> This patch splits the swiotlb into individual areas which have their
>> own lock. When there are swiotlb map/allocate request, allocate
>> io tlb buffer from areas averagely and free the allocation back
>> to the associated area. This is to prepare to resolve the overhead
>> of single spinlock among device's queues. Per device may have its
>> own io tlb mem and bounce buffer pool.
>>
>> This idea from Andi Kleen 
>> patch(https://github.com/intel/tdx/commit/4529b578
>> 4c141782c72ec9bd9a92df2b68cb7d45). Rework it and make it may work
>> for individual device's io tlb mem. The device driver may determine
>> area number according to device queue number.
> 
> Rather than introduce this extra level of allocator complexity, how 
> about just dividing up the initial SWIOTLB allocation into multiple 
> io_tlb_mem instances?
> 
> Robin.

Agree. Thanks for suggestion. That will be more generic and will update
in the next version.

Thanks.

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 15:07           ` Robin Murphy
@ 2022-04-28 16:02             ` Andi Kleen
  -1 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2022-04-28 16:02 UTC (permalink / raw)
  To: Robin Murphy, Christoph Hellwig
  Cc: Tianyu Lan, m.szyprowski, michael.h.kelley, kys, parri.andrea,
	thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv, konrad.wilk,
	linux-kernel, kirill.shutemov, iommu, andi.kleen, brijesh.singh,
	vkuznets, hch


On 4/28/2022 8:07 AM, Robin Murphy wrote:
> On 2022-04-28 15:55, Andi Kleen wrote:
>>
>> On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
>>> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>>>> Rather than introduce this extra level of allocator complexity, how 
>>>> about
>>>> just dividing up the initial SWIOTLB allocation into multiple 
>>>> io_tlb_mem
>>>> instances?
>>> Yeah.  We're almost done removing all knowledge of swiotlb from 
>>> drivers,
>>> so the very last thing I want is an interface that allows a driver to
>>> allocate a per-device buffer.
>>
>> At least for TDX need parallelism with a single device for performance.
>>
>> So if you split up the io tlb mems for a device then you would need a 
>> new mechanism to load balance the requests for single device over 
>> those. I doubt it would be any simpler.
>
> Eh, I think it would be, since the round-robin retry loop can then 
> just sit around the existing io_tlb_mem-based allocator, vs. the churn 
> of inserting it in the middle, plus it's then really easy to 
> statically distribute different starting points across different 
> devices via dev->dma_io_tlb_mem if we wanted to.
>
> Admittedly the overall patch probably ends up about the same size, 
> since it likely pushes a bit more complexity into swiotlb_init to 
> compensate, but that's still a trade-off I like.

Unless you completely break the external API this will require a new 
mechanism to search a list of io_tlb_mems for the right area to free into.

If the memory area not contiguous (like in the original patch) this will 
be a O(n) operation on the number of io_tlb_mems, so it would get more 
and more expensive on larger systems. Or you merge them all together (so 
that the simple address arithmetic to look up the area works again), 
which will require even more changes in the setup. Or you add hashing or 
similar which will be even more complicated.

In the end doing it with a single io_tlb_mem is significantly simpler 
and also more natural.

-Andi



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 16:02             ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2022-04-28 16:02 UTC (permalink / raw)
  To: Robin Murphy, Christoph Hellwig
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, konrad.wilk,
	linux-hyperv, Tianyu Lan, linux-kernel, michael.h.kelley, iommu,
	andi.kleen, brijesh.singh, vkuznets, kys, kirill.shutemov, hch


On 4/28/2022 8:07 AM, Robin Murphy wrote:
> On 2022-04-28 15:55, Andi Kleen wrote:
>>
>> On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
>>> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>>>> Rather than introduce this extra level of allocator complexity, how 
>>>> about
>>>> just dividing up the initial SWIOTLB allocation into multiple 
>>>> io_tlb_mem
>>>> instances?
>>> Yeah.  We're almost done removing all knowledge of swiotlb from 
>>> drivers,
>>> so the very last thing I want is an interface that allows a driver to
>>> allocate a per-device buffer.
>>
>> At least for TDX need parallelism with a single device for performance.
>>
>> So if you split up the io tlb mems for a device then you would need a 
>> new mechanism to load balance the requests for single device over 
>> those. I doubt it would be any simpler.
>
> Eh, I think it would be, since the round-robin retry loop can then 
> just sit around the existing io_tlb_mem-based allocator, vs. the churn 
> of inserting it in the middle, plus it's then really easy to 
> statically distribute different starting points across different 
> devices via dev->dma_io_tlb_mem if we wanted to.
>
> Admittedly the overall patch probably ends up about the same size, 
> since it likely pushes a bit more complexity into swiotlb_init to 
> compensate, but that's still a trade-off I like.

Unless you completely break the external API this will require a new 
mechanism to search a list of io_tlb_mems for the right area to free into.

If the memory area not contiguous (like in the original patch) this will 
be a O(n) operation on the number of io_tlb_mems, so it would get more 
and more expensive on larger systems. Or you merge them all together (so 
that the simple address arithmetic to look up the area works again), 
which will require even more changes in the setup. Or you add hashing or 
similar which will be even more complicated.

In the end doing it with a single io_tlb_mem is significantly simpler 
and also more natural.

-Andi


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
  2022-04-28 16:02             ` Andi Kleen
@ 2022-04-28 16:59               ` Robin Murphy
  -1 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 16:59 UTC (permalink / raw)
  To: Andi Kleen, Christoph Hellwig
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, konrad.wilk,
	linux-hyperv, Tianyu Lan, linux-kernel, michael.h.kelley, iommu,
	andi.kleen, brijesh.singh, vkuznets, kys, kirill.shutemov, hch

On 2022-04-28 17:02, Andi Kleen wrote:
> 
> On 4/28/2022 8:07 AM, Robin Murphy wrote:
>> On 2022-04-28 15:55, Andi Kleen wrote:
>>>
>>> On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
>>>> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>>>>> Rather than introduce this extra level of allocator complexity, how 
>>>>> about
>>>>> just dividing up the initial SWIOTLB allocation into multiple 
>>>>> io_tlb_mem
>>>>> instances?
>>>> Yeah.  We're almost done removing all knowledge of swiotlb from 
>>>> drivers,
>>>> so the very last thing I want is an interface that allows a driver to
>>>> allocate a per-device buffer.
>>>
>>> At least for TDX need parallelism with a single device for performance.
>>>
>>> So if you split up the io tlb mems for a device then you would need a 
>>> new mechanism to load balance the requests for single device over 
>>> those. I doubt it would be any simpler.
>>
>> Eh, I think it would be, since the round-robin retry loop can then 
>> just sit around the existing io_tlb_mem-based allocator, vs. the churn 
>> of inserting it in the middle, plus it's then really easy to 
>> statically distribute different starting points across different 
>> devices via dev->dma_io_tlb_mem if we wanted to.
>>
>> Admittedly the overall patch probably ends up about the same size, 
>> since it likely pushes a bit more complexity into swiotlb_init to 
>> compensate, but that's still a trade-off I like.
> 
> Unless you completely break the external API this will require a new 
> mechanism to search a list of io_tlb_mems for the right area to free into.
> 
> If the memory area not contiguous (like in the original patch) this will 
> be a O(n) operation on the number of io_tlb_mems, so it would get more 
> and more expensive on larger systems. Or you merge them all together (so 
> that the simple address arithmetic to look up the area works again), 
> which will require even more changes in the setup. Or you add hashing or 
> similar which will be even more complicated.
> 
> In the end doing it with a single io_tlb_mem is significantly simpler 
> and also more natural.

Sorry if "dividing up the initial SWIOTLB allocation" somehow sounded 
like "making multiple separate SWIOTLB allocations all over the place"?

I don't see there being any *functional* difference in whether a slice 
of the overall SWIOTLB memory is represented by 
"io_tlb_default_mem->areas[i]->blah" or "io_tlb_default_mem[i]->blah", 
I'm simply advocating for not churning the already-complex allocator 
internals by pushing the new complexity out to the margins instead.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock
@ 2022-04-28 16:59               ` Robin Murphy
  0 siblings, 0 replies; 33+ messages in thread
From: Robin Murphy @ 2022-04-28 16:59 UTC (permalink / raw)
  To: Andi Kleen, Christoph Hellwig
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, konrad.wilk,
	linux-hyperv, Tianyu Lan, linux-kernel, michael.h.kelley, iommu,
	andi.kleen, brijesh.singh, vkuznets, kys, kirill.shutemov, hch

On 2022-04-28 17:02, Andi Kleen wrote:
> 
> On 4/28/2022 8:07 AM, Robin Murphy wrote:
>> On 2022-04-28 15:55, Andi Kleen wrote:
>>>
>>> On 4/28/2022 7:45 AM, Christoph Hellwig wrote:
>>>> On Thu, Apr 28, 2022 at 03:44:36PM +0100, Robin Murphy wrote:
>>>>> Rather than introduce this extra level of allocator complexity, how 
>>>>> about
>>>>> just dividing up the initial SWIOTLB allocation into multiple 
>>>>> io_tlb_mem
>>>>> instances?
>>>> Yeah.  We're almost done removing all knowledge of swiotlb from 
>>>> drivers,
>>>> so the very last thing I want is an interface that allows a driver to
>>>> allocate a per-device buffer.
>>>
>>> At least for TDX need parallelism with a single device for performance.
>>>
>>> So if you split up the io tlb mems for a device then you would need a 
>>> new mechanism to load balance the requests for single device over 
>>> those. I doubt it would be any simpler.
>>
>> Eh, I think it would be, since the round-robin retry loop can then 
>> just sit around the existing io_tlb_mem-based allocator, vs. the churn 
>> of inserting it in the middle, plus it's then really easy to 
>> statically distribute different starting points across different 
>> devices via dev->dma_io_tlb_mem if we wanted to.
>>
>> Admittedly the overall patch probably ends up about the same size, 
>> since it likely pushes a bit more complexity into swiotlb_init to 
>> compensate, but that's still a trade-off I like.
> 
> Unless you completely break the external API this will require a new 
> mechanism to search a list of io_tlb_mems for the right area to free into.
> 
> If the memory area not contiguous (like in the original patch) this will 
> be a O(n) operation on the number of io_tlb_mems, so it would get more 
> and more expensive on larger systems. Or you merge them all together (so 
> that the simple address arithmetic to look up the area works again), 
> which will require even more changes in the setup. Or you add hashing or 
> similar which will be even more complicated.
> 
> In the end doing it with a single io_tlb_mem is significantly simpler 
> and also more natural.

Sorry if "dividing up the initial SWIOTLB allocation" somehow sounded 
like "making multiple separate SWIOTLB allocations all over the place"?

I don't see there being any *functional* difference in whether a slice 
of the overall SWIOTLB memory is represented by 
"io_tlb_default_mem->areas[i]->blah" or "io_tlb_default_mem[i]->blah", 
I'm simply advocating for not churning the already-complex allocator 
internals by pushing the new complexity out to the margins instead.

Thanks,
Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface
  2022-04-28 14:14   ` Tianyu Lan
  (?)
  (?)
@ 2022-04-28 17:16   ` kernel test robot
  -1 siblings, 0 replies; 33+ messages in thread
From: kernel test robot @ 2022-04-28 17:16 UTC (permalink / raw)
  To: Tianyu Lan; +Cc: llvm, kbuild-all

Hi Tianyu,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on next-20220428]
[cannot apply to linus/master v5.18-rc4 v5.18-rc3 v5.18-rc2 v5.18-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Tianyu-Lan/swiotlb-Introduce-swiotlb-device-allocation-function/20220428-221517
base:    bdc61aad77faf67187525028f1f355eff3849f22
config: hexagon-randconfig-r045-20220428 (https://download.01.org/0day-ci/archive/20220429/202204290145.wcC5anxG-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c59473aacce38cd7dd77eebceaf3c98c5707ab3b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/68c4bec9aa77ed6aedaa27a63a2b1c5916f0db50
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Tianyu-Lan/swiotlb-Introduce-swiotlb-device-allocation-function/20220428-221517
        git checkout 68c4bec9aa77ed6aedaa27a63a2b1c5916f0db50
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/base/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from drivers/base/core.c:30:
>> include/linux/swiotlb.h:213:6: warning: no previous prototype for function 'swiotlb_device_free' [-Wmissing-prototypes]
   void swiotlb_device_free(struct device *dev)
        ^
   include/linux/swiotlb.h:213:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void swiotlb_device_free(struct device *dev)
   ^
   static 
>> include/linux/swiotlb.h:217:5: warning: no previous prototype for function 'swiotlb_device_allocate' [-Wmissing-prototypes]
   int swiotlb_device_allocate(struct device *dev,
       ^
   include/linux/swiotlb.h:217:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int swiotlb_device_allocate(struct device *dev,
   ^
   static 
   2 warnings generated.


vim +/swiotlb_device_free +213 include/linux/swiotlb.h

   212	
 > 213	void swiotlb_device_free(struct device *dev)
   214	{
   215	}
   216	
 > 217	int swiotlb_device_allocate(struct device *dev,
   218				    unsigned int area_num,
   219				    unsigned long size)
   220	{
   221		return -ENOMEM;
   222	}
   223	#endif /* CONFIG_SWIOTLB */
   224	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH] swiotlb: Add Child IO TLB mem support
  2022-04-28 14:44     ` Robin Murphy
@ 2022-04-29 14:21       ` Tianyu Lan
  -1 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-29 14:21 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv,
	konrad.wilk, linux-kernel, kirill.shutemov, iommu, andi.kleen,
	brijesh.singh, vkuznets, hch

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significant lock contention on the swiotlb lock.

This patch adds child IO TLB mem support to resolve spinlock overhead
among device's queues. Each device may allocate IO tlb mem and setup
child IO TLB mem according to queue number. Swiotlb code allocates
bounce buffer among child IO tlb mem iterately.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  7 +++
 kernel/dma/swiotlb.c    | 96 ++++++++++++++++++++++++++++++++++++-----
 2 files changed, 93 insertions(+), 10 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..4a3f6a7b4b7e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -89,6 +89,9 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @child_nslot:The number of IO TLB slot in the child IO TLB mem.
+ * @num_child:  The child io tlb mem number in the pool.
+ * @child_start:The child index to start searching in the next round.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -102,6 +105,10 @@ struct io_tlb_mem {
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	unsigned int num_child;
+	unsigned int child_nslot;
+	unsigned int child_start;
+	struct io_tlb_mem *child;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..382fa2288645 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -207,6 +207,25 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 		mem->force_bounce = true;
 
 	spin_lock_init(&mem->lock);
+
+	if (mem->num_child) {
+		mem->child_nslot = nslabs / mem->num_child;
+		mem->child_start = 0;
+
+		/*
+		 * Initialize child IO TLB mem, divide IO TLB pool
+		 * into child number. Reuse parent mem->slot in the
+		 * child mem->slot.  
+		 */
+		for (i = 0; i < mem->num_child; i++) {
+			mem->num_child = 0;
+			mem->child[i].slots = mem->slots + i * mem->child_nslot;
+			swiotlb_init_io_tlb_mem(&mem->child[i],
+				start + ((i * mem->child_nslot) << IO_TLB_SHIFT),
+				mem->child_nslot, late_alloc);
+		}
+	}
+
 	for (i = 0; i < mem->nslabs; i++) {
 		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
@@ -336,16 +355,18 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
-	if (!mem->slots) {
-		free_pages((unsigned long)vstart, order);
-		return -ENOMEM;
-	}
+	if (!mem->slots)
+		goto error_slots;
 
 	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
 
 	swiotlb_print_info();
 	return 0;
+
+error_slots:
+	free_pages((unsigned long)vstart, order);
+	return -ENOMEM;
 }
 
 void __init swiotlb_exit(void)
@@ -483,10 +504,11 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size,
+				 unsigned int alloc_align_mask)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
 		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -565,6 +587,46 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	return index;
 }
 
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size, unsigned int alloc_align_mask)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_mem *child_mem = mem;
+	int start = 0, i = 0, index;
+
+	if (mem->num_child) {
+		i = start = mem->child_start;
+		mem->child_start = (mem->child_start + 1) % mem->num_child;
+		child_mem = mem->child;
+	}
+
+	do {
+		index = swiotlb_do_find_slots(child_mem + i, dev, orig_addr,
+					      alloc_size, alloc_align_mask);
+		if (index >= 0)
+			return i * mem->child_nslot + index;
+		if (++i >= mem->num_child)
+			i = 0;
+	} while (i != start);
+
+	return -1;
+}
+
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used = 0;
+
+	if (mem->num_child) {
+		for (i = 0; i < mem->num_child; i++)
+			used += mem->child[i].used;
+	} else {
+		used = mem->used;
+	}
+
+	return used;
+}
+
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		unsigned int alloc_align_mask, enum dma_data_direction dir,
@@ -594,7 +656,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				     alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
@@ -617,9 +679,9 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 	return tlb_addr;
 }
 
-static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
+static void swiotlb_do_release_slots(struct io_tlb_mem *mem,
+				     struct device *dev, phys_addr_t tlb_addr)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	unsigned long flags;
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
@@ -660,6 +722,20 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	spin_unlock_irqrestore(&mem->lock, flags);
 }
 
+static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	int index, offset;
+
+	if (mem->num_child) {
+		offset = swiotlb_align_offset(dev, tlb_addr);	
+		index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
+		mem = &mem->child[index / mem->child_nslot];
+	}
+
+	swiotlb_do_release_slots(mem, dev, tlb_addr);
+}
+
 /*
  * tlb_addr is the physical address of the bounce buffer to unmap.
  */
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH] swiotlb: Add Child IO TLB mem support
@ 2022-04-29 14:21       ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-29 14:21 UTC (permalink / raw)
  To: hch, m.szyprowski, robin.murphy, michael.h.kelley, kys
  Cc: Tianyu Lan, iommu, linux-kernel, vkuznets, brijesh.singh,
	konrad.wilk, hch, wei.liu, parri.andrea, thomas.lendacky,
	linux-hyperv, andi.kleen, kirill.shutemov

From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significant lock contention on the swiotlb lock.

This patch adds child IO TLB mem support to resolve spinlock overhead
among device's queues. Each device may allocate IO tlb mem and setup
child IO TLB mem according to queue number. Swiotlb code allocates
bounce buffer among child IO tlb mem iterately.

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
---
 include/linux/swiotlb.h |  7 +++
 kernel/dma/swiotlb.c    | 96 ++++++++++++++++++++++++++++++++++++-----
 2 files changed, 93 insertions(+), 10 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..4a3f6a7b4b7e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -89,6 +89,9 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:	%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @child_nslot:The number of IO TLB slot in the child IO TLB mem.
+ * @num_child:  The child io tlb mem number in the pool.
+ * @child_start:The child index to start searching in the next round.
  */
 struct io_tlb_mem {
 	phys_addr_t start;
@@ -102,6 +105,10 @@ struct io_tlb_mem {
 	bool late_alloc;
 	bool force_bounce;
 	bool for_alloc;
+	unsigned int num_child;
+	unsigned int child_nslot;
+	unsigned int child_start;
+	struct io_tlb_mem *child;
 	struct io_tlb_slot {
 		phys_addr_t orig_addr;
 		size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..382fa2288645 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -207,6 +207,25 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
 		mem->force_bounce = true;
 
 	spin_lock_init(&mem->lock);
+
+	if (mem->num_child) {
+		mem->child_nslot = nslabs / mem->num_child;
+		mem->child_start = 0;
+
+		/*
+		 * Initialize child IO TLB mem, divide IO TLB pool
+		 * into child number. Reuse parent mem->slot in the
+		 * child mem->slot.  
+		 */
+		for (i = 0; i < mem->num_child; i++) {
+			mem->num_child = 0;
+			mem->child[i].slots = mem->slots + i * mem->child_nslot;
+			swiotlb_init_io_tlb_mem(&mem->child[i],
+				start + ((i * mem->child_nslot) << IO_TLB_SHIFT),
+				mem->child_nslot, late_alloc);
+		}
+	}
+
 	for (i = 0; i < mem->nslabs; i++) {
 		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
 		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
@@ -336,16 +355,18 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 
 	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
 		get_order(array_size(sizeof(*mem->slots), nslabs)));
-	if (!mem->slots) {
-		free_pages((unsigned long)vstart, order);
-		return -ENOMEM;
-	}
+	if (!mem->slots)
+		goto error_slots;
 
 	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
 	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
 
 	swiotlb_print_info();
 	return 0;
+
+error_slots:
+	free_pages((unsigned long)vstart, order);
+	return -ENOMEM;
 }
 
 void __init swiotlb_exit(void)
@@ -483,10 +504,11 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
-			      size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+				 struct device *dev, phys_addr_t orig_addr,
+				 size_t alloc_size,
+				 unsigned int alloc_align_mask)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	unsigned long boundary_mask = dma_get_seg_boundary(dev);
 	dma_addr_t tbl_dma_addr =
 		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -565,6 +587,46 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
 	return index;
 }
 
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+			      size_t alloc_size, unsigned int alloc_align_mask)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_mem *child_mem = mem;
+	int start = 0, i = 0, index;
+
+	if (mem->num_child) {
+		i = start = mem->child_start;
+		mem->child_start = (mem->child_start + 1) % mem->num_child;
+		child_mem = mem->child;
+	}
+
+	do {
+		index = swiotlb_do_find_slots(child_mem + i, dev, orig_addr,
+					      alloc_size, alloc_align_mask);
+		if (index >= 0)
+			return i * mem->child_nslot + index;
+		if (++i >= mem->num_child)
+			i = 0;
+	} while (i != start);
+
+	return -1;
+}
+
+static unsigned long mem_used(struct io_tlb_mem *mem)
+{
+	int i;
+	unsigned long used = 0;
+
+	if (mem->num_child) {
+		for (i = 0; i < mem->num_child; i++)
+			used += mem->child[i].used;
+	} else {
+		used = mem->used;
+	}
+
+	return used;
+}
+
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		unsigned int alloc_align_mask, enum dma_data_direction dir,
@@ -594,7 +656,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		if (!(attrs & DMA_ATTR_NO_WARN))
 			dev_warn_ratelimited(dev,
 	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-				 alloc_size, mem->nslabs, mem->used);
+				     alloc_size, mem->nslabs, mem_used(mem));
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
@@ -617,9 +679,9 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 	return tlb_addr;
 }
 
-static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
+static void swiotlb_do_release_slots(struct io_tlb_mem *mem,
+				     struct device *dev, phys_addr_t tlb_addr)
 {
-	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 	unsigned long flags;
 	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
 	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
@@ -660,6 +722,20 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 	spin_unlock_irqrestore(&mem->lock, flags);
 }
 
+static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	int index, offset;
+
+	if (mem->num_child) {
+		offset = swiotlb_align_offset(dev, tlb_addr);	
+		index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
+		mem = &mem->child[index / mem->child_nslot];
+	}
+
+	swiotlb_do_release_slots(mem, dev, tlb_addr);
+}
+
 /*
  * tlb_addr is the physical address of the bounce buffer to unmap.
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] swiotlb: Add Child IO TLB mem support
  2022-04-29 14:21       ` Tianyu Lan
@ 2022-04-29 14:25         ` Tianyu Lan
  -1 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-29 14:25 UTC (permalink / raw)
  To: hch, robin.murphy
  Cc: Tianyu Lan, iommu, linux-kernel, vkuznets, brijesh.singh,
	konrad.wilk, hch, wei.liu, parri.andrea, thomas.lendacky,
	linux-hyperv, andi.kleen, kirill.shutemov, m.szyprowski,
	michael.h.kelley, kys

On 4/29/2022 10:21 PM, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> Traditionally swiotlb was not performance critical because it was only
> used for slow devices. But in some setups, like TDX/SEV confidential
> guests, all IO has to go through swiotlb. Currently swiotlb only has a
> single lock. Under high IO load with multiple CPUs this can lead to
> significant lock contention on the swiotlb lock.
> 
> This patch adds child IO TLB mem support to resolve spinlock overhead
> among device's queues. Each device may allocate IO tlb mem and setup
> child IO TLB mem according to queue number. Swiotlb code allocates
> bounce buffer among child IO tlb mem iterately.
> 

Hi Robin and Christoph:
       According to Robin idea. I draft this patch. Please have a look 
and check whether it's right diection.

Thanks.

> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>   include/linux/swiotlb.h |  7 +++
>   kernel/dma/swiotlb.c    | 96 ++++++++++++++++++++++++++++++++++++-----
>   2 files changed, 93 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7ed35dd3de6e..4a3f6a7b4b7e 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -89,6 +89,9 @@ extern enum swiotlb_force swiotlb_force;
>    * @late_alloc:	%true if allocated using the page allocator
>    * @force_bounce: %true if swiotlb bouncing is forced
>    * @for_alloc:  %true if the pool is used for memory allocation
> + * @child_nslot:The number of IO TLB slot in the child IO TLB mem.
> + * @num_child:  The child io tlb mem number in the pool.
> + * @child_start:The child index to start searching in the next round.
>    */
>   struct io_tlb_mem {
>   	phys_addr_t start;
> @@ -102,6 +105,10 @@ struct io_tlb_mem {
>   	bool late_alloc;
>   	bool force_bounce;
>   	bool for_alloc;
> +	unsigned int num_child;
> +	unsigned int child_nslot;
> +	unsigned int child_start;
> +	struct io_tlb_mem *child;
>   	struct io_tlb_slot {
>   		phys_addr_t orig_addr;
>   		size_t alloc_size;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index e2ef0864eb1e..382fa2288645 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -207,6 +207,25 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   		mem->force_bounce = true;
>   
>   	spin_lock_init(&mem->lock);
> +
> +	if (mem->num_child) {
> +		mem->child_nslot = nslabs / mem->num_child;
> +		mem->child_start = 0;
> +
> +		/*
> +		 * Initialize child IO TLB mem, divide IO TLB pool
> +		 * into child number. Reuse parent mem->slot in the
> +		 * child mem->slot.
> +		 */
> +		for (i = 0; i < mem->num_child; i++) {
> +			mem->num_child = 0;
> +			mem->child[i].slots = mem->slots + i * mem->child_nslot;
> +			swiotlb_init_io_tlb_mem(&mem->child[i],
> +				start + ((i * mem->child_nslot) << IO_TLB_SHIFT),
> +				mem->child_nslot, late_alloc);
> +		}
> +	}
> +
>   	for (i = 0; i < mem->nslabs; i++) {
>   		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>   		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
> @@ -336,16 +355,18 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   
>   	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>   		get_order(array_size(sizeof(*mem->slots), nslabs)));
> -	if (!mem->slots) {
> -		free_pages((unsigned long)vstart, order);
> -		return -ENOMEM;
> -	}
> +	if (!mem->slots)
> +		goto error_slots;
>   
>   	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
>   	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
>   
>   	swiotlb_print_info();
>   	return 0;
> +
> +error_slots:
> +	free_pages((unsigned long)vstart, order);
> +	return -ENOMEM;
>   }
>   
>   void __init swiotlb_exit(void)
> @@ -483,10 +504,11 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
>    * Find a suitable number of IO TLB entries size that will fit this request and
>    * allocate a buffer from that IO TLB pool.
>    */
> -static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> -			      size_t alloc_size, unsigned int alloc_align_mask)
> +static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
> +				 struct device *dev, phys_addr_t orig_addr,
> +				 size_t alloc_size,
> +				 unsigned int alloc_align_mask)
>   {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>   	unsigned long boundary_mask = dma_get_seg_boundary(dev);
>   	dma_addr_t tbl_dma_addr =
>   		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
> @@ -565,6 +587,46 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   	return index;
>   }
>   
> +static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> +			      size_t alloc_size, unsigned int alloc_align_mask)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	struct io_tlb_mem *child_mem = mem;
> +	int start = 0, i = 0, index;
> +
> +	if (mem->num_child) {
> +		i = start = mem->child_start;
> +		mem->child_start = (mem->child_start + 1) % mem->num_child;
> +		child_mem = mem->child;
> +	}
> +
> +	do {
> +		index = swiotlb_do_find_slots(child_mem + i, dev, orig_addr,
> +					      alloc_size, alloc_align_mask);
> +		if (index >= 0)
> +			return i * mem->child_nslot + index;
> +		if (++i >= mem->num_child)
> +			i = 0;
> +	} while (i != start);
> +
> +	return -1;
> +}
> +
> +static unsigned long mem_used(struct io_tlb_mem *mem)
> +{
> +	int i;
> +	unsigned long used = 0;
> +
> +	if (mem->num_child) {
> +		for (i = 0; i < mem->num_child; i++)
> +			used += mem->child[i].used;
> +	} else {
> +		used = mem->used;
> +	}
> +
> +	return used;
> +}
> +
>   phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		size_t mapping_size, size_t alloc_size,
>   		unsigned int alloc_align_mask, enum dma_data_direction dir,
> @@ -594,7 +656,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		if (!(attrs & DMA_ATTR_NO_WARN))
>   			dev_warn_ratelimited(dev,
>   	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
> -				 alloc_size, mem->nslabs, mem->used);
> +				     alloc_size, mem->nslabs, mem_used(mem));
>   		return (phys_addr_t)DMA_MAPPING_ERROR;
>   	}
>   
> @@ -617,9 +679,9 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   	return tlb_addr;
>   }
>   
> -static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
> +static void swiotlb_do_release_slots(struct io_tlb_mem *mem,
> +				     struct device *dev, phys_addr_t tlb_addr)
>   {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>   	unsigned long flags;
>   	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
>   	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> @@ -660,6 +722,20 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	spin_unlock_irqrestore(&mem->lock, flags);
>   }
>   
> +static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	int index, offset;
> +
> +	if (mem->num_child) {
> +		offset = swiotlb_align_offset(dev, tlb_addr);	
> +		index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> +		mem = &mem->child[index / mem->child_nslot];
> +	}
> +
> +	swiotlb_do_release_slots(mem, dev, tlb_addr);
> +}
> +
>   /*
>    * tlb_addr is the physical address of the bounce buffer to unmap.
>    */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH] swiotlb: Add Child IO TLB mem support
@ 2022-04-29 14:25         ` Tianyu Lan
  0 siblings, 0 replies; 33+ messages in thread
From: Tianyu Lan @ 2022-04-29 14:25 UTC (permalink / raw)
  To: hch, robin.murphy
  Cc: parri.andrea, thomas.lendacky, wei.liu, Tianyu Lan, linux-hyperv,
	konrad.wilk, linux-kernel, kirill.shutemov, iommu,
	michael.h.kelley, andi.kleen, brijesh.singh, vkuznets, kys, hch

On 4/29/2022 10:21 PM, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> Traditionally swiotlb was not performance critical because it was only
> used for slow devices. But in some setups, like TDX/SEV confidential
> guests, all IO has to go through swiotlb. Currently swiotlb only has a
> single lock. Under high IO load with multiple CPUs this can lead to
> significant lock contention on the swiotlb lock.
> 
> This patch adds child IO TLB mem support to resolve spinlock overhead
> among device's queues. Each device may allocate IO tlb mem and setup
> child IO TLB mem according to queue number. Swiotlb code allocates
> bounce buffer among child IO tlb mem iterately.
> 

Hi Robin and Christoph:
       According to Robin idea. I draft this patch. Please have a look 
and check whether it's right diection.

Thanks.

> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>   include/linux/swiotlb.h |  7 +++
>   kernel/dma/swiotlb.c    | 96 ++++++++++++++++++++++++++++++++++++-----
>   2 files changed, 93 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7ed35dd3de6e..4a3f6a7b4b7e 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -89,6 +89,9 @@ extern enum swiotlb_force swiotlb_force;
>    * @late_alloc:	%true if allocated using the page allocator
>    * @force_bounce: %true if swiotlb bouncing is forced
>    * @for_alloc:  %true if the pool is used for memory allocation
> + * @child_nslot:The number of IO TLB slot in the child IO TLB mem.
> + * @num_child:  The child io tlb mem number in the pool.
> + * @child_start:The child index to start searching in the next round.
>    */
>   struct io_tlb_mem {
>   	phys_addr_t start;
> @@ -102,6 +105,10 @@ struct io_tlb_mem {
>   	bool late_alloc;
>   	bool force_bounce;
>   	bool for_alloc;
> +	unsigned int num_child;
> +	unsigned int child_nslot;
> +	unsigned int child_start;
> +	struct io_tlb_mem *child;
>   	struct io_tlb_slot {
>   		phys_addr_t orig_addr;
>   		size_t alloc_size;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index e2ef0864eb1e..382fa2288645 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -207,6 +207,25 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>   		mem->force_bounce = true;
>   
>   	spin_lock_init(&mem->lock);
> +
> +	if (mem->num_child) {
> +		mem->child_nslot = nslabs / mem->num_child;
> +		mem->child_start = 0;
> +
> +		/*
> +		 * Initialize child IO TLB mem, divide IO TLB pool
> +		 * into child number. Reuse parent mem->slot in the
> +		 * child mem->slot.
> +		 */
> +		for (i = 0; i < mem->num_child; i++) {
> +			mem->num_child = 0;
> +			mem->child[i].slots = mem->slots + i * mem->child_nslot;
> +			swiotlb_init_io_tlb_mem(&mem->child[i],
> +				start + ((i * mem->child_nslot) << IO_TLB_SHIFT),
> +				mem->child_nslot, late_alloc);
> +		}
> +	}
> +
>   	for (i = 0; i < mem->nslabs; i++) {
>   		mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
>   		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
> @@ -336,16 +355,18 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   
>   	mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>   		get_order(array_size(sizeof(*mem->slots), nslabs)));
> -	if (!mem->slots) {
> -		free_pages((unsigned long)vstart, order);
> -		return -ENOMEM;
> -	}
> +	if (!mem->slots)
> +		goto error_slots;
>   
>   	set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
>   	swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
>   
>   	swiotlb_print_info();
>   	return 0;
> +
> +error_slots:
> +	free_pages((unsigned long)vstart, order);
> +	return -ENOMEM;
>   }
>   
>   void __init swiotlb_exit(void)
> @@ -483,10 +504,11 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index)
>    * Find a suitable number of IO TLB entries size that will fit this request and
>    * allocate a buffer from that IO TLB pool.
>    */
> -static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> -			      size_t alloc_size, unsigned int alloc_align_mask)
> +static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
> +				 struct device *dev, phys_addr_t orig_addr,
> +				 size_t alloc_size,
> +				 unsigned int alloc_align_mask)
>   {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>   	unsigned long boundary_mask = dma_get_seg_boundary(dev);
>   	dma_addr_t tbl_dma_addr =
>   		phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
> @@ -565,6 +587,46 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>   	return index;
>   }
>   
> +static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> +			      size_t alloc_size, unsigned int alloc_align_mask)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	struct io_tlb_mem *child_mem = mem;
> +	int start = 0, i = 0, index;
> +
> +	if (mem->num_child) {
> +		i = start = mem->child_start;
> +		mem->child_start = (mem->child_start + 1) % mem->num_child;
> +		child_mem = mem->child;
> +	}
> +
> +	do {
> +		index = swiotlb_do_find_slots(child_mem + i, dev, orig_addr,
> +					      alloc_size, alloc_align_mask);
> +		if (index >= 0)
> +			return i * mem->child_nslot + index;
> +		if (++i >= mem->num_child)
> +			i = 0;
> +	} while (i != start);
> +
> +	return -1;
> +}
> +
> +static unsigned long mem_used(struct io_tlb_mem *mem)
> +{
> +	int i;
> +	unsigned long used = 0;
> +
> +	if (mem->num_child) {
> +		for (i = 0; i < mem->num_child; i++)
> +			used += mem->child[i].used;
> +	} else {
> +		used = mem->used;
> +	}
> +
> +	return used;
> +}
> +
>   phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		size_t mapping_size, size_t alloc_size,
>   		unsigned int alloc_align_mask, enum dma_data_direction dir,
> @@ -594,7 +656,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		if (!(attrs & DMA_ATTR_NO_WARN))
>   			dev_warn_ratelimited(dev,
>   	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
> -				 alloc_size, mem->nslabs, mem->used);
> +				     alloc_size, mem->nslabs, mem_used(mem));
>   		return (phys_addr_t)DMA_MAPPING_ERROR;
>   	}
>   
> @@ -617,9 +679,9 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   	return tlb_addr;
>   }
>   
> -static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
> +static void swiotlb_do_release_slots(struct io_tlb_mem *mem,
> +				     struct device *dev, phys_addr_t tlb_addr)
>   {
> -	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>   	unsigned long flags;
>   	unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
>   	int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> @@ -660,6 +722,20 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
>   	spin_unlock_irqrestore(&mem->lock, flags);
>   }
>   
> +static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
> +{
> +	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +	int index, offset;
> +
> +	if (mem->num_child) {
> +		offset = swiotlb_align_offset(dev, tlb_addr);	
> +		index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> +		mem = &mem->child[index / mem->child_nslot];
> +	}
> +
> +	swiotlb_do_release_slots(mem, dev, tlb_addr);
> +}
> +
>   /*
>    * tlb_addr is the physical address of the bounce buffer to unmap.
>    */
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-04-29 14:26 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-28 14:14 [RFC PATCH 0/2] swiotlb: Introduce swiotlb device allocation function Tianyu Lan
2022-04-28 14:14 ` Tianyu Lan
2022-04-28 14:14 ` [RFC PATCH 1/2] swiotlb: Split up single swiotlb lock Tianyu Lan
2022-04-28 14:14   ` Tianyu Lan
2022-04-28 14:44   ` Robin Murphy
2022-04-28 14:44     ` Robin Murphy
2022-04-28 14:45     ` Christoph Hellwig
2022-04-28 14:45       ` Christoph Hellwig
2022-04-28 14:55       ` Andi Kleen
2022-04-28 14:55         ` Andi Kleen
2022-04-28 15:05         ` Christoph Hellwig
2022-04-28 15:05           ` Christoph Hellwig
2022-04-28 15:16           ` Andi Kleen
2022-04-28 15:16             ` Andi Kleen
2022-04-28 15:07         ` Robin Murphy
2022-04-28 15:07           ` Robin Murphy
2022-04-28 16:02           ` Andi Kleen
2022-04-28 16:02             ` Andi Kleen
2022-04-28 16:59             ` Robin Murphy
2022-04-28 16:59               ` Robin Murphy
2022-04-28 14:56       ` Robin Murphy
2022-04-28 14:56         ` Robin Murphy
2022-04-28 15:54     ` Tianyu Lan
2022-04-28 15:54       ` Tianyu Lan
2022-04-29 14:21     ` [RFC PATCH] swiotlb: Add Child IO TLB mem support Tianyu Lan
2022-04-29 14:21       ` Tianyu Lan
2022-04-29 14:25       ` Tianyu Lan
2022-04-29 14:25         ` Tianyu Lan
2022-04-28 14:14 ` [RFC PATCH 2/2] Swiotlb: Add device bounce buffer allocation interface Tianyu Lan
2022-04-28 14:14   ` Tianyu Lan
2022-04-28 15:50   ` Tianyu Lan
2022-04-28 15:50     ` Tianyu Lan
2022-04-28 17:16   ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.