linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible
@ 2024-01-26 13:54 Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used Alexander Lobakin
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

The series grew from Eric's idea and patch at [0]. The idea of using the
shortcut for direct DMA as well belongs to Chris.

When an architecture doesn't need DMA synchronization and the buffer is
not an SWIOTLB buffer, most of times the kernel and the drivers end up
calling DMA sync operations for nothing.
Even when DMA is direct, this involves a good non-inline call ladder and
eats a bunch of CPU time. With IOMMU, this results in calling indirect
calls on hotpath just to check what is already known and return.
XSk is been using a custom shortcut for that for quite some time.
I recently wanted to introduce a similar one for Page Pool. Let's combine
all this into one generic shortcut, which would cover all DMA sync ops
and all types of DMA (direct, IOMMU, ...).

* #1 adds stub inlines to be able to skip DMA sync ops or even compile
     them out when not needed.
* #2 adds the generic shortcut and enables it for direct DMA.
* #3 adds ability to skip DMA syncs behind an IOMMU.
* #4-5 are just cleanups for Page Pool to avoid merge conflicts in future.
* #6 checks for the shortcut as early as possible in the Page Pool code to
     make sure no cycles wasted.
* #7 replaces XSk's shortcut with the generic one.

On 100G NIC, the result is +3-5% for direct DMA and +10-11% for IOMMU.
As a bonus, XSk core now allows batched buffer allocations for IOMMU
setups.
If the shortcut is not available on some system, there should be no
visible performance regressions.

[0] https://lore.kernel.org/netdev/20221115182841.2640176-1-edumazet@google.com

Alexander Lobakin (5):
  dma: compile-out DMA sync op calls when not used
  page_pool: make sure frag API fields don't span between cachelines
  page_pool: don't use driver-set flags field directly
  page_pool: check for DMA sync shortcut earlier
  xsk: use generic DMA sync shortcut instead of a custom one

Eric Dumazet (2):
  dma: avoid expensive redundant calls for sync operations
  iommu/dma: avoid expensive indirect calls for sync operations

 kernel/dma/Kconfig                            |   4 +
 include/net/page_pool/types.h                 |  21 +++-
 include/linux/device.h                        |   5 +
 include/linux/dma-map-ops.h                   |  17 +++
 include/linux/dma-mapping.h                   | 100 +++++++++++++-----
 include/net/xdp_sock_drv.h                    |   7 +-
 include/net/xsk_buff_pool.h                   |  13 +--
 drivers/base/dd.c                             |   2 +
 drivers/iommu/dma-iommu.c                     |   1 +
 drivers/net/ethernet/engleder/tsnep_main.c    |   2 +-
 .../net/ethernet/freescale/dpaa2/dpaa2-xsk.c  |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |   2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c      |   2 +-
 drivers/net/ethernet/intel/igc/igc_main.c     |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |   2 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   2 +-
 drivers/net/ethernet/netronome/nfp/nfd3/xsk.c |   2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |   2 +-
 kernel/dma/mapping.c                          |  60 ++++++++---
 kernel/dma/swiotlb.c                          |  14 +++
 net/core/page_pool.c                          |  67 +++++++-----
 net/xdp/xsk_buff_pool.c                       |  29 +----
 23 files changed, 237 insertions(+), 125 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used
  2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
@ 2024-01-26 13:54 ` Alexander Lobakin
  2024-01-29  6:11   ` Christoph Hellwig
  2024-01-31 16:52   ` Simon Horman
  2024-01-26 13:54 ` [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations Alexander Lobakin
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

Some platforms do have DMA, but DMA there is always direct and coherent.
Currently, even on such platforms DMA sync operations are compiled and
called.
Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when
either sync operations are needed or there is DMA ops or swiotlb
enabled. Set dma_need_sync() and dma_skip_sync() (stub for now)
depending on this symbol state and don't call sync ops when
dma_skip_sync() is true.
The change allows for future optimizations of DMA sync calls depending
on compile-time or runtime conditions.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 kernel/dma/Kconfig          |  4 ++
 include/linux/dma-mapping.h | 92 +++++++++++++++++++++++++------------
 kernel/dma/mapping.c        | 26 ++++++-----
 3 files changed, 81 insertions(+), 41 deletions(-)

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index d62f5957f36b..1c9ff05b1ecb 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -107,6 +107,10 @@ config DMA_BOUNCE_UNALIGNED_KMALLOC
 	bool
 	depends on SWIOTLB
 
+config DMA_NEED_SYNC
+	def_bool ARCH_HAS_SYNC_DMA_FOR_DEVICE || ARCH_HAS_SYNC_DMA_FOR_CPU || \
+		 ARCH_HAS_SYNC_DMA_FOR_CPU_ALL || DMA_OPS || SWIOTLB
+
 config DMA_RESTRICTED_POOL
 	bool "DMA Restricted Pool"
 	depends on OF && OF_RESERVED_MEM && SWIOTLB
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 4a658de44ee9..9dd7e1578bf6 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -117,14 +117,14 @@ dma_addr_t dma_map_resource(struct device *dev, phys_addr_t phys_addr,
 		size_t size, enum dma_data_direction dir, unsigned long attrs);
 void dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size,
 		enum dma_data_direction dir, unsigned long attrs);
-void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
-		enum dma_data_direction dir);
-void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
-		size_t size, enum dma_data_direction dir);
-void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
-		    int nelems, enum dma_data_direction dir);
-void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
-		       int nelems, enum dma_data_direction dir);
+void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
+			       size_t size, enum dma_data_direction dir);
+void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
+				  size_t size, enum dma_data_direction dir);
+void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
+			   int nelems, enum dma_data_direction dir);
+void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
+			      int nelems, enum dma_data_direction dir);
 void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,
 		gfp_t flag, unsigned long attrs);
 void dma_free_attrs(struct device *dev, size_t size, void *cpu_addr,
@@ -147,7 +147,6 @@ u64 dma_get_required_mask(struct device *dev);
 bool dma_addressing_limited(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
 size_t dma_opt_mapping_size(struct device *dev);
-bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
 		enum dma_data_direction dir, gfp_t gfp, unsigned long attrs);
@@ -195,20 +194,24 @@ static inline void dma_unmap_resource(struct device *dev, dma_addr_t addr,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
 }
-static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
-		size_t size, enum dma_data_direction dir)
+static inline void __dma_sync_single_for_cpu(struct device *dev,
+					     dma_addr_t addr, size_t size,
+					     enum dma_data_direction dir)
 {
 }
-static inline void dma_sync_single_for_device(struct device *dev,
-		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+static inline void __dma_sync_single_for_device(struct device *dev,
+						dma_addr_t addr, size_t size,
+						enum dma_data_direction dir)
 {
 }
-static inline void dma_sync_sg_for_cpu(struct device *dev,
-		struct scatterlist *sg, int nelems, enum dma_data_direction dir)
+static inline void __dma_sync_sg_for_cpu(struct device *dev,
+					 struct scatterlist *sg, int nelems,
+					 enum dma_data_direction dir)
 {
 }
-static inline void dma_sync_sg_for_device(struct device *dev,
-		struct scatterlist *sg, int nelems, enum dma_data_direction dir)
+static inline void __dma_sync_sg_for_device(struct device *dev,
+					    struct scatterlist *sg, int nelems,
+					    enum dma_data_direction dir)
 {
 }
 static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
@@ -277,10 +280,6 @@ static inline size_t dma_opt_mapping_size(struct device *dev)
 {
 	return 0;
 }
-static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
-{
-	return false;
-}
 static inline unsigned long dma_get_merge_boundary(struct device *dev)
 {
 	return 0;
@@ -348,20 +347,55 @@ static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
 	return dma_unmap_page_attrs(dev, addr, size, dir, attrs);
 }
 
-static inline void dma_sync_single_range_for_cpu(struct device *dev,
-		dma_addr_t addr, unsigned long offset, size_t size,
-		enum dma_data_direction dir)
+static inline void
+__dma_sync_single_range_for_cpu(struct device *dev, dma_addr_t addr,
+				unsigned long offset, size_t size,
+				enum dma_data_direction dir)
 {
-	return dma_sync_single_for_cpu(dev, addr + offset, size, dir);
+	__dma_sync_single_for_cpu(dev, addr + offset, size, dir);
 }
 
-static inline void dma_sync_single_range_for_device(struct device *dev,
-		dma_addr_t addr, unsigned long offset, size_t size,
-		enum dma_data_direction dir)
+static inline void
+__dma_sync_single_range_for_device(struct device *dev, dma_addr_t addr,
+				   unsigned long offset, size_t size,
+				   enum dma_data_direction dir)
 {
-	return dma_sync_single_for_device(dev, addr + offset, size, dir);
+	__dma_sync_single_for_device(dev, addr + offset, size, dir);
 }
 
+#ifdef CONFIG_DMA_NEED_SYNC
+
+#define dma_skip_sync(dev)			false
+
+bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
+
+#else /* !CONFIG_DMA_NEED_SYNC */
+
+#define dma_skip_sync(dev)			true
+#define dma_need_sync(dev, dma_addr)		false
+
+#endif /* !CONFIG_DMA_NEED_SYNC */
+
+#define dma_check_sync(op, dev, ...)					  \
+	do {								  \
+		if (!dma_skip_sync(dev))				  \
+			op(dev, __VA_ARGS__);				  \
+	} while (0)
+
+#define dma_sync_single_for_cpu(d, a, s, r)				  \
+	dma_check_sync(__dma_sync_single_for_cpu, d, a, s, r)
+#define dma_sync_single_for_device(d, a, s, r)				  \
+	dma_check_sync(__dma_sync_single_for_device, d, a, s, r)
+#define dma_sync_sg_for_cpu(d, s, n, r)					  \
+	dma_check_sync(__dma_sync_sg_for_cpu, d, s, n, r)
+#define dma_sync_sg_for_device(d, s, n, r)				  \
+	dma_check_sync(__dma_sync_sg_for_device, d, s, n, r)
+
+#define dma_sync_single_range_for_cpu(d, a, o, s, r)			  \
+	dma_check_sync(__dma_sync_single_range_for_cpu, d, a, o, s, r)
+#define dma_sync_single_range_for_device(d, a, o, s, r)			  \
+	dma_check_sync(__dma_sync_single_range_for_device, d, a, o, s, r)
+
 /**
  * dma_unmap_sgtable - Unmap the given buffer for DMA
  * @dev:	The device for which to perform the DMA operation
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 58db8fd70471..a30f37f9d4db 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -329,8 +329,8 @@ void dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size,
 }
 EXPORT_SYMBOL(dma_unmap_resource);
 
-void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
-		enum dma_data_direction dir)
+void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
+			       size_t size, enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 
@@ -341,10 +341,10 @@ void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 		ops->sync_single_for_cpu(dev, addr, size, dir);
 	debug_dma_sync_single_for_cpu(dev, addr, size, dir);
 }
-EXPORT_SYMBOL(dma_sync_single_for_cpu);
+EXPORT_SYMBOL(__dma_sync_single_for_cpu);
 
-void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
-		size_t size, enum dma_data_direction dir)
+void __dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
+				  size_t size, enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 
@@ -355,10 +355,10 @@ void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
 		ops->sync_single_for_device(dev, addr, size, dir);
 	debug_dma_sync_single_for_device(dev, addr, size, dir);
 }
-EXPORT_SYMBOL(dma_sync_single_for_device);
+EXPORT_SYMBOL(__dma_sync_single_for_device);
 
-void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
-		    int nelems, enum dma_data_direction dir)
+void __dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
+			   int nelems, enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 
@@ -369,10 +369,10 @@ void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
 		ops->sync_sg_for_cpu(dev, sg, nelems, dir);
 	debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir);
 }
-EXPORT_SYMBOL(dma_sync_sg_for_cpu);
+EXPORT_SYMBOL(__dma_sync_sg_for_cpu);
 
-void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
-		       int nelems, enum dma_data_direction dir)
+void __dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
+			      int nelems, enum dma_data_direction dir)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 
@@ -383,7 +383,7 @@ void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
 		ops->sync_sg_for_device(dev, sg, nelems, dir);
 	debug_dma_sync_sg_for_device(dev, sg, nelems, dir);
 }
-EXPORT_SYMBOL(dma_sync_sg_for_device);
+EXPORT_SYMBOL(__dma_sync_sg_for_device);
 
 /*
  * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
@@ -841,6 +841,7 @@ size_t dma_opt_mapping_size(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
 
+#ifdef CONFIG_DMA_NEED_SYNC
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
@@ -850,6 +851,7 @@ bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 	return ops->sync_single_for_cpu || ops->sync_single_for_device;
 }
 EXPORT_SYMBOL_GPL(dma_need_sync);
+#endif /* CONFIG_DMA_NEED_SYNC */
 
 unsigned long dma_get_merge_boundary(struct device *dev)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used Alexander Lobakin
@ 2024-01-26 13:54 ` Alexander Lobakin
  2024-01-26 15:48   ` Robin Murphy
  2024-01-26 13:54 ` [PATCH net-next 3/7] iommu/dma: avoid expensive indirect " Alexander Lobakin
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

From: Eric Dumazet <edumazet@google.com>

Quite often, NIC devices do not need dma_sync operations on x86_64
at least.
Indeed, when dev_is_dma_coherent(dev) is true and
dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
and friends do nothing.

However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.

Add dev->skip_dma_sync boolean which is set during the device
initialization depending on the setup: dev_is_dma_coherent() for direct
DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
Then later, if/when swiotlb is used for the first time, the flag
is turned off, from swiotlb_tbl_map_single().

On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
+3-5% increase for direct DMA.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/linux/device.h      |  5 +++++
 include/linux/dma-map-ops.h | 17 +++++++++++++++++
 include/linux/dma-mapping.h | 12 ++++++++++--
 drivers/base/dd.c           |  2 ++
 kernel/dma/mapping.c        | 34 +++++++++++++++++++++++++++++++---
 kernel/dma/swiotlb.c        | 14 ++++++++++++++
 6 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 97c4b046c09d..f23e6a32bea0 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -686,6 +686,8 @@ struct device_physical_location {
  *		other devices probe successfully.
  * @dma_coherent: this particular device is dma coherent, even if the
  *		architecture supports non-coherent devices.
+ * @dma_skip_sync: DMA sync operations can be skipped for coherent non-SWIOTLB
+ *		buffers.
  * @dma_ops_bypass: If set to %true then the dma_ops are bypassed for the
  *		streaming DMA operations (->map_* / ->unmap_* / ->sync_*),
  *		and optionall (if the coherent mask is large enough) also
@@ -800,6 +802,9 @@ struct device {
     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
 	bool			dma_coherent:1;
 #endif
+#ifdef CONFIG_DMA_NEED_SYNC
+	bool			dma_skip_sync:1;
+#endif
 #ifdef CONFIG_DMA_OPS_BYPASS
 	bool			dma_ops_bypass : 1;
 #endif
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4abc60f04209..937c295e9da8 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -78,6 +78,7 @@ struct dma_map_ops {
 			int nents, enum dma_data_direction dir);
 	void (*cache_sync)(struct device *dev, void *vaddr, size_t size,
 			enum dma_data_direction direction);
+	bool (*can_skip_sync)(struct device *dev);
 	int (*dma_supported)(struct device *dev, u64 mask);
 	u64 (*get_required_mask)(struct device *dev);
 	size_t (*max_mapping_size)(struct device *dev);
@@ -111,6 +112,22 @@ static inline void set_dma_ops(struct device *dev,
 }
 #endif /* CONFIG_DMA_OPS */
 
+#ifdef CONFIG_DMA_NEED_SYNC
+
+static inline void dma_set_skip_sync(struct device *dev, bool skip)
+{
+	dev->dma_skip_sync = skip;
+}
+
+void dma_setup_skip_sync(struct device *dev);
+
+#else /* !CONFIG_DMA_NEED_SYNC */
+
+#define dma_set_skip_sync(dev, skip)		do { } while (0)
+#define dma_setup_skip_sync(dev)		do { } while (0)
+
+#endif /* !CONFIG_DMA_NEED_SYNC */
+
 #ifdef CONFIG_DMA_CMA
 extern struct cma *dma_contiguous_default_area;
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 9dd7e1578bf6..bc9f67e0c139 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -365,9 +365,17 @@ __dma_sync_single_range_for_device(struct device *dev, dma_addr_t addr,
 
 #ifdef CONFIG_DMA_NEED_SYNC
 
-#define dma_skip_sync(dev)			false
+static inline bool dma_skip_sync(const struct device *dev)
+{
+	return dev->dma_skip_sync;
+}
+
+bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 
-bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
+static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
+{
+	return dma_skip_sync(dev) ? false : __dma_need_sync(dev, dma_addr);
+}
 
 #else /* !CONFIG_DMA_NEED_SYNC */
 
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 85152537dbf1..67ad3e1d51f6 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -642,6 +642,8 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 			goto pinctrl_bind_failed;
 	}
 
+	dma_setup_skip_sync(dev);
+
 	ret = driver_sysfs_add(dev);
 	if (ret) {
 		pr_err("%s: driver_sysfs_add(%s) failed\n",
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index a30f37f9d4db..8fa464b3954e 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -842,15 +842,43 @@ size_t dma_opt_mapping_size(struct device *dev)
 EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
 
 #ifdef CONFIG_DMA_NEED_SYNC
-bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
+bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
 
 	if (dma_map_direct(dev, ops))
+		/*
+		 * dma_skip_sync could've been set to false on first SWIOTLB
+		 * buffer mapping, but @dma_addr is not necessary an SWIOTLB
+		 * buffer. In this case, fall back to more granular check.
+		 */
 		return dma_direct_need_sync(dev, dma_addr);
-	return ops->sync_single_for_cpu || ops->sync_single_for_device;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(__dma_need_sync);
+
+void dma_setup_skip_sync(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+	bool skip;
+
+	if (dma_map_direct(dev, ops))
+		/*
+		 * dma_skip_sync will be set to false on first SWIOTLB buffer
+		 * mapping, if any. During the device initialization, it's
+		 * enough to check only for DMA coherence.
+		 */
+		skip = dev_is_dma_coherent(dev);
+	else if (!ops->sync_single_for_device && !ops->sync_single_for_cpu)
+		skip = true;
+	else if (ops->can_skip_sync)
+		skip = ops->can_skip_sync(dev);
+	else
+		skip = false;
+
+	dma_set_skip_sync(dev, skip);
 }
-EXPORT_SYMBOL_GPL(dma_need_sync);
 #endif /* CONFIG_DMA_NEED_SYNC */
 
 unsigned long dma_get_merge_boundary(struct device *dev)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b079a9a8e087..b62ea0a4f106 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1286,6 +1286,16 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
 
 #endif /* CONFIG_DEBUG_FS */
 
+static inline void swiotlb_disable_dma_skip_sync(struct device *dev)
+{
+	/*
+	 * If dma_skip_sync was set, reset it to false on first SWIOTLB buffer
+	 * mapping/allocation to always sync SWIOTLB buffers.
+	 */
+	if (unlikely(dma_skip_sync(dev)))
+		dma_set_skip_sync(dev, false);
+}
+
 phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		unsigned int alloc_align_mask, enum dma_data_direction dir,
@@ -1323,6 +1333,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
+	swiotlb_disable_dma_skip_sync(dev);
+
 	/*
 	 * Save away the mapping from the original address to the DMA address.
 	 * This is needed when we sync the memory.  Then we sync the buffer if
@@ -1640,6 +1652,8 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
 	if (index == -1)
 		return NULL;
 
+	swiotlb_disable_dma_skip_sync(dev);
+
 	tlb_addr = slot_addr(pool->start, index);
 
 	return pfn_to_page(PFN_DOWN(tlb_addr));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 3/7] iommu/dma: avoid expensive indirect calls for sync operations
  2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations Alexander Lobakin
@ 2024-01-26 13:54 ` Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 4/7] page_pool: make sure frag API fields don't span between cachelines Alexander Lobakin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

From: Eric Dumazet <edumazet@google.com>

Use the new dma_map_ops::can_skip_sync() callback in IOMMU DMA. It is
enough to check only for the DMA coherence, as SWIOTLB is checked
dynamically later.

perf profile before the patch:

    18.53%  [kernel]       [k] gq_rx_skb
    14.77%  [kernel]       [k] napi_reuse_skb
     8.95%  [kernel]       [k] skb_release_data
     5.42%  [kernel]       [k] dev_gro_receive
     5.37%  [kernel]       [k] memcpy
<*>  5.26%  [kernel]       [k] iommu_dma_sync_sg_for_cpu
     4.78%  [kernel]       [k] tcp_gro_receive
<*>  4.42%  [kernel]       [k] iommu_dma_sync_sg_for_device
     4.12%  [kernel]       [k] ipv6_gro_receive
     3.65%  [kernel]       [k] gq_pool_get
     3.25%  [kernel]       [k] skb_gro_receive
     2.07%  [kernel]       [k] napi_gro_frags
     1.98%  [kernel]       [k] tcp6_gro_receive
     1.27%  [kernel]       [k] gq_rx_prep_buffers
     1.18%  [kernel]       [k] gq_rx_napi_handler
     0.99%  [kernel]       [k] csum_partial
     0.74%  [kernel]       [k] csum_ipv6_magic
     0.72%  [kernel]       [k] free_pcp_prepare
     0.60%  [kernel]       [k] __napi_poll
     0.58%  [kernel]       [k] net_rx_action
     0.56%  [kernel]       [k] read_tsc
<*>  0.50%  [kernel]       [k] __x86_indirect_thunk_r11
     0.45%  [kernel]       [k] memset

After patch, lines with <*> no longer show up, and overall
cpu usage looks much better (~60% instead of ~72%)

    25.56%  [kernel]       [k] gq_rx_skb
     9.90%  [kernel]       [k] napi_reuse_skb
     7.39%  [kernel]       [k] dev_gro_receive
     6.78%  [kernel]       [k] memcpy
     6.53%  [kernel]       [k] skb_release_data
     6.39%  [kernel]       [k] tcp_gro_receive
     5.71%  [kernel]       [k] ipv6_gro_receive
     4.35%  [kernel]       [k] napi_gro_frags
     4.34%  [kernel]       [k] skb_gro_receive
     3.50%  [kernel]       [k] gq_pool_get
     3.08%  [kernel]       [k] gq_rx_napi_handler
     2.35%  [kernel]       [k] tcp6_gro_receive
     2.06%  [kernel]       [k] gq_rx_prep_buffers
     1.32%  [kernel]       [k] csum_partial
     0.93%  [kernel]       [k] csum_ipv6_magic
     0.65%  [kernel]       [k] net_rx_action

iavf yields +10% of Mpps on Rx. This also unblocks batch allocations
of XSk buffers when IOMMU is active.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/iommu/dma-iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 50ccc4f1ef81..86290562eda5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1720,6 +1720,7 @@ static const struct dma_map_ops iommu_dma_ops = {
 	.unmap_page		= iommu_dma_unmap_page,
 	.map_sg			= iommu_dma_map_sg,
 	.unmap_sg		= iommu_dma_unmap_sg,
+	.can_skip_sync		= dev_is_dma_coherent,
 	.sync_single_for_cpu	= iommu_dma_sync_single_for_cpu,
 	.sync_single_for_device	= iommu_dma_sync_single_for_device,
 	.sync_sg_for_cpu	= iommu_dma_sync_sg_for_cpu,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 4/7] page_pool: make sure frag API fields don't span between cachelines
  2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
                   ` (2 preceding siblings ...)
  2024-01-26 13:54 ` [PATCH net-next 3/7] iommu/dma: avoid expensive indirect " Alexander Lobakin
@ 2024-01-26 13:54 ` Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 5/7] page_pool: don't use driver-set flags field directly Alexander Lobakin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

After commit 5027ec19f104 ("net: page_pool: split the page_pool_params
into fast and slow") that made &page_pool contain only "hot" params at
the start, cacheline boundary chops frag API fields group in the middle
again.
To not bother with this each time fast params get expanded or shrunk,
let's just align them to `4 * sizeof(long)`, the closest upper pow-2 to
their actual size (2 longs + 1 int). This ensures 16-byte alignment for
the 32-bit architectures and 32-byte alignment for the 64-bit ones,
excluding unnecessary false-sharing.
::page_state_hold_cnt is used quite intensively on hotpath no matter if
frag API is used, so move it to the newly created hole in the first
cacheline.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/page_pool/types.h | 12 +++++++++++-
 net/core/page_pool.c          |  9 +++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 76481c465375..217e73b7e4fc 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -128,12 +128,22 @@ struct page_pool_stats {
 struct page_pool {
 	struct page_pool_params_fast p;
 
+	u32 pages_state_hold_cnt;
 	bool has_init_callback;
 
+	/* The following block must stay within one cacheline. On 32-bit
+	 * systems, sizeof(long) == sizeof(int), so that the block size is
+	 * ``3 * sizeof(long)``. On 64-bit systems, the actual size is
+	 * ``2 * sizeof(long) + sizeof(int)``. The closest pow-2 to both of
+	 * them is ``4 * sizeof(long)``, so just use that one for simplicity.
+	 * Having it aligned to a cacheline boundary may be excessive and
+	 * doesn't bring any good.
+	 */
+	__cacheline_group_begin(frag) __aligned(4 * sizeof(long));
 	long frag_users;
 	struct page *frag_page;
 	unsigned int frag_offset;
-	u32 pages_state_hold_cnt;
+	__cacheline_group_end(frag);
 
 	struct delayed_work release_dw;
 	void (*disconnect)(void *pool);
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 4933762e5a6b..be1219816990 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -170,11 +170,20 @@ static void page_pool_producer_unlock(struct page_pool *pool,
 		spin_unlock_bh(&pool->ring.producer_lock);
 }
 
+static void page_pool_struct_check(void)
+{
+	CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_users);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_page);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_offset);
+}
+
 static int page_pool_init(struct page_pool *pool,
 			  const struct page_pool_params *params)
 {
 	unsigned int ring_qsize = 1024; /* Default */
 
+	page_pool_struct_check();
+
 	memcpy(&pool->p, &params->fast, sizeof(pool->p));
 	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 5/7] page_pool: don't use driver-set flags field directly
  2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
                   ` (3 preceding siblings ...)
  2024-01-26 13:54 ` [PATCH net-next 4/7] page_pool: make sure frag API fields don't span between cachelines Alexander Lobakin
@ 2024-01-26 13:54 ` Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 6/7] page_pool: check for DMA sync shortcut earlier Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 7/7] xsk: use generic DMA sync shortcut instead of a custom one Alexander Lobakin
  6 siblings, 0 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

page_pool::p is driver-defined params, copied directly from the
structure passed to page_pool_create(). The structure isn't meant
to be modified by the Page Pool core code and this even might look
confusing[0][1].
In order to be able to alter some flags, let's define our own, internal
fields the same way as the already existing one (::has_init_callback).
They are defined as bits in the driver-set params, leave them so here
as well, to not waste byte-per-bit or so. Almost 30 bits are still free
for future extensions.
We could've defined only new flags here or only the ones we may need
to alter, but checking some flags in one place while others in another
doesn't sound convenient or intuitive. ::flags passed by the driver can
now go to the "slow" PP params.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link[0]: https://lore.kernel.org/netdev/20230703133207.4f0c54ce@kernel.org
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/page_pool/types.h |  9 ++++++---
 net/core/page_pool.c          | 34 ++++++++++++++++++----------------
 2 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 217e73b7e4fc..6a767ad1c572 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -44,7 +44,6 @@ struct pp_alloc_cache {
 
 /**
  * struct page_pool_params - page pool parameters
- * @flags:	PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV
  * @order:	2^order pages on allocation
  * @pool_size:	size of the ptr_ring
  * @nid:	NUMA node id to allocate from pages from
@@ -54,10 +53,10 @@ struct pp_alloc_cache {
  * @dma_dir:	DMA mapping direction
  * @max_len:	max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
  * @offset:	DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
+ * @flags:	PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV
  */
 struct page_pool_params {
 	struct_group_tagged(page_pool_params_fast, fast,
-		unsigned int	flags;
 		unsigned int	order;
 		unsigned int	pool_size;
 		int		nid;
@@ -68,6 +67,7 @@ struct page_pool_params {
 		unsigned int	offset;
 	);
 	struct_group_tagged(page_pool_params_slow, slow,
+		unsigned int	flags;
 		struct net_device *netdev;
 /* private: used by test code only */
 		void (*init_callback)(struct page *page, void *arg);
@@ -129,7 +129,10 @@ struct page_pool {
 	struct page_pool_params_fast p;
 
 	u32 pages_state_hold_cnt;
-	bool has_init_callback;
+
+	bool dma_map:1;			/* Perform DMA mapping */
+	bool dma_sync:1;		/* Perform DMA sync */
+	bool has_init_callback:1;	/* slow.init_callback is set */
 
 	/* The following block must stay within one cacheline. On 32-bit
 	 * systems, sizeof(long) == sizeof(int), so that the block size is
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index be1219816990..2c353906407c 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -188,7 +188,7 @@ static int page_pool_init(struct page_pool *pool,
 	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
 
 	/* Validate only known flags were used */
-	if (pool->p.flags & ~(PP_FLAG_ALL))
+	if (pool->slow.flags & ~(PP_FLAG_ALL))
 		return -EINVAL;
 
 	if (pool->p.pool_size)
@@ -202,22 +202,26 @@ static int page_pool_init(struct page_pool *pool,
 	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
 	 * which is the XDP_TX use-case.
 	 */
-	if (pool->p.flags & PP_FLAG_DMA_MAP) {
+	if (pool->slow.flags & PP_FLAG_DMA_MAP) {
 		if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
 		    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
 			return -EINVAL;
+
+		pool->dma_map = true;
 	}
 
-	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) {
+	if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
 		/* In order to request DMA-sync-for-device the page
 		 * needs to be mapped
 		 */
-		if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		if (!(pool->slow.flags & PP_FLAG_DMA_MAP))
 			return -EINVAL;
 
 		if (!pool->p.max_len)
 			return -EINVAL;
 
+		pool->dma_sync = true;
+
 		/* pool->p.offset has to be set according to the address
 		 * offset used by the DMA engine to start copying rx data
 		 */
@@ -243,7 +247,7 @@ static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->p.flags & PP_FLAG_DMA_MAP)
+	if (pool->dma_map)
 		get_device(pool->p.dev);
 
 	return 0;
@@ -253,7 +257,7 @@ static void page_pool_uninit(struct page_pool *pool)
 {
 	ptr_ring_cleanup(&pool->ring, NULL);
 
-	if (pool->p.flags & PP_FLAG_DMA_MAP)
+	if (pool->dma_map)
 		put_device(pool->p.dev);
 
 #ifdef CONFIG_PAGE_POOL_STATS
@@ -396,7 +400,7 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	if (page_pool_set_dma_addr(page, dma))
 		goto unmap_failed;
 
-	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+	if (pool->dma_sync)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
 
 	return true;
@@ -442,8 +446,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 	if (unlikely(!page))
 		return NULL;
 
-	if ((pool->p.flags & PP_FLAG_DMA_MAP) &&
-	    unlikely(!page_pool_dma_map(pool, page))) {
+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page))) {
 		put_page(page);
 		return NULL;
 	}
@@ -463,8 +466,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t gfp)
 {
 	const int bulk = PP_ALLOC_CACHE_REFILL;
-	unsigned int pp_flags = pool->p.flags;
 	unsigned int pp_order = pool->p.order;
+	bool dma_map = pool->dma_map;
 	struct page *page;
 	int i, nr_pages;
 
@@ -489,8 +492,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		page = pool->alloc.cache[i];
-		if ((pp_flags & PP_FLAG_DMA_MAP) &&
-		    unlikely(!page_pool_dma_map(pool, page))) {
+		if (dma_map && unlikely(!page_pool_dma_map(pool, page))) {
 			put_page(page);
 			continue;
 		}
@@ -562,7 +564,7 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;
 
-	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+	if (!pool->dma_map)
 		/* Always account for inflight pages, even if we didn't
 		 * map them
 		 */
@@ -640,7 +642,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
 }
 
 /* If the page refcnt == 1, this will try to recycle the page.
- * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
+ * If pool->dma_sync is set, we'll try to sync the DMA area for
  * the configured size min(dma_sync_size, pool->max_len).
  * If the page refcnt != 1, then the page will be returned to memory
  * subsystem.
@@ -663,7 +665,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
-		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		if (pool->dma_sync)
 			page_pool_dma_sync_for_device(pool, page,
 						      dma_sync_size);
 
@@ -776,7 +778,7 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
 		return NULL;
 
 	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
-		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		if (pool->dma_sync)
 			page_pool_dma_sync_for_device(pool, page, -1);
 
 		return page;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 6/7] page_pool: check for DMA sync shortcut earlier
  2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
                   ` (4 preceding siblings ...)
  2024-01-26 13:54 ` [PATCH net-next 5/7] page_pool: don't use driver-set flags field directly Alexander Lobakin
@ 2024-01-26 13:54 ` Alexander Lobakin
  2024-01-26 13:54 ` [PATCH net-next 7/7] xsk: use generic DMA sync shortcut instead of a custom one Alexander Lobakin
  6 siblings, 0 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

We can save a couple more function calls in the Page Pool code if we
check for dma_skip_sync() earlier, just when we test pp->p.dma_sync.
Move both these checks into an inline wrapper and call the PP wrapper
over the generic DMA sync function only when both are true.
You can't cache the result of dma_skip_sync() in &page_pool, as it may
change anytime if an SWIOTLB buffer is allocated or mapped.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 net/core/page_pool.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2c353906407c..cefdd9822fb7 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -369,16 +369,24 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
 	return page;
 }
 
-static void page_pool_dma_sync_for_device(struct page_pool *pool,
-					  struct page *page,
-					  unsigned int dma_sync_size)
+static void __page_pool_dma_sync_for_device(struct page_pool *pool,
+					    struct page *page,
+					    unsigned int dma_sync_size)
 {
 	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
 
 	dma_sync_size = min(dma_sync_size, pool->p.max_len);
-	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
-					 pool->p.offset, dma_sync_size,
-					 pool->p.dma_dir);
+	__dma_sync_single_range_for_device(pool->p.dev, dma_addr,
+					   pool->p.offset, dma_sync_size,
+					   pool->p.dma_dir);
+}
+
+static __always_inline void
+page_pool_dma_sync_for_device(struct page_pool *pool, struct page *page,
+			      unsigned int dma_sync_size)
+{
+	if (pool->dma_sync && !dma_skip_sync(pool->p.dev))
+		__page_pool_dma_sync_for_device(pool, page, dma_sync_size);
 }
 
 static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
@@ -400,8 +408,7 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	if (page_pool_set_dma_addr(page, dma))
 		goto unmap_failed;
 
-	if (pool->dma_sync)
-		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
+	page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
 
 	return true;
 
@@ -665,9 +672,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
 	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
-		if (pool->dma_sync)
-			page_pool_dma_sync_for_device(pool, page,
-						      dma_sync_size);
+		page_pool_dma_sync_for_device(pool, page, dma_sync_size);
 
 		if (allow_direct && in_softirq() &&
 		    page_pool_recycle_in_cache(page, pool))
@@ -778,8 +783,7 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
 		return NULL;
 
 	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
-		if (pool->dma_sync)
-			page_pool_dma_sync_for_device(pool, page, -1);
+		page_pool_dma_sync_for_device(pool, page, -1);
 
 		return page;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 7/7] xsk: use generic DMA sync shortcut instead of a custom one
  2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
                   ` (5 preceding siblings ...)
  2024-01-26 13:54 ` [PATCH net-next 6/7] page_pool: check for DMA sync shortcut earlier Alexander Lobakin
@ 2024-01-26 13:54 ` Alexander Lobakin
  6 siblings, 0 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 13:54 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Christoph Hellwig, Marek Szyprowski,
	Robin Murphy, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

XSk infra's been using its own DMA sync shortcut to try avoiding
redundant function calls. Now that there is a generic one, remove
the custom implementation and rely on the generic helpers.
xsk_buff_dma_sync_for_cpu() doesn't need the second argument anymore,
remove it.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/xdp_sock_drv.h                    |  7 ++---
 include/net/xsk_buff_pool.h                   | 13 ++-------
 drivers/net/ethernet/engleder/tsnep_main.c    |  2 +-
 .../net/ethernet/freescale/dpaa2/dpaa2-xsk.c  |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  2 +-
 drivers/net/ethernet/intel/igc/igc_main.c     |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  2 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |  4 +--
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  2 +-
 drivers/net/ethernet/netronome/nfp/nfd3/xsk.c |  2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |  2 +-
 net/xdp/xsk_buff_pool.c                       | 29 +++----------------
 13 files changed, 20 insertions(+), 51 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index c9aec9ab6191..0a5dca2b2b3f 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -219,13 +219,10 @@ static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool
 	return meta;
 }
 
-static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
+static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp)
 {
 	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
 
-	if (!pool->dma_need_sync)
-		return;
-
 	xp_dma_sync_for_cpu(xskb);
 }
 
@@ -402,7 +399,7 @@ static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool
 	return NULL;
 }
 
-static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
+static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp)
 {
 }
 
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 99dd7376df6a..b61e787a0ee5 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -43,7 +43,6 @@ struct xsk_dma_map {
 	refcount_t users;
 	struct list_head list; /* Protected by the RTNL_LOCK */
 	u32 dma_pages_cnt;
-	bool dma_need_sync;
 };
 
 struct xsk_buff_pool {
@@ -82,7 +81,6 @@ struct xsk_buff_pool {
 	u8 tx_metadata_len; /* inherited from umem */
 	u8 cached_need_wakeup;
 	bool uses_need_wakeup;
-	bool dma_need_sync;
 	bool unaligned;
 	bool tx_sw_csum;
 	void *addrs;
@@ -155,21 +153,16 @@ static inline dma_addr_t xp_get_frame_dma(struct xdp_buff_xsk *xskb)
 	return xskb->frame_dma;
 }
 
-void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb);
 static inline void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb)
 {
-	xp_dma_sync_for_cpu_slow(xskb);
+	dma_sync_single_for_cpu(xskb->pool->dev, xskb->dma,
+				xskb->pool->frame_len, DMA_BIDIRECTIONAL);
 }
 
-void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma,
-				 size_t size);
 static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
 					  dma_addr_t dma, size_t size)
 {
-	if (!pool->dma_need_sync)
-		return;
-
-	xp_dma_sync_for_device_slow(pool, dma, size);
+	dma_sync_single_for_device(pool->dev, dma, size, DMA_BIDIRECTIONAL);
 }
 
 /* Masks for xdp_umem_page flags.
diff --git a/drivers/net/ethernet/engleder/tsnep_main.c b/drivers/net/ethernet/engleder/tsnep_main.c
index ae0b8b37b9bf..12a92e436d0b 100644
--- a/drivers/net/ethernet/engleder/tsnep_main.c
+++ b/drivers/net/ethernet/engleder/tsnep_main.c
@@ -1571,7 +1571,7 @@ static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi,
 		length = __le32_to_cpu(entry->desc_wb->properties) &
 			 TSNEP_DESC_LENGTH_MASK;
 		xsk_buff_set_size(entry->xdp, length - ETH_FCS_LEN);
-		xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool);
+		xsk_buff_dma_sync_for_cpu(entry->xdp);
 
 		/* RX metadata with timestamps is in front of actual data,
 		 * subtract metadata size to get length of actual data and
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
index 051748b997f3..a466c2379146 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-xsk.c
@@ -55,7 +55,7 @@ static u32 dpaa2_xsk_run_xdp(struct dpaa2_eth_priv *priv,
 	xdp_set_data_meta_invalid(xdp_buff);
 	xdp_buff->rxq = &ch->xdp_rxq;
 
-	xsk_buff_dma_sync_for_cpu(xdp_buff, ch->xsk_pool);
+	xsk_buff_dma_sync_for_cpu(xdp_buff);
 	xdp_act = bpf_prog_run_xdp(xdp_prog, xdp_buff);
 
 	/* xdp.data pointer may have changed */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 11500003af0d..d20ce517426e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -483,7 +483,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 
 		bi = *i40e_rx_bi(rx_ring, next_to_process);
 		xsk_buff_set_size(bi, size);
-		xsk_buff_dma_sync_for_cpu(bi, rx_ring->xsk_pool);
+		xsk_buff_dma_sync_for_cpu(bi);
 
 		if (!first)
 			first = bi;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 8b81a1677045..5d4aabf7e1b1 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -892,7 +892,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 				   ICE_RX_FLX_DESC_PKT_LEN_M;
 
 		xsk_buff_set_size(xdp, size);
-		xsk_buff_dma_sync_for_cpu(xdp, xsk_pool);
+		xsk_buff_dma_sync_for_cpu(xdp);
 
 		if (!first) {
 			first = xdp;
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index ba8d3fe186ae..ad9ebbd9d61d 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -2817,7 +2817,7 @@ static int igc_clean_rx_irq_zc(struct igc_q_vector *q_vector, const int budget)
 		}
 
 		bi->xdp->data_end = bi->xdp->data + size;
-		xsk_buff_dma_sync_for_cpu(bi->xdp, ring->xsk_pool);
+		xsk_buff_dma_sync_for_cpu(bi->xdp);
 
 		res = __igc_xdp_run_prog(adapter, prog, bi->xdp);
 		switch (res) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 59798bc33298..ebda0cebe910 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -304,7 +304,7 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 		}
 
 		bi->xdp->data_end = bi->xdp->data + size;
-		xsk_buff_dma_sync_for_cpu(bi->xdp, rx_ring->xsk_pool);
+		xsk_buff_dma_sync_for_cpu(bi->xdp);
 		xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp);
 
 		if (likely(xdp_res & (IXGBE_XDP_TX | IXGBE_XDP_REDIR))) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index b8dd74453655..1b7132fa70de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -270,7 +270,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
 	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp);
 	net_prefetch(mxbuf->xdp.data);
 
 	/* Possible flows:
@@ -319,7 +319,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
 	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp);
 	net_prefetch(mxbuf->xdp.data);
 
 	prog = rcu_dereference(rq->xdp_prog);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d601b5faaed5..5e5d9fd0bfd5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -917,7 +917,7 @@ INDIRECT_CALLABLE_SCOPE bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 
 	if (!rq->xsk_pool) {
 		count = mlx5e_refill_rx_wqes(rq, head, wqe_bulk);
-	} else if (likely(!rq->xsk_pool->dma_need_sync)) {
+	} else if (likely(dma_skip_sync(rq->pdev))) {
 		mlx5e_xsk_free_rx_wqes(rq, head, wqe_bulk);
 		count = mlx5e_xsk_alloc_rx_wqes_batched(rq, head, wqe_bulk);
 	} else {
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
index 45be6954d5aa..01cfa9cc1b5e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
@@ -184,7 +184,7 @@ nfp_nfd3_xsk_rx(struct nfp_net_rx_ring *rx_ring, int budget,
 		xrxbuf->xdp->data += meta_len;
 		xrxbuf->xdp->data_end = xrxbuf->xdp->data + pkt_len;
 		xdp_set_data_meta_invalid(xrxbuf->xdp);
-		xsk_buff_dma_sync_for_cpu(xrxbuf->xdp, r_vec->xsk_pool);
+		xsk_buff_dma_sync_for_cpu(xrxbuf->xdp);
 		net_prefetch(xrxbuf->xdp->data);
 
 		if (meta_len) {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b334eb16da23..39834dc4ba88 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5310,7 +5310,7 @@ static int stmmac_rx_zc(struct stmmac_priv *priv, int limit, u32 queue)
 
 		/* RX buffer is good and fit into a XSK pool buffer */
 		buf->xdp->data_end = buf->xdp->data + buf1_len;
-		xsk_buff_dma_sync_for_cpu(buf->xdp, rx_q->xsk_pool);
+		xsk_buff_dma_sync_for_cpu(buf->xdp);
 
 		prog = READ_ONCE(priv->xdp_prog);
 		res = __stmmac_xdp_run_prog(priv, prog, buf->xdp);
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index ce60ecd48a4d..ecea2a329b1d 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -338,7 +338,6 @@ static struct xsk_dma_map *xp_create_dma_map(struct device *dev, struct net_devi
 
 	dma_map->netdev = netdev;
 	dma_map->dev = dev;
-	dma_map->dma_need_sync = false;
 	dma_map->dma_pages_cnt = nr_pages;
 	refcount_set(&dma_map->users, 1);
 	list_add(&dma_map->list, &umem->xsk_dma_list);
@@ -424,7 +423,6 @@ static int xp_init_dma_info(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_
 
 	pool->dev = dma_map->dev;
 	pool->dma_pages_cnt = dma_map->dma_pages_cnt;
-	pool->dma_need_sync = dma_map->dma_need_sync;
 	memcpy(pool->dma_pages, dma_map->dma_pages,
 	       pool->dma_pages_cnt * sizeof(*pool->dma_pages));
 
@@ -460,8 +458,6 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
 			__xp_dma_unmap(dma_map, attrs);
 			return -ENOMEM;
 		}
-		if (dma_need_sync(dev, dma))
-			dma_map->dma_need_sync = true;
 		dma_map->dma_pages[i] = dma;
 	}
 
@@ -557,11 +553,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool)
 	xskb->xdp.data_meta = xskb->xdp.data;
 	xskb->xdp.flags = 0;
 
-	if (pool->dma_need_sync) {
-		dma_sync_single_range_for_device(pool->dev, xskb->dma, 0,
-						 pool->frame_len,
-						 DMA_BIDIRECTIONAL);
-	}
+	dma_sync_single_for_device(pool->dev, xskb->dma, pool->frame_len,
+				   DMA_BIDIRECTIONAL);
+
 	return &xskb->xdp;
 }
 EXPORT_SYMBOL(xp_alloc);
@@ -633,7 +627,7 @@ u32 xp_alloc_batch(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u32 max)
 {
 	u32 nb_entries1 = 0, nb_entries2;
 
-	if (unlikely(pool->dma_need_sync)) {
+	if (unlikely(!dma_skip_sync(pool->dev))) {
 		struct xdp_buff *buff;
 
 		/* Slow path */
@@ -693,18 +687,3 @@ dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr)
 		(addr & ~PAGE_MASK);
 }
 EXPORT_SYMBOL(xp_raw_get_dma);
-
-void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb)
-{
-	dma_sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0,
-				      xskb->pool->frame_len, DMA_BIDIRECTIONAL);
-}
-EXPORT_SYMBOL(xp_dma_sync_for_cpu_slow);
-
-void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma,
-				 size_t size)
-{
-	dma_sync_single_range_for_device(pool->dev, dma, 0,
-					 size, DMA_BIDIRECTIONAL);
-}
-EXPORT_SYMBOL(xp_dma_sync_for_device_slow);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 13:54 ` [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations Alexander Lobakin
@ 2024-01-26 15:48   ` Robin Murphy
  2024-01-26 16:45     ` Alexander Lobakin
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2024-01-26 15:48 UTC (permalink / raw)
  To: Alexander Lobakin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

On 26/01/2024 1:54 pm, Alexander Lobakin wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Quite often, NIC devices do not need dma_sync operations on x86_64
> at least.
> Indeed, when dev_is_dma_coherent(dev) is true and
> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> and friends do nothing.
> 
> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
> 
> Add dev->skip_dma_sync boolean which is set during the device
> initialization depending on the setup: dev_is_dma_coherent() for direct
> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
> Then later, if/when swiotlb is used for the first time, the flag
> is turned off, from swiotlb_tbl_map_single().

I think you could probably just promote the dma_uses_io_tlb flag from 
SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.

Similarly I don't think a new op is necessary now that we have 
dma_map_ops.flags. A simple static flag to indicate that sync may be 
skipped under the same conditions as implied for dma-direct - i.e. 
dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought 
to suffice.

Thanks,
Robin.

> On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
> +3-5% increase for direct DMA.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>   include/linux/device.h      |  5 +++++
>   include/linux/dma-map-ops.h | 17 +++++++++++++++++
>   include/linux/dma-mapping.h | 12 ++++++++++--
>   drivers/base/dd.c           |  2 ++
>   kernel/dma/mapping.c        | 34 +++++++++++++++++++++++++++++++---
>   kernel/dma/swiotlb.c        | 14 ++++++++++++++
>   6 files changed, 79 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 97c4b046c09d..f23e6a32bea0 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -686,6 +686,8 @@ struct device_physical_location {
>    *		other devices probe successfully.
>    * @dma_coherent: this particular device is dma coherent, even if the
>    *		architecture supports non-coherent devices.
> + * @dma_skip_sync: DMA sync operations can be skipped for coherent non-SWIOTLB
> + *		buffers.
>    * @dma_ops_bypass: If set to %true then the dma_ops are bypassed for the
>    *		streaming DMA operations (->map_* / ->unmap_* / ->sync_*),
>    *		and optionall (if the coherent mask is large enough) also
> @@ -800,6 +802,9 @@ struct device {
>       defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
>   	bool			dma_coherent:1;
>   #endif
> +#ifdef CONFIG_DMA_NEED_SYNC
> +	bool			dma_skip_sync:1;
> +#endif
>   #ifdef CONFIG_DMA_OPS_BYPASS
>   	bool			dma_ops_bypass : 1;
>   #endif
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 4abc60f04209..937c295e9da8 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -78,6 +78,7 @@ struct dma_map_ops {
>   			int nents, enum dma_data_direction dir);
>   	void (*cache_sync)(struct device *dev, void *vaddr, size_t size,
>   			enum dma_data_direction direction);
> +	bool (*can_skip_sync)(struct device *dev);
>   	int (*dma_supported)(struct device *dev, u64 mask);
>   	u64 (*get_required_mask)(struct device *dev);
>   	size_t (*max_mapping_size)(struct device *dev);
> @@ -111,6 +112,22 @@ static inline void set_dma_ops(struct device *dev,
>   }
>   #endif /* CONFIG_DMA_OPS */
>   
> +#ifdef CONFIG_DMA_NEED_SYNC
> +
> +static inline void dma_set_skip_sync(struct device *dev, bool skip)
> +{
> +	dev->dma_skip_sync = skip;
> +}
> +
> +void dma_setup_skip_sync(struct device *dev);
> +
> +#else /* !CONFIG_DMA_NEED_SYNC */
> +
> +#define dma_set_skip_sync(dev, skip)		do { } while (0)
> +#define dma_setup_skip_sync(dev)		do { } while (0)
> +
> +#endif /* !CONFIG_DMA_NEED_SYNC */
> +
>   #ifdef CONFIG_DMA_CMA
>   extern struct cma *dma_contiguous_default_area;
>   
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 9dd7e1578bf6..bc9f67e0c139 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -365,9 +365,17 @@ __dma_sync_single_range_for_device(struct device *dev, dma_addr_t addr,
>   
>   #ifdef CONFIG_DMA_NEED_SYNC
>   
> -#define dma_skip_sync(dev)			false
> +static inline bool dma_skip_sync(const struct device *dev)
> +{
> +	return dev->dma_skip_sync;
> +}
> +
> +bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr);
>   
> -bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
> +static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
> +{
> +	return dma_skip_sync(dev) ? false : __dma_need_sync(dev, dma_addr);
> +}
>   
>   #else /* !CONFIG_DMA_NEED_SYNC */
>   
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 85152537dbf1..67ad3e1d51f6 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -642,6 +642,8 @@ static int really_probe(struct device *dev, struct device_driver *drv)
>   			goto pinctrl_bind_failed;
>   	}
>   
> +	dma_setup_skip_sync(dev);
> +
>   	ret = driver_sysfs_add(dev);
>   	if (ret) {
>   		pr_err("%s: driver_sysfs_add(%s) failed\n",
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index a30f37f9d4db..8fa464b3954e 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -842,15 +842,43 @@ size_t dma_opt_mapping_size(struct device *dev)
>   EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
>   
>   #ifdef CONFIG_DMA_NEED_SYNC
> -bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
> +bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
>   {
>   	const struct dma_map_ops *ops = get_dma_ops(dev);
>   
>   	if (dma_map_direct(dev, ops))
> +		/*
> +		 * dma_skip_sync could've been set to false on first SWIOTLB
> +		 * buffer mapping, but @dma_addr is not necessary an SWIOTLB
> +		 * buffer. In this case, fall back to more granular check.
> +		 */
>   		return dma_direct_need_sync(dev, dma_addr);
> -	return ops->sync_single_for_cpu || ops->sync_single_for_device;
> +
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(__dma_need_sync);
> +
> +void dma_setup_skip_sync(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +	bool skip;
> +
> +	if (dma_map_direct(dev, ops))
> +		/*
> +		 * dma_skip_sync will be set to false on first SWIOTLB buffer
> +		 * mapping, if any. During the device initialization, it's
> +		 * enough to check only for DMA coherence.
> +		 */
> +		skip = dev_is_dma_coherent(dev);
> +	else if (!ops->sync_single_for_device && !ops->sync_single_for_cpu)
> +		skip = true;
> +	else if (ops->can_skip_sync)
> +		skip = ops->can_skip_sync(dev);
> +	else
> +		skip = false;
> +
> +	dma_set_skip_sync(dev, skip);
>   }
> -EXPORT_SYMBOL_GPL(dma_need_sync);
>   #endif /* CONFIG_DMA_NEED_SYNC */
>   
>   unsigned long dma_get_merge_boundary(struct device *dev)
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index b079a9a8e087..b62ea0a4f106 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -1286,6 +1286,16 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
>   
>   #endif /* CONFIG_DEBUG_FS */
>   
> +static inline void swiotlb_disable_dma_skip_sync(struct device *dev)
> +{
> +	/*
> +	 * If dma_skip_sync was set, reset it to false on first SWIOTLB buffer
> +	 * mapping/allocation to always sync SWIOTLB buffers.
> +	 */
> +	if (unlikely(dma_skip_sync(dev)))
> +		dma_set_skip_sync(dev, false);
> +}
> +
>   phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		size_t mapping_size, size_t alloc_size,
>   		unsigned int alloc_align_mask, enum dma_data_direction dir,
> @@ -1323,6 +1333,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>   		return (phys_addr_t)DMA_MAPPING_ERROR;
>   	}
>   
> +	swiotlb_disable_dma_skip_sync(dev);
> +
>   	/*
>   	 * Save away the mapping from the original address to the DMA address.
>   	 * This is needed when we sync the memory.  Then we sync the buffer if
> @@ -1640,6 +1652,8 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
>   	if (index == -1)
>   		return NULL;
>   
> +	swiotlb_disable_dma_skip_sync(dev);
> +
>   	tlb_addr = slot_addr(pool->start, index);
>   
>   	return pfn_to_page(PFN_DOWN(tlb_addr));

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 15:48   ` Robin Murphy
@ 2024-01-26 16:45     ` Alexander Lobakin
  2024-01-26 17:21       ` Robin Murphy
  2024-01-29 14:07       ` Alexander Lobakin
  0 siblings, 2 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-26 16:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

From: Robin Murphy <robin.murphy@arm.com>
Date: Fri, 26 Jan 2024 15:48:54 +0000

> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> Quite often, NIC devices do not need dma_sync operations on x86_64
>> at least.
>> Indeed, when dev_is_dma_coherent(dev) is true and
>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>> and friends do nothing.
>>
>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>
>> Add dev->skip_dma_sync boolean which is set during the device
>> initialization depending on the setup: dev_is_dma_coherent() for direct
>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
>> Then later, if/when swiotlb is used for the first time, the flag
>> is turned off, from swiotlb_tbl_map_single().
> 
> I think you could probably just promote the dma_uses_io_tlb flag from
> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.

Nice catch!

> 
> Similarly I don't think a new op is necessary now that we have
> dma_map_ops.flags. A simple static flag to indicate that sync may be> skipped under the same conditions as implied for dma-direct - i.e.
> dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought
> to suffice.

In my initial implementation, I used a new dma_map_ops flag, but then I
realized different DMA ops may require or not require syncing under
different conditions, not only dev_is_dma_coherent().
Or am I wrong and they would always be the same?

> 
> Thanks,
> Robin.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 16:45     ` Alexander Lobakin
@ 2024-01-26 17:21       ` Robin Murphy
  2024-01-26 18:48         ` Petr Tesařík
  2024-01-29 14:07       ` Alexander Lobakin
  1 sibling, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2024-01-26 17:21 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

On 26/01/2024 4:45 pm, Alexander Lobakin wrote:
> From: Robin Murphy <robin.murphy@arm.com>
> Date: Fri, 26 Jan 2024 15:48:54 +0000
> 
>> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> Quite often, NIC devices do not need dma_sync operations on x86_64
>>> at least.
>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>> and friends do nothing.
>>>
>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>
>>> Add dev->skip_dma_sync boolean which is set during the device
>>> initialization depending on the setup: dev_is_dma_coherent() for direct
>>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
>>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
>>> Then later, if/when swiotlb is used for the first time, the flag
>>> is turned off, from swiotlb_tbl_map_single().
>>
>> I think you could probably just promote the dma_uses_io_tlb flag from
>> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.
> 
> Nice catch!
> 
>>
>> Similarly I don't think a new op is necessary now that we have
>> dma_map_ops.flags. A simple static flag to indicate that sync may be> skipped under the same conditions as implied for dma-direct - i.e.
>> dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought
>> to suffice.
> 
> In my initial implementation, I used a new dma_map_ops flag, but then I
> realized different DMA ops may require or not require syncing under
> different conditions, not only dev_is_dma_coherent().
> Or am I wrong and they would always be the same?

I think it's safe to assume that, as with P2P support, this will only 
matter for dma-direct and iommu-dma for the foreseeable future, and 
those do currently share the same conditions as above. Thus we may as 
well keep things simple for now, and if anything ever does have cause to 
change, it can be the future's problem to keep this mechanism working as 
intended.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 17:21       ` Robin Murphy
@ 2024-01-26 18:48         ` Petr Tesařík
  2024-01-26 19:13           ` Robin Murphy
  2024-01-29 14:36           ` Alexander Lobakin
  0 siblings, 2 replies; 24+ messages in thread
From: Petr Tesařík @ 2024-01-26 18:48 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alexander Lobakin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Christoph Hellwig, Marek Szyprowski, Joerg Roedel,
	Will Deacon, Greg Kroah-Hartman, Rafael J. Wysocki,
	Magnus Karlsson, Maciej Fijalkowski, Alexander Duyck, bpf,
	netdev, iommu, linux-kernel

On Fri, 26 Jan 2024 17:21:24 +0000
Robin Murphy <robin.murphy@arm.com> wrote:

> On 26/01/2024 4:45 pm, Alexander Lobakin wrote:
> > From: Robin Murphy <robin.murphy@arm.com>
> > Date: Fri, 26 Jan 2024 15:48:54 +0000
> >   
> >> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:  
> >>> From: Eric Dumazet <edumazet@google.com>
> >>>
> >>> Quite often, NIC devices do not need dma_sync operations on x86_64
> >>> at least.
> >>> Indeed, when dev_is_dma_coherent(dev) is true and
> >>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> >>> and friends do nothing.
> >>>
> >>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
> >>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
> >>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
> >>>
> >>> Add dev->skip_dma_sync boolean which is set during the device
> >>> initialization depending on the setup: dev_is_dma_coherent() for direct
> >>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
> >>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
> >>> Then later, if/when swiotlb is used for the first time, the flag
> >>> is turned off, from swiotlb_tbl_map_single().  
> >>
> >> I think you could probably just promote the dma_uses_io_tlb flag from
> >> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.  
> > 
> > Nice catch!
> >   
> >>
> >> Similarly I don't think a new op is necessary now that we have
> >> dma_map_ops.flags. A simple static flag to indicate that sync may be> skipped under the same conditions as implied for dma-direct - i.e.
> >> dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought
> >> to suffice.  
> > 
> > In my initial implementation, I used a new dma_map_ops flag, but then I
> > realized different DMA ops may require or not require syncing under
> > different conditions, not only dev_is_dma_coherent().
> > Or am I wrong and they would always be the same?  
> 
> I think it's safe to assume that, as with P2P support, this will only 
> matter for dma-direct and iommu-dma for the foreseeable future, and 
> those do currently share the same conditions as above. Thus we may as 
> well keep things simple for now, and if anything ever does have cause to 
> change, it can be the future's problem to keep this mechanism working as 
> intended.

Can we have a comment that states this assumption along with the flag?
Because when it breaks, it will keep someone cursing for days why DMA
sometimes fails on their device before they find out it's not synced.
And then wondering why the code makes such silly assumptions...

My two cents
Petr T

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 18:48         ` Petr Tesařík
@ 2024-01-26 19:13           ` Robin Murphy
  2024-01-29  6:09             ` Christoph Hellwig
  2024-01-29 14:36           ` Alexander Lobakin
  1 sibling, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2024-01-26 19:13 UTC (permalink / raw)
  To: Petr Tesařík
  Cc: Alexander Lobakin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Christoph Hellwig, Marek Szyprowski, Joerg Roedel,
	Will Deacon, Greg Kroah-Hartman, Rafael J. Wysocki,
	Magnus Karlsson, Maciej Fijalkowski, Alexander Duyck, bpf,
	netdev, iommu, linux-kernel

On 26/01/2024 6:48 pm, Petr Tesařík wrote:
> On Fri, 26 Jan 2024 17:21:24 +0000
> Robin Murphy <robin.murphy@arm.com> wrote:
> 
>> On 26/01/2024 4:45 pm, Alexander Lobakin wrote:
>>> From: Robin Murphy <robin.murphy@arm.com>
>>> Date: Fri, 26 Jan 2024 15:48:54 +0000
>>>    
>>>> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:
>>>>> From: Eric Dumazet <edumazet@google.com>
>>>>>
>>>>> Quite often, NIC devices do not need dma_sync operations on x86_64
>>>>> at least.
>>>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>>>> and friends do nothing.
>>>>>
>>>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>>>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>>>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>>>
>>>>> Add dev->skip_dma_sync boolean which is set during the device
>>>>> initialization depending on the setup: dev_is_dma_coherent() for direct
>>>>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
>>>>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
>>>>> Then later, if/when swiotlb is used for the first time, the flag
>>>>> is turned off, from swiotlb_tbl_map_single().
>>>>
>>>> I think you could probably just promote the dma_uses_io_tlb flag from
>>>> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.
>>>
>>> Nice catch!
>>>    
>>>>
>>>> Similarly I don't think a new op is necessary now that we have
>>>> dma_map_ops.flags. A simple static flag to indicate that sync may be> skipped under the same conditions as implied for dma-direct - i.e.
>>>> dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought
>>>> to suffice.
>>>
>>> In my initial implementation, I used a new dma_map_ops flag, but then I
>>> realized different DMA ops may require or not require syncing under
>>> different conditions, not only dev_is_dma_coherent().
>>> Or am I wrong and they would always be the same?
>>
>> I think it's safe to assume that, as with P2P support, this will only
>> matter for dma-direct and iommu-dma for the foreseeable future, and
>> those do currently share the same conditions as above. Thus we may as
>> well keep things simple for now, and if anything ever does have cause to
>> change, it can be the future's problem to keep this mechanism working as
>> intended.
> 
> Can we have a comment that states this assumption along with the flag?
> Because when it breaks, it will keep someone cursing for days why DMA
> sometimes fails on their device before they find out it's not synced.
> And then wondering why the code makes such silly assumptions...

Indeed, apologies if it wasn't totally clear, but I really was implying 
a literal "may skip sync if coherent and not using SWIOTLB (which 
matches dma-direct)" flag, documented as such, and not trying to dress 
it up as anything more generic. I just can't suggest a suitably concise 
name for that of the top of my head... :)

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 19:13           ` Robin Murphy
@ 2024-01-29  6:09             ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-01-29  6:09 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Petr Tesařík, Alexander Lobakin, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christoph Hellwig,
	Marek Szyprowski, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

On Fri, Jan 26, 2024 at 07:13:05PM +0000, Robin Murphy wrote:
>> Can we have a comment that states this assumption along with the flag?
>> Because when it breaks, it will keep someone cursing for days why DMA
>> sometimes fails on their device before they find out it's not synced.
>> And then wondering why the code makes such silly assumptions...
>
> Indeed, apologies if it wasn't totally clear, but I really was implying a 
> literal "may skip sync if coherent and not using SWIOTLB (which matches 
> dma-direct)" flag, documented as such, and not trying to dress it up as 
> anything more generic. I just can't suggest a suitably concise name for 
> that of the top of my head... :)

Yes, that seems like the right way to go.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used
  2024-01-26 13:54 ` [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used Alexander Lobakin
@ 2024-01-29  6:11   ` Christoph Hellwig
  2024-01-29 11:07     ` Alexander Lobakin
  2024-01-31 16:52   ` Simon Horman
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-01-29  6:11 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel,
	Will Deacon, Greg Kroah-Hartman, Rafael J. Wysocki,
	Magnus Karlsson, Maciej Fijalkowski, Alexander Duyck, bpf,
	netdev, iommu, linux-kernel

On Fri, Jan 26, 2024 at 02:54:50PM +0100, Alexander Lobakin wrote:
> Some platforms do have DMA, but DMA there is always direct and coherent.
> Currently, even on such platforms DMA sync operations are compiled and
> called.
> Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when
> either sync operations are needed or there is DMA ops or swiotlb
> enabled. Set dma_need_sync() and dma_skip_sync() (stub for now)
> depending on this symbol state and don't call sync ops when
> dma_skip_sync() is true.
> The change allows for future optimizations of DMA sync calls depending
> on compile-time or runtime conditions.

So the idea of compiling out the calls sounds fine to me.  But what
is the point of the extra indirection through the __-prefixed calls?

And if we need that (please document it in the commit log), please
make the wrappers proper inline functions and not macros.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used
  2024-01-29  6:11   ` Christoph Hellwig
@ 2024-01-29 11:07     ` Alexander Lobakin
  2024-01-29 12:15       ` Robin Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-29 11:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marek Szyprowski, Robin Murphy, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

From: Christoph Hellwig <hch@lst.de>
Date: Mon, 29 Jan 2024 07:11:36 +0100

> On Fri, Jan 26, 2024 at 02:54:50PM +0100, Alexander Lobakin wrote:
>> Some platforms do have DMA, but DMA there is always direct and coherent.
>> Currently, even on such platforms DMA sync operations are compiled and
>> called.
>> Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when
>> either sync operations are needed or there is DMA ops or swiotlb
>> enabled. Set dma_need_sync() and dma_skip_sync() (stub for now)
>> depending on this symbol state and don't call sync ops when
>> dma_skip_sync() is true.
>> The change allows for future optimizations of DMA sync calls depending
>> on compile-time or runtime conditions.
> 
> So the idea of compiling out the calls sounds fine to me.  But what
> is the point of the extra indirection through the __-prefixed calls?

Because dma_sync_* ops are external functions, not inlines, and in the
next patch I'm adding a check there.

> 
> And if we need that (please document it in the commit log), please
> make the wrappers proper inline functions and not macros.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used
  2024-01-29 11:07     ` Alexander Lobakin
@ 2024-01-29 12:15       ` Robin Murphy
  0 siblings, 0 replies; 24+ messages in thread
From: Robin Murphy @ 2024-01-29 12:15 UTC (permalink / raw)
  To: Alexander Lobakin, Christoph Hellwig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marek Szyprowski, Joerg Roedel, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, Magnus Karlsson, Maciej Fijalkowski,
	Alexander Duyck, bpf, netdev, iommu, linux-kernel

On 2024-01-29 11:07 am, Alexander Lobakin wrote:
> From: Christoph Hellwig <hch@lst.de>
> Date: Mon, 29 Jan 2024 07:11:36 +0100
> 
>> On Fri, Jan 26, 2024 at 02:54:50PM +0100, Alexander Lobakin wrote:
>>> Some platforms do have DMA, but DMA there is always direct and coherent.
>>> Currently, even on such platforms DMA sync operations are compiled and
>>> called.
>>> Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when
>>> either sync operations are needed or there is DMA ops or swiotlb
>>> enabled. Set dma_need_sync() and dma_skip_sync() (stub for now)
>>> depending on this symbol state and don't call sync ops when
>>> dma_skip_sync() is true.
>>> The change allows for future optimizations of DMA sync calls depending
>>> on compile-time or runtime conditions.
>>
>> So the idea of compiling out the calls sounds fine to me.  But what
>> is the point of the extra indirection through the __-prefixed calls?
> 
> Because dma_sync_* ops are external functions, not inlines, and in the
> next patch I'm adding a check there.
> 
>>
>> And if we need that (please document it in the commit log), please
>> make the wrappers proper inline functions and not macros.

In fact those wrappers could perhaps subsume the existing stub 
definitions, by starting with a refactor along these lines:

static inline dma_sync_x(...)
{
	if (IS_ENABLED(CONFIG_NEED_DMA_SYNC))
		__dma_sync_x(...);
}

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 16:45     ` Alexander Lobakin
  2024-01-26 17:21       ` Robin Murphy
@ 2024-01-29 14:07       ` Alexander Lobakin
  2024-01-29 14:29         ` Robin Murphy
  1 sibling, 1 reply; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-29 14:07 UTC (permalink / raw)
  To: Robin Murphy
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Fri, 26 Jan 2024 17:45:11 +0100

> From: Robin Murphy <robin.murphy@arm.com>
> Date: Fri, 26 Jan 2024 15:48:54 +0000
> 
>> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> Quite often, NIC devices do not need dma_sync operations on x86_64
>>> at least.
>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>> and friends do nothing.
>>>
>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>
>>> Add dev->skip_dma_sync boolean which is set during the device
>>> initialization depending on the setup: dev_is_dma_coherent() for direct
>>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
>>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
>>> Then later, if/when swiotlb is used for the first time, the flag
>>> is turned off, from swiotlb_tbl_map_single().
>>
>> I think you could probably just promote the dma_uses_io_tlb flag from
>> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.
> 
> Nice catch!

BTW, this implies such hotpath check:

	if (dev->dma_skip_sync && !READ_ONCE(dev->dma_uses_io_tlb))
		// ...

This seems less effective than just resetting dma_skip_sync on first
allocation.

> 
>>
>> Similarly I don't think a new op is necessary now that we have
>> dma_map_ops.flags. A simple static flag to indicate that sync may be> skipped under the same conditions as implied for dma-direct - i.e.
>> dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought
>> to suffice.
> 
> In my initial implementation, I used a new dma_map_ops flag, but then I
> realized different DMA ops may require or not require syncing under
> different conditions, not only dev_is_dma_coherent().
> Or am I wrong and they would always be the same?
> 
>>
>> Thanks,
>> Robin.
> 
> Thanks,
> Olek

Thanks,
Olek

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-29 14:07       ` Alexander Lobakin
@ 2024-01-29 14:29         ` Robin Murphy
  2024-01-29 14:34           ` Alexander Lobakin
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2024-01-29 14:29 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

On 2024-01-29 2:07 pm, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Fri, 26 Jan 2024 17:45:11 +0100
> 
>> From: Robin Murphy <robin.murphy@arm.com>
>> Date: Fri, 26 Jan 2024 15:48:54 +0000
>>
>>> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:
>>>> From: Eric Dumazet <edumazet@google.com>
>>>>
>>>> Quite often, NIC devices do not need dma_sync operations on x86_64
>>>> at least.
>>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>>> and friends do nothing.
>>>>
>>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>>
>>>> Add dev->skip_dma_sync boolean which is set during the device
>>>> initialization depending on the setup: dev_is_dma_coherent() for direct
>>>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
>>>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
>>>> Then later, if/when swiotlb is used for the first time, the flag
>>>> is turned off, from swiotlb_tbl_map_single().
>>>
>>> I think you could probably just promote the dma_uses_io_tlb flag from
>>> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.
>>
>> Nice catch!
> 
> BTW, this implies such hotpath check:
> 
> 	if (dev->dma_skip_sync && !READ_ONCE(dev->dma_uses_io_tlb))
> 		// ...
> 
> This seems less effective than just resetting dma_skip_sync on first
> allocation.

Well, my point is not to have a dma_skip_sync at all; I'm suggesting the 
check would be:

	if (dev_is_dma_coherent(dev) && dev_uses_io_tlb(dev))
		...

where on the platform which cares about this most, that first condition 
is a compile-time constant (and as implied, the second would want to be 
similarly wrapped for !SWIOTLB configs).

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-29 14:29         ` Robin Murphy
@ 2024-01-29 14:34           ` Alexander Lobakin
  0 siblings, 0 replies; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-29 14:34 UTC (permalink / raw)
  To: Robin Murphy
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

From: Robin Murphy <robin.murphy@arm.com>
Date: Mon, 29 Jan 2024 14:29:43 +0000

> On 2024-01-29 2:07 pm, Alexander Lobakin wrote:
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Fri, 26 Jan 2024 17:45:11 +0100
>>
>>> From: Robin Murphy <robin.murphy@arm.com>
>>> Date: Fri, 26 Jan 2024 15:48:54 +0000
>>>
>>>> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:
>>>>> From: Eric Dumazet <edumazet@google.com>
>>>>>
>>>>> Quite often, NIC devices do not need dma_sync operations on x86_64
>>>>> at least.
>>>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>>>> and friends do nothing.
>>>>>
>>>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes
>>>>> about
>>>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit
>>>>> rate.
>>>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>>>
>>>>> Add dev->skip_dma_sync boolean which is set during the device
>>>>> initialization depending on the setup: dev_is_dma_coherent() for
>>>>> direct
>>>>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive
>>>>> result
>>>>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA
>>>>> ops.
>>>>> Then later, if/when swiotlb is used for the first time, the flag
>>>>> is turned off, from swiotlb_tbl_map_single().
>>>>
>>>> I think you could probably just promote the dma_uses_io_tlb flag from
>>>> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.
>>>
>>> Nice catch!
>>
>> BTW, this implies such hotpath check:
>>
>>     if (dev->dma_skip_sync && !READ_ONCE(dev->dma_uses_io_tlb))
>>         // ...
>>
>> This seems less effective than just resetting dma_skip_sync on first
>> allocation.
> 
> Well, my point is not to have a dma_skip_sync at all; I'm suggesting the
> check would be:
> 
>     if (dev_is_dma_coherent(dev) && dev_uses_io_tlb(dev))
>         ...

Aaah, okay, but what about dma_map_ops?
It would then become:

	if ((!dev->dma_ops ||
	     (!dev->dma_ops->sync_single_for_device &&
	      !dev->dma_ops->sync_single_for_cpu)) ||
	     (dev->dma_ops->flags & DMA_F_SKIP_SYNC)) &&
	    dev_is_dma_coherent(dev) && !dev_uses_io_tlb(dev))
		dma_sync_ ...

Quite a lot and everything except dev_uses_io_tlb() is known at device
probing time, that's why I decided to cache the result into a new flag...

> 
> where on the platform which cares about this most, that first condition
> is a compile-time constant (and as implied, the second would want to be
> similarly wrapped for !SWIOTLB configs).
> 
> Thanks,
> Robin.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-26 18:48         ` Petr Tesařík
  2024-01-26 19:13           ` Robin Murphy
@ 2024-01-29 14:36           ` Alexander Lobakin
  2024-01-29 16:15             ` Petr Tesařík
  1 sibling, 1 reply; 24+ messages in thread
From: Alexander Lobakin @ 2024-01-29 14:36 UTC (permalink / raw)
  To: Petr Tesařík, Robin Murphy
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

From: Petr Tesařík <petr@tesarici.cz>
Date: Fri, 26 Jan 2024 19:48:19 +0100

> On Fri, 26 Jan 2024 17:21:24 +0000
> Robin Murphy <robin.murphy@arm.com> wrote:
> 
>> On 26/01/2024 4:45 pm, Alexander Lobakin wrote:
>>> From: Robin Murphy <robin.murphy@arm.com>
>>> Date: Fri, 26 Jan 2024 15:48:54 +0000
>>>   
>>>> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:  
>>>>> From: Eric Dumazet <edumazet@google.com>
>>>>>
>>>>> Quite often, NIC devices do not need dma_sync operations on x86_64
>>>>> at least.
>>>>> Indeed, when dev_is_dma_coherent(dev) is true and
>>>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
>>>>> and friends do nothing.
>>>>>
>>>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
>>>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
>>>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
>>>>>
>>>>> Add dev->skip_dma_sync boolean which is set during the device
>>>>> initialization depending on the setup: dev_is_dma_coherent() for direct
>>>>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
>>>>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
>>>>> Then later, if/when swiotlb is used for the first time, the flag
>>>>> is turned off, from swiotlb_tbl_map_single().  
>>>>
>>>> I think you could probably just promote the dma_uses_io_tlb flag from
>>>> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.  
>>>
>>> Nice catch!
>>>   
>>>>
>>>> Similarly I don't think a new op is necessary now that we have
>>>> dma_map_ops.flags. A simple static flag to indicate that sync may be> skipped under the same conditions as implied for dma-direct - i.e.
>>>> dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought
>>>> to suffice.  
>>>
>>> In my initial implementation, I used a new dma_map_ops flag, but then I
>>> realized different DMA ops may require or not require syncing under
>>> different conditions, not only dev_is_dma_coherent().
>>> Or am I wrong and they would always be the same?  
>>
>> I think it's safe to assume that, as with P2P support, this will only 
>> matter for dma-direct and iommu-dma for the foreseeable future, and 
>> those do currently share the same conditions as above. Thus we may as 
>> well keep things simple for now, and if anything ever does have cause to 
>> change, it can be the future's problem to keep this mechanism working as 
>> intended.
> 
> Can we have a comment that states this assumption along with the flag?
> Because when it breaks, it will keep someone cursing for days why DMA
> sometimes fails on their device before they find out it's not synced.

BTW, dma_skip_sync is set right before driver->probe(), so that if any
problematic device appears, it could easily be fixed by adding one line
to its probe callback.

> And then wondering why the code makes such silly assumptions...
> 
> My two cents
> Petr T

Thanks,
Olek

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations
  2024-01-29 14:36           ` Alexander Lobakin
@ 2024-01-29 16:15             ` Petr Tesařík
  0 siblings, 0 replies; 24+ messages in thread
From: Petr Tesařík @ 2024-01-29 16:15 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Robin Murphy, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Christoph Hellwig, Marek Szyprowski, Joerg Roedel,
	Will Deacon, Greg Kroah-Hartman, Rafael J. Wysocki,
	Magnus Karlsson, Maciej Fijalkowski, Alexander Duyck, bpf,
	netdev, iommu, linux-kernel

On Mon, 29 Jan 2024 15:36:35 +0100
Alexander Lobakin <aleksander.lobakin@intel.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz>
> Date: Fri, 26 Jan 2024 19:48:19 +0100
> 
> > On Fri, 26 Jan 2024 17:21:24 +0000
> > Robin Murphy <robin.murphy@arm.com> wrote:
> >   
> >> On 26/01/2024 4:45 pm, Alexander Lobakin wrote:  
> >>> From: Robin Murphy <robin.murphy@arm.com>
> >>> Date: Fri, 26 Jan 2024 15:48:54 +0000
> >>>     
> >>>> On 26/01/2024 1:54 pm, Alexander Lobakin wrote:    
> >>>>> From: Eric Dumazet <edumazet@google.com>
> >>>>>
> >>>>> Quite often, NIC devices do not need dma_sync operations on x86_64
> >>>>> at least.
> >>>>> Indeed, when dev_is_dma_coherent(dev) is true and
> >>>>> dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> >>>>> and friends do nothing.
> >>>>>
> >>>>> However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
> >>>>> 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
> >>>>> Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.
> >>>>>
> >>>>> Add dev->skip_dma_sync boolean which is set during the device
> >>>>> initialization depending on the setup: dev_is_dma_coherent() for direct
> >>>>> DMA, !(sync_single_for_device || sync_single_for_cpu) or positive result
> >>>>> from the new callback, dma_map_ops::can_skip_sync for non-NULL DMA ops.
> >>>>> Then later, if/when swiotlb is used for the first time, the flag
> >>>>> is turned off, from swiotlb_tbl_map_single().    
> >>>>
> >>>> I think you could probably just promote the dma_uses_io_tlb flag from
> >>>> SWIOTLB_DYNAMIC to a general SWIOTLB thing to serve this purpose now.    
> >>>
> >>> Nice catch!
> >>>     
> >>>>
> >>>> Similarly I don't think a new op is necessary now that we have
> >>>> dma_map_ops.flags. A simple static flag to indicate that sync may be> skipped under the same conditions as implied for dma-direct - i.e.
> >>>> dev_is_dma_coherent(dev) && !dev->dma_use_io_tlb - seems like it ought
> >>>> to suffice.    
> >>>
> >>> In my initial implementation, I used a new dma_map_ops flag, but then I
> >>> realized different DMA ops may require or not require syncing under
> >>> different conditions, not only dev_is_dma_coherent().
> >>> Or am I wrong and they would always be the same?    
> >>
> >> I think it's safe to assume that, as with P2P support, this will only 
> >> matter for dma-direct and iommu-dma for the foreseeable future, and 
> >> those do currently share the same conditions as above. Thus we may as 
> >> well keep things simple for now, and if anything ever does have cause to 
> >> change, it can be the future's problem to keep this mechanism working as 
> >> intended.  
> > 
> > Can we have a comment that states this assumption along with the flag?
> > Because when it breaks, it will keep someone cursing for days why DMA
> > sometimes fails on their device before they find out it's not synced.  
> 
> BTW, dma_skip_sync is set right before driver->probe(), so that if any
> problematic device appears, it could easily be fixed by adding one line
> to its probe callback.

Ah, perfect!

Petr T

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used
  2024-01-26 13:54 ` [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used Alexander Lobakin
  2024-01-29  6:11   ` Christoph Hellwig
@ 2024-01-31 16:52   ` Simon Horman
  2024-01-31 17:14     ` Robin Murphy
  1 sibling, 1 reply; 24+ messages in thread
From: Simon Horman @ 2024-01-31 16:52 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel,
	Will Deacon, Greg Kroah-Hartman, Rafael J. Wysocki,
	Magnus Karlsson, Maciej Fijalkowski, Alexander Duyck, bpf,
	netdev, iommu, linux-kernel

On Fri, Jan 26, 2024 at 02:54:50PM +0100, Alexander Lobakin wrote:
> Some platforms do have DMA, but DMA there is always direct and coherent.
> Currently, even on such platforms DMA sync operations are compiled and
> called.
> Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when
> either sync operations are needed or there is DMA ops or swiotlb
> enabled. Set dma_need_sync() and dma_skip_sync() (stub for now)
> depending on this symbol state and don't call sync ops when
> dma_skip_sync() is true.
> The change allows for future optimizations of DMA sync calls depending
> on compile-time or runtime conditions.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Hi Alexander,

This seems to cause x86_64 allmodconfig builds to fail:

 ../drivers/media/platform/ti/omap3isp/ispstat.c:82:35: error: ‘dma_sync_single_range_for_device’ undeclared (first use in this function); did you mean ‘dma_sync_sgtable_for_device’?
    82 |                                   dma_sync_single_range_for_device);
       |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       |                                   dma_sync_sgtable_for_device
 ../drivers/media/platform/ti/omap3isp/ispstat.c:82:35: note: each undeclared identifier is reported only once for each function it appears in
 ../drivers/media/platform/ti/omap3isp/ispstat.c: In function ‘isp_stat_buf_sync_magic_for_cpu’:
 ../drivers/media/platform/ti/omap3isp/ispstat.c:94:35: error: ‘dma_sync_single_range_for_cpu’ undeclared (first use in this function); did you mean ‘dma_sync_sgtable_for_cpu’?
    94 |                                   dma_sync_single_range_for_cpu);
       |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       |                                   dma_sync_sgtable_for_cpu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used
  2024-01-31 16:52   ` Simon Horman
@ 2024-01-31 17:14     ` Robin Murphy
  0 siblings, 0 replies; 24+ messages in thread
From: Robin Murphy @ 2024-01-31 17:14 UTC (permalink / raw)
  To: Simon Horman, Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christoph Hellwig, Marek Szyprowski, Joerg Roedel, Will Deacon,
	Greg Kroah-Hartman, Rafael J. Wysocki, Magnus Karlsson,
	Maciej Fijalkowski, Alexander Duyck, bpf, netdev, iommu,
	linux-kernel

On 31/01/2024 4:52 pm, Simon Horman wrote:
> On Fri, Jan 26, 2024 at 02:54:50PM +0100, Alexander Lobakin wrote:
>> Some platforms do have DMA, but DMA there is always direct and coherent.
>> Currently, even on such platforms DMA sync operations are compiled and
>> called.
>> Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when
>> either sync operations are needed or there is DMA ops or swiotlb
>> enabled. Set dma_need_sync() and dma_skip_sync() (stub for now)
>> depending on this symbol state and don't call sync ops when
>> dma_skip_sync() is true.
>> The change allows for future optimizations of DMA sync calls depending
>> on compile-time or runtime conditions.
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> Hi Alexander,
> 
> This seems to cause x86_64 allmodconfig builds to fail:

Oh yeah, the sync_single_range definitions shouldn't need touching at 
all, since they're unconditional wrappers around regular sync_single 
invocations (which already may or may not do anything).

Thanks,
Robin.

> 
>   ../drivers/media/platform/ti/omap3isp/ispstat.c:82:35: error: ‘dma_sync_single_range_for_device’ undeclared (first use in this function); did you mean ‘dma_sync_sgtable_for_device’?
>      82 |                                   dma_sync_single_range_for_device);
>         |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>         |                                   dma_sync_sgtable_for_device
>   ../drivers/media/platform/ti/omap3isp/ispstat.c:82:35: note: each undeclared identifier is reported only once for each function it appears in
>   ../drivers/media/platform/ti/omap3isp/ispstat.c: In function ‘isp_stat_buf_sync_magic_for_cpu’:
>   ../drivers/media/platform/ti/omap3isp/ispstat.c:94:35: error: ‘dma_sync_single_range_for_cpu’ undeclared (first use in this function); did you mean ‘dma_sync_sgtable_for_cpu’?
>      94 |                                   dma_sync_single_range_for_cpu);
>         |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>         |                                   dma_sync_sgtable_for_cpu

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-01-31 17:14 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-26 13:54 [PATCH net-next 0/7] dma: skip calling no-op sync ops when possible Alexander Lobakin
2024-01-26 13:54 ` [PATCH net-next 1/7] dma: compile-out DMA sync op calls when not used Alexander Lobakin
2024-01-29  6:11   ` Christoph Hellwig
2024-01-29 11:07     ` Alexander Lobakin
2024-01-29 12:15       ` Robin Murphy
2024-01-31 16:52   ` Simon Horman
2024-01-31 17:14     ` Robin Murphy
2024-01-26 13:54 ` [PATCH net-next 2/7] dma: avoid expensive redundant calls for sync operations Alexander Lobakin
2024-01-26 15:48   ` Robin Murphy
2024-01-26 16:45     ` Alexander Lobakin
2024-01-26 17:21       ` Robin Murphy
2024-01-26 18:48         ` Petr Tesařík
2024-01-26 19:13           ` Robin Murphy
2024-01-29  6:09             ` Christoph Hellwig
2024-01-29 14:36           ` Alexander Lobakin
2024-01-29 16:15             ` Petr Tesařík
2024-01-29 14:07       ` Alexander Lobakin
2024-01-29 14:29         ` Robin Murphy
2024-01-29 14:34           ` Alexander Lobakin
2024-01-26 13:54 ` [PATCH net-next 3/7] iommu/dma: avoid expensive indirect " Alexander Lobakin
2024-01-26 13:54 ` [PATCH net-next 4/7] page_pool: make sure frag API fields don't span between cachelines Alexander Lobakin
2024-01-26 13:54 ` [PATCH net-next 5/7] page_pool: don't use driver-set flags field directly Alexander Lobakin
2024-01-26 13:54 ` [PATCH net-next 6/7] page_pool: check for DMA sync shortcut earlier Alexander Lobakin
2024-01-26 13:54 ` [PATCH net-next 7/7] xsk: use generic DMA sync shortcut instead of a custom one Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).