[PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping
@ 2015-03-10 17:43 Govindarajulu Varadarajan
  2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan
  2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan
  0 siblings, 2 replies; 14+ messages in thread
From: Govindarajulu Varadarajan @ 2015-03-10 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan

The following series tries to address these two problem in dma buff allocation.

* Memory wastage because of large 9k allocation using kmalloc:
  For 9k dma buffer, netdev_alloc_skb_ip_align internally calls kmalloc for
  size > 4096. In case of 9k buff, kmalloc returns pages for order 2, 16k.
  And we use only ~9k of 16k. 7k memory wasted. Using the frag the frag
  allocator in patch 1/2, we can allocate three 9k buffs in a 32k page size.
  Typical enic configuration has 8 rq, and desc ring of size 4096.
  Thats 8 * 4096 * (16*1024) = 512 MB. Using this frag allocator:
  8 * 4096 * (32*1024/3) = 341 MB. Thats 171 MB of memory save.

* frequent dma_map() calls:
  we call dma_map() for every buff we allocate. When iommu is on, This is very
  time consuming. From my testing, most of the cpu cycles are wasted spinning on
  spin_lock_irqsave(&iovad->iova_rbtree_lock, flags) in
  intel_map_page() .. -> ..__alloc_and_insert_iova_range()

With this patch, we call dma_map() once for 32k page. i.e once for every three
9k desc, and once every twenty 1500 bytes desc.

Here are my testing result with 8 rq, 4096 ring size and 9k mtu. irq of each rq
is affinitized with different CPU. Ran iperf with 32 threads. Link is 10G.
iommu is on.

		CPU utilization		throughput
without patch	100%			1.8 Gbps
with patch	13%			9.8 Gbps

v3:
Make this facility more generic so that other drivers can use it.

v2:
Remove changing order facility

Govindarajulu Varadarajan (2):
  net: implement dma cache skb allocator
  enic: use netdev_dma_alloc

 drivers/net/ethernet/cisco/enic/enic_main.c |  31 ++---
 drivers/net/ethernet/cisco/enic/vnic_rq.c   |   3 +
 drivers/net/ethernet/cisco/enic/vnic_rq.h   |   3 +
 include/linux/skbuff.h                      |  22 +++
 net/core/skbuff.c                           | 209 ++++++++++++++++++++++++++++
 5 files changed, 246 insertions(+), 22 deletions(-)

-- 
2.3.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next v3 1/2] net: implement dma cache skb allocator
  2015-03-10 17:43 [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping Govindarajulu Varadarajan
@ 2015-03-10 17:43 ` Govindarajulu Varadarajan
  2015-03-10 20:33   ` Alexander Duyck
  2015-03-14 20:08   ` Ben Hutchings
  2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan
  1 sibling, 2 replies; 14+ messages in thread
From: Govindarajulu Varadarajan @ 2015-03-10 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan

This patch implements dma cache skb allocator. This is based on
__alloc_page_frag & __page_frag_refill implementation in net/core/skbuff.c

In addition to frag allocation from order(3) page in __alloc_page_frag,
we also maintain dma address of the page. While allocating a frag for skb
we use va + offset for virtual address of the frag, and pa + offset for
dma address of the frag. This reduces the number of calls to dma_map() by 1/3
for 9000 bytes and by 1/20 for 1500 bytes.

__alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e 4096 in most
of the cases. So 9k buffer allocation goes through kmalloc which return
page of order 2, 16k. We waste 7k bytes for every 9k buffer.

we maintain dma_count variable which is incremented when we allocate a frag.
netdev_dma_frag_unmap() will decrement the dma_count and unmap it when there is
no user of that page.

This reduces the memory utilization for 9k mtu by 33%.

The user of dma cache allocator first calls netdev_dma_init() to initialize the
dma_head structure. User can allocate a dma frag by calling
netdev_dma_alloc_skb(), allocated skb is returned and dma_addr is stored in the
pointer *dma_addr passed to the function. Every call to this function should
have corresponding call to netdev_dma_frag_unmap() to unmap the page.

At last netdev_dma_destroy() is called to clean to dma_cache. Before calling
destroy(), user should have unmapped all the skb allocated.

All calls to these function should be mutually exclusive. It is the
responsibility of the caller to satisfy this condition. Since this will be
mostly used for rx buffer allocation, where unmap,allocation is serialized, we
can avoid using locks.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
---
 include/linux/skbuff.h |  22 ++++++
 net/core/skbuff.c      | 209 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 231 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bba1330..ae2d024 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2153,6 +2153,28 @@ static inline struct sk_buff *dev_alloc_skb(unsigned int length)
 	return netdev_alloc_skb(NULL, length);
 }
 
+struct netdev_dma_node {
+	struct page_frag	frag;
+	unsigned int		pagecnt_bias;
+	int			dma_count;
+	void			*va;
+	dma_addr_t		dma_addr;
+};
+
+struct netdev_dma_head {
+	struct netdev_dma_node	*nc;
+	struct device		*dev;
+	gfp_t			gfp;
+};
+
+void netdev_dma_destroy(struct netdev_dma_head *nc_head);
+void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head,
+			   struct netdev_dma_node *nc);
+void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev,
+		     gfp_t gfp);
+struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head,
+				     struct netdev_dma_node **nc_node,
+				     dma_addr_t *dma_addr, size_t sz);
 
 static inline struct sk_buff *__netdev_alloc_skb_ip_align(struct net_device *dev,
 		unsigned int length, gfp_t gfp)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 47c3241..ec3c46c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -79,6 +79,7 @@
 
 struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
+struct kmem_cache *netdev_dma_cache __read_mostly;
 
 /**
  *	skb_panic - private function for out-of-line support
@@ -361,6 +362,210 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
 	return page;
 }
 
+static void __dma_cache_refill(struct netdev_dma_head *nc_head, size_t sz)
+{
+	struct netdev_dma_node *nc;
+	gfp_t gfp_comp = nc_head->gfp | __GFP_COMP | __GFP_NOWARN |
+			 __GFP_NORETRY;
+	u8 order = NETDEV_FRAG_PAGE_MAX_ORDER;
+
+	nc = kmem_cache_alloc(netdev_dma_cache, GFP_ATOMIC);
+	nc_head->nc = nc;
+	if (unlikely(!nc))
+		return;
+
+	if (unlikely(sz > (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)))
+		goto precise_order;
+
+	nc->frag.page = alloc_pages_node(NUMA_NO_NODE, gfp_comp, order);
+	if (unlikely(!nc->frag.page)) {
+precise_order:
+		order = get_order(sz);
+		nc->frag.page = alloc_pages_node(NUMA_NO_NODE, nc_head->gfp,
+						 order);
+		if (!nc->frag.page)
+			goto free_nc;
+	}
+
+	nc->frag.size = (PAGE_SIZE << order);
+	nc->dma_addr = dma_map_page(nc_head->dev, nc->frag.page, 0,
+				    nc->frag.size, DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(nc_head->dev, nc->dma_addr)))
+		goto free_page;
+	nc->va = page_address(nc->frag.page);
+	atomic_add(nc->frag.size - 1, &nc->frag.page->_count);
+	nc->pagecnt_bias = nc->frag.size;
+	nc->frag.offset = nc->frag.size;
+	nc->dma_count = 0;
+
+	return;
+
+free_page:
+	__free_pages(nc->frag.page, order);
+free_nc:
+	kmem_cache_free(netdev_dma_cache, nc);
+	nc_head->nc = NULL;
+}
+
+static struct netdev_dma_node *netdev_dma_alloc(struct netdev_dma_head *nc_head,
+						size_t sz)
+{
+	struct netdev_dma_node *nc = nc_head->nc;
+	int offset;
+
+	if (unlikely(!nc)) {
+refill:
+		__dma_cache_refill(nc_head, sz);
+		nc = nc_head->nc;
+		if (unlikely(!nc))
+			return NULL;
+	}
+
+	offset = nc->frag.offset - sz;
+	if (offset < 0) {
+		if (!atomic_sub_and_test(nc->pagecnt_bias,
+					 &nc->frag.page->_count)) {
+			/* the caller has processed all the frags belonging to
+			 * this page. Since page->_count is not 0 and
+			 * nc->dma_count is 0 these frags should be in stack.
+			 * We should unmap the page here.
+			 */
+			if (!nc->dma_count) {
+				dma_unmap_single(nc_head->dev, nc->dma_addr,
+						 nc->frag.size,
+						 DMA_FROM_DEVICE);
+				kmem_cache_free(netdev_dma_cache, nc);
+			} else {
+			/* frags from this page are used by the caller. Let the
+			 * caller unmap the page using netdev_dma_frag_unmap().
+			 */
+				nc->pagecnt_bias = 0;
+			}
+			goto refill;
+		}
+		WARN_ON(nc->dma_count);
+		atomic_set(&nc->frag.page->_count, nc->frag.size);
+		nc->pagecnt_bias = nc->frag.size;
+		offset = nc->frag.size - sz;
+	}
+	nc->pagecnt_bias--;
+	nc->dma_count++;
+	nc->frag.offset = offset;
+
+	return nc;
+}
+
+/* netdev_dma_alloc_skb - alloc skb with dma mapped data
+ * @nc_head:	initialized netdev_dma_head from which allocation should be
+ *		done.
+ * @nc_node:	Corresponding dma_node is stored in this. The caller needs to
+ *		save this and use it for netdev_dma_frag_unmap() to unmap
+ *		page frag later.
+ * @dma_addr:	dma address of the data is stored here.
+ * @sz:		size of data needed.
+ *
+ * Allocates skb from dma cached page. This function returns allocated skb_buff
+ * from the provided netdev_dma_head cache. It stores the dma address in
+ * dma_addr.
+ *
+ * All the calls to netdev_dma_alloc_skb(), netdev_dma_frag_unmap(),
+ * netdev_dma_init(), netdev_dma_destroy() should be mutually exclusive for a
+ * dma_head. It is the responsibility of the caller to satisfy this condition.
+ * Since this is mostly used for rx buffer allocation, where unmap,allocation
+ * is serialized we can avoid using locks.
+ */
+struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head,
+				     struct netdev_dma_node **nc_node,
+				     dma_addr_t *dma_addr, size_t sz)
+{
+	struct sk_buff *skb;
+	struct netdev_dma_node *nc;
+	void *va;
+
+	sz = sz + NET_IP_ALIGN + NET_SKB_PAD;
+	sz = SKB_DATA_ALIGN(sz) +
+	     SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	nc = netdev_dma_alloc(nc_head, sz);
+	if (unlikely(!nc))
+		return NULL;
+	va = nc->va + nc->frag.offset;
+	skb = build_skb(va, sz);
+	if (unlikely(!skb)) {
+		nc->pagecnt_bias++;
+		nc->frag.offset += sz;
+		nc->dma_count--;
+
+		return NULL;
+	}
+
+	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
+	*dma_addr = nc->dma_addr + nc->frag.offset + NET_SKB_PAD + NET_IP_ALIGN;
+	*nc_node = nc;
+
+	return skb;
+}
+EXPORT_SYMBOL_GPL(netdev_dma_alloc_skb);
+
+/* netdev_dma_init - initialize dma cache head
+ * @nc_head:	dma_head to be initialized
+ * @dev:	device for dma map
+ * @gfp:	gfp flagg for allocating page
+ *
+ * initializes dma_head with provided gfp flags and device
+ */
+void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev,
+		     gfp_t gfp)
+{
+	nc_head->nc = NULL;
+	nc_head->dev = dev;
+	nc_head->gfp = gfp;
+}
+EXPORT_SYMBOL_GPL(netdev_dma_init);
+
+/* netdev_dma_frag_unmap - unmap page
+ * @nc_head:	dma_head of page frag allocator
+ * @nc:		dma_node to be unmapped
+ *
+ * unmaps requested page if there are no dma references to the page
+ */
+void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head,
+			   struct netdev_dma_node *nc)
+{
+	/* If nc is not used by dma_head. We should be free to unmap the
+	 * page if there are no pending dma frags addr used by the caller.
+	 */
+	if (!--nc->dma_count && !nc->pagecnt_bias) {
+		dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size,
+			       DMA_FROM_DEVICE);
+		kmem_cache_free(netdev_dma_cache, nc);
+	}
+}
+EXPORT_SYMBOL_GPL(netdev_dma_frag_unmap);
+
+/* netdev_dma_destroy - destroy dma_head
+ * @nc_head:	dma_head to be destroyed
+ *
+ * unmap the current dma_page and free the page. Caller should have called
+ * netdev_dma_frag_unmap() for all the allocated frages before calling this
+ * function.
+ */
+void netdev_dma_destroy(struct netdev_dma_head *nc_head)
+{
+	struct netdev_dma_node *nc = nc_head->nc;
+
+	if (unlikely(!nc))
+		return;
+	if (WARN_ON(nc->dma_count))
+		return;
+	dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size,
+		       DMA_FROM_DEVICE);
+	atomic_sub(nc->pagecnt_bias - 1, &nc->frag.page->_count);
+	__free_pages(nc->frag.page, get_order(nc->frag.size));
+	kmem_cache_free(netdev_dma_cache, nc);
+	nc_head->nc = NULL;
+}
+EXPORT_SYMBOL_GPL(netdev_dma_destroy);
+
 static void *__alloc_page_frag(struct netdev_alloc_cache __percpu *cache,
 			       unsigned int fragsz, gfp_t gfp_mask)
 {
@@ -3324,6 +3529,10 @@ void __init skb_init(void)
 						0,
 						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
 						NULL);
+	netdev_dma_cache = kmem_cache_create("netdev_dma_cache",
+					     sizeof(struct netdev_dma_node),
+					     0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
+					     NULL);
 }
 
 /**
-- 
2.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next v3 2/2] enic: use netdev_dma_alloc
  2015-03-10 17:43 [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping Govindarajulu Varadarajan
  2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan
@ 2015-03-10 17:43 ` Govindarajulu Varadarajan
  2015-03-10 20:14   ` Alexander Duyck
  1 sibling, 1 reply; 14+ messages in thread
From: Govindarajulu Varadarajan @ 2015-03-10 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan

This patches uses dma cache skb allocator fot rx buffers.

netdev_dma_head is initialized per rq. All calls to netdev_dma_alloc_skb() and
netdev_dma_frag_unmap() happens in napi_poll and they are serialized.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
---
 drivers/net/ethernet/cisco/enic/enic_main.c | 31 +++++++++--------------------
 drivers/net/ethernet/cisco/enic/vnic_rq.c   |  3 +++
 drivers/net/ethernet/cisco/enic/vnic_rq.h   |  3 +++
 3 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 204bd182..3be5bc12 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -952,13 +952,9 @@ nla_put_failure:
 
 static void enic_free_rq_buf(struct vnic_rq *rq, struct vnic_rq_buf *buf)
 {
-	struct enic *enic = vnic_dev_priv(rq->vdev);
-
 	if (!buf->os_buf)
 		return;
-
-	pci_unmap_single(enic->pdev, buf->dma_addr,
-		buf->len, PCI_DMA_FROMDEVICE);
+	netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
 	dev_kfree_skb_any(buf->os_buf);
 	buf->os_buf = NULL;
 }
@@ -979,17 +975,10 @@ static int enic_rq_alloc_buf(struct vnic_rq *rq)
 
 		return 0;
 	}
-	skb = netdev_alloc_skb_ip_align(netdev, len);
+	skb = netdev_dma_alloc_skb(&rq->nc_head, &buf->nc, &dma_addr, len);
 	if (!skb)
 		return -ENOMEM;
 
-	dma_addr = pci_map_single(enic->pdev, skb->data, len,
-				  PCI_DMA_FROMDEVICE);
-	if (unlikely(enic_dma_map_check(enic, dma_addr))) {
-		dev_kfree_skb(skb);
-		return -ENOMEM;
-	}
-
 	enic_queue_rq_desc(rq, skb, os_buf_index,
 		dma_addr, len);
 
@@ -1016,8 +1005,6 @@ static bool enic_rxcopybreak(struct net_device *netdev, struct sk_buff **skb,
 	new_skb = netdev_alloc_skb_ip_align(netdev, len);
 	if (!new_skb)
 		return false;
-	pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr, len,
-				    DMA_FROM_DEVICE);
 	memcpy(new_skb->data, (*skb)->data, len);
 	*skb = new_skb;
 
@@ -1065,8 +1052,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq,
 				enic->rq_truncated_pkts++;
 		}
 
-		pci_unmap_single(enic->pdev, buf->dma_addr, buf->len,
-				 PCI_DMA_FROMDEVICE);
+		netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
 		dev_kfree_skb_any(skb);
 		buf->os_buf = NULL;
 
@@ -1078,10 +1064,11 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq,
 		/* Good receive
 		 */
 
+		pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr,
+					    bytes_written, DMA_FROM_DEVICE);
 		if (!enic_rxcopybreak(netdev, &skb, buf, bytes_written)) {
 			buf->os_buf = NULL;
-			pci_unmap_single(enic->pdev, buf->dma_addr, buf->len,
-					 PCI_DMA_FROMDEVICE);
+			netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
 		}
 		prefetch(skb->data - NET_IP_ALIGN);
 
@@ -1122,9 +1109,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq,
 
 		/* Buffer overflow
 		 */
-
-		pci_unmap_single(enic->pdev, buf->dma_addr, buf->len,
-				 PCI_DMA_FROMDEVICE);
+		netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
 		dev_kfree_skb_any(skb);
 		buf->os_buf = NULL;
 	}
@@ -1648,6 +1633,8 @@ static int enic_open(struct net_device *netdev)
 	}
 
 	for (i = 0; i < enic->rq_count; i++) {
+		netdev_dma_init(&enic->rq[i].nc_head, &enic->pdev->dev,
+				GFP_ATOMIC);
 		vnic_rq_fill(&enic->rq[i], enic_rq_alloc_buf);
 		/* Need at least one buffer on ring to get going */
 		if (vnic_rq_desc_used(&enic->rq[i]) == 0) {
diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.c b/drivers/net/ethernet/cisco/enic/vnic_rq.c
index 36a2ed6..afa1d71 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_rq.c
+++ b/drivers/net/ethernet/cisco/enic/vnic_rq.c
@@ -23,6 +23,7 @@
 #include <linux/pci.h>
 #include <linux/delay.h>
 #include <linux/slab.h>
+#include <linux/skbuff.h>
 
 #include "vnic_dev.h"
 #include "vnic_rq.h"
@@ -199,6 +200,8 @@ void vnic_rq_clean(struct vnic_rq *rq,
 		rq->ring.desc_avail++;
 	}
 
+	netdev_dma_destroy(&rq->nc_head);
+
 	/* Use current fetch_index as the ring starting point */
 	fetch_index = ioread32(&rq->ctrl->fetch_index);
 
diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.h b/drivers/net/ethernet/cisco/enic/vnic_rq.h
index 8111d52..d4ee963 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_rq.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_rq.h
@@ -21,6 +21,7 @@
 #define _VNIC_RQ_H_
 
 #include <linux/pci.h>
+#include <linux/skbuff.h>
 
 #include "vnic_dev.h"
 #include "vnic_cq.h"
@@ -73,6 +74,7 @@ struct vnic_rq_buf {
 	unsigned int index;
 	void *desc;
 	uint64_t wr_id;
+	struct netdev_dma_node *nc;
 };
 
 struct vnic_rq {
@@ -100,6 +102,7 @@ struct vnic_rq {
 	unsigned int bpoll_state;
 	spinlock_t bpoll_lock;
 #endif /* CONFIG_NET_RX_BUSY_POLL */
+	struct netdev_dma_head nc_head;
 };
 
 static inline unsigned int vnic_rq_desc_avail(struct vnic_rq *rq)
-- 
2.3.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc
  2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan
@ 2015-03-10 20:14   ` Alexander Duyck
  2015-03-11  9:27     ` Govindarajulu Varadarajan
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2015-03-10 20:14 UTC (permalink / raw)
  To: Govindarajulu Varadarajan, davem, netdev; +Cc: ssujith, benve


On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote:
> This patches uses dma cache skb allocator fot rx buffers.
>
> netdev_dma_head is initialized per rq. All calls to netdev_dma_alloc_skb() and
> netdev_dma_frag_unmap() happens in napi_poll and they are serialized.
>
> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>

This isn't going to work. The problem is the way you are using your 
fragments you can end up with a memory corruption as the frame headers 
that were updated by the stack may be reverted for any frames received 
before the last frame was unmapped.  I ran into that issue when I was 
doing page reuse with build_skb on the Intel drivers and I suspect you 
will see the same issue.

The way to work around it is to receive the data in to the fragments, 
and then pull the headers out and store them in a separate skb via 
something similar to copy-break.  You can then track the fragments in frags.

> ---
>   drivers/net/ethernet/cisco/enic/enic_main.c | 31 +++++++++--------------------
>   drivers/net/ethernet/cisco/enic/vnic_rq.c   |  3 +++
>   drivers/net/ethernet/cisco/enic/vnic_rq.h   |  3 +++
>   3 files changed, 15 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
> index 204bd182..3be5bc12 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_main.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_main.c
> @@ -952,13 +952,9 @@ nla_put_failure:
>   
>   static void enic_free_rq_buf(struct vnic_rq *rq, struct vnic_rq_buf *buf)
>   {
> -	struct enic *enic = vnic_dev_priv(rq->vdev);
> -
>   	if (!buf->os_buf)
>   		return;
> -
> -	pci_unmap_single(enic->pdev, buf->dma_addr,
> -		buf->len, PCI_DMA_FROMDEVICE);
> +	netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
>   	dev_kfree_skb_any(buf->os_buf);
>   	buf->os_buf = NULL;
>   }
> @@ -979,17 +975,10 @@ static int enic_rq_alloc_buf(struct vnic_rq *rq)
>   
>   		return 0;
>   	}
> -	skb = netdev_alloc_skb_ip_align(netdev, len);
> +	skb = netdev_dma_alloc_skb(&rq->nc_head, &buf->nc, &dma_addr, len);
>   	if (!skb)
>   		return -ENOMEM;
>   
> -	dma_addr = pci_map_single(enic->pdev, skb->data, len,
> -				  PCI_DMA_FROMDEVICE);
> -	if (unlikely(enic_dma_map_check(enic, dma_addr))) {
> -		dev_kfree_skb(skb);
> -		return -ENOMEM;
> -	}
> -
>   	enic_queue_rq_desc(rq, skb, os_buf_index,
>   		dma_addr, len);
>   

I'm curious why you are still using skbs as your data type for receiving 
frames before they come in.  Why not just store a pointer to your dma 
buffer and hold off on allocating the sk_buff until you have actually 
received the frame in the buffer?  It would save you something like 256B 
per frame if you just hold off on the allocation until the skb is really 
needed.

> @@ -1016,8 +1005,6 @@ static bool enic_rxcopybreak(struct net_device *netdev, struct sk_buff **skb,
>   	new_skb = netdev_alloc_skb_ip_align(netdev, len);
>   	if (!new_skb)
>   		return false;
> -	pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr, len,
> -				    DMA_FROM_DEVICE);
>   	memcpy(new_skb->data, (*skb)->data, len);
>   	*skb = new_skb;
>   
> @@ -1065,8 +1052,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq,
>   				enic->rq_truncated_pkts++;
>   		}
>   
> -		pci_unmap_single(enic->pdev, buf->dma_addr, buf->len,
> -				 PCI_DMA_FROMDEVICE);
> +		netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
>   		dev_kfree_skb_any(skb);
>   		buf->os_buf = NULL;
>   
> @@ -1078,10 +1064,11 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq,
>   		/* Good receive
>   		 */
>   
> +		pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr,
> +					    bytes_written, DMA_FROM_DEVICE);
>   		if (!enic_rxcopybreak(netdev, &skb, buf, bytes_written)) {
>   			buf->os_buf = NULL;
> -			pci_unmap_single(enic->pdev, buf->dma_addr, buf->len,
> -					 PCI_DMA_FROMDEVICE);
> +			netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
>   		}
>   		prefetch(skb->data - NET_IP_ALIGN);
>   

It looks like you already have copy-break code in your codepath.  It 
might be worth taking a look at what you would gain by deferring the skb 
allocation and using the copy-break code path to take care of small 
frames and headers for larger frames.

> @@ -1122,9 +1109,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq,
>   
>   		/* Buffer overflow
>   		 */
> -
> -		pci_unmap_single(enic->pdev, buf->dma_addr, buf->len,
> -				 PCI_DMA_FROMDEVICE);
> +		netdev_dma_frag_unmap(&rq->nc_head, buf->nc);
>   		dev_kfree_skb_any(skb);
>   		buf->os_buf = NULL;
>   	}
> @@ -1648,6 +1633,8 @@ static int enic_open(struct net_device *netdev)
>   	}
>   
>   	for (i = 0; i < enic->rq_count; i++) {
> +		netdev_dma_init(&enic->rq[i].nc_head, &enic->pdev->dev,
> +				GFP_ATOMIC);
>   		vnic_rq_fill(&enic->rq[i], enic_rq_alloc_buf);
>   		/* Need at least one buffer on ring to get going */
>   		if (vnic_rq_desc_used(&enic->rq[i]) == 0) {
> diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.c b/drivers/net/ethernet/cisco/enic/vnic_rq.c
> index 36a2ed6..afa1d71 100644
> --- a/drivers/net/ethernet/cisco/enic/vnic_rq.c
> +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.c
> @@ -23,6 +23,7 @@
>   #include <linux/pci.h>
>   #include <linux/delay.h>
>   #include <linux/slab.h>
> +#include <linux/skbuff.h>
>   
>   #include "vnic_dev.h"
>   #include "vnic_rq.h"
> @@ -199,6 +200,8 @@ void vnic_rq_clean(struct vnic_rq *rq,
>   		rq->ring.desc_avail++;
>   	}
>   
> +	netdev_dma_destroy(&rq->nc_head);
> +
>   	/* Use current fetch_index as the ring starting point */
>   	fetch_index = ioread32(&rq->ctrl->fetch_index);
>   
> diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.h b/drivers/net/ethernet/cisco/enic/vnic_rq.h
> index 8111d52..d4ee963 100644
> --- a/drivers/net/ethernet/cisco/enic/vnic_rq.h
> +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.h
> @@ -21,6 +21,7 @@
>   #define _VNIC_RQ_H_
>   
>   #include <linux/pci.h>
> +#include <linux/skbuff.h>
>   
>   #include "vnic_dev.h"
>   #include "vnic_cq.h"
> @@ -73,6 +74,7 @@ struct vnic_rq_buf {
>   	unsigned int index;
>   	void *desc;
>   	uint64_t wr_id;
> +	struct netdev_dma_node *nc;
>   };
>   
>   struct vnic_rq {
> @@ -100,6 +102,7 @@ struct vnic_rq {
>   	unsigned int bpoll_state;
>   	spinlock_t bpoll_lock;
>   #endif /* CONFIG_NET_RX_BUSY_POLL */
> +	struct netdev_dma_head nc_head;
>   };
>   
>   static inline unsigned int vnic_rq_desc_avail(struct vnic_rq *rq)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator
  2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan
@ 2015-03-10 20:33   ` Alexander Duyck
  2015-03-11  8:57     ` Govindarajulu Varadarajan
  2015-03-11 17:06     ` David Laight
  2015-03-14 20:08   ` Ben Hutchings
  1 sibling, 2 replies; 14+ messages in thread
From: Alexander Duyck @ 2015-03-10 20:33 UTC (permalink / raw)
  To: Govindarajulu Varadarajan, davem, netdev; +Cc: ssujith, benve


On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote:
> This patch implements dma cache skb allocator. This is based on
> __alloc_page_frag & __page_frag_refill implementation in net/core/skbuff.c
>
> In addition to frag allocation from order(3) page in __alloc_page_frag,
> we also maintain dma address of the page. While allocating a frag for skb
> we use va + offset for virtual address of the frag, and pa + offset for
> dma address of the frag. This reduces the number of calls to dma_map() by 1/3
> for 9000 bytes and by 1/20 for 1500 bytes.
>
> __alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e 4096 in most
> of the cases. So 9k buffer allocation goes through kmalloc which return
> page of order 2, 16k. We waste 7k bytes for every 9k buffer.

The question I would have is do you actually need to have the 9k 
buffer?  Does the hardware support any sort of scatter-gather receive?  
If so that would be preferable as the 9k allocation per skb will have 
significant overhead when you start receiving small packets.

A classic example is a TCP flow where you are only receiving a few 
hundred bytes per frame.  You will take a huge truesize penalty for 
allocating a 9k skb for a frame of only a few hundred bytes, though it 
sounds like you are taking that hit already.

> we maintain dma_count variable which is incremented when we allocate a frag.
> netdev_dma_frag_unmap() will decrement the dma_count and unmap it when there is
> no user of that page.
>
> This reduces the memory utilization for 9k mtu by 33%.
>
> The user of dma cache allocator first calls netdev_dma_init() to initialize the
> dma_head structure. User can allocate a dma frag by calling
> netdev_dma_alloc_skb(), allocated skb is returned and dma_addr is stored in the
> pointer *dma_addr passed to the function. Every call to this function should
> have corresponding call to netdev_dma_frag_unmap() to unmap the page.
>
> At last netdev_dma_destroy() is called to clean to dma_cache. Before calling
> destroy(), user should have unmapped all the skb allocated.
>
> All calls to these function should be mutually exclusive. It is the
> responsibility of the caller to satisfy this condition. Since this will be
> mostly used for rx buffer allocation, where unmap,allocation is serialized, we
> can avoid using locks.
>
> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
> ---
>   include/linux/skbuff.h |  22 ++++++
>   net/core/skbuff.c      | 209 +++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 231 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bba1330..ae2d024 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2153,6 +2153,28 @@ static inline struct sk_buff *dev_alloc_skb(unsigned int length)
>   	return netdev_alloc_skb(NULL, length);
>   }
>   
> +struct netdev_dma_node {
> +	struct page_frag	frag;
> +	unsigned int		pagecnt_bias;
> +	int			dma_count;
> +	void			*va;
> +	dma_addr_t		dma_addr;
> +};
> +
> +struct netdev_dma_head {
> +	struct netdev_dma_node	*nc;
> +	struct device		*dev;
> +	gfp_t			gfp;
> +};
> +
> +void netdev_dma_destroy(struct netdev_dma_head *nc_head);
> +void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head,
> +			   struct netdev_dma_node *nc);
> +void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev,
> +		     gfp_t gfp);
> +struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head,
> +				     struct netdev_dma_node **nc_node,
> +				     dma_addr_t *dma_addr, size_t sz);
>   
>   static inline struct sk_buff *__netdev_alloc_skb_ip_align(struct net_device *dev,
>   		unsigned int length, gfp_t gfp)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 47c3241..ec3c46c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -79,6 +79,7 @@
>   
>   struct kmem_cache *skbuff_head_cache __read_mostly;
>   static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> +struct kmem_cache *netdev_dma_cache __read_mostly;
>   
>   /**
>    *	skb_panic - private function for out-of-line support
> @@ -361,6 +362,210 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
>   	return page;
>   }
>   
> +static void __dma_cache_refill(struct netdev_dma_head *nc_head, size_t sz)
> +{
> +	struct netdev_dma_node *nc;
> +	gfp_t gfp_comp = nc_head->gfp | __GFP_COMP | __GFP_NOWARN |
> +			 __GFP_NORETRY;
> +	u8 order = NETDEV_FRAG_PAGE_MAX_ORDER;
> +
> +	nc = kmem_cache_alloc(netdev_dma_cache, GFP_ATOMIC);
> +	nc_head->nc = nc;
> +	if (unlikely(!nc))
> +		return;
> +
> +	if (unlikely(sz > (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER)))
> +		goto precise_order;
> +
> +	nc->frag.page = alloc_pages_node(NUMA_NO_NODE, gfp_comp, order);
> +	if (unlikely(!nc->frag.page)) {
> +precise_order:
> +		order = get_order(sz);
> +		nc->frag.page = alloc_pages_node(NUMA_NO_NODE, nc_head->gfp,
> +						 order);
> +		if (!nc->frag.page)
> +			goto free_nc;
> +	}
> +
> +	nc->frag.size = (PAGE_SIZE << order);
> +	nc->dma_addr = dma_map_page(nc_head->dev, nc->frag.page, 0,
> +				    nc->frag.size, DMA_FROM_DEVICE);
> +	if (unlikely(dma_mapping_error(nc_head->dev, nc->dma_addr)))
> +		goto free_page;
> +	nc->va = page_address(nc->frag.page);
> +	atomic_add(nc->frag.size - 1, &nc->frag.page->_count);
> +	nc->pagecnt_bias = nc->frag.size;
> +	nc->frag.offset = nc->frag.size;
> +	nc->dma_count = 0;
> +
> +	return;
> +
> +free_page:
> +	__free_pages(nc->frag.page, order);
> +free_nc:
> +	kmem_cache_free(netdev_dma_cache, nc);
> +	nc_head->nc = NULL;
> +}
> +
> +static struct netdev_dma_node *netdev_dma_alloc(struct netdev_dma_head *nc_head,
> +						size_t sz)
> +{
> +	struct netdev_dma_node *nc = nc_head->nc;
> +	int offset;
> +
> +	if (unlikely(!nc)) {
> +refill:
> +		__dma_cache_refill(nc_head, sz);
> +		nc = nc_head->nc;
> +		if (unlikely(!nc))
> +			return NULL;
> +	}
> +
> +	offset = nc->frag.offset - sz;
> +	if (offset < 0) {
> +		if (!atomic_sub_and_test(nc->pagecnt_bias,
> +					 &nc->frag.page->_count)) {
> +			/* the caller has processed all the frags belonging to
> +			 * this page. Since page->_count is not 0 and
> +			 * nc->dma_count is 0 these frags should be in stack.
> +			 * We should unmap the page here.
> +			 */
> +			if (!nc->dma_count) {
> +				dma_unmap_single(nc_head->dev, nc->dma_addr,
> +						 nc->frag.size,
> +						 DMA_FROM_DEVICE);
> +				kmem_cache_free(netdev_dma_cache, nc);
> +			} else {
> +			/* frags from this page are used by the caller. Let the
> +			 * caller unmap the page using netdev_dma_frag_unmap().
> +			 */
> +				nc->pagecnt_bias = 0;
> +			}
> +			goto refill;
> +		}
> +		WARN_ON(nc->dma_count);
> +		atomic_set(&nc->frag.page->_count, nc->frag.size);
> +		nc->pagecnt_bias = nc->frag.size;
> +		offset = nc->frag.size - sz;
> +	}
> +	nc->pagecnt_bias--;
> +	nc->dma_count++;
> +	nc->frag.offset = offset;
> +
> +	return nc;
> +}
> +
> +/* netdev_dma_alloc_skb - alloc skb with dma mapped data
> + * @nc_head:	initialized netdev_dma_head from which allocation should be
> + *		done.
> + * @nc_node:	Corresponding dma_node is stored in this. The caller needs to
> + *		save this and use it for netdev_dma_frag_unmap() to unmap
> + *		page frag later.
> + * @dma_addr:	dma address of the data is stored here.
> + * @sz:		size of data needed.
> + *
> + * Allocates skb from dma cached page. This function returns allocated skb_buff
> + * from the provided netdev_dma_head cache. It stores the dma address in
> + * dma_addr.
> + *
> + * All the calls to netdev_dma_alloc_skb(), netdev_dma_frag_unmap(),
> + * netdev_dma_init(), netdev_dma_destroy() should be mutually exclusive for a
> + * dma_head. It is the responsibility of the caller to satisfy this condition.
> + * Since this is mostly used for rx buffer allocation, where unmap,allocation
> + * is serialized we can avoid using locks.
> + */
> +struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head,
> +				     struct netdev_dma_node **nc_node,
> +				     dma_addr_t *dma_addr, size_t sz)
> +{
> +	struct sk_buff *skb;
> +	struct netdev_dma_node *nc;
> +	void *va;
> +
> +	sz = sz + NET_IP_ALIGN + NET_SKB_PAD;
> +	sz = SKB_DATA_ALIGN(sz) +
> +	     SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	nc = netdev_dma_alloc(nc_head, sz);
> +	if (unlikely(!nc))
> +		return NULL;
> +	va = nc->va + nc->frag.offset;
> +	skb = build_skb(va, sz);
> +	if (unlikely(!skb)) {
> +		nc->pagecnt_bias++;
> +		nc->frag.offset += sz;
> +		nc->dma_count--;
> +
> +		return NULL;
> +	}
> +
> +	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> +	*dma_addr = nc->dma_addr + nc->frag.offset + NET_SKB_PAD + NET_IP_ALIGN;
> +	*nc_node = nc;
> +
> +	return skb;
> +}
> +EXPORT_SYMBOL_GPL(netdev_dma_alloc_skb);
> +

This function should be dropped.  It is going to lead to memory 
corruption since you should not modify any of the header data until 
after the unmap has been called for the page(s) in this fragment.

> +/* netdev_dma_init - initialize dma cache head
> + * @nc_head:	dma_head to be initialized
> + * @dev:	device for dma map
> + * @gfp:	gfp flagg for allocating page
> + *
> + * initializes dma_head with provided gfp flags and device
> + */
> +void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev,
> +		     gfp_t gfp)
> +{
> +	nc_head->nc = NULL;
> +	nc_head->dev = dev;
> +	nc_head->gfp = gfp;
> +}
> +EXPORT_SYMBOL_GPL(netdev_dma_init);
> +
> +/* netdev_dma_frag_unmap - unmap page
> + * @nc_head:	dma_head of page frag allocator
> + * @nc:		dma_node to be unmapped
> + *
> + * unmaps requested page if there are no dma references to the page
> + */
> +void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head,
> +			   struct netdev_dma_node *nc)
> +{
> +	/* If nc is not used by dma_head. We should be free to unmap the
> +	 * page if there are no pending dma frags addr used by the caller.
> +	 */
> +	if (!--nc->dma_count && !nc->pagecnt_bias) {
> +		dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size,
> +			       DMA_FROM_DEVICE);
> +		kmem_cache_free(netdev_dma_cache, nc);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(netdev_dma_frag_unmap);
> +
> +/* netdev_dma_destroy - destroy dma_head
> + * @nc_head:	dma_head to be destroyed
> + *
> + * unmap the current dma_page and free the page. Caller should have called
> + * netdev_dma_frag_unmap() for all the allocated frages before calling this
> + * function.
> + */
> +void netdev_dma_destroy(struct netdev_dma_head *nc_head)
> +{
> +	struct netdev_dma_node *nc = nc_head->nc;
> +
> +	if (unlikely(!nc))
> +		return;
> +	if (WARN_ON(nc->dma_count))
> +		return;
> +	dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size,
> +		       DMA_FROM_DEVICE);
> +	atomic_sub(nc->pagecnt_bias - 1, &nc->frag.page->_count);
> +	__free_pages(nc->frag.page, get_order(nc->frag.size));
> +	kmem_cache_free(netdev_dma_cache, nc);
> +	nc_head->nc = NULL;
> +}
> +EXPORT_SYMBOL_GPL(netdev_dma_destroy);
> +
>   static void *__alloc_page_frag(struct netdev_alloc_cache __percpu *cache,
>   			       unsigned int fragsz, gfp_t gfp_mask)
>   {
> @@ -3324,6 +3529,10 @@ void __init skb_init(void)
>   						0,
>   						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
>   						NULL);
> +	netdev_dma_cache = kmem_cache_create("netdev_dma_cache",
> +					     sizeof(struct netdev_dma_node),
> +					     0, SLAB_HWCACHE_ALIGN | SLAB_PANIC,
> +					     NULL);
>   }
>   
>   /**

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator
  2015-03-10 20:33   ` Alexander Duyck
@ 2015-03-11  8:57     ` Govindarajulu Varadarajan
  2015-03-11 13:55       ` Alexander Duyck
  2015-03-11 17:06     ` David Laight
  1 sibling, 1 reply; 14+ messages in thread
From: Govindarajulu Varadarajan @ 2015-03-11  8:57 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Govindarajulu Varadarajan, davem, netdev, ssujith, benve

On Tue, 10 Mar 2015, Alexander Duyck wrote:

>
> On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote:
>> This patch implements dma cache skb allocator. This is based on
>> __alloc_page_frag & __page_frag_refill implementation in net/core/skbuff.c
>> 
>> In addition to frag allocation from order(3) page in __alloc_page_frag,
>> we also maintain dma address of the page. While allocating a frag for skb
>> we use va + offset for virtual address of the frag, and pa + offset for
>> dma address of the frag. This reduces the number of calls to dma_map() by 
>> 1/3
>> for 9000 bytes and by 1/20 for 1500 bytes.
>> 
>> __alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e 4096 in 
>> most
>> of the cases. So 9k buffer allocation goes through kmalloc which return
>> page of order 2, 16k. We waste 7k bytes for every 9k buffer.
>
> The question I would have is do you actually need to have the 9k buffer? 
> Does the hardware support any sort of scatter-gather receive?  If so that 
> would be preferable as the 9k allocation per skb will have significant 
> overhead when you start receiving small packets.
>

enic hw has limited desc per rq (4096), and we can have only one dma block per
desc. Having sg/header-split will reduce the effective rq ring size for
large packets.

> A classic example is a TCP flow where you are only receiving a few hundred 
> bytes per frame.  You will take a huge truesize penalty for allocating a 9k 
> skb for a frame of only a few hundred bytes, though it sounds like you are 
> taking that hit already.
>

For this we have rx copybreak for pkts < 256 bytes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc
  2015-03-10 20:14   ` Alexander Duyck
@ 2015-03-11  9:27     ` Govindarajulu Varadarajan
  2015-03-11 14:00       ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Govindarajulu Varadarajan @ 2015-03-11  9:27 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Govindarajulu Varadarajan, davem, netdev, ssujith, benve


On Tue, 10 Mar 2015, Alexander Duyck wrote:

>
> On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote:
>> This patches uses dma cache skb allocator fot rx buffers.
>> 
>> netdev_dma_head is initialized per rq. All calls to netdev_dma_alloc_skb() 
>> and
>> netdev_dma_frag_unmap() happens in napi_poll and they are serialized.
>> 
>> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
>
> This isn't going to work. The problem is the way you are using your fragments 
> you can end up with a memory corruption as the frame headers that were 
> updated by the stack may be reverted for any frames received before the last 
> frame was unmapped.  I ran into that issue when I was doing page reuse with 
> build_skb on the Intel drivers and I suspect you will see the same issue.
>

Is this behaviour platform dependent? I tested this patch for more than a month
and I did not face any issue. I ran normal traffic like ssh, nfs and iperf/netperf.
Is there a special scenario when this could occur?

Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this?
Each desc should have different dma address to write to. Can you explain me how
this can happen?

> The way to work around it is to receive the data in to the fragments, and 
> then pull the headers out and store them in a separate skb via something 
> similar to copy-break.  You can then track the fragments in frags.
>

If I split the pkt header into another frame, is it guaranteed that stack will
not modify the pkt data?

Thanks a lot for reviewing this patch.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator
  2015-03-11  8:57     ` Govindarajulu Varadarajan
@ 2015-03-11 13:55       ` Alexander Duyck
  2015-03-11 15:42         ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2015-03-11 13:55 UTC (permalink / raw)
  To: Govindarajulu Varadarajan, Alexander Duyck; +Cc: davem, netdev, ssujith, benve

On 03/11/2015 01:57 AM, Govindarajulu Varadarajan wrote:
> On Tue, 10 Mar 2015, Alexander Duyck wrote:
>
>>
>> On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote:
>>> This patch implements dma cache skb allocator. This is based on
>>> __alloc_page_frag & __page_frag_refill implementation in
>>> net/core/skbuff.c
>>>
>>> In addition to frag allocation from order(3) page in __alloc_page_frag,
>>> we also maintain dma address of the page. While allocating a frag
>>> for skb
>>> we use va + offset for virtual address of the frag, and pa + offset for
>>> dma address of the frag. This reduces the number of calls to
>>> dma_map() by 1/3
>>> for 9000 bytes and by 1/20 for 1500 bytes.
>>>
>>> __alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e
>>> 4096 in most
>>> of the cases. So 9k buffer allocation goes through kmalloc which return
>>> page of order 2, 16k. We waste 7k bytes for every 9k buffer.
>>
>> The question I would have is do you actually need to have the 9k
>> buffer? Does the hardware support any sort of scatter-gather
>> receive?  If so that would be preferable as the 9k allocation per skb
>> will have significant overhead when you start receiving small packets.
>>
>
> enic hw has limited desc per rq (4096), and we can have only one dma
> block per
> desc. Having sg/header-split will reduce the effective rq ring size for
> large packets.

Yes, so?  In the case of the Intel driver they default to only 512
descriptors per ring with a size of 2K per descriptor.  Are you saying
the latency for your interrupt processing is so high that you need to
buffer 9k per descriptor in order to achieve line rate?

>
>> A classic example is a TCP flow where you are only receiving a few
>> hundred bytes per frame.  You will take a huge truesize penalty for
>> allocating a 9k skb for a frame of only a few hundred bytes, though
>> it sounds like you are taking that hit already.
>>
>
> For this we have rx copybreak for pkts < 256 bytes.

That definitely helps but still leaves you open since a 257 byte packet
would be consuming 9K in that case.  Also standard flows with frames of
1514 still take a hard hit with a receive buffer size of 9K.

- Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc
  2015-03-11  9:27     ` Govindarajulu Varadarajan
@ 2015-03-11 14:00       ` Alexander Duyck
  2015-03-11 17:34         ` David Laight
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2015-03-11 14:00 UTC (permalink / raw)
  To: Govindarajulu Varadarajan, Alexander Duyck; +Cc: davem, netdev, ssujith, benve

On 03/11/2015 02:27 AM, Govindarajulu Varadarajan wrote:
>
> On Tue, 10 Mar 2015, Alexander Duyck wrote:
>
>>
>> On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote:
>>> This patches uses dma cache skb allocator fot rx buffers.
>>>
>>> netdev_dma_head is initialized per rq. All calls to
>>> netdev_dma_alloc_skb() and
>>> netdev_dma_frag_unmap() happens in napi_poll and they are serialized.
>>>
>>> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
>>
>> This isn't going to work. The problem is the way you are using your
>> fragments you can end up with a memory corruption as the frame
>> headers that were updated by the stack may be reverted for any frames
>> received before the last frame was unmapped.  I ran into that issue
>> when I was doing page reuse with build_skb on the Intel drivers and I
>> suspect you will see the same issue.
>>
>
> Is this behaviour platform dependent? I tested this patch for more
> than a month
> and I did not face any issue. I ran normal traffic like ssh, nfs and
> iperf/netperf.
> Is there a special scenario when this could occur?

Yes it depends on the platform and IOMMU used.  For an example take a
loot at the SWIOTLB implementation.  I always assumed if I can work with
that when it is doing bounce buffers I can work with any IOMMU or platform.

>
> Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this?
> Each desc should have different dma address to write to. Can you
> explain me how
> this can happen?

No that won't help.  The issue is that when the page is mapped you
should not be updating any fields in the page until it is unmapped. 
Since you have multiple buffers mapped to a single page you should be
waiting until the entire page is unmapped.

>
>> The way to work around it is to receive the data in to the fragments,
>> and then pull the headers out and store them in a separate skb via
>> something similar to copy-break.  You can then track the fragments in
>> frags.
>>
>
> If I split the pkt header into another frame, is it guaranteed that
> stack will
> not modify the pkt data?

Paged fragments in the frags list use the page count to determine if
they can update it.  The problem is you cannot use a shared page as
skb->head if you plan to do any DMA mapping with it as it can cause
issues if you change any of the fields before it is unmapped.

>
> Thanks a lot for reviewing this patch.

No problem.  Just glad I saw this before you had to go though reverting
your stuff like I did.

- Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator
  2015-03-11 13:55       ` Alexander Duyck
@ 2015-03-11 15:42         ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2015-03-11 15:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Govindarajulu Varadarajan, Alexander Duyck, davem, netdev,
	ssujith, benve

On Wed, 2015-03-11 at 06:55 -0700, Alexander Duyck wrote:

> That definitely helps but still leaves you open since a 257 byte packet
> would be consuming 9K in that case.  Also standard flows with frames of
> 1514 still take a hard hit with a receive buffer size of 9K.

One possible way to deal with this would be to try to use order-2 pages,
(but not compound pages) but break them.

split_page(page, 2);

struct page *page0 = page;
struct page *page1 = page + 1;
struct page *page2 = page + 2;
struct page *page3 = page + 3;


if frame length <= 4096, attach page0 as skb first frag, and free
page[1-3]

If frame length <= 8192 bytes, driver can attach page0 & page1 to the
skb, and free page[2-3]

Otherwise, attach page[0-2] to skb and free page3

Not sure if buddy allocator will be pleased, but worth trying ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH net-next v3 1/2] net: implement dma cache skb allocator
  2015-03-10 20:33   ` Alexander Duyck
  2015-03-11  8:57     ` Govindarajulu Varadarajan
@ 2015-03-11 17:06     ` David Laight
  1 sibling, 0 replies; 14+ messages in thread
From: David Laight @ 2015-03-11 17:06 UTC (permalink / raw)
  To: 'Alexander Duyck', Govindarajulu Varadarajan, davem, netdev
  Cc: ssujith, benve

From: Alexander Duyck
...
> The question I would have is do you actually need to have the 9k
> buffer?  Does the hardware support any sort of scatter-gather receive?
> If so that would be preferable as the 9k allocation per skb will have
> significant overhead when you start receiving small packets.
...

You don't necessarily need true scatter gather.

A lot of ethernet MAC continue long frames into the buffer
associated with the next ring entry.
So if you fill the ring with (say) 2k buffers you 'just' have
to piece the fragments together when a long frame is received.

I used to use a single 64k buffer split into 128 buffers of 512
bytes each (last was actually smaller for alignment).
All receive frames were copied into (the equivalent of) an skb
that was allocated after the receive completed.
The aligned copy (including the leading and trailing padding)
didn't actually cost enough to worry about.
(Without TCP checksum offload the data was heading for the cache.)

	David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc
  2015-03-11 14:00       ` Alexander Duyck
@ 2015-03-11 17:34         ` David Laight
  2015-03-11 17:51           ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2015-03-11 17:34 UTC (permalink / raw)
  To: 'Alexander Duyck', Govindarajulu Varadarajan, Alexander Duyck
  Cc: davem, netdev, ssujith, benve

From: Alexander Duyck
...
> > Is this behaviour platform dependent? I tested this patch for more
> > than a month
> > and I did not face any issue. I ran normal traffic like ssh, nfs and
> > iperf/netperf.
> > Is there a special scenario when this could occur?
> 
> Yes it depends on the platform and IOMMU used.  For an example take a
> loot at the SWIOTLB implementation.  I always assumed if I can work with
> that when it is doing bounce buffers I can work with any IOMMU or platform.
> 
> >
> > Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this?
> > Each desc should have different dma address to write to. Can you
> > explain me how
> > this can happen?
> 
> No that won't help.  The issue is that when the page is mapped you
> should not be updating any fields in the page until it is unmapped.
> Since you have multiple buffers mapped to a single page you should be
> waiting until the entire page is unmapped.

Isn't the 'unit of memory for dma sync' a cache line, not a page?

You certainly need to test on systems without cache coherent io.

	David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc
  2015-03-11 17:34         ` David Laight
@ 2015-03-11 17:51           ` Alexander Duyck
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2015-03-11 17:51 UTC (permalink / raw)
  To: David Laight, Govindarajulu Varadarajan, Alexander Duyck
  Cc: davem, netdev, ssujith, benve

On 03/11/2015 10:34 AM, David Laight wrote:
> From: Alexander Duyck
> ...
>>> Is this behaviour platform dependent? I tested this patch for more
>>> than a month
>>> and I did not face any issue. I ran normal traffic like ssh, nfs and
>>> iperf/netperf.
>>> Is there a special scenario when this could occur?
>> Yes it depends on the platform and IOMMU used.  For an example take a
>> loot at the SWIOTLB implementation.  I always assumed if I can work with
>> that when it is doing bounce buffers I can work with any IOMMU or platform.
>>
>>> Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this?
>>> Each desc should have different dma address to write to. Can you
>>> explain me how
>>> this can happen?
>> No that won't help.  The issue is that when the page is mapped you
>> should not be updating any fields in the page until it is unmapped.
>> Since you have multiple buffers mapped to a single page you should be
>> waiting until the entire page is unmapped.
> Isn't the 'unit of memory for dma sync' a cache line, not a page?

Yes, but the problem is the entire page is mapped, and unmapped and that
triggers a syncronization over the entire page, not just the most recent
buffer within the page that was used.

The problem is the API maps an order 3 page and then is using chunks of
it for receive buffers, but then the last buffer unmaps the entire page
which could invalidate any CPU side accesses to the page while it was
still mapped.

In order to make it workable it would have to be mapped bidirectional
and on the last unmap everything that isn't the last buffer would have
to be synced for device before the page is unmapped which would likely
be more expensive than just avoiding all of this by identifying the page
as being shared and cloning the header out of the page frag.

> You certainly need to test on systems without cache coherent io.
>
> 	David

I agree.

- Alex

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator
  2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan
  2015-03-10 20:33   ` Alexander Duyck
@ 2015-03-14 20:08   ` Ben Hutchings
  1 sibling, 0 replies; 14+ messages in thread
From: Ben Hutchings @ 2015-03-14 20:08 UTC (permalink / raw)
  To: Govindarajulu Varadarajan; +Cc: davem, netdev, ssujith, benve

[-- Attachment #1: Type: text/plain, Size: 1606 bytes --]

On Tue, 2015-03-10 at 23:13 +0530, Govindarajulu Varadarajan wrote:
[...]
> +static struct netdev_dma_node *netdev_dma_alloc(struct netdev_dma_head *nc_head,
> +						size_t sz)
> +{
> +	struct netdev_dma_node *nc = nc_head->nc;
> +	int offset;
> +
> +	if (unlikely(!nc)) {
> +refill:
> +		__dma_cache_refill(nc_head, sz);
> +		nc = nc_head->nc;
> +		if (unlikely(!nc))
> +			return NULL;
> +	}
> +
> +	offset = nc->frag.offset - sz;
> +	if (offset < 0) {
> +		if (!atomic_sub_and_test(nc->pagecnt_bias,
> +					 &nc->frag.page->_count)) {
> +			/* the caller has processed all the frags belonging to
> +			 * this page. Since page->_count is not 0 and
> +			 * nc->dma_count is 0 these frags should be in stack.
> +			 * We should unmap the page here.

Unmapping potentially copies data back from a bounce buffer (even if
it's already been synchronised).  This is OK in the case of a buffer
that's attached to the skb as a page fragment, because in that case it's
read-only for the stack.  It's not OK for a buffer that has been passed
to build_skb(), as this may legitimately have been changed by the stack
and this will overwrite the changes.

So you have to choose between DMA buffer recycling and build_skb().

> +			 */
> +			if (!nc->dma_count) {
> +				dma_unmap_single(nc_head->dev, nc->dma_addr,
> +						 nc->frag.size,
> +						 DMA_FROM_DEVICE);
[...]

Don't mix dma_map_page() and dma_unmap_single().

Ben.

-- 
Ben Hutchings
Q.  Which is the greater problem in the world today, ignorance or apathy?
A.  I don't know and I couldn't care less.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-03-14 20:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-10 17:43 [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping Govindarajulu Varadarajan
2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan
2015-03-10 20:33   ` Alexander Duyck
2015-03-11  8:57     ` Govindarajulu Varadarajan
2015-03-11 13:55       ` Alexander Duyck
2015-03-11 15:42         ` Eric Dumazet
2015-03-11 17:06     ` David Laight
2015-03-14 20:08   ` Ben Hutchings
2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan
2015-03-10 20:14   ` Alexander Duyck
2015-03-11  9:27     ` Govindarajulu Varadarajan
2015-03-11 14:00       ` Alexander Duyck
2015-03-11 17:34         ` David Laight
2015-03-11 17:51           ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.