* [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping @ 2015-03-10 17:43 Govindarajulu Varadarajan 2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan 2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan 0 siblings, 2 replies; 14+ messages in thread From: Govindarajulu Varadarajan @ 2015-03-10 17:43 UTC (permalink / raw) To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan The following series tries to address these two problem in dma buff allocation. * Memory wastage because of large 9k allocation using kmalloc: For 9k dma buffer, netdev_alloc_skb_ip_align internally calls kmalloc for size > 4096. In case of 9k buff, kmalloc returns pages for order 2, 16k. And we use only ~9k of 16k. 7k memory wasted. Using the frag the frag allocator in patch 1/2, we can allocate three 9k buffs in a 32k page size. Typical enic configuration has 8 rq, and desc ring of size 4096. Thats 8 * 4096 * (16*1024) = 512 MB. Using this frag allocator: 8 * 4096 * (32*1024/3) = 341 MB. Thats 171 MB of memory save. * frequent dma_map() calls: we call dma_map() for every buff we allocate. When iommu is on, This is very time consuming. From my testing, most of the cpu cycles are wasted spinning on spin_lock_irqsave(&iovad->iova_rbtree_lock, flags) in intel_map_page() .. -> ..__alloc_and_insert_iova_range() With this patch, we call dma_map() once for 32k page. i.e once for every three 9k desc, and once every twenty 1500 bytes desc. Here are my testing result with 8 rq, 4096 ring size and 9k mtu. irq of each rq is affinitized with different CPU. Ran iperf with 32 threads. Link is 10G. iommu is on. CPU utilization throughput without patch 100% 1.8 Gbps with patch 13% 9.8 Gbps v3: Make this facility more generic so that other drivers can use it. v2: Remove changing order facility Govindarajulu Varadarajan (2): net: implement dma cache skb allocator enic: use netdev_dma_alloc drivers/net/ethernet/cisco/enic/enic_main.c | 31 ++--- drivers/net/ethernet/cisco/enic/vnic_rq.c | 3 + drivers/net/ethernet/cisco/enic/vnic_rq.h | 3 + include/linux/skbuff.h | 22 +++ net/core/skbuff.c | 209 ++++++++++++++++++++++++++++ 5 files changed, 246 insertions(+), 22 deletions(-) -- 2.3.2 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH net-next v3 1/2] net: implement dma cache skb allocator 2015-03-10 17:43 [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping Govindarajulu Varadarajan @ 2015-03-10 17:43 ` Govindarajulu Varadarajan 2015-03-10 20:33 ` Alexander Duyck 2015-03-14 20:08 ` Ben Hutchings 2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan 1 sibling, 2 replies; 14+ messages in thread From: Govindarajulu Varadarajan @ 2015-03-10 17:43 UTC (permalink / raw) To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan This patch implements dma cache skb allocator. This is based on __alloc_page_frag & __page_frag_refill implementation in net/core/skbuff.c In addition to frag allocation from order(3) page in __alloc_page_frag, we also maintain dma address of the page. While allocating a frag for skb we use va + offset for virtual address of the frag, and pa + offset for dma address of the frag. This reduces the number of calls to dma_map() by 1/3 for 9000 bytes and by 1/20 for 1500 bytes. __alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e 4096 in most of the cases. So 9k buffer allocation goes through kmalloc which return page of order 2, 16k. We waste 7k bytes for every 9k buffer. we maintain dma_count variable which is incremented when we allocate a frag. netdev_dma_frag_unmap() will decrement the dma_count and unmap it when there is no user of that page. This reduces the memory utilization for 9k mtu by 33%. The user of dma cache allocator first calls netdev_dma_init() to initialize the dma_head structure. User can allocate a dma frag by calling netdev_dma_alloc_skb(), allocated skb is returned and dma_addr is stored in the pointer *dma_addr passed to the function. Every call to this function should have corresponding call to netdev_dma_frag_unmap() to unmap the page. At last netdev_dma_destroy() is called to clean to dma_cache. Before calling destroy(), user should have unmapped all the skb allocated. All calls to these function should be mutually exclusive. It is the responsibility of the caller to satisfy this condition. Since this will be mostly used for rx buffer allocation, where unmap,allocation is serialized, we can avoid using locks. Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> --- include/linux/skbuff.h | 22 ++++++ net/core/skbuff.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 231 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bba1330..ae2d024 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2153,6 +2153,28 @@ static inline struct sk_buff *dev_alloc_skb(unsigned int length) return netdev_alloc_skb(NULL, length); } +struct netdev_dma_node { + struct page_frag frag; + unsigned int pagecnt_bias; + int dma_count; + void *va; + dma_addr_t dma_addr; +}; + +struct netdev_dma_head { + struct netdev_dma_node *nc; + struct device *dev; + gfp_t gfp; +}; + +void netdev_dma_destroy(struct netdev_dma_head *nc_head); +void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head, + struct netdev_dma_node *nc); +void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev, + gfp_t gfp); +struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head, + struct netdev_dma_node **nc_node, + dma_addr_t *dma_addr, size_t sz); static inline struct sk_buff *__netdev_alloc_skb_ip_align(struct net_device *dev, unsigned int length, gfp_t gfp) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 47c3241..ec3c46c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -79,6 +79,7 @@ struct kmem_cache *skbuff_head_cache __read_mostly; static struct kmem_cache *skbuff_fclone_cache __read_mostly; +struct kmem_cache *netdev_dma_cache __read_mostly; /** * skb_panic - private function for out-of-line support @@ -361,6 +362,210 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc, return page; } +static void __dma_cache_refill(struct netdev_dma_head *nc_head, size_t sz) +{ + struct netdev_dma_node *nc; + gfp_t gfp_comp = nc_head->gfp | __GFP_COMP | __GFP_NOWARN | + __GFP_NORETRY; + u8 order = NETDEV_FRAG_PAGE_MAX_ORDER; + + nc = kmem_cache_alloc(netdev_dma_cache, GFP_ATOMIC); + nc_head->nc = nc; + if (unlikely(!nc)) + return; + + if (unlikely(sz > (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER))) + goto precise_order; + + nc->frag.page = alloc_pages_node(NUMA_NO_NODE, gfp_comp, order); + if (unlikely(!nc->frag.page)) { +precise_order: + order = get_order(sz); + nc->frag.page = alloc_pages_node(NUMA_NO_NODE, nc_head->gfp, + order); + if (!nc->frag.page) + goto free_nc; + } + + nc->frag.size = (PAGE_SIZE << order); + nc->dma_addr = dma_map_page(nc_head->dev, nc->frag.page, 0, + nc->frag.size, DMA_FROM_DEVICE); + if (unlikely(dma_mapping_error(nc_head->dev, nc->dma_addr))) + goto free_page; + nc->va = page_address(nc->frag.page); + atomic_add(nc->frag.size - 1, &nc->frag.page->_count); + nc->pagecnt_bias = nc->frag.size; + nc->frag.offset = nc->frag.size; + nc->dma_count = 0; + + return; + +free_page: + __free_pages(nc->frag.page, order); +free_nc: + kmem_cache_free(netdev_dma_cache, nc); + nc_head->nc = NULL; +} + +static struct netdev_dma_node *netdev_dma_alloc(struct netdev_dma_head *nc_head, + size_t sz) +{ + struct netdev_dma_node *nc = nc_head->nc; + int offset; + + if (unlikely(!nc)) { +refill: + __dma_cache_refill(nc_head, sz); + nc = nc_head->nc; + if (unlikely(!nc)) + return NULL; + } + + offset = nc->frag.offset - sz; + if (offset < 0) { + if (!atomic_sub_and_test(nc->pagecnt_bias, + &nc->frag.page->_count)) { + /* the caller has processed all the frags belonging to + * this page. Since page->_count is not 0 and + * nc->dma_count is 0 these frags should be in stack. + * We should unmap the page here. + */ + if (!nc->dma_count) { + dma_unmap_single(nc_head->dev, nc->dma_addr, + nc->frag.size, + DMA_FROM_DEVICE); + kmem_cache_free(netdev_dma_cache, nc); + } else { + /* frags from this page are used by the caller. Let the + * caller unmap the page using netdev_dma_frag_unmap(). + */ + nc->pagecnt_bias = 0; + } + goto refill; + } + WARN_ON(nc->dma_count); + atomic_set(&nc->frag.page->_count, nc->frag.size); + nc->pagecnt_bias = nc->frag.size; + offset = nc->frag.size - sz; + } + nc->pagecnt_bias--; + nc->dma_count++; + nc->frag.offset = offset; + + return nc; +} + +/* netdev_dma_alloc_skb - alloc skb with dma mapped data + * @nc_head: initialized netdev_dma_head from which allocation should be + * done. + * @nc_node: Corresponding dma_node is stored in this. The caller needs to + * save this and use it for netdev_dma_frag_unmap() to unmap + * page frag later. + * @dma_addr: dma address of the data is stored here. + * @sz: size of data needed. + * + * Allocates skb from dma cached page. This function returns allocated skb_buff + * from the provided netdev_dma_head cache. It stores the dma address in + * dma_addr. + * + * All the calls to netdev_dma_alloc_skb(), netdev_dma_frag_unmap(), + * netdev_dma_init(), netdev_dma_destroy() should be mutually exclusive for a + * dma_head. It is the responsibility of the caller to satisfy this condition. + * Since this is mostly used for rx buffer allocation, where unmap,allocation + * is serialized we can avoid using locks. + */ +struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head, + struct netdev_dma_node **nc_node, + dma_addr_t *dma_addr, size_t sz) +{ + struct sk_buff *skb; + struct netdev_dma_node *nc; + void *va; + + sz = sz + NET_IP_ALIGN + NET_SKB_PAD; + sz = SKB_DATA_ALIGN(sz) + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + nc = netdev_dma_alloc(nc_head, sz); + if (unlikely(!nc)) + return NULL; + va = nc->va + nc->frag.offset; + skb = build_skb(va, sz); + if (unlikely(!skb)) { + nc->pagecnt_bias++; + nc->frag.offset += sz; + nc->dma_count--; + + return NULL; + } + + skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); + *dma_addr = nc->dma_addr + nc->frag.offset + NET_SKB_PAD + NET_IP_ALIGN; + *nc_node = nc; + + return skb; +} +EXPORT_SYMBOL_GPL(netdev_dma_alloc_skb); + +/* netdev_dma_init - initialize dma cache head + * @nc_head: dma_head to be initialized + * @dev: device for dma map + * @gfp: gfp flagg for allocating page + * + * initializes dma_head with provided gfp flags and device + */ +void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev, + gfp_t gfp) +{ + nc_head->nc = NULL; + nc_head->dev = dev; + nc_head->gfp = gfp; +} +EXPORT_SYMBOL_GPL(netdev_dma_init); + +/* netdev_dma_frag_unmap - unmap page + * @nc_head: dma_head of page frag allocator + * @nc: dma_node to be unmapped + * + * unmaps requested page if there are no dma references to the page + */ +void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head, + struct netdev_dma_node *nc) +{ + /* If nc is not used by dma_head. We should be free to unmap the + * page if there are no pending dma frags addr used by the caller. + */ + if (!--nc->dma_count && !nc->pagecnt_bias) { + dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size, + DMA_FROM_DEVICE); + kmem_cache_free(netdev_dma_cache, nc); + } +} +EXPORT_SYMBOL_GPL(netdev_dma_frag_unmap); + +/* netdev_dma_destroy - destroy dma_head + * @nc_head: dma_head to be destroyed + * + * unmap the current dma_page and free the page. Caller should have called + * netdev_dma_frag_unmap() for all the allocated frages before calling this + * function. + */ +void netdev_dma_destroy(struct netdev_dma_head *nc_head) +{ + struct netdev_dma_node *nc = nc_head->nc; + + if (unlikely(!nc)) + return; + if (WARN_ON(nc->dma_count)) + return; + dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size, + DMA_FROM_DEVICE); + atomic_sub(nc->pagecnt_bias - 1, &nc->frag.page->_count); + __free_pages(nc->frag.page, get_order(nc->frag.size)); + kmem_cache_free(netdev_dma_cache, nc); + nc_head->nc = NULL; +} +EXPORT_SYMBOL_GPL(netdev_dma_destroy); + static void *__alloc_page_frag(struct netdev_alloc_cache __percpu *cache, unsigned int fragsz, gfp_t gfp_mask) { @@ -3324,6 +3529,10 @@ void __init skb_init(void) 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); + netdev_dma_cache = kmem_cache_create("netdev_dma_cache", + sizeof(struct netdev_dma_node), + 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, + NULL); } /** -- 2.3.2 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator 2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan @ 2015-03-10 20:33 ` Alexander Duyck 2015-03-11 8:57 ` Govindarajulu Varadarajan 2015-03-11 17:06 ` David Laight 2015-03-14 20:08 ` Ben Hutchings 1 sibling, 2 replies; 14+ messages in thread From: Alexander Duyck @ 2015-03-10 20:33 UTC (permalink / raw) To: Govindarajulu Varadarajan, davem, netdev; +Cc: ssujith, benve On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote: > This patch implements dma cache skb allocator. This is based on > __alloc_page_frag & __page_frag_refill implementation in net/core/skbuff.c > > In addition to frag allocation from order(3) page in __alloc_page_frag, > we also maintain dma address of the page. While allocating a frag for skb > we use va + offset for virtual address of the frag, and pa + offset for > dma address of the frag. This reduces the number of calls to dma_map() by 1/3 > for 9000 bytes and by 1/20 for 1500 bytes. > > __alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e 4096 in most > of the cases. So 9k buffer allocation goes through kmalloc which return > page of order 2, 16k. We waste 7k bytes for every 9k buffer. The question I would have is do you actually need to have the 9k buffer? Does the hardware support any sort of scatter-gather receive? If so that would be preferable as the 9k allocation per skb will have significant overhead when you start receiving small packets. A classic example is a TCP flow where you are only receiving a few hundred bytes per frame. You will take a huge truesize penalty for allocating a 9k skb for a frame of only a few hundred bytes, though it sounds like you are taking that hit already. > we maintain dma_count variable which is incremented when we allocate a frag. > netdev_dma_frag_unmap() will decrement the dma_count and unmap it when there is > no user of that page. > > This reduces the memory utilization for 9k mtu by 33%. > > The user of dma cache allocator first calls netdev_dma_init() to initialize the > dma_head structure. User can allocate a dma frag by calling > netdev_dma_alloc_skb(), allocated skb is returned and dma_addr is stored in the > pointer *dma_addr passed to the function. Every call to this function should > have corresponding call to netdev_dma_frag_unmap() to unmap the page. > > At last netdev_dma_destroy() is called to clean to dma_cache. Before calling > destroy(), user should have unmapped all the skb allocated. > > All calls to these function should be mutually exclusive. It is the > responsibility of the caller to satisfy this condition. Since this will be > mostly used for rx buffer allocation, where unmap,allocation is serialized, we > can avoid using locks. > > Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> > --- > include/linux/skbuff.h | 22 ++++++ > net/core/skbuff.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 231 insertions(+) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index bba1330..ae2d024 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -2153,6 +2153,28 @@ static inline struct sk_buff *dev_alloc_skb(unsigned int length) > return netdev_alloc_skb(NULL, length); > } > > +struct netdev_dma_node { > + struct page_frag frag; > + unsigned int pagecnt_bias; > + int dma_count; > + void *va; > + dma_addr_t dma_addr; > +}; > + > +struct netdev_dma_head { > + struct netdev_dma_node *nc; > + struct device *dev; > + gfp_t gfp; > +}; > + > +void netdev_dma_destroy(struct netdev_dma_head *nc_head); > +void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head, > + struct netdev_dma_node *nc); > +void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev, > + gfp_t gfp); > +struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head, > + struct netdev_dma_node **nc_node, > + dma_addr_t *dma_addr, size_t sz); > > static inline struct sk_buff *__netdev_alloc_skb_ip_align(struct net_device *dev, > unsigned int length, gfp_t gfp) > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 47c3241..ec3c46c 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -79,6 +79,7 @@ > > struct kmem_cache *skbuff_head_cache __read_mostly; > static struct kmem_cache *skbuff_fclone_cache __read_mostly; > +struct kmem_cache *netdev_dma_cache __read_mostly; > > /** > * skb_panic - private function for out-of-line support > @@ -361,6 +362,210 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc, > return page; > } > > +static void __dma_cache_refill(struct netdev_dma_head *nc_head, size_t sz) > +{ > + struct netdev_dma_node *nc; > + gfp_t gfp_comp = nc_head->gfp | __GFP_COMP | __GFP_NOWARN | > + __GFP_NORETRY; > + u8 order = NETDEV_FRAG_PAGE_MAX_ORDER; > + > + nc = kmem_cache_alloc(netdev_dma_cache, GFP_ATOMIC); > + nc_head->nc = nc; > + if (unlikely(!nc)) > + return; > + > + if (unlikely(sz > (PAGE_SIZE << NETDEV_FRAG_PAGE_MAX_ORDER))) > + goto precise_order; > + > + nc->frag.page = alloc_pages_node(NUMA_NO_NODE, gfp_comp, order); > + if (unlikely(!nc->frag.page)) { > +precise_order: > + order = get_order(sz); > + nc->frag.page = alloc_pages_node(NUMA_NO_NODE, nc_head->gfp, > + order); > + if (!nc->frag.page) > + goto free_nc; > + } > + > + nc->frag.size = (PAGE_SIZE << order); > + nc->dma_addr = dma_map_page(nc_head->dev, nc->frag.page, 0, > + nc->frag.size, DMA_FROM_DEVICE); > + if (unlikely(dma_mapping_error(nc_head->dev, nc->dma_addr))) > + goto free_page; > + nc->va = page_address(nc->frag.page); > + atomic_add(nc->frag.size - 1, &nc->frag.page->_count); > + nc->pagecnt_bias = nc->frag.size; > + nc->frag.offset = nc->frag.size; > + nc->dma_count = 0; > + > + return; > + > +free_page: > + __free_pages(nc->frag.page, order); > +free_nc: > + kmem_cache_free(netdev_dma_cache, nc); > + nc_head->nc = NULL; > +} > + > +static struct netdev_dma_node *netdev_dma_alloc(struct netdev_dma_head *nc_head, > + size_t sz) > +{ > + struct netdev_dma_node *nc = nc_head->nc; > + int offset; > + > + if (unlikely(!nc)) { > +refill: > + __dma_cache_refill(nc_head, sz); > + nc = nc_head->nc; > + if (unlikely(!nc)) > + return NULL; > + } > + > + offset = nc->frag.offset - sz; > + if (offset < 0) { > + if (!atomic_sub_and_test(nc->pagecnt_bias, > + &nc->frag.page->_count)) { > + /* the caller has processed all the frags belonging to > + * this page. Since page->_count is not 0 and > + * nc->dma_count is 0 these frags should be in stack. > + * We should unmap the page here. > + */ > + if (!nc->dma_count) { > + dma_unmap_single(nc_head->dev, nc->dma_addr, > + nc->frag.size, > + DMA_FROM_DEVICE); > + kmem_cache_free(netdev_dma_cache, nc); > + } else { > + /* frags from this page are used by the caller. Let the > + * caller unmap the page using netdev_dma_frag_unmap(). > + */ > + nc->pagecnt_bias = 0; > + } > + goto refill; > + } > + WARN_ON(nc->dma_count); > + atomic_set(&nc->frag.page->_count, nc->frag.size); > + nc->pagecnt_bias = nc->frag.size; > + offset = nc->frag.size - sz; > + } > + nc->pagecnt_bias--; > + nc->dma_count++; > + nc->frag.offset = offset; > + > + return nc; > +} > + > +/* netdev_dma_alloc_skb - alloc skb with dma mapped data > + * @nc_head: initialized netdev_dma_head from which allocation should be > + * done. > + * @nc_node: Corresponding dma_node is stored in this. The caller needs to > + * save this and use it for netdev_dma_frag_unmap() to unmap > + * page frag later. > + * @dma_addr: dma address of the data is stored here. > + * @sz: size of data needed. > + * > + * Allocates skb from dma cached page. This function returns allocated skb_buff > + * from the provided netdev_dma_head cache. It stores the dma address in > + * dma_addr. > + * > + * All the calls to netdev_dma_alloc_skb(), netdev_dma_frag_unmap(), > + * netdev_dma_init(), netdev_dma_destroy() should be mutually exclusive for a > + * dma_head. It is the responsibility of the caller to satisfy this condition. > + * Since this is mostly used for rx buffer allocation, where unmap,allocation > + * is serialized we can avoid using locks. > + */ > +struct sk_buff *netdev_dma_alloc_skb(struct netdev_dma_head *nc_head, > + struct netdev_dma_node **nc_node, > + dma_addr_t *dma_addr, size_t sz) > +{ > + struct sk_buff *skb; > + struct netdev_dma_node *nc; > + void *va; > + > + sz = sz + NET_IP_ALIGN + NET_SKB_PAD; > + sz = SKB_DATA_ALIGN(sz) + > + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + nc = netdev_dma_alloc(nc_head, sz); > + if (unlikely(!nc)) > + return NULL; > + va = nc->va + nc->frag.offset; > + skb = build_skb(va, sz); > + if (unlikely(!skb)) { > + nc->pagecnt_bias++; > + nc->frag.offset += sz; > + nc->dma_count--; > + > + return NULL; > + } > + > + skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); > + *dma_addr = nc->dma_addr + nc->frag.offset + NET_SKB_PAD + NET_IP_ALIGN; > + *nc_node = nc; > + > + return skb; > +} > +EXPORT_SYMBOL_GPL(netdev_dma_alloc_skb); > + This function should be dropped. It is going to lead to memory corruption since you should not modify any of the header data until after the unmap has been called for the page(s) in this fragment. > +/* netdev_dma_init - initialize dma cache head > + * @nc_head: dma_head to be initialized > + * @dev: device for dma map > + * @gfp: gfp flagg for allocating page > + * > + * initializes dma_head with provided gfp flags and device > + */ > +void netdev_dma_init(struct netdev_dma_head *nc_head, struct device *dev, > + gfp_t gfp) > +{ > + nc_head->nc = NULL; > + nc_head->dev = dev; > + nc_head->gfp = gfp; > +} > +EXPORT_SYMBOL_GPL(netdev_dma_init); > + > +/* netdev_dma_frag_unmap - unmap page > + * @nc_head: dma_head of page frag allocator > + * @nc: dma_node to be unmapped > + * > + * unmaps requested page if there are no dma references to the page > + */ > +void netdev_dma_frag_unmap(struct netdev_dma_head *nc_head, > + struct netdev_dma_node *nc) > +{ > + /* If nc is not used by dma_head. We should be free to unmap the > + * page if there are no pending dma frags addr used by the caller. > + */ > + if (!--nc->dma_count && !nc->pagecnt_bias) { > + dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size, > + DMA_FROM_DEVICE); > + kmem_cache_free(netdev_dma_cache, nc); > + } > +} > +EXPORT_SYMBOL_GPL(netdev_dma_frag_unmap); > + > +/* netdev_dma_destroy - destroy dma_head > + * @nc_head: dma_head to be destroyed > + * > + * unmap the current dma_page and free the page. Caller should have called > + * netdev_dma_frag_unmap() for all the allocated frages before calling this > + * function. > + */ > +void netdev_dma_destroy(struct netdev_dma_head *nc_head) > +{ > + struct netdev_dma_node *nc = nc_head->nc; > + > + if (unlikely(!nc)) > + return; > + if (WARN_ON(nc->dma_count)) > + return; > + dma_unmap_page(nc_head->dev, nc->dma_addr, nc->frag.size, > + DMA_FROM_DEVICE); > + atomic_sub(nc->pagecnt_bias - 1, &nc->frag.page->_count); > + __free_pages(nc->frag.page, get_order(nc->frag.size)); > + kmem_cache_free(netdev_dma_cache, nc); > + nc_head->nc = NULL; > +} > +EXPORT_SYMBOL_GPL(netdev_dma_destroy); > + > static void *__alloc_page_frag(struct netdev_alloc_cache __percpu *cache, > unsigned int fragsz, gfp_t gfp_mask) > { > @@ -3324,6 +3529,10 @@ void __init skb_init(void) > 0, > SLAB_HWCACHE_ALIGN|SLAB_PANIC, > NULL); > + netdev_dma_cache = kmem_cache_create("netdev_dma_cache", > + sizeof(struct netdev_dma_node), > + 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, > + NULL); > } > > /** ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator 2015-03-10 20:33 ` Alexander Duyck @ 2015-03-11 8:57 ` Govindarajulu Varadarajan 2015-03-11 13:55 ` Alexander Duyck 2015-03-11 17:06 ` David Laight 1 sibling, 1 reply; 14+ messages in thread From: Govindarajulu Varadarajan @ 2015-03-11 8:57 UTC (permalink / raw) To: Alexander Duyck; +Cc: Govindarajulu Varadarajan, davem, netdev, ssujith, benve On Tue, 10 Mar 2015, Alexander Duyck wrote: > > On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote: >> This patch implements dma cache skb allocator. This is based on >> __alloc_page_frag & __page_frag_refill implementation in net/core/skbuff.c >> >> In addition to frag allocation from order(3) page in __alloc_page_frag, >> we also maintain dma address of the page. While allocating a frag for skb >> we use va + offset for virtual address of the frag, and pa + offset for >> dma address of the frag. This reduces the number of calls to dma_map() by >> 1/3 >> for 9000 bytes and by 1/20 for 1500 bytes. >> >> __alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e 4096 in >> most >> of the cases. So 9k buffer allocation goes through kmalloc which return >> page of order 2, 16k. We waste 7k bytes for every 9k buffer. > > The question I would have is do you actually need to have the 9k buffer? > Does the hardware support any sort of scatter-gather receive? If so that > would be preferable as the 9k allocation per skb will have significant > overhead when you start receiving small packets. > enic hw has limited desc per rq (4096), and we can have only one dma block per desc. Having sg/header-split will reduce the effective rq ring size for large packets. > A classic example is a TCP flow where you are only receiving a few hundred > bytes per frame. You will take a huge truesize penalty for allocating a 9k > skb for a frame of only a few hundred bytes, though it sounds like you are > taking that hit already. > For this we have rx copybreak for pkts < 256 bytes. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator 2015-03-11 8:57 ` Govindarajulu Varadarajan @ 2015-03-11 13:55 ` Alexander Duyck 2015-03-11 15:42 ` Eric Dumazet 0 siblings, 1 reply; 14+ messages in thread From: Alexander Duyck @ 2015-03-11 13:55 UTC (permalink / raw) To: Govindarajulu Varadarajan, Alexander Duyck; +Cc: davem, netdev, ssujith, benve On 03/11/2015 01:57 AM, Govindarajulu Varadarajan wrote: > On Tue, 10 Mar 2015, Alexander Duyck wrote: > >> >> On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote: >>> This patch implements dma cache skb allocator. This is based on >>> __alloc_page_frag & __page_frag_refill implementation in >>> net/core/skbuff.c >>> >>> In addition to frag allocation from order(3) page in __alloc_page_frag, >>> we also maintain dma address of the page. While allocating a frag >>> for skb >>> we use va + offset for virtual address of the frag, and pa + offset for >>> dma address of the frag. This reduces the number of calls to >>> dma_map() by 1/3 >>> for 9000 bytes and by 1/20 for 1500 bytes. >>> >>> __alloc_page_frag is limited to max buffer size of PAGE_SIZE, i.e >>> 4096 in most >>> of the cases. So 9k buffer allocation goes through kmalloc which return >>> page of order 2, 16k. We waste 7k bytes for every 9k buffer. >> >> The question I would have is do you actually need to have the 9k >> buffer? Does the hardware support any sort of scatter-gather >> receive? If so that would be preferable as the 9k allocation per skb >> will have significant overhead when you start receiving small packets. >> > > enic hw has limited desc per rq (4096), and we can have only one dma > block per > desc. Having sg/header-split will reduce the effective rq ring size for > large packets. Yes, so? In the case of the Intel driver they default to only 512 descriptors per ring with a size of 2K per descriptor. Are you saying the latency for your interrupt processing is so high that you need to buffer 9k per descriptor in order to achieve line rate? > >> A classic example is a TCP flow where you are only receiving a few >> hundred bytes per frame. You will take a huge truesize penalty for >> allocating a 9k skb for a frame of only a few hundred bytes, though >> it sounds like you are taking that hit already. >> > > For this we have rx copybreak for pkts < 256 bytes. That definitely helps but still leaves you open since a 257 byte packet would be consuming 9K in that case. Also standard flows with frames of 1514 still take a hard hit with a receive buffer size of 9K. - Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator 2015-03-11 13:55 ` Alexander Duyck @ 2015-03-11 15:42 ` Eric Dumazet 0 siblings, 0 replies; 14+ messages in thread From: Eric Dumazet @ 2015-03-11 15:42 UTC (permalink / raw) To: Alexander Duyck Cc: Govindarajulu Varadarajan, Alexander Duyck, davem, netdev, ssujith, benve On Wed, 2015-03-11 at 06:55 -0700, Alexander Duyck wrote: > That definitely helps but still leaves you open since a 257 byte packet > would be consuming 9K in that case. Also standard flows with frames of > 1514 still take a hard hit with a receive buffer size of 9K. One possible way to deal with this would be to try to use order-2 pages, (but not compound pages) but break them. split_page(page, 2); struct page *page0 = page; struct page *page1 = page + 1; struct page *page2 = page + 2; struct page *page3 = page + 3; if frame length <= 4096, attach page0 as skb first frag, and free page[1-3] If frame length <= 8192 bytes, driver can attach page0 & page1 to the skb, and free page[2-3] Otherwise, attach page[0-2] to skb and free page3 Not sure if buddy allocator will be pleased, but worth trying ? ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH net-next v3 1/2] net: implement dma cache skb allocator 2015-03-10 20:33 ` Alexander Duyck 2015-03-11 8:57 ` Govindarajulu Varadarajan @ 2015-03-11 17:06 ` David Laight 1 sibling, 0 replies; 14+ messages in thread From: David Laight @ 2015-03-11 17:06 UTC (permalink / raw) To: 'Alexander Duyck', Govindarajulu Varadarajan, davem, netdev Cc: ssujith, benve From: Alexander Duyck ... > The question I would have is do you actually need to have the 9k > buffer? Does the hardware support any sort of scatter-gather receive? > If so that would be preferable as the 9k allocation per skb will have > significant overhead when you start receiving small packets. ... You don't necessarily need true scatter gather. A lot of ethernet MAC continue long frames into the buffer associated with the next ring entry. So if you fill the ring with (say) 2k buffers you 'just' have to piece the fragments together when a long frame is received. I used to use a single 64k buffer split into 128 buffers of 512 bytes each (last was actually smaller for alignment). All receive frames were copied into (the equivalent of) an skb that was allocated after the receive completed. The aligned copy (including the leading and trailing padding) didn't actually cost enough to worry about. (Without TCP checksum offload the data was heading for the cache.) David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 1/2] net: implement dma cache skb allocator 2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan 2015-03-10 20:33 ` Alexander Duyck @ 2015-03-14 20:08 ` Ben Hutchings 1 sibling, 0 replies; 14+ messages in thread From: Ben Hutchings @ 2015-03-14 20:08 UTC (permalink / raw) To: Govindarajulu Varadarajan; +Cc: davem, netdev, ssujith, benve [-- Attachment #1: Type: text/plain, Size: 1606 bytes --] On Tue, 2015-03-10 at 23:13 +0530, Govindarajulu Varadarajan wrote: [...] > +static struct netdev_dma_node *netdev_dma_alloc(struct netdev_dma_head *nc_head, > + size_t sz) > +{ > + struct netdev_dma_node *nc = nc_head->nc; > + int offset; > + > + if (unlikely(!nc)) { > +refill: > + __dma_cache_refill(nc_head, sz); > + nc = nc_head->nc; > + if (unlikely(!nc)) > + return NULL; > + } > + > + offset = nc->frag.offset - sz; > + if (offset < 0) { > + if (!atomic_sub_and_test(nc->pagecnt_bias, > + &nc->frag.page->_count)) { > + /* the caller has processed all the frags belonging to > + * this page. Since page->_count is not 0 and > + * nc->dma_count is 0 these frags should be in stack. > + * We should unmap the page here. Unmapping potentially copies data back from a bounce buffer (even if it's already been synchronised). This is OK in the case of a buffer that's attached to the skb as a page fragment, because in that case it's read-only for the stack. It's not OK for a buffer that has been passed to build_skb(), as this may legitimately have been changed by the stack and this will overwrite the changes. So you have to choose between DMA buffer recycling and build_skb(). > + */ > + if (!nc->dma_count) { > + dma_unmap_single(nc_head->dev, nc->dma_addr, > + nc->frag.size, > + DMA_FROM_DEVICE); [...] Don't mix dma_map_page() and dma_unmap_single(). Ben. -- Ben Hutchings Q. Which is the greater problem in the world today, ignorance or apathy? A. I don't know and I couldn't care less. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH net-next v3 2/2] enic: use netdev_dma_alloc 2015-03-10 17:43 [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping Govindarajulu Varadarajan 2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan @ 2015-03-10 17:43 ` Govindarajulu Varadarajan 2015-03-10 20:14 ` Alexander Duyck 1 sibling, 1 reply; 14+ messages in thread From: Govindarajulu Varadarajan @ 2015-03-10 17:43 UTC (permalink / raw) To: davem, netdev; +Cc: ssujith, benve, Govindarajulu Varadarajan This patches uses dma cache skb allocator fot rx buffers. netdev_dma_head is initialized per rq. All calls to netdev_dma_alloc_skb() and netdev_dma_frag_unmap() happens in napi_poll and they are serialized. Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> --- drivers/net/ethernet/cisco/enic/enic_main.c | 31 +++++++++-------------------- drivers/net/ethernet/cisco/enic/vnic_rq.c | 3 +++ drivers/net/ethernet/cisco/enic/vnic_rq.h | 3 +++ 3 files changed, 15 insertions(+), 22 deletions(-) diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c index 204bd182..3be5bc12 100644 --- a/drivers/net/ethernet/cisco/enic/enic_main.c +++ b/drivers/net/ethernet/cisco/enic/enic_main.c @@ -952,13 +952,9 @@ nla_put_failure: static void enic_free_rq_buf(struct vnic_rq *rq, struct vnic_rq_buf *buf) { - struct enic *enic = vnic_dev_priv(rq->vdev); - if (!buf->os_buf) return; - - pci_unmap_single(enic->pdev, buf->dma_addr, - buf->len, PCI_DMA_FROMDEVICE); + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); dev_kfree_skb_any(buf->os_buf); buf->os_buf = NULL; } @@ -979,17 +975,10 @@ static int enic_rq_alloc_buf(struct vnic_rq *rq) return 0; } - skb = netdev_alloc_skb_ip_align(netdev, len); + skb = netdev_dma_alloc_skb(&rq->nc_head, &buf->nc, &dma_addr, len); if (!skb) return -ENOMEM; - dma_addr = pci_map_single(enic->pdev, skb->data, len, - PCI_DMA_FROMDEVICE); - if (unlikely(enic_dma_map_check(enic, dma_addr))) { - dev_kfree_skb(skb); - return -ENOMEM; - } - enic_queue_rq_desc(rq, skb, os_buf_index, dma_addr, len); @@ -1016,8 +1005,6 @@ static bool enic_rxcopybreak(struct net_device *netdev, struct sk_buff **skb, new_skb = netdev_alloc_skb_ip_align(netdev, len); if (!new_skb) return false; - pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr, len, - DMA_FROM_DEVICE); memcpy(new_skb->data, (*skb)->data, len); *skb = new_skb; @@ -1065,8 +1052,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq, enic->rq_truncated_pkts++; } - pci_unmap_single(enic->pdev, buf->dma_addr, buf->len, - PCI_DMA_FROMDEVICE); + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); dev_kfree_skb_any(skb); buf->os_buf = NULL; @@ -1078,10 +1064,11 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq, /* Good receive */ + pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr, + bytes_written, DMA_FROM_DEVICE); if (!enic_rxcopybreak(netdev, &skb, buf, bytes_written)) { buf->os_buf = NULL; - pci_unmap_single(enic->pdev, buf->dma_addr, buf->len, - PCI_DMA_FROMDEVICE); + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); } prefetch(skb->data - NET_IP_ALIGN); @@ -1122,9 +1109,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq, /* Buffer overflow */ - - pci_unmap_single(enic->pdev, buf->dma_addr, buf->len, - PCI_DMA_FROMDEVICE); + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); dev_kfree_skb_any(skb); buf->os_buf = NULL; } @@ -1648,6 +1633,8 @@ static int enic_open(struct net_device *netdev) } for (i = 0; i < enic->rq_count; i++) { + netdev_dma_init(&enic->rq[i].nc_head, &enic->pdev->dev, + GFP_ATOMIC); vnic_rq_fill(&enic->rq[i], enic_rq_alloc_buf); /* Need at least one buffer on ring to get going */ if (vnic_rq_desc_used(&enic->rq[i]) == 0) { diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.c b/drivers/net/ethernet/cisco/enic/vnic_rq.c index 36a2ed6..afa1d71 100644 --- a/drivers/net/ethernet/cisco/enic/vnic_rq.c +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.c @@ -23,6 +23,7 @@ #include <linux/pci.h> #include <linux/delay.h> #include <linux/slab.h> +#include <linux/skbuff.h> #include "vnic_dev.h" #include "vnic_rq.h" @@ -199,6 +200,8 @@ void vnic_rq_clean(struct vnic_rq *rq, rq->ring.desc_avail++; } + netdev_dma_destroy(&rq->nc_head); + /* Use current fetch_index as the ring starting point */ fetch_index = ioread32(&rq->ctrl->fetch_index); diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.h b/drivers/net/ethernet/cisco/enic/vnic_rq.h index 8111d52..d4ee963 100644 --- a/drivers/net/ethernet/cisco/enic/vnic_rq.h +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.h @@ -21,6 +21,7 @@ #define _VNIC_RQ_H_ #include <linux/pci.h> +#include <linux/skbuff.h> #include "vnic_dev.h" #include "vnic_cq.h" @@ -73,6 +74,7 @@ struct vnic_rq_buf { unsigned int index; void *desc; uint64_t wr_id; + struct netdev_dma_node *nc; }; struct vnic_rq { @@ -100,6 +102,7 @@ struct vnic_rq { unsigned int bpoll_state; spinlock_t bpoll_lock; #endif /* CONFIG_NET_RX_BUSY_POLL */ + struct netdev_dma_head nc_head; }; static inline unsigned int vnic_rq_desc_avail(struct vnic_rq *rq) -- 2.3.2 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc 2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan @ 2015-03-10 20:14 ` Alexander Duyck 2015-03-11 9:27 ` Govindarajulu Varadarajan 0 siblings, 1 reply; 14+ messages in thread From: Alexander Duyck @ 2015-03-10 20:14 UTC (permalink / raw) To: Govindarajulu Varadarajan, davem, netdev; +Cc: ssujith, benve On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote: > This patches uses dma cache skb allocator fot rx buffers. > > netdev_dma_head is initialized per rq. All calls to netdev_dma_alloc_skb() and > netdev_dma_frag_unmap() happens in napi_poll and they are serialized. > > Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> This isn't going to work. The problem is the way you are using your fragments you can end up with a memory corruption as the frame headers that were updated by the stack may be reverted for any frames received before the last frame was unmapped. I ran into that issue when I was doing page reuse with build_skb on the Intel drivers and I suspect you will see the same issue. The way to work around it is to receive the data in to the fragments, and then pull the headers out and store them in a separate skb via something similar to copy-break. You can then track the fragments in frags. > --- > drivers/net/ethernet/cisco/enic/enic_main.c | 31 +++++++++-------------------- > drivers/net/ethernet/cisco/enic/vnic_rq.c | 3 +++ > drivers/net/ethernet/cisco/enic/vnic_rq.h | 3 +++ > 3 files changed, 15 insertions(+), 22 deletions(-) > > diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c > index 204bd182..3be5bc12 100644 > --- a/drivers/net/ethernet/cisco/enic/enic_main.c > +++ b/drivers/net/ethernet/cisco/enic/enic_main.c > @@ -952,13 +952,9 @@ nla_put_failure: > > static void enic_free_rq_buf(struct vnic_rq *rq, struct vnic_rq_buf *buf) > { > - struct enic *enic = vnic_dev_priv(rq->vdev); > - > if (!buf->os_buf) > return; > - > - pci_unmap_single(enic->pdev, buf->dma_addr, > - buf->len, PCI_DMA_FROMDEVICE); > + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); > dev_kfree_skb_any(buf->os_buf); > buf->os_buf = NULL; > } > @@ -979,17 +975,10 @@ static int enic_rq_alloc_buf(struct vnic_rq *rq) > > return 0; > } > - skb = netdev_alloc_skb_ip_align(netdev, len); > + skb = netdev_dma_alloc_skb(&rq->nc_head, &buf->nc, &dma_addr, len); > if (!skb) > return -ENOMEM; > > - dma_addr = pci_map_single(enic->pdev, skb->data, len, > - PCI_DMA_FROMDEVICE); > - if (unlikely(enic_dma_map_check(enic, dma_addr))) { > - dev_kfree_skb(skb); > - return -ENOMEM; > - } > - > enic_queue_rq_desc(rq, skb, os_buf_index, > dma_addr, len); > I'm curious why you are still using skbs as your data type for receiving frames before they come in. Why not just store a pointer to your dma buffer and hold off on allocating the sk_buff until you have actually received the frame in the buffer? It would save you something like 256B per frame if you just hold off on the allocation until the skb is really needed. > @@ -1016,8 +1005,6 @@ static bool enic_rxcopybreak(struct net_device *netdev, struct sk_buff **skb, > new_skb = netdev_alloc_skb_ip_align(netdev, len); > if (!new_skb) > return false; > - pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr, len, > - DMA_FROM_DEVICE); > memcpy(new_skb->data, (*skb)->data, len); > *skb = new_skb; > > @@ -1065,8 +1052,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq, > enic->rq_truncated_pkts++; > } > > - pci_unmap_single(enic->pdev, buf->dma_addr, buf->len, > - PCI_DMA_FROMDEVICE); > + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); > dev_kfree_skb_any(skb); > buf->os_buf = NULL; > > @@ -1078,10 +1064,11 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq, > /* Good receive > */ > > + pci_dma_sync_single_for_cpu(enic->pdev, buf->dma_addr, > + bytes_written, DMA_FROM_DEVICE); > if (!enic_rxcopybreak(netdev, &skb, buf, bytes_written)) { > buf->os_buf = NULL; > - pci_unmap_single(enic->pdev, buf->dma_addr, buf->len, > - PCI_DMA_FROMDEVICE); > + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); > } > prefetch(skb->data - NET_IP_ALIGN); > It looks like you already have copy-break code in your codepath. It might be worth taking a look at what you would gain by deferring the skb allocation and using the copy-break code path to take care of small frames and headers for larger frames. > @@ -1122,9 +1109,7 @@ static void enic_rq_indicate_buf(struct vnic_rq *rq, > > /* Buffer overflow > */ > - > - pci_unmap_single(enic->pdev, buf->dma_addr, buf->len, > - PCI_DMA_FROMDEVICE); > + netdev_dma_frag_unmap(&rq->nc_head, buf->nc); > dev_kfree_skb_any(skb); > buf->os_buf = NULL; > } > @@ -1648,6 +1633,8 @@ static int enic_open(struct net_device *netdev) > } > > for (i = 0; i < enic->rq_count; i++) { > + netdev_dma_init(&enic->rq[i].nc_head, &enic->pdev->dev, > + GFP_ATOMIC); > vnic_rq_fill(&enic->rq[i], enic_rq_alloc_buf); > /* Need at least one buffer on ring to get going */ > if (vnic_rq_desc_used(&enic->rq[i]) == 0) { > diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.c b/drivers/net/ethernet/cisco/enic/vnic_rq.c > index 36a2ed6..afa1d71 100644 > --- a/drivers/net/ethernet/cisco/enic/vnic_rq.c > +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.c > @@ -23,6 +23,7 @@ > #include <linux/pci.h> > #include <linux/delay.h> > #include <linux/slab.h> > +#include <linux/skbuff.h> > > #include "vnic_dev.h" > #include "vnic_rq.h" > @@ -199,6 +200,8 @@ void vnic_rq_clean(struct vnic_rq *rq, > rq->ring.desc_avail++; > } > > + netdev_dma_destroy(&rq->nc_head); > + > /* Use current fetch_index as the ring starting point */ > fetch_index = ioread32(&rq->ctrl->fetch_index); > > diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.h b/drivers/net/ethernet/cisco/enic/vnic_rq.h > index 8111d52..d4ee963 100644 > --- a/drivers/net/ethernet/cisco/enic/vnic_rq.h > +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.h > @@ -21,6 +21,7 @@ > #define _VNIC_RQ_H_ > > #include <linux/pci.h> > +#include <linux/skbuff.h> > > #include "vnic_dev.h" > #include "vnic_cq.h" > @@ -73,6 +74,7 @@ struct vnic_rq_buf { > unsigned int index; > void *desc; > uint64_t wr_id; > + struct netdev_dma_node *nc; > }; > > struct vnic_rq { > @@ -100,6 +102,7 @@ struct vnic_rq { > unsigned int bpoll_state; > spinlock_t bpoll_lock; > #endif /* CONFIG_NET_RX_BUSY_POLL */ > + struct netdev_dma_head nc_head; > }; > > static inline unsigned int vnic_rq_desc_avail(struct vnic_rq *rq) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc 2015-03-10 20:14 ` Alexander Duyck @ 2015-03-11 9:27 ` Govindarajulu Varadarajan 2015-03-11 14:00 ` Alexander Duyck 0 siblings, 1 reply; 14+ messages in thread From: Govindarajulu Varadarajan @ 2015-03-11 9:27 UTC (permalink / raw) To: Alexander Duyck; +Cc: Govindarajulu Varadarajan, davem, netdev, ssujith, benve On Tue, 10 Mar 2015, Alexander Duyck wrote: > > On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote: >> This patches uses dma cache skb allocator fot rx buffers. >> >> netdev_dma_head is initialized per rq. All calls to netdev_dma_alloc_skb() >> and >> netdev_dma_frag_unmap() happens in napi_poll and they are serialized. >> >> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> > > This isn't going to work. The problem is the way you are using your fragments > you can end up with a memory corruption as the frame headers that were > updated by the stack may be reverted for any frames received before the last > frame was unmapped. I ran into that issue when I was doing page reuse with > build_skb on the Intel drivers and I suspect you will see the same issue. > Is this behaviour platform dependent? I tested this patch for more than a month and I did not face any issue. I ran normal traffic like ssh, nfs and iperf/netperf. Is there a special scenario when this could occur? Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this? Each desc should have different dma address to write to. Can you explain me how this can happen? > The way to work around it is to receive the data in to the fragments, and > then pull the headers out and store them in a separate skb via something > similar to copy-break. You can then track the fragments in frags. > If I split the pkt header into another frame, is it guaranteed that stack will not modify the pkt data? Thanks a lot for reviewing this patch. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc 2015-03-11 9:27 ` Govindarajulu Varadarajan @ 2015-03-11 14:00 ` Alexander Duyck 2015-03-11 17:34 ` David Laight 0 siblings, 1 reply; 14+ messages in thread From: Alexander Duyck @ 2015-03-11 14:00 UTC (permalink / raw) To: Govindarajulu Varadarajan, Alexander Duyck; +Cc: davem, netdev, ssujith, benve On 03/11/2015 02:27 AM, Govindarajulu Varadarajan wrote: > > On Tue, 10 Mar 2015, Alexander Duyck wrote: > >> >> On 03/10/2015 10:43 AM, Govindarajulu Varadarajan wrote: >>> This patches uses dma cache skb allocator fot rx buffers. >>> >>> netdev_dma_head is initialized per rq. All calls to >>> netdev_dma_alloc_skb() and >>> netdev_dma_frag_unmap() happens in napi_poll and they are serialized. >>> >>> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> >> >> This isn't going to work. The problem is the way you are using your >> fragments you can end up with a memory corruption as the frame >> headers that were updated by the stack may be reverted for any frames >> received before the last frame was unmapped. I ran into that issue >> when I was doing page reuse with build_skb on the Intel drivers and I >> suspect you will see the same issue. >> > > Is this behaviour platform dependent? I tested this patch for more > than a month > and I did not face any issue. I ran normal traffic like ssh, nfs and > iperf/netperf. > Is there a special scenario when this could occur? Yes it depends on the platform and IOMMU used. For an example take a loot at the SWIOTLB implementation. I always assumed if I can work with that when it is doing bounce buffers I can work with any IOMMU or platform. > > Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this? > Each desc should have different dma address to write to. Can you > explain me how > this can happen? No that won't help. The issue is that when the page is mapped you should not be updating any fields in the page until it is unmapped. Since you have multiple buffers mapped to a single page you should be waiting until the entire page is unmapped. > >> The way to work around it is to receive the data in to the fragments, >> and then pull the headers out and store them in a separate skb via >> something similar to copy-break. You can then track the fragments in >> frags. >> > > If I split the pkt header into another frame, is it guaranteed that > stack will > not modify the pkt data? Paged fragments in the frags list use the page count to determine if they can update it. The problem is you cannot use a shared page as skb->head if you plan to do any DMA mapping with it as it can cause issues if you change any of the fields before it is unmapped. > > Thanks a lot for reviewing this patch. No problem. Just glad I saw this before you had to go though reverting your stuff like I did. - Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc 2015-03-11 14:00 ` Alexander Duyck @ 2015-03-11 17:34 ` David Laight 2015-03-11 17:51 ` Alexander Duyck 0 siblings, 1 reply; 14+ messages in thread From: David Laight @ 2015-03-11 17:34 UTC (permalink / raw) To: 'Alexander Duyck', Govindarajulu Varadarajan, Alexander Duyck Cc: davem, netdev, ssujith, benve From: Alexander Duyck ... > > Is this behaviour platform dependent? I tested this patch for more > > than a month > > and I did not face any issue. I ran normal traffic like ssh, nfs and > > iperf/netperf. > > Is there a special scenario when this could occur? > > Yes it depends on the platform and IOMMU used. For an example take a > loot at the SWIOTLB implementation. I always assumed if I can work with > that when it is doing bounce buffers I can work with any IOMMU or platform. > > > > > Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this? > > Each desc should have different dma address to write to. Can you > > explain me how > > this can happen? > > No that won't help. The issue is that when the page is mapped you > should not be updating any fields in the page until it is unmapped. > Since you have multiple buffers mapped to a single page you should be > waiting until the entire page is unmapped. Isn't the 'unit of memory for dma sync' a cache line, not a page? You certainly need to test on systems without cache coherent io. David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next v3 2/2] enic: use netdev_dma_alloc 2015-03-11 17:34 ` David Laight @ 2015-03-11 17:51 ` Alexander Duyck 0 siblings, 0 replies; 14+ messages in thread From: Alexander Duyck @ 2015-03-11 17:51 UTC (permalink / raw) To: David Laight, Govindarajulu Varadarajan, Alexander Duyck Cc: davem, netdev, ssujith, benve On 03/11/2015 10:34 AM, David Laight wrote: > From: Alexander Duyck > ... >>> Is this behaviour platform dependent? I tested this patch for more >>> than a month >>> and I did not face any issue. I ran normal traffic like ssh, nfs and >>> iperf/netperf. >>> Is there a special scenario when this could occur? >> Yes it depends on the platform and IOMMU used. For an example take a >> loot at the SWIOTLB implementation. I always assumed if I can work with >> that when it is doing bounce buffers I can work with any IOMMU or platform. >> >>> Will using DMA_BIDIRECTIONAL and sync_to_cpu & sync_to_device solve this? >>> Each desc should have different dma address to write to. Can you >>> explain me how >>> this can happen? >> No that won't help. The issue is that when the page is mapped you >> should not be updating any fields in the page until it is unmapped. >> Since you have multiple buffers mapped to a single page you should be >> waiting until the entire page is unmapped. > Isn't the 'unit of memory for dma sync' a cache line, not a page? Yes, but the problem is the entire page is mapped, and unmapped and that triggers a syncronization over the entire page, not just the most recent buffer within the page that was used. The problem is the API maps an order 3 page and then is using chunks of it for receive buffers, but then the last buffer unmaps the entire page which could invalidate any CPU side accesses to the page while it was still mapped. In order to make it workable it would have to be mapped bidirectional and on the last unmap everything that isn't the last buffer would have to be synced for device before the page is unmapped which would likely be more expensive than just avoiding all of this by identifying the page as being shared and cloning the header out of the page frag. > You certainly need to test on systems without cache coherent io. > > David I agree. - Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-03-14 20:08 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-10 17:43 [PATCH net-next v3 0/2] introduce dma frag allocation and reduce dma mapping Govindarajulu Varadarajan 2015-03-10 17:43 ` [PATCH net-next v3 1/2] net: implement dma cache skb allocator Govindarajulu Varadarajan 2015-03-10 20:33 ` Alexander Duyck 2015-03-11 8:57 ` Govindarajulu Varadarajan 2015-03-11 13:55 ` Alexander Duyck 2015-03-11 15:42 ` Eric Dumazet 2015-03-11 17:06 ` David Laight 2015-03-14 20:08 ` Ben Hutchings 2015-03-10 17:43 ` [PATCH net-next v3 2/2] enic: use netdev_dma_alloc Govindarajulu Varadarajan 2015-03-10 20:14 ` Alexander Duyck 2015-03-11 9:27 ` Govindarajulu Varadarajan 2015-03-11 14:00 ` Alexander Duyck 2015-03-11 17:34 ` David Laight 2015-03-11 17:51 ` Alexander Duyck
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.