[PATCH RFC 0/7] add socket to netdev page frag recycling support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/7] add socket to netdev page frag recycling support
@ 2021-08-18  3:32 Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 1/7] page_pool: refactor the page pool to support multi alloc context Yunsheng Lin
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  3:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

This patchset adds the socket to netdev page frag recycling
support based on the busy polling and page pool infrastructure.

The profermance improve from 30Gbit to 41Gbit for one thread iperf
tcp flow, and the CPU usages decreases about 20% for four threads
iperf flow with 100Gb line speed in IOMMU strict mode.

The profermance improve about 2.5% for one thread iperf tcp flow
in IOMMU passthrough mode.

Yunsheng Lin (7):
  page_pool: refactor the page pool to support multi alloc context
  skbuff: add interface to manipulate frag count for tx recycling
  net: add NAPI api to register and retrieve the page pool ptr
  net: pfrag_pool: add pfrag pool support based on page pool
  sock: support refilling pfrag from pfrag_pool
  net: hns3: support tx recycling in the hns3 driver
  sysctl_tcp_use_pfrag_pool

 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 32 +++++----
 include/linux/netdevice.h                       |  9 +++
 include/linux/skbuff.h                          | 43 +++++++++++-
 include/net/netns/ipv4.h                        |  1 +
 include/net/page_pool.h                         | 15 ++++
 include/net/pfrag_pool.h                        | 24 +++++++
 include/net/sock.h                              |  1 +
 net/core/Makefile                               |  1 +
 net/core/dev.c                                  | 34 ++++++++-
 net/core/page_pool.c                            | 86 ++++++++++++-----------
 net/core/pfrag_pool.c                           | 92 +++++++++++++++++++++++++
 net/core/sock.c                                 | 12 ++++
 net/ipv4/sysctl_net_ipv4.c                      |  7 ++
 net/ipv4/tcp.c                                  | 34 ++++++---
 14 files changed, 325 insertions(+), 66 deletions(-)
 create mode 100644 include/net/pfrag_pool.h
 create mode 100644 net/core/pfrag_pool.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/7] page_pool: refactor the page pool to support multi alloc context
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
@ 2021-08-18  3:32 ` Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 2/7] skbuff: add interface to manipulate frag count for tx recycling Yunsheng Lin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  3:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

Currently the page pool assumes the caller MUST guarantee safe
non-concurrent access, e.g. softirq for rx.

This patch refactors the page pool to support multi allocation
contexts, in order to support the tx recycling support in the
page pool(tx means 'socket to netdev' here).

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool.h | 10 ++++++
 net/core/page_pool.c    | 86 +++++++++++++++++++++++++++----------------------
 2 files changed, 57 insertions(+), 39 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index a408240..8d4ae4b 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -135,6 +135,9 @@ struct page_pool {
 };
 
 struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+struct page *__page_pool_alloc_pages(struct page_pool *pool,
+				     struct pp_alloc_cache *alloc,
+				     gfp_t gfp);
 
 static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
 {
@@ -155,6 +158,13 @@ static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool,
 	return page_pool_alloc_frag(pool, offset, size, gfp);
 }
 
+struct page *page_pool_drain_frag(struct page_pool *pool, struct page *page,
+				  long drain_count);
+void page_pool_free_frag(struct page_pool *pool, struct page *page,
+			 long drain_count);
+void page_pool_empty_alloc_cache_once(struct page_pool *pool,
+				      struct pp_alloc_cache *alloc);
+
 /* get the stored dma direction. A driver might decide to treat this locally and
  * avoid the extra cache line from page_pool to determine the direction
  */
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index e140905..7194dcc 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -110,7 +110,8 @@ EXPORT_SYMBOL(page_pool_create);
 static void page_pool_return_page(struct page_pool *pool, struct page *page);
 
 noinline
-static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
+static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
+						 struct pp_alloc_cache *alloc)
 {
 	struct ptr_ring *r = &pool->ring;
 	struct page *page;
@@ -140,7 +141,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
 			break;
 
 		if (likely(page_to_nid(page) == pref_nid)) {
-			pool->alloc.cache[pool->alloc.count++] = page;
+			alloc->cache[alloc->count++] = page;
 		} else {
 			/* NUMA mismatch;
 			 * (1) release 1 page to page-allocator and
@@ -151,27 +152,28 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
 			page = NULL;
 			break;
 		}
-	} while (pool->alloc.count < PP_ALLOC_CACHE_REFILL);
+	} while (alloc->count < PP_ALLOC_CACHE_REFILL);
 
 	/* Return last page */
-	if (likely(pool->alloc.count > 0))
-		page = pool->alloc.cache[--pool->alloc.count];
+	if (likely(alloc->count > 0))
+		page = alloc->cache[--alloc->count];
 
 	spin_unlock(&r->consumer_lock);
 	return page;
 }
 
 /* fast path */
-static struct page *__page_pool_get_cached(struct page_pool *pool)
+static struct page *__page_pool_get_cached(struct page_pool *pool,
+					   struct pp_alloc_cache *alloc)
 {
 	struct page *page;
 
 	/* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
-	if (likely(pool->alloc.count)) {
+	if (likely(alloc->count)) {
 		/* Fast-path */
-		page = pool->alloc.cache[--pool->alloc.count];
+		page = alloc->cache[--alloc->count];
 	} else {
-		page = page_pool_refill_alloc_cache(pool);
+		page = page_pool_refill_alloc_cache(pool, alloc);
 	}
 
 	return page;
@@ -252,6 +254,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 /* slow path */
 noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
+						 struct pp_alloc_cache *alloc,
 						 gfp_t gfp)
 {
 	const int bulk = PP_ALLOC_CACHE_REFILL;
@@ -265,13 +268,13 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 		return __page_pool_alloc_page_order(pool, gfp);
 
 	/* Unnecessary as alloc cache is empty, but guarantees zero count */
-	if (unlikely(pool->alloc.count > 0))
-		return pool->alloc.cache[--pool->alloc.count];
+	if (unlikely(alloc->count > 0))
+		return alloc->cache[--alloc->count];
 
 	/* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */
-	memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
+	memset(alloc->cache, 0, sizeof(void *) * bulk);
 
-	nr_pages = alloc_pages_bulk_array(gfp, bulk, pool->alloc.cache);
+	nr_pages = alloc_pages_bulk_array(gfp, bulk, alloc->cache);
 	if (unlikely(!nr_pages))
 		return NULL;
 
@@ -279,7 +282,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	 * page element have not been (possibly) DMA mapped.
 	 */
 	for (i = 0; i < nr_pages; i++) {
-		page = pool->alloc.cache[i];
+		page = alloc->cache[i];
 		if ((pp_flags & PP_FLAG_DMA_MAP) &&
 		    unlikely(!page_pool_dma_map(pool, page))) {
 			put_page(page);
@@ -287,7 +290,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 		}
 
 		page_pool_set_pp_info(pool, page);
-		pool->alloc.cache[pool->alloc.count++] = page;
+		alloc->cache[alloc->count++] = page;
 		/* Track how many pages are held 'in-flight' */
 		pool->pages_state_hold_cnt++;
 		trace_page_pool_state_hold(pool, page,
@@ -295,8 +298,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	}
 
 	/* Return last page */
-	if (likely(pool->alloc.count > 0))
-		page = pool->alloc.cache[--pool->alloc.count];
+	if (likely(alloc->count > 0))
+		page = alloc->cache[--alloc->count];
 	else
 		page = NULL;
 
@@ -307,19 +310,27 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 /* For using page_pool replace: alloc_pages() API calls, but provide
  * synchronization guarantee for allocation side.
  */
-struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+struct page *__page_pool_alloc_pages(struct page_pool *pool,
+				     struct pp_alloc_cache *alloc,
+				     gfp_t gfp)
 {
 	struct page *page;
 
 	/* Fast-path: Get a page from cache */
-	page = __page_pool_get_cached(pool);
+	page = __page_pool_get_cached(pool, alloc);
 	if (page)
 		return page;
 
 	/* Slow-path: cache empty, do real allocation */
-	page = __page_pool_alloc_pages_slow(pool, gfp);
+	page = __page_pool_alloc_pages_slow(pool, alloc, gfp);
 	return page;
 }
+EXPORT_SYMBOL(__page_pool_alloc_pages);
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	return __page_pool_alloc_pages(pool, &pool->alloc, gfp);
+}
 EXPORT_SYMBOL(page_pool_alloc_pages);
 
 /* Calculate distance between two u32 values, valid if distance is below 2^(31)
@@ -522,11 +533,9 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 }
 EXPORT_SYMBOL(page_pool_put_page_bulk);
 
-static struct page *page_pool_drain_frag(struct page_pool *pool,
-					 struct page *page)
+struct page *page_pool_drain_frag(struct page_pool *pool, struct page *page,
+				  long drain_count)
 {
-	long drain_count = BIAS_MAX - pool->frag_users;
-
 	/* Some user is still using the page frag */
 	if (likely(page_pool_atomic_sub_frag_count_return(page,
 							  drain_count)))
@@ -543,13 +552,9 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
 	return NULL;
 }
 
-static void page_pool_free_frag(struct page_pool *pool)
+void page_pool_free_frag(struct page_pool *pool, struct page *page,
+			 long drain_count)
 {
-	long drain_count = BIAS_MAX - pool->frag_users;
-	struct page *page = pool->frag_page;
-
-	pool->frag_page = NULL;
-
 	if (!page ||
 	    page_pool_atomic_sub_frag_count_return(page, drain_count))
 		return;
@@ -572,7 +577,8 @@ struct page *page_pool_alloc_frag(struct page_pool *pool,
 	*offset = pool->frag_offset;
 
 	if (page && *offset + size > max_size) {
-		page = page_pool_drain_frag(pool, page);
+		page = page_pool_drain_frag(pool, page,
+					    BIAS_MAX - pool->frag_users);
 		if (page)
 			goto frag_reset;
 	}
@@ -628,26 +634,26 @@ static void page_pool_free(struct page_pool *pool)
 	kfree(pool);
 }
 
-static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
+void page_pool_empty_alloc_cache_once(struct page_pool *pool,
+				      struct pp_alloc_cache *alloc)
 {
 	struct page *page;
 
-	if (pool->destroy_cnt)
-		return;
-
 	/* Empty alloc cache, assume caller made sure this is
 	 * no-longer in use, and page_pool_alloc_pages() cannot be
 	 * call concurrently.
 	 */
-	while (pool->alloc.count) {
-		page = pool->alloc.cache[--pool->alloc.count];
+	while (alloc->count) {
+		page = alloc->cache[--alloc->count];
 		page_pool_return_page(pool, page);
 	}
 }
 
 static void page_pool_scrub(struct page_pool *pool)
 {
-	page_pool_empty_alloc_cache_once(pool);
+	if (!pool->destroy_cnt)
+		page_pool_empty_alloc_cache_once(pool, &pool->alloc);
+
 	pool->destroy_cnt++;
 
 	/* No more consumers should exist, but producers could still
@@ -705,7 +711,9 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_put(pool))
 		return;
 
-	page_pool_free_frag(pool);
+	page_pool_free_frag(pool, pool->frag_page,
+			    BIAS_MAX - pool->frag_users);
+	pool->frag_page = NULL;
 
 	if (!page_pool_release(pool))
 		return;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/7] skbuff: add interface to manipulate frag count for tx recycling
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 1/7] page_pool: refactor the page pool to support multi alloc context Yunsheng Lin
@ 2021-08-18  3:32 ` Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 3/7] net: add NAPI api to register and retrieve the page pool ptr Yunsheng Lin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  3:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

As the skb->pp_recycle and page->pp_magic may not be enough
to track if a frag page is from page pool after the calling
of __skb_frag_ref(), mostly because of a data race, see:
commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
recycling page_pool packets").

As the case of tcp, there may be fragmenting, coalescing or
retransmiting case that might lose the track if a frag page
is from page pool or not.

So increment the frag count when __skb_frag_ref() is called,
and use the bit 0 in frag->bv_page to indicate if a page is
from a page pool, which automically pass down to another
frag->bv_page when doing a '*new_frag = *frag' or memcpying
the shinfo.

It seems we could do the trick for rx too if it makes sense.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/linux/skbuff.h  | 43 ++++++++++++++++++++++++++++++++++++++++---
 include/net/page_pool.h |  5 +++++
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6bdb0db..2878d26 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -331,6 +331,11 @@ static inline unsigned int skb_frag_size(const skb_frag_t *frag)
 	return frag->bv_len;
 }
 
+static inline bool skb_frag_is_pp(const skb_frag_t *frag)
+{
+	return (unsigned long)frag->bv_page & 1UL;
+}
+
 /**
  * skb_frag_size_set() - Sets the size of a skb fragment
  * @frag: skb fragment
@@ -2190,6 +2195,21 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 		skb->pfmemalloc	= true;
 }
 
+static inline void __skb_fill_pp_page_desc(struct sk_buff *skb, int i,
+					   struct page *page, int off,
+					   int size)
+{
+	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+	frag->bv_page = (struct page *)((unsigned long)page | 0x1UL);
+	frag->bv_offset = off;
+	skb_frag_size_set(frag, size);
+
+	page = compound_head(page);
+	if (page_is_pfmemalloc(page))
+		skb->pfmemalloc = true;
+}
+
 /**
  * skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
@@ -2211,6 +2231,14 @@ static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
+static inline void skb_fill_pp_page_desc(struct sk_buff *skb, int i,
+					 struct page *page, int off,
+					 int size)
+{
+	__skb_fill_pp_page_desc(skb, i, page, off, size);
+	skb_shinfo(skb)->nr_frags = i + 1;
+}
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		     int size, unsigned int truesize);
 
@@ -3062,7 +3090,10 @@ static inline void skb_frag_off_copy(skb_frag_t *fragto,
  */
 static inline struct page *skb_frag_page(const skb_frag_t *frag)
 {
-	return frag->bv_page;
+	unsigned long page = (unsigned long)frag->bv_page;
+
+	page &= ~1UL;
+	return (struct page *)page;
 }
 
 /**
@@ -3073,7 +3104,12 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
-	get_page(skb_frag_page(frag));
+	struct page *page = skb_frag_page(frag);
+
+	if (skb_frag_is_pp(frag))
+		page_pool_atomic_inc_frag_count(page);
+	else
+		get_page(page);
 }
 
 /**
@@ -3101,7 +3137,8 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
 	struct page *page = skb_frag_page(frag);
 
 #ifdef CONFIG_PAGE_POOL
-	if (recycle && page_pool_return_skb_page(page))
+	if ((recycle || skb_frag_is_pp(frag)) &&
+	    page_pool_return_skb_page(page))
 		return;
 #endif
 	put_page(page);
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 8d4ae4b..86babb2 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -270,6 +270,11 @@ static inline long page_pool_atomic_sub_frag_count_return(struct page *page,
 	return ret;
 }
 
+static void page_pool_atomic_inc_frag_count(struct page *page)
+{
+	atomic_long_inc(&page->pp_frag_count);
+}
+
 static inline bool is_page_pool_compiled_in(void)
 {
 #ifdef CONFIG_PAGE_POOL
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/7] net: add NAPI api to register and retrieve the page pool ptr
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 1/7] page_pool: refactor the page pool to support multi alloc context Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 2/7] skbuff: add interface to manipulate frag count for tx recycling Yunsheng Lin
@ 2021-08-18  3:32 ` Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 4/7] net: pfrag_pool: add pfrag pool support based on page pool Yunsheng Lin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  3:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

As tx recycling is built upon the busy polling infrastructure,
and busy polling is based on napi_id, so add a api for driver
to register a page pool to a NAPI instance and api for socket
layer to retrieve the page pool corresponding to a NAPI.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/linux/netdevice.h |  9 +++++++++
 net/core/dev.c            | 34 +++++++++++++++++++++++++++++++---
 2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2f03cd9..51a1169 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -40,6 +40,7 @@
 #endif
 #include <net/netprio_cgroup.h>
 #include <net/xdp.h>
+#include <net/page_pool.h>
 
 #include <linux/netdev_features.h>
 #include <linux/neighbour.h>
@@ -336,6 +337,7 @@ struct napi_struct {
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
 	struct task_struct	*thread;
+	struct page_pool        *pp;
 };
 
 enum {
@@ -349,6 +351,7 @@ enum {
 	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
 	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
+	NAPI_STATE_RECYCLABLE,          /* Support tx page recycling */
 };
 
 enum {
@@ -362,6 +365,7 @@ enum {
 	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
 	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
+	NAPIF_STATE_RECYCLABLE          = BIT(NAPI_STATE_RECYCLABLE),
 };
 
 enum gro_result {
@@ -2473,6 +2477,10 @@ static inline void *netdev_priv(const struct net_device *dev)
 void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 		    int (*poll)(struct napi_struct *, int), int weight);
 
+void netif_recyclable_napi_add(struct net_device *dev, struct napi_struct *napi,
+			       int (*poll)(struct napi_struct *, int),
+			       int weight, struct page_pool *pool);
+
 /**
  *	netif_tx_napi_add - initialize a NAPI context
  *	@dev:  network device
@@ -2997,6 +3005,7 @@ struct net_device *dev_get_by_index(struct net *net, int ifindex);
 struct net_device *__dev_get_by_index(struct net *net, int ifindex);
 struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex);
 struct net_device *dev_get_by_napi_id(unsigned int napi_id);
+struct page_pool *page_pool_get_by_napi_id(unsigned int napi_id);
 int netdev_get_name(struct net *net, char *name, int ifindex);
 int dev_restart(struct net_device *dev);
 int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index 74fd402..d6b905b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -935,6 +935,19 @@ struct net_device *dev_get_by_napi_id(unsigned int napi_id)
 }
 EXPORT_SYMBOL(dev_get_by_napi_id);
 
+struct page_pool *page_pool_get_by_napi_id(unsigned int napi_id)
+{
+	struct napi_struct *napi;
+	struct page_pool *pp = NULL;
+
+	napi = napi_by_id(napi_id);
+	if (napi)
+		pp = napi->pp;
+
+	return pp;
+}
+EXPORT_SYMBOL(page_pool_get_by_napi_id);
+
 /**
  *	netdev_get_name - get a netdevice name, knowing its ifindex.
  *	@net: network namespace
@@ -6757,7 +6770,8 @@ EXPORT_SYMBOL(napi_busy_loop);
 
 static void napi_hash_add(struct napi_struct *napi)
 {
-	if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
+	if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state) ||
+	    !test_bit(NAPI_STATE_RECYCLABLE, &napi->state))
 		return;
 
 	spin_lock(&napi_hash_lock);
@@ -6860,8 +6874,10 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 }
 EXPORT_SYMBOL(dev_set_threaded);
 
-void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
-		    int (*poll)(struct napi_struct *, int), int weight)
+void netif_recyclable_napi_add(struct net_device *dev,
+			       struct napi_struct *napi,
+			       int (*poll)(struct napi_struct *, int),
+			       int weight, struct page_pool *pool)
 {
 	if (WARN_ON(test_and_set_bit(NAPI_STATE_LISTED, &napi->state)))
 		return;
@@ -6886,6 +6902,11 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 	list_add_rcu(&napi->dev_list, &dev->napi_list);
+	if (pool) {
+		napi->pp = pool;
+		set_bit(NAPI_STATE_RECYCLABLE, &napi->state);
+	}
+
 	napi_hash_add(napi);
 	/* Create kthread for this napi if dev->threaded is set.
 	 * Clear dev->threaded if kthread creation failed so that
@@ -6894,6 +6915,13 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 	if (dev->threaded && napi_kthread_create(napi))
 		dev->threaded = 0;
 }
+EXPORT_SYMBOL(netif_recyclable_napi_add);
+
+void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
+		    int (*poll)(struct napi_struct *, int), int weight)
+{
+	netif_recyclable_napi_add(dev, napi, poll, weight, NULL);
+}
 EXPORT_SYMBOL(netif_napi_add);
 
 void napi_disable(struct napi_struct *n)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 4/7] net: pfrag_pool: add pfrag pool support based on page pool
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
                   ` (2 preceding siblings ...)
  2021-08-18  3:32 ` [PATCH RFC 3/7] net: add NAPI api to register and retrieve the page pool ptr Yunsheng Lin
@ 2021-08-18  3:32 ` Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 5/7] sock: support refilling pfrag from pfrag_pool Yunsheng Lin
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  3:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

This patch add the pfrag pool support based on page pool.
Caller need to call pfrag_pool_updata_napi() to connect the
pfrag pool to the page pool through napi.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/pfrag_pool.h | 24 +++++++++++++
 net/core/Makefile        |  1 +
 net/core/pfrag_pool.c    | 92 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 117 insertions(+)
 create mode 100644 include/net/pfrag_pool.h
 create mode 100644 net/core/pfrag_pool.c

diff --git a/include/net/pfrag_pool.h b/include/net/pfrag_pool.h
new file mode 100644
index 0000000..2abea26
--- /dev/null
+++ b/include/net/pfrag_pool.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _PAGE_FRAG_H
+#define _PAGE_FRAG_H
+
+#include <linux/gfp.h>
+#include <linux/llist.h>
+#include <linux/mm_types_task.h>
+#include <net/page_pool.h>
+
+struct pfrag_pool {
+	struct page_frag frag;
+	long frag_users;
+	unsigned int napi_id;
+	struct page_pool *pp;
+	struct pp_alloc_cache alloc;
+};
+
+void pfrag_pool_updata_napi(struct pfrag_pool *pool,
+			    unsigned int napi_id);
+struct page_frag *pfrag_pool_refill(struct pfrag_pool *pool, gfp_t gfp);
+void pfrag_pool_commit(struct pfrag_pool *pool, unsigned int sz,
+		      bool merge);
+void pfrag_pool_flush(struct pfrag_pool *pool);
+#endif
diff --git a/net/core/Makefile b/net/core/Makefile
index 35ced62..171f839 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -14,6 +14,7 @@ obj-y		     += dev.o dev_addr_lists.o dst.o netevent.o \
 			fib_notifier.o xdp.o flow_offload.o
 
 obj-y += net-sysfs.o
+obj-y += pfrag_pool.o
 obj-$(CONFIG_PAGE_POOL) += page_pool.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
diff --git a/net/core/pfrag_pool.c b/net/core/pfrag_pool.c
new file mode 100644
index 0000000..6ad1383
--- /dev/null
+++ b/net/core/pfrag_pool.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/align.h>
+#include <linux/dma-mapping.h>
+#include <linux/mm.h>
+#include <linux/netdevice.h>
+#include <net/pfrag_pool.h>
+
+#define BAIS_MAX	(LONG_MAX / 2)
+
+void pfrag_pool_updata_napi(struct pfrag_pool *pool,
+			    unsigned int napi_id)
+{
+	struct page_pool *pp;
+
+	if (!pool || pool->napi_id == napi_id)
+		return;
+
+	pr_info("frag pool %pK's napi id changed from %u to %u\n",
+		pool, pool->napi_id, napi_id);
+
+	rcu_read_lock();
+	pp = page_pool_get_by_napi_id(napi_id);
+	if (!pp) {
+		rcu_read_unlock();
+		return;
+	}
+
+	pool->napi_id = napi_id;
+	pool->pp = pp;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(pfrag_pool_updata_napi);
+
+struct page_frag *pfrag_pool_refill(struct pfrag_pool *pool, gfp_t gfp)
+{
+	struct page_frag *pfrag = &pool->frag;
+
+	if (!pool || !pool->pp)
+		return NULL;
+
+	if (pfrag->page) {
+		long drain_users;
+
+		if (pfrag->offset < pfrag->size)
+			return pfrag;
+
+		drain_users = BAIS_MAX - pool->frag_users;
+		if (page_pool_drain_frag(pool->pp, pfrag->page, drain_users))
+			goto out;
+	}
+
+	pfrag->page = __page_pool_alloc_pages(pool->pp, &pool->alloc, gfp);
+	if (unlikely(!pfrag->page))
+		return NULL;
+
+out:
+	page_pool_set_frag_count(pfrag->page, BAIS_MAX);
+	pfrag->size = page_size(pfrag->page);
+	pool->frag_users = 0;
+	pfrag->offset = 0;
+	return pfrag;
+}
+EXPORT_SYMBOL(pfrag_pool_refill);
+
+void pfrag_pool_commit(struct pfrag_pool *pool, unsigned int sz,
+		       bool merge)
+{
+	struct page_frag *pfrag = &pool->frag;
+
+	pfrag->offset += ALIGN(sz, dma_get_cache_alignment());
+	WARN_ON(pfrag->offset > pfrag->size);
+
+	if (!merge)
+		pool->frag_users++;
+}
+EXPORT_SYMBOL(pfrag_pool_commit);
+
+void pfrag_pool_flush(struct pfrag_pool *pool)
+{
+	struct page_frag *pfrag = &pool->frag;
+
+	page_pool_empty_alloc_cache_once(pool->pp, &pool->alloc);
+
+	if (!pfrag->page)
+		return;
+
+	page_pool_free_frag(pool->pp, pfrag->page,
+			    BAIS_MAX - pool->frag_users);
+	pfrag->page = NULL;
+}
+EXPORT_SYMBOL(pfrag_pool_flush);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 5/7] sock: support refilling pfrag from pfrag_pool
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
                   ` (3 preceding siblings ...)
  2021-08-18  3:32 ` [PATCH RFC 4/7] net: pfrag_pool: add pfrag pool support based on page pool Yunsheng Lin
@ 2021-08-18  3:32 ` Yunsheng Lin
  2021-08-18  3:32 ` [PATCH RFC 6/7] net: hns3: support tx recycling in the hns3 driver Yunsheng Lin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  3:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

As previous patch has added pfrag pool based on the page
pool, so support refilling pfrag from the new pfrag pool
for tcpv4.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/sock.h |  1 +
 net/core/sock.c    |  9 +++++++++
 net/ipv4/tcp.c     | 34 ++++++++++++++++++++++++++--------
 3 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 6e76145..af40084 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -455,6 +455,7 @@ struct sock {
 	unsigned long		sk_pacing_rate; /* bytes per second */
 	unsigned long		sk_max_pacing_rate;
 	struct page_frag	sk_frag;
+	struct pfrag_pool	*sk_frag_pool;
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
 	netdev_features_t	sk_route_forced_caps;
diff --git a/net/core/sock.c b/net/core/sock.c
index aada649..53152c9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -140,6 +140,7 @@
 #include <net/busy_poll.h>
 
 #include <linux/ethtool.h>
+#include <net/pfrag_pool.h>
 
 static DEFINE_MUTEX(proto_list_mutex);
 static LIST_HEAD(proto_list);
@@ -1934,6 +1935,11 @@ static void __sk_destruct(struct rcu_head *head)
 		put_page(sk->sk_frag.page);
 		sk->sk_frag.page = NULL;
 	}
+	if (sk->sk_frag_pool) {
+		pfrag_pool_flush(sk->sk_frag_pool);
+		kfree(sk->sk_frag_pool);
+		sk->sk_frag_pool = NULL;
+	}
 
 	if (sk->sk_peer_cred)
 		put_cred(sk->sk_peer_cred);
@@ -3134,6 +3140,9 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_frag.page	=	NULL;
 	sk->sk_frag.offset	=	0;
+
+	sk->sk_frag_pool = kzalloc(sizeof(*sk->sk_frag_pool), sk->sk_allocation);
+
 	sk->sk_peek_off		=	-1;
 
 	sk->sk_peer_pid 	=	NULL;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f931def..992dcbc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -280,6 +280,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <net/pfrag_pool.h>
 
 /* Track pending CMSGs. */
 enum {
@@ -1337,12 +1338,20 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			if (err)
 				goto do_fault;
 		} else if (!zc) {
-			bool merge = true;
+			bool merge = true, pfrag_pool = true;
 			int i = skb_shinfo(skb)->nr_frags;
-			struct page_frag *pfrag = sk_page_frag(sk);
+			struct page_frag *pfrag;
 
-			if (!sk_page_frag_refill(sk, pfrag))
-				goto wait_for_space;
+			pfrag_pool_updata_napi(sk->sk_frag_pool,
+					       READ_ONCE(sk->sk_napi_id));
+			pfrag = pfrag_pool_refill(sk->sk_frag_pool, sk->sk_allocation);
+			if (!pfrag) {
+				pfrag = sk_page_frag(sk);
+				if (!sk_page_frag_refill(sk, pfrag))
+					goto wait_for_space;
+
+				pfrag_pool = false;
+			}
 
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
@@ -1369,11 +1378,20 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			if (merge) {
 				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 			} else {
-				skb_fill_page_desc(skb, i, pfrag->page,
-						   pfrag->offset, copy);
-				page_ref_inc(pfrag->page);
+				if (pfrag_pool) {
+					skb_fill_pp_page_desc(skb, i, pfrag->page,
+							      pfrag->offset, copy);
+				} else {
+					page_ref_inc(pfrag->page);
+					skb_fill_page_desc(skb, i, pfrag->page,
+							   pfrag->offset, copy);
+				}
 			}
-			pfrag->offset += copy;
+
+			if (pfrag_pool)
+				pfrag_pool_commit(sk->sk_frag_pool, copy, merge);
+			else
+				pfrag->offset += copy;
 		} else {
 			if (!sk_wmem_schedule(sk, copy))
 				goto wait_for_space;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 6/7] net: hns3: support tx recycling in the hns3 driver
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
                   ` (4 preceding siblings ...)
  2021-08-18  3:32 ` [PATCH RFC 5/7] sock: support refilling pfrag from pfrag_pool Yunsheng Lin
@ 2021-08-18  3:32 ` Yunsheng Lin
  2021-08-18  8:57 ` [PATCH RFC 0/7] add socket to netdev page frag recycling support Eric Dumazet
  2021-08-18 22:05 ` David Ahern
  7 siblings, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  3:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

Use netif_recyclable_napi_add() to register page pool to
the NAPI instance, and avoid doing the DMA mapping/unmapping
when the page is from page pool.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 32 +++++++++++++++----------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index fcbeb1f..ab86566 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1689,12 +1689,18 @@ static int hns3_map_and_fill_desc(struct hns3_enet_ring *ring, void *priv,
 		return 0;
 	} else {
 		skb_frag_t *frag = (skb_frag_t *)priv;
+		struct page *page = skb_frag_page(frag);
 
 		size = skb_frag_size(frag);
 		if (!size)
 			return 0;
 
-		dma = skb_frag_dma_map(dev, frag, 0, size, DMA_TO_DEVICE);
+		if (skb_frag_is_pp(frag) && page->pp->p.dev == dev) {
+			dma = page_pool_get_dma_addr(page) + skb_frag_off(frag);
+			type = DESC_TYPE_PP_FRAG;
+		} else {
+			dma = skb_frag_dma_map(dev, frag, 0, size, DMA_TO_DEVICE);
+		}
 	}
 
 	if (unlikely(dma_mapping_error(dev, dma))) {
@@ -4525,7 +4531,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv)
 		ret = hns3_get_vector_ring_chain(tqp_vector,
 						 &vector_ring_chain);
 		if (ret)
-			goto map_ring_fail;
+			return ret;
 
 		ret = h->ae_algo->ops->map_ring_to_vector(h,
 			tqp_vector->vector_irq, &vector_ring_chain);
@@ -4533,19 +4539,10 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv)
 		hns3_free_vector_ring_chain(tqp_vector, &vector_ring_chain);
 
 		if (ret)
-			goto map_ring_fail;
-
-		netif_napi_add(priv->netdev, &tqp_vector->napi,
-			       hns3_nic_common_poll, NAPI_POLL_WEIGHT);
+			return ret;
 	}
 
 	return 0;
-
-map_ring_fail:
-	while (i--)
-		netif_napi_del(&priv->tqp_vector[i].napi);
-
-	return ret;
 }
 
 static void hns3_nic_init_coal_cfg(struct hns3_nic_priv *priv)
@@ -4754,7 +4751,7 @@ static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
 				(PAGE_SIZE << hns3_page_order(ring)),
 		.nid = dev_to_node(ring_to_dev(ring)),
 		.dev = ring_to_dev(ring),
-		.dma_dir = DMA_FROM_DEVICE,
+		.dma_dir = DMA_BIDIRECTIONAL,
 		.offset = 0,
 		.max_len = PAGE_SIZE << hns3_page_order(ring),
 	};
@@ -4923,6 +4920,15 @@ int hns3_init_all_ring(struct hns3_nic_priv *priv)
 		u64_stats_init(&priv->ring[i].syncp);
 	}
 
+	for (i = 0; i < priv->vector_num; i++) {
+		struct hns3_enet_tqp_vector *tqp_vector;
+
+		tqp_vector = &priv->tqp_vector[i];
+		netif_recyclable_napi_add(priv->netdev, &tqp_vector->napi,
+					  hns3_nic_common_poll, NAPI_POLL_WEIGHT,
+					  tqp_vector->rx_group.ring->page_pool);
+	}
+
 	return 0;
 
 out_when_alloc_ring_memory:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
                   ` (5 preceding siblings ...)
  2021-08-18  3:32 ` [PATCH RFC 6/7] net: hns3: support tx recycling in the hns3 driver Yunsheng Lin
@ 2021-08-18  8:57 ` Eric Dumazet
  2021-08-18  9:36   ` Yunsheng Lin
  2021-08-18 22:05 ` David Ahern
  7 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2021-08-18  8:57 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Alexander Duyck, Russell King,
	Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe

On Wed, Aug 18, 2021 at 5:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> This patchset adds the socket to netdev page frag recycling
> support based on the busy polling and page pool infrastructure.

I really do not see how this can scale to thousands of sockets.

tcp_mem[] defaults to ~ 9 % of physical memory.

If you now run tests with thousands of sockets, their skbs will
consume Gigabytes
of memory on typical servers, now backed by order-0 pages (instead of
current order-3 pages)
So IOMMU costs will actually be much bigger.

Are we planning to use Gigabyte sized page pools for NIC ?

Have you tried instead to make TCP frags twice bigger ?
This would require less IOMMU mappings.
(Note: This could require some mm help, since PAGE_ALLOC_COSTLY_ORDER
is currently 3, not 4)

diff --git a/net/core/sock.c b/net/core/sock.c
index a3eea6e0b30a7d43793f567ffa526092c03e3546..6b66b51b61be9f198f6f1c4a3d81b57fa327986a
100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2560,7 +2560,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
        }
 }

-#define SKB_FRAG_PAGE_ORDER    get_order(32768)
+#define SKB_FRAG_PAGE_ORDER    get_order(65536)
 DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);

 /**



>
> The profermance improve from 30Gbit to 41Gbit for one thread iperf
> tcp flow, and the CPU usages decreases about 20% for four threads
> iperf flow with 100Gb line speed in IOMMU strict mode.
>
> The profermance improve about 2.5% for one thread iperf tcp flow
> in IOMMU passthrough mode.
>
> Yunsheng Lin (7):
>   page_pool: refactor the page pool to support multi alloc context
>   skbuff: add interface to manipulate frag count for tx recycling
>   net: add NAPI api to register and retrieve the page pool ptr
>   net: pfrag_pool: add pfrag pool support based on page pool
>   sock: support refilling pfrag from pfrag_pool
>   net: hns3: support tx recycling in the hns3 driver
>   sysctl_tcp_use_pfrag_pool
>
>  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 32 +++++----
>  include/linux/netdevice.h                       |  9 +++
>  include/linux/skbuff.h                          | 43 +++++++++++-
>  include/net/netns/ipv4.h                        |  1 +
>  include/net/page_pool.h                         | 15 ++++
>  include/net/pfrag_pool.h                        | 24 +++++++
>  include/net/sock.h                              |  1 +
>  net/core/Makefile                               |  1 +
>  net/core/dev.c                                  | 34 ++++++++-
>  net/core/page_pool.c                            | 86 ++++++++++++-----------
>  net/core/pfrag_pool.c                           | 92 +++++++++++++++++++++++++
>  net/core/sock.c                                 | 12 ++++
>  net/ipv4/sysctl_net_ipv4.c                      |  7 ++
>  net/ipv4/tcp.c                                  | 34 ++++++---
>  14 files changed, 325 insertions(+), 66 deletions(-)
>  create mode 100644 include/net/pfrag_pool.h
>  create mode 100644 net/core/pfrag_pool.c
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-18  8:57 ` [PATCH RFC 0/7] add socket to netdev page frag recycling support Eric Dumazet
@ 2021-08-18  9:36   ` Yunsheng Lin
  2021-08-23  9:25     ` [Linuxarm] " Yunsheng Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-18  9:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Jakub Kicinski, Alexander Duyck, Russell King,
	Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, hch

On 2021/8/18 16:57, Eric Dumazet wrote:
> On Wed, Aug 18, 2021 at 5:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> This patchset adds the socket to netdev page frag recycling
>> support based on the busy polling and page pool infrastructure.
> 
> I really do not see how this can scale to thousands of sockets.
> 
> tcp_mem[] defaults to ~ 9 % of physical memory.
> 
> If you now run tests with thousands of sockets, their skbs will
> consume Gigabytes
> of memory on typical servers, now backed by order-0 pages (instead of
> current order-3 pages)
> So IOMMU costs will actually be much bigger.

As the page allocator support bulk allocating now, see:
https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L252

if the DMA also support batch mapping/unmapping, maybe having a
small-sized page pool for thousands of sockets may not be a problem?
Christoph Hellwig mentioned the batch DMA operation support in below
thread:
https://www.spinics.net/lists/netdev/msg666715.html

if the batched DMA operation is supported, maybe having the
page pool is mainly benefit the case of small number of socket?

> 
> Are we planning to use Gigabyte sized page pools for NIC ?
> 
> Have you tried instead to make TCP frags twice bigger ?

Not yet.

> This would require less IOMMU mappings.
> (Note: This could require some mm help, since PAGE_ALLOC_COSTLY_ORDER
> is currently 3, not 4)

I am not familiar with mm yet, but I will take a look about that:)

> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index a3eea6e0b30a7d43793f567ffa526092c03e3546..6b66b51b61be9f198f6f1c4a3d81b57fa327986a
> 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2560,7 +2560,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
>         }
>  }
> 
> -#define SKB_FRAG_PAGE_ORDER    get_order(32768)
> +#define SKB_FRAG_PAGE_ORDER    get_order(65536)
>  DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> 
>  /**
> 
> 
> 
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
                   ` (6 preceding siblings ...)
  2021-08-18  8:57 ` [PATCH RFC 0/7] add socket to netdev page frag recycling support Eric Dumazet
@ 2021-08-18 22:05 ` David Ahern
  2021-08-19  8:18   ` Yunsheng Lin
  7 siblings, 1 reply; 23+ messages in thread
From: David Ahern @ 2021-08-18 22:05 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

On 8/17/21 9:32 PM, Yunsheng Lin wrote:
> This patchset adds the socket to netdev page frag recycling
> support based on the busy polling and page pool infrastructure.
> 
> The profermance improve from 30Gbit to 41Gbit for one thread iperf
> tcp flow, and the CPU usages decreases about 20% for four threads
> iperf flow with 100Gb line speed in IOMMU strict mode.
> 
> The profermance improve about 2.5% for one thread iperf tcp flow
> in IOMMU passthrough mode.
> 

Details about the test setup? cpu model, mtu, any other relevant changes
/ settings.

How does that performance improvement compare with using the Tx ZC API?
At 1500 MTU I see a CPU drop on the Tx side from 80% to 20% with the ZC
API and ~10% increase in throughput. Bumping the MTU to 3300 and
performance with the ZC API is 2x the current model with 1/2 the cpu.

Epyc 7502, ConnectX-6, IOMMU off.

In short, it seems like improving the Tx ZC API is the better path
forward than per-socket page pools.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-18 22:05 ` David Ahern
@ 2021-08-19  8:18   ` Yunsheng Lin
  2021-08-20 14:35     ` David Ahern
  0 siblings, 1 reply; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-19  8:18 UTC (permalink / raw)
  To: David Ahern, davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

On 2021/8/19 6:05, David Ahern wrote:
> On 8/17/21 9:32 PM, Yunsheng Lin wrote:
>> This patchset adds the socket to netdev page frag recycling
>> support based on the busy polling and page pool infrastructure.
>>
>> The profermance improve from 30Gbit to 41Gbit for one thread iperf
>> tcp flow, and the CPU usages decreases about 20% for four threads
>> iperf flow with 100Gb line speed in IOMMU strict mode.
>>
>> The profermance improve about 2.5% for one thread iperf tcp flow
>> in IOMMU passthrough mode.
>>
> 
> Details about the test setup? cpu model, mtu, any other relevant changes
> / settings.

CPU is arm64 Kunpeng 920, see:
https://www.hisilicon.com/en/products/Kunpeng/Huawei-Kunpeng-920

mtu is 1500, the relevant changes/settings I can think of the iperf
client runs on the same numa as the nic hw exists(which has one 100Gbit
port), and the driver has the XPS enabled too.

> 
> How does that performance improvement compare with using the Tx ZC API?
> At 1500 MTU I see a CPU drop on the Tx side from 80% to 20% with the ZC
> API and ~10% increase in throughput. Bumping the MTU to 3300 and
> performance with the ZC API is 2x the current model with 1/2 the cpu.

I added a sysctl node to decide whether pfrag pool is used:
net.ipv4.tcp_use_pfrag_pool

and use msg_zerocopy to compare the result:
Server uses cmd "./msg_zerocopy -4 -i eth4 -C 32 -S 192.168.100.2 -r tcp"
Client uses cmd "./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -"

The zc does seem to improve the CPU usages significantly, but not for throughput
with mtu 1500. And the result seems to be similar with mtu 3300.

the detail result is below:

(1) IOMMU strict mode + net.ipv4.tcp_use_pfrag_pool = 0:
root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
tx=115317 (7196 MB) txc=0 zc=n

 Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':

        4315472244      cycles

       4.199890190 seconds time elapsed

       0.084328000 seconds user
       1.528714000 seconds sys
root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z
tx=90121 (5623 MB) txc=90121 zc=y

 Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z':

        1715892155      cycles

       4.243329050 seconds time elapsed

       0.083275000 seconds user
       0.755355000 seconds sys


(2)IOMMU strict mode + net.ipv4.tcp_use_pfrag_pool = 1:
root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
tx=138932 (8669 MB) txc=0 zc=n

 Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':

        4034016168      cycles

       4.199877510 seconds time elapsed

       0.058143000 seconds user
       1.644480000 seconds sys
root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z
tx=93369 (5826 MB) txc=93369 zc=y

 Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z':

        1815300491      cycles

       4.243259530 seconds time elapsed

       0.051767000 seconds user
       0.796610000 seconds sys


(3)IOMMU passthrough + net.ipv4.tcp_use_pfrag_pool=0
root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
tx=129927 (8107 MB) txc=0 zc=n

 Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':

        3720131007      cycles

       4.200651840 seconds time elapsed

       0.038604000 seconds user
       1.455521000 seconds sys
root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z
tx=135285 (8442 MB) txc=135285 zc=y

 Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z':

        1721949875      cycles

       4.242596800 seconds time elapsed

       0.024963000 seconds user
       0.779391000 seconds sys

(4)IOMMU  passthrough + net.ipv4.tcp_use_pfrag_pool=1
root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
tx=151844 (9475 MB) txc=0 zc=n

 Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':

        3786216097      cycles

       4.200606520 seconds time elapsed

       0.028633000 seconds user
       1.569736000 seconds sys


> 
> Epyc 7502, ConnectX-6, IOMMU off.
> 
> In short, it seems like improving the Tx ZC API is the better path
> forward than per-socket page pools.

The main goal is to optimize the SMMU mapping/unmaping, if the cost of memcpy
it higher than the SMMU mapping/unmaping + page pinning, then Tx ZC may be a
better path, at leas it is not sure for small packet?


> .
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-19  8:18   ` Yunsheng Lin
@ 2021-08-20 14:35     ` David Ahern
  2021-08-23  3:32       ` Yunsheng Lin
  0 siblings, 1 reply; 23+ messages in thread
From: David Ahern @ 2021-08-20 14:35 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

On 8/19/21 2:18 AM, Yunsheng Lin wrote:
> On 2021/8/19 6:05, David Ahern wrote:
>> On 8/17/21 9:32 PM, Yunsheng Lin wrote:
>>> This patchset adds the socket to netdev page frag recycling
>>> support based on the busy polling and page pool infrastructure.
>>>
>>> The profermance improve from 30Gbit to 41Gbit for one thread iperf
>>> tcp flow, and the CPU usages decreases about 20% for four threads
>>> iperf flow with 100Gb line speed in IOMMU strict mode.
>>>
>>> The profermance improve about 2.5% for one thread iperf tcp flow
>>> in IOMMU passthrough mode.
>>>
>>
>> Details about the test setup? cpu model, mtu, any other relevant changes
>> / settings.
> 
> CPU is arm64 Kunpeng 920, see:
> https://www.hisilicon.com/en/products/Kunpeng/Huawei-Kunpeng-920
> 
> mtu is 1500, the relevant changes/settings I can think of the iperf
> client runs on the same numa as the nic hw exists(which has one 100Gbit
> port), and the driver has the XPS enabled too.
> 
>>
>> How does that performance improvement compare with using the Tx ZC API?
>> At 1500 MTU I see a CPU drop on the Tx side from 80% to 20% with the ZC
>> API and ~10% increase in throughput. Bumping the MTU to 3300 and
>> performance with the ZC API is 2x the current model with 1/2 the cpu.
> 
> I added a sysctl node to decide whether pfrag pool is used:
> net.ipv4.tcp_use_pfrag_pool
> 
> and use msg_zerocopy to compare the result:
> Server uses cmd "./msg_zerocopy -4 -i eth4 -C 32 -S 192.168.100.2 -r tcp"
> Client uses cmd "./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -"
> 
> The zc does seem to improve the CPU usages significantly, but not for throughput
> with mtu 1500. And the result seems to be similar with mtu 3300.
> 
> the detail result is below:
> 
> (1) IOMMU strict mode + net.ipv4.tcp_use_pfrag_pool = 0:
> root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
> tx=115317 (7196 MB) txc=0 zc=n
> 
>  Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':
> 
>         4315472244      cycles
> 
>        4.199890190 seconds time elapsed
> 
>        0.084328000 seconds user
>        1.528714000 seconds sys
> root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z
> tx=90121 (5623 MB) txc=90121 zc=y
> 
>  Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z':
> 
>         1715892155      cycles
> 
>        4.243329050 seconds time elapsed
> 
>        0.083275000 seconds user
>        0.755355000 seconds sys
> 
> 
> (2)IOMMU strict mode + net.ipv4.tcp_use_pfrag_pool = 1:
> root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
> tx=138932 (8669 MB) txc=0 zc=n
> 
>  Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':
> 
>         4034016168      cycles
> 
>        4.199877510 seconds time elapsed
> 
>        0.058143000 seconds user
>        1.644480000 seconds sys
> root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z
> tx=93369 (5826 MB) txc=93369 zc=y
> 
>  Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z':
> 
>         1815300491      cycles
> 
>        4.243259530 seconds time elapsed
> 
>        0.051767000 seconds user
>        0.796610000 seconds sys
> 
> 
> (3)IOMMU passthrough + net.ipv4.tcp_use_pfrag_pool=0
> root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
> tx=129927 (8107 MB) txc=0 zc=n
> 
>  Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':
> 
>         3720131007      cycles
> 
>        4.200651840 seconds time elapsed
> 
>        0.038604000 seconds user
>        1.455521000 seconds sys
> root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z
> tx=135285 (8442 MB) txc=135285 zc=y
> 
>  Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp -z':
> 
>         1721949875      cycles
> 
>        4.242596800 seconds time elapsed
> 
>        0.024963000 seconds user
>        0.779391000 seconds sys
> 
> (4)IOMMU  passthrough + net.ipv4.tcp_use_pfrag_pool=1
> root@(none):/# perf stat -e cycles ./msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp
> tx=151844 (9475 MB) txc=0 zc=n
> 
>  Performance counter stats for './msg_zerocopy -4 -i eth4 -C 0 -S 192.168.100.1 -D 192.168.100.2 tcp':
> 
>         3786216097      cycles
> 
>        4.200606520 seconds time elapsed
> 
>        0.028633000 seconds user
>        1.569736000 seconds sys
> 
> 
>>
>> Epyc 7502, ConnectX-6, IOMMU off.
>>
>> In short, it seems like improving the Tx ZC API is the better path
>> forward than per-socket page pools.
> 
> The main goal is to optimize the SMMU mapping/unmaping, if the cost of memcpy
> it higher than the SMMU mapping/unmaping + page pinning, then Tx ZC may be a
> better path, at leas it is not sure for small packet?
> 

It's a CPU bound problem - either Rx or Tx is cpu bound depending on the
test configuration. In my tests 3.3 to 3.5M pps is the limit (not using
LRO in the NIC - that's a different solution with its own problems).

At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
on throughput since the Rx is 100% cpu.

At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
reduces Rx processing and lower CPU to process the incoming stream. Then
using the Tx ZC API you lower the Tx overehad allowing a single stream
to faster - sending more data which in the end results in much higher
pps and throughput. At the limit you are CPU bound (both ends in my
testing as Rx side approaches the max pps, and Tx side as it continually
tries to send data).

Lowering CPU usage on Tx the side is a win regardless of whether there
is a big increase on the throughput at 1500 MTU since that configuration
is an Rx CPU bound problem. Hence, my point that we have a good start
point for lowering CPU usage on the Tx side; we should improve it rather
than add per-socket page pools.

You can stress the Tx side and emphasize its overhead by modifying the
receiver to drop the data on Rx rather than copy to userspace which is a
huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
stream to go faster and emphasize Tx bottlenecks as the pps at 3300
approaches the top pps at 1500. e.g., doing this with iperf3 shows the
spinlock overhead with tcp_sendmsg, overhead related to 'select' and
then gup_pgd_range.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-20 14:35     ` David Ahern
@ 2021-08-23  3:32       ` Yunsheng Lin
  2021-08-24  3:34         ` David Ahern
  0 siblings, 1 reply; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-23  3:32 UTC (permalink / raw)
  To: David Ahern, davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

On 2021/8/20 22:35, David Ahern wrote:
> On 8/19/21 2:18 AM, Yunsheng Lin wrote:
>> On 2021/8/19 6:05, David Ahern wrote:
>>> On 8/17/21 9:32 PM, Yunsheng Lin wrote:
>>>> This patchset adds the socket to netdev page frag recycling
>>>> support based on the busy polling and page pool infrastructure.
>>>>
>>>> The profermance improve from 30Gbit to 41Gbit for one thread iperf
>>>> tcp flow, and the CPU usages decreases about 20% for four threads
>>>> iperf flow with 100Gb line speed in IOMMU strict mode.
>>>>
>>>> The profermance improve about 2.5% for one thread iperf tcp flow
>>>> in IOMMU passthrough mode.
>>>>
>>>
>>> Details about the test setup? cpu model, mtu, any other relevant changes
>>> / settings.
>>
>> CPU is arm64 Kunpeng 920, see:
>> https://www.hisilicon.com/en/products/Kunpeng/Huawei-Kunpeng-920
>>
>> mtu is 1500, the relevant changes/settings I can think of the iperf
>> client runs on the same numa as the nic hw exists(which has one 100Gbit
>> port), and the driver has the XPS enabled too.
>>
>>>
>>> How does that performance improvement compare with using the Tx ZC API?
>>> At 1500 MTU I see a CPU drop on the Tx side from 80% to 20% with the ZC
>>> API and ~10% increase in throughput. Bumping the MTU to 3300 and
>>> performance with the ZC API is 2x the current model with 1/2 the cpu.
>>
>> I added a sysctl node to decide whether pfrag pool is used:
>> net.ipv4.tcp_use_pfrag_pool
>>

[..]

>>
>>
>>>
>>> Epyc 7502, ConnectX-6, IOMMU off.
>>>
>>> In short, it seems like improving the Tx ZC API is the better path
>>> forward than per-socket page pools.
>>
>> The main goal is to optimize the SMMU mapping/unmaping, if the cost of memcpy
>> it higher than the SMMU mapping/unmaping + page pinning, then Tx ZC may be a
>> better path, at leas it is not sure for small packet?
>>
> 
> It's a CPU bound problem - either Rx or Tx is cpu bound depending on the
> test configuration. In my tests 3.3 to 3.5M pps is the limit (not using
> LRO in the NIC - that's a different solution with its own problems).

I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
bottleneck?

It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
is not changed when testing, MTU is 1500:

IOMMU in strict mode:
1. Tx ZC case:
   22Gbit with Tx being bottleneck(cpu bound)
2. Tx non-ZC case with pfrag pool enabled:
   40Git with Rx being bottleneck(cpu bound)
3. Tx non-ZC case with pfrag pool disabled:
   30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
   not have a single CPU reaching about 100% usage.

> 
> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
> on throughput since the Rx is 100% cpu.

As above performance data, enabling ZC does not seems to help when IOMMU
is involved, which has about 30% performance degrade when pfrag pool is
disabled and 50% performance degrade when pfrag pool is enabled.

> 
> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
> reduces Rx processing and lower CPU to process the incoming stream. Then
> using the Tx ZC API you lower the Tx overehad allowing a single stream
> to faster - sending more data which in the end results in much higher
> pps and throughput. At the limit you are CPU bound (both ends in my
> testing as Rx side approaches the max pps, and Tx side as it continually
> tries to send data).
> 
> Lowering CPU usage on Tx the side is a win regardless of whether there
> is a big increase on the throughput at 1500 MTU since that configuration
> is an Rx CPU bound problem. Hence, my point that we have a good start
> point for lowering CPU usage on the Tx side; we should improve it rather
> than add per-socket page pools.

Acctually it is not a per-socket page pools, the page pool is still per
NAPI, this patchset adds multi allocation context to the page pool, so that
the tx can reuse the same page pool with rx, which is quite usefully if the
ARFS is enabled.

> 
> You can stress the Tx side and emphasize its overhead by modifying the
> receiver to drop the data on Rx rather than copy to userspace which is a
> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow

As the frag page is supported in page pool for Rx, the Rx probably is not
a bottleneck any more, at least not for IOMMU in strict mode.

It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?

> stream to go faster and emphasize Tx bottlenecks as the pps at 3300
> approaches the top pps at 1500. e.g., doing this with iperf3 shows the
> spinlock overhead with tcp_sendmsg, overhead related to 'select' and
> then gup_pgd_range.

When IOMMU is in strict mode, the overhead with IOMMU seems to be much
bigger than spinlock(23% to 10%).

Anyway, I still think ZC mostly benefit to packet which is bigger than a
specific size and IOMMU disabling case.


> .
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-18  9:36   ` Yunsheng Lin
@ 2021-08-23  9:25     ` Yunsheng Lin
  2021-08-23 15:04       ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-23  9:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Jakub Kicinski, Alexander Duyck, Russell King,
	Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, hch

On 2021/8/18 17:36, Yunsheng Lin wrote:
> On 2021/8/18 16:57, Eric Dumazet wrote:
>> On Wed, Aug 18, 2021 at 5:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>
>>> This patchset adds the socket to netdev page frag recycling
>>> support based on the busy polling and page pool infrastructure.
>>
>> I really do not see how this can scale to thousands of sockets.
>>
>> tcp_mem[] defaults to ~ 9 % of physical memory.
>>
>> If you now run tests with thousands of sockets, their skbs will
>> consume Gigabytes
>> of memory on typical servers, now backed by order-0 pages (instead of
>> current order-3 pages)
>> So IOMMU costs will actually be much bigger.
> 
> As the page allocator support bulk allocating now, see:
> https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L252
> 
> if the DMA also support batch mapping/unmapping, maybe having a
> small-sized page pool for thousands of sockets may not be a problem?
> Christoph Hellwig mentioned the batch DMA operation support in below
> thread:
> https://www.spinics.net/lists/netdev/msg666715.html
> 
> if the batched DMA operation is supported, maybe having the
> page pool is mainly benefit the case of small number of socket?
> 
>>
>> Are we planning to use Gigabyte sized page pools for NIC ?
>>
>> Have you tried instead to make TCP frags twice bigger ?
> 
> Not yet.
> 
>> This would require less IOMMU mappings.
>> (Note: This could require some mm help, since PAGE_ALLOC_COSTLY_ORDER
>> is currently 3, not 4)
> 
> I am not familiar with mm yet, but I will take a look about that:)


It seems PAGE_ALLOC_COSTLY_ORDER is mostly related to pcp page, OOM, memory
compact and memory isolation, as the test system has a lot of memory installed
(about 500G, only 3-4G is used), so I used the below patch to test the max
possible performance improvement when making TCP frags twice bigger, and
the performance improvement went from about 30Gbit to 32Gbit for one thread
iperf tcp flow in IOMMU strict mode, and using the pfrag pool, the improvement
went from about 30Gbit to 40Gbit for the same testing configuation:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fcb5355..dda20f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,7 +37,7 @@
  * coalesce naturally under reasonable reclaim pressure and those which
  * will not.
  */
-#define PAGE_ALLOC_COSTLY_ORDER 3
+#define PAGE_ALLOC_COSTLY_ORDER 4

 enum migratetype {
        MIGRATE_UNMOVABLE,
diff --git a/net/core/sock.c b/net/core/sock.c
index 870a3b7..b1e0dfc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2580,7 +2580,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
        }
 }

-#define SKB_FRAG_PAGE_ORDER    get_order(32768)
+#define SKB_FRAG_PAGE_ORDER    get_order(65536)
 DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);

 /**

> 
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index a3eea6e0b30a7d43793f567ffa526092c03e3546..6b66b51b61be9f198f6f1c4a3d81b57fa327986a
>> 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2560,7 +2560,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
>>         }
>>  }
>>
>> -#define SKB_FRAG_PAGE_ORDER    get_order(32768)
>> +#define SKB_FRAG_PAGE_ORDER    get_order(65536)
>>  DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
>>
>>  /**
>>
>>
>>
>>>
> _______________________________________________
> Linuxarm mailing list -- linuxarm@openeuler.org
> To unsubscribe send an email to linuxarm-leave@openeuler.org
> 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-23  9:25     ` [Linuxarm] " Yunsheng Lin
@ 2021-08-23 15:04       ` Eric Dumazet
  2021-08-24  8:03         ` Yunsheng Lin
  2021-08-25 16:29         ` David Ahern
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2021-08-23 15:04 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Alexander Duyck, Russell King,
	Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, Christoph Hellwig

On Mon, Aug 23, 2021 at 2:25 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2021/8/18 17:36, Yunsheng Lin wrote:
> > On 2021/8/18 16:57, Eric Dumazet wrote:
> >> On Wed, Aug 18, 2021 at 5:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>
> >>> This patchset adds the socket to netdev page frag recycling
> >>> support based on the busy polling and page pool infrastructure.
> >>
> >> I really do not see how this can scale to thousands of sockets.
> >>
> >> tcp_mem[] defaults to ~ 9 % of physical memory.
> >>
> >> If you now run tests with thousands of sockets, their skbs will
> >> consume Gigabytes
> >> of memory on typical servers, now backed by order-0 pages (instead of
> >> current order-3 pages)
> >> So IOMMU costs will actually be much bigger.
> >
> > As the page allocator support bulk allocating now, see:
> > https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L252
> >
> > if the DMA also support batch mapping/unmapping, maybe having a
> > small-sized page pool for thousands of sockets may not be a problem?
> > Christoph Hellwig mentioned the batch DMA operation support in below
> > thread:
> > https://www.spinics.net/lists/netdev/msg666715.html
> >
> > if the batched DMA operation is supported, maybe having the
> > page pool is mainly benefit the case of small number of socket?
> >
> >>
> >> Are we planning to use Gigabyte sized page pools for NIC ?
> >>
> >> Have you tried instead to make TCP frags twice bigger ?
> >
> > Not yet.
> >
> >> This would require less IOMMU mappings.
> >> (Note: This could require some mm help, since PAGE_ALLOC_COSTLY_ORDER
> >> is currently 3, not 4)
> >
> > I am not familiar with mm yet, but I will take a look about that:)
>
>
> It seems PAGE_ALLOC_COSTLY_ORDER is mostly related to pcp page, OOM, memory
> compact and memory isolation, as the test system has a lot of memory installed
> (about 500G, only 3-4G is used), so I used the below patch to test the max
> possible performance improvement when making TCP frags twice bigger, and
> the performance improvement went from about 30Gbit to 32Gbit for one thread
> iperf tcp flow in IOMMU strict mode,

This is encouraging, and means we can do much better.

Even with SKB_FRAG_PAGE_ORDER  set to 4, typical skbs will need 3 mappings

1) One for the headers (in skb->head)
2) Two page frags, because one TSO packet payload is not a nice power-of-two.

The first issue can be addressed using a piece of coherent memory (128
or 256 bytes per entry in TX ring).
Copying the headers can avoid one IOMMU mapping, and improve IOTLB
hits, because all
slots of the TX ring buffer will use one single IOTLB slot.

The second issue can be solved by tweaking a bit
skb_page_frag_refill() to accept an additional parameter
so that the whole skb payload fits in a single order-4 page.


 and using the pfrag pool, the improvement
> went from about 30Gbit to 40Gbit for the same testing configuation:

Yes, but you have not provided performance number when 200 (or 1000+)
concurrent flows are running.

Optimizing singe flow TCP performance while killing performance for
the more common case is not an option.


>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fcb5355..dda20f9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -37,7 +37,7 @@
>   * coalesce naturally under reasonable reclaim pressure and those which
>   * will not.
>   */
> -#define PAGE_ALLOC_COSTLY_ORDER 3
> +#define PAGE_ALLOC_COSTLY_ORDER 4
>
>  enum migratetype {
>         MIGRATE_UNMOVABLE,
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 870a3b7..b1e0dfc 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2580,7 +2580,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
>         }
>  }
>
> -#define SKB_FRAG_PAGE_ORDER    get_order(32768)
> +#define SKB_FRAG_PAGE_ORDER    get_order(65536)
>  DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
>
>  /**
>
> >
> >>
> >> diff --git a/net/core/sock.c b/net/core/sock.c
> >> index a3eea6e0b30a7d43793f567ffa526092c03e3546..6b66b51b61be9f198f6f1c4a3d81b57fa327986a
> >> 100644
> >> --- a/net/core/sock.c
> >> +++ b/net/core/sock.c
> >> @@ -2560,7 +2560,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
> >>         }
> >>  }
> >>
> >> -#define SKB_FRAG_PAGE_ORDER    get_order(32768)
> >> +#define SKB_FRAG_PAGE_ORDER    get_order(65536)
> >>  DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> >>
> >>  /**
> >>
> >>
> >>
> >>>
> > _______________________________________________
> > Linuxarm mailing list -- linuxarm@openeuler.org
> > To unsubscribe send an email to linuxarm-leave@openeuler.org
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-23  3:32       ` Yunsheng Lin
@ 2021-08-24  3:34         ` David Ahern
  2021-08-24  8:41           ` Yunsheng Lin
  0 siblings, 1 reply; 23+ messages in thread
From: David Ahern @ 2021-08-24  3:34 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

On 8/22/21 9:32 PM, Yunsheng Lin wrote:
> 
> I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
> bottleneck?

yes.

> 
> It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
> is not changed when testing, MTU is 1500:

-Z == sendfile API. That works fine to a point and that point is well
below 100G.

I mean TCP with MSG_ZEROCOPY and SO_ZEROCOPY.

> 
> IOMMU in strict mode:
> 1. Tx ZC case:
>    22Gbit with Tx being bottleneck(cpu bound)
> 2. Tx non-ZC case with pfrag pool enabled:
>    40Git with Rx being bottleneck(cpu bound)
> 3. Tx non-ZC case with pfrag pool disabled:
>    30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
>    not have a single CPU reaching about 100% usage.
> 
>>
>> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
>> on throughput since the Rx is 100% cpu.
> 
> As above performance data, enabling ZC does not seems to help when IOMMU
> is involved, which has about 30% performance degrade when pfrag pool is
> disabled and 50% performance degrade when pfrag pool is enabled.

In a past response you should numbers for Tx ZC API with a custom
program. That program showed the dramatic reduction in CPU cycles for Tx
with the ZC API.

> 
>>
>> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
>> reduces Rx processing and lower CPU to process the incoming stream. Then
>> using the Tx ZC API you lower the Tx overehad allowing a single stream
>> to faster - sending more data which in the end results in much higher
>> pps and throughput. At the limit you are CPU bound (both ends in my
>> testing as Rx side approaches the max pps, and Tx side as it continually
>> tries to send data).
>>
>> Lowering CPU usage on Tx the side is a win regardless of whether there
>> is a big increase on the throughput at 1500 MTU since that configuration
>> is an Rx CPU bound problem. Hence, my point that we have a good start
>> point for lowering CPU usage on the Tx side; we should improve it rather
>> than add per-socket page pools.
> 
> Acctually it is not a per-socket page pools, the page pool is still per
> NAPI, this patchset adds multi allocation context to the page pool, so that
> the tx can reuse the same page pool with rx, which is quite usefully if the
> ARFS is enabled.
> 
>>
>> You can stress the Tx side and emphasize its overhead by modifying the
>> receiver to drop the data on Rx rather than copy to userspace which is a
>> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
> 
> As the frag page is supported in page pool for Rx, the Rx probably is not
> a bottleneck any more, at least not for IOMMU in strict mode.
> 
> It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
> MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?

https://github.com/dsahern/iperf, mods branch

--zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv.


> 
>> stream to go faster and emphasize Tx bottlenecks as the pps at 3300
>> approaches the top pps at 1500. e.g., doing this with iperf3 shows the
>> spinlock overhead with tcp_sendmsg, overhead related to 'select' and
>> then gup_pgd_range.
> 
> When IOMMU is in strict mode, the overhead with IOMMU seems to be much
> bigger than spinlock(23% to 10%).
> 
> Anyway, I still think ZC mostly benefit to packet which is bigger than a
> specific size and IOMMU disabling case.
> 
> 
>> .
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-23 15:04       ` Eric Dumazet
@ 2021-08-24  8:03         ` Yunsheng Lin
  2021-08-25 16:29         ` David Ahern
  1 sibling, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-24  8:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Jakub Kicinski, Alexander Duyck, Russell King,
	Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, Christoph Hellwig

On 2021/8/23 23:04, Eric Dumazet wrote:
> On Mon, Aug 23, 2021 at 2:25 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2021/8/18 17:36, Yunsheng Lin wrote:
>>> On 2021/8/18 16:57, Eric Dumazet wrote:
>>>> On Wed, Aug 18, 2021 at 5:33 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>>
>>>>> This patchset adds the socket to netdev page frag recycling
>>>>> support based on the busy polling and page pool infrastructure.
>>>>
>>>> I really do not see how this can scale to thousands of sockets.
>>>>
>>>> tcp_mem[] defaults to ~ 9 % of physical memory.
>>>>
>>>> If you now run tests with thousands of sockets, their skbs will
>>>> consume Gigabytes
>>>> of memory on typical servers, now backed by order-0 pages (instead of
>>>> current order-3 pages)
>>>> So IOMMU costs will actually be much bigger.
>>>
>>> As the page allocator support bulk allocating now, see:
>>> https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L252
>>>
>>> if the DMA also support batch mapping/unmapping, maybe having a
>>> small-sized page pool for thousands of sockets may not be a problem?
>>> Christoph Hellwig mentioned the batch DMA operation support in below
>>> thread:
>>> https://www.spinics.net/lists/netdev/msg666715.html
>>>
>>> if the batched DMA operation is supported, maybe having the
>>> page pool is mainly benefit the case of small number of socket?
>>>
>>>>
>>>> Are we planning to use Gigabyte sized page pools for NIC ?
>>>>
>>>> Have you tried instead to make TCP frags twice bigger ?
>>>
>>> Not yet.
>>>
>>>> This would require less IOMMU mappings.
>>>> (Note: This could require some mm help, since PAGE_ALLOC_COSTLY_ORDER
>>>> is currently 3, not 4)
>>>
>>> I am not familiar with mm yet, but I will take a look about that:)
>>
>>
>> It seems PAGE_ALLOC_COSTLY_ORDER is mostly related to pcp page, OOM, memory
>> compact and memory isolation, as the test system has a lot of memory installed
>> (about 500G, only 3-4G is used), so I used the below patch to test the max
>> possible performance improvement when making TCP frags twice bigger, and
>> the performance improvement went from about 30Gbit to 32Gbit for one thread
>> iperf tcp flow in IOMMU strict mode,
> 
> This is encouraging, and means we can do much better.
> 
> Even with SKB_FRAG_PAGE_ORDER  set to 4, typical skbs will need 3 mappings
> 
> 1) One for the headers (in skb->head)
> 2) Two page frags, because one TSO packet payload is not a nice power-of-two.
> 
> The first issue can be addressed using a piece of coherent memory (128
> or 256 bytes per entry in TX ring).
> Copying the headers can avoid one IOMMU mapping, and improve IOTLB
> hits, because all
> slots of the TX ring buffer will use one single IOTLB slot.

Acctually, the hns3 driver has implemented the bounce buffer for the
above case, see:
https://elixir.bootlin.com/linux/v5.14-rc7/source/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c#L2042

Enabling the header buffer copying, the performance only improve from
about 30Gbit to 32Gbit for one thread iperf tcp flow.

So it seems the IOMMU overhead does not only related to how many
frag does a skb have, but also related to the length of each frag,
as the IOMMU mapping is based on 4K/2M granularity(for arm64), so it
may still take a lot of time to write each 4K page entry to the page
table when mapping and invalidate each 4K page entry when unmapping.

Also, hns3 driver implement the dma_map_sg() to reduce the number of
IOMMU mapping/unmapping, the peformance is only about 10%, possibly
due to the above reason too, see:
https://lkml.org/lkml/2021/6/16/134

> 
> The second issue can be solved by tweaking a bit
> skb_page_frag_refill() to accept an additional parameter
> so that the whole skb payload fits in a single order-4 page.

I am not sure I understand the above. Are you suggesting passing
'copy' to skb_page_frag_refill(), so that it will allocate a new
pages if there is no enough buffer for the caller?

> 
> 
>  and using the pfrag pool, the improvement
>> went from about 30Gbit to 40Gbit for the same testing configuation:
> 
> Yes, but you have not provided performance number when 200 (or 1000+)
> concurrent flows are running.

As the iperf seems to only support 200 concurrent flows(running more
threads seems to cause "Connection timed out"), any other performance
tool supporting 1000+ concurrent flows?

There is 32 cpus on the numa where the nic hw exists, and using taskset
to run the iperf in the same numa, as the page pool support page frag for
rx now, so the allocating the multi-order pages as skb_page_frag_refill()
does won't waste memory any more.

The below is the performance data for 200 concurrent iperf tcp flows for
one tx queue:
                          throughput      node cpu usages
pfrag_pool disabled:        31Gbit            5%
pfrag_pool-page order 0:    43Gbit            8%
pfrag_pool-page order 1:    50Gbit            8%
pfrag_pool-page order 2:    70Gbit            10%
pfrag_pool-page order 3:    90Gbit            11%


The below is the performance data for 200 concurrent iperf tcp flows for
32 tx queues(94.1Gbit is tcp flow line speed for 100Gbit port with mtu 1500):
                          throughput      node cpu usages
pfrag_pool disabled:        94.1Gbit            23%%
pfrag_pool-page order 0:    93.9Gbit            31%
pfrag_pool-page order 1:    94.1Gbit            24%
pfrag_pool-page order 2:    94.1Gbit            23%
pfrag_pool-page order 3:    94.1Gbit            16%

So it seems page pool for tx seems promising for large number of sockets
too?

> 
> Optimizing singe flow TCP performance while killing performance for
> the more common case is not an option.
> 
> 
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index fcb5355..dda20f9 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -37,7 +37,7 @@
>>   * coalesce naturally under reasonable reclaim pressure and those which
>>   * will not.
>>   */
>> -#define PAGE_ALLOC_COSTLY_ORDER 3
>> +#define PAGE_ALLOC_COSTLY_ORDER 4
>>
>>  enum migratetype {
>>         MIGRATE_UNMOVABLE,
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 870a3b7..b1e0dfc 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2580,7 +2580,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
>>         }
>>  }
>>
>> -#define SKB_FRAG_PAGE_ORDER    get_order(32768)
>> +#define SKB_FRAG_PAGE_ORDER    get_order(65536)
>>  DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
>>
>>  /**
>>
>>>
>>>>
>>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>>> index a3eea6e0b30a7d43793f567ffa526092c03e3546..6b66b51b61be9f198f6f1c4a3d81b57fa327986a
>>>> 100644
>>>> --- a/net/core/sock.c
>>>> +++ b/net/core/sock.c
>>>> @@ -2560,7 +2560,7 @@ static void sk_leave_memory_pressure(struct sock *sk)
>>>>         }
>>>>  }
>>>>
>>>> -#define SKB_FRAG_PAGE_ORDER    get_order(32768)
>>>> +#define SKB_FRAG_PAGE_ORDER    get_order(65536)
>>>>  DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
>>>>
>>>>  /**
>>>>
>>>>
>>>>
>>>>>
>>> _______________________________________________
>>> Linuxarm mailing list -- linuxarm@openeuler.org
>>> To unsubscribe send an email to linuxarm-leave@openeuler.org
>>>
> .
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-24  3:34         ` David Ahern
@ 2021-08-24  8:41           ` Yunsheng Lin
  0 siblings, 0 replies; 23+ messages in thread
From: Yunsheng Lin @ 2021-08-24  8:41 UTC (permalink / raw)
  To: David Ahern, davem, kuba
  Cc: alexander.duyck, linux, mw, linuxarm, yisen.zhuang, salil.mehta,
	thomas.petazzoni, hawk, ilias.apalodimas, ast, daniel,
	john.fastabend, akpm, peterz, will, willy, vbabka, fenghua.yu,
	guro, peterx, feng.tang, jgg, mcroce, hughd, jonathan.lemon,
	alobakin, willemb, wenxu, cong.wang, haokexin, nogikh, elver,
	yhs, kpsingh, andrii, kafai, songliubraving, netdev,
	linux-kernel, bpf, chenhao288, edumazet, yoshfuji, dsahern,
	memxor, linux, atenart, weiwan, ap420073, arnd,
	mathew.j.martineau, aahringo, ceggers, yangbo.lu, fw,
	xiangxia.m.yue, linmiaohe

On 2021/8/24 11:34, David Ahern wrote:
> On 8/22/21 9:32 PM, Yunsheng Lin wrote:
>>
>> I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
>> bottleneck?
> 
> yes.
> 
>>
>> It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
>> is not changed when testing, MTU is 1500:
> 
> -Z == sendfile API. That works fine to a point and that point is well
> below 100G.
> 
> I mean TCP with MSG_ZEROCOPY and SO_ZEROCOPY.
> 
>>
>> IOMMU in strict mode:
>> 1. Tx ZC case:
>>    22Gbit with Tx being bottleneck(cpu bound)
>> 2. Tx non-ZC case with pfrag pool enabled:
>>    40Git with Rx being bottleneck(cpu bound)
>> 3. Tx non-ZC case with pfrag pool disabled:
>>    30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
>>    not have a single CPU reaching about 100% usage.
>>
>>>
>>> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
>>> on throughput since the Rx is 100% cpu.
>>
>> As above performance data, enabling ZC does not seems to help when IOMMU
>> is involved, which has about 30% performance degrade when pfrag pool is
>> disabled and 50% performance degrade when pfrag pool is enabled.
> 
> In a past response you should numbers for Tx ZC API with a custom
> program. That program showed the dramatic reduction in CPU cycles for Tx
> with the ZC API.

As I deduced the cpu usage from the cycles in "perf stat -e cycles XX", which
does not seem to include the cycles for NAPI polling, which does the tx clean
(including dma unmapping) and does not run in the same cpu as msg_zerocopy runs.

I retested it using msg_zerocopy:
       msg_zerocopy cpu usage      NAPI polling cpu usage
ZC:        23%                               70%
non-ZC     50%                               40%

So it seems to match now, sorry for the confusion.

> 
>>
>>>
>>> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
>>> reduces Rx processing and lower CPU to process the incoming stream. Then
>>> using the Tx ZC API you lower the Tx overehad allowing a single stream
>>> to faster - sending more data which in the end results in much higher
>>> pps and throughput. At the limit you are CPU bound (both ends in my
>>> testing as Rx side approaches the max pps, and Tx side as it continually
>>> tries to send data).
>>>
>>> Lowering CPU usage on Tx the side is a win regardless of whether there
>>> is a big increase on the throughput at 1500 MTU since that configuration
>>> is an Rx CPU bound problem. Hence, my point that we have a good start
>>> point for lowering CPU usage on the Tx side; we should improve it rather
>>> than add per-socket page pools.
>>
>> Acctually it is not a per-socket page pools, the page pool is still per
>> NAPI, this patchset adds multi allocation context to the page pool, so that
>> the tx can reuse the same page pool with rx, which is quite usefully if the
>> ARFS is enabled.
>>
>>>
>>> You can stress the Tx side and emphasize its overhead by modifying the
>>> receiver to drop the data on Rx rather than copy to userspace which is a
>>> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
>>
>> As the frag page is supported in page pool for Rx, the Rx probably is not
>> a bottleneck any more, at least not for IOMMU in strict mode.
>>
>> It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
>> MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?
> 
> https://github.com/dsahern/iperf, mods branch
> 
> --zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv.

Thanks for sharing the tool.
I retested using above iperf, and result is similar to previous result
too.

> 
> 
>>
>>> stream to go faster and emphasize Tx bottlenecks as the pps at 3300
>>> approaches the top pps at 1500. e.g., doing this with iperf3 shows the
>>> spinlock overhead with tcp_sendmsg, overhead related to 'select' and
>>> then gup_pgd_range.
>>
>> When IOMMU is in strict mode, the overhead with IOMMU seems to be much
>> bigger than spinlock(23% to 10%).
>>
>> Anyway, I still think ZC mostly benefit to packet which is bigger than a
>> specific size and IOMMU disabling case.
>>
>>
>>> .
>>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-23 15:04       ` Eric Dumazet
  2021-08-24  8:03         ` Yunsheng Lin
@ 2021-08-25 16:29         ` David Ahern
  2021-08-25 16:32           ` Eric Dumazet
  1 sibling, 1 reply; 23+ messages in thread
From: David Ahern @ 2021-08-25 16:29 UTC (permalink / raw)
  To: Eric Dumazet, Yunsheng Lin
  Cc: David Miller, Jakub Kicinski, Alexander Duyck, Russell King,
	Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, Christoph Hellwig

On 8/23/21 8:04 AM, Eric Dumazet wrote:
>>
>>
>> It seems PAGE_ALLOC_COSTLY_ORDER is mostly related to pcp page, OOM, memory
>> compact and memory isolation, as the test system has a lot of memory installed
>> (about 500G, only 3-4G is used), so I used the below patch to test the max
>> possible performance improvement when making TCP frags twice bigger, and
>> the performance improvement went from about 30Gbit to 32Gbit for one thread
>> iperf tcp flow in IOMMU strict mode,
> 
> This is encouraging, and means we can do much better.
> 
> Even with SKB_FRAG_PAGE_ORDER  set to 4, typical skbs will need 3 mappings
> 
> 1) One for the headers (in skb->head)
> 2) Two page frags, because one TSO packet payload is not a nice power-of-two.

interesting observation. I have noticed 17 with the ZC API. That might
explain the less than expected performance bump with iommu strict mode.

> 
> The first issue can be addressed using a piece of coherent memory (128
> or 256 bytes per entry in TX ring).
> Copying the headers can avoid one IOMMU mapping, and improve IOTLB
> hits, because all
> slots of the TX ring buffer will use one single IOTLB slot.
> 
> The second issue can be solved by tweaking a bit
> skb_page_frag_refill() to accept an additional parameter
> so that the whole skb payload fits in a single order-4 page.
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-25 16:29         ` David Ahern
@ 2021-08-25 16:32           ` Eric Dumazet
  2021-08-25 16:38             ` David Ahern
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2021-08-25 16:32 UTC (permalink / raw)
  To: David Ahern
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski, Alexander Duyck,
	Russell King, Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, Christoph Hellwig

On Wed, Aug 25, 2021 at 9:29 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 8/23/21 8:04 AM, Eric Dumazet wrote:
> >>
> >>
> >> It seems PAGE_ALLOC_COSTLY_ORDER is mostly related to pcp page, OOM, memory
> >> compact and memory isolation, as the test system has a lot of memory installed
> >> (about 500G, only 3-4G is used), so I used the below patch to test the max
> >> possible performance improvement when making TCP frags twice bigger, and
> >> the performance improvement went from about 30Gbit to 32Gbit for one thread
> >> iperf tcp flow in IOMMU strict mode,
> >
> > This is encouraging, and means we can do much better.
> >
> > Even with SKB_FRAG_PAGE_ORDER  set to 4, typical skbs will need 3 mappings
> >
> > 1) One for the headers (in skb->head)
> > 2) Two page frags, because one TSO packet payload is not a nice power-of-two.
>
> interesting observation. I have noticed 17 with the ZC API. That might
> explain the less than expected performance bump with iommu strict mode.

Note that if application is using huge pages, things get better after

commit 394fcd8a813456b3306c423ec4227ed874dfc08b
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Aug 20 08:43:59 2020 -0700

    net: zerocopy: combine pages in zerocopy_sg_from_iter()

    Currently, tcp sendmsg(MSG_ZEROCOPY) is building skbs with order-0
fragments.
    Compared to standard sendmsg(), these skbs usually contain up to
16 fragments
    on arches with 4KB page sizes, instead of two.

    This adds considerable costs on various ndo_start_xmit() handlers,
    especially when IOMMU is in the picture.

    As high performance applications are often using huge pages,
    we can try to combine adjacent pages belonging to same
    compound page.

    Tested on AMD Rome platform, with IOMMU, nominal single TCP flow speed
    is roughly doubled (~55Gbit -> ~100Gbit), when user application
    is using hugepages.

    For reference, nominal single TCP flow speed on this platform
    without MSG_ZEROCOPY is ~65Gbit.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Willem de Bruijn <willemb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Ideally the gup stuff should really directly deal with hugepages, so
that we avoid
all these crazy refcounting games on the per-huge-page central refcount.

>
> >
> > The first issue can be addressed using a piece of coherent memory (128
> > or 256 bytes per entry in TX ring).
> > Copying the headers can avoid one IOMMU mapping, and improve IOTLB
> > hits, because all
> > slots of the TX ring buffer will use one single IOTLB slot.
> >
> > The second issue can be solved by tweaking a bit
> > skb_page_frag_refill() to accept an additional parameter
> > so that the whole skb payload fits in a single order-4 page.
> >
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-25 16:32           ` Eric Dumazet
@ 2021-08-25 16:38             ` David Ahern
  2021-08-25 17:24               ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: David Ahern @ 2021-08-25 16:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski, Alexander Duyck,
	Russell King, Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, Christoph Hellwig

On 8/25/21 9:32 AM, Eric Dumazet wrote:
> On Wed, Aug 25, 2021 at 9:29 AM David Ahern <dsahern@gmail.com> wrote:
>>
>> On 8/23/21 8:04 AM, Eric Dumazet wrote:
>>>>
>>>>
>>>> It seems PAGE_ALLOC_COSTLY_ORDER is mostly related to pcp page, OOM, memory
>>>> compact and memory isolation, as the test system has a lot of memory installed
>>>> (about 500G, only 3-4G is used), so I used the below patch to test the max
>>>> possible performance improvement when making TCP frags twice bigger, and
>>>> the performance improvement went from about 30Gbit to 32Gbit for one thread
>>>> iperf tcp flow in IOMMU strict mode,
>>>
>>> This is encouraging, and means we can do much better.
>>>
>>> Even with SKB_FRAG_PAGE_ORDER  set to 4, typical skbs will need 3 mappings
>>>
>>> 1) One for the headers (in skb->head)
>>> 2) Two page frags, because one TSO packet payload is not a nice power-of-two.
>>
>> interesting observation. I have noticed 17 with the ZC API. That might
>> explain the less than expected performance bump with iommu strict mode.
> 
> Note that if application is using huge pages, things get better after
> 
> commit 394fcd8a813456b3306c423ec4227ed874dfc08b
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Thu Aug 20 08:43:59 2020 -0700
> 
>     net: zerocopy: combine pages in zerocopy_sg_from_iter()
> 
>     Currently, tcp sendmsg(MSG_ZEROCOPY) is building skbs with order-0
> fragments.
>     Compared to standard sendmsg(), these skbs usually contain up to
> 16 fragments
>     on arches with 4KB page sizes, instead of two.
> 
>     This adds considerable costs on various ndo_start_xmit() handlers,
>     especially when IOMMU is in the picture.
> 
>     As high performance applications are often using huge pages,
>     we can try to combine adjacent pages belonging to same
>     compound page.
> 
>     Tested on AMD Rome platform, with IOMMU, nominal single TCP flow speed
>     is roughly doubled (~55Gbit -> ~100Gbit), when user application
>     is using hugepages.
> 
>     For reference, nominal single TCP flow speed on this platform
>     without MSG_ZEROCOPY is ~65Gbit.
> 
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>     Cc: Willem de Bruijn <willemb@google.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Ideally the gup stuff should really directly deal with hugepages, so
> that we avoid
> all these crazy refcounting games on the per-huge-page central refcount.
> 

thanks for the pointer. I need to revisit my past attempt to get iperf3
working with hugepages.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-25 16:38             ` David Ahern
@ 2021-08-25 17:24               ` Eric Dumazet
  2021-08-26  4:05                 ` David Ahern
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2021-08-25 17:24 UTC (permalink / raw)
  To: David Ahern
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski, Alexander Duyck,
	Russell King, Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, Christoph Hellwig

On Wed, Aug 25, 2021 at 9:39 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 8/25/21 9:32 AM, Eric Dumazet wrote:

> >
>
> thanks for the pointer. I need to revisit my past attempt to get iperf3
> working with hugepages.

ANother pointer, just in case this helps.

commit 72653ae5303c626ca29fcbcbb8165a894a104adf
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Aug 20 10:11:17 2020 -0700

    selftests: net: tcp_mmap: Use huge pages in send path

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linuxarm] Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support
  2021-08-25 17:24               ` Eric Dumazet
@ 2021-08-26  4:05                 ` David Ahern
  0 siblings, 0 replies; 23+ messages in thread
From: David Ahern @ 2021-08-26  4:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yunsheng Lin, David Miller, Jakub Kicinski, Alexander Duyck,
	Russell King, Marcin Wojtas, linuxarm, Yisen Zhuang, Salil Mehta,
	Thomas Petazzoni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrew Morton, Peter Zijlstra, Will Deacon, Matthew Wilcox,
	Vlastimil Babka, Fenghua Yu, Roman Gushchin, Peter Xu, Tang,
	Feng, Jason Gunthorpe, mcroce, Hugh Dickins, Jonathan Lemon,
	Alexander Lobakin, Willem de Bruijn, wenxu, Cong Wang, Kevin Hao,
	Aleksandr Nogikh, Marco Elver, Yonghong Song, kpsingh,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, netdev, LKML, bpf,
	chenhao288, Hideaki YOSHIFUJI, David Ahern, memxor, linux,
	Antoine Tenart, Wei Wang, Taehee Yoo, Arnd Bergmann,
	Mat Martineau, aahringo, ceggers, yangbo.lu, Florian Westphal,
	xiangxia.m.yue, linmiaohe, Christoph Hellwig

On 8/25/21 10:24 AM, Eric Dumazet wrote:
> On Wed, Aug 25, 2021 at 9:39 AM David Ahern <dsahern@gmail.com> wrote:
>>
>> On 8/25/21 9:32 AM, Eric Dumazet wrote:
> 
>>>
>>
>> thanks for the pointer. I need to revisit my past attempt to get iperf3
>> working with hugepages.
> 
> ANother pointer, just in case this helps.
> 
> commit 72653ae5303c626ca29fcbcbb8165a894a104adf
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Thu Aug 20 10:11:17 2020 -0700
> 
>     selftests: net: tcp_mmap: Use huge pages in send path
> 

very helpful. Thanks. Added support to iperf3, and it does show a big
drop in cpu utilization.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-08-26  4:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18  3:32 [PATCH RFC 0/7] add socket to netdev page frag recycling support Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 1/7] page_pool: refactor the page pool to support multi alloc context Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 2/7] skbuff: add interface to manipulate frag count for tx recycling Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 3/7] net: add NAPI api to register and retrieve the page pool ptr Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 4/7] net: pfrag_pool: add pfrag pool support based on page pool Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 5/7] sock: support refilling pfrag from pfrag_pool Yunsheng Lin
2021-08-18  3:32 ` [PATCH RFC 6/7] net: hns3: support tx recycling in the hns3 driver Yunsheng Lin
2021-08-18  8:57 ` [PATCH RFC 0/7] add socket to netdev page frag recycling support Eric Dumazet
2021-08-18  9:36   ` Yunsheng Lin
2021-08-23  9:25     ` [Linuxarm] " Yunsheng Lin
2021-08-23 15:04       ` Eric Dumazet
2021-08-24  8:03         ` Yunsheng Lin
2021-08-25 16:29         ` David Ahern
2021-08-25 16:32           ` Eric Dumazet
2021-08-25 16:38             ` David Ahern
2021-08-25 17:24               ` Eric Dumazet
2021-08-26  4:05                 ` David Ahern
2021-08-18 22:05 ` David Ahern
2021-08-19  8:18   ` Yunsheng Lin
2021-08-20 14:35     ` David Ahern
2021-08-23  3:32       ` Yunsheng Lin
2021-08-24  3:34         ` David Ahern
2021-08-24  8:41           ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).