All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO
@ 2020-03-25 14:08 yang_y_yi
  2020-03-25 14:37 ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: yang_y_yi @ 2020-03-25 14:08 UTC (permalink / raw)
  To: netdev; +Cc: u9012063, yangyi01, yang_y_yi

From: Yi Yang <yangyi01@inspur.com>

TPACKET_V3 performance is very very bad in case of TSO, it is even
worse than non-TSO case. For Linux kernels which set CONFIG_HZ to
1000, req.tp_retire_blk_tov = 1 can help improve it a bit, but some
Linux distributions set CONFIG_HZ to 250, so req.tp_retire_blk_tov = 1
actually means req.tp_retire_blk_tov = 4, it won't have any help.

This fix patch can fix the aforementioned performance issue, it can
boost the performance from 3.05Gbps to 16.9Gbps, a very huge
improvement. It will retire current block as early as possible in
case of TSO in order that userspace application can consume it
in time.

Signed-off-by: Yi Yang <yangyi01@inspur.com>
---
 net/packet/af_packet.c | 42 ++++++++++++++++++++++++++++++++----------
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e5b0986..cbe9052 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1005,7 +1005,8 @@ static void prb_fill_curr_block(char *curr,
 /* Assumes caller has the sk->rx_queue.lock */
 static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 					    struct sk_buff *skb,
-					    unsigned int len
+					    unsigned int len,
+					    bool retire_cur_block
 					    )
 {
 	struct tpacket_kbdq_core *pkc;
@@ -1041,7 +1042,8 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 	end = (char *)pbd + pkc->kblk_size;
 
 	/* first try the current block */
-	if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {
+	if (BLOCK_NUM_PKTS(pbd) == 0 ||
+	    (!retire_cur_block && curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end)) {
 		prb_fill_curr_block(curr, pkc, pbd, len);
 		return (void *)curr;
 	}
@@ -1066,7 +1068,8 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 
 static void *packet_current_rx_frame(struct packet_sock *po,
 					    struct sk_buff *skb,
-					    int status, unsigned int len)
+					    int status, unsigned int len,
+					    bool retire_cur_block)
 {
 	char *curr = NULL;
 	switch (po->tp_version) {
@@ -1076,7 +1079,8 @@ static void *packet_current_rx_frame(struct packet_sock *po,
 					po->rx_ring.head, status);
 		return curr;
 	case TPACKET_V3:
-		return __packet_lookup_frame_in_block(po, skb, len);
+		return __packet_lookup_frame_in_block(po, skb, len,
+						      retire_cur_block);
 	default:
 		WARN(1, "TPACKET version not supported\n");
 		BUG();
@@ -2174,6 +2178,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	__u32 ts_status;
 	bool is_drop_n_account = false;
 	bool do_vnet = false;
+	struct virtio_net_hdr vnet_hdr;
+	int vnet_hdr_ok = 0;
+	bool retire_cur_block = false;
 
 	/* struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
 	 * We may add members to them until current aligned size without forcing
@@ -2269,17 +2276,32 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 			do_vnet = false;
 		}
 	}
+
+	if (do_vnet) {
+		vnet_hdr_ok = virtio_net_hdr_from_skb(skb, &vnet_hdr,
+						      vio_le(), true, 0);
+		/* Improve performance by retiring current block for
+		 * TPACKET_V3 in case of TSO.
+		 */
+		if (vnet_hdr_ok == 0) {
+			retire_cur_block = true;
+		}
+	}
+
 	spin_lock(&sk->sk_receive_queue.lock);
 	h.raw = packet_current_rx_frame(po, skb,
-					TP_STATUS_KERNEL, (macoff+snaplen));
+					TP_STATUS_KERNEL, (macoff+snaplen),
+					retire_cur_block);
 	if (!h.raw)
 		goto drop_n_account;
 
-	if (do_vnet &&
-	    virtio_net_hdr_from_skb(skb, h.raw + macoff -
-				    sizeof(struct virtio_net_hdr),
-				    vio_le(), true, 0))
-		goto drop_n_account;
+	if (do_vnet) {
+		if (vnet_hdr_ok != 0)
+			goto drop_n_account;
+		else
+			memcpy(h.raw + macoff - sizeof(struct virtio_net_hdr),
+			       &vnet_hdr, sizeof(vnet_hdr));
+	}
 
 	if (po->tp_version <= TPACKET_V2) {
 		packet_increment_rx_head(po, &po->rx_ring);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-25 14:08 [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO yang_y_yi
@ 2020-03-25 14:37 ` Willem de Bruijn
  2020-03-26  0:43   ` 答复: [vger.kernel.org代发]Re: " Yi Yang (杨燚)-云服务集团
       [not found]   ` <8c7c4b8.a0a4.17112280afb.Coremail.yang_y_yi@163.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Willem de Bruijn @ 2020-03-25 14:37 UTC (permalink / raw)
  To: yang_y_yi; +Cc: Network Development, u9012063, yangyi01

On Wed, Mar 25, 2020 at 10:10 AM <yang_y_yi@163.com> wrote:
>
> From: Yi Yang <yangyi01@inspur.com>
>
> TPACKET_V3 performance is very very bad in case of TSO, it is even
> worse than non-TSO case. For Linux kernels which set CONFIG_HZ to
> 1000, req.tp_retire_blk_tov = 1 can help improve it a bit, but some
> Linux distributions set CONFIG_HZ to 250, so req.tp_retire_blk_tov = 1
> actually means req.tp_retire_blk_tov = 4, it won't have any help.
>
> This fix patch can fix the aforementioned performance issue, it can
> boost the performance from 3.05Gbps to 16.9Gbps, a very huge
> improvement. It will retire current block as early as possible in
> case of TSO in order that userspace application can consume it
> in time.
>
> Signed-off-by: Yi Yang <yangyi01@inspur.com>

I'm not convinced that special casing TSO packets is the right solution here.

We should consider converting TPACKET_V3 to hrtimer and allow usec
resolution block timer.

Would that solve your issue?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* 答复: [vger.kernel.org代发]Re: [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-25 14:37 ` Willem de Bruijn
@ 2020-03-26  0:43   ` Yi Yang (杨燚)-云服务集团
  2020-03-26  1:20     ` Willem de Bruijn
       [not found]   ` <8c7c4b8.a0a4.17112280afb.Coremail.yang_y_yi@163.com>
  1 sibling, 1 reply; 16+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-03-26  0:43 UTC (permalink / raw)
  To: willemdebruijn.kernel, yang_y_yi; +Cc: netdev, u9012063

[-- Attachment #1: Type: text/plain, Size: 1796 bytes --]

By the way, even if we used hrtimer, it can't ensure so high performance improvement, the reason is every frame has different size, you can't know how many microseconds one frame will be available, early timer firing will be an unnecessary waste, late timer firing will reduce performance, so I still think the way this patch used is best so far.

-----邮件原件-----
发件人: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] 代表 Willem de Bruijn
发送时间: 2020年3月25日 22:38
收件人: yang_y_yi@163.com
抄送: Network Development <netdev@vger.kernel.org>; u9012063@gmail.com; Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
主题: [vger.kernel.org代发]Re: [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO

On Wed, Mar 25, 2020 at 10:10 AM <yang_y_yi@163.com> wrote:
>
> From: Yi Yang <yangyi01@inspur.com>
>
> TPACKET_V3 performance is very very bad in case of TSO, it is even 
> worse than non-TSO case. For Linux kernels which set CONFIG_HZ to 
> 1000, req.tp_retire_blk_tov = 1 can help improve it a bit, but some 
> Linux distributions set CONFIG_HZ to 250, so req.tp_retire_blk_tov = 1 
> actually means req.tp_retire_blk_tov = 4, it won't have any help.
>
> This fix patch can fix the aforementioned performance issue, it can 
> boost the performance from 3.05Gbps to 16.9Gbps, a very huge 
> improvement. It will retire current block as early as possible in case 
> of TSO in order that userspace application can consume it in time.
>
> Signed-off-by: Yi Yang <yangyi01@inspur.com>

I'm not convinced that special casing TSO packets is the right solution here.

We should consider converting TPACKET_V3 to hrtimer and allow usec resolution block timer.

Would that solve your issue?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Re: [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO
       [not found]   ` <8c7c4b8.a0a4.17112280afb.Coremail.yang_y_yi@163.com>
@ 2020-03-26  1:16     ` Willem de Bruijn
  0 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2020-03-26  1:16 UTC (permalink / raw)
  To: yang_y_yi; +Cc: Network Development, u9012063, yangyi01

On Wed, Mar 25, 2020 at 10:46 AM yang_y_yi <yang_y_yi@163.com> wrote:
>
> Yes, hrtimer is better, but it will change current API semantics.
>
> req.tp_retire_blk_tov means millisecond, if we change it as microsecond, it will break ABI.

That can be resolved by adding a new feature flag that reinterprets
the field in the request.

#define TP_FT_REQ_USEC      0x2

Please remember to use plain text and don't top paste.



> At 2020-03-25 22:37:59, "Willem de Bruijn" <willemdebruijn.kernel@gmail.com> wrote:
> >On Wed, Mar 25, 2020 at 10:10 AM <yang_y_yi@163.com> wrote:
> >>
> >> From: Yi Yang <yangyi01@inspur.com>
> >>
> >> TPACKET_V3 performance is very very bad in case of TSO, it is even
> >> worse than non-TSO case. For Linux kernels which set CONFIG_HZ to
> >> 1000, req.tp_retire_blk_tov = 1 can help improve it a bit, but some
> >> Linux distributions set CONFIG_HZ to 250, so req.tp_retire_blk_tov = 1
> >> actually means req.tp_retire_blk_tov = 4, it won't have any help.
> >>
> >> This fix patch can fix the aforementioned performance issue, it can
> >> boost the performance from 3.05Gbps to 16.9Gbps, a very huge
> >> improvement. It will retire current block as early as possible in
> >> case of TSO in order that userspace application can consume it
> >> in time.
> >>
> >> Signed-off-by: Yi Yang <yangyi01@inspur.com>
> >
> >I'm not convinced that special casing TSO packets is the right solution here.
> >
> >We should consider converting TPACKET_V3 to hrtimer and allow usec
> >resolution block timer.
> >
> >Would that solve your issue?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [vger.kernel.org代发]Re: [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-26  0:43   ` 答复: [vger.kernel.org代发]Re: " Yi Yang (杨燚)-云服务集团
@ 2020-03-26  1:20     ` Willem de Bruijn
  2020-03-27  2:33       ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: " Yi Yang (杨燚)-云服务集团
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-03-26  1:20 UTC (permalink / raw)
  To: Yi Yang (杨燚)-云服务集团
  Cc: yang_y_yi, netdev, u9012063

On Wed, Mar 25, 2020 at 8:45 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> By the way, even if we used hrtimer, it can't ensure so high performance improvement, the reason is every frame has different size, you can't know how many microseconds one frame will be available, early timer firing will be an unnecessary waste, late timer firing will reduce performance, so I still think the way this patch used is best so far.
>

The key differentiating feature of TPACKET_V3 is the use of blocks to
efficiently pack packets and amortize wake ups.

If you want immediate notification for every packet, why not just use
TPACKET_V2?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/        packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-26  1:20     ` Willem de Bruijn
@ 2020-03-27  2:33       ` Yi Yang (杨燚)-云服务集团
  2020-03-27  3:16         ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-03-27  2:33 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: yang_y_yi, netdev, u9012063

[-- Attachment #1: Type: text/plain, Size: 1275 bytes --]

-----邮件原件-----
发件人: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] 代表 Willem de Bruijn
发送时间: 2020年3月26日 9:21
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: yang_y_yi@163.com; netdev@vger.kernel.org; u9012063@gmail.com
主题: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO

On Wed, Mar 25, 2020 at 8:45 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> By the way, even if we used hrtimer, it can't ensure so high performance improvement, the reason is every frame has different size, you can't know how many microseconds one frame will be available, early timer firing will be an unnecessary waste, late timer firing will reduce performance, so I still think the way this patch used is best so far.
>

The key differentiating feature of TPACKET_V3 is the use of blocks to efficiently pack packets and amortize wake ups.

If you want immediate notification for every packet, why not just use TPACKET_V2?

[Yi Yang] For non-TSO packet, TPACKET_V3 is much better than TPACKET_V2, but for TSO packet, it is bad, we prefer to use TPACKET_V3 for better performance.
[Yi Yang] 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-27  2:33       ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: " Yi Yang (杨燚)-云服务集团
@ 2020-03-27  3:16         ` Willem de Bruijn
  2020-03-28  8:36           ` 答复: " Yi Yang (杨燚)-云服务集团
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-03-27  3:16 UTC (permalink / raw)
  To: Yi Yang (杨燚)-云服务集团
  Cc: willemdebruijn.kernel, yang_y_yi, netdev, u9012063

> On Wed, Mar 25, 2020 at 8:45 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
> >
> > By the way, even if we used hrtimer, it can't ensure so high performance improvement, the reason is every frame has different size, you can't know how many microseconds one frame will be available, early timer firing will be an unnecessary waste, late timer firing will reduce performance, so I still think the way this patch used is best so far.
> >
>
> The key differentiating feature of TPACKET_V3 is the use of blocks to efficiently pack packets and amortize wake ups.
>
> If you want immediate notification for every packet, why not just use TPACKET_V2?
>
> For non-TSO packet, TPACKET_V3 is much better than TPACKET_V2, but for TSO packet, it is bad, we prefer to use TPACKET_V3 for better performance.

At high rate, blocks are retired and userspace is notified as soon as
a packet arrives that does not fit and requires dispatching a new
block. As such, max throughput is not timer dependent. The timer
exists to bound notification latency when packet arrival rate is slow.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-27  3:16         ` Willem de Bruijn
@ 2020-03-28  8:36           ` Yi Yang (杨燚)-云服务集团
  2020-03-28 18:36             ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-03-28  8:36 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: yang_y_yi, netdev, u9012063

[-- Attachment #1: Type: text/plain, Size: 2303 bytes --]


-----邮件原件-----
发件人: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
发送时间: 2020年3月27日 11:17
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: willemdebruijn.kernel@gmail.com; yang_y_yi@163.com; netdev@vger.kernel.org; u9012063@gmail.com
主题: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO

> On Wed, Mar 25, 2020 at 8:45 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
> >
> > By the way, even if we used hrtimer, it can't ensure so high performance improvement, the reason is every frame has different size, you can't know how many microseconds one frame will be available, early timer firing will be an unnecessary waste, late timer firing will reduce performance, so I still think the way this patch used is best so far.
> >
>
> The key differentiating feature of TPACKET_V3 is the use of blocks to efficiently pack packets and amortize wake ups.
>
> If you want immediate notification for every packet, why not just use TPACKET_V2?
>
> For non-TSO packet, TPACKET_V3 is much better than TPACKET_V2, but for TSO packet, it is bad, we prefer to use TPACKET_V3 for better performance.

At high rate, blocks are retired and userspace is notified as soon as a packet arrives that does not fit and requires dispatching a new block. As such, max throughput is not timer dependent. The timer exists to bound notification latency when packet arrival rate is slow.

[Yi Yang] Per our iperf3 tcp test with TSO enabled, even if packet size is about 64K and block size is also 64K + 4K (to accommodate tpacket_vX header), we can't see high performance without this patch, I think some small packets before 64K big packets decide what performance it can reach, according to my trace, TCP packet size is increasing from less than 100 to 64K gradually, so it looks like how long this period took decides what performance it can reach. So yes, I don’t think hrtimer can help fix this issue very efficiently. In addition, I also noticed packet size pattern is 1514, 64K, 64K, 64K, 64K, ..., 1514, 64K even if it reaches 64K packet size, maybe that 1514 packet has big impact on performance, I just guess.

[Yi Yang] 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-28  8:36           ` 答复: " Yi Yang (杨燚)-云服务集团
@ 2020-03-28 18:36             ` Willem de Bruijn
  2020-03-29  2:42               ` 答复: " Yi Yang (杨�D)-云服务集团
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-03-28 18:36 UTC (permalink / raw)
  To: Yi Yang (杨燚)-云服务集团
  Cc: willemdebruijn.kernel, yang_y_yi, netdev, u9012063

On Sat, Mar 28, 2020 at 4:37 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
>
> -----邮件原件-----
> 发件人: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com]
> 发送时间: 2020年3月27日 11:17
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: willemdebruijn.kernel@gmail.com; yang_y_yi@163.com; netdev@vger.kernel.org; u9012063@gmail.com
> 主题: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
>
> > On Wed, Mar 25, 2020 at 8:45 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
> > >
> > > By the way, even if we used hrtimer, it can't ensure so high performance improvement, the reason is every frame has different size, you can't know how many microseconds one frame will be available, early timer firing will be an unnecessary waste, late timer firing will reduce performance, so I still think the way this patch used is best so far.
> > >
> >
> > The key differentiating feature of TPACKET_V3 is the use of blocks to efficiently pack packets and amortize wake ups.
> >
> > If you want immediate notification for every packet, why not just use TPACKET_V2?
> >
> > For non-TSO packet, TPACKET_V3 is much better than TPACKET_V2, but for TSO packet, it is bad, we prefer to use TPACKET_V3 for better performance.
>
> At high rate, blocks are retired and userspace is notified as soon as a packet arrives that does not fit and requires dispatching a new block. As such, max throughput is not timer dependent. The timer exists to bound notification latency when packet arrival rate is slow.
>
> [Yi Yang] Per our iperf3 tcp test with TSO enabled, even if packet size is about 64K and block size is also 64K + 4K (to accommodate tpacket_vX header), we can't see high performance without this patch, I think some small packets before 64K big packets decide what performance it can reach, according to my trace, TCP packet size is increasing from less than 100 to 64K gradually, so it looks like how long this period took decides what performance it can reach. So yes, I don’t think hrtimer can help fix this issue very efficiently. In addition, I also noticed packet size pattern is 1514, 64K, 64K, 64K, 64K, ..., 1514, 64K even if it reaches 64K packet size, maybe that 1514 packet has big impact on performance, I just guess.

Again, the main issue is that the timer does not matter at high rate.
The 3 Gbps you report corresponds to ~6000 TSO packets, or 167 usec
inter arrival time. The timer, whether 1 or 4 ms, should never be
needed.

There are too many unknown variables here. Besides block size, what is
tp_block_nr? What is the drop rate? Are you certain that you are not
causing drops by not reading fast enough? What happens when you
increase tp_block_size or tp_block_nr? It may be worthwhile to pin
iperf to one (set of) core(s) and the packet socket reader to another.
Let it busy spin and do minimal processing, just return blocks back to
the kernel.

If unsure about that, it may be interesting to instrument the kernel
and count how many block retire operations are from
prb_retire_rx_blk_timer_expired and how many from tpacket_rcv.

Note that do_vnet only changes whether a virtio_net_header is prefixed
to the data. Having that disabled (the common case) does not stop GSO
packets from arriving.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-28 18:36             ` Willem de Bruijn
@ 2020-03-29  2:42               ` Yi Yang (杨�D)-云服务集团
  2020-03-30  1:51                 ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Yi Yang (杨�D)-云服务集团 @ 2020-03-29  2:42 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: yang_y_yi, netdev, u9012063

[-- Attachment #1: smime.p7m --]
[-- Type: application/pkcs7-mime, Size: 11836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-29  2:42               ` 答复: " Yi Yang (杨�D)-云服务集团
@ 2020-03-30  1:51                 ` Willem de Bruijn
  2020-03-30  6:34                   ` 答复: " Yi Yang (杨燚)-云服务集团
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-03-30  1:51 UTC (permalink / raw)
  To: Yi Yang (杨燚)-云服务集团
  Cc: willemdebruijn.kernel, yang_y_yi, netdev, u9012063

On Sat, Mar 28, 2020 at 10:43 PM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> -----邮件原件-----
> 发件人: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com]
> 发送时间: 2020年3月29日 2:36
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: willemdebruijn.kernel@gmail.com; yang_y_yi@163.com;
> netdev@vger.kernel.org; u9012063@gmail.com
> 主题: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next]
> net/ packet: fix TPACKET_V3 performance issue in case of TSO
>
> On Sat, Mar 28, 2020 at 4:37 AM Yi Yang (杨燚)-云服务集团
> <yangyi01@inspur.com> wrote:
> >
> <yangyi01@inspur.com> wrote:
> > > >
> > > > By the way, even if we used hrtimer, it can't ensure so high
> performance improvement, the reason is every frame has different size, you
> can't know how many microseconds one frame will be available, early timer
> firing will be an unnecessary waste, late timer firing will reduce
> performance, so I still think the way this patch used is best so far.
> > > >
> > >
> > > The key differentiating feature of TPACKET_V3 is the use of blocks to
> efficiently pack packets and amortize wake ups.
> > >
> > > If you want immediate notification for every packet, why not just use
> TPACKET_V2?
> > >
> > > For non-TSO packet, TPACKET_V3 is much better than TPACKET_V2, but for
> TSO packet, it is bad, we prefer to use TPACKET_V3 for better performance.
> >
> > At high rate, blocks are retired and userspace is notified as soon as a
> packet arrives that does not fit and requires dispatching a new block. As
> such, max throughput is not timer dependent. The timer exists to bound
> notification latency when packet arrival rate is slow.
> >
> > [Yi Yang] Per our iperf3 tcp test with TSO enabled, even if packet size is
> about 64K and block size is also 64K + 4K (to accommodate tpacket_vX
> header), we can't see high performance without this patch, I think some
> small packets before 64K big packets decide what performance it can reach,
> according to my trace, TCP packet size is increasing from less than 100 to
> 64K gradually, so it looks like how long this period took decides what
> performance it can reach. So yes, I don’t think hrtimer can help fix this
> issue very efficiently. In addition, I also noticed packet size pattern is
> 1514, 64K, 64K, 64K, 64K, ..., 1514, 64K even if it reaches 64K packet size,
> maybe that 1514 packet has big impact on performance, I just guess.
>
> Again, the main issue is that the timer does not matter at high rate.
> The 3 Gbps you report corresponds to ~6000 TSO packets, or 167 usec inter
> arrival time. The timer, whether 1 or 4 ms, should never be needed.
>
> There are too many unknown variables here. Besides block size, what is
> tp_block_nr? What is the drop rate? Are you certain that you are not causing
> drops by not reading fast enough? What happens when you increase
> tp_block_size or tp_block_nr? It may be worthwhile to pin iperf to one (set
> of) core(s) and the packet socket reader to another.
> Let it busy spin and do minimal processing, just return blocks back to the
> kernel.
>
> If unsure about that, it may be interesting to instrument the kernel and
> count how many block retire operations are from
> prb_retire_rx_blk_timer_expired and how many from tpacket_rcv.
>
> Note that do_vnet only changes whether a virtio_net_header is prefixed to
> the data. Having that disabled (the common case) does not stop GSO packets
> from arriving.
>
> [Yi Yang] You can refer to the patch
> (https://patchwork.ozlabs.org/patch/1257288/) for OVS DPDK for more details,
> tp_block_nr is 128 for TSO case, frame size is equal to block size, I tried
> increase block size to multiple frames, also tried bigger tp_block_nr, both of
> them won't have any help. For TSO, we have to have vnet header in frame,
> otherwise TSO won't work. Our user scenario is Openstack, but use OVS DPDK not
> OVS, no matter it is tap interface or veth interface, performance is very bad,
> because OVS DPDK is using RAW socket to receive packets from it and transmit
> packets to it for veth, our iperf3 tcp test case attached two veth interfaces
> to OVS DPDK bridge and set two veth peers to two network name spaces and run
> iperf3 client in one ns, run iperf3 server in another ns, the traffic will go
> back and forth two veth interfaces, OVS DPDK used TPACKE_V3 to forward packets
> between two veth interfaces.
>
>                                                TPACKET_V3       TPACKET_V3
> Here is an illustration for the traffic: ns01 <-> veth1 <-> vethbr1 <-> OVS
> DBDK Bridge <-> vethbr2 <-> veth2 <-> ns02
>
> I have used two pmd threads to handle vethbr1 and vethbr2 traffic,
> respectively, and pin them to core 2 and 3, respectively, iperf3 server and
> client are pinned to core 4 and 5, respectively, so the producer won't have
> buffer overflow issue, on the contrary, the consumer is starved, I tried to
> output tpacket stats information, no queen freeze, no packet drop, so I'm sure
> buffer is enough for it, I can see the consumer (pmd thread) is being starved
> because it can't receive packets in many loops, pmd threads are very fast,
> they don't have any other thing to do except receiving and transmitting
> packets.
>
> My test script for reference:
>
> #!/bin/bash
>
> DB_SOCK=unix:/var/run/openvswitch/db.sock
> OVS_VSCTL="/home/yangyi/workspace/ovs-master/utilities/ovs-vsctl --db=${DB_SOCK}"
>
> ${OVS_VSCTL} add-br br-int1 -- set bridge br-int1 datapath_type=netdev
> protocols=OpenFlow10,OpenFlow12,OpenFlow13
> ip link add veth1 type veth peer name vethbr1
> ip link add veth2 type veth peer name vethbr2
> ip netns add ns01
> ip netns add ns02
>
> ip link set veth1 netns ns01
> ip link set veth2 netns ns02
>
> ip netns exec ns01 ifconfig lo 127.0.0.1 up
> ip netns exec ns01 ifconfig veth1 10.15.1.2/24 up
> #ip netns exec ns01 ethtool -K veth1 tx off
>
> ip netns exec ns02 ifconfig lo 127.0.0.1 up
> ip netns exec ns02 ifconfig veth2 10.15.1.3/24 up
> #ip netns exec ns02 ethtool -K veth2 tx off
>
> ifconfig vethbr1 0 up
> ifconfig vethbr2 0 up
>
>
> ${OVS_VSCTL} add-port br-int1 vethbr1
> ${OVS_VSCTL} add-port br-int1 vethbr2
>
> ip netns exec ns01 ping 10.15.1.3 -c 3
> ip netns exec ns02 ping 10.15.1.2 -c 3
>
> killall iperf3
> ip netns exec ns02 iperf3 -s -i 10 -D -A 4
> ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 -A 5 --get-server-output
>
> ------------------------
> iperf3 test result
> -----------------------
> [yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
> iperf3: no process found
> Connecting to host 10.15.1.3, port 5201
> [  4] local 10.15.1.2 port 44976 connected to 10.15.1.3 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes

Thanks for the detailed info.

So there is more going on there than a simple network tap. veth, which
calls netif_rx and thus schedules delivery with a napi after a softirq
(twice), tpacket for recv + send + ovs processing. And this is a
single flow, so more sensitive to batching, drops and interrupt
moderation than a workload of many flows.

If anything, I would expect the ACKs on the return path to be the more
likely cause for concern, as they are even less likely to fill a block
before the timer. The return path is a separate packet socket?

With initial small window size, I guess it might be possible for the
entire window to be in transit. And as no follow-up data will arrive,
this waits for the timeout. But at 3Gbps that is no longer the case.
Again, the timeout is intrinsic to TPACKET_V3. If that is
unacceptable, then TPACKET_V2 is a more logical choice. Here also in
relation to timely ACK responses.

Other users of TPACKET_V3 may be using fewer blocks of larger size. A
change to retire blocks after 1 gso packet will negatively affect
their workloads. At the very least this should be an optional feature,
similar to how I suggested converting to micro seconds.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-30  1:51                 ` Willem de Bruijn
@ 2020-03-30  6:34                   ` Yi Yang (杨燚)-云服务集团
  2020-03-30 14:16                     ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-03-30  6:34 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: yang_y_yi, netdev, u9012063

[-- Attachment #1: Type: text/plain, Size: 3270 bytes --]

-----邮件原件-----
发件人: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com] 
发送时间: 2020年3月30日 9:52
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: willemdebruijn.kernel@gmail.com; yang_y_yi@163.com; netdev@vger.kernel.org; u9012063@gmail.com
主题: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO

> iperf3 test result
> -----------------------
> [yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
> iperf3: no process found
> Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port 
> 44976 connected to 10.15.1.3 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes

Thanks for the detailed info.

So there is more going on there than a simple network tap. veth, which calls netif_rx and thus schedules delivery with a napi after a softirq (twice), tpacket for recv + send + ovs processing. And this is a single flow, so more sensitive to batching, drops and interrupt moderation than a workload of many flows.

If anything, I would expect the ACKs on the return path to be the more likely cause for concern, as they are even less likely to fill a block before the timer. The return path is a separate packet socket?

With initial small window size, I guess it might be possible for the entire window to be in transit. And as no follow-up data will arrive, this waits for the timeout. But at 3Gbps that is no longer the case.
Again, the timeout is intrinsic to TPACKET_V3. If that is unacceptable, then TPACKET_V2 is a more logical choice. Here also in relation to timely ACK responses.

Other users of TPACKET_V3 may be using fewer blocks of larger size. A change to retire blocks after 1 gso packet will negatively affect their workloads. At the very least this should be an optional feature, similar to how I suggested converting to micro seconds.

[Yi Yang] My iperf3 test is TCP socket, return path is same socket as forward path. BTW this patch will retire current block only if vnet header is in packets, I don't know what else use cases will use vnet header except our user scenario. In addition, I also have more conditions to limit this, but it impacts on performance. I'll try if V2 can fix our issue, this will be only one way to fix our issue if not.

+
+       if (do_vnet) {
+               vnet_hdr_ok = virtio_net_hdr_from_skb(skb, &vnet_hdr,
+                                                     vio_le(), true, 0);
+               /* Improve performance by retiring current block for
+                * TPACKET_V3 in case of TSO.
+                */
+               if (vnet_hdr_ok == 0 && po->tp_version == TPACKET_V3 &&
+                   vnet_hdr.flags != 0 &&
+                   (vnet_hdr.gso_type == VIRTIO_NET_HDR_GSO_TCPV4 ||
+                       vnet_hdr.gso_type == VIRTIO_NET_HDR_GSO_TCPV6)) {
+                       retire_cur_frame = true;
+               }
+       }
+


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
  2020-03-30  6:34                   ` 答复: " Yi Yang (杨燚)-云服务集团
@ 2020-03-30 14:16                     ` Willem de Bruijn
  2020-04-14  3:41                       ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 perform ance " Yi Yang (杨燚)-云服务集团
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-03-30 14:16 UTC (permalink / raw)
  To: Yi Yang (杨燚)-云服务集团
  Cc: yang_y_yi, netdev, u9012063

On Mon, Mar 30, 2020 at 2:35 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> -----邮件原件-----
> 发件人: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com]
> 发送时间: 2020年3月30日 9:52
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: willemdebruijn.kernel@gmail.com; yang_y_yi@163.com; netdev@vger.kernel.org; u9012063@gmail.com
> 主题: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO
>
> > iperf3 test result
> > -----------------------
> > [yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
> > iperf3: no process found
> > Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port
> > 44976 connected to 10.15.1.3 port 5201
> > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> > [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> > [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes
>
> Thanks for the detailed info.
>
> So there is more going on there than a simple network tap. veth, which calls netif_rx and thus schedules delivery with a napi after a softirq (twice), tpacket for recv + send + ovs processing. And this is a single flow, so more sensitive to batching, drops and interrupt moderation than a workload of many flows.
>
> If anything, I would expect the ACKs on the return path to be the more likely cause for concern, as they are even less likely to fill a block before the timer. The return path is a separate packet socket?
>
> With initial small window size, I guess it might be possible for the entire window to be in transit. And as no follow-up data will arrive, this waits for the timeout. But at 3Gbps that is no longer the case.
> Again, the timeout is intrinsic to TPACKET_V3. If that is unacceptable, then TPACKET_V2 is a more logical choice. Here also in relation to timely ACK responses.
>
> Other users of TPACKET_V3 may be using fewer blocks of larger size. A change to retire blocks after 1 gso packet will negatively affect their workloads. At the very least this should be an optional feature, similar to how I suggested converting to micro seconds.
>
> [Yi Yang] My iperf3 test is TCP socket, return path is same socket as forward path. BTW this patch will retire current block only if vnet header is in packets, I don't know what else use cases will use vnet header except our user scenario. In addition, I also have more conditions to limit this, but it impacts on performance. I'll try if V2 can fix our issue, this will be only one way to fix our issue if not.
>

Thanks. Also interesting might be a short packet trace of packet
arrival on the bond device ports, taken at the steady state of 3 Gbps.
To observe when inter-arrival time exceeds the 167 usec mean. Also
informative would be to learn whether when retiring a block using your
patch, that block also holds one or more ACK packets along with the
GSO packet. As their delay might be the true source of throttling the sender.

I think we need to understand the underlying problem better to
implement a robust fix that works for a variety of configurations, and
does not causing accidental regressions. The current patch works for
your setup, but I'm afraid that it might paper over the real issue.

It is a peculiar aspect of TPACKET_V3 that blocks are retired not when
a packet is written that fills them, but when the next packet arrives
and cannot find room. Again, at sustained rate that delay should be
immaterial. But it might be okay to measure remaining space after
write and decide to retire if below some watermark. I would prefer
that watermark to be a ratio of block size rather than whether the
packet is gso or not.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代        发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 perform        ance issue in case of TSO
  2020-03-30 14:16                     ` Willem de Bruijn
@ 2020-04-14  3:41                       ` Yi Yang (杨燚)-云服务集团
  2020-04-14 14:03                         ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-04-14  3:41 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: yang_y_yi, netdev, u9012063

[-- Attachment #1: Type: text/plain, Size: 5619 bytes --]

Reply inline, sorry for late response.

-----邮件原件-----
发件人: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] 代表 Willem de Bruijn
发送时间: 2020年3月30日 22:16
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: yang_y_yi@163.com; netdev@vger.kernel.org; u9012063@gmail.com
主题: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 perform ance issue in case of TSO

On Mon, Mar 30, 2020 at 2:35 AM Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com> wrote:
>
> -----邮件原件-----
> 发件人: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com]
> 发送时间: 2020年3月30日 9:52
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
> 抄送: willemdebruijn.kernel@gmail.com; yang_y_yi@163.com; 
> netdev@vger.kernel.org; u9012063@gmail.com
> 主题: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] 
> net/ packet: fix TPACKET_V3 performance issue in case of TSO
>
> > iperf3 test result
> > -----------------------
> > [yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
> > iperf3: no process found
> > Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port
> > 44976 connected to 10.15.1.3 port 5201
> > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> > [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> > [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes
>
> Thanks for the detailed info.
>
> So there is more going on there than a simple network tap. veth, which calls netif_rx and thus schedules delivery with a napi after a softirq (twice), tpacket for recv + send + ovs processing. And this is a single flow, so more sensitive to batching, drops and interrupt moderation than a workload of many flows.
>
> If anything, I would expect the ACKs on the return path to be the more likely cause for concern, as they are even less likely to fill a block before the timer. The return path is a separate packet socket?
>
> With initial small window size, I guess it might be possible for the entire window to be in transit. And as no follow-up data will arrive, this waits for the timeout. But at 3Gbps that is no longer the case.
> Again, the timeout is intrinsic to TPACKET_V3. If that is unacceptable, then TPACKET_V2 is a more logical choice. Here also in relation to timely ACK responses.
>
> Other users of TPACKET_V3 may be using fewer blocks of larger size. A change to retire blocks after 1 gso packet will negatively affect their workloads. At the very least this should be an optional feature, similar to how I suggested converting to micro seconds.
>
> [Yi Yang] My iperf3 test is TCP socket, return path is same socket as forward path. BTW this patch will retire current block only if vnet header is in packets, I don't know what else use cases will use vnet header except our user scenario. In addition, I also have more conditions to limit this, but it impacts on performance. I'll try if V2 can fix our issue, this will be only one way to fix our issue if not.
>

Thanks. Also interesting might be a short packet trace of packet arrival on the bond device ports, taken at the steady state of 3 Gbps.
To observe when inter-arrival time exceeds the 167 usec mean. Also informative would be to learn whether when retiring a block using your patch, that block also holds one or more ACK packets along with the GSO packet. As their delay might be the true source of throttling the sender.

I think we need to understand the underlying problem better to implement a robust fix that works for a variety of configurations, and does not causing accidental regressions. The current patch works for your setup, but I'm afraid that it might paper over the real issue.

It is a peculiar aspect of TPACKET_V3 that blocks are retired not when a packet is written that fills them, but when the next packet arrives and cannot find room. Again, at sustained rate that delay should be immaterial. But it might be okay to measure remaining space after write and decide to retire if below some watermark. I would prefer that watermark to be a ratio of block size rather than whether the packet is gso or not.

[Yi Yang] Sorry for late reply, I missed this email. I did do timing for every received frames, time interval is highly dynamic, I can't find any valuable clues, but I did find TCP ACK frames have big impact on performance, which are some small frames (size is not more than 100), in TPACKET_V3 case, a block will have a bunch of such TCP ACK frames, so these ACK frames aren't received and sent back to the receiver in time. I tried TPACKET_V2, its performance is beyond I expect, I tried it in kernel 5.5.9, its performance is better than this patch, about 11Gbps, I also tried kernel 4.15.0 (from Ubuntu, it actually cherry picked many fixed patches from upstream, so isn't official 4.15.0), its performance is about 14Gbps, worse than this patch (it is 17Gbps), so obviously the performance is kernel-related, platform related. In non-pmd case (i.e. sender and receiver are one thread and use the same CPU), TPACKET_V2 is much better then recvmmsg&sendmmsg. We decide to use TPACKET_V2 for TSO. But we don't know how we can reach higher performance than 14Gbps, it looks like tpacket_v2/v3's cache flush operation has side effect on performance (especially once flush per frame for TPACKET_V2)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 perform ance issue in case of TSO
  2020-04-14  3:41                       ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 perform ance " Yi Yang (杨燚)-云服务集团
@ 2020-04-14 14:03                         ` Willem de Bruijn
  2020-04-15  3:33                           ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ pa cket: " Yi Yang (杨燚)-云服务集团
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-04-14 14:03 UTC (permalink / raw)
  To: Yi Yang (杨燚)-云服务集团
  Cc: yang_y_yi, netdev, u9012063

> > > iperf3 test result
> > > -----------------------
> > > [yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
> > > iperf3: no process found
> > > Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port
> > > 44976 connected to 10.15.1.3 port 5201
> > > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > > [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> > > [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> > > [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes
> >
> > Thanks for the detailed info.
> >
> > So there is more going on there than a simple network tap. veth, which calls netif_rx and thus schedules delivery with a napi after a softirq (twice), tpacket for recv + send + ovs processing. And this is a single flow, so more sensitive to batching, drops and interrupt moderation than a workload of many flows.
> >
> > If anything, I would expect the ACKs on the return path to be the more likely cause for concern, as they are even less likely to fill a block before the timer. The return path is a separate packet socket?
> >
> > With initial small window size, I guess it might be possible for the entire window to be in transit. And as no follow-up data will arrive, this waits for the timeout. But at 3Gbps that is no longer the case.
> > Again, the timeout is intrinsic to TPACKET_V3. If that is unacceptable, then TPACKET_V2 is a more logical choice. Here also in relation to timely ACK responses.
> >
> > Other users of TPACKET_V3 may be using fewer blocks of larger size. A change to retire blocks after 1 gso packet will negatively affect their workloads. At the very least this should be an optional feature, similar to how I suggested converting to micro seconds.
> >
> > [Yi Yang] My iperf3 test is TCP socket, return path is same socket as forward path. BTW this patch will retire current block only if vnet header is in packets, I don't know what else use cases will use vnet header except our user scenario. In addition, I also have more conditions to limit this, but it impacts on performance. I'll try if V2 can fix our issue, this will be only one way to fix our issue if not.
> >
>
> Thanks. Also interesting might be a short packet trace of packet arrival on the bond device ports, taken at the steady state of 3 Gbps.
> To observe when inter-arrival time exceeds the 167 usec mean. Also informative would be to learn whether when retiring a block using your patch, that block also holds one or more ACK packets along with the GSO packet. As their delay might be the true source of throttling the sender.
>
> I think we need to understand the underlying problem better to implement a robust fix that works for a variety of configurations, and does not causing accidental regressions. The current patch works for your setup, but I'm afraid that it might paper over the real issue.
>
> It is a peculiar aspect of TPACKET_V3 that blocks are retired not when a packet is written that fills them, but when the next packet arrives and cannot find room. Again, at sustained rate that delay should be immaterial. But it might be okay to measure remaining space after write and decide to retire if below some watermark. I would prefer that watermark to be a ratio of block size rather than whether the packet is gso or not.
>
> [Yi Yang] Sorry for late reply, I missed this email. I did do timing for every received frames, time interval is highly dynamic, I can't find any valuable clues, but I did find TCP ACK frames have big impact on performance, which are some small frames (size is not more than 100), in TPACKET_V3 case, a block will have a bunch of such TCP ACK frames, so these ACK frames aren't received and sent back to the receiver in time. I tried TPACKET_V2, its performance is beyond I expect, I tried it in kernel 5.5.9, its performance is better than this patch, about 11Gbps, I also tried kernel 4.15.0 (from Ubuntu, it actually cherry picked many fixed patches from upstream, so isn't official 4.15.0), its performance is about 14Gbps, worse than this patch (it is 17Gbps), so obviously the performance is kernel-related, platform related. In non-pmd case (i.e. sender and receiver are one thread and use the same CPU), TPACKET_V2 is much better then recvmmsg&sendmmsg. We decide to use TPACKET_V2 for TSO. But we don't know how we can reach higher performance than 14Gbps, it looks like tpacket_v2/v3's cache flush operation has side effect on performance (especially once flush per frame for TPACKET_V2)

Kernel 5.5.9 with TPACKET_V2 is better than this patch at 11 Gbps, but
Ubuntu 4.15.0 is worse that this patch at 14 Gbps (this patch is 17)?

How did you arrive at the conclusion that the cache flush operation is
the main bottleneck?

Good to hear that you verified that a main issue is the ACK delay.

Instead of packet sockets, you could also take a look at AF_XDP. There
seems to be documentation on how to deploy it with OVS.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代        发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ pa        cket: fix TPACKET_V3 perform ance issue in case of TSO
  2020-04-14 14:03                         ` Willem de Bruijn
@ 2020-04-15  3:33                           ` Yi Yang (杨燚)-云服务集团
  0 siblings, 0 replies; 16+ messages in thread
From: Yi Yang (杨燚)-云服务集团 @ 2020-04-15  3:33 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: yang_y_yi, netdev, u9012063

[-- Attachment #1: Type: text/plain, Size: 6632 bytes --]

-----邮件原件-----
发件人: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] 代表 Willem de Bruijn
发送时间: 2020年4月14日 22:04
收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@inspur.com>
抄送: yang_y_yi@163.com; netdev@vger.kernel.org; u9012063@gmail.com
主题: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ pa cket: fix TPACKET_V3 perform ance issue in case of TSO

> > > iperf3 test result
> > > -----------------------
> > > [yangyi@localhost ovs-master]$ sudo ../run-iperf3.sh
> > > iperf3: no process found
> > > Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port
> > > 44976 connected to 10.15.1.3 port 5201
> > > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > > [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> > > [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> > > [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes
> >
> > Thanks for the detailed info.
> >
> > So there is more going on there than a simple network tap. veth, which calls netif_rx and thus schedules delivery with a napi after a softirq (twice), tpacket for recv + send + ovs processing. And this is a single flow, so more sensitive to batching, drops and interrupt moderation than a workload of many flows.
> >
> > If anything, I would expect the ACKs on the return path to be the more likely cause for concern, as they are even less likely to fill a block before the timer. The return path is a separate packet socket?
> >
> > With initial small window size, I guess it might be possible for the entire window to be in transit. And as no follow-up data will arrive, this waits for the timeout. But at 3Gbps that is no longer the case.
> > Again, the timeout is intrinsic to TPACKET_V3. If that is unacceptable, then TPACKET_V2 is a more logical choice. Here also in relation to timely ACK responses.
> >
> > Other users of TPACKET_V3 may be using fewer blocks of larger size. A change to retire blocks after 1 gso packet will negatively affect their workloads. At the very least this should be an optional feature, similar to how I suggested converting to micro seconds.
> >
> > [Yi Yang] My iperf3 test is TCP socket, return path is same socket as forward path. BTW this patch will retire current block only if vnet header is in packets, I don't know what else use cases will use vnet header except our user scenario. In addition, I also have more conditions to limit this, but it impacts on performance. I'll try if V2 can fix our issue, this will be only one way to fix our issue if not.
> >
>
> Thanks. Also interesting might be a short packet trace of packet arrival on the bond device ports, taken at the steady state of 3 Gbps.
> To observe when inter-arrival time exceeds the 167 usec mean. Also informative would be to learn whether when retiring a block using your patch, that block also holds one or more ACK packets along with the GSO packet. As their delay might be the true source of throttling the sender.
>
> I think we need to understand the underlying problem better to implement a robust fix that works for a variety of configurations, and does not causing accidental regressions. The current patch works for your setup, but I'm afraid that it might paper over the real issue.
>
> It is a peculiar aspect of TPACKET_V3 that blocks are retired not when a packet is written that fills them, but when the next packet arrives and cannot find room. Again, at sustained rate that delay should be immaterial. But it might be okay to measure remaining space after write and decide to retire if below some watermark. I would prefer that watermark to be a ratio of block size rather than whether the packet is gso or not.
>
> [Yi Yang] Sorry for late reply, I missed this email. I did do timing 
> for every received frames, time interval is highly dynamic, I can't 
> find any valuable clues, but I did find TCP ACK frames have big impact 
> on performance, which are some small frames (size is not more than 
> 100), in TPACKET_V3 case, a block will have a bunch of such TCP ACK 
> frames, so these ACK frames aren't received and sent back to the 
> receiver in time. I tried TPACKET_V2, its performance is beyond I 
> expect, I tried it in kernel 5.5.9, its performance is better than 
> this patch, about 11Gbps, I also tried kernel 4.15.0 (from Ubuntu, it 
> actually cherry picked many fixed patches from upstream, so isn't 
> official 4.15.0), its performance is about 14Gbps, worse than this 
> patch (it is 17Gbps), so obviously the performance is kernel-related, 
> platform related. In non-pmd case (i.e. sender and receiver are one 
> thread and use the same CPU), TPACKET_V2 is much better then 
> recvmmsg&sendmmsg. We decide to use TPACKET_V2 for TSO. But we don't 
> know how we can reach higher performance than 14Gbps, it looks like 
> tpacket_v2/v3's cache flush operation has side effect on performance 
> (especially once flush per frame for TPACKET_V2)

Kernel 5.5.9 with TPACKET_V2 is better than this patch at 11 Gbps, but Ubuntu 4.15.0 is worse that this patch at 14 Gbps (this patch is 17)?

[Yi Yang] It's true that performance of kernel 5.5.9 is worse than Ubuntu kernel 4.15.0, when I tested this patch, 5.5.9 only can reach 11Gbps, but Ubuntu kernel 4.15.0 can reach 17Gbps, I don't know why, performance of recvmmsg & sendmmsg also shows the same situation, i.e. Ubuntu kernel 4.15.0 is better than kernel 5.5.9, the same situation is for TPACKET_V2 for TSO, maybe it is a performance regression on kernel side. Default HZ is 1000 for kernel 5.5.9, but it is 250 for Ubuntu 4.15.0, I'm not sure if it is one of factors.

How did you arrive at the conclusion that the cache flush operation is the main bottleneck?

[Yi Yang] Cache flush is high overhead like spinlock, especially if every frame triggers cache flush, I know it is unavoidable for TPACKET, otherwise userspace can't timely see changes kernel did.

Good to hear that you verified that a main issue is the ACK delay.

Instead of packet sockets, you could also take a look at AF_XDP. There seems to be documentation on how to deploy it with OVS.

[Yi Yang] Yes, current OVS can support AF_XDP, but it needs recent kernels, for our use cases, its performance isn't better than tpacket, mostly important, tpacket is available since 3.10.0, so all the current Linux distributions can run it, this is major reason why we prefer tpacket.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3600 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-04-15  3:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-25 14:08 [PATCH net-next] net/packet: fix TPACKET_V3 performance issue in case of TSO yang_y_yi
2020-03-25 14:37 ` Willem de Bruijn
2020-03-26  0:43   ` 答复: [vger.kernel.org代发]Re: " Yi Yang (杨燚)-云服务集团
2020-03-26  1:20     ` Willem de Bruijn
2020-03-27  2:33       ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: " Yi Yang (杨燚)-云服务集团
2020-03-27  3:16         ` Willem de Bruijn
2020-03-28  8:36           ` 答复: " Yi Yang (杨燚)-云服务集团
2020-03-28 18:36             ` Willem de Bruijn
2020-03-29  2:42               ` 答复: " Yi Yang (杨�D)-云服务集团
2020-03-30  1:51                 ` Willem de Bruijn
2020-03-30  6:34                   ` 答复: " Yi Yang (杨燚)-云服务集团
2020-03-30 14:16                     ` Willem de Bruijn
2020-04-14  3:41                       ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 perform ance " Yi Yang (杨燚)-云服务集团
2020-04-14 14:03                         ` Willem de Bruijn
2020-04-15  3:33                           ` 答复: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ pa cket: " Yi Yang (杨燚)-云服务集团
     [not found]   ` <8c7c4b8.a0a4.17112280afb.Coremail.yang_y_yi@163.com>
2020-03-26  1:16     ` Re: [PATCH net-next] net/packet: fix TPACKET_V3 performance " Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.